Evaluating ligand docking methods for drugging protein–protein interfaces: insights from AlphaFold2 and molecular dynamics refinement

Jordi Gómez Borrego; Marc Torrent Burgas

doi:10.1186/s13321-025-01067-4

. 2025 Sep 25;17:144. doi: 10.1186/s13321-025-01067-4

Evaluating ligand docking methods for drugging protein–protein interfaces: insights from AlphaFold2 and molecular dynamics refinement

Jordi Gómez Borrego ¹, Marc Torrent Burgas ^1,^✉

PMCID: PMC12465486 PMID: 40999494

Abstract

Advances in docking protocols have significantly enhanced the field of protein–protein interaction (PPI) modulation, with AlphaFold2 (AF2) and molecular dynamics (MD) refinements playing pivotal roles. This study evaluates the performance of AF2 models against experimentally solved structures in docking protocols targeting PPIs. Using a dataset of 16 interactions with validated modulators, we benchmarked eight docking protocols, revealing similar performance between native and AF2 models. Local docking strategies outperformed blind docking, with TankBind_local and Glide providing the best results across the structural types tested. MD simulations and other ensemble generation algorithms such as AlphaFlow, refined both native and AF2 models, improving docking outcomes but showing significant variability across conformations. These results suggest that, while structural refinement can enhance docking in some cases, overall performance appears to be constrained by limitations in scoring functions and docking methodologies. Although protein ensembles can improve virtual screening, predicting the most effective conformations for docking remains a challenge. These findings support the use of AF2-generated structures in docking protocols targeting PPIs and highlight the need to improve current scoring methodologies.

Scientific contribution

This study provides a systematic benchmark of docking protocols applied to protein–proteininteractions (PPIs) using both experimentally solved structures and AlphaFold2 models. Byintegrating molecular dynamics ensembles and AlphaFlow-generated conformations, we showthat structural refinement improves docking outcomes in selected cases, but overallperformance remains constrained by docking scoring function limitations. Our analysis showsthat AlphaFold2 models perform comparably to native structures in PPI docking, validating theiruse when experimental data are unavailable. These results establish a reference framework forfuture PPI-focused virtual screening and underscore the need for improved scoring functionsand ensemble-based approaches to better exploit emerging structural prediction tools.

Supplementary Information

The online version contains supplementary material available at 10.1186/s13321-025-01067-4.

Keywords: Protein interaction, Virtual screening, AlphaFold, Molecular docking, Molecular dynamics, Benchmarking

Introduction

Structure-based molecular docking plays a critical role in drug discovery by predicting the binding modes and affinities of protein–ligand interactions. Molecular docking can expedite rational drug design, enabling the rapid evaluation of large chemical libraries to identify promising candidates [1, 2]. However, docking strategies can be inaccurate in predicting binding affinities and struggle to account for protein flexibility [3–5]. Such limitations are especially relevant in designing protein–protein interaction (PPI) modulators, where tailored screening libraries are scarce and druggable regions of proteins are difficult to target. Unlike traditional enzyme or receptor targets, which have well-defined grooves for ligand binding, PPIs present unique challenges due to their large, flat contact surfaces and limited structural data [6, 7]. Nevertheless, PPIs are promising therapeutic targets given their involvement in essential biological processes, with potential applications in treating diseases like cancer, neurodegenerative diseases and infections.

Earlier research assessed the efficacy of diverse docking algorithms in identifying true binding compounds using experimental structures [3, 8, 9], and subsequent studies showed the utility of homology and ab initio models [2]. The advent of AlphaFold2 (AF2) enabled the prediction of high-resolution protein structures even in the absence of homologous templates [10, 11]. Recent analyses explored AF2 models' suitability in hit-to-lead modeling [12–17]. Some studies reported that AF2 models performed less consistently than Protein Data Bank (PDB) structures, emphasizing the need for refinement via molecular dynamics (MD) simulations, induced-fit docking (IFD), or re-scoring functions. However, other studies suggest that docking inaccuracies stem more from methodological limitations than model quality [14]. Comparative analyses between apo and holo PDB structures showed that AF2 models perform similarly to apo structures but worse than holo structures [12, 18]. As far as we are aware, no extensive ligand docking benchmarks focused specifically on PPIs have been carried out so far.

Here, we conducted a comparative structural analysis of AF2 models and experimentally solved structures to assess the performance of eight docking protocols targeting PPIs (Fig. 1). We used a dataset of 16 interactions with validated modulators from the ChEMBL and 2P2Idb databases [19, 20]. Benchmarking revealed similar performance between native and AF2 models. While refining AF2 models with MD simulations or other ensemble generation algorithms improved docking results in some cases, the outcomes varied significantly across conformations. These findings suggest that performance variations in docking protocols originates from the scoring functions rather than model quality. Although, using protein ensembles rather than unique structures may enhance virtual screening protocols, predicting which conformation will yield better docking results remains a challenge.

Fig. 1 — Workflow diagram developed in this study. We first selected 16 PPIs with confirmed active ligands from the ChEMBL and 2P2Idb databases. Subsequently, we conducted a comparative structural analysis using modeled structures from five sources: (1) AF2 models derived from the native PDB structures (AFnat), (2) AF2 models of the full-length protein (AFfull), (3) ten representative conformations following a 500 ns MD simulation on native PDB structures (MD-PDB), (4) ten representative conformations following a 500 ns MD simulation of AFnat structures (MD-AF), and (5) ten representative conformations from ensembles generated by AlphaFlow. We then applied eight docking protocols, including local and blind docking approaches, and evaluated their performance by calculating multiple metrics for each protocol

Results and discussion

AlphaFold2 models are suitable starting structures for molecular docking

Previous results from our group demonstrated the robust capabilities of AF2 (version 2.3.1) in accurately predicting heterodimeric complexes with low homology [21]. An evaluation of 140 non-homologous PPI structures excluded from the AF2 training dataset that 81% of the complexes were accurately predicted. This result emphasizes the predictive power of the algorithm, even in the absence of prior structural information. These findings are consistent with recent literature highlighting the utility of AF2 structures in virtual screening.

Here, we aimed to understand how AF2 protein complex models can be used for ligand virtual screening and how reliable these strategies might be. To achieve this, we compared 16 experimentally resolved PPIs with known active ligands (Table 1) to their corresponding AF2 models. Since PDB structures do not always represent the full protein but instead a particular domain or section, two types of AF2 models were generated: models derived from native PDB structures (AFnat) and those derived from full-length proteins based on genetic sequences (AFfull). Additionally, protein ensembles were created through 500 ns all-atom MD simulations or using the AlphaFlow sequence-conditioned generative model. These ensembles served as input for benchmarking eight docking strategies targeting protein interfaces.

Table 1.

List of protein complexes analyzed in this study and information about the selected active ligands

Protein complex	Type	PDB_ID	# ligands	Source	Source ID
RUNX1/CBFB	Protein–protein	1E50	175	chEMBL	CHEMBL2093862
DCUN1D1/Ubc12	Protein–protein	3TDZ	64	chEMBL	CHEMBL4523603
Gag-Pol/LEDGF	Protein–protein	2B4J	63	2P2Idb	20401
KRAS/SOS1	Protein–protein	6EPL	53	2P2Idb	20202
PCSK9/LDLR	Protein–protein	2W2M	49	chEMBL	CHEMBL4523996
XIAP-BIR3/SMAC	Protein–protein	1G73	29	2P2Idb	10503
BRD4-1/H4	Protein-peptide	3UVW	255	2P2Idb	30205
PPARG/NCOR2	Protein-peptide	8AQN	176	chEMBL	CHEMBL2096976
KPNB1/SNUPN	Protein-peptide	2P8Q	97	chEMBL	CHEMBL3885594
MDM2/P53	Protein-peptide	1YCR	46	2P2Idb	10301
Mcl-1/BID-MM	Protein-peptide	5C3F	39	chEMBL	CHEMBL3430886
VHL/HIF1A	Protein-peptide	4AJY	37	2P2Idb	20701
WDR5/MLL1	Protein-peptide	4ESG	37	2P2Idb	10601
BCLXL/BAK	Protein-peptide	5FMK	36	2P2Idb	10102
ANXA2/S100A10	Protein-peptide	4FTG	27	chEMBL	CHEMBL2111435
KEAP1/NRF2	Protein-peptide	2FLU	25	2P2Idb	10201

Open in a new tab

We first evaluated the quality of the models generated by AF2 using several metrics. The AlphaFold-Multimer algorithm combines the interface pTM (ipTM) and pTM scores into a unified accuracy metric (ipTM + pTM), prioritizing interface accuracy [22]. Models with scores above 0.7 were classified as high-quality. Besides AF2 score, structural similarity was also evaluated using the TM-score, while complex prediction accuracy was assessed through DockQ and interface root mean square deviation (iRMS) metrics [23, 24]. The TM-score measures how well the predicted model aligns with experimental structures residue coverage and distance. DockQ assesses the quality of docking models by comparing predicted protein complexes to experimental structures, while iRMS evaluates the accuracy of protein–protein interfaces. For AFfull structures, quality was assessed using the ipTM + pTM and pDockQ2 scores [25]. The pDockQ2 score estimates the quality of multimeric models, much like DockQ, but is specifically tailored for situations where the native protein is either unknown or altered.

All AF2 models derived from native PDB structures in the benchmark dataset exhibited ipTM + pTM scores above 0.7, categorizing them as high-quality models. Furthermore, all models recorded a TM-score above 0.6 (median: 0.972; Supplementary Fig. 1, Supplementary File 1), indicating similar topology and chain orientation compared to the experimental structures (Fig. 2A). AF2 accurately predicted all benchmark structures (DockQ > 0.23) except for 1G73 and delivered high-quality predictions (DockQ > 0.8) for 9 of the 16 complexes (median: 0.838; Supplementary Fig. 1, Supplementary File 1). Most predicted structures closely resembled the native structures (14 structures, iRMS < 2 Å) or were within acceptable limits (11 structures, iRMS < 4 Å; median: 0.88), except for 1G73 and 8AQN (Supplementary File 1). In the case of 1G73 (XIAP/BIR3-SMAC), although the overall fold was nearly native-like (TM-score = 0.616) and both chains aligned well in isolation, the highly flexible, unfolded N-terminal domain of the SMAC protein hindered accurate prediction of the correct partner position. This resulted in a low-quality model, as reflected in the DockQ and iRMS metrics (Fig. 2B). The high iRMS in 8AQN (PPARG-NCOR2) was due to differences in two regions involving large, unfolded loops connecting alpha helices in PPARG (residues 250–284 and 452–477). In the model, the C-terminal region of PPARG interacted with the protein partner, unlike in the native structure, which negatively affected the iRMS (Fig. 2C). These high-quality metrics for the AFnat models were expected as the targets in the benchmark dataset had representative structures in the PDB before September 30, 2021, and were included in the AF 2.3.1 training dataset. Notwithstanding, multiple studies suggest that high-accuracy models can be achieved even for structures not included in the training dataset. [21, 22, 25]

Fig. 2 — Structural alignments of PDB structures with AFnat and AFfull models. PDB structures are colored in grey, AFnat models in wheat and AFfull models in salmon. Quality metrics are provided in Supplementary File 1. A Structural alignment of the native 3TDZ structure with the corresponding AFnat model and a magnified view of the protein interface showing the side chains. The quality metrics TM-score, DockQ and iRMS indicate that the AFnat model is nearly native-like. B Structural alignment of the native 1G73 and the corresponding AFnat model in front and top views. The SMAC protein (three-helix bundle) is aligned with an RMSD of 0.543 Å, however, the protein partner XIAP is displaced. C Structural alignment of the native 8AQN and the corresponding AFnat model. Both structures align in most of the structure, however, notable structural differences are observed in the highlighted disordered region. D Structural alignments of the native 1YCR (left) and 2P8Q (right) with their corresponding AFfull models. In 1YCR, both chains contain a large percentage of unfolded regions, negatively impacting quality metrics (ipTM < 0.5 and pDockQ2 < 0.23), however, the interface region is conserved (iRMSD < 1 Å, highlighted). In 2P8Q, the additional regions in the AFfull structure modify the protein interface leading to large discrepancies between the PDB and AFfull structures (iRMSD > 20 Å)

To identify structural differences that may arise upon complex formation, the apo and holo conformations of the individual partners were compared (Supplementary Table 1). Peptides were excluded due to their high flexibility and lack of resolved structures in the PDB of the 21 protein partners analyzed, 20 were available in the PDB. Among these, 16 exhibited an RMSD below 2 Å, indicating nearly identical conformations. Significant structural variations were observed in 4 cases, which underwent conformational changes upon interacting with their partner, primarily due to differences in highly mobile loops (Supplementary Table 1). This suggests that the individual protein interfaces in the dataset undergo minimal structural rearrangements when interacting with partners, mainly affecting highly mobile loops and potentially reflecting the challenges of targeting protein complexes that undergo substantial structural rearrangements upon binding.

Modeling truncated versus full-length proteins affects model quality

In this study, we assessed the impact of full-length protein regions on the structure and interface of protein complexes by comparing AFfull models with native structures. The AFfull models generally showed lower quality, evidenced by decreased pDockQ2 (less than 0.23) and ipTM-pTM scores (below 0.7), primarily due to high predicted average errors (PAEs) from unfolded or altered regions (Fig. 2D, Supplementary File 1, Supplementary Fig. 1). These models were notably larger, containing up to 2700 additional residues and more unfolded regions, which compromised the interface quality across all complexes, particularly in protein-peptide interactions (Supplementary File 1). Although a few complexes, such as 1E50 and 6EPL, maintained high-quality interfaces with minimal alterations, most, including 1YCR and 2P8Q, exhibited significant interface changes (Fig. 2D). Notably, 15 of the 16 complexes analyzed showed a negative impact from unfolded regions, which should be considered when predicting protein complexes, particularly in downstream applications such as molecular docking.

Conformational ensembles as starting structures for molecular docking

Previous research has shown that refining protein structures using MD simulations can improve virtual screening performance, even with short simulation times [26]. To investigate this, we conducted 500-ns MD simulations on both PDB and AFnat structures, clustering the resulting trajectories into 10 representative structures and comparing them to the native PDB structures (Supplementary File 1, Supplementary Fig. 1). The consistency of global motion and exploration of conformational space were validated through triplicate MD simulations (Supplementary Fig. 2). Additionally, we used AlphaFlow, a generative model designed to emulate MD ensembles and predict the conformational space of proteins [27]. The MD-derived structures were classified into two categories (Fig. 3):

Stable Structures. Representative structures closely resembled initial configurations throughout the simulation, with all structures displaying DockQ scores ≥ 0.50 and iRMS ≤ 2 Å.
Rearranged Structures. Some structures exhibited global or local rearrangements relative to the initial structure, resulting in certain representative structures with DockQ scores < 0.50 and iRMS > 2 Å.

Fig. 3 — Representative analysis of the MD trajectories performed in this study. The 1E50 (A) and 2FLU (B) complexes are analyzed. Top-left: Structural alignments of the native and the 10 representative structures resulting from the 500-ns MD simulation. Top-right: RMSD plots of both chains throughout the 500-ns simulation. The colors of the lines in the plot correspond to the colors of the chains shown in the top-left figure. Bottom-left: Radius of gyration (blue) and potential energy (green) plotted across the same trajectory frames. Bottom-right: Free energy landscape from principal component analysis of molecular dynamics trajectories. The results of replicates are shown in Supplementary Fig. 2

Analysis of MD simulations on PDB structures showed that 11 of the 16 complexes fell into the first category, indicating no significant conformational changes. In these cases, the representative structures closely resembled the initial configurations, reflecting the stability of the complexes during the simulations (Fig. 3, top-left). This category included both protein–protein and protein-peptide complexes, such as 2P8Q and 8AQN, which remained stable despite the high flexibility associated with peptides. The stability is attributed to extensive contacts between the peptide and its protein partner. For example, in 1E50, the protein chains exhibited consistent stability throughout the simulation, as shown by the RMSD plot (Fig. 3A, top-right, Supplementary Fig. 2A). Radius of gyration and potential energy were also plotted to assess compactness and energy trends. The results indicated stable compactness over time, with occasional fluctuations (Fig. 3A, bottom-left). The potential energy profile revealed low energy conformations, inversely correlated with compactness, implying that densely packed complexes are associated with lower energy levels. Principal component analysis (PCA) of the simulation trajectory data highlighted the conformational exploration pattern (Fig. 3A, bottom-right), revealing similar behaviors across replicates and convergence toward a stable, low-energy structure. The second category, comprising complexes like 2FLU, 3UVW, 4AJY, 4ESG, and 4FTG, included protein-peptide complexes where the protein partners remained relatively stable, whereas the peptides exhibited high flexibility, significantly influencing DockQ and iRMS metrics (Fig. 3B, top-left). In 2FLU simulation, we identified low-energy conformations that inversely correlated with protein compactness throughout the entire simulation (Fig. 3B, bottom-left). As before, replicate simulations explored similar conformational space, ultimately converging toward two low-energy conformations (Supplementary Fig. 2B).

MD simulations on AFnat models revealed no major differences compared to the representative MD models derived from PDB structures, except in 1G73, 2FLU, 4AJY, and 8AQN (Supplementary File 1). Achieving convergence in these cases proved challenging due to the complex potential energy landscape, requiring longer timescales or enhanced sampling to explore effectively. Differences in quality metrics for protein-peptide complexes (2FLU, 4AJY) can be attributed to peptide flexibility (Supplementary File 1). Atomic-level flexibility, assessed by root mean square fluctuation (RMSF), generally followed similar patterns between AF2 and PDB MD structures, except in 1G73, due to its extensive protein interface (Fig. 4A).

AlphaFlow ensembles exhibited similar RMSF patterns to PDB/AF2 MD simulations but showed increased fluctuations (Fig. 4). When analyzing energy distributions, AlphaFlow ensembles displayed broader distributions and higher-energy distributions compared to MD ensembles (Fig. 4). For example, while 4AJY displayed high structural similarity across ensembles, significant structural diversification was observed in 6EPL (Fig. 4B). Compared to MD, AlphaFlow achieved faster wall-clock convergence across several equilibrium properties, positioning it as a rapid alternative for generating diversified protein ensembles. These findings suggest that AlphaFlow captures conformational flexibility and generates conformations not typically observed in MD simulations.

Evaluation of docking performance across strategies

We assessed the docking performance of eight strategies: Glide, Glide-IFD, Vina, Gnina, TankBind_local, TankBind_blind, EquiBind, and DiffDock (Fig. 5) [28–33]. This set includes well-established local docking tools, such as Glide, AutoDock Vina, and Gnina, and deep-learning (DL)-based blind docking methods like TankBind and DiffDock. For TankBind, two approaches were employed: using the protein interface as the binding site (TankBind_local) or using binding sites predicted by P2Rank (TankBind_blind). This diversity allows for a comprehensive evaluation of different protein interface search spaces and scoring functions. Active and inactive compounds (Table 1) were docked and ranked using native PDB structures, AFnat models, and structural ensembles generated using AlphaFlow or MD simulations of the native structures. For each docking strategy, the top poses were evaluated using the area under the receiver operating characteristic curve (AUROC) to gauge performance (Fig. 6A, Supplementary File 2). For MD and AlphaFlow structures, confidence intervals of AUROC values were calculated across 10 representative structures from the entire structural ensemble, and the best-performing AUROC scores were also considered (Supplementary File 2). Each docking protocol was applied to individual protein chains, centering the grid on the interface region for local docking tools (Fig. 5). Software versions and parameters for each protocol are detailed in the methodology section.

Evaluation of docking Protocols. A Heatmap illustrating the AUROC values for the docking protocols examined in this study. The empty cells in TankBind_blind denote complexes where P2Rank did not identify binding pockets. B Violin plot showing the AUROC values, comparing structure types within each docking protocol. No significant differences (n.s.) were found between PDB and AF structures in any of the docking strategies. Mann–Whitney test p-values are shown for comparisons between PDB and MD, and between PDB and AlphaFlow. Glide, AutoDock Vina, Gnina, and TankBind_local (targeting the protein interface) are considered local docking protocols, whereas DiffDock is classified as a blind docking strategy. Glide-IFD was excluded because no significant performance differences were observed between Glide and Glide-IFD, as detailed in Supplementary Fig. 5. EquiBind and TankBind_blind were also excluded: EquiBind often resulted in ligand clashes and yielded positive binding energies, and TankBind_blind had a small sample size due to difficulty in predicting binding sites for many complexes. Black lines indicate median values. Examples of representative ROC curves are shown in Supplementary Fig. 6

Analysis of AUROC values from local docking strategies revealed no significant differences between using PDB or AFnat structures, regardless of protocol (Fig. 6B, Supplementary Fig. 3). The protocols resulted in modest AUROC values overall, ranging from a median of 0.522 for AutoDock Vina (PDB structures) to 0.623 for TankBind_local, and from 0.537 (Glide-IFD) to 0.644 (TankBind_local using AFnat structures). No correlation was found between AUROC and quality metrics such as DockQ or ipTM + pTM (Supplementary Fig. 4A). Despite the high-quality metrics provided by AlphaFold2 for protein topology and interface prediction, this accuracy did not consistently translate into reliable molecular docking performance. Similar results were obtained using both native structures and AF2 models, suggesting that the primary limitation may lie in the docking strategies or the active/decoy ligand selection rather than in the use of AF2 models. Using the Glide-IFD protocol did not significantly enhance performance compared to Glide alone across structure types (Supplementary Fig. 5), likely due to limited flexibility in core interface residues (typicallyonly three). Increasing flexibility might improve results, but would substantially raise computational cost.

MD simulations and AlphaFlow refine protein dynamics, including side-chain conformations and loop rearrangements that single AF2 or native structures cannot capture. These refinements led to modest improvements in AUROC values across all strategies, with significant gains observed in Glide and Gnina (Fig. 6B, Supplementary Fig. 3). No significant differences were observed between MD and AlphaFlow for any strategy (Fig. 6B). While MD outperformed AlphaFlow in Glide and Glide-IFD, AlphaFlow had slight advantages elsewhere. The MD and AlphaFlow results reflect optimal performance across 10 clustered structures within ensembles, but identifying the optimal structure a priori remains challenging. These results align with studies advocating refinement techniques to improve enrichment factors but differences in force fields or scoring functions among programs may also influence performance [13, 14].

Blind docking approaches enabled ligand placement and performance evaluation without prior binding site information. TankBind with p2rank-predicted binding sites (TankBind_blind) struggled with peptide structures and certain proteins due to undetected binding pockets. However, EquiBind and DiffDock consistently identified at least one pocket for ligand localization. TankBind_blind and DiffDock distributed ligands across multiple sites, while EquiBind generally predicted a single binding site. However, ligands were inconsistently placed at protein interfaces in large proteins, often sattered across solven-exposed surface patches (Fig. 7, left). Proteins with a single active site had ligands across the entire interface (Fig. 7, center), while peptides had ligands surrounding the entire structure, likely due to limited binding sites (Fig. 7, right). Many ligands exhibited steric clashes and lacked physical plausibility, consistent with recent studies reporting structurally implausible poses from DL-based blind docking methods, especially EquiBind [34]. Therefore, poor performance in blind docking may arise from ligand misplacement, structurally implausible ligands, or peptide handling issues.

Fig. 7 — Ligand positioning in blind docking approaches tested in this study: Equibind (yellow), DiffDock (cyan), and TankBind_blind (green). In 1G73, multiple binding sites were predicted by DiffDock and TankBind_blind, but a single binding site was predicted by Equibind. In 2FLU, a single pocket was identified by all approaches. In 4AJY, Equibind and DiffDock interacted with different regions of the peptide; however, Equibind showed significant steric clashes. TankBind_blind failed to predict a binding site in 4AJY

Unlike other docking methods, EquiBind and DiffDock cannot predict binding energy directly; EquiBind provides ligand poses, while DiffDock assigns a confidence score. Binding energy can be estimated by rescoring their poses using Gnina's docking function, but EquiBind poses often led to ligand clashes and unrealistic binding energies and were theferore excluded from further analysis. Like traditional docking methods, no significant AUROC differences were found between PDB and AFnat structures for blind docking approaches (Fig. 6B), nor was any correlation observed between AUROC and DockQ (Supplementary Fig. 4B). Refinement using MD simulations and AlphaFlow ensembles improved DiffDock docking success, considering binding affinity rather than confidence scores. Some complexes processed with TankBind_blind also showed improvements (Fig. 6B). Further analysis by protein and peptide categories yielded similar results, indicating no systematic performance advantage for proteins over peptides (Supplementary Fig. 3). For DL-based blind docking, combining DiffDock with external scoring tools to compute binding affinity is recommended This includes, docking functions or MM/GBSA, for better discrimination between active and decoy ligands, and using protein structures generated through MD or AlphaFlow.

Complementary metrics for docking effectiveness and binding pocket analysis

AUROC provides an intuitive, threshold-independent measure of how effectively actives and decoys are separated although it can be overly optimistic under class imbalance. To provide a more complete evaluation of the results and help guide users in choosing the most appropriate strategy, we also report the recall and precision, the area under the precision–recall curve (AUPRC), which emphasizes positive-class performance by plotting precision versus recall and is sensitive to class imbalance, and the enrichment factor at 1% (EF1%), which quantifies early retrieval by comparing the fraction of actives found in the top 1% of compounds to random expectation (Supplementary File 3).

Across 16 benchmark PPIs, we observed substantial variation in docking performance both between methods and across individual targets. On target 1E50, for example, AUPRC values ranged from 0.105 (TankBind_local) to 0.769 (Gnina), and from 0.118 (Glide) to 0.932 (TankBind_blind) on 3TDZ. Although all methods performed above the random classifier baseline, overall performance from AutoDock Vina was relatively low (mean AUPRC = 0.299) and the best score was obtained by Gnina (mean = 0.670). Intermediate scores were obtained with Glide, DiffDock, and both TankBind variants, which yielded mean AUPRC values ranging from 0.30 to 0.50, showing moderate power to distinguish active molecules from decoys. As virtual screening typically involves experimental testing of only a small number of top-ranked compounds, we also assessed EF1%. TankBind_local and Glide performed best in terms of early enrichment (mean EF1% = 6.20 and 5.85, respectively), with TankBind_blind and Glide-IFD close behind. In comparison, Gnina, despite its strong global discrimination, had a relatively low mean EF1% of 1.76, only marginally above the random baseline. DiffDock and AutoDock Vina had moderate early enrichments but were highly variable across the different targets. Some PPIs, such as 1E50, 2B4J, 2P8Q, 4FTG, and 8AQN, with small interfaces, generally showed poor early enrichment across most docking programs. Conversely, 1YCR, 2W2M, 5FMK, and 6EPL with larger, more defined interfaces, consistently achieved high enrichment, with mean EF1% values higher than 7.

To better understand these differences, we investigated whether binding site characteristics correlate with docking performance. We analyzed the PPI interfaces with SiteMap [35], calculating SiteScore and Dscore, as indicators of overall ligand-binding potential and druggability, respectively. SiteScore incorporates features of pocket size, enclosure, and hydrophilicity, with scores above 0.8 generally indicating ligand-binding potential. Dscore returns a more druggability-oriented assessment: targets with scores below 0.5 are "difficult," 0.5 to 0.75 "moderately druggable," and above 0.75 "highly druggable" [36]. We observed that highly druggable targets—i.e., 1YCR, 5FMK, and 6EPL, all with Dscore > 0.75 and SiteScore near or above 0.8—tended to yield stronger early enrichment regardless of the docking method. On the other hand, targets with low or medium Dscores, i.e., 1E50 (0.399), 2B4J (0.576), 2P8Q (0.722), 4FTG (0.467), and 8AQN (0.623), tended to have poorer early retrieval. Nevertheless, no strong correlation between pocket descriptors and docking performance was observed, likely due to the relatively limited number of PPIs analyzed.

Collectively, these findings emphasize the importance of selecting a docking protocol based not only on its overall discrimination (e.g., AUPRC) but also its potential for early enrichment (EF1%), depending on the specific goal of the virtual screening campaign (Table 2). If the primary aim is to achieve good global separation between active molecules and decoys, Gnina provides the best ranking performance. On the other hand, for campaigns in which active molecules are desired among top-ranked compounds, methods such as TankBind_local, Glide, TankBind_blind, and Glide-IFD provide superior early enrichment. Notably, TankBind_local offers a balanced compromise between early retrieval and overall ranking performance. The druggability of the binding site also informs method choice: Glide and TankBind_local perform best on highly druggable pockets (SiteScore > 0.8, Dscore > 0.75), while Glide-IFD, TankBind_blind, and DiffDock perform better on sites with moderate druggability (Table 3). For poorly druggable or cryptic pockets, more exhaustive sampling protocols and MD-refined structures, coupled with induced-fit docking, can help recover valid binding poses and improve screening success.

Table 2.

Recommended docking tools based on screening objectives and performance metrics

Objective	Recommended Tools	Rationale
Global separation	Gnina	Highest average AUPRC, suggesting strong ranking over full dataset, particularly when full ranking matters more than top hits
Early enrichment	TankBind_local, Glide, TankBind_blind, Glide-IFD	Highest early-retrieval performance, suitable when testing only top-ranked compounds
Balanced performance	TankBind_local	Balanced EF1% and AUPRC, useful when both overall ranking and early retrieval are critical

Open in a new tab

Table 3.

Recommended docking tools based on target pocket druggability

Target Type	Recommended Tools	Rationale
Druggable pocket (SiteScore > 0.8, Dscore > 0.75)	Glide, TankBind_local	High early enrichment (EF1%) observed across docking tools
Moderately druggable (Dscore 0.5–0.75)	Glide-IFD, TankBind_blind, DiffDock	Perform better on moderately defined or partially occluded pockets
Poorly druggable (Dscore < 0.5)	TankBind_blind, DiffDock	Extensive sampling may compensate for weak or ambiguous binding sites
Uncertain or cryptic pockets	MD-derived structures with induced fit docking	Captures induced fit and flexibility

Open in a new tab

Conclusions

PPIs play key regulatory roles in the cell and represent promising, yet challenging, therapeutic targets. While most PPI-targeting efforts to date have focused on oncology and immunology, PPI targeting is a possibility in other underrepresented fields as well. For example, targeting essential PPIs in pathogens or host–pathogen interfaces may provide a new avenue for antimicrobial discovery [37–39]. Growing attempts to drug PPIs in different biological situations could enable the identification of novel therapeutics for unmet medical requirements. In this study, we evaluated the ability of current docking algorithms to predict how small molecules target interfaces in protein complexes. We benchmarked eight docking strategies using protein complexes from various sources: PDB, AF2, MD, and AlphaFlow. Our results indicate generally modest performance across all docking strategies. AUROC analysis revealed no significant differences in docking performance between high-quality AF2 models and native PDB structures. This suggests that AF2 models can be a promising alternative when protein structures are unavailable, despite the expected low performance of current docking strategies. Refinement via MD simulations or generating structural ensembles with AlphaFlow can significantly improve performance in virtual screening protocols. However, high variability in AUROC values across multiple representative structures from these ensembles indicates that local rearrangements in backbone or side-chain conformations can heavily influence docking performance. Predicting which conformation will yield the best results remains challenging. Conversely, induced fit docking, which grants core interface residues some flexibility, did not significantly improve docking performance, likely due to the limited number of residues allowed flexibility. Local docking protocols like Glide and TankBind_local exhibited higher discriminative power than blind docking approaches, particularly with proteins over peptides, possibly due to better scoring functions or force fields. Inaccurate binding site predictions, steric clashes, and physically implausible poses likely contribute to the reduced performance of blind docking [34]. Similar results have been reported in previous studies using monomers and other docking strategies, emphasizing the current limitations of docking protocols. [13, 15–17]

We acknowledge several caveats to this benchmark: (i) the dataset is limited by the number of protein complexes with known active compounds, and ligand activities are sometimes calculated under different assay conditions and techniques, which could also have an impact in deciding which ligands are active and which not; (ii) some structures involve protein-peptide complexes, which complicate the analysis due to the high flexibility of peptides. Docking scoring functions may not perform equally well with proteins and peptides, and the top-ranked pose used to analyze docking performance may not represent the experimental binding orientation [40].

The field of protein structure and complex prediction is rapidly evolving. During manuscript preparation, a new version of AlphaFold, AlphaFold 3, was released [41]. This version offers enhanced capabilities for modeling protein structures in complex with ligands and reportedly provides significantly improved results compared to traditional methods like Vina. However, AlphaFold 3 was not publicly available when performing the analysis, preventing replication of this study using the latest methodology. Future studies advancing in this direction are expected soon, potentially enhancing the accuracy of current methods and accelerating drug development.

Methods

Compilation of PPI interactions with experimentally validated inhibitors

A combined dataset from 2P2IDB (retrieved 2022-09-07) and ChEMBL (accessed 2023-06-22) was compiled, comprising 16 protein–protein and protein-peptide complexes with confirmed small-molecule modulators (Table 1) [19, 20]. In the ChEMBL database, targets were filtered based on the "PROTEIN–PROTEIN INTERACTION" type. Records were subsequently organized according to the number of compounds with experimentally determined activities, focusing on those with over 25 active compounds. Only entries with experimentally resolved structures of the protein complexes in the PDB were selected. Compounds with a pChEMBL value above 5 were defined as "active," and their ligands were downloaded as SD files. In 2P2IDB, the protein complexes were sorted by the number of known ligands, and only those with over 25 compounds were chosen. Complex 4GQ6 was excluded because the peptide partner consisted of only three residues. Among class 3 complexes associated with Bromodomain/Histone protein–protein interactions, the complex 3UVW was exclusively selected for having the most ligands. For each complex, random ligands with comparable physicochemical properties from the ChemBridge libraries were selected in a 1:10 ratio active-to-decoy ratio.

Protein structures and predictions

The protein structures were retrieved from the PDB, and water molecules, ligands, and additional chains were removed. The FASTA sequences for both the PDB structures and full-length proteins (obtained from UniProt, accessed 2023-06-22) were used to perform protein 3D structure predictions using AF2.

Modeling protein complexes

AlphaFold v2.3.1 was used for modeling [11]. The following database versions were used for predictions: UniRef90 v2022_01, MGnify v2022_05, Uniclust30 v2021_03, BFD, PDB (downloaded 2023-01-10), and PDB70 (downloaded 2023-01-10). Default parameters were used for multiple sequence alignment (MSA) generation and model recycling. The model with the best combined score (ipTM + pTM) was selected. Models with a combined score (ipTM + pTM) > 0.7 were considered high-quality.

Comparison of models and native structures

The AF2 models, as well as MD and AlphaFlow ensembles were compared to their native PDB structures using various metrics such as the TM-score (MM-align package was used), DockQ and pDockQ2 [24, 25, 42]. The residue numbering and protein chain names were confirmed to match in both models and native structures before computing scores. Residues with a pLDDT score below 50 were classified as unfolded to estimate the proportion of intrinsically disordered regions in native and AF2 structures.

Modeling protein conformational ensembles with AlphaFlow

A total of 250 conformations for each of the 16 native protein complexes were generated using AlphaFlow [27]. The AlphaFlow-MD model was used for inference. Since AlphaFlow does not currently support multimeric structures, a 25-residue glycine linker was inserted to separate the two chains. AlphaFlow ensembles were clustered using an all-to-all RMSD matrix and near-neighbor clustering via maxcluster, with representative structures chosen from the 10 most populated clusters for docking experiments. The structural diversity of AlphaFlow and MD ensembles was compared using an all-to-all RMSD matrix between the 250 AlphaFlow and 250 MD conformations. FoldX 5.0 was used to compute the energy distribution of the ensembles using the "Stability" command, and the structures were repaired beforehand with the "RepairPDB" command. [43]

Molecular dynamics simulations

All-atom 500-ns MD simulations were performed for all 16 PPIs with confirmed inhibitors, involving both structures from PDB and predicted AF models. Gromacs version 2022.3 was used to carry out MD simulations [44]. Topologies and coordinates were generated based on the CHARMM36 force field as of July 2022. Systems were solvated within a rhombic dodecahedron TIP3P water-box, maintaining a 1 nm distance between the box edges. Na⁺ and Cl^- counter ions were added to maintain system neutrality. The systems underwent a 100 ps energy minimization to resolve steric clashes and improper geometries, followed by 100 ps each of NVT and NPT equilibration phases. During the NVT phase, the V-rescale temperature-coupling method was employed with a constant coupling interval of 1 ps at 300 K. For the NPT phase, the Parrinello-Rahman pressure coupling method was used with a constant coupling interval of 2 ps at 300 K. Electrostatic forces were computed using the Particle Mesh Ewald method for both the NVT and NPT simulations. The 500 ns simulations were carried out with an integration time step of 0.002 ps, and trajectory data were collected at 100 ps intervals. After completing the simulations, the Gromacs utility "gmx cluster" and the gromos algorithm were used to cluster the resulting trajectories, excluding the first 20 ns. The clustering threshold for each complex was adjusted to yield approximately 30 clusters, ensuring that 90% of frames were within the 10 most populated clusters. The representative structures of the 10 most populated clusters were used for the docking experiments. RMSF plots for C-alpha atoms were generated using the "gmx rmsf" tool, employing the averaged structure from the simulation.

Docking protocols

In all docking methods, ligands and proteins were prepared using Schrödinger Maestro's LigPrep and Protein Preparation protocols, respectively. For ligands, 3D conformations were generated using the OPLS4 force field, and the processed ligands were saved as SD files. Ligands containing more than 500 atoms were excluded. Proteins underwent preprocessing and optimization, which included the addition of hydrogens and disulfide bonds, filling in missing side chains with Prime, and generating het states using Epik. An energy minimization step was then performed using the OPLS4 force field. Docking was carried out on both protein chains separately, centering the grid on the protein complex interface for local docking tools. To define binding pockets used for our local docking tools, we first identified interface residues using the “Getinterfaces.py” Python script developed by the Oxford Protein Informatics Group (OPIP, https://www.blopig.com/blog/tools/). The grid was centered around the centroid of the interface residues. Grid dimensions were adjusted to match the size of each protein interface. PPI interfaces were analyzed using SiteMap to obtain SiteScore and Dscore, as indicators of ligand-binding potential and druggability, respectively. [35] Detailed parameters for all binding pockets are listed in Supplementary File 4. For each PPI–docking protocol combination, we selected the structure type yielding the highest AUROC and then computed all performance metrics using only that optimal structure (Supplementary File 3). EquiBind was omitted from this evaluation because its predictions resulted in ligand clashes and yielded positive binding energies. For most cases, the selected structures are derived from MD simulations.

Glide Glide version 2021-2 was used for docking [28]. The Glide Grid Generation module was used to create the docking grid. Inner grid dimensions were fixed at 10 × 10 × 1Å3, while the size of the outer box varied depending on the protein interface's size (Supplementary File 4). Molecular docking was performed using the standard precision (SP) protocol, and the resulting docking scores were used to evaluate this method's performance. Additionally, an IFD protocol based on Glide and the Refinement module in Prime was conducted.

AutoDock Vina and Gnina AutoDock Vina 1.2.5 and Gnina 1.0 were employed [29, 33]. Protein and ligand files were converted into PDBQT format using Open Babel 3.1.0. A total of 10 docking poses were generated with an exhaustiveness setting of 8, and the highest-ranked pose was selected for further analysis.

EquiBind The EquiBind code was downloaded from the authors' publicly available repository [31]. The multiligand_inference.py script was executed to infer the binding structures with default search space options.

DiffDock DiffDock code was obtained from the publicly available repository [32]. The protocol outlined in the README.md file was followed to create the ESM2 embeddings for the proteins and conduct the inference with default search space options. Specific flags used for the inferences were: –inference_steps 20, –samples_per_complex 20, –batch_size 10, –actual_steps 18, and –no_final_step_noise. The confidence scores of the rank 1 poses were used to compute the AUROC values, and these poses were combined with the docking function of Gnina to obtain the binding affinities using the –minimize flag for energy minimization.

Tankbind Tankbind code was acquired from the openly accessible repository [30]. TankBind was run in two modes: considering the protein interface as a binding pocket (TankBind_local) and only considering the binding pockets predicted by P2Rank (TankBind_blind) [45]. The procedure specified in the prediction_example_using_PDB_6hd6.ipynb notebook was followed to run the first mode by modifying the code so it considers the protein interface as a pocket instead of the center of the protein. The same notebook was followed for running the second mode, utilizing version 2.3 of P2Rank to predict the binding pockets.

Supplementary Information

Additional file 1.^{(2.3MB, docx)}

Additional file 2.^{(23.5KB, xlsx)}

Additional file 3.^{(24.7KB, xlsx)}

Additional file 4.^{(17.7KB, xlsx)}

Additional file 5.^{(21.6KB, xlsx)}

Abbreviations

AF2: AlphaFold2
AFfull: AlphaFold full
AFnat: AlphaFold native
AUROC: Area under receiver operating characteristic
DL: Deep learning
IFD: Induced-fit docking
ipTM: Interface predicted template modeling
iRMS: Interface root mean squared deviation
MD: Molecular dynamics
MSA: Multiple sequence alignment
PAE: Predicted average error
PCA: Principal component analysis
PDB: Protein Data Bank
pLDDT: Predicted local distance difference test
PPI: Protein–protein interaction
pTM: Predicted template modeling
RMSD: Root mean squared deviation
RMSF: Root mean squared fluctuation
RoG: Radius of gyration

Author contributions

M.T.B. designed, directed, obtained funding for, and coordinated the study. J.G.B. performed all experiments, and analysis. J.G.B. wrote an initial version of the paper, subsequently edited by M.T.B. All authors contributed to editing and revising the final version of the manuscript.

Funding

Financial support from the Spanish Ministry of Economy and Innovation (PDC2021-121544-I00 funded by MCIN/AEI/ and European Union Next GenerationEU/ PRT.; PID2020-114627RB-I00 funded by MCIN/AEI /10.13039/501100011033; PID2023-152706OB-I00 funded by MICIU/AEI/10.13039/501100011033 and by FEDER, UE, all to M.T.). This work has been co-financed by the Spanish Ministry of Science and Innovation with funds from the European Union NextGenerationEU, from the Recovery, Transformation and Resilience Plan (PRTR-C17.I1) and from the Autonomous Community of Catalonia within the framework of the Biotechnology Plan Applied to Health. J.G.B. is a recipient of a Joan Oró Fellowship from the Generalitat de Catalunya (2023 FI-100278).

Data availability

Data and code to reproduce the results are available from Github: https://github.com/SysBioUAB/docking_benchmark.

Code availability

Scripts used in this study to run the analyses and reproduce the figures have been deposited in the GitHub repository https://github.com/SysBioUAB/docking_benchmark.

Declarations

Ethical approval and consent to participate

Not applicable. No human samples or clinical data were used in this study.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1.Zhang B, Li H, Yu K, Jin Z (2022) Molecular docking-based computational platform for high-throughput virtual screening. CCF Trans High Perform Comput 4:63–74 [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Blanes-Mira C et al (2022) Comprehensive survey of consensus docking for high-throughput virtual screening. Molecules 28:175 [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Chaput L, Mouawad L (2017) Efficient conformational sampling and weak scoring in docking programs? Strategy of the wisdom of crowds. J Cheminformatics 9:37 [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Singh N, Chaput L, Villoutreix BO (2020) Fast rescoring protocols to improve the performance of structure-based virtual screening performed on protein-protein interfaces. J Chem Inf Model 60:3910–3934 [DOI] [PubMed] [Google Scholar]
5.Schneider G (2010) Virtual screening: an endless staircase? Nat Rev Drug Discov 9:273–276 [DOI] [PubMed] [Google Scholar]
6.Wells JA, McClendon CL (2007) Reaching for high-hanging fruit in drug discovery at protein–protein interfaces. Nature 450:1001–1009 [DOI] [PubMed] [Google Scholar]
7.Scott DE, Bayly AR, Abell C, Skidmore J (2016) Small molecules, big targets: drug discovery faces the protein–protein interaction challenge. Nat Rev Drug Discov 15:533–550 [DOI] [PubMed] [Google Scholar]
8.Chen Z et al (2009) Pharmacophore-based virtual screening versus docking-based virtual screening: a benchmark comparison against eight targets. Acta Pharmacol Sin 30:1694–1708 [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Chaput L, Martinez-Sanz J, Saettel N, Mouawad L (2016) Benchmark of four popular virtual screening programs: construction of the active/decoy dataset remains a major determinant of measured performance. J Cheminformatics 8:56 [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Akdel M et al (2022) A structural biology community assessment of AlphaFold2 applications. Nat Struct Mol Biol. 10.1038/s41594-022-00849-w [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Jumper J et al (2021) Highly accurate protein structure prediction with AlphaFold. Nature 596:583–589 [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Zhang Y et al (2023) Benchmarking refined and unrefined alphafold2 structures for hit discovery. J Chem Inf Model. 10.1021/acs.jcim.2c01219 [DOI] [PubMed] [Google Scholar]
13.Scardino V, Di Filippo JI, Cavasotto CN (2023) How good are AlphaFold models for docking-based virtual screening? IScience 26:105920 [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Wong F et al (2022) Benchmarking AlphaFold-enabled molecular docking predictions for antibiotic discovery. Mol Syst Biol 18:e11081 [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Holcomb M, Chang Y-T, Goodsell DS, Forli S (2023) Evaluation of AlphaFold2 structures as docking targets. Protein Sci 32:e4530 [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Karelina M, Noh JJ, Dror RO (2023) How accurately can one predict drug binding modes using AlphaFold models? Elife 12:89386 [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Díaz-Rovira AM et al (2023) are deep learning structural models sufficiently accurate for virtual screening? Application of docking algorithms to AlphaFold2 predicted structures. J Chem Inf Model 63:1668–1674 [DOI] [PubMed] [Google Scholar]
18.Peng, Y. et al. Assessment of AlphaFold Structures and Optimization Methods for Virtual Screening. 10.1101/2023.01.10.523376 (2023)
19.Mendez D et al (2019) ChEMBL: towards direct deposition of bioassay data. Nucleic Acids Res 47:D930–D940 [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Basse M-J, Betzi S, Morelli X, Roche P (2016) 2P2Idb v2: update of a structural database dedicated to orthosteric modulation of protein–protein interactions. Database J. Biol. Databases Curation 2016:007 [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Gómez Borrego J, Torrent Burgas M (2024) Structural assembly of the bacterial essential interactome. Elife 13:e94919 [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Evans, R. et al. Protein complex prediction with AlphaFold-Multimer. 2021.10.04.463034 Preprint at 10.1101/2021.10.04.463034 (2022).
23.Zhang Y, Skolnick J (2004) Scoring function for automated assessment of protein structure template quality. Proteins 57:702–710 [DOI] [PubMed] [Google Scholar]
24.Basu S, Wallner B (2016) DockQ: a quality measure for protein-protein docking models. PLoS ONE 11:e0161879 [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Zhu W, Shenoy A, Kundrotas P, Elofsson A (2023) Evaluation of AlphaFold-Multimer prediction on multi-chain protein complexes. Bioinformatics 39:424 [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Guterres H, Im W (2020) Improving protein-ligand docking results with high-throughput molecular dynamics simulations. J Chem Inf Model 60:2189–2198 [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Jing B, Berger B & Jaakkola T. AlphaFold meets flow matching for generating protein ensembles. Preprint at 10.48550/arXiv.2402.04845 (2024).
28.Friesner RA et al (2004) Glide: a new approach for rapid, accurate docking and scoring. 1. Method and assessment of docking accuracy. J Med Chem 47:1739–1749 [DOI] [PubMed] [Google Scholar]
29.McNutt AT et al (2021) GNINA 1.0: molecular docking with deep learning. J. Cheminformatics 13:43 [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Lu W et al (2022) TANKBind: trigonometry-aware neural networks for drug-protein binding structure prediction. Adv Neural Inf Process Syst 35:7236–7249 [Google Scholar]
31.Stärk H, Ganea O-E, Pattanaik L, Barzilay R, Jaakkola T. EquiBind: geometric deep learning for drug binding structure prediction. Preprint at 10.48550/arXiv.2202.05146 (2022).
32.Corso G, Stärk H, Jing B, Barzilay R, Jaakkola T. DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking. Preprint at 10.48550/arXiv.2210.01776 (2023).
33.Trott O, Olson AJ (2010) AutoDock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization and multithreading. J Comput Chem 31:455–461 [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Buttenschoen M, Morris GM, Deane CM. PoseBusters: AI-based docking methods fail to generate physically valid poses or generalise to novel sequences. Preprint at http://arxiv.org/abs/2308.05777 (2023). [DOI] [PMC free article] [PubMed]
35.Halgren TA (2009) Identifying and characterizing binding sites and assessing druggability. J Chem Inf Model 49:377–389 [DOI] [PubMed] [Google Scholar]
36.Alzyoud L, Bryce RA, Al Sorkhy M, Atatreh N, Ghattas MA (2022) Structure-based assessment and druggability classification of protein–protein interaction sites. Sci Rep 12:7975 [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Crua Asensio N, Muñoz Giner E, de Groot NS, Torrent Burgas M (2017) Centrality in the host–pathogen interactome is associated with pathogen fitness during infection. Nat Commun 8:14092 [DOI] [PMC free article] [PubMed] [Google Scholar]
38.de Groot NS, Burgas MT (2020) Bacteria use structural imperfect mimicry to hijack the host interactome. PLOS Comput Biol 16:e1008395 [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Gómez Borrego J, Torrent Burgas M (2022) Analysis of host-bacteria protein interactions reveals conserved domains and motifs that mediate fundamental infection pathways. Int J Mol Sci 23:11489 [DOI] [PMC free article] [PubMed] [Google Scholar]
40.Ramírez D, Caballero J (2018) Is It Reliable to take the molecular docking top scoring position as the best solution without considering available structural data? Molecules 23:1038 [DOI] [PMC free article] [PubMed] [Google Scholar]
41.Abramson J et al (2024) Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature. 10.1038/s41586-024-07487-w [DOI] [PMC free article] [PubMed] [Google Scholar]
42.Mukherjee S, Zhang Y (2009) MM-align: a quick algorithm for aligning multiple-chain protein complex structures using iterative dynamic programming. Nucleic Acids Res 37:e83 [DOI] [PMC free article] [PubMed] [Google Scholar]
43.Delgado J, Radusky LG, Cianferoni D, Serrano L (2019) FoldX 5.0: working with RNA, small molecules and a new graphical interface. Bioinformatics 35:4168–4169 [DOI] [PMC free article] [PubMed] [Google Scholar]
44.Abraham MJ et al (2015) GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. SoftwareX 1–2:19–25 [Google Scholar]
45.Krivák R, Hoksza D (2018) P2Rank: machine learning based tool for rapid and accurate prediction of ligand binding sites from protein structure. J Cheminformatics 10:39 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Additional file 1.^{(2.3MB, docx)}

Additional file 2.^{(23.5KB, xlsx)}

Additional file 3.^{(24.7KB, xlsx)}

Additional file 4.^{(17.7KB, xlsx)}

Additional file 5.^{(21.6KB, xlsx)}

Data Availability Statement

Data and code to reproduce the results are available from Github: https://github.com/SysBioUAB/docking_benchmark.

Scripts used in this study to run the analyses and reproduce the figures have been deposited in the GitHub repository https://github.com/SysBioUAB/docking_benchmark.

[CR1] 1.Zhang B, Li H, Yu K, Jin Z (2022) Molecular docking-based computational platform for high-throughput virtual screening. CCF Trans High Perform Comput 4:63–74 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR2] 2.Blanes-Mira C et al (2022) Comprehensive survey of consensus docking for high-throughput virtual screening. Molecules 28:175 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR3] 3.Chaput L, Mouawad L (2017) Efficient conformational sampling and weak scoring in docking programs? Strategy of the wisdom of crowds. J Cheminformatics 9:37 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR4] 4.Singh N, Chaput L, Villoutreix BO (2020) Fast rescoring protocols to improve the performance of structure-based virtual screening performed on protein-protein interfaces. J Chem Inf Model 60:3910–3934 [DOI] [PubMed] [Google Scholar]

[CR5] 5.Schneider G (2010) Virtual screening: an endless staircase? Nat Rev Drug Discov 9:273–276 [DOI] [PubMed] [Google Scholar]

[CR6] 6.Wells JA, McClendon CL (2007) Reaching for high-hanging fruit in drug discovery at protein–protein interfaces. Nature 450:1001–1009 [DOI] [PubMed] [Google Scholar]

[CR7] 7.Scott DE, Bayly AR, Abell C, Skidmore J (2016) Small molecules, big targets: drug discovery faces the protein–protein interaction challenge. Nat Rev Drug Discov 15:533–550 [DOI] [PubMed] [Google Scholar]

[CR8] 8.Chen Z et al (2009) Pharmacophore-based virtual screening versus docking-based virtual screening: a benchmark comparison against eight targets. Acta Pharmacol Sin 30:1694–1708 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR9] 9.Chaput L, Martinez-Sanz J, Saettel N, Mouawad L (2016) Benchmark of four popular virtual screening programs: construction of the active/decoy dataset remains a major determinant of measured performance. J Cheminformatics 8:56 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR10] 10.Akdel M et al (2022) A structural biology community assessment of AlphaFold2 applications. Nat Struct Mol Biol. 10.1038/s41594-022-00849-w [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR11] 11.Jumper J et al (2021) Highly accurate protein structure prediction with AlphaFold. Nature 596:583–589 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR12] 12.Zhang Y et al (2023) Benchmarking refined and unrefined alphafold2 structures for hit discovery. J Chem Inf Model. 10.1021/acs.jcim.2c01219 [DOI] [PubMed] [Google Scholar]

[CR13] 13.Scardino V, Di Filippo JI, Cavasotto CN (2023) How good are AlphaFold models for docking-based virtual screening? IScience 26:105920 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR14] 14.Wong F et al (2022) Benchmarking AlphaFold-enabled molecular docking predictions for antibiotic discovery. Mol Syst Biol 18:e11081 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR15] 15.Holcomb M, Chang Y-T, Goodsell DS, Forli S (2023) Evaluation of AlphaFold2 structures as docking targets. Protein Sci 32:e4530 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR16] 16.Karelina M, Noh JJ, Dror RO (2023) How accurately can one predict drug binding modes using AlphaFold models? Elife 12:89386 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR17] 17.Díaz-Rovira AM et al (2023) are deep learning structural models sufficiently accurate for virtual screening? Application of docking algorithms to AlphaFold2 predicted structures. J Chem Inf Model 63:1668–1674 [DOI] [PubMed] [Google Scholar]

[CR18] 18.Peng, Y. et al. Assessment of AlphaFold Structures and Optimization Methods for Virtual Screening. 10.1101/2023.01.10.523376 (2023)

[CR19] 19.Mendez D et al (2019) ChEMBL: towards direct deposition of bioassay data. Nucleic Acids Res 47:D930–D940 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR20] 20.Basse M-J, Betzi S, Morelli X, Roche P (2016) 2P2Idb v2: update of a structural database dedicated to orthosteric modulation of protein–protein interactions. Database J. Biol. Databases Curation 2016:007 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR21] 21.Gómez Borrego J, Torrent Burgas M (2024) Structural assembly of the bacterial essential interactome. Elife 13:e94919 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR22] 22.Evans, R. et al. Protein complex prediction with AlphaFold-Multimer. 2021.10.04.463034 Preprint at 10.1101/2021.10.04.463034 (2022).

[CR23] 23.Zhang Y, Skolnick J (2004) Scoring function for automated assessment of protein structure template quality. Proteins 57:702–710 [DOI] [PubMed] [Google Scholar]

[CR24] 24.Basu S, Wallner B (2016) DockQ: a quality measure for protein-protein docking models. PLoS ONE 11:e0161879 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR25] 25.Zhu W, Shenoy A, Kundrotas P, Elofsson A (2023) Evaluation of AlphaFold-Multimer prediction on multi-chain protein complexes. Bioinformatics 39:424 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR26] 26.Guterres H, Im W (2020) Improving protein-ligand docking results with high-throughput molecular dynamics simulations. J Chem Inf Model 60:2189–2198 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR27] 27.Jing B, Berger B & Jaakkola T. AlphaFold meets flow matching for generating protein ensembles. Preprint at 10.48550/arXiv.2402.04845 (2024).

[CR28] 28.Friesner RA et al (2004) Glide: a new approach for rapid, accurate docking and scoring. 1. Method and assessment of docking accuracy. J Med Chem 47:1739–1749 [DOI] [PubMed] [Google Scholar]

[CR29] 29.McNutt AT et al (2021) GNINA 1.0: molecular docking with deep learning. J. Cheminformatics 13:43 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR30] 30.Lu W et al (2022) TANKBind: trigonometry-aware neural networks for drug-protein binding structure prediction. Adv Neural Inf Process Syst 35:7236–7249 [Google Scholar]

[CR31] 31.Stärk H, Ganea O-E, Pattanaik L, Barzilay R, Jaakkola T. EquiBind: geometric deep learning for drug binding structure prediction. Preprint at 10.48550/arXiv.2202.05146 (2022).

[CR32] 32.Corso G, Stärk H, Jing B, Barzilay R, Jaakkola T. DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking. Preprint at 10.48550/arXiv.2210.01776 (2023).

[CR33] 33.Trott O, Olson AJ (2010) AutoDock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization and multithreading. J Comput Chem 31:455–461 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR34] 34.Buttenschoen M, Morris GM, Deane CM. PoseBusters: AI-based docking methods fail to generate physically valid poses or generalise to novel sequences. Preprint at http://arxiv.org/abs/2308.05777 (2023). [DOI] [PMC free article] [PubMed]

[CR35] 35.Halgren TA (2009) Identifying and characterizing binding sites and assessing druggability. J Chem Inf Model 49:377–389 [DOI] [PubMed] [Google Scholar]

[CR36] 36.Alzyoud L, Bryce RA, Al Sorkhy M, Atatreh N, Ghattas MA (2022) Structure-based assessment and druggability classification of protein–protein interaction sites. Sci Rep 12:7975 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR37] 37.Crua Asensio N, Muñoz Giner E, de Groot NS, Torrent Burgas M (2017) Centrality in the host–pathogen interactome is associated with pathogen fitness during infection. Nat Commun 8:14092 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR38] 38.de Groot NS, Burgas MT (2020) Bacteria use structural imperfect mimicry to hijack the host interactome. PLOS Comput Biol 16:e1008395 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR39] 39.Gómez Borrego J, Torrent Burgas M (2022) Analysis of host-bacteria protein interactions reveals conserved domains and motifs that mediate fundamental infection pathways. Int J Mol Sci 23:11489 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR40] 40.Ramírez D, Caballero J (2018) Is It Reliable to take the molecular docking top scoring position as the best solution without considering available structural data? Molecules 23:1038 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR41] 41.Abramson J et al (2024) Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature. 10.1038/s41586-024-07487-w [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR42] 42.Mukherjee S, Zhang Y (2009) MM-align: a quick algorithm for aligning multiple-chain protein complex structures using iterative dynamic programming. Nucleic Acids Res 37:e83 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR43] 43.Delgado J, Radusky LG, Cianferoni D, Serrano L (2019) FoldX 5.0: working with RNA, small molecules and a new graphical interface. Bioinformatics 35:4168–4169 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR44] 44.Abraham MJ et al (2015) GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. SoftwareX 1–2:19–25 [Google Scholar]

[CR45] 45.Krivák R, Hoksza D (2018) P2Rank: machine learning based tool for rapid and accurate prediction of ligand binding sites from protein structure. J Cheminformatics 10:39 [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Evaluating ligand docking methods for drugging protein–protein interfaces: insights from AlphaFold2 and molecular dynamics refinement

Jordi Gómez Borrego

Marc Torrent Burgas

Abstract

Abstract

Scientific contribution

Supplementary Information

Introduction

Fig. 1.

Results and discussion

AlphaFold2 models are suitable starting structures for molecular docking

Table 1.

Fig. 2.

Modeling truncated versus full-length proteins affects model quality

Conformational ensembles as starting structures for molecular docking

Fig. 3.

Fig. 4.

Evaluation of docking performance across strategies

Fig. 5.

Fig. 6.

Fig. 7.

Complementary metrics for docking effectiveness and binding pocket analysis

Table 2.

Table 3.

Conclusions

Methods

Compilation of PPI interactions with experimentally validated inhibitors

Protein structures and predictions

Modeling protein complexes

Comparison of models and native structures

Modeling protein conformational ensembles with AlphaFlow

Molecular dynamics simulations

Docking protocols

Supplementary Information

Abbreviations

Author contributions

Funding

Data availability

Code availability

Declarations

Ethical approval and consent to participate

Competing interests

Footnotes

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases