Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2026 Mar 10.
Published in final edited form as: J Chem Inf Model. 2025 Feb 14;65(5):2180–2190. doi: 10.1021/acs.jcim.4c02296

The Need for continuing blinded Pose- and Activity Prediction Benchmarks

Christian Kramer 1, John Chodera 2, Kelly L Damm-Ganamet 3, Michael K Gilson 4, Judith Günther 5, Uta Lessel 6, Richard A Lewis 7, David Mobley 8, Eva Nittinger 9, Adam Pecina 10, Matthieu Schapira 11, W Patrick Walters 12
PMCID: PMC12818095  NIHMSID: NIHMS2136461  PMID: 39951479

Abstract

Computational tools for structure-based drug design (SBDD) are widely used in drug discovery and can provide valuable insights to advance projects in an efficient and cost-effective manner. However, despite the importance of SBDD to the field, the underlying methodologies and techniques have many limitations. In particular, binding pose and activity predictions (P-AP) are still not consistently reliable. We strongly believe that a limiting factor is the lack of a widely accepted and established community benchmarking process that independently assesses the performance and drives the development of methods, similar to the CASP benchmarking challenge for protein structure prediction. Here, we provide an overview of P-AP, unblinded benchmarking datasets and blinded benchmarking initiatives (concluded and ongoing) and offer a perspective on learnings and the future of the field. To accelerate a breakthrough on the development of novel P-AP methods, it is necessary for the community to establish and support a long-term benchmark challenge that provides non-biased training/test/validation sets, a systematic independent validation, and a forum for scientific discussions.

Introduction

Structure-based drug design (SBDD) has become a core method in drug discovery. It has been applied to many different protein families.1,2 SBDD presumes that a small molecule binds to its target protein in a fixed pose around which it fluctuates due to thermal motion. While SBDD has contributed to numerous clinical candidates, many of the underlying techniques still have limitations. Chief among these is binding pose prediction for a small molecule in a target binding pocket, based on a variety of algorithms. Determining this pose is a key component of all structure-based design techniques, as it enables drug discovery teams to evaluate and optimize the interactions between a proposed molecule and a protein binding site.

The pose-finding workflow has two components: sampling, which generates a pose for a molecule in a binding site, and scoring, which ranks the poses. For a manageably small number of compounds, one may use detailed, more accurate (but costlier) simulation-based methods to estimate affinities.35 For virtual screening of large compound libraries, physical simplifications, such as neglecting protein flexibility, entropy and solvation, are typically employed to reduce the computational time. The recent advent of purchasable ultra-large chemical spaces has further magnified the need for fast and efficient ways to virtually screen large datasets of compounds, such as in ultra-large virtual screening (ULVS) applications. Classical docking methods based on physical principles have been combined with machine learning/deep learning (ML/DL) models. The result can be an active learning (AL) approach in which a small fraction (1–10%) of the overall dataset is docked and used to build an ML model that predicts the docking score. The ML model, which is typically several orders of magnitude faster than docking, is then used to score the entire dataset.6 This process is typically repeated for 3–5 cycles to obtain a final set of molecules for purchase or synthesis.

Both pose- and activity prediction (P-AP) are relevant across all preclinical drug design phases. Based on predicted poses, binding affinities to on- and off-targets can be calculated during lead optimization using molecular dynamics simulation techniques, such as relative (RBFE) or absolute (ABFE) binding free energy calculations.3,4,7 The emergence of generative algorithms that can optimize a chemical structure with respect to a particular scoring function are highly dependent on P-AP approaches to direct the chemical modifications. Small changes need to be accurately scored, thus the accuracy of the activity prediction function needs to be sufficiently high. All SBDD, artificial intelligence (AI) and ULVS applications critically depend on reliable P-AP methods.

Despite its importance to drug design and decades of development, P-AP is still not a solved challenge. P-AP fails too often, in particular for novel targets where the structures of co-complexes have yet to be solved. Even when the structures of co-complexes are known, P-AP does not always perform well. Zev et al. showed in a study of Mpro X-ray complexes8 that the experimental pose could be regenerated within 2Å RMSD for only 26% of non-covalently bound ligands and 46% of covalent inhibitors. It is very difficult to quantify the current state of the art, as there is no independent, continuous benchmarking effort. The protein-ligand (P-L) structure and affinity prediction field has seen several benchmarking initiatives913 (for more detail see section “Past and Present Protein-Ligand Binding Benchmarks” below). Unfortunately, most of these have been discontinued, usually due to lack of sustained funding. Currently, there is no established community benchmark that independently assesses and thus drives development of novel P-AP methods.

The protein structure prediction field, which recently experienced its “AlphaFold moment”, was able to achieve positive progress primarily due to the CASP benchmarking challenge which has now been run through 15 iterations over 30 years. Although the latest CASP15 included ligand-containing structures, only a small fraction of the complexes were relevant for drug discovery.14 CASP16 will continue to feature a section on protein-ligand predictions, but CASP has not so far become central to the P-AP community, as it has to the protein structure prediction community, and stable community commitment would be needed for this to occur.

With this perspective, we want to bring attention to the need for a widely accepted, ongoing, benchmarking process for P-AP methods. We are united in the belief that P-AP approaches, with all their physical approximations, can and need to be substantially better than they are today. Novel physical approximations and AI/ML algorithms, together with more and better data to train and validate activity prediction approaches, will provide substantial improvements. The potential benefit for drug discovery, and ultimately patients, will be huge, as P-AP is still a key technology for boosting the efficiency of drug design. Developing activity prediction approaches will be a long-term effort that requires continuous enablement, not only monetarily but also by supplying diverse high-quality structures and affinity data to enable rigorous benchmarking and feedback on methodological improvements. This is especially true for testing of AI/ML models which make increasing use of available data for training; such models need new, diverse data for fair benchmarking.

In what follows, we briefly review various algorithmic approaches to P-AP. Then, we provide an overview of evaluation datasets and benchmarks related to P-AP that have either concluded or are still ongoing and discuss learnings from them. Finally, we discuss open issues in P-AP and requirements for the next generation of benchmarks and conclude with a perspective on the future of the field.

Approaches to Pose & Activity Prediction

P-AP approaches are regularly reviewed,1517 so we only give a short overview of the general directions.

Pose Prediction Overview

Pose prediction algorithms attempt to predict the most stable pose of a small molecule bound to a target protein. Classic algorithms typically consist of two components: a sampling method that establishes poses and a scoring method that ranks the stabilities of the poses. In many cases, the sampling and scoring components work in concert. An initial set of poses is sampled and scored, then the most promising poses are subjected to additional sampling and scoring. It should be noted that scores obtained from docking programs only provide a rough estimate of the “fit” of a small molecule in a protein binding pocket and should not be considered “activity predictions”. The prediction of binding affinity is typically a separate procedure that is performed on a docked pose.

Pose generation algorithms have been developed in two waves. The first wave, which we term classical methods,18 19 20 21 22 23 24 25 use a combination of physics-based, knowledge-based, and empirical terms to define and evaluate the poses of ligands in a protein binding site. The second wave is based on deep learning architectures like DiffDock26 and EquiBind27. These networks are directly trained on known 3D structures without any human-imposed constraints on preferred geometries. While these methods show promise, they are not mature and generated structures can be problematic.28

Additional criteria are often used together with activity prediction approaches to filter out physically unrealistic poses. These include terms such as electrostatic and van der Waals energies, rotatable bond penalties, lipophilic contacts, binding surface area, ligand strain energy, as well as interaction fingerprints, shape similarity, and pharmacophore points on known interaction hotspots.

Activity Prediction Overview

Many approaches for activity prediction—which typically is approached by affinity prediction -- have been developed. Standard scoring functions have primarily been designed for the initial screening of extensive drug candidate libraries. Here, the total number of P-L complexes to be evaluated grows extremely rapidly, considering that the number of compounds must be multiplied by their numerous protonation and tautomer states, as well as flexibility in the protein binding site. Therefore, these scoring functions were designed to highlight plausible ligand binding modes rather than rank them accurately by their binding free energies. Their efficiency relies on severe approximations, which limit their accuracy in predicting binding affinities.29 Structure-based deep-learning-based scoring functions augment standard scoring functions with additional machine-learning corrections or utilize standalone machine-learning algorithms to estimate affinity directly from the structure of the P-L complex. These deep-learning-based approaches are fast but not necessarily as reliable.

Computationally less efficient but more rigorous methods include MM-GB/PBSA30 and the linear interaction energy (LIE) method.31 These are typically based on molecular dynamics simulations and so are intermediate in both accuracy and the computational cost between standard scoring functions and rigorous perturbation methods. The MM/GBSA method, for example offers the decomposition of binding free energies32 to individual terms and residues, but its predictive power is limited30. A more rigorous approach is the direct computation of binding free energies from MD simulation, represented by thermodynamic integration33 (TI) and free energy perturbation (FEP) methods. These methods offer the advantage of a theoretically sound derivation of the computed quantities based on statistical mechanics, which blur the strict separation between pose prediction and activity prediction, as many local poses are sampled during a simulation. Calculating simulation-based free energies can take up to tens of GPU hours per molecule. The accuracy to be expected is highest for congeneric series of small molecules that only differ by a small part of the molecule and may be compromised by the use of approximate force fields.34 Poor 3D starting points for binding free energy calculations may have a substantial negative effect on the predicted affinities. Mining Minima provides an alternate, rigorously grounded approach that uses a force field and an implicit solvent model and falls between docking and simulation-based free energy methods in terms of speed and accuracy.35

Quantum Mechanics-based (QM) methods have been recently utilized to avoid the limitations imposed by the simplified formalism of both the conventional scoring functions and simulation force fields.36 QM methods are currently too computationally demanding to be routinely applied to P-L systems. Hybrid QM/MM schemes (where only a small part of the system is treated at the QM level), fragmentation schemes and semiempirical methods have been utilized in the physics-based end-point scoring functions.37,38

Consensus scoring, the application of different activity prediction and combination approaches to the same complex, has repeatedly been attempted. This is based on the assumption that a single scoring function is not optimal, but different scoring functions can compensate for weakness of the individual ones.39,40

Judging P-AP approaches

There are three classic criteria by which P-AP approaches are judged: Pose prediction, ranking/early enrichment, and correlation of predicted and measured binding affinities. For pose prediction, geometries of predicted poses are compared to experimentally determined poses, typically from CryoEM or X-ray. The identification of the ligand near-native or native pose within all generated poses as the minimum-free-energy structure is an unambiguous criterion for the performance of any scoring/sampling method. The difficulty of classic pose prediction generally varies between self-docking (e.g. docking the ligand into a protein binding pocket perfectly pre-formed) and cross-docking into (1) experimental holo structures, (2) experimental apo structures and (3) structures obtained through structure prediction techniques that start with an amino-acid sequence. When comparing predictions to experimentally determined poses, it is important to note that the experimental pose is also a model which has been fit into calculated electron density. The quality of the experimental pose fit should always be confirmed when utilizing this criterion.

Ranking and early enrichment is typically used to examine the virtual screening power of a P-AP setup. Here, the goal is to assign a higher score to known active ligands and lower scores to inactive decoy compounds. Success can be measured as the area under the curve (AUC) of the receiver operator characteristics (ROC) curve, or as the enrichment of known active ligands within the top X% of a rank-ordered list. Success can also be measured as the reduction of the number of false negatives and false positives. However, in lead optimization campaigns, most or all of the compounds of interest are likely to be active, so the goal is to rank candidate ligands by their affinities: performance is evaluated by calculating the correlation between the experimental and calculated binding affinity, or the difference (e.g. RMSD) between the predicted and true values. All three classic criteria, i.e. posing, early enrichment, and affinity prediction, come with their own practical challenges for the evaluation, as will be discussed in later sections.

When setting up protein systems for P-AP predictions, structural approximations need to be made: Which protein structure should be docked into? Which rotamers, tautomers, and protomers should be used? Which water molecules are important to consider? Water is well known to mediate crucial interactions for binding, but should the water be considered as a part of the ligand or the protein? Those questions often do not have clear scientific answers, and answers may be system- and method-dependent. Covalent binding, metal coordination, or cofactors can introduce additional layers of complexity that need careful consideration. The ultimate performance depends on the individual user bringing in their expert understanding of the system.41 While commercial tools typically come with a pipeline to prepare protein-ligand systems, there is a lack of good open-source tools for doing so. Additionally, the quality of system preparation in isolation is difficult to benchmark, making it challenging to assess the utility of tools in this space.

Deep Learning-based P-AP approaches

The recent surge in deep learning (DL) has led to numerous publications with novel DL-based methods for pose prediction. They can be differentiated into pocket-based docking with pre-defined pockets, such as DeepDock42 or Uni-Mol,43 as well as blind pose prediction, e.g. DiffDock26, EquiBind27, or TankBind,44 where the binding site is identified by the method itself. A recent publication by Deane and coworkers pointed out that many evaluations of these are flawed because evaluation criteria are inadequate for these methods.26 DL-based methods need a lot of data to learn the physics-based principles that are hard-coded in the conventional methods. Most pose prediction evaluation criteria developed for conventional methods do not test for physical correctness (e.g. the flatness of aromatic systems, or the absence of intramolecular clashes). Deane et al. recently showed that those algorithms are immature, as the local small molecule geometries generated can be physically highly implausible, and when presented with unbiased test, none of the current DL-based pose prediction algorithms perform better than conventional algorithms.28 This is particularly true if the DL-based methods are applied to structures that they have never seen before.28,50,51

More recently, the field has seen the emergence of a new docking approach referred to as “co-folding”. This method, which has its origins in protein structure prediction, simultaneously generates the structure of the protein while predicting a ligand binding pose. Such models generate the protein structure based on the amino acid sequence information and include the ligand during this process. This way, protein structure generation and ligand pose generation are tackled simultaneously, which would solve current induced fit binding issues. Perhaps the most noteworthy example of co-folding is AlphaFold3,45 a successor to the widely acclaimed AlphaFold2.46 Co-folding approaches learn the relationships between proteins and ligands from structures in the PDB. Some of these methods can reproduce the poses of covalently bound ligands. The published methods include NeuralPLexer,47 RoseTTAFold All-Atom,48 and AlphaFold3,45 some of which have performed well on the PoseBusters28 benchmark. However, the limited availability of program code and a suitable shared benchmark has prevented broader benchmarking by the community.

Blinded vs unblinded benchmarking

It is necessary to differentiate between blinded and unblinded benchmarking when judging pose prediction and activity prediction approaches. Unblinded benchmarking is typically done by the authors of new approaches, using a publicly available dataset. Those datasets often have some relationship to the model training set, for example by being split out of a ground truth pool before training (such as by a random partition into train/validation/test sets, with the test set only used for evaluation of the final model). Thus, validation and test sets may have the same basic distribution as the training dataset, which can result in an unfair evaluation if the ultimate goal is to predict an unseen new protein-ligand system (the core point of discovery and lead optimization in Pharma is to discover and/or optimize new chemical matter). If standard datasets are used for the evaluation, there is a real risk that samples overlap between training and evaluation sets. Beyond this, judging what is inside and outside of the distribution is very hard, and novel proposals for how to best do this are still being published. Overlap between training and evaluation sets are a particular concern for AI/ML-based scoring functions that can potentially memorize training samples. 28,49,50

In contrast, blinded benchmarking is performed on completely independent, new data that has not been previously available to the creators of a novel P-AP approach. Blinded benchmarking sets a higher bar than unblinded benchmarking, as it seeks to make it impossible for models to cheat or accidentally give artificially inflated performance. To guarantee that benchmark data has not been used to train a model, blinded benchmarking either has to be done by an independent organization on unreleased or secret data, or P-AP approaches have to be locked and containerized at a certain point in time and then tested on new data. For an evolving discipline, this means that blinded benchmark datasets must regularly be refreshed or updated with the latest data. There are examples for both approaches, and we will discuss them below.

Benchmarking datasets

Although we favor blinded benchmarking as final assessment, unblinded benchmarking datasets are very useful for P-AP development. Unblinded benchmarking datasets for pose prediction are typically curated directly from the PDB. These sets differ in the years of publication, size, and criteria used to measure experimental quality.5154

Datasets for ranking include decoys5558 whereas later datasets include property-matching of decoys,59,60 and ensure that decoys are experimentally validated (i.e. measured inactives). Negative data is published less often and is thus more difficult to compile for a dataset.

Binding free energy prediction has benchmarking datasets of different kinds. There are large datasets comprising protein-ligand systems with experimentally determined cocrystal structures and measured Ki or IC50 values. Structures typically come from the PDB database. The PDBbind database, underlying the CASF benchmark (evaluation according to our nomenclature), and BindingMOAD61 are such datasets. There are also much larger datasets of binding free energies that do not include crystal structures, notably BindingDB62, and ChEMBL63. One challenge with all of these datasets is that Ki and IC50 values are not really transferable across assays.64,65 It is safest to evaluate ranking only within each assay series, for example as in a recent publication introducing the PL-REX benchmark dataset consisting of 10 diverse protein-ligand series with consistently measured affinities and high-resolution X-ray crystal structures.38 There are also collections of congeneric series of ligands crystallized or template-docked into the same target structure, with activity values that have been measured in the same assay setup.5,34,66,67 and the OpenFF benchmark set (in process).

Issues with evaluation datasets

Apart from the risk that evaluation datasets overlap with the training data, there are a number of problems with current benchmarking datasets:

  • Decoys are often not experimentally confirmed (may be true actives).

  • Pose Prediction: Native structure is overly simple, cross-docking is a more realistic scenario.

  • Docking evaluations based on X-ray structures do not evaluate how activity prediction approaches account for nonphysical poses, i.e. clashes (see comment on DL-based scoring function above).

  • Public benchmarking datasets based on compilations from mixed assays can have a huge spread in measured values (PDBbind has an activity range from 2 – 12 log units). While this is not a problem per se, evaluators need to be aware that the R2 values observed from such datasets do not translate to R2 values observed on datasets obtained for a single assay which typically has a spread of 4 log units (10 μM to 1 nM). Also, affinity values need to be log-transformed, as the fold factor difference linearly corresponds to a change in Gibbs’ free energy of binding.

  • Public benchmarking datasets (as opposed to blinded benchmarks) do not allow comparing different P-AP approaches, since it is very hard to control the overlap of training and evaluation data. Note that overlap does not necessarily mean that the identical protein-ligand structure is part of both datasets. A setup where representatives of the same protein family with very similar ligands are part of both datasets also renders the evaluation much less meaningful. In addition, different methods usually are tested against different subsets of the publicly available data, and this inconsistency introduces uncontrolled variation into any comparison of methods.

Past and Present Protein-Ligand Binding Benchmarks

The need for rigorous blinded benchmarking motivated the organization of community-wide prediction challenges in this space since the first SAMPL challenge in 2008.68 The SAMPL challenges, most recently SAMPL9,69 have maintained a distinctive focus on the calculation of physical properties simpler than protein-ligand affinities as tests of the fundamental reliability of physics-based methods, though they have also included some protein-ligand binding challenges.7072 For about a decade, the U.S. National Institutes of Health supported the CSAR series of protein-ligand pose- and affinity-prediction challenges,7375 followed by the D3R challenges.9,7678 The TDT Challenge, sponsored by the American Chemical Society combined the generation of tutorials with a prospective prediction component where selected molecules were screened experimentally.10 With the sunsetting of CSAR, D3R, and TDT, the CASP challenge series, which has traditionally focused on protein structure prediction, has branched into protein-ligand pose prediction, starting with CASP1579 and heading into CASP16. The recently launched CACHE challenges are benchmarking hit finding methods where the predicted compounds are experimentally tested. Therefore CACHE is complementary to benchmarking initiatives focused on P-AP.12,80

The protein-ligand challenges led to some specific conclusions, though many of these have concerned the validation process as much as the computational methods themselves. At the last D3R workshop, it was noted that every D3R challenge included some methods with accurate pose predictions, and that most affinity predictions did correlate positively with the experimental “ground truth”, but that the results did not make it possible to identify a single best method or to rank methods in terms of accuracy. Such results highlight the importance of parameter settings and protein preparation. Similarly, methods that incorporated machine learning/artificial intelligence in some cases underperformed and in other cases outperformed more structure- and physics-based methods. Overall, it was not feasible to track the performance of a given method through multiple challenges across the years, because methods and their applications changed over time.

Two efforts, CELPP13 and CAMEO,11 addressed this limitation by establishing continuous, high-throughput benchmarking of protein-ligand pose predictions. These took advantage of the fact that the RCSB Protein Data Bank81 releases a tranche of new structures every week, and several days before the structures are released, it provides a file with a list of the forthcoming structures. Thus, an up-tempo, weekly, blinded prediction challenge can be held in the days between releases of the listing files and release of the structures. In about a year of operation, CELPP generated over 2,000 pose-prediction challenges and used them to benchmark methods of external participants. This reflects a data flow about 100 times that of D3R, and it demonstrated the ability to provide statistically significant comparisons among methods.

The CAMEO project provides a decentralized resource for model evaluation. In a manner similar to CELPP, unpublished structures are transmitted to user’s servers for predictions. The predictions are then returned to the central CAMEO servers that evaluate the predicted structures. While CAMEO was originally intended for protein structure prediction, it had for some time expanded its capabilities to include the prediction of liganded structures but discontinued that in 2016. Although CELPP and CAMEO are not currently operational for predicting protein-ligand structures, this powerful approach gives a strong direction for future work.

There is no analog of the PDB to enable an exactly analogous high-throughput benchmark of protein-ligand affinity calculations. Something very similar could be implemented by taking advantage of the ongoing flow of new protein-ligand affinity data available in newly published articles and patents. We envision that challenge participants would provide fully automated affinity-calculation workflow software to a trusted evaluation group, which would curate newly published protein-ligand affinity data and use them to benchmark the workflows in an effectively blinded manner. Given that BindingDB and ChEMBL currently curate on the order of 105 protein-ligand affinity data per year, it is likely that a robust benchmarking flow could be established.

Protein-ligand binding benchmarks may also emerge from large open-science initiatives dedicated to generating experimental screening data to train ML models. The Structural Genomics Consortium (SGC) is using DNA-encoded-library and affinity selection mass spectrometry screening platforms to screen dozens (and soon hundreds) of proteins against billions and thousands of compounds respectively. Computational predictions from ML models trained on this set are then tested experimentally at the SGC and data shared openly.

Open Issues for P-AP prediction and benchmarking

In the field of P-AP development and benchmarking, there are some issues beyond individual P-AP approaches that are rarely addressed:

P-AP performance is highly dependent on the individual user bringing in their expert understanding of the system when setting up the system and evaluating the output.41 Errors will propagate and poorly modelled poses can affect the performance of more rigorous methods such as FEP.82 Additionally, users who are experts on various target families will often be able to determine the most appropriate pose by simple visual inspection.83 To make benchmarks fully comprehensible, we propose that participants deposit exact input files used in the calculations.

The limits of the experimental data against which P-AP predictions are compared also requires a better understanding. Studies have demonstrated that mixing Ki or IC50 data from different assays leads to increased levels of noise.64,65,84 In reality, assays measure more than the idealized protein-ligand binding event that P-AP models assume. In fact, various idiosyncrasies of each individual assay contribute to the experimental outcome. This not only leads to a parallel shift of measured activities, but also a substantial amount of random noise. Thus, it would be important to understand how good a general P-AP model can become when compared to a typical Ki/IC50 assay - even with a perfect model, the upper limit is much lower than the reproducibility from an individual assay.

Classical evaluation metrics for predicted poses, such as RMSD between docked and crystal pose, are not sufficient to evaluate the quality of a pose prediction tool, especially for ML/DL based pose prediction methods, as demonstrated in the PoseBusters benchmark. Here, additional metrics are necessary to get a real estimate of performance. In addition, for ML/DL based pose prediction methods, a classical non-DL method should be included to test advantages over the baseline state-of-the-art. Similarly, it is important to include comparisons to null hypotheses. As a null hypothesis for activity prediction, we recommend comparing to correlations with simple descriptors like molecular weight or ClogP. These descriptors can be calculated via many open source tools and are fully transparent. For pose prediction, simple comparator models include predicting pose of a new compound based on the most similar compound-target pair in public databases or running a docking using an established approach with a defined version.

Setting Up Future Benchmarks

The requirements for a good benchmark set have been articulated,52,85 but it is worthwhile to look at them in the context of advancing the field.

Relevance

Benchmarks should aim to contain relevant drug targets with drug-like molecules. As this is the key use case for pose prediction, it would help if the dataset is drawn from the druglike space and macromolecules that are representative drug targets. This space is expanding to include macrocyclic compounds, radioligand therapies, harder protein targets86, as well as nucleic acid targets.79 Pose prediction benchmarks should also seek to expand their domain. This implies sets of complexes with ligands that are dissimilar to currently known ligands in public benchmark sets, not just in terms of structure but also in terms of the protein-ligand interactions, e.g. covering rare ligand functional groups and protein co-factors. It should also be recognized that we need to drug targets previously considered undruggable, as discovery projects are being challenged with multimers as well as monomers, with highly flexible and poorly defined pockets, with cryptic pockets, and with other types of macromolecules. Target diversity is important, not just from the point of view of biological function, but also the properties of the pocket.

The data sets used should also include a range of scenarios that are challenging to current methods. For example, they should include activity cliffs, where similar ligands have very different affinities or binding modes. In addition, docking calculations typically yield highly similar poses for similar ligands, and cases like those described in development of BAY069,87 where there are two flipped binding modes for very similar structures, are beyond the current state–of–the–art. These emphasize the need for experimental validation of any pose hypothesis. This situation would be a very hard test for pose prediction tools and demonstrate that even with prior knowledge of a similar system in the training set, predictions would be flawed. It would also be good to have sets split by differing protonation states and whether the water network is a key factor, as currently this is only poorly dealt with.

Interpretability

An important component of a benchmark test is not just to numerically assess a particular approach but also to understand why it did well or had issues. P-AP methods have many components that can be tuned and modified; several have been discussed above. A given docking method may perform well or poorly, depending on the methods used to set up the proteins and ligands. A recent review of DL-based activity prediction approaches aimed at introducing measures to compare different types of DL-based activity prediction approaches and also their explainability.88 All evaluated methods exhibited a lack of interpretability if applied to real-world data sets. The authors of the programs should be afforded the opportunity to present their results and freely discuss these topics. It is important in benchmarking challenges to capture as much information as possible on employed parameters, set-up, etc. to identify if trends exist.

Structure quality

To be a good benchmark, the experimental ground truth should be clearly defined. In terms of a protein-ligand complex Xray structure, this would mean clear electron density maps for the binding pocket. The quality of the pose prediction should be assessed by eventual fit to these maps in addition to using the typical RMSD criterion of < 2A. If the binding site is well defined and some other parts of the structure not involved in binding are not, this criterion can be relaxed. By the same token, NMR structures and cryo-EM structures offer important challenges of several binding modes/protein conformers with near equivalent energies. Pose prediction programs should be able to deal with this energy landscape with several minima.

Affinity data quality

In tandem with structure, the experimental ground truth of measured affinity needs to be reliable and consistent, measured using a single method under same conditions (ideally in the same laboratory), with enough information to provide Kd values. Individually measured Kd data for enantiomeric pairs with a substantial difference in pKd (> 2 log units, the more the better) will help to highlight approaches that can distinguish strong and weak binding based on modelled 3D interactions. With all benchmarks, there needs to be enough data to draw statistically significant conclusions.

“Fairness”

How do we ensure people don’t train on highly related data to optimize performance on the benchmarks? If benchmarks are very kinase-heavy then training on kinase data will improve performance, presumably, even if it is not exactly the benchmark data. A target with a novel pocket architecture and varying binding modes would be much harder. We recognize that pose prediction programs have been monetized and more spectacular results can be published in higher ranking journals, so there are incentives to appear to do better. To make matters fairer, it behooves scientists to disclose their training sets, so that a benchmark can be assessed in this context.

All of these issues can be addressed by running a diverse array of truly blind benchmark tests, where either containerized methods are tested on newly released, and hence effectively blind data, or on data released only after predictions are made. In both cases, high quality new datasets are needed on a recurring basis. Datasets with consistent measurement of affinity, with a good range of affinity, are most available in the industrial setting. It would be useful if Pharma/Biotech had a clear incentive to disclose these data sets as part of the publication strategy for projects that have come to an end. The incentive for those participating in a challenge may seem clearer (publications, status of best-in-class), but may not be compelling for existing industry leaders that may feel there is more to lose than to gain. In these cases, the incentive must come from industrial customers insisting on seeing the results of commercial software participation in blinded challenges. For the end user community, the tests need to have a realistic perspective to run long enough and be accepted by the whole community. (CASP is an inspiring example, as it is now in iteration 16 and has a diverse international board). The resulting insights will enable well-informed purchasing decisions and may even help determine which tools merit further investment and development.

The Future of the Field

Throughout the last few years, virtual libraries have increased dramatically in size, such as ENAMINE’s REadily AvaiLable for synthesis (REAL) combinatorial library,89 which nowadays contains 48 billion molecules. The appearance of ultra-large on-demand chemical libraries reinvigorated the docking and scoring field. Screening such ultra-large virtual libraries demands novel approaches for P-AP. Fast docking approaches, e.g. based on DOCK3.79092, or heavily parallelized setups like VirtualFlow93 were developed to perform P-AP of these ultra-large libraries. Other approaches such as VSYNTHES,89 SpaceDocking,94 and others95 make use of approximations, such as a combinatorial approach, modeling first minimum scaffolds and then expanding on the best modelled scaffolds. Hybrid models and deep docking methods were built, where a subset of molecules is docked and a DL model is built based on these docking scores to predict docking scores for the remaining library. Subsequently, active learning iterations improve the model by taking more actual docking scores into account during training.9699 As ultra-large libraries continue to grow, the interest in novel and better performant P-AP approaches will continue to grow.

Even though P-AP is not a new problem, open questions and challenges still abound as described above. Due to the frequent application of P-AP during drug development campaigns, novel methods and approaches have the potential for high impact. To advance the field, blinded benchmarks are needed to guide developers and continuously reveal the evolving state of the art. As the field advances, our understanding of how to best run those benchmarks also evolves. To fully leverage the potential of P-AP and develop novel approaches that will further reduce cost and cycle times of preclinical drug discovery, ongoing benchmarking initiatives need a continuous supply of novel blinded structures and other relevant data, an independent cross-institutional community that facilitates the evaluation, and a community of method developers that engages in the blinded benchmarks.

Funding

DLM appreciates financial support from the National Institutes of Health (R35GM148236). The Chodera laboratory receives or has received funding from multiple sources, including the National Institutes of Health, the National Science Foundation, the Parker Institute for Cancer Immunotherapy, Relay Therapeutics, Entasis Therapeutics, Silicon Therapeutics, EMD Serono (Merck KGaA), AstraZeneca, Vir Biotechnology, Bayer, XtalPi, Interline Therapeutics, the Molecular Sciences Software Institute, the Starr Cancer Consortium, the Open Force Field Consortium, Cycle for Survival, a Louis V. Gerstner Young Investigator Award, and the Sloan Kettering Institute. A complete funding history for the Chodera lab can be found at http://choderalab.org/funding.

Footnotes

Author disclosures

CK is an employee of F. Hoffmann-La Roche Ltd and may own stock or stock options of F. Hoffmann-La Roche Ltd. EN is employee of AstraZeneca, and may own stock or stock options in AstraZeneca. MKG has an equity interest in and is a cofounder and scientific advisor of VeraChem LLC. He is also a scientific advisor to Denovicon Therapeutics, Beren Therapeutics, and In Cerebro Inc. RL is a scientific advisor to Cresset Ltd. DLM is on the scientific advisory boards of Anagenex and OpenEye Scientific Software, Cadence Molecular Sciences. WPW is an employee of Relay Therapeutics and may own stock or stock options. KG is employee of Johnson and Johnson and may own stock or stock options in Johnson and Johnson. J.D.C. is a current member of the Scientific Advisory Board of OpenEye Scientific Software, and has equity interests in Achira, Inc. JG is an employee of Bayer AG, and may own stock or stock options in Bayer AG.

Contributor Information

Christian Kramer, F. Hoffmann-La Roche Ltd. Research and Early Development, Basel, Switzerland..

John Chodera, Memorial Sloan Kettering Cancer Center, New York, United States.

Kelly L. Damm-Ganamet, Therapeutics Discovery, Janssen Research & Development.

Michael K. Gilson, Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California San Diego, La Jolla, CA.

Judith Günther, Bayer AG, Drug Discovery Sciences, 13353 Berlin, Germany.

Uta Lessel, Medicinal Chemistry, Boehringer Ingelheim Pharma GmbH & Co. KG.

Richard A. Lewis, Global Discovery Chemistry, Novartis Pharma AG

David Mobley, Departments of Pharmaceutical Sciences and Chemistry, University of California, Irvine.

Eva Nittinger, Medicinal Chemistry, Research and Early Development, Respiratory and Immunology (R&I), BioPharmaceuticals R&D, AstraZeneca, 43183, Gothenburg, Sweden.

Adam Pecina, Institute of Organic Chemistry and Biochemistry of the Czech Academy of Sciences, Prague, Czech Republic.

Matthieu Schapira, Structural Genomics Consortium and Department of Pharmacology & Toxicology, University of Toronto.

W. Patrick Walters, Computation, Relay Therapeutics, Cambridge, MA.

References

  • (1).van Montfort RLM; Workman P Structure-Based Drug Design: Aiming for a Perfect Fit. Essays Biochem. 2017, 61 (5), 431–437. 10.1042/EBC20170052. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (2).Bissantz C; Kuhn B; Stahl M A Medicinal Chemist’s Guide to Molecular Interactions. J. Med. Chem. 2010, 53 (14), 5061–5084. 10.1021/jm100112j. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (3).Cournia Z; Allen B; Sherman W Relative Binding Free Energy Calculations in Drug Discovery: Recent Advances and Practical Considerations. J. Chem. Inf. Model. 2017, 57 (12), 2911–2937. 10.1021/acs.jcim.7b00564. [DOI] [PubMed] [Google Scholar]
  • (4).Feng M; Heinzelmann G; Gilson MK Absolute Binding Free Energy Calculations Improve Enrichment of Actives in Virtual Compound Screening. Sci. Rep. 2022, 12 (1), 13640. 10.1038/s41598-022-17480-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (5).Wang L; Wu Y; Deng Y; Kim B; Pierce L; Krilov G; Lupyan D; Robinson S; Dahlgren MK; Greenwood J; Romero DL; Masse C; Knight JL; Steinbrecher T; Beuming T; Damm W; Harder E; Sherman W; Brewer M; Wester R; Murcko M; Frye L; Farid R; Lin T; Mobley DL; Jorgensen WL; Berne BJ; Friesner RA; Abel R Accurate and Reliable Prediction of Relative Ligand Binding Potency in Prospective Drug Discovery by Way of a Modern Free-Energy Calculation Protocol and Force Field. J. Am. Chem. Soc. 2015, 137 (7), 2695–2703. 10.1021/ja512751q. [DOI] [PubMed] [Google Scholar]
  • (6).Sivula T; Yetukuri L; Kalliokoski T; Käsnänen H; Poso A; Pöhner I Machine Learning-Boosted Docking Enables the Efficient Structure-Based Virtual Screening of Giga-Scale Enumerated Chemical Libraries. J. Chem. Inf. Model. 2023, 63 (18), 5773–5783. 10.1021/acs.jcim.3c01239. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (7).Chen W; Cui D; Abel R; Friesner RA; Wang L Accurate Calculation of Absolute Protein-Ligand Binding Free Energies. ChemRxiv April 13, 2022. 10.26434/chemrxiv-2022-2t0dq. [DOI] [Google Scholar]
  • (8).Zev S; Raz K; Schwartz R; Tarabeh R; Gupta PK; Major DT Benchmarking the Ability of Common Docking Programs to Correctly Reproduce and Score Binding Modes in SARS-CoV-2 Protease Mpro. J. Chem. Inf. Model. 2021, 61 (6), 2957–2966. 10.1021/acs.jcim.1c00263. [DOI] [PubMed] [Google Scholar]
  • (9).Gathiaka S; Liu S; Chiu M; Yang H; Stuckey JA; Kang YN; Delproposto J; Kubish G; Dunbar JB; Carlson HA; Burley SK; Walters WP; Amaro RE; Feher VA; Gilson MK D3R Grand Challenge 2015: Evaluation of Protein–Ligand Pose and Affinity Predictions. J. Comput. Aided Mol. Des. 2016, 30 (9), 651–668. 10.1007/s10822-016-9946-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (10).Jansen JM; Cornell W; Tseng YJ; Amaro RE Teach–Discover–Treat (TDT): Collaborative Computational Drug Discovery for Neglected Diseases. J. Mol. Graph. Model. 2012, 38, 360–362. 10.1016/j.jmgm.2012.07.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (11).Haas J; Barbato A; Behringer D; Studer G; Roth S; Bertoni M; Mostaguir K; Gumienny R; Schwede T Continuous Automated Model EvaluatiOn (CAMEO) Complementing the Critical Assessment of Structure Prediction in CASP12. Proteins Struct. Funct. Bioinforma. 2018, 86 (S1), 387–398. 10.1002/prot.25431. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (12).Mullard A When Can AI Deliver the Drug Discovery Hits? Nat. Rev. Drug Discov. 2024, 23 (3), 159–161. 10.1038/d41573-024-00036-0. [DOI] [PubMed] [Google Scholar]
  • (13).Wagner JR; Churas CP; Liu S; Swift RV; Chiu M; Shao C; Feher VA; Burley SK; Gilson MK; Amaro RE Continuous Evaluation of Ligand Protein Predictions: A Weekly Community Challenge for Drug Docking. Structure 2019, 27 (8), 1326–1335.e4. 10.1016/j.str.2019.05.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (14).Kryshtafovych A; Antczak M; Szachniuk M; Zok T; Kretsch RC; Rangan R; Pham P; Das R; Robin X; Studer G; Durairaj J; Eberhardt J; Sweeney A; Topf M; Schwede T; Fidelis K; Moult J New Prediction Categories in CASP15. Proteins Struct. Funct. Bioinforma. 2023, 91 (12), 1550–1557. 10.1002/prot.26515. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (15).Klauda JB Virtual Issue on Docking. J. Phys. Chem. B 2021, 125 (21), 5455–5457. 10.1021/acs.jpcb.1c03303. [DOI] [PubMed] [Google Scholar]
  • (16).Jiang D; Zhao H; Du H; Deng Y; Wu Z; Wang J; Zeng Y; Zhang H; Wang X; Wu J; Hsieh C-Y; Hou T How Good Are Current Docking Programs at Nucleic Acid–Ligand Docking? A Comprehensive Evaluation. J. Chem. Theory Comput. 2023, 19 (16), 5633–5647. 10.1021/acs.jctc.3c00507. [DOI] [PubMed] [Google Scholar]
  • (17).Elokely KM; Doerksen RJ Docking Challenge: Protein Sampling and Molecular Docking Performance. J. Chem. Inf. Model. 2013, 53 (8), 1934–1945. 10.1021/ci400040d. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (18).Kuntz ID; Blaney JM; Oatley SJ; Langridge R; Ferrin TE A Geometric Approach to Macromolecule-Ligand Interactions. J. Mol. Biol. 1982, 161 (2), 269–288. 10.1016/0022-2836(82)90153-X. [DOI] [PubMed] [Google Scholar]
  • (19).Trott O; Olson AJ AutoDock Vina: Improving the Speed and Accuracy of Docking with a New Scoring Function, Efficient Optimization, and Multithreading. J. Comput. Chem. 2010, 31 (2), 455–461. 10.1002/jcc.21334. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (20).Böhm HJ The Computer Program LUDI: A New Method for the de Novo Design of Enzyme Inhibitors. J. Comput. Aided Mol. Des. 1992, 6 (1), 61–78. 10.1007/BF00124387. [DOI] [PubMed] [Google Scholar]
  • (21).Abagyan R; Totrov M; Kuznetsov D ICM—A New Method for Protein Modeling and Design: Applications to Docking and Structure Prediction from the Distorted Native Conformation. J. Comput. Chem. 1994, 15 (5), 488–506. 10.1002/jcc.540150503. [DOI] [Google Scholar]
  • (22).Jones G; Willett P; Glen RC; Leach AR; Taylor R Development and Validation of a Genetic Algorithm for Flexible Docking1. J. Mol. Biol. 1997, 267 (3), 727–748. 10.1006/jmbi.1996.0897. [DOI] [PubMed] [Google Scholar]
  • (23).Kramer B; Rarey M; Lengauer T Evaluation of the FLEXX Incremental Construction Algorithm for Protein–Ligand Docking. Proteins Struct. Funct. Bioinforma. 1999, 37 (2), 228–241. 10.1002/(SICI)1097-0134(19991101)37:2&lt;228::AID-PROT8&gt;3.0.CO;2-8. [DOI] [PubMed] [Google Scholar]
  • (24).Friesner RA; Banks JL; Murphy RB; Halgren TA; Klicic JJ; Mainz DT; Repasky MP; Knoll EH; Shelley M; Perry JK; Shaw DE; Francis P; Shenkin PS Glide:  A New Approach for Rapid, Accurate Docking and Scoring. 1. Method and Assessment of Docking Accuracy. J. Med. Chem. 2004, 47 (7), 1739–1749. 10.1021/jm0306430. [DOI] [PubMed] [Google Scholar]
  • (25).McGann M FRED and HYBRID Docking Performance on Standardized Datasets. J. Comput. Aided Mol. Des. 2012, 26 (8), 897–906. 10.1007/s10822-012-9584-8. [DOI] [PubMed] [Google Scholar]
  • (26).Corso G; Stärk H; Jing B; Barzilay R; Jaakkola T DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking. arXiv February 11, 2023. 10.48550/arXiv.2210.01776. [DOI] [Google Scholar]
  • (27).Stärk H; Ganea O; Pattanaik L; Barzilay DR; Jaakkola T EquiBind: Geometric Deep Learning for Drug Binding Structure Prediction. In Proceedings of the 39th International Conference on Machine Learning; PMLR, 2022; pp 20503–20521. [Google Scholar]
  • (28).Buttenschoen M; Morris GM; Deane CM PoseBusters: AI-Based Docking Methods Fail to Generate Physically Valid Poses or Generalise to Novel Sequences. Chem. Sci. 2024, 15 (9), 3130–3139. 10.1039/D3SC04185A. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (29).Irwin JJ; Shoichet BK Docking Screens for Novel Ligands Conferring New Biology. J. Med. Chem. 2016, 59 (9), 4103–4120. 10.1021/acs.jmedchem.5b02008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (30).Godschalk F; Genheden S; Söderhjelm P; Ryde U Comparison of MM/GBSA Calculations Based on Explicit and Implicit Solvent Simulations. Phys. Chem. Chem. Phys. 2013, 15 (20), 7731–7739. 10.1039/C3CP00116D. [DOI] [PubMed] [Google Scholar]
  • (31).Åqvist J; Medina C; Samuelsson J-E A New Method for Predicting Binding Affinity in Computer-Aided Drug Design. Protein Eng. Des. Sel. 1994, 7 (3), 385–391. 10.1093/protein/7.3.385. [DOI] [PubMed] [Google Scholar]
  • (32).Hou T; Wang J; Li Y; Wang W Assessing the Performance of the MM/PBSA and MM/GBSA Methods. 1. The Accuracy of Binding Free Energy Calculations Based on Molecular Dynamics Simulations. J. Chem. Inf. Model. 2011, 51 (1), 69–82. 10.1021/ci100275a. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (33).Bhati AP; Wan S; Wright DW; Coveney PV Rapid, Accurate, Precise, and Reliable Relative Free Energy Prediction Using Ensemble Based Thermodynamic Integration. J. Chem. Theory Comput. 2017, 13 (1), 210–222. 10.1021/acs.jctc.6b00979. [DOI] [PubMed] [Google Scholar]
  • (34).Ross GA; Lu C; Scarabelli G; Albanese SK; Houang E; Abel R; Harder ED; Wang L The Maximal and Current Accuracy of Rigorous Protein-Ligand Binding Free Energy Calculations. Commun. Chem. 2023, 6 (1), 1–12. 10.1038/s42004-023-01019-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (35).Gilson MK; Stewart LE; Potter MJ; Webb SP Rapid, Accurate, Ranking of Protein–Ligand Binding Affinities with VM2, the Second-Generation Mining Minima Method. J. Chem. Theory Comput. 2024, 20 (14), 6328–6340. 10.1021/acs.jctc.4c00407. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (36).Ginex T; Vázquez J; Estarellas C; Luque FJ Quantum Mechanical-Based Strategies in Drug Discovery: Finding the Pace to New Challenges in Drug Design. Curr. Opin. Struct. Biol. 2024, 87, 102870. 10.1016/j.sbi.2024.102870. [DOI] [PubMed] [Google Scholar]
  • (37).Fedorov DG The Fragment Molecular Orbital Method: Theoretical Development, Implementation in GAMESS, and Applications. WIREs Comput. Mol. Sci. 2017, 7 (6), e1322. 10.1002/wcms.1322. [DOI] [Google Scholar]
  • (38).Pecina A; Fanfrlík J; Lepšík M; Řezáč J SQM2.20: Semiempirical Quantum-Mechanical Scoring Function Yields DFT-Quality Protein–Ligand Binding Affinity Predictions in Minutes. Nat. Commun. 2024, 15 (1), 1127. 10.1038/s41467-024-45431-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (39).Jorgensen WL The Many Roles of Computation in Drug Discovery. Science 2004, 303 (5665), 1813–1818. 10.1126/science.1096361. [DOI] [PubMed] [Google Scholar]
  • (40).Charifson PS; Corkery JJ; Murcko MA; Walters WP Consensus Scoring:  A Method for Obtaining Improved Hit Rates from Docking Databases of Three-Dimensional Structures into Proteins. J. Med. Chem. 1999, 42 (25), 5100–5109. 10.1021/jm990352k. [DOI] [PubMed] [Google Scholar]
  • (41).Warren GL; Andrews CW; Capelli A-M; Clarke B; LaLonde J; Lambert MH; Lindvall M; Nevins N; Semus SF; Senger S; Tedesco G; Wall ID; Woolven JM; Peishoff CE; Head MS A Critical Assessment of Docking Programs and Scoring Functions. J. Med. Chem. 2006, 49 (20), 5912–5931. 10.1021/jm050362n. [DOI] [PubMed] [Google Scholar]
  • (42).Méndez-Lucio O; Ahmad M; del Rio-Chanona EA; Wegner JK A Geometric Deep Learning Approach to Predict Binding Conformations of Bioactive Molecules. Nat. Mach. Intell. 2021, 3 (12), 1033–1039. 10.1038/s42256-021-00409-9. [DOI] [Google Scholar]
  • (43).Zhou G; Gao Z; Ding Q; Zheng H; Xu H; Wei Z; Zhang L; Ke G Uni-Mol: A Universal 3D Molecular Representation Learning Framework. ChemRxiv May 26, 2022. 10.26434/chemrxiv-2022-jjm0j. [DOI] [Google Scholar]
  • (44).Lu W; Wu Q; Zhang J; Rao J; Li C; Zheng S TANKBind: Trigonometry-Aware Neural NetworKs for Drug-Protein Binding Structure Prediction. bioRxiv June 6, 2022, p 2022.06.06.495043. 10.1101/2022.06.06.495043. [DOI] [Google Scholar]
  • (45).Abramson J; Adler J; Dunger J; Evans R; Green T; Pritzel A; Ronneberger O; Willmore L; Ballard AJ; Bambrick J; Bodenstein SW; Evans DA; Hung C-C; O’Neill M; Reiman D; Tunyasuvunakool K; Wu Z; Žemgulytė A; Arvaniti E; Beattie C; Bertolli O; Bridgland A; Cherepanov A; Congreve M; Cowen-Rivers AI; Cowie A; Figurnov M; Fuchs FB; Gladman H; Jain R; Khan YA; Low CMR; Perlin K; Potapenko A; Savy P; Singh S; Stecula A; Thillaisundaram A; Tong C; Yakneen S; Zhong ED; Zielinski M; Žídek A; Bapst V; Kohli P; Jaderberg M; Hassabis D; Jumper JM Accurate Structure Prediction of Biomolecular Interactions with AlphaFold 3. Nature 2024, 630 (8016), 493–500. 10.1038/s41586-024-07487-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (46).Jumper J; Evans R; Pritzel A; Green T; Figurnov M; Ronneberger O; Tunyasuvunakool K; Bates R; Žídek A; Potapenko A; Bridgland A; Meyer C; Kohl SAA; Ballard AJ; Cowie A; Romera-Paredes B; Nikolov S; Jain R; Adler J; Back T; Petersen S; Reiman D; Clancy E; Zielinski M; Steinegger M; Pacholska M; Berghammer T; Bodenstein S; Silver D; Vinyals O; Senior AW; Kavukcuoglu K; Kohli P; Hassabis D Highly Accurate Protein Structure Prediction with AlphaFold. Nature 2021, 596 (7873), 583–589. 10.1038/s41586-021-03819-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (47).Qiao Z; Nie W; Vahdat A; Miller TF; Anandkumar A State-Specific Protein–Ligand Complex Structure Prediction with a Multiscale Deep Generative Model. Nat. Mach. Intell. 2024, 6 (2), 195–208. 10.1038/s42256-024-00792-z. [DOI] [Google Scholar]
  • (48).Krishna R; Wang J; Ahern W; Sturmfels P; Venkatesh P; Kalvet I; Lee GR; Morey-Burrows FS; Anishchenko I; Humphreys IR; McHugh R; Vafeados D; Li X; Sutherland GA; Hitchcock A; Hunter CN; Kang A; Brackenbrough E; Bera AK; Baek M; DiMaio F; Baker D Generalized Biomolecular Modeling and Design with RoseTTAFold All-Atom. Science 2024, 384 (6693), eadl2528. 10.1126/science.adl2528. [DOI] [PubMed] [Google Scholar]
  • (49).Chen L; Cruz A; Ramsey S; Dickson CJ; Duca JS; Hornak V; Koes DR; Kurtzman T Hidden Bias in the DUD-E Dataset Leads to Misleading Performance of Deep Learning in Structure-Based Virtual Screening. PLOS ONE 2019, 14 (8), e0220113. 10.1371/journal.pone.0220113. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (50).Sieg J; Flachsenberg F; Rarey M In Need of Bias Control: Evaluating Chemical Data for Machine Learning in Structure-Based Virtual Screening. J. Chem. Inf. Model. 2019, 59 (3), 947–961. 10.1021/acs.jcim.8b00712. [DOI] [PubMed] [Google Scholar]
  • (51).Hartshorn MJ; Verdonk ML; Chessari G; Brewerton SC; Mooij WTM; Mortenson PN; Murray CW Diverse, High-Quality Test Set for the Validation of Protein−Ligand Docking Performance. J. Med. Chem. 2007, 50 (4), 726–741. 10.1021/jm061277y. [DOI] [PubMed] [Google Scholar]
  • (52).Warren GL; Do TD; Kelley BP; Nicholls A; Warren SD Essential Considerations for Using Protein–Ligand Structures in Drug Discovery. Drug Discov. Today 2012, 17 (23), 1270–1281. 10.1016/j.drudis.2012.06.011. [DOI] [PubMed] [Google Scholar]
  • (53).Friedrich N-O; Meyder A; de Bruyn Kops C; Sommer K; Flachsenberg F; Rarey M; Kirchmair J High-Quality Dataset of Protein-Bound Ligand Conformations and Its Application to Benchmarking Conformer Ensemble Generators. J. Chem. Inf. Model. 2017, 57 (3), 529–539. 10.1021/acs.jcim.6b00613. [DOI] [PubMed] [Google Scholar]
  • (54).Liu Z; Su M; Han L; Liu J; Yang Q; Li Y; Wang R Forging the Basis for Developing Protein–Ligand Interaction Scoring Functions. Acc. Chem. Res. 2017, 50 (2), 302–309. 10.1021/acs.accounts.6b00491. [DOI] [PubMed] [Google Scholar]
  • (55).Mysinger MM; Carchia M; Irwin John. J.; Shoichet BK. Directory of Useful Decoys, Enhanced (DUD-E): Better Ligands and Decoys for Better Benchmarking. J. Med. Chem. 2012, 55 (14), 6582–6594. 10.1021/jm300687e. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (56).Rohrer SG; Baumann K Maximum Unbiased Validation (MUV) Data Sets for Virtual Screening Based on PubChem Bioactivity Data. J. Chem. Inf. Model. 2009, 49 (2), 169–184. 10.1021/ci8002649. [DOI] [PubMed] [Google Scholar]
  • (57).Tran-Nguyen V-K; Jacquemard C; Rognan D LIT-PCBA: An Unbiased Data Set for Machine Learning and Virtual Screening. J. Chem. Inf. Model. 2020, 60 (9), 4263–4273. 10.1021/acs.jcim.0c00155. [DOI] [PubMed] [Google Scholar]
  • (58).Gori DNP; Alberca LN; Rodriguez S; Alice JI; Llanos MA; Bellera CL; Talevi A LIDeB Tools: A Latin American Resource of Freely Available, Open-Source Cheminformatics Apps. Artif. Intell. Life Sci. 2022, 2, 100049. 10.1016/j.ailsci.2022.100049. [DOI] [Google Scholar]
  • (59).Ibrahim TM; Bauer MR; Boeckler FM Applying DEKOIS 2.0 in Structure-Based Virtual Screening to Probe the Impact of Preparation Procedures and Score Normalization. J. Cheminformatics 2015, 7 (1), 21. 10.1186/s13321-015-0074-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (60).Vogel SM; Bauer MR; Boeckler FM DEKOIS: Demanding Evaluation Kits for Objective in Silico Screening — A Versatile Tool for Benchmarking Docking Programs and Scoring Functions. J. Chem. Inf. Model. 2011, 51 (10), 2650–2665. 10.1021/ci2001549. [DOI] [PubMed] [Google Scholar]
  • (61).Benson ML; Smith RD; Khazanov NA; Dimcheff B; Beaver J; Dresslar P; Nerothin J; Carlson HA Binding MOAD, a High-Quality Protein–Ligand Database. Nucleic Acids Res. 2008, 36 (suppl_1), D674–D678. 10.1093/nar/gkm911. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (62).Liu T; Lin Y; Wen X; Jorissen RN; Gilson MK BindingDB: A Web-Accessible Database of Experimentally Determined Protein–Ligand Binding Affinities. Nucleic Acids Res. 2007, 35 (Database issue), D198–D201. 10.1093/nar/gkl999. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (63).Gaulton A; Bellis LJ; Bento AP; Chambers J; Davies M; Hersey A; Light Y; McGlinchey S; Michalovich D; Al-Lazikani B; Overington JP ChEMBL: A Large-Scale Bioactivity Database for Drug Discovery. Nucleic Acids Res. 2012, 40 (Database issue), D1100–1107. 10.1093/nar/gkr777. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (64).Kalliokoski T; Kramer C; Vulpetti A; Gedeck P Comparability of Mixed IC₅₀ Data - a Statistical Analysis. PloS One 2013, 8 (4), e61007. 10.1371/journal.pone.0061007. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (65).Kramer C; Kalliokoski T; Gedeck P; Vulpetti A The Experimental Uncertainty of Heterogeneous Public K(i) Data. J. Med. Chem. 2012, 55 (11), 5165–5173. 10.1021/jm300131x. [DOI] [PubMed] [Google Scholar]
  • (66).Schindler CEM; Baumann H; Blum A; Böse D; Buchstaller H-P; Burgdorf L; Cappel D; Chekler E; Czodrowski P; Dorsch D; Eguida MKI; Follows B; Fuchß T; Grädler U; Gunera J; Johnson T; Jorand Lebrun C; Karra S; Klein M; Knehans T; Koetzner L; Krier M; Leiendecker M; Leuthner B; Li L; Mochalkin I; Musil D; Neagu C; Rippmann F; Schiemann K; Schulz R; Steinbrecher T; Tanzer E-M; Unzue Lopez A; Viacava Follis A; Wegener A; Kuhn D Large-Scale Assessment of Binding Free Energy Calculations in Active Drug Discovery Projects. J. Chem. Inf. Model. 2020, 60 (11), 5457–5474. 10.1021/acs.jcim.0c00900. [DOI] [PubMed] [Google Scholar]
  • (67).Tosstorff A; Rudolph MG; Cole JC; Reutlinger M; Kramer C; Schaffhauser H; Nilly A; Flohr A; Kuhn B A High Quality, Industrial Data Set for Binding Affinity Prediction: Performance Comparison in Different Early Drug Discovery Scenarios. J. Comput. Aided Mol. Des. 2022, 36 (10), 753–765. 10.1007/s10822-022-00478-x. [DOI] [PubMed] [Google Scholar]
  • (68).Nicholls A; Mobley DL; Guthrie JP; Chodera JD; Bayly CI; Cooper MD; Pande VS Predicting Small-Molecule Solvation Free Energies: An Informal Blind Test for Computational Chemistry. J. Med. Chem. 2008, 51 (4), 769–779. 10.1021/jm070549+. [DOI] [PubMed] [Google Scholar]
  • (69).Amezcua M; Setiadi J; Mobley DL The SAMPL9 Host–Guest Blind Challenge: An Overview of Binding Free Energy Predictive Accuracy. Phys. Chem. Chem. Phys. 2024, 26 (12), 9207–9225. 10.1039/D3CP05111K. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (70).Mobley DL; Liu S; Cerutti DS; Swope WC; Rice JE Alchemical Prediction of Hydration Free Energies for SAMPL. J. Comput. Aided Mol. Des. 2012, 26 (5), 551–562. 10.1007/s10822-011-9528-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (71).Amezcua M; Setiadi J; Ge Y; Mobley DL An Overview of the SAMPL8 Host–Guest Binding Challenge. J. Comput. Aided Mol. Des. 2022, 36 (10), 707–734. 10.1007/s10822-022-00462-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (72).Grosjean H; Işık M; Aimon A; Mobley D; Chodera J; von Delft F; Biggin PC SAMPL7 Protein-Ligand Challenge: A Community-Wide Evaluation of Computational Methods against Fragment Screening and Pose-Prediction. J. Comput. Aided Mol. Des. 2022, 36 (4), 291–311. 10.1007/s10822-022-00452-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (73).Dunbar JB Jr.; Smith RD; Damm-Ganamet KL; Ahmed A; Esposito EX; Delproposto J; Chinnaswamy K; Kang Y-N; Kubish G; Gestwicki JE; Stuckey JA; Carlson HA CSAR Data Set Release 2012: Ligands, Affinities, Complexes, and Docking Decoys. J. Chem. Inf. Model. 2013, 53 (8), 1842–1852. 10.1021/ci4000486. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (74).Smith RD; Damm-Ganamet KL; Dunbar JB Jr.; Ahmed A; Chinnaswamy K; Delproposto JE; Kubish GM; Tinberg CE; Khare SD; Dou J; Doyle L; Stuckey JA; Baker D; Carlson HA CSAR Benchmark Exercise 2013: Evaluation of Results from a Combined Computational Protein Design, Docking, and Scoring/Ranking Challenge. J. Chem. Inf. Model. 2016, 56 (6), 1022–1031. 10.1021/acs.jcim.5b00387. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (75).Dunbar JB Jr.; Smith RD; Yang C-Y; Ung PM-U; Lexa KW; Khazanov NA; Stuckey JA; Wang S; Carlson HA CSAR Benchmark Exercise of 2010: Selection of the Protein–Ligand Complexes. J. Chem. Inf. Model. 2011, 51 (9), 2036–2046. 10.1021/ci200082t. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (76).Gaieb Z; Liu S; Gathiaka S; Chiu M; Yang H; Shao C; Feher VA; Walters WP; Kuhn B; Rudolph MG; Burley SK; Gilson MK; Amaro RE D3R Grand Challenge 2: Blind Prediction of Protein–Ligand Poses, Affinity Rankings, and Relative Binding Free Energies. J. Comput. Aided Mol. Des. 2018, 32 (1), 1–20. 10.1007/s10822-017-0088-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (77).Gaieb Z; Parks CD; Chiu M; Yang H; Shao C; Walters WP; Lambert MH; Nevins N; Bembenek SD; Ameriks MK; Mirzadegan T; Burley SK; Amaro RE; Gilson MK D3R Grand Challenge 3: Blind Prediction of Protein–Ligand Poses and Affinity Rankings. J. Comput. Aided Mol. Des. 2019, 33 (1), 1–18. 10.1007/s10822-018-0180-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (78).Parks CD; Gaieb Z; Chiu M; Yang H; Shao C; Walters WP; Jansen JM; McGaughey G; Lewis RA; Bembenek SD; Ameriks MK; Mirzadegan T; Burley SK; Amaro RE; Gilson MK D3R Grand Challenge 4: Blind Prediction of Protein–Ligand Poses, Affinity Rankings, and Relative Binding Free Energies. J. Comput. Aided Mol. Des. 2020, 34 (2), 99–119. 10.1007/s10822-020-00289-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (79).Elofsson A Progress at Protein Structure Prediction, as Seen in CASP15. Curr. Opin. Struct. Biol. 2023, 80, 102594. 10.1016/j.sbi.2023.102594. [DOI] [PubMed] [Google Scholar]
  • (80).Ackloo S; Al-awar R; Amaro RE; Arrowsmith CH; Azevedo H; Batey RA; Bengio Y; Betz UAK; Bologa CG; Chodera JD; Cornell WD; Dunham I; Ecker GF; Edfeldt K; Edwards AM; Gilson MK; Gordijo CR; Hessler G; Hillisch A; Hogner A; Irwin JJ; Jansen JM; Kuhn D; Leach AR; Lee AA; Lessel U; Morgan MR; Moult J; Muegge I; Oprea TI; Perry BG; Riley P; Rousseaux SAL; Saikatendu KS; Santhakumar V; Schapira M; Scholten C; Todd MH; Vedadi M; Volkamer A; Willson TM CACHE (Critical Assessment of Computational Hit-Finding Experiments): A Public–Private Partnership Benchmarking Initiative to Enable the Development of Computational Methods for Hit-Finding. Nat. Rev. Chem. 2022, 6 (4), 287–295. 10.1038/s41570-022-00363-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (81).Berman HM; Westbrook J; Feng Z; Gilliland G; Bhat TN; Weissig H; Shindyalov IN; Bourne PE The Protein Data Bank. Nucleic Acids Res. 2000, 28 (1), 235–242. 10.1093/nar/28.1.235. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (82).Cappel D; Jerome S; Hessler G; Matter H Impact of Different Automated Binding Pose Generation Approaches on Relative Binding Free Energy Simulations. J. Chem. Inf. Model. 2020, 60 (3), 1432–1444. 10.1021/acs.jcim.9b01118. [DOI] [PubMed] [Google Scholar]
  • (83).Kuhn B; Haap W; Obst-Sander U; Kramer C; Stahl M What We Learned in 25 Years of Interactive Molecular Design Sessions. ChemMedChem 2021, 16 (18), 2760–2763. 10.1002/cmdc.202100351. [DOI] [PubMed] [Google Scholar]
  • (84).Landrum GA; Riniker S Combining IC50 or Ki Values from Different Sources Is a Source of Significant Noise. J. Chem. Inf. Model. 2024, 64 (5), 1560–1567. 10.1021/acs.jcim.4c00049. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (85).Hahn D; Bayly C; Boby ML; Macdonald HB; Chodera J; Gapsys V; Mey A; Mobley D; Benito LP; Schindler C; Tresadern G; Warren G Best Practices for Constructing, Preparing, and Evaluating Protein-Ligand Binding Affinity Benchmarks [Article v1.0]. Living J. Comput. Mol. Sci. 2022, 4 (1), 1497–1497. 10.33011/livecoms.4.1.1497. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (86).Hakkennes MLA; Buda F; Bonnet S MetalDock: An Open Access Docking Tool for Easy and Reproducible Docking of Metal Complexes. J. Chem. Inf. Model. 2023, 63 (24), 7816–7825. 10.1021/acs.jcim.3c01582. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (87).Günther J; Hillig RC; Zimmermann K; Kaulfuss S; Lemos C; Nguyen D; Rehwinkel H; Habgood M; Lechner C; Neuhaus R; Ganzer U; Drewes M; Chai J; Bouché L BAY-069, a Novel (Trifluoromethyl)Pyrimidinedione-Based BCAT1/2 Inhibitor and Chemical Probe. J. Med. Chem. 2022, 65 (21), 14366–14390. 10.1021/acs.jmedchem.2c00441. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (88).Wang DD; Wu W; Wang R Structure-Based, Deep-Learning Models for Protein-Ligand Binding Affinity Prediction. J. Cheminformatics 2024, 16 (1), 2. 10.1186/s13321-023-00795-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (89).Sadybekov AA; Sadybekov AV; Liu Y; Iliopoulos-Tsoutsouvas C; Huang X-P; Pickett J; Houser B; Patel N; Tran NK; Tong F; Zvonok N; Jain MK; Savych O; Radchenko DS; Nikas SP; Petasis NA; Moroz YS; Roth BL; Makriyannis A; Katritch V Synthon-Based Ligand Discovery in Virtual Libraries of over 11 Billion Compounds. Nature 2022, 601 (7893), 452–459. 10.1038/s41586-021-04220-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (90).Lyu J; Wang S; Balius TE; Singh I; Levit A; Moroz YS; O’Meara MJ; Che T; Algaa E; Tolmachova K; Tolmachev AA; Shoichet BK; Roth BL; Irwin JJ Ultra-Large Library Docking for Discovering New Chemotypes. Nature 2019, 566 (7743), 224–229. 10.1038/s41586-019-0917-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (91).Tummino TA; Iliopoulos-Tsoutsouvas C; Braz JM; O’Brien ES; Stein RM; Craik V; Tran NK; Ganapathy S; Liu F; Shiimura Y; Tong F; Ho TC; Radchenko DS; Moroz YS; Rosado SR; Bhardwaj K; Benitez J; Liu Y; Kandasamy H; Normand C; Semache M; Sabbagh L; Glenn I; Irwin JJ; Kumar KK; Makriyannis A; Basbaum AI; Shoichet BK Large Library Docking for Cannabinoid-1 Receptor Agonists with Reduced Side Effects. bioRxiv February 28, 2024, p 2023.02.27.530254. 10.1101/2023.02.27.530254. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (92).Coleman RG; Carchia M; Sterling T; Irwin JJ; Shoichet BK Ligand Pose and Orientational Sampling in Molecular Docking. PLOS ONE 2013, 8 (10), e75992. 10.1371/journal.pone.0075992. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (93).Gorgulla C; Boeszoermenyi A; Wang Z-F; Fischer PD; Coote PW; Padmanabha Das KM; Malets YS; Radchenko DS; Moroz YS; Scott DA; Fackeldey K; Hoffmann M; Iavniuk I; Wagner G; Arthanari H An Open-Source Drug Discovery Platform Enables Ultra-Large Virtual Screens. Nature 2020, 580 (7805), 663–668. 10.1038/s41586-020-2117-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (94).Beroza P; Crawford JJ; Ganichkin O; Gendelev L; Harris SF; Klein R; Miu A; Steinbacher S; Klingler F-M; Lemmen C Chemical Space Docking Enables Large-Scale Structure-Based Virtual Screening to Discover ROCK1 Kinase Inhibitors. Nat. Commun. 2022, 13 (1), 6447. 10.1038/s41467-022-33981-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (95).Gryniukova A; Kaiser F; Myziuk I; Alieksieieva D; Leberecht C; Heym PP; Tarkhanova OO; Moroz YS; Borysko P; Haupt VJ AI-Powered Virtual Screening of Large Compound Libraries Leads to the Discovery of Novel Inhibitors of Sirtuin-1. J. Med. Chem. 2023, 66 (15), 10241–10251. 10.1021/acs.jmedchem.3c00128. [DOI] [PubMed] [Google Scholar]
  • (96).Gentile F; Yaacoub JC; Gleave J; Fernandez M; Ton A-T; Ban F; Stern A; Cherkasov A Artificial Intelligence–Enabled Virtual Screening of Ultra-Large Chemical Libraries with Deep Docking. Nat. Protoc. 2022, 17 (3), 672–697. 10.1038/s41596-021-00659-2. [DOI] [PubMed] [Google Scholar]
  • (97).Ton A-T; Gentile F; Hsing M; Ban F; Cherkasov A Rapid Identification of Potential Inhibitors of SARS-CoV-2 Main Protease by Deep Docking of 1.3 Billion Compounds. Mol. Inform. 2020, 39 (8), 2000028. 10.1002/minf.202000028. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (98).Tropsha A; Isayev O; Varnek A; Schneider G; Cherkasov A Integrating QSAR Modelling and Deep Learning in Drug Discovery: The Emergence of Deep QSAR. Nat. Rev. Drug Discov. 2024, 23 (2), 141–155. 10.1038/s41573-023-00832-0. [DOI] [PubMed] [Google Scholar]
  • (99).Bedart C; Simoben CV; Schapira M Emerging Structure-Based Computational Methods to Screen the Exploding Accessible Chemical Space. Curr. Opin. Struct. Biol. 2024, 86, 102812. 10.1016/j.sbi.2024.102812. [DOI] [PubMed] [Google Scholar]

RESOURCES