Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2024 Jun 6.
Published in final edited form as: Mol Inform. 2023 Dec 19;43(1):e202300207. doi: 10.1002/minf.202300207

Hit Discovery using Docking ENriched by GEnerative Modeling (HIDDEN GEM): A Novel Computational Workflow for Accelerated Virtual Screening of Ultra-large Chemical Libraries

Konstantin I Popov 1,2,, James Wellnitz 1,2, Travis Maxfield 1,2, Alexander Tropsha 1,
PMCID: PMC11156482  NIHMSID: NIHMS1990114  PMID: 37802967

Abstract

Recent rapid expansion of make-on-demand, purchasable, chemical libraries comprising dozens of billions or even trillions of molecules has challenged the efficient application of traditional structure-based virtual screening methods that rely on molecular docking. We present a novel computational methodology termed HIDDEN GEM (HIt Discovery using Docking ENriched by GEnerative Modeling) that greatly accelerates virtual screening. This workflow uniquely integrates machine learning, generative chemistry, massive chemical similarity searching and molecular docking of small, selected libraries in the beginning and the end of the workflow. For each target, HIDDEN GEM nominates a small number of top-scoring virtual hits prioritized from ultra-large chemical libraries. We have benchmarked HIDDEN GEM by conducting virtual screening campaigns for 16 diverse protein targets using Enamine REAL Space library comprising 37 billion molecules. We show that HIDDEN GEM yields the highest enrichment factors as compared to state of the art accelerated virtual screening methods, while requiring the least computational resources. HIDDEN GEM can be executed with any docking software and employed by users with limited computational resources.

Introduction

Over the past several years the libraries of purchasable, make-on-demand chemical compounds have rapidly expanded, with the Enamine REAL Space now encompassing over 37 billion compounds (https://enamine.net/compound-collections/real-compounds) and eXplore space from eMolecules reportedly containing a staggering more than 7 trillion molecules (https://www.emolecules.com/explore). This expansion of readily accessible chemical space has afforded new opportunities1-3 but also presented significant challenges for structure-based virtual screening (VS)4. Indeed, it has been shown that VS of ultra-large chemical libraries using molecular docking can identify greater number of diverse ligands with high binding affinity5-10. However, molecular docking is inherently computationally demanding, even when accelerated with recent advancements in hardware11 and cloud computing 12,13. Expanding upon estimates reported recently by Sadybekov et al.14, docking of the Enamine REAL Space would cost approximately $3,000,000 dollars using Amazon Web Service (AWS; https://aws.amazon.com/). Thus, the need for very significant computational resources accessible only to a handful of research groups and the exorbitant cost of high-performance computing make conventional, docking-based VS of ultra-large libraries increasingly difficult, especially for the users with limited computational resources.

Recognizing the limitations of molecular docking but also acknowledging the advantages of ultra-large library screening, several groups recently reported on accelerated VS approaches14-23. These approaches attempt to minimize the number of actual expensive docking calculations while still identifying top scoring compounds from ultra-large chemical libraries. Some approaches, like DeepDocking17, employ deep learning models trained on the docking scores of relatively small subsets of the chemical library. These models are then used to select a small number of top-scoring hit compounds from the entire VS library, and only these compounds are docked into respective binding sites. Others, like VirtualFlow18, leverage faster, but less accurate score estimation approaches instead of a learned model, followed by actual docking of only the top compounds identified by the faster methods. Both types of approaches, however, still incur a high computational cost, especially when GPUs, rather than CPUs, are required (as is the case for deep learning). Chemistry-based rules have been also used to reduce the number of actual docking runs, as in V-SYNTHES14. This approach starts by docking all chemical building blocks used to create an ultra-large screening library. Then, a small number of molecules from the entire screening library that contain building blocks found to have best docking scores are selected for actual docking. Unfortunately, the utility of this approach is limited by the requirement that the screening libraries have a combinatorial, building block design, as well as the need to possess detailed knowledge of the chemistry that makes up the library, information that is often proprietary or of limited access.

Alternative approaches to targeted ligand discovery that avoid the pitfalls of docking and virtual screening of known chemical libraries altogether have been proposed24-26 where generative models27,28 are used to create de novo virtual compounds predicted to have the desired affinity. Unlike structure-based VS methods, generative approaches propose molecules that are not limited to a given enumerated library, allowing for more freedom in hit selection. However, the generated compounds typically lack any known synthesis route undermining their immediate utility in drug discovery campaigns29,30. Furthermore, model training requires the knowledge of a relatively large number of molecules with the desired activity, which are frequently unavailable for many targets.

Herein, we present a novel, highly efficient computational approach dubbed HIDDEN GEM (HIt Discovery using Docking ENriched by GEnerative Modeling) for structure-based VS of ultra-large chemical libraries. HIDDEN GEM uniquely integrates and leverages the advantages of the aforementioned methods including molecular docking, machine learning, and generative modeling while attempting to circumvent their limitations. This novel methodology starts by using traditional molecular docking of a small, chemically diverse chemical library. The docking results are then used to bias a pretrained generative model as well as to train a filtering model to create and select novel compounds with better docking scores. The subsets of these de novo designed molecules confirmed to have high scores in actual docking simulations are then used as queries for massive chemical similarity searching to identify a small set of purchasable compounds highly similar to the queries in any ultra-large library. Molecules in this small set are then docked and scored for nomination as purchasable hits, or these molecules can be used for additional tuning of the generative model. These steps, including the generative modeling, filtering, minimal docking, similarity searching, and final docking of small number of hits can be repeated several times if needed. Below, we demonstrate that iteration between initial docking, generation, similarity searching, and final docking helps HIDDEN GEM rapidly focus on the chemical space with top scoring compounds including both hits in the screening library and de novo virtual hits.

We evaluated the performance of HIDDEN GEM on 16 different protein targets from various families using the 37 billion-sized Enamine REAL Space as a virtual screening library. We have shown that for all 16 targets HIDDEN GEM identified sets of hits with docking scores enriched up to 1000-fold over scores for a random subset of the REAL Space library. Surprisingly, this enrichment was observed even after a single cycle of the generative modeling and similarity searching. The virtual screening for each target was completed in as little as two days leveraging only a single 44 CPU-core machine for docking, an 800 CPU-core computing cluster for similarity searching, and one Nvidia GTX 1080 Ti GPU for generative modeling. We posit that due to high computational efficiency, relatively low demand for expensive computational resources, and universal success in identifying the high-scoring hits for diverse targets, HIDDEN GEM offers an attractive alternative to all current VS methods.

Results and Discussion

General description of the HIDDEN GEM Workflow

For a given protein target with a known structure and binding site, HIDDEN GEM starts with the selection of a small initial library of compounds; we refer to this step as “Initialization” (Figure 1). Depending on the prior knowledge of the target, the initial compound set may be either target-focused or target-independent. In this work, we treat all targets as orphans (lacking any known binders) and thus, take a target-independent approach; we also report the results of a case study utilizing prior knowledge. To initialize the workflow, we utilize the Hit Locator Library (HLL) from Enamine (https://enamine.net/compound-libraries/diversity-libraries/hit-locator-library-200), which contains roughly 460,000 drug-like compounds. All molecules in this library are docked into each target and the best score per compound is kept.

Figure 1: Illustration of using HIDDEN GEM workflow for virtual screening of an Ultra-large chemical library.

Figure 1:

The Initialization step employs traditional molecular docking of a small, diverse chemical library (on the left). The docking results are then used to bias generative model to create de novo compounds with high docking scores (Generation step). The subset of these de novo molecules is docked into the binding site and the resulting top-scoring molecules are then used for massive chemical similarity searching to identify a small set of purchasable compounds most similar to the query molecules in any ultra-large library (Similarity step). This small set of molecules is then docked and scored for hit nomination or used to train a new series of generative models. HIDDEN GEM cycle, including the generative modeling, filtering, and similarity searching can be repeated several times if needed. HIDDEN GEM rapidly delivers both purchasable in-library and de novo virtual hits.

In the second step, referred to as “Generation” (Figure 1), the docking results from Initialization are used to bias a pre-trained generative model toward creating de novo compounds predicted to have better docking scores. Herein we use a SMILES-based31 generative model pretrained on all the compounds from ChEMBL32; however, any other large library of compounds can be used for this purpose. To achieve faster convergence, the biasing process is implemented using a combination of two schemes. First, the compounds with the top 1% of docking scores are selected from Initialization to fine-tune the generative model towards the generation of compounds structurally similar to molecules in this top-scoring set. Second, all compounds from Initialization are used to build a binary classification filtering model, trained to discriminate the top 1% scoring compounds from the remaining 99%. After training, these two schemes are used in tandem, with the fine-tuned generative model proposing hit-like compounds and the filtering model assessing whether a given compound is predicted to be in the top 1% class. Only de novo generated compounds that are predicted to be in the top 1% are kept, and the rest are rejected. While generation can continue indefinitely, we only generate approximately 10,000 novel and unique compounds. This biased generation process takes as little as 4 hours on a single NVIDIA GeForce GTX 1080 Ti GPU. All resulting hit compounds are then docked and scored using the same approach as in Initialization.

In the third, “Similarity” step (Figure 1), up to 1000 top scoring compounds resulting from docking conducted at the end of Initialization are used for massive similarity search against ultra-large VS library. In this work we use the 37 billion Enamine REAL Space as the reference VS library. Compounds in this library with the highest calculated similarity (see Methods) to any of the top-scoring compounds from Initialization are identified. The number of compounds selected is a tunable parameter and could be any number desired. In this work, we chose this number to be 100,000 to minimize the amount of subsequent docking calculations. This Similarity step is more computationally expensive than Generation, requiring approximately 3,600 core-hours (number of CPU-cores required to complete computation in 1 hour); however, the use of emerging and more advanced similarity searching algorithms33 has the potential to significantly accelerate this step. This set of 100,000 similar, in-library compounds is then docked and scored.

The single iteration of a Generation, minimal docking of the generative hits, Similarity, and final docking of 100,000 hits is referred to as a HIDDEN GEM “Cycle” (Figure 1). Completing one cycle of the HIDDEN GEM workflow leads to the selection of compounds from an ultra-large chemical library, along with de novo generated compounds with significantly improved scores compared to those in the library used for the Initialization (Figure 2). Given the size of the Enamine HLL used in the Initialization step, the total number of molecules subject to actual docking simulations after one HIDDEN GEM cycle that involved VS against 37B Enamine REAL Space library is only under 600,000. If desired, additional cycles of the HIDDEN GEM workflow can be utilized to further optimize the selection of high-scoring compounds. In this case, the hits nominated and scored in the previous cycle are used for biasing in the “Generation” step of the next cycle . Theoretically, cycles can be repeated indefinitely; but, as we discuss below, surprisingly, even one cycle is sufficient to identify high-scoring hits.

Figure 2: Comparison of score distributions for each step of HIDDEN GEM.

Figure 2:

Docking scores of the top 1,000 scoring compounds after the Initialization (blue), Generation (orange) and Similarity (green) steps of HIDDEN GEM for 16 different targets. For all targets except 4R06, docking score are from FRED68. 4R06 docking scores are from AutoDock VINA67. Score distributions are plotted as normalized densities such that the area under each curve sums to 1.

Assessment of HIDDEN GEM

We assessed the effectiveness of HIDDEN GEM by running a single cycle of the workflow on each of the 16 different protein targets34-49 (Table S1). For each of the targets, a significant shift in the top 1,000 docking scores for unique compounds (the top set) was observed after each step of the HIDDEN GEM cycle, with scores after Generation and Similarity steps being significantly better than those from Initialization (Figure 2). With the exception of two targets (PDB ID: 3L5D and 3MAX), the top set selected from the ultra-large library by the Similarity step had better scores compared to the top set produced by the Generation step. These results did not depend on the protein class of the target, suggesting HIDDEN GEM can function on any protein class.

To establish a baseline control to assess HIDDEN GEM performance, we docked a random subset of 3.5 million compounds from Enamine REAL (which required roughly the same computational resources as those needed for completing one cycle of HIDDEN GEM) into each target. We found that top-scoring compounds from both the Generation and Similarity steps had significantly better scores than those from the random baseline (Figure S1).

We compared the computational efficiency of HIDDEN GEM to that for several recent methods capable of screening the billion size libraries14,18,22,23 along with a brute force docking baseline (Table 1). A standard metric to compare screening approaches is the enrichment factor (EF)50. For a given score threshold, EF is the ratio of compounds identified by a VS method with the score better than the selected threshold to the number of compounds in the random collection of the docked compounds that also score better than the same threshold (see Methods). We used EF1000, where the threshold is set to the 1000th best score produced by the virtual screening method. HIDDEN GEM had an average EF1000 of 270 (Table S2). This EF1000 was higher than that of 220-250, estimated for the reported data for V-SYNTHES14. Deep Docking23 and VirtualFlow18 do not provide an EF value or the data required to calculate one. All other methods cannot be assigned an EF, as they dock the entire library rather than a subset making EF calculations inapplicable.

Table 1: Comparison of several large scale virtual screening methods.

Reported resources are discussed in respective publications. Core hours required to screen the 37 billion REAL Space were calculated by scaling reported compute times to the 37 billion compounds (^) or assuming a docking time of 10 seconds per compound (#). EF1000 was calculated if the data required for its assessment was reported in the respective publications. AWS cost was estimated assuming an a1.medium instance for CPUs and a p3.2xlarge for GPUs. Estimated AWS time assumed that only 160,000 instances could be used at any given time.

Method Reported
resources
used
Core-hours
required to screen
37 billion
compounds
EF1000 Estimated
cost on AWS
Estimated
time on
AWS
HIDDEN GEM 1 RTX 1080Ti GPU

800 + 44 CPUs
5,380 CPU

2 GPU
270 $138 0.03 hours
DeepDocking23 250 V100 GPUs

650 CPUs
296,400 CPU

114,000 GPU
NA* $356,398 2.6 hours
V-SYNTHES14 NA* 5,556 CPU^

0 GPU
220 $142 0.03 hours
VirtualFlow18 8000 CPUs 152,672,000 CPU#

0 GPU
NA* $3,893,136 954.2 hours
Summit Supercomputer22 27,648 Tesla
V100 GPUs
0 CPU

12,275,712 GPU#
NA** $37,563,678 76.7 hours
Brute Force Docking NA 102,777,777 CPU^

0 GPU
NA** $2,620,833 642.4 hours
*

Not reported in the respective publication.

**

Enrichment factor may not be calculated because all compounds are docked.

All methods can also be compared by the estimated computational cost. Both HIDDEN GEM and V-SYNTHES14 only require a few thousand CPU-core hours and would cost under $150 per target on the AWS cloud (Table 1). This compares favorably to the other four methods listed in Table 1 that require hundreds of thousands of GPU and CPU-core hours, driving respective costs up into the million-dollar range. HIDDEN GEM is over 19,000-fold less costly than brute force docking. Overall, HIDDEN GEM utilizes fewer resources than any other method except V-SYNTHES; however, HIDDEN GEM achieves the highest enrichment factor across all methods, making it, to the best of our knowledge, the most efficient among publicly accessible virtual screening methods.

Importance of the Generation step.

In principle, the HIDDEN GEM workflow could still be functional even if the Generation step is bypassed. To examine the impact of the biased generative modeling on the overall workflow performance, we compared the results of the complete HIDDEN GEM to a version with no Generation step, referred to as No GEM, where top compounds from Initialization are used directly as the query for the Similarity step. Specifically, we compared a single complete cycle of HIDDEN GEM, which includes the Generation step followed by the Similarity step (cf. Figure 1)to two cycles of No GEM (where the Initialization step is followed by the Similarity step in the first cycle, and the second cycle was also a Similarity step using the Similarity results from the first cycle as a reference). In general, we found that for all 16 targets the complete HIDDEN GEM workflow was superior in identifying better-scoring in-library compounds (Figures 3, S2 and S3). Moreover, with the exception of four targets (1Q4X, 1T7R, 2ZV2, and 3K23), the Generation step alone produced better scoring compounds than those identified by No GEM (Figure S2). Notably, when executing the complete HIDDEN GEM workflow, only the top 100 hits from the Generation step are used in the Similarity step as reference compounds. Thus, the score distribution of the 1000 top hits is expected to be shifted toward better docking scores, as is observed in most, but not all, cases (cf. Figure 2). In analyzing these results, one should keep in mind that similarity searches are not expected to produce hits with always better (or lower) scores than those for the reference compounds. Thus, it should not be surprising that the results are target dependent. Interesting that for the same four targets (1Q4X, 1T7R, 2ZV2, and 3K23) the score distribution of hits after conducting the Similarity step of the full HIDDEN GEM cycle is shifted toward better scores as compared to those following the Generation step (Figure 2). In fact, this pattern is true for most targets, with the exception of 3L5D, 3MAX and 1NDE (in the latter case, the distributions after the Generation and Similarity steps are practically the same). These observations suggest complex, and perhaps non-decipherable relationships between the intricate features of target binding sites that influence docking scores, diversity of hits used as reference compounds, similarity metrics, and docking score distributions. However, despite these complex relationships, our key result, i.e., substantial improvement of docking scores for hits resulting from the complete HIDDEN GEM workflow, is universal in all studied cases as is clearly illustrated by Figure 2.

Figure 3: Importance of the Generation step.

Figure 3:

Docking scores for the top 1,000 scoring compounds after one cycle of HIDDEN GEM for Generation (orange) and Similarity (green) steps and two cycles of Similarity (not using Generation) called No GEM (red). Score distributions are plotted as normalized densities such that the area under each curve sums to 1. No GEM docking score distributions for all 16 targets are shown in Figure S2.

We also observe that two cycles of No GEM required significantly more resources than a single cycle of HIDDEN GEM (as Similarity is far more expensive than Generation), making the standard HIDDEN GEM cycle superior in both performance and computational efficiency. Further, the Generation step allows for the exploration of de novo designed compounds that could not be found with Similarity alone, increasing the potential utility of the HIDDEN GEM workflow.

Multiple cycles of HIDDEN GEM

As mentioned above, the HIDDEN GEM cycles can be executed sequentially, potentially resulting in the iterative nomination of progressively better-scoring compounds with each successive cycle. We applied a second cycle of HIDDEN GEM to two randomly chosen targets (4R06 and 5MZJ). We found that only 5MZJ exhibited improvements in compound scores from the Similarity step after multiple cycles of the workflow (Figure S4). Both targets showed significant improvement in scores between cycles 1 and 2 in the Generation step. Notably, the magnitude of improvement in Generation scores was much greater than that observed in the Similarity step for 5MZJ.

These studies suggest that if de novo compound nominations are desired, completing multiple cycles until the generative compounds show no further improvement could prove fruitful albeit this approach comes with the added risk of lower synthetic accessibility of the de novo generated compounds. Ultimately, the assessment of synthetic feasibility lies with chemists working on particular targets. For compounds that appear promising, chemists may be able to identify synthetic pathways to make these compounds or discard computational hits as non-makable. Concerning the nomination of purchasable library compounds, the extent to which a second (or more) cycle improves compounds scores appears to depend on the target and may not always be necessary or justified. In most cases we expect that a single cycle is sufficient to achieve significant enrichment in the top set of purchasable compounds.

The choice of initial libraries and HIDDEN GEM performance

HIDDEN GEM starts by docking a small set of compounds during Initialization. As the rest of the workflow leverages results from Initialization, choosing different initial sets has the potential to change the results of VS. To examine this issue, we compared the outcomes of a single cycle of HIDDEN GEM using either the Enamine HLL or a custom diversity set of 1.4 million compounds (Diverse set) representing the 37 billion Enamine REAL Space in the Initialization step. This analysis was conducted for two protein targets (5MZJ and 4R06). For 5MZJ we observed a significant difference between final score distributions for top 1000 compounds when starting with HLL versus Diverse, while minor difference was found for 4R06 (Figure S5). This observation of target-dependent behavior was reminiscent of findings reported above for the investigation involving multiple cycles of HIDDEN GEM.

Interestingly, for 4R06, we observed a 25% overlap between the top 1,000 scoring compounds (top set) starting with HLL and the top set starting from Diverse. For 5MZJ, this overlap was only 8%. The HLL and Diverse sets share zero compounds between them. Thus, within a single cycle HIDDEN GEM has converged from different initial compounds to at least a few identical regions of chemical space in the ultra-large library. This suggests that the initial set is likely not the dominating factor in determining performance, though, changes in the initial set do appear to have some effect.

HIDDEN GEM may identify purchasable compounds similar to known bioactive molecules

For each of the 16 targets, HIDDEN GEM was used to nominate 1,000 top scoring compounds from the Enamine REAL Space. These nominations were selected without prior knowledge of any published actives for a given target. No expert-based, manual selection, or tuning of the scoring protocol with known actives was used. We compared top-scoring sets of compounds for each target for structural similarity to respective known actives identified in ChEMBL32. We found that in general, the similarity between these sets of compounds was low. Nevertheless, there were several nominated purchasable compounds that showed high structural similarity to high affinity actives (Figures 4 and S7). These pairs even showed exact matches of major substructures, and in some cases a general conservation of the base scaffold. For comparison, randomly nominating 1,000 samples from Enamine REAL Space showed no such similarity to known actives for any target. For all targets, except 5L2S, there was significantly higher Tanimoto similarity51 between the top set and known actives compared to the random 1,000 compounds and the known actives (Table S3). These results serve to validate HIDDEN GEM’s ability to nominate compounds that are likely enriched toward activity against a given target. At the same time, our results show that HIDDEN GEM nominated diverse set of purchasable compounds that are not all close analogs of any known actives, which is a desirable objective for any VS campaign52.

Figure 4: Examples of HIDDEN GEM nominations similar to known actives for the given target.

Figure 4:

Activity data are IC50 values taken from the corresponding ChEMBL32 entry for the UniProt ID associated with the target PDB ID. Further examples can be found in Figure S7.

HIDDEN GEM workflow allows for leveraging prior knowledge

All results discussed above were obtained considering the proteins as orphan targets with known binding sites. No knowledge about existing binders to respective targets was utilized during the execution of the HIDDEN GEM workflow. However, this workflow can consider known binders if such information exists. Thus, we explored a version of the workflow that used known active compounds to assemble the set used for Initialization. The Initialization set was generated by finding 500,000 compounds in Enamine REAL Space with structure similar to known actives as measured by the Tanimoto similarity51. HIDDEN GEM then proceeded as described above. We carried out this approach on 5MZJ, the adenosine A2a receptor, leveraging 844 known actives from DUD-E53. We found that using this biased initial set did not help HIDDEN GEM to identify higher scoring hits compared to when using standard Initialization library. However, when using biased Initialization library, top scoring hits had much higher structural similarity to known actives as compared to top hits from the non-biased approach (Figure S8). Interestingly, there was some overlap (9%) between the top 1000 scored hits from each approach. We conclude that using known actives may help HIDDEN GEM to identify high-scoring analogs of these molecules, but the availability of such prior knowledge is not critical for HIDDEN GEM’s ability to identify high-scoring hits, which may include structural analogs of known actives.

DISCUSSION

We have introduced HIDDEN GEM, which, to our knowledge, is the first automated workflow combining minimal molecular docking, machine learning, generative modeling, and massive similarity searching to greatly accelerate large scale virtual screening. The workflow requires the actual docking of no more than a million compounds per target in total, allowing for the rapid nomination of top scoring virtual hits selected out of billion-sized libraries. These nominations can be either purchasable or, uniquely to the HIDDEN GEM workflow, de novo designed compounds, providing more flexibility to future hit optimization steps.

We have demonstrated HIDDEN GEM’s generalizability by applying this workflow to 16 different targets, covering a wide range of protein classes, each showing significant enrichment. We intentionally challenged our approach by assuming no prior knowledge except for binding site location, thereby treating these targets as if they were understudied. HIDDEN GEM was still able to perform well under this constraint. However, the design of HIDDEN GEM allows for the possible use of any known information such as known actives. Developing approaches that incorporate prior knowledge into HIDDEN GEM is an area with great potential for future investigation.

We shall highlight two key results of using HIDDEN GEM that we attribute to the use of the Generation step. First, we find that significant enrichment in the scores of top hit compounds is achieved after only a single cycle of the HIDDEN GEM workflow, minimizing the number of resources required for screening (Figure S4). Second, we underscore HIDDEN GEM's ability to outperform similarity searching alone (No GEM) for screening ultra-large libraries. Together, this suggests Generation is critical to the success of the entire workflow (Figures 3 and S2).

While it is difficult to assess why the Generation step has such an effect, we theorize that this is due to the sparsity of current ultra-large libraries8 and, more specifically, the Generation step's ability to traverse this sparsity by more effectively utilizing patterns in chemical structures. While the Enamine REAL Space library with 37 billion compounds certainly qualifies as ultra-large, it represents a tiny fraction of the estimated complete library of 1063 chemically feasible, drug-like small molecules54. Furthermore, the combinatorial nature of most ultra-large libraries can exacerbate the sparsity issue, as all compounds are derived from a much smaller pool of fragments and reactions. Navigating such a sparse chemical space by means of similarity searching alone can be volatile, as nearby points may be structurally quite different. The Generation step does not share this constraint, instead being able to explore chemical space unrestricted by known molecules, potentially filling structural gaps found in the library. Some evidence of this notion is provided by the observation that high-scoring compounds from the Generation step formed a smaller number of dense clusters compared to the similarity search alone (Figures S6 and S3).

We also observe that the chemical diversity of the 1,000 top scoring compounds (top set) generally decreased after each step of the HIDDEN GEM cycle for all targets (Figure S6). Relative to the top hits of Initialization, those of the Generation and Similarity steps were clustering around a few, distinct regions of chemical space, representing distinct scaffolds. This suggests that HIDDEN GEM is “focusing in” on certain regions of the chemical space, likely areas that contain specific chemical groups important for achieving better scores. To explore this observation further, we have examined several representative molecules from different structural clusters of top scoring compounds (Figure 5). We found that HIDDEN GEM was capable of nominating high-scoring molecules with diverse scaffolds, which can also explore alternative binding poses (Fig. 5A). In addition, HIDDEN GEM can potentially emulate hit optimization by progressively identifying a series of analogs, including molecules both generated de novo and found in the purchasable collection, with gradually improving scores (Fig. 5B). We posit that this interesting observation reflects the exploratory nature of HIDDEN GEM enabled by its generative chemistry component that was discussed in the previous paragraph.

Figure 5: Hit distribution and evolution during HIDDEN GEM screening.

Figure 5:

A. (Left): Distribution of the top 1,000 compounds in the chemical space from the Initialization (blue), Generation (orange) and Similarity (green) steps after one cycle of HIDDEN GEM calculated using t-SNE79. Two top scoring compounds from distal structural clusters nominated by HIDDEN GEM are highlighted by purple and yellow. These hits present different chemical scaffolds and different interaction/binding modes with the receptor (Middle and Right). B. Distribution of docking scores for the top 1,000 compounds after the Initialization (blue), Generation (orange) and Similarity (green) steps of HIDDEN GEM. The best scoring compound from the initial Hit Locater Library is indicated by the blue circle and arrow, and its most structurally similar analogs found in the top 5 scored compounds from the Generation and Similarity steps of HIDDEN GEM are indicated by orange and green circles and arrows, respectively. (Right): Representation of the binding modes for these three compounds. For some targets HIDDEN GEM allows us to identify diverse, but equally highly scoring hits, as well as better scoring analogs of compounds in the initial screening library.

Overall, HIDDEN GEM validation showed enrichment of hits by as high as 1000 fold compared to random, with an average enrichment factor upwards of 270. This is higher than the reported enrichment factor of other VS methods (Table 1), placing HIDDEN GEM at the top of ultra-fast virtual screening methods. HIDDEN GEM achieved this performance while leveraging a fraction of the resources required by most of other approaches and employing 19,000-fold less resources than brute force docking (Table 1). Further, it did so without requiring either prior knowledge of the known binders or extensive, often inaccessible, proprietary knowledge of how the VS library was produced. These features of HIDDEN GEM combined with its open-source nature make it appealing as an effective, publicly available VS resource.

Future directions

In our observation, HIDDEN GEM is the most computationally efficient and effective VS approach discussed in public literature (cf. Table 1). Nevertheless, there is room for further improvement. Obviously, the largest limitation of HIDDEN GEM, which, in fact, is shared by all other structure-based VS methods, is the reliance on molecular docking to score compounds. Oftentimes, the correlation between docking score and biological activity is weak55,56. While many campaigns leveraging docking have shown success, they often require fine-tuning and expert based filters to be successful7, something not always achievable for all targets. This issue can only be solved with advancements in binding affinity prediction, an area of active research57-60. HIDDEN GEM was designed with this in mind and can easily replace the docking-based scoring step with any other method that can score a compound-target pair.

HIDDEN GEM can benefit from improvements in other areas as well. Roughly 65% of computational effort stems from the Similarity step due to its simplistic, brute force implementation in this work. Advancements in similarity searching approaches33,61 or utilization of commercial software such as Arthor (https://www.nextmovesoftware.com/arthor.html) or InfiniSee (https://www.biosolveit.de/infiniSee) can reduce the computational load for the Similarity step from 3,600 core-hours to below 1 core-hour. Such improvement will likely negate the computational cost of the Similarity step. Additionally, development and utilization of faster docking methods, like QuickVina262 or AI assisted docking63,64, can further accelerate calculations. With such improvements, it would become feasible to run the entire HIDDEN GEM workflow on Enamine REAL Space on a single, generic 8-core laptop.

Methods

Ultra-large virtual screening library

A library of approximately 37 billion small molecules was curated from the current Enamine REAL space catalog (https://enamine.net/compound-collections/real-compounds). This library included stereoisomers and racemates. No compounds from other sources were included.

Selection of targets

A total of 16 targets were used in this work for method validation. These targets were selected by combining an arbitrary subset from the list previously used to validate DeepDocking approach17, and a subset from the DISCO database65 that represents targets from different protein families and different docking difficulties (as assigned in the DISCO database). All targets were required to have a known experimental structure and the presence of a known ligand to establish the location of the binding site. No other information about the targets was utilized in the HIDDEN GEM approach described in this work.

Initial screening libraries

Two libraries were used for the initialization step of HIDDEN GEM (cf. Figure 1). In most cases, we employed the Enamine Hit Locator Library (HLL), consisting of 460,600 compounds. In addition, to evaluate the sensitivity of HIDDEN GEM to the choice of the initial library, we also considered the 1.4 million diverse set (Diverse). This set was generated by min-max greedy clustering of the 37 billion Enamine REAL Space library such that no two compounds had a Tanimoto similarity51 above 0.2. To carry out the generation of this set, first a 256-bit radius 3 hashed Extended Connectivity Fingerprint66 (ECFP) was calculated for each compound in the library. Then, the order of the compounds was randomly shuffled. For each compound, its Tanimoto similarity to all compounds in the set was calculated. If all similarities were below 0.2, the compound was added to the set. Otherwise, it was not added. This process was carried out until all compounds were processed resulting in a set of roughly 1.4 million compounds.

Docking

Docking was performed with Autodock Vina67 for 4R06 and Openeye’s FRED68 for all other targets. For Vina, the target and the ligands were prepared using default settings from ADFR software suite. Initial 3D conformers were calculated using RDKit69 for each ligand. The docking box (20Å-side cube) was placed at the center of the known ligand. For FRED, targets were prepared using Openeye’s SPRUCE using default setting. Ligands were prepared using the Pose variant of Openeye’s OMEGA70 with flipper - stereo center enumerating utility turned on and otherwise default settings. Docking was carried out on CPUs from UNC’s Longleaf and Dogwood high performance computing clusters. No prior information regarding target binding, aside from the location of a known binding site, was used during the docking process.

Generative modeling

The generative model is a transformer encoder trained with causal masking and positional encoding according to a reconstruction cross-entropy loss71:

(S)=i=1klogP(sis1,,si1)

where S=s1,,sk represents a SMILES string padded with a special padding character to a length of k if necessary. We use k=120 and remove from the pre-training dataset any SMILES strings exceeding this length. The conditional probability is estimated by a neural network, specifically the transformer encoder, consisting of several multi-head attention blocks, whose parameters are learned via the Adam optimizer, a modified form of stochastic gradient descent on a batched version of the above loss function72. This architecture is similar to that of pre-trained language models, such as OpenAI’s GPT series73. Specifically, we use 6 transformer encoder layers, each with an 8-head multi-head attention block and a HIDDEN dimension of 512, implemented and trained in PyTorch74.

In more detail, a SMILES string is first embedded according to

h0=WeS+Wp

where We is an embedding matrix and wp is a sinusoidal positional encoding71. This initial embedding is passed through n=6 transformer encoder layers, utilizing a causal attention mask that ensures each character is only predicted based on the preceding characters:

hi=TransformerEncoderLayer(hi1),i=1,,n

Finally, the last hidden layer of the network is converted back into a probability distribution over tokens via the SoftMax function

P(S)=softmax(WeThn)

The model is pre-trained on a selection of approximately 1.4 million SMILES strings derived from ChEMBL, stripped of their chirality and augmented by SMILES enumeration75. This process adds to the dataset each of three different random non-canonical SMILES strings for each SMILES present already in the dataset in canonical form. During training and generation, SMILES strings are tokenized at the character-level. The model was trained over the epochs of the dataset, taking in total approximately 72 GPU-hours. To benchmark the efficacy of pre-training, we quantified the validity, novelty, and uniqueness of generated SMILES strings. Out of 10,000 generated strings, we found 95% represented valid molecules and 92% were valid, novel, and unique. This compares favorably to generative models not utilizing a transformer architecture, such as those based on recurrent neural networks27.

During the Generation step of each round, the pre-trained model was fine-tuned on the top 1% of scoring compounds from the previous step. Fine-tuning, the process of biasing a generative model by repeat training on a smaller, refined dataset, was performed over 10 epochs at a reduced learning rate of 0.0001. All model weights were allowed to vary over the fine-tuning step.

Furthermore, all docked compounds from the previous step were used to train a classification model applied toward filtering generated compounds. The top 1% of scoring compounds were labeled with a positive class, while the remaining 99% were labeled with the negative class. Each compound was featurized using RDKit fingerprint descriptors with default settings69. A random forest model, using Sci-kit Learn76 and default parameters, was trained on this data, maintaining the inherent imbalance in the data. To use this model for filtering generated compounds, a model classification threshold was needed. To determine an adequate threshold, we separately trained an identical model on a randomly chosen, stratified 80% of the data. We used the remaining 20% of the data for validation, paying attention to the model’s positive predictive value (PPV), as is best practice for virtual screening and similar tasks such as this. We chose a threshold at which the model achieved a PPV of at least 60%.

Similarity search

Similarity searching was carried out in an exact, brute force manner. First, a reference set of 100 compounds R are selected along with an ultra-large purchasable chemical library Q. In this work, R was identified by the top 100 scoring compounds in the previous Generation step and the ultra-large library was the 37 billion REAL Space. A 256bit radius 3 hashed Extended Connectivity Fingerprint66 (ECFP) is calculated for both R, (Rf) and Q, (Qf). For each query fingerprint qfQf, the Euclidean distance to each of the reference compounds rfRf in the fingerprint space is calculated. To assign a single “similarity score” to any given query compound qQ to the whole reference set R the distances between the query and the reference set are aggregated using the minimum value:

qscore=minrfRfqfrf2

After similarity scores are calculated for all queries, a small subset of the queries is isolated. In this work, this subset was chosen as the 100,000 queries with the best scores. This subset represents the nominated purchasable picks for a cycle. Distance calculations were carried out using the Scipy77 and Numpy78 packages for Python. Fingerprints were generated using RDKit69. Computations were run in parallel across 800 CPU cores on UNC’s Longleaf HPC system and took approximately 24 hours, or ~20,000 CPU core-hours, for each reference set.

Nomination of top sets from HIDDEN GEM

HIDDEN GEM can nominate both de novo and purchasable compounds. De novo compounds are nominated from the docked output of the Generation step, while purchasable compounds come from the docked output of the Similarity step. In this work, no experiments were conducted on nominated compounds. These nominations were based strictly on docking scores, using no expert opinion or filtering methods. We nominated 1,000 picks for each target from only the Similarity step. In the case of experimental follow up, this set can be reduced to a desired number by utilizing any preferred filtering or clustering approach, which has shown to be an effective way to process virtual nominations in previous works7,14,23.

Enrichment factor calculation

Enrichment factor (EF) is an effective way to assess if a given method is better than an unintelligent random control. Higher EF values suggest a more effective method. EF is only useful for methods that are sampling from a population, as the produced sample is not guaranteed to have the highest scores from the population. Methods that calculate score across the whole population are guaranteed to find the highest scoring compounds, thus the nominated set cannot be “enriched with high scoring compounds” as it is the exact set of the highest scoring compounds.

Enrichment factor (EF) is defined in this work as

EF=Number of top-scoring compoundsfromaVSmethod<XNumber of random compounds<X,

where X is a desired docking score threshold. EF1000 sets X as the docking score of the 1000th top scoring compound from the HIDDEN GEM nominations. Random compounds were a set of 3.5 million compounds selected randomly with uniform probability from the 37 billion Enamine REAL Space. These are docked against the respective target to score random nominations. EF1000 values for V-SYNTHES14 were estimated from histograms of data provided in the respective publication.

Supplementary Material

1

ACKNOWLEDGEMENTS.

We thank the Research Computing group at the University of North Carolina at Chapel Hill for providing computational resources and support that have contributed to these research results. We also thank Josh Hochuli and Kathryn Kirchoff for helpful discussions.

This work was supported by the National Institutes of Health (Grants U19AI171292 and R01GM140154). JW appreciates the support from the National Institute of General Medical Sciences of the National Institutes of Health under Award Number T32GM135122. TM received support during the early stages of this work from the NIGMS and the NICHD of the NIH under award numbers T32 GM086330 and T32 HD104576.

DATA AVAILBILITY

Lists of 1,000 top-scoring hits for all 16 targets can be found at https://github.com/molecularmodelinglab/HIDDEN-GEM/tree/main/paper_nominations

CODE AVAILABILITY.

Scripts, source codes and full documentation for HIDDEN GEM can be found at https://github.com/molecularmodelinglab/HIDDEN-GEM.

References.

  • 1.Potlitz F, Link A & Schulig L Advances in the discovery of new chemotypes through ultra-large library docking. Expert Opin. Drug Discov 18, 303–313 (2023). [DOI] [PubMed] [Google Scholar]
  • 2.Lyu J. et al. Ultra-large library docking for discovering new chemotypes. Nature 566, 224–229 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Quartararo AJ et al. Ultra-large chemical libraries for the discovery of high-affinity peptide binders. Nat. Commun 11, 3183 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Walters WP Virtual Chemical Libraries. J. Med. Chem 62, 1116–1124 (2019). [DOI] [PubMed] [Google Scholar]
  • 5.Stanzione F, Giangreco I & Cole JC Chapter Four - Use of molecular docking computational tools in drug discovery. in Progress in Medicinal Chemistry (eds. Witty DR & Cox B) vol. 60 273–343 (Elsevier, 2021). [DOI] [PubMed] [Google Scholar]
  • 6.Gahbauer S. et al. Iterative computational design and crystallographic screening identifies potent inhibitors targeting the Nsp3 macrodomain of SARS-CoV-2. Proc. Natl. Acad. Sci 120, e2212931120 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Bender BJ et al. A practical guide to large-scale docking. Nat. Protoc 16, 4799–4832 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Lyu J, Irwin JJ & Shoichet BK Modeling the expansion of virtual screening libraries. Nat. Chem. Biol 1–7 (2023) doi: 10.1038/s41589-022-01234-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Stein RM et al. Virtual discovery of melatonin receptor ligands to modulate circadian rhythms. Nature 579, 609–614 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Alon A. et al. Structures of the σ2 receptor enable docking for bioactive ligand discovery. Nature 600, 759–764 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Pandey M. et al. The transformational role of GPU computing and deep learning in drug discovery. Nat. Mach. Intell 4, 211–221 (2022). [Google Scholar]
  • 12.Muhammad NB & Bazzi M Advances in Cloud Computing: Security Issues and Challenges in the Cloud. in 2022 5th International Conference on Information and Computer Technologies (ICICT) 110–116 (2022). doi: 10.1109/ICICT55905.2022.00027. [DOI] [Google Scholar]
  • 13.Tingle BI & Irwin JJ Large-Scale Docking in the Cloud. J. Chem. Inf. Model 63, 2735–2741 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Sadybekov AA et al. Synthon-based ligand discovery in virtual libraries of over 11 billion compounds. Nature 601, 452–459 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Yu Y. et al. Uni-Dock: GPU-Accelerated Docking Enables Ultralarge Virtual Screening. J. Chem. Theory Comput (2023) doi: 10.1021/acs.jctc.2c01145. [DOI] [PubMed] [Google Scholar]
  • 16.Gentile F. et al. Artificial intelligence–enabled virtual screening of ultra-large chemical libraries with deep docking. Nat. Protoc 17, 672–697 (2022). [DOI] [PubMed] [Google Scholar]
  • 17.Gentile F. et al. Deep Docking: A Deep Learning Platform for Augmentation of Structure Based Drug Discovery. ACS Cent. Sci 6, 939–949 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Gorgulla C. et al. An open-source drug discovery platform enables ultra-large virtual screens. Nature 580, 663–668 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Berenger F, Kumar A, Zhang KYJ & Yamanishi Y Lean-Docking: Exploiting Ligands’ Predicted Docking Scores to Accelerate Molecular Docking. J. Chem. Inf. Model 61, 2341–2352 (2021). [DOI] [PubMed] [Google Scholar]
  • 20.Choi J & Lee J V-Dock: Fast Generation of Novel Drug-like Molecules Using Machine-Learning-Based Docking Score and Molecular Optimization. Int. J. Mol. Sci 22, 11635 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Gupta A & Zhou H-X Machine Learning-Enabled Pipeline for Large-Scale Virtual Drug Screening. J. Chem. Inf. Model 61, 4236–4244 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Acharya A. et al. Supercomputer-Based Ensemble Docking Drug Discovery Pipeline with Application to Covid-19. J. Chem. Inf. Model 60, 5832–5852 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Gentile F. et al. Automated discovery of noncovalent inhibitors of SARS-CoV-2 main protease by consensus Deep Docking of 40 billion small molecules. Chem. Sci 12, 15960–15974 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Xu Z, Wauchope OR & Frank AT Navigating Chemical Space by Interfacing Generative Artificial Intelligence and Molecular Docking. J. Chem. Inf. Model 61, 5589–5600 (2021). [DOI] [PubMed] [Google Scholar]
  • 25.Thomas M, Bender A & de Graaf C Integrating structure-based approaches in generative molecular design. Curr. Opin. Struct. Biol 79, 102559 (2023). [DOI] [PubMed] [Google Scholar]
  • 26.Zhavoronkov A. et al. Deep learning enables rapid identification of potent DDR1 kinase inhibitors. Nat. Biotechnol 37, 1038–1040 (2019). [DOI] [PubMed] [Google Scholar]
  • 27.Popova M, Isayev O & Tropsha A Deep reinforcement learning for de novo drug design. Sci. Adv 4, eaap7885 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Gómez-Bombarelli R. et al. Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules. ACS Cent. Sci 4, 268–276 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Anstine DM & Isayev O Generative Models as an Emerging Paradigm in the Chemical Sciences. J. Am. Chem. Soc 145, 8736–8750 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Cheng Y, Gong Y, Liu Y, Song B & Zou Q Molecular design in drug discovery: a comprehensive review of deep generative models. Brief. Bioinform 22, bbab344 (2021). [DOI] [PubMed] [Google Scholar]
  • 31.Weininger D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J. Chem. Inf. Comput. Sci 28, 31–36 (1988). [Google Scholar]
  • 32.Gaulton A. et al. ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res. 40, D1100–D1107 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Bellmann L, Penner P & Rarey M Topological Similarity Search in Large Combinatorial Fragment Spaces. J. Chem. Inf. Model 61, 238–251 (2021). [DOI] [PubMed] [Google Scholar]
  • 34.Henke BR et al. A New Series of Estrogen Receptor Modulators That Display Selectivity for Estrogen Receptor β. J. Med. Chem 45, 5492–5505 (2002). [DOI] [PubMed] [Google Scholar]
  • 35.Kukimoto-Niino M. et al. Crystal Structure of the Ca2+/Calmodulin-dependent Protein Kinase Kinase in Complex with the Inhibitor STO-609*. J. Biol. Chem 286, 22570–22579 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Biggadike K. et al. Design and x-ray crystal structures of high-potency nonsteroidal glucocorticoid agonists exploiting a novel binding site on the receptor. Proc. Natl. Acad. Sci 106, 18114–18119 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Zhu Z. et al. Discovery of Cyclic Acylguanidines as Highly Potent and Selective β-Site Amyloid Cleaving Enzyme (BACE) Inhibitors: Part I—Inhibitor Design and Validation. J. Med. Chem 53, 951–965 (2010). [DOI] [PubMed] [Google Scholar]
  • 38.Renaud J. et al. Estrogen Receptor Modulators: Identification and Structure–Activity Relationships of Potent ERα-Selective Tetrahydroisoquinoline Ligands. J. Med. Chem 46, 2945–2957 (2003). [DOI] [PubMed] [Google Scholar]
  • 39.Bressi JC et al. Exploration of the HDAC2 foot pocket: Synthesis and SAR of substituted N-(2-aminophenyl)benzamides. Bioorg. Med. Chem. Lett 20, 3142–3145 (2010). [DOI] [PubMed] [Google Scholar]
  • 40.Whitehead L. et al. Human HDAC isoform selectivity achieved via exploitation of the acetate release channel with structurally unique small molecule inhibitors. Bioorg. Med. Chem 19, 4626–4634 (2011). [DOI] [PubMed] [Google Scholar]
  • 41.Borngraeber S. et al. Ligand selectivity by seeking hydrophobicity in thyroid hormone receptor. Proc. Natl. Acad. Sci 100, 15358–15363 (2003). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Brzozowski AM et al. Molecular basis of agonism and antagonism in the oestrogen receptor. Nature 389, 753–758 (1997). [DOI] [PubMed] [Google Scholar]
  • 43.McTigue M. et al. Molecular conformations, interactions, and properties associated with drug efficiency and clinical performance among VEGFR TK inhibitors. Proc. Natl. Acad. Sci 109, 18281–18289 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Hur E. et al. Recognition and Accommodation at the Androgen Receptor Coactivator Binding Interface. PLOS Biol. 2, e274 (2004). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Chen P. et al. Spectrum and Degree of CDK Drug Interactions Predicts Clinical Performance. Mol. Cancer Ther 15, 2273–2281 (2016). [DOI] [PubMed] [Google Scholar]
  • 46.van Marrewijk LM et al. SR2067 Reveals a Unique Kinetic and Structural Signature for PPARγ Partial Agonism. ACS Chem. Biol 11, 273–283 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Fan H. et al. Structural basis for ligand recognition of the human thromboxane A2 receptor. Nat. Chem. Biol 15, 27–33 (2019). [DOI] [PubMed] [Google Scholar]
  • 48.Squire CJ, Dickson JM, Ivanovic I & Baker EN Structure and Inhibition of the Human Cell Cycle Checkpoint Kinase, Wee1A Kinase: An AtypicalTyrosine Kinase with a Key Role in CDK1 Regulation. Structure 13, 541–550 (2005). [DOI] [PubMed] [Google Scholar]
  • 49.Cheng RKY et al. Structures of Human A1 and A2A Adenosine Receptors with Xanthines Reveal Determinants of Selectivity. Structure 25, 1275–1285.e4 (2017). [DOI] [PubMed] [Google Scholar]
  • 50.Huang N, Shoichet BK & Irwin JJ Benchmarking Sets for Molecular Docking. J. Med. Chem 49, 6789–6801 (2006). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Tanimoto TT An elementary mathematical theory of classification and prediction. (International Business Machines Corporation, 1958). [Google Scholar]
  • 52.Phatak SS, Stephan CC & Cavasotto CN High-throughput and in silico screenings in drug discovery. Expert Opin. Drug Discov 4, 947–959 (2009). [DOI] [PubMed] [Google Scholar]
  • 53.Mysinger MM, Carchia M, Irwin John. J. & Shoichet BK Directory of Useful Decoys, Enhanced (DUD-E): Better Ligands and Decoys for Better Benchmarking. J. Med. Chem 55, 6582–6594 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Fink T, Bruggesser H & Reymond J-L Virtual Exploration of the Small-Molecule Chemical Universe below 160 Daltons. Angew. Chem. Int. Ed 44, 1504–1508 (2005). [DOI] [PubMed] [Google Scholar]
  • 55.Pantsar T & Poso A Binding Affinity via Docking: Fact and Fiction. Molecules 23, 1899 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Fan J, Fu A & Zhang L Progress in molecular docking. Quant. Biol 7, 83–89 (2019). [Google Scholar]
  • 57.Ballester PJ & Mitchell JBO A machine learning approach to predicting protein–ligand binding affinity with applications to molecular docking. Bioinformatics 26, 1169–1175 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Bitencourt-Ferreira G & de Azevedo WF Machine Learning to Predict Binding Affinity. in Docking Screens for Drug Discovery (ed. de Azevedo WF Jr.) 251–273 (Springer, 2019). doi: 10.1007/978-1-4939-9752-7_16. [DOI] [PubMed] [Google Scholar]
  • 59.Rube HT et al. Prediction of protein–ligand binding affinity from sequencing data with interpretable machine learning. Nat. Biotechnol 40, 1520–1527 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Ragoza M, Hochuli J, Idrobo E, Sunseri J & Koes DR Protein–Ligand Scoring with Convolutional Neural Networks. J. Chem. Inf. Model 57, 942–957 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Bai Y. et al. SimGNN: A Neural Network Approach to Fast Graph Similarity Computation. in Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining 384–392 (Association for Computing Machinery, 2019). doi: 10.1145/3289600.3290967. [DOI] [Google Scholar]
  • 62.Alhossary A, Handoko SD, Mu Y & Kwoh C-K Fast, accurate, and reliable molecular docking with QuickVina 2. Bioinformatics 31, 2214–2216 (2015). [DOI] [PubMed] [Google Scholar]
  • 63.Corso G, Stärk H, Jing B, Barzilay R & Jaakkola T DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking. Preprint at 10.48550/arXiv.2210.01776 (2023). [DOI] [Google Scholar]
  • 64.Stärk H, Ganea O-E, Pattanaik L, Barzilay R & Jaakkola T EquiBind: Geometric Deep Learning for Drug Binding Structure Prediction. Preprint at 10.48550/arXiv.2202.05146 (2022). [DOI] [Google Scholar]
  • 65.Wierbowski SD, Wingert BM, Zheng J & Camacho CJ Cross-docking benchmark for automated pose and ranking prediction of ligand binding. Protein Sci. 29, 298–305 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66.Rogers D & Hahn M Extended-connectivity fingerprints. J. Chem. Inf. Model 50, 742–754 (2010). [DOI] [PubMed] [Google Scholar]
  • 67.Eberhardt J, Santos-Martins D, Tillack AF & Forli S AutoDock Vina 1.2.0: New Docking Methods, Expanded Force Field, and Python Bindings. J. Chem. Inf. Model 61, 3891–3898 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68.McGann M. FRED Pose Prediction and Virtual Screening Accuracy. J. Chem. Inf. Model 51, 578–596 (2011). [DOI] [PubMed] [Google Scholar]
  • 69.Landrum G. et al. rdkit/rdkit: 2023_03_1 (Q1 2023) Release. (2023) doi: 10.5281/zenodo.7880616. [DOI] [Google Scholar]
  • 70.Hawkins PCD, Skillman AG, Warren GL, Ellingson BA & Stahl MT Conformer Generation with OMEGA: Algorithm and Validation Using High Quality Structures from the Protein Databank and Cambridge Structural Database. J. Chem. Inf. Model 50, 572–584 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 71.Vaswani A. et al. Attention is all you need. Adv. Neural Inf. Process. Syst 30, (2017). [Google Scholar]
  • 72.Kingma DP & Ba J Adam: A method for stochastic optimization. ArXiv Prepr. ArXiv14126980 (2014). [Google Scholar]
  • 73.Radford A, Narasimhan K, Salimans T, Sutskever I, & others. Improving language understanding by generative pre-training. (2018). [Google Scholar]
  • 74.Paszke A. et al. Pytorch: An imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst 32, (2019). [Google Scholar]
  • 75.Bjerrum EJ SMILES enumeration as data augmentation for neural network modeling of molecules. ArXiv Prepr. ArXiv170307076 (2017). [Google Scholar]
  • 76.Pedregosa F. et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res 12, 2825–2830 (2011). [Google Scholar]
  • 77.Virtanen P. et al. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nat. Methods 17, 261–272 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 78.Harris CR et al. Array programming with NumPy. Nature 585, 357–362 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 79.Hinton GE & Roweis S Stochastic Neighbor Embedding. in Advances in Neural Information Processing Systems (eds. Becker S, Thrun S & Obermayer K) vol. 15 (MIT Press, 2002). [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

1

Data Availability Statement

Lists of 1,000 top-scoring hits for all 16 targets can be found at https://github.com/molecularmodelinglab/HIDDEN-GEM/tree/main/paper_nominations

Scripts, source codes and full documentation for HIDDEN GEM can be found at https://github.com/molecularmodelinglab/HIDDEN-GEM.

RESOURCES