Abstract
Pharmacophores are abstractions of essential chemical interaction patterns, holding an irreplaceable position in drug discovery. Despite the availability of many pharmacophore tools, the adoption of deep learning for pharmacophore-guided drug discovery remains relatively rare. We herein propose a knowledge-guided diffusion framework for ‘on-the-fly’ 3D ligand-pharmacophore mapping, named DiffPhore. It leverages ligand-pharmacophore matching knowledge to guide ligand conformation generation, meanwhile utilizing calibrated sampling to mitigate the exposure bias of the iterative conformation search process. By training on two self-established datasets of 3D ligand-pharmacophore pairs, DiffPhore achieves state-of-the-art performance in predicting ligand binding conformations, surpassing traditional pharmacophore tools and several advanced docking methods. It also manifests superior virtual screening power for lead discovery and target fishing. Using DiffPhore, we successfully identify structurally distinct inhibitors for human glutaminyl cyclases, and their binding modes are further validated through co-crystallographic analysis. We believe this work will advance the AI-enabled pharmacophore-guided drug discovery techniques.
Subject terms: Virtual screening, X-ray crystallography, Machine learning
The authors develop a deep learning framework for 3D ligand pharmacophore mapping, enabling binding pose prediction, lead discovery, and target fishing. Using this approach, they identify structurally different inhibitors for human glutaminyl cyclases.
Introduction
Artificial intelligence (AI) is permeating various critical stages of drug discovery, holding immense potential to revolutionize the drug discovery process and profoundly change the current landscape of drug discovery1–3. In recent years, AI has made considerable advancements in lead discovery and optimization, target identification, and pre-clinical/clinical investigations4,5. Especially in structure-guided drug discovery, deep learning (DL) algorithms can efficiently accomplish several core tasks such as binding pose generation, binding affinity prediction, and molecular generation6–12. Currently, there are three main DL methods for binding pose generation with protein structure constraints: predicting ligand translation, rotation, and torsion to recover binding modes (e.g., DiffDock13 and EquiBind14), employing gradient decent to predict protein-ligand distance matrices (e.g., TANKBind15), and adopting E(n)-equivariant graph neural network to iteratively update the movement and position of ligand atoms (e.g., KarmaDock16 and E3Bind17). For structure-guided molecular generation, several DL algorithms have also been well-established recently, including TargetDiff18, ResGen19, SurfGen20, and PockeFlow21, some of which have undergone successful validations in discovering new hit/lead compounds. The learning capabilities of these DL methods are typically enhanced by incorporating sufficient samples or integrating knowledge such as the complementary principles of protein-ligand recognition.
Pharmacophores, as abstractions of critical chemical interactions, provide alternative means to depict the principles of protein-ligand complementary. Compared with structure-guided methods, pharmacophores possess unique, concise, and position-inclusive features, along with directional matching patterns. Consequently, they are widely employed in practical drug discovery, especially in the early stages of the process22. Despite the availability of several widely-used pharmacophore-based drug discovery tools (e.g., AncPhore23, PhDD24, PHASE25, Catalyst26, and Pharao27, Pharmit28), DL-enabled pharmacophore-guided drug discovery technologies remain relatively rare, with only a few instances reported to date. For example, PGMG implemented pharmacophore-guided molecular generation by establishing latent variables to address many-to-many (non-3D) mapping between pharmacophores and molecules29. PharmacoNet utilized DL techniques to model pharmacophores from protein structures, coupled with a graph-matching algorithm, ultimately achieving effective pharmacophore-based virtual screening30. The sluggish progress in DL technologies for pharmacophore-guided drug discovery involves multiple factors, including the absence of high-quality datasets and sophisticated algorithms capable of efficiently capturing sparse pharmacophore features.
By enhancing our previously developed anchor pharmacophore tool, AncPhore23, we created two datasets (CpxPhoreSet and LigPhoreSet) of 3D ligand-pharmacophore pairs, incorporating 10 pharmacophore feature types and exclusion spheres. CpxPhoreSet, derived from experimental protein-ligand complex structures, contains real but biased ligand-pharmacophore mapping (LPM) scenarios. In contrast, LigPhoreSet, generated from energetically favorable ligand conformations by considering both pharmacophore and ligand diversity, covers a broader range of perfectly-matched ligand–pharmacophore pairs. The complementary characteristics of these two datasets enable the development of efficient DL models for LPM and other pharmacophore-based tasks (e.g., de novo design and structural optimization).
Adhering to the ligand-pharmacophore matching principles, we propose DiffPhore, a pioneering knowledge-guided diffusion framework for “on-the-fly” 3D LPM. The main concept behind DiffPhore is to utilize the LPM knowledge to guide the conformation generative process, meanwhile leveraging calibrated sampling to reduce the exposure bias inherent in diffusion models. Specifically, DiffPhore consists of three main modules: knowledge-guided LPM encoder, diffusion-based conformation generator, and calibrated conformation sampler. The LPM encoder extracts the ligand-pharmacophore matching principles based on type and directional alignment, efficiently representing the mapping relationships between 3D ligands and pharmacophores. Based on these LPM representations, the diffusion-based conformation generator processes the ligand-pharmacophore matching information and estimates the directions for conformation denoising. The calibrated sampler adjusts the conformation perturbation strategy to narrow the discrepancy between the training and inference phases, aiming to enhance sample efficiency.
In this work, we evaluate DiffPhore on two independent datasets, PDBBind test set and PoseBusters31 set, and observe that it outperforms traditional pharmacophore tools and several advanced docking methods in predicting binding conformations. Further assessments on the DUD-E database32 and the IFPTarget library33 highlight DiffPhore’s effectiveness in virtual screening for both lead discovery and target fishing. We then apply DiffPhore for virtual screening of human glutaminyl cyclases, promising drug targets for neurodegenerative diseases and cancer immunotherapy34–37, successfully identifying structurally distinct inhibitors. Co-crystallographic analysis reveals consistency between the binding conformations of these inhibitors, as observed in complex crystal structures, and those predicted by DiffPhore.
Results
The complementary datasets of 3D ligand-pharmacophore pairs
To promote the development of pharmacophore-based DL methods, we released two datasets of 3D ligand-pharmacophore pairs, CpxPhoreSet and LigPhoreSet. They were constructed using AncPhore23 by considering 10 types of pharmacophore features (as shown in Supplementary Fig. 1), including hydrogen-bond donor (HD), hydrogen-bond acceptor (HA), metal coordination (MB), aromatic ring (AR), positively-charged center (PO), negatively-charged center (NE), hydrophobic (HY), covalent bond (CV), cation- interaction (CR), halogen bond (XB), along with steric constraints represented by exclusion spheres (EX). CpxPhoreSet comprises 15,012 ligand-pharmacophore pairs derived from experimental protein-ligand complex structures, each containing 3–15 pharmacophore features. To better represent a wider space of LPM patterns, we established a sophisticated protocol to derive ligand-pharmacophore pairs from 3D ligand structures (Fig. 1a); it involves Bemis–Murcko scaffold filtering, fingerprint similarity clustering, 3D conformation generation, pharmacophore generation and sampling, and exclusion sphere addition (see details in Methods). Starting with 11.48 million ligands from the In-Stock subset of ZINC20, we ultimately obtained 280,096 representative ligands and 840,288 corresponding ligand-pharmacophore pairs, collectively forming LigPhoreSet.
Fig. 1. The datasets of 3D ligand-pharmacophore pairs.
a The construction protocol for LigPhoreSet (see details in Methods section). b The t-SNE plots of ligands’ ECFP4 (1024-bit) fingerprints and pharmacophore counts reveal that LigPhoreSet covers wider chemical and pharmacophoric spaces compared with CpxPhoreSet. The ECFP4 fingerprints were processed by PCA (with random_state=2024 and n_component = 50) for dimensionality reduction before the t-SNE analysis. The t-SNE analysis was performed with the following hyperparameters: n_component = 2, perplexity = 30, n_iter = 5000, random_state = 2024. c LigPhoreSet shares similar occurrence frequency of pharmacophore feature with CpxPhoreSet. d Distribution of the fitness scores (i.e., ; see Methods) of ligand-pharmacophore pairs from CpxPhoreSet (n = 15,012) and LigPhoreSet (n = 840,288). The boxes represent data distribution with center lines showing medians, box limits indicating the 25th and 75th percentiles, and whiskers extending to 1.5 times the interquartile range from the lower and upper quartiles. Source data are provided as a Source Data file.
By performing t-SNE analysis on the dimensionality-reduced ECFP4 descriptors of the ligands, we observed that the ligands in LigPhoreSet exhibit a broader chemical diversity compared to those in CpxPhoreSet (Fig. 1b, Supplementary Fig. 2 and Supplementary Table 1). Meanwhile, LigPhoreSet displays greater diversity in pharmacophore features and a roughly comparable ccurrence frequency of pharmacophore feature type, in contrast to CpxPhoreSet (Fig. 1c and Supplementary Table 2). These attributes, combined with perfect-matching ligand-pharmacophore pairs, make LigPhoreSet suitable for developing DL algorithms to capture generalizable LPM patterns across a broad chemical and pharmacophoric space. By comparison, CpxPhoreSet contained imperfectly-matched ligand-pharmacophore pairs with fitness scores ranging from 0.5 to 1.0, averaging 0.967 (Fig. 1d). It can be used to refine the model for understanding the real-world biased LPMs and recognizing the induced-fit effects of ligand-target interactions. Therefore, we employed LigPhoreSet for the initial warm-up phase of model training and CpxPhoreSet for the subsequent refinement stage.
An overview of the knowledge-guided diffusion framework DiffPhore
DiffPhore is a knowledge-guided diffusion framework designed to generate 3D ligand conformations that maximally map to a given pharmacophore model (Fig. 2a). Essentially, DiffPhore incorporates pharmacophore type and direction matching rules to guide the alignment between ligand conformations and pharmacophore models (Fig. 2b). It comprises three main modules, namely, knowledge-guided LPM encoder, diffusion-based conformation generator, and calibrated conformation sampler.
Fig. 2. The framework of DiffPhore.
a DiffPhore adopts the diffusion-denoising process to predict binding conformations mapping with pharmacophore from randomly initialized conformations. b DiffPhore incorporates knowledge-guided pharmacophore mapping rules for conformation generation. LPM representation encoder uses a geometric heterogenous graph , including a fully-connected bipartite graph to represent LPM, where and are introduced to deliver type and direction matching information for ligand conformation update. c The calibrated conformation sampler randomly takes pseudo conformations (i.e., from intermediate prediction) as inputs for learning the conformation denoising process. The probability to select pseudo conformations is controlled by an annealing temperature Pepoch.
The knowledge-guided LPM encoder module encodes ligand conformation and pharmacophore model as a geometric heterogenous graph composed of a ligand conformation graph , a pharmacophore graph , and a full-connected bipartite graph that represents ligand conformation-pharmacophore relations. The explicit pharmacophore-ligand mapping knowledge, including rules for pharmacophore type and direction matching, are incorporated into . This is achieved by integrating the pharmacophore fingerprints, orientations, and reference angles of all ligand atoms, as well as the types and directions of all pharmacophore features (Fig. 2b; see details in Methods and Supplementary Methods). The pharmacophore type matching vectors are obtained by aligning each ligand atom with all pharmacophore features one by one, which is expedited using pharmacophore fingerprints. Similarly, the pharmacophore direction matching vectors are derived by computing the discrepancy between the intrinsic orientation of each ligand atom and the direction of each directional pharmacophore feature (HA, HD, MB, etc). Leveraging these knowledge-guided encodings, the encoder captures the essence of the alignment between ligand conformations and pharmacophores, resulting in a robust representation of LPM.
The diffusion-based conformation generator module takes the LPM representations as input, and estimates the translation (), rotation (), and torsion () transformations for the ligand conformation at each step (see details in Methods and Supplementary Fig. 3). Crucially, the generator employs a score-based diffusion model, parameterized by an SE(3)-equivariant graph neural network, to uncover the deep geometric features of ligand conformations, pharmacophores, and most importantly, their mapping relationships. This allows for conformation exploration that is informed by both the 3D chemical structure and pharmacophore model. During training, the ground truth ligand conformation is perturbed by applying random transformations sampled from corresponding diffusion kernels at time , which are then fed into the network as input. The conformation generator is subsequently tasked with predicting the gradients of the diffusion kernels (i.e., , , and ), which can be used to estimate the actual transformations () to recover the ligand towards original conformation (Supplementary Fig. 3). During the conformation generation phase, the generator can gradually refine the ligand conformations until it theoretically aligns maximally with the pharmacophore model.
The auto-regressive conformation generation process faces the issue of exposure bias because the computations performed during the inference phase differ from those during the training phase of the conformation generator. During the training phase, the generator takes the perturbed conformation of the ligand as input, whereas during the inference phase, it receives the predicted conformation as input instead. Since the predicted conformations cannot ensure generation quality, any prediction error in the last step accumulates, leading to a significant bias in the final generation, especially within large 3D conformation spaces. To address this issue, we proposed a calibrated conformation sampler to narrow the generation discrepancy between the training and inference processes (Fig. 2c; see details in Methods and Supplementary Table 3). In the training process, the calibrated conformation sampler mimics the inference computations and adopts a pseudo conformation to feed into the generator instead of the former perturbed ground truth. In this manner, the conformation generator undergoes a consistent training and inference process, reducing the exposure bias and enhancing the generation quality. In practical model training, we employed the calibrated conformation sampler for refinement training on CpxPhoreSet after an initial warm-up training on LigPhoreSet.
Our ablation experiment results indicated that removing either feature type matching or direction matching decreases the accuracy of binding conformation prediction (Supplementary Table 4), underscoring the importance of incorporating pharmacophore-specific knowledge for enhancing model performance. In terms of training schemes, skipping the warm-up training with LigPhoreSet led to reduced conformation prediction accuracy (Supplementary Table 5). Notably, excluding refinement training with CpxPhoreSet resulted in a substantial drop in prediction accuracy, demonstrating the critical role of learning from real-world ligand-pharmacophore matches. Additionally, omitting the calibrated conformation sampler impaired the model’s predictive capability (Supplementary Table 5). These results highlight the importance of integrating pharmacophore mapping knowledge, conformation sampling, and the complementary use of LigPhoreSet and CpxPhoreSet datasets.
DiffPhore enables accurate prediction of ligand binding conformations
In this section, we assessed the capability of DiffPhore in generating ligand binding conformations. We selected two traditional pharmacophore programs, AncPhore and MOE, combined with two conformation generation tools, Openbabel (OB) and Conformator (CF), as baseline comparisons. MOE is recognized as one of the state-of-the-art pharmacophore tools, while AncPhore is selected for parallel comparison because it employs identical pharmacophore definitions as those used in DiffPhore. We here employed (see Methods) to rank the generated poses of DiffPhore, as it effectively reflects the quality of the generated poses (Supplementary Fig. 4). To ensure a fair comparison between DiffPhore and the baselines, we evaluated them on two independent test sets (PDBBind test set and PoseBusters set), and employed the same number of input initial conformations in each evaluation.
We observed that with 40 initial conformations, DiffPhore achieved a high top-1 success rate (73.82%) in generating the conformations with RMSD less than 2 Å on PDBBind test set, and a 67.13% success rate with only 10 initial conformations, substantially outperforming AncPhore and MOE regardless of the conformation generation methods (Fig. 3a and Table 1). The superior predictive capability and minimal impact of the number (and diversity) of initial conformations, at least partly, indicates that DiffPhore effectively reduces local optimum issues during the conformational space search. Similarly, DiffPhore exhibited a substantially higher success rate in binding pose prediction than AncPhore and MOE on PoseBusters (Fig. 3b and Table 1). Notably, DiffPhore showed comparable performance when evaluated on the subsets of new proteins from the PDBBind test set and PoseBusters set, which were excluded from the training datasets (Fig. 3c). This partly reflects that DiffPhore captures the underlying principles of ligand-pharmacophore mapping, rather than merely memorizing the training samples.
Fig. 3. The performance of DiffPhore on ligand binding conformation prediction.
Plots of cumulative distribution describing the proportion of observations falling below each RMSD value by different methods on (a) the PDBBind test set and (b) PoseBusters set. c The top-1 success rates for different methods evaluated on the full set (all proteins) or new protein subset (new proteins, not included in the training set) of PDBBind test set and PoseBusters set. Cumulative distribution plots describing the proportion of observations below each energy ratio value for different methods on (d) the PDBBind test set and (e) PoseBusters set. The energy ratio is calculated as where and represent the UFF force field energies (from PoseBusters validity test) of the predicted and ground truth poses, respectively. Source data are provided as a Source Data file.
Table 1.
Comparison of DiffPhore and other methods on the time-split PDBBind test set and the PoseBusters set
Dataset | Methoda | Top-1 RMSD (Å) | Top-5 RMSD (Å) | Runtime (s)b | ||||
---|---|---|---|---|---|---|---|---|
%<1 | %<2 | Med. | %<1 | %<2 | Med. | |||
Time-split PDBBind test set | AncPhore(10, OB) | 7.71 | 18.46 | 8.47 | 10.47 | 21.21 | 6.40 | 11.61 |
AncPhore(40, OB) | 11.02 | 24.24 | 5.39 | 14.04 | 31.13 | 4.11 | 48.23 | |
AncPhore(10, CF) | 15.43 | 23.14 | 8.03 | 16.80 | 28.10 | 6.45 | 31.40 | |
AncPhore(40, CF) | 18.73 | 32.51 | 5.56 | 22.58 | 38.84 | 4.04 | 53.51 | |
MOE(10, OB) | 11.29 | 36.36 | 3.03 | 15.70 | 44.90 | 2.21 | 3.88 | |
MOE(40, OB) | 16.25 | 41.05 | 2.47 | 23.42 | 50.96 | 1.93 | 8.06 | |
MOE(10, CF) | 20.39 | 42.42 | 2.51 | 27.00 | 53.72 | 1.76 | 25.20 | |
MOE(40, CF) | 22.31 | 52.62 | 1.77 | 32.78 | 63.09 | 1.44 | 28.14 | |
Uni-dock | 17.63 | 34.16 | 4.00 | 26.17 | 47.93 | 2.14 | 3.97 | |
Glide SP* | 17.36 | 44.63 | 2.27 | 31.13 | 60.06 | 1.54 | - | |
GNINA | 20.11 | 42.42 | 2.55 | 25.62 | 55.92 | 1.68 | 84.95 | |
SMINA | 17.63 | 29.48 | 4.19 | 23.14 | 45.18 | 2.29 | 122.63 | |
AutoDock Vina | 17.20 | 30.32 | 4.40 | 21.57 | 45.48 | 2.22 | 67.93 | |
KarmaDock* | - | 56.20 | - | - | - | - | - | |
SurfDock* | 40.96 | 68.41 | 1.18 | 54.18 | 75.11 | 0.94 | - | |
DiffPhore(10) | 25.35 | 67.13 | 1.48 | 36.77 | 79.11 | 1.21 | 6.97 | |
DiffPhore(40) | 34.82 | 73.82 | 1.26 | 49.3 | 80.78 | 1.01 | 27.51 | |
PoseBusters | AncPhore(10, OB) | 22.20 | 37.38 | 3.14 | 24.53 | 46.03 | 2.27 | 21.50 |
AncPhore(40, OB) | 26.17 | 46.73 | 2.12 | 29.44 | 56.07 | 1.59 | 83.69 | |
AncPhore(10, CF) | 27.57 | 48.60 | 2.16 | 33.41 | 53.97 | 1.67 | 24.55 | |
AncPhore(40, CF) | 35.28 | 57.01 | 1.54 | 44.16 | 65.19 | 1.25 | 77.82 | |
MOE(10, OB) | 12.38 | 37.15 | 2.60 | 17.76 | 52.57 | 1.92 | 3.21 | |
MOE(40, OB) | 18.93 | 43.46 | 2.32 | 25.00 | 57.94 | 1.75 | 10.55 | |
MOE(10, CF) | 20.79 | 50.47 | 1.99 | 29.21 | 63.08 | 1.55 | 7.47 | |
MOE(40, CF) | 26.87 | 58.18 | 1.73 | 37.85 | 67.29 | 1.32 | 13.64 | |
Uni-dock | 16.35 | 34.96 | 3.72 | 23.83 | 45.56 | 2.23 | 3.41 | |
GNINA | 34.35 | 61.21 | 1.44 | 42.99 | 81.07 | 1.13 | 12.99 | |
SMINA | 27.10 | 49.30 | 2.03 | 36.45 | 66.36 | 1.35 | 14.09 | |
AutoDock Vina | 26.87 | 46.03 | 2.47 | 33.64 | 62.38 | 1.49 | 14.61 | |
DiffPhore(10) | 37.29 | 82.19 | 1.20 | 51.78 | 92.87 | 0.98 | 4.40 | |
DiffPhore(40) | 44.42 | 87.89 | 1.05 | 67.22 | 96.67 | 0.76 | 19.51 |
a “*” means data from references16,42. “-” indicates unavailable data. The numbers in parentheses for DiffPhore, AncPhore, and MOE represent the number of initial conformations. The abbreviations following the numbers in the parentheses denote the conformation tools for evaluation, where ‘OB’ refers to Openbabel and ‘CF’ refers to Conformator. The best and the second-best results are highlighted in bold and underlined, respectively.
bAlthough all methods were tested on the same operating system and hardware, the runtimes presented here are not directly comparable due to differences in processing devices: DiffPhore, GNINA, and Uni-dock utilized GPUs, while the other methods relied on CPUs. The time taken by conformation generation was included for AncPhore and MOE.
Subsequently, PoseBusters validity tests were carried out to assess the chemical and physical plausibility of the predicted binding poses, measured by the “%RMSD < 2 Å & PB-Valid” & “%RMSD < 2 Å and PB-Valid (without protein)” metrics. Compared with the baseline methods, DiffPhore achieved high success rates for plausible poses on PDBBind test set, with “%RMSD < 2 Å & PB-Valid” at 31.80% and “%RMSD < 2 Å & PB-Valid (without protein)” at 72.48%; similarly, on the PoseBusters set, it recorded 53.68% and 86.70%, respectively (Supplementary Table 6). Analysis of the energy ratios of the top-1 predicted conformations versus the ground truth ones revealed that the DiffPhore-generated conformations have relatively lower conformational energies compared to those predicted by AncPhore and MOE (Fig. 3d, e). Additionally, the bond lengths, angles, and dihedral angles of DiffPhore-generated conformations displayed distributions closely resembling those of the ground truth poses (Supplementary Figs. 5–7). These results clearly indicate that DiffPhore can produce chemically and energetically reasonable conformations while maintaining optimal mapping to the given pharmacophore models.
Next, four open-access molecule docking tools—AutoDock Vina38, Uni-dock39, SMINA40, GNINA41—were selected for comparison; all these docking tools were evaluated under a pocket-given situation. DiffPhore outperformed the tested traditional docking tools and had comparable performance to the recently reported advanced DL-based docking tools KarmaDock16 and SurfDock42 on the PDBBind test set, albeit without a speed advantage (Table 1 and Supplementary Fig. 8). Notably, DiffPhore generated more intramolecularly plausible conformations than the docking tools, as evidenced by the “%RMSD < 2 Å & PB-Valid (without protein)” metric in Supplementary Table 6, while maintaining comparable effectiveness in handling protein clashes.
Given that ligand flexibility and pharmacophore complexity are known to influence the predictive accuracy and speed of pharmacophore tools, we further explored how these factors affect the performance of DiffPhore. Here, AncPhore was chosen for comparison because it uses the same pharmacophore definitions and input files as DiffPhore. It can be observed that as molecular complexity increases, the success rate of conformation prediction decreases for both DiffPhore and AncPhore (Fig. 4). However, DiffPhore shows a notable advantage over AncPhore in handling more flexible compounds. For compounds with fewer than 47 heavy atoms or 19 rotatable bonds, DiffPhore achieved a success rate of ~80% (Fig. 4a, b). Regarding pharmacophore complexity, it manifested superior predictive capability for pharmacophores with 3 to 12 features (Fig. 4c). Unlike AncPhore, whose prediction speed is significantly affected by ligand flexibility and pharmacophore complexity, DiffPhore demonstrates resilience to these factors, only experiencing a modest reduction in speed when handling more flexible ligands (Supplementary Fig. 9). DiffPhore’s robust performance reflects the advancements and sophistication of our established datasets and knowledge-guided diffusion framework.
Fig. 4. The impact of ligand flexibility and pharmacophore complexity on the predictive accuracy of DiffPhore.
Plots of the Top-1 RMSD values (upper) or success rates (lower) versus the number of heavy atoms (a), rotatable bonds (b), and pharmacophore features (c) reveal the impacts of ligand flexibility and pharmacophore complexity on the conformation prediction performance of DiffPhore and AncPhore. The numbers in parentheses represent the number of initial conformations; top-1 success rate means generating conformations with RMSD < 2 Å. Data are presented as mean values ±95% confidence interval. Source data are provided as a Source Data file.
DiffPhore manifests superior screening power for lead discovery and target fishing
The robust performance of DiffPhore in generating binding conformations enables it to serve as a core engine for virtual screening in lead discovery and target fishing tasks. To examine its screening ability for lead discovery, we selected 28 structurally and mechanistically different targets, including metalloenzymes, from the DUD-E dataset32, by considering the specificity of pharmacophore models and the impact of anchor pharmacophore features in practical drug discovery. Among the selected targets, half are exclusive to the training set, while the other half overlap with it. For all these targets, the corresponding pharmacophore models (Supplementary Table 7) were established by comparing multiple complex crystal structures; notably, they all exhibited typical anchor pharmacophore features that frequently occur across multiple structures and/or are important for natural substrate binding or catalysis. To fully leverage the advantage of DiffPhore in virtual screening, we proposed four different fitness scorings (Dfscore1 to Dfscore4; see “Methods”) to evaluate the ligand-pharmacophore matching. AncPhore, MOE, and four docking tools (AutoDock Vina, Uni-dock, SMINA, and GNINA) were chosen for comparison. To provide a more comprehensive comparison, we referenced the data for additional docking and re-scoring methods, whose screening power on these targets has been evaluated43.
The four fitness scores of DiffPhore exhibited similar performance across different metrics (Fig. 5a–f). DiffPhore shows a superior ability to distinguish between active and decoy ligands, as evidenced by the AUROC metric, which surpasses traditional pharmacophore tools including AncPhore and MOE, and is comparable to the leading docking tools including RFscore-VS, Glide SP, and EquiScore (Fig. 5a, d). Regarding the BEDROC metric, a similar trend was observed, with the exception that MOE shows performance comparable to DiffPhore (Fig. 5b, e). Notably, all four DiffPhore fitness scores exhibited strengths in terms of enrichment factors at 0.5% and 1%, closely approaching the performance of MOE while surpassing AncPhore and nearly all docking tools (Fig. 5c, f and Supplementary Fig. 10). This highlights DiffPhore’s excellent ability to prioritize active molecules according to the given pharmacophore model, which is intrinsically related to the essence of pharmacophore methods. Importantly, when comparing performance on overlapping and non-overlapping targets from the training set, DiffPhore showed no obvious differences in the metrics such as AUROC, BEDROC, and enrichment factors at 0.5% and 1% (Fig. 5a–f and Supplementary Fig. 10). This partly reflects the advantage of the pharmacophore methods in addressing the challenges of target preferences and the generalizability to unseen proteins that are often faced by DL-based docking tools. In comparison, the success rate of pharmacophore-based virtual screening is closely tied to the quality of defined pharmacophore models (i.e., whether the pharmacophore models accurately represent key target information). Theoretically, incorporating multiple pharmacophore models in virtual screening could potentially increase enrichment factors and enhance overall hit rates and chemical diversity.
Fig. 5. The DiffPhore screening power for lead discovery and target fishing.
Comparison of different DiffPhore scorings with other methods in virtual screening for lead discovery, evaluated respectively on (a–c) non-overlapping and (d–f) overlapping targets using the metrics AUROC, BEDROC, and EF0.5%. Boxes are ranked based on their mean values, indicated by triangle markers. The “*” symbol denotes a statistically significant difference (unpaired two-sided student’s t-tests, p-value < 0.05, n = 14) between the baseline and DiffPhore (). Exact p values are provided in the Source Data file. The boxes represent data distribution with center lines showing medians, box limits indicating the 25th and 75th percentiles, and whiskers extending to 1.5 times the interquartile range from the lower and upper quartiles. AUROC area under the receiver operating characteristic curve, BEDROC Boltzmann-enhanced discrimination of receiver operating characteristic, EF0.5% enrichment factor at 0.5%. g Comparison of DiffPhore with other baselines in predicting the 12 targets of 4OH-tamoxifen. Percent rank = (rank order/number of total complex structures in IFPTarget) × 100. Source data are provided as a Source Data file.
As reported, the DUD-E dataset poses challenges for evaluating virtual screening methods due to the inherent biases44,45. Analog bias occurs when actives for a given target share similar scaffolds, creating recognizable patterns that DL models can easily exploit. Decoy bias, often introduced by selection criteria that prioritize dissimilarity to actives, can also be leveraged by models, leading to false positives. Although the DUD-E bias cannot be avoided in our evaluations, DiffPhore primarily performs the task of matching ligands to the pharmacophore models, allowing for appropriate feature-matching deviations, which fundamentally distinguishes it apart from docking methods. We used four different fitness scores to assess the virtual screening performance of DiffPhore, which likely helps mitigate the impact of database bias. Overall, DiffPhore achieves comprehensive and balanced performance across all metrics, indicating its potential as a promising tool for pharmacophore-based virtual screening in lead discovery.
We next evaluated the screening ability of DiffPhore in target fishing using the IFPTarget library, which contains 2842 unique targets and 11,890 complex structures33. To better accommodate the target fishing tasks, we employed , which is specifically designed to reduce the impact of the number of pharmacophore features on target ranking (see Methods). We selected 4OH-Tamoxifen for testing due to its known binding to more than 12 different targets. DiffPhore outperformed AncPhore in target ranking, achieving an average percent rank of 12.03%, and demonstrated comparable or slightly superior performance relative to the tested docking methods (Fig. 5g). We observed a notable limitation of DiffPhore for certain targets such as human fibroblast collagenase, mainly due to that the derived pharmacophore models cannot represent these targets. The findings indicate that appropriate pharmacophore representations are crucial for DiffPhore to achieve effective virtual screening, whether for lead discovery or target fishing.
DiffPhore identifies lead compounds for human glutaminyl cyclases
Human secretory glutaminyl cyclase (sQC) and Golgi-resident glutaminyl cyclase (gQC), responsible for N-terminal pyroglutamation for multiple protein substrates, are attractive therapeutic targets for various human diseases, including neurodegenerative diseases and cancers34–37. Based on the reaction pathway of sQC-catalyzed pyroglutamation of the tripeptide substrate NH2-Gln-Phe-Ala-CONH2 (QFA), we constructed a pharmacophore model derived from its initial binding mode for virtual screening using DiffPhore against about 1.4 million compounds from the Vitas-M library (see Methods). We picked 15 structurally distinct top-ranked compounds (using ) for experimental verification. Of them, 7 displayed inhibitory activity against sQC and gQC with IC50 values less than 100 μM (Supplementary Table 8 and Supplementary Fig. 11). Compounds 5 and 13 manifested IC50 of 6.94 μM (Ki = 6.71 μM) and 3.44 μM (Ki = 3.33 μM) to sQC, and 15.73 μM (Ki = 15.31 μM) and 3.93 μM (Ki = 3.82 μM) to gQC, respectively. Notably, both compounds exhibited the ability to thermodynamically stabilize both sQC (ΔTm of 4.98 °C for 5 and 7.05 °C for 13) and gQC (ΔTm of 2.06 °C for 5 and 5.50 °C for 13) proteins (Fig. 6a).
Fig. 6. The sQC/gQC inhibitors identified by DiffPhore.
a The chemical structures of compounds 5 and 13, along with their predicted conformations mapping to the pharmacophore model derived from the binding mode of QFA with sQC; the IC50 curves of the two inhibitors of sQC/gQC (all determinations are tested in triplicate; data are presented as mean values ±SEM); the melting curves (first-derivative of dissociation) of sQC (yellow) and gQC (cyan) in the presence or absence of 5 (50 μM) or 13 (50 μM). Views from the (b) sQC:5 (PDB code 9ISD) and (c) sQC:13 (PDB code 9IVV) complex structures, revealing the modes of 5 and 13 inhibiting sQC; the mFo-DFc electron density (OMIT maps, blue mesh, contoured at 3.0σ) around 5 and 13 are calculated from the last refinement models. Superimpositions of (d) sQC:5 and (e) sQC:13, respectively, with sQC:QFA analog (PDB code 6YI1)46, reveal that 5 and 13 have a similar mode as that of QFA analog with sQC.
Through co-crystallization studies, we obtained the crystal structures of sQC in complex with 5 and 13 (Supplementary Tables 9, 10). Clear mFo - DFc electron density was observed in the active site for both structures, enabling confident modeling of 5 and 13 (Fig. 6b, c, Supplementary Fig. 12). Compound 5 is positioned to coordinate with the active site Zn2+ (N3-Zn2+ distance of 2.2 Å), make a hydrogen bond with the catalytic triad Asp248 (N1-OD2 distance of 2.8 Å), two hydrogen bonds with Gln304 (N1’-O2 distance of 2.9 Å, and O3’-NH distance of 3.6 Å), and face-to-face π-π stacking interactions with Trp207 (Fig. 6b). Compound 13 makes a coordination bond with Zn2+ (N3-Zn2+ distance of 2.0 Å), a hydrogen bond with Asp248 (N1-OD2 distance of 2.7 Å), and water-bridging interaction with Glu202 and Gln304 (Fig. 6c). Superimposing these structures with the sQC:QFA analog structure (PDB code 6YI1)46 reveals that both compounds, especially 5, closely resemble QFA analog binding with sQC (and gQC) through similar pharmacophore features, notably in zinc coordination and hydrogen-bonding interactions with catalytically important residues (Fig. 6d, e, Supplementary Fig. 13, and Fig. 14). These results highlight the effectiveness of DiffPhore in pharmacophore-guided lead discovery.
Discussion
Pharmacophore and molecular docking methods are both fundamentally based on the principles of receptor–ligand recognition, but they are implemented in entirely different ways. Molecular docking captures a wide array of receptor-ligand interactions, considering the full spectrum of possible contacts. In contrast, pharmacophores distill these interactions down to their most essential features, focusing on the abstracted representation and precise matching of key interactions, including directional alignment. The pharmacophore approach sidesteps the complexity of less significant interactions, providing a streamlined and efficient mode for drug discovery. More importantly, pharmacophore can avoid the target preference issue often encountered in traditional or DL-based molecular docking methods. Therefore, developing DL-enabled pharmacophore technologies is a promising direction. DiffPhore stands as the pioneering DL model for the LPM task, potentially acting as a catalyst for advancing this kind of technologies.
Since there are no standardized datasets for constructing DL-based pharmacophore models, we established LigPhoreSet and CpxPhoreset by considering 10 types of pharmacophore features and exclusion spheres. The combined use of these two datasets has been demonstrated to develop high-quality models for the LPM task. Certainly, these datasets will be useful for developing additional DL models for other tasks, such as molecular generation. Moreover, our proposed protocol for constructing ligand-derived datasets may inspire the creation of more robust and comprehensive datasets, driving progress in pharmacophore-guided drug discovery.
In DiffPhore, the pharmacophore principles have been elaborately fused into neural networks to search for ligand binding conformations according to a given pharmacophore model. This approach greatly enhances the efficiency of conformation search and reduces the likelihood of combinatorial explosion, which is especially beneficial for highly flexible ligands, as observed in the test results. The knowledge-guided diffusion-based framework, which focuses on 3D transformations rather than the generation of absolute atomic coordinates, can produce more chemically plausible and energetically favorable conformations. The proposed calibrated conformation sampler proved effective in resolving the discrepancies between training and inference stages and mitigating the model exposure issues in diffusion models. Notably, DiffPhore shows robust performance in generating binding conformations even for unseen proteins, reflecting, at least in part, that it has learned the essential principles of ligand-pharmacophore matching rather than merely memorizing the training data.
In practical application, we utilized DiffPhore to implement a substrate-mimicking strategy (i.e., deriving pharmacophore models from catalytic reaction pathways), and successfully discovered structurally distinct inhibitors for the clinically important metalloenzymes sQC/gQC. Co-crystallographic analysis revealed that the binding modes of the inhibitors closely resemble that of the sQC/gQC substrate, especially concerning anchor pharmacophore features such as zinc coordination and hydrogen bonding with the catalytically important residues. This case study clearly reveals that DiffPhore, equipped with precisely defined pharmacophore models, can efficiently discover high-quality lead compounds. It also highlights the unique advantage of pharmacophore models in identifying metalloenzyme inhibitors involving metal coordination.
The encouraging performance of DiffPhore in predicting binding conformations and conducting virtual screening highlights the advancements of our proposed datasets and knowledge-guided diffusion-based framework. Further development of algorithms is warranted to improve computational efficiency and accuracy, to address differences in bond lengths and dihedral angles arising from conformational changes, to consider intramolecular interactions (e.g., intramolecular hydrogen bonds), as well as to tackle conformational prediction challenges posed by more complex ligands (e.g., macrocyclic structures).
Methods
Dataset construction
We constructed two 3D ligand-pharmacophore pair datasets, CpxPhoreSet and LigPhoreSet, for LPM learning, by using the enhanced version of AncPhore23. CpxPhoreSet was established by analyzing a total of 19,443 protein-ligand complex structures collected in PDBBind (version 2020)47,48. We followed a time-split scheme14 and divided PDBBind into train (16,379 entries), validation (968 entries), and test (363 entries) sets. The train and validation set were used to establish the CpxPhoreSet and the remaining test set was used for performance evaluation. For each complex structure, AncPhore was used to generate one pharmacophore model considering 10 pharmacophore feature types (HD, HA, MB, AR, PO, NE, HY, CV, CR, and XB) and exclusion spheres (EX) according to protein-ligand interactions. These models with less than 3 features and more than 15 features were disregarded. The retained pharmacophore models, along with ligand conformations, constituted the CpxPhoreSet, encompassing a total of 15,012 ligand–pharmacophore pairs.
We started LigPhoreSet construction with ~11.48 million ligands (with molecular weights less than 800 and LogP less than 5) obtained from the In-Stock subset of the ZINC20 database49 (downloaded in May, 2023). After removing duplicates, multiple components, and unidentifiable SMILESs using the RDKit software, the remaining ligands were clustered based on Bemis–Murcko scaffold50 rules. From each ligand cluster, a representative ligand was randomly chosen. These selected ligands underwent further filtration based on Morgan fingerprint51 similarity (with a radius of 2) to ensure a broad but nonredundant diversity of ligand chemotypes. Then, each of these filtered ligands generated a corresponding, energetically favorable 3D conformation using the RDKit MMFF force field. Next, the pharmacophore models for all ligand conformations are generated as follows: (1) generating initial pharmacophore models by considering 10 pharmacophore feature types (identical to those used for CpxPhoreSet) for all 3D ligand conformations; (2) retaining the pharmacophore models bearing at least 2 features among MB, HA, HD, AR, NE, or PO; (3) sampling three pharmacophore models with different pharmacophore feature combinations for each ligand conformation; (4) generating exclusion spheres as steric constraints via a pseudo-receptor manner, with exclusion spheres introduced around the ligand conformation at distances ranging from 3 Å to 5 Å. We finally obtained a version of LigPhoreSet comprising 840,288 ligand-pharmacophore pairs derived from 280,096 ligands. To facilitate initial model training and hyperparameter search, we randomly extracted a subset from LigPhoreSet, containing 84,030 samples and 28,010 ligands.
Problem formulation
Generally, the 3D ligand-pharmacophore mapping (LPM) task is to identify a reasonable ligand conformation that maximally matches with a given pharmacophore model. Given the 3D structure of the pharmacophore and an initial ligand conformation , the LPM model predicts a ligand conformation that satisfies the pharmacophore constraints.
1 |
where , and represent the 3D representations of the input pharmacophore model, the input ligand conformation and the generated ligand conformation, respectively. The generated ligand conformation shares the same chemical structure with and only differs in conformation.
The LPM problem treated by diffusion-based generative modeling
To solve the LPM problem, we need to approximate the conditional probability density of ligand binding conformation . In general, the gradient of the probability density is called the score function. Song and Ermon introduced score-based generative modeling to learn this score function from data and to generate samples with Langevin dynamics52. Given an initial sample from any prior distribution (e.g., Gaussian distribution), Langevin dynamics is incorporated to perform denoising using the following iterative update:
2 |
where is the step size and is the number of iterations. is the sample noise from . is the input ligand conformation and is the predicted ligand conformation. Under certain modest conditions53,54, and with sufficiently small step size and large , the distribution of will approximate the true distribution of ligand binding conformation , where represents an optimal ligand conformation.
In this paper, we proposed a score-based diffusion model DiffPhore, to iteratively generate the ligand conformation . Specifically, the denoising probability can be formulated as given the ligand structure . stands for the perturbed data point at step . Thus, the perturbed data distribution . We consider a sequence of decreasing noise scales (). DiffPhore introduces a score network52,55 to estimate the score function at each noise level . Thus, the training loss of this score network is
3 |
As for the generation stage, DiffPhore sequentially performs steps of Langevin MCMC to obtain a reasonable ligand conformation mapped with the given pharmacophore.
4 |
Where is the step size at . In the next section, we will explain the detailed implementations of the above generative modeling framework.
DiffPhore architecture
To iteratively generate reasonable ligand conformations given the pharmacophore, DiffPhore adopts the following three main components, including a knowledge-guided LPM representation encoder (), a conformation generator (), and a calibrated conformation sampler (). As the input of the score network, the LPM representation leverages pharmacophore principles to characterize 3D mapping relationships of the ligand conformation-pharmacophore pairs. The conformation generator takes LPM representations as inputs to iteratively search ligand conformations to fit with a pharmacophore model, continuing until the maximum alignment is achieved. The calibrated conformation sampler is designed to eliminate the exposure bias of the iterative conformation search process.
Knowledge-guided LPM representations
Accurately representing LPMs is the prerequisite for successfully performing the task of fitting ligand conformations to pharmacophores. We proposed a heterogenous geometric graph to characterize LPMs in the 3D space,
5 |
where is a ligand graph at the t-th step, is a pharmacophore graph, and is a bipartite graph (Fig. 2b; see details in Supplementary Methods). and stand for the ligand atoms and their 3D coordinates. represents the covalent bonds as well as unbonded edges within 5 Å in the ligand. and represent the pharmacophore points and their 3D coordinates, respectively. denotes the connections between each pair of pharmacophore features, and connections of each exclusion sphere to the nearest pharmacophore point in . and are vectors connecting the neighboring nodes in and , respectively.
The bipartite graph is exploited to describe the ligand-pharmacophore matching relations, where connects each ligand atom to all the pharmacophore feature points, and are the pharmacophore type and direction matching vectors, respectively. We detail the featurization and implementation of , and in Supplementary Methods.
Diffusion-based conformation generator
Given the LPM representations at random time , the generator aims to predict the conformations at the former step, given by
6 |
Since the degrees of freedom in 3D coordinates are significantly higher than needed, the ligand conformation here is represented by a combination of translations, rotations, and changes to torsion angles. The ligand conformation space can be formulated as an -dimensional submanifold, is the number of rotatable bonds, and 6 refers to the roto-translations13. More specifically, the ligand conformation space can be formally defined as:
7 |
8 |
9 |
where is the product space of 3D translation group , 3D rotation group and changes in torsion angles , stands for the total transformation with respect to an element in the product space , and refer to the actual translation, rotation, and torsion angle transformations. To reduce the prediction complexity, we formulate the prediction task as the learning of the change directions (or scores) in the ligand translation (), rotation (), and torsion angles () instead:
10 |
11 |
The scores () correspond to the gradient estimates of the translation (), rotation () and torsion angle () diffusion kernels, which follow Gaussian distribution, IGSO(3) distribution56 and the wrapped normal distribution57, respectively. The corresponding gradients (e.g., can be easily calculated in advance13.
The SE(3)-equivariant conformation generator comprises the embedding, update, and output modules (Supplementary Fig. 3a). The update module consists of message-passing layers, each with intra- and inter-graph update layers. The intra-graph layer extracts the topological features of the ligand and the pharmacophore separately; the inter-graph layer performs the feature fusion between two graphs, establishing deep representations of the ligand-pharmacophore interactions. Finally, the output module predicts the change directions in the ligand translation (), rotation (), and torsion angles ().
Initially, the embedding module processes input ligand and pharmacophore graphs as well as cross-edges between them, and integrates the random diffusion time () into the graph features (Supplementary Fig. 3b; see details in Supplementary Methods). This yields the initial embeddings for ligand () pharmacophore () and cross edges (). Since all the computations of the conformation generator correspond to a specific step, we omit the notation for the subsequent features for simplicity.
12 |
Next, the update module iteratively refines the initial embeddings via message passing layers (Supplementary Fig. 3c), with each layer comprising intra-graph and inter-graph updates. Intra-graph updates compute messages ( and , Eqs. 13–15) to incorporate the information of internal topological structure within each graph. The updates are computed as tensor products of the node features and the spherical harmonic representations of neighboring edge vectors, weighted by the edge embedding , the outgoing and the incoming node features .
13 |
14 |
15 |
In the formulas, denotes the tensor products layer, stands for neighbor nodes and is the spherical harmonic. refers to the batch normalization, stands for a MLP layer, and refers to tensor product operation.
The inter-graph layer simulates the ligand-pharmacophore recognition and alignment with the constructed bipartite graph . The pharmacophore type and direction matching vectors are incorporated for the calculation of the inter-graph updates (, ) via similar tensor product layers (Eqs. 16–17). The intra- and inter-graph messages are aggregated to update ligand () and pharmacophore () node embeddings (Eqs. 18–19). All tensor product layers here are implemented using the ‘FullyConnectedTensorProduct’ layer from the E3NN package (see details in Supplementary Methods).
16 |
17 |
18 |
19 |
Finally, the output module receives the updated ligand features to predict the translation, rotation, and torsion scores w.r.t their diffusion kernels. The translation and rotation are rigid transformations operating on the center of mass of the ligand so that edges between ligand atoms and the center of mass are constructed to compute the corresponding scores with tensor product layer. Here, are MLP layers, and denotes radical bias embedding of edge length. The vector stands for the 3D vectors from the center of mass to the ligand atom .
20 |
21 |
To estimate the torsion score , the output layer focuses on the rotatable bonds and the adjacent atoms to predict the corresponding changes in torsion angles. For the torsion score of the rotatable , the output layer performs the tensor product operation between the bond features and the adjacent atom features, given by
22 |
23 |
24 |
25 |
where is the set of ligand atoms connected with the rotatable bonds. is a convolutional filter constructed for each rotatable bond calculating the tensor product of the spherical harmonic representation of the bond axis ( here means max level is 2) and the vector from bond center of to the atom . refers to a MLP layer, stands for the feature of the rotatable bond formed by the involved node features of the incoming node and the outgoing node .
The loss function () of DiffPhore consists of three components corresponding to the translation, rotation and torsion diffusion kernels:
26 |
where are the precomputed labels in the training dataset. In this way, the loss function () enforces the conformation generator to estimate the denoising directions at each step.
Calibrated conformation sampler
Due to the auto-regressive generation fashion, diffusion models usually suffer from the exposure bias, which is caused by the input mismatch between the training and the inference phases. In particular, the conformation generator in the training process takes a perturbed conformation as input and the corresponding scores () as the labels. By contrast, the conformation generator is fed with the predicted conformation during the inference process. To narrow the discrepancy between the training and inference phases, we proposed a calibrated conformation sampler, which mimics the inference process to construct pseudo ligand conformations () and corresponding scores () for model training (see details in Supplementary Table 3 and Supplementary Methods). The pseudo ligand conformations are estimated based on the denoised data points by DiffPhore and thus alleviating the exposure bias problem.
27 |
28 |
29 |
However, solely relying on these calibrated data points for the model training is infeasible, as the quality of them is inferior to the real data points. Therefore, we utilized an epoch-dependent possibility scheduler to sample the calibrated data as inputs with the probability and real data points with the probability (). During the training process, the sampling probability starts from a small value and gradually increases along with the increment of the training epoch as shown in Eq. 28. where are hyperparameters (Supplementary Table 11) that balance the utilization of the two types of training data.
The detailed algorithms of , , and , and the model training details can be found in Supplementary Methods and the source codes.
Pharmacophore fitness scorings
To leverage the advantages of DiffPhore in different virtual screening scenarios, we introduced four fitness scorings to evaluate the degree of the alignment between the generated ligand conformations and the reference pharmacophore model. is a basic scoring function considering pharmacophore feature alignment and exclusion sphere collision, which is calculated using an in-situ max-matching approach as provided in Eqs. 30–32:
30 |
31 |
32 |
where represents the total overlap volume between the ligand conformation pharmacophore features () and the reference pharmacophore features (). It is calculated as the sum of individual feature overlap volume, considering scaling factors (), basic weights (), chemical group weights (), directional differences (), tolerance ranges (), the distance of the match pharmacophore pair (). is set as 0 (the number of root atoms equals 1) or (the number of root atoms larger than 1). represents the total volume of the reference pharmacophore features. denotes the sum of volumes where the ligand atoms overlap with reference exclusion volumes. is a maximum tolerance for ligand clashing with pharmacophore, set to 500. is adopted as the default fitness score for DiffPhore.
Building upon , includes a bias factor that accounts for the percentage of matched pharmacophore feature pairs, with the aim to consider the tolerance of pharmacophore features. It is calculated by Eq. 33, where is the number of matched pharmacophore pairs and is the total number of reference pharmacophore features. Here, two pharmacophore features are regarded as a matched pair if the distance between them is less longer than their tolerance range.
33 |
Recognizing the significance of anchor pharmacophore features in protein-ligand recognition and practical drug discovery23, we introduced to specifically measure the alignment of anchor pharmacophore features:
34 |
where is the total volume of the anchor features in reference pharmacophore model. is the sum of volumes accounting for the ligand pharmacophore features overlapping with the reference anchor features.
Taking all the factors into consideration, we also proposed a comprehensive fitness score :
35 |
In addition, is specifically designed for target fishing. It further considers the extent to which the number of ligand’s pharmacophore features matches the number of pharmacophore features representing the target.
36 |
where is the count of molecule pharmacophore features.
Baseline setup and implementation
The conformation generation tools, including OpenBabel (version 2.4.1) and Conformator (version 1.2.1), were obtained from their official websites. We utilized the official “AutoPH4” plugin to perform pharmacophore modeling in MOE (version 2020.09) and employed the “Compute | Pharmacophore | Search” functionality for ligand-pharmacophore alignment. The aligned poses in MOE are ranked using the default “rmsdx” metric.
The open-access docking programs, including AutoDock Vina, Uni-dock, SMINA, and GNINA, were implemented following their official source code repositories and instructions. We utilized the “prepare_receptor” and “prepare_ligand” scripts in ADFR toolkit for AutoDock Vina, GNINA, and SMINA to prepare the PDBQT files of protein and ligand structures. As for Uni-dock, its official “unidocktools” was employed. To define the conformation search area, we used a box of centered on the ligand in complex crystal structure for AutoDock Vina and Uni-dock, and utilized the “--autobox_ligand” option with default buffer range (4 Å) for SMINA and GNINA. The number of binding conformations was set to 10 for all docking baselines, with other parameters kept at their default settings.
All these calculations were performed on a Linux Rocky 9.2 operating system, utilizing Intel(R) Xeon(R) Platinum 8378C CPU @ 2.80 GHz and NVIDIA RTX 4090 GPU.
Evaluation metrics
The metrics including Root Mean Square Deviation (RMSD), PoseBusters test validity (PB-Valid), Boltzmann-Enhanced Discrimination of Receiver Operating Characteristic (BEDROC), Enrichment Factor (EF), and Area Under the Receiver Operating Characteristic Curve (AUROC), were used for performance evaluations (see details in Supplementary Methods).
Case study: human glutaminyl cyclases
We used DiffPhore to conduct virtual screening for identifying potential inhibitors for sQC and gQC by employing a substrate-mimicking strategy. First, the binding mode of the tripeptide substrate NH2-Gln-Phe-Ala-CONH2 (QFA) with sQC was obtained via our QM/MM calculations. Then, we constructed the pharmacophore modeling from the calculated sQC:QFA complex structure to represent the key binding features of QFA with sQC (Supplementary Fig. 13). Given the pharmacophore model, we then utilized DiffPhore to screen potential hit compounds for sQC against the commercial Vitas-M compound library (https://vitasmlab.biz/) with about 1.4 million compounds available for quick purchase. To efficiently screen the large compound library, we implemented a pharmacophore fingerprint filtering approach, which evaluates the matching context based on the number and types of pharmacophore features, without accounting for their 3D alignment. A pharmacophore fingerprint similarity cutoff of 0.6 was set. To this end, we identified 91,229 ligands from the Vitas-M compound library for subsequent screening by DiffPhore (20 poses generated for each ligand). Through manual inspection, we picked 15 structurally distinct compounds from the top-ranked hits with for experimental verification.
sQC/gQC/PGP-1 protein expression and purification
We followed the protocols from our previous study58 for the expression and purification of human sQC (amino acids 33-361), gQC (amino acids 53-382), and the auxiliary enzyme PGP-1 (amino acids 1-215) (see details in Supplementary Methods).
sQC/gQC/PGP-1 inhibition activity assays
All compounds were tested for their inhibitory activity on sQC, gQC, and PGP-1 in the assay buffer (25 mM Tris-HCl, 150 mM NaCl, 10% glycerol, pH 8.0) as described previously58 (see details in Supplementary Methods). All determinations were tested in triplicate.
Thermal shift assays
The sQC or gQC enzymes (5 μM) were first incubated with test compounds (50 μM) or a vehicle at room temperature for 20 minutes in Tris-HCl buffer (25 mM Tris-HCl, 150 mM NaCl, 10% glycerol, pH 8.0). Then, the SYPRO ORANGE dye (10× concentration) was added, and the fluorescence was promptly quantified using a fluorescence quantitative PCR instrument. The temperature was incrementally increased from 30 to 95 °C, rising by 1 °C per cycle. The resulting fluorescence intensity versus temperature was analyzed using GraphPad Prism to determine the melting temperature (Tm) values.
Co-crystallization, data collection, and analysis
The hanging-drop vapor diffusion method was employed for co-crystallization experiments. The purified sQC proteins (8 mg/mL) were incubated with 5/13 (3.9 mM) on ice for 2 h and then centrifuged at 15,777 × g for 10 min to remove insoluble materials. Crystals were grown under the condition: 12−16% (v/v) polyethylene glycol 4000, 0.2 M MgCl2 and 0.1 M Tris-HCl at pH 8.5. The protein solution was mixed with the reservoir solution at a 1:1 ratio. The crystals were cryoprotected with the mother liquor supplemented with 30% (v/v) glycerol prior to harvesting. Data collection was performed at the BL18U1 beamline at the Shanghai Synchrotron Radiation Facility. The diffraction data were processed using XDS59 or AutoXP60, followed by structural determination with PHENIX61 and WinCoot62. We utilized the existing crystal structure of sQC (PDB code 3PBB) as the template in the molecular replacement step. The crystal structures of sQC:5 (PDB code 9ISD) and sQC:13 (PDB code 9IVV) are available in the Protein Data Bank.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Supplementary information
Acknowledgements
This work is financially supported by the National Key R&D Program of China (2023YFF1204901 to G.B.L.), the National Natural Science Foundation of China (82122065, 82473845, and 82073698 to G.B.L.), the Basic Research Foundation of Sichuan University (2023SCUH0073 to G.B.L.), and the Sichuan Science and Technology Program (2025YFHZ0085 to G.B.L.). We thank the staff from beamlines BL18U1 and BL19U1 at Shanghai Synchrotron Radiation Facility of the National Facility for Protein Science (Shanghai, China) for their great support. We also thank Professor Jin-Liang Yang and Dr. Zhi-Xiong Zhang (Sichuan University) for providing MOE for comparison purposes.
Author contributions
G.B.L. conceived, planned, and supervised this study; J.L.Y. and C.Z. collected and processed the datasets; J.L.Y. designed and trained the model supervised by G.B.L. and X.G.L.; J.L.Y. and J.W.W. performed model validation and case study; X.L.N., J.M., F.B.M., and Y.T.C. performed protein purification, enzymatic activity testing, and co-crystallographic studies; J.L.Y., C.Z., X.L.N., J.M., F.B.M., J.W.W., B.D.T., X.G.L., and G.B.L. analyzed the data; J.L.Y., X.L.N., X.G.L. and G.B.L. wrote the manuscript. All authors contributed to the final draft and approved the final version for submission.
Peer review
Peer review information
Nature Communications thanks the anonymous reviewer(s) for their contribution to the peer review of this work. A peer review file is available.
Data availability
The LigPhoreSet and CpxPhoreSet datasets for training and evaluation are available in Zenodo63 (10.5281/zenodo.14819917). The PDBBind set is available at http://pdbbind.org.cn. The PoseBusters set is available at https://zenodo.org/record/8278563. The DUD-E set is available at http://dude.docking.org. The ZINC database is available at https://zinc20.docking.org. The crystal structure of sQC used as the template in structural determination is available in the Protein Data Bank under the accession code 3PBB. Crystallographic data for sQC:5 and sQC:13 reported in this study are available in Protein Data Bank under the accession codes 9ISD and 9IVV. The crystal structure of sQC:QFA analog used for structure comparison is available in Protein Data Bank under the accession codes 6YI1. Source data is provided with this paper as a Source Data file.
Code availability
The source code is available in Zenodo64 (10.5281/zenodo.14818730), GitHub repository65 (https://github.com/VicFisher/DiffPhore) and our project website (https://diffphore.ddtmlab.org).
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Contributor Information
Xiang-Gen Liu, Email: liuxianggen@scu.edu.cn.
Guo-Bo Li, Email: liguobo@scu.edu.cn.
Supplementary information
The online version contains supplementary material available at 10.1038/s41467-025-57485-3.
References
- 1.Wang, H. et al. Scientific discovery in the age of artificial intelligence. Nature620, 47–60 (2023). [DOI] [PubMed] [Google Scholar]
- 2.Schneider, P. et al. Rethinking drug design in the artificial intelligence era. Nat. Rev. Drug Discov.19, 353–364 (2020). [DOI] [PubMed] [Google Scholar]
- 3.Mullowney, M. W. et al. Artificial intelligence for natural product drug discovery. Nat. Rev. Drug Discov.22, 895–916 (2023). [DOI] [PubMed] [Google Scholar]
- 4.Theodoris, C. V. et al. Transfer learning enables predictions in network biology. Nature618, 616–624 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Ren, F. et al. A small-molecule TNIK inhibitor targets fibrosis in preclinical and clinical models. Nat. Biotechnol. 43, 63–75 (2024). [DOI] [PMC free article] [PubMed]
- 6.Catacutan, D. B. et al. Machine learning in preclinical drug discovery. Nat. Chem. Biol.20, 960–973 (2024). [DOI] [PubMed] [Google Scholar]
- 7.Pandey, M. et al. The transformational role of GPU computing and deep learning in drug discovery. Nat. Mach. Intell.4, 211–221 (2022). [Google Scholar]
- 8.Perez-Lopez, R. et al. A guide to artificial intelligence for cancer researchers. Nat. Rev. Cancer24, 427–441 (2024). [DOI] [PubMed] [Google Scholar]
- 9.Allenspach, S. et al. Neural multi-task learning in drug design. Nat. Mach. Intell.6, 124–137 (2024). [Google Scholar]
- 10.Tropsha, A. et al. Integrating QSAR modelling and deep learning in drug discovery: the emergence of deep QSAR. Nat. Rev. Drug Discov.23, 141–155 (2024). [DOI] [PubMed] [Google Scholar]
- 11.Du, Y. et al. Machine learning-aided generative molecular design. Nat. Mach. Intell.6, 589–604 (2024). [Google Scholar]
- 12.Munson, B. P. et al. De novo generation of multi-target compounds using deep generative chemistry. Nat. Commun.15, 3636 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Corso, G. et al. DiffDock: diffusion steps, twists, and turns for molecular docking. In International Conference on Learning Representations (ICLR, 2023).
- 14.Stärk, H. et al. EquiBind: geometric deep learning for drug binding structure prediction. In Proc. International Conference on Machine Learning. 162, 20503–20521 (PMLR, 2022).
- 15.Lu, W. et al. TANKBind: Trigonometry-aware neural networks for drug-protein binding structure prediction. Adv. Neural Inf. Process. Syst.35, 7236–7249 (2022).
- 16.Zhang, X. et al. Efficient and accurate large library ligand docking with KarmaDock. Nat. Comput. Sci.3, 789–804 (2023). [DOI] [PubMed] [Google Scholar]
- 17.Zhang, Y. et al. E3Bind: an end-to-end equivariant network for protein-ligand docking. In International Conference on Learning Representations (ICLR, 2023).
- 18.Guan, J. et al. 3D Equivariant diffusion for target-aware molecule generation and affinity prediction. In International Conference on Learning Representations (ICLR, 2023).
- 19.Zhang, O. et al. ResGen is a pocket-aware 3D molecular generation model based on parallel multiscale modelling. Nat. Mach. Intell.5, 1020–1030 (2023). [Google Scholar]
- 20.Zhang, O. et al. Learning on topological surface and geometric structure for 3D molecular generation. Nat. Comput. Sci.3, 849–859 (2023). [DOI] [PubMed] [Google Scholar]
- 21.Jiang, Y. et al. PocketFlow is a data-and-knowledge-driven structure-based molecular generative model. Nat. Mach. Intell.6, 326–337 (2024). [Google Scholar]
- 22.Schaller, D. et al. Next generation 3D pharmacophore modeling. Wiley Interdiscip. Rev. Comput. Mol. Sci.10, e1468 (2020). [Google Scholar]
- 23.Dai, Q. et al. AncPhore: a versatile tool for anchor pharmacophore-steered drug discovery with applications in discovery of new inhibitors targeting metallo-β-lactamases and indoleamine/tryptophan 2,3-dioxygenases. Acta Pharm. Sin. B11, 1931–1946 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Huang, Q. et al. PhDD: A new pharmacophore-based de novo design method of drug-like molecules combined with assessment of synthetic accessibility. J. Mol. Graph. Model.28, 775–787 (2010). [DOI] [PubMed] [Google Scholar]
- 25.Dixon, S. L. et al. PHASE: a novel approach to pharmacophore modeling and 3d database searching. Chem. Biol. Drug Des.67, 370–372 (2006). [DOI] [PubMed] [Google Scholar]
- 26.Barnum, D. et al. Identification of common functional configurations among molecules. J. Chem. Inf. Comput. Sci.36, 563–571 (1996). [DOI] [PubMed] [Google Scholar]
- 27.Taminau, J. et al. Pharao: pharmacophore alignment and optimization. J. Mol. Graph. Model.27, 161–169 (2008). [DOI] [PubMed] [Google Scholar]
- 28.Sunseri, J. et al. Pharmit: interactive exploration of chemical space. Nucleic Acids Res.44, W442–W448 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Zhu, H. et al. A pharmacophore-guided deep learning approach for bioactive molecular generation. Nat. Commun.14, 6234 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Seo, S. & Kim, W. Y. PharmacoNet: accelerating large-scale virtual screening by deep pharmacophore modeling. Chem. Sci.15, 19473–19487 (2024). [DOI] [PMC free article] [PubMed]
- 31.Buttenschoen, M. et al. PoseBusters: AI-based docking methods fail to generate physically valid poses or generalise to novel sequences. Chem. Sci.15, 3130–3139 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Mysinger, M. M. et al. Directory of useful decoys, enhanced (DUD-E): better ligands and decoys for better benchmarking. J. Med. Chem.55, 6582–6594 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Li, G.-B. et al. IFPTarget: a customized virtual target identification method based on protein–ligand interaction fingerprinting analyses. J. Chem. Inf. Model.57, 1640–1651 (2017). [DOI] [PubMed] [Google Scholar]
- 34.Xu, C. et al. Glutaminyl cyclase, diseases, and development of glutaminyl cyclase inhibitors. J. Med. Chem.64, 6549–6565 (2021). [DOI] [PubMed] [Google Scholar]
- 35.Coimbra, J. R. M. et al. Therapeutic potential of glutaminyl cyclases: Current status and emerging trends. Drug Discov. Today28, 103644 (2023). [DOI] [PubMed] [Google Scholar]
- 36.Logtenberg, M. E. W. et al. Glutaminyl cyclase is an enzymatic modifier of the CD47- SIRPα axis and a target for cancer immunotherapy. Nat. Med.25, 612–619 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Barreira da Silva, R. et al. Loss of the intracellular enzyme QPCTL limits chemokine function and reshapes myeloid infiltration to augment tumor immunity. Nat. Immunol.23, 568–580 (2022). [DOI] [PubMed] [Google Scholar]
- 38.Eberhardt, J. et al. AutoDock Vina 1.2.0: new docking methods, expanded force field, and Python bindings. J. Chem. Inf. Model.61, 3891–3898 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Yu, Y. et al. Uni-Dock: GPU-accelerated docking enables ultralarge virtual screening. J. Chem. Theory Comput.19, 3336–3345 (2023). [DOI] [PubMed] [Google Scholar]
- 40.Koes, D. R. et al. Lessons learned in empirical Scoring with smina from the CSAR 2011 benchmarking exercise. J. Chem. Inf. Model.53, 1893–1904 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.McNutt, A. T. et al. GNINA 1.0: molecular docking with deep learning. J. Cheminform.13, 43 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Cao, D. et al. SurfDock is a surface-informed diffusion generative model for reliable and accurate protein-ligand complex prediction. Nat Methods22, 310–322 (2025). [DOI] [PubMed]
- 43.Cao, D. et al. Generic protein–ligand interaction scoring by integrating physical prior knowledge and data augmentation modelling. Nat. Mach. Intell.6, 688–700 (2024). [Google Scholar]
- 44.Chaput, L. et al. Benchmark of four popular virtual screening programs: construction of the active/decoy dataset remains a major determinant of measured performance. J. Cheminform.8, 56 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Chen, L. et al. Hidden bias in the DUD-E dataset leads to misleading performance of deep learning in structure-based virtual screening. PLoS One14, e0220113 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Kupski, O. et al. Hydrazides are potent transition-state analogues for glutaminyl cyclase implicated in the pathogenesis of Alzheimer’s Disease. Biochemistry59, 2585–2591 (2020). [DOI] [PubMed] [Google Scholar]
- 47.Liu, Z. et al. Forging the basis for developing protein–ligand interaction scoring functions. Acc. Chem. Res.50, 302–309 (2017). [DOI] [PubMed] [Google Scholar]
- 48.Su, M. et al. Comparative assessment of scoring functions: the CASF-2016 update. J. Chem. Inf. Model.59, 895–913 (2019). [DOI] [PubMed] [Google Scholar]
- 49.Irwin, J. J. et al. ZINC20—a free ultralarge-scale chemical database for ligand discovery. J. Chem. Inf. Model.60, 6065–6073 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Bemis, G. W. & Murcko, M. A. The properties of known drugs. 1. Molecular frameworks. J. Med. Chem.39, 2887–2893 (1996). [DOI] [PubMed] [Google Scholar]
- 51.Morgan, H. L. The generation of a unique machine description for chemical structures-a technique developed at chemical abstracts service. J. Chem. Doc.5, 107–113 (1965). [Google Scholar]
- 52.Song, Y. & Ermon, S. Generative modeling by estimating gradients of the data distribution. Adv. Neural Inf. Process. Syst.32, 11918–11930 (2019).
- 53.Roberts, G. O. & Tweedie, R. L. J. B. Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli2, 341–363 (1996). [Google Scholar]
- 54.Welling, M. & Teh, Y. W. Bayesian learning via stochastic gradient Langevin dynamics. In: Proc. 28th International Conference on International Conference on Machine Learning (2011).
- 55.Song, Y. et al. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations (ICLR, 2021).
- 56.Nikolayev, D. et al. Normal Distribution on the Rotation Group So(3). Texture Stress Microstruct.29, 201–233 (1997). [Google Scholar]
- 57.Jing, B. et al. Torsional diffusion for molecular conformer generation. Adv. Neural Inf. Process. Syst.35, 24240–24253 (2022).
- 58.Mou, J. et al. X-ray structure-guided discovery of a potent benzimidazole glutaminyl cyclase inhibitor that shows activity in a Parkinson’s Disease mouse model. J. Med. Chem.67, 8730–8756 (2024). [DOI] [PubMed] [Google Scholar]
- 59.Brehm, W. et al. XDSGUI: a graphical user interface for XDS, SHELX and ARCIMBOLDO. J. Appl. Crystallogr.56, 1585–1594 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Wang, L. et al. AutoPX: a new software package to process X-ray diffraction data from biomacromolecular crystals. Acta Crystallogr. D78, 890–902 (2022). [DOI] [PubMed] [Google Scholar]
- 61.Liebschner, D. et al. Macromolecular structure determination using X-rays, neutrons and electrons: recent developments in Phenix. Acta Crystallogr. D75, 861–877 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Emsley, P. et al. Features and development of Coot. Acta Crystallogr. D66, 486–501 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Yu, J. et al. LigPhoreSet and CpxPhoreSet for training DiffPhore. Zenodo, 10.5281/zenodo.14819917 (2025).
- 64.Yu, J. et al. DiffPhore:v1.0. Zenodo, 10.5281/zenodo.14818730 (2025).
- 65.Yu, J. et al. DiffPhore:v1.0. GitHub, https://github.com/VicFisher/DiffPhore (2024).
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The LigPhoreSet and CpxPhoreSet datasets for training and evaluation are available in Zenodo63 (10.5281/zenodo.14819917). The PDBBind set is available at http://pdbbind.org.cn. The PoseBusters set is available at https://zenodo.org/record/8278563. The DUD-E set is available at http://dude.docking.org. The ZINC database is available at https://zinc20.docking.org. The crystal structure of sQC used as the template in structural determination is available in the Protein Data Bank under the accession code 3PBB. Crystallographic data for sQC:5 and sQC:13 reported in this study are available in Protein Data Bank under the accession codes 9ISD and 9IVV. The crystal structure of sQC:QFA analog used for structure comparison is available in Protein Data Bank under the accession codes 6YI1. Source data is provided with this paper as a Source Data file.
The source code is available in Zenodo64 (10.5281/zenodo.14818730), GitHub repository65 (https://github.com/VicFisher/DiffPhore) and our project website (https://diffphore.ddtmlab.org).