Optimal strategies for virtual screening of induced-fit and flexible target in the 2015 D3R Grand Challenge

Zhaofeng Ye; Matthew P Baumgartner; Bentley M Wingert; Carlos J Camacho

doi:10.1007/s10822-016-9941-0

. Author manuscript; available in PMC: 2017 Sep 1.

Published in final edited form as: J Comput Aided Mol Des. 2016 Aug 29;30(9):695–706. doi: 10.1007/s10822-016-9941-0

Optimal strategies for virtual screening of induced-fit and flexible target in the 2015 D3R Grand Challenge

Zhaofeng Ye ^1,², Matthew P Baumgartner ¹, Bentley M Wingert ¹, Carlos J Camacho ¹

PMCID: PMC5079819 NIHMSID: NIHMS813592 PMID: 27573981

Abstract

Induced fit or protein flexibility can make a given structure less useful for docking and/or scoring. The 2015 Drug Design Data Resource (D3R) Grand Challenge provided a unique opportunity to prospectively test optimal strategies for virtual screening in these type of targets: heat shock protein 90 (HSP90), a protein with multiple ligand-induced binding modes; and, mitogen-activated protein kinase kinase kinase kinase 4 (MAP4K4), a kinase with a large flexible pocket. Using previously known co-crystal structures, we tested predictions from methods that keep the receptor structure fixed and used (a) multiple receptor/ligand co-crystals as binding templates for minimization or docking (“close”), (b) methods that align or dock to a single receptor (“cross”), and (c) a hybrid approach that chose from multiple bound ligands as initial templates for minimization to a single receptor (“min-cross”). Pose prediction using our “close” models resulted in average ligand RMSDs of 0.32 Å and 1.6 Å for HSP90 and MAP4K4, respectively, the most accurate models of the community-wide challenge. On the other hand, affinity ranking using our “cross” methods performed well overall despite the fact that a fixed receptor cannot model ligand-induced structural changes,. In addition, “close” methods that leverage the co-crystals of the different binding modes of HSP90 also predicted the best affinity ranking. Our studies suggest that analysis of changes on the receptor structure upon ligand binding can help select an optimal virtual screening strategy.

Keywords: drug discovery, virtual screening, D3R, induced fit, affinity ranking, pose prediction

Introduction

Major challenges in virtual screening are the inadequate scoring functions to evaluate the affinity of docked poses, and the difficulty to predict ligand induced flexibility observed in many important therapeutic targets [1-5]. To evaluate improvements in this area, the Drug Design Data Resource (D3R) developed the 2015 Grand Challenge, a community-wide experiment for researchers around the world to prospectively test docking and scoring methodologies against blinded data from two targets: heat shock protein 90 (HSP90), a protein that binds following an induced fit mechanism [6], i.e., the unbound or apo structure undergoes significant structural rearrangements upon ligand binding; and, mitogen-activated protein kinase kinase kinase kinase 4 (MAP4K4), a kinase with a large pocket that includes sizable flexible loops [7].

The most commonly used scoring functions can basically be classified into three types as Kitchen et al. summarized [1]: force-field-based scoring (e.g., D-Score [8], G-Score [8], GOLD [9], AutoDock [10], DOCK [11], Glide [12], SIE [13]), empirical scoring (e.g., LUDI [14, 15], F-Score [16], ChemScore [17], SCORE [18], Fresno [19], X-SCORE [20], AutoDock Vina [21]) and knowledge-based scoring (e.g., DrugScore [22], SMoG [23]). In the 2010 Community Structure-Activity Resource (CSAR) Exercise, Carlson and collaborators analyzed the performance of different scoring functions on the CSAR-NRC data set [5, 24]. The results indicated that most of the scoring functions had comparable performance (R²=0.3-0.4) and the best R² were achieved by AutoDock and AutoDock Vina (R²=0.55) [5]. Despite the poor performance of scoring, many docking methods did well in predicting poses within 2.0 Å of the crystal conformation [3, 5].

Over the last few years, the Camacho lab has steadily built novel platforms for drug discovery, from predictions of druggable sites [25], to pharmacophore-based interactive virtual screening technologies that search billion size libraries in seconds [26]. We also developed Smina [27], a version of AutoDock Vina specially optimized to support high-throughput minimization and scoring. Based on our current implementation in AnchorQuery [28], Smina can minimize 10,000 compounds into a fixed receptor in about 10 seconds (details will be published elsewhere), the same time scale required for docking a single compound to a flexible receptor [1]. More recently, we have shifted our attention to improving our virtual screening pipeline [26-28]. We participated in the 2013/14 CSAR challenge that involved rank-ordering compounds to homology models of the receptors with a given protein primary sequence, identifying close-to-native bound conformations out of a set of decoy poses, and rank-ordering the affinity of sets of congeneric compounds to a given protein. Our predictions were among the best in the field [29, 30]. We showed that the most significant contribution to a meaningful enrichment of native-like models was the identification of the best receptor structure for docking and scoring. In particular, we showed that ranking a set of 31 congeneric compounds cross-docked to the tRNA (m1G37) methyltransferase (TRMD) structure with the largest pocket resulted in an impressive R²= 0.67, whereas other receptor structures yielded R² ~ 0.

Here, we report our participation in the 2015 D3R Grand Challenge, where we performed a comprehensive analysis of different strategies for predicting docking poses and ranking affinities for two highly flexible targets: HSP90 and MAP4K4. These strategies included methods that utilize all available receptor/ligand co-crystals (“close”), all available ligands and a single holo-receptor structure (“min-cross”) and only a single receptor/ligand co-crystal (“cross”). As in the 2013/14 CSAR competition [29], we found that the method that predicted the best docking poses was not the same as the ones that predicted the best ranking of active compounds. Similarly, different methods were shown to predict the optimal ranking of active compounds for HSP90 and MAP4K4, i.e., “close” and “cross”, respectively. Inspection of the type of flexibility exhibited by each target, i.e., induced fit versus large flexible pocket, suggests guiding principles for selecting the optimal virtual screening for flexible targets. We note that these findings are strongly supported by the fact that our prospective pose predictions and affinity rankings for HSP90 to the 2015 D3R Grand Challenge were the best in the community-wide experiment.

Methods

We tested the performance of five major methods (Fig. 1) on both pose and affinity predictions. Several variants of the methods were also applied to special cases, which will be discussed later in the specific challenges.

The methods used the following applications that are freely available for academic research. Structure preparation: all receptor structures were superimposed using the “align” command in PyMOL 1.7 [31]. Conformer generation: For structural alignment, 20 conformers were generated using Omega2 [32] with default settings. Chemical similarity: Babel 2.3.2 [33] was used with fingerprint 3 (FP3) to identify the most similar or “closest” compound among known ligands. The co-crystal receptor corresponding to the “closest” compound is referred to as “closest” receptor. Conformer alignment: Structural alignments were performed using Open3DALIGN 2.282 [34]. Minimization: Aligned conformers are minimized to a given receptor using Smina [27] with default settings. Docking: Compounds were docked with Smina with default parameters and AutoDock Vina [21] scoring function. A reference compound was used to define the docking box. The Vina-predicted energy was used to select the best ranked docked pose.

Align-close method

(a) Conformers were generated for each compound in the test set. (b) The “closest” compound among known bound ligands was identified. (c) Conformers were aligned to the “closest” compound. (d) Aligned conformers were minimized to the “closest” receptor. (e) The best Vina score was used to predict affinity for the compound.

Dock-close method

(a) The “closest” compound among known bound ligands was identified. (b) Compounds were docked to the “closest” receptor using “closest” ligand as reference to define docking box. (c) The best Vina score was used to predict affinity for the compound.

Min-cross method

(a) Conformers were generated for each compound. (b) The “closest” compound was identified. (c) Conformers were aligned to the “closest” compound. (d) The aligned conformers were minimized to all known bound receptors. (e) The best Vina score to each receptor was used to predict affinity. (f) Optimal receptor for virtual screening is selected (see below).

Align-cross method

(a) Conformers were generated for each compound. (b) Conformers were aligned to every known bound ligand. (d) Aligned conformers were minimized to the corresponding bound receptor. (e) The best Vina score among conformers was used to predict affinity to each receptor structure. (f) Optimal receptor for virtual screening is selected (see below).

Dock-cross method

(a) Compounds were docked to every known bound receptor using its bound ligand as reference. (b) The best Vina score to each receptor was used to predict affinity. (c) Optimal receptor for virtual screening is selected (see below).

These five methods can be grouped by receptor selection. The optimal receptor for “cross” methods (min-cross, align-cross and dock-cross) was chosen by comparing the Vina scores for each receptor with experimental data (IC50, see Supplementary Table 1 and 2). We calculated Spearman's rank correlation coefficient (Spearman ρ) and coefficient of determination (R²) to select the optimal receptor that performs the best for affinity ranking in our training set. Similarly, we compared the best-scored poses - for each receptor with the crystal poses to generate the ligand root-mean-square deviation (RMSD), and computed the percentage of poses that have a RMSD less than 2 Å to select the optimal receptor for pose prediction. For testing data, we use the best-performing receptor in the training data set to rank affinity and predict poses. For “close” methods (align-close and dock-close), there is no optimal receptor, but multiple receptor/ligand co-crystals are used for predictions.

Results

HSP90 Challenge

Challenge

(1) Predict binding modes of six HSP90 compounds. (2) Predict affinity ranking of P=180 HSP90 compounds, among this set, 33 unidentified compounds were said to have no inhibition. (3) Predict relative/absolute free energy of three small sets of compounds. Analyses of the 180 compounds show that they all fall into three chemical scaffolds (aminopyrimidines, benzimidazolones and benzophenone-like, Fig. 2C-2E. Upper panels show scaffolds, and lower panels show examples). Two unpublished structures, 4YKR and 4YKY, were provided as examples of benzimidazolones and benzophenone-like compound binding.

A. Four conformations of HSP90 ligand-induced binding pocket based on the nearby adaptive loop (L2, between H4 and H5 [43]): close (2WI5), helix (4EFU), open (3RLR), half-close (3B28) (white cartoon: HSP90, red cartoon: flexible loop, orange ticks: small molecules) B. Four waters in the binding pockets labeled from 1 to 4 (white cartoon: HSP90, red sphere: water molecules) C. Aminopyrimidine scaffold and compound (2XDX) D. Benzimidazolone scaffold and compound (4YKR) E. Benzophenone-like scaffold and compound (4YKR) F. Histogram of binding modes among the N=181 known co-crystal structures and I=69 structures with IC50 data. (N: number of co-crystals, I: number of co-crystal with IC50 data) G. Histogram of conservation frequency of water molecule in Fig. 2B shows that three crystal waters are 100% conserved.

Binding Pocket Analysis

There are N=179 PDB plus two unpublished HSP90 structures bound to small molecules, with I=69 of them having known IC50 (from BindingDB [35], BindingMOAD [36] and PDBBind [37], Supplemental Table 1). We superimposed all the known receptors to the receptor structure in 4YKR. Interestingly, a distal loop (L2 between H4 and H5, Fig. 2A) is very adaptive upon different ligand binding. Basically, all co-crystal structures can be grouped into four distinct conformations based on the adaptive loop (red cartoon in Fig. 2A): close, helix, open and half-close (a conformation between open and close). The histograms of these binding modes in the whole dataset and sub-dataset with IC50 are shown in Fig. 2F. The core binding pocket is quite rigid and stable, and four crystal water molecules are observed to participate in ligand binding (Fig. 2B). Three waters are highly conserved despite the different adaptive loop conformations (Fig. 2G). These analyses suggest that the ligand-binding pocket of HSP90 consists of a rigid core part with a conserved water-mediated interacting network and a ligand-dependent adaptive loop. Therefore, when preparing models for docking and alignment/minimization, we kept conserved water molecules as part of the receptors.

Methods

We applied the five methods listed in Fig. 1 (i.e., align-close, dock-close, min-cross, align-cross, and dock-cross) for both pose prediction and affinity ranking. For affinity ranking, we also devised several variations of the previous methods as potential improvements for ligand alignment, and others. (a) min-cross-scaffold and align-close-scaffold methods: Given the limited set of scaffolds that presumably capture the core ligand interactions, for min-cross and align-close methods we aligned the test compounds to the three scaffolds shown in Fig. 2C-2E (see, e.g., Fig. 3B) instead of the chemically “closest” compounds as in Fig. 3A. (b) min-cross-pose and align-close-pose methods: Instead of using as templates ligand structures from co-crystals, we use the actual predicted poses by “close” methods as templates for alignment in min-cross and align-close methods (see, e.g., Fig. 3C). (c) dock-close-filter and align-close-filter: We also used the aforementioned predicted poses for manually selecting inactive compounds in testing set. We then overruled the Vina score and moved this set of compounds to the bottom of the affinity ranking for the two methods that had best performance in training set. (d) HSP90 score 1-4: We used machine learning and forward selection methodologies to develop four HSP90-specific scoring functions from the set of energy terms available in Smina [27] (see Supplemental Table 2 for the selected parameters and weights). A training dataset was constructed by cross-docking the I=69 compounds with published IC50 data to crystal structure 4EFU (optimal receptor for dock-cross method) with the default Smina settings. HSP90 score 1 and 2 functions were trained on active compounds (measured by Spearman ρ), while HSP90 score 3 and 4 were trained to maximize the discrimination of active versus decoy compounds which were obtained from the HSP90 dataset in the DUD-E database [38] (measured by AUC). (e) 3DQSAR-align-pose and 3DQSAR-dock-pose: The relatively large amount of binding data made quantitative structure–activity relationship (QSAR) possible. Using Open3DQSAR 2.3 [39], we trained 3DQSAR models with the 69 HSP90 structures with IC50 data. We applied the trained models to the predicted poses in the testing set from “close” methods.

Phase 1: Pose prediction results

Retrospective study of known ligands demonstrated that dock-close and align-close methods predicted the most accurate poses. For the analysis shown in Fig. 3D the co-crystal of each ligand was first removed from the dataset, and poses were then predicted based on the remaining co-crystal structures in the training set. Given the large dataset of available co-crystal structures, our results reflect the empirical observation that crystallographic information is superior to any computational model. Hence, we were able to predict high-accuracy poses for all six testing compounds. We took the top five poses predicted by “close” methods (sorted by Vina score), and submitted the best models. The mean RMSD for the first ranked and best pose were 0.46 Å and 0.32 Å, respectively. Fig. 3F-3H show an example of the best-predicted poses of each scaffold. The predicted pose for HSP90-44 had a flexible group sticking out of the binding site. We used molecular dynamics to predict the most likely conformation, yet the co-crystal shows that this group is stabilized by Lys58 from the second HSP90 monomer in the dimer structure (Fig. 3E). When structural data is available, our results demonstrate that “close” methods are significantly better in pose prediction than “cross” methods, while dock-cross have an upper limit of about 50% success rate using a single receptor structure.

Phase 2: Affinity prediction results

The results of our predictions are summarized in Table 1. Dock-close (Spearman ρ=0.42, R²=0.26) and align-close (Spearman ρ=0.45, R²=0.24) methods have the best performance. The relative performance of the five methods is consistent between the training set and our submitted predictions (Fig. 4A). An interesting question to ask is whether for the “cross” methods, were we able to predict the optimal receptors? The answer is that our R² analysis correctly predicted an open structure (Fig. 2A) as optimal receptor. However, in retrospect, we found that other open structures were marginally better, see Testing (best) in Fig. 4A. Thus, a receptor is only assumed to be “optimal” based on the data available. Overall, the relatively similar outcomes of “close” and “cross” methods suggest that our scoring function cannot account for the change in free energy associated with different receptor structures, and therefore ranking ligands to induced-fit targets is still limited.

Table 1.

Affinity ranking prediction results of HSP90 challenge.

Method	Phase ^a	Spearman ρ	Kendall Tau ^b
align-close	P1, P2	0.45	0.31
dock-close	P1, P2	0.42	0.29
align-cross	P1, P2	0.33	0.22
dock-cross	P1	0.37	0.25
align-close-scaffold	P1, P2	0.42	0.3
min-cross-pose	P2	0.26	0.18
align-close-pose	P2	0.37	0.26
align-close-filter	P1, P2	0.38	0.26
dock-close-filter	P1, P2	0.38	0.26
HSP90 score 1	P1	0.17	0.12
HSP90 score 2	P1	0.23	0.16
HSP90 score 3	P1	−0.01	−0.01
HSP90 score 4	P1	0.09	0.06
3DQSAR-align-pose	P2	0.18	0.13
3DQSAR-dock-pose	P2	0.24	0.16

Open in a new tab

P1 means this method was submitted for evaluation in HSP 90 Phase 1 challenge. P2 stands for Phase 2.

Spearman ρ and Kendall Tau are from D3R result evaluation.

A. Prediction rates on: training set, testing set submitted prospectively, and testing (best) set reassessed retrospectively. Optimal receptors for align-cross, min-cross and dock-cross were (prospectively) 3OWD, 4BQJ, 3K98 and (retrospectively) 3T10, 3RLP, 3OWD, respectively. N: number of co-crystals, I: number of co-crystal with IC50 data, P: number of compounds for prediction. B. Results of variant methods: aligning to scaffold, to predicted pose, and using human expertise to eliminate non-binders. C. Distinguishing active from 33 inactive compounds using general methods, human discrimination, 3DQSAR, and special purpose scoring functions to discriminate HSP90 ligands. The lower panel shows binding/non-binding AUC performances, and upper panel shows the corresponding affinity ranking. **D-F.** Examples of binding poses of inactive compounds. D. Co-crystal of inactive compound 176 (4YKY). E. Co-crystal from PDB 3B26 (unknown IC50). F. prediction for compound 110 (inactive).

Alignment is an area that can be improved particularly for large and/or low similarity compounds. Thus, we developed two variants of the above methods to test different structural alignments. First, we surmised that aligning to the scaffold would lead to better core interactions (Fig. 2C-2E). Although this was the case in our training data set, the opposite was observed for “-scaffold” methods in the testing set (Fig. 4B). In retrospect, we found that our method was able to improve some bad alignments, but it also eliminated some good ones. The latter was particularly true for benzophenone-like compounds whose structures are quite diverse. Second, using our models for the testing set as “predicted closest” templates also failed to improve the affinity ranking, observing only a minor “-pose” improvement for min-cross in the training set (Fig. 4B). The failure may come from the inadequacy of the force field to smoothly remove clashes upon minimization. In summary, neither aligning to scaffolds nor to predicted poses improved affinity ranking relative to aligning to the “closest” compound.

As a control of blind versus human predictions, we visually inspected all dock-close and align-close poses and predicted whether they were binders/active or non-binders/inactive (“-filter” methods in Table 1). Humbly, the blind methods performed better than the subjective human filtered scores. In hindsight, one problem is that a compound may bind but it might also be deemed inactive. For instance, HSP90-176 and HSP90-110 are both inactive compounds (Fig. 4D-F), yet, compound 176 binds HSP90 (4YKY), same thing happened for our binding model for HSP90-110 (Fig. 4F) that is based on a highly similar co-crystal (3B26 in Fig. 4E).

The 3DQSAR models perform poorly in affinity ranking (Table 1 and Fig. 4C). The major reason seems to be that there were no cases for aminopyrimidine scaffold among the 69 compounds with IC50 data. Therefore, when applying the models and functions to testing set, they did poorly at scoring aminopyrimidine compounds.

The HSP90 score 1-2 were trained to better rank active compounds, and their predicted ranking was similar to other “cross” methods. However, these scoring functions had a meaningful improvement in the discrimination between actives and inactives (Fig. 4C). On the other hand, the HSP90 score 3 and 4 that were specially designed solely to distinguish actives from inactives. As expected, these methods performed poorly in affinity ranking. However, training on inactive compounds from the DUD-E database did not improve the discrimination of active compounds. In hindsight, we realized that the inactive compounds in the testing set had different scaffolds than the DUD-E decoy compounds. Thus, in all likelihood the observed discrimination might be close to random. These results show how dangerous is to evaluate machine learning scoring functions without a rigorous benchmarking. Overall, these results indicate that target specific scoring functions and 3DQSAR models can do better at distinguishing active from inactive compounds than the default Vina scoring function used in the methods in Fig. 1.