Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2019 Nov 26.
Published in final edited form as: J Chem Inf Model. 2018 Oct 16;58(11):2343–2354. doi: 10.1021/acs.jcim.8b00309

FINDSITEcomb2.0: A New Approach for Virtual Ligand Screening of Proteins and Virtual Target Screening of Biomolecules

Hongyi Zhou 1, Hongnan Cao 1, Jeffrey Skolnick 1,*
PMCID: PMC6437778  NIHMSID: NIHMS1009560  PMID: 30278128

Abstract

Computational approaches for predicting protein-ligand interactions can facilitate drug lead discovery and drug target determination. We have previously developed a threading/structural-based approach, FINDSITEcomb, for the virtual ligand screening of proteins that has been extensively experimentally validated. Even when low resolution predicted protein structures are employed, FINDSITEcomb has the advantage of being faster and more accurate than traditional high-resolution structure-based docking methods. It also overcomes the limitations of traditional QSAR methods that require a known set of seed ligands that bind to the given protein target. Here, we further improve FINDSITEcomb by enhancing its template ligand selection from the PDB/DrugBank/ChEMBL libraries of known protein-ligand interactions.by: (1) parsing the template proteins and their corresponding binding ligands in the DrugBank and ChEMBL libraries into domains so that the ligands with falsely matched domains to the targets will not be selected as template ligands; (2) applying various thresholds to filter out falsely matched template structures in the structure comparison process and thus their corresponding ligands for template ligands selection. With a sequence identity cutoff of 30% of target to templates and modeled target structures, FINDSITEcomb2.0 is shown to significantly improve over FINDSITEcomb on the DUD-E benchmark set by improving the 1% enrichment factor from 16.7 to 22.1 on modeled structures, with a p-value of 4.3×10−3 by the Student t-test. With an 80% sequence identity cutoff of target to templates for the DUD-E set and modeled target structures, FINDSITEcomb2.0 having a 1% ROC enrichment factor of 52.39, also outperforms state-of-the-art methods that employ machine learning such as a deep convolutional neural network, CNN, with an enrichment of 29.65. Thus, FINDSITEcomb2.0 represents a significant improvement in the state-of-the-art. The FINDSITEcomb2.0 web service is freely available for academic users at http://pwp.gatech.edu/cssb/FINDSITE-COMB-2.

GRAPHICAL ABSTRACT

graphic file with name nihms-1009560-f0001.jpg

INTRODUCTION

The goal of drug discovery is to find effective and safe drugs that bind to a given protein target1. Since drugs often have serious side effects that result in FDA disapproval or drug withdrawal, the ability to de-risk drugs could have enormous economic impact on pharmaceutical companies2. To find a drug and understand its side effects or to avoid serious side effects, the first step is to identify all potential protein targets of the drug. Experimental discovery of drugs binding to a given protein target and all of its off-target human proteins is both costly and time-consuming3. Virtual ligand screening (VLS) is a computational tool for predicting protein-ligand interactions that has been widely employed in modern drug discovery for lead identification4 and is more cost effective than experimental high-throughput ligand screening for proteins or protein target screening for drugs.

To date, there are three broad categories of virtual ligand screening approaches: (1) high-resolution structure-based docking methods58, (2) ligand-based QSAR methods 911, and (3) threading/structure-based methods1219. The advantage of high-resolution structure-based docking approaches is that they are based on physics principles and can potentially discover novel active binders. Their limitation and drawbacks are the availability of high-resolution structures which currently cover around 1/3 of all human proteins20, their computational expense and lower accuracy than ligand-based methods19. Although ligand-based methods have better accuracy and are much faster than docking methods, they require a prior set of ligands known to bind to the protein target of interest. For the majority of human protein targets, such known sets of ligands are unavailable21.

Threading/structure-based methods address the limitations of the lack of availability of high-resolution structures required by docking approaches and the absence of known binders required by ligand-based approaches, while having comparable efficiency and accuracy as ligand-based approaches. Our recently developed FINDSITEcomb is a salient example of a threading/structure-based method that has been extensively experimentally validated19, 22. FINDSITEcomb has also been demonstrated to outperform state-of-the-art docking methods such as DOCK65 and AUTODOCK-vina8 and shows comparable accuracy when low resolution predicted or experimental structures are used.

Recently, the emerging new convolutional neural network (CNN) or deep CNN 23, 24 technology has been applied to virtual ligand screening23, 24. The new protein-ligand scoring with CNN boosts the accuracy of high-resolution structure-based docking methods to the level of ligand-based and threading/structure-based approaches.

Methods for ligand virtual screening can, in principle, be employed for virtual target screening (VTS) where the goal is to predict for a ligand/small molecule, all of its possible protein targets in a given exome20, 25. For high-resolution structure-based docking approaches, virtual target screening requires high-resolution structures of all proteins in a given exome. At present, this is not feasible even if computational cost were not an issue, which it is20. For ligand-based approaches, a prior set of known binding ligands for all proteins in a genome is also not known. However, both issues disappear in the threading/structure-based method FINDSITEcomb. In a recent work26, we applied FINDSITEcomb to predict possible human targets of all FDA approved drugs as well as experimental drugs from the DrugBank database27. FINDSITEcomb covers 97% of human proteins (i.e., there are modeled structures for at least one domain of 97% of the sequences in the human proteome). The predicted drug targets allowed us to infer drug side effects and to repurpose FDA approved drugs to treat rare diseases26.

To predict the interaction of a given protein target to a ligand, FINDSITEcomb first uses the target protein sequence and the threading method SP328 to find its homologous structures in the Protein Data Bank (PDB)29. If there is no user provided target protein structure, FINDSITEcomb predicts one using TASSER-VMT 30 from threading identified PDB template protein structures (we call the proteins in the structural database “template proteins”). Subsequently, to find potential binders (“template ligands”), the target structure or model is compared to: (a) ligand binding pockets from the subset of its homologous PDB template structures having ligand complexes from the PDB, and (b) pre-computed template modeled holo protein models of the proteins in the ChEMBL31 and DrugBank27 databases to find template ligands for the protein of interest. Then, using a similarity measure defined from a Tanimoto coefficient based fingerprint (mTC)32, the ligand in a screened library is compared to each set of template ligands from the PDB, ChEMBL and DrugBank. This mTC similarity score measures the likelihood that a given ligand binds to a particular protein.

In practice, the accuracy of FINDSITEcomb strongly depends on the accuracy of the selected template ligands that serve the same role as known binders for the protein target in ligand-based approaches. Here, to improve performance, we focus on the elimination of falsely selected, (viz., false positive) template ligands. In the current work, we demonstrate that the improved version, FINDSITEcomb2.0, is significantly better than the original version and outperforms the state-of-the-art deep learning CNN methods23, 24. For the 102 DUD-E benchmark set33 and using predicted protein structures whose allowed templates have a sequence identity to the target protein <30%, the average area under the receiver operating characteristic (ROC) curve (AUC) and top 1% enrichment factor improves from 0.745 and 16.74 for FINDSITEcomb to 0.784 and 22.06 for FINDSITEcomb2.0. On DUD-E, with the same 80% sequence identity cutoff of target to templates/training targets as used in the CNN scoring method23, the average AUC and ROC 1% enrichment factor of FINDSITEcomb2.0 using modeled target structures are 0.876 and 52.39, respectively, compared to 0.868 and 29.65 for the CNN scoring method 23 using experimental structures. Also with the same 80% sequence identity cutoff and modeled structures, FINDSITEcomb2.0, has 59/102 (57.8%) targets having an AUC greater than 0.9 compared to the same 59/102 (57.8%) targets by AtomNet that includes also the training set24. However, in a more realistic comparison to its 30 target testing set, AtomNet’s number is reduced to 14/30 or 46.7%24 whereas for a random subset of 30 targets, FINDSITEcomb2.0, has 16/30 (53.3%) targets having an AUC greater than 0.9. Using experimental structures and an 80% sequence identity cutoff for template ligand selection, FINDSITEcomb2.0 performs even better with an average AUC and ROC 1% enrichment factor of 0.886 and 56.27, respectively, and the number of targets having an AUC greater than 0.9 is 64 or 62.7%. Thus, by these measures, FINDSITEcomb2.0 is better than the most recent state-of-the-art alternative methods.

METHODS

Overview

The flowchart of FINDSITEcomb2.0 is shown in Figure 1. The differences between FINDSITEcomb and FINDSITEcomb2.0 are: 1) how the protein-ligand libraries of DrugBank and ChEMBL are utilized, and 2) the procedure for selecting template ligands. Given a protein target sequence or structure, both FINDSITEcomb and FINDSITEcomb2.0 employ the SP3 threading method28 to find homologous PDB template structures. Ligand binding pockets from a subset of these template structures having bound ligands are excised (PDB pockets). If the input is just the protein’s sequence, then TASSER-VMT30 is used to build a structural model of the protein target. Then, template ligands are extracted from template proteins in the PDB/DrugBank/ChEMBL libraries that have similar pockets/structures to the target. These template proteins are obtained by comparing the modeled/input structure of the target to: a) the above excised PDB pockets, b) pre-computed structure models of the domains of the templates in DrugBank, and c) pre-computed structure models of the domains of the templates in ChEMBL. Template ligands from each of the three databases (PDB, DrugBank & ChEMBL) will be compared with each molecule in a screening compound library and a similarity score used to rank the molecules. The three rankings are combined by using the largest score of each compound among the three rankings. Details concerning the selection of template ligands and improvement measures are provided in the next section.

Figure 1.

Figure 1.

Flowchart of FINDSITEcomb2.0.

Template ligands from the PDB

The procedure for selecting template ligands from the PDB and predicting binding sites is done by a program called FINDSITEfilt 19 that was previously developed. PDB structure templates identified by the SP3 threading algorithm28 as having a threading Z-score (score in standard deviation units relative to the mean of the structure template library) > 4.0 are collected. The threading Z-score cutoff of 4.0 is an empirical value that ensures that there are a sufficient number of ligand bound templates (in practice, about 100) even for hard targets identified by the SP3 method as having a threading Z-score < 4.5. Then, the model or input structure of the target is compared to the ligand binding pockets from a subset of the PDB templates having bound ligands to determine the potential ligand binding sites and bound ligands of the target using a structure-pocket alignment algorithm. The details of the structure-pocket alignment algorithm are detailed in the original FINDSITEcomb method19. A ligand binding pocket structure consists of: (1) the Cα atoms of the template residues, any of whose backbone and/or side chain heavy atoms are within 4.5 Å of the bound ligand’s heavy atoms; (2) the Cα atoms of the template residues that are within 8 Å of the bound ligand’s heavy atoms. These Cα atoms are usually scattered along the protein’s sequence. We then apply the structure-pocket alignment algorithm to compare the target structure to each pocket and rank the pockets according to a similarity score.

In FINDSITEcomb 19, up to 100 ligands from top ranked pockets are selected as template ligands regardless of the alignment coverage of the target to the pockets. In this improved version, the number of template ligands will be optimized to eliminate falsely selected ligands (ligands that are less likely to bind to the targets). To do so, we shall apply a filter to the ligand binding pockets. We require that the alignment length between the structure of the target and template’s pocket must satisfy:

Nali/Nalimax>p1orNali/Npocket>p2 (1),

where Nali is the alignment length of the pockets, Nalimax is the maximal alignment length of all the pockets to the target pocket, Npocket is the number of Cα atoms of the given pocket, p1, p2 are two cutoffs. Here, empirical values p1=0.7, p2=0.5 are used.

Binding site prediction

To predict the binding sites of the target protein, we superimpose the above selected PDB template ligands onto the target structure using the rotational and translational matrix from the structure-pocket alignment. We then cluster the ligands according to their spatial proximity: (a) starting from the top ranked un-clustered ligand Li, going through all the other un-clustered ligands Lj, if the distance of the center of mass of heavy atoms between Li and Lj is less than 3 Å, then Lj is put into the cluster represented by Li; (b) repeat process (a) until all ligands are clustered.

Next, we define a contact between a residue of the target protein and a ligand if the distance between any heavy atom of the residue and the ligand is less than rcut + rvdw(ligand atom) + rvdw(residue atom), where rcut is an empirical cutoff, and rvdw is the van de Waals radius of the heavy atom obtained from the CHARMM force field parameters34. A residue of the target is defined as a binding site specific to a cluster of ligands if the residue is in contact with > pcut fraction of ligands in that cluster. We optimized pcut = 0.34, rcut = 0.7 Å using the binding site prediction targets of CASP9. These values were then tested on the CASP10 targets and gave comparable or better performance to the state-of-the-art methods35, 36. The binding sites of each cluster of PDB template ligands define a predicted pocket of the target.

Parsing the template proteins into domains and clustering their ligands

While the PDB database has protein-ligand complex structures that allow us to determine the pocket a ligand binds to, the DrugBank and ChEMBL databases do not have the structures of the protein-ligand complex. In FINDSITEcomb 19, known binders of a protein target in the DrugBank and ChEMBL databases are assumed to bind to all domains of the protein because we did not have the specific binding position/pocket of the ligands. Based on this assumption, if an unknown target has a close similarity score to one domain of a library protein in the DrugBank or ChEMBL database, all ligands of that protein will be transferred to the unknown target as template ligands, even though they might not actually bind to the domain that is similar to the unknown target. Obviously, this assumption will result in some falsely selected template ligands.

Another issue associated with the FINDSITEcomb approach is that some of the protein targets in the ChEMBL database have too many known ligands (exceeding a few thousand). If all of them are used for template ligands, they slow down the algorithm and often provide little added value, as their information is redundant. To increase efficiency, we cluster the ligands using the similarity measure of Tanimoto Coefficient (TC) 37 defined by its Fingerprints38. Throughout this work, we use the FP2 fingerprint generated by the Open Babel (https://openbabel.org/wiki/Tutorial:Fingerprints) program with the default options enabled. We then use a variable number of TC cutoffs TCcut for clustering according to the total number of known ligands of the given target, Nlig:

TCcut=0.95ifNlig<200;0.85if200Nlig10000.80if1000<Nlig20000.70if2000<Nlig50000.60ifNlig>5000 (2)

A greedy algorithm is employed to cluster the ligands: (1) pairwise TCs are calculated; (2) for each ligand, we count the number of neighboring ligands having > TCcut to it; (3) starting from the ligand with largest number of neighbors, we use it as a cluster representative and its neighbors as members to define a cluster; (4) remove the ligands belonging to the cluster and repeat step (1)-(3) until all ligands are assigned to a cluster.

To avoid false template ligands, we will partition the above known representative ligands of a protein in the DrugBank and ChEMBL databases into their possible binding domains. We shall use the same DrugBank and ChEMBL datasets as utilized in the FINDSITEcomb approach19. The DrugBank database has 3833 protein targets and ChEMBL has 2453 protein targets. The overview of the ligand partitioning/target protein domain assignment procedure is shown in Figure 2. For each protein template in the DrugBank or ChEMBL database, we use PFAM39 to partition its sequence into domains. The FINDSITEfilt method is used to infer PDB template ligands and predict the binding pockets for each domain. For the construction of the pseudo holo template proteins, (that is template proteins where a ligand from ChEMBL or DrugBank is predicted to bind), 2407 out of the 3833 DrugBank templates and 2061 out of the 2453 ChEMBL templates have multiple domains. We then employ TASSER-VMT30 to model each database protein domain. We then run FINDSITEfilt on the each predicted ChEMBL or DrugBank protein structure to assign where the ChEMBL or DrugBank ligand binds on the basis of the maximal TC (called maxiTC) score between the to be assigned ligand and all the PDB template ligands of the given domain. In very rare cases, (e.g. among the total 13,004 ligand assignments for DrugBank, only 582 or 4.5% have multiple domain assignments) where the target has multiple identical or very close domains, and the maxiTC scores are tied among these domains. Then, the ligand will be assigned to all of the tied domains. In the end, for each database protein, we have structure models and the binding sites of its domains with each domain assigned a subset (cluster representatives) of the known binding ligands of the protein.

Figure 2.

Figure 2.

Overview of ligand partitioning/domain assignment process for proteins in the DrugBank and ChEMBL databases.

Template ligands from DrugBank and ChEMBL

The structure model of an input protein target will be aligned to the pre-computed model of each protein domain in a database (DrugBank or ChEMBL) using a modified version of Fr-TM-align18, 40. First, Fr-TM-align is used to align the two structures, then, instead of using the TM-score41 that is purely based on a structural similarity measure of the two structures, we include a sequence similarity score based on evolutionary similarity. The output score is the summation of the BLOSUM62 substitution matrix 42 value over the aligned residues provided by Fr-TM-align and normalized by the target length. In other words, Fr-TM-align is used to build the equivalent sequence alignment, and then, BLOSUM62 is used to calculate the sequence alignment score (without gap penalties and normalized by target length). The alignment score will be used to rank database domains. We assume that the larger the score is, the closer the domain’s function is to the target.

In the original FINDSITEcomb approach, we use the ligands of the top first ranked protein target in the database as the template ligands of the input target, regardless of the closeness of structure similarity to its template. This simple implementation has some drawbacks. First, if the top first ranked template structure and the target structure are distant, their function might not be similar, and thus, the template ligands from the template structure likely do not bind to the target protein. Second, the target structure could have been aligned to a region of the template that does not bind to the selected ligands. Third, if two templates have very close similarity to the target (one of them is ranked first) but have very different known sets of ligands, a slight inaccuracy in the model of one template giving an unpredictable rank order switch could result in quite different template ligands. This gives an unstable result.

In this new version of FINDSITEcomb2.0 to improve the performance, we made the following modifications to reduce falsely selected template domains and consequently, template ligands:

  1. A global structure similarity threshold TMcut measured by the TM-score41 of the input target to the domains in DrugBank or ChEMBL is used. Structural alignment is done by the latest version of TM-align called Fr-TM-align40. This avoids the use of template ligands of likely unrelated structures.

  2. To ensure that the global alignment of the template domain to the target covers a reasonable part of the binding site of the template domain (functional part), we apply the following alignment overlap threshold: the structurally aligned region of the domain and the predicted binding region of the domain must have > OVcut overlap. This filter ensures that the target has structure similarity to the region of the template domain that binds to its known set of ligands.

  3. To increase tolerance to structure modeling inaccuracy, we select the top Ndm(≥1) rather than only top first ranked domains for template ligand inference.

The above thresholds of TM-score41 TMcut, overlap OVcut, top domains Ndm as well as the number of PDB pockets selected Npkt will be optimized. The selected template ligands will be employed for virtual ligand screening of the input target using Eq. (3).

Ligand similarity comparison

Once template ligands are obtained by the above procedure, they are employed to search for actives of the input target in a compound library by the following similarity score:

mTC=wl=1NlgTC(Ll,Llib)Nlg+(1w)maxl(1,...,Nlg)(TC(Ll,Llib)) (3),

where TC is the Tanimoto Coefficient37; Nlg is the number of template ligands from the putative evolutionarily related proteins; Ll and Llib stand for the template ligand and the ligand in the compound library, respectively; w is a weight parameter. w=1 gives the average TC in the original FINDSITE screening score16. The second term is the maximal TC between a given compound and all the template ligands. Here, we empirically choose w=0.1 to give more weight to the second term so that when the template ligands are true ligands of the target, they will be favored.

Assessment criteria

To assess the virtual ligand screening results, we employed the AUC (the area under the ROC curve) and the enrichment factor of the top x% of molecules defined by

EFx=Numberofactivesinthetopx%xNactive/100 (4),

Where Nactive is the total number of actives in the entire screened molecule set. Ideally, if all actives are within the top 1% of screened molecules, EF1 should be 100. However, since for the DUD-E set33, the ratio of decoys/actives is on average ~60, the number of actives is greater than 1% of the total number of molecules. Thus, the top 1% of screened molecules can only accommodate 60% of the actives. As a consequence, the maximal possible EF1 will be only, on average, ~60. Other measures we employ are the precision and coverage:

precision=NumberofactiveswithmTCcutofforwithinΔmTCTotalnumberofmoleculeswithmTCcutofforwithinΔmTC (5a)
coverage=NumberofactiveswithmTCcutoffTotalnumberofactives (5b)

Optimization of thresholds

We performed a grid-based optimization of the four parameters on the DUD-E set33 with a 30% sequence identity cutoff. The grid search spaces are (0.2, 0.4, 0.6, 0.8) for TMcut, (20%, 40%, 60%, 80%) for the overlap cutoff, OVcut, (1, 5, 10, 20, 30) for Ndm and (50, 100, 200, 300) for Npkt, respectively. The parameters that give the best average AUC of 0.7878 are (TMcut, OVcut, Ndm, Npkt)=(0.6, 20%, 20, 100). The parameters for best average EF1 of 22.905 are (0.6, 80%, 20, 100) that will be used as the default for future applications. It is interesting to note that the parameters that have the worst average AUC of 0.7456 and worst EF1 of 13.70 are (0.8, 40%, 1, 300).

Experimental Materials and Methods

Human erythrocyte carbonic anhydrase 1 hCA1 and porcine heart malate dehydrogenase pMDH were obtained from Sigma, USA. Pseudomonas aeruginosa aminoglycoside O-phosphotransferase APH(3”)-Ib (gb|ABK33456.1|) with an N-terminal His-tag followed by TEV protease cleavable linker was overexpressed in E. coli and purified using standard nickel affinity chromatography. The detailed experimental procedures are documented in the Supplementary Materials.

Thermofluor or thermal shift assays were performed using previously reported protocols22,43. Briefly, thermal denaturation of the target protein was performed in 96-well plates using a RealPlex quantitative PCR instrument from Eppendorf (Eppendorf, NY, USA). The system was precalibrated before each run. 20 μL of protein sample, at a fixed concentration in the range of 1–5 μM, was aliquoted into the bottom of the wells, incubated with the compounds of interest for 10–30 min at room temperature, and mixed with SYPRO orange (serially diluted to a final concentration of 5× from a 5000× stock solution). The same reaction buffer containing 50 mM HEPES pH 7.3, 100 mM NaCl was used to screen different compounds at a final concentration of 0.5 mM unless otherwise stated. A heating ramp of 1 °C/min from 20 °C to 95 °C was used, and one data point was acquired for each degree increment. The excitation and emission wavelengths were 465 and 580 nm, respectively. The compound/buffer control melting curve was subtracted from the protein containing samples. In this study, the compounds themselves typically yielded minimal fluorescence signal compared to the protein’s melting curve. The negative first derivative of relative fluorescence in arbitrary units, −d(FAU)/dt, was plotted against T with Excel software to determine the melting temperature Tm. Only one negative peak in the plot was observed for each sample, indicating a concerted melting of the enzyme following a quasi-two-state model. The thermal shift ΔTm = Tm (protein with drug) – Tm (protein). Each condition was run at least in duplicate and gave a standard deviation <1 oC, unless otherwise stated. While a ΔTm > 1 oC (considering experimental errors) indicates binding, an appropriate estimation of the absolute thermal dynamics properties of traditional binding affinity Kd requires a “titration” experiment, i.e., the measurement and fitting of a set of Tms from multiple melting curves under conditions of at least 2–3 different ligand concentrations.22,44. An example of this is provided in the inset to Figure 5.

Figure 5.

Figure 5.

Thermal shifts of pMDH at different concentrations of Sunitinib either in the free base form or with malate. A malate control also tested. Inset is the linear fitting of ln[ligand] v.s. 4.184 X 1000 X (1/T0-1/Tm)/R where T0 is melting temperature of the apo protein in the absence of ligand of interest, Tm is melting temperature at a given ligand concentration. The ligand affinity to the protein Kd (μM) ≈ e y-intercept, an approximation under conditions of [ligand] >> [protein] during the thermal transition based on previously derived binding thermodynamics equations22,44.

A web service of FINDSITEcomb2.0 for VLS and VTS

We have implemented an online web service of FINDSITEcomb2.0 for VLS and VTS that is freely available for academic users at http://pwp.gatech.edu/cssb/FINDSITE-COMB-2. For VLS, the user inputs a protein sequence or PDB structure and selects a screening library of compounds from DrugBank, or the NCI diversity set or user selected molecules. The output will be the rank order of the screened molecules along with their predicted binding precision. For VTS, the user inputs the SMILES string45 of a single molecule. Currently, we only screen the molecule against the human exome. The output will be the rank order of all human proteins (including isoforms) that are predicted to bind the molecule. The precision of the binding prediction is also provided. Since the computation is not instantaneous, the user needs to supply a valid academic email address. Once the screening is completed, the user will receive an email and a link to download the results or the results will be attached to the email.

RESULTS

Comparison to FINDSITEcomb for virtual ligand screening

We tested FINDSITEcomb2.0 on the benchmarking set of the Directory of Useful Decoys, Enhanced (DUD-E) for virtual ligand screening33 and compared it to FINDSITEcomb. Since FINDSITEcomb2.0 has four free parameters to choose, in order to have a fair comparison, we employ a jack-knife test. For each testing target, we optimized its parameters on the DUD-E set by excluding the testing target and use EF1 as the objective function. DUD-E has 102 targets. On average, the ratio of the number of decoys to actives is around 60, with the average numbers of actives of ~224 and decoys of 13,696.

In practice, we merge the screening results of all targets and calculate a single precision and coverage for the whole dataset, whereas AUC and EFx are calculated for each target. In Table 1, we compare the performance of FINDSITEcomb2.0 to FINDSITEcomb with a sequence identity cutoff 30%, i.e. no template with known binders from the PDB, DrugBank, or ChEMBL having a sequence identity > 30% to the input target are used for template ligand inference. We compare these approaches using both modeled structures generated by using TASSER-VMT with a 30% sequence identity cutoff for modeling templates and experimental structures. For modeled structures, FINDSITEcomb2.0 has a mean AUC and a mean top 1% enrichment factor of 0.785 and 22.06, respectively, as compared to 0.745 and 16.74 for FINDSITEcomb. FINDSITEcomb2.0 has 88 targets with better than random selection (EF1>1) compared to 81 for FINDSITEcomb. FINDSITEcomb2.0 is shown to significantly improve FINDSITEcomb on the DUD-E benchmark set with a p-value of 6.0×10−4 for AUC and 4.3×10−3 for a 1% enrichment factor by the Student t-test. With an mTC cutoff of 0.9, FINDSITEcomb2.0 has a prediction precision of 83.8% whereas FINDSITEcomb has a prediction precision of 75.0%. In Figure 3, using modeled structures, we show the scatter plot of EF1 for the two compared methods. FINDSITEcomb2.0 has 59 targets having a larger EF1 than FINDSITEcomb, whereas FINDSITEcomb has only 28 targets with a larger EF1 than FINDSITEcomb2.0. Using experimental structures, both methods show around a ~15% increase in EF1, but their relative performance does not change.

Table 1.

Comparison between FINDSITEcomb2.0 and FINDSITEcomb on the DUD-E set

AUCa EF1b Precision (mTC≥0.9) Coverage (mTC≥0.9)
Modeled target structures
FINDSITEcomb 0.745 (21) 16.74 (81) 75.0% 3.35%
FINDSITEcomb2.0 0.785 (30) 22.06 (88) 83.8% 6.01%
FINDSITEcomb2.0 -nofilter 0.755(24) 16.76(89) 71.7% 3.06%
Experimental target structures
FINDSITEcomb 0.779 (26) 19.15 (90) 74.6% 3.35%
FINDSITEcomb2.0 0.811 (36) 25.21 (93) 83.3% 6.27%
FINDSITEcomb2.0 -nofilter 0.779(30) 19.73(91) 75.1% 3.35%
a

Numbers in parenthesis are the number of targets having an AUC > 0.9.

b

Numbers in parenthesis are the number of targets having an EF1 > 1.

Figure 3.

Figure 3.

For modeled target protein structures, scatter plot of EF1 between FINDSITEcomb and FINDSITEcomb2.0.

To examine the effects of the two major improvements of FINDSITEcomb2.0 over FINDSITEcomb: (1) parsing library templates into domains and (2) applying filters to template ligand selection, we tested an alternative approach called FINDSITEcomb2.0-nofilter that does not apply filters to template ligand selection. With modeled target structures, although on average FINDSITEcomb2.0-nofilter has only slightly better AUC and EF1 and slightly worse precision and coverage than those of FINDSITEcomb, it has more targets with better than random ligand selection (EF1>1) and an AUC > 0.9. FINDSITEcomb2.0-nofilter has a better EF1 than FINDSITEcomb for 48 targets, whereas FINDSITEcomb is better for 32 targets. With experimental target structures, FINDSITEcomb2.0-nofilter still performs slightly better than FINDSITEcomb but it is significantly worse than FINDSITEcomb2.0. We also note that by using only the PDB library for template ligands and modeled target structure, FINDSITEcomb2.0 (FINDSITEfilt2.0) has an average EF1 of 14.20 compared to 22.06 of full FINDSITEcomb2.0. The number of targets having an EF1 >1 by FINDSITEfilt2.0 is 76 whereas by FINDSITEcomb2.0 is 88. This indicates that adding the templates from the DrugBank and ChEMBL libraries significantly improves the performance of FINDSITEcomb2.0.

Here, by way of example we show two examples of improved targets. For target Ada, using modeled structures, FINDSITEcomb2.0 has an EF1 of 40.09 whereas FINDSITEcomb only has an EF1 of 3.25, while FINDSITEcomb2.0-nofilter has 7.58. This indicates that the domain parsing of templates in the DrugBank and ChEMBL databases improves its EF1 from 3.25 to 7.58. After implementing FINDSITEcomb2.0 that applies filters and increases the number of templates for the DrugBank and ChEMBL libraries, the EF1 greatly improves to 40.09. The performance increase in this case is due to the elimination of false positive ligand templates. Specifically, the number of template ligands in the PDB library is reduced from 100 to 10, in DrugBank from 1 to 0 and in ChEMBL from 104 to 0, respectively.

Another example is target rock1. FINDSITEcomb has an EF1 of 0.9995, while FINDSITEcomb2.0-nofilter increases the EF1 to 2.999. More significantly, FINDSITEcomb2.0 further improves EF1 to 43.98. In this case, FINDSITEcomb2.0 does not filter out any PDB template ligands. Besides protein kinase C iota type that is selected by FINDSITEcomb, FINDSITEcomb2.0 selects these additional template targets for template ligands from DrugBank: Beta-adrenergic receptor kinase 2; Beta-adrenergic receptor kinase 1; 3-phosphoinositide-dependent protein kinase 1; MAP kinase-activated protein kinase 2; Serine/threonine-protein kinase 12; Calcium/calmodulin-dependent protein kinase type IV; Calcium/calmodulin-dependent protein kinase type 1D; Serine/threonine-protein kinase 17B. FINDSITEcomb2.0 also selects 9 more protein templates from the ChEMBL library than FINDSITEcomb.

Comparison to deep CNN methods for virtual ligand screening

Recently, two groups23, 24 have developed virtual ligand screening approaches using state-of-the-art deep convolutional neural networks46. Though they still require high resolution protein structures, their performance in virtual ligand screening is much better than traditional docking methods such as AUTODOCK-vina8. Thus, it is interesting to know how FINDSITEcomb2.0 compares to these deep CNN methods. AtomNet24 used randomly selected 72 DUD-E targets for training and the remaining 30 targets for testing. The CNN scoring method23 clustered the 102 DUD-E targets with an 80% sequence identity threshold and divided the clusters into 3 sets to do 3-fold cross-validation test. For fair comparison, we apply an 80% sequence cutoff to templates for template ligand selection: i.e. ligands from any template in the PDB, DrugBank and ChEMBL having a sequence identity >80% will be ignored. Furthermore, for each testing target, we only optimize its four free parameters on the DUD-E targets having sequence identity < 80% to it. We tested FINDSITEcomb2.0 using both modeled structures (with 30% sequence identity cutoff for modeling) and experimental structures. The CNN scoring approach23 used the ROC enrichment factor (ROC-EF) to assess their results. The ROC-EF is slightly different from Eq.(4) and is defined as

ROCEFx=Numberofactivesatx%falsepositiverateNumberofdecoysatx%falsepositiverate (6)

Table 2 compares the performance of FINDSITEcomb2.0 to deep CNN methods along with FINDSITEcomb and the traditional docking method AUTODOCK-vina8 on the DUD-E set. AtomNet has the best mean AUC of 0.895 only when it includes the training targets. Even though the deep CNN scoring method has a better AUC than FINDSITEcomb, it has a worse ROC-EF1. In practice, a better ROC-EF1 or EF1 is more useful because one needs only to test the top few percent of the ranked ligands. FINDSITEcomb2.0 with modeled structures has a better mean AUC of 0.876 and a better mean ROC-EF1 of 52.39 as compared to 0.868 and 29.65 of the CNN scoring method23. FINDSITEcomb2.0 also has more targets, 59(57.8%), with an AUC > 0.9 than the 49(48.0%) of the CNN scoring method23. For a randomly selected 30 target subset, FINDSITEcomb2.0 has a mean AUC of 0.880 and 16(53.3%) targets having an AUC > 0.9 whereas AtomNet has 0.855 and 14(46.7%), respectively24. The traditional docking method AUTODOCK-vina has the worst performance by all measures. With experimental structures, FINDSITEcomb2.0 performs even better, with a mean AUC of 0.886, ROC-EF1 of 56.27, and 64 or 62.7% of targets having an AUC>0.9. The ROC-EF1 difference between modeled and experimental structures is less than 10%. Thus, FINDSITEcomb2.0 outperforms the state-of-the-art deep learning based methods and offers the further advantage that it can use predicted as well as experimental structures.

Table 2.

Comparison of FINDSITEcomb2.0 with CNN methods on the DUD-E set

AUC # of AUC > 0.9 (%) ROC-EF1
AtomNet24 (DUD-E-30) 0.855 14 (46.7%) -
AtomNet24 (DUD-E-102)a 0.895 59 (57.8%) -
CNN scoring23 0.868 49 (48.0%) 29.65
AUTODOCK-vina23 0.716 2(2.0%) 7.32
FINDSITEcomb 0.829 39 (38.2%) 37.26
FINDSITEcomb2.0(modeled structures) 0.876 59 (57.8%) 52.39
FINDSITEcomb2.0(modeled structures,DUD-E-30)b 0.880 16(53.3%) 53.55
FINDSITEcomb2.0(experimental structures) 0.886 64 (62.7%) 56.27
a

Includes results for the 72 training protein targets.

b

30 random targets from DUD-E set.

Comparison to ligand-based methods for virtual ligand screening

FINDSITEcomb2.0 distinguishes itself from ligand-based QSAR methods by not using any information about known binding ligands to the target protein. The only requirement from the target protein is the amino acid sequence. Since the DUD-E dataset favors ligand-based methods 47, we shall employ an unbiased dataset called ULS/UDS developed in Ref. 47 to compare FINDSITEcomb2.0 to ligand-based methods. The ULS/UDS (Unbiased Ligand Set / Unbiased Decoy Set) set was designed not to bias towards ligand-based virtual screening methods. It has 17 GPCR targets with an average ~ 600 decoys and ~ 15 ligands. With a sequence identity cutoff of 80% (i.e. information from any templates having sequence identity > 80% to the target will not be used in ligand homology modeling), FINDSITEcomb2.0 achieves an average 0.862 of AUC of ROC curve in comparison to 0.675 by the FCFP_6 fingerprint based method47. Even with a much stricter sequence identity cutoff of 30%, FINDSITEcomb2.0 still has an average AUC of 0.713. Thus, for an unbiased data set like ULS/UDS, we conclude that FINDSITEcomb2.0 performs better than ligand-based methods. Results for individual targets can be found in Table S2 in the Supplementary Materials.

Comparison to docking-based method for ligand diversity

A good performing ligand virtual screening method should not only have a good enrichment factor, but also show significant diversity of the top ranked ligands to increase the chance of finding new classes of ligand hits. Since FINDSITEcomb2.0 uses a similarity search as its last step, it will be informative to know if its recalled ligands within the top 1% of ranked molecules are more diverse than traditional docking methods. Hence, we compare FINDSITEcomb2.0 to AUTODOCK-vina8 on the DUD-E data set. We cluster the active ligands within the top 1% of ranked screened molecules using a TC cutoff of 0.8. The average number of clusters per protein of all of the ligands is 123.9. AUTODOCK-vina recalls on average 13.01 clusters within the top 1% whereas FINDSITEcomb2.0 recalls on average 48.98 clusters with an 80% sequence identity cutoff. Thus, in terms of recalled ligand diversity, FINDSITEcomb2.0 is much better than docking methods such as AUTODOCK-vina.

Comparison to FINDSITEcomb for virtual target screening

In our earlier work we created a database and service that provides predicted protein targets of drugs, drugs for a given protein target, and associated diseases and side effects of drugs, called DR. PRODIS26. There we applied FINDSITEcomb to predict the binding targets and possible side effects of all small molecules in DrugBank26. In contrast to virtual ligand screening, virtual target screening screens a given molecule against the entire human genome. We have modeled and generated data for virtual target screening for 97% of all human protein sequences (including isoforms)26. Within the 97% of modeled sequences, 85.6% of sequences have at least one domain with a predicted TM-score to native > 0.4. We have shown in FINDSITEcomb that on average, a target will have better than random VLS results if its predicted TM-score to native > 0.4. Those modeled targets are ready to be used by FINDSITEcomb2.0 for virtual target screening. To the best of our knowledge, this is the first academic approach that could screen almost the entire human genome for drug target discovery. Other sequence-based methods for drug-target interaction prediction rely on knowledge of existing drug-target interactions since their inference approaches require drug-drug and/or protein-protein similarity4853. In addition, methods that require high resolution protein structures are limited by the availability of such structures as well as their significant computational expense20.

To test FINDSITEcomb2.0 in the context of virtual target screening, i.e. predicting the human targets of a given compound, we compiled a set of 540 small molecule drugs from DrugBank version 5.09 27 that are not present in the earlier version 3.0 employed by FINDSITEcomb2.0 and FINDSITEcomb as the training library. The results as assessed by the AUC and top 1% enrichment factor are compiled in Table 3. FINDSITEcomb2.0 achieves an average AUC=0.881 and EF1=59.96 compared to an AUC=0.851 and EF1=52.34, respectively for FINDSITEcomb. Thus, FINDSITEcomb2.0 has a significant improvement over FINDSITEcomb for virtual target screening as indicated by a p-value of 4.2×10−6 and 1.7×10−8 with a Student t-test for AUC and EF1, respectively Again, we examined how FINDSITEcomb2.0-nofilter performs when only domain parsing of the DrugBank and ChEMBL libraries are employed. FINDSITEcomb2.0-nofilter has AUC of 0.861 and EF1 of 55.92, both having obvious improvement over FINDSITEcomb.

Table 3.

Comparison between FINDSITEcomb2.0 and FINDSITEcomb for drug target prediction

AUC EF1
FINDSITEcomb 0.851 52.34
FINDSITEcomb2.0 0.881 59.96
FINDSITEcomb2.0 (53 drugs) a 0.866 59.88
FINDSITEcomb2.0I-nofilter 0.861 55.92
PROBExb 0.81 31.4
a

Randomly selected from the 540 protein set.

b

Tested on 53 (whose identity is not available from the literature) drugs.

Although there are no similar academic methods published, we did notice a commercially available method called PROBEx by CYCLICA (https://static1.squarespace.com/static/54b9178ae4b09cb81d821314/t/5872da0fc534a5d5acbf55d5/1483921941137/Cyclica_ValidationNote_ROKT.pdf). PROBEx was tested on 53 drugs from DrugBank for target prediction with result of AUC=0.81 and EF1=31.4. Since we do not know which 53 drugs were assessed in their study, we randomly evaluate 53 drugs from among our 540 drugs. FINDSITEcomb2.0 has AUC=0.866 and EF1=59.88 for this 53 drug set. Thus, FINDSITEcomb2.0 performs significantly better than PROBEx.

Experimental validation of FINDSITEcomb2.0 predictions

Using the DUD-E set, we examine the dependence of the precision by FINDSITEcomb2.0 on the mTC score cutoff in Figure 4. Under the very stringent condition that no template has a sequence identity > 30% to the input target in the PDB, DrugBank and ChEMBL libraries, with an mTC cutoff of 0.9, the precision is 84%, whereas with an mTC cutoff of 0.8, it drops rapidly to 41%.

Figure 4.

Figure 4.

Dependence of protein-ligand interaction predicted cumulative precision based on the mTC cutoff (upper) and the precision with a given mTC within a bin size of 0.05 (lower) by FINDSITEcomb2.0 with a 30% sequence identity cutoff and modeled target structures. Data are derived from the 102 targets DUD-E set.

Next, we experimentally tested three protein targets to assess the predicted benchmark precision of FINDSITEcomb2.0. The detailed experimental procedures are summarized in the Supplementary Materials. We applied FINDSITEcomb2.0 to ligand binding prediction for 3 protein targets – human erythrocyte carbonic anhydrase 1 hCA1 (Sigma, USA), porcine heart malate dehydrogenase pMDH (Sigma, USA), and Pseudomonas aeruginosa aminoglycoside O-phosphotransferase APH(3”)-Ib (in-house overexpressed and purified from E. coli) to screen against the molecules from the National Cancer Institute (NCI) diversity set (https://dtp.cancer.gov/organization/dscb/obtaining/available_plates.htm). A sequence identity cutoff of 30% for protein templates was applied for pMDH (sp|P11708|) and APH(3”)-Ib (WP|084929469.1|); and a control test with no cutoff was applied to hCA1 (sp|P00915|). The reason for using NCI molecules is that they are easy to obtain (http://dtp.nci.nih.gov/branches/dscb/repo_open.html). The NCI diversity set consists of 1597 molecules from Diversity Set III, 97 molecules are from the Approved Oncology Drugs Set IV, and 118 molecules from the Natural Product Set II (total 1812 NCI molecules). Molecules with an mTC > 0.8 are selected for Thermofluor assay, a common and sensitive fluorescence-based thermal shift experimental test of ligand binding to the protein target, using our previously reported protocols20,48. We found that the experimentally observed precision (overall hit rate of true binders defined as compounds displaying a thermal shift over 1 oC at 0.5 mM final concentration) was generally consistent with the average expected precision of FINDSITEcomb2.0 predictions for the DUD-E benchmark set (see Table 4).

Table 4.

List of thermal shift ΔTm values versus mTC scores of different proteins and drugs§

pMDH
Drug ΔTm(oC) mTC Precision
Imatinib 0 0.930 0.961
Sorafenib 2* 0.926 0.961
Sunitinib 10* 0.925 0.961
Lapatinib n.d. 0.925 0.961
Dasatinib n.d. 0.924 0.944
Erlotinib 0 0.922 0.944
Pazopanib −1 0.922 0.944
Fludarabine 1.5* 0.854 0.518
NSC151252 4.5* 0.804 0.187
Gefitinib 0.5 0.801 0.187
Precision Observed/Expected : 0.50/0.76
APH(3”)-Ib
Empirical parameters Optimized parameters
Drug ΔTm(oC) mTC Precision Drug ΔTm(oC) mTC Precision
Lapatinib n.d. 0.927 0.961 NSC36398 2* 0.900 0.874
Dasatinib 1 0.926 0.961 NSC34875 3* 0.845 0.518
Sunitinib n.d. 0.926 0.961
Erlotinib −3 0.924 0.944
Pazopanib −2 0.924 0.944
NSC36398 2* 0.900 0.874
NSC76988 3* 0.861 0.596
NSC34875 3* 0.845 0.518
NSC26744 4* 0.827 0.373
Gefitinib −1 0.804 0.187
Precision Observed/Expected : 0.50/0.73 Precision Observed/Expected : 1.0/0.70
hCA1
Drug ΔTm(oC) mTC Precision
Celecoxib 2.5* 0.921 0.962
NSC107679 3.5* 0.861 0.694
NSC263220 2.5* 0.837 0.540
NSC17129 3.5* 0.828 0.456
NSC133195 0 0.818 0.371
Precision Observed/Expected# : 0.8/0.60
§

Experiment was done using predictions from an empirical choice of the four parameters before they were optimized. The predictions for hCA1 and APH(3”)-Ib are the same by both parameter sets; whereas the empirical parameterization yielded a significant number of false positives that are kinase inhibitors that are absent in the parameter set optimized on DUD-E.

n.d. the fluorescence signal is significantly dampened or no melting transition observed. These ligands are ignored in the calculation of positive hit rate.

*

The ΔTm values of true binders (ΔTm >1oC) are indicated in bold. Positive hit rate (%) = Ntrue binder / Ntotal tested with observable melting curves. It is 80% for hCA1, 50% for pMDH, and 50% for APH(3”)-Ib, respectively.

#

The observed precision is calculated the same way as the experimental positive hit rate described above. The expected precision is calculated based on average precision statistics of FINDSITEcomb2.0 predictions for the 102 DUD-E benchmark sets at specific template/target sequence identity cutoff and mTC score with ΔmTC of 0.05. In the experimental validation sets, the mTC score cutoff was 0.8 for all compounds tested. There the sequence identity cutoff was set to be 30% for pMDH and APH(3”)-Ib and there was no cutoff sequence identity cutoff for hCA1 as a control set.

As shown in Table 4, For pMDH the observed and expected precision values are 0.5 and 0.76 respectively. Whereas for APH(3”)-Ib the observed and expected precision are 1.0/0.7 resspectively. In particular, we identified the novel binding of the FDA-approved anticancer drug Sunitinib (Drug Bank ID: DB01268) to pMDH with an estimated Kd of ~5.4 μM based on the thermal shift assay (Figure 5). Further tests confirmed the thermal shift was due to the receptor tyrosine kinase inhibitor component of Sunitnib (free base form) and not the malate ingredient in the Sunitinib malate drug formulation. The mode of potential off-target interaction of the anticancer drug Sunitinib with malate dehydrogenase needs to be elucidated. Interestingly, it has been recently reported that cytosolic malate dehydrogenase activity supports glycolysis in actively proliferating cells and cancer54. Kinase inhibitors are known to often show promiscuity and are an important repertoire for drug repurposing for the treatment of infectious diseases55,56.

Finally, the average observed precision is hCA1 is 0.8, as compared to the expected precision of 0.6. The two binders identified for APH(3”)-Ib belong to the flavonoid family with ΔTm in the range of 2–4 o C (Table 4, Figure 6,Table S1). A BLAST search inferred that APH(3”)-Ib is a streptomycin kinase. We confirmed that APH(3”)-Ib shows substrate specificity toward streptomycin and is inactive with kanamycin or gentamycin (Figure S3). This is consistent with a previous study where the flavanol quercetin (NSC36398) inhibits a related kanamycin active kinase APH(2”)-Iva by occupying the ATP binding site as evidenced by its crystal structure57.

Figure 6.

Figure 6.

Chemical structures of different compounds/drugs tested in the thermal shift assay. 2D structures were directly obtained from PubChem, and the available CAS number is in parenthesis.

On the other hand, when no sequence cut off was applied, hCA1 shows a relatively high observed precision of 80% which is roughly consistent with the average expected precision of 70% (Table 4), with 4 out of 5 compounds tested, all belonging to sulfonamide family known to inhibit carbonic anhydrases. These ligands have a positive ΔTm in the range of 2.5–3.5 oC. The relatively high hit rate of hCA1 control test where no sequence identity cutoff was applied, is due to the existing crystal structure of hCA1 complexes with sulfonamide inhibitors available from the PDB that provide both protein and ligand templates. For example, Celecoxib (Drug Bank ID: DB00482), an FDA approved non-steroidal anti-inflammatory drug (NSAID) known to inhibit prostaglandin-endoperoxide synthase COX-2 (but not COX-1) and hCA2 proteins according to Drug Bank (www.drugbank.ca), was identified to be a true binder for hCA1 (sequence identity between hCA1 and hCA2 is 60%). These results corroborate the prediction power of FINDSITEcomb2.0 as a VLS approach and its ability to accelerate experimental drug lead discovery by in silico ligand prescreening.

DISCUSSION

Previously, FINDSITEcomb was demonstrated to perform better than traditional docking methods for virtual ligand screening19 using the DUD benchmarking set58, and was also employed for drug target (virtual target screening-VTS) and side effect predictions for all DrugBank drugs26. To the best of our knowledge, FINDSITEcomb was the only method applied to the entire human proteome. Recently developed deep learning CNN scoring boosts the performance of high-resolution structure based docking methods in terms of AUC and enrichment factor23, 24. Although CNN methods have a better AUC than that of FINDSITEcomb, they have worse performance than FINDSITEcomb in terms of the enrichment factor. The reason could be due to fact that the AUC is mostly determined by actives ranking beyond the top 1–5%, whereas the enrichment factor is determined by actives ranking at the very top. Machine learning-based methods are good at ranking actives beyond the top 1–5% range. But as a practical matter this is not the most relevant region.

Here, based upon FINDSITEcomb, we developed the FINDSITEcomb2.0 approach that significantly improves over FINDSITEcomb in both virtual ligand and virtual target screening. Even when modeled structures are used, FINDSITEcomb2.0 performs better than deep CNN methods that use experimental target structures not only for the enrichment factor but also for their AUC. The improvement is mostly attributable to filters added to the template ligand selection. Parsing the template protein targets and ligands in the DrugBank and ChEMBL databases to domains also contributes to this improvement, especially for the number of targets that show better than random ligand selection. For three proteins and a small set (<10) of predicted binding ligands, we experimentally demonstrated significant agreement between the average observed precision of 60% and the average calculated expected precision of 70% of FINDSITEcomb2.0 prediction for protein-ligand interactions when mTC > 0.80 with a sequence identity cutoff of 30% (Table 4). Using Figure 4, we can predict the likely precision of a given set of ligands, thereby suggesting when it makes sense to experimentally test FINDSITEcomb2.0’s predictions. In other words, it is the first algorithm that can suggest under what conditions it makes sense to experimentally test the VTS or VLS predictions. Since only a handful of ligands (<50) need to be screened to identify novel hits and it can employ predicted as well as experimental protein structures, FINDSITEcomb2.0 is a very powerful VLS and VTS tool that can greatly assist in accelerating drug discovery and side effect predictions.

Supplementary Material

SI

ACKNOWLEDGEMENTS

This work is supported by grant No. 1R35GM-118039 of the Division of General Medical Sciences of the National Institutes of Health. The authors would like to thank Dr. Bartosz Ilkowski for managing the cluster on which this work was conducted and Jessica Forness for assistance in preparation of the manuscript.

REFERENCES

  • 1.Settleman J; L.Cohen R , Communication in Drug Development: “Translating” Scientific Discovery. Cell 2016, 164, 1101–1104. [DOI] [PubMed] [Google Scholar]
  • 2.DiMasia JA; Hansenb RW; Grabowski HG, The Price of Innovation: New Estimates of Drug Development Costs. Journal of Health Economics 2003, 22, 151–185. [DOI] [PubMed] [Google Scholar]
  • 3.Macarron R; Banks MN; Bojanic D; Burns DJ; Cirovic DA; Garyantes T; Green DVS; Hertzberg RP; Janzen WP; Paslay JW; Schopfer U; Sittampalam GS, Impact of High-throughput Screening in Biomedical Research. Nat Rev Drug Discov. 2011, 10, 188–195. [DOI] [PubMed] [Google Scholar]
  • 4.Reddy AS; Pati SP; Kumar PP; Pradeep HN; Sastry GN, Virtual Screening in Drug Discovery – A Computational Perspective. Curr Protein Pept Sci. 2007, 8, 331–353. [DOI] [PubMed] [Google Scholar]
  • 5.Ewing TJA; Makino S; Skillman AG; Kuntz ID, DOCK 4.0: Search Strategies for Automated Molecular Docking of Flexible Molecule Databases. J. Comput-Aided Molec. Design 2001, 15, 411–428. [DOI] [PubMed] [Google Scholar]
  • 6.Friesner RA; Banks JL; Murphy RB; Halgren TA; Klicic JJ; Mainz DT; Repasky MP; Knoll EH; Shelley M; Perry JK; Shaw DE; Francis P; Shenkin PS, Glide: A New Approach for Rapid, Accurate Docking and Scoring. 1. Method and Assessment of Docking Accuracy. J. Med. Chem 2004, 47, 1739–1749. [DOI] [PubMed] [Google Scholar]
  • 7.Abagyan R; Totrov M; Kuznetsov D, ICM - A New Method for Protein Modeling and Design: Applications to Docking and Structure Prediction from the Distorted Native Conformation. J. Comput. Chem 1994, 15, 488–506. [Google Scholar]
  • 8.Trott O; Olson AJ, AutoDock Vina: Improving the Speed and Accuracy of Docking with a New Scoring Function, Efficient Optimization and Multithreading. J. Comput. Chem 2010, 31, 455–461. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Stahura F; Bajorath J, New Methodologies for Ligand-based Virtual Screening. Curr Pharm Des 2005, 11, 189–202. [DOI] [PubMed] [Google Scholar]
  • 10.Nikolova N; Jaworska J, Approaches to Measure Chemical Similarity – a Review. QSAR Comb. Sci 2003, 22, 1006–1026. [Google Scholar]
  • 11.Glen RC; Adams SE, Similarity Metrics and Descriptor Spaces - Which Combinations to Choose? QSAR Comb. Sci 2006, 25, 1133–1142. [Google Scholar]
  • 12.Brylinski M; Skolnick J, Q-Dock: Low-resolution Flexible Ligand docking with Pocket-specific Threading Restraints. J. Comput. Chem 2008, 29, 1574–1588. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Brylinski M; Skolnick J, Q-DockLHM: Low-resolution Refinement for Ligand Comparative modeling. J. Comput. Chem 2010, 31, 1093–1105. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Lee HS; Zhang Y, BSP-SLIM: A Blind low-resolution Ligand-protein Docking Approach Using Predicted Protein Structures. Proteins 2011, 80, 93–110. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Roy A; Zhang Y, Recognizing Protein-Ligand Binding Sites by Global Structural Alignment and Local Geometry Refinement. Structure 2012, 20, 987–997. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Brylinski M; Skolnick J, FINDSITE: A Threading-Based Method for Ligand-Binding Site Prediction and Functional Annotation. Proc. Natl. Acad. Sci. U.S.A 2008, 105, 129–134. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Brylinski M; Skolnick J, FINDSITELHM: A Threading-Based Approach to Ligand Homology Modeling. PLoS Comput. Biol 2009, 5, e1000405. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Zhou H; Skolnick J, FINDSITEX: A Structure-Based, Small Molecule Virtual Screening Approach with Application to All Identified Human GPCRs. Mol. Biopharm 2012, 9, 1775–1784. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Zhou H; Skolnick J, FINDSITEcomb: A Threading/Structure-Based, Proteomic-Scale Virtual Ligand Screening Approach. J. Chem. Inf. Model 2013, 53, 230–240. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Reardon S, Project Ranks Billions of Drug Interactions. Nature 2013, 503, 449. [DOI] [PubMed] [Google Scholar]
  • 21.Brylinski M; Skolnick J, Comprehensive Structural and Functional Characterization of the Human Kinome by Protein Structure Modeling and Ligand Virtual Screening. J. Chem. Inf. Model 2010, 50, 1839–1854. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Srinivasan B; Zhou H; Kubanek J; Skolnick J, Experimental Validation of FINDSITEcomb Virtual Ligand Screening Results for Eight Proteins Yields Novel Nanomolar and Picomolar Binders. J. Cheminf 2014, 6, 16–29. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Ragoza M; Hochuli J; Idrobo E; Sunseri J; Koes DR, Protein–Ligand Scoring with Convolutional Neural Networks. J. Chem. Inf. Model 2017, 57, 942–957. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Wallach I; Dzamba M; Heifets A, AtomNet: A Deep Convolutional Neural Network for Bioactivity Prediction in Structure-Based Drug Discovery. arXiv preprint 2015, 1510.02855.
  • 25.Chen Y; Zhi D, Ligand-Protein Inverse Docking and Its Potential Use in the Computer Search of Protein Targets of a Small Molecule. Proteins 2001, 43, 217–226. [DOI] [PubMed] [Google Scholar]
  • 26.Zhou H; Gao M; Skolnick J, Comprehensive Prediction of Drug-Protein Interactions and Side Effects for the Human Proteome. Sci. Rep 2015, 5, 11090. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Wishart D; Knox C; Guo A; Shrivastava S; Hassanali M; Stothard P; Chang Z; Woolsey J, DrugBank: a Comprehensive Resource for in Silico Drug Discovery and Exploration. Nucl. Acid. Res 2006, 34, D668–D672. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Zhou H; Zhou Y, Fold Recognition by Combining Sequence Profiles Derived from Evolution and from Depth-Dependent Structural Alignment of Fragments. Proteins 2005, 321–328. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Bernstein FC; Koetzle TF; Williams GJB; Meyer EF, Jr. MDB; Rodgers JR; Kennard O; Shimanouchi T; Tasumi M, The Protein Data Bank: A Computer-Based Archival File for Macromolecular Structures. J. Mol. Biol 1977, 112, 535–542. [DOI] [PubMed] [Google Scholar]
  • 30.Zhou H; Skolnick J, Template-Based Protein Structure Modeling Using TASSERVMT. Proteins 2011, 80, 352–361. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Gaulton A; Bellis L; Bento A; Chambers J; Davies M; Hersey A; Light Y; McGlinchey S; Michalovich D; Al-Lazikani B; Overington J, ChEMBL: A Large-Scale Bioactivity Database for Drug Discovery. Nucl. Acid. Res 2012, 40, D1100–D1107. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Willett P, Similarity-Based Virtual Screening Using 2D Fingerprints. Drug Discovery Today 2006, 11, 1046–1053. [DOI] [PubMed] [Google Scholar]
  • 33.Mysinger MM; Carchia M; Irwin JJ; Shoichet BK, Directory of Useful Decoys, Enhanced (DUD-E): Better Ligands and Decoys for Better Benchmarking. J. Med. Chem 2012, 55, 6582–6594. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Brooks B; Bruccoleri R; Olafson B; States D; Swaminathan S; Karplus M, CHARMM: A Program for Macromolecular Energy, Minimization, and Dynamics Calculations. J. Comput. Chem 1983, 4, 187–217. [Google Scholar]
  • 35.Cassarino TG; Bordoli L; Schwede T, Assessment of Ligand Binding Site Predictions in CASP10. Proteins 2014, 82, 154–163. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Schmidt T; Haas J; Cassarino TG; Schwede T, Assessment of Ligand Binding Residue Predictions in CASP9. Proteins 2009, 77, 138–146. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Tanimoto TT, An Elementary Mathematical Theory of Classification and Prediction. IBM Interanl Report 1958. [Google Scholar]
  • 38.Anonymous, In; Daylight Chemical Information Systems,Inc, Aliso Viejo, CA: 2007. [Google Scholar]
  • 39.Bateman A; Birney E; Cerruti L; Durbin R; Etwiller L; R.Eddy S; Griffiths-Jones S; Howe KL; Marshall M; Sonnhammer ELL, The Pfam Protein Families Database. Nucl. Acids Res 2002, 30, 276–280. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Pandit S; Skolnick J, Fr-TM-align: A New Protein Structural Alignment Method Based on Fragment Alignments and the TM-score. BMC Bioinf. 2008, 9, 531. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Zhang Y; Skolnick J, A Scoring Function for the Automated Assessment of Protein Structure Template Quality. Proteins 2004, 57, 702–710. [DOI] [PubMed] [Google Scholar]
  • 42.Henikoff S; Henikoff JG, Amino Acid Substitution Matrices from Protein Blocks. Proc. Natl. Acad. Sci. U. S. A 1992, 89, 10915–10919. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Cao H; Walton JD; Brumm P; Phillips GNJ, Structure and Substrate Specificity of a Eukaryotic Fucosidase from Fusarium Graminearum. J. Biol. Chem 2014, 289, 25624–25638. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Lo M; Aulabaugh A; Jin G; Cowling R; Bard J; Malamas M; Ellestad G, Evaluation of Fluorescence-Based Thermal Shift Assays for Hit Identification in Drug Discovery. Anal. Biochem. 2004, 332, 153–159. [DOI] [PubMed] [Google Scholar]
  • 45.Weininger D; Weininger A; Weininger J, SMILES. 2. Algorithm for Generation of Unique SMILES Notation. J. Chem. Inf. Model 1989, 29, 97–101. [Google Scholar]
  • 46.LeCun Y; Bengio Y; Hinton G, Deep Learning. Nature 2015, 521, 436–444. [DOI] [PubMed] [Google Scholar]
  • 47.Xia J; Jin H; Liu Z; Zhang L; Wang XS, An Unbiased Method To Build Benchmarking Sets for Ligand-Based Virtual Screening and its Application To GPCRs. J. Chem. Inf. Model 2014, 54, 1433–1450. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Laarhoven T. v.; Marchioro E, Predicting Drug-Target Interactions for New Drug Compounds Using a Weighted Nearest Neighbor Profile. PLoS One 2013, 8, e66952. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Lounkine E; Keiser M; Whitebread S; Mikhailov D; Hamon J; Jenkins JL; Lavan P; Weber E; Doak A; Cote S; Shoichet BK; Urban L, Large-Scale Prediction and Testing of Drug Activity on Side-Effect Targets. Nature 2012, 486, 361–368. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Cheng F; Liu C; Jiang J; Lu W; Li W; Liu G; Zhou W; Humang J; Tang Y, Predicting of Drug-Target Interactions and Drug Repositioning via Network-Based Inference. Plos Comput. Biol. 2012, 8, e1002503. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Cobanoglu MC; Liu C; Hu F; Oltvai ZN; Bahar I, Predicting Drug–Target Interactions Using Probabilistic Matrix Factorization. J. Chem. Inf. Model 2013, 53, 3393–3409. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Yamanishi Y; Araki M; Gutteridge A; Honda W, Prediction of Drug-Target Interaction Networks from the Integration of Chemical and Genomic Spaces. Bioinformatics 2008, 24, i232–i240. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Yamanishi Y; Pauwels E; Kotera M, Drug Side-Effect Prediction Based on the Integration of Chemical and Biological Spaces. J. Chem. Inf. Model 2012, 52, 3284–3292. [DOI] [PubMed] [Google Scholar]
  • 54.Hanse EA; Ruan C; Kachman M; Wang D; Lowman XH; Kelekar A, Cytosolic Malate Dehydrogenase Activity Helps Support Glycolysis in Actively Proliferating Cells and Cancer. Oncogene 2017, 36, 3915–3924. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Walsh CT; Fischbach MA, Repurposing Libraries of Eukaryotic Protein Kinase Inhibitors for Antibiotic Discovery. Proc. Natl. Acad. Sci. U.S.A 2009, 106, 1689–1690. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Dichiara M; Marrazzo A; Prezzavento O; Collinas S; Rescifinia A; Amata E, Repurposing of Human Kinase Inhibitors in Neglected Protozoan Diseases. ChemMedChem. 2017, 12, 1235–1253. [DOI] [PubMed] [Google Scholar]
  • 57.Shakya T; Stogios PJ; Waglechner N; Evdokimova E; Ejim L; Blanchard JE; McArthur AG; Savchenko A; Wright GD, A Small Molecule Discrimination Map of the Antibiotic Resistance Kinome. Chem. Biol 2011, 18, 1591–1601. [DOI] [PubMed] [Google Scholar]
  • 58.Huang N; Shoichet B; Irwin J, Benchmarking Sets for Molecular Docking. J. Med. Chem 2006, 49, 6789–6801. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

SI

RESOURCES