A Novel Scoring Based Distributed Protein Docking Application to Improve Enrichment

Prachi Pradeep; Craig Struble; Terrence Neumann; Daniel S Sem; Stephen J Merrill

doi:10.1109/TCBB.2015.2401020

. Author manuscript; available in PMC: 2016 Nov 1.

Published in final edited form as: IEEE/ACM Trans Comput Biol Bioinform. 2015 Nov-Dec;12(6):1464–1469. doi: 10.1109/TCBB.2015.2401020

A Novel Scoring Based Distributed Protein Docking Application to Improve Enrichment

Prachi Pradeep ¹, Craig Struble ², Terrence Neumann ³, Daniel S Sem ⁴, Stephen J Merrill ⁵

PMCID: PMC4784258 NIHMSID: NIHMS745365 PMID: 26671816

Abstract

Molecular docking is a computational technique which predicts the binding energy and the preferred binding mode of a ligand to a protein target. Virtual screening is a tool which uses docking to investigate large chemical libraries to identify ligands that bind favorably to a protein target.

We have developed a novel scoring based distributed protein docking application to improve enrichment in virtual screening. The application addresses the issue of time and cost of screening in contrast to conventional systematic parallel virtual screening methods in two ways. Firstly, it automates the process of creating and launching multiple independent dockings on a high performance computing cluster. Secondly, it uses a Nȧive Bayes scoring function to calculate binding energy of un-docked ligands to identify and preferentially dock (Autodock predicted) better binders.

The application was tested on four proteins using a library of 10,573 ligands. In all the experiments, (i). 200 of the 1000 best binders are identified after docking only ∼ 14% of the chemical library, (ii). 9 or 10 best-binders are identified after docking only ∼ 19% of the chemical library, and (iii). no significant enrichment is observed after docking ∼ 70% of the chemical library. The results show significant increase in enrichment of potential drug leads in early rounds of virtual screening.

Index Terms: Virtual Screening, High Performance Computing, Distributed Protein Docking, HTCondor, Nȧive Bayes, Scoring Function

1 Introduction

Modern drug discovery is a “lengthy, expensive, difficult, and inefficient process” with low rate of new therapeutic discovery [1]. Currently, the research and development cost of each new molecular entity is approximately US $1.8 billion [2]. Conventional experimental procedures such as medicinal chemistry and high throughput screening (HTS) are still the most accurate methods for rapid identification of drug leads. However, there is an enormous growth in commercial and publicly available chemical structure libraries of potential drug compounds (ligands), such as the ZINC database [3], [4] (contains over 21 million compounds), which require more efficient techniques for screening. In this context, computational methods are now being used to enhance the drug development process [5], [6].

Molecular docking is a computational technique which predicts the interaction between a protein and a potential drug compound [7], [8], [9]. Virtual screening, the use of high-performance computing (HPC) clusters to analyze large databases of ligands, is a well established and cost-effective method for identifying possible drug leads against a target protein [10], [11], [12], [13]. Virtual screening utilizes docking to simulate protein-ligand interaction to prioritize potential ligands for experimental validation. There exist standard docking protocols like DOCK [14], AutoDock [15], [16], GOLD [17] and FlexX [18] which predict if a ligand is a good binder and a potential drug lead for a given target protein.

Independent nature of these docking simulations allows for implementation of distributed protein docking, where a feasible number of these docking processes can be run simultaneously on a computing cluster. Currently, there a variety of software, including Docking@Home [19], DOVIS [20] and DockFlow [21], which automate the parallelization of virtual screening process to scale large chemical libraries. However, the selection of aforementioned N ligands is systematic or pre-defined in nature. Consequently, the entire chemical library needs to be docked in order to find the best binders which is time consuming for large chemical libraries.

To further reduce the time and cost of virtual screening, parallelization can be implemented using a mechanism to select potential binders from the remaining chemical library based on the docking results and the chemical nature of previously docked ligands. This set of potential binders needs a thoughtfully compiled sample of ligands for screening. These potential binders can then be queued for the docking process, preferentially over the others. Such an implementation eliminates the necessity of docking all the ligands in a chemical library, thereby, optimizing the virtual screening process.

The pharmaceutical industry heavily relies on Christopher Lipinski's rule-of-five analysis for assessing if a compound is likely to be bioavailable [22], [23]. The rule establishes that certain compound properties (viz. molecular weight, lipophilicity, number of hydrogen bond donors and acceptors), if below threshold values, are highly correlated with a drugs having good bioavailability [24]. These properties are familiar to and routinely calculated by pharmaceutical researchers. We have, in our previous work, shown the utility of Lipinski properties as attributes in a neural net-based prediction of binding affinity, with an accuracy of 86% [25]. We, therefore, propose the use of Lipinski properties of ligands as attributes in a scoring function to predict the binding energy of un-docked ligands much more quickly than via full Autodock calculations, which are not viable for large datasets like ZINC.

In this article, we present an application which performs supervised learning using binding energy of previously docked ligands and their similarity with un-docked ligands in terms of their Lipinski properties in a Nȧive Bayes analysis, to make a prediction of binding energy of un-docked ligands. We present the performance of this application on four receptor proteins using an in-house library of 10,573 ligands which we have used in our previous docking studies [26], [25].

2 Methods

2.1 Setup

2.1.1 System Architecture

The application was originally developed and deployed on Marquette University's Père Cluster. The cluster is composed of 128 nodes, 2 × quad core Intel Xeon X5550, total of 1024 cores. The processors feature 24 GB RAM per node, DDR Inifiband backbone, 20 Gb/s, Red Hat Enterprise Linux 5.3.

2.1.2 Tools and Software

The scoring based virtual screening application was developed using several tools, modules, and a docking software. A list of all these resources is as follows:

HTCondor

It is a resource management and scheduling system for executing computation-intensive jobs harnessing idle compute power [27], [28], [29]. The DAGMan (Directed Acyclic Graph Manager) is a meta-scheduler for HTCondor jobs [30]. It is specially designed to provide a scheduling mechanism for jobs which have a dependency on each other. A DAGMan serves as a very handy tool in managing jobs that are components of a large workflow. In this work, HTCondor 7.4.4 for X86_64-LINUX RHEL5 is used a resource scheduler.

AutoDock

It is a widely used open source software for protein docking, which predicts how ligands bind to the pre-calculated docking area (grid) on the target protein [15].These grids aid the physical description of the docking site and binding [16]. Lower predicted binding energy implies better ligand affinity. In this work, all dockings were performed using Autodock4.

MGLTools

The proteins and ligands used in this work were processed and using the MGLTools [31].

Python

Python 2.4.3 was used as the programming language for code development [32].

2.2 Implementation

The HTCondor DAGMan is used to implement incremental docking using an X-DAG structure as shown in figure 1(a). With this kind of dependency, jobs B, C, D will not start until job A is completed; job E will not start until jobs B, C, and D are completed and so on. Figure 1(b) shows the submission file for such a DAG. Each job is defined by the keyword JOB and the relationship between jobs is defined by recursive usage of keywords PARENT and CHILD [33]. Each CHILD job in the X-DAG represents a ligand-protein docking simulation and each PARENT job represents the execution of the scoring function.

Fig. 1 — Implementation of X-Dag architecture: (a). The X-DAG workflow: Jobs B, C, D are launched when job A is completed. Job E is launched when jobs B, C, D are completed and so on, (b). A sample HTCondor DAGMan input file for a X-DAG. Each job is defined by the keyword *JOB*, jobs at the nodes are defined by the of keyword *PARENT*, and jobs at the leaves are defined by the keyword *CHILD*, illustrating the dependence relationship.

Figure 2 shows the basic framework of our scoring based virtual screening application. The functionality can be broken down into three distinct steps:

Data Partitioning

The application takes as input a data file (.tar format) which consists of the pre-prepared ligands, proteins and the supporting files in a ready to dock format. A dag submission file is created which contains the job definitions for determining the order in which the ligands are docked by the HTCondor worker nodes. N ligands are selected randomly for the first round of dockings to initiate the cycle.
Job Submission and Control

The result of each round of docking is sent back to HTCondor's central manager. At the end of each round of docking (i.e., at points A, E, I in Figure 1(a) and so on) the scoring function predicts the next N best binders, which are then dynamically updated in the dag file and docked in the next round. This process is continued K times such that all the ligands are docked. Figure 3 shows the implementation of the scoring function.
Aggregation of Results

Once all the ligands from the given chemical library have been docked and results are obtained by the central manager, the last step in the X-DAG is a summarize step implemented in by HTCondor's post script. In this step, the ligands are sorted on the basis of their Autodock predicted binding energy and the final output file is created. This step is useful in deciding how many rounds of docking are essential to discriminate the potential binders from non-binders.

Fig. 2 — System Architecture: Workflow of the scoring based distributed docking application. Multiple docking jobs are created and the dockings are implemented in an incremental fashion. The set of ligands for each round of docking is determined by the scoring function.

Fig. 3 — Implementation of Nȧive Bayes scoring function. N ligands are selected randomly for docking in round 1. The results of docking and the Lipinski properties of ligands are used to make a selection of the ligands to be docked in the next cycle. K rounds of docking are performed such that all the ligands in the library are docked for performance evaluation of the system.

2.3 Nȧive Bayes Scoring Function

Lipinski's Rule-of-Five establishes the importance of four different physico-chemical properties which are correlated with a chemical compound having good bioavailability viz. Molecular Weight, AlogP, Number of Hydrogen Acceptors and Number of Hydrogen Donors [22]. The expected Autodock predicted binding energy of a ligand is calculated based on the calculated binding energy of docked ligands and the Lipinski properties of ligands as attributes in a Nȧive Bayes analysis.

Each Lipinski property is grouped into ten equal sized bins based on the range of values of each property. For example, if the molecular weight of the ligands in the chemical library varies between 80 and 99 Daltons (range=20), then the bins would be 80-81, 82-83, and so on up to 98-99. Each ligand in the chemical library is assigned to a bin for each property. So a ligand, L, can be represented as a vector of four property bins corresponding to each Lipinski property, l_i (i=1-4).

Similarly, binding energy is grouped into five equal sized bins based on the range of binding energy values for the N ligands docked in each round. Each docked ligand is then assigned to one of the five energy bins, E_k (k=1-5). This information along with the ligand property is then used to find the probability of an un-docked ligand (L) having binding energy in each of the five bins. The probability of a ligand having a binding energy can be calculated using the Bayes theorem:

P (E_{k} | L) = \frac{P (L | E_{k}) P (E_{k})}{P (L)}

(1)

where, P(E_k|L) is the probability that the ligand will have a binding energy in bin E_k, P(L|E_k) is the probability of an energy bin given a ligand, P(E_k) is the probability of any energy bin, and P(L) is the probability of occurrence of any ligand. Assuming that each ligand is equally likely to occur in a chemical library, P(L) is assigned a constant value. Also, P(E) is the ratio of the number of ligands in each energy bin and the number of ligands docked so far (N), which is a constant. So, P(L|E_k) can be re-written as:

P (E_{k} | L) \propto P (L | E_{k})

(2)

Making the Nȧive assumptions of independence of ligand properties and representing the Lipinski bins for the docked ligands as ld_i(i = 1 – 4), P(L|E_k) can be written as a product of P_L(ld_i|E_k), which is the probability of a ligand having a Lipinski property ld_i if it had a binding energy in energy bin E_k:

P (L | E_{k}) = π_{i = 1}^{4} P_{L} (l d_{i} | E_{k})

(3)

P_L(ld_i|E_k) is the ratio of number of un-docked ligands with the same Lipinski property bin as the docked ligands with an energy bin E_k and the total number of ligands with an energy bin E_k:

P_{L} (l d_{i} | E_{k}) = \frac{N_{(l d_{i} = l_{i}, E_{k})}}{N_{E_{k}}}

(4)

If the number of ligands docked or the number of ligands in an energy bin is zero, the above probabilities are calculated by introducing a smoothing factor λ = 0.1 such that the new probabilities are:

P_{L} (l d_{i} | E_{k}) = \frac{N_{(l d_{i} = l_{i}, E_{k})} + λ}{N_{E_{k}} + 4 λ}

(5)

P (E_{k}) = \frac{N_{E_{k}} + λ}{N + 4 λ}

(6)

Finally the energy bin with the highest probability is the predicted binding energy (E_l) for each un-docked ligand:

E_{L} = arg max_{E_{k}} P (E_{k} | L)

(7)

Similarly binding energy bins for all the un-docked ligands are estimated. N ligands with the lowest predicted energies are then selected for docking in the next round.

2.4 Datasets

Proteins

Four experiments on three distinct proteins were performed to evaluate the performance of application. The first set of proteins is Dihydrofolate Reductase (DHFR) [PDB:1DF7] and Dihydrodipicolinate reductase (DHPR) [PDB:1C3V] which are targets for the disease Tuberculosis [33] [34]. The other protein drug target is Human Dual Specificity Phosphatase 5 (DUSP5), an enzyme in humans encoded by the DUSP5 gene [35], [36]. DUSP5 protein has two domains and each of these domains participate in ligand binding. For this study each of these domains were tested individually and are referred to as DUSP5C [PDB:2G6Z] and DUSP5R. The protein structure for DUSP5R was based on a homology model for a related mitogen-activated protein kinase phosphatase, MDP-3. The crystal structure of the proteins was obtained from the Protein Data Bank [37]. The details are included in the supplemental file 2.

Chemical Library

An in-house physical collection of 10, 573 chemical ligands in the Center for Structure-based Drug Design and Development (CSD3) was used [38], [26], [25]. The library contains drug-like molecules, selected on the basis of their predicted binding to dehydrogenases and kinases, a general compliance with the Lipinski Rule of Five, and other drug-like filters. The ligands were converted into a ready-to-dock format using Autodock tools [39]. The applicability and performance on of this chemical library has been demonstrated in our previous docking studies [26], [25]. The steps for the preparation of the files for the experiment are provided in supplemental file 1.

2.5 Performance Metrics

Autodock computes the possible interaction points in the binding site of the protein and then docks each ligand to a protein target allowing the ligand to adopt many different conformations or poses. The docking simulation output consists of clusters of similar poses and the calculated binding energy for each docking pose within each cluster. For our experiments, we choose the cluster with the highest number of poses and then select the pose with the lowest predicted energy as the most favorable pose. The binding energy of this pose is used as the final predicted binding energy for a particular protein-ligand complex and is defined as the binding energy of the ligand. Lower binding energy of a protein-ligand complex is an indication of its binding affinity; lower the binding energy more stable the complex.

The objective of introducing the scoring function in the HPC framework is to enable the identification of better binders allowing for enrichment in early rounds of docking. To evaluate the performance of the application all the ligands were docked against the target proteins and 1000 best binding ligands were identified for each protein based on their binding energy. Since the predictions are not binary in nature (i.e., a strong binder and not a strong binder), we do not measure performance in terms of ROC and AUC. We rank the ligands based on their predicted binded energy and measure the performance of the algorithm in terms of simpler and more commonly used metrics: average energy, ligand enrichment and cumulative ligand enrichment. Average energy is the average binding energy (as described above) of all the ligands docked in a round, ligand enrichment is the concentration of ligands with low binding energies in each round compared to their concentration throughout the docking cycle, and cumulative ligand enrichment is the concentration of ligands with low binding energies up to each round compared to their concentration throughout the docking cycle [9], [40], [41], [42]. We have evaluated the performance of the application based on enrichment observed for the 1000 best binders in each round of incremental docking and is calculated as:

Average Energy = \frac{\sum E_{i}}{N_{R}}

(8)

Ligand Enrichment = N_{R 1000}

(9)

Cumulative Ligand Enrichment = N_{T 1000}

(10)

where, E_i is the binding energy of the ith ligand docked in a round, N_R is the number of ligands docked in a round, N_R₁₀₀₀ is the number of best 1000 binders docked in a round and N_T1000 is number of 1000 best binders docked so far.

3 Results

500 ligands were docked in each round of incremental docking with a total of 22 rounds. If the ligands docked in each round were to be selected in a random fashion, it would be expected that each round is enriched with 50 out of 1000 best binding ligands.

Figure 4 shows the plot between the average energy of all ligands docked in a round versus round number. Average energy decreased substantially after the very first round of Bayesian selection of potential binders. For all four proteins (i). average energy in round 2 is the lowest across all rounds. (ii). average energy starts to increase after an initial decrease and attains a value higher than in round 1 towards the end of virtual screening experiments. These results demonstrate the earlier rounds are enriched with lower binding energy ligands (better binders) and later rounds have lesser number of potential binders.

Fig. 4 — Plot between average binding energy of a round and round number. Lower binding energy indicates better protein-ligand complexes. Rounds 2 and 3 show a significant drop in average energy for all four proteins indicating significant enrichment in better binding ligands.

To verify that the binding actually occurs in the naturally occurring binding site, a visual representation of the protein-ligand complex is generated using Pymol [43]. Figure 5 shows the predicted docking complex of the four proteins with the ligand with lowest binding energy (best binder) in the second round of docking. During docking preparation, co-crystallized ligands were removed from the PDB coordinates allowing ligands to occupy the known binding sites. It is observed that the ligand (represented by stick model) is strategically placed in the natural substrate binding site of the protein molecules and as seen in the inset images. For DHFR and DHPR, the predicted inhibitors were predicted to bind in the NADP⁺ pocket for each enzyme (see Supp. Fig. 1). The predicted inhibitor for DUSP5C was predicted to bind adjacent to the known active nucleophile for the protein, Cys-263 (see Supp. Fig. 2). This structure is structurally similar to a group of molecules recently published [44]. As DUSP5R is based on a homology model, further investigation to probe the binding site of this regulatory domain is required.

Fig. 5 — Focused image of the protein-ligand complex at the binding site. The inset shows full image of the protein-ligand complex.

Figure 6(a) and 6(b) show ligand enrichment and cumulative ligand enrichment for each round, respectively. A sharp spike at rounds 2 and 3 in the ligand enrichment plot shows the effectiveness of the scoring function in identifying potential binders. The steep slope of the curve in the cumulative ligand enrichment plot between rounds 2 and 3 is an indicative of higher rate of enrichment in earlier rounds. For all proteins, (i). 200 of 1000 best binding ligands are identified by docking only 14% of the chemical library, (ii). 500 of 1000 best binding ligands are identified by docking only 28% of the chemical library, (iii). there is no significant gain in enrichment after docking almost 70% of the chemical library, and (iv). 9 or 10 best binding ligands are identified within docking 19% of the chemical library.

Fig. 6 — Enrichment observed in terms of the 1000 best binders in each round of incremental docking.

In all the four experiments, 1000 best binders were docked in rounds 2 and 3 with p-value < 0.001. The results suggest that virtual screening for any protein can be considered complete after round 5 by docking only 30% of the chemical library. These results demonstrate that our approach has a selective preference for better binding ligands and provides better enrichment as compared to a completely random parallel virtual screening application. Additionally, it offers time and cost benefits by reducing the need to dock the entire chemical library to identify potential Autodock predicted drug leads.

4 Conclusion

Virtual screening is a computational method to identify potential drug leads from a large chemical library. Ligand enrichment, thus, forms the essence of virtual screening. High performance computing clusters strengthen the capabilities of virtual screening process by further gain in time and cost. However, the growing size of available chemical libraries and the aspiration for exhaustive search for a potential drug from the entire virtual chemical pool necessitates a new methodology to allow for faster discrimination of binders from the non-binders.

We present an optimization to the virtual screening workflow allowing for large throughput of results in smaller time scale. We have implemented a Nȧive Bayes scoring function which performs supervised learning using Lipinski properties and binding energy of ligands for a given target protein to predict the binding energy of unknown ligands. The application harnesses HTCondor's capability as a resource management system to automatically schedule parallel and distributed protein dockings. The results of using this application to isolate binders for four target proteins suggest that potential drug leads can be isolated by examining not more than 30% chemicals in a large chemical, saving the need to investigate the entire chemical library.

The application is compatible with and can be deployed on different computing clusters with slight or no modification. We have tested the performance of the application after porting and integration with other available grids like BOINC [45] and Tera-Grid [46]. The application can also be implemented on a commercial cloud. We have successfully migrated it to the Amazon EC2 cloud to assess the feasibility of such an implementation. A detailed performance analysis and comparison was also done to validate against local high performance computing resources. It was found that the application can be implemented on the cloud as there is no overhead required to set up an in-house cluster or grid, software requirements are inexpensive or free, and computing time is rapid based on the number of resources purchased on the cloud.

However, the performance of this framework at the level of ZINC is still to be tested. It remains to be seen that enrichment and scaling obtained in this study can be maintained at the size of ZINC. Nonetheless, the early results are promising and further investigations to see if this approach scales appropriation are needed.

Supplementary Material

Supplement1

NIHMS745365-supplement-Supplement1.pdf^{(49.2KB, pdf)}

Supplement2

NIHMS745365-supplement-Supplement2.pdf^{(4.7MB, pdf)}

Acknowledgments

The experiments were performed on the Père cluster funded by National Science Foundation awards OCI-0923037 “MRI: Acquisition of a Parallel Computing Cluster and Storage for the Marquette University Grid (MUGrid)” and CBET-0521602 “Acquisition of a Linux Cluster to Support College-Wide Research & Teaching Activities.” DS is partly supported by NIH grants AI101975 and HL112639.

Contributor Information

Prachi Pradeep, Department of Mathematics, Statistics, and Computer Science, Marquette University, WI, USA.

Craig Struble, Aria Diagnostics, Inc., San Jose, CA, USA.

Terrence Neumann, Department of Chemistry and Biochemistry, Texas Wesleyan University, TX, USA.

Daniel S. Sem, School of Pharmacy, Concordia University Wisconsin, Mequon, WI, USA

Stephen J. Merrill, Department of Mathematics, Statistics, and Computer Science, Marquette University, WI, USA

References

1.Anson BD, Ma J, He JQ. Identifying cardiotoxic compounds. Genetic Engineering & Biotechnology News. 2009;29:34–35. [Google Scholar]
2.Paul SM, Mytelka DS, Dunwiddie CT, Persinger CC, Munos BH, Lindborg SR, Schacht AL. How to improve r&d productivity: the pharmaceutical industry's grand challenge. Nature Reviews Drug Discovery. 2010;9:203–214. doi: 10.1038/nrd3078. [DOI] [PubMed] [Google Scholar]
3.Zinc homepage. http://zinc.docking.org/
4.Irwin JJ, Shoichet BK. Zinc-a free database of commercially available compounds for virtual screening. Journal of chemical information and modeling. 2005;45(1):177–182. doi: 10.1021/ci049714. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Waszkowycz B, Perkins TDJ, Sykes RA, Li J. Large-scale virtual screening for discovering leads in the postgenomic era. IBM Systems Journal - Deep computing for the life sciences. 2001;40:360–376. [Google Scholar]
6.Alvarez J, Shoichet B. In: Virtual Screening in Drug Discovery. Alvarez J, Shoichet B, editors. CRC Press; 2005. [Google Scholar]
7.P CE, Leach Andrew R, Shoichet Brian K. Docking and scoring. Journal of Medicinal Chemistry. 2006;49(20) doi: 10.1021/jm060999m. [DOI] [PubMed] [Google Scholar]
8.Blaney JDJ. A good ligand is hard to find: Automated docking methods. Perspect Drug Disc Des. 1993;1:301–319. [Google Scholar]
9.L AR, S BK, P CE. Prediction of protein-ligand interactions. docking and scoring: successes and gaps. Journal of Medicinal Chemistry. 2006;49:5851–5. doi: 10.1021/jm060999m. [DOI] [PubMed] [Google Scholar]
10.Walters W, Stahl M, Murcko M. Virtual screening-an overview. Drug Discovery Today. 1998;3:160–178. [Google Scholar]
11.Reddy AS, Pati SP, Kumar PP, Pradeep H, Sastry GN. Virtual screening in drug discovery - a computational perspective. Current Protein & Peptide Science. 2007;8:329–351. doi: 10.2174/138920307781369427. [DOI] [PubMed] [Google Scholar]
12.Muegge I, Oloff S. Advances in virtual screening. Drug Discovery Today: Technologies. 2006;3:405–411. doi: 10.1016/j.ddtec.2006.12.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Shoichet BK. Virtual screening of chemical libraries. Nature. 2004;432:862–865. doi: 10.1038/nature03197. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Kuntz ID, Blaney JM, Oatley SJ, Langridge R, Ferrin TE. A geometric approach to macromolecule-ligand interactions. Journal of Molecular Biology. 1982;161:269–288. doi: 10.1016/0022-2836(82)90153-x. [DOI] [PubMed] [Google Scholar]
15. [Last accessed: 2011.05.07];Autodock homepage. http://autodock.scripps.edu/
16.Morris GM, Goodsell DS, Halliday RS, Huey R, Hart WE, Belew RK, Olson AJ. Automated docking using a lamarckian genetic algorithm and an empirical binding free energy function. Journal of Computational Chemistry. 1998;19:1639–1662. [Google Scholar]
17.Jones G, Willett P, Glen RC. Molecular recognition of receptor sites using a genetic algorithm with a description of desolvation. Journal of Molecular Biology. 1995;245:43–53. doi: 10.1016/s0022-2836(95)80037-9. [DOI] [PubMed] [Google Scholar]
18.Rarey M, Kramer B, Lengauer T, Klebe G. A fast flexible docking method using an incremental construction algorithm. Journal of Molecular Biology. 1996;261:470–489. doi: 10.1006/jmbi.1996.0477. [DOI] [PubMed] [Google Scholar]
19. [Last Accessed: 2011.05.08];Docking@home. http://docking.cis.udel.edu/
20.Zhang S, Kumar K, Jiang X, Wallqvist A, Reifman J. Dovis: an implementation for high-throughput virtual screening using autodock. BMC Bioinformatics. 2008;9:126. doi: 10.1186/1471-2105-9-126. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Azam N, Ghanem M, Kalaitzopoulos D, Wolf A, Kasam V, Wang Y, Hofmann-Apitius M. Dockflow: Achieving interoperability of protein docking tools across heterogeneous grid middleware. International Journal of Ad Hoc and Ubiquitous Computing. 2010;6:235–251. [Google Scholar]
22.Lipinski C, Lombardo F, Dominy B, Feeney P. Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Advanced Drug Delivery Reviews. 1997;23:3–25. doi: 10.1016/s0169-409x(00)00129-0. [DOI] [PubMed] [Google Scholar]
23.Lipinski CA. Lead- and drug-like compounds: the rule-of-five revolution. Drug Discovery Today: Technologies. 2004;1(4):337–341. doi: 10.1016/j.ddtec.2004.11.007. [DOI] [PubMed] [Google Scholar]
24.W J, Ghose AK, Viswanadhan VN. A knowledge-based approach in designing combinatorial or medicinal chemistry libraries for drug discovery. 1. a qualitative and quantitative characterization of known drug databases. J Comb Chem. 1999;1(1):55–68. doi: 10.1021/cc9800071. [DOI] [PubMed] [Google Scholar]
25.Bazeley PS, Prithivi S, Struble CA, Povinelli RJ, Sem DS. Synergistic use of compound properties and docking scores in neural network modeling of cyp2d6 binding: predicting affinity and conformational sampling. Journal of chemical information and modeling. 2006;46(6):2698–2708. doi: 10.1021/ci600267k. [DOI] [PubMed] [Google Scholar]
26.Boonsri P, Neumann TS, Olson AL, Cai S, Herdendorf TJ, Miziorko HM, Hannongbua S, Sem DS. Molecular docking and nmr binding studies to identify novel inhibitors of human phosphomevalonate kinase. Biochemical and biophysical research communications. 2013;430(1):313–319. doi: 10.1016/j.bbrc.2012.10.130. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Condor - A Hunter of Idle Workstations. 1988 [Google Scholar]
28. [Last accessed: 2011.05.08];Condor project. http://research.cs.wisc.edu/htcondor/ [Online]. Available: http://research.cs.wisc.edu/htcondor/
29.Thain D, Tannenbaum T, Livny M. Distributed computing in practice: The condor experience. Concurrency and Computation: Practice and Experience. 2005;17(2-4):323–356. [Google Scholar]
30. [Last accessed: 2011.05.08];Dagman applications. http://research.cs.wisc.edu/htcondor/dagman/dagman.html.
31. [Last accessed: 2011.05.08];Mgltools website. http://mgltools.scripps.edu/
32.Python software foundation. python language reference, version 2.4.3. available at http://www.python.org.
33. [Last accessed: 2011.05.08];X-dag. http://www.cs.wisc.edu/condor/manual/v7.4/2_10DAGMan_Applications.html.
34. [Last accessed:2011.05.08];Tuberculosis. http://en.wikipedia.org/wiki/-Tuberculosis.
35.Dual specificity protein phosphatase 5. http://en.wikipedia.org/wiki/DUSP5.
36.K SP, D JE. Multiple dual specificity protein tyrosine phosphatases are expressed and regulated differentially in liver cell lines. Journal of Biological Chemistry. 1995;270:1156–60. doi: 10.1074/jbc.270.3.1156. [DOI] [PubMed] [Google Scholar]
37.Bernstein FC, Koetzle TF, Williams GJ, Meyer EF, Brice MD, Rodgers JR, Kennard O, Shimanouchi T, Tasumi M. The protein data bank. European Journal of Biochemistry. 1977;80(2):319–324. doi: 10.1111/j.1432-1033.1977.tb11885.x. [DOI] [PubMed] [Google Scholar]
38.Center for structure-based drug design and development. http://www.csddd.org/
39.Sanner MF. Python: A programming language for software integration and development. J Mol Graphics Mod. 1999;17:57–61. [PubMed] [Google Scholar]
40.Klebe G. Virtual ligand screening: strategies, perspectives and limitations. Drug discovery today. 2006;11(13):580–594. doi: 10.1016/j.drudis.2006.05.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
41.Kellenberger E, Rodrigo J, Muller P, Rognan D. Comparative evaluation of eight docking tools for docking and virtual screening accuracy. Proteins: Structure, Function, and Bioinformatics. 2004;57(2):225–242. doi: 10.1002/prot.20149. [DOI] [PubMed] [Google Scholar]
42.Cummings MD, Des Jarlais RL, Gibbs AC, Mohan V, Jaeger EP. Comparison of automated docking programs as virtual screening tools. Journal of medicinal chemistry. 2005;48(4):962–976. doi: 10.1021/jm049798d. [DOI] [PubMed] [Google Scholar]
43.The pymol molecular graphics system, version 1.2r3pre. schrdinger, llc; http://www.pymol.org/ [Google Scholar]
44.Neumann T, Span E, Kalous K, Gastonguay A, Kutty R, Nayak J, Bohl C, Lange R, Sarker M, Talipov M, Rathore R, Ramchandran R, Sem D. Identification of polysulfonated inhibitors related to suramin that target dual specificity phosphatase 5 and provide new insights into the binding requirements for dual-phosphate substrate pockets. Proteins: Struct, Funct, Bioinf. Submitted. [Google Scholar]
45.Anderson DP. Boinc: A system for public-resource computing and storage. Proceedings of the 5th IEEE/ACM International Workshop on Grid Computing. :4–10. [Google Scholar]
46.Catlett Cea. Teragrid: Analysis of organization, system architecture, and middleware enabling new types of applications. In: Grandinetti Lucio., editor. HPC and Grids in Action. IOS Press ‘Advances in Parallel Computing’ series; Amsterdam: 2007. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement1

NIHMS745365-supplement-Supplement1.pdf^{(49.2KB, pdf)}

Supplement2

NIHMS745365-supplement-Supplement2.pdf^{(4.7MB, pdf)}

[R1] 1.Anson BD, Ma J, He JQ. Identifying cardiotoxic compounds. Genetic Engineering & Biotechnology News. 2009;29:34–35. [Google Scholar]

[R2] 2.Paul SM, Mytelka DS, Dunwiddie CT, Persinger CC, Munos BH, Lindborg SR, Schacht AL. How to improve r&d productivity: the pharmaceutical industry's grand challenge. Nature Reviews Drug Discovery. 2010;9:203–214. doi: 10.1038/nrd3078. [DOI] [PubMed] [Google Scholar]

[R3] 3.Zinc homepage. http://zinc.docking.org/

[R4] 4.Irwin JJ, Shoichet BK. Zinc-a free database of commercially available compounds for virtual screening. Journal of chemical information and modeling. 2005;45(1):177–182. doi: 10.1021/ci049714. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] 5.Waszkowycz B, Perkins TDJ, Sykes RA, Li J. Large-scale virtual screening for discovering leads in the postgenomic era. IBM Systems Journal - Deep computing for the life sciences. 2001;40:360–376. [Google Scholar]

[R6] 6.Alvarez J, Shoichet B. In: Virtual Screening in Drug Discovery. Alvarez J, Shoichet B, editors. CRC Press; 2005. [Google Scholar]

[R7] 7.P CE, Leach Andrew R, Shoichet Brian K. Docking and scoring. Journal of Medicinal Chemistry. 2006;49(20) doi: 10.1021/jm060999m. [DOI] [PubMed] [Google Scholar]

[R8] 8.Blaney JDJ. A good ligand is hard to find: Automated docking methods. Perspect Drug Disc Des. 1993;1:301–319. [Google Scholar]

[R9] 9.L AR, S BK, P CE. Prediction of protein-ligand interactions. docking and scoring: successes and gaps. Journal of Medicinal Chemistry. 2006;49:5851–5. doi: 10.1021/jm060999m. [DOI] [PubMed] [Google Scholar]

[R10] 10.Walters W, Stahl M, Murcko M. Virtual screening-an overview. Drug Discovery Today. 1998;3:160–178. [Google Scholar]

[R11] 11.Reddy AS, Pati SP, Kumar PP, Pradeep H, Sastry GN. Virtual screening in drug discovery - a computational perspective. Current Protein & Peptide Science. 2007;8:329–351. doi: 10.2174/138920307781369427. [DOI] [PubMed] [Google Scholar]

[R12] 12.Muegge I, Oloff S. Advances in virtual screening. Drug Discovery Today: Technologies. 2006;3:405–411. doi: 10.1016/j.ddtec.2006.12.002. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] 13.Shoichet BK. Virtual screening of chemical libraries. Nature. 2004;432:862–865. doi: 10.1038/nature03197. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] 14.Kuntz ID, Blaney JM, Oatley SJ, Langridge R, Ferrin TE. A geometric approach to macromolecule-ligand interactions. Journal of Molecular Biology. 1982;161:269–288. doi: 10.1016/0022-2836(82)90153-x. [DOI] [PubMed] [Google Scholar]

[R15] 15. [Last accessed: 2011.05.07];Autodock homepage. http://autodock.scripps.edu/

[R16] 16.Morris GM, Goodsell DS, Halliday RS, Huey R, Hart WE, Belew RK, Olson AJ. Automated docking using a lamarckian genetic algorithm and an empirical binding free energy function. Journal of Computational Chemistry. 1998;19:1639–1662. [Google Scholar]

[R17] 17.Jones G, Willett P, Glen RC. Molecular recognition of receptor sites using a genetic algorithm with a description of desolvation. Journal of Molecular Biology. 1995;245:43–53. doi: 10.1016/s0022-2836(95)80037-9. [DOI] [PubMed] [Google Scholar]

[R18] 18.Rarey M, Kramer B, Lengauer T, Klebe G. A fast flexible docking method using an incremental construction algorithm. Journal of Molecular Biology. 1996;261:470–489. doi: 10.1006/jmbi.1996.0477. [DOI] [PubMed] [Google Scholar]

[R19] 19. [Last Accessed: 2011.05.08];Docking@home. http://docking.cis.udel.edu/

[R20] 20.Zhang S, Kumar K, Jiang X, Wallqvist A, Reifman J. Dovis: an implementation for high-throughput virtual screening using autodock. BMC Bioinformatics. 2008;9:126. doi: 10.1186/1471-2105-9-126. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] 21.Azam N, Ghanem M, Kalaitzopoulos D, Wolf A, Kasam V, Wang Y, Hofmann-Apitius M. Dockflow: Achieving interoperability of protein docking tools across heterogeneous grid middleware. International Journal of Ad Hoc and Ubiquitous Computing. 2010;6:235–251. [Google Scholar]

[R22] 22.Lipinski C, Lombardo F, Dominy B, Feeney P. Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Advanced Drug Delivery Reviews. 1997;23:3–25. doi: 10.1016/s0169-409x(00)00129-0. [DOI] [PubMed] [Google Scholar]

[R23] 23.Lipinski CA. Lead- and drug-like compounds: the rule-of-five revolution. Drug Discovery Today: Technologies. 2004;1(4):337–341. doi: 10.1016/j.ddtec.2004.11.007. [DOI] [PubMed] [Google Scholar]

[R24] 24.W J, Ghose AK, Viswanadhan VN. A knowledge-based approach in designing combinatorial or medicinal chemistry libraries for drug discovery. 1. a qualitative and quantitative characterization of known drug databases. J Comb Chem. 1999;1(1):55–68. doi: 10.1021/cc9800071. [DOI] [PubMed] [Google Scholar]

[R25] 25.Bazeley PS, Prithivi S, Struble CA, Povinelli RJ, Sem DS. Synergistic use of compound properties and docking scores in neural network modeling of cyp2d6 binding: predicting affinity and conformational sampling. Journal of chemical information and modeling. 2006;46(6):2698–2708. doi: 10.1021/ci600267k. [DOI] [PubMed] [Google Scholar]

[R26] 26.Boonsri P, Neumann TS, Olson AL, Cai S, Herdendorf TJ, Miziorko HM, Hannongbua S, Sem DS. Molecular docking and nmr binding studies to identify novel inhibitors of human phosphomevalonate kinase. Biochemical and biophysical research communications. 2013;430(1):313–319. doi: 10.1016/j.bbrc.2012.10.130. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] 27.Condor - A Hunter of Idle Workstations. 1988 [Google Scholar]

[R28] 28. [Last accessed: 2011.05.08];Condor project. http://research.cs.wisc.edu/htcondor/ [Online]. Available: http://research.cs.wisc.edu/htcondor/

[R29] 29.Thain D, Tannenbaum T, Livny M. Distributed computing in practice: The condor experience. Concurrency and Computation: Practice and Experience. 2005;17(2-4):323–356. [Google Scholar]

[R30] 30. [Last accessed: 2011.05.08];Dagman applications. http://research.cs.wisc.edu/htcondor/dagman/dagman.html.

[R31] 31. [Last accessed: 2011.05.08];Mgltools website. http://mgltools.scripps.edu/

[R32] 32.Python software foundation. python language reference, version 2.4.3. available at http://www.python.org.

[R33] 33. [Last accessed: 2011.05.08];X-dag. http://www.cs.wisc.edu/condor/manual/v7.4/2_10DAGMan_Applications.html.

[R34] 34. [Last accessed:2011.05.08];Tuberculosis. http://en.wikipedia.org/wiki/-Tuberculosis.

[R35] 35.Dual specificity protein phosphatase 5. http://en.wikipedia.org/wiki/DUSP5.

[R36] 36.K SP, D JE. Multiple dual specificity protein tyrosine phosphatases are expressed and regulated differentially in liver cell lines. Journal of Biological Chemistry. 1995;270:1156–60. doi: 10.1074/jbc.270.3.1156. [DOI] [PubMed] [Google Scholar]

[R37] 37.Bernstein FC, Koetzle TF, Williams GJ, Meyer EF, Brice MD, Rodgers JR, Kennard O, Shimanouchi T, Tasumi M. The protein data bank. European Journal of Biochemistry. 1977;80(2):319–324. doi: 10.1111/j.1432-1033.1977.tb11885.x. [DOI] [PubMed] [Google Scholar]

[R38] 38.Center for structure-based drug design and development. http://www.csddd.org/

[R39] 39.Sanner MF. Python: A programming language for software integration and development. J Mol Graphics Mod. 1999;17:57–61. [PubMed] [Google Scholar]

[R40] 40.Klebe G. Virtual ligand screening: strategies, perspectives and limitations. Drug discovery today. 2006;11(13):580–594. doi: 10.1016/j.drudis.2006.05.012. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R41] 41.Kellenberger E, Rodrigo J, Muller P, Rognan D. Comparative evaluation of eight docking tools for docking and virtual screening accuracy. Proteins: Structure, Function, and Bioinformatics. 2004;57(2):225–242. doi: 10.1002/prot.20149. [DOI] [PubMed] [Google Scholar]

[R42] 42.Cummings MD, Des Jarlais RL, Gibbs AC, Mohan V, Jaeger EP. Comparison of automated docking programs as virtual screening tools. Journal of medicinal chemistry. 2005;48(4):962–976. doi: 10.1021/jm049798d. [DOI] [PubMed] [Google Scholar]

[R43] 43.The pymol molecular graphics system, version 1.2r3pre. schrdinger, llc; http://www.pymol.org/ [Google Scholar]

[R44] 44.Neumann T, Span E, Kalous K, Gastonguay A, Kutty R, Nayak J, Bohl C, Lange R, Sarker M, Talipov M, Rathore R, Ramchandran R, Sem D. Identification of polysulfonated inhibitors related to suramin that target dual specificity phosphatase 5 and provide new insights into the binding requirements for dual-phosphate substrate pockets. Proteins: Struct, Funct, Bioinf. Submitted. [Google Scholar]

[R45] 45.Anderson DP. Boinc: A system for public-resource computing and storage. Proceedings of the 5th IEEE/ACM International Workshop on Grid Computing. :4–10. [Google Scholar]

[R46] 46.Catlett Cea. Teragrid: Analysis of organization, system architecture, and middleware enabling new types of applications. In: Grandinetti Lucio., editor. HPC and Grids in Action. IOS Press ‘Advances in Parallel Computing’ series; Amsterdam: 2007. [Google Scholar]

PERMALINK

A Novel Scoring Based Distributed Protein Docking Application to Improve Enrichment

Prachi Pradeep

Craig Struble

Terrence Neumann

Daniel S Sem

Stephen J Merrill

Abstract

1 Introduction

2 Methods

2.1 Setup

2.1.1 System Architecture

2.1.2 Tools and Software

HTCondor

AutoDock

MGLTools

Python

2.2 Implementation

Fig. 1.

Fig. 2.

Fig. 3.

2.3 Nȧive Bayes Scoring Function

2.4 Datasets

Proteins

Chemical Library

2.5 Performance Metrics

3 Results

Fig. 4.

Fig. 5.

Fig. 6.

4 Conclusion

Supplementary Material

Acknowledgments

Contributor Information

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases