Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2025 Apr 27.
Published in final edited form as: J Chem Inf Model. 2020 Mar 3;60(3):1509–1527. doi: 10.1021/acs.jcim.9b00686

CANDOCK: Chemical Atomic Network-based Hierarchical Flexible Docking Algorithm using Generalized Statistical Potentials

Jonathan Fine 1,, Janez Konc 2,, Ram Samudrala 3, Gaurav Chopra 1,4,5,6,7,8,*
PMCID: PMC12034428  NIHMSID: NIHMS2073751  PMID: 32069042

Abstract

Small molecule docking has proven to be invaluable for drug design and discovery. However, existing docking methods have several limitations, such as, improper treatment of the interactions of essential components in the chemical environment of the binding pocket (e.g. cofactors, metal-ions, etc.), incomplete sampling of chemically relevant ligand conformational space, and the inability to consistently correlate docking scores of the best binding pose with experimental binding affinities. We present CANDOCK, a novel docking algorithm that utilizes a hierarchical approach to reconstruct ligands from an atomic grid using graph theory and generalized statistical potential functions to sample biologically relevant ligand conformations. Our algorithm accounts for protein flexibility, solvent, metal ions and cofactors interactions in the binding pocket that are traditionally ignored by current methods. We evaluate the algorithm on the PDBbind and Astex proteins to show its ability to reproduce the binding mode of the ligands that is independent of the initial ligand conformation in these benchmarks. Finally, we identify the best selector and ranker potential functions, such that, the statistical score of best selected docked pose correlates with the experimental binding affinities of the ligands for any given protein target. Our results indicate that CANDOCK is a generalized flexible docking method that addresses several limitations of current docking methods by considering all interactions in the chemical environment of a binding pocket for correlating the best docked pose with biological activity. CANDOCK along with all structures and scripts used for benchmarking is available at https://github.com/chopralab/candock_benchmark.

Graphical Abstract

graphic file with name nihms-2073751-f0001.jpg

1. Introduction

Computational docking provides a means to predict and assess interactions between ligands and proteins with relatively little investment. Docking refers to physical three-dimensional structural interactions between a receptor (typically, proteins, DNA, RNA, etc.) and a ligand (small molecules, proteins, peptides, etc.)115. Docking methods are evaluated by predicting the correct pose/binding mode (evaluated using RMSD or TMScore of the coordinates of the atoms) or by measuring predicted binding affinities4,8,11,12,16. Application to protein targets involved in disease holds the promise of discovering new therapeutics using traditional single target approaches or by virtually measuring the interactions of a compound with the proteins from multi-organism proteome1722. The resulting chemo-proteome interactions can be interrogated to study polypharmacology19 and investigate the effect drugs and agents have on protein classes in a disease-specific context19,22. In previous works, we have used the algorithm presented herein to combat Ebola20, determine the toxicity of potential diabetes therapeutics21, and rank the affinity of kinase inhibitors for the treatment of Acute Myeloid Leukemia23.

More than 20 molecular docking software tools, such as, Autodock Vina24, Gold25, MedusaDock2628, and Glide3, are currently in use for pharmaceutical research. However, after decades of method development and application, the promise to computationally determine new therapeutics has not been fully realized and computational methods for drug discovery are still in its infancy29,30. The CANDOCK algorithm confronts several outstanding technical and practical problems in computational docking. For example, one significant problem is assessing goodness-of-fit, or the likelihood that the given pose is the most physically realistic (native-like) pose among many unrealistic binding poses. Another significant limitation is the lack of full protein flexibility in the docking methods used today. The induced fit is a widely recognized challenge in computational drug screening26,27,31, where the protein and the ligand undergo conformational changes upon ligand binding. Therefore, the traditional treatment of proteins as rigid structures may be insufficient and often misleading for structure-guided drug screening and design as shown by us and others previously32. Docking ligands to their protein targets is particularly challenging when attempting to reproduce the binding mode of small molecules to ligand-free or alternative ligand-bound protein structures, which invariably occurs for practical application of any docking method. Specifically, docking with ligand-bound (holo) protein structures typically leads to an accuracy of 60–80%, whereas ligand-free (apo) structures yields a docking accuracy of merely 20–40%3337.

Several methods have been implemented to account for protein and ligand flexibility, including multiple experimentally derived structures from X-ray crystallography38, nuclear magnetic resonance38 rotamer libraries26,39, Monte Carlo24,40, and molecular mechanics4146. The same principle limits use of multiple experimentally derived protein structures or side-chain rotamer libraries: binding a ligand to a protein can cause conformational changes in either molecule that are not captured by these methods47. The sampling problem is compounded by the fact that the protein main chain torsion angles are also frequently altered from their ligand-free conformations, which these methods fail to capture. Molecular mechanics is well suited for capturing fine detail side-chain and main chain motions and rearrangements through energy minimization. However, molecular mechanics is limited in that adequate sampling of all degrees of freedom between protein and ligand: rotation, translation, and torsion angle are frequently computationally intractable. Further, the use of unrestrained molecular dynamics has been shown to disrupt the ligand from its native pose48.

Modern docking methods address these issues by employing algorithms such as the Genetic Algorithm25,31,49,50 to flexibly sample the conformational space. However, it has been shown that these methods do not consistently produce poses that rank the biological activity of the ligand well50,51 and that the ability of these methods to produce a correct pose is dependent on the starting conformation of the ligand52,53. Some methodologies use a fragment-based approach to docking54 to sample the conformational space for a given ligand efficiently. These fragment-based methods have reported a greater ability to rank activity between given ligands55,56. Therefore, we believe that further innovation in fragment-based methods is an appropriate way to improve docking methods.

We have developed the CANDOCK algorithm around a new protocol for hierarchical (atoms to fragments to molecules) docking with iterative dynamics during molecule reconstruction to “grow” the ligand in the binding pocket. The docking protocol is based on two guiding principles: (i) binding sites possess regions of both very high and very low structural stability57 and (ii) a tandem sequence of small protein motions are generally sufficient to predict the correct binding mode of protein-ligand interactions47. The hierarchical nature of this method is derived from an ‘atoms to fragments,’ ‘fragments to ligands’ approach that generates chemically relevant poses given the ligand and surrounding any chemical environment (e.g. protein, RNA, DNA binding sites or interfaces). For any flexible ligand, the expectation is that at least one or a few fragments conformations assembled using ligand-receptor atomic interactions in the binding pocket will bind to a structurally stable region of the receptor. Following identification of such a binding mode, subtle conformational changes of the receptor is necessary for reconstructing the ligand using these fragments as “seeds” to generate accurate receptor-ligand binding modes (poses). We show that CANDOCK can accurately reproduce the binding mode of ligands and rank the activity of these ligands in such poses using a generalized statistically derived forcefield, demonstrating the potential to overcome traditional challenges with induced-fit docking methods.

2. Materials and Methods

We first introduce our generalized statistical scoring function, then provide details of the CANDOCK algorithm, and selection of benchmarking datasets for evaluating pose selection and receptor-ligand affinity ranking.

2.1. Generalized Statistical Scoring Function

A generalized statistical scoring potential is used to account for varying chemical environments, such as metal ions, cofactors, water molecules, and have shown great promise for selecting correct poses in both small-molecule and protein-protein docking58. The scoring function employed by the CANDOCK algorithm is a pairwise atomic scoring function that is based on our previous work59. Here, we reproduce the fundamental equations59 to clarify the terminology used in our manuscript. The scoring function calculates the potential between two atoms based on the distance between atoms i and j with atom types a and b and takes four input terms that determine the method by which score is calculated. The possible terms are ‘functional’, ‘reference’, ‘composition’, and ‘cutoff’ which define the probability function P given in Eq. (1):

s(rabij)=ijlnP(rabijc)P(rij) (1)

The ‘functional’ term determines the numerator of Eq. (1) and can be defined either as a ‘normalized frequency’ function f(r) in Eq. (2) or a ‘radial’ distribution function g(r) given in Eq. (3):

P(rabijC)=f(rab)=Ns(rab)rNs(rab) (2)

where Ns is the number of observed atoms found at a given distance.

P(rabijC)=g(rab)=Ns(rab)Vs(r)rNs(rab)Vs(r) (3)

where Ns is divided by the volume of the sphere Vs(r). To distinguish between these two functions, ‘radial’ scoring functions start with ‘R’ while ‘normalized frequency’ functions start with ‘F’.

The ‘reference’ term determines the denominator of the scoring function. It can be defined either as ‘mean’, in which case it is calculated as a sum of all atom type pairs divided by the number of atom types, or as the `cumulativè sum of all atom type pairs. The mean term can be used with either ‘normalized frequency’ (Eq. (4)) or ‘radial’ (Eq. (5)):

P(r)=f(r)=abf(rab)n (4)
P(r)=g(r)=abg(rab)n (5)

The ‘cumulative’ option can be used together with ‘normalized frequency’ to Eq. (6) and ‘radial’ Eq. (7):

P(r)=f(r)=abNs(rab)rabNs(rab) (6)
P(r)=g(r)=abNs(rab)Vs(r)rabNs(rab)Vs(r) (7)

Scoring functions compiled with the ‘mean’ option are denoted as ‘M’ while those compiled with the ‘cumulative’ are denoted as ‘C’. The third term defines the composition of the scoring function. This term controls the number of unique atom pairs used for compiling the scoring function. The ‘complete’ option will result in the scoring function compiled from all possible atom type pairs while the ‘reduced’ option will only use the atom types present in either the protein (including cofactors, waters, and post-translational modifications) and ligand. The letter ‘C’ is used to denote complete scoring function while ‘R’ is used to denoted scoring function that is compiled with the ‘reduced’ option. A total of 8 scoring function families can be created with these three options (RMR, RMC, RCR, RCC, FMR, FMC, FCR, FCC). The fourth and final term used to compile the scoring function is the ‘cutoff’ which controls the maximum distance at which the interactions will be calculated, with possible values ranging from 4 Å to 15 Å. With all four options there are a total of 96 possible scoring functions (8×12) to account for generalized parameters for identifying native poses and activity across a diverse set of biomolecular interactions in varying chemical environments (proteins, nucleic acids, interfaces, cofactors, etc.). Example scoring functions are, ‘radial-mean-reduced-6’ (RMR6), ‘normalized frequency-cumulative-complete-8’ (FCC8), etc. as denoted in the manuscript. It should be noted that not all 96 scoring functions are intended to be used for all docking simulation and the selection of the appropriate scoring function for a given goal will be discussed in latter sections.

2.2. The CANDOCK Algorithm

2.2.1. Phase I: Structure Preparation

The CANDOCK algorithm’s input is a set of compounds to be docked, a query protein structure, and a set of binding sites on the query protein structure. In a three-phase protocol (Figure 1), it performs semi or fully flexible docking of compounds to the protein and outputs docked and minimized protein-compound complex structures together with their predicted scores.

Figure 1:

Figure 1:

Overview of the CANDOCK docking algorithm. Phase I consists of processing the input protein (a) and the ligand (b). During Phase I, an atomic grid is created in the protein binding site where the scores of all possible atom types at each point in the binding site grid. Simultaneously, the input ligand(s) are fragmented along the rotatable bonds present in the ligand. The grid is used to recreate the rigid fragments in the binding pocket. Phase II constructs the rigid ligand fragments in the binding site grid producing ‘seeds’ that can be grown into the full ligand (c). Phase III identifies potential ligand poses using maximum clique algorithm (d), clusters and links these poses using A*(e) and minimizes the poses into the binding site (f).

Parse Receptor and Compounds.

The inputs to the algorithm are the 3D coordinates and topology of a query receptor (e.g. protein structure) consisting of single or multiple chains which may also contain cofactors and post-translation modifications in the PDB format, and compounds in the MOL2 format. Compounds are processed in batches of size 10 to enable reading of large molecular files that do not fit in computer memory. An example of a ligand is given in Figure 2a.

Figure 2:

Figure 2:

Atom type assignment and fragmentation procedure present in CANDOCK. The procedure begins with the topology and 3D coordinates of the ligand (a). Using these data, the IDATM type is assigned to each atom in the ligand using a previously described algorithm27(b). This yields the hybridization state of all atoms, allowing for the assignment of bond orders for all atoms (c). The bond orders and topologies are used to assign a rotatable flag for each bond in the ligand using rules derived from the DOCK 6 program31. The rigid fragments identified using this method are boxed (d).

Compute Atom Types.

To compute atom types for protein, cofactors, and compounds, we implemented the IDATM algorithm60 (results given in Figure 2b). We also implemented an algorithm61,62 to assign AMBER General Force Field (GAFF) atom types to cofactors, ligands, and post-translational modifications, while GAFF types for proteins are obtained from the AMBER10 topology file available as part of the OpenMM package63. A list of the IDATM atom types used in this work are provided in Table S1.

Assignment of Bond Orders.

Using the hybridization information provided by the newly assigned IDATM atom types, several potential bond order states can be generated as to fit with the expected number of bonds (valence) for each ligand atom. These potential bond order assignments are evaluated in a trial and error fashion to determine whether they form a valid molecule using valence state rules derived for all atom types. The bond order set that satisfies the set of valence states with the lowest sum of atomic penalty scores over all atoms (see Figure 2c) is used to assign GAFF bond orders of the ligand.

Fragment Compounds.

Rotatable bonds are first identified in each compound using the extended list of rotatable bonds adapted from the UCSF DOCK 6 software64. Next, structurally rigid fragments consisting of atoms between the rotatable bonds are identified. Bond vectors for rotatable bonds are retained for each rigid fragment to be used during reconstruction of docked fragments. Fragments consisting of more than 4 atoms, in which at least two atoms are rigid (connected by a non-rotatable bond) are considered as seed fragments. These are subsequently rigidly docked into the protein binding site. All other non-seed fragments are considered as linking fragments during the compound reconstruction process. This result is shown in Figure 2d.

Assignment of Force Field Atom Types.

Using the computed GAFF atom types, the bonded forces of the AMBER force field are generated for the protein and the docked compounds. Protein-compound interactions are scored using the knowledge-based Radial Mean Reduced (RMR) discriminatory function defined previously59 with a 6 Å cutoff (see section on Generalized statistical scoring function). This function calculates a fitness score for each compound’s or fragment’s atom in a protein by considering all protein atoms within 6 Å radius of that atom. It is an atomic level radial distribution function with mean reference state that averages over all pairwise atom types from a reduced atom type composition (protein’s and compound’s atom types), using experimentally determined intermolecular complexes in the Cambridge Structural Database (CSD)65 and in the Protein Data Bank (PDB)66 as the information sources. The objective function that is used for the minimization of the protein-compound interactions is computed using the RMC scoring function with a 15 Å cutoff as follows: for each possible pair of atom types present in the protein-ligand complex, the RMC function is sampled at discrete 0.1 Å intervals and is smoothed using B-spline interpolation. Potential energy values and their first derivatives are calculated at 0.01 Å intervals over the [0, 15] Å interval for the smoothed function. The objective function is implemented as a custom knowledge-based force object in OpenMM63 which is used as a library from the CANDOCK source code.

Prepare Protein for Molecular Mechanics.

The N- and C- terminal residues are renamed according to the AMBER topology specification, e.g., ALA to NALA or CALA, disulfide bonds are added to the protein by connection of SG atoms that are closer than 2.5 Å, inter-residue bonds are also added by connection of main chain C and N atoms that are closer than 1.4 Å.

2.2.2. Phase II: Rigid Fragment Docking

Compute Rotations of Seeds.

For each seed fragment, we compute its rotational transformations about the geometric center which is fixed at the coordinate origin. Accordingly, we first compute 256 uniformly distributed unit vectors around the coordinate origin. Then, the seed fragment is rotated by 10° increments around the axis formed by each unit vector. To speed up the subsequent step of rigid fragment docking, the rotated fragment atoms’ coordinates are mapped on a hexagonal close-packed (HCP) grid of 0.375 Å resolution. This mapping enables efficient docking of fragments to a protein binding site since their rotational transformations need to be computed only once. The fragment’s clashes with the protein and the fragment’s RMR6 scores are determined by translations of the rotational fragment grid over the compatible HCP binding site grid using fast integer arithmetic.

Generate Binding Site Grid.

A binding site location for docking is specified using one or more centroids, each consisting of the Cartesian coordinate of its center and its radius. We generate a grid that covers the space of all centroids that represent the binding site (Figure 3a). We use an HCP grid that provides maximal packing efficiency, covering the same volumetric space of a simple cubic grid with approximately 40% fewer grid points to achieve the same maximal interstitial spacing. The grid points are in a distance range of 0.8 Å < d < 8 Å from any protein atom. We use a grid spacing of 0.375 Å with a maximal interstitial spacing of 0.22 Å to densely represent the protein binding sites (Figure 3b).

Figure 3:

Figure 3:

Detailed overview of the hierarchal relationship between the atomic grid and ligand fragments. The protein binding site is supplied as a series of centroids that are combined to form a volume of space that defines the binding pocket (a). Regions of this volume that do not clash with the protein, waters, or cofactors are filled with a hexagonal close-packed grid (b). The score of all atom types present in the ligand are calculated at each grid point using the RMR6 scoring function (c). Ligand fragments from the previous step are translated and rotated within this grid to produce a collection of the same ligand fragment throughout the binding site (d). This collection of ligand fragments is clustered using a greedy clustering algorithm using RMSD to determine if two fragments are similar. If two fragments are within a 2.0 Å of each other, the fragment with a higher RMR6 is deleted. Remaining docked fragments are referred to as seeds (e). The score distribution of a typical seed is given in (f) to show the exponential score shape of the distribution.

Dock and Cluster Rigid Fragments.

Intermolecular geometric and chemical complementarity between a protein and a ligand is essential for binding. Energetically preferred positions of ligand atom types can be captured using a discriminatory function (Figure 3c). Docking of seed fragments to the binding site grid is performed by moving seed’s rotational grid over the binding site grid points. Docked fragment poses that are in a steric clash with the protein are rejected (Figure 3d). A steric clash is considered if any interatomic distance between the fragment and the protein falls within nine-tenths of the atoms’ respective van der Waals sum. Each fragment translation and rotation that passes this initial filter is then evaluated with the RMR6 discriminatory function59. Finally, greedy clustering of docked and scored fragment poses in the Root Mean Square Deviation (RMSD) space computed based on their heavy atoms at 2 Å cluster cutoff is performed, resulting in a uniform distribution of locally best-scoring docked seed fragments covering the entire protein binding site (Figure 3e).

2.2.3. Phase III: Flexible Docking with Iterative Minimization

Generate Partial Compound Conformations.

For each compound to be docked, a user-specified percentage of each of its best-scoring rigidly docked seed fragment poses are considered. Among these, we search for such compatible pairs of docked seeds that are at the appropriate distances, that is, the distance between them is less than the maximum of their known bond distance. The maximum possible distance between a pair of seeds is calculated by traversing the path between the fragments in the original compound and summing up the distances between the endpoints of each rigid fragment on the path. We construct an undirected graph in which vertices represent seed fragments, and edges indicate that the corresponding pair of seed fragments is linkable. Using the MaxCliqueDyn algorithm67 we then find all fully connected subgraphs consisting of k vertices (k-cliques) in this graph, where the default value of k is set to three or to the number of seed fragments, whichever value is less. Each k-clique corresponds to a possible partial conformation of the docked seed fragments, in which these fragments are appropriately distanced so that they may be linked into the original compound. The maximal clique algorithm of Bron and Kerbosch68, which was previously used for pose matching69, differs significantly from our maximum clique algorithm67. While a maximal search covers all cliques that are not subgraphs of another clique, maximum clique algorithms only search for the clique with the maximum number of vertices. Consequently, although both address a NP-hard problem, finding a maximum clique requires an order of magnitude less computing time. The possible partial conformations are then clustered using a greedy clustering algorithm at RMSD cutoff of 2 Å, where the best-scored cluster representatives are retained. The partial conformations sorted by their RMR6 scores from the best- to the worst-scored are used as an input to the next step of compound reconstruction.

Reconstruct Compound with Protein Flexibility.

Each identified partial conformation of the docked seed fragments is gradually grown into the original ligand by addition of non-seed fragments using the A* search algorithm. This can be done at different levels of protein flexibility. Protein minimization may be performed at each step of the linking process or only at the end when the compound has been reconstructed. Each seed fragment is linked to adjoining fragments according to the connectivity of the original compound. Each added non-seed fragment is rotated 360° about the bond vector at 60° increments. If the user has specified full protein flexibility, the resulting conformation of the partial compound and the protein is subjected to knowledge-based energy minimization using the RMC15 scoring function as for intermolecular forces. Simultaneously, bonds, angles and torsions of the partial compound and the protein are minimized using the standard AMBER molecular mechanics energy minimization. This procedure uses the popular OpenMM software package, specifically its implementation of the L-BFGS minimization algorithm70. With each round of minimization, the RMR6 score is calculated for the protein-compound interactions, and the scored conformation is added to the priority queue which consists of the growing compound conformations in the order from the best-scored to the worst-scored.

At each subsequent step of reconstruction, the A* search algorithm chooses the best-scored conformation from this priority queue and attempts to extend it. This conformation must meet an additional condition, which is that its attachment atoms that are to be connected by rotatable bonds to fragments not-yet added, need to be at appropriate distances from the attachment atoms on the remaining seed fragments. The algorithm iterates until the priority queue is empty in which case the compound has been completely reconstructed and is in a local minimum energy state. Alternatively, if the specified maximum number of steps was exceeded (1000 by default), then the reconstruction failed. The A* search is repeated for each partial conformation of docked seed fragments until all have been considered for reconstruction into a different docked conformation of the original compound. A final energy minimization procedure is performed on the protein-ligand complex treating the protein as fully flexible (side-chain and backbone) to remove steric clashes in the process of growing the ligand into the binding site. In addition to knowledge-based and molecular mechanics energy minimization, the fragment reconstruction process intrinsically accounts for ligand flexibility in the docking process. The described protocol results in a ranked list of docked and minimized protein-compound complexes. These steps are summarized in the flow chart show in Figure 4.

Figure 4:

Figure 4:

Workflow of the fragment linking procedure. The algorithm begins with a set of ligand fragments docked into the binding site of the protein (termed as seeds) which are selected based off their RMR6 score and the number of seeds is determined by the ‘Top Seed Percent’ parameter. These fragments are joined together into ligand templates using the maximum clique algorithm and the potential ligand templates are clustered using a greedy clustering algorithm which remove ligand fragments within an RMSD of 2.0 Å from each other. Remaining ligand templates are joined using the A* algorithm which determines whether a seed can be added to the growing ligand template. If the seed cannot be added, the template is rejected, and the pair is added to a list of failed pairs. If the seed can be added, then it is added to the ligand template. Once all seeds have been added to the ligand template, the template is accepted an energy minimized. The algorithm ends once all templates have been added or rejected.

2.3. Benchmarking the CANDOCK Algorithm

Throughout the paper, we evaluated different scoring functions for their ability to ‘select’ the crystal-like ligand pose (i.e. a pose within 2.0 Å of the crystal ligand pose)as the most negatively scored pose (best ranked pose) and termed them as ‘selectors’ henceforth. Here, we define the selection rate as the fraction of the best-ranked poses (most negatively scored) within 2.0 Å of the crystal ligand pose. We calculated this selection rate for each scoring function at different radius cutoff values (4 Å to 15 Å) to identify the best selectors. This metric should not be confused with success rate, which is simply the algorithm’s ability to produce a crystal-like pose.

2.3.1. Benchmarking Set of Choice.

There are a wide variety of benchmarking sets to evaluate docking programs to evaluate docking methods, most of which are derived from the protein databank71. We evaluated the CANDOCK hierarchical docking algorithm using a benchmarking set (1) to determine whether the algorithm can reproduce the crystal binding pose of the ligand in the binding site of the protein and (2) to correlate the scores of the three-dimensional (3D) docked poses of the ligand to the measured Kd/Ki values of the ligand binding with the protein. The PDBbind benchmark72,73 is very well suited for this analysis because, for each protein in this set, it provides 3D coordinates and corresponding activity values for five protein-ligand complexes. In the CASF-2016 benchmarking set (also referred to as the PDBBind Core set v2016), there are a total of 285 such complexes for 57 proteins of interest to the medicinal chemistry community. This benchmarking set includes decoy poses which are used to validate our scoring functions independently of the CANDOCK algorithm. The number of fragments present in a given ligand range from a single fragment to ligands consisting of thirteen fragments, enabling an evaluation of our method on both rigid and flexible ligands.

In addition to CASF-2016, we have also benchmarked our method against the Astex Diverse set74 as several protein-ligand complexes in this set include metal ions and other cofactors, allowing us to showcase these examples and assess how our algorithm handles these particular cases. We obtained each structure from the Astex set from the Protein Data Bank directly and only considered the biological assembly used to create the original benchmark. Additionally, to ensure that CANDOCK can generate native-like poses when not given the crystallographic coordinates of a ligand as input, we generated the 3D structure of each ligand from its SMILES string using Molconverter75 and compared these results to those obtained when the original crystallographic coordinates were used.

To evaluate the performance of CANDOCK against non-cognate protein structures, we have included benchmarking examples for the PINC is Not Cognate (PINC) benchmarking set76. From this set, we have chosen 6 target cases to evaluate CANDOCK: betasecretase1, carbonic anhydrase II, cyclin dependent kinase 2, map kinase 14, PTP1b, and PPAR gamma. For this benchmarking set, multiple ligands with known crystallographic poses are supplied for a given target along with 5 example proteins crystallized with different ligands. The goal of this benchmark is to obtain the crystal pose of the supplied ligands in these non-cognate protein crystal structures.

2.3.2. Input Preparation.

The binding site for both benchmarking sets is defined by spheres with a radius 4.5 Å centered around each atom of crystal ligand. We did not remove any cofactors, solvent molecules, ions, or glycans when preparing our docking runs. The provided reference ligand was used to generate fragments and seeds for docking. The Astex benchmark was run again using input ligand coordinates generated using only the SMILES representation of the molecule and the Molconverter package from Chemaxon75.

2.3.3. Parameters Chosen for Benchmarking.

The most important parameter present in CANDOCK for linking seeds into ligands is the ‘Top Seed Percent’ parameter as it is crucial to selecting the number of seeds used to generate potential conformations via the maximum clique algorithm67. If this number is too small, then there will not be enough potential conformations generated to sample the conformational space of the ligand properly. In fact, there is a possibility that no conformations are generated during the linking step, causing CANDOCK to fail to produce any conformations. If the ‘Top Seed Percent’ is too large, then the conformational search space is too large, and CANDOCK will become computationally inefficient (especially in the case of fully-flexible protein docking). Therefore, we wanted to sample potential ‘Top Seed Percent’ values to determine how well our method does at various levels of conformational space sampling. The values chosen for this parameter are 0.5%, 1.0%, 2.0%, 5.0%, 10%, 20%, 50%, and 100%. Default values of all the parameters used in the algorithm are listed in Table S10.

Similar to the conformational space sampled, we also investigated the effect of protein flexibility on the ability of the CANDOCK algorithm to reproduce the binding pose of a ligand. Accordingly, we used the algorithm in three modes: no protein flexibility (no energy minimization performed, maximum final iterations set to zero), with semi-flexible protein (final energy minimization only, default options), and with a fully flexible protein (iterative energy minimization performed, iterative flag turned on). The RMSDs for all poses generated from all ‘Top Seed Percent’ values and all flexibility modes are calculated with respect to the experimental crystal pose using a symmetry independent method.

Finally, we determined the best scoring function to select the pose from all generated poses that best reproduces the crystal ligand pose (the ‘selector’ scoring function’) and potentially differentiate it from another scoring function used to rank the activity of a given ligand to the protein target of interest (the ‘ranker’ scoring function). To do this, we calculated the score of all poses generated for CASF-2016 using all scoring functions described in section 2.1. We then evaluated the ability of each scoring function to select the crystal pose of a ligand from all poses as well as the correlation between the score assigned to the selected pose and the experimental binding affinity. As there are 96 scoring functions, there are 9216 (96 ways to select by 96 ways to rank) different methods to rank the affinity of the ligands in CASF-2016. An overview of this benchmarking process for activity prediction is given in Figure 5.

Figure 5:

Figure 5:

CANDOCK activity evaluation pipeline. Sampling is performed using the RMR6 scoring function to generate thousands of ligand poses. The best pose is selected with a ‘selector’ scoring function to represent the protein-ligand complex. Only this selected pose is rescored using the ‘ranker’ scoring function, which is used to assign a new score to the complex. The best ranker score on the selected pose is used to rank the protein-ligand complex based on correlation with pKd/pKi data.

3. Results and Discussion

We discuss the performance of the CANDOCK algorithm in reproducing the crystal pose of a ligand via sampling the conformational space of the ligand in the binding pocket (including the entire chemical environment with cofactors, metal ions, crystal waters, etc.) modeled with different levels of protein flexibility for two benchmarking sets. In addition, we evaluate the ability of the algorithm to discriminate the crystal pose from all poses generated by the algorithm, and the ability to rank the activity of the ligands against the protein targets of interest.

3.1. Knowledge-based Scoring Functions Perform Well on the Decoys Present in the CASF-2016 Benchmark

Before evaluating the ability of the CANDOCK algorithm to reproduce the crystal pose of a ligand in the binding pocket of a protein as measured by success rate, we first show that the scoring functions perform well at selecting a crystal-like pose from the decoy poses provided by the CASF-2016 benchmark set73. First, we evaluated our 96 scoring functions on the ‘docking power’ test provided by the CASF-2016 benchmark. ‘Docking power’ is the ability for a scoring function to select a pose within 2.0Å of the crystal pose and is synonymous with selection rate with the exception that docking power is measured on poses not generated by CANDOCK. Our results show that the RMR5 and the RMR6 scoring functions outperform all the others with success rates of 87% and 86%, respectively, when the crystal pose is not included with the decoys (see Table S2 for other scoring functions). When the crystal pose is included, the docking powers increase to 95% and 94%, respectively (see Table S3 for details). These values outperform all other scoring functions in the original CASF-2016 paper73. Moreover, our best performing scoring functions (RMR5 and RMR6) also outperforms a machine learning based scoring function, recently introduced to improve its performance77. It should be noted that the performance of the scoring functions are within the statistical error of both RMR5 and RMR6 (compare the first three columns of Table 1 and Table S4 to Tables S8S9 published for the CASF-2016 benchmark73), suggesting that our scoring functions perform at least as good as the best scoring functions benchmarked in the original work.

Table 1.

Statistics shown for the ‘docking power’(selector only), ‘scoring power’(as measured by Pearson correlation between the ranker and binding affinity) and ‘ranking power’ (as measured by Spearman correlation between the ranker and binding affinity) tests where the RMR5 and RMR6 scoring functions are used to select the representative pose from the CASF-2016 decoys and the RMC12, RMC13, RMC14, RMC15, FMC12, FMC13, FMC14, and FMC15 are used to rank the pose. The RMSD of the decoy is included as an additional selector to show that knowledge of the RMSD is not required to achieve the best correlation and is within the error of ranking a pose selected by a scoring function.

Selector Native pose included Docking power Ranker Scoring power Ranking power
RMR5 No 84.0–90.0 RMC12 0.5667–0.6766 0.4750–0.6625
Yes 92.0–96.0 RMC13 0.5660–0.6779 0.4804–0.6625
RMC14 0.5642–0.6755 0.4900–0.6750
RMC15 0.5577–0.6712 0.4696–0.6661
FMC12 0.5634–0.6748 0.4875–0.6661
FMC13 0.5637–0.6764 0.4875–0.6679
FMC14 0.5624–0.6742 0.4857–0.6714
FMC15 0.5565–0.6692 0.4696–0.6661
RMR6 No 83.0–90.0 RMC12 0.5621–0.6734 0.4536–0.6393
Yes 92.0–96.0 RMC13 0.5637–0.6745 0.4643–0.6446
RMC14 0.5618–0.6721 0.4732–0.6589
RMC15 0.5553–0.6678 0.4500–0.6411
FMC12 0.5617–0.6727 0.4625–0.6482
FMC13 0.5590–0.6716 0.4696–0.6500
FMC14 0.5600–0.6713 0.4696–0.6518
FMC15 0.5533–0.6661 0.4500–0.6429
RMSD RMC12 0.5688–0.6722 0.3446–0.5250
RMC13 0.5634–0.6680 0.3405–0.5214
RMC14 0.5560–0.6624 0.3429–0.5179
RMC15 0.5502–0.6569 0.3357–0.5125
FMC12 0.5643–0.6685 0.3464–0.5250
FMC13 0.5622–0.6668 0.3482–0.5232
FMC14 0.5558–0.6613 0.3393–0.5161
FMC15 0.5496–0.6557 0.3339–0.5143

Using the selector/ranker methodology described in Figure 5, we used both RMR5 and RMR6 as selectors and 12 other scoring functions (RMC10, RMC11, RMC12, RMC13, RMC14, RMC15, FMC10, FMC11, FMC12, FMC13, FMC14, and FMC15) as rankers for the ‘scoring power’ and ‘ranking power’ tests for binding affinity as described in the original CASF-2016 paper73. Additionally, the RMSD of the provided decoy pose is used as a selector to test whether knowledge of the crystal pose is needed for adequate ranking and scoring. The corresponding Pearson and Spearman correlation coefficients are given in Table 1, and Tables S5 and S6. The best selector ranker pair for the proved decoys is RMR5/RMC13 with a Pearson correlation of 0.626 (confidence interval of [0.566 – 0.6779]). This result places this correlation within statistical error of the best published non-machine learning scoring functions73,77. For the ranker test, the best combination is RMR5/RMC14 with a Spearman correlation of 0.5964 (confidence interval of [0.49 – 0.675]), a result which places our scoring functions within the top 10 non-machine learning scoring functions and within statistical error of the best scoring function. It should be noted that all the selectors chosen for this analysis (see Table 1) perform within statistical error of each other, indicating that the family of scoring function with large cutoffs, using mean reference state, and complete reference for the protein-ligand complex is well suited for ranking ligand affinities.

3.2. Ligand Conformational Sampling is Enhanced by Fragment Docking and Protein Flexibility

An important feature of any receptor-ligand docking methodology is its ability to generate docked crystal-like ligand poses within 2.0 Å RMSD of the experimentally determined pose of the native ligand78,79. Using the CASF-2016 benchmarking set, we validated the ability of CANDOCK to generate crystal-like poses among the docked poses. We plotted the cumulative frequencies of all docked poses with the RMSDs from their corresponding crystal ligands’ poses for all ‘Top Seed Percent’ values and for varying degrees of protein flexibility using the RMR6 scoring function (Figure 6; left-hand panels). Expectedly, these plots indicate that the use of larger (>20%) ‘Top Seed Percent’ values generated significantly more poses within 2.0 Å than lower (<10%) ‘Top Seed Percent’ values. For the semi-flexible (Figure 6c) method, the ‘Top Seed Percent’ value of 20% yielded the highest number of poses within 2.0 Å of the crystal pose, with the corresponding cumulative frequency of ~91%, compared to independent benchmark of the best performing methods resulting in ~80% success rate to generate the pose37. The semi-flexible method thus outperformed the rigid protein (Figure 6a) and the fully flexible (Figure 6e) methods for the larger ‘Top Seed Percent’ values that correlate with higher sampling of the ligand conformational space during fragment docking. However, the fully flexible protein method outperformed the semi-flexible (Figure 6c) and the rigid protein (Figure 6a) methods for smaller ‘Top Seed Percent’ values such as 5% and 10%. In addition, the Boltzmann-like distributions in the RMSD plots (Figure S2) indicate that the CANDOCK algorithm adequately sampled the ligand conformations both far and close to the crystal ligand pose in CASF-2016. This suggests that the prediction of energetically-favorable ligand conformations is dependent on near-native protein flexibility during the linking of docked fragments. There are only 17 co-crystal structures (out of 285) where the semi-flexible algorithm failed to find a single crystal-like pose for the native ligand (1H22, 1H23, 1NVQ, 1U1B, 1YDT, 2P15, 2QNQ, 3AG9, 3BV9, 3KWA, 3O9I, 3PRS, 3UEU, 3URI, 3ZSO, 4EA2, 5C2H) for any ‘Top Seed Percent’ value. Additional 9 complexes (2C3I, 2CET, 2W66, 2WCA, 3ARU, 3BGZ, 3OZT, 3RR4, 3UEX) failed to find a crystal-like pose when the semi-flexible algorithm was used with a ‘Top Seed Percent’ value of 20%. Two of these complexes (3BV9, 3URI) contains a peptide ligand with a protein, a situation generally treated differently in other docking studies37. When fully-flexible docking is considered, CANDOCK fails on a total of 10 complexes, out of 285, resulting in an overall success rate of ~96% to generate crystal-like poses. Specifically, CANDOCK generates successful (crystal-like) poses for 7 complexes out of 17 failures from semi-flexible docking (3O9I, 2QNQ, 1YDT, 3ZSO, 5C2H, 3UEU, and 4EA2), and 2P15 becomes a near hit with an RMSD of 2.04Å. These results indicate that hierarchical generation of the ligand poses with the protein flexibility considered after fragment docking and ligand reconstruction is a successful strategy for enhanced sampling of the conformational space of ligands in protein-ligand complexes.

Figure 6:

Figure 6:

Cumulative frequency of the best RMSD pose generated by for rigid (flexible ligand only with no energy minimization of protein-ligand complex), semi-flexible (energy minimization of protein-ligand complex at the end), and fully-flexible (iterative energy minimization during linking procedure) CANDOCK docking results for the 285 proteins in CASF-2016 using the RMR6 scoring function are given in (a), (c), and (e) respectively. The selection rate, i.e., the portion of the best-scored docked poses within 2.0 Å of the crystal pose, is given for different scoring functions employed in (b), (d), and (f).

3.3. Radial Mean Reduced (RMR) Scoring Function Family Generates Best Docked Ligand Poses

The RMR family of scoring function at the cutoff radius value of 6 Å from each atom of the ligand (RMR6) performed best for the semi-flexible protein method (Figure 6; right-hand panels). The best selector scoring function for the rigid protein method was RMR8 and RMR5 for the fully-flexible protein method. This shows that the RMR scoring function family is the best selector among 8 other generalized family of scoring functions. Conversely, the Radial Cumulative Complete (RCC) scoring function family performed the worst in selecting the crystal pose from the generated poses with the RCC11 scoring function being the overall worst selector.

To elucidate the rationale behind the good performance of RMR6 in selecting a crystal-like pose, we plotted the RMR6 score of the docked ligands with lowest RMSD from the crystal pose against the RMR6 score of the crystal pose (Figure S3). For ‘Top Seed Percent’ values >10%, there is a clear separation between the successful poses within 2.0 Å (blue points) and the failed poses far from the crystal ligand pose (red points). Moreover, these failed poses cluster above the diagonal line, indicating that RMR scores of failed complexes have higher energy value (as expected) than the crystal pose during sampling for ‘Top Seed Percent’ values >10% (Figure S3). The number of failed poses decrease to lower numbers with increasing ‘Top Seed Percent’, from 244 for 0.5%, 218 for 1.0%, 178 for 2.0%, 97 for 5.0%, 46 for 10%, 26 for 20%, 30 for 50%, and 32 for 100%. These data suggest a ‘Top Seed Percent’ of 20% yields the highest number of poses within 2.0 Å of the crystal pose (previous section, Figure 6 - left-hand panels) and the number of failed cases are rare and clearly discriminated from both the crystal pose, as well as, the successful near-native docked poses (blue points) by using the RMR6 scores. Therefore, RMR6 can discriminate native and near-native interactions from a set of incorrect conformations generated by our docking method. Furthermore, RMR6 scoring function is a decent selector as the top pose selection rate of 41% for semi-flexible docking at a ‘Top Seed Percent’ of 20% (Figure 6; right-hand center panels) and is comparable to the state-of-the-art independent benchmarks.37 Clearly, for these successful cases, the best (most negative) RMR6 score corresponds to a pose within 2.0 Å RMSD of the crystal pose (Figure S4). However, RMR6 has a bias towards incorrectly scoring a non-crystal-like pose better than the experimental crystal pose for both successful and failed cases (blue and red points respectively are below the diagonal in Figure S5). If we include predicted poses other than the best scored pose, then we get a much higher selection success rate of 55% when top 2 poses are selected, 69% when top 5 poses are selected, and 76% when top 10 poses are selected. While the RMR6 scoring function is a decent selector, more work is needed to enhance the selection success rate, perhaps in combination with other scoring functions at different cut-offs along by using machine learning methods80,81. However, it is good to note that without any machine learning, our generalized RMR6 scoring function is comparable to successfully selecting a pose to a recently published neural network based scoring selection82 with a selection rate of ~50% for the top pose and ~65% for the top 5 poses. This suggests a reduced composition over all pairwise protein’s and compound’s specific atom types with mean reference state improves discriminatory accuracy by giving ‘context’ to the specific pose by solely including atom type interactions that are possible between the receptor and the ligand.

3.4. Docking Long Aliphatic Chains Needs Enhanced Sampling

We identified six complexes (1H22, 1H23, 3AG9, 3KWA, 3UEU, and 4EA2) out of 17 failed cases with CANDOCK semi-flexible algorithm with ligands that contain long aliphatic carbon chains (greater than 4 atoms). The remaining 11 complexes that fail are 3URI (8-mer peptide), 3O9I, 1U1B, 2QNQ, 3BV9 (6-mer peptide), 3PRS (14 fragments), 1YDT, 1NVQ, 2P15, 5C2H, and 3ZSO. If fully-flexible protein docking is considered, we get 4 complexes out of 10 failed cases that contain long aliphatic carbon chains (1H22, 1H23, 3AG9, 3KWA). CANDOCK does not consider aliphatic chain consisting of three carbon atoms (sp3 hybridized carbon; C3) as fragments for docking (see Materials and Methods). Instead, the A* search algorithm determines the docked positions by rotating them around the bond vectors of the growing chain at 60° increments. We hypothesize that this discrete sampling of conformational space, and not the potential functions in CANDOCK, is the cause for the poor performance of the algorithm on these compounds with many rotatable bonds. To test our hypothesis for the six failed long aliphatic carbon chain complexes (1H22, 1H23, 3AG9, 3KWA, 3UEU, and 4EA2), we scored the decoys provided by the CASF benchmarking set73 that included at least one pose within 2.0 Å RMSD. In all 6 cases, the RMR6 scoring function selected a pose within 2.0 Å RMSD of the crystal ligand, indicating that our generalized scoring function does not account for failure to identify crystal-like conformations (Figure S6). We plan to address this issue in detail in future versions of the algorithm by implementing a new sampling method or a ligand-class specific scoring function, similar to what was done for the support of carbohydrates in Autodock Vina separately83.

3.5. Protein Flexibility Improves Docking Ligands with Many Rotatable Bonds

The number of rotatable bonds in a ligand significantly influences the ability of docking algorithms to generate docked crystal-like ligand poses37. To study the effect of rotatable bonds on the performance of the algorithm, we compute the selection rate of the RMR6 scoring function against the number of fragments in a ligand (Figure 7). Due to the hierarchical fragment-based nature of the CANDOCK algorithm, the number of ligand fragments is used instead of number of rotatable bonds to measure CANDOCK’s performance. By comparing the fully-flexible protein method (Figure 7c) to the rigid protein method (Figure 7a) and to the semi-flexible method (Figure 7b), we show that the selection rate for flexible ligands increases with including protein flexibility during docking. Here, we define a flexible ligand with greater than 4 total fragments as the average number of fragments is 3.8 and the median is 3 fragments in the CASF-2016 dataset. Specifically, for the 216 ligands with four or fewer fragments, the semi-flexible (Figure 7b) and the fully-flexible (Figure 7c) methods performed equally well. The rigid, semi-flexible and fully flexible methods have a respective selection rates of 50±3.5%, 66±3.2%, 65±3.2% for the top pose; 65%±3.3%, 76%±2.9%, 77%±2.9% when top 2 poses are selected; 74%±3.0%, 83%±2.6%, 86±2.4% when top 5 poses are selected; and 79%±2.8%, 88%±2.2%, 91%±2.0% when top 10 poses are selected. Thus, full protein flexibility is not essential for ligands with less than 5 fragments as there is little difference in selection rate between semi-flexible and fully-flexible docking (Figure 7b,c). In contrast, for 69 ligands with greater than 4 fragments, the rigid, semi-flexible and fully flexible methods have a respective mean selection rates of 29%±5.6%, 54%±5.9%, 54%±6.0% for the top pose; 35%±5.8%, 64%±5.7%, 68%±5.7% when top 2 poses are selected; 47%±6.0%,75%±5.2%, 79%±4.9% when top 5 poses are selected; and 53%±6.1%, 77%±5.1%, 87%±4.1% when top 10 poses are selected. Better performance of flexible methods versus the rigid method for larger ligands is most likely caused by the plateauing and even slight decline in the number of poses generated for ligands with >5 fragments for ‘Top Seed Percent’ values >10% (Figure S7). This suggests there is an upper limit to the sampling space possible for a given binding site and for a given ligand and once this limit is reached, the algorithm is no longer able to produce more docked ligand poses. From the values given, it is clear that the semi-flexible and fully-flexible methods are superior the rigid method. However, while it is difficult to determine a direct superiority of the fully-flexible method over the semi-flexible method for the top pose through the top 5 poses, the fully-flexible method outperforms the semi-flexible method when considering the top 10 poses. Therefore, we conclude that protein flexibility is an important feature of the CANDOCK algorithm.

Figure 7:

Figure 7:

Selection rates for the RMR6 scoring function with rigid (a), semi-flexible (b), and fully-flexible (c) CANDOCK docking arranged by the numbers of ligand fragments in CASF-2016 (see Figure 2 for the definition of a fragment). For fragment counts greater than 13 (3URI, 3AG9, and 3PRS), CANDOCK did not produce any poses within 2.0 of the crystal pose.

3.6. Inclusion of Chemical Environment and Cofactor Interaction in Binding Sites Lead to Accurate Crystal-like Ligand Pose Generation

The Astex Diverse Set74 is a widely used benchmarking set for measuring a docking program’s ability to predict the native pose of a ligand. One important feature of this set, compared to CASF-201673, is the inclusion of several cofactors and metal ions such as zinc ions and heme groups in the binding sites. Traditionally, with docking methods, the cofactors in the binding pockets have been ignored or treated as non-physical models with improper representations that affected performance73. As an example, for Heme groups, we used a previously published extension to the GAFF forcefield to ensure proper representation of this cofactor during the minimization procedure84, compared to other methods treating it as a hydrogen bond donor24. We hypothesize that in order to perform well on this benchmarking set, the docking algorithm must properly sample ligand conformations interacting with metal ions and doing so requires adequate representation of metal-ligand interaction potentials at the atomic scale. A generalized potential function can include all relevant cofactors, metal ions, etc. in the binding pocket as separate interactions (Figure S8), compared to one metal-ion type used by others24,73. To highlight the ability of our scoring function to characterize such interactions in a pair-wise fashion, we plotted various atom pair interactions of interest to medicinal chemists (Figure S8).

The number of complexes in this benchmarking set where CANDOCK algorithm produces a ligand pose within 2.0 Å RMSD of the crystal pose is given in Table 2. CANDOCK successfully generates a crystal pose for 97.6% of the Astex benchmarking set (83 out of the 85 complexes). We attribute this success to the ability of our algorithm to properly sample the conformational space of ligand in the binding pocket while considering all interactions of the ligand within the binding pocket including cofactors, metal ions, etc. In a recent comparison using Astex dataset31, the success rate for FlexAID31, Autodock Vina24, FlexX85, and rDock49 are 66.7%, 81.8%, 78.8%, and 89.4% respectively, when all 85 complexes are considered. When 16 complexes containing a metal ion were removed (1GKC, 1HP0, 1HQ2, 1HWW, 1JD0, 1JJE, 1LRH, 1MZC, 1OQ5, 1R1H, 1R55, 1R58, 1UML, 1XM6, 1XOQ, 1YQY), the success rates of these methods increased to 72.1%, 83.6%, 79.7%, and 91.3% respectively31. CANDOCK outperforms these methods without removing metal ions complexes from the benchmarking set, supporting the hypothesis of adequate sampling and included proper representation of interactions within the binding site. The two complexes where CANDOCK nearly missed to generate a crystal pose using the semi-flexible method are 1HP0 (lowest RMSD of 2.08) and 1W1P (lowest RMSD of 2.734). Additionally, when the protein is considered as a rigid body (rigid docking), CANDOCK failed to find crystal poses for 1Y6B and 1MZC as well (81 out of 85 complexes in Table 2). The algorithm also performs well on complexes that failed by using other popular docking methodologies for the Astex Diverse set. According to a previous study31, there are four complexes (1G9V, 1GM8, 1JD0, and 1MEH) where Autodock Vina24, rDock49, FlexX85, and FlexAID31 all have difficulty reproducing the crystal-like pose of the ligand but CANDOCK successfully generated a crystal-like pose. CANDOCK is able to select a crystal-like pose 52% of time for the top scored pose, 60% of the time for the top 2 poses, 66% of the time in the top 3 poses, 75% of the time in the top 5 poses, and 79% of the time in the top 10 poses.

Table 2.

Number of successes in the Astex diverse set for all ‘Top Seed Percent’ values, a CANDOCK parameter used to create ligand templates for docking is investigated. There is a total of 85 protein-ligand complexes in this benchmarking set. The columns for ‘Rigid Docking’ and ‘Semi-Flexible Docking’ show CANDOCK success to produce a pose within 2Å for the number of Astex protein-ligand benchmark (out of 85) for each ‘Top Seed Percent’ parameter. CANDOCK outperforms Vina using a top seed percent greater than 20% (CANDOCK parameter) to generate a pose for the Astex benchmark. For ‘All Ligand Poses’ that are all the docked ligand pose generated by using different ‘Top Seed Percent’ is considered, CANDOCK performs best using both rigid and flexible docking methods. Unlike CANDOCK, Autodock Vina’s ability to generate a pose within 2Å is dependent on the initial conformation of the ligand as shown by the difference in success when the ‘Crystal Ligand Pose’ compared to a ‘Non-native Ligand Pose’ is used.

CANDOCK Top Seed Percent Rigid Docking Semi-Flexible Docking
0.5% 7 7
1.0% 14 15
2.0% 28 33
5.0% 57 60
10% 67 74
20% 77 79
50% 79 82
100% 78 81
All Ligand Poses 81 83
Input as Crystal Ligand Pose Input as Non-native Ligand Pose*
Vina 79 68
*

OpenBabel91 was used to change ligand conformation of the crystal pose for AutoDock Vina

When CANDOCK was given starting coordinates generated from the SMILES string of the Astex ligands using Molconverter75, it produced a crystal-like pose for 77 out of the 85 complexes. As compared to running CANDOCK with 20% of the docked seeds and the crystallographic coordinates as input, there are three additional failures: 1M2Z, 1XM6, and 1XOZ. Conversely, 1MCZ was docked successfully when using coordinates generated from a SMILES string, however the best RMSD score when using crystallographic input ligand was a near-hit with a value of 2.15 Å. These three complexes all have large ring structures which cause fewer than 100 seeds to be created after fragment docking. Decreasing the clustering radius for the clustering step of the linking phase resulted in crystal like-poses for all three complexes and a similar strategy yielded a crystal-like pose when applied to 1HP0. Therefore, we conclude that the CANDOCK algorithm performs equally well when given non-crystallographic coordinates provided that large rings are accommodated in the clustering step of the linking phase. CANDOCK’s performance with non-native ligand inputs is in contrast to that of Vina, where the use of non-native coordinates yields only a crystal pose for 68 out of the 85 poses as compared to 79 / 85 when crystallographic coordinates are used (Table 2).

The interactions of the ligand with cofactors in the binding pocket for these complexes are shown in Figure S9. Specifically, 1G9V have cation-π interaction and 1GM8 have π-π interactions between an aromatic ring and the surrounding protein environment. Similarly, 1MEH contains a π-π stacking interaction between the ligand and a cofactor. 1JD0 has an interaction between the zinc ion and a sulfonyl group. These complexes showcase the success of our hierarchical docking method over previously published works.

We also consider specific cases where CANDOCK successfully reproduced the crystal pose of ligands which interact with a cofactor (Figure 8). Specifically, in Figure 8ab, for oxygen-zinc interactions in 1HWW and 1R55 during docking, the energy minimization procedure moved the location of the Zn2+ ion in the binding pocket (2.4 Å and 1.5 Å respectively) as there are no constraints to restrict its movement within the binding pocket. This movement does not prevent the algorithm from generating a ligand pose within 2.0 Å RMSD of the native structure. For 1OQ5 and 1JD0, the docked poses of ligands interacts with a zinc ion through a sulfonyl amide group (Figure 8cd) and it is interesting to note that the zinc ion moved much less in these cases (0.5 Å and 0.6 Å). For the ligand in 1OQ5 (Figure 8c), the orientation of the sulfonyl amide aligns perfectly with the reference crystal pose, suggesting that the interactions with sulfonyl amide group caused the zinc ion to stay in place. For the ligand in 1JD0 (Figure 8d), the docked pose of the same group does not align with its reference; however, the overall pose still is within 2.0 Å of this reference. Therefore, the ability for the algorithm to produce a pose within 2.0 Å of the reference is not dependent on correctly predicting the orientation of all functional groups in a given molecule.

Figure 8:

Figure 8:

Example docked ligand poses from the Astex Diverse set that show versatility of the CANDOCK algorithm in handling cofactors. In all panels, the reference pose is given in white and the lowest RMSD pose predicted by CANDOCK with a ‘Top Seed Percent’ value of 20% using the semi-flexible method is given in green. Panels (a) and (b) were selected due to presence of oxygen-zinc interactions between native ligand and protein. The zinc ion before and after energy minimization is given in gray and cyan respectively showing that the energy minimization moved the zinc ion considerably. The complexes in (c) and (d) show the interactions between sulfonyl amide groups and a zinc ion. The interactions of a compound with a heme group via a nitrogen lone pair is shown in (e) and the interaction of an aromatic carbon with a heme group is given in (f). Finally, panels (g) and (h) show the interactions of compounds with other cofactors, such as a π-π interaction of a compound with flavin-adenine dinucleotide and interaction of a compound with zinc and magnesium in a binding pocket.

We selected a larger organic cofactor (heme group) in the binding site of the protein-ligand complexes, 1P2Y and 1R9O (Figure 8eh). The heme group is present in several liver enzymes8688, therefore predicting the location of a ligand relative to this group is important for medicinal chemistry. For 1P2Y, CANDOCK predicts the pose of a compound relative to the heme group when the nitrogen of the compound is interacting with the iron atom of this group (Figure 8e). Similarly, for 1R9O, a successful pose is generated including the interaction between an aromatic carbon and the iron atom (Figure 8f) indicating that proper representation of heme group is essential to capture such interactions to generate the binding pose. We also demonstrate that generating a crystal-like docked ligand pose in the presence of a large cofactor is independent of the size of the cofactor itself. This is shown for 1SG0 complex containing the flavin-adenine dinucleotide cofactor (Figure 8g) where the dominant interaction between the ligand and the cofactor is π-π stacking. A crystal-like pose was also reproduced when the type of interaction changed dramatically, as shown in 1XM6 for the binuclear metal center formed by zinc and magnesium ions (Figure 8h). These interactions are important for developing phosphodiesterase inhibitors89, therefore it is encouraging to observe CANDOCK’s ability to reproduce a crystal pose in these cases. We conclude that the algorithm is able to generate a crystal-like docking pose by including interactions with diverse cofactors in the binding pocket.

3.7. Radial Mean Complete (RMC) scoring function at 15 Å cutoff is best for energy minimization

A potential or scoring function, used for energy minimization of a protein and a ligand should correlate quantitatively with the RMSD between the docked ligand and the crystal ligand, so that a decrease in score corresponds to a decrease in RMSD. Therefore, to determine the best minimization function, we calculated these correlations expressed as the average and the median Pearson correlation coefficients for all the scoring functions evaluated over CASF-2016 (Table S7). Figure S10 shows that the RMC and FMC scoring function families have the largest correlation with RMSD (average across all cutoffs is 0.30 units greater than averages for other scoring functions). Moreover, with increase in the cutoff value for RMC and FMC scoring functions, the correlation also increased from an average of 0.36 at 4 Å to an average of 0.56 at 15 Å suggesting that including long-range interactions is essential. We also show that the median and the average of these correlation values for the RMC and FMC scoring function families are relatively similar, indicating that the distribution of correlation values is not biased towards high or low correlations for any given protein in the CASF-2016 set. In addition, the RMC15 score of the experimental crystal pose has a strong correlation with the RMC15 score of the lowest RMSD pose (Figure S11, r2 > 0.99). Finally, the pose with the lowest RMC15 score correlates well with the RMC15 score of the crystal pose (Figure S12, r2 > 0.95). Taken together, we conclude that using the RMC15 scoring function in the CANDOCK algorithm to calculate intermolecular forces and energies during crystal the energy minimization of the docked protein-ligand complexes correlates well with RMSD from ligand pose (few example cases of RMSD vs RMC15 score plots are shown in Figure S13).

3.8. CANDOCK Can Reproduce the Binding Pose of a Ligand in a Non-cognate Crystal Form

To assess CANDOCK’s ability to reproduce the crystal pose of a small molecule in a holo-protein bound to a different ligand (a non-cognate protein form), we benchmarked CANDOCK against the PINC Is Not Cognate benchmarking set. This benchmark is divided into 12 protein targets, each having 5 crystal structures bound to a ligand and an additional set of ligands with known crystal poses in the target protein. The goal of the benchmark is to reproduce the crystal pose of the provided ligands using the 5 non-cognate protein structures. From the 12 protein targets, we focused on the following 6 targets as they were previously identified as being difficult to dock76, Beta-secretase 1, Carbonic Anhydrase II, CDK2, Mapk14, Ptp1b, and Pparγ. The cumulative distributions of the best pose produced by CANDOCK is provided for each of these targets in Figure S14. To compare with an established docking procedure, we have also produced the same cumulative distribution plots for AutoDOCK Vina24 in Figure S15. The quantification for both programs is provided in Table 3. For each individual protein for all targets, CANDOCK is able to produce more crystal-like proteins than Vina with exception of proteins 1 and 2 for Beta-secretase 1 and protein 1 for Carbonic Anhydrase II. In each of the exceptions, Vina only produces a crystal-like pose for a single non-cognate ligand more than CANDOCK. When CANDOCK outperforms Vina, it typically produces twice as many crystal-like poses as compared to Vina and in one case, it produces 5 times as many poses as Vina (see Protein 2 of Ptp1b in Table 3). When considering all 5 proteins for each target, CANDOCK reproduces the crystal pose for all proteins more frequently than AutoDock Vina, with the notable exception of MAPK14 where Vina is only able to produce two more crystal-like poses than CANDOCK.

Table 3.

Number of successes for six targets in the Pinc Is Not Cognate benchmarking obtained for both CANDOCK and AutoDOCK Vina. With the exception of MAPK14, CANDOCK is able to find a pose within 2.0 of the non-cognate ligand with greater frequency than Vina when considering all proteins for each target.

Target Protein 1 Protein 2 Protein 3 Protein 4 Protein 5 All Total
CANDOCK
BACE1 42 33 32 31 27 76 103
CAII 73 103 107 103 105 119 128
CDK2 94 35 108 114 96 124 127
MAPK14 43 41 32 34 47 78 92
PParγ 37 27 32 34 42 53 62
PTP1b 14 15 20 23 12 33 52
AutoDock Vina
BACE1 43 34 31 10 17 66 103
CAII 74 73 83 71 31 108 128
CDK2 42 12 45 43 35 78 127
MAPK14 43 31 32 25 35 80 92
PParγ 10 9 13 13 10 29 62
PTP1b 11 6 4 12 9 21 52

A possible explanation for CANDOCK’s ability to outperform Vina on the PINC benchmark is that the poses generated by CANDOCK do not depend on the input conformation of the ligand. As mentioned in section 2.2.3, the input ligand is fragmented and reassembled in the binding pocket thereby removing any input conformational bias from the ligand. This allows CANDOCK to create a wide variety of ligand poses (see Figure S2). Conversely, Vina is dependent on the starting conformation of the ligand. For example, when we did the Astex benchmark, Vina produced a crystal pose in 93% of the target ligands when it was provided the binding pose, but only 80% when the ligand is minimized before being used as input to Vina. Therefore, we can conclude that CANDOCK is superior to Vina for generating poses in the binding site.

3.9. Correlation between docking score and binding affinity is not influenced by the deviation of the scored pose from the native pose

Another critical aspect of the scoring function is the ability to accurately rank the relative binding affinities of known binders to the same protein target. A stringent criterion for testing the ranking ability of a scoring function is by docking the compounds to the targets and compare to experimental binding affinities, i.e. without knowing the crystal pose of the ligand. CASF-2016 provides experimental binding affinities (pKi/pKd) and three-dimensional coordinates of 57 protein targets with 5 compounds each for a total of 285 pKi/pKd values for protein-ligand complexes. We determined the overall correlation between the 285 experimental binding affinities (pKi/pKd) with docking scores for 285 docked poses selected using each of the generalized scoring functions (docking with 20% ‘Top Seed Percent’ value using CANDOCK). We found that RMR6, our best ‘selector’ scoring function for selecting the crystal-like pose, does not correlate with the pKi/pKd values supplied by CASF-2016 with an overall Pearson correlation of −0.275 and Spearman correlation of −0.349. When these correlations are calculated separately over 57 protein targets (each with 5 compounds) and then averaged, we get an average Pearson correlation of −0.38 and average Spearman correlation of −0.431. This suggests a need for a different scoring function for scoring the crystal-like selected pose. Therefore, we developed a procedure (Figure 5) to first select the representative docked pose of a complex using a scoring function (selector) and then rank using another scoring function (ranker) to correlate with the pKi/pKd values. The best ‘ranker’ scoring functions are RMC15 and FMC15 (Figure 9a and 9b) that were selected based on both Pearson and Spearman correlation between all 96×96 selector and ranker scoring functions combinations with the experimental pKi/pKd data in CASF-2016. The overall Pearson and Spearman correlation for RMR6 as ‘selector’ and RMC15 as ‘ranker’ are −0.343 and −0.464 (correlations are −0.43 and −0.418 respectively when averaged over 57 protein targets). It is important to note that the RMC15 score of weak binders in CASF-2016 (pKi < 2.5) does not correlate similar to other binders (Figure 9cd) as removal of these weak binders improved the correlation between the RMC15 score and binding affinity to an overall Pearson and Spearman correlation of −0.584 and −0.593 respectively.

Figure 9:

Figure 9:

The Pearson (a) and Spearman (b) correlation coefficients between all pairs of selector and ranker scoring functions (arranged by family) and the experimental pKi of any complexes in CASF-2016. Note a negative correlation between score and pKi/pKd is expected as the ‘p’ operator introduces a negative sign to the affinity (the smaller the Ki, the larger the pKi). The RMC and FMC (highlighted in yellow) families perform best and there is a general trend where an increase in cutoff (from left to right) results in improved performance in ranking complexes in order of their measured pKi. Plots of pKi vs. RMC15 score are given in (c) and (d) for the worst crystal pose selector (RCC11) and the best crystal pose selector (RMR6), respectively. The lack of major differences between these two selectors with the same ranker indicates the lack of importance in selecting the correct binding pose for ranking the pKi of a protein-ligand complex. (e) The distribution of all correlations, regardless of selector, for the RMC15 scoring function (f) The correlations for other docking methods with RMR6 as the selector and RMC15 as the ranker.

Next, we show that there was little difference between the worst crystal pose selector (RCC11 that selects top pose 22% of the time, Figure 9c) and the best selector (RMR6 that selects top pose 43% of the time, Figure 9d) to correlate with binding affinity. The difference in Pearson correlation for the worst (RCC11) and the best (RMR6) selectors in combination with the best ranker (RMC15) score is 0.024. Furthermore, the correlation between the RMC15 score (best ranker) and the pKi/pKd data for all 96 possible selectors (shown in Figure 9e) have a small deviation (standard deviation of 0.0829 for the average Pearson correlation). This suggests that the selection of the pose has a minor impact on ranking the activity of the ligand. This result is further supported by Figures S16S20 and Tables S8S9 where the selector is either the best-scored pose using RMR6 scoring function or the lowest RMSD pose from the crystal ligand. The results in Figures 10b and S1618 also show that, on a class-wise basis, there is little difference between the correlations for poses select by the lowest RMSD and the pose selected by RMR6. We find that either of these selectors do not improve the ability of the best ranker (RMC15) scoring function to rank the pKi/pKd data of compounds binding to the same protein. Additionally, there is little difference in the overall Pearson and Spearman correlation (0.001 and 0.004 respectively) for the lowest RMSD pose vs the best-scored RMR6 pose (selectors) that is rescored with RMC15 (ranker). While these findings are encouraging as they suggest to remove the burden of finding the crystal pose of the ligand, but a more detailed study with an additional benchmarking sets, such as the Directory of Useful Decoys (DUD-E)90, is required to determine the proper choice of scoring function or combinations to rank protein-ligand complexes.

Figure 10:

Figure 10:

The relationship between the RMSD rank of docked poses and the overall Pearson correlation between the RMR6 (blue) and RMC15 (green) score for CASF-2016 binding affinity of 285 protein-ligand complexes is shown in (a). An insert is used to highlight the correlation between RMC15 and binding affinity around the 750th pose as ranked by the RMSD between the pose and the native pose. The class-wise correlation between the RMC15 score of a pose selected by the best RMR6 score and the lowest RMSD is shown in (b). The similarity between the two selectors indicates that the knowledge of the crystal pose is not necessary for the RMC15 scoring to perform well on a class-wise basis.

To further illustrate that other docked poses in addition to the crystal-like pose contribute towards binding affinity, we calculated the correlation between the RMC15 score and binding affinity while varying the RMSD rank used to select the pose for scoring. First, only the best RMSD pose for each of the 285 protein targets is scored using RMC15 and the correlation between this score and the binding affinity is measured. This is repeated for the second best RMSD pose of each complex and then continued similarly for all docked poses ranked in ascending order of RMSD from the crystal ligand. If fewer docked poses are available for any protein target than the RMSD rank, the worst RMSD pose is used. Results of this procedure for the RMR6 and RMC15 scoring functions are given in Figure 10a and indicate that the lowest RMSD rank does not always yield the best correlation with binding affinity for the RMC15 scoring function. In fact, the best correlation is achieved around the 750th pose as ranked by RMSD (Figure 10a green line and yellow insert) and other RMSD ranks also produce a similar correlation. In contrast, the RMR6 scoring function is dependent on the RMSD of the pose (Figure 10a blue line) but does not correlate with binding affinity. Finally, as mentioned previously, there is no difference in correlation between the RMC15 score and the binding affinity for different protein classes using both the best RMR6 scored pose and the lowest RMSD selected pose (Figure 10b, Figures S16S18) suggesting that the knowledge of the crystal pose is not necessary for predicting binding affinity. We would like to stress that further investigation into these patterns are required and will be addressed in future works.

Similar to the selector used, the flexibility mode (rigid, semi-flexible, fully-flexible) used to generate ligand poses does not have a significant impact on the correlation between score and binding affinity (see Figure 9f). While the fully-flexible methodology has a significant advantage for the kinases such as, ABL1, JAK2, and CHK1 (comparing Figure S17 and S18), there are many other examples of protein-ligand complexes where the semi-flexible method provides a clear advantage over the fully-flexible and rigid methodologies (Figures S16S18). This is significant because semi-flexible method is less computationally demanding than the fully-flexible method and can be used efficiently in a virtual screening pipeline. Moreover, there is a large variation in Pearson and Spearman correlations between the scores and pKi/pKd data have variability based on the type of protein varying from −1.0 (best) to +1.0 (worst) as shown in Figures 10b, S16S18 and Tables S8S9. For example, the nuclear hormone receptors ER and AR have positive correlation values instead of the expected negative ones; the best selector/ranker pair for HIV proteases in CASF-2016 is RMC15/RMR6 which is the opposite of what was found for other test cases of CASF-2016, in general. Therefore, the use of different scoring functions for different protein classes may be advantageous in ranking the relative binding affinity of the ligands to the protein targets but extensive benchmarking is needed to obtain class-specific biases.

4. Conclusions

We present the CANDOCK algorithm, our hierarchal atomic network-based docking algorithm that accounts for protein flexibility and ligand interactions with all cofactors, metal ions, etc. in the binding pocket using generalized statistical scoring functions. We demonstrated that these scoring functions worked very well to generate a crystal-like pose for ~94% of the CASF-2016 dataset consisting of 285 protein-ligand complexes. There were 17 (of 285) failures in total with semi-flexible docking, which were reduced to 10 failures with fully flexible including 4 (out of 10) failures that contain long aliphatic chains. We found that the RMR6 scoring function was the best at selecting a crystal-like ligand pose and RMC15 scoring function scored the selected poses to rank ligands according to their measured binding affinities. Our algorithm only requires a final energy minimization of the protein and the ligand (semi-flexible) to generate crystal-like ligand poses for ligands consisting of less than six fragments, compared to fully-flexible methods needed for larger ligands. CANDOCK was developed to provide proper representations of ligand, receptor, and all cofactors in the binding pocket. It performs well by including ligand and cofactors interactions in the binding pocket using the generalized statistical potential and without the need for parameterization. CANDOCK successfully generates a crystal pose for 97.6% of the Astex benchmarking set (83 out of the 85 complexes) that includes generating crystal-like poses for cases that failed with all popular docking methods (e.g. containing metal-organic interactions). We show that the RMR6 scoring function using a short distance cutoff and reduced atom type set is adequate for selecting the crystal pose of the ligand. However, a longer distance cutoff and complete atom type set used in the RMC15 scoring function are essential to achieve reasonable correlation between the docking score and the RMSD of a docked ligand from the crystal ligand, which justifies the use of RMC15 as the minimization function. The RMC15 scoring function was also the best at reproducing reasonable correlations between scores and ligand binding affinities. We believe that the release of the CANDOCK algorithm will give the community a valuable freely available tool for generating chemically relevant ligand poses for use in drug discovery efforts. The hierarchical nature of our method presents a powerful and flexible tool to performs proteome-wide docking studies efficiently, yielding an improved drug discovery and design pipelines. We have placed all the scripts and input protein and ligand structures required to reproduce our results at https://github.com/chopralab/candock_benchmark.

Supplementary Material

Supplement

Acknowledgments

This work was supported in part by a Purdue University start-up package from the Department of Chemistry at Purdue University, Ralph W. and Grace M. Showalter Research Trust award, the Integrative Data Science Initiative award, and the Jim and Diann Robbers Cancer Research Grant for New Investigators award to Gaurav Chopra, a Lynn Fellowship to Jonathan Fine, and Slovenian Research Agency, project number L7-8269 to Janez Konc, a NIH Director’s Pioneer Award (1DP1OD006779) to Ram Samudrala, and NIH NCATS ASPIRE Design Challenge awards to Gaurav Chopra and Ram Samudrala. Additional support, in part by, a NCATS Clinical and Translational Sciences Award (UL1TR001412), a NCATS Clinical and Translational Sciences Award from the Indiana Clinical and Translational Sciences Institute (UL1TR002529), and the Purdue University Center for Cancer Research NIH grant P30 CA023168 are also acknowledged. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Footnotes

Supporting information: Additional tables relating to atom types used by CANDOCK (Table S1), benchmarking against the decoy poses provided by the CASF-2016 benchmark (Tables S2-S6), time taken by CANDOCK to generate poses (Figure S1), additional figures related to pose generation and selection (Figures S2-S7), additional figures related to the scoring functions (Figures S8-S9), relationship between docking score and RMSD (Table S7 and Figures S10 – S13), relationship between docking score and pKi/pKd (Tables S8-S9 and Figures S14-S18), default parameters used with the program (Table S10).

References

  • (1).Horst JA; Laurenzi A; Bernard B; Samudrala R Computational Multitarget Drug Discovery. Polypharmacology Drug Discov. 2012, 263–301. [Google Scholar]
  • (2).Jenwitheesuk E; Samudrala R Identification of Potential Multitarget Antimalarial Drugs. JAMA 2005, 294, 1487. [DOI] [PubMed] [Google Scholar]
  • (3).Friesner RA; Banks JL; Murphy RB; Halgren TA; Klicic JJ; Mainz DT; Repasky MP; Knoll EH; Shelley M; Perry JK; et al. Glide: A New Approach for Rapid, Accurate Docking and Scoring. 1. Method and Assessment of Docking Accuracy. J. Med. Chem 2004, 47, 1739–1749. [DOI] [PubMed] [Google Scholar]
  • (4).Perola E; Walters WP; Charifson PS A Detailed Comparison of Current Docking and Scoring Methods on Systems of Pharmaceutical Relevance. Proteins Struct. Funct. Bioinforma 2004, 56, 235–249. [DOI] [PubMed] [Google Scholar]
  • (5).Ou-Yang S; Lu J; Kong X; Liang Z; Luo C; Jiang H Computational Drug Discovery. Acta Pharmacol. Sin 2012, 33, 1131–1140. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (6).Deng Z; Chuaqui C; Singh J Structural Interaction Fingerprint (SIFt): A Novel Method for Analyzing Three-Dimensional Protein−Ligand Binding Interactions. J. Med. Chem 2004, 47, 337–344. [DOI] [PubMed] [Google Scholar]
  • (7).Pouliot Y; Chiang AP; Butte AJ Predicting Adverse Drug Reactions Using Publicly Available PubChem Bioassay Data. Clin. Pharmacol. Ther 2011, 90, 90–99. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (8).Jenwitheesuk E; Horst JA; Rivas KL; Van Voorhis WC; Samudrala R Novel Paradigms for Drug Discovery: Computational Multitarget Screening. Trends Pharmacol. Sci 2008, 29, 62–71. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (9).Horst JA; Pieper U; Sali A; Zhan L; Chopra G; Samudrala R; Featherstone JDB Strategic Protein Target Analysis for Developing Drugs to Stop Dental Caries. Adv. Dent. Res 2012, 24, 86–93. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (10).Carlson HA; McCammon JA Accommodating Protein Flexibility in Computational Drug Design. Mol. Pharmacol 2000, 57, 213–218. [PubMed] [Google Scholar]
  • (11).Cross JB; Thompson DC; Rai BK; Baber JC; Fan KY; Hu Y; Humblet C Comparison of Several Molecular Docking Programs: Pose Prediction and Virtual Screening Accuracy. J. Chem. Inf. Model 2009, 49, 1455–1474. [DOI] [PubMed] [Google Scholar]
  • (12).Biesiada J; Porollo A; Velayutham P; Kouril M; Meller J Survey of Public Domain Software for Docking Simulations and Virtual Screening. Hum. Genomics 2011, 5, 497–505. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (13).Brewerton SC The Use of Protein-Ligand Interaction Fingerprints in Docking. Curr. Opin. Drug Discov. Devel 2008, 11, 356–364. [PubMed] [Google Scholar]
  • (14).Huang S-Y; Zou X Ensemble Docking of Multiple Protein Structures: Considering Protein Structural Variations in Molecular Docking. Proteins Struct. Funct. Bioinforma 2006, 66, 399–421. [DOI] [PubMed] [Google Scholar]
  • (15).Huang S-Y; Zou X Efficient Molecular Docking of NMR Structures: Application to HIV-1 Protease. Protein Sci. 2006, 16, 43–51. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (16).Grigoryan AV; Wang H; Cardozo TJ Can the Energy Gap in the Protein-Ligand Binding Energy Landscape Be Used as a Descriptor in Virtual Ligand Screening? PLoS One 2012, 7, e46532. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (17).Minie M; Chopra G; Sethi G; Horst J; White G; Roy A; Hatti K; Samudrala R CANDO and the Infinite Drug Discovery Frontier. Drug Discov. Today 2014, 19, 1353–1363. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (18).Sethi G; Chopra G; Samudrala R Multiscale Modelling of Relationships between Protein Classes and Drug Behavior Across All Diseases Using the CANDO Platform. Mini Rev. Med. Chem 2015, 15, 705–717. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (19).Chopra G; Samudrala R Exploring Polypharmacology in Drug Discovery and Repurposing Using the CANDO Platform. Curr. Pharm. Des 2016, 22, 3109–3123. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (20).Chopra G; Kaushik S; Elkin PL; Samudrala R Combating Ebola with Repurposed Therapeutics Using the CANDO Platform. Molecules 2016, 21, 1537. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (21).Hernandez-Perez M; Chopra G; Fine J; Conteh AM; Anderson RM; Linnemann AK; Benjamin C; Nelson JB; Benninger KS; Nadler JL; et al. Inhibition of 12/15-Lipoxygenase Protects Against β-Cell Oxidative Stress and Glycemic Deterioration in Mouse Models of Type 1 Diabetes. Diabetes 2017, 66, 2875–2887. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (22).Fine J; Lackner R; Samudrala R; Chopra G Computational Chemoproteomics to Understand the Role of Selected Psychoactives in Treating Mental Health Indications. Sci. Rep 2019, 9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (23).Ma X; Zhou J; Wang C; Carter-Cooper B; Yang F; Larocque E; Fine J; Tsuji G; Chopra G; Lapidus RG; et al. Identification of New FLT3 Inhibitors That Potently Inhibit AML Cell Lines via an Azo Click-It/Staple-It Approach. ACS Med. Chem. Lett 2017, 8, 492–497. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (24).Trott O; Olson AJ Software News and Update AutoDock Vina: Improving the Speed and Accuracy of Docking with a New Scoring Function, Efficient Optimization, and Multithreading. J. Comput. Chem 2010, 31, 455–461. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (25).Verdonk ML; Cole JC; Hartshorn MJ; Murray CW; Taylor RD Improved Protein-Ligand Docking Using GOLD. Proteins Struct. Funct. Bioinforma 2003, 52, 609–623. [DOI] [PubMed] [Google Scholar]
  • (26).Ding F; Yin S; Dokholyan NV Rapid Flexible Docking Using a Stochastic Rotamer Library of Ligands. J. Chem. Inf. Model 2010, 50, 1623–1632. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (27).Wang J; Dokholyan NV MedusaDock 2.0: Efficient and Accurate Protein-Ligand Docking with Constraints. J. Chem. Inf. Model 2019, 59, 2509–2515. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (28).Yin S; Biedermannova L; Vondrasek J; Dokholyan NV MedusaScore: An Accurate Force Field-Based Scoring Function for Virtual Drug Screening. J. Chem. Inf. Model 2008, 48, 1656–1662. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (29).Yuriev E; Ramsland PA Latest Developments in Molecular Docking: 2010–2011 in Review. J. Mol. Recognit 2013, 26, 215–239. [DOI] [PubMed] [Google Scholar]
  • (30).Damm-Ganamet KL; Smith RD; Dunbar JB; Stuckey JA; Carlson HA CSAR Benchmark Exercise 2011–2012: Evaluation of Results from Docking and Relative Ranking of Blinded Congeneric Series. J. Chem. Inf. Model 2013, 53, 1853–1870. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (31).Gaudreault F; Najmanovich RJ FlexAID: Revisiting Docking on Non-Native-Complex Structures. J. Chem. Inf. Model 2015, 55, 1323–1336. [DOI] [PubMed] [Google Scholar]
  • (32).Jenwitheesuk E; Samudrala R Improved Prediction of HIV-1 Protease-Inhibitor Binding Energies by Molecular Dynamics Simulations. BMC Struct. Biol 2003, 3, 1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (33).Warren GL; Andrews CW; Capelli AM; Clarke B; LaLonde J; Lambert MH; Lindvall M; Nevins N; Semus SF; Senger S; et al. A Critical Assessment of Docking Programs and Scoring Functions. J. Med. Chem 2006, 49, 5912–5931. [DOI] [PubMed] [Google Scholar]
  • (34).Kellenberger E; Rodrigo J; Muller P; Rognan D Comparative Evaluation of Eight Docking Tools for Docking and Virtual Screening Accuracy. Proteins Struct. Funct. Genet 2004, 57, 225–242. [DOI] [PubMed] [Google Scholar]
  • (35).Bissantz C; Bernard P; Hibert M; Rognan D Protein-Based Virtual Screening of Chemical Databases. II. Are Homology Models of g-Protein Coupled Receptors Suitable Targets? Proteins Struct. Funct. Bioinforma 2002, 50, 5–25. [DOI] [PubMed] [Google Scholar]
  • (36).Huang N; Shoichet BK; Irwin JJ Benchmarking Sets for Molecular Docking. J. Med. Chem 2006, 49, 6789–6801. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (37).Wang Z; Sun H; Yao X; Li D; Xu L; Li Y; Tian S; Hou T Comprehensive Evaluation of Ten Docking Programs on a Diverse Set of Protein-Ligand Complexes: The Prediction Accuracy of Sampling Power and Scoring Power. Phys. Chem. Chem. Phys 2016, 18, 12964–12975. [DOI] [PubMed] [Google Scholar]
  • (38).Claußen H; Buning C; Rarey M; Lengauer T FLEXE: Efficient Molecular Docking Considering Protein Structure Variations. J. Mol. Biol 2001, 308, 377–395. [DOI] [PubMed] [Google Scholar]
  • (39).Leach AR Ligand Docking to Proteins with Discrete Side-Chain Flexibility. J. Mol. Biol 1994, 235, 345–356. [DOI] [PubMed] [Google Scholar]
  • (40).Apostolakis J; Plückthun A; Caflisch A Docking Small Ligands in Flexible Binding Sites. J. Comput. Chem 1998, 19, 21–37. [Google Scholar]
  • (41).Meng EC; Gschwend DA; Blaney JM; Kuntz ID Orientational Sampling and Rigid-Body Minimization in Molecular Docking. Proteins Struct. Funct. Genet 1993, 17, 266–278. [DOI] [PubMed] [Google Scholar]
  • (42).Zhao H; Caflisch A Discovery of ZAP70 Inhibitors by High-Throughput Docking into a Conformation of Its Kinase Domain Generated by Molecular Dynamics. Bioorg. Med. Chem. Lett 2013, 23, 5721–5726. [DOI] [PubMed] [Google Scholar]
  • (43).Chopra G; Summa CM; Levitt M Solvent Dramatically Affects Protein Structure Refinement. Proc. Natl. Acad. Sci. U. S. A 2008, 105, 20239–20244. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (44).Chopra G; Kalisman N; Levitt M Consistent Refinement of Submitted Models at CASP Using a Knowledge-Based Potential. Proteins Struct. Funct. Bioinforma 2010, 78, n/a–n/a. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (45).Rodrigues JPGLM; Levitt M; Chopra G KoBaMIN: A Knowledge-Based Minimization Web Server for Protein Structure Refinement. Nucleic Acids Res. 2012, 40, W323–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (46).Lertkiatmongkol P; Assawamakin A; White G; Chopra G; Rongnoparut P; Samudrala R; Tongsima S Distal Effect of Amino Acid Substitutions in CYP2C9 Polymorphic Variants Causes Differences in Interatomic Interactions against (S)-Warfarin. PLoS One 2013, 8, e74053. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (47).Zavodszky MI; Kuhn LA Side-Chain Flexibility in Protein-Ligand Binding: The Minimal Rotation Hypothesis. Protein Sci. 2005, 14, 1104–1114. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (48).Chen YC Beware of Docking! Trends Pharmacol. Sci 2015, 36, 78–95. [DOI] [PubMed] [Google Scholar]
  • (49).Ruiz-Carmona S; Alvarez-Garcia D; Foloppe N; Garmendia-Doval AB; Juhos S; Schmidtke P; Barril X; Hubbard RE; Morley SD RDock: A Fast, Versatile and Open Source Program for Docking Ligands to Proteins and Nucleic Acids. PLoS Comput. Biol 2014, 10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (50).Nedumpully-Govindan P; Jemec DB; Ding F CSAR Benchmark of Flexible MedusaDock in Affinity Prediction and Nativelike Binding Pose Selection. J. Chem. Inf. Model 2016, 56, 1042–1052. [DOI] [PubMed] [Google Scholar]
  • (51).Carlson HA; Smith RD; Damm-Ganamet KL; Stuckey JA; Ahmed A; Convery MA; Somers DO; Kranz M; Elkins PA; Cui G; et al. CSAR 2014: A Benchmark Exercise Using Unpublished Data from Pharma. J. Chem. Inf. Model 2016, 56, 1063–1077. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (52).Onodera K; Satou K; Hirota H Evaluations of Molecular Docking Programs for Virtual Screening. J. Chem. Inf. Model 2007, 47, 1609–1618. [DOI] [PubMed] [Google Scholar]
  • (53).Feher M; Williams CI Effect of Input Differences on the Results of Docking Calculations. J. Chem. Inf. Model 2009, 49, 1704–1714. [DOI] [PubMed] [Google Scholar]
  • (54).Pevzner Y; Frugier E; Schalk V; Caflisch A; Woodcock HL Fragment-Based Docking: Development of the CHARMMing Web User Interface as a Platform for Computer-Aided Drug Design. J. Chem. Inf. Model 2014, 54, 2612–2620. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (55).Belew RK; Forli S; Goodsell DS; O’Donnell TJ; Olson AJ Fragment-Based Analysis of Ligand Dockings Improves Classification of Actives. J. Chem. Inf. Model 2016, 56, 1597–1607. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (56).Vilar S; Cozza G; Moro S Medicinal Chemistry and the Molecular Operating Environment (MOE): Application of QSAR and Molecular Docking to Drug Discovery. Curr. Top. Med. Chem 2008, 8, 1555–1572. [DOI] [PubMed] [Google Scholar]
  • (57).Luque I; Freire E Structural Stability of Binding Sites: Consequences for Binding Affinity and Allosteric Effects. Proteins Struct. Funct. Genet 2000, 4, 63–71. [DOI] [PubMed] [Google Scholar]
  • (58).Zimmermann MT; Leelananda SP; Kloczkowski A; Jernigan RL Combining Statistical Potentials with Dynamics-Based Entropies Improves Selection from Protein Decoys and Docking Poses. J. Phys. Chem. B 2012, 116, 6725–6731. [DOI] [PubMed] [Google Scholar]
  • (59).Bernard B; Samudrala R A Generalized Knowledge-Based Discriminatory Function for Biomolecular Interactions. Proteins Struct. Funct. Bioinforma 2009, 76, 115–128. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (60).Meng EC; Lewis RA Determination of Molecular Topology and Atomic Hybridization States from Heavy Atom Coordinates. J. Comput. Chem 1991, 12, 891–898. [Google Scholar]
  • (61).Wang J; Wolf RM; Caldwell JW; Kollman PA; Case DA Development and Testing of a General Amber Force Field. J. Comput. Chem 2004, 25, 1157–1174. [DOI] [PubMed] [Google Scholar]
  • (62).Wang J; Wang W; Kollman PA; Case DA Automatic Atom Type and Bond Type Perception in Molecular Mechanical Calculations. J. Mol. Graph. Model 2006, 25, 247–260. [DOI] [PubMed] [Google Scholar]
  • (63).Eastman P; Swails J; Chodera JD; McGibbon RT; Zhao Y; Beauchamp KA; Wang L-P; Simmonett AC; Harrigan MP; Stern CD; et al. OpenMM 7: Rapid Development of High Performance Algorithms for Molecular Dynamics. PLOS Comput. Biol 2017, 13, e1005659. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (64).Allen WJ; Balius TE; Mukherjee S; Brozell SR; Moustakas DT; Lang PT; Case DA; Kuntz ID; Rizzo RC DOCK 6: Impact of New Features and Current Docking Performance. J. Comput. Chem 2015, 36, 1132–1156. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (65).Groom CR; Bruno IJ; Lightfoot MP; Ward SC; IUCr. The Cambridge Structural Database. Acta Crystallogr. Sect. B Struct. Sci. Cryst. Eng. Mater 2016, 72, 171–179. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (66).Rose PW; Prlić A; Bi C; Bluhm WF; Christie CH; Dutta S; Green RK; Goodsell DS; Westbrook JD; Woo J; et al. The RCSB Protein Data Bank: Views of Structural Biology for Basic and Applied Research and Education. Nucleic Acids Res. 2015, 43, D345–D356. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (67).Konc J; Janežič D An Improved Branch and Bound Algorithm for the Maximum Clique Problem. MATCH Commun. Math. Comput. Chem. MATCH Commun. Math. Comput. Chem 2007, 58, 569–590. [Google Scholar]
  • (68).Bron C; Kerbosch J Algorithm 457: Finding All Cliques of an Undirected Graph [H]. Commun. ACM 1973, 16, 575–577. [Google Scholar]
  • (69).Zsoldos Z; Reid D; Simon A; Sadjad SB; Johnson AP EHiTS: A New Fast, Exhaustive Flexible Ligand Docking System. J. Mol. Graph. Model 2007, 26, 198–212. [DOI] [PubMed] [Google Scholar]
  • (70).Liu DC; Nocedal J On the Limited Memory BFGS Method for Large Scale Optimization. Math. Program 1989, 45, 503–528. [Google Scholar]
  • (71).Fine J; Chopra G Lemon: A Framework for Rapidly Mining Structural Information from the Protein Data Bank. Bioinformatics 2019, 35, 4165–4167. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (72).Liu Z; Su M; Han L; Liu J; Yang Q; Li Y; Wang R Forging the Basis for Developing Protein-Ligand Interaction Scoring Functions. Acc. Chem. Res 2017, 50, 302–309. [DOI] [PubMed] [Google Scholar]
  • (73).Su M; Yang Q; Du Y; Feng G; Liu Z; Li Y; Wang R Comparative Assessment of Scoring Functions: The CASF-2016 Update. J. Chem. Inf. Model 2019, 59, 895–913. [DOI] [PubMed] [Google Scholar]
  • (74).Hartshorn MJ; Verdonk ML; Chessari G; Brewerton SC; Mooij WTM; Mortenson PN; Murray CW Diverse, High-Quality Test Set for the Validation of Protein-Ligand Docking Performance. J. Med. Chem 2007. [DOI] [PubMed] [Google Scholar]
  • (75).Chemaxon M Molecule File Converter, Version 5.10. 1,(C) 1999–2012. ChemAxon Ltd. [Google Scholar]
  • (76).Cleves AE; Jain AN Knowledge-Guided Docking: Accurate Prospective Prediction of Bound Configurations of Novel Ligands Using Surflex-Dock. J. Comput. Aided. Mol. Des 2015, 29, 485–509. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (77).Lu J; Hou X; Wang C; Zhang Y Incorporating Explicit Water Molecules and Ligand Conformation Stability in Machine-Learning Scoring Functions. J. Chem. Inf. Model 2019, 59, 4540–4549. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (78).Jain AN Scoring Noncovalent Protein-Ligand Interactions: A Continuous Differentiable Function Tuned to Compute Binding Affinities. J. Comput. Aided. Mol. Des 1996, 10, 427–440. [DOI] [PubMed] [Google Scholar]
  • (79).Ruppert J; Welch W; Jain AN Automatic Identification and Representation of Protein Binding Sites for Molecular Docking. Protein Sci. 1996, 6, 524–533. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (80).Khamis MA; Gomaa W Comparative Assessment of Machine-Learning Scoring Functions on PDBbind 2013. Eng. Appl. Artif. Intell 2015, 45, 136–151. [Google Scholar]
  • (81).Wang C; Zhang Y Improving Scoring-Docking-Screening Powers of Protein–Ligand Scoring Functions Using Random Forest. J. Comput. Chem 2017, 38, 169–177. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (82).Lim J; Ryu S; Park K; Choe YJ; Ham J; Kim WY Predicting Drug–Target Interaction Using a Novel Graph Neural Network with 3D Structure-Embedded Graph Representation. J. Chem. Inf. Model 2019, 59, 3981–3988. [DOI] [PubMed] [Google Scholar]
  • (83).Nivedha AK; Thieker DF; Makeneni S; Hu H; Woods RJ Vina-Carb: Improving Glycosidic Angles during Carbohydrate Docking. J. Chem. Theory Comput 2016, 12, 892–901. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (84).Shahrokh K; Orendt A; Yost GS; Cheatham TE Quantum Mechanically Derived AMBER-Compatible Heme Parameters for Various States of the Cytochrome P450 Catalytic Cycle. J. Comput. Chem 2012, 33, 119–133. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (85).Cross SSJ Improved FlexX Docking Using FlexS-Determined Base Fragment Placement. J. Chem. Inf. Model 2005, 45, 993–1001. [DOI] [PubMed] [Google Scholar]
  • (86).Bezhentsev VM; Tarasova OA; Dmitriev AV; Rudik AV; Lagunin AA; Filimonov DA; Poroikov VV Computer-Aided Prediction of Xenobiotic Metabolism in the Human Body. Russ. Chem. Rev 2016, 85, 854–879. [Google Scholar]
  • (87).Kirchmair J; Göller AH; Lang D; Kunze J; Testa B; Wilson ID; Glen RC; Schneider G Predicting Drug Metabolism: Experiment and/or Computation? Nature Reviews Drug Discovery. 2015. [DOI] [PubMed] [Google Scholar]
  • (88).Kirchmair J; Williamson MJ; Tyzack JD; Tan L; Bond PJ; Bender A; Glen RC Computational Prediction of Metabolism: Sites, Products, SAR, P450 Enzyme Dynamics, and Mechanisms. Journal of Chemical Information and Modeling. 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (89).Card GL; England BP; Suzuki Y; Fong D; Powell B; Lee B; Luu C; Tabrizizad M; Gillette S; Ibrahim PN; et al. Structural Basis for the Activity of Drugs That Inhibit Phosphodiesterases. Structure 2004, 12, 2233–2247. [DOI] [PubMed] [Google Scholar]
  • (90).Mysinger MM; Carchia M; Irwin JJ; Shoichet BK Directory of Useful Decoys, Enhanced (DUD-E): Better Ligands and Decoys for Better Benchmarking. J. Med. Chem 2012, 55, 6582–6594. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (91).O’Boyle NM; Banck M; James CA; Morley C; Vandermeersch T; Hutchison GR Open Babel: An Open Chemical Toolbox. J. Cheminform 2011, 3, 33. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement

RESOURCES