Abstract
Water plays a critical role in ligand-protein interactions. However, it is still challenging to predict accurately not only where water molecules prefer to bind, but also which of those water molecules might be displaceable. The latter is often seen as a route to optimizing affinity of potential drug candidates. Using a protocol we call WaterDock, we show that the freely available AutoDock Vina tool can be used to predict accurately the binding sites of water molecules. WaterDock was validated using data from X-ray crystallography, neutron diffraction and molecular dynamics simulations and correctly predicted 97% of the water molecules in the test set. In addition, we combined data-mining, heuristic and machine learning techniques to develop probabilistic water molecule classifiers. When applied to WaterDock predictions in the Astex Diverse Set of protein ligand complexes, we could identify whether a water molecule was conserved or displaced to an accuracy of 75%. A second model predicted whether water molecules were displaced by polar groups or by non-polar groups to an accuracy of 80%. These results should prove useful for anyone wishing to undertake rational design of new compounds where the displacement of water molecules is being considered as a route to improved affinity.
Introduction
Water is a key structural feature of protein-ligand complexes and can form a complex hydrogen-bonding network between ligand and protein [1], [2]. Water-mediated binding is so common that a study of 392 protein-ligand complexes found that 85% had at least one or more water molecules that bridge the interaction between the ligand and the protein [3]. Furthermore, the displacement of an ordered water molecule can drastically affect a ligand's binding affinity [4], [5]. As a result, it is common to include explicit water molecules in computational drug design [6]–[8]. The careful consideration of hydration sites has been shown to aid the predictability of 3D QSAR models, [9]–[11] ensure stable simulations with molecular dynamics [12], and improve the accuracy of rigorous free energy calculations [13]. Continuum solvent models have also been reported to improve with the addition of explicit water molecules [14]. Traditionally, ordered water molecules were ignored in ligand docking studies and ligands were docked into desolvated binding sites. There are now a number of docking protocols that include explicit water molecules and claim to improve accuracy in many cases [15]–[20]. However, it has also been reported that including such water molecules may hamper efforts to predict a ligand's correct binding mode [21].
A popular strategy in rational drug design is to modify a ligand so that it displaces an ordered water molecule into the bulk solvent [5], [11], [22], [23]. This is due to the favorable entropic gain that can result by increasing the water molecule's translational and orientational degrees of freedom. However, the targeted displacement of an ordered water molecule may be unsuccessful [24], [25], can also lead to a decrease in affinity if the ligand is unable to replace the water molecule's hydrogen bonds correctly and fulfill its stabilizing role [4], [26]. This has important implications for lead-optimization and rigorous theoretical studies have investigated how changing a water displacing functional group affects a ligand's affinity [27], [28]. In addition, water molecules are important pharmacophoric features of a binding site [29], and the chemical diversity of potential inhibitors generated in silico has been reported to be greatly affected by the targeted displacement of ordered water molecules [30]–[32]. Water molecule locations are typically taken from X-ray crystal structures and may be validated by observing the same position in other crystal structures of the same protein. Nevertheless, there are inherent problems with identifying hydration sites with crystallography. Water molecules can be artifactual, may be too mobile to identify or not observed because of low resolution [33]–[35]. In cases such as homology modeling, there will be no structural knowledge of water molecules. Hence, it is necessary to be able to accurately predict water locations within binding sites.
Water sites can be predicted by running molecular dynamics or Monte Carlo simulations with an explicit water model and taking the peaks in water density or averaging over water molecule locations [36]. These techniques have the benefit of including entropic effects in the prediction but can be very time consuming to run, especially with buried cavities due to the long time it takes for water to permeate within the protein. Grand canonical Monte Carlo methods can significantly reduce the length of the simulation [37], although can still be computationally demanding. The grid-based Monte Carlo method JAWS attempts to strike a balance between rapid solvation techniques and full molecular simulations that explicitly treat entropic effects [28]. It has the added advantage of producing an estimate of the free energy of displacing the water molecule into bulk solvent although the value may not be well converged [38]. A notable integral theory approach, called the 3D reference interaction site model (3D-RISM), has reported success in predicting the solvation structure within protein cavities [39] and in ligand binding sites [40]. Inhomogeneous fluid solvation theory (IFST), as popularized by Lazaridis [41], [42], uses a short molecular simulation to calculate the thermodynamics of water molecules in protein binding sites. A great advantage of using IFST is that the free energy is broken down into its enthalpic and entropic contributions and these values are then used to understand the thermodynamics of ligand binding [43]–[46]. IFST also forms the basis behind WaterMap [47], [48], which calculates the binding thermodynamics of displaced water molecules and has been used to understand the affinity and ligand selectivity in a number of different cases [49], [50].
Fast solvation methods have also been pursued for a number of years. A popular empirical method is GRID, which calculates the interaction energy of a chemical probe around a protein [51]. The water probe is able to make up to 4 hydrogen bonds with the protein. A novel mean field method has been reported by Setny and Zacharias that places potential water sites on a lattice and iteratively solves the solvent distribution using a semi-heuristic cellular automata approach [52]. The fact that water sites form distinctive distributions around amino acids [53] has been exploited by a number of knowledge-based methods. An early example called AQUARIUS predicted solvent sites within a protein by mapping each amino acid to a data set of crystal structures [54]. SuperStar is another knowledge-based method that combines structural data from the Protein Data Bank [55] and the Cambridge Structural Database [56] (CSD) to predict chemical propensity maps within protein cavities [57]. Schymkowitz et al. similarly used water distributions around amino acids to predict buried water molecules [58]. The distributions were clustered and then optimized using the Fold X forcefield. When water molecules that were coordinated by 2 or more polar atoms were considered, Fold X reported a success rate of 76%. Most recently, Rossato et al. developed AcquaAlta, which identified favorable water geometries from the CSD and ab intio calculations to predict the location of water molecules that bridge polar interactions between the ligand and the protein [59]. AcquaAlta predicted 76% of crystallographic water positions in the training set and 66% in the test set.
As the affinities, binding modes and chemical diversity of a series of ligands can be greatly affected by the water molecules in a protein binding site, it is important to predict which water molecules are displaced or conserved during the binding process. Some docking procedures, although different in implementation, involve switching explicit water molecules “on” and “off” [17], [60], [61]. Other approaches have used the structural features of a water molecule's environment to predict whether it will be displaced or not without any prior knowledge of the ligand. Using a K-nearest neighbors genetic algorithm, Consolv reported 75% accuracy in predicting whether a binding site water molecule would be displaced or not [62]. However, as Consolv used crystallographic temperature factors as structural descriptors, it cannot be applied to predicted water sites. Amadasi and co-workers have combined the HINT forcefield [63] with the Rank score [64] to classify water molecules into 2 broad categories; conserved/functionally displaced and sterically displaced/missing [65], [66]. Their first study correctly classified 76% of the water molecules tested while their second study reported a classification accuracy of 87%. Their analysis included weakly bound water molecules, which were a maximum of 4 Å away from the protein. On the other hand, WaterScore used water molecules within 7 Å of the bound ligand in protein-ligand binding sites [67]. Using multivariate logistic statistical regression, WaterScore reported 67% accuracy in classifying displaced and conserved waters, although water molecules that were displaced because of steric clashes with the ligand were not included in their analysis. Barillari et al. used the computationally expensive double-decoupling method to calculate the binding energies of 54 water molecules in protein-ligand complexes [68]. They found that water molecules that could be displaced by a ligand were on average less strongly bound than conserved water molecules by 2.5 kcal/mol.
Despite the positive strides that have been made in understanding the role of ordered waters, no single method is able to answer how displaceable a water molecule is, and what is it likely to be displaced by. When there is limited experimental knowledge of a binding site's solvation structure, addressing these questions becomes even less clear. In this paper we develop a pipeline that can accurately predict the location of water molecules and predict whether they are likely to be conserved or displaced after ligand binding. We also predict the probability that predicted water molecules will be displaced by polar or non-polar groups.
Using a method we call WaterDock, we show that the freely available AutoDock Vina tool [69] can be used to predict the location of ordered water molecules in ligand binding sites to a very high degree of accuracy. Crucially, a WaterDock prediction only takes a matter of seconds to produce. WaterDock was validated against high-resolution crystal structures, neutron diffraction data and molecular dynamics simulations. Using a validation set of proteins for which high resolution X-ray structures have been determined at least twice, we found that WaterDock was able to predict 88% of “consensus” water sites with a mean error of 0.78 Å. Using 14 structures of OppA bound to lysine-X-lysine tripeptides, WaterDock predicted 97% of the ordered water molecules, with on average 1 false positive per structure.
By combining data mining, heuristic and machine learning techniques, we developed two probabilistic water molecule classifiers that were designed to predict the role of our WaterDock predictions. Water molecules were predicted in the binding sites of the Astex Diverse Set [70] of protein-ligand complexes after the ligands had been removed from the structures. By overlaying the ligands back into the hydrated cavities, we studied the statistics of hypothetically “displaced” water molecules. We could predict whether water molecules were displaced or conserved to an accuracy of 75% and whether water molecules were displaced by a polar ligand group or a non-polar group to 80% accuracy, both after cross validation.
The key advantages of the approaches we present here are that they take only a few seconds to compute yet are able to maintain a very high degree of accuracy. We hope that these techniques will be useful in molecular modeling and rational drug design, especially in cases where there is limited structural information of the protein. Furthermore, they utilize freely available software.
Methods
1. Validation of WaterDock method
Docking is a multidimensional optimization problem so many programs should be well adapted at balancing the various energetic needs of a water molecule. The main benefit of using AutoDock Vina (henceforth referred to as Vina) to predict water locations is that the stochastic nature of its algorithm ensures that many possible water sites can be generated in a single docking run. Repeated independent dockings of a water molecule into a cavity produces a diverse ensemble of locations that must be processed in order to produce a single, coherent and reproducible solvation structure. To ensure the prediction method is as fast as possible (Vina only takes a few seconds to dock a water molecule), we chose to experiment with different energetic filtering and clustering procedures. We refer to the docking, filtering and clustering procedure as WaterDock. Other docking programs can in principle be used to predict hydration sites within proteins and can be validated using the methods outlined in this paper.
We used two data sets to validate WaterDock and one independent test set. The first validation set was used to find the minimum score for accepting a docked water site and the second validation set was created to establish the clustering procedure. By using 2 data sets to validate WaterDock, we hoped to minimize over-fitting the water placement method. The first set comprised of 15 high-resolution, pharmacologically relevant protein crystal structures and is shown in Table 1. As there can be some inconsistencies regarding crystallographically observed water molecules, it may be that Vina correctly predicts hydration sites that are not observed experimentally. For this reason, three proteins from Table 1 were chosen for molecular dynamics (MD) simulations. The minimum distances from predicted water molecules to an experimental or MD water molecule were used to investigate the relationship between a prediction's error and its Vina score. In order to assess the magnitude of the errors, the minimum distances were compared to those from a random placement of water molecules (see Figure 1). The energy cutoff was chosen as the Vina score that produced an error distribution that was indistinguishable from the error distribution from the random placement model.
Table 1. The protein structures used to establish a cut-off score that indicates whether or not a prediction is better than random.
Protein | PDB code | Resolution (Å) | Ligand |
BRD4 | 2OSS | 1.35 | None |
BRD4 | 3MXF | 1.6 | JQ1 |
Trypsin | 1SOQ | 1.02 | None |
Trypsin | 1BTY | 1.5 | Benzamidine |
HSP 90a | 1AH6 | 1.8 | None |
HSP 90 | 1AM1 | 2 | ADP |
Penicillopepsina | 3APP | 1.8 | None |
Penicillopepsin | 1BXQ | 1.41 | PPi3 |
Cytochrome P450 2B4 | 1PO5 | 1.6 | None |
Cytochrome P450 2B4 | 1SUO | 1.9 | 4-(4-chlorophenyl) imidazole |
PIM1 kinasea | 1YWV | 2 | None |
PIM1 kinase | 1XWS | 1.8 | BI1 |
Purine nucleoside phosphorylase | 1V48 | 2.2 | DFPP-G |
GluA2 ligand binding core | 1FTM | 1.7 | AMPA |
HIV-1 protease | 1KZK | 1.09 | JE-2147 |
Structures that were selected for molecular dynamics simulations.
Table 1 includes apo and holo crystal structures of some of the same proteins in order to test whether Vina can predict the location of bridging water molecules as well as water molecules in unliganded binding sites. The proteins were also selected to have a diverse number of water molecules in the binding site. For example, trypsin has only one water molecule bridging the interaction between the ligand (benzamidine) and the protein whereas heat shock protein 90 has 9 bridging water molecules and 6 neighboring waters with its ligand, adenosine diphosphate (ADP). The unliganded structures of heat shock protein 90, penicillopepsin and PIM1 kinase were simulated using unrestrained MD for 10 ns. These proteins were selected as their binding sites vary in their hydrophobicity and are easily accessible to the bulk solvent. One hundred snap-shots were selected at random from the 3 simulations and Vina was used to predict the hydration sites in each snap-shot. Because of the hydrophobic diversity of the binding sites and a total of 300 conformational snap-shots were used for docking, we felt the number of simulations was sufficient to encapsulate different water structure in MD. Details of the MD simulations are provided in Text S1.
For each crystal structure or MD snapshot, Vina was used to dock a single water molecule into the binding site and all the locations and poses were recorded. The ensemble of different binding modes that are generated form the basis of the water site predictions. In a single run, Vina can generate a maximum of 20 conformations. Vina was used twice on each structure so there were 40 water site predictions for each binding site with overlap in many of the predicted positions. Using the Python [71] script that accompanies the software package AutoDockTools [72], the structures were stripped of water molecules and prepared into the appropriate PDBQT file format necessary for Vina. For holo-proteins, the search space was defined to be a 15 Å around the geometric center of the ligand. Apo-proteins were structurally aligned to the corresponding holo structure and the ligand center was again used to define the docking search space (See Text S1 for details).
As mentioned, Vina's predictions were compared to a random distribution of water molecules. Water molecules were placed at random within the sterically allowed volume of each docking search space. AutoGrid (part of the AutoDock 4 package) [73] was used to create oxygen affinity grid maps and favorable points were selected at random on grid locations that had affinities less than or equal to 0 kcal/mol. Five hundred random points were selected for each protein structure.
Repeated independent water molecule dockings creates many overlapping and similar water predictions even after low energy sites have been removed. A second data set was created in order to test the accuracy of different clustering methods and different docking procedures. An accurate water placement method is one in which many experimental water positions are correctly identified (high true positive rate) with very few predictions that are not experimentally observed (low false positive rate). As discussed in the introduction, the validity of water molecules seen in X-ray crystal structures is often uncertain and many water molecules may be missing from the structure. This complicates the proper assessment of the sensitivity and specificity of a water placement method.
To circumvent these issues, the data set in Table 2 was assembled in which each structure had been determined to a high resolution more than once. Where possible, neutron diffraction data was included because of its ability to resolve proton positions. Each protein in Table 2 was structurally aligned and “consensus” water molecules were determined. A consensus water molecule was defined as one that was within 1 Å of another water molecule seen in at least one other structure. These water molecules were used to assess the true positive rate of WaterDock. The binding site water molecules that were seen in only one structure were retained in order to quantify the false positive rate of WaterDock. By validating WaterDock in this way, WaterDock's true positive rate was assessed using only trustworthy water sites while its false positive rate was assessed using all water sites, for which there is at least some evidence for. Note that because of the difficulty in experimentally resolving some water molecules, the false positive rate is likely to be an upper estimate.
Table 2. The proteins and set of structures used to establish the docking and clustering procedures for the water placement method.
Protein | PDB codes | Resolution (Å) | Ligand |
HIV-Protease | 3FX5, 1HPX, 2ZYEa | 0.93, 2,1.9 | KNI-272 |
Ribonuclease A | 1KF5, 1FS3, 5RSAa | 1.2, 1.4, 2 | None |
GluR2 ligand binding core | 1FTMb, 1MY2b | 1.7, 1.8 | AMPA |
Trypsin | 1S0Q, 1UTQ, 1TPO | 1.0, 1.2, 1.7 | None |
Concanavalin A | 1NLS, 1GKB, 1JBC, 1QNYa | 0.9, 1.6, 1.2, 1.8 | None |
Glutathione S-transferase | 1K3Yb, 1K3Lb | 1.3, 1,5 | S-hexyl glutatione |
Carbonic Anhydrase | 3KS3, 3MWO, 2ILI | 0.9, 1.4, 1.1 | None |
Structures that have been determined by neutron diffraction.
Structures where multiple chains have been used to validate ordered water molecules.
Each of the proteins in Table 2 were structurally aligned and consensus water sites were identified using the statistical programming language R [74]. Using a 15 Å cube to define each binding site, 185 distinct water molecules were identified. Of these water molecules, only 92 had been identified by at least twice by experiment. Observing less than half of experimentally determined water molecules in at least two structures highlights the uncertainty regarding crystallographic water positions and underlies the need for caution when validating a water prediction method.
To test WaterDock on an independent data set, we chose 14 structures of OppA bound to different KXK tri-peptides (see Table S1 and S2). The data set was primarily chosen because the same test set was used for a recent water prediction method called AcquaAlta [59]. Doing so allows a direct comparison of the two methods. In addition, the structures have been determined to a high resolution and the ligands have varied water distributions around the side chain of the central amino acid [2].
2. Investigating water displacement and conservation
When a ligand binds to a protein, water molecules that once occupied the ligand's position can be moved or displaced into the bulk solvent. As discussed in the introduction, the displacement of certain water molecules can have a profound effect on the affinity of a ligand. Hence, for each WaterDock prediction, we created a model to assign the probability that it will be either displaced or conserved during ligand binding. Such a probability effectively acts as a physically meaningful “score” that would help to identify which water sites are structurally important. We developed probabilistic models rather than discrete classifiers because whether a water molecule is displaced or not depends on the size, type and scaffold of a ligand. Classifying a water molecule as either always displaceable or only conserved we felt was an oversimplification.
As described in more detail below, we established three structural descriptors of water molecules in a binding site. Using a data mining protocol outlined below, we found a descriptor that correlates with the binding energy of a water molecule as calculated by thermodynamic integration. The two other descriptors were designed heuristically to encapsulate the hydrophilicity and lipophilicity of a water molecule's protein environment. As we wanted our probabilistic classifier to apply to our WaterDock predictions, we predicted water sites in a high quality data set of protein ligand complexes after the ligands had been removed from the structures. By overlaying the ligands back into the WaterDock solvated cavities and comparing the predicted water sites to crystallographic water molecules, we marked WaterDock predictions as either conserved or displaced. The hypothetically displaced water molecules were also recorded as being displaced by hydrogen-bonding groups or non-polar ligand groups. This approach allowed us to create a classifier that was consistent with our water placement method and circumvented issues relating to the displacement of water by protein side chain movements. Also, since WaterDock was found to be very accurate (see Results and Discussion), we were confident in our predictions of “apo” hydration sites.
Using a tree-based machine-learning algorithm, we created two models. The first assigned the probability that a water molecule will be either displaced or conserved. The second model assigned the probability that a water molecule will be displaced by a hydrogen-bonding group or a non-polar group.
Establishing a water energy score
Using the double decoupling method, Barillari et al. calculated the absolute binding free energies of 54 water molecules from 35 ligand-protein complexes [68]. The data set was made up of 6 proteins and 11 conserved water molecules. They found that conserved water molecules had statistically significant lower binding energies than displaceable water molecules. We considered this data set to be ideal to find the water energy score because of the size of the set, the diverse range of proteins and the consistent manner in which the binding energies were calculated. Each of the 54 water molecules were initially scored using the scoring functions from Vina and AutoDock 4 and correlations with R2 values of 0.01 and 0.31 were found. We felt these correlations were not strong enough to capture the calculated water energetics so we used a combination of AutoDock 4's force-field based scoring function and Vina's empirical scoring function as the starting point for a data mining procedure to find a new water energy model. All unique combinations of the terms in AutoDock 4 and the AutoDock Vina scoring functions were combined and fitted to Barillari's calculated binding data, creating 255 linear models. The models omitted terms relating to rotatable bonds, as they are not applicable to a water molecule. In order to avoid over-fitting, to reward model simplicity and hence find the most “meaningful” combination of terms, the models were then ranked by their Akaike information criterion (AIC) [75]. The AIC is a measure of the goodness of fit that penalizes models for the number of parameters they contain. The preferred model being the one that minimizes the AIC. The top 30 models with the lowest AICs were then selected for an extensive cross validation study.
To cross-validate the models, all the calculated binding data for one of the 11 conserved water molecules was partitioned from the training set to form a test set. The top 30 models were then re-fit to the training set and the mean error of the model on the test set was recorded. The process was repeated until each of the 11 conserved water molecules was used as the test set. The model that had the lowest mean error after cross-validation was selected as the final water energy model.
Creating heuristic hydrophilic and lipophilic scores
By analyzing 10,837 surface bound water molecules in 56 high resolution crystal structures, Kuhn et al. established the individual hydration propensities for each amino acid atom type [76]. They determined the propensities by dividing the total number of water molecules that hydrated an atom by the number of surface exposed occurrences. Building on their work, we created a hydrophilicity model and a lipophilicity model intended to encapsulate the local chemical environment of a water molecule. This information was intended to be distinct from the water energy model. The hydrophilicity model is a distance weighted sum of the propensities from all the atoms within 4 Å of a water molecule and is given by:
(1) |
where N is the number of protein atoms within 4 Å of the atomic position, ri is the distance (in Angstroms) of atom i to a water molecule, hi is the hydration propensity of atom i and d0 is the distance scale of the interaction, set at 1 Å. We chose the weighting function because previous work have suggested that hydrophobicity decays exponentially with distance [77]. The hydration propensities of cofactor atoms were assigned the same value as the most similar protein atom. Because of the high magnitude of ion hydration free energies, ion hydration propensities were assigned the same as the highest value in the Kuhn data set. For the lipophilic score, we chose the same form as (1) and it is given by
(2) |
where the terms are as before except li which is the carbon propensity of atom i. As atomic carbon propensities have not been established as they have been for hydrophilicity, as a working hypothesis, we set all carbon atoms a propensity score of 1 and all other atom types a score of 0.
Finding displaced water molecules retrospectively with WaterDock
The Astex Diverse Set contains 85 high-resolution crystal structures of pharmacologically relevant ligand-protein complexes [70]. The ligands are drug-like and have a diverse range of scaffolds. Importantly, the electron density of the ligands in the crystal structures accounts for all parts of the ligand, leaving little ambiguity over the binding mode. This makes the Astex Diverse Set an appropriate data set to investigate what types of ligand atoms “displace” the WaterDock predictions.
The protein-ligand complexes were prepared for docking as previously described in this article. Ligands and water molecules were removed from the binding sites and cofactors were retained. Water sites were predicted in the binding site using the WaterDock method. A predicted water molecule was classified as conserved if it was seen within 1.5 Å of a water molecule seen in the crystal structure of the protein-ligand complex. Predicted water molecules that were not within 1.5 Å of a crystallographic water molecule but within 1.5 Å of a ligand atom were classified as displaced. The distance cut off was chosen as this represents an acceptable water prediction error and is within the van der Waals radius of a water molecule [78].
Creating a probabilistic water classifier
We expected that the displacement probability of a water molecule depended on a non linear combination of the 3 structural descriptors (binding energy, hydrophilicity and lipophilicity) and that certain regions of parameter space would generally correspond to different classes of water molecule. Classification trees meet these requirements by recursively partitioning the parameter space such that each region defines a class. Classification trees are particularly well suited to our problem because the proportion of a class in a partitioned region can be readily interpreted as a conditional probability. However, because of a tree's hierarchical nature, small changes in the data can result in a different series of splits, making single classification trees unstable. The method of bootstrap aggregation (known as “bagging”) alleviates this issue by fitting many trees to bootstrapped samples (sampling with replacement) of the data. The probability of a class is found by averaging the class proportions from each classification tree.
Using the free statistical language R with the package “rpart” [74], a bagged classification tree was written and was trained on the predicted water positions in Astex Diverse Set to classify them as conserved or displaced. In addition, a second model was trained to classify displaced WaterDock predictions as displaced by hydrogen-bonding groups or by non-polar groups. To assess the accuracy of the models, we used “leave-protein-out” cross validation. This involved partitioning the Astex Diverse Set into a training set and a test set, where the test set comprised of all the water molecules from a single protein. Each water molecule in the test set was classified by both models and the fraction of correct predictions were recorded. This process was repeated until all 85 proteins had been used as the test set. The accuracies quoted in the results are the mean accuracies from all the partitions. This validation procedure was chosen so that the models were tested on structures that were distinct to the structures in the training set.
Results and Discussion
1. Validation of WaterDock as a Water Placement Tool
Determining the energetic cutoff
The minimum distance of each docked water molecule from a crystallographic or molecular dynamics (MD) water molecule was computed in order to assess how placement prediction error depended on the water position's Vina score. In particular, we sought to find a score cutoff that identified well-determined sites by comparing the predictions to a random placement of water molecules. Figure 1 shows how each Vina score has an error distribution associated with it and how the median and the range of the error distributions decreases for more negative scores. In particular, as the scores increase, the distributions tend to the error distribution from the random placement model. It is apparent that the lower the Vina score, the closer the agreement with crystallographic water locations.
When predicting water locations in the X-ray crystal structures of Table 1, the error distributions were always better than the error distribution from the random model. During the MD simulations, large numbers of water molecules filled the cavities. This meant that placing a water molecule at random within the cavity has a much greater chance of being near a simulated water molecule. While this meant that the prediction error was also reduced, improving on the random model provided a more stringent test. As a result, a cut-off of 0.6 kcal/mol was chosen by inspection as the minimum acceptable score of a predicted water molecule.
Establishing the docking and clustering method
Using 7 crystal structures that had been resolved multiple times (Table 2), different docking and clustering protocols were experimented with in order to find the method that predicted the largest number of consensus water molecules for the fewest number of false positives. Here, we summarize the most accurate protocol while the results for different docking and clustering regimes are included in Table S3.
We found that independently docking a water molecule 3 times into the binding site was enough to sufficiently sample the configuration space of the water molecule while docking only once did not. The “exhaustiveness” parameter in Vina determines how rigorous the docking search is and is roughly proportional to elapsed docking time. We found that setting this parameter to 20 significantly improved the accuracy of the subsequent clustering methods when compared to an exhaustiveness value of 10. Three independent docking runs with an exhaustiveness value of 20 was also very fast and took no more than 15 seconds to complete on a 2.33 GHz Intel Xeon quad core processor.
Independently docking a water molecule 3 times with Vina generates a maximum of 60 binding modes. Many of the positions overlapped or were in close proximity to one another. Clustering the water positions is a time efficient way of producing a solvation map of the binding site from an ensemble of water positions. A number of different hierarchical clustering methods were experimented with, including complete linkage, single linkage and Ward's minimum variance method. Distance cutoffs of each clustering method were varied to find the one that gave the best accuracy. The average position of a docked water molecule cluster was used as the predicted water molecule location.
The most accurate clustering method was found to be with 2 rounds of single linkage clustering with different distance cutoffs. The results are summarized in Tables 3 and 4. The first clustering round used a distance cutoff of 0.5 Å and was designed to remove the most overlapping sites and to reduce the “chaining” of clusters in the second docking round. The output was clustered again with a distance cutoff of 1.6 Å. While these distance cutoffs were established empirically so as to maximize accuracy, it is interesting to note that the second clustering cutoff is around the van der Waals radius of a water molecule [78].
Table 3. The performance of the final WaterDock method on the second validation set.
Max Error = 1.5 Å | Max Error = 2.0 Å | |||||
Maximum distance of experimental waters from protein (Å) | Consensus water molecules predicted (%) | False Positives (%) | Mean Error (Å) | Consensus water molecules predicted (%) | False Positives (%) | Mean Error (Å) |
3 | 88 | 24 | 0.69 | 94 | 16 | 0.77 |
3.3 | 81 | 24 | 0.69 | 88 | 16 | 0.78 |
Table 4. The individual protein results using the final WaterDock method.
HIV Protease | Ribonuclease A | GluR2 | Trypsin | Concanavalin A | GST‡ | Carbonic Anhydrase | Total | |
Consensus Waters | 9 | 10 | 15 | 14 | 17 | 13 | 15 | 93 |
Predicted Consensus Waters | 9 | 8 | 15 | 13 | 13 | 12 | 12 | 82 |
False Positives | 2 | 3 | 3 | 2 | 4 | 3 | 4 | 21 |
Water Molecules Predicted* | 18 | 20 | 20 | 17 | 21 | 19 | 18 | 133 |
The number of correctly predicted non-consensus water sites can be calculated by finding the difference between the number of water molecules predicted and the sum of the predicted consensus waters and false positives.
Glutathione S-transferase.
Using a maximum placement error of 2 Å the final WaterDock method identified 88% of consensus water molecules within 3.3 Å of the protein. The distance of 3.3 Å was chosen from the water-water radial distribution function so as to define the first hydration shell [79]. Out of the 80 consensus water molecules correctly identified, only 8 were over 1.5 Å away from the experimental position and 54 were within 1 Å of a consensus water molecule. When only tightly bound water molecules (within 3 Å of the protein) were considered, WaterDock predicted 94% of these consensus water molecules.
Given that only protein-water interactions and not water-water interactions were used to generate the initial ensemble of positions, it is perhaps surprising that WaterDock was able to predict the vast majority of consensus water sites. Even in examples that contain a complex network of water molecules, such as Ribonuclease A, and Carbonic Anhydrase, WaterDock was still able to predict 80% of the consensus sites (see Table 3). It is clear therefore, that the protein is the most important factor in determining a water molecule's position. However, the omission of water-water interactions was likely to be responsible for some of the errors. In a few cases, an experimental water site was found to lie between 2 predicted locations (see Figure 2), resulting in a false positive. In examples such as Ribonuclease A, Concanavalin A and Carbonic Anhydrase, it was found that water-water interactions were very subtle and consensus sites were observed to be slightly displaced with respect to the WaterDock predictions, possibly to accommodate and interact with another water molecule.
Water-water interactions could be included in the WaterDock method if a second sampling procedure, akin to the JAWS method [28] could switch the predicted sites “on” and “off”. We also considered sequentially docking a water molecule into a cavity to account for water-water interactions. However we found that the point at which to stop docking was ambiguous and that subsequent predictions were biased to regions near previous predictions. Importantly, neither of these methods adapt the positions of water molecules to optimize both the protein-water and the water-water interactions. A second energy minimization step would be required to achieve this. Given the high accuracy and speed of the current method, we felt these improvements were unnecessary. Table 4 shows the number of correctly predicted consensus water molecules and the number of mis-predictions for each individual protein.
Applying WaterDock to the test set
We decided to use to same data set used by the water prediction method, AcquaAlta [59], as our test set so as to allow a direct comparison of the methods. The test set comprised of fourteen crystal structures of OppA bound to different KXK tri-peptides. AcquaAlta reported that they could predict 66% of the water molecules that bridged the interaction between the ligand and the protein to a maximum error of 1.4 Å. Using the same maximum error, WaterDock predicted 87% of the crystallographic water molecules. When the results were visually inspected (Figure 3), 11 additional predictions were found to be within 2.0 Å of crystallographic water molecules that made the same interactions with the ligand and protein. When these water molecules were included in the analysis, WaterDock identified 97% of the crystallographic water sites with a mean error of 0.68 Å. On average, WaterDock predicted just under 1 water molecule per structure that was not seen experimentally. The false positive rate was not reported for AcquaAlta.
2. Predicting displaceable water molecules using WaterDock
Water energy model from a data mining procedure
The 54 water molecules that Barillari's et al. calculated the binding energy for using the double decoupling method [68] were scored with the AutoDock 4 and the Vina scoring functions. All linear combinations of the scoring functions energetic terms were used to create 255 energy models. After selecting the top 30 models based on model simplicity and goodness of fit (as denoted by the model's AIC), cross validation was used to find the model that yielded the lowest error. It was found that a single term, the hydrogen bonding term from Vina's scoring function had the lowest mean error in the cross-validation (CV) study, with an error of 1.7 kcal/mol. The standard error of the fit was 1.6 kcal/mol and had an R2 value of 0.50. For comparison, if the average calculated energy of the Barillari data set is used to predict each water molecule's energy, the mean error would be 2.5 kcal/mol. The coefficient and intercept of the re-weighted Vina hydrogen bonding term is shown in Table 5.
Table 5. The gradient and intercept of the Vina's hydrogen-bonding term after refitting it to the calculated binding energy of water according to Barillari et al.
Term | Weight (kcal/mol) |
Intercept | 1.77 |
H-bond | −2.58 |
Vina's hydrogen bonding term is the sum over hydrogen bonding pairs [69]. For each pair, the value ranges from 1 to 0 and varies linearly with distance. The significant correlation despite the simplicity of the model result is likely to be due to a strong enthalpy-entropy compensation effect, where the number and strength of hydrogen bonds correlates with the translational and orientational freedom of the water molecule.
Classifying the role of water
As displaced water molecules can greatly affect a ligand's affinity and specificity, it is of great interest to quantify the probability that a WaterDock prediction will be displaced or conserved. If a water is displaceable, it useful to know whether is likely to be displaced by a polar group or a non-polar group. In order to develop a water classifier that is consistent with our water placement method, we used a high quality data set of protein ligand complexes to predict the locations of water molecules after the ligands had been removed from the structures. By overlaying the ligands back onto the hypothetical “apo” solvation structure, we investigated the displacement statistics of our water predictions (See Figure 2B). In total, 545 predicted apo water molecules were within 1.5 Å of a water molecule seen in the crystal structure of the protein-ligand complex and so were classified as conserved. Also, 459 predicted water molecules were classified as displaced as they were within 1.5 Å from a ligand. Of these displaced water molecules, 216 were displaced by polar groups and 243 were displaced by non polar groups.
Using the re-weighted Vina hydrogen bond term, the hydrophilicity model and the lipophilicity model as descriptors in a probabilistic machine learning classifier, water molecules were predicted to be either being displaced or conserved. Using “leave-protein-out” cross validation (as described in Methods), 75% of the WaterDock predictions were correctly classified as either conserved of displaced when the class with the highest probability was used for the prediction. Similarly, when waters predicted to be displaced by WaterDock were classified as being displaced by a polar group or by a non-polar group, 80% of the WaterDock predictions were correctly classified in cross validation. Table 6 shows that there was little bias in predicting each individual class.
Table 6. The results of the models that classify water molecules as displaced or conserved and as displaced by a polar group and displaced by a non-polar group.
Model 1 correctly classified (%) | Model 2 correctly classified (%) | ||||
Total | Conserved waters | Displaced waters | Total | Waters displaced by polar groups | Waters displaced by non-polar groups |
75 | 70 | 81 | 80 | 82 | 79 |
One benefit of using a probabilistic classifier is that the certainty of a prediction is naturally quantified. One would therefore expect that the higher the classification probability is, the lower the chance of misclassification. For both of our models, we found that classification probabilities of 0.8 or above correctly classified the water in 94% and 95% of cases in both models after cross validation. This emphasizes the usefulness of the probabilistic approach taken.
Figure 4 shows the distributions of the three scores for WaterDock predictions displaced by polar and non polar groups as well as for conserved and displaced water molecules. While each score could be used individually to distinguish between water classes, we found that the highest accuracy in the cross validation could only be achieved using all three energy scores (Tables S4 and S5).
In Figure 4, it seems counter intuitive that conserved WaterDock predictions are more likely to have a higher lipophilic score than displaced water molecules. This is due to the fact that conserved water molecules tend to be more buried and so have more contacts with the protein, which also explains the higher hydrophilicity scores and the stronger hydrogen bonds. The opposite is true when one compares WaterDock predictions that were displaced by polar groups to water predictions that were displaced by non-polar groups. Water molecules displaced by non-polar groups tend to reside in slightly more lipophilic and less hydrophilic environments and tend to make fewer and weaker hydrogen bonds.
It is interesting to note that even though Vina's hydrogen-bonding term was established using a data mining protocol and the hydrophilicity score was designed heuristically, both scores were strongly correlated with an R2 of 0.72. These very different approaches have converged to describe a related property of water. Despite the high correlation, the combination of the two scores in the machine learning algorithm increased the classification accuracy by around 7% compared to when each term was fitted individually (see Table S4). Because the increase in accuracy is seen after cross-validation, it indicates that it is not a result of over-fitting and, that despite the high correlation, the terms sufficient are sufficiently distinct so as to improve the classification success rate.
Ligand water displacement propensities
As well as predicting the role that WaterDock predictions play in ligand binding, we also investigated the propensities for ligand chemical groups to occupy predicted water sites. Given the very good agreement with WaterDock's predictions and experimentally determined water sites, we expect these displacement statistics to be similar for water molecules seen in crystal structures.
Figure 5 shows the probability of finding ligand functional groups at various distances from hypothetically displaced water sites. For a given distance cutoff, each point can be considered as the propensity that a ligand atom will displace a water molecule. Hydrogen bond donors and acceptors were equally likely to displace predicted water molecules and were found to be around 9 times more likely to be within 0.5 Å of a water site than aromatic and aliphatic carbons. This indicates that it is important for water displacing ligand groups to replicate water's hydrogen bonding capacity. Interestingly, when the occupation probabilities were computed for ligand atoms, rather than atom functions, oxygen atoms were over twice as likely to be found within 0.5 Å of a displaced water site than nitrogen atoms. At 1.5 Å (the distance cutoff we previously used to define whether a water molecule was displaced or not) the displacement propensities of oxygen and nitrogen are roughly the same. The higher probability for a ligand oxygen atom to more closely occupy a displaced water site further emphasizes the importance for ligands groups to mimic the water molecule they displace.
As the distance from a predicted water site increases further, the less one can consider a ligand atom to have displaced a water molecule. As a result, the propensities tend to the same value. Ligand atoms such as halogens, sulfur and phosphorous were not included in this study due to their small number in the data set.
From Figure 5, it is tempting to conclude that ligand modifications designed to displace a water molecule should always be made with an hydrogen-bonding group. However, in this study we have seen that many water molecules, depending on their local environment, are preferentially displaced by non-polar groups. However, since carbon is the most abundant ligand element in the Astex Diverse Set and representative of drug-like ligands, the per atom displacement probability is significantly less for carbon than for polar atoms.
Conclusions
Using three data sets, we have shown that by using a method we call WaterDock, the docking software AutoDock Vina can be used to predict the binding positions of water molecules in an accurate manner. Using structures that have been determined more than once by either X-ray crystallography or by neutron diffraction, we found WaterDock could predict 88% of consensus water molecules. In order to understand the structural importance of WaterDock's predictions, we combined data mining, heuristic and machine learning techniques to assess the probability that a prediction is either conserved or displaced. After cross-validation, this model had a classification accuracy of 75%. Similarly, we found we could predict whether WaterDock predictions were displaced by polar or non-polar ligand groups to an accuracy of 80%.
These models allow one to predict not only the location of water molecules, but also if a water is likely to be displaceable by oxygen or nitrogen atoms only or whether in fact there is scope for displacement by something more non-polar, like a methyl group. Such knowledge could be advantageous in the context of lead-optimization. Work is underway to see how this water scoring information can be used to improve the prediction of ligand-protein binding affinities. An example water-placement prediction script is available (Supporting Information S1) and all water classifiers are available on request.
Supporting Information
Footnotes
Competing Interests: The authors have read the journal's policy and have the following conflicts. GMM is an employee of InhibOx. There are no patents, products in development or marketed products to declare. This does not alter the authors' adherence to all the PLoS ONE policies on sharing data and materials, as detailed online in the guide for authors.
Funding: This work was supported by the Engineering and Physical Sciences Research Council (EPSRC). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. No additional external funding received for this study.
References
- 1.Roe SM, Prodromou C, O'Brien R, Ladbury JE, Piper PW, et al. Structural basis for inhibition of the Hsp90 molecular chaperone by the antitumor antibiotics radicicol and geldanamycin. J Med Chem. 1999;42:260–266. doi: 10.1021/jm980403y. [DOI] [PubMed] [Google Scholar]
- 2.Sleigh SH, Seavers PR, Wilkinson AJ, Ladbury JE, Tame JR. Crystallographic and calorimetric analysis of peptide binding to OppA protein. J Mol Biol. 1999;291:393–415. doi: 10.1006/jmbi.1999.2929. [DOI] [PubMed] [Google Scholar]
- 3.Lu Y, Wang R, Yang C-Y, Wang S. Analysis of ligand-bound water molecules in high resolution crystal structures of protein-ligand complexes. J Chem Inf Model. 2007;47:668–675. doi: 10.1021/ci6003527. [DOI] [PubMed] [Google Scholar]
- 4.Clarke C, Woods RJ, Gluska J, Cooper A, Nutley MA, et al. Involvement of water in carbohydrate-protein binding. J Am Chem Soc. 2001;123:12238–12247. doi: 10.1021/ja004315q. [DOI] [PubMed] [Google Scholar]
- 5.Lam PY, Jadhav PK, Eyermann CJ, Hodge CN, Ru Y, et al. Rational design of potent, bioavailable, nonpeptide cyclic ureas as HIV protease inhibitors. Science. 1994;263:380–384. doi: 10.1126/science.8278812. [DOI] [PubMed] [Google Scholar]
- 6.de Beer SB, Vermeulen NP, Oostenbrink C. The role of water molecules in computational drug design. Curr Top Med Chem. 2010;10:55–66. doi: 10.2174/156802610790232288. [DOI] [PubMed] [Google Scholar]
- 7.Mancera RL. Molecular modelling of hydration in drug design. Curr Opin Drug Discov Devel. 2007;10:275–280. [PubMed] [Google Scholar]
- 8.Wong SE, Lightstone FC. Accounting for water molecules in drug design. Exp Opin Drug Discov. 2011;6:65–74. doi: 10.1517/17460441.2011.534452. [DOI] [PubMed] [Google Scholar]
- 9.Hussain A, Melville J, Hirst J. Molecular docking and QSAR of aplyronine A and analogues: potent inhibitors of actin. J Comput Aided Mol Des. 2010;24:1–15. doi: 10.1007/s10822-009-9307-y. [DOI] [PubMed] [Google Scholar]
- 10.Pastor M, Cruciani G, Watson KA. A strategy for the incorporation of water molecules present in a ligand binding site into a three-dimensional quantitative structure-activity relationship analysis. J Med Chem. 1997;40:4089–4102. doi: 10.1021/jm970273d. [DOI] [PubMed] [Google Scholar]
- 11.Taha MO, Habash M, Al-Hadidi Z, Al-Bakri A, Younis K, et al. Docking-based comparative intermolecular contacts analysis as new 3-D QSAR concept for validating docking studies and in silico screening: NMT and GP inhibitors as case studies. J Chem Inf Model. 2011;51:647–669. doi: 10.1021/ci100368t. [DOI] [PubMed] [Google Scholar]
- 12.Wallnoefer HG, Handschuh S, Liedl KR, Fox T. Stabilizing of a globular protein by a highly complex water network: a molecular dynamics simulation study on factor Xa. J Phys Chem B. 2010;114:7405–7412. doi: 10.1021/jp101654g. [DOI] [PubMed] [Google Scholar]
- 13.Luccarelli J, Michel J, Tirado-Rives J, Jorgensen WL. Effects of water placement on predictions of binding affinities for p38α MAP kinase inhibitors. J Chem Theory Comput. 2010;6:3850–3856. doi: 10.1021/ct100504h. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Wallnoefer HG, Liedl KR, Fox T. A challenging system: Free energy prediction for factor Xa. J Comput Chem. 2011;32:1743–1752. doi: 10.1002/jcc.21758. [DOI] [PubMed] [Google Scholar]
- 15.de Graaf C, Oostenbrink C, Keizers PH, van der Wijst T, Jongejan A, et al. Catalytic site prediction and virtual screening of cytochrome P450 2D6 substrates by consideration of water and rescoring in automated docking. J Med Chem. 2006;49:2417–2430. doi: 10.1021/jm0508538. [DOI] [PubMed] [Google Scholar]
- 16.de Graaf C, Pospisil P, Pos W, Folkers G, Vermeulen NP. Binding mode prediction of cytochrome P450 and thymidine kinase protein-ligand complexes by consideration of water and rescoring in automated docking. J Med Chem. 2005;48:2308–2318. doi: 10.1021/jm049650u. [DOI] [PubMed] [Google Scholar]
- 17.Rarey M, Kramer B, Lengauer T. The particle concept: placing discrete water molecules during protein-ligand docking predictions. Proteins. 1999;34:17–28. [PubMed] [Google Scholar]
- 18.Roberts BC, Mancera RL. Ligand-protein docking with water molecules. J Chem Inf Model. 2008;48:397–408. doi: 10.1021/ci700285e. [DOI] [PubMed] [Google Scholar]
- 19.Santos R, Hritz J, Oostenbrink C. Role of water in molecular docking simulations of cytochrome P450 2D6. J Chem Inf Model. 2010;50:146–154. doi: 10.1021/ci900293e. [DOI] [PubMed] [Google Scholar]
- 20.Thilagavathi R, Mancera RL. Ligand-protein cross-docking with water molecules. J Chem Inf Model. 2010;50:415–421. doi: 10.1021/ci900345h. [DOI] [PubMed] [Google Scholar]
- 21.Bellocchi D, Macchiarulo A, Costantino G, Pellicciari R. Docking studies on PARP-1 inhibitors: insights into the role of a binding pocket water molecule. Bioorg Med Chem. 2005;13:1151–1157. doi: 10.1016/j.bmc.2004.11.024. [DOI] [PubMed] [Google Scholar]
- 22.Chen JM, Xu SL, Wawrzak Z, Basarab GS, Jordan DB. Structure-based design of potent inhibitors of scytalone dehydratase: displacement of a water molecule from the active site. Biochemistry. 1998;37:17735–17744. doi: 10.1021/bi981848r. [DOI] [PubMed] [Google Scholar]
- 23.Wissner A, Berger DM, Boschelli DH, Floyd MB, Jr, Greenberger LM, et al. 4-Anilino-6,7-dialkoxyquinoline-3-carbonitrile inhibitors of epidermal growth factor receptor kinase and their bioisosteric relationship to the 4-anilino-6,7-dialkoxyquinazoline inhibitors. J Med Chem. 2000;43:3244–3256. doi: 10.1021/jm000206a. [DOI] [PubMed] [Google Scholar]
- 24.Clarke C, Woods RJ, Gluska J, Cooper A, Nutley MA, et al. Involvement of water in carbohydrate-protein binding. J Am Chem Soc. 2001;123:12238–12247. doi: 10.1021/ja004315q. [DOI] [PubMed] [Google Scholar]
- 25.Kadirvelraj R, Foley BL, Dyekjaer JD, Woods RJ. Involvement of water in carbohydrate-protein binding: Concanavalin A revisited. J Am Chem Soc. 2008;130:16933–16942. doi: 10.1021/ja8039663. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Mikol V, Papageorgiou C, Borer X. The role of water molecules in the structure-based design of (5-hydroxynorvaline)-2-cyclosporin: synthesis, biological activity, and crystallographic analysis with cyclophilin A. J Med Chem. 1995;38:3361–3367. doi: 10.1021/jm00017a020. [DOI] [PubMed] [Google Scholar]
- 27.Garcia-Sosa AT, Mancera RL. Free energy calculations of mutations involving a tightly bound water molecule and ligand substitutions in a ligand-protein complex. Mol Inf. 2010;29:589–600. doi: 10.1002/minf.201000007. [DOI] [PubMed] [Google Scholar]
- 28.Michel J, Tirado-Rives J, Jorgensen WL. Energetics of displacing water molecules from protein binding sites: Consequences for ligand optimization. J Am Chem Soc. 2009;131:15403–15411. doi: 10.1021/ja906058w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Lloyd DG, Garcia-Sosa AT, Alberts IL, Todorov NP, Mancera RL. The effect of tightly bound water molecules on the structural interpretation of ligand-derived pharmacophore models. J Comp Aided Mol Des. 2004;18:89–100. doi: 10.1023/b:jcam.0000030032.81753.b4. [DOI] [PubMed] [Google Scholar]
- 30.Garcia-Sosa AT, Firth-Clark S, Mancera RL. Including Tightly-Bound Water Molecules in de Novo Drug Design. Exemplification through the in Silico Generation of Poly(ADP-ribose)polymerase Ligands. J Chem Inf Model. 2005;45:624–633. doi: 10.1021/ci049694b. [DOI] [PubMed] [Google Scholar]
- 31.Garcia-Sosa AT, Mancera RL. The effect of a tightly bound water molecule on scaffold diversity in the computer-aided de novo ligand design of CDK2 inhibitors. J Mol Mod. 2006;12:422–431. doi: 10.1007/s00894-005-0063-1. [DOI] [PubMed] [Google Scholar]
- 32.Mancera RL. De novo ligand design with explicit water molecules: an application to bacterial neuraminidase. J Comput Aided Mol Des. 2002;16:479–499. doi: 10.1023/a:1021273501447. [DOI] [PubMed] [Google Scholar]
- 33.Carugo O, Bordo D. How many water molecules can be detected by protein crystallography? Acta Crystallogr D Biol Crystallogr. 1999;55:479–483. doi: 10.1107/s0907444998012086. [DOI] [PubMed] [Google Scholar]
- 34.Davis AM, Teague SJ, Kleywegt GJ. Application and limitations of X-ray crystallographic data in structure-based ligand and drug design. Angew Chem Int Ed Engl. 2003;42:2718–2736. doi: 10.1002/anie.200200539. [DOI] [PubMed] [Google Scholar]
- 35.Ernst JA, Clubb RT, Zhou H-X, Gronenborn AM, Clore GM. Demonstration of positionally disordered water within a protein hydrophobic cavity by NMR. Science. 1995;267:1813–1816. doi: 10.1126/science.7892604. [DOI] [PubMed] [Google Scholar]
- 36.Henchman RH, McCammon JA. Extracting hydration sites around proteins from explicit water simulations. J Comput Chem. 2002;23:861–869. doi: 10.1002/jcc.10074. [DOI] [PubMed] [Google Scholar]
- 37.Resat H, Mezei M. Grand canonical ensemble Monte Carlo simulation of the dCpG/proflavine crystal hydrate. Biophys J. 1996;71:1179–1190. doi: 10.1016/S0006-3495(96)79322-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Michel J, Essex JW. Prediction of protein-ligand binding affinity by free energy simulations: assumptions, pitfalls and expectations. J Comput Aided Mol Des. 2010;24:639–658. doi: 10.1007/s10822-010-9363-3. [DOI] [PubMed] [Google Scholar]
- 39.Imai T, Hiraoka R, Kovalenko A, Hirata F. Locating missing water molecules in protein cavities by the three-dimensional reference interaction site model theory of molecular solvation. Proteins: Struct Func Genet. 2007;66:804–813. doi: 10.1002/prot.21311. [DOI] [PubMed] [Google Scholar]
- 40.Imai T, Oda K, Kovalenko A, Hirata F, Kidera A. Ligand mapping on protein surfaces by the 3D-RISM theory: Toward computational fragment-based drug design. J Am Chem Soc. 2009;131:12430–12440. doi: 10.1021/ja905029t. [DOI] [PubMed] [Google Scholar]
- 41.Lazaridis T. Inhomogeneous fluid approach to solvation thermodynamics. 1. Theory. J Phys Chem B. 1998;102:3531–3541. [Google Scholar]
- 42.Lazaridis T. Inhomogeneous fluid approach to solvation thermodynamics. 2. Applications to simple fluids. J Phys Chem B. 1998;102:3542–3550. [Google Scholar]
- 43.Li Z, Lazaridis T. Thermodynamic contributions of the ordered water molecule in HIV-1 protease. J Am Chem Soc. 2003;125:6636–6637. doi: 10.1021/ja0299203. [DOI] [PubMed] [Google Scholar]
- 44.Li Z, Lazaridis T. Thermodynamics of buried water clusters at a protein at ligand binding interface. J Phys Chem B. 2005;110:1464–1475. doi: 10.1021/jp056020a. [DOI] [PubMed] [Google Scholar]
- 45.Li Z, Lazaridis T. The effect of water displacement on binding thermodynamics: concanavalin A. J Phys Chem B. 2005;109:662–670. doi: 10.1021/jp0477912. [DOI] [PubMed] [Google Scholar]
- 46.Li Z, Lazaridis T. Thermodynamics of buried water clusters at a protein-ligand binding interface. J Phys Chem B. 2006;110:1464–1475. doi: 10.1021/jp056020a. [DOI] [PubMed] [Google Scholar]
- 47.Abel R, Young T, Farid R, Berne BJ, Friesner RA. Role of the active-site solvent in the thermodynamics of factor Xa ligand binding. J Am Chem Soc. 2008;130:2817–2831. doi: 10.1021/ja0771033. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Young RJ, Campbell M, Borthwick AD, Brown D, Burns-Kurtis CL, et al. Structure- and property-based design of factor Xa inhibitors: pyrrolidin-2-ones with acyclic alanyl amides as P4 motifs. Bioorg Med Chem Lett. 2006;16:5953–5957. doi: 10.1016/j.bmcl.2006.09.001. [DOI] [PubMed] [Google Scholar]
- 49.Frydenvang K, Pickering DS, Greenwood JR, Krogsgaard-Larsen N, Brehm L, et al. Biostructural and pharmacological studies of bicyclic analogues of the 3-isoxazolol glutamate receptor agonist ibotenic acid. J Med Chem. 2010;53:8354–8361. doi: 10.1021/jm101218a. [DOI] [PubMed] [Google Scholar]
- 50.Robinson DD, Sherman W, Farid R. Understanding kinase selectivity through energetic analysis of binding site waters. ChemMedChem. 2010;5:618–627. doi: 10.1002/cmdc.200900501. [DOI] [PubMed] [Google Scholar]
- 51.Goodford PJ. A computational procedure for determining energetically favorable binding sites on biologically important macromolecules. J Med Chem. 1985;28:849–857. doi: 10.1021/jm00145a002. [DOI] [PubMed] [Google Scholar]
- 52.Setny P, Zacharias M. Hydration in discrete water. A mean field, cellular automata based approach to calculating hydration free energies. J Phys Chem B. 2010;114:8667–8675. doi: 10.1021/jp102462s. [DOI] [PubMed] [Google Scholar]
- 53.Thanki N, Thornton JM, Goodfellow JM. Distributions of water around amino acid residues in proteins. J Mol Biol. 1988;202:637–657. doi: 10.1016/0022-2836(88)90292-6. [DOI] [PubMed] [Google Scholar]
- 54.Pitt WR, Goodfellow JM. Modelling of solvent positions around polar groups in proteins. Protein Eng. 1991;4:531–537. doi: 10.1093/protein/4.5.531. [DOI] [PubMed] [Google Scholar]
- 55.Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, et al. The Protein Data Bank. Nucleic Acids Res. 2000;28:235–242. doi: 10.1093/nar/28.1.235. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Allen FH. The Cambridge Structural Database: a quarter of a million crystal structures and rising. Acta Crystallogr B. 2002;58:380–388. doi: 10.1107/s0108768102003890. [DOI] [PubMed] [Google Scholar]
- 57.Verdonk ML, Cole JC, Taylor R. SuperStar: a knowledge-based approach for identifying interaction sites in proteins. J Mol Biol. 1999;289:1093–1108. doi: 10.1006/jmbi.1999.2809. [DOI] [PubMed] [Google Scholar]
- 58.Schymkowitz JW, Rousseau F, Martins IC, Ferkinghoff-Borg J, Stricher F, et al. Prediction of water and metal binding sites and their affinities by using the Fold-X force field. Proc Natl Acad Sci U S A. 2005;102:10147–10152. doi: 10.1073/pnas.0501980102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Rossato G, Ernst B, Vedani A, Smiesko M. AcquaAlta: A directional approach to the solvation of ligand-protein complexes. J Chem Inf Model. 2011;51:1867–1881. doi: 10.1021/ci200150p. [DOI] [PubMed] [Google Scholar]
- 60.Huang N, Shoichet BK. Exploiting ordered waters in molecular docking. J Med Chem. 2008;51:4862–4865. doi: 10.1021/jm8006239. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Verdonk ML, Chessari G, Cole JC, Hartshorn MJ, Murray CW, et al. Modeling water molecules in protein-ligand docking using GOLD. J Med Chem. 2005;48:6504–6515. doi: 10.1021/jm050543p. [DOI] [PubMed] [Google Scholar]
- 62.Raymer ML, Sanschagrin PC, Punch WF, Venkataraman S, Goodman ED, et al. Predicting conserved water-mediated and polar ligand interactions in proteins using a K-nearest-neighbors genetic algorithm. J Mol Biol. 1997;265:445–464. doi: 10.1006/jmbi.1996.0746. [DOI] [PubMed] [Google Scholar]
- 63.Kellogg GE, Semus SF, Abraham DJ. HINT: a new method of empirical hydrophobic field calculation for CoMFA. J Comput Aided Mol Des. 1991;5:545–552. doi: 10.1007/BF00135313. [DOI] [PubMed] [Google Scholar]
- 64.Chen DL, Kellogg GE. A computational tool to optimize ligand selectivity between two similar biomacromolecular targets. J Comput Aided Mol Des. 2005;19:69–82. doi: 10.1007/s10822-005-1485-7. [DOI] [PubMed] [Google Scholar]
- 65.Amadasi A, Spyrakis F, Cozzini P, Abraham DJ, Kellogg GE, et al. Mapping the energetics of water-protein and water-ligand interactions with the “natural” HINT forcefield: predictive tools for characterizing the roles of water in biomolecules. J Mol Biol. 2006;358:289–309. doi: 10.1016/j.jmb.2006.01.053. [DOI] [PubMed] [Google Scholar]
- 66.Amadasi A, Surface JA, Spyrakis F, Cozzini P, Mozzarelli A, et al. Robust classification of “relevant” water molecules in putative protein binding sites. J Med Chem. 2008;51:1063–1067. doi: 10.1021/jm701023h. [DOI] [PubMed] [Google Scholar]
- 67.Garcia-Sosa AT, Mancera RL, Dean PM. WaterScore: a novel method for distinguishing between bound and displaceable water molecules in the crystal structure of the binding site of protein-ligand complexes. J Mol Model. 2003;9:172–182. doi: 10.1007/s00894-003-0129-x. [DOI] [PubMed] [Google Scholar]
- 68.Barillari C, Taylor J, Viner R, Essex JW. Classification of water molecules in protein binding sites. J Am Chem Soc. 2007;129:2577–2587. doi: 10.1021/ja066980q. [DOI] [PubMed] [Google Scholar]
- 69.Trott O, Olson AJ. AutoDock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. J Comput Chem. 2010;31:455–461. doi: 10.1002/jcc.21334. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Hartshorn MJ, Verdonk ML, Chessari G, Brewerton SC, Mooij WT, et al. Diverse, high-quality test set for the validation of protein-ligand docking performance. J Med Chem. 2007;50:726–741. doi: 10.1021/jm061277y. [DOI] [PubMed] [Google Scholar]
- 71.van Rossum G. Python tutorial, Technical report CS-R9526, Centrum voor Wikunde en Informatica (CWI) Amsterdam: 1995. [Google Scholar]
- 72.Morris GM, Huey R, Lindstrom W, Sanner MF, Belew RK, et al. AutoDock 4 and AutoDockTools 4: Automated docking with selective receptor flexibility. J Comp Chem. 2009;30:2785–2791. doi: 10.1002/jcc.21256. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73.Morris GM, Goodsell DS, Halliday RS, Huey R, Hart WE, et al. Automated docking using a Lamarckian genetic algorithm and an empirical binding free energy function. J Comp Chem. 1998;19:1639–1662. [Google Scholar]
- 74.Team RCD. R: A language and environment for statistical computing. Vienna, Austria: 2011. [Google Scholar]
- 75.Akaike H. A new look at the statistical model identification. IEEE Trans Automatic Control. 1974;19:716–722. [Google Scholar]
- 76.Kuhn LA, Swanson CA, Pique ME, Tainer JA, Getzoff ED. Atomic and residue hydrophilicity in the context of folded protein structures. Proteins: Struc Func Genet. 1995;23:536–547. doi: 10.1002/prot.340230408. [DOI] [PubMed] [Google Scholar]
- 77.Israelachvili J, Pashley R. The hydrophobic interaction is long range, decaying exponentially with distance. Nature. 1982;300:341–342. doi: 10.1038/300341a0. [DOI] [PubMed] [Google Scholar]
- 78.Li A-J, Nussinov R. A set of van der Waals and coulombic radii of protein atoms for molecular and solvent-accessible surface calculation, packing evaluation, and docking. Proteins: Struc Func Genet. 1998;32:111–127. [PubMed] [Google Scholar]
- 79.Narten A, Levy H. Liquid water: Molecular correlation functions from X-ray diffraction. J Chem Phys. 1971;55:2263. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.