Skip to main content
Elsevier Sponsored Documents logoLink to Elsevier Sponsored Documents
. 2018 Apr 3;26(4):565–571.e3. doi: 10.1016/j.str.2018.02.009

Ranking Enzyme Structures in the PDB by Bound Ligand Similarity to Biological Substrates

Jonathan D Tyzack 1,2,, Laurent Fernando 1, Antonio JM Ribeiro 1, Neera Borkakoti 1, Janet M Thornton 1
PMCID: PMC5890617  PMID: 29551288

Summary

There are numerous applications that use the structures of protein-ligand complexes from the PDB, such as 3D pharmacophore identification, virtual screening, and fragment-based drug design. The structures underlying these applications are potentially much more informative if they contain biologically relevant bound ligands, with high similarity to the cognate ligands. We present a study of ligand-enzyme complexes that compares the similarity of bound and cognate ligands, enabling the best matches to be identified. We calculate the molecular similarity scores using a method called PARITY (proportion of atoms residing in identical topology), which can conveniently be combined to give a similarity score for all cognate reactants or products in the reaction. Thus, we generate a rank-ordered list of related PDB structures, according to the biological similarity of the ligands bound in the structures.

Keywords: enzyme, PDB, similarity, ligand, biological, relevance, bound, cognate, native

Graphical Abstract

graphic file with name fx1.jpg

Highlights

  • We present PARITY, matching atoms in identical topology to gauge ligand similarity

  • Bound-cognate ligand similarity is a useful metric for ranking PDB structures

  • Only 26% of enzyme structures in the PDB have bound-cognate ligand similarity ≥0.7

  • We provide rank-ordered lists of PDBs with the most biologically relevant ligands


Tyzack et al. present a similarity study, comparing bound ligands in enzyme structures deposited in the PDB with biological substrates in native reactions. They identify structures most likely to contain relevant information regarding enzyme-substrate interactions and binding interfaces.

Introduction

Enzymes are an important group of drug targets where understanding ligand-enzyme binding requires inspection of crystal structures with bound ligands. The ligand-enzyme complex becomes more informative if the bound ligand is similar to the cognate ligand (i.e., the compound expected to bind in vivo), allowing the binding site and ligand-enzyme interactions to be identified more completely. With unmodified target proteins, it is often not possible to bind cognate ligands without the reaction occurring, so compounds with varying degrees of similarity to the cognate ligands are used as surrogates in co-crystallization experiments.

There are many databases of ligand-protein binding sites, which aim to define the binding cavity and identify important protein-ligand interactions (Stuart et al., 2002, Hendlich et al., 2003, Golovin, 2004, Ivanisenko, 2004, Shoemaker et al., 2010, Kufareva et al., 2012, Desaphy et al., 2015), overlay the binding site with known bound ligands (Laskowski, 2004, Lombard et al., 2014), and score structures using binding affinity and resolution data (Hu et al., 2005). One challenge is to identify the best PDB (Gutmanas et al., 2014) entry for analysis, since many proteins have multiple structures available, with many different ligands. “Best” structures are often selected using their resolution, without regard for the nature of the ligands bound. For enzymes, we propose the similarity of bound and cognate small-molecule ligands as another important measure for scoring structures, where pocket identification and description would be enhanced from understanding the biological relevance of the bound substrates.

Bound protein-ligand structures are used in many applications, including virtual screening and 3D pharmacophore identification. For example, ligand-homology modeling (Drwal and Griffith, 2013) uses binding site alignment and ligand transposition (Konc et al., 2015) as the basis to score and validate protein-ligand interactions (Najmanovich et al., 2008, Shin et al., 2011, Evangelidis et al., 2012, Konc et al., 2012, Kurbatova et al., 2013, Zhou and Skolnick, 2013, Heo et al., 2014, Konc and Janežič, 2014, Cleves and Jain, 2015, Roy et al., 2015). Docking methods have also been enhanced by using the location of bound ligands to supplement scoring functions (Stanton et al., 2015, Anighoro and Bajorath, 2016) and to enable false positives to be pruned from virtual screening (Bietz et al., 2016). Large-scale computational methods that identify 3D binding pharmacophores (Meslamani et al., 2012) or represent ligand-protein interactions as networks (Kalinina et al., 2011, Martínez-Jiménez and Marti-Renom, 2015, Kasahara and Kinoshita, 2016) are also likely to be enhanced with knowledge about the biological relevance of the ligands on which they are based, potentially improving the prediction of ligand-protein interactions (Kinnings and Jackson, 2011) and the performance of machine learning methods to classify actives from decoys (Chupakhin et al., 2013).

A further example of the use of bound structures is fragment-based design, which links fragments from bound substrates to design ligands (Desaphy and Rognan, 2014, Tang and Altman, 2014, Wang et al., 2015). In many cases the bound ligands are inhibitors or are molecules with their active fragments modified, so a guide to find those ligands that are most similar to the cognate molecule could help in the design of active molecules.

Herein, we present a method called PARITY (proportion of atoms residing in identical topology) to compare the similarity of bound and cognate ligands and automatically annotate the current content of the PDB. The methodology is described in detail in the online STAR Methods and summarized in the flow chart in Figure 1, with an example of the PARITY method provided in Figure 2. We anticipate that our similarity scores will allow researchers to identify the most representative and biologically relevant PDB structures when collating datasets for the diverse methods applying these data. The similarity scores generated from this analysis are released in the public Mendeley Data Repository.

Figure 1.

Figure 1

Flowchart to Summarize the Overall Methodology

Figure 2.

Figure 2

PARITY Example

Illustration of the similarity calculation for KEGG R05493 (alpha-ketoglutarate-dependent 2,4-dichlorophenoxyacetate dioxygenase) and PDB: 3AVR (catalytic fragment of UTX/KDM6A bound with histone H3K27me3 peptide, N-oxyalylglycine, and Ni(II)). The cognate reactants C07088 (4-chlorophenoxyacetate), C00026 (2-oxoglutarate), and C00007 (oxygen) are matched to the most similar bound ligands (OGA [N-oxalylglycine], EDO [1,2-ethanediol], and M3L [N-trimethyllysine], respectively) using PARITY by matching atoms of the same type in equivalent topological positions. The 2D ligand graphics were generated using MarvinSketch (ChemAxon, 2016).

Results

In this section, we present the analyses of the similarity of bound ligands to cognate ligands across the dataset (Figure 3). Separate plots are generated finding the most similar PDB-KEGG match for each (1) PDB structure, (2) 100% sequence identity cluster, (3) KEGG reaction, and (4) EC reference represented in the dataset, with a relative frequency comparison in Figure 3E. It would be possible to cluster the sequences using looser clustering criteria, which would produce fewer clusters and would in all likelihood remove some of the poorer matches, but only the 100% sequence identity results are presented here. Summary level data are presented in Table 1 showing the percentage of the best PDB-KEGG matches in different similarity categories.

Figure 3.

Figure 3

Bound Ligands versus Cognate Ligands Similarity Frequency Graphs

The graphs show the frequency of the binned similarity scores of bound ligands from PDB structures and cognate ligands from KEGG reactions. Graphs are produced for the most similar PDB-KEGG match for each (A) PDB structure, (B) 100% sequence identity cluster, (C) KEGG reaction, and (D) EC reference, with relative cumulative frequency comparisons provided in (E).

Table 1.

Similarity Results

Best PDB-KEGG Match for Each: Similarity Category
Num None Bound, % Sim ≤ 0.3, % 0.3 < sim < 0.7, % Sim ≥ 0.7, %
(a) PDB 56,994 16.9 31.3 25.8 26.0
(b) Cluster 14,257 17.8 23.2 23.2 36.2
(c) KEGG 9,308 0.0 3.1 38.0 58.9
(d) EC 5,392 0.0 3.0 34.1 62.9

Table shows the number of matches and the percentage with no bound ligands, similarity ≤ 0.3, similarity between 0.3 and 0.7, and similarity ≥ 0.7 for the most similar PDB-KEGG match for each (a) PDB, (b) 100% sequence identity cluster, (c) KEGG reaction, and (d) EC.

A key observation is the high level of PDB structures where there are no bound ligands or that have bound ligands with low similarity to cognate ligands. It can be seen from Table 1 that 16.9% (9,612) of the PDBs in the study do not have bound ligands and, after finding the most similar KEGG reaction for those that do, 31.3% (17,833) have a similarity score of less than or equal to 0.3. Only 26.0% (14,839) have a similarity score of greater than or equal to 0.7.

Selecting the most similar PDB-KEGG match for each cluster of 100% sequence identity improves the situation since many of the poorer matches can be discarded. This improves further by selecting the most similar PDB-KEGG match for each KEGG and EC, where the proportion with a similarity score greater than or equal to 0.7 increases to 58.9% and 62.9%, respectively. This emphasizes the usefulness of the similarity measure and how it can be used to identify PDB structures with bound ligands most similar to the cognate ligands.

To further demonstrate the value of this dataset, we provide two use cases where it is required to find PDB structures with the most similar bound ligands, first for a particular KEGG reaction and second for a particular EC reaction. In these examples, we are able to identify the best matches from a large number of structures with varying degrees of similarity, eliminating the need to manually sift through the data and enabling research effort to be focused on inspecting the ligand binding or other higher value activities.

Use Case 1: Finding PDBs with the Most Similar Bound Ligands to KEGG R01026

The number of PDBs referencing R01026 (acetylcholine acetylhydrolase) via its parent EC references 3.1.1.7 (acetylcholinesterase) and 3.1.1.8 (cholinesterase) is 840, with the frequency distribution of binned similarity scores shown in Figure 4B. By measuring the similarity of bound ligands in each PDB to the reactants and products in R01026 we are able ascertain that there are only 16 structures (1.9%) with a similarity score greater than or equal to 0.7 and only two exact matches. One of the exact matches, PDB: 2HA4 (crystal structure of mutant S203A of mouse acetylcholinesterase complexed with acetylcholine), is shown in Figure 4C.

Figure 4.

Figure 4

Use Case 1

Graphic shows (A) the KEGG reaction for R01026 (acetylcholine acetylhydrolase); (B) the distribution of similarity scores between the cognate ligands and the bound ligands in PDBs referencing R01026; (C) a representation of the binding pocket of the one of the best matches, PDB: 2HA4 (crystal structure of mutant S203A of mouse acetylcholinesterase complexed with acetylcholine), with bound acetylcholine; and (D) an extract from the Mendeley Data Repository for R01026. (In D num_KEGG refers to the number of KEGG reactions potentially matched to the PDB via the EC in SIFTS, r_or_p refers to whether reactants [r] or products [p] have been matched, and cpd_matches details the bound and cognate matched ligands in the format bound; cognate; similarity_score with multiple matches separated by an underscore.)

Use Case 2: Finding PDBs with the Most Similar Bound Ligands to EC 4.2.1.75

The number of PDBs referencing EC 4.2.1.75 (uroporphyrinogen III synthase) is 59, with the frequency distribution of binned similarity scores shown in Figure 5B. By measuring the similarity of bound ligands in each PDB to the reactants and products in R03165 (hydroxymethylbilane hydro-lyase [cyclizing]) we are able to ascertain that there is only one structure (1.7%) with a similarity score greater than or equal to 0.7, which is also an exact match, shown in Figure 5C.

Figure 5.

Figure 5

Use Case 2

Graphic shows (A) the KEGG reaction for R03165 (hydroxymethylbilane hydro-lyase [cyclizing]); (B) the distribution of similarity scores between the cognate ligands and the bound ligands in PDBs referencing EC 4.2.1.75 (uroporphyrinogen III synthase); (C) a representation of the binding pocket of the best match, PDB: 3D8N (uroporphyrinogen III synthase-uroporphyrinogen III complex), with bound uroporphyrinogen; and (D) an extract from the Mendeley Data Repository for EC 4.2.1.75. (In D num_KEGG refers to the number of KEGG reactions potentially matched to the PDB via the EC in SIFTS, r_or_p refers to whether reactants [r] or products [p] have been matched, and cpd_matches details the bound and cognate matched ligands in the format bound; cognate; similarity_score with multiple matches separated by an underscore.)

Validation Using Manually Curated M-CSA Dataset

A limitation of the methodology described is the potentially imprecise matching of PDB to KEGG via the EC reference number(s) in SIFTS. In many cases a PDB is uniquely matched to one KEGG reaction via its sole EC reference (38.8% of our dataset are uniquely mapped to one KEGG reaction in this way), but in some cases multiple EC numbers are listed, the EC number is partial, or the EC number maps to multiple KEGG reactions. In these cases, it is not possible to know precisely which KEGG reaction(s) is the “correct” match for that structure; some KEGG reactions may be catalyzed by the enzyme, but with lower efficiency, and some may not occur at all. This is a problem we do not attempt to resolve here; rather we explicitly show in the output data the number of KEGG reactions associated with each PDB structure so it is obvious to the user how many KEGG reactions have been matched, from which one will be selected.

It is interesting to note that 83.2% of the full PDB dataset map to only one EC number and so any related KEGG reactions can be expected to share similar chemistry on similar substrates. In addition, a further 10.2% have multiple EC numbers, but all share the same EC subsubclass (i.e., the third level of the EC annotation) and so can be expected to share similar chemistry but on more different substrates. This leaves a further 6.6% that map to multiple EC numbers that differ more fundamentally at the EC class or subclass level. It would make an interesting study to measure and compare the similarity of KEGG reactions mapped to PDB structures via EC numbers or within EC subsubclasses, but this is not considered further here, where the focus remains on the similarity of bound and cognate ligands.

The purpose of this work is not to attempt to assign function to structure based on bound ligands, rather it is intended to prioritize searches from a KEGG or EC perspective to those structures most likely to contain relevant bound ligands. Therefore, this does not eliminate the need for a user to verify that a particular PDB structure can indeed perform the query reaction. Bearing this in mind, it is still informative to compare some of the PDB to KEGG matches above a predefined similarity threshold to manually curated data to provide some insight into the relevance of the best matches.

In order to test the impact of this uncertainty, we have examined a manually curated set of PDBs from the M-CSA (Ribeiro et al., 2017). The M-CSA is a database of enzyme mechanisms, and selects a representative PDB code for each entry using a set of selection criteria, which include best resolution, lack of mutations, and presence of cofactors and/or ligands. We chose to look only at those M-CSA entries that had an associated KEGG reaction with a similarity score of at least 0.7, giving a dataset of 2,629 PDB to KEGG reaction data points. In 84.2% of this dataset, the most similar KEGG reaction from our analysis agreed with the M-CSA manual assignment, meaning that in 15.8% of cases the bound ligands are more similar to a different KEGG reaction, albeit usually a very similar one. This is not necessarily wrong (the enzyme may be promiscuous or the EC nomenclature may be “hierarchical” in some way so that one EC number subsumes another), but this highlights the need to validate the match when the EC number includes more than one KEGG reaction. It should be noted that the number of PDB structures with only one potential KEGG in our M-CSA validation dataset is 45.6%, higher than the overall dataset (38.8%). Looking further down the rankings, the M-CSA manually assigned reaction was in the top two and top five rankings in our analysis for 95.3% and 98.8% of the dataset, respectively. This shows that the manually assigned reaction usually appeared high in the ranking if not always in the top rank, giving confidence that when the similarity score is high, relevant KEGG reactions are being matched.

Discussion

Our results show that there is high variability in the similarity of bound ligands to cognate ligands across PDB structures, with a high proportion of PDB structures containing no bound ligands or containing ligands with low similarity to the cognate molecule. These results emphasize the importance of a measure of the relevance of the PDB in terms of the similarity of bound and cognate ligands (in addition to other more established metrics such as resolution) to help researchers more easily identify the most biologically relevant PDB structures as starting points for further study.

This variability in similarity of bound and cognate ligands makes some PDB structures much more useful and informative for the diverse virtual screening methods described in the Introduction, which rely on identifying key interactions between ligands and the binding site. Despite the current limitation of low similarity to the cognate ligand in many situations, these virtual screening methods are able to extract relevant information from the bound structure and apply it in a beneficial way in a wide variety of applications. The similarity scores released in the public Mendeley Data Repository and forming the basis of this study will help researchers in the preparation of datasets for these virtual screening methods, circumventing the need to manually review many PDB structures for suitability. We anticipate that this will help in the identification of binding pockets with greater accuracy and specificity and help in the identification of actives when used in virtual screening.

The similarity measure described in this paper finds the proportion of atoms residing in identical topology between molecules, a method we call PARITY. We favor this similarity metric since it avoids the situation where a small change to a crucial connecting atom disrupts the maximum common substructure or molecular paths, having a disproportionately large detrimental effect on the similarity score. It also enables the ligand similarity scores to be conveniently combined into a KEGG reactants/products similarity score favoring more complete matching of reactants of products.

Where there is sufficient similarity between bound and cognate ligands there is potential to use the bound atom coordinates in the matched region from PARITY to constrain the docking of the cognate ligand. The docked cognate-ligand enzyme structures obtained would have the potential to enhance and supplement virtual screening datasets and facilitate the further study of ligand-enzyme interactions. We aim to generate a dataset of docked cognate-ligand enzyme structures to be released alongside the similarity scores in an online database.

STAR★Methods

Key Resources Table

REAGENT or RESOURCE SOURCE IDENTIFIER
Deposited Data

ec_pdb_sep17.csv This paper; Mendeley Data https://doi.org/10.17632/7c48npgyr8.2#file-d8c87e4a-76b7-4763-bd7d-61a6aa854021
kegg_pdb_sep17.csv This paper; Mendeley Data https://doi.org/10.17632/7c48npgyr8.2#file-c3650d9e-36b8-4126-9c57-aa77a5ce92ed
keggCpd_pdb_sep17.csv This paper; Mendeley Data https://doi.org/10.17632/7c48npgyr8.2#file-8cd8d174-7b52-4e22-9898-d518c626c798

Software and Algorithms

RDKit Landrum et al. http://www.rdkit.org

Other

pdb_chain_enzyme.csv Protein Data Bank in Europe (PDBe)
https://www.ebi.ac.uk/pdbe/
pdb_chain_enzyme.csv.gz
https://www.ebi.ac.uk/pdbe/docs/sifts/quick.html
components.cif Protein Data Bank in Europe (PDBe)
https://www.ebi.ac.uk/pdbe/
components.cif
ftp://ftp.wwpdb.org/pub/pdb/data/monomers/components.cif
PDB files Protein Data Bank in Europe (PDBe)
https://www.ebi.ac.uk/pdbe/
Not applicable
KEGG compound MOL files
KEGG reactions
Kyoto Encyclopedia of Genes and Genomes (KEGG)
http://www.genome.jp/kegg/
KEGG API
http://www.kegg.jp/kegg/rest/keggapi.html

Contact for Reagent and Resource Sharing

Further information and requests for resources should be directed to and will be fulfilled by the Lead Contact, Jonathan Tyzack (tyzack@ebi.ac.uk).

Method Details

The overall goal of this paper is to generate a ranked list of PDB structures according to biological similarity of their ligands. A flow chart describing the different steps in this analysis (explained in more detail below) is presented in Figure 1. At its heart is a method to compare the similarity of bound and cognate ligands, which we call PARITY (Proportion of Atoms Residing in Identical Topology). This measures the similarity of bound ligands in PDB structures (Rose et al., 2015) with cognate ligands from KEGG reactions (Kanehisa et al., 2016), linking via the EC number published in SIFTS (Velankar et al., 2013).

Data Set Curation

Firstly, it was necessary to identify enzyme structures within the PDB to form the basis for the similarity study. SIFTS (Structure Integration with Function, Taxonomy and Sequence) (Velankar et al., 2013) is a resource for mapping between PDB and Uniprot, but also consolidates meta data including up-to-date EC references. Therefore, SIFTS was used to obtain the latest mappings of PDB codes to EC references (using the SIFTS file pdb_chain_enzyme.csv dated 2017/09/23). For some structures where only a partial EC reference is listed, the structure was mapped to all downstream leaves in the EC hierarchy.

To be able to carry out the similarity analysis it was necessary to obtain molecular structures for the bound and cognate ligands. Chemical structures for bound PDB ligands were obtained by matching the ligand reference to a database of SMILES strings provided by the PDB (components.cif), but structures for cognate reactants/products are not always uniquely assigned at the EC level. Therefore, the EC references were mapped to KEGG reactions in the KEGG database (Kanehisa, 2000, Kanehisa et al., 2016), enabling PDB structures to be linked to KEGG reaction(s) and chemical structures for cognate ligands to be obtained from MOL file representations within KEGG.

Of the 68,236 PDB structures listed in the SIFTS download, 56,994 could be matched to a KEGG reaction with molecular structures for the reactants/products. Of the remaining 11,242 structures the majority act on polymer substrates: 6,606 belong to EC 3.4 (peptidases); 1,698 belong to EC 3.1 (acting on ester bonds); 1050 belong to EC 3.2 (glycosylases); 838 belong to EC 2.3 acyltransferases; and 1,050 belong to other EC categories. These were excluded from the analysis as similarity calculations are not possible without cognate ligand structures.

KEGG has some generic reactions that contain a Markush structure where the R group represents a position where variation is tolerated and often elucidated in more specific child reactions. The similarity methods described in the next section allow for Markush structures by replacing the R group in the cognate ligand with the matched fragment in the bound ligand, allowing comparable similarity scores to be generated for reactions containing Markush structures.

Ligand Similarity Using PARITY

The RDKit cheminformatics toolkit (Landrum, 2016) was used to read molecules in SMILES or MOL file format and perform the similarity calculations. The molecular similarity between bound molecule B and cognate molecule C was calculated using PARITY (Proportion of Atoms Residing in Identical Topology) by identifying the proportion of atoms of the same type residing in identical topological positions in B and C. This is implemented by identifying the maximum common substructure (MCS) in molecules B and C on the most permissive basis, matching any atom and bond type, and then counting the number of atoms of the same type in equivalent positions in each molecule, generating a similarity score using a Tanimoto based formula expressing the intersection over the union:

S=IU=Nsim(NB+NCNsim)

where I represents the intersection of B and C, U represents the union of B and C, NB is the number of atoms in bound ligand B, NC is the number of atoms in cognate ligand C and Nsim is the number of atoms of the same type in equivalent positons in B and C. The ligand similarity score was binned by rounding to the nearest 0.05 to allow plots of molecule similarity against frequency to be made.

The advantage of this method over simply using the size of the MCS or path based fingerprint methods is the situation where a small change in the center of a molecule disrupts the MCS and causes a disproportionately large negative change in the similarity score. The PARITY method gives a more intuitive and gradual fall in similarity in this situation, as demonstrated by the comparison between C00026 (2-oxoglutarate) and OGA (N-oxalyglycine) in Figure 2 where a difference of just 1 atom retains a relatively high PARITY similarity score of 0.82 but would fall more dramatically to (5/(10 + 10 − 5)) = 0.33 if matching the MCS.

Similarity at the KEGG Reactants/Products Level

For each PDB to KEGG comparison, ligand similarity calculations are carried out on an all by all basis comparing all bound ligands to all cognate reactants (i.e. compounds on the left-hand side of the KEGG reaction) and to all cognate products (i.e. compounds on the right-hand side of the KEGG reaction), taking the best match to either reactants or products. Whether the match has been made to either KEGG reactants or products is recorded and explicitly documented in the output data.

To ensure complete matching of the cognate reaction, it is important to note that all cognate reactants/products must be matched to a bound ligand, and any remaining unmatched cognate molecules will appear as unmatched in the final calculation and be fully reflected in the final similarity score. However, due to the presence of water as a solvent in PDB structures and the difficulty of resolving the positions of hydrogen atoms, compounds C00001 (water), C00080 (proton) and C00282 (dihydrogen) were excluded from the cognate ligands. The cognate ligands are matched to their most similar bound ligand using a greedy matching algorithm, i.e. the next most similar pair of unmatched cognate and bound ligands is always matched.

From the resulting list of matched cognate and bound ligands the similarity scores can conveniently be combined to generate a similarity score on a KEGG reactants/products basis. For example, if there are M cognate molecules Cm=1→M on one side of a KEGG reaction, and N bound molecules Bn=1→N in the PDB, ligand similarity scores are calculated as follows:

SC1B1=IC1B1UC1B1=NsimC1B1(NC1+NB1NsimC1B1)SC2B2=IC2B2UC2B2=NsimC2B2(NC2+NB2NsimC2B2)SCMBM=ICMBMUCMBM=NsimCMBM(NCM+NBMNsimCMBM)

In the case where M < N any remaining unmatched bound ligands are discarded; better matches to the cognate molecules have been found. In the case where M > N there are not enough bound molecules to match all of the cognate molecules, but the similarity score is still calculated comparing to an empty molecule B0 as follows:

SCmB0=ICmB0UCmB0=NsimCmB0(NCm+NB0NsimCmB0)=0(NCm+00)=0NCm

The similarity scores can then be combined to give a reaction similarity score using:

Sreactants/products=m=1MICmBmm=1MUCmBm

substituting the expressions from above. In this way, a similarity score is obtained at the KEGG reactants/products level by combining the similarity scores of the best matches at the ligand level.

The methodology to calculate similarity at the KEGG reactants/products level is demonstrated for PDB 3AVR (Catalytic fragment of UTX/KDM6A bound with histone H3K27me3 peptide, N-oxyalylglycine, and Ni(II)) and KEGG R05493 (alpha-ketoglutarate-dependent 2,4-dichlorophenoxyacetate dioxygenase) in Figure 2.

Rank Ordered Similarity for Each KEGG and EC

Once the similarity scores have been calculated the PDB-KEGG comparisons were rank ordered to identify the PDB structures containing the most representative ligands for each EC reference and each KEGG reaction. A Mendeley data repository contains 2 files (ec_pdb_sep17.csv and kegg_pdb_sep17.csv) with the top ranked matches for each EC and KEGG respectively, along with any other PDB-KEGG matches with a similarity score greater than or equal to 0.70. A breakdown of each PDB-KEGG match by pairs of matched bound-cognate ligands is given in the final column. However, to facilitate easy searching by KEGG compound we include an additional file (keggCpd_pdb_sep17.csv) showing for each KEGG compound in each KEGG reaction, rank ordered matches to PDB files where similarity to a bound ligand is greater than or equal to 0.7.

The content of the PDB reflects the areas of focus for structural biologists and contains many duplicate structures and homologues, so PDBs were clustered based on identical Fasta sequence to enable the best PDB structure to be identified for each cluster.

Data and Software Availability

The csv files described in Rank ordered similarity for each KEGG and EC in Method details have been made available in a Mendeley data repository that can be accessed on https://data.mendeley.com/datasets/7c48npgyr8:

ec_pdb_sep17.csv https://doi.org/10.17632/7c48npgyr8.2#file-d8c87e4a-76b7-4763-bd7d-61a6aa854021

kegg_pdb_sep17.csv https://doi.org/10.17632/7c48npgyr8.2#file-c3650d9e-36b8-4126-9c57-aa77a5ce92ed

keggCpd_pdb_sep17.csv https://doi.org/10.17632/7c48npgyr8.2#file-8cd8d174-7b52-4e22-9898-d518c626c798

Acknowledgments

The authors would like to acknowledge the European Molecular Biology Laboratory (EMBL) for funding this work.

Author Contributions

Conceptualization, J.D.T., L.F., A.J.M.R., N.B., and J.M.T.; Methodology, J.D.T.; Software, J.D.T. and L.F.; Formal Analysis, J.D.T.; Investigation, J.D.T.; Data Curation, J.D.T. and L.F.; Writing – Original Draft, J.D.T.; Writing – Reviewing and Editing, J.D.T., L.F., A.J.M.R., N.B., and J.M.T.; Visualization, J.D.T.; Supervision, J.M.T.; Funding Acquisition, J.M.T.

Declaration of Interests

The authors declare no competing interests.

Published: March 15, 2018

References

  1. Anighoro A., Bajorath J. Three-dimensional similarity in molecular docking: prioritizing ligand poses on the basis of experimental binding modes. J. Chem. Inf. Model. 2016;56:580–587. doi: 10.1021/acs.jcim.5b00745. [DOI] [PubMed] [Google Scholar]
  2. Bietz S., Fährrolfes R., Rarey M. The art of compiling protein binding site ensembles. Mol. Inform. 2016;35:593–598. doi: 10.1002/minf.201600043. [DOI] [PubMed] [Google Scholar]
  3. ChemAxon. (2016). MarvinSketch, Version 16.6.13.0.
  4. Chupakhin V., Marcou G., Baskin I., Varnek A., Rognan D. Predicting ligand binding modes from neural networks trained on protein–ligand interaction fingerprints. J. Chem. Inf. Model. 2013;53:763–772. doi: 10.1021/ci300200r. [DOI] [PubMed] [Google Scholar]
  5. Cleves A.E., Jain A.N. Knowledge-guided docking: accurate prospective prediction of bound configurations of novel ligands using Surflex-Dock. J. Comput. Aided Mol. Des. 2015;29:485–509. doi: 10.1007/s10822-015-9846-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Desaphy J., Bret G., Rognan D., Kellenberger E. sc-PDB: a 3D-database of ligandable binding sites–10 years on. Nucleic Acids Res. 2015;43(Database issue):D399–D404. doi: 10.1093/nar/gku928. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Desaphy J., Rognan D. sc-PDB-Frag: a database of protein–ligand interaction patterns for bioisosteric replacements. J. Chem. Inf. Model. 2014;54:1908–1918. doi: 10.1021/ci500282c. [DOI] [PubMed] [Google Scholar]
  8. Drwal M.N., Griffith R. Combination of ligand- and structure-based methods in virtual screening. Drug Discov. Today Technol. 2013;10:e395–e401. doi: 10.1016/j.ddtec.2013.02.002. [DOI] [PubMed] [Google Scholar]
  9. Evangelidis T., Bourne P.E., Xie L., Xie L. 2012 IEEE International Conference on Bioinformatics and Biomedicine Workshops. IEEE; 2012. An integrated workflow for proteome-wide off-target identification and polypharmacology dRug design; pp. 32–39. [Google Scholar]
  10. Golovin A. E-MSD: an integrated data resource for bioinformatics. Nucleic Acids Res. 2004;32:D211–D216. doi: 10.1093/nar/gkh078. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Gutmanas A., Alhroub Y., Battle G.M., Berrisford J.M., Bochet E., Conroy M.J., Dana J.M., Fernandez Montecelo M.A., van Ginkel G., Gore S.P. PDBe: protein data bank in Europe. Nucleic Acids Res. 2014;42(Database issue):D285–D291. doi: 10.1093/nar/gkt1180. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Hendlich M., Bergner A., Günther J., Klebe G. Relibase: design and development of a database for comprehensive analysis of protein–ligand interactions. J. Mol. Biol. 2003;326:607–620. doi: 10.1016/s0022-2836(02)01408-0. [DOI] [PubMed] [Google Scholar]
  13. Heo L., Shin W.-H., Lee M.S., Seok C. GalaxySite: ligand-binding-site prediction by using molecular docking. Nucleic Acids Res. 2014;42(Web Server issue):W210–W214. doi: 10.1093/nar/gku321. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Hu L., Benson M.L., Smith R.D., Lerner M.G., Carlson H.A. Binding MOAD (Mother of all databases) Proteins. 2005;60:333–340. doi: 10.1002/prot.20512. [DOI] [PubMed] [Google Scholar]
  15. Ivanisenko V.A. PDBSite: a database of the 3D structure of protein functional sites. Nucleic Acids Res. 2004;33(Database issue):D183–D187. doi: 10.1093/nar/gki105. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Kalinina O.V., Wichmann O., Apic G., Russell R.B. Combinations of protein-chemical complex structures reveal new targets for established drugs. PLoS Comput. Biol. 2011;7:e1002043. doi: 10.1371/journal.pcbi.1002043. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Kanehisa M. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000;28:27–30. doi: 10.1093/nar/28.1.27. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Kanehisa M., Sato Y., Kawashima M., Furumichi M., Tanabe M. KEGG as a reference resource for gene and protein annotation. Nucleic Acids Res. 2016;44:D457–D462. doi: 10.1093/nar/gkv1070. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Kasahara K., Kinoshita K. Landscape of protein-small ligand binding modes. Protein Sci. 2016;25:1659–1671. doi: 10.1002/pro.2971. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Kinnings S.L., Jackson R.M. ReverseScreen3D: a structure-based ligand matching method to identify protein targets. J. Chem. Inf. Model. 2011;51:624–634. doi: 10.1021/ci1003174. [DOI] [PubMed] [Google Scholar]
  21. Konc J., Česnik T., Konc J.T., Penca M., Janežič D. ProBiS-Database: precalculated binding site similarities and local pairwise alignments of PDB structures. J. Chem. Inf. Model. 2012;52:604–612. doi: 10.1021/ci2005687. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Konc J., Janežič D. ProBiS-ligands: a web server for prediction of ligands by examination of protein binding sites. Nucleic Acids Res. 2014;42:W215–W220. doi: 10.1093/nar/gku460. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Konc J., Lešnik S., Janežič D. Modeling enzyme-ligand binding in drug discovery. J. Cheminform. 2015;7:48. doi: 10.1186/s13321-015-0096-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Kufareva I., Ilatovskiy A.V., Abagyan R. Pocketome: an encyclopedia of small-molecule binding sites in 4D. Nucleic Acids Res. 2012;40(Database issue):D535–D540. doi: 10.1093/nar/gkr825. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Kurbatova N., Chartier M., Zylber M.I., Najmanovich R. IsoCleft finder – a web-based tool for the detection and analysis of protein binding-site geometric and chemical similarities. F1000Res. 2013;2:117. doi: 10.12688/f1000research.2-117.v1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Landrum, G. (2016). RDKit, Open-Source Cheminformatics. Available at: http://www.rdkit.org.
  27. Laskowski R.A. PDBsum more: new summaries and analyses of the known 3D structures of proteins and nucleic acids. Nucleic Acids Res. 2004;33(Database issue):D266–D268. doi: 10.1093/nar/gki001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Lombard V., Golaconda Ramulu H., Drula E., Coutinho P.M., Henrissat B. The carbohydrate-active enzymes database (CAZy) in 2013. Nucleic Acids Res. 2014;42(Database issue):D490–D495. doi: 10.1093/nar/gkt1178. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Martínez-Jiménez F., Marti-Renom M.A. Ligand-target prediction by structural network biology using nAnnoLyze. PLoS Comput. Biol. 2015;11:e1004157. doi: 10.1371/journal.pcbi.1004157. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Meslamani J., Li J., Sutter J., Stevens A., Bertrand H.O., Rognan D. Protein–ligand-based pharmacophores: generation and utility assessment in computational ligand profiling. J. Chem. Inf. Model. 2012;52:943–955. doi: 10.1021/ci300083r. [DOI] [PubMed] [Google Scholar]
  31. Najmanovich R., Kurbatova N., Thornton J. Detection of 3D atomic similarities and their use in the discrimination of small molecule protein-binding sites. Bioinformatics. 2008;24:i105–i111. doi: 10.1093/bioinformatics/btn263. [DOI] [PubMed] [Google Scholar]
  32. Ribeiro A.J.M., Holliday G.L., Furnham N., Tyzack J.D., Ferris K., Thornton J.M. Mechanism and Catalytic Site Atlas (M-CSA): a database of enzyme reaction mechanisms and active sites. Nucleic Acids Res. 2017;46:D618–D623. doi: 10.1093/nar/gkx1012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Rose P.W., Prli A., Bi C., Bluhm W.F., Christie C.H., Dutta S., Green R.K., Goodsell D.S., Westbrook J.D., Woo J. The RCSB Protein Data Bank: views of structural biology for basic and applied research and education. Nucleic Acids Res. 2015;43(Database issue):D345–D356. doi: 10.1093/nar/gku1214. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Roy A., Srinivasan B., Skolnick J. PoLi: a virtual screening pipeline based on template pocket and ligand similarity. J. Chem. Inf. Model. 2015;55:1757–1770. doi: 10.1021/acs.jcim.5b00232. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Shin W.-H., Heo L., Lee J., Ko J., Seok C., Lee J. LigDockCSA: protein-ligand docking using conformational space annealing. J. Comput. Chem. 2011;32:3226–3232. doi: 10.1002/jcc.21905. [DOI] [PubMed] [Google Scholar]
  36. Shoemaker B.A., Zhang D., Thangudu R.R., Tyagi M., Fong J.H., Marchler-Bauer A., Bryant S.H., Madej T., Panchenko A.R. Inferred Biomolecular Interaction Server–a web server to analyze and predict protein interacting partners and binding sites. Nucleic Acids Res. 2010;38(Database issue):D518–D524. doi: 10.1093/nar/gkp842. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Stanton R.A., Nettles J.H., Schinazi R.F. Ligand similarity guided receptor selection enhances docking accuracy and recall for non-nucleoside HIV reverse transcriptase inhibitors. J. Mol. Model. 2015;21:282. doi: 10.1007/s00894-015-2826-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Stuart A.C., Ilyin V.A., Sali A. LigBase: a database of families of aligned ligand binding sites in known protein sequences and structures. Bioinformatics. 2002;18:200–201. doi: 10.1093/bioinformatics/18.1.200. [DOI] [PubMed] [Google Scholar]
  39. Tang G.W., Altman R.B. Knowledge-based fragment binding prediction. PLoS Comput. Biol. 2014;10:e1003589. doi: 10.1371/journal.pcbi.1003589. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Velankar S., Dana J.M., Jacobsen J., van Ginkel G., Gane P.J., Luo J., Oldfield T.J., O’Donovan C., Martin M.-J., Kleywegt G.J. SIFTS: structure integration with function, taxonomy and sequences resource. Nucleic Acids Res. 2013;41(Database issue):D483–D489. doi: 10.1093/nar/gks1258. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Wang C., Liu J., Luo F., Deng Z., Hu Q.-N. Predicting target-ligand interactions using protein ligand-binding site and ligand substructures. BMC Syst. Biol. 2015;9(Suppl 1):S2. doi: 10.1186/1752-0509-9-S1-S2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Zhou H., Skolnick J. FINDSITE(comb): a threading/structure-based, proteomic-scale virtual ligand screening approach. J. Chem. Inf. Model. 2013;53:230–240. doi: 10.1021/ci300510n. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES