Abstract
The sc-PDB database (available at http://bioinfo-pharma.u-strasbg.fr/scPDB/) is a comprehensive and up-to-date selection of ligandable binding sites of the Protein Data Bank. Sites are defined from complexes between a protein and a pharmacological ligand. The database provides the all-atom description of the protein, its ligand, their binding site and their binding mode. Currently, the sc-PDB archive registers 9283 binding sites from 3678 unique proteins and 5608 unique ligands. The sc-PDB database was publicly launched in 2004 with the aim of providing structure files suitable for computational approaches to drug design, such as docking. During the last 10 years we have improved and standardized the processes for (i) identifying binding sites, (ii) correcting structures, (iii) annotating protein function and ligand properties and (iv) characterizing their binding mode. This paper presents the latest enhancements in the database, specifically pertaining to the representation of molecular interaction and to the similarity between ligand/protein binding patterns. The new website puts emphasis in pictorial analysis of data.
INTRODUCTION
The 3D structures of macromolecules, as collected by the Worldwide Protein Data Bank (PDB) organization (http://wwpdb.org, (1)), offer wealth of information for computer-aided approaches to drug design. During the last 30 years, the steady increase of the PDB archive (2) has prompted the development of 3D methods for hit identification by virtual screening of chemical libraries, de novo ligand design and hit to lead. Many success stories have been reported in the literature (3). Besides, some proteins have never been efficiently modulated by chemical compounds despite intense efforts in medicinal chemistry. The concept of ligandability has thus been suggested to qualify the ability of a protein to bind with high affinity a small molecular weight compound (4,5). Recent studies demonstrated that simple geometric and physico-chemical descriptors of protein cavities (principally size, shape and polarity) are sufficient to predict structural ligandability (6–9).
The sc-PDB is a specialized structure database focused on ligand binding site in ligandable proteins (10). We have selected in the PDB all proteins in complex with a small synthetic or natural ligand (140 Da < MW < 800 Da), provided this ligand was well buried and biologically relevant and since 2013 provided the binding site was predicted ligandable according to a machine leaning-based model. The different stages of database design process are detailed in the online documentation and summarized in Figure 1.
The first publicly available version of sc-PDB has been released in 2004. The database has been annually updated with regular additions of new features (See Supplementary Table S1 for a summary of changes since the database creation). Not only the quality and the precision of data improved over the 10 years, but new tools have allowed global analysis of data. A major example is the clustering of sites for proteins present in multiple copies in the database (11). The new functionalities in sc-PDB, introduced after 2011, are discussed in detail in this paper.
sc-PDB CONTENT
The sc-PDB data are directly compatible with computational methods, such as docking, molecular mechanics and electrostatic calculations. Unlike the PDB, which generally does not represent hydrogen atoms nor defines ionization state of titratable groups, the sc-PDB provides an all-atom model of molecules: (i) hydrogen atoms are added to amino acids considering that arginine and lysine are positively charged and aspartic and glutamic acids are negatively charged, (ii) hydrogen atoms are added to other residues according to ionized templates built from HET group dictionary (12), (iii) the intermolecular H-bonding network is optimized using the BioSolveIT Hydescorer program (13). The overall processing of an original PDB entry yields atomic data for a single ligand, the protein chain(s) surrounding this ligand and its binding site (i.e. all protein residues with at least one heavy atom closer than 6.5 Å to any ligand heavy atom). Of note, protein and binding site contain standard amino acids, and may include cofactor(s), metallic ion(s) and covalently bound residue(s), such as carbohydrate. Last, each sc-PDB entry is characterized with functional and chemical annotations.
The current sc-PDB release contains 9283 entries, representing 3678 different UniProt (14) proteins and 5608 different HET ligands. The data set is non-redundant: although about 10% of ligands and almost half of proteins are present more than once in the database, each sc-PDB ligand/protein complex is unique. Less than 5% of proteins are encountered more than 10 times in the database, yet some of them have a very high copy number. The three most frequent proteins are HIV protease (248 entries), cyclin-dependent kinase 2 (180 entries) and beta-secretase 1 (155 entries). More statistics are given in Supplementary Figure S1, and at http://cheminfo.u-strasbg.fr/scPDB/ABOUT.
The total size of compressed database is 1.5 GB. Its downloadable content is summarized in Table 1.
Table 1. Downloadable content of sc-PDB.
filename | Number of entries in file | Data description | Data type |
---|---|---|---|
protein.mol2 | one or list of PDB ID matching search criteria | All-atom description of sc-PDB protein(s) | Atomic data |
ligand.mol2 | one or list of PDB ID matching search criteria | All-atom description of sc-PDB ligand | Atomic data |
site.mol2 | one or list of PDB ID matching search criteria | All-atom description of sc-PDB ligand binding site | Atomic data |
cavity6.mol2 | one or list of PDB ID matching search criteria | The cavity is the negative image of the binding site, described by regularly spaced points colored according to pharmacophoric properties of the site atoms | Atomic data |
ints_M.mol2 | one or list of PDB ID matching search criteria | Non-bonded interactions between sc-PDB ligand and its binding site | Atomic data |
Each interaction is characterized by three points, placed respectively on the protein atom in interaction, the ligand atom in interaction and at the center of the segment defined by these two points | |||
ifp.txt | one or list of PDB ID matching search criteria | Non-bonded interactions between sc-PDB ligand and its binding site | Binary string |
For each residue in site is marked the presence or absence of interaction with ligand (hydrophobic contact, aromatic bond, H-bond, ionic bond, metal-ion bond) | |||
C-clusterID.tar.gz | Clusters of binding sites | The archive classifies all the sc-PDB entries of a UNIPROT protein. It is organized into directories, one for each cluster of sites. Each cluster contains protein.mol2, site.mol2 and ligand.mol2 files of all individual PDB entries, which have been 3D-aligned to the site at cluster center | Atomic data |
Alignment.tar.gz | Pair of PDB ID, 3D-aligned for optimizing site similarity (1) or binding mode similarity (2) | The archive describes the protein, site, ligand and cavity (1) or non-bonded interaction (2) files for the reference entry (original coordinates) and the compared entry (fitted coordinates) | Atomic data |
scPDB_results | list of entries matching search criteria | Annotation and 2D structure of ligands in csv, xlsx or sdf formats | Text |
PDBID_distribution.tsv | list of entries similar to query site (1) or query binding mode (2) | Similarity scores | Text |
All charts and pictures of sc-PDB website are downloadable in png format. The complete database is downloadable as a compressed archive from the database homepage.
NEW FEATURES OF THE sc-PDB DATABASE
Depiction of protein–ligand complexes
The latest sc-PDB release enables the user to depict protein–ligand complexes according to different needs and complexity levels. For example, a medicinal chemist may be primarily interested in the PoseView 2D sketch highlighting the ligand structure, binding site boundaries and main interactions (Supplementary Figure S2A). A cheminformatician may focus on the nearby tabulated list of protein–ligand interactions including involved atoms and a full topological description (distance, angle) of each interaction (Supplementary Figure S2B). Last, a structural biologist can access a 3D picture of the complex embedded in the OpenAstex viewer (Supplementary Figure S2B) (15). The interaction table is graphically linked to the 3D picture: scrolling the mouse over any interaction line in the table interactively displays the corresponding interaction in the neighboring 3D picture.
Water molecules
Water is by essence the biological fluid. The role of water in molecular recognition events is not yet fully understood although it has been extensively studied experimentally and theoretically (see (16) for a comprehensive review). Observations made for water molecules at binding interface between a drug and its protein target demonstrated that ordered solvent molecule(s) can either reinforce or by contrast weaken the stability of the complex depending on the studied system (17). In drug design, interfacial water molecules have a profound impact on calculations, both the inexpensive computational protocols, such as hit finding by high-throughput docking (18), and the more sophisticated algorithms, such as lead optimization using free-energy perturbation calculations (19).
Since 2012, a sc-PDB protein contains all water molecules that establish two or more hydrogen bonds with the binding site (i.e. donor-acceptor distance < 3.5 Å and 120° < donor-H-acceptor angle < 240°). These water molecules are expected to be hardly displaceable by a ligand because of tight binding to the protein (20). Water molecules are present in about two-thirds of sc-PDB complexes (Figure 2). The number of water molecules per site ranges from 1 to 10, but the distribution is largely biased toward smaller values. Although only few of the selected water molecules are in direct interaction with the ligand, using this information is key to structure-based design and drastically influences virtual screening for example.
Query for similar binding sites
The molecular basis of the ligand/protein recognition gives insights into the specificity of a drug for its target protein. For example, structural variations in binding site may explain the permissive binding of different ligands to a single protein. As mentioned in the Introduction, we have previously addressed this issue by analyzing the multiple binding sites in a given protein (11). The sc-PDB clusters of binding sites can reveal differences in location, size, composition or 3D structure. For example, clustering the sc-PDB sites of adenylosuccinate synthetase yields three clusters; two of them that have similar structures and compositions except guanosine diphosphate (GDP) and Mg2+ cofactors; the third one is localized in a different region in the protein (Supplementary Figure S3). Other high quality databases derived from the PDB also facilitate the comparison of the binding sites across a protein family (21–23). The sc-PDB database is, however, the only meta-database enabling to search the PDB using user-defined queries mixing protein, ligand, binding site and binding mode properties. For example, a single query in the sc-PDB enables the selection of all protein–ligand complexes for which (i) the target is a protein kinase, (ii) the ligand is a fragment with a molecular weight between 150 and 300, (iii) the binding site comprises at least one bound water molecule, (iv) the ligand is neutral and contacts its target by one aromatic face-to-face interaction.
Local structural similarity between non-homologous proteins can account for the promiscuity of a ligand, and thus can help explaining the side effects of a drug or suggest its repositioning toward a novel target and therapeutical indication (24,25). The sc-PDB database now enables the identification of similar sites in distinct proteins using a pre-computed all-against-all comparison with the in-house developed Shaper algorithm (8). The sc-PDB website allows to query the matrix of scores for any given sc-PDB site. It displays the distribution of scores and lists the entries whose similarity score is higher than a given threshold (default value is 0.44). For example, the binding site for phosphomethylphosphonic acid-guanylate ester in Escherichia coli adenylosuccinate synthetase (PDB ID: 1HOP, HET: CGP) shares significant 3D similarity with a single site in sc-PDB, that of GTP in a murine homologous protein (Supplementary Figure S4).
Query for similar binding patterns
The non-bonded interactions between a ligand and its protein define a 3D pattern that characterizes the binding mode. We have recently developed a new geometrical method to encode and compare protein–ligand interaction patterns (26). Briefly, each interaction is represented by three points: the interacting ligand atom, the interacting protein atom and a pseudo-atom at the geometric center of the above-cited two atoms. Each interaction is assigned a molecular type according to the type of non-bonded interaction (hydrophobic, aromatic, hydrogen bond, ionic bond, metal-ion bond). The 3D pattern is defined by all the triplets of interaction pseudo-atoms and graph theory is applied to find the maximal common subgraph (clique) between two 3D patterns. The similarity score evaluates the quality of overlap after 3D alignment of the two patterns. Using this approach we recently demonstrated that the protein–ligand binding mode is generally conserved within a family of homologous protein even though bound ligands are dissimilar.
The sc-PDB database now enables the identification of similar 3D pattern in distinct complexes; the all-against-all comparison of sc-PDB complexes was computed using the program Grim (26). The sc-PDB website allows to query the matrix of scores for any given sc-PDB ligand/protein binding mode. It displays the distribution of scores and lists the entries whose similarity score is higher than the threshold selected on the distribution (default value is 0.65). For example, the binding mode of phosphomethylphosphonic acid-guanylate ester to E. coli adenylosuccinate synthetase (PDB ID: 1HOP, HET: CGP) shares significant similarity with 25 complexes in sc-PDB, representing 19 different proteins bound to GDP, GTP or close analogs. The two top scorers are respectively a homologous protein in wheat (PDB ID: 1DJ3) and the functionally unrelated signal recognition particle protein (PDB ID: 1RJ9, Figure 3).
A new interface
The main architecture of database has not been changed, but the sc-PDB website has been completely re-designed to enhance interactivity. For every entry, the user can navigate in the main menu and directly switch views in the same window focusing on either a simple description of the entry, or a full characterization of the ligand or its binding site. Only searches for similar binding sites or binding modes open a new window with the rank-ordered list of sc-PDB hits corresponding to the query. At almost all sections of the web interface, molecules (protein, ligand, binding site), interaction pattern, tabulated results (hit lists, protein–ligand interactions) and charts (ligand and binding site properties, distribution of similar binding sites or binding modes) can be downloaded in the relevant file format (mol2, xlsx, csv, tsv, png, jpg, svg, pdf). In case of binding site/binding mode similarity searches, aligned molecules are also downloadable (Table 1).
The website specifications are detailed at http://cheminfo.u-strasbg.fr/scPDB/ABOUT.
SUPPLEMENTARY DATA
Supplementary Data are available at NAR Online.
Acknowledgments
The High-performance Computing Center at the University of Strasbourg (Illkirch, France) and the Calculation Center of the IN2P3 (CNRS, Villeurbanne, France) are acknowledged for allocation of computing time and excellent support.
Footnotes
Present address: Desaphy Jérémy, Lilly Research Laboratories, Eli Lilly and Company, Indianapolis, IN 46285, USA.
FUNDING
Centre National de la Recherche Scientifique (CNRS); University of Strasbourg. LABEX ANR-10-LABX-0034 Medalis; French government managed by ‘Agence Nationale de la Recherche’ under ‘Programme d'investisssement d'avenir’. Funding for open access charge: LABEX ANR-10-LABX-0034 Medalis.
Conflict of interest statement. None declared.
REFERENCES
- 1.Berman H., Henrick K., Nakamura H., Markley J.L. The worldwide Protein Data Bank (wwPDB): ensuring a single, uniform archive of PDB data. Nucleic Acids Res. 2007;35:D301–D303. doi: 10.1093/nar/gkl971. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Gutmanas A., Alhroub Y., Battle G.M., Berrisford J.M., Bochet E., Conroy M.J., Dana J.M., Fernandez Montecelo M.A., van Ginkel G., Gore S.P., et al. PDBe: Protein Data Bank in Europe. Nucleic Acids Res. 2014;42:D285–D291. doi: 10.1093/nar/gkt1180. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Ripphausen P., Nisius B., Peltason L., Bajorath J. Quo Vadis, virtual screening? A comprehensive survey of prospective applications. J. Med. Chem. 2010;53:8461–8467. doi: 10.1021/jm101020z. [DOI] [PubMed] [Google Scholar]
- 4.Hopkins A.L., Groom C.R. The druggable genome. Nat. Rev. Drug Discov. 2002;1:727–730. doi: 10.1038/nrd892. [DOI] [PubMed] [Google Scholar]
- 5.Edfeldt F.N., Folmer R.H., Breeze A.L. Fragment screening to predict druggability (ligandability) and lead discovery success. Drug Discov. Today. 2011;16:284–287. doi: 10.1016/j.drudis.2011.02.002. [DOI] [PubMed] [Google Scholar]
- 6.Weisel M., Proschak E., Kriegl J.M., Schneider G. Form follows function: shape analysis of protein cavities for receptor-based drug design. Proteomics. 2009;9:451–459. doi: 10.1002/pmic.200800092. [DOI] [PubMed] [Google Scholar]
- 7.Krasowski A., Muthas D., Sarkar A., Schmitt S., Brenk R. DrugPred: a structure-based approach to predict protein druggability developed using an extensive nonredundant data set. J. Chem. Inf. Model. 2011;51:2829–2842. doi: 10.1021/ci200266d. [DOI] [PubMed] [Google Scholar]
- 8.Desaphy J., Azdimousa K., Kellenberger E., Rognan D. Comparison and druggability prediction of protein-ligand binding sites from pharmacophore-annotated cavity shapes. J. Chem. Inf. Model. 2012;52:2287–2299. doi: 10.1021/ci300184x. [DOI] [PubMed] [Google Scholar]
- 9.Perola E., Herman L., Weiss J. Development of a rule-based method for the assessment of protein druggability. J. Chem. Inf. Model. 2012;52:1027–1038. doi: 10.1021/ci200613b. [DOI] [PubMed] [Google Scholar]
- 10.Kellenberger E., Muller P., Schalon C., Bret G., Foata N., Rognan D. sc-PDB: an annotated database of druggable binding sites from the Protein Data Bank. J. Chem. Inf. Model. 2006;46:717–727. doi: 10.1021/ci050372x. [DOI] [PubMed] [Google Scholar]
- 11.Meslamani J., Rognan D., Kellenberger E. sc-PDB: a database for identifying variations and multiplicity of “druggable” binding sites in proteins. Bioinformatics. 2011;27:1324–1326. doi: 10.1093/bioinformatics/btr120. [DOI] [PubMed] [Google Scholar]
- 12.Bhat T.N., Bourne P., Feng Z., Gilliland G., Jain S., Ravichandran V., Schneider B., Schneider K., Thanki N., Weissig H., et al. The PDB data uniformity project. Nucleic Acids Res. 2001;29:214–218. doi: 10.1093/nar/29.1.214. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Bietz S., Urbaczek S., Schulz B., Rarey M. Protoss: a holistic approach to predict tautomers and protonation states in protein-ligand complexes. J. Cheminform. 2014;6:12. doi: 10.1186/1758-2946-6-12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.The UniProt Consortium. Activities at the Universal Protein Resource (UniProt) Nucleic Acids Res. 2014;42:D191–D198. doi: 10.1093/nar/gkt1140. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Hartshorn M.J. AstexViewer: a visualisation aid for structure-based drug design. J. Comput. Aided Mol. Des. 2002;16:871–881. doi: 10.1023/a:1023813504011. [DOI] [PubMed] [Google Scholar]
- 16.Li Z., Lazaridis T. Water at biomolecular binding interfaces. Phys. Chem. Chem. Phys. 2007;9:573–581. doi: 10.1039/b612449f. [DOI] [PubMed] [Google Scholar]
- 17.Ladbury J. Just add water! The effect of water on the specificity of protein-ligand binding sites and its potential application to drug design. ACS Chem. Biol. 1996;3:973–980. doi: 10.1016/s1074-5521(96)90164-7. [DOI] [PubMed] [Google Scholar]
- 18.Barelier S., Boyce S.E., Fish I., Fischer M., Goodin D.B., Shoichet B.K. Roles for ordered and bulk solvent in ligand recognition and docking in two related cavities. PLoS ONE. 2013;8:e69153. doi: 10.1371/journal.pone.0069153. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Michel J., Tirado-Rives J., Jorgensen W.L. Energetics of displacing water molecules from protein binding sites: consequences for ligand optimization. J. Am. Chem. Soc. 2009;131:15403–15411. doi: 10.1021/ja906058w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Bissantz C., Kuhn B., Stahl M. A medicinal chemist's guide to molecular interactions. J. Med. Chem. 2010;53:5061–5084. doi: 10.1021/jm100112j. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Amemiya T., Koike R., Kidera A., Ota M. PSCDB: a database for protein structural change upon ligand binding. Nucleic Acids Res. 2012;40:D554–D558. doi: 10.1093/nar/gkr966. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Kufareva I., Ilatovskiy A.V., Abagyan R. Pocketome: an encyclopedia of small-molecule binding sites in 4D. Nucleic Acids Res. 2012;40:D535–D540. doi: 10.1093/nar/gkr825. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Maietta P., Lopez G., Carro A., Pingilley B.J., Leon L.G., Valencia A., Tress M.L. FireDB: a compendium of biological and pharmacologically relevant ligands. Nucleic Acids Res. 2013;42:D267–D272. doi: 10.1093/nar/gkt1127. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Peters J.-U. Polypharmacology - foe or friend. J. Med. Chem. 2013;56:8955–8971. doi: 10.1021/jm400856t. [DOI] [PubMed] [Google Scholar]
- 25.Anighoro A., Bajorath J., Rastelli G. Polypharmacology: challenges and opportunities in drug discovery. J. Med. Chem. 2014. doi:10.1021/jm5006463. [DOI] [PubMed]
- 26.Desaphy J., Raimbaud E., Ducrot P., Rognan D. Encoding protein-ligand interaction patterns in fingerprints and graphs. J. Chem. Inf. Model. 2013;53:623–637. doi: 10.1021/ci300566n. [DOI] [PubMed] [Google Scholar]