Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2020 Aug 26.
Published in final edited form as: Methods Mol Biol. 2012;910:145–164. doi: 10.1007/978-1-61779-965-5_8

Mapping Between Databases of Compounds and Protein Targets

Sorel Muresan, Markus Sitzmann, Christopher Southan
PMCID: PMC7449375  NIHMSID: NIHMS1618662  PMID: 22821596

Abstract

Databases that provide links between bioactive compounds and their protein targets are increasingly important in drug discovery and chemical biology. They join the expanding universes of cheminformatics via chemical structures on the one hand and bioinformatics via sequences on the other. However, it is difficult to assess the relative utility of databases without the explicit comparison of content. We have exemplified an approach to this by comparing resources that each has a different focus on bioactive chemistry (ChEMBL, DrugBank, Human Metabolome Database, and Therapeutic Target Database) both at the chemical structure and protein levels. We compared the compound sets at different representational stringencies using NCI/CADD Structure Identifiers. The overlap and uniqueness in chemical content can be broadly interpreted in the context of different data capture strategies. However, we recorded apparent anomalies, such as many compounds-in-common between the metabolite and drug databases. We also compared the content of sequences mapped to the compounds via their UniProt protein identifiers. While these were also generally interpretable in the context of individual databases we discerned differences in coverage and the types of supporting data used. For example, the target concept is applied differently between DrugBank and the Therapeutic Target Database. In ChEMBL it encompasses a broader range of mappings from chemical biology and species orthologue cross-screening in addition to drug targets per se. Our analysis should assist users not only in exploiting the synergies between these four high-value resources but also in assessing the utility of other databases at the interface of chemistry and biology.

Keywords: Bioactive compounds, Small-molecule databases, Chemical structure identifiers, Cheminformatics, Bioinformatics, Drug targets, ChEMBL, DrugBank, Human Metabolome Database, Therapeutic Target Database

1. Introduction

Characterization of the interactions between bioactive compounds and proteins is a central tenet of modern drug discovery. It also underpins biochemistry, structural biology, metabolism, enzymology, toxicology, and chemical biology. Progress in these areas is facilitated by databases that collate compound-to-protein relationships along with their supporting data. An important consequence is an increasing overlap between cheminformatics and bioinformatics. The outer limits can be represented by the Chemical Structure Lookup Service (CSLS) (1) that contains ~80 million chemical structures and the UniProt Knowledgebase (UniProtKB) with ~13 million protein sequence identifiers (2). The inner limits of the join can be defined by those large-scale commercial sources that declare their mapping statistics, such as GVKBIO which includes 2.9 million unique structures (the majority from patents) and 4,500 sequence IDs (an update of the statistics from ref. 3). The corresponding figures for the current largest public database, ChEMBL, are 0.6 million compounds and nearly 5,000 sequences (these figures are an approximate inner join as not all the chemical structures are mapped to proteins).

The data-supported collated relationships between proteins and small molecules are sometimes colloquially described without being rigorously defined. Most readers of this chapter will know that atorvastatin (Lipitor) is not only the world best-selling drug, but is also an enzyme inhibitor. Thus, we can refer to the implicit “binds-to” and “modulates-the-activity-of” relationships as a “mapping” between (3R,5R)-7-(2-(4-fluorophenyl)-3-phenyl-4-(phenylcarbamoyl)-5-propan-2-ylpyrrol-1-yl)-3,5-dihydroxyheptanoic acid (PubChem CID 60823) and 3-hydroxy-3-methylglutaryl-coenzyme A reductase (Swiss-Prot P04035). Three of the sources we look at here conform to this concept although different evidence types may be used. The important exception is HMDB where 3-hydroxy-3-methylglutaryl-coenzyme A (PubChem CID 91506) can be mapped in the alternative context of being an endogenous metabolic substrate for the same enzyme that atorvastatin inhibits by competitive binding to the active site.

An expanding number of databases implement a similar concept of compound-to-protein mapping even if they may differ in the details (e.g., how they deal with complex targets containing more than one protein). However, analogous to the proliferation of Web-based bioinformatics resources, they present users with the problems of comparison and choice. Assessing their relative utility and fitness-for-purpose is particularly difficult where the compound and protein entity content may seem similar but the capture strategies, curatorial practices, and depth of data extraction from source documents may be significantly different. This chapter seeks to detail some approaches that can be generally applied to this problem by comparing just four public databases.

2. Materials

  1. Chemical structures: ChEMBL (4), DrugBank (5), the Human Metabolome Database (HMDB) (6), and the Therapeutic Target Database (TTD) (7) are public databases and can be accessed from their Web sites. The structures were downloaded as SD files.

  2. Target protein identifiers: In the case of ChEMBL, these were taken as UniProt IDs from an internal Oracle download of the database. For the other databases, the sequences were downloaded as FASTA format. The Protein Identifier Cross-Reference Service (PICR) (8) was then used to convert these to their corresponding UniProt IDs.

  3. InChI executable, version 1.03, downloadable from IUPAC (9) was used to generate InChI strings and InChIKeys from the SD files.

  4. The Chemical Identifier Resolver (CIR) (10) was used to generate NCI/CADD identifiers from the SD files.

  5. The 4-way Venn diagrams were built with VENNY, an online interactive tool for comparing lists by Venn diagrams (11).

3. Methods

3.1. Databases

These are just brief descriptions of salient features relevant to the comparison exercise because these sources are well documented in publications and on their Web sites included in the references.

ChEMBL (4) data is curated from nearly 40,000 papers that cover a significant fraction of global drug R&D published output. The mappings between targets and assay results include extensive compound sets against kinases and GPCRs as well as a high capture of drugs and clinical candidates. The external chemistry connections of ChEMBL are, in part, mediated by ChEBI (12). Thus, both the ChEMBL-to-ChEBI and ChEBI-to-PubChem links are reciprocal (except for the ChEBI subset not in ChEMBL). This is complicated by the fact that ChEMBL records only link from PubChem via PubChem Bioassay (AIDs) but lack direct links to PubChem substances (SIDs) which are only available for records in ChEBI.

DrugBank (5) collates detailed drug data with target and mechanism-of-action information. The DrugCard data structure contains many fields with approximately half of the information being devoted to drugs. The other half is devoted to target sequences, pharmacological properties, pharmacogenomic data, food–drug interactions, drug–drug interactions, and experimental ADME data.

The Human Metabolome Database (6) collates detailed human metabolite information. It contains chemical, clinical and biochemical data, linked to other public compound and protein sources. Because it has been developed at the same institution, there are linkages between DrugBank and HMDB at the compound, target, and pathway levels. As in the DrugCard data structure, each MetaboCard compound entry contains many fields.

The Therapeutic Target Database (7) is conceptually similar to DrugBank but the compound-to-target mappings are more focussed on primary targets. Another difference is the 3-way split of targets and compounds into marketed, clinical trial, and research phase.

The restriction to just four sources was imposed so we could make use of the informative 4-way Venn display tool (shown in the figures below). While the analysis per se can be extended to more datasets the comparative visualization of results becomes complex. The individual databases were selected because they have common characteristics that include the following:

  1. Their content is predominantly based on (but not restricted to) data-supported mappings between compounds and proteins.

  2. Their sequence content is directly searchable by BLAST and downloadable as FASTA strings thus providing immediate utility for bioinformatics analysis.

  3. They include UniProt IDs.

  4. They use standardized, searchable, and downloadable representations of chemical structures (e.g., SMILES, SD files, IUPAC names, and InChIs).

  5. Their chemical structures have predominantly drug-like or metabolite properties.

  6. They link out extensively to other databases (e.g., PubMed, PubChem, ChEBI, and UniProt).

  7. The evidence supporting the mappings is derived predominantly from PubMed documents.

  8. They use extensive expert source selection and curation of the database records. This is distinct from the open repository model (e.g., PubChem) that collates input from many submitters.

  9. Our impression (supported by the opinions of others and citations) is that these are four high-value databases with wide user uptake.

There are also some high-level divisions and groupings:

  1. DrugBank and ChEMBL (via ChEBI) are both PubChem submission sources (i.e., they have PubChem SIDs assigned to all database records) and therefore have reciprocal linkages from, and to, PubChem Compound Identifiers (CIDs). While TTD and HMDB do not currently submit, they both link-out to PubChem SIDs and CIDs (the SID-to-CID relationships are explained in the PubChem Help documentation).

  2. DrugBank and TTD are drug-centric but include some research and/or clinical candidates. ChEMBL has a wider coverage of the primary research literature but also captures drugs and clinical candidates.

  3. DrugBank and TTD include a proportion of non-small molecule therapeutics (e.g., antibodies).

  4. DrugBank, ChEMBL, and TTD include some compounds with undefined molecular mechanisms, in vivo bioactive readouts, or target complexes that cannot be mapped to unique protein sequences.

  5. HMDB focuses on metabolites.

  6. DrugBank has a database cross-reference in UniProt entries.

As might be expected where primary data sources are continually expanding, updates will change content. Commendably, all the sources above include statistics and release notes. ChEMBL reports approximate monthly updates. In this work we used ChEMBL 7.0. DrugBank has a major update cycle in terms of years rather than increments. However, while the release of DrugBank 3.0 has been announced as imminent, we downloaded the available version 2.0. Similarly, HMDB has a long cycle time and we used version 2.5. The timing of TTD releases is not clear but their statistics suggest incremental updating.

3.2. Sequence Identifiers

For ChEMBL we used the target table from an Oracle download of the database to extract the UniProt IDs. For the other databases we downloaded all available sequences in FASTA format. For DrugBank and TTD we took all target sequences, rather than splitting them into selectable subsets, to simplify our comparisons (e.g., the limitation of the Venn to a 4-way visualization).

3.3. Database Processing

For the overlap analysis all databases were downloaded from their respective Web sites as SD files (on 19 Sept, 2010). The following section outlines some approaches for the analysis and comparison of chemical structure sets downloaded from different sources. The structure standardization was performed using public cheminformatics Web services.

3.3.1. Chemical Structure Representation in Databases

The familiar two-dimensional (2D) chemical structure diagram is the “natural language” preferred form of representation for chemists. However, although powerful, it allows different forms or representations, for example, as different tautomers for uracil (Fig. 1) that can create ambiguities when comparing large structure sets.

Fig. 1.

Fig. 1.

Two tautomeric forms of uracil (lactam and lactim) and their corresponding IUPAC names and SMILES strings.

Uracil can be drawn as lactam and lactim differing in the pattern of double bond locations and hydrogen atom attachments for the same compound. While a chemist can easily recognize such cases, this cannot be done if hundreds of thousands of structure records have to be searched or compared. For this, a unique structure representation in a computer-readable format is needed. Common molecular file formats are SMILES (13, 14) and MDL’s MOL file (15, 16). Essentially all cheminformatics toolkits can read/write them and perform basic structural operations such as editing or substructure searches. From SMILES and MOL files unique structural identifiers can be generated algorithmically or at the point of registration in database systems. The former approach includes canonical SMILES (17) and IUPAC names (9) as well as molecular hashcodes (e.g., CACTVS hashcodes (18), InChIKeys (19)). CAS Registry Numbers™ (20), ChemSpider IDs (21), and PubChem CIDs (22) are examples of unique structure identifiers specific to a particular database/system (see Note 1).

3.3.2. Chemical Structure Identifier

A chemical identifier for comparison of large datasets has to deal with ambiguities such as tautomers, salt forms, and charged resonance structures of the same compound from different sources. Since a definition of a unique (or “canonical”) structure representation cannot be derived from fundamental physical principles, variants such as tautomers are condition-dependent. Thus, because there is no standard normalization, database providers either ignore the problem or establish their own implementation and/or definitions of uniqueness. For instance, a vendor might not include tautomeric forms of the same substance as separate catalogue entries. However, for a dataset dealing explicitly with measured keto-enol equilibrium constants it is essential to distinguish these tautomers. Identifiers can therefore be selected to match the intended use.

3.3.3. NIST/IUPAC International Chemical Identifier

An important effort for the development of a nonproprietary standardized chemical structure identifier is the NIST/IUPAC International Chemical Identifier (InChI) project (19). The current InChI algorithm can generate a series of machine-readable unique string representations from a chemical structure (Fig. 2).

Fig. 2.

Fig. 2.

InChI, InChIKey, and Standard InChI/InChIKey generated for the lactam and lactim form of uracil.

An InChI describes a chemical structure in terms of layers of information. The main or connectivity layer represents all atoms and their connectivity while subsequent layers may add information about charges, isotopes, stereochemistry, and tautomeric forms. Not all layers have to be included. For instance, while omission of the tautomer layer (in InChI terminology “fixed hydrogen layer”) results in a tautomer-insensitive representation, its inclusion generates distinguishable InChIs for each tautomer. This can be seen in Fig. 2, for the uracil tautomers. Both InChIs have identical connectivity layers (consisting of the sub-layers “chemical formula,” “atom connectivity,” and “hydrogens”), but differ in their “fixed hydrogen” layer beginning with the “f/” delimiter. These layers are selectable options for the InChI calculation.

A full-length InChI string facilitates recovery of the original structure. However, like the IUPAC names, these can get long for large and complex structures. Consequently, to make InChI easier to index for Internet search engines and within databases, a hashed, fixed length version, the InChIKey, was added to the InChI library. The InChIKey cannot be reversed into the original structure except by a database lookup. For better interoperability and compatibility between databases and Web applications, a standard variant of InChI and InChIKey was implemented by defining an immutable set of options to be used for the calculation of InChI and InChIKeys (19). This interoperability restricts structure information in some cases. For instance, the standard variant of InChI/InChIKey cannot be used to separate between tautomers since it is a tautomer-invariant representation of a chemical structure (Fig. 2).

3.3.4. NCI/CADD Structure Identifiers (FICTS, FICuS, uuuuu)

The NCI/CADD Structure Identifiers have been developed since 2003 at the Computer-Aided Drug Design (CADD) Group of the National Cancer Institute (NCI) (23). They are based on molecular hashcodes (18) generated by the cheminformatics toolkit CACTVS (24,25) from a chemical structure. Similar to InChIKeys, these hashcodes do not carry any information about the original structure and cannot be converted back to it.

The calculation starts with a structure normalization before generating the hashcode. As in Fig. 2, this normalizes different representations including tautomers, charged resonance structures, miss-drawn functional groups, missing hydrogen atoms, missing charges, or incorrect valences. In addition, optional selection of normalization modes (Fig. 3) adjusts the sensitivity to particular features of the input structure (e.g., salts, counter-ions, isotopes, formal charges, tautomerism, and/or stereochemistry).

Fig. 3.

Fig. 3.

The NCI/CADD Structure Identifiers provide adjustable levels of sensitivity to certain molecular or atomic features. If an identifier is set not to be sensitive to one of the illustrated chemical features, the input structure is transformed by the particular rule shown in each column, e.g., if the identifier is set to disregard “fragments” only the largest organic compound is considered.

The letters “F”, “I”, “C”, “T”, and “S” stand for (input sensitivity to) fragments, isotopic labeling, charges, tautomerism, and stereochemistry information. If any of these is switched off, the corresponding upper-case letter is replaced by a lower-case “u” (standing for “un-sensitive”). The three most important identifier variants created from this scheme are the FICTS, FICuS, and uuuuu identifiers. The name “FICTS identifier” indicates that sensitivity to all features is “on” and is the closest representation of the original structure. The normalization procedure consists typically of unifying different drawing variants of functional groups or the addition of missing hydrogen atoms.

The calculation of the FICuS identifier includes all steps of the FICTS normalization procedure. Additionally, a canonical tautomeric form of the input structure is generated as the FICuS identifier. This comes closest to how chemists perceive a structure because it is insensitive to tautomeric representations not usually regarded as different compounds. In contrast, the uuuuu identifier is more general since it only considers the basic molecular connectivity. It disregards fragments other than the largest (e.g., counter ions, water), deletes stereochemistry information or isotope labels, neutralizes the structure to its most reasonable state (charges maintaining aromaticity are kept), and represents the canonical tautomer calculated for the FICuS identifier. Hence, the uuuuu identifier is useful for discerning closely related compound forms. An illustration of the behavior of the three identifiers calculated for six structural variants of alpha-methylhistamine is shown in Table 1.

Table 1.

NCI/CADD identifiers calculated for structural variants of alpha-methylhistamine

Structure FICTS FICuS uuuuu
graphic file with name nihms-1618662-t0008.jpg F2BD225EDAC391C1-FICTS-01-5B F2BD225EDAC391C1-FICuS-01-7C F2BD225EDAC391C1-uuuuu-01-2B
graphic file with name nihms-1618662-t0009.jpg 746D9B3FB2CF43D5-FICTS-01-5C F2BD225EDAC391C1-FICuS-01-7C F2BD225EDAC391C1-uuuuu-01-2B
H
Stereoisomer (S)
A286977DB9DCC4E7-FICTS-01-67 A286977DB9DCC4E7-FICuS-01-88 F2BD225EDAC391C1-uuuuu-01-2B
graphic file with name nihms-1618662-t0010.jpg 701959160007986A-FICTS-01-FB 701959160007986A-FICuS-01-lC F2BD225EDAC391C1-uuuuu-01-2B
graphic file with name nihms-1618662-t0011.jpg F3B8CA719A55FA13-FICTS-01-54 F3B8CA719A55FA13-FICuS-01-75 F2BD225EDAC391C1-uuuuu-01-2B
graphic file with name nihms-1618662-t0012.jpg 6C2A60C9F23B4A37-FICTS-01-40 6C2A60C9F23B4A37-FICuS-01-61 F2BD225EDAC391C1-uuuuu-01-2B

The first part of an NCI/CADD Identifier is the 16-digit hexadecimal CACTVS hashcode. This is followed by a name tag (FICTS, FICuS, uuuuu), a two-digit version tag, and a two-digit checksum. The FICTS identifier perceives all listed variants of alpha-methylhistamine as different chemical compounds, the FICuS identifier links the two tautomers to each other, while the uuuuu identifier considers all six structures as identical

3.3.5. Chemical Identifier Resolver

CIR is a Web service also developed by the CADD Group of the National Cancer Institute (NCI) that converts a given structure identifier (e.g., SMILES, chemical name, Standard InChI/InChIKey, NCI/CADD Identifier) into another representation or structure identifier (10). It can be used either from its Web site (http://cactus.nci.nih.gov/chemical/structure, Fig. 4) or by putting together a URL request applying the general URI scheme:

Fig. 4.

Fig. 4.

The NCI/CADD Chemical Identifier Resolver. For example, http://cactus.nci.nih.gov/chemical/structure/seroquel/stdinchikey, will generate VRHJBWUIWQOFLF-WLHGVMLRSA-N, the Standard InChIKey for seroquel.

http://cactus.nci.nih.gov/chemical/structure/” +identifier+ “/” + representation

The URL interface allows an automated submission of requests by scripting/programming languages or the simple integration of data into other Web service via JavaScript/AJAX.

For the lookup of hashed identifiers like Standard InChIKeys and the NCI/CADD Structure Identifiers, CIR currently uses the CSLS database (23) aggregated from the currently largest available small-molecule repositories like ChemNavigator iResearch Library (26), PubChem (27) (including ChemSpider (21), ZINC (28), and eMolecules (29)). The set of unique structures has been found by calculating the NCI/CADD Identifiers for all original structure records in these databases (~120 million records). Further development of CIR will extend to a diverse set of cheminformatics methods including generation of tautomers and stereoisomers, or the calculation of physicochemical properties.

3.4. Processing of ChEMBL, DrugBank, HMDB, and TTD

For all structure records in the four databases, FICTS, FICuS, uuuuu, and Standard InChIKey were calculated using the original SD files as input. The calculation of Standard InChIKeys was performed with InChI executable version 1.03 downloadable from IUPAC (9). The NCI/CADD Structure Identifiers are generated as property of the internal structure representation of CACTVS (24). The calculation of the identifiers automatically includes the corresponding structure normalization procedures. We maintained the ID pointers back to each of the source databases (e.g., DBxxxxx or HMDBxxxxx for DrugBank or HMDB, respectively). TTD splits the IDs into four categories and ChEMBL employed ChEBI IDs as identifiers. For following up individual records, we used a proprietary AstraZeneca internal application, called Chemistry Connect, which merges structures according to in-house chemistry rules and provides Web out-links to these four sources and many others including PubChem and ChemSpider (see Note 2).

3.5. Mapping Compounds Between Databases

Using the methods described we have produced three general result sets. The first, summarized in Table 2 and Fig. 5, includes the breakdown of the four different identifier types we have determined from each of the four sources. The first column is the record count as given by each source download. The second is the pairwise overlap (compounds-in-common) at the levels of FICTS and uuuuu (the 4×4 matrices in Table 3). The third result set is the 4-way Venn diagram at the uuuuu level (Fig. 6).

Table 2.

The total number of records and unique compounds at different standardization levels for the four databases

Database Total records Unique FICTS Unique FICuS Unique uuuuu Unique InChIKey
ChEMBL 600,624 599,900 598,615 558,135 600,004
DRUGBANK  4,664  4,469  4,458  4,328  4,462
HMDB  7,886  7,877  7,859  7,482  7,878
TTD  3,387  2,852  2,828  2,565  2,817

Fig. 5.

Fig. 5.

The number of unique structures resulted from the various standardization processes as percentages from the total number of unique structures from the original sources.

Table 3.

The pairwise overlaps between the databases using FICTS and uuuuu identifiers

FICTS ChEMBL DRUGBANK HMDB TTD
ChEMBL 599,900 1,763    852 1,559
DRUGBANK 4,469    351 1,157
HMDB 7,877    157
TTD 2,852
uuuuu ChEMBL DRUGBANK HMDB TTD
ChEMBL 558,135 2,571 1,185 1,979
DRUGBANK 4,328    626 1,394
HMDB 7,482    222
TTD 2,565

The main diagonal indicates the number of unique compounds

Fig. 6.

Fig. 6.

The 4-way Venn diagram of compound content comparison at the uuuuu identifier level. The totals for each database are given in the uuuuu section of Table 2.

The figures in Table 2 show the expected reduction in numbers according to the stringencies of the identifiers discussed in Subheading 3.3.2. While the representational choices made by the sources can be different, the distributions in Fig. 5 are similar, with the exception of TTD. This shows a 26% reduction between records and uuuuus compared to 8, 7, and 5%, for ChEMBL, DrugBank, and HMDB, respectively. Without further detailed analysis we cannot determine the causes behind different record:uuuuu ratios but they could include contributions from salts, mixtures, isomers, and missing charges. The unique InChIKey counts are in close agreement with the number of total records in ChEMBL, DrugBank, and HMDB. This phenomenon of “redundancy collapse” for bioactive chemistry sources has been reported before for a larger range of databases even though different chemistry rules were used for the analysis (3).

The pairwise overlaps (Table 3) increase going from FICTS to uuuuu as more compounds merge to unique structural representations. Other features appear less intuitive, such as the many identical structures between the drug (DrugBank) and metabolite (HMDB) databases. Some of these would be expected (e.g., the hormone epinephrine as DB00668) but others are not (e.g., atorvastatin as HMDB05006). The figure for HMDB of over 8% uuuuu structures-in-common with DrugBank exceeds what we might expected from the inclusion of pharmaceutically approved hormone preparations. The explanation lies in the important utility of HMDB for the interpretation of analytical results not only for endogenous metabolite structures but also for the identification of common drugs (Wishart D, personal communication).

The set of overlaps represented in the Venn diagram (Fig. 6) facilitates more detailed comparisons of the databases. A noticeable feature is unique content. Given the approximate 100-fold larger size the observation that 99% of ChEMBL chemical content is not captured by the other databases is unsurprising. Also predictable is that HMDB is substantially unique because of its metabolite focus. Less expected is the number of structures unique to DrugBank and TTD individually which, by implication may have not only been extracted from different sources accessed by one or the other but also not subsumed within the primary literature extracted by ChEMBL. Given the declared nesting of Drugstore within ChEMBL the 1,055 structures-in-common between it, DrugBank and TTD are likely to be predominantly approved drugs. However, this is well below the individual totals given in DrugBank and TTD as 1,350 and 1,514, respectively. This issue of an anomalously low structure identity overlap between collections that each nominally includes at least all FDA-approved drugs has been noted previously using a different set of databases (3).

In the middle of Fig. 6 we see 185 compounds-in-common between all four sources. One of these, 6-aminohexanoic acid, occurs as source identifiers HMDB01901, DAP000200, and ChEBI227755 in HMDB, TTD, and ChEBI, respectively, but also as duplicate entries DB00513 and DB04134 in DrugBank where they represent separate records for the synonyms “aminocaproic acid” (DB00513) and “6-aminohexanoic acid” (DB04134). As an example of the utility of the NCI/CADD CIR (Fig. 4), the InChIs, or SMILES from any of the five entries will all convert to uuuuu identifier string “017F65C418085161-uuuuu-01-D4”.

We can inspect some examples that are unique to each of the drug databases. Taking a TTD-only entry, first we find the database record DCL000003 for a compound named “BMS-275291” or Rebimastat (7F87D3454124E6E2-uuuuu-01-FF). While there is an identical PubChem CID 148203 also labeled as “BMS 275291”, we suggest this may be erroneous and the correct structure (EEE06B24B53EA4E4-uuuuu-01-30) was linked by ChEMBL (CHEBI:220194 as PubChem CID 9913881 or SID 85418578). This was corroborated by other sources in AstraZeneca’s Chemistry Connect. As a DrugBank-only structure we found database record DB02724 (015DF44E4FF1D7E2-uuuuu-01-26) as the Delta-2-Albomycin A1 antibiotic (see Note 3).

3.6. Mapping Proteins Between the Databases

The exercise of cross-mapping protein sequence identifiers between databases is analogous to the chemical structure comparison described above. However, the results are very different. These are shown as pairwise overlaps (sequences-in-common) in Table 4 and the complete Venn comparison in Fig. 7. We have also included the average number of compounds-per-protein (Table 5, see also Note 4).

Table 4.

The pairwise overlaps between UniProt identifiers

ChEMBL DRUGBANK HMDB TTD
ChEMBL 4,862    964 1,349    780
DRUGBANK 5,543    971    799
HMDB 4,251    614
TTD 1,883

The main diagonal indicates the number of unique protein sequence identifiers

Fig. 7.

Fig. 7.

The 4-way Venn diagram of target content comparison at the UniProt identifier level.

Table 5.

The total number of UniProt IDs for each database and the average number of compounds-per-protein

ChEMBL DRUGBANK HMDB TTD
Proteins 4,862 5,543 4,251 1,884
Compounds-per-protein    115 0.8 1.8 1.4

The compound totals used for the calculation of compounds-per-protein were the corresponding uuuuu figures from Table 2

The first aspect to consider is the concept of “target”. While an extended discussion is outside the scope of this work, some consideration is necessary to interpret our results. Extending the example used in the introduction, we can consider HMG-CoA reductase to be the primary target of atorvastatin in the sense that this 1:1 compound-to-protein mapping is the causal basis for the therapeutic effect. Three examples illustrate the inter-source differences. In TTD, the atorvastatin entry, DAP000553, maps to the single expected “primary” target P04035. In ChEMBL, the same compound maps not only to P04035 but also, via cross-screening data, to the rat orthologue P51639, and to dipeptidyl peptidase IV from pig, P22411. The atorvastatin entry in DrugBank has 17 associated “target” sequences. Another example is the (deliberate) inclusion of non-targets. Thus, we find trypsin (P07477) in ChEMBL, DrugBank, and TTD. As a widely used mechanistic exemplar for serine protease inhibition studies, this is an important target to capture cross-screening data for, but is not a drug target per se. Further complexity is illustrated by TTD’s useful inclusion of 104 antisense protein targets. We have not filtered these from the total target download (as some could also be small-molecule targets) but our chemical structure processing in this work does not encompass antisense reagents.

For reviewing the protein overlaps in Table 4, the statistics provided by TTD are useful because to a first approximation they represent the counts of primary targets (verified by selected record inspections). TTD specifies 358 targets of marketed drugs, 251 in clinical trials, and 1,254 in the research phase. The first of these is close to the 324 drug targets for approved drugs in 2006 (30). The fact that TTD has only 799 proteins-in-common with DrugBank (Fig. 7) points to a broader target protein mapping implementation in the latter. This extends beyond primary targets (usually indicated as Target 1 in the record) to any protein that, based on literature mining, has a reported association with the named drug, including metabolizing enzymes. It also not only explains the higher compound-to-protein ratio between DrugBank and TTD in Table 5 (0.8 for the former and 1.4 for the latter) but also the coverage of 4,579 proteins in DrugBank that are not found in the primary literature of direct research compound testing as captured by ChEMBL (N.B. DrugBank 3.0 will have improved mapping stringencies and selectable target subsets, see also Note 5).

Reviewing Fig. 7 for unique protein content indicates a majority proportion for HMDB, ChEMBL, and DrugBank but less than half for TTD. The unique content from HMDB arises from the fact that the majority of proteins involved in metabolism are not (so far) being pursued as drug targets. An example is UniProt ID 014503, the human Class E basic helix-loop-helix protein 40, included in the metabolite record for Cyclic AMP (HMDB00058). Nevertheless, the diagram shows an HMDB:DrugBank vs. HMDB:TTD ratio of 2,413:1,838. While this suggests that nearly 2,000 proteins involved in metabolism may have been investigated as drug targets, only 644 of these have data in ChEMBL. As was the case for compounds, the unique protein content for ChEMBL is not unexpected considering its broad chemogenomic and structure activity relationships (SAR) scope. A ChEMBL-unique example is UniProt ID Q9WUL0 for rat DNA topoisomerase 1 captured from cross-screening data (CHEMBL1075164). An example of a DrugBank unique sequence is UniProt ID P06993, an HTH-type transcriptional regulator from Escherichia, coli. This is associated with the fungicide benzoic acid in the record for DB03793. One of the reasons for TTD having unique protein content is the inclusion of possible targets without compound mappings such as UniProt ID P04324, the Nef protein from HIV, in TTDR00778 (see Note 6).

An additional factor contributing to differences in protein content is the way the individual sources handle the target complex problem. This is illustrated by inspecting the records for the approved proteosome inhibitor boretozamid or Velcade (PubChem CID 387447). In TTD, the drug record (DAP001318) assigns the target name as “26S proteosome” but without any mapped protein identifiers. The record for the same compound in DrugBank (APRD00828) lists the protein IDs for five proteosome constitutive subunits. However, the 26S proteosome is reported to have 11 non-ATPase regulatory subunits and 7 beta-type subunits (31). Thus, even though DrugBank has provided mappings to the individual subunits, there should be at least twice as many if the target is the 26S complex but less subunit protein IDs if this was assumed to be the 20S core complex. In ChEMBL the same compound (CHEMBL325041) is mapped to 37 targets. These appear to be extracted from cross-screen data against other proteases. While “proteosome” is mentioned in the assay descriptions, there are no subunit protein IDs mapped to the compound (see Note 7).

3.7. Conclusions

We have outlined approaches that we hope not only to have utility for those in position to execute them but also have provided some insight to users of the Web interfaces who navigate between these and similar sources at the interface between chemical information and bioinformatics. The examples we have chosen only scratch the surface of even just the hundreds of compound and protein entries unique to certain sources, let alone the other Venn diagram segments. The overlap patterns are challenging to interpret as they are just numbers but inspecting individual records approaches a “standard of truth” for what each source actually contains, regardless of their declared capture strategies and scope. Discerning the reasons behind the observed differences is necessarily more speculative but equally important as these databases (and any others for that matter) do not typically report comparisons “between themselves” at the level of detail presented here. Following these resources into the future will be of great interest as global chemogenomic data generation increases and efforts continue in the development of new medicines directed against a widening range of drug targets.

As final remarks, we would firstly like to make clear that none of our observations should be interpreted as criticism, particularly since our internal efforts for data integration across internal and external sources make us acutely aware of the challenges associated with compound-to-protein mappings. Indeed, we would emphasize not only the powerful complementarity of this set of interlinked resources, but also that they welcome feedback. Secondly, we appreciate the opportunity to access the public data used in this work. Consequently, any parties interested in obtaining our results for further analysis (particularly perhaps those teams from the databases we have included) are welcome to contact us.

Footnotes

As we have shown above, databases may use different chemistry rules for handling structures. It is thus important when comparing large sets to implement a common structure normalization process to handle salts and mixtures, isotopes, tautomers, and stereochemistry to generate a unique structure identifier. This should be achievable with any cheminformatics toolkit.

We recommend inclusion of the standard versions of InChI and InChIKey in all chemical databases. These can provide both standardized record counts and simple direct content comparisons (for example with Excel). They can also be used to establish direct links to major public cheminformatics resources such as PubChem and ChemSpider.

The adoption of the PubChem CID as a universal identifier and out-link by these four databases (although technically indirect in the ChEMBL case) is very useful. However, it has complications that cannot be detailed here. Users therefore need to be aware that chemical database entries representing the same compound can point to different CIDs with different structural representations, sources (PubChem SIDs) links and bioannotations. This “multiplexing” is particularly problematic for approved drugs.

Normalizing protein content to compare databases is nontrivial and can be confounded by the use of different identifiers in records. The provision of FASTA sequence downloads is useful as these can be normalized to UniProt IDs using PICR as described. The vast majority of these were SwissProt IDs (i.e., curated and non-redundant canonical sequences) rather than the automatically assigned UniProt IDs by TREMBL. Nevertheless, there could be small differences between our protein counts generated via the use of PICR and those reported by the sources.

Depending on what type of analysis is envisaged, more detail may be discerned by downloading and comparing the individual target subsets from DrugBank and TTD. It would be preferable to be able to select just the UniProt IDs but this needs full data downloads. While we have used UniProt IDs, they unfortunately cannot be retrieved directly from the Web interfaces. Given the vagaries of name searching, the most reliable way to match the target is by using BLAST with a section of FASTA sequence from the UniProt entry.

While we did not implement it here as a separate exercise, it would be possible to merge these four sources into one database system, with the provision of maintaining the individual identifier mappings. By aggregating the different coverage, this would be particularly efficient and comprehensive for structure searching.

An analogous protein level merge could be used to generate a sequence-searchable database. This would be valuable not only to pick-up identity matches but also, via the sequence similarity scores, facilitate homology detection (and probable chemical modulation starting points) for any query sequence.

Note added in proof. Between the time of writing and delivery of proofs three of the databases, DrugBank, TTD and ChEMBL, have undergone major updates. Consequently, the reported absolute and comparative content statistics are no longer current. Notwithstanding, the general applicability and conclusions of our analysis remain valid. It should be noted that ChEMBL IDs are now the primary chemical structure identifiers for that database not the ChEBI IDs used in the early releases. It should also be noted that, subsequent to the recent additions of TTD and HMDB, all four sources now have SIDs in PubChem.

References

RESOURCES