Abstract
Similarity-based clustering and classification of compounds enable the search of drug leads and the structural and chemogenomic studies for facilitating chemical, biomedical, agricultural, material and other industrial applications. A database that organizes compounds into similarity-based as well as scaffold-based and property-based families is useful for facilitating these tasks. CFam Chemical Family database http://bidd2.cse.nus.edu.sg/cfam was developed to hierarchically cluster drugs, bioactive molecules, human metabolites, natural products, patented agents and other molecules into functional families, superfamilies and classes of structurally similar compounds based on the literature-reported high, intermediate and remote similarity measures. The compounds were represented by molecular fingerprint and molecular similarity was measured by Tanimoto coefficient. The functional seeds of CFam families were from hierarchically clustered drugs, bioactive molecules, human metabolites, natural products, patented agents, respectively, which were used to characterize families and cluster compounds into families, superfamilies and classes. CFam currently contains 11 643 classes, 34 880 superfamilies and 87 136 families of 490 279 compounds (1691 approved drugs, 1228 clinical trial drugs, 12 386 investigative drugs, 262 881 highly active molecules, 15 055 human metabolites, 80 255 ZINC-processed natural products and 116 783 patented agents). Efforts will be made to further expand CFam database and add more functional categories and families based on other types of molecular representations.
INTRODUCTION
Similarity-based clustering and classification of compounds have been extensively used in diverse tasks ranging from the search of bioactive agents for drug discovery (1–4) to the molecular and chemogenomic studies in such applications as chemspace navigation and analysis (5,6), structure-target relationship investigation (7–12), cross-pharmacology profiling of intra-family and cross-family targets (13,14) and receptor de-orphanization (15). For facilitating these and other tasks and for the orderly management of known compounds and the study of new compounds, it would be advantageous to organize the known compounds into chemical families based on structural similarity (16,17) as well as molecular scaffold classification (5,18,19) and molecular descriptor projection (19,20). This requires a method and resource for defining, generating and maintaining a comprehensive set of chemical families. To the best of our knowledge, such a resource is not yet publically available. We therefore developed the CFam Chemical Family database (http://bidd2.cse.nus.edu.sg/cfam) both as a database of function-based chemical families and as a resource for facilitating further development of chemical family databases.
Generating a chemical family database would rely heavily on automated algorithms for classifying large number of known compounds that exceed 30 million compounds, 1.4 million bioactive molecules and 760 000 patented agents in the Pubchem (21) and ChEMBL (22) databases, which evokes two problems. One is the difficulty to strictly use hierarchical clustering algorithm for grouping such a large number of known compounds, even though k-means hierarchical clustering algorithm is capable of clustering 800 000 compounds (2,16) and none-hierarchical ones can cluster millions of compounds (23). The second is the difficulty to systematically define chemical families and select family members relevant to both structural and chemical studies and applications in pharmaceutical, biomedical, agricultural and industrial research and development. These problems also arise in generating protein domain families, which have been resolved by selecting subsets of proteins of known functions as the seeds of protein domain families to both define each family's functional and structural characteristics and select family members by multiple sequence alignment against the seed proteins (24). We employed a similar strategy for generating the CFam chemical families.
To make CFam chemical families more relevant to the applications in pharmaceutical, biomedical, agricultural, material and other industrial applications as well as to the research in chemistry and related scientific disciplines, the seeds of the CFam families were or are to be iteratively selected from hierarchically clustered approved drugs, clinical trial drugs, investigative drugs, bioactive molecules, human metabolites, food ingredients and additives, flavors and scents, agrochemicals, natural products, patented agents, toxic substances, purchasable compounds and other known compounds based on the literature-reported high-similarity measures (25–28). These families were further clustered into CFam superfamilies and classes by hierarchically clustering the seeds based on the literature-reported intermediate similarity (11,29,30) and remote similarity (3,13,30) measures. Although this iterative hierarchical clustering procedure seems similar to the incremental clustering algorithm used in selecting representative proteins for clustering proteins (31) and representative compounds for clustering large compound libraries (23), there are two significant differences. One is that the seed selection and clustering processes are based on hierarchical clustering algorithms. The second is the preferential selection of compounds of higher functional importance as the seeds in the order of drugs, bioactive molecules, human metabolites, etc.
Currently, CFam database includes the seeds, members and names of families, superfamilies and classes functionally characterized by the approved drugs, clinical trial drugs, investigative drugs, highly active molecules (IC50 or Ki < 1 μM against molecular target), human metabolites, zinc-processed natural products and patented agents. Table 1 provides the statistics of CFam seeds, compounds, families, superfamilies and classes with respect to the seven functional categories of compounds.
Table 1. The statistics of CFam seeds, compounds, families, superfamilies and classes with respect to the seven functional categories of compounds: approved drugs, clinical trial drugs, investigative drugs, bioactives (currently highly active molecules), human metabolites, zinc-processed natural products and patented agents.
| Functional category | Number of seeds | Number of seeds and members | Number of families | Number of superfamilies | Number of classes |
|---|---|---|---|---|---|
| Approved Drugs | 1691 | 95 367 (4121 HM, 19 408 NP) | 1114 | 937 | 813 |
| Clinical Trial Drugs | 1168 | 38 981 (551 HM, 3258 NP) | 863 | 756 | 537 |
| Investigative Drugs | 11 093 | 93 191 (4321 HM, 11 881 NP) | 4226 | 2870 | 1700 |
| Bioactives | 98 523 | 171 162 (833 HM, 24 439 NP) | 29 983 | 15 088 | 4035 |
| Human Metabolites | 5229 | 10 408 (5229 HM, 1820 NP) | 2058 | 1377 | 709 |
| Natural Products | 19 449 | 20 821 | 4017 | 1517 | 394 |
| Patented Agents | 60 349 | 60 349 | 44 875 | 12 335 | 3455 |
| Total | 197 502 | 490 279 | 87 136 | 34 880 | 11 643 |
The number of members of these families from the two categories of special interests, human metabolites (HM) and natural products (NP) are also provided.
DATA COLLECTION AND PROCESSING
Because of the high computational cost of clustering large number of compounds, the first version of CFam primarily focuses on the following seven categories of compounds of functional significance: 1691 approved drugs from TTD (32) and Drugbank (33), 1228 clinical trial drugs and 12 386 investigative drugs from TTD (32), 262 881 highly active molecules (IC50 or Ki < 1 μM against molecular target) from Chembl version 18 (22), 15 055 human metabolites from HMDB (34), 80 255 ZINC-processed natural products from ZINC (35) and 116 783 patented agents from PubChem (21) databases, respectively. For database entries with multiple non-linked components, only the largest component was selected. Hydrogens were added and salt ions were removed by using Open Babel (36), duplicates were identified and removed by comparative analysis of their InChIKeys, which is a hashed version of InChI (37) designed to be nearly unique for each individual compound with a collision resistance of 2.2 × 1015 (38).
GENERATION OF CFam FAMILIES OF HIGH SIMILARITY COMPOUNDS
Molecular similarity and analysis may be conducted from different structural, physicochemical and functional perspectives by using different types of molecular representations. These include molecular descriptors (19,20,39), molecular scaffolds (5,18,19), molecular fingerprints (3,16,17) and other molecular representations, such as chemical graphs, pharmacophore patterns and molecular fields (40–43). Multiple forms of chemical families can thus be generated from these molecular representations in a similar manner as the multiple forms of protein families generated from multiple-sequence alignment of protein domains (24,44), conserved signature profiling of selected sequence segments (45), structure classification (46,47) and combined analysis of these and other features (48). Due to the high computational cost in clustering large number of compounds, in the first version of CFam, we only used one type of molecular representation, the 2D molecular fingerprints (specifically, the 881-bit PubChem substructure fingerprints computed by using PaDEL (49)), for representing molecules, which was selected because of its computational efficiency, demonstrated effectiveness in similarity searching and extensive applications in drug discovery (3,50–54). The other types of molecular representations will be used in the future version of CFam for generating other forms of chemical families.
The seeds of CFam families were assigned and used to assemble compounds into CFam families by the following iterative hierarchical clustering procedure. In the first iteration, 1691 approved drugs were clustered by hierarchical clustering algorithm with the 2D fingerprint Tarnimoto coefficient (2DF-TC) as the similarity metric and the complete linkage as the linkage criterion. Tarnimoto coefficient was used because it is the most popular similarity metric for measuring compound similarity (3). Complete linkage was used because of its relatively good performance in clustering bioactive compounds in a recent comparative study (55). The criterion for grouping compounds into a cluster of high-similarity compounds is 2DF-TC >0.85, which was adopted because it is a widely used criterion for avoiding structural redundancy in selecting compound libraries for screening bioactive compounds (25,26). High-similarity compounds grouped by this criterion typically have 30–81% chance of having the same activity in the same bioassay (26–28). The drug/drugs in each cluster was/were assigned as the seed/seeds of a CFam-approved drug family with the family name systematically characterized by the target/targets, activity type (e.g. inhibitor), molecular class/classes (e.g. benzisoxazole derivative) and drug name/names of the seed/seeds.
In the second iteration, the 2DF-TCs of the 1228 clinical trial drugs against the seed/seeds of the existing CFam families were first computed. If the 2DF-TC of a drug is >0.85 with respect to all the seeds/seed of a family, the drug was assigned as a seed of that family. If the 2DF-TC of a drug is >0.85 to some but not all of the seeds of a family, the drug was assigned as a member of that family. If the 2DF-TC of a drug is >0.85 to the seeds of more than one family, the drug was tentatively assigned to the family/families with the largest 2DF-TC and the remaining family/families was/were marked as a cousin family to the assigned family/families and these cousins are indicated in the CFam database (e.g. CFFAD942 Prostaglandin G/H synthase 2 inhibitor diarylsubstituted isoxazole derivative valdecoxib family is a cousin family of CFFAD3 D2 dopamine receptor ligand benzisoxazole derivative risperidone family) so that the cousin families can be subsequently evaluated for possible merger into a combined family. The remaining unassigned clinical trial drugs were subject to the same procedure as that of the first iteration to assign them as the seed/seeds of CFam clinical trial drug families for assembling compounds into the respective families.
In the subsequent iterations, each set of 12 386 investigative drugs, 262 881 highly active molecules, 15 055 human metabolites, 80 255 ZINC-processed natural products and 116 783 patented agents were in turn subject to the same procedure as that of the second iteration to assign compounds into the existing CFam families or as the seed/seeds of the new CFam investigative drug families, bioactive molecule families, human metabolite families, natural product families and patented agent families for assembling compounds into the corresponding families, respectively. If the 2DF-TC of a compound is >0.85 to the seeds of more than one family, it was preferentially assigned in order of priority to approved drug, clinical trial drug, bioactive molecule (currently highly active molecule), human metabolite, natural product and patented agent family, respectively. Certain functional categories, such as human metabolites and natural products, are of special interests beyond one scientific discipline. Therefore, if a compound from these categories (e.g. a natural product) was preferentially assigned to a family of a different category (e.g. approved drug), that family was marked and is displayed as a family containing compound/compounds from this special category (e.g. approved drug family with natural product).
While possible, the names of these families were systematically determined in a similar manner as those of approved drugs. Many clinical trial and investigative drugs have little molecular class information and large number of bioactive compounds and natural products are without a common name, which make it difficult to automatically search for their molecular class names. Therefore, while possible, the IUPAC systematic names were used to extract common substructure names as putative molecular class names. Efforts will be made to determine the molecular classes of these families from the structure information of their seed/seeds. For the remaining families that we were unable to obtain molecular class information, their family names were tentatively characterized by the name/names or ID/IDs of their seed/seeds.
GENERATION OF CFam SUPERFAMILIES OF INTERMEDIATE TO HIGH SIMILARITY COMPOUNDS, AND CFam CLASSES OF REMOTE TO HIGH SIMILARITY COMPOUNDS
The centroid seeds of the CFam families were further clustered by hierarchical clustering algorithm with the 2DF-TC as the similarity metric and the complete linkage as the linkage criterion, so that the CFam families can be assembled into CFam superfamilies and classes. The criterion for assembling CFam family/families into a superfamily of intermediate to high similarity compounds is 2DF-TC >0.70, which was applied because compounds satisfying this criterion have been regarded as similar to one other (30,56) and those with slightly lower similarity typically have remote similarity (29). Compounds grouped by this intermediate-similarity criterion may have up to 30% chance of having the same activity in the same bioassay (11). These superfamilies were systematically named from the common target classes, chemical classes and individual family names of the constituent family names. A superfamily is typically composed of compounds of the same or highly similar molecular scaffolds targeting the same target, members of the same target subfamilies or target sites accommodating similar molecular scaffolds. For instance, the CFSAD2 cAMP-specific 3′, 5′-cyclic phosphodiesterase, TNF inhibitor xanthine derivative superfamily includes two families of xanthine derivatives against the two targets and three families of structurally similar purine derivatives, N-alkylguanine acyclonucleosides and theobromines.
The criterion for further assembling CFam superfamily/superfamilies into CFam classes of remote to intermediate similarity compounds is 2DF-TC >0.57, which was used because it can reasonably capture similarity compounds with cross-pharmacology relationships but not necessarily have the same activity (13). A CFam class typically consists of a large number of compounds that bind to multiple members of a target family/subfamily and/or target families/subfamilies with binding-sites accommodating similar molecular scaffolds, which makes it difficult to systematically name it. Therefore, CFam classes were tentatively named by their CFam class IDs only. Efforts will be made to manually determine their names. An example of a CFam class is CFCAD3, which is composed of the binders of GPCR Class A subfamilies A1 (C-C chemokine receptors), A9 (neuropeptide Y receptors), A13 (cannabinoid receptors), A17 (dopamine receptors), A18 (muscarinic acetylcholine receptors) and A19 (5-HT receptors), cholinesterases, tryptases, dopamine transporters and sodium channel proteins, etc.
DATABASE STRUCTURE AND ACCESS
CFam can be searched by three different modes (Figure 1). The first mode enables the search of CFam by inputting a compound name or ID (currently support CFam, Pubchem, Chembl, Zinc and TTD compound IDs), a CFam family name or ID, a CFam superfamily name or ID and a CFam Class ID, respectively. The relevant information may be obtained by clicking the buttons of ‘Molecule’, ‘Family’, ‘Superfamily’ and ‘Class’, respectively. For instance, inputting ‘aspirin’ and then clicking ‘Molecule’ leads to the CFam molecule CFAMM00072836 page which shows that aspirin belongs to the CFam CFFAD534 cyclooxygenase inhibitor salicylate derivative aspirin family (Figure 2). The second mode enables the browsing of CFam families, superfamilies and classes of any functional category, respectively, which can be proceeded by first clicking the ‘Family’, ‘Superfamily’ or ‘Class’ word in the section header titled ‘Browse CFam Family/Superfamily/Class by Functional Category’, and then clicking a specific functional category below the header. For instance, clicking ‘Family’ and then ‘Approved Drug Families’ leads to the page of CFam approved drug families list (Figure 3). The third mode facilitates the alignment of an input compound in SMILES or molecular fingerprint format against CFam seeds to identify CFam families with high, intermediate and remote similarity to the input compound. The list of up to 30 CFam families with at least one seed having 2DF-TC > 0.85 (high similarity family), 0.85 ≥ 2DF-TC > 0.7 (intermediate similarity family) and 0.7 ≥ 2DF-TC > 0.57 (remote similarity) to the input compound is provided. Figure 4 shows the result page of the alignment of aspirin with CFam seeds. To facilitate the development of chemical family databases and the structural and functional analysis of molecules, CFam seeds can be downloaded from the CFam main page (Figure 1).
Figure 1.

CFam web interface. CFam is searchable by three modes: compound and family name and ID searching, browsing of CFam families, superfamilies and classes and the alignment of a compound against CFam families.
Figure 2.

A CFam molecule page resulting from the name search by inputting ‘aspirin’ and selecting ‘molecule’.
Figure 3.

The CFam approved drug families browsing page resulting from the clicking of ‘Family’ in the section header titled ‘Browse CFam Family/Superfamily/Class by Functional Category’ and ‘Approved Drug Families’ in the section.
Figure 4.

The CFam result page of the alignment of aspirin with CFam seeds.
REMARKS
Specialized chemical information resources, such as the chemical family databases, complement the general chemical databases for facilitating focused studies on the navigation, classification and the structural and functional characterization of molecules. The chemical family databases that comprehensively cover the known chemspace and characterize molecules from different molecular representations are increasingly needed given the rapidly expanding pools of molecules from synthetic and natural sources (57–59) and the increasing need to analyze higher number and more variety of compounds for diverse applications (13–15,19). To meet such a need, CFam will be further updated to expand existing functional families and add new families of moderately active molecules (IC50 or Ki 1–10 μM against molecular target), food ingredients and additives, flavors and scents, agrochemicals, natural products beyond ZINC processed ones, toxic substances, purchasable compounds and other compounds. Although some of the CFam families are currently composed of seeds only, these seeds are nonetheless useful for facilitating further development of chemical families and function-based classification of compounds.
FUNDING
Funding for open access charge: Major State Basic Research Development Program of China [2013CB967204]. The authors would also like to thank the Singapore Academic Research Fund (R-148-000-181-112).
Conflict of interest statement. None declared.
REFERENCES
- 1.Gruneberg S., Stubbs M.T., Klebe G. Successful virtual screening for novel inhibitors of human carbonic anhydrase: strategy and experimental confirmation. J. Med. Chem. 2002;45:3588–3602. doi: 10.1021/jm011112j. [DOI] [PubMed] [Google Scholar]
- 2.Bocker A., Schneider G., Teckentrup A. NIPALSTREE: a new hierarchical clustering approach for large compound libraries and its application to virtual screening. J. Chem. Inf. Model. 2006;46:2220–2229. doi: 10.1021/ci050541d. [DOI] [PubMed] [Google Scholar]
- 3.Willett P. Similarity-based virtual screening using 2D fingerprints. Drug Discov. Today. 2006;11:1046–1053. doi: 10.1016/j.drudis.2006.10.005. [DOI] [PubMed] [Google Scholar]
- 4.Riniker S., Fechner N., Landrum G.A. Heterogeneous classifier fusion for ligand-based virtual screening: or, how decision making by committee can be a good thing. J. Chem. Inf. Model. 2013;53:2829–2836. doi: 10.1021/ci400466r. [DOI] [PubMed] [Google Scholar]
- 5.Lipinski C., Hopkins A. Navigating chemical space for biology and medicine. Nature. 2004;432:855–861. doi: 10.1038/nature03193. [DOI] [PubMed] [Google Scholar]
- 6.Renner S., van Otterlo W.A., Dominguez Seoane M., Mocklinghoff S., Hofmann B., Wetzel S., Schuffenhauer A., Ertl P., Oprea T.I., Steinhilber D., et al. Bioactivity-guided mapping and navigation of chemical space. Nat. Chem. Biol. 2009;5:585–592. doi: 10.1038/nchembio.188. [DOI] [PubMed] [Google Scholar]
- 7.Hu Y., Bajorath J. Rationalizing structure and target relationships between current drugs. AAPS J. 2012;14:764–771. doi: 10.1208/s12248-012-9392-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Eckert H., Bajorath J. Molecular similarity analysis in virtual screening: foundations, limitations and novel approaches. Drug Discov. Today. 2007;12:225–233. doi: 10.1016/j.drudis.2007.01.011. [DOI] [PubMed] [Google Scholar]
- 9.Wang Y., Bajorath J. Development of a compound class-directed similarity coefficient that accounts for molecular complexity effects in fingerprint searching. J. Chem. Inf. Model. 2009;49:1369–1376. doi: 10.1021/ci900108d. [DOI] [PubMed] [Google Scholar]
- 10.Vogt I., Ahmed H.E., Auer J., Bajorath J. Exploring structure-selectivity relationships of biogenic amine GPCR antagonists using similarity searching and dynamic compound mapping. Mol. Divers. 2008;12:25–40. doi: 10.1007/s11030-008-9071-2. [DOI] [PubMed] [Google Scholar]
- 11.Biniashvili T., Schreiber E., Kliger Y. Improving classical substructure-based virtual screening to handle extrapolation challenges. J. Chem. Inf. Model. 2012;52:678–685. doi: 10.1021/ci200472s. [DOI] [PubMed] [Google Scholar]
- 12.Hu G., Kuang G., Xiao W., Li W., Liu G., Tang Y. Performance evaluation of 2D fingerprint and 3D shape similarity methods in virtual screening. J. Chem. Inf. Model. 2012;52:1103–1113. doi: 10.1021/ci300030u. [DOI] [PubMed] [Google Scholar]
- 13.Brianso F., Carrascosa M.C., Oprea T.I., Mestres J. Cross-pharmacology analysis of G protein-coupled receptors. Curr. Top Med. Chem. 2011;11:1956–1963. doi: 10.2174/156802611796391285. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Lin H., Sassano M.F., Roth B.L., Shoichet B.K. A pharmacological organization of G protein-coupled receptors. Nat. Methods. 2013;10:140–146. doi: 10.1038/nmeth.2324. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.van der Horst E., Peironcely J.E., Ijzerman A.P., Beukers M.W., Lane J.R., van Vlijmen H.W., Emmerich M.T., Okuno Y., Bender A. A novel chemogenomics analysis of G protein-coupled receptors (GPCRs) and their ligands: a potential strategy for receptor de-orphanization. BMC Bioinformatics. 2010;11:316. doi: 10.1186/1471-2105-11-316. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Bocker A., Derksen S., Schmidt E., Teckentrup A., Schneider G. A hierarchical clustering approach for large compound libraries. J. Chem. Inf. Model. 2005;45:807–815. doi: 10.1021/ci0500029. [DOI] [PubMed] [Google Scholar]
- 17.Engels M.F., Gibbs A.C., Jaeger E.P., Verbinnen D., Lobanov V.S., Agrafiotis D.K. A cluster-based strategy for assessing the overlap between large chemical libraries and its application to a recent acquisition. J. Chem. Inf. Model. 2006;46:2651–2660. doi: 10.1021/ci600219n. [DOI] [PubMed] [Google Scholar]
- 18.Wetzel S., Klein K., Renner S., Rauh D., Oprea T.I., Mutzel P., Waldmann H. Interactive exploration of chemical space with Scaffold Hunter. Nat. Chem. Biol. 2009;5:581–583. doi: 10.1038/nchembio.187. [DOI] [PubMed] [Google Scholar]
- 19.Lachance H., Wetzel S., Kumar K., Waldmann H. Charting, navigating, and populating natural product chemical space for drug discovery. J. Med. Chem. 2012;55:5989–6001. doi: 10.1021/jm300288g. [DOI] [PubMed] [Google Scholar]
- 20.Le Guilloux V., Colliandre L., Bourg S., Guenegou G., Dubois-Chevalier J., Morin-Allory L. Visual characterization and diversity quantification of chemical libraries: 1. Creation of delimited reference chemical subspaces. J. Chem. Inf. Model. 2011;51:1762–1774. doi: 10.1021/ci200051r. [DOI] [PubMed] [Google Scholar]
- 21.Bolton E, Wang Y, Thiessen PA, SH B. PubChem: integrated platform of small molecules and biological activities. Annu. Rep. Comput. Chem. 2008;4:217–240. [Google Scholar]
- 22.Bento A.P., Gaulton A., Hersey A., Bellis L.J., Chambers J., Davies M., Kruger F.A., Light Y., Mak L., McGlinchey S., et al. The ChEMBL bioactivity database: an update. Nucleic Acids Res. 2014;42:D1083–1090. doi: 10.1093/nar/gkt1031. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Li W. A fast clustering algorithm for analyzing highly similar compounds of very large libraries. J. Chem. Inf. Model. 2006;46:1919–1923. doi: 10.1021/ci0600859. [DOI] [PubMed] [Google Scholar]
- 24.Sonnhammer E.L., Eddy S.R., Durbin R. Pfam: a comprehensive database of protein domain families based on seed alignments. Proteins. 1997;28:405–420. doi: 10.1002/(sici)1097-0134(199707)28:3<405::aid-prot10>3.0.co;2-l. [DOI] [PubMed] [Google Scholar]
- 25.Matter H. Selecting optimally diverse compounds from structure databases: a validation study of two-dimensional and three-dimensional molecular descriptors. J. Med. Chem. 1997;40:1219–1229. doi: 10.1021/jm960352+. [DOI] [PubMed] [Google Scholar]
- 26.Martin Y.C., Kofron J.L., Traphagen L.M. Do structurally similar molecules have similar biological activity? J. Med. Chem. 2002;45:4350–4358. doi: 10.1021/jm020155c. [DOI] [PubMed] [Google Scholar]
- 27.Cramer R.D., Jilek R.J., Guessregen S., Clark S.J., Wendt B., Clark R.D. ‘Lead hopping’. Validation of topomer similarity as a superior predictor of similar biological activities. J. Med. Chem. 2004;47:6777–6791. doi: 10.1021/jm049501b. [DOI] [PubMed] [Google Scholar]
- 28.Dunkel M., Gunther S., Ahmed J., Wittig B., Preissner R. SuperPred: drug classification and target prediction. Nucleic Acids Res. 2008;36:W55–W59. doi: 10.1093/nar/gkn307. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Godden J.W., Stahura F.L., Bajorath J. Anatomy of fingerprint search calculations on structurally diverse sets of active compounds. J. Chem. Inf. Model. 2005;45:1812–1819. doi: 10.1021/ci050276w. [DOI] [PubMed] [Google Scholar]
- 30.Boehm M., Wu T.Y., Claussen H., Lemmen C. Similarity searching and scaffold hopping in synthetically accessible combinatorial chemistry spaces. J. Med. Chem. 2008;51:2468–2480. doi: 10.1021/jm0707727. [DOI] [PubMed] [Google Scholar]
- 31.Li W., Jaroszewski L., Godzik A. Clustering of highly homologous sequences to reduce the size of large protein databases. Bioinformatics. 2001;17:282–283. doi: 10.1093/bioinformatics/17.3.282. [DOI] [PubMed] [Google Scholar]
- 32.Qin C., Zhang C., Zhu F., Xu F., Chen S.Y., Zhang P., Li Y.H., Yang S.Y., Wei Y.Q., Tao L., et al. Therapeutic target database update 2014: a resource for targeted therapeutics. Nucleic Acids Res. 2014;42:D1118–D1123. doi: 10.1093/nar/gkt1129. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Law V., Knox C., Djoumbou Y., Jewison T., Guo A.C., Liu Y., Maciejewski A., Arndt D., Wilson M., Neveu V., et al. DrugBank 4.0: shedding new light on drug metabolism. Nucleic Acids Res. 2014;42:D1091–D1097. doi: 10.1093/nar/gkt1068. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Wishart D.S., Jewison T., Guo A.C., Wilson M., Knox C., Liu Y., Djoumbou Y., Mandal R., Aziat F., Dong E., et al. HMDB 3.0–The Human Metabolome Database in 2013. Nucleic Acids Res. 2013;41:D801–D807. doi: 10.1093/nar/gks1065. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Irwin J.J., Sterling T., Mysinger M.M., Bolstad E.S., Coleman R.G. ZINC: a free tool to discover chemistry for biology. J. Chem. Inf. Model. 2012;52:1757–1768. doi: 10.1021/ci3001277. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.O'Boyle N.M., Banck M., James C.A., Morley C., Vandermeersch T., Hutchison G.R. Open Babel: an open chemical toolbox. J. Cheminform. 2011;3:33. doi: 10.1186/1758-2946-3-33. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.International Union of Pure and AppliedChemistry. InChIversion 1 (software version 1.04 for Standard and Non-Standard InChI/InChIKey) 2011 http://www.iupac.org/InChI/ [Google Scholar]
- 38.InChI Trust. IUPAC International Chemical Identifier (InChI)Programs InChI version 1, software version 1.04 User's Guide. 2012 http://www.inchi-trust.org/download/104/InChI_UserGuide.pdf . [Google Scholar]
- 39.Bender A., Jenkins J.L., Scheiber J., Sukuru S.C., Glick M., Davies J.W. How similar are similarity searching methods? A principal component analysis of molecular descriptor space. J. Chem. Inf. Model. 2009;49:108–119. doi: 10.1021/ci800249s. [DOI] [PubMed] [Google Scholar]
- 40.Dean P.M., editor. Molecular Similarity in Drug Design. London: Chapman and Hall; 1994. [Google Scholar]
- 41.Willett P., Barnard J.M., Downs G.M. Chemical Similarity Searching. J. Chem. Inf. Comput. Sci. 1998;38:983–996. [Google Scholar]
- 42.Nikolova N., Jaworska J. Approaches to measure chemical similarity – a review. QSAR Comb. Sci. 2003;22:1006–1026. [Google Scholar]
- 43.Bender A., Glen R.C. Molecular similarity: a key technique in molecular informatics. Org. Biomol. Chem. 2004;2:3204–3218. doi: 10.1039/B409813G. [DOI] [PubMed] [Google Scholar]
- 44.Finn R.D., Bateman A., Clements J., Coggill P., Eberhardt R.Y., Eddy S.R., Heger A., Hetherington K., Holm L., Mistry J., et al. Pfam: the protein families database. Nucleic Acids Res. 2014;42:D222–D230. doi: 10.1093/nar/gkt1223. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Sigrist C.J., de Castro E., Cerutti L., Cuche B.A., Hulo N., Bridge A., Bougueleret L., Xenarios I. New and continuing developments at PROSITE. Nucleic Acids Res. 2013;41:D344–D347. doi: 10.1093/nar/gks1067. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Andreeva A., Howorth D., Chandonia J.M., Brenner S.E., Hubbard T.J., Chothia C., Murzin A.G. Data growth and its impact on the SCOP database: new developments. Nucleic Acids Res. 2008;36:D419–D425. doi: 10.1093/nar/gkm993. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Cuff A.L., Sillitoe I., Lewis T., Clegg A.B., Rentzsch R., Furnham N., Pellegrini-Calace M., Jones D., Thornton J., Orengo C.A. Extending CATH: increasing coverage of the protein structure universe and linking structure with function. Nucleic Acids Res. 2011;39:D420–D426. doi: 10.1093/nar/gkq1001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Hunter S., Jones P., Mitchell A., Apweiler R., Attwood T.K., Bateman A., Bernard T., Binns D., Bork P., Burge S., et al. InterPro in 2011: new developments in the family and domain prediction database. Nucleic Acids Res. 2012;40:D306–D312. doi: 10.1093/nar/gkr948. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Yap C.W. PaDEL-descriptor: an open source software to calculate molecular descriptors and fingerprints. J. Comput. Chem. 2011;32:1466–1474. doi: 10.1002/jcc.21707. [DOI] [PubMed] [Google Scholar]
- 50.Brown R., Martin Y. The information content of 2D and 3D structural descriptors relevant to ligand-receptor binding. J. Chem. Inf. Comput. Sci. 1997;37:1–9. [Google Scholar]
- 51.Schuffenhauer A., Gillet V.J., Willett P. Similarity searching in files of three-dimensional chemical structures: analysis of the BIOSTER database using two-dimensional fingerprints and molecular field descriptors. J. Chem. Inf. Comput. Sci. 2000;40:295–307. doi: 10.1021/ci990263g. [DOI] [PubMed] [Google Scholar]
- 52.Makara G.M. Measuring molecular similarity and diversity: total pharmacophore diversity. J. Med. Chem. 2001;44:3563–3571. doi: 10.1021/jm010036h. [DOI] [PubMed] [Google Scholar]
- 53.Sheridan R.P., Kearsley S.K. Why do we need so many chemical similarity search methods? Drug Discov. Today. 2002;7:903–911. doi: 10.1016/s1359-6446(02)02411-x. [DOI] [PubMed] [Google Scholar]
- 54.Cruciani G., Pastor M., Mannhold R. Suitability of molecular descriptors for database mining. A comparative analysis. J. Med. Chem. 2002;45:2685–2694. doi: 10.1021/jm0011326. [DOI] [PubMed] [Google Scholar]
- 55.Smieja M., Warszycki D., Tabor J., Bojarski A.J. Asymmetric clustering index in a case study of 5-HT1A receptor ligands. PLoS One. 2014;9:e102069. doi: 10.1371/journal.pone.0102069. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Xue L., Godden J.W., Bajorath J. Database searching for compounds with similar biological activity using short binary bit string representations of molecules. J. Chem. Inf. Comput. Sci. 1999;39:881–886. doi: 10.1021/ci990308d. [DOI] [PubMed] [Google Scholar]
- 57.Thomas G.L., Johannes C.W. Natural product-like synthetic libraries. Curr. Opin. Chem. Biol. 2011;15:516–522. doi: 10.1016/j.cbpa.2011.05.022. [DOI] [PubMed] [Google Scholar]
- 58.Lopez-Vallejo F., Giulianotti M.A., Houghten R.A., Medina-Franco J.L. Expanding the medicinally relevant chemical space with compound libraries. Drug Discov. Today. 2012;17:718–726. doi: 10.1016/j.drudis.2012.04.001. [DOI] [PubMed] [Google Scholar]
- 59.van Hattum H., Waldmann H. Biology-oriented synthesis: harnessing the power of evolution. J. Am. Chem. Soc. 2014;136:11853–11859. doi: 10.1021/ja505861d. [DOI] [PubMed] [Google Scholar]
