Abstract
Motivation: Software applications for structural similarity searching and clustering of small molecules play an important role in drug discovery and chemical genomics. Here, we present the first open-source compound mining framework for the popularstatistical programming environment R. The integration with a powerful statistical environment maximizes the flexibility, expandability and programmability of the provided analysis functions.
Results: We discuss the algorithms and compound mining utilities provided by the R package ChemmineR. It contains functions for structural similarity searching, clustering of compound libraries with a wide spectrum of classification algorithms and various utilities for managing complex compound data. It also offers a wide range of visualization functions for compound clusters and chemical structures. The package is well integrated with the online ChemMine environment and allows bidirectional communications between the two services.
Availability: ChemmineR is freely available as an R package from the ChemMine project site: http://bioweb.ucr.edu/ChemMineV2/chemminer
Contact: thomas.girke@ucr.edu
REFERENCES
- Carhart R, et al. Atom pairs as molecular features in structure-activity studies: definition and applications. J. Chem. Inf. Comput. Sci. 1985;25:64–73. [Google Scholar]
- Chen J, et al. ChemDB: a public database of small molecules and related chemoinformatics resources. Bioinformatics. 2005;21:4133–4139. doi: 10.1093/bioinformatics/bti683. [DOI] [PubMed] [Google Scholar]
- Chen X, Reynolds C. Performance of similarity measures in 2D fragment-based similarity searching: comparison of structural descriptors and similarity coefficients. J. Chem. Inf. Comput. Sci. 2002;42:1407–1414. doi: 10.1021/ci025531g. [DOI] [PubMed] [Google Scholar]
- Gedeck P, et al. QSAR–how good is it in practice? Comparison of descriptor sets on an unbiased cross section of corporate data sets. J. Chem. Inf. Model. 2006;46:1924–1936. doi: 10.1021/ci050413p. [DOI] [PubMed] [Google Scholar]
- Gentleman R, et al. Bioinformatics and Computational Biology Solutions Using R and Bioconductor. New York: Springer; 2005. [Google Scholar]
- Girke T, et al. ChemMine. A compound mining database for chemical genomics. Plant Physiol. 2005;138:573–577. doi: 10.1104/pp.105.062687. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Guha R, et al. The Blue obelisk-interoperability in chemical informatics. J. Chem. Inf. Model. 2006;46:991–998. doi: 10.1021/ci050400b. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Holliday JD, et al. Analysis and display of the size dependence of chemical similarity coefficients. J. Chem. Inf. Comput. Sci. 2003;43:819–828. doi: 10.1021/ci034001x. [DOI] [PubMed] [Google Scholar]
- Irwin JJ, Shoichet BK. ZINC–a free database of commercially available compounds for virtual screening. J. Chem. Inf. Model. 2005;45:177–182. doi: 10.1021/ci049714. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lang DT, et al. rggobi: interface between R and GGobi. 2007 R package version 2.1.7. [Google Scholar]
- O'Boyle NM, et al. Pybel: a Python wrapper for the OpenBabel cheminformatics toolkit. Chem. Cent. J. 2008;2:1–7. doi: 10.1186/1752-153X-2-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- R Development Core Team. R: a language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing; 2008. ISBN 3-900051-07-0. [Google Scholar]
- Raymond J, et al. Heuristics for similarity searching of chemical graphs using a maximum common edge subgraph algorithm. J. Chem. Inf. Comput. Sci. 2002;42:305–316. doi: 10.1021/ci010381f. [DOI] [PubMed] [Google Scholar]
- Seiler KP, et al. ChemBank: a small-molecule screening and cheminformatics resource database. Nucleic Acids Res. 2008;36(Database issue):351–359. doi: 10.1093/nar/gkm843. [DOI] [PMC free article] [PubMed] [Google Scholar]