Abstract
We describe initial results of miRNA sequence analysis with the optimal symbol compression ratio (OSCR) algorithm and recast this grammar inference algorithm as an improved minimum description length (MDL) learning tool: MDLcompress. We apply this tool to explore the relationship between miRNAs, single nucleotide polymorphisms (SNPs), and breast cancer. Our new algorithm outperforms other grammar-based coding methods, such as DNA Sequitur, while retaining a two-part code that highlights biologically significant phrases. The deep recursion of MDLcompress, together with its explicit two-part coding, enables it to identify biologically meaningful sequence without needlessly restrictive priors. The ability to quantify cost in bits for phrases in the MDL model allows prediction of regions where SNPs may have the most impact on biological activity. MDLcompress improves on our previous algorithm in execution time through an innovative data structure, and in specificity of motif detection (compression) through improved heuristics. An MDLcompress analysis of 144 over expressed genes from the breast cancer cell line BT474 has identified novel motifs, including potential microRNA (miRNA) binding sites that are candidates for experimental validation.
[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40]
Contributor Information
Scott C Evans, Email: evans@research.ge.com.
Antonis Kourtidis, Email: akourtidis@albany.edu.
T Stephen Markham, Email: markham@research.ge.com.
Jonathan Miller, Email: jnthnmllr@gmail.com.
Douglas S Conklin, Email: dconklin@albany.edu.
Andrew S Torres, Email: torres@research.ge.com.
References
- Fire A, Xu S, Montgomery MK, Kostas SA, Driver SE, Mello CC. Potent and specific genetic interference by double-stranded RNA in caenorhabditis elegans. Nature. 1998;391(6669):806–811. doi: 10.1038/35888. [DOI] [PubMed] [Google Scholar]
- Hannon GJ, Rossi JJ. Unlocking the potential of the human genome with RNA interference. Nature. 2004;431(7006):371–378. doi: 10.1038/nature02870. [DOI] [PubMed] [Google Scholar]
- Kourtidis A, Eifert C, Conklin DS. In: Systems Biology, Applications and Perspectives, Ernst Schering Foundation Symposium Proceedings. Bringmann P, Butcher EC, Parry G, Weiss B, editor. Vol. 61. Springer, New York, NY, USA; 2007. RNAi applications in target validation; pp. 1–21. [DOI] [PubMed] [Google Scholar]
- Lewis BP, Shih I-H, Jones-Rhoades MW, Bartel DP, Burge CB. Prediction of mammalian microRNA targets. Cell. 2003;115(7):787–798. doi: 10.1016/S0092-8674(03)01018-3. [DOI] [PubMed] [Google Scholar]
- Lewis BP, Burge CB, Bartel DP. Conserved seed pairing, often flanked by adenosines, indicates that thousands of human genes are microRNA targets. Cell. 2005;120(1):15–20. doi: 10.1016/j.cell.2004.12.035. [DOI] [PubMed] [Google Scholar]
- Rusinov V, Baev V, Minkov IN, Tabler M. MicroInspector: a web tool for detection of miRNA binding sites in an RNA sequence. Nucleic Acids Research. 2005;33(web server):W696–W700. doi: 10.1093/nar/gki364. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Calin GA, Liu C-G, Sevignani C. et al. MicroRNA profiling reveals distinct signatures in B cell chronic lymphocytic leukemias. Proceedings of the National Academy of Sciences of the United States of America. 2004;101(32):11755–11760. doi: 10.1073/pnas.0404432101. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Esquela-Kerscher A, Slack FJ. Oncomirs—microRNAs with a role in cancer. Nature Reviews Cancer. 2006;6(4):259–269. doi: 10.1038/nrc1840. [DOI] [PubMed] [Google Scholar]
- Grünwald P, Myung IJ, Pitt M, editor. Advances in Minimum Description Length: Theory and Applications. MIT Press, Cambridge, Mass, USA; 2005. [Google Scholar]
- Evans SC. Kolmogorov complexity estimation and application for information system security, Ph.D. dissertation. 2003.
- Evans SC, Barnett B, Bush SF, Saulnier GJ. Minimum description length principles for detection and classification of FTP exploits. Proceedings of IEEE Military Communications Conference (MILCOM '04), Monterey, Calif, USA, October-November 2004. pp. 473–479.
- Evans SC, Torres A, Miller J. Tech. Rep. GRC223. GE Research, Niskayuna, NY, USA; 2006. MicroRNA target motif detection using OSCR. [Google Scholar]
- Li M, Vitányi P. Introduction to Kolmogorov Complexity and Applications. Springer, New York, NY, USA; 1997. [Google Scholar]
- Szpankowski W, Ren W, Szpankowski L. An optimal DNA segmentation based on the MDL principle. International Journal of Bioinformatics Research and Applications. 2005;1(1):3–17. doi: 10.1504/IJBRA.2005.006899. [DOI] [PubMed] [Google Scholar]
- Tobus I, Korodi G, Rissanen J. DNA sequence compression using the normalized maximum likelihood model for discrete regression. Proceedings of Data Compression Conference (DCC '03), Snowbird, Utah, USA, March 2003. pp. 253–262.
- Apostolico A, Lonardi S. Some theory and practice of greedy off-line textual substitution. Proceedings of Data Compression Conference (DCC '98), Snowbird, Utah, USA, March 1998. pp. 119–128.
- Nevill-Manning CG, Witten IH. Identifying hierarchical structure in sequences: a linear-time algorithm. Journal of Artificial Intelligence Research. 1997;7:67–82. [Google Scholar]
- Cherniavsky N, Lander R. Grammar-based compression of DNA sequences. DIMACS Working Group on The Burrows—Wheeler Transform, Piscataway, NJ, USA, August 2004.
- Chen X, Li M, Ma B, Tromp J. DNACompress: fast and effective DNA sequence compression. Bioinformatics. 2002;18(12):1696–1698. doi: 10.1093/bioinformatics/18.12.1696. [DOI] [PubMed] [Google Scholar]
- Behzadi B, Le Fessant F. DNA compression challenge revisited: a dynamic programming approach. The 16th Annual Symposium on Combinatorial Pattern Matching (CPM '05), Jeju Island, Korea, 2005, Lecture Notes in Computer Science. pp. 190–200.
- Evans SC, Markham TS, Torres A, Kourtidis A, Conklin D. An improved minimum description length learning algorithm for nucleotide sequence analysis. Proceedings of IEEE 40th Asilomar Conference on Signals, Systems and Computers (ACSSC '06), Pacific Grove, Calif, USA, October-November 2006. pp. 1843–1850.
- Gács P, Tromp JT, Vitányi PMB. Algorithmic statistics. IEEE Transactions on Information Theory. 2001;47(6):2443–2463. doi: 10.1109/18.945257. [DOI] [Google Scholar]
- Cover TM, Thomas JA. Elements of Information Theory. Wiley-Interscience, New York, NY, USA; 1991. [Google Scholar]
-
Lai EC. MicroRNAs are complementary to
UTR sequence motifs that mediate negative post-transcriptional regulation. Nature Genetics. 2002;30(4):363–364. doi: 10.1038/ng865. [DOI] [PubMed] [Google Scholar] - Lai EC, Tam B, Rubin GM. Pervasive regulation of Drosophila Notch target genes by GY-box-, Brd-box-, and K-box-class microRNAs. Genes & Development. 2005;19(9):1067–1080. doi: 10.1101/gad.1291905. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Doench JG, Sharp PA. Specificity of microRNA target selection in translational repression. Genes & Development. 2004;18(5):504–511. doi: 10.1101/gad.1184404. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Brennecke J, Stark A, Russell RB, Cohen SM. Principles of microRNA-target recognition. PLoS Biology. 2005;3(3):e85. doi: 10.1371/journal.pbio.0030085. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Evans SC, Saulnier GJ, Bush SF. A new universal two part code for estimation of string kolmogorov complexity and algorithmic minimum sufficient statistic. DIMACS Workshop on Complexity and Inference, Piscataway, NJ, USA, June 2003.
- Voorhoeve PM, le Sage C, Schrier M. et al. A genetic screen implicates miRNA-372 and miRNA-373 as oncogenes in testicular germ cell tumors. Cell. 2006;124(6):1169–1181. doi: 10.1016/j.cell.2006.02.037. [DOI] [PubMed] [Google Scholar]
- Mackay A, Jones C, Dexter T. et al. cDNA microarray analysis of genes associated with ERBB2 (HER2/neu) overexpression in human mammary luminal epithelial cells. Oncogene. 2003;22(17):2680–2688. doi: 10.1038/sj.onc.1206349. [DOI] [PubMed] [Google Scholar]
- Bertucci F, Borie N, Ginestier C. et al. Identification and validation of an ERBB2 gene expression signature in breast cancers. Oncogene. 2004;23(14):2564–2575. doi: 10.1038/sj.onc.1207361. [DOI] [PubMed] [Google Scholar]
- Lim LP, Lau NC, Garrett-Engele P. et al. Microarray analysis shows that some microRNAs downregulate large numbers of target mRNAs. Nature. 2005;433(7027):769–773. doi: 10.1038/nature03315. [DOI] [PubMed] [Google Scholar]
- Altschul SF, Madden TL, Schäffer AA. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research. 1997;25(17):3389–3402. doi: 10.1093/nar/25.17.3389. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mignone F, Grillo G, Licciulli F. et al. UTRdb and UTRsite: a collection of sequences and regulatory motifs of the untranslated regions of eukaryotic mRNAs. Nucleic Acids Research. 2005;33(database):D141–D146. doi: 10.1093/nar/gki021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- http://microrna.sanger.ac.uk/sequences/index.shtml.
- Griffiths-Jones S, Grocock RJ, van Dongen S, Bateman A, Enright AJ. miRBase: microRNA sequences, targets and gene nomenclature. Nucleic Acids Research. 2006;34(database):D140–D144. doi: 10.1093/nar/gkj112. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Huang X, Hardison RC, Miller W. A space-efficient algorithm for local similarities. Computer Applications in the Biosciences. 1990;6(4):373–381. doi: 10.1093/bioinformatics/6.4.373. [DOI] [PubMed] [Google Scholar]
- Paddison PJ, Silva JM, Conklin DS. et al. A resource for large-scale RNA-interference-based screens in mammals. Nature. 2004;428(6981):427–431. doi: 10.1038/nature02370. [DOI] [PubMed] [Google Scholar]
- Clop A, Marcq F, Takeda H. et al. A mutation creating a potential illegitimate microRNA target site in the myostatin gene affects muscularity in sheep. Nature Genetics. 2006;38(7):813–818. doi: 10.1038/ng1810. [DOI] [PubMed] [Google Scholar]
- http://snp500cancer.nci.nih.gov/
