Skip to main content
EURASIP Journal on Bioinformatics and Systems Biology logoLink to EURASIP Journal on Bioinformatics and Systems Biology
. 2007 Dec 24;2007(1):13853. doi: 10.1155/2007/13853

Motif Discovery in Tissue-Specific Regulatory Sequences Using Directed Information

Arvind Rao 1,, Alfred O Hero III 1, David J States 2, James Douglas Engel 3
PMCID: PMC3171326  PMID: 18340376

Abstract

Motif discovery for the identification of functional regulatory elements underlying gene expression is a challenging problem. Sequence inspection often leads to discovery of novel motifs (including transcription factor sites) with previously uncharacterized function in gene expression. Coupled with the complexity underlying tissue-specific gene expression, there are several motifs that are putatively responsible for expression in a certain cell type. This has important implications in understanding fundamental biological processes such as development and disease progression. In this work, we present an approach to the identification of motifs (not necessarily transcription factor sites) and examine its application to some questions in current bioinformatics research. These motifs are seen to discriminate tissue-specific gene promoter or regulatory regions from those that are not tissue-specific. There are two main contributions of this work. Firstly, we propose the use of directed information for such classification constrained motif discovery, and then use the selected features with a support vector machine (SVM) classifier to find the tissue specificity of any sequence of interest. Such analysis yields several novel interesting motifs that merit further experimental characterization. Furthermore, this approach leads to a principled framework for the prospective examination of any chosen motif to be discriminatory motif for a group of coexpressed/coregulated genes, thereby integrating sequence and expression perspectives. We hypothesize that the discovery of these motifs would enable the large-scale investigation for the tissue-specific regulatory role of any conserved sequence element identified from genome-wide studies.

[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42]

Contributor Information

Arvind Rao, Email: ukarvind@umich.edu.

Alfred O Hero, III, Email: hero@umich.edu.

David J States, Email: dstates@umich.edu.

James Douglas Engel, Email: engel@umich.edu.

References

  1. MacIsaac KD, Fraenkel E. Practical strategies for discovering regulatory DNA sequence motifs. PLoS Computational Biology. 2006;2(4):e36. doi: 10.1371/journal.pcbi.0020036. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Kreiman G. Identification of sparsely distributed clusters of cis-regulatory elements in sets of co-expressed genes. Nucleic Acids Research. 2004;32(9):2889–2900. doi: 10.1093/nar/gkh614. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Burge C, Karlin S. Prediction of complete gene structures in human genomic DNA. Journal of Molecular Biology. 1997;268(1):78–94. doi: 10.1006/jmbi.1997.0951. [DOI] [PubMed] [Google Scholar]
  4. Li Q, Barkess G, Qian H. Chromatin looping and the probability of transcription. Trends in Genetics. 2006;22(4):197–202. doi: 10.1016/j.tig.2006.02.004. [DOI] [PubMed] [Google Scholar]
  5. Kleinjan DA, van Heyningen V. Long-range control of gene expression: emerging mechanisms and disruption in disease. The American Journal of Human Genetics. 2005;76(1):8–32. doi: 10.1086/426833. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Pennacchio LA, Loots GG, Nobrega MA, Ovcharenko I. Predicting tissue-specific enhancers in the human genome. Genome Research. 2007;17(2):201–211. doi: 10.1101/gr.5972507. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. King DC, Taylor J, Elnitski L, Chiaromonte F, Miller W, Hardison RC. Evaluation of regulatory potential and conservation scores for detecting cis-regulatory modules in aligned mammalian genome sequences. Genome Research. 2005;15(8):1051–1060. doi: 10.1101/gr.3642605. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Pennacchio LA, Ahituv N, Moses AM. et al. In vivo enhancer analysis of human conserved non-coding sequences. Nature. 2006;444(7118):499–502. doi: 10.1038/nature05295. [DOI] [PubMed] [Google Scholar]
  9. Kadota K, Ye J, Nakai Y, Terada T, Shimizu K. ROKU: a novel method for indentification of tissue-specific genes. BMC Bioinformatics. 2006;7:294. doi: 10.1186/1471-2105-7-294. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Schug J, Schuller W-P, Kappen C, Salbaum JM, Bucan M, Stoeckert CJ Jr. Promoter features related to tissue specificity as measured by Shannon entropy. Genome biology. 2005;6(4):R33. doi: 10.1186/gb-2005-6-4-r33. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Werner T. Regulatory networks: linking microarray data to systems biology. Mechanisms of Ageing and Development. 2007;128(1):168–172. doi: 10.1016/j.mad.2006.11.022. [DOI] [PubMed] [Google Scholar]
  12. Aerts S, Van Loo P, Thijs G. et al. TOUCAN 2: the all-inclusive open source workbench for regulatory sequence analysis. Nucleic Acids Research. 2005;33(Web Server):W393–W396. doi: 10.1093/nar/gki354. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Chan BY, Kibler D. Using hexamers to predict cis-regulatory motifs in Drosophila. BMC Bioinformatics. 2005;6:262. doi: 10.1186/1471-2105-6-262. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Hutchinson GB. The prediction of vertebrate promoter regions using differential hexamer frequency analysis. Computer Applications in the Biosciences. 1996;12(5):391–398. doi: 10.1093/bioinformatics/12.5.391. [DOI] [PubMed] [Google Scholar]
  15. Sumazin P, Chen G, Hata N, Smith AD, Zhang T, Zhang MQ. DWE: discriminating word enumerator. Bioinformatics. 2005;21(1):31–38. doi: 10.1093/bioinformatics/bth471. [DOI] [PubMed] [Google Scholar]
  16. Lakshmanan G, Lieuw KH, Lim K-C. et al. Localization of distant urogenital system-, central nervous system-, and endocardium-specific transcriptional regulatory elements in the GATA-3 locus. Molecular and Cellular Biology. 1999;19(2):1558–1568. doi: 10.1128/mcb.19.2.1558. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Khandekar M, Suzuki N, Lewton J, Yamamoto M, Engel JD. Multiple, distant Gata2 enhancers specify temporally and tissue-specific patterning in the developing urogenital system. Molecular and Cellular Biology. 2004;24(23):10263–10276. doi: 10.1128/MCB.24.23.10263-10276.2004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Peng H, Long F, Ding C. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2005;27(8):1226–1238. doi: 10.1109/TPAMI.2005.159. [DOI] [PubMed] [Google Scholar]
  19. Proceedings of NIPS 2006 Workshop on Causality and Feature Selection. http://research.ihost.com/cws2006/ http://research.ihost.com/cws2006/
  20. Guyon I, Elisseeff A. An introduction to variable and feature selection. The Journal of Machine Learning Research. 2003;3:1157–1182. [Google Scholar]
  21. Marko H. The bidirectional communication theory—a generalization of information theory. IEEE Transactions on Communications. 1973;COM-21(12):1345–1351. [Google Scholar]
  22. Massey J. Causality, feedback and directed information. Proceedings of the International Symposium on Information Theory and Its Applications (ISITA '90), Waikiki, Hawaii, USA, November 1990. pp. 303–305.
  23. Venkataramanan R, Pradhan SS. Source coding with feed-forward: rate-distortion theorems and error exponents for a general source. IEEE Transactions on Information Theory. 2007;53(6):2154–2179. [Google Scholar]
  24. Cover TM, Thomas JA. Elements of Information Theory. John Wiley & Sons, New York, NY, USA; 1991. [Google Scholar]
  25. Miller EG. A new class of entropy estimators for multidimensional densities. Proceedings of the IEEE International Conference on Accoustics, Speech, and Signal Processing (ICASSP '03), Hong Kong, April 2003. pp. 297–300.
  26. Willett RM, Nowak RD. Complexity-regularized multiresolution density estimation. Proceedings of the International Symposium on Information Theory (ISIT '04), Chicago, Ill, USA, June-July 2004. pp. 303–305.
  27. Nemenman I, Shafee F, Bialek W. In: Advances in Neural Information Processing Systems 14. Dietterich TG, Becker S, Ghahramani Z, editor. MIT Press, Cambridge, Mass, USA; 2002. Entropy and inference, revisited. [Google Scholar]
  28. Paninski L. Estimation of entropy and mutual information. Neural Computation. 2003;15(6):1191–1253. doi: 10.1162/089976603321780272. [DOI] [Google Scholar]
  29. Joe H. Relative entropy measures of multivariate dependence. Journal of the American Statistical Association. 1989;84(405):157–164. doi: 10.2307/2289859. [DOI] [Google Scholar]
  30. Efron B, Tibshirani RJ. An Introduction to the Bootstrap, Monographs on Statistics and Applied Probability. Chapman & Hall/CRC, Boca Raton, Fla, USA; 1994. [Google Scholar]
  31. Ramsay JO, Silverman BW. Functional Data Analysis, Springer Series in Statistics. Springer, New York, NY, USA; 1997. [Google Scholar]
  32. Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B. 1995;57(1):289–300. [Google Scholar]
  33. Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning. Springer, New York, NY, USA; 2001. [Google Scholar]
  34. Kendall MG. A new measure of rank correlation. Biometrika. 1938;30(1/2):81–93. doi: 10.2307/2332226. [DOI] [Google Scholar]
  35. NCBI Pubmed URL. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi http://www.ncbi.nlm.nih.gov/entrez/query.fcgi
  36. Murphy AM, Thompson WR, Peng LF, Jones L II. Regulation of the rat cardiac troponin I gene by the transcription factor GATA-4. Biochemical Journal. 1997;322, part 2:393–401. doi: 10.1042/bj3220393. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Azakie A, Fineman JR, He Y. Myocardial transcription factors are modulated during pathologic cardiac hypertrophy in vivo. The Journal of Thoracic and Cardiovascular Surgery. 2006;132(6):1262–1271.e4. doi: 10.1016/j.jtcvs.2006.08.005. [DOI] [PubMed] [Google Scholar]
  38. Vanhoutte P, Nissen JL, Brugg B. et al. Opposing roles of Elk-1 and its brain-specific usoform, short Elk-1, in nerve growth factor-induced PC12 differentiation. Journal of Biological Chemistry. 2001;276(7):5189–5196. doi: 10.1074/jbc.M006678200. [DOI] [PubMed] [Google Scholar]
  39. Olson EN. Regulation of muscle transcription by the MyoD family: the heart of the matter. Circulation Research. 1993;72(1):1–6. doi: 10.1161/01.res.72.1.1. [DOI] [PubMed] [Google Scholar]
  40. Dressler GR, Douglass EC. Pax-2 is a DNA-binding protein expressed in embryonic kidney and Wilms tumor. Proceedings of the National Academy of Sciences of the United States of America. 1992;89(4):1179–1183. doi: 10.1073/pnas.89.4.1179. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Grote D, Souabni A, Busslinger M, Bouchard M. Pax2/8-regulated Gata3 expression is necessary for morphogenesis and guidance of the nephric duct in the developing kidney. Development. 2006;133(1):53–61. doi: 10.1242/dev.02184. [DOI] [PubMed] [Google Scholar]
  42. Rao A, Hero AO, States DJ, Engel JD. Inference of biologically relevant gene influence networks using the directed information criterion. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '06), Toulouse, France, May 2006. pp. 1028–1031.

Articles from EURASIP Journal on Bioinformatics and Systems Biology are provided here courtesy of Springer

RESOURCES