WORDUP: an efficient algorithm for discovering statistically significant patterns in DNA sequences

G Pesole; N Prunella; S Liuni; M Attimonelli; C Saccone

doi:10.1093/nar/20.11.2871

. 1992 Jun 11;20(11):2871–2875. doi: 10.1093/nar/20.11.2871

WORDUP: an efficient algorithm for discovering statistically significant patterns in DNA sequences.

G Pesole ¹, N Prunella ¹, S Liuni ¹, M Attimonelli ¹, C Saccone ¹

PMCID: PMC336935 PMID: 1614873

Abstract

We present here a fast and sensitive method designed to isolate short nucleotide sequences which have non-random statistical properties and may thus be biologically active. It is based on a first order Markov analysis and allows us to detect statistically significant sequence motifs from six to ten nucleotides long which are significantly shared (or avoided) in the sequences under investigation. This method has been tested on a set of 521 sequences extracted from the Eukaryotic Promoter Database (2). Our results demonstrate the accuracy and the efficiency of the method in that the sequence motifs which are known to act as eukaryotic promoters, such as the TATA-box and the CAAT-box, were clearly identified. In addition we have found other statistically significant motifs, the biological roles of which are yet to be clarified.

Selected References

These references are in PubMed. This may not be the complete list of references from this article.

Beckmann J. S., Brendel V., Trifonov E. N. Intervening sequences exhibit distinct vocabulary. J Biomol Struct Dyn. 1986 Dec;4(3):391–400. doi: 10.1080/07391102.1986.10506357. [DOI] [PubMed] [Google Scholar]
Beutler E., Gelbart T., Han J. H., Koziol J. A., Beutler B. Evolution of the genome and the genetic code: selection at the dinucleotide level by methylation and polyribonucleotide cleavage. Proc Natl Acad Sci U S A. 1989 Jan;86(1):192–196. doi: 10.1073/pnas.86.1.192. [DOI] [PMC free article] [PubMed] [Google Scholar]
Blaisdell B. E. A measure of the similarity of sets of sequences not requiring sequence alignment. Proc Natl Acad Sci U S A. 1986 Jul;83(14):5155–5159. doi: 10.1073/pnas.83.14.5155. [DOI] [PMC free article] [PubMed] [Google Scholar]
Borodovsky MYu, Gusein-Zade S. M. A general rule for ranged series of codon frequencies in different genomes. J Biomol Struct Dyn. 1989 Apr;6(5):1001–1012. doi: 10.1080/07391102.1989.10506527. [DOI] [PubMed] [Google Scholar]
Breathnach R., Chambon P. Organization and expression of eucaryotic split genes coding for proteins. Annu Rev Biochem. 1981;50:349–383. doi: 10.1146/annurev.bi.50.070181.002025. [DOI] [PubMed] [Google Scholar]
Brendel V., Beckmann J. S., Trifonov E. N. Linguistics of nucleotide sequences: morphology and comparison of vocabularies. J Biomol Struct Dyn. 1986 Aug;4(1):11–21. doi: 10.1080/07391102.1986.10507643. [DOI] [PubMed] [Google Scholar]
Bucher P., Bryan B. Signal search analysis: a new method to localize and characterize functionally important DNA sequences. Nucleic Acids Res. 1984 Jan 11;12(1 Pt 1):287–305. doi: 10.1093/nar/12.1part1.287. [DOI] [PMC free article] [PubMed] [Google Scholar]
Casey J. L., Di Jeso B., Rao K. K., Rouault T. A., Klausner R. D., Harford J. B. Deletional analysis of the promoter region of the human transferrin receptor gene. Nucleic Acids Res. 1988 Jan 25;16(2):629–646. doi: 10.1093/nar/16.2.629. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chen R. P., Ingraham H. A., Treacy M. N., Albert V. R., Wilson L., Rosenfeld M. G. Autoregulation of pit-1 gene expression mediated by two cis-active promoter elements. Nature. 1990 Aug 9;346(6284):583–586. doi: 10.1038/346583a0. [DOI] [PubMed] [Google Scholar]
Davidson I., Fromental C., Augereau P., Wildeman A., Zenke M., Chambon P. Cell-type specific protein binding to the enhancer of simian virus 40 in nuclear extracts. Nature. 1986 Oct 9;323(6088):544–548. doi: 10.1038/323544a0. [DOI] [PubMed] [Google Scholar]
Devereux J., Haeberli P., Smithies O. A comprehensive set of sequence analysis programs for the VAX. Nucleic Acids Res. 1984 Jan 11;12(1 Pt 1):387–395. doi: 10.1093/nar/12.1part1.387. [DOI] [PMC free article] [PubMed] [Google Scholar]
Dudley J. P. Discrete high molecular weight RNA transcribed from the long interspersed repetitive element L1Md. Nucleic Acids Res. 1987 Mar 25;15(6):2581–2592. doi: 10.1093/nar/15.6.2581. [DOI] [PMC free article] [PubMed] [Google Scholar]
Efstratiadis A., Posakony J. W., Maniatis T., Lawn R. M., O'Connell C., Spritz R. A., DeRiel J. K., Forget B. G., Weissman S. M., Slightom J. L. The structure and evolution of the human beta-globin gene family. Cell. 1980 Oct;21(3):653–668. doi: 10.1016/0092-8674(80)90429-8. [DOI] [PubMed] [Google Scholar]
Everett R. D., Baty D., Chambon P. The repeated GC-rich motifs upstream from the TATA box are important elements of the SV40 early promoter. Nucleic Acids Res. 1983 Apr 25;11(8):2447–2464. doi: 10.1093/nar/11.8.2447. [DOI] [PMC free article] [PubMed] [Google Scholar]
Galas D. J., Eggert M., Waterman M. S. Rigorous pattern-recognition methods for DNA sequences. Analysis of promoter sequences from Escherichia coli. J Mol Biol. 1985 Nov 5;186(1):117–128. doi: 10.1016/0022-2836(85)90262-1. [DOI] [PubMed] [Google Scholar]
Gartmann C. J., Grob U. SQUIRREL: Sequence QUery, Information Retrieval and REporting Library. A program package for analyzing signals in nucleic acid sequences for the VAX. Nucleic Acids Res. 1991 Nov 11;19(21):6033–6040. doi: 10.1093/nar/19.21.6033. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ghosh D. A relational database of transcription factors. Nucleic Acids Res. 1990 Apr 11;18(7):1749–1756. doi: 10.1093/nar/18.7.1749. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gouy M., Gautier C., Attimonelli M., Lanave C., di Paola G. ACNUC--a portable retrieval system for nucleic acid sequence databases: logical and physical designs and usage. Comput Appl Biosci. 1985 Sep;1(3):167–172. doi: 10.1093/bioinformatics/1.3.167. [DOI] [PubMed] [Google Scholar]
Hertz G. Z., Hartzell G. W., 3rd, Stormo G. D. Identification of consensus patterns in unaligned DNA sequences known to be functionally related. Comput Appl Biosci. 1990 Apr;6(2):81–92. doi: 10.1093/bioinformatics/6.2.81. [DOI] [PubMed] [Google Scholar]
Korn L. J., Queen C. L., Wegman M. N. Computer analysis of nucleic acid regulatory sequences. Proc Natl Acad Sci U S A. 1977 Oct;74(10):4401–4405. doi: 10.1073/pnas.74.10.4401. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kozhukhin C. G., Pevzner P. A. Genome inhomogeneity is determined mainly by WW and SS dinucleotides. Comput Appl Biosci. 1991 Jan;7(1):39–49. doi: 10.1093/bioinformatics/7.1.39. [DOI] [PubMed] [Google Scholar]
Lipman D. J., Wilbur W. J., Smith T. F., Waterman M. S. On the statistical significance of nucleic acid similarities. Nucleic Acids Res. 1984 Jan 11;12(1 Pt 1):215–226. doi: 10.1093/nar/12.1part1.215. [DOI] [PMC free article] [PubMed] [Google Scholar]
Mengeritsky G., Smith T. F. Recognition of characteristic patterns in sets of functionally equivalent DNA sequences. Comput Appl Biosci. 1987 Sep;3(3):223–227. doi: 10.1093/bioinformatics/3.3.223. [DOI] [PubMed] [Google Scholar]
Nussinov R. Doublet frequencies in evolutionary distinct groups. Nucleic Acids Res. 1984 Feb 10;12(3):1749–1763. doi: 10.1093/nar/12.3.1749. [DOI] [PMC free article] [PubMed] [Google Scholar]
Nussinov R. Eukaryotic dinucleotide preference rules and their implications for degenerate codon usage. J Mol Biol. 1981 Jun 15;149(1):125–131. doi: 10.1016/0022-2836(81)90264-3. [DOI] [PubMed] [Google Scholar]
Pevzner P. A., Borodovsky MYu, Mironov A. A. Linguistics of nucleotide sequences. I: The significance of deviations from mean statistical characteristics and prediction of the frequencies of occurrence of words. J Biomol Struct Dyn. 1989 Apr;6(5):1013–1026. doi: 10.1080/07391102.1989.10506528. [DOI] [PubMed] [Google Scholar]
Smith T. F., Waterman M. S., Sadler J. R. Statistical characterization of nucleic acid sequence functional domains. Nucleic Acids Res. 1983 Apr 11;11(7):2205–2220. doi: 10.1093/nar/11.7.2205. [DOI] [PMC free article] [PubMed] [Google Scholar]
Staden R. Searching for patterns in protein and nucleic acid sequences. Methods Enzymol. 1990;183:193–211. doi: 10.1016/0076-6879(90)83014-z. [DOI] [PubMed] [Google Scholar]
Stormo G. D. Consensus patterns in DNA. Methods Enzymol. 1990;183:211–221. doi: 10.1016/0076-6879(90)83015-2. [DOI] [PubMed] [Google Scholar]
Stückle E. E., Emmrich C., Grob U., Nielsen P. J. Statistical analysis of nucleotide sequences. Nucleic Acids Res. 1990 Nov 25;18(22):6641–6647. doi: 10.1093/nar/18.22.6641. [DOI] [PMC free article] [PubMed] [Google Scholar]
Waterman M. S., Arratia R., Galas D. J. Pattern recognition in several sequences: consensus and alignment. Bull Math Biol. 1984;46(4):515–527. doi: 10.1007/BF02459500. [DOI] [PubMed] [Google Scholar]

[OCR_00569] Beckmann J. S., Brendel V., Trifonov E. N. Intervening sequences exhibit distinct vocabulary. J Biomol Struct Dyn. 1986 Dec;4(3):391–400. doi: 10.1080/07391102.1986.10506357. [DOI] [PubMed] [Google Scholar]

[OCR_00579] Beutler E., Gelbart T., Han J. H., Koziol J. A., Beutler B. Evolution of the genome and the genetic code: selection at the dinucleotide level by methylation and polyribonucleotide cleavage. Proc Natl Acad Sci U S A. 1989 Jan;86(1):192–196. doi: 10.1073/pnas.86.1.192. [DOI] [PMC free article] [PubMed] [Google Scholar]

[OCR_00615] Blaisdell B. E. A measure of the similarity of sets of sequences not requiring sequence alignment. Proc Natl Acad Sci U S A. 1986 Jul;83(14):5155–5159. doi: 10.1073/pnas.83.14.5155. [DOI] [PMC free article] [PubMed] [Google Scholar]

[OCR_00573] Borodovsky MYu, Gusein-Zade S. M. A general rule for ranged series of codon frequencies in different genomes. J Biomol Struct Dyn. 1989 Apr;6(5):1001–1012. doi: 10.1080/07391102.1989.10506527. [DOI] [PubMed] [Google Scholar]

[OCR_00634] Breathnach R., Chambon P. Organization and expression of eucaryotic split genes coding for proteins. Annu Rev Biochem. 1981;50:349–383. doi: 10.1146/annurev.bi.50.070181.002025. [DOI] [PubMed] [Google Scholar]

[OCR_00561] Brendel V., Beckmann J. S., Trifonov E. N. Linguistics of nucleotide sequences: morphology and comparison of vocabularies. J Biomol Struct Dyn. 1986 Aug;4(1):11–21. doi: 10.1080/07391102.1986.10507643. [DOI] [PubMed] [Google Scholar]

[OCR_00604] Bucher P., Bryan B. Signal search analysis: a new method to localize and characterize functionally important DNA sequences. Nucleic Acids Res. 1984 Jan 11;12(1 Pt 1):287–305. doi: 10.1093/nar/12.1part1.287. [DOI] [PMC free article] [PubMed] [Google Scholar]

[OCR_00669] Casey J. L., Di Jeso B., Rao K. K., Rouault T. A., Klausner R. D., Harford J. B. Deletional analysis of the promoter region of the human transferrin receptor gene. Nucleic Acids Res. 1988 Jan 25;16(2):629–646. doi: 10.1093/nar/16.2.629. [DOI] [PMC free article] [PubMed] [Google Scholar]

[OCR_00657] Chen R. P., Ingraham H. A., Treacy M. N., Albert V. R., Wilson L., Rosenfeld M. G. Autoregulation of pit-1 gene expression mediated by two cis-active promoter elements. Nature. 1990 Aug 9;346(6284):583–586. doi: 10.1038/346583a0. [DOI] [PubMed] [Google Scholar]

[OCR_00665] Davidson I., Fromental C., Augereau P., Wildeman A., Zenke M., Chambon P. Cell-type specific protein binding to the enhancer of simian virus 40 in nuclear extracts. Nature. 1986 Oct 9;323(6088):544–548. doi: 10.1038/323544a0. [DOI] [PubMed] [Google Scholar]

[OCR_00649] Devereux J., Haeberli P., Smithies O. A comprehensive set of sequence analysis programs for the VAX. Nucleic Acids Res. 1984 Jan 11;12(1 Pt 1):387–395. doi: 10.1093/nar/12.1part1.387. [DOI] [PMC free article] [PubMed] [Google Scholar]

[OCR_00661] Dudley J. P. Discrete high molecular weight RNA transcribed from the long interspersed repetitive element L1Md. Nucleic Acids Res. 1987 Mar 25;15(6):2581–2592. doi: 10.1093/nar/15.6.2581. [DOI] [PMC free article] [PubMed] [Google Scholar]

[OCR_00639] Efstratiadis A., Posakony J. W., Maniatis T., Lawn R. M., O'Connell C., Spritz R. A., DeRiel J. K., Forget B. G., Weissman S. M., Slightom J. L. The structure and evolution of the human beta-globin gene family. Cell. 1980 Oct;21(3):653–668. doi: 10.1016/0092-8674(80)90429-8. [DOI] [PubMed] [Google Scholar]

[OCR_00635] Everett R. D., Baty D., Chambon P. The repeated GC-rich motifs upstream from the TATA box are important elements of the SV40 early promoter. Nucleic Acids Res. 1983 Apr 25;11(8):2447–2464. doi: 10.1093/nar/11.8.2447. [DOI] [PMC free article] [PubMed] [Google Scholar]

[OCR_00607] Galas D. J., Eggert M., Waterman M. S. Rigorous pattern-recognition methods for DNA sequences. Analysis of promoter sequences from Escherichia coli. J Mol Biol. 1985 Nov 5;186(1):117–128. doi: 10.1016/0022-2836(85)90262-1. [DOI] [PubMed] [Google Scholar]

[OCR_00606] Gartmann C. J., Grob U. SQUIRREL: Sequence QUery, Information Retrieval and REporting Library. A program package for analyzing signals in nucleic acid sequences for the VAX. Nucleic Acids Res. 1991 Nov 11;19(21):6033–6040. doi: 10.1093/nar/19.21.6033. [DOI] [PMC free article] [PubMed] [Google Scholar]

[OCR_00645] Ghosh D. A relational database of transcription factors. Nucleic Acids Res. 1990 Apr 11;18(7):1749–1756. doi: 10.1093/nar/18.7.1749. [DOI] [PMC free article] [PubMed] [Google Scholar]

[OCR_00630] Gouy M., Gautier C., Attimonelli M., Lanave C., di Paola G. ACNUC--a portable retrieval system for nucleic acid sequence databases: logical and physical designs and usage. Comput Appl Biosci. 1985 Sep;1(3):167–172. doi: 10.1093/bioinformatics/1.3.167. [DOI] [PubMed] [Google Scholar]

[OCR_00673] Hertz G. Z., Hartzell G. W., 3rd, Stormo G. D. Identification of consensus patterns in unaligned DNA sequences known to be functionally related. Comput Appl Biosci. 1990 Apr;6(2):81–92. doi: 10.1093/bioinformatics/6.2.81. [DOI] [PubMed] [Google Scholar]

[OCR_00585] Korn L. J., Queen C. L., Wegman M. N. Computer analysis of nucleic acid regulatory sequences. Proc Natl Acad Sci U S A. 1977 Oct;74(10):4401–4405. doi: 10.1073/pnas.74.10.4401. [DOI] [PMC free article] [PubMed] [Google Scholar]

[OCR_00583] Kozhukhin C. G., Pevzner P. A. Genome inhomogeneity is determined mainly by WW and SS dinucleotides. Comput Appl Biosci. 1991 Jan;7(1):39–49. doi: 10.1093/bioinformatics/7.1.39. [DOI] [PubMed] [Google Scholar]

[OCR_00589] Lipman D. J., Wilbur W. J., Smith T. F., Waterman M. S. On the statistical significance of nucleic acid similarities. Nucleic Acids Res. 1984 Jan 11;12(1 Pt 1):215–226. doi: 10.1093/nar/12.1part1.215. [DOI] [PMC free article] [PubMed] [Google Scholar]

[OCR_00592] Mengeritsky G., Smith T. F. Recognition of characteristic patterns in sets of functionally equivalent DNA sequences. Comput Appl Biosci. 1987 Sep;3(3):223–227. doi: 10.1093/bioinformatics/3.3.223. [DOI] [PubMed] [Google Scholar]

[OCR_00647] Nussinov R. Doublet frequencies in evolutionary distinct groups. Nucleic Acids Res. 1984 Feb 10;12(3):1749–1763. doi: 10.1093/nar/12.3.1749. [DOI] [PMC free article] [PubMed] [Google Scholar]

[OCR_00577] Nussinov R. Eukaryotic dinucleotide preference rules and their implications for degenerate codon usage. J Mol Biol. 1981 Jun 15;149(1):125–131. doi: 10.1016/0022-2836(81)90264-3. [DOI] [PubMed] [Google Scholar]

[OCR_00565] Pevzner P. A., Borodovsky MYu, Mironov A. A. Linguistics of nucleotide sequences. I: The significance of deviations from mean statistical characteristics and prediction of the frequencies of occurrence of words. J Biomol Struct Dyn. 1989 Apr;6(5):1013–1026. doi: 10.1080/07391102.1989.10506528. [DOI] [PubMed] [Google Scholar]

[OCR_00617] Smith T. F., Waterman M. S., Sadler J. R. Statistical characterization of nucleic acid sequence functional domains. Nucleic Acids Res. 1983 Apr 11;11(7):2205–2220. doi: 10.1093/nar/11.7.2205. [DOI] [PMC free article] [PubMed] [Google Scholar]

[OCR_00602] Staden R. Searching for patterns in protein and nucleic acid sequences. Methods Enzymol. 1990;183:193–211. doi: 10.1016/0076-6879(90)83014-z. [DOI] [PubMed] [Google Scholar]

[OCR_00590] Stormo G. D. Consensus patterns in DNA. Methods Enzymol. 1990;183:211–221. doi: 10.1016/0076-6879(90)83015-2. [DOI] [PubMed] [Google Scholar]

[OCR_00611] Stückle E. E., Emmrich C., Grob U., Nielsen P. J. Statistical analysis of nucleotide sequences. Nucleic Acids Res. 1990 Nov 25;18(22):6641–6647. doi: 10.1093/nar/18.22.6641. [DOI] [PMC free article] [PubMed] [Google Scholar]

[OCR_00653] Waterman M. S., Arratia R., Galas D. J. Pattern recognition in several sequences: consensus and alignment. Bull Math Biol. 1984;46(4):515–527. doi: 10.1007/BF02459500. [DOI] [PubMed] [Google Scholar]

PERMALINK

WORDUP: an efficient algorithm for discovering statistically significant patterns in DNA sequences.

G Pesole

N Prunella

S Liuni

M Attimonelli

C Saccone

Abstract

Full text

Selected References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

WORDUP: an efficient algorithm for discovering statistically significant patterns in DNA sequences.

G Pesole

N Prunella

S Liuni

M Attimonelli

C Saccone

Abstract

Full text

Selected References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases