Statistical analysis of nucleotide sequences

E E Stückle; C Emmrich; U Grob; P J Nielsen

doi:10.1093/nar/18.22.6641

. 1990 Nov 25;18(22):6641–6647. doi: 10.1093/nar/18.22.6641

Statistical analysis of nucleotide sequences.

E E Stückle ¹, C Emmrich ¹, U Grob ¹, P J Nielsen ¹

PMCID: PMC332623 PMID: 2251125

Abstract

In order to scan nucleic acid databases for potentially relevant but as yet unknown signals, we have developed an improved statistical model for pattern analysis of nucleic acid sequences by modifying previous methods based on Markov chains. We demonstrate the importance of selecting the appropriate parameters in order for the method to function at all. The model allows the simultaneous analysis of several short sequences with unequal base frequencies and Markov order k not equal to 0 as is usually the case in databases. As a test of these modifications, we show that in E. coli sequences there is a bias against palindromic hexamers which correspond to known restriction enzyme recognition sites.

Selected References

These references are in PubMed. This may not be the complete list of references from this article.

Almagor H. A Markov analysis of DNA sequences. J Theor Biol. 1983 Oct 21;104(4):633–645. doi: 10.1016/0022-5193(83)90251-5. [DOI] [PubMed] [Google Scholar]
Arnold J., Cuticchia A. J., Newsome D. A., Jennings W. W., 3rd, Ivarie R. Mono- through hexanucleotide composition of the sense strand of yeast DNA: a Markov chain analysis. Nucleic Acids Res. 1988 Jul 25;16(14B):7145–7158. doi: 10.1093/nar/16.14.7145. [DOI] [PMC free article] [PubMed] [Google Scholar]
Barker W. C., George D. G., Hunt L. T. Protein sequence database. Methods Enzymol. 1990;183:31–49. doi: 10.1016/0076-6879(90)83005-t. [DOI] [PubMed] [Google Scholar]
Bilofsky H. S., Burks C. The GenBank genetic sequence data bank. Nucleic Acids Res. 1988 Mar 11;16(5):1861–1863. doi: 10.1093/nar/16.5.1861. [DOI] [PMC free article] [PubMed] [Google Scholar]
Blaisdell B. E. Markov chain analysis finds a significant influence of neighboring bases on the occurrence of a base in eucaryotic nuclear DNA sequences both protein-coding and noncoding. J Mol Evol. 1984;21(3):278–288. doi: 10.1007/BF02102360. [DOI] [PubMed] [Google Scholar]
Bodnar J. W., Ward D. C. Highly recurring sequence elements identified in eukaryotic DNAs by computer analysis are often homologous to regulatory sequences or protein binding sites. Nucleic Acids Res. 1987 Feb 25;15(4):1835–1851. doi: 10.1093/nar/15.4.1835. [DOI] [PMC free article] [PubMed] [Google Scholar]
Brendel V., Beckmann J. S., Trifonov E. N. Linguistics of nucleotide sequences: morphology and comparison of vocabularies. J Biomol Struct Dyn. 1986 Aug;4(1):11–21. doi: 10.1080/07391102.1986.10507643. [DOI] [PubMed] [Google Scholar]
Burks C., Cinkosky M. J., Gilna P., Hayden J. E., Abe Y., Atencio E. J., Barnhouse S., Benton D., Buenafe C. A., Cumella K. E. GenBank: current status and future directions. Methods Enzymol. 1990;183:3–22. doi: 10.1016/0076-6879(90)83003-r. [DOI] [PubMed] [Google Scholar]
Cameron G. N. The EMBL data library. Nucleic Acids Res. 1988 Mar 11;16(5):1865–1867. doi: 10.1093/nar/16.5.1865. [DOI] [PMC free article] [PubMed] [Google Scholar]
Claverie J. M., Bougueleret L. Heuristic informational analysis of sequences. Nucleic Acids Res. 1986 Jan 10;14(1):179–196. doi: 10.1093/nar/14.1.179. [DOI] [PMC free article] [PubMed] [Google Scholar]
Claverie J. M., Sauvaget I., Bougueleret L. K-tuple frequency analysis: from intron/exon discrimination to T-cell epitope mapping. Methods Enzymol. 1990;183:237–252. doi: 10.1016/0076-6879(90)83017-4. [DOI] [PubMed] [Google Scholar]
Devereux J., Haeberli P., Smithies O. A comprehensive set of sequence analysis programs for the VAX. Nucleic Acids Res. 1984 Jan 11;12(1 Pt 1):387–395. doi: 10.1093/nar/12.1part1.387. [DOI] [PMC free article] [PubMed] [Google Scholar]
Dumas J. P., Ninio J. Efficient algorithms for folding and comparing nucleic acid sequences. Nucleic Acids Res. 1982 Jan 11;10(1):197–206. doi: 10.1093/nar/10.1.197. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kahn P., Cameron G. EMBL Data Library. Methods Enzymol. 1990;183:23–31. doi: 10.1016/0076-6879(90)83004-s. [DOI] [PubMed] [Google Scholar]
Nussinov R. Doublet frequencies in evolutionary distinct groups. Nucleic Acids Res. 1984 Feb 10;12(3):1749–1763. doi: 10.1093/nar/12.3.1749. [DOI] [PMC free article] [PubMed] [Google Scholar]
Nussinov R. Nucleotide quartets in the vicinity of eukaryotic transcriptional initiation sites: some DNA and chromatin structural implications. DNA. 1987 Feb;6(1):13–22. doi: 10.1089/dna.1987.6.13. [DOI] [PubMed] [Google Scholar]
Nussinov R. Theoretical molecular biology: prospectives and perspectives. J Theor Biol. 1987 Mar 21;125(2):219–235. doi: 10.1016/s0022-5193(87)80043-7. [DOI] [PubMed] [Google Scholar]
Peterson R. C. Prediction of the frequencies of restriction endonuclease recognition sequences using di- and mononucleotide frequencies. Biotechniques. 1988 Jan;6(1):34–40. [PubMed] [Google Scholar]
Pevzner P. A., Borodovsky MYu, Mironov A. A. Linguistics of nucleotide sequences. I: The significance of deviations from mean statistical characteristics and prediction of the frequencies of occurrence of words. J Biomol Struct Dyn. 1989 Apr;6(5):1013–1026. doi: 10.1080/07391102.1989.10506528. [DOI] [PubMed] [Google Scholar]
Phillips G. J., Arnold J., Ivarie R. Mono- through hexanucleotide composition of the Escherichia coli genome: a Markov chain analysis. Nucleic Acids Res. 1987 Mar 25;15(6):2611–2626. doi: 10.1093/nar/15.6.2611. [DOI] [PMC free article] [PubMed] [Google Scholar]
Phillips G. J., Arnold J., Ivarie R. The effect of codon usage on the oligonucleotide composition of the E. coli genome and identification of over- and underrepresented sequences by Markov chain analysis. Nucleic Acids Res. 1987 Mar 25;15(6):2627–2638. doi: 10.1093/nar/15.6.2627. [DOI] [PMC free article] [PubMed] [Google Scholar]
Roberts R. J. Restriction and modification enzymes and their recognition sequences. Nucleic Acids Res. 1985;13 (Suppl):r165–r200. doi: 10.1093/nar/13.suppl.r165. [DOI] [PMC free article] [PubMed] [Google Scholar]
Santibánez-Koref M., Reich J. G. Dinucleotide frequencies in different reading frame positions of coding mammalian DNA sequences. Biomed Biochim Acta. 1986;45(6):737–748. [PubMed] [Google Scholar]
Sidman K. E., George D. G., Barker W. C., Hunt L. T. The protein identification resource (PIR). Nucleic Acids Res. 1988 Mar 11;16(5):1869–1871. doi: 10.1093/nar/16.5.1869. [DOI] [PMC free article] [PubMed] [Google Scholar]
Smith T. F., Waterman M. S., Sadler J. R. Statistical characterization of nucleic acid sequence functional domains. Nucleic Acids Res. 1983 Apr 11;11(7):2205–2220. doi: 10.1093/nar/11.7.2205. [DOI] [PMC free article] [PubMed] [Google Scholar]
Stormo G. D., Schneider T. D., Gold L. M. Characterization of translational initiation sites in E. coli. Nucleic Acids Res. 1982 May 11;10(9):2971–2996. doi: 10.1093/nar/10.9.2971. [DOI] [PMC free article] [PubMed] [Google Scholar]
Volinia S., Bernardi F., Gambari R., Barrai I. Co-localization of rare oligonucleotides and regulatory elements in mammalian upstream gene regions. J Mol Biol. 1988 Sep 20;203(2):385–390. doi: 10.1016/0022-2836(88)90006-x. [DOI] [PubMed] [Google Scholar]

[OCR_01193] Almagor H. A Markov analysis of DNA sequences. J Theor Biol. 1983 Oct 21;104(4):633–645. doi: 10.1016/0022-5193(83)90251-5. [DOI] [PubMed] [Google Scholar]

[OCR_01204] Arnold J., Cuticchia A. J., Newsome D. A., Jennings W. W., 3rd, Ivarie R. Mono- through hexanucleotide composition of the sense strand of yeast DNA: a Markov chain analysis. Nucleic Acids Res. 1988 Jul 25;16(14B):7145–7158. doi: 10.1093/nar/16.14.7145. [DOI] [PMC free article] [PubMed] [Google Scholar]

[OCR_01158] Barker W. C., George D. G., Hunt L. T. Protein sequence database. Methods Enzymol. 1990;183:31–49. doi: 10.1016/0076-6879(90)83005-t. [DOI] [PubMed] [Google Scholar]

[OCR_01148] Bilofsky H. S., Burks C. The GenBank genetic sequence data bank. Nucleic Acids Res. 1988 Mar 11;16(5):1861–1863. doi: 10.1093/nar/16.5.1861. [DOI] [PMC free article] [PubMed] [Google Scholar]

[OCR_01194] Blaisdell B. E. Markov chain analysis finds a significant influence of neighboring bases on the occurrence of a base in eucaryotic nuclear DNA sequences both protein-coding and noncoding. J Mol Evol. 1984;21(3):278–288. doi: 10.1007/BF02102360. [DOI] [PubMed] [Google Scholar]

[OCR_01181] Bodnar J. W., Ward D. C. Highly recurring sequence elements identified in eukaryotic DNAs by computer analysis are often homologous to regulatory sequences or protein binding sites. Nucleic Acids Res. 1987 Feb 25;15(4):1835–1851. doi: 10.1093/nar/15.4.1835. [DOI] [PMC free article] [PubMed] [Google Scholar]

[OCR_01208] Brendel V., Beckmann J. S., Trifonov E. N. Linguistics of nucleotide sequences: morphology and comparison of vocabularies. J Biomol Struct Dyn. 1986 Aug;4(1):11–21. doi: 10.1080/07391102.1986.10507643. [DOI] [PubMed] [Google Scholar]

[OCR_01155] Burks C., Cinkosky M. J., Gilna P., Hayden J. E., Abe Y., Atencio E. J., Barnhouse S., Benton D., Buenafe C. A., Cumella K. E. GenBank: current status and future directions. Methods Enzymol. 1990;183:3–22. doi: 10.1016/0076-6879(90)83003-r. [DOI] [PubMed] [Google Scholar]

[OCR_01149] Cameron G. N. The EMBL data library. Nucleic Acids Res. 1988 Mar 11;16(5):1865–1867. doi: 10.1093/nar/16.5.1865. [DOI] [PMC free article] [PubMed] [Google Scholar]

[OCR_01184] Claverie J. M., Bougueleret L. Heuristic informational analysis of sequences. Nucleic Acids Res. 1986 Jan 10;14(1):179–196. doi: 10.1093/nar/14.1.179. [DOI] [PMC free article] [PubMed] [Google Scholar]

[OCR_01189] Claverie J. M., Sauvaget I., Bougueleret L. K-tuple frequency analysis: from intron/exon discrimination to T-cell epitope mapping. Methods Enzymol. 1990;183:237–252. doi: 10.1016/0076-6879(90)83017-4. [DOI] [PubMed] [Google Scholar]

[OCR_01218] Devereux J., Haeberli P., Smithies O. A comprehensive set of sequence analysis programs for the VAX. Nucleic Acids Res. 1984 Jan 11;12(1 Pt 1):387–395. doi: 10.1093/nar/12.1part1.387. [DOI] [PMC free article] [PubMed] [Google Scholar]

[OCR_01162] Dumas J. P., Ninio J. Efficient algorithms for folding and comparing nucleic acid sequences. Nucleic Acids Res. 1982 Jan 11;10(1):197–206. doi: 10.1093/nar/10.1.197. [DOI] [PMC free article] [PubMed] [Google Scholar]

[OCR_01157] Kahn P., Cameron G. EMBL Data Library. Methods Enzymol. 1990;183:23–31. doi: 10.1016/0076-6879(90)83004-s. [DOI] [PubMed] [Google Scholar]

[OCR_01168] Nussinov R. Doublet frequencies in evolutionary distinct groups. Nucleic Acids Res. 1984 Feb 10;12(3):1749–1763. doi: 10.1093/nar/12.3.1749. [DOI] [PMC free article] [PubMed] [Google Scholar]

[OCR_01175] Nussinov R. Nucleotide quartets in the vicinity of eukaryotic transcriptional initiation sites: some DNA and chromatin structural implications. DNA. 1987 Feb;6(1):13–22. doi: 10.1089/dna.1987.6.13. [DOI] [PubMed] [Google Scholar]

[OCR_01174] Nussinov R. Theoretical molecular biology: prospectives and perspectives. J Theor Biol. 1987 Mar 21;125(2):219–235. doi: 10.1016/s0022-5193(87)80043-7. [DOI] [PubMed] [Google Scholar]

[OCR_01182] Peterson R. C. Prediction of the frequencies of restriction endonuclease recognition sequences using di- and mononucleotide frequencies. Biotechniques. 1988 Jan;6(1):34–40. [PubMed] [Google Scholar]

[OCR_01214] Pevzner P. A., Borodovsky MYu, Mironov A. A. Linguistics of nucleotide sequences. I: The significance of deviations from mean statistical characteristics and prediction of the frequencies of occurrence of words. J Biomol Struct Dyn. 1989 Apr;6(5):1013–1026. doi: 10.1080/07391102.1989.10506528. [DOI] [PubMed] [Google Scholar]

[OCR_01196] Phillips G. J., Arnold J., Ivarie R. Mono- through hexanucleotide composition of the Escherichia coli genome: a Markov chain analysis. Nucleic Acids Res. 1987 Mar 25;15(6):2611–2626. doi: 10.1093/nar/15.6.2611. [DOI] [PMC free article] [PubMed] [Google Scholar]

[OCR_01200] Phillips G. J., Arnold J., Ivarie R. The effect of codon usage on the oligonucleotide composition of the E. coli genome and identification of over- and underrepresented sequences by Markov chain analysis. Nucleic Acids Res. 1987 Mar 25;15(6):2627–2638. doi: 10.1093/nar/15.6.2627. [DOI] [PMC free article] [PubMed] [Google Scholar]

[OCR_01222] Roberts R. J. Restriction and modification enzymes and their recognition sequences. Nucleic Acids Res. 1985;13 (Suppl):r165–r200. doi: 10.1093/nar/13.suppl.r165. [DOI] [PMC free article] [PubMed] [Google Scholar]

[OCR_01170] Santibánez-Koref M., Reich J. G. Dinucleotide frequencies in different reading frame positions of coding mammalian DNA sequences. Biomed Biochim Acta. 1986;45(6):737–748. [PubMed] [Google Scholar]

[OCR_01151] Sidman K. E., George D. G., Barker W. C., Hunt L. T. The protein identification resource (PIR). Nucleic Acids Res. 1988 Mar 11;16(5):1869–1871. doi: 10.1093/nar/16.5.1869. [DOI] [PMC free article] [PubMed] [Google Scholar]

[OCR_01177] Smith T. F., Waterman M. S., Sadler J. R. Statistical characterization of nucleic acid sequence functional domains. Nucleic Acids Res. 1983 Apr 11;11(7):2205–2220. doi: 10.1093/nar/11.7.2205. [DOI] [PMC free article] [PubMed] [Google Scholar]

[OCR_01164] Stormo G. D., Schneider T. D., Gold L. M. Characterization of translational initiation sites in E. coli. Nucleic Acids Res. 1982 May 11;10(9):2971–2996. doi: 10.1093/nar/10.9.2971. [DOI] [PMC free article] [PubMed] [Google Scholar]

[OCR_01185] Volinia S., Bernardi F., Gambari R., Barrai I. Co-localization of rare oligonucleotides and regulatory elements in mammalian upstream gene regions. J Mol Biol. 1988 Sep 20;203(2):385–390. doi: 10.1016/0022-2836(88)90006-x. [DOI] [PubMed] [Google Scholar]

PERMALINK

Statistical analysis of nucleotide sequences.

E E Stückle

C Emmrich

U Grob

P J Nielsen

Abstract

Full text

Selected References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Statistical analysis of nucleotide sequences.

E E Stückle

C Emmrich

U Grob

P J Nielsen

Abstract

Full text

Selected References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases