A hidden Markov model that finds genes in E. coli DNA

A Krogh; I S Mian; D Haussler

doi:10.1093/nar/22.22.4768

. 1994 Nov 11;22(22):4768–4778. doi: 10.1093/nar/22.22.4768

A hidden Markov model that finds genes in E. coli DNA.

A Krogh ¹, I S Mian ¹, D Haussler ¹

PMCID: PMC308529 PMID: 7984429

Abstract

A hidden Markov model (HMM) has been developed to find protein coding genes in E. coli DNA using E. coli genome DNA sequence from the EcoSeq6 database maintained by Kenn Rudd. This HMM includes states that model the codons and their frequencies in E. coli genes, as well as the patterns found in the intergenic region, including repetitive extragenic palindromic sequences and the Shine-Delgarno motif. To account for potential sequencing errors and or frameshifts in raw genomic DNA sequence, it allows for the (very unlikely) possibility of insertions and deletions of individual nucleotides within a codon. The parameters of the HMM are estimated using approximately one million nucleotides of annotated DNA in EcoSeq6 and the model tested on a disjoint set of contigs containing about 325,000 nucleotides. The HMM finds the exact locations of about 80% of the known E. coli genes, and approximate locations for about 10%. It also finds several potentially new genes, and locates several places were insertion or deletion errors/and or frameshifts may be present in the contigs.

Selected References

These references are in PubMed. This may not be the complete list of references from this article.

Altschul S. F., Gish W., Miller W., Myers E. W., Lipman D. J. Basic local alignment search tool. J Mol Biol. 1990 Oct 5;215(3):403–410. doi: 10.1016/S0022-2836(05)80360-2. [DOI] [PubMed] [Google Scholar]
Baldi P., Chauvin Y., Hunkapiller T., McClure M. A. Hidden Markov models of biological primary sequence information. Proc Natl Acad Sci U S A. 1994 Feb 1;91(3):1059–1063. doi: 10.1073/pnas.91.3.1059. [DOI] [PMC free article] [PubMed] [Google Scholar]
Brown M., Hughey R., Krogh A., Mian I. S., Sjölander K., Haussler D. Using Dirichlet mixture priors to derive hidden Markov models for protein families. Proc Int Conf Intell Syst Mol Biol. 1993;1:47–55. [PubMed] [Google Scholar]
Brunak S., Engelbrecht J., Knudsen S. Prediction of human mRNA donor and acceptor sites from the DNA sequence. J Mol Biol. 1991 Jul 5;220(1):49–65. doi: 10.1016/0022-2836(91)90380-o. [DOI] [PubMed] [Google Scholar]
Cardon L. R., Stormo G. D. Expectation maximization algorithm for identifying protein-binding sites with variable lengths from unaligned DNA fragments. J Mol Biol. 1992 Jan 5;223(1):159–170. doi: 10.1016/0022-2836(92)90723-w. [DOI] [PubMed] [Google Scholar]
Churchill G. A. Stochastic models for heterogeneous DNA sequences. Bull Math Biol. 1989;51(1):79–94. doi: 10.1007/BF02458837. [DOI] [PubMed] [Google Scholar]
Churchill G. A., Waterman M. S. The accuracy of DNA sequences: estimating sequence quality. Genomics. 1992 Sep;14(1):89–98. doi: 10.1016/s0888-7543(05)80288-5. [DOI] [PubMed] [Google Scholar]
Collado-Vides J. Grammatical model of the regulation of gene expression. Proc Natl Acad Sci U S A. 1992 Oct 15;89(20):9405–9409. doi: 10.1073/pnas.89.20.9405. [DOI] [PMC free article] [PubMed] [Google Scholar]
Farabaugh P. J. Alternative readings of the genetic code. Cell. 1993 Aug 27;74(4):591–596. doi: 10.1016/0092-8674(93)90507-M. [DOI] [PMC free article] [PubMed] [Google Scholar]
Farber R., Lapedes A., Sirotkin K. Determination of eukaryotic protein coding regions using neural networks and information theory. J Mol Biol. 1992 Jul 20;226(2):471–479. doi: 10.1016/0022-2836(92)90961-i. [DOI] [PubMed] [Google Scholar]
Fickett J. W. Recognition of protein coding regions in DNA sequences. Nucleic Acids Res. 1982 Sep 11;10(17):5303–5318. doi: 10.1093/nar/10.17.5303. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fickett J. W., Torney D. C., Wolf D. R. Base compositional structure of genomes. Genomics. 1992 Aug;13(4):1056–1064. doi: 10.1016/0888-7543(92)90019-o. [DOI] [PubMed] [Google Scholar]
Fickett J. W., Tung C. S. Assessment of protein coding measures. Nucleic Acids Res. 1992 Dec 25;20(24):6441–6450. doi: 10.1093/nar/20.24.6441. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gesteland R. F., Weiss R. B., Atkins J. F. Recoding: reprogrammed genetic decoding. Science. 1992 Sep 18;257(5077):1640–1641. doi: 10.1126/science.1529352. [DOI] [PubMed] [Google Scholar]
Gribskov M., Devereux J., Burgess R. R. The codon preference plot: graphic analysis of protein coding sequences and prediction of gene expression. Nucleic Acids Res. 1984 Jan 11;12(1 Pt 2):539–549. doi: 10.1093/nar/12.1part2.539. [DOI] [PMC free article] [PubMed] [Google Scholar]
Koop B. F., Rowan L., Chen W. Q., Deshpande P., Lee H., Hood L. Sequence length and error analysis of Sequenase and automated Taq cycle sequencing methods. Biotechniques. 1993 Mar;14(3):442–447. [PubMed] [Google Scholar]
Krogh A., Brown M., Mian I. S., Sjölander K., Haussler D. Hidden Markov models in computational biology. Applications to protein modeling. J Mol Biol. 1994 Feb 4;235(5):1501–1531. doi: 10.1006/jmbi.1994.1104. [DOI] [PubMed] [Google Scholar]
Kröger M., Wahl R., Rice P. Compilation of DNA sequences of Escherichia coli (update 1993). Nucleic Acids Res. 1993 Jul 1;21(13):2973–3000. doi: 10.1093/nar/21.13.2973. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lawrence C. E., Reilly A. A. An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins. 1990;7(1):41–51. doi: 10.1002/prot.340070105. [DOI] [PubMed] [Google Scholar]
O'Neill M. C. Escherichia coli promoters: neural networks develop distinct descriptions in learning to search for promoters of different spacing classes. Nucleic Acids Res. 1992 Jul 11;20(13):3471–3477. doi: 10.1093/nar/20.13.3471. [DOI] [PMC free article] [PubMed] [Google Scholar]
Rudd K. E., Miller W., Werner C., Ostell J., Tolstoshev C., Satterfield S. G. Mapping sequenced E.coli genes by computer: software, strategies and examples. Nucleic Acids Res. 1991 Feb 11;19(3):637–647. doi: 10.1093/nar/19.3.637. [DOI] [PMC free article] [PubMed] [Google Scholar]
Shepherd J. C. Method to determine the reading frame of a protein from the purine/pyrimidine genome sequence and its possible evolutionary justification. Proc Natl Acad Sci U S A. 1981 Mar;78(3):1596–1600. doi: 10.1073/pnas.78.3.1596. [DOI] [PMC free article] [PubMed] [Google Scholar]
Shine J., Dalgarno L. The 3'-terminal sequence of Escherichia coli 16S ribosomal RNA: complementarity to nonsense triplets and ribosome binding sites. Proc Natl Acad Sci U S A. 1974 Apr;71(4):1342–1346. doi: 10.1073/pnas.71.4.1342. [DOI] [PMC free article] [PubMed] [Google Scholar]
Snyder E. E., Stormo G. D. Identification of coding regions in genomic DNA sequences: an application of dynamic programming and neural networks. Nucleic Acids Res. 1993 Feb 11;21(3):607–613. doi: 10.1093/nar/21.3.607. [DOI] [PMC free article] [PubMed] [Google Scholar]
Staden R. Finding protein coding regions in genomic sequences. Methods Enzymol. 1990;183:163–180. doi: 10.1016/0076-6879(90)83012-x. [DOI] [PubMed] [Google Scholar]
Staden R., McLachlan A. D. Codon preference and its use in identifying protein coding regions in long DNA sequences. Nucleic Acids Res. 1982 Jan 11;10(1):141–156. doi: 10.1093/nar/10.1.141. [DOI] [PMC free article] [PubMed] [Google Scholar]
States D. J., Botstein D. Molecular sequence accuracy and the analysis of protein coding regions. Proc Natl Acad Sci U S A. 1991 Jul 1;88(13):5518–5522. doi: 10.1073/pnas.88.13.5518. [DOI] [PMC free article] [PubMed] [Google Scholar]
Stern M. J., Ames G. F., Smith N. H., Robinson E. C., Higgins C. F. Repetitive extragenic palindromic sequences: a major component of the bacterial genome. Cell. 1984 Jul;37(3):1015–1026. doi: 10.1016/0092-8674(84)90436-7. [DOI] [PubMed] [Google Scholar]
Stormo G. D., Hartzell G. W., 3rd Identifying protein-binding sites from unaligned DNA fragments. Proc Natl Acad Sci U S A. 1989 Feb;86(4):1183–1187. doi: 10.1073/pnas.86.4.1183. [DOI] [PMC free article] [PubMed] [Google Scholar]
Stultz C. M., White J. V., Smith T. F. Structural analysis based on state-space modeling. Protein Sci. 1993 Mar;2(3):305–314. doi: 10.1002/pro.5560020302. [DOI] [PMC free article] [PubMed] [Google Scholar]
Tavaré S., Song B. Codon preference and primary sequence structure in protein-coding regions. Bull Math Biol. 1989;51(1):95–115. doi: 10.1007/BF02458838. [DOI] [PubMed] [Google Scholar]
Uberbacher E. C., Mural R. J. Locating protein-coding regions in human DNA sequences by a multiple sensor-neural network approach. Proc Natl Acad Sci U S A. 1991 Dec 15;88(24):11261–11265. doi: 10.1073/pnas.88.24.11261. [DOI] [PMC free article] [PubMed] [Google Scholar]
White J. V., Stultz C. M., Smith T. F. Protein classification by stochastic modeling and optimal filtering of amino-acid sequences. Math Biosci. 1994 Jan;119(1):35–75. doi: 10.1016/0025-5564(94)90004-3. [DOI] [PubMed] [Google Scholar]

[OCR_01433] Altschul S. F., Gish W., Miller W., Myers E. W., Lipman D. J. Basic local alignment search tool. J Mol Biol. 1990 Oct 5;215(3):403–410. doi: 10.1016/S0022-2836(05)80360-2. [DOI] [PubMed] [Google Scholar]

[OCR_01378] Baldi P., Chauvin Y., Hunkapiller T., McClure M. A. Hidden Markov models of biological primary sequence information. Proc Natl Acad Sci U S A. 1994 Feb 1;91(3):1059–1063. doi: 10.1073/pnas.91.3.1059. [DOI] [PMC free article] [PubMed] [Google Scholar]

[OCR_01386] Brown M., Hughey R., Krogh A., Mian I. S., Sjölander K., Haussler D. Using Dirichlet mixture priors to derive hidden Markov models for protein families. Proc Int Conf Intell Syst Mol Biol. 1993;1:47–55. [PubMed] [Google Scholar]

[OCR_01309] Brunak S., Engelbrecht J., Knudsen S. Prediction of human mRNA donor and acceptor sites from the DNA sequence. J Mol Biol. 1991 Jul 5;220(1):49–65. doi: 10.1016/0022-2836(91)90380-o. [DOI] [PubMed] [Google Scholar]

[OCR_01321] Cardon L. R., Stormo G. D. Expectation maximization algorithm for identifying protein-binding sites with variable lengths from unaligned DNA fragments. J Mol Biol. 1992 Jan 5;223(1):159–170. doi: 10.1016/0022-2836(92)90723-w. [DOI] [PubMed] [Google Scholar]

[OCR_01345] Churchill G. A. Stochastic models for heterogeneous DNA sequences. Bull Math Biol. 1989;51(1):79–94. doi: 10.1007/BF02458837. [DOI] [PubMed] [Google Scholar]

[OCR_01409] Churchill G. A., Waterman M. S. The accuracy of DNA sequences: estimating sequence quality. Genomics. 1992 Sep;14(1):89–98. doi: 10.1016/s0888-7543(05)80288-5. [DOI] [PubMed] [Google Scholar]

[OCR_01368] Collado-Vides J. Grammatical model of the regulation of gene expression. Proc Natl Acad Sci U S A. 1992 Oct 15;89(20):9405–9409. doi: 10.1073/pnas.89.20.9405. [DOI] [PMC free article] [PubMed] [Google Scholar]

[OCR_01415] Farabaugh P. J. Alternative readings of the genetic code. Cell. 1993 Aug 27;74(4):591–596. doi: 10.1016/0092-8674(93)90507-M. [DOI] [PMC free article] [PubMed] [Google Scholar]

[OCR_01335] Farber R., Lapedes A., Sirotkin K. Determination of eukaryotic protein coding regions using neural networks and information theory. J Mol Biol. 1992 Jul 20;226(2):471–479. doi: 10.1016/0022-2836(92)90961-i. [DOI] [PubMed] [Google Scholar]

[OCR_01329] Fickett J. W. Recognition of protein coding regions in DNA sequences. Nucleic Acids Res. 1982 Sep 11;10(17):5303–5318. doi: 10.1093/nar/10.17.5303. [DOI] [PMC free article] [PubMed] [Google Scholar]

[OCR_01457] Fickett J. W., Torney D. C., Wolf D. R. Base compositional structure of genomes. Genomics. 1992 Aug;13(4):1056–1064. doi: 10.1016/0888-7543(92)90019-o. [DOI] [PubMed] [Google Scholar]

[OCR_01322] Fickett J. W., Tung C. S. Assessment of protein coding measures. Nucleic Acids Res. 1992 Dec 25;20(24):6441–6450. doi: 10.1093/nar/20.24.6441. [DOI] [PMC free article] [PubMed] [Google Scholar]

[OCR_01411] Gesteland R. F., Weiss R. B., Atkins J. F. Recoding: reprogrammed genetic decoding. Science. 1992 Sep 18;257(5077):1640–1641. doi: 10.1126/science.1529352. [DOI] [PubMed] [Google Scholar]

[OCR_01325] Gribskov M., Devereux J., Burgess R. R. The codon preference plot: graphic analysis of protein coding sequences and prediction of gene expression. Nucleic Acids Res. 1984 Jan 11;12(1 Pt 2):539–549. doi: 10.1093/nar/12.1part2.539. [DOI] [PMC free article] [PubMed] [Google Scholar]

[OCR_01405] Koop B. F., Rowan L., Chen W. Q., Deshpande P., Lee H., Hood L. Sequence length and error analysis of Sequenase and automated Taq cycle sequencing methods. Biotechniques. 1993 Mar;14(3):442–447. [PubMed] [Google Scholar]

[OCR_01382] Krogh A., Brown M., Mian I. S., Sjölander K., Haussler D. Hidden Markov models in computational biology. Applications to protein modeling. J Mol Biol. 1994 Feb 4;235(5):1501–1531. doi: 10.1006/jmbi.1994.1104. [DOI] [PubMed] [Google Scholar]

[OCR_01296] Kröger M., Wahl R., Rice P. Compilation of DNA sequences of Escherichia coli (update 1993). Nucleic Acids Res. 1993 Jul 1;21(13):2973–3000. doi: 10.1093/nar/21.13.2973. [DOI] [PMC free article] [PubMed] [Google Scholar]

[OCR_01319] Lawrence C. E., Reilly A. A. An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins. 1990;7(1):41–51. doi: 10.1002/prot.340070105. [DOI] [PubMed] [Google Scholar]

[OCR_01313] O'Neill M. C. Escherichia coli promoters: neural networks develop distinct descriptions in learning to search for promoters of different spacing classes. Nucleic Acids Res. 1992 Jul 11;20(13):3471–3477. doi: 10.1093/nar/20.13.3471. [DOI] [PMC free article] [PubMed] [Google Scholar]

[OCR_01427] Rudd K. E., Miller W., Werner C., Ostell J., Tolstoshev C., Satterfield S. G. Mapping sequenced E.coli genes by computer: software, strategies and examples. Nucleic Acids Res. 1991 Feb 11;19(3):637–647. doi: 10.1093/nar/19.3.637. [DOI] [PMC free article] [PubMed] [Google Scholar]

[OCR_01451] Shepherd J. C. Method to determine the reading frame of a protein from the purine/pyrimidine genome sequence and its possible evolutionary justification. Proc Natl Acad Sci U S A. 1981 Mar;78(3):1596–1600. doi: 10.1073/pnas.78.3.1596. [DOI] [PMC free article] [PubMed] [Google Scholar]

[OCR_01437] Shine J., Dalgarno L. The 3'-terminal sequence of Escherichia coli 16S ribosomal RNA: complementarity to nonsense triplets and ribosome binding sites. Proc Natl Acad Sci U S A. 1974 Apr;71(4):1342–1346. doi: 10.1073/pnas.71.4.1342. [DOI] [PMC free article] [PubMed] [Google Scholar]

[OCR_01355] Snyder E. E., Stormo G. D. Identification of coding regions in genomic DNA sequences: an application of dynamic programming and neural networks. Nucleic Acids Res. 1993 Feb 11;21(3):607–613. doi: 10.1093/nar/21.3.607. [DOI] [PMC free article] [PubMed] [Google Scholar]

[OCR_01302] Staden R. Finding protein coding regions in genomic sequences. Methods Enzymol. 1990;183:163–180. doi: 10.1016/0076-6879(90)83012-x. [DOI] [PubMed] [Google Scholar]

[OCR_01324] Staden R., McLachlan A. D. Codon preference and its use in identifying protein coding regions in long DNA sequences. Nucleic Acids Res. 1982 Jan 11;10(1):141–156. doi: 10.1093/nar/10.1.141. [DOI] [PMC free article] [PubMed] [Google Scholar]

[OCR_01401] States D. J., Botstein D. Molecular sequence accuracy and the analysis of protein coding regions. Proc Natl Acad Sci U S A. 1991 Jul 1;88(13):5518–5522. doi: 10.1073/pnas.88.13.5518. [DOI] [PMC free article] [PubMed] [Google Scholar]

[OCR_01423] Stern M. J., Ames G. F., Smith N. H., Robinson E. C., Higgins C. F. Repetitive extragenic palindromic sequences: a major component of the bacterial genome. Cell. 1984 Jul;37(3):1015–1026. doi: 10.1016/0092-8674(84)90436-7. [DOI] [PubMed] [Google Scholar]

[OCR_01315] Stormo G. D., Hartzell G. W., 3rd Identifying protein-binding sites from unaligned DNA fragments. Proc Natl Acad Sci U S A. 1989 Feb;86(4):1183–1187. doi: 10.1073/pnas.86.4.1183. [DOI] [PMC free article] [PubMed] [Google Scholar]

[OCR_01374] Stultz C. M., White J. V., Smith T. F. Structural analysis based on state-space modeling. Protein Sci. 1993 Mar;2(3):305–314. doi: 10.1002/pro.5560020302. [DOI] [PMC free article] [PubMed] [Google Scholar]

[OCR_01347] Tavaré S., Song B. Codon preference and primary sequence structure in protein-coding regions. Bull Math Biol. 1989;51(1):95–115. doi: 10.1007/BF02458838. [DOI] [PubMed] [Google Scholar]

[OCR_01331] Uberbacher E. C., Mural R. J. Locating protein-coding regions in human DNA sequences by a multiple sensor-neural network approach. Proc Natl Acad Sci U S A. 1991 Dec 15;88(24):11261–11265. doi: 10.1073/pnas.88.24.11261. [DOI] [PMC free article] [PubMed] [Google Scholar]

[OCR_01458] White J. V., Stultz C. M., Smith T. F. Protein classification by stochastic modeling and optimal filtering of amino-acid sequences. Math Biosci. 1994 Jan;119(1):35–75. doi: 10.1016/0025-5564(94)90004-3. [DOI] [PubMed] [Google Scholar]

PERMALINK

A hidden Markov model that finds genes in E. coli DNA.

A Krogh

I S Mian

D Haussler

Abstract

Full text

Selected References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

A hidden Markov model that finds genes in E. coli DNA.

A Krogh

I S Mian

D Haussler

Abstract

Full text

Selected References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases