Abstract
The number of completely sequenced bacterial genomes has been growing fast. There are computer methods available for finding genes but yet there is a need for more accurate algorithms. The GeneMark. hmm algorithm presented here was designed to improve the gene prediction quality in terms of finding exact gene boundaries. The idea was to embed the GeneMark models into naturally derived hidden Markov model framework with gene boundaries modeled as transitions between hidden states. We also used the specially derived ribosome binding site pattern to refine predictions of translation initiation codons. The algorithm was evaluated on several test sets including 10 complete bacterial genomes. It was shown that the new algorithm is significantly more accurate than GeneMark in exact gene prediction. Interestingly, the high gene finding accuracy was observed even in the case when Markov models of order zero, one and two were used. We present the analysis of false positive and false negative predictions with the caution that these categories are not precisely defined if the public database annotation is used as a control.
Full Text
The Full Text of this article is available as a PDF (129.7 KB).
Selected References
These references are in PubMed. This may not be the complete list of references from this article.
- Altschul S. F., Madden T. L., Schäffer A. A., Zhang J., Zhang Z., Miller W., Lipman D. J. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997 Sep 1;25(17):3389–3402. doi: 10.1093/nar/25.17.3389. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Baldi P., Chauvin Y., Hunkapiller T., McClure M. A. Hidden Markov models of biological primary sequence information. Proc Natl Acad Sci U S A. 1994 Feb 1;91(3):1059–1063. doi: 10.1073/pnas.91.3.1059. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Blattner F. R., Plunkett G., 3rd, Bloch C. A., Perna N. T., Burland V., Riley M., Collado-Vides J., Glasner J. D., Rode C. K., Mayhew G. F. The complete genome sequence of Escherichia coli K-12. Science. 1997 Sep 5;277(5331):1453–1462. doi: 10.1126/science.277.5331.1453. [DOI] [PubMed] [Google Scholar]
- Borodovsky M., McIninch J. D., Koonin E. V., Rudd K. E., Médigue C., Danchin A. Detection of new genes in a bacterial genome using Markov models for three gene classes. Nucleic Acids Res. 1995 Sep 11;23(17):3554–3562. doi: 10.1093/nar/23.17.3554. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bult C. J., White O., Olsen G. J., Zhou L., Fleischmann R. D., Sutton G. G., Blake J. A., FitzGerald L. M., Clayton R. A., Gocayne J. D. Complete genome sequence of the methanogenic archaeon, Methanococcus jannaschii. Science. 1996 Aug 23;273(5278):1058–1073. doi: 10.1126/science.273.5278.1058. [DOI] [PubMed] [Google Scholar]
- Burge C., Karlin S. Prediction of complete gene structures in human genomic DNA. J Mol Biol. 1997 Apr 25;268(1):78–94. doi: 10.1006/jmbi.1997.0951. [DOI] [PubMed] [Google Scholar]
- Churchill G. A. Stochastic models for heterogeneous DNA sequences. Bull Math Biol. 1989;51(1):79–94. doi: 10.1007/BF02458837. [DOI] [PubMed] [Google Scholar]
- Fleischmann R. D., Adams M. D., White O., Clayton R. A., Kirkness E. F., Kerlavage A. R., Bult C. J., Tomb J. F., Dougherty B. A., Merrick J. M. Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science. 1995 Jul 28;269(5223):496–512. doi: 10.1126/science.7542800. [DOI] [PubMed] [Google Scholar]
- Fraser C. M., Gocayne J. D., White O., Adams M. D., Clayton R. A., Fleischmann R. D., Bult C. J., Kerlavage A. R., Sutton G., Kelley J. M. The minimal gene complement of Mycoplasma genitalium. Science. 1995 Oct 20;270(5235):397–403. doi: 10.1126/science.270.5235.397. [DOI] [PubMed] [Google Scholar]
- Gelfand M. S. Prediction of function in DNA sequence analysis. J Comput Biol. 1995 Spring;2(1):87–115. doi: 10.1089/cmb.1995.2.87. [DOI] [PubMed] [Google Scholar]
- Hayes W. S., Borodovsky M. Deriving ribosomal binding site (RBS) statistical models from unannotated DNA sequences and the use of the RBS model for N-terminal prediction. Pac Symp Biocomput. 1998:279–290. [PubMed] [Google Scholar]
- Henderson J., Salzberg S., Fasman K. H. Finding genes in DNA with a Hidden Markov Model. J Comput Biol. 1997 Summer;4(2):127–141. doi: 10.1089/cmb.1997.4.127. [DOI] [PubMed] [Google Scholar]
- Himmelreich R., Hilbert H., Plagens H., Pirkl E., Li B. C., Herrmann R. Complete sequence analysis of the genome of the bacterium Mycoplasma pneumoniae. Nucleic Acids Res. 1996 Nov 15;24(22):4420–4449. doi: 10.1093/nar/24.22.4420. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Klenk H. P., Clayton R. A., Tomb J. F., White O., Nelson K. E., Ketchum K. A., Dodson R. J., Gwinn M., Hickey E. K., Peterson J. D. The complete genome sequence of the hyperthermophilic, sulphate-reducing archaeon Archaeoglobus fulgidus. Nature. 1997 Nov 27;390(6658):364–370. doi: 10.1038/37052. [DOI] [PubMed] [Google Scholar]
- Krogh A., Brown M., Mian I. S., Sjölander K., Haussler D. Hidden Markov models in computational biology. Applications to protein modeling. J Mol Biol. 1994 Feb 4;235(5):1501–1531. doi: 10.1006/jmbi.1994.1104. [DOI] [PubMed] [Google Scholar]
- Krogh A., Mian I. S., Haussler D. A hidden Markov model that finds genes in E. coli DNA. Nucleic Acids Res. 1994 Nov 11;22(22):4768–4778. doi: 10.1093/nar/22.22.4768. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kunst F., Ogasawara N., Moszer I., Albertini A. M., Alloni G., Azevedo V., Bertero M. G., Bessières P., Bolotin A., Borchert S. The complete genome sequence of the gram-positive bacterium Bacillus subtilis. Nature. 1997 Nov 20;390(6657):249–256. doi: 10.1038/36786. [DOI] [PubMed] [Google Scholar]
- Lawrence J. G. Selfish operons and speciation by gene transfer. Trends Microbiol. 1997 Sep;5(9):355–359. doi: 10.1016/S0966-842X(97)01110-4. [DOI] [PubMed] [Google Scholar]
- Link A. J., Robison K., Church G. M. Comparing the predicted and observed properties of proteins encoded in the genome of Escherichia coli K-12. Electrophoresis. 1997 Aug;18(8):1259–1313. doi: 10.1002/elps.1150180807. [DOI] [PubMed] [Google Scholar]
- Lukashin A. V., Engelbrecht J., Brunak S. Multiple alignment using simulated annealing: branch point definition in human mRNA splicing. Nucleic Acids Res. 1992 May 25;20(10):2511–2516. doi: 10.1093/nar/20.10.2511. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Missiakas D., Georgopoulos C., Raina S. The Escherichia coli heat shock gene htpY: mutational analysis, cloning, sequencing, and transcriptional regulation. J Bacteriol. 1993 May;175(9):2613–2624. doi: 10.1128/jb.175.9.2613-2624.1993. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Médigue C., Rouxel T., Vigier P., Hénaut A., Danchin A. Evidence for horizontal gene transfer in Escherichia coli speciation. J Mol Biol. 1991 Dec 20;222(4):851–856. doi: 10.1016/0022-2836(91)90575-q. [DOI] [PubMed] [Google Scholar]
- Sacerdot C., Dessen P., Hershey J. W., Plumbridge J. A., Grunberg-Manago M. Sequence of the initiation factor IF2 gene: unusual protein features and homologies with elongation factors. Proc Natl Acad Sci U S A. 1984 Dec;81(24):7787–7791. doi: 10.1073/pnas.81.24.7787. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Smith D. R., Doucette-Stamm L. A., Deloughery C., Lee H., Dubois J., Aldredge T., Bashirzadeh R., Blakely D., Cook R., Gilbert K. Complete genome sequence of Methanobacterium thermoautotrophicum deltaH: functional analysis and comparative genomics. J Bacteriol. 1997 Nov;179(22):7135–7155. doi: 10.1128/jb.179.22.7135-7155.1997. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tomb J. F., White O., Kerlavage A. R., Clayton R. A., Sutton G. G., Fleischmann R. D., Ketchum K. A., Klenk H. P., Gill S., Dougherty B. A. The complete genome sequence of the gastric pathogen Helicobacter pylori. Nature. 1997 Aug 7;388(6642):539–547. doi: 10.1038/41483. [DOI] [PubMed] [Google Scholar]