Skip to main content
Nucleic Acids Research logoLink to Nucleic Acids Research
. 1998 Jun 15;26(12):2941–2947. doi: 10.1093/nar/26.12.2941

Combining diverse evidence for gene recognition in completely sequenced bacterial genomes.

D Frishman 1, A Mironov 1, H W Mewes 1, M Gelfand 1
PMCID: PMC147632  PMID: 9611239

Abstract

Analysis of a newly sequenced bacterial genome starts with identification of protein-coding genes. Functional assignment of proteins requires the exact knowledge of protein N-termini. We present a new program ORPHEUS that identifies candidate genes and accurately predicts gene starts. The analysis starts with a database similarity search and identification of reliable gene fragments. The latter are used to derive statistical characteristics of protein-coding regions and ribosome-binding sites and to predict the complete set of genes in the analyzed genome. In a test on Bacillus subtilis and Escherichia coli genomes, the program correctly identified 93.3% (resp. 96.3%) of experimentally annotated genes longer than 100 codons described in the PIR-International database, and for these genes 96.3% (83.9%) of starts were predicted exactly. Furthermore, 98.9% (99.1%) of genes longer than 100 codons annotated in GenBank were found, and 92.9% (75.7%) of predicted starts coincided with the feature table description. Finally, for the complete gene complements of B.subtilis and E.coli , including genes shorter than 100 codons, gene prediction accuracy was 88.9 and 87.1%, respectively, with 94.2 and 76.7% starts coinciding with the existing annotation.

Full Text

The Full Text of this article is available as a PDF (144.6 KB).

Selected References

These references are in PubMed. This may not be the complete list of references from this article.

  1. Altschul S. F., Gish W., Miller W., Myers E. W., Lipman D. J. Basic local alignment search tool. J Mol Biol. 1990 Oct 5;215(3):403–410. doi: 10.1016/S0022-2836(05)80360-2. [DOI] [PubMed] [Google Scholar]
  2. Bairoch A., Apweiler R. The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 1998. Nucleic Acids Res. 1998 Jan 1;26(1):38–42. doi: 10.1093/nar/26.1.38. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Barker W. C., Garavelli J. S., Haft D. H., Hunt L. T., Marzec C. R., Orcutt B. C., Srinivasarao G. Y., Yeh L. S., Ledley R. S., Mewes H. W. The PIR-International Protein Sequence Database. Nucleic Acids Res. 1998 Jan 1;26(1):27–32. doi: 10.1093/nar/26.1.27. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Barrick D., Villanueba K., Childs J., Kalil R., Schneider T. D., Lawrence C. E., Gold L., Stormo G. D. Quantitative analysis of ribosome binding sites in E.coli. Nucleic Acids Res. 1994 Apr 11;22(7):1287–1295. doi: 10.1093/nar/22.7.1287. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Blattner F. R., Plunkett G., 3rd, Bloch C. A., Perna N. T., Burland V., Riley M., Collado-Vides J., Glasner J. D., Rode C. K., Mayhew G. F. The complete genome sequence of Escherichia coli K-12. Science. 1997 Sep 5;277(5331):1453–1462. doi: 10.1126/science.277.5331.1453. [DOI] [PubMed] [Google Scholar]
  6. Bork P., Ouzounis C., Sander C., Scharf M., Schneider R., Sonnhammer E. What's in a genome? Nature. 1992 Jul 23;358(6384):287–287. doi: 10.1038/358287a0. [DOI] [PubMed] [Google Scholar]
  7. Borodovsky M., Rudd K. E., Koonin E. V. Intrinsic and extrinsic approaches for detecting genes in a bacterial genome. Nucleic Acids Res. 1994 Nov 11;22(22):4756–4767. doi: 10.1093/nar/22.22.4756. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Bult C. J., White O., Olsen G. J., Zhou L., Fleischmann R. D., Sutton G. G., Blake J. A., FitzGerald L. M., Clayton R. A., Gocayne J. D. Complete genome sequence of the methanogenic archaeon, Methanococcus jannaschii. Science. 1996 Aug 23;273(5278):1058–1073. doi: 10.1126/science.273.5278.1058. [DOI] [PubMed] [Google Scholar]
  9. Burge C., Karlin S. Prediction of complete gene structures in human genomic DNA. J Mol Biol. 1997 Apr 25;268(1):78–94. doi: 10.1006/jmbi.1997.0951. [DOI] [PubMed] [Google Scholar]
  10. Dreyfus M. What constitutes the signal for the initiation of protein synthesis on Escherichia coli mRNAs? J Mol Biol. 1988 Nov 5;204(1):79–94. doi: 10.1016/0022-2836(88)90601-8. [DOI] [PubMed] [Google Scholar]
  11. Eddy S. R., Durbin R. RNA sequence analysis using covariance models. Nucleic Acids Res. 1994 Jun 11;22(11):2079–2088. doi: 10.1093/nar/22.11.2079. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Etzold T., Ulyanov A., Argos P. SRS: information retrieval system for molecular biology data banks. Methods Enzymol. 1996;266:114–128. doi: 10.1016/s0076-6879(96)66010-8. [DOI] [PubMed] [Google Scholar]
  13. Fickett J. W. Finding genes by computer: the state of the art. Trends Genet. 1996 Aug;12(8):316–320. doi: 10.1016/0168-9525(96)10038-x. [DOI] [PubMed] [Google Scholar]
  14. Fickett J. W., Tung C. S. Assessment of protein coding measures. Nucleic Acids Res. 1992 Dec 25;20(24):6441–6450. doi: 10.1093/nar/20.24.6441. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Gelfand M. S., Mironov A. A., Pevzner P. A. Gene recognition via spliced sequence alignment. Proc Natl Acad Sci U S A. 1996 Aug 20;93(17):9061–9066. doi: 10.1073/pnas.93.17.9061. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Gelfand M. S. Prediction of function in DNA sequence analysis. J Comput Biol. 1995 Spring;2(1):87–115. doi: 10.1089/cmb.1995.2.87. [DOI] [PubMed] [Google Scholar]
  17. Gish W., States D. J. Identification of protein coding regions by database similarity search. Nat Genet. 1993 Mar;3(3):266–272. doi: 10.1038/ng0393-266. [DOI] [PubMed] [Google Scholar]
  18. Hayes W. S., Borodovsky M. Deriving ribosomal binding site (RBS) statistical models from unannotated DNA sequences and the use of the RBS model for N-terminal prediction. Pac Symp Biocomput. 1998:279–290. [PubMed] [Google Scholar]
  19. Huang X. Fast comparison of a DNA sequence with a protein sequence database. Microb Comp Genomics. 1996;1(4):281–291. doi: 10.1089/mcg.1996.1.281. [DOI] [PubMed] [Google Scholar]
  20. Koonin E. V., Mushegian A. R., Rudd K. E. Sequencing and analysis of bacterial genomes. Curr Biol. 1996 Apr 1;6(4):404–416. doi: 10.1016/s0960-9822(02)00508-0. [DOI] [PubMed] [Google Scholar]
  21. Krogh A., Mian I. S., Haussler D. A hidden Markov model that finds genes in E. coli DNA. Nucleic Acids Res. 1994 Nov 11;22(22):4768–4778. doi: 10.1093/nar/22.22.4768. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Kunst F., Ogasawara N., Moszer I., Albertini A. M., Alloni G., Azevedo V., Bertero M. G., Bessières P., Bolotin A., Borchert S. The complete genome sequence of the gram-positive bacterium Bacillus subtilis. Nature. 1997 Nov 20;390(6657):249–256. doi: 10.1038/36786. [DOI] [PubMed] [Google Scholar]
  23. Lukashin A. V., Borodovsky M. GeneMark.hmm: new solutions for gene finding. Nucleic Acids Res. 1998 Feb 15;26(4):1107–1115. doi: 10.1093/nar/26.4.1107. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Nielsen H., Engelbrecht J., Brunak S., von Heijne G. Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites. Protein Eng. 1997 Jan;10(1):1–6. doi: 10.1093/protein/10.1.1. [DOI] [PubMed] [Google Scholar]
  25. Ogasawara N. Markedly unbiased codon usage in Bacillus subtilis. Gene. 1985;40(1):145–150. doi: 10.1016/0378-1119(85)90035-6. [DOI] [PubMed] [Google Scholar]
  26. Pearson W. R., Wood T., Zhang Z., Miller W. Comparison of DNA sequences with protein sequences. Genomics. 1997 Nov 15;46(1):24–36. doi: 10.1006/geno.1997.4995. [DOI] [PubMed] [Google Scholar]
  27. Robison K., Gilbert W., Church G. M. Large scale bacterial gene discovery by similarity search. Nat Genet. 1994 Jun;7(2):205–214. doi: 10.1038/ng0694-205. [DOI] [PubMed] [Google Scholar]
  28. Salzberg S. L., Delcher A. L., Kasif S., White O. Microbial gene identification using interpolated Markov models. Nucleic Acids Res. 1998 Jan 15;26(2):544–548. doi: 10.1093/nar/26.2.544. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Schneider T. D., Stormo G. D., Gold L., Ehrenfeucht A. Information content of binding sites on nucleotide sequences. J Mol Biol. 1986 Apr 5;188(3):415–431. doi: 10.1016/0022-2836(86)90165-8. [DOI] [PubMed] [Google Scholar]
  30. Shields D. C., Sharp P. M. Synonymous codon usage in Bacillus subtilis reflects both translational selection and mutational biases. Nucleic Acids Res. 1987 Oct 12;15(19):8023–8040. doi: 10.1093/nar/15.19.8023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Shine J., Dalgarno L. The 3'-terminal sequence of Escherichia coli 16S ribosomal RNA: complementarity to nonsense triplets and ribosome binding sites. Proc Natl Acad Sci U S A. 1974 Apr;71(4):1342–1346. doi: 10.1073/pnas.71.4.1342. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Varshavsky A. The N-end rule: functions, mysteries, uses. Proc Natl Acad Sci U S A. 1996 Oct 29;93(22):12142–12149. doi: 10.1073/pnas.93.22.12142. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Weinrauch Y., Guillen N., Dubnau D. A. Sequence and transcription mapping of Bacillus subtilis competence genes comB and comA, one of which is related to a family of bacterial regulatory determinants. J Bacteriol. 1989 Oct;171(10):5362–5375. doi: 10.1128/jb.171.10.5362-5375.1989. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press

RESOURCES