Skip to main content
Nucleic Acids Research logoLink to Nucleic Acids Research
. 1994 Nov 11;22(22):4756–4767. doi: 10.1093/nar/22.22.4756

Intrinsic and extrinsic approaches for detecting genes in a bacterial genome.

M Borodovsky 1, K E Rudd 1, E V Koonin 1
PMCID: PMC308528  PMID: 7984428

Abstract

The unannotated regions of the Escherichia coli genome DNA sequence from the EcoSeq6 database, totaling 1,278 'intergenic' sequences of the combined length of 359,279 basepairs, were analyzed using computer-assisted methods with the aim of identifying putative unknown genes. The proposed strategy for finding new genes includes two key elements: i) prediction of expressed open reading frames (ORFs) using the GeneMark method based on Markov chain models for coding and non-coding regions of Escherichia coli DNA, and ii) search for protein sequence similarities using programs based on the BLAST algorithm and programs for motif identification. A total of 354 putative expressed ORFs were predicted by GeneMark. Using the BLASTX and TBLASTN programs, it was shown that 208 ORFs located in the unannotated regions of the E. coli chromosome are significantly similar to other protein sequences. Identification of 182 ORFs as probable genes was supported by GeneMark and BLAST, comprising 51.4% of the GeneMark 'hits' and 87.5% of the BLAST 'hits'. 73 putative new genes, comprising 20.6% of the GeneMark predictions, belong to ancient conserved protein families that include both eubacterial and eukaryotic members. This value is close to the overall proportion of highly conserved sequences among eubacterial proteins, indicating that the majority of the putative expressed ORFs that are predicted by GeneMark, but have no significant BLAST hits, nevertheless are likely to be real genes. The majority of the putative genes identified by BLAST search have been described since the release of the EcoSeq6 database, but about 70 genes have not been detected so far. Among these new identifications are genes encoding proteins with a variety of predicted functions including dehydrogenases, kinases, several other metabolic enzymes, ATPases, rRNA methyltransferases, membrane proteins, and different types of regulatory proteins.

Full text

PDF
4756

Images in this article

Selected References

These references are in PubMed. This may not be the complete list of references from this article.

  1. Alefounder P. R., Perham R. N. Identification, molecular cloning and sequence analysis of a gene cluster encoding the class II fructose 1,6-bisphosphate aldolase, 3-phosphoglycerate kinase and a putative second glyceraldehyde 3-phosphate dehydrogenase of Escherichia coli. Mol Microbiol. 1989 Jun;3(6):723–732. doi: 10.1111/j.1365-2958.1989.tb00221.x. [DOI] [PubMed] [Google Scholar]
  2. Altschul S. F., Boguski M. S., Gish W., Wootton J. C. Issues in searching molecular sequence databases. Nat Genet. 1994 Feb;6(2):119–129. doi: 10.1038/ng0294-119. [DOI] [PubMed] [Google Scholar]
  3. Altschul S. F., Gish W., Miller W., Myers E. W., Lipman D. J. Basic local alignment search tool. J Mol Biol. 1990 Oct 5;215(3):403–410. doi: 10.1016/S0022-2836(05)80360-2. [DOI] [PubMed] [Google Scholar]
  4. Bairoch A. The PROSITE dictionary of sites and patterns in proteins, its current status. Nucleic Acids Res. 1993 Jul 1;21(13):3097–3103. doi: 10.1093/nar/21.13.3097. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Blattner F. R., Burland V., Plunkett G., 3rd, Sofia H. J., Daniels D. L. Analysis of the Escherichia coli genome. IV. DNA sequence of the region from 89.2 to 92.8 minutes. Nucleic Acids Res. 1993 Nov 25;21(23):5408–5417. doi: 10.1093/nar/21.23.5408. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Borodovsky M., Koonin E. V., Rudd K. E. New genes in old sequence: a strategy for finding genes in the bacterial genome. Trends Biochem Sci. 1994 Aug;19(8):309–313. doi: 10.1016/0968-0004(94)90067-1. [DOI] [PubMed] [Google Scholar]
  7. Burland V., Plunkett G., 3rd, Daniels D. L., Blattner F. R. DNA sequence and analysis of 136 kilobases of the Escherichia coli genome: organizational symmetry around the origin of replication. Genomics. 1993 Jun;16(3):551–561. doi: 10.1006/geno.1993.1230. [DOI] [PubMed] [Google Scholar]
  8. Cain B. D., Norton P. J., Eubanks W., Nick H. S., Allen C. M. Amplification of the bacA gene confers bacitracin resistance to Escherichia coli. J Bacteriol. 1993 Jun;175(12):3784–3789. doi: 10.1128/jb.175.12.3784-3789.1993. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Claverie J. M., Bougueleret L. Heuristic informational analysis of sequences. Nucleic Acids Res. 1986 Jan 10;14(1):179–196. doi: 10.1093/nar/14.1.179. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Claverie J. M. Database of ancient sequences. Nature. 1993 Jul 1;364(6432):19–20. doi: 10.1038/364019b0. [DOI] [PubMed] [Google Scholar]
  11. Daniels D. L., Plunkett G., 3rd, Burland V., Blattner F. R. Analysis of the Escherichia coli genome: DNA sequence of the region from 84.5 to 86.5 minutes. Science. 1992 Aug 7;257(5071):771–778. doi: 10.1126/science.1379743. [DOI] [PubMed] [Google Scholar]
  12. Fichant G., Gautier C. Statistical method for predicting protein coding regions in nucleic acid sequences. Comput Appl Biosci. 1987 Nov;3(4):287–295. doi: 10.1093/bioinformatics/3.4.287. [DOI] [PubMed] [Google Scholar]
  13. Fickett J. W. Recognition of protein coding regions in DNA sequences. Nucleic Acids Res. 1982 Sep 11;10(17):5303–5318. doi: 10.1093/nar/10.17.5303. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Fickett J. W., Tung C. S. Assessment of protein coding measures. Nucleic Acids Res. 1992 Dec 25;20(24):6441–6450. doi: 10.1093/nar/20.24.6441. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Fields C. A., Soderlund C. A. gm: a practical tool for automating DNA sequence analysis. Comput Appl Biosci. 1990 Jul;6(3):263–270. doi: 10.1093/bioinformatics/6.3.263. [DOI] [PubMed] [Google Scholar]
  16. Gerdes K., Poulsen L. K., Thisted T., Nielsen A. K., Martinussen J., Andreasen P. H. The hok killer gene family in gram-negative bacteria. New Biol. 1990 Nov;2(11):946–956. [PubMed] [Google Scholar]
  17. Gibson J. L., Chen J. H., Tower P. A., Tabita F. R. The form II fructose 1,6-bisphosphatase and phosphoribulokinase genes form part of a large operon in Rhodobacter sphaeroides: primary structure and insertional mutagenesis analysis. Biochemistry. 1990 Sep 4;29(35):8085–8093. doi: 10.1021/bi00487a014. [DOI] [PubMed] [Google Scholar]
  18. Gish W., States D. J. Identification of protein coding regions by database similarity search. Nat Genet. 1993 Mar;3(3):266–272. doi: 10.1038/ng0393-266. [DOI] [PubMed] [Google Scholar]
  19. Gorbalenya A. E., Koonin E. V. Superfamily of UvrA-related NTP-binding proteins. Implications for rational classification of recombination/repair systems. J Mol Biol. 1990 Jun 20;213(4):583–591. doi: 10.1016/S0022-2836(05)80243-8. [DOI] [PubMed] [Google Scholar]
  20. Green P., Lipman D., Hillier L., Waterston R., States D., Claverie J. M. Ancient conserved regions in new gene sequences and the protein databases. Science. 1993 Mar 19;259(5102):1711–1716. doi: 10.1126/science.8456298. [DOI] [PubMed] [Google Scholar]
  21. Gribskov M., Devereux J., Burgess R. R. The codon preference plot: graphic analysis of protein coding sequences and prediction of gene expression. Nucleic Acids Res. 1984 Jan 11;12(1 Pt 2):539–549. doi: 10.1093/nar/12.1part2.539. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Henikoff S., Henikoff J. G. Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci U S A. 1992 Nov 15;89(22):10915–10919. doi: 10.1073/pnas.89.22.10915. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Henikoff S., Henikoff J. G. Performance evaluation of amino acid substitution matrices. Proteins. 1993 Sep;17(1):49–61. doi: 10.1002/prot.340170108. [DOI] [PubMed] [Google Scholar]
  24. Hutchinson G. B., Hayden M. R. The prediction of exons through an analysis of spliceable open reading frames. Nucleic Acids Res. 1992 Jul 11;20(13):3453–3462. doi: 10.1093/nar/20.13.3453. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Karlin S., Altschul S. F. Applications and statistics for multiple high-scoring segments in molecular sequences. Proc Natl Acad Sci U S A. 1993 Jun 15;90(12):5873–5877. doi: 10.1073/pnas.90.12.5873. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Karlin S., Altschul S. F. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc Natl Acad Sci U S A. 1990 Mar;87(6):2264–2268. doi: 10.1073/pnas.87.6.2264. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Kitagawa M., Wada C., Yoshioka S., Yura T. Expression of ClpB, an analog of the ATP-dependent protease regulatory subunit in Escherichia coli, is controlled by a heat shock sigma factor (sigma 32). J Bacteriol. 1991 Jul;173(14):4247–4253. doi: 10.1128/jb.173.14.4247-4253.1991. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Kleffe J., Borodovsky M. First and second moment of counts of words in random texts generated by Markov chains. Comput Appl Biosci. 1992 Oct;8(5):433–441. doi: 10.1093/bioinformatics/8.5.433. [DOI] [PubMed] [Google Scholar]
  29. Konopka A. K., Owens J. Complexity charts can be used to map functional domains in DNA. Genet Anal Tech Appl. 1990 Apr;7(2):35–38. doi: 10.1016/0735-0651(90)90010-d. [DOI] [PubMed] [Google Scholar]
  30. Koonin E. V., Bork P., Sander C. Yeast chromosome III: new gene functions. EMBO J. 1994 Feb 1;13(3):493–503. doi: 10.1002/j.1460-2075.1994.tb06287.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Koonin E. V., Rudd K. E. A conserved domain in putative bacterial and bacteriophage transglycosylases. Trends Biochem Sci. 1994 Mar;19(3):106–107. doi: 10.1016/0968-0004(94)90201-1. [DOI] [PubMed] [Google Scholar]
  32. Kröger M., Wahl R., Rice P. Compilation of DNA sequences of Escherichia coli (update 1993). Nucleic Acids Res. 1993 Jul 1;21(13):2973–3000. doi: 10.1093/nar/21.13.2973. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Médigue C., Rouxel T., Vigier P., Hénaut A., Danchin A. Evidence for horizontal gene transfer in Escherichia coli speciation. J Mol Biol. 1991 Dec 20;222(4):851–856. doi: 10.1016/0022-2836(91)90575-q. [DOI] [PubMed] [Google Scholar]
  34. Médigue C., Viari A., Hénaut A., Danchin A. Colibri: a functional data base for the Escherichia coli genome. Microbiol Rev. 1993 Sep;57(3):623–654. doi: 10.1128/mr.57.3.623-654.1993. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Plunkett G., 3rd, Burland V., Daniels D. L., Blattner F. R. Analysis of the Escherichia coli genome. III. DNA sequence of the region from 87.2 to 89.2 minutes. Nucleic Acids Res. 1993 Jul 25;21(15):3391–3398. doi: 10.1093/nar/21.15.3391. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Rhiel E., Stirewalt V. L., Gasparich G. E., Bryant D. A. The psaC genes of Synechococcus sp. PCC7002 and Cyanophora paradoxa: cloning and sequence analysis. Gene. 1992 Mar 1;112(1):123–128. doi: 10.1016/0378-1119(92)90313-e. [DOI] [PubMed] [Google Scholar]
  37. Robison K., Gilbert W., Church G. M. Large scale bacterial gene discovery by similarity search. Nat Genet. 1994 Jun;7(2):205–214. doi: 10.1038/ng0694-205. [DOI] [PubMed] [Google Scholar]
  38. Sandal N. N., Marcker K. A. Similarities between a soybean nodulin, Neurospora crassa sulphate permease II and a putative human tumour suppressor. Trends Biochem Sci. 1994 Jan;19(1):19–19. doi: 10.1016/0968-0004(94)90168-6. [DOI] [PubMed] [Google Scholar]
  39. Schuh R., Aicher W., Gaul U., Côté S., Preiss A., Maier D., Seifert E., Nauber U., Schröder C., Kemler R. A conserved family of nuclear proteins containing structural elements of the finger protein encoded by Krüppel, a Drosophila segmentation gene. Cell. 1986 Dec 26;47(6):1025–1032. doi: 10.1016/0092-8674(86)90817-2. [DOI] [PubMed] [Google Scholar]
  40. Schuler G. D., Altschul S. F., Lipman D. J. A workbench for multiple alignment construction and analysis. Proteins. 1991;9(3):180–190. doi: 10.1002/prot.340090304. [DOI] [PubMed] [Google Scholar]
  41. Schweinfest C. W., Henderson K. W., Suster S., Kondoh N., Papas T. S. Identification of a colon mucosa gene that is down-regulated in colon adenomas and adenocarcinomas. Proc Natl Acad Sci U S A. 1993 May 1;90(9):4166–4170. doi: 10.1073/pnas.90.9.4166. [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Senapathy P., Shapiro M. B., Harris N. L. Splice junctions, branch point sites, and exons: sequence statistics, identification, and applications to genome project. Methods Enzymol. 1990;183:252–278. doi: 10.1016/0076-6879(90)83018-5. [DOI] [PubMed] [Google Scholar]
  43. Shao Z., Newman E. B. Sequencing and characterization of the sdaB gene from Escherichia coli K-12. Eur J Biochem. 1993 Mar 15;212(3):777–784. doi: 10.1111/j.1432-1033.1993.tb17718.x. [DOI] [PubMed] [Google Scholar]
  44. Shepherd J. C. Method to determine the reading frame of a protein from the purine/pyrimidine genome sequence and its possible evolutionary justification. Proc Natl Acad Sci U S A. 1981 Mar;78(3):1596–1600. doi: 10.1073/pnas.78.3.1596. [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Snyder E. E., Stormo G. D. Identification of coding regions in genomic DNA sequences: an application of dynamic programming and neural networks. Nucleic Acids Res. 1993 Feb 11;21(3):607–613. doi: 10.1093/nar/21.3.607. [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. Staden R. Computer methods to locate signals in nucleic acid sequences. Nucleic Acids Res. 1984 Jan 11;12(1 Pt 2):505–519. doi: 10.1093/nar/12.1part2.505. [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. Staden R., McLachlan A. D. Codon preference and its use in identifying protein coding regions in long DNA sequences. Nucleic Acids Res. 1982 Jan 11;10(1):141–156. doi: 10.1093/nar/10.1.141. [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Tatusov R. L., Koonin E. V. A simple tool to search for sequence motifs that are conserved in BLAST outputs. Comput Appl Biosci. 1994 Jul;10(4):457–459. doi: 10.1093/bioinformatics/10.4.457. [DOI] [PubMed] [Google Scholar]
  49. Tavaré S., Song B. Codon preference and primary sequence structure in protein-coding regions. Bull Math Biol. 1989;51(1):95–115. doi: 10.1007/BF02458838. [DOI] [PubMed] [Google Scholar]
  50. Uberbacher E. C., Mural R. J. Locating protein-coding regions in human DNA sequences by a multiple sensor-neural network approach. Proc Natl Acad Sci U S A. 1991 Dec 15;88(24):11261–11265. doi: 10.1073/pnas.88.24.11261. [DOI] [PMC free article] [PubMed] [Google Scholar]
  51. Walker J. E., Saraste M., Runswick M. J., Gay N. J. Distantly related sequences in the alpha- and beta-subunits of ATP synthase, myosin, kinases and other ATP-requiring enzymes and a common nucleotide binding fold. EMBO J. 1982;1(8):945–951. doi: 10.1002/j.1460-2075.1982.tb01276.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press

RESOURCES