Abstract
Cosmids from the 1A3–1A10 region of the complete miniset were individually subcloned by using the vector M13 mp18. Sequences of each cosmid were assembled from about 400 DNA fragments generated from the ends of these phage subclones and merged into one 189-kb contig. About 160 ORFs identified by the CodonUse program were subjected to similarity searches. The biological functions of 80 ORFs could be assigned reliably by using the WIT and Magpie genome investigation tools. Eighty percent of these recognizable ORFs were organized in functional clusters, which simplified assignment decisions and increased the strength of the predictions. A set of 26 genes for cobalamin biosynthesis, genes for polyhydroxyalkanoic acid metabolism, DNA replication and recombination, and DNA gyrase were among those identified. Most of the ORFs lacking significant similarity with reference databases also were grouped. There are two large clusters of these ORFs, one located between 45 and 67 kb of the map, and the other between 150 and 183 kb. Nine of the loosely identified ORFs (of 15) of the first of these clusters match ORFs from phages or transposons. The other cluster also has four ORFs of possible phage origin.
Rhodobacter capsulatus is a purple, nonsulfur photosynthetic bacterium. The ease of plating and generation of mutations, together with convenient systems for cloning and genetic analysis (1), has made Rhodobacter an established model for many important biological processes, including photosynthesis (2, 3) and nitrogen fixation (4, 5). Mobilization of the chromosome by an integrated R′ factor produced a genetic map of R. capsulatus with more than 30 markers (6). The development of shuttle vectors based on broad host range plasmids made possible efficient random and site-specific mutagenesis with transposons as well as gene cloning by mutant complementation (7). Results from gene inactivation studies, and intensive studies of gene regulation on both the protein and RNA levels, has produced a picture of multistep regulatory response cascades (4, 8–10).
There are about 250 sequenced genes from R. capsulatus resulting from random efforts of different groups. Over the last 10 years, there have been about 1,600 entries in MedLine concerning the molecular biology of R. capsulatus and R. sphaeroides, putting them on a list of the 10 most-studied microorganisms. The ability to choose between phototrophic and heterotrophic growth and the genetic determinants for nitrogen fixation all are packed in a genome that is 3/4 the size of Escherichia coli’s. Recently, cosmid encyclopedias of genomes of three different strains of R. capsulatus were assembled and mapped with high resolution (11). These cosmids were used in global expression studies (12) and for the comparison of chromosome organization in different strains (13).
There has been an explosion in the number of bacterial genome sequencing projects. More than 40 such projects, listed at http://www.mcs.anl.gov/home/gaasterl/magpie.html, are in progress. Much of the development in genomics is due to advances in computation, where different search algorithms have been merged in automated genome investigation environments such as WIT developed by R. Overbeek (http://www.mcs.anl.gov/home/compbio/WIT/wit.html), Magpie (14), or software used for annotation of the Haemophilus genome (15). These systems reduce the time required for raw annotation. Hybridization chip technology promises to provide clues to total genome expression (16, 17). The situation for determination of gene functions is less bright, because half of the genes generated in sequencing projects have unknown, or loosely defined, functions. A systematic functional analysis of the ORFs in sequenced bacteria, like the one started for yeast (18), will help solve this problem. However, many of the industrially or medically important microorganisms targeted for genome sequencing are hard to cultivate and not the best candidates for this work. We are using R. capsulatus for our gene function assignment, starting with the determination of its genome sequence.
The two major approaches to the sequencing of bacterial genomes are: piecemeal, as done for Synechocystis 6803 (19) or assembly from a single shotgun cloning (15). We chose the first strategy for the following reasons: (i) R. capsulatus has a genome size of 3.8 Mb, which we considered to be too large to be assembled by shotgun sequencing without great complications; (ii) the high GC content of this genome (68%) results in many sequencing stops; it is easier to link contigs interrupted by these stops if the genome already is represented by an ordered clone encyclopedia; and (iii) a cosmid encyclopedia of R. capsulatus already had been constructed and mapped with high resolution (11, 13). The current report describes the results of a pilot project (5% of the genome) aimed at testing the sequencing tactics and computer tools to be applied to the remainder of the Rhodobacter genome.
MATERIALS AND METHODS
Subcloning of the Cosmids.
DNA from cosmids 1A3–1A10 (Fig. 1) was prepared as described (11). Ten micrograms of cosmid DNA were digested using HinPI endonuclease under conditions of partial cleavage. Approximately 1-kb fragments were size-selected with agarose and cloned (20) into vector M13 mp18 (21). Hybrid phages were selected by the lack of α-complementation of the defective gene of β-galactosidase in the E. coli host strain used for transfection. Hybrid phages were individually harvested in 96-well plates, and isolated DNA was stored at −70°C.
Figure 1.
Part of the cosmid encyclopedia of the R. capsulatus genome, showing the positions of the cosmid subset that provided the substrate for sequencing. The region whose sequence is reported here is shaded in gray. Details of the map construction and its use in mapping genes are in ref. 12.
Sequencing Technique.
About 400 phage subclones were prepared for each cosmid. A first round of sequences was generated from the ends of these subclones using standard “−20” primer. Sequencing reactions were performed with a Pharmacia kit for fluorescent sequencing. Products of the reactions were analyzed by using a Pharmacia alf sequencer. Average reading lengths were 400 bases, producing 4.5 sequence redundancy. After this step, we expected to have only 20% single-strand sequence with five gaps per 35-kb cosmid insert, which was close to the observed number.
Sequence Assembly.
The sequence of each cosmid was assembled with the GeneSkipper software developed at the European Molecular Biology Laboratory, Heidelberg. Gaps between contigs were closed by isolating M13 clones at the ends of the contigs and by limited primer walking. The replicative form of M13 was used to generate second-strand sequence where necessary. Simulated restriction maps of sequenced cosmids were compared with the experimental maps (12), and the few discrepancies were carefully reanalyzed. The cosmid sequences were merged into one 189-kb contig (Fig. 2).
Figure 2.
ORF locations and putative functional assignments for the first 5% of the R. capsulatus genome to be sequenced. Degree of reliability of functional predictions is represented by the color of the ORFs as follows: dark blue, high probability; light blue, tentative; green, high similarity with yet unassigned ORFs; yellow, lower similarity with unassigned ORFs; and red, ORFs lacking any reliable similarity with sequences in databases. A color code for possible functions is labeled on the picture.
The DNA sequence and results of the annotations are presented at the web site http://www.mcs.anl.gov/home/compbio/Organisms/Rhodobacter_capsulatus/rhodobacter.html together with some analytical tools.
RESULTS AND DISCUSSION
ORF Calling.
A number of genome elements are important for the biological functions of an organism: sequences regulating expression, structural RNA genes, sequences involved in the replication and stable maintenance of the genome, and sequences triggering recombination. However, about 80% of the bacterial genome is represented by genes coding for proteins (15, 22). Our work is focused on assigning functions to these genes. Some of the tools used for this task, like BlastN, can potentially deal with large unparsed DNA sequences. To use others, like FastA or BlastP, one should analyze individual DNA regions that may encode proteins (ORFs), so the initial task in genome annotation is to find these ORFs. One way to approach this is to collect all DNA fragments larger than 300 bp (a threshold determined by statistical considerations) in all reading frames from each stop codon to the most remote upstream start codon and then to analyze them. This way nothing should be missed. This approach is used in Magpie (14). However, in high GC organisms, such as R. capsulatus, stop codons are rare, and this approach produces almost five times more pseudo-ORFs than real ones. This not only complicates the analysis, but also makes the recognition of a real ORF impossible, if no function can be assigned to any of the overlapping potential ORFs. Alternative approaches rely on biases in codon utilization, which make translated regions different from the rest of the DNA for all known organisms. These regions can be found by using Markov chain analysis (23) or with simpler algorithms, like the probability of belonging to an ORF for a translated DNA segment of a certain size, as in CodonUse developed by Conrad Halling. For organisms with strong codon bias like R. capsulatus, such ORF identification is straightforward for almost 90% of the genome. A different situation was observed for a low-GC prophage, for which the ORF search based on the major type of Rhodobacter codon usage fails. Obviously, these genes imply a different set of codon preferences (23). In such regions, we collected all possible ORFs. Altogether, 200 ORFs were initially selected for the 189-kb DNA fragment, 163 of which remained after annotation.
Functional Annotation of the ORFs.
Annotation of the ORFs was done using the WIT/WIT2 system (http://www.mcs.anl.gov/home/compbio/WIT/wit.html) developed by R. Overbeek in collaboration with E. Selkov and N. Maltsev at Argonne National Laboratory. WIT/WIT2 is an environment for functional annotation of genomes, which can be used in two steps. WIT2 allows rapid initial annotation of the ORFs by broad integration of publicly available tools (e.g., blast, fasta, Prosite, MTpred, Psort, etc.) as well as original tools (pattern recognition, cluster analysis of proteins, signature analysis).
The results of these functional assignments are shown in Fig. 2 and Table 1. Biological functions of 80 of the ORFs (50% of all ORFs) covering about 70% of the sequenced region were reliably assigned. Eighty percent of the assignable ORFs were organized in clusters with related functions. The vast majority of ORFs lacking significant similarity with entries in reference databases also could be grouped in possibly cotranscribed sets.
Table 1.
Functional assignments of Rhodobacter capsulatus ORFs
ORF position | ORF position | Best database matches | Likely ORF function | |||
---|---|---|---|---|---|---|
A Survey of Gene Functions Found in the DNA Sequence.
Cobalamin biosynthesis. The most cohesive group of ORFs observed is a set of 26, most of which match with high similarity scores to the genes for synthesis of cobalamin (vitamin B12) in Pseudomonas denitrificans and Salmonella typhimurium. Many mysteries accompany the study of B12 synthesis and use. Its corrin nucleus may well be the evolutionary ancestor of siroheme, chlorophyll, and heme (24). The original function of B12 may have been in anaerobic fermentations. Currently, it is required by Salmonella under aerobic conditions but synthesized only under anaerobic conditions. Pseudomonas makes B12 only under aerobic conditions (25). The aerobic and anaerobic pathways to cobalamin contain many of the same enzymatic steps, but two of the 20 reactions that begin with uroporphyrinogen III and end with adenosyl cobinamide differ considerably. In the presence of oxygen, cobalt is inserted at the 15th step in the pathway by an enzyme (cobyric acid a, c-diamide synthase) that adds cobalt to hydrogenobyrinic acid a,c-diamide (24). Note that ORF starting at the position 12,233 (Table 1), the last ORF in the “cobalamin” region, has a good match to this enzyme, implying that R. capsulatus can make B12 aerobically. In addition, there are homologues of cobC, cobN, cobW, cobK, and cobF of P. denitrificans in the cobalamin region of R. capsulatus. All of these similarities point to aerobic synthesis of B12. Can Rhodobacter make B12 anaerobically as well? Under anaerobic conditions, such as those required by Salmonella and Propionibacterium shermanii, cobalt is inserted at step 3 and carried through all of the intermediates from that point on. The enzyme for step 4, the product of the cobG gene of P. denitrificans, is required only in the presence of oxygen. We do not find a homologue of cobG in this region, but it could be elsewhere in the chromosome. In Salmonella, the cbiMNQO genes encode a transport system for cobalt; these genes are present in R. capsulatus.
Other Regions.
The most common gene group observed codes for membrane transport. Eleven genes related to transport were organized in three clusters, and five more single genes were distributed along the 189-kb fragment. One of these gene clusters appears to code for maltose transport, another for ribose transport. The high number of transport genes corresponds to the finding by Riley for E. coli (26) and the experimental result of Zheng, who observed several dozen operons related to ABC transporters by hybridization to a display of 540 restriction fragments covering the entire Rhodobacter chromosome (27).
DNA recombination and replication were represented by eight identified genes. Two unlinked genes code for two gyrase subunits, a pair of possibly cotranscribed ORFs code for an SbcCD enzyme homolog, and other genes encode DnaK, PolC, integrase and transposase homologs. The latter two are in regions that appear to be either transposons or integrated phage genomes.
An operon of five genes codes for polyhydroxyalkanoic acid metabolism. A gene encoding acetyl-CoA synthetase is possibly a part of this putative operon. The latter enzyme enables the cell to use acetate through the glyoxylate cycle.
Genetic determinants for sensing environmental signals, two-component signal transduction, sugar utilization, translation machinery, glycolysis, cold adaptation, and amino acid metabolism also were located within the 189-kb DNA fragment.
About one-third of the ORFs lack detectable similarity with the reference databases with the tools used. There are two large clusters of such ORFs, one located from 45 to 67 kb and the other from 150 to 183 kb. Nine of the loosely identified ORFs of 15 in the first 22-kb cluster match ORFs from phages or transposons. In these cases, the blast searches turn up matches to repressors or recombination enzymes that characterize either phages or transposons. The other 33-kb unidentified cluster also has four ORFs of possible phage origin. Phages that infect R. capsulatus have been described rarely. The only one for which anything is known is the GTA, a defective transducing phage useful for shuttling genes into and out of R. capsulatus (28, 29). Each of the putative phages discovered in the sequencing project includes a possible repressor gene, so expression and deletion studies can be used to start the biological characterization of these phages.
We conclude that the strategy of constructing an ordered cosmid library followed by shotgun sequencing of subclones and then primer walking is efficient and cost effective, particularly in view of the long-term goal of assigning functions to the ORFs discovered. The sequenced subclones provide the substrates for GTA-mediated deletions, so the phenotypic analysis can proceed hand-in-hand with the sequencing.
Acknowledgments
We are grateful to Conrad Halling for development of the CodonUse program, Ross Overbeek for help with annotation, Terri Gaasterland for introducing us to Magpie, and William Buikema for much critical advice. The work was supported by grants from the Department of Energy (DE-FG02-86ER13546), the Harris and Frances Block Research Fund at the University of Chicago, the National Science Foundation for International Cooperation with the Academy of Sciences of the Czech Republic (INT-9506881), the National Science Foundation for Instrumentation (MCB-9421031), Grant 204/970206 from the Grant Agency of the Czech Republic, and Grants ES009 and VS6074 of the Ministry of Education, Youth and Sports of the Czech Republic.
References
- 1.Donohue T J, Kaplan S. Methods Enzymol. 1991;204:459–485. doi: 10.1016/0076-6879(91)04024-i. [DOI] [PubMed] [Google Scholar]
- 2.Klug G. Arch Microbiol. 1993;159:397–404. doi: 10.1007/BF00288584. [DOI] [PubMed] [Google Scholar]
- 3.Bauer C, Buggy J, Mosley C. Trends Genet. 1993;9:56–60. doi: 10.1016/0168-9525(93)90188-N. [DOI] [PubMed] [Google Scholar]
- 4.Kranz R G, Foster-Hartnett D. Mol Microbiol. 1990;4:1793–1800. doi: 10.1111/j.1365-2958.1990.tb02027.x. [DOI] [PubMed] [Google Scholar]
- 5.Oelze J, Klein G. Arch Microbiol. 1996;165:219–225. doi: 10.1007/s002030050319. [DOI] [PubMed] [Google Scholar]
- 6.Willison J C. FEMS Microbiol Rev. 1993;10:1–38. doi: 10.1111/j.1574-6968.1993.tb05862.x. [DOI] [PubMed] [Google Scholar]
- 7.Johnson J A, Wong W K, Beatty J T. J Bacteriol. 1986;167:604–610. doi: 10.1128/jb.167.2.604-610.1986. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Kranz R G, Pace V M, Caldicott I M. J Bacteriol. 1990;172:53–62. doi: 10.1128/jb.172.1.53-62.1990. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Ponnampalam S N, Buggy J J, Bauer C E. J Bacteriol. 1995;177:2990–2997. doi: 10.1128/jb.177.11.2990-2997.1995. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Hubner P, Masepohl B, Klipp W, Bickle T A. Mol Microbiol. 1993;10:123–132. doi: 10.1111/j.1365-2958.1993.tb00909.x. [DOI] [PubMed] [Google Scholar]
- 11.Fonstein M, Haselkorn R. Proc Natl Acad Sci USA. 1993;90:2522–2526. doi: 10.1073/pnas.90.6.2522. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Fonstein M, Koshy E G, Nikolskaya T, Mourachov P, Haselkorn R. EMBO J. 1995;14:1827–1841. doi: 10.1002/j.1460-2075.1995.tb07171.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Fonstein M, Nikolskaya T, Haselkorn R. J Bacteriol. 1995;177:2368–2372. doi: 10.1128/jb.177.9.2368-2372.1995. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Gaasterland T, Sensen C W S. Biochimie. 1996;78:302–310. doi: 10.1016/0300-9084(96)84761-4. [DOI] [PubMed] [Google Scholar]
- 15.Fleischmann R D, Adams M D, White O, Clayton R A, Kirkness E F, et al. Science. 1995;269:496–512. doi: 10.1126/science.7542800. [DOI] [PubMed] [Google Scholar]
- 16.Chee M, Yang R, Hubbell E, Berno A, Huang X C, Stern D, Winkler J, Lockhart D J, Morris M S, Fodor S P. Science. 1996;274:610–614. doi: 10.1126/science.274.5287.610. [DOI] [PubMed] [Google Scholar]
- 17.Yershov G, Barsky V, Belgovskiy A, Kirillov E, Kreindlin E, Ivanov I, Parinov S, Guschin D, Drobishev A, Dubiley S, Mirzabekov A. Proc Natl Acad Sci USA. 1996;93:4913–4918. doi: 10.1073/pnas.93.10.4913. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Oliver G S. Nature (London) 1996;379:597–600. doi: 10.1038/379597a0. [DOI] [PubMed] [Google Scholar]
- 19.Kaneko T, Sato S, Kotani H, Tanaka A, Asamizu E, et al. DNA Res. 1996;3:109–136. doi: 10.1093/dnares/3.3.109. [DOI] [PubMed] [Google Scholar]
- 20.Vlcek C, Paces V. Gene. 1995;165:137–138. doi: 10.1016/0378-1119(95)00534-d. [DOI] [PubMed] [Google Scholar]
- 21.Messing J, Vieira J. Gene. 1982;19:269–276. doi: 10.1016/0378-1119(82)90016-6. [DOI] [PubMed] [Google Scholar]
- 22.Blattner F R, Burland V, Plunkett G, Sofia H J, Daniels D L. Nucleic Acids Res. 1993;21:5408–5417. doi: 10.1093/nar/21.23.5408. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Borodovsky M, McIninch J D, Koonin E V, Rudd K E, Medigue C, Danchin A. Nucleic Acids Res. 1995;23:3554–3562. doi: 10.1093/nar/23.17.3554. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Roth J R, Lawrence J G, Bobik T A. Annu Rev Microbiol. 1996;50:137–181. doi: 10.1146/annurev.micro.50.1.137. [DOI] [PubMed] [Google Scholar]
- 25.Battersby A R. Science. 1994;264:1551–1557. doi: 10.1126/science.8202709. [DOI] [PubMed] [Google Scholar]
- 26.Riley M. Microbiol Rev. 1993;57:862–952. doi: 10.1128/mr.57.4.862-952.1993. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Zheng S, Haselkorn R. Mol Microbiol. 1996;20:1001–1011. doi: 10.1111/j.1365-2958.1996.tb02541.x. [DOI] [PubMed] [Google Scholar]
- 28.Yen H C, Hu N T, Marrs B L. J Mol Biol. 1979;131:157–168. doi: 10.1016/0022-2836(79)90071-8. [DOI] [PubMed] [Google Scholar]
- 29.Kumar V, Fonstein M, Haselkorn R. Nature (London) 1996;381:653–654. doi: 10.1038/381653a0. [DOI] [PubMed] [Google Scholar]