Skip to main content
Proceedings of the National Academy of Sciences of the United States of America logoLink to Proceedings of the National Academy of Sciences of the United States of America
. 2004 Jun 21;101(26):9722–9727. doi: 10.1073/pnas.0400975101

Computational inference of scenarios for α-proteobacterial genome evolution

Bastien Boussau 1, E Olof Karlberg 1, A Carolin Frank 1, Boris-Antoine Legault 1, Siv G E Andersson 1,
PMCID: PMC470742  PMID: 15210995

Abstract

The α-proteobacteria, from which mitochondria are thought to have originated, display a 10-fold genome size variation and provide an excellent model system for studies of genome size evolution in bacteria. Here, we use computational approaches to infer ancestral gene sets and to quantify the flux of genes along the branches of the α-proteobacterial species tree. Our study reveals massive gene expansions at branches diversifying plant-associated bacteria and extreme losses at branches separating intracellular bacteria of animals and humans. Alterations in gene numbers have mostly affected functional categories associated with regulation, transport, and small-molecule metabolism, many of which are encoded by paralogous gene families located on auxiliary chromosomes. The results suggest that the α-proteobacterial ancestor contained 3,000 -5,000 genes and was a free-living, aerobic, and motile bacterium with pili and surface proteins for host cell and environmental interactions. Approximately one third of the ancestral gene set has no homologs among the eukaryotes. More than 40% of the genes without eukaryotic counterparts encode proteins that are conserved among the α-proteobacteria but for which no function has yet been identified. These genes that never made it into the eukaryotes but are widely distributed in bacteria may represent bacterial drug targets and should be prime candidates for future functional characterization.


Fundamental questions subjected to much debate concern the extent to which microbial genomes are related by vertical descent versus horizontal gene transfer (1-5). A direct approach to address these questions is to estimate frequencies of deletions/duplications and horizontal gene transfers for closely related species and compare these estimates with estimates of nucleotide substitution rates. The α-proteobacteria provide an excellent model system for such studies because genome size variation in this subdivision spans the entire size range for bacteria, from 1 Mb in Rickettsia spp. to >9 Mb in Bradyrhizobium japonicum (6-12). Furthermore, there is an amazing variation in lifestyle characteristics in this subdivision, including both obligate (Rickettsia and Wolbachia) and facultative (Bartonella and Brucella) intracellular bacteria as well as soil-borne plant symbionts and pathogens (Sinorhizobium, Agrobacterium, and Bradyrhizobium), which enables correlations between gene contents and lifestyle features to be examined.

The α-proteobacterial group has also attracted much interest because one of its descending lineages is thought to be the ancestor of mitochondria (13, 14). The acquisition of mitochondria represents one of the earliest and most extreme cases of horizontal gene transfer events known in the history of life. Phylogenetic studies suggest that ≥630 eukaryotic genes were transferred from the α-proteobacteria to the eukaryotes, including many genes coding for modern mitochondrial protein functions (15). For the majority of mitochondrial proteins, however, no bacterial homologs were identified, indicating that they were derived from nuclear, eukaryotic genomes via intragenomic duplication and sequence divergence (14-16).

Based on results from pairwise genome comparisons, it has been suggested that there is a correlation between genome size alterations, microbial population sizes, and growth habitats (17). For example, it has been shown that free-living bacterial species of large population sizes accumulate insertion/deletion and rearrangement mutations relative to nucleotide substitutions at much higher frequencies than host-dependent bacteria of small population sizes, in which the influence of horizontal gene transfers has been negligible (17). Algorithms for mapping the presence and absence of genes onto inferred species trees in multiple genome comparisons (18, 19) have been used to reconstruct ancestral gene sets and to obtain estimates of the flow of genes along each of the individual branches. By using such approaches, >500 genes have been assigned to the last universal common ancestor (LUCA) (19), and 2,000 genes have been assigned to the ancestor of the Archaea (18).

In this study, we used the α-proteobacteria as a model system to examine the contents of ancestral genomes along with the evolutionary basis for genome size differences. Our results suggest that the α-proteobacterial ancestor contained several thousand genes and was metabolically highly versatile. The flux of genes along the individual branches of the tree highlights the role of the auxiliary chromosomes as mediators of genome size expansions and contractions in response to alterations in environmental conditions.

Materials and Methods

Genome Analysis. The sizes and GenBank accession numbers of α-proteobacterial genomes included in this analysis are given in Table 1. The assignment of functional categories for proteins in Rickettsia prowazekii, Rickettsia conorii, Brucella melitensis, Brucella suis, Caulobacter crescentus, Agrobacterium tumefaciens, Sinorhizobium meliloti, and Mesorhizobium loti was taken from the Institute for Genomic Research (www.tigr.org). Uncategorized proteins and proteins from Bartonella henselae, Bartonella quintana, and B. japonicum were assigned a functional category according to the best hit in similarity searches using blastp (E < 1 × 10-10) against all classified proteins from The Institute for Genomic Research (www.tigr.org). Additional proteobacterial genomes included as outgroups in the analyses were Campylobacter jejuni (NC_002163), Escherichia coli (NC_000913), Helicobacter pylori (NC_000913), Pseudomonas aeruginosa (NC_002516), Ralstonia solanacearum (NC_003296), Salmonella typhimurium (NC_003197 and NC_003277), and Xylella fastidiosa (NC_002490).

Table 1. α-Proteobacterial species included in the reconstruction analysis.

Species Total size, Mb GenBank accession no. (size, Mb)
R. prowazekii 1.1 NC_000963 (1.1)
R. conorii 1.3 NC_003103 (1.3)
W. pipientis 1.3 NC_002987 (1.3)
B. quintana 1.6 BX897700 (1.6)
B. henselae 1.9 BX897699 (1.9)
B. melitensis 3.3 NC_003317 (2.1), NC_003318 (1.2)
B. suis 3.3 NC_004310 (2.1), NC_004311 (1.2)
C. crescentus 4.0 NC_002696 (4.0)
R. palustris 5.5 NC_005296 (5.5)
A. tumefaciens 5.6 NC_003062 (2.8), NC_003063 (2.1), NC_003064 (0.5), NC_003065 (0.2)
S. meliloti 6.7 NC_003047 (3.6), NC_003037 (1.4), NC_003078 (1.7)
M. loti 7.6 NC_002678 (7.0), NC_002679 (0.4), NC_002682 (0.2)
B. japonicum 9.1 NC_004463 (9.1)

Phylogenetic Inference. The species phylogeny was estimated by using a data set of concatenated proteins that were selected on the basis that they are encoded by genes that are located in segments with largely conserved gene order structures in B. henselae, B. quintana, B. melitensis, A. tumefaciens, S. meliloti, and M. loti (see Fig. 6, which is published as supporting information on the PNAS web site). Homologs of the selected proteins B. quintana were inferred by blastp (20) searches (E < 1 × 10-20) against the protein data set of each α-proteobacterial genome. To exclude paralogs we included in the analysis only genes without a second blast hit with an E value of <1 × 10-20. Another selection criteria for inclusion used was that orthologs should be present in at least 12 of the 20 taxa, resulting in a final set of 38 proteins (Table 3, which is published as supporting information on the PNAS web site).

The alignment was performed by using clustalw (21) on individual protein sequences that were later concatenated. Maximum-likelihood phylogenies were constructed by using phyml (version 2.1 beta) (22) assuming the Jones-Taylor-Thornton model of protein evolution and four γ-distributed rate categories with the α parameter and proportion of invariable sites estimated from the data. To assess the variation in the data, 100 bootstrap replicates were generated from the data set with seqboot from the phylip 3.5c package (J. Felsenstein, Department of Genetics, University of Washington, Seattle). Maximum-likelihood trees were estimated from the bootstrap matrices as described above, and a majority-rule consensus tree was generated from them by using consense, also from the phylip 3.5C package.

Inference of Ancestral Gene Sets. The homologous groups were created by using the Clusters of Orthologous Groups (COGs) database (23) in its 66-genomes version. Proteomes classified in COGs were retrieved from the COGs database. Six unclassified proteomes (B. henselae, B. quintana, B. suis, B. japonicum, Rhodopseudomonas palustris, and Wolbachia pipientis) were assigned COGs according to the following procedure: the proteins in each unclassified proteome were used as first queries and then databases in separate blast searches with all proteomes in the COGs database. The unclassified proteins were added to the COG to which it had the highest number of symmetric best hits (BeTs) and BeTs >1. Because this procedure expanded the COGs, the same was done for all the unclassified proteins from the other species so as to also include proteins with BeTs to the newly assigned proteins. New clusters were then created from uncategorized proteins forming triangles of BeTs as described in ref. 23. Finally, clusters containing only two proteins were made from linear BeT relations, after which the remaining proteins were included as single genes.

The most parsimonious scenarios of α-proteobacterial genome evolution and the α-proteobacterial ancestor were reconstructed by character mapping by using generalized parsimony as implemented in paup* (version 4.0b10 for Unix) (24) on a rooted species tree, with acctran (accelerated transformation) (see Fig. 3) and deltran (delayed transformation) (Fig. 7, which is published as supporting information on the PNAS web site) options for parsimony analysis. Fig. 3 shows the results for penalties for duplications, deletions, and gene genesis of 1, 1, and 5, respectively. The selection of penalty values and results obtained for different penalty values are described in Fig. 7.

Fig. 3.

Fig. 3.

Inference of deletions/duplications and gene-genesis events based on the α-proteobacterial tree was made by using different clustering levels and penalty values. The inference was based on proteins already classified in COGs (23) to which we added COGs containing proteins in three or more species internally related by best hits (58,171 proteins in total) (a) and the complete set of proteins (73,658 proteins in total) (b). Inference of gene contents was made by using the acctran option for parsimony analysis in paup* with penalties for duplication, deletion, and gene genesis set to 1, 1, and 5, respectively. Numbers along branches refer to the number of duplications/losses/genesis, respectively. Numbers at nodes refer to the putative number of genes in the inferred genome at the node. Outgroup sequences are as described for Fig. 2, but they were pruned from the tree shown here. Abbreviations for species names are as described in the legends to Figs. 1 and 2.

The ancestral proteomes were inferred separately for protein families assigned to auxiliary (mega-COG) and main (main-COG) chromosomes. The criteria for inclusion in the mega-COG family were that ≥30% of the protein members were encoded on auxiliary replichores or symbiosis islands in the Rhizobiales. By using these criteria, 43% of the proteins encoded by the auxiliary replichores and 6% of chromosomally encoded proteins were members of the mega-COG families on average. Because many of the species-specific genes are located on the auxiliary replichores, we used the complete α-proteobacterial proteome for this analysis. The gene content of the inferred α-proteobacterial ancestral genome was compared with the estimated gene content of protomitochondria (15) and the LUCA (19) by using the presence or absence of a COG rather than the absolute numbers of genes.

Results and Discussion

Gene Function of α-Proteobacterial Genomes. To explore expansions in gene function with genome size for the α-proteobacteria (Table 1), we examined gene content statistics for 14 functional categories (Fig. 1). The relationships between gene content and genome size can be approximated with linear functions, with slopes ranging from four genes per megabase for basic information processes such as transcription and translation to >80 genes per megabase for energy metabolism, transport, and regulatory functions. Functional categories associated with environmental interactions (e.g., transport and regulation) were found to be the most variable among bacteria with different lifestyles. For example, the small genomes of obligate and facultative intracellular parasites have only a few regulatory and transport genes, whereas the larger genomes of free-living soil bacteria that alternate between environments of different nutritional quality contain hundreds of such genes. A rapid increase in the number of regulatory genes in relation to gene content has been observed (25, 26) and may be a general feature of all bacterial genomes.

Fig. 1.

Fig. 1.

Plot of genome size against gene content for each of the functional categories. RP, R. prowazekii; RC, R. conorii; BQ, B. quintana; BH, B. henselae; BM, B. melitensis; BS, B. suis; CC, C. crescentus; AT, A. tumefaciens; SM, S. meliloti; ML, M. loti; and BJ, B. japonicum. See Table 1 for genome sizes. The data were separated into two sections (a and b) to prevent overcrowding.

Extrapolation to the intercept of the y axis provides a measure of the minimal set of genes shared among the α-proteobacteria, which here is estimated to 250 genes (Table 4, which is published as supporting information on the PNAS web site). This set includes ≈200 genes for DNA, RNA, and protein biosynthesis and another 40 genes for nucleotide and cofactor biosynthesis. This is comparable with the minimal set of core genes in endosymbiotic bacteria (27) as well as to minimal gene numbers inferred by computational approaches (28) and experimental knockout mutants of Bacillus subtilis (29).

The Species Tree for α-Proteobacteria. To place the dramatic shifts in genome size in an evolutionary context, we needed an underlying reliable species tree onto which the gene sets could be mapped. Because a few of the divergence nodes were not conclusively resolved in our rRNA tree (data not shown), we inferred the tree topology by using concatenated protein sequences (Fig. 2). To minimize topology inconsistencies caused by horizontal gene transfer and gene paralogy, we selected for this analysis a set of 38 genes sampled from regions with conserved gene order structures in the Rhizobiales (Fig. 6 and Table 3).

Fig. 2.

Fig. 2.

Phylogenetic relationship of 13 α-proteobacterial species (high-lighted by the purple background) with 7 species from other proteobacterial subdivisions as outgroups. The topology, branch lengths, and bootstrap support are according to maximum-likelihood reconstructions with the Jones-Taylor-Thornton + 4ΓI model. Similar results were obtained with the neighbor-joining method and after removal of positions with gaps. A list of genes used for the phylogenetic reconstructions is given in Table 5. Abbreviations for species names are as described in the legend to Fig. 1 with the addition of the following taxa: WP, W. pipientis; RhP, R. palustris; CJ, C. jejuni; EC, E. coli; HP, H. pylori; PA, P. aeruginosa; RS, R. solanacearum; ST, S. typhi; and XF, X. fastidiosa.

The phylogenetic tree (Fig. 2), constructed by using the maximum-likelihood method, provided strong support for a clustering of the Rhizobiales to the exclusion of the more early diverging lineages B. japonicum, C. crescentus, and the Rickettsiales. The two Bartonella species formed a clade with Brucella with high bootstrap support, as did also A. tumefaciens and S. meliloti, which formed a separate clade. The position of M. loti was placed with high support (>90%) close to the root of the Bartonella/Brucella clade. However, the branches separating M. loti from its neighboring clades are very short and the placement of M. loti in the tree was found to be sensitive both to the methods used and to the genes and species sampled (data not shown). For all other divergences, the tree topology was robust. The branching order depicted in Fig. 2 represents our best estimate of the underlying species tree.

Computational Inference of Ancestral Gene Sets. We inferred ancestral α-proteobacterial proteomes and estimated the number of gene losses, duplications, and genesis events along each branch of the topology shown in Fig. 2 with character mapping using generalized parsimony (Figs. 3 and 7). Following the routines of previous work (18, 19), we included in the analysis proteins already classified in the COGs database (23) along with proteins encoded by genomes not yet incorporated in the COGs database but related to existing COGs by BeTs. This process resulted in a first data set of 56,337 proteins, to which we added 384 COGs containing proteins not related to any existing COGs but present in three or more species and internally related by BeTs. With the inclusion of these proteins, the data set amounted to 58,171 proteins, and the α-proteobacterial ancestral proteome was estimated to 3,300 proteins (Fig. 3a). The remaining proteins were assigned into single or linear protein COGs, which resulted in a data set that included all 73,658 proteins and yielded an ancestral proteome of >5,000 proteins (Fig. 3b). Because some of the species-specific genes may be rapidly evolving or incorrectly annotated as genes, their inclusion probably results in an overestimate of the ancestral proteome size (Fig. 3b), just as their exclusion may yield an underestimate (Fig. 3a). Thus, we define the lower and upper boundaries of the ancestral α-proteobacterial proteome to 3,000 and 5,000 proteins, respectively.

Metabolic Expansions and Contractions. The analyses of gene content alterations at the branches of the tree revealed two major trends that are observed irrespectively of the different data sets and methods used (Fig. 4). First, massive genome size expansions accompanied the divergence of the plant-associated Rhizobiales, particularly the evolution of M. loti and B. japonicum. There seems to have been a gradual increase of genes encoding transcriptional regulators and proteins involved in the transport and metabolism of amino acids, nucleotides, carbohydrates, coenzymes, lipids, inorganic ions, and secondary metabolites. These expansions argue in favor of ancestral cells being visited by highly dynamic plasmids that introduced novel genes by duplication and/or genesis, some of which were maintained selectively in response to the increased use of soil compounds and the refined interactions with the progenitors of modern plant cells.

Fig. 4.

Fig. 4.

Net gene loss or gain throughout the evolution of the α-proteobacterial species. Arrows pointing upward indicate net gains of genes (G), and arrows pointing downward indicate net losses of genes (L). Colors and sizes of arrows refer to the net number of genes gained or lost at each branch. Colors of circles refer to the relative fraction of genes assigned to the different functional groups in the modern and inferred genome at the node. Yellow, information storage and processing; green, metabolism; red, cellular processes; blue, poorly characterized. Clustering groups and estimated frequencies are as described for Fig. 3a. Abbreviations for species names are as described in the legends to Figs. 1 and 2.

Extreme reductions of size occurred twice independently: in the ancestor of the obligate intracellular lineages Rickettsia and Wolbachia and in the ancestor of the facultative intracellular lineages Bartonella and Brucella. These losses have largely affected protein families for transcription regulation, transport, and metabolism of amino acids, nucleotides, carbohydrates, lipids, and other small molecules. Particularly notable is the independent loss of genes involved in secretory pathways, pilus assembly, and flagellar biosynthesis. The loss of genes associated with the transition from interactions with plants to animals in the ancestor of Bartonella and Brucella was not balanced by a corresponding gain of genes; no genes have homologs solely in Bartonella and Brucella (E < 0.001).

The number of genes eliminated before the split of Rickettsia and Wolbachia was estimated to 2,300-3,800 genes, as compared with ≈200-700 lost genes per lineage after the split (Fig. 3). The inverse correlation between gene loss and branch lengths for this part of the tree (compare Figs. 2 and 3) makes the lower frequency of gene-elimination events in recent times all the more striking. On average, the ratio of deletions to nucleotide substitutions was 25-fold higher before the split of Rickettsia and Wolbachia. A high frequency of gene loss relative to nucleotide substitutions was also observed immediately before the emergence of the intracellular lineages Bartonella and Brucella, which is reminiscent of the more rapid loss of genes at an early stage of genome reduction in aphid endosymbiont lineages, followed by genomic stasis (17). Overall, we observed no correlation between frequencies of amino acid substitutions and gene loss (r2 = 0.14), gene duplication (r2 = 0.02), or gene genesis (r2 = 0.05), indicating dramatically different fixation rates for these mutations in the different lineages over time.

Gene Flux on Chromosomes and Auxiliary Replicons. Many species in the Rhizobiales contain auxiliary chromosomes (Table 1) that are characterized by less gene synteny than the main chromosomes (Fig. 6). To quantify the differences in mutational rates and patterns for genes located on different replicons, we inferred ancestral proteomes separately for COGs assigned to the auxiliary replicons (mega-COG) versus those assigned to the main chromosomes (main-COG). We classified a COG as a mega-COG if >30% of its protein members were encoded on an auxiliary replicon in A. tumefaciens, Brucella spp., S. meliloti, or on the symbiosis islands in M. loti and B. japonicum. In total, we classified 13% of the COGs as mega-COGs, which corresponds to 2,349 COGs (8,662 proteins) out of the complete set of 17,669 COGs (73,658 proteins) included in the analysis.

The results showed that 20-24% of the losses that occurred immediately before the Bartonella/Brucella divergence was associated with mega-COGs (Fig. 8, which is published as supporting information on the PNAS web site). Likewise, a substantial fraction of the identified duplications involved proteins in mega-COG families, as observed for example on the branch leading to the Rhizobiales (23%) and also on the branch separating these from R. palustris and B. japonicum (55%). In the terminal branches for S. meliloti and A. tumefaciens, all three types of mutational events were frequent for proteins classified in the mega-COG family, including 30% of duplications, 25% of losses, and 60% of gene-genesis events. Overall, mega-COGs accounted for 21% of changes below the α-proteobacterial ancestor. Considering that the mega-COGs only account for 13% of all COGs, the relative frequencies of deletions, duplications, and gene genesis was considerably higher for proteins classified in these families. We speculate that the auxiliary replicons were derived from plasmids that expanded by reiterative processes of duplication/deletion and horizontal gene-transfer events in the Rhizobiales.

Inferred Metabolism of the α-Proteobacterial Ancestor. Our pathway analysis of the core ancestral gene set identified in all the analyses (Table 5, which is published as supporting information on the PNAS web site) suggests that it contained genes for glycolysis and a complete system for aerobic respiration, as expected for a unicellular organism that was well adapted to the aerobic environment. Notable was its broad biosynthetic capability and the presence of multiple genes for regulatory and transport functions. The analysis further identified genes for flagellar biosynthesis and type III and type IV secretion systems. Thus, the ancestor was probably a free-living, aerobic, and motile bacterium that had evolved elaborate communication mechanisms with other cells. Also present in the ancestor were genes for phage-related functions; however, these genes may incorrectly have been assigned to the ancestor because of multiple independent acquisitions of phage genes by horizontal gene transfer in some of the derived lineages.

A comparison of the α-proteobacterial ancestral genome with the gene content of the LUCA identified a small set of genes inferred to be present in the LUCA (13) but absent from our ancestral set. The number and identity of such genes depend on penalty values, but even for the highest penalty values it was observed that a set of genes, including those for homoserine kinase, uridine kinase, endonuclease IV, and glutamyl-tRNA reductase, were predicted to be present in the LUCA but were absent from the α-proteobacterial ancestor. These might have been lost before the divergence of the α-proteobacterial ancestor or, alternatively, been incorrectly assigned to the LUCA.

Comparing the α-Proteobacterial Ancestor with the Mitochondrial Ancestor. The endosymbiotic theory postulates that mitochondria evolved by massive gene loss and transfer of genes from the common ancestor to the nuclear genome of the host cell. A total of 630 orthologous groups display a close phylogenetic relationship between eukaryotes and α-proteobacteria (15). These represent a minimal estimate of the protomitochondrial proteome, because some gene transfers may have been missed because of weak phylogenetic signals and others may have been lost from the eukaryotic genomes included in the analysis. We compared the 630 α-proteobacterial gene groups with the set of COGs inferred to be putatively present in the α-proteobacterial ancestor. The protomitochondrial set includes 487 genes in 412 COG-associated groups (15), all of which belong to the 3,300 genes in the >3,100 COGs of our ancestor (Fig. 3a). Of the 143 protomitochondrial groups not associated with a COG, 92 are represented in the ancestral gene pool. Most of the 51 groups missing from our data set consists of hypothetical proteins or proteins with unknown functions.

Phylogenetic analyses of rRNA sequences, protein subunits of the respiratory chain complexes, and concatenated protein alignment suggest that mitochondria evolved from the α-proteobacteria, with no evidence for multiple independent acquisitions (12, 13, 30-32). Although several studies have placed mitochondria as a deeply diverging sister clade near to the Rickettsiales (30-32), the exact position is still debated. Here, we consider the gene set of the reconstructed α-proteobacterial ancestor as an upper limit of the protomitochondrial proteome. To estimate how many of these ancestral genes may, at the most, have been transferred to the host nuclear genome, we selected the complete set of COGs present in the α-proteobacterial ancestor and used them as queries in sequence-similarity searches against eukaryotic genomes. As expected, the number of COGs showing significant sequence similarity to eukaryotic genes decreased with increasing blast scores from ≈1,700 (score ≥50) to 850 (score ≥150) (Fig. 5). The remaining 1,144 ancestral COGs without eukaryotic homologs (score ≤40) represent putative gene losses. The genes in these COGs display a broad taxonomic distribution in bacteria (data not shown), and surprisingly many (>45%) encode proteins of unknown or poorly characterized function (Table 2). Future functional analyses of these genes may provide the answers as to why these genes were not transferred to the eukaryotes.

Fig. 5.

Fig. 5.

Number of COGs in the α-proteobacterial ancestor (Fig. 3a) with sequence similarity to eukaryotic genes for different blast score values. Estimated number of COGs that shows similarity to eukaryotic genes in the inferred proteomes of the α-proteobacterial ancestor (upper curve) and the minimal protomitochondrial ancestor (lower curve) (15).

Table 2.

Relative fraction of COGs in the α-proteobacterial ancestor (Fig. 3b) sorted according to broad functional categories

Functional category + Hom* Min − Hom*
Cellular processes 17 15 12
Information processes 15 15 6
Metabolism 45 53 14
Poorly characterized 20 17 45
New clusters 3 0 23
*

Values are percentages of COGs in the α-proteobacterial ancestor with homologs (score ≥50) (+Hom) and without homologs (score ≤40) (−Hom) in eukaryotic genomes.

Values are percentages of COGs in the minimal (Min) protomitochondrial genome (15) with homologs in eukaryotic genomes (score ≥50).

Uncategorized clusters created in this analysis.

Concluding Remarks

This study represents an attempt to quantify the different mutational changes that underlie genome size alterations in the α-proteobacteria. We observed no correlation between nucleotide substitution rates and fixation rates for mutations that affect genome contents. On the contrary, our results strongly suggest that the inferred frequencies of deletions, duplications, and horizontal gene transfers depend on population sizes and bacterial lifestyle features. In particular, the data support the suggested correlation between transitions to intracellular growth habitats and genome size reductions, with the highest frequencies of gene loss at early stages of the transition (17).

The stability of the main chromosomes of the Rhizobiales, displayed as segments with conserved gene synteny, contrasts with otherwise high substation rates and extensive gene-content differences. Expansions and contractions in the genomic repertoire have mostly affected genes involved in environmental interactions; these typically are located on the auxiliary replichores and evolve by very high turnover rates. It is possible that we have underestimated these rates at the internal branches of the tree because of multiple insertion/deletion events. High intrinsic rates for duplications/deletions and horizontal gene transfers may serve as an efficient mutational engine that enables rapid responses to alterations in the environmental conditions when subjected to strong selective pressures.

Although the estimated frequencies of duplication and gene-genesis events depend on the penalties assigned to these events, our study clearly demonstrates the importance of gene duplications for expanding and diversifying the metabolic and regulatory capacities of the bacterial cell. A consequence of high duplication and deletion rates is that the number of paralogous proteins may be much larger than previously anticipated. In effect, the many different protein variants do not necessarily trace back to one ancestral giant gene pool but may have arisen throughout evolution via reiterative processes of duplication and loss. The continuous generation of novel paralogs may provide one explanation for the difficulty to obtain congruent single gene trees in phylogenomic surveys (1-5).

Computational inference of ancestral genomes with refined models that account for the relative frequencies of the different types of mutational events in the different lineages will provide more detailed scenarios of genome size evolution in the α-proteobacteria and other bacterial subdivisions.

Supplementary Material

Supporting Information
pnas_101_26_9722__.html (3.6KB, html)

Acknowledgments

This research was supported by grants from the Swedish Research Council, the Swedish Foundation for Strategic Research, and the Wallenberg Foundation.

This paper was submitted directly (Track II) to the PNAS office.

Abbreviations: LUCA, last universal common ancestor; COGs, Clusters of Orthologous Groups; BeT, symmetric best hit.

References

  • 1.Doolittle, W. F. (1999) Science 284, 2124-2128. [DOI] [PubMed] [Google Scholar]
  • 2.Snel, B., Bork, P. & Huynen, M. (1999) Nat. Genet. 21, 108-110. [DOI] [PubMed] [Google Scholar]
  • 3.Sicheritz-Ponten, T. & Andersson, S. G. E. (2001) Nucleic Acids Res. 29, 545-552. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Kurland, C. G., Canback, B. & Berg, O. G. (2003) Proc. Natl. Acad. Sci. USA 100, 9658-9662. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Daubin, V., Moran, N. A. & Ochman, H. (2003) Science 301, 829-832. [DOI] [PubMed] [Google Scholar]
  • 6.Andersson, S. G. E., Zomorodipour, A., Andersson, J. O., Sicheritz-Ponten, T., Alsmark, U. C. M., Podowski, R. M., Näslund, K., Eriksson, A.-S., Winkler, H. H. & Kurland, C. G. (1998) Nature 396, 133-140. [DOI] [PubMed] [Google Scholar]
  • 7.Ogata, H., Audic, S., Renesto-Audiffren, P., Fournier, P. E., Barbe, V., Samson, D., Roux, V., Cossart, P., Weissenbach, J., Claverie, J. M. & Raoult, D. (2001) Science 293, 2093-2098. [DOI] [PubMed] [Google Scholar]
  • 8.Goodner, B., Hinkle, G., Gattung, S., Miller, N., Blanchard, M., Qurollo, B., Goldman, B. S., Cao, Y., Askenazi, M., Halling, C., et al. (2001) Science 294, 2323-2328. [DOI] [PubMed] [Google Scholar]
  • 9.Wood, D. W., Setubal, J. C., Kaul, R., Monks, D. E., Kitajima, J. P., Okura, V. K., Zhou, Y., Chen, L., Wood, G. E., Almeida, N. F., Jr., et al. (2001) Science 294, 2317-2322. [DOI] [PubMed] [Google Scholar]
  • 10.Galibert, F., Finan, T. M., Long, S. R., Puhler, A., Abola, P., Ampe, F., Barloy-Hubler, F., Barnett, M. J., Becker, A., Boistard, P., et al. (2001) Science 293, 668-672. [DOI] [PubMed] [Google Scholar]
  • 11.Kaneko, T., Nakamura, Y., Sato, S., Minamisawa, K., Uchiumi, T., Sasamoto, S., Watanabe, A., Idesawa, K., Iriguchi, M., Kawashima, K., et al. (2002) DNA Res. 9, 189-197. [DOI] [PubMed] [Google Scholar]
  • 12.Wu, M., Sun, L. V., Vamathevan, J., Riegler, M., Deboy, R., Brownlie, J. C., McGraw, E. A., Martin, W., Esser, C., Ahmadinejad, N., et al. (2004) PLoS Biol. 2, 327-341. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Gray, M., Burger, G. & Lang, B. F. (1999) Science 283, 1476-1481. [DOI] [PubMed] [Google Scholar]
  • 14.Karlberg, O. & Andersson, S. G. E. (2003) Nat. Rev. Genet. 4, 391-397. [DOI] [PubMed] [Google Scholar]
  • 15.Gabaldon, T. & Huynen, M. A. (2003) Science 301, 609. [DOI] [PubMed] [Google Scholar]
  • 16.Karlberg, E. O., Canbäck, B., Kurland, C. G. & Andersson, S. G. E. (2000) Yeast 17, 170-187. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Tamas, I., Klasson, L., Canbäck, B., Näslund, A. K., Eriksson, A.-S., Wernegreen, J. J., Sandström, J. P., Moran, N. A. & Andersson, S. G. E. (2002) Science 296, 2376-2379. [DOI] [PubMed] [Google Scholar]
  • 18.Snel, B., Bork, P. & Huynen, M. (2002) Genome Res. 12, 17-25. [DOI] [PubMed] [Google Scholar]
  • 19.Mirkin, B. G., Fenner, T. I., Galperin, M. Y. & Koonin, E. V. (2003) BMC Evol. Biol. 3, 2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W. & Liplman, D. J. (1997) Nucleic Acids Res. 25, 3389-3402. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Thompson, J. D., Higgins, D. G. & Gibson, T. J. (1994) Nucleic Acids Res. 22, 4673-4680. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Guindon, S. & Gascuel, O. (2003) Syst. Biol. 52, 696-704. [DOI] [PubMed] [Google Scholar]
  • 23.Tatusov, R. L., Koonin, E. V. & Lipman, D. J. (1997) Science 278, 631-637. [DOI] [PubMed] [Google Scholar]
  • 24.Swofford, D. L. (1998) Phylogenetic Analysis Using Parsimony (paup) (Sinauer, Sunderland, MA), Version 4.0b10.
  • 25.Nimwegen, E. (2003) Trends Genet. 19, 479-484. [DOI] [PubMed] [Google Scholar]
  • 26.Konstantinidis, K. T. & Tiedje, J. M. (2004) Proc. Natl. Acad. Sci. USA 101, 3160-3165. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Klasson, L. & Andersson, S. G. E. (2004) Trends Microbiol. 12, 37-43. [DOI] [PubMed] [Google Scholar]
  • 28.Koonin, E. V. (2000) Annu. Rev. Genomics Hum. Genet. 1, 99-116. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Kobayashi, K. (2003) Proc. Natl. Acad. Sci. USA 100, 4678-4683. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Olsen, G. J., Woese, C. R. & Overbeek, R. (1994) J. Bacteriol. 176, 1-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Viale, A. & Arakaki, A. K. (1994) FEBS Lett. 341, 146-151. [DOI] [PubMed] [Google Scholar]
  • 32.Emelyanov, V. (2003) Arch. Biochem. Biophys. 420, 130-141. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supporting Information
pnas_101_26_9722__.html (3.6KB, html)
pnas_101_26_9722__4.pdf (745.9KB, pdf)
pnas_101_26_9722__5.html (4.5KB, html)
pnas_101_26_9722__1.pdf (922.9KB, pdf)
pnas_101_26_9722__2.pdf (815.1KB, pdf)
pnas_101_26_9722__3.html (11.6KB, html)
pnas_101_26_9722__9.html (3.9KB, html)
pnas_101_26_9722__7.pdf (637.9KB, pdf)
pnas_101_26_9722__8.pdf (197.7KB, pdf)

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences

RESOURCES