Abstract
Background
Orthologous genes are frequently presumed to perform similar functions. However, outside of model organisms, this is rarely tested. One means of inferring changes in function is if there are changes in the level of gene conservation and selective constraint. Here we compare levels of gene conservation across three bacterial groups to test for changes in gene functionality.
Findings
The level of gene conservation for different orthologous genes is highly correlated across clades, even for highly divergent groups of bacteria. These correlations do not arise from broad differences in gene functionality (e.g. informational genes vs. metabolic genes), but instead seem to result from very specific differences in gene function. Furthermore, these functional differences appear to be maintained over very long periods of time.
Conclusion
These results suggest that even over broad time scales, most bacterial genes are under a nearly constant level of purifying selection, and that bacterial evolution is thus dominated by selective and functional stasis.
Background
We are interested in whether the functional importance of orthologous genes changes across bacterial taxa. We pose a simple question: if a gene plays an important role for the functioning of a bacterium, is the orthologue of that gene in a distantly related bacterium also particularly important? To measure functional importance, we look at how strongly genes are conserved over time. If a group of orthologous genes are well conserved across different taxa, this implies that strong purifying selection is acting to maintain these genes. Similarly, if a group of orthologous genes are lost quickly across different taxa, this implies that purifying selection is acting only weakly to maintain these genes. If the strength of purifying selection for individual genes does not change across bacterial groups (i.e. if there is a correlation in the level of orthologue conservation), this implies that the functional importance of most orthologous genes does not change quickly. On the other hand, if there is little correlation in the level of gene conservation across bacterial groups, this implies that the functional importance of many orthologues changes often, perhaps because of differences in the genetic backgrounds of organisms (e.g. compensating mechanisms at other loci).
Here we show that between three bacterial clades, the levels of conservation for specific orthologues are highly correlated. This correlation remains even when examining subgroups of functionally similar genes, such as genes involved in ribosome function. This suggests that despite large differences in genetic background, the strength of selection acting to maintain any specific orthologue remains approximately constant, and that most genes maintain their specific functionality over long periods of time.
We used stochastic character mapping [1] to calculate a measure of gene conservation that accounts for phylogenetic relatedness between taxa. Briefly, for each protein coding gene in E. coli K12 W3110, we determined whether an orthologous gene was present or absent for all other bacteria with fully sequenced genomes (447 other genomes in total). Together with information on the phylogenetic history, these data were used to calculate a parameter for each orthologue that reflects the rate (probability per unit time) that the orthologue the orthologue will be lost or gained along a branch (see Additional file 1). Because this parameter value is mostly determined by how quickly an orthologue is lost over time, we term this parameter the rate of orthologue loss (ROL). Low ROL values imply that along a branch, there is a low probability of that orthologue being lost. High ROL values imply that along a branch, there is a high probability that the orthologue will be lost. For each orthologue, one ROL was calculated for all branches in a clade.
Methods
All genomes were downloaded from the NCBI database in May of 2007 (Additional file 2), and a phylogeny was constructed using a concatenated set of 73 conserved orthologues (Additional files 3 and 4). The program SIMMAP [2] was used to calculate all ROL values (Additional file 5), and gene functional classes were divided using MultiFun [3]. For detailed materials and methods, see Additional file 1.
Results
Constancy of ROL across different bacterial clades
We tested whether the ROL values for specific orthologues change between different clades of bacteria. We calculated the ROL values for orthologous genes in three bacterial groups: the γ – and β-proteobacteria; the α-proteobacteria, which are the sister clade of the γ-β-proteobacteria and diverged approximately 2.5 billion years ago [4]; and the Bacilli-Molllicutes clade, which diverged from the γ-β-proteobacteria just over three billion years ago [4]. For all of these clades, we found that the ROL values for orthologous genes were highly correlated (Fig. 1; r2 = 0.673, Pearson's ρ = 0.756 for γ-β-proteobacteria versus α-proteobacteria; r2 = 0.488, Pearson's ρ = 0.628 for γ-β-proteobacteria versus Bacilli-Molllicutes; all data are listed in Additional file 5).
A simple explanation for the high correlation between ROL values is that it is driven by differences in gene essentiality: essential genes will be strongly conserved, while nonessential genes will be weakly conserved. To test this, we divided the genes into essential and nonessential groups, based on the experimental results from two recent studies that used either E. coli K12 MG1655 [5] or E. coli K12 BW25113 [6]. We disregarded any discrepancies in essentiality annotation between the two studies and focused only on those genes for which they agree on the classification of essentiality. We found that even when excluding orthologues that are classified as essential in E. coli, the correlations remained very high (Fig. 1; r2 = 0.539 and r2 = 0.332, respectively).
A second explanation for the high correlation between ROL values is that it is driven by differences between functional classes of genes. For example, informational orthologues may be highly conserved, whereas genes involved in metabolic functions may be less conserved. To test this hypothesis, we calculated the correlation coefficients for ROL values of orthologues within single functional classes of genes as delineated by MultiFun [3] (see Additional file 1). We found that within MultiFun classes, ROL values between bacterial groups were again highly correlated, even when considering only nonessential genes. The r2 values between γ-β-proteobacteria and α-proteobacteria varied from 0.740 (for information transfer genes related to DNA, MultiFun class 2.1) to 0.260 (for carbon utilization genes, class 1.1) (Fig. 2). The r2 values between γ-β-proteobacteria and Bacilli-Mollicutes varied from 0.600 (for structural genes in the ribosome, MultiFun class 6.6) to 0.020 (for structural genes responsible for surface antigens, class 6.3) (Fig. 2). Together, these data suggest that ROL values remain constant over long stretches of time, on the order of billions of years, and that this constancy is driven neither by broad differences in gene functionality, nor differences in gene essentiality, but by specific differences in gene function.
The relationship between ROL and gene essentiality
We have assumed above that the level of gene conservation reflects the strength of purifying selection acting on a gene: well-conserved genes are under strong purifying selection, while less conserved genes experience only weak purifying selection. Here we test this assumption by asking how well our measure of gene conservation, ROL, corresponds to growth phenotypes, which we know to be under selection. Specifically, if the deletion of a gene causes lethality even under benign laboratory conditions, then the loss of this gene is almost certainly lethal in the natural environment and is thus under strong purifying selection. We first ask, then, how well ROL values correlate with annotations of gene essentiality. The ROL values for essential and nonessential genes are shown in Fig. 3A. On average, genes that have been classified as essential in E. coli K12 have a dramatically lower ROL than non-essential genes.
To quantify the relationship between ROL and essentiality, we used a receiver operator characteristic (ROC) curve. This curve describes the relationship between the fraction of false positives and the fraction of true positives when using ROL to discriminate between essential and nonessential genes. One means of quantifying this relationship is by calculating the area under the ROC curve (the AUC), which is equivalent to the probability that a randomly chosen essential gene will have a lower ROL than a randomly chosen nonessential gene [7]. If ROL were perfectly predictive of gene essentiality, the AUC would be 1.0; the AUC for this analysis was 0.947 (Fig. 3B), and strongly suggests that ROL values do reflect the strength of purifying selection acting on a gene.
We also asked whether ROL values and the quantitative effects of gene deletions are correlated. Using data on growth yield in rich media of deletion mutants [6], we found a small but highly significant relationship between a gene's ROL value and the growth yield of that deletion strain (Fig. S2; r2 = 0.0628, p < 0.0001; Spearman's ρ = 0.127, p < 0.0001). Again, this suggests that ROL values reflect the strength of purifying selection acting on a gene.
Conclusion
We have shown that ROL values for specific orthologues are correlated over long broad evolutionary distances, and that these correlations remain strong even within specific functional classes of genes and for genes that are not essential for cellular viability. In other words, the constancy of the level gene conservation across bacterial orders seems to result from specific differences in gene function. The strength of the correlations we find here are of similar magnitude to one found in a previous study of correlations between protein evolutionary rates within the Chlamydiaceae [8]. Notably, the Chlamydiaceae are far more closely related than the clades considered here, so a high correlation should not be surprising. However, we have also considered selection on a more general scale (gene presence versus gene absence), which likely increases the strength of the correlations. Interestingly, for some orthologues, ROL values have changed considerably across taxonomic groups (we show three examples in Figs. 4 and 5). We propose that these genes have changed in functional importance, resulting in either increased or decreased purifying selection.
Some essential E. coli genes have orthologues that are consistently lost at high rates among other γ-β-proteobacteria, α-proteobacteria, and Bacilli-Mollicutes, contrary to the high level of conservations expected for essential genes. This is not due to these genes only being essential in E. coli and nonessential in other taxa. In Table 1 we show a list of genes that are essential in E coli K12 and which have high ROL values (greater than 2.4 in all three bacterial groups studied (Fig. 1)), together with data from an empirical study of gene essentiality in the γ-proteobacterium Acinetobacter baylyi [9]. Of nine genes with an orthologue in Acinetobacter, eight are also essential in Acinetobacter. This suggests, surprisingly, that some genes, despite being essential, are lost frequently, and is consistent with the view that compensation at other sites in the genome may occur even for "essential" functions.
Table 1.
K12 gene | Gene function | ROL γ-β | ROL α | ROL B-M | Essential |
asd | ASA dehydrogenase | 3.07 | n.a. | n.a. | DB |
can | carbonic anhydrase | 4.07 | 11.67 | n.a. | - |
degS | serine endoprotease | 8.63 | 9.54 | 7.77 | NON |
dnaC | DNA biosynthesis protein | 62.30 | n.a. | 94.06 | - |
fabA | HD thioester dehydrase | 2.78 | 2.42 | n.a. | - |
fabB | 3-oxoacyl-[acp] synthase | 3.55 | 2.75 | n.a. | DB |
fbaA | FBP aldolase | 3.21 | n.a. | n.a. | - |
fldA | flavodoxin 1 | 4.73 | 22.92 | 16.86 | - |
ftsE | transporter subunit | 3.05 | 2.48 | 8.79 | - |
ftsK | chromosome partitioning | 7.20 | 2.55 | 10.41 | ESS |
ftsL | cell division protein | 4.59 | n.a. | n.a. | - |
ftsN | cell division protein | 12.64 | n.a. | n.a. | - |
ftsX | transporter subunit | 2.87 | 13.32 | 9.10 | - |
hemD | uroporphyrinogen synthase | 2.75 | n.a. | n.a. | DB |
hemG | protoporphyrin oxidase | 8.47 | n.a. | n.a. | - |
holB | DNA ROL III subunit | 5.08 | 19.02 | 32.24 | - |
lolB | chaperone for lipoproteins | 2.67 | n.a. | n.a. | ESS |
lolE | lipoprotein transporter | 7.92 | 5.89 | 55.12 | - |
mreD | cell wall structural complex | 3.53 | n.a. | n.a. | - |
mukB | chromosome partitioning | 13.42 | n.a. | 94.78 | - |
mukE | chromosome partitioning | 7.41 | n.a. | n.a. | - |
mukF | chromosome partitioning | 7.16 | n.a. | n.a. | - |
nrdA | RDP reductase subunit | 4.91 | n.a. | n.a. | - |
nrdB | RDP reductase subunit | 3.03 | 5.75 | n.a. | ESS |
plsB | O-acyltransferase | 5.09 | n.a. | n.a. | ESS |
plsC | acyltransferase | 4.06 | 5.79 | 32.81 | - |
psd | decarboxylase | 3.28 | 36.74 | 13.10 | ESS |
pssA | phosphatidylserine synthase | 7.12 | n.a. | n.a. | - |
rlpB | minor lipoprotein | 8.13 | n.a. | n.a. | - |
secM | regulator of secA | 10.80 | n.a. | n.a. | - |
yejM | hydrolase, inner membrane | 6.76 | n.a. | n.a. | - |
yrbK | conserved protein | 8.94 | n.a. | n.a. | - |
yrfF | inner membrane protein | 7.93 | n.a. | n.a. | - |
zipA | cell division protein | 3.33 | n.a. | n.a. | - |
Only those E. coli K12 genes with high ROL values in γ-β-proteobacteria, α-proteobacteria, and Bacilli-Mollicutes (ROL values greater than 2.4 in all three groups (Fig. 1)) are shown here. Of these 34 genes, nine have an orthologue in Acinetobacter, and eight of these have an essential function in Acinetobacter. Many of the genes lost at high rates from γ-β-proteobacteria have orthologues in fewer than 10% of the α-proteobacteria or Bacilli-Mollicutes; these are indicated by n.a. ESS, genes for which no deletion mutants were isolated in Acinetobacter; DB, genes in which only "double band" mutants containing chromosomal duplications were isolated in Acinetobacter (and which are thus likely to be essential); NON, genes for which deletion mutants were isolated in Acinetobacter, and which are thus considered to be non-essential; [9]. '-', genes without an orthologue in Acinetobacter.
Many of the essential genes that are lost at high rates are recent innovations. Considering those genes that are essential in E. coli K12 but are lost at high rates from other γ-β-proteobacteria, 44% (18 out of 41) have a distribution restricted to the γ-β-proteobacteria and are thus likely to be relatively recent additions to the genomic repertoire. In contrast, of the essential genes with low ROL values (less than 2.4), only 0.9% (2 out of 222) are restricted to the γ-β-proteobacteria. Previous work has shown that recently acquired genes tend to be incorporated at the edge of the cellular network [10]. Such peripheral genes may thus be more easily removed from the genome, with fewer interactions to compensate.
These results confirm and extend previous studies that have investigated the relationship between essentiality and gene conservation [11-13]. However, here we have used a phylogenetically corrected measure of gene conservation (ROL). Additionally, we have found that the ability of orthologue conservation to predict gene essentiality is far higher than has previously been realized [11], most likely due to the lower accuracy of earlier datasets. Finally, we have shown for the first time a correlation between gene conservation and quantitative measures of deletion phenotypes (growth yield, Fig. S2).
Our metric of gene conservation, which takes into account phylogenetic history, provides a considerable improvement over simpler measures such as the fraction of taxa that retain a specific orthologue (retention). Using retention to predict essentiality yields an AUC of 0.937, meaning that essential genes are incorrectly ranked higher than nonessential genes 6.3% of the time. Using ROL, the misclassified fraction is reduced to 5.3%, a reduction of 16% in the error rate. ROL has the additional advantage of being based on a specific evolutionary model, which itself may provide biological insights, for example into the relative rates of gene loss versus horizontal transfer (i.e. the ratio of gene loss versus gene gain in lineages).
Finally, we note that high-throughput experimental assessments of gene essentiality are prone to both false positive and false negative results (i.e. annotating a non-essential gene as essential and vice versa). The level of agreement on essentiality between the two most recent studies of gene essentiality [5,6] is similar to the level of agreement between both studies and ROL (all are between 94% and 95%), and far greater than between the first experimental study of gene essentiality [14] and the latter two experimental studies. This suggests that ROL may be a valid and useful means of cross-validating experimental studies in order to find genes likely to be false positives or false negatives, which could then be reexamined.
Abbreviations
ROC: receiver operator characteristic; AUC: area under the ROC curve; ROL: rate of orthologue loss; W3110: E. coli K12 W3110.
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
OKS and MA conceived of the study. OKS performed the bioinformatic and phylogenetic analyses and drafted the manuscript. MA edited the manuscript.
Supplementary Material
Acknowledgments
Acknowledgements
We thank the Theoretical Biology group at ETH Zurich for discussions. Funding was provided by the Roche Research Foundation and the Novartis Foundation (to OKS), and the Swiss National Science Foundation (to MA).
Contributor Information
Olin K Silander, Email: olin.silander@env.ethz.ch.
Martin Ackermann, Email: martin.ackermann@env.ethz.ch.
References
- Huelsenbeck JP, Nielsen R, Bollback JP. Stochastic mapping of morphological characters. Systematic Biology. 2003;52:131–158. doi: 10.1080/10635150390192780. [DOI] [PubMed] [Google Scholar]
- Bollback JP. SIMMAP: Stochastic character mapping of discrete traits on phylogenies. Bmc Bioinformatics. 2006;7:88. doi: 10.1186/1471-2105-7-88. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Serres MH, Riley M. MultiFun, a multifunctional classification scheme for Escherichia coli K-12 gene products. Microbial & comparative genomics. 2000;5:205–222. doi: 10.1089/omi.1.2000.5.205. [DOI] [PubMed] [Google Scholar]
- Battistuzzi FU, Feijao A, Hedges SB. A genomic timescale of prokaryote evolution: insights into the origin of methanogenesis, phototrophy, and the colonization of land. Bmc Evolutionary Biology. 2004;4:44. doi: 10.1186/1471-2148-4-44. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kato JI, Hashimoto M. Construction of consecutive deletions of the Escherichia coli chromosome. Molecular Systems Biology. 2007;3:132. doi: 10.1038/msb4100174. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Baba T, Ara T, Hasegawa M, Takai Y, Okumura Y, Baba M, Datsenko KA, Tomita M, Wanner BL, Mori H. Construction of Escherichia coli K-12 in-frame, single-gene knockout mutants: the Keio collection. Molecular Systems Biology. 2006;2:2006.0008. doi: 10.1038/msb4100050. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fawcett T. An introduction to ROC analysis. Pattern Recogn Lett. 2006;27:861–874. doi: 10.1016/j.patrec.2005.10.010. [DOI] [Google Scholar]
- Jordan IK, Kondrashov FA, Rogozin IB, Tatusov RL, Wolf YI, Koonin EV. Constant relative rate of protein evolution and detection of functional diversification among bacterial, archaeal and eukaryotic proteins. Genome Biology. 2001;2:research0053.0051–0053.0059. doi: 10.1186/gb-2001-2-12-research0053. [DOI] [PMC free article] [PubMed] [Google Scholar]
- de Berardinis V, Vallenet D, Castelli V, Besnard M, Pinet A, Cruaud C, Samair S, Lechaplais C, Gyapay G, Richez C, Durot M, Kreimeyer A, Le Fèvre F, Schächter V, Pezo V, Döring V, Scarpelli C, Médigue C, Cohen GN, Marlière P, Salanoubat M, Weissenbach J. A complete collection of single-gene deletion mutants of Acinetobacter baylyi ADP1. Molecular Systems Biology. 2008;4:174. doi: 10.1038/msb.2008.10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pal C, Papp B, Lercher MJ. Adaptive evolution of bacterial metabolic networks by horizontal gene transfer. Nature Genetics. 2005;37:1372–1375. doi: 10.1038/ng1686. [DOI] [PubMed] [Google Scholar]
- Gustafson AM, Snitkin ES, Parker SCJ, DeLisi C, Kasif S. Towards the identification of essential genes using targeted genome sequencing and comparative analysis. Bmc Genomics. 2006;7:265. doi: 10.1186/1471-2164-7-265. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jordan IK, Rogozin IB, Wolf YI, Koonin EV. Essential genes are more evolutionarily conserved than are nonessential genes in bacteria. Genome Research. 2002;12:962–968. doi: 10.1101/gr.87702. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fang G, Rocha E, Danchin A. How essential are nonessential genes? Molecular Biology and Evolution. 2005;22:2147–2156. doi: 10.1093/molbev/msi211. [DOI] [PubMed] [Google Scholar]
- Gerdes SY, Scholle MD, Campbell JW, Balazsi G, Ravasz E, Daugherty MD, Somera AL, Kyrpides NC, Anderson I, Gelfand MS, Bhattacharya A, Kapatral V, D'Souza M, Baev MV, Grechkin Y, Mseeh F, Fonstein MY, Overbeek R, Barabasi AL, Oltvai ZN, Osterman AL. Experimental determination and system level analysis of essential genes in Escherichia coli MG1655. Journal of Bacteriology. 2003;185:5673–5684. doi: 10.1128/JB.185.19.5673-5684.2003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Efron B, Tibshirani RJ. An introduction to the Bootstrap. New York: Chapman & Hall; 1993. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.