Significance
Evolution of microbes is dominated by horizontal gene transfer and the incessant host–parasite arms race that promotes the evolution of diverse antiparasite defense systems. The evolutionary factors governing these processes are complex and difficult to disentangle, but rapidly growing genome databases provide ample material for testing evolutionary models. Rigorous mathematical modeling of evolutionary processes, combined with computer simulation and comparative genomics, allowed us to elucidate the evolutionary regimes of different classes of microbial genes. Only genes involved in key informational and metabolic pathways are subject to strong selection, whereas most of the others are effectively neutral or even burdensome. Mobile genetic elements and defense systems are costly, supporting the understanding that their evolution is governed by the same factors.
Keywords: mobile genetic elements, selection, gene loss, horizontal gene transfer, antiparasite defense
Abstract
We combine mathematical modeling of genome evolution with comparative analysis of prokaryotic genomes to estimate the relative contributions of selection and intrinsic loss bias to the evolution of different functional classes of genes and mobile genetic elements (MGE). An exact solution for the dynamics of gene family size was obtained under a linear duplication–transfer–loss model with selection. With the exception of genes involved in information processing, particularly translation, which are maintained by strong selection, the average selection coefficient for most nonparasitic genes is low albeit positive, compatible with observed positive correlation between genome size and effective population size. Free-living microbes evolve under stronger selection for gene retention than parasites. Different classes of MGE show a broad range of fitness effects, from the nearly neutral transposons to prophages, which are actively eliminated by selection. Genes involved in antiparasite defense, on average, incur a fitness cost to the host that is at least as high as the cost of plasmids. This cost is probably due to the adverse effects of autoimmunity and curtailment of horizontal gene transfer caused by the defense systems and selfish behavior of some of these systems, such as toxin–antitoxin and restriction modification modules. Transposons follow a biphasic dynamics, with bursts of gene proliferation followed by decay in the copy number that is quantitatively captured by the model. The horizontal gene transfer to loss ratio, but not duplication to loss ratio, correlates with genome size, potentially explaining increased abundance of neutral and costly elements in larger genomes.
In the wake of the genomic revolution, quantitative understanding of the roles that ecological and genetic factors play in determining the size, composition, and architecture of genomes has become a central goal in biology (1–3). The vast number of prokaryotic genomes sequenced to date reveals a great diversity of sizes, which range from about 110 kb and 140 protein coding genes in the smallest intracellular symbionts (4) to almost 15 Mb and more than 10,000 genes in the largest myxobacteria (5). Beyond a core of ∼100 nearly universal genes, the gene complements of bacteria and archaea are highly heterogeneous (6–8). Remarkably, 10–20% of the genes in most microbial genomes are ORFans, that is, genes that have no detectable homologs in other species and are replaced at extremely high rates in the course of microbial evolution (9, 10). Furthermore, all but the most reduced genomes host multiple and diverse parasitic genetic elements, such as transposons and prophages that collectively compose the so-called microbial mobilome (11).
The evolution of microbial genomes is generally interpreted in terms of the interplay between three factors: (i) gene gain, via horizontal gene transfer (HGT) and gene duplication; (ii) gene loss, via deletion; and (iii) natural selection that affects the fixation and maintenance of genes (8, 12). The intrinsic bias toward DNA deletion (and hence gene loss) that characterizes mutational processes in prokaryotes (as well as eukaryotes) results in nonadaptive genome reduction (13), whereas selection contributes to maintaining slightly beneficial genes (14). In agreement with this model, the strength of purifying selection, as measured by the ratio of nonsynonymous to synonymous variation, positively correlates with the genome size (15, 16). However, when it comes to interpreting the genome composition, the picture is complicated by the fact that selection can also lead to adaptive genome reduction by removing pseudogenes (17), costly genetic parasites, and accessory genes, which are dispensable under stable environmental conditions (18, 19). Conversely, the increased propensity of some gene families to be horizontally transferred might suffice to ensure their persistence beyond the effects of selection and intrinsic loss bias (20). Rather than being minor deviations from a general trend, nonuniform levels of selection and horizontal gene transfer affecting different families and classes of genes appear to be essential to explain the abundance distributions and evolutionary persistence times of genes (10, 12). Accordingly, a quantitative assessment of the fitness costs and benefits for different classes of genes is essential to attain an adequate understanding of the evolutionary forces that shape genomes.
The magnitude and even the sign with which the presence (or absence) of a gene contributes to the fitness of an organism are not constant in time. For example, the metabolic cost incurred by the replication, transcription, and translation of a gene strongly depends on the cell growth rate and the gene expression level (21). A recent study on the effects of different types of mutations in Salmonella enterica has shown that up to 25% of large deletions could result in a fitness increase, although the benefit of losing a particular gene critically depends on the environment (19). These findings emphasize the importance of averaging across multiple environmental conditions when it comes to estimating the fitness contribution of a gene. For the purpose of evolutionary analyses, a meaningful proxy for such an average can be obtained by inferring selection coefficients directly from the gene family abundances observed in large collections of genomes. The main difficulty in this case is disentangling the effects of selection from the effects of intrinsic loss bias, which normally requires a priori knowledge of the effective population size or the gene gain and loss rates (14, 22, 23).
Here we combine mathematical modeling, comparative genomics, and data compiled from mutation accumulation experiments to infer the characteristic contributions of selection and intrinsic DNA loss for different gene categories. To disentangle selection and loss bias, we first obtained an exact, time-dependent solution of the linear duplication–transfer–loss model with selection that governs the dynamics of gene copy numbers in a population of genomes (24–28). When applied to a large genomic data set, the model provides maximum likelihood estimates of the neutral equivalent (effective) loss bias, a composite parameter that amalgamates the effects of intrinsic loss bias (the loss bias before the action of selection) and selection. The selection coefficient can be extracted from the effective loss bias as long as the rate of gene loss is known, for which we used estimates from mutation accumulation experiments.
Our results show that with the exception of genes involved in core informational processes, most gene families are neutral or only slightly beneficial in the long term. Among the genetic elements that are typically considered parasitic, prophages show the highest fitness cost, followed by conjugative plasmids and transposons, which are only weakly deleterious in the long term. Notably, genes involved in antiparasite defense do not seem to provide long-term benefits on average but rather are slightly deleterious, almost to the same extent as transposons. We complete our analysis with an evaluation of the causes that make transposon dynamics qualitatively different from those of other gene classes and explore the effect of genome size on the rates of HGT, gene duplication, and gene loss.
Results
Duplication–Transfer–Loss Model of Gene Family Evolution.
To describe the dynamics of a gene family size (gene copy number) in a population of genomes, we used a linear duplication–transfer–loss model with selection. Within a genome, the gene copy number can increase via duplication of the extant copies, which occurs at rate d per copy, or through the arrival of a new copy via HGT, at rate h independent on the copy number. Likewise, gene loss at rate l per copy leads to a decrease in the copy number. Duplication, HGT, and gene loss define a classical birth–death–transfer model at the genome level (24–27, 29). Selection is introduced through a contribution s to the fitness of a genome (s is positive for beneficial genes and negative for costly genes), which is multiplied by the gene copy number k. Specifically, we assume that fitness is additive, there is no epistasis, and the fitness contributions of all genes from the same family are the same. At the cell population level, the number of genomes carrying k copies, nk, obeys the following system of differential equations:
[1] |
The basal growth rate g was included for completeness, although it does not affect the copy number distribution. Moreover, the entire system can be restated in terms of the ratios of each of the parameters to the loss rate (see SI Appendix for more details). The linear duplication–transfer–loss model with selection can be exactly solved for arbitrary initial conditions by formulating Eq. 1 as a first-order partial differential equation for the generating function and applying the method of characteristics (SI Appendix) (30, 31). The result is the copy number distribution, i.e., the fraction pk of hosts with an arbitrary number of copies k at any time. In the case of a population where the gene family is initially absent, we obtain
[2] |
with
and
[3] |
In these expressions, time is measured in units of loss events, and C(t) is a normalization factor that ensures that the sum of pk over all k is equal to 1. A notable property of this solution is that as the system approaches the stationary state, selection, duplication, and loss merge into the composite parameter , which, in the absence of selection, coincides with the inverse of the duplication/loss ratio (see SI Appendix for more details). Therefore, we refer to as the “neutral equivalent” (henceforth “effective”) duplication/loss ratio (d/le). It is also possible to define the effective HGT/loss ratio such that gene families with the same effective ratios have the same stationary distributions. The fitness contribution of a gene (i.e., selection to loss ratio) can be expressed in terms of the gene’s effective duplication/loss ratio and the actual (intrinsic) duplication/loss ratio as
[4] |
Duplication, Loss, and Selection in Different Functional Categories of Genes.
We used the COUNT method (24) to estimate the effective duplication/loss ratio (d/le) associated to different gene families [defined as clusters of orthologous groups (COGs)] in 35 sets of closely related genomes [alignable tight genomic clusters (ATGCs)], which jointly encompass 678 bacterial and archaeal genomes (32, 33). As shown in the preceding section, the effective duplication/loss ratio (d/le) is a composite parameter that results from selection on gene copy number affecting the fixation of gene duplications and gene losses. For a neutral gene family, the effective duplication/loss ratio is simply the same as the ratio between the rates of gene duplication and gene loss. Because selection prevents the loss of beneficial genes, the effective duplication/loss ratios associated with beneficial genes are greater than their intrinsic duplication/loss ratios, whereas the opposite holds for genes (e.g., parasitic elements) that are costly to the host and tend to be eliminated by selection. Technically, the duplication term includes not only bona fide duplications but any process that causes an increase in copy number that is proportional to the preexisting copy number. Thus, HGT can also contribute to the duplication term in clonal populations, where the copy numbers of donors and recipients are highly correlated. Fig. 1A shows the effective duplication/loss ratios for gene families that belong to different functional categories [as defined under the COG classification (34)], as well as genes of transposons, conjugative plasmids, and prophages. For the majority of the gene families, the effective duplication/loss ratios are below 1, which is compatible with the pervasive bias toward gene loss combined with (near) neutrality of numerous genes. In agreement with the notion that selection affects the effective duplication/loss ratios, their values decrease from the essential functional categories, such as translation and nucleotide metabolism, to the nonessential and parasitic gene classes. The apparent bimodality of the distributions for some functional categories (Fig. 1A) is likely due to their biological heterogeneity. For example, category N (secretion and motility) sharply splits into two major groups of gene families: (i) components of the flagellum and (ii) proteins involved in cellulose production and glycosyltransferases, with high d/le values for the former and much lower values for the latter (SI Appendix, Table S1).
The average fitness contribution of a gene can be inferred from its effective duplication/loss ratio provided that the intrinsic duplication/loss ratio is known (see preceding section). To estimate the intrinsic duplication/loss ratio (d/l), we used two independent approaches. The first approach was based on the assumption that a substantial fraction of genes from nonessential, but not parasitic, functional categories are effectively neutral. Considering that gene families in those categories are relatively well represented across taxa (we required them to be present in at least three different ATGCs) and are not regarded as part of the mobilome (11), we would expect that, if not neutral, they are slightly beneficial and provide an upper bound for the intrinsic duplication/loss ratio. After sorting nonparasitic functional categories by their effective duplication/loss ratios (Fig. 1A), category K (transcription) was selected as the last category whose members arguably exert a positive average fitness effect. The intrinsic duplication/loss ratio was then calculated as the median of the effective duplication/loss ratios among the pool of gene families involved in poorly understood functions (R and S), carbohydrate metabolism (G), secretion (U), secondary metabolism (Q), and defense (V). In the second approach, we identified genes that are represented by one or more copies in a single genome, while absent in all other genomes of the same ATGC. Such genes [henceforth ORFans (35, 36)] are likely of recent acquisition and can be assumed neutral, if not slightly deleterious. The maximum likelihood estimate of the duplication/loss ratio obtained for ORFans provides, therefore, a lower bound for the intrinsic duplication/loss ratio (Methods and SI Appendix). The ratios obtained with both approaches were 0.124 [95% confidence interval (CI) 0.117–0.131] and 0.126 (95% CI 0.115–0.137). The two independent estimates are strikingly consistent with each other and robust to small changes in the methodology (SI Appendix). Accordingly, we took the average d/l = 0.125 as the intrinsic duplication/loss ratio. This value quantifies the intrinsic bias toward gene loss once the effect of selection is removed.
Quantitative estimates of the ratio between the selection coefficient and the loss rate (s/l) for each functional category are readily obtained by applying Eq. 4 to the effective duplication/loss ratios (Table 1). In the case of costly gene families, the ratio s/l quantifies the relative contributions of selection and loss in controlling the gene copy number. However, quantitative assessment of the selection coefficients from the s/l ratio requires knowledge of the intrinsic rates of gene loss in prokaryotic genomes. A compilation of published data from mutation accumulation experiments shows that disruption of gene coding regions due to small indels and/or large deletions occurs at rates between and per gene per generation (37–45), which yields the ranges for the selection coefficients listed in Table 1. Assuming the effective size of typical microbial populations to fall between 108 and 109 (21, 46, 47), the selection coefficients yielded by these estimates indicate evolution determined by positive fitness contribution () for information processing categories (translation and replication) as well as some metabolic categories (especially nucleotide metabolism) and cellular functions (cell division and chaperones), an effectively neutral evolutionary regime for several categories including transcription, and evolution driven by negative fitness contribution () for defense genes and mobile genetic elements.
Table 1.
d/le | s/l | s () | ||
Lower | Upper | |||
F, nucleotide metabolism and transport | 0.273 | 0.39 | 0.20 | 1.58 |
J, translation | 0.273 | 0.39 | 0.20 | 1.58 |
D, cell division | 0.266 | 0.39 | 0.19 | 1.56 |
H, coenzyme metabolism | 0.260 | 0.38 | 0.19 | 1.54 |
N, secretion and motility | 0.247 | 0.37 | 0.19 | 1.49 |
O, posttranslational modification, protein turnover, and chaperone functions | 0.223 | 0.34 | 0.17 | 1.37 |
C, energy production and conversion | 0.197 | 0.29 | 0.15 | 1.18 |
E, amino acid metabolism and transport | 0.187 | 0.27 | 0.14 | 1.08 |
L, replication and repair | 0.172 | 0.23 | 0.11 | 0.91 |
I, lipid metabolism | 0.166 | 0.20 | 0.10 | 0.82 |
T, signal transduction | 0.159 | 0.18 | 0.09 | 0.72 |
P, inorganic ion transport and metabolism | 0.150 | 0.14 | 0.07 | 0.57 |
M, membrane and cell wall structure and biogenesis | 0.140 | 0.09 | 0.05 | 0.36 |
K, transcription | 0.140 | 0.09 | 0.05 | 0.36 |
R, general functional prediction only | 0.140 | 0.09 | 0.04 | 0.36 |
S, function unknown | 0.128 | 0.02 | 0.01 | 0.09 |
G, carbohydrate metabolism and transport | 0.123 | −0.02 | −0.01 | −0.07 |
U, intracellular trafficking and secretion | 0.122 | −0.02 | −0.01 | −0.09 |
Q, biosynthesis, transport, and catabolism of secondary metabolites | 0.112 | −0.10 | −0.05 | −0.40 |
V, defense | 0.106 | −0.16 | −0.08 | −0.62 |
V(i), antibiotic/drug resistance | 0.135 | 0.06 | 0.03 | 0.25 |
V(ii), antipathogen defense | 0.059 | −1.05 | −0.52 | −4.18 |
Tr, transposon | 0.104 | −0.18 | −0.09 | −0.74 |
Pl, conjugative plasmid | 0.079 | −0.53 | −0.27 | −2.12 |
Ph, (pro)phage | 0.047 | −1.56 | −0.78 | −6.23 |
The table shows the estimated values of the effective duplication/loss ratio (d/le), selection to loss ratio (s/l), and selection coefficient (s) for different functional categories of genes The s/l values were calculated assuming an intrinsic duplication/loss ratio d/l = 0.125. Loss rates equal to and per gene per generation were used to obtain the lower and upper estimates of s, respectively.
To shed light on the causes that make the defense genes slightly deleterious, we split the gene families in this category into two subcategories: (i) drug and/or antibiotic resistance and detoxification and (ii) restriction modification, CRISPR-Cas, and toxin–antitoxin. The median fitness effect substantially and significantly differs in sign and magnitude between both groups, with for genes involved in detoxification and drug resistance and for genes involved in antiparasite defense (Mann–Whitney test, ). Thus, the drug resistance machinery is close to neutral whereas the antiparasite defense systems are about as deleterious as plasmids and somewhat more so than transposons. Among the latter, toxin–antitoxins are the most deleterious, followed by CRISPR-Cas and restriction modification, although the pairwise differences are only significant between toxin–antitoxins and restriction modification [ and , respectively; Mann–Whitney test, ].
Long-Term Gene Dynamics and Bursts of Transposon Proliferation.
The loss biases and selection coefficients in Table 1 describe the dynamics of genes in groups of closely related genomes, with evolutionary distances of ∼0.01–0.1 fixed substitutions per base pair. To investigate whether the same values apply at larger phylogenetic scales, we pooled data from all ATGCs and compared the global abundances of genes from different categories with the long-term equilibrium abundances expected from the model (Fig. 1 B and C). In most categories, the observed copy number agrees with the predicted value, and the same holds for the fraction of genomes that harbor a given gene family.
Two notable exceptions are the genes involved in translation (category J) and the transposons. In the case of translation-related genes, the observed copy number is ∼40% greater than expected (median observed 0.50, median expected 0.36, Wilcoxon test ), and the fraction of genomes with at least one copy is ∼80% greater than expected (median observed 0.48, median expected 0.27, Wilcoxon test ). Such deviations reflect the inability of the model to reproduce a scenario in which selection acts to maintain a single member of most of the gene families in almost every genome, as is the case for translation. In the case of transposons, there is a dramatic excess of ∼213% in the mean copy number (median observed 0.25, median expected 0.08, Wilcoxon test ) but no significant deviation in the fraction of genomes that carry transposons. Such excess of copies apparently results from occasional proliferation bursts that offset the prevailing loss-biased dynamics. Indeed, ∼12% of the lineage-specific families of transposons show evidence of recent expansions, as indicated by effective duplication/loss ratios greater than 1, whereas the fraction of such families drops below 4% in other functional categories (Fig. 2A, orange bars). Analysis of the typical burst sizes also reveals differences between transposons, with a mean burst size close to 4, and the rest of genes, with mean burst sizes around 2 (Fig. 2A, gray line). Episodes of transposon proliferation are not evenly distributed among taxa but rather concentrate in a few groups, such as Sulfolobus, Xanthomonas, Francisella, and Rickettsia (Fig. 2B). The high prophage burst rate in Xanthomonas is due to the presence of a duplicated prophage related to P2-like viruses in Xanthomonas citri.
To test whether the burst dynamics observed for transposons could explain the deviation in their global abundance, we analyzed a modified version of the model in which long phases of genome decay are punctuated by proliferative bursts of size K. Specifically, each decay phase was modeled as a duplication–transfer–loss process with selection, with initial condition , . Bursts occur at exponentially distributed intervals with the rate (note that is the characteristic interval between two consecutive bursts). When a burst occurs, the duplication–transfer–loss process is reset to its initial condition. In this model, the time-extended average for the mean copy number, , becomes . Using this expression it is possible to evaluate the expected mean copy number for any given value of the burst rate and the burst size (see SI Appendix for details). In the case of transposons, the fraction of families with signs of recent expansions leads to the estimate (i.e., one burst for every 25 losses; Methods). For this burst rate, the modified model recovers the observed mean copy number if the burst size is set to K = 4.2, which is notably close to the value K = 3.9 estimated from the data.
Relationships Between Genome Size and Gene Duplication, Horizontal Transfer, and Loss Rates.
We further investigated the relationships between the genome size and the factors that determine gene abundances. For each set of related genomes, we estimated the intrinsic duplication/loss ratio (d/l) and the total HGT/loss ratio (h/l) for genes from neutral categories and compared those to the mean genome size, quantified as the number of ORFs in the genome. As shown in Fig. 3, d/l is independent of the genome size, whereas h/l positively correlates with the genome size.
The same trends are confirmed by the analysis of ORFan abundances. Provided that the duplication rate is small compared with the loss rate, the number of ORFan families per genome constitutes a proxy for the ratio h/l. On the other hand, the fraction of ORFan families with more than one copy is a quantity that only depends on the ratio d/l (SI Appendix). As in the case of neutral gene families, the study of ORFans reveals a strong positive correlation between genome size and h/l but lack of significant correlation with d/l.
Because in prokaryotes genome size positively correlates with the effective population size (Ne) (14), we also explored the correlations between Ne and the ratios h/l and d/l (SI Appendix). The same qualitative correlations were detected; that is, h/l positively correlates with Ne, whereas d/l shows no correlation. However, the association between h/l and Ne becomes nonsignificant when genome size and Ne are jointly considered in an analysis of partial correlations. Therefore, it seems that the association between Ne and h/l is a by-product of the intrinsic correlation between effective population size and genome size.
Disentangling Environmental and Intrinsic Contributions to Fitness.
Because our estimates of the selection coefficients constitute ecological and temporal averages, a low selection coefficient might result not only from a genuine lack of adaptive value but, perhaps more likely, from the limited range of environmental conditions in which the given gene becomes useful. To disentangle the two scenarios, we compared the nonsynonymous to synonymous nucleotide substitution ratios (dN/dS) for different gene categories. The expectation is that genes that perform an important function in a rare environment would be characterized by low average selection coefficients (frequent loss) combined with intense purifying selection at the sequence level (low dN/dS) in those genomes that harbor the gene. Gene sequence analysis shows that in most cases, the dN/dS of a gene is primarily determined by the ATGC rather than by the functional category (SI Appendix, Fig. S1). These observations are compatible with the results of a previous analysis indicating that the median dN/dS value is a robust ATGC-specific feature (15). Notable exceptions are transposons and prophages, which show a high dN/dS in most taxa.
After accounting for the ATGC-related variability, we found a significant negative correlation between the selection coefficient of a functional class and the dN/dS (Fig. 4; Spearman’s ρ = −0.58, P = 0.004). Such a connection between the selection pressures on gene dynamics and sequence evolution is to be expected under the straightforward assumption that genes that are more important for organism survival are subject to stronger selection on the sequence level and has been observed previously (48). However, genes involved in metabolic processes, especially carbohydrate metabolism, have lower dN/dS values than predicted from the overall trend (Fig. 4), suggesting that the effective neutrality of such genes results from the heterogeneity of environmental conditions. Among the gene categories with low selection coefficients, the dN/dS values of transposons, prophages, and gene families with poorly characterized functions are significantly greater than expected from the general trend, which is consistent with the notion that these genes provide little or no benefit to the cells that harbor them.
Gene Dynamics and Microbial Lifestyles.
In an effort to clarify the biological underpinnings of the gene dynamics, we compared the effective duplication to loss ratios in microbes with three lifestyles: free-living, facultative host-associated, and obligate intracellular parasite (Fig. 5). In the first two groups, d/le drops from essential functional categories to nonessential categories and genetic parasites, with significantly higher values in free-living microbes than in facultative host-associated bacteria. Obligate intracellular parasites have remarkably low d/le values, as could be expected from their strong genomic degeneration. Notably, genetic parasites and genes from the defense category show the highest d/le among the genes of intracellular parasites, although due to the small number of intracellular parasites in our dataset (only three ATGCs, with most genetic parasites restricted to the ATGC044 encompassing Rickettsia), this result must be taken with caution. We estimated the selection coefficients for free-living and facultative host-associated microbes, under the assumption that the intrinsic d/l is universally the same across the microbial diversity. The significant difference in d/le between the two lifestyles translates into consistently higher s values for most functional categories of genes in free-living microbes (SI Appendix, Fig. S2). Thus, the beneficial effects of most genes appear to be significantly greater in free-living compared with facultative host-associated bacteria, and in both these categories of microbes, selection for gene retention is dramatically stronger than it is in obligate, intracellular parasites.
Discussion
Multiple variants of the duplication–transfer–loss model and related multitype branching processes have been widely used to study the evolution of gene copy numbers (24, 25, 28, 49), especially in the context of transposons and other genetic parasites (22, 23, 26, 27, 50). To make the models tractable, most studies make simplifying assumptions, such as stationary state, absence of duplication, or lack of selection, and obtain the model parameters from the copy number distributions observed in large genomic datasets, relying on the assumption that model parameters are homogeneous across taxa. Here we derived an exact solution for the time-dependent duplication–transfer–loss model with additive selection and found that in general, it is impossible to distinguish neutral and costly elements solely based on the copy number distributions. This is the case because the effects of selection and loss bias blend into a composite parameter that is equivalent to an effective loss bias in a neutral scenario. Using the solution of the complete model, we investigated the copy number dynamics of a large number of gene families in groups of related genomes, without the need to assume homogeneity of the HGT, duplication, and loss rates across taxa (8). We then used the expression that relates the parameter values under selection with their neutral equivalents to estimate the selection coefficients for different classes of genes.
The results of this analysis rely on several assumptions. First, the duplication–transfer–loss model was solved in a regime of linear selection that assumes that the benefit or cost of a gene family linearly grows with the gene copy number. This choice of the cost function, which is arguably suitable for genetic parasites, might be violated by ensembles of genes involved in processes that require tight dosage balance among the respective proteins, such as the translation system (51). For such genes, the fitness benefit will be underestimated because the observed number of family members is lower than predicted by the model. Second, to calculate the intrinsic loss bias (d/l), we assumed that certain classes of genes are effectively neutral. In that regard, two independent approaches were explored: (i) using ORFans as the neutral class and (ii) inferring the neutral categories based on plausible dispensability and a low position in the effective loss bias ranking. Notably, nearly identical values were obtained through both approaches, indicating that our estimates are robust to the choice of the neutral reference group. Third, the model assumes that duplication and deletion rates, as well as selection coefficients, are constant in time. It has been proposed that recently duplicated genes are subject to significantly higher loss rates and lower selection coefficients than older paralogs (52, 53). Should that be the case, recently duplicated gene copies would be short-lived, and their existence would not affect the generality of our results, provided that the duplication to loss ratio is understood as an effective parameter that accounts for the survival probability of a paralog beyond the initial phase. Finally, to convert the selection to loss ratios (s/l) to selection coefficients (s), we used two estimates of the loss rate l. A conservative estimate was taken from the experimental study of medium to large deletions (in the range of 1 to 202 kb) in Salmonella enterica (37). Because small indels also contribute to the loss of genes via pseudogenization, we additionally considered a second, upper bound estimate, , which is the geometric mean of the indel rates collected from multiple mutation accumulation experiments (38–45) multiplied by an average target size of 1 kb per ORF.
Our estimates yielded a broad range of selection coefficients that reflects positive, near zero (neutral) or negative fitness contributions of the respective genes. Notably, the ranking of the gene categories by fitness contribution is closely similar to the ranking by evolutionary mobility (gene gain and loss rates) (8) such that genes with positive fitness contributions are the least mobile. In accordance with the intuitive expectation, gene families involved in essential functions, in particular nucleotide metabolism and translation, occupy the highest ranks in the list of genes maintained by selection (highest positive s values; Table 1). The middle of the range of selection coefficients is occupied by functional categories of genes that are beneficial, sometimes strongly so, for microbes under specific conditions but otherwise could be burdensome, such as carbohydrate metabolism and ion transport. This inference was supported by analysis of selection on the protein sequence level that is reflected in the dN/dS ratio. Overall, we observed the expected significant negative correlation between the selection coefficient estimated from gene dynamics and dN/dS, indicating that functionally important genes are, on average, subject to strong constraints on the sequence level. However, for genes involved in metabolic processes, in particular carbohydrate metabolism, the dN/dS values are lower than expected given their average selection coefficients, which is consistent with relatively strong sequence-level selection in the subsets of microbes that have these genes. In agreement with this interpretation, when the s values for these categories were estimated separately for free-living and host-associated microbes, they turned out to be slightly beneficial in the former but costly in the latter.
In contrast, genetic parasites that negatively contribute to the fitness of the cell are at the bottom of the list of s values (Table 1). Among those, prophages are the most costly class, whereas plasmids and especially transposons evolve under regimes closer to neutrality. Prophages, plasmids, and transposons differ substantially in the magnitude of the associated selection coefficients: selection is strong and effective against prophages (Nes ∼ −10) and moderate against transposons and plasmids (Nes ∼ −1). These differences are consistent with the differences in the lifestyles between these selfish elements whereby transposons and plasmids are relatively harmless to the host cell, apart from being an energetic burden, whereas prophages have the potential to kill the host upon lisogenization (20, 54). Accordingly, genetic parasites also differ in the relative importance that selection and deletions play in keeping them under control. Both selection and deletions contribute to the removal of prophages (the contribution of selection being ∼1.6 times greater), whereas deletion is the main cause of plasmid and transposon loss (roughly twice as important as selection for plasmids and 5 times as important in the case of transposons). The demonstration that transposons are only weakly selected against and are lost primarily due to the intrinsic deletion bias is compatible with the wealth of degenerated insertion sequences found in many bacterial genomes (55–57). Conversely, deleterious elements, such as prophages, whose spread is limited by selection against high copy numbers, present fewer degenerated copies than lower cost elements, such as transposons.
One of the most interesting and, at least at first glance, unexpected observations made in the course of this work is that genes encoding components of antipathogen defense systems are on average deleterious, with an average cost similar to or even greater than the cost of plasmids (Table 1). In part, this is likely to be the case because some of the most abundant defense systems, such as toxin–antitoxins and restriction modification modules, clearly display properties of selfish genetic elements and, moreover, are addictive to host cells (58–61). Indeed, in agreement with the partially selfish character of such defense modules, we found that toxin–antitoxins are the most deleterious category of genetic elements in microbes, apart from prophages. More generally, the patchy distribution of defense systems in prokaryotic genomes, together with theoretical and experimental evidence, suggests that defense systems incur nonnegligible fitness costs that are thought to stem primarily from autoimmunity and abrogation of HGT and, therefore, are rapidly eliminated when not needed (62–64).
Long-term transposon dynamics is well described by a model that combines long phases of decay, during which transposons behave as inactive genetic material, punctuated by small proliferation bursts that produce on average four new copies. Despite the simplicity of this model, it captures, at least qualitatively, the heterogeneity of transposition rates among transposon families (65) and environmental conditions (66, 67). Unlike large expansions, which are rare events typically associated with ecological transitions affecting the entire genome (68–71), small bursts occur frequently and affect a sizable fraction of transposon families. Some well-known instances of large transposon expansions become apparent in our analysis that identified taxa with unusually high burst rates, such as Xanthomonas, Burkholderia, and Francisella, in accord with previous observations (70, 71). In most other taxa, transposon decay is the dominant process, which is the expected trend, given that transposition is tightly regulated and a large fraction of transposon copies are inactive (72, 73). The small fitness cost of transposons in the decay phase is also consistent with a nonproliferative scenario, where the fitness effect is reduced to the energetic cost of replication and expression (21). Due to the rapidity of bursts, our methodology cannot be used to assess the cost of a transposon during the burst phase. Because active transposons likely impose a larger burden on the host (74), variation in burst sizes is likely to reflect differences in the intensity of selection and the duration of proliferative episodes.
Apart from the transposons, the only notable case of burst-driven dynamics corresponds to genes from the defense category in Sulfolobus. A closer inspection of this group reveals multiple instances of duplications, gains, and losses of CRISPR-Cas systems as also observed previously (75). In the case of prophages, the low burst rate is likely to reflect genuine lack of bursts or our inability to detect them due to the dominant, selection-driven fast decay dynamics. Indeed, given the fitness cost that we estimated for prophages, a burst of prophages would decay almost three times faster than a burst of transposons of similar size.
The effective size of microbial populations positively correlates with the genome size, which led to the hypothesis that the genome dynamics is dominated by selection acting to maintain slightly beneficial genes (14, 16). In the present analysis, when gene families from all functional categories are pooled, the median fitness contribution per gene is Nes ∼0.1, which provides independent support for this weak selection-driven concept of microbial genome evolution. In that framework, the fact that genetic parasites are more abundant in large genomes, as reported previously (76–78) and confirmed by our data, seemingly raises a paradox: the same genomes where selection works more efficiently to maintain beneficial genes also harbor more parasites. A possible solution comes from our observation that the HGT to loss ratio (where the HGT rate is measured per genome and the loss rate is measured per gene) grows with the genome size. Such behavior, which had been already noted for transposons (23) and agrees with the recently derived genome-average scaling law (14), is likely to result, at least in part, from larger genomes providing more nonessential regions where a parasite can integrate without incurring major costs to the cell. Alternatively or additionally, the observed dependence could emerge if duplication and loss rates per gene decreased with genome size, whereas the HGT rate remains constant. Indeed, an inverse correlation between the genome size and the duplication and loss rates could be expected as long as mutation rates appear to have evolved to lower values in populations with larger Ne (41, 79).
Taken together, the results of this analysis reveal the relative contributions of selection and intrinsic deletion bias to the evolution of different classes of microbial genes and selfish genetic elements. Among other findings, we showed that the genome-averaged selection coefficients are low, and evolution is driven by strong selection only for a small set of essential genes. In addition, we detected substantial, systematic differences between the evolutionary regimes of bacteria with different lifestyles, with much stronger selection for gene retention in free-living microbes compared with parasites, especially obligate, intracellular ones. This difference appears to be fully biologically plausible in that diversification of the metabolic, transport, and signaling capabilities is beneficial for free-living microbes but not for parasites that therefore follow the evolutionary route of genome degradation.
Counterintuitive as this might be, we show that antiparasite defense systems are generally deleterious for microbes, roughly to the same extent as mobile elements. These results are compatible with the previously observed highly dynamic evolution of such systems that are kept by microbes either when they are essential to counteract aggressive parasites or due to their own selfish and addictive properties. These findings can be expected to foster further exploration of the interplay between genome size; effective population size; the rates of horizontal transfer; duplication and loss of genes; and the dynamics of mobile elements in the evolution of prokaryotic populations and, eventually, the entire microbial biosphere.
Methods
Gene Copy Number Dynamics.
Let be the number of genomes that carry k copies of the gene of interest at time t. We define the generating function . In terms of the generating function, Eq. 1 becomes , where , , and . This equation can be solved for any initial condition by applying the method of characteristics (SI Appendix). The generating function for the copy number distribution is then obtained as . The explicit values of are recovered as the coefficients of the series expansion of with respect to z.
Estimation of the Effective Ratios d/le and h/le from Genomic Data.
Genomic data were obtained from an updated version of the ATCG database that clusters genomes from bacteria and archaea into closely related groups (33). We analyzed 35 of the largest ATGCs (34 bacterial and 1 archaeal group) that included 10 or more genomes each. For each of those ATGCs, clusters of orthologous genes shared among genomes of the same ATGC (ATGC-COGs) were identified (33, 80), and rooted species trees were generated as described previously (8).
The effective duplication/loss ratio (d/le) and transfer/loss ratio (h/le) for each ATGC-COG were estimated with the software COUNT (24), which optimizes the parameters of a duplication–transfer–loss model analogous to the model described above under the assumption of neutrality (81). The output of the program was postprocessed to obtain ATGC-COG-specific rates as described in ref. 20. ATGC-COGs were assigned to families based on their COG and pfam annotations. COG and pfam annotations were also used to classify families into functional categories. At the family level, the representative ratios d/le and h/le of a family were obtained as the median d/le and the sum of h/le, respectively, among its constituent ATGC-COGs. The mean copy number of a family was calculated as the average, across all ATGCs, of the ATGC-specific mean abundances (ATGC-COGs belonging to the same family in the same ATGC were pooled to obtain the ATGC-specific mean abundance, whereas the ATGC-specific mean for absent families was set to zero). The fraction of genomes that contain a family was calculated in a similar manner. This approach minimizes the bias associated to nonuniform ATGC sizes. To minimize inference artifacts associated to small families, only those families encompassing at least five ATGC-COGs from at least three ATGCs were considered for further analyses.
Estimation of the Intrinsic Duplication/Loss Ratio.
Two approaches were used to estimate the intrinsic duplication/loss ratio d/l. In the first approach, putative neutral families from categories R, S, G, U, Q, and V were pooled, and the median d/le was chosen to serve as the estimate of d/l. The 95% confidence interval was calculated with the formula , where IQR is the interquartile range and N is the number of families (82). In the second approach, the copy numbers of ATGC-COGs that are specific to one single genome were used to infer the ratio d/l under the assumption that such genes are of recent acquisition and effectively neutral. To that end we used the solution of the duplication–transfer–loss model to derive a maximum likelihood estimate of d/l given a list of single-genome ATGC-COGs, their copy numbers, and the time since the last branching event in the genome tree (in units of loss events, as provided by COUNT). Explicit formulas and their derivation are discussed in SI Appendix. Likelihood maximization was carried out using the Nelder–Mead simplex method as implemented in MATLAB R2016b. The 95% confidence interval was determined by the values of d/l whose log-likelihood was 1.92 units smaller than the maximum log-likelihood (83).
Burst Frequency, Rate, and Size.
The frequency of bursts was calculated as the fraction of ATGC-COGs in which d/le > 1. The burst rate ϕ was estimated by maximum likelihood, assuming that bursts occur randomly at exponentially distributed intervals, such that the probability of observing a burst in a tree of phylogenetic depth t is equal to . Accordingly, the log-likelihood of observing natgc bursts in an ATGC with Natgc ATGC-COGs is , where tatgc is the depth of the ATGC tree in units of loss events (SI Appendix). The global log-likelihood is the sum of the contributions from all ATGCs. As a proxy for the burst size we used the maximum copy number observed in each ATGC-COG. For each category, the characteristic burst size was calculated as the quotient between the mean burst size in ATGC-COGs with d/le > 1 and the baseline defined by the mean of the maxima in the rest of ATGC-COGs.
Estimation of the Characteristic dN/dS Ratios.
The dN/dS of every ATGC-COG was calculated as follows. Starting from the multiple sequence alignment, the program codeml from the PAML (phylogenetic analysis by maximum likelihood) package (84) was used to obtain the dN/dS for each pair of sequences in the ATGC-COG. The conditions 0.01 < dN < 3 and 0.01 < dS < 3 were used to select informative gene pairs. The representative dN/dS for the ATGC-COG was obtained as the median dN/dS among the informative pairs. In the next step, ATGC-COGs from the same ATGC that belong to the same functional category were pooled, and the median of their dN/dS was taken as the representative dN/dS. To account for ATGC-related effects, the dN/dS values of all categories within an ATGC were converted into ranks. The null hypothesis that all categories are equal in terms of their dN/dS was rejected by a Skillings–Mack test (t = 939.7, df = 22, P < 10−20). To identify which categories significantly deviate from the null hypothesis, the mean rank of each category was compared with the theoretical 95% CI for the mean of 35 samples taken from a discrete uniform distribution in the interval from 1 to 23.
Supplementary Material
Footnotes
The authors declare no conflict of interest.
This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1704925114/-/DCSupplemental.
References
- 1.Koonin EV. The Logic of Chance: The Nature and Origin of Biological Evolution. FT Press; Upper Saddle River, NJ: 2011. [Google Scholar]
- 2.Lynch M. The Origins of Genome Architecture. Sinauer Associates; Sunderland, MA: 2007. [Google Scholar]
- 3.Koonin EV, Wolf YI. Evolution of microbes and viruses: A paradigm shift in evolutionary biology? Front Cell Infect Microbiol. 2012;2:119. doi: 10.3389/fcimb.2012.00119. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Moran NA, Bennett GM. The tiniest tiny genomes. Annu Rev Microbiol. 2014;68:195–215. doi: 10.1146/annurev-micro-091213-112901. [DOI] [PubMed] [Google Scholar]
- 5.Han K, et al. Extraordinary expansion of a Sorangium cellulosum genome from an alkaline milieu. Sci Rep. 2013;3:2101. doi: 10.1038/srep02101. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Koonin EV. Comparative genomics, minimal gene-sets and the last universal common ancestor. Nat Rev Microbiol. 2003;1:127–136. doi: 10.1038/nrmicro751. [DOI] [PubMed] [Google Scholar]
- 7.Tettelin H, Riley D, Cattuto C, Medini D. Comparative genomics: The bacterial pan-genome. Curr Opin Microbiol. 2008;11:472–477. doi: 10.1016/j.mib.2008.09.006. [DOI] [PubMed] [Google Scholar]
- 8.Puigbò P, Lobkovsky AE, Kristensen DM, Wolf YI, Koonin EV. Genomes in turmoil: Quantification of genome dynamics in prokaryote supergenomes. BMC Biol. 2014;12:66. doi: 10.1186/s12915-014-0066-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Koonin EV, Wolf YI. Genomics of bacteria and archaea: The emerging dynamic view of the prokaryotic world. Nucleic Acids Res. 2008;36:6688–6719. doi: 10.1093/nar/gkn668. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Wolf YI, Makarova KS, Lobkovsky AE, Koonin EV. Two fundamentally different classes of microbial genes. Nat Microbiol. 2016;2:16208. doi: 10.1038/nmicrobiol.2016.208. [DOI] [PubMed] [Google Scholar]
- 11.Frost LS, Leplae R, Summers AO, Toussaint A. Mobile genetic elements: the agents of open source evolution. Nat Rev Microbiol. 2005;3:722–732. doi: 10.1038/nrmicro1235. [DOI] [PubMed] [Google Scholar]
- 12.Lobkovsky AE, Wolf YI, Koonin EV. Gene frequency distributions reject a neutral model of genome evolution. Genome Biol Evol. 2013;5:233–242. doi: 10.1093/gbe/evt002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Kuo CH, Ochman H. Deletional bias across the three domains of life. Genome Biol Evol. 2009;1:145–152. doi: 10.1093/gbe/evp016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Sela I, Wolf YI, Koonin EV. Theory of prokaryotic genome evolution. Proc Natl Acad Sci USA. 2016;113:11399–11407. doi: 10.1073/pnas.1614083113. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Novichkov PS, Wolf YI, Dubchak I, Koonin EV. Trends in prokaryotic evolution revealed by comparison of closely related bacterial and archaeal genomes. J Bacteriol. 2009;191:65–73. doi: 10.1128/JB.01237-08. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Kuo CH, Moran NA, Ochman H. The consequences of genetic drift for bacterial genome complexity. Genome Res. 2009;19:1450–1454. doi: 10.1101/gr.091785.109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Kuo CH, Ochman H. The extinction dynamics of bacterial pseudogenes. PLoS Genet. 2010;6:e1001050. doi: 10.1371/journal.pgen.1001050. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Lee M-C, Marx CJ. Repeated, selection-driven genome reduction of accessory genes in experimental populations. PLoS Genet. 2012;8:e1002651. doi: 10.1371/journal.pgen.1002651. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Koskiniemi S, Sun S, Berg OG, Andersson DI. Selection-driven gene loss in bacteria. PLoS Genet. 2012;8:e1002787. doi: 10.1371/journal.pgen.1002787. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Iranzo J, Puigbò P, Lobkovsky AE, Wolf YI, Koonin EV. Inevitability of Genetic Parasites. Genome Biol Evol. 2016;8:2856–2869. doi: 10.1093/gbe/evw193. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Lynch M, Marinov GK. The bioenergetic costs of a gene. Proc Natl Acad Sci USA. 2015;112:15690–15695. doi: 10.1073/pnas.1514974112. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Bichsel M, Barbour AD, Wagner A. Estimating the fitness effect of an insertion sequence. J Math Biol. 2013;66:95–114. doi: 10.1007/s00285-012-0504-2. [DOI] [PubMed] [Google Scholar]
- 23.Iranzo J, Gómez MJ, López de Saro FJ, Manrubia S. Large-scale genomic analysis suggests a neutral punctuated dynamics of transposable elements in bacterial genomes. PLOS Comput Biol. 2014;10:e1003680. doi: 10.1371/journal.pcbi.1003680. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Csurös M. Count: evolutionary analysis of phylogenetic profiles with parsimony and likelihood. Bioinformatics. 2010;26:1910–1912. doi: 10.1093/bioinformatics/btq315. [DOI] [PubMed] [Google Scholar]
- 25.Karev GP, Wolf YI, Berezovskaya FS, Koonin EV. Gene family evolution: An in-depth theoretical and simulation analysis of non-linear birth-death-innovation models. BMC Evol Biol. 2004;4:32. doi: 10.1186/1471-2148-4-32. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.van Passel MWJ, Nijveen H, Wahl LM. Birth, death, and diversification of mobile promoters in prokaryotes. Genetics. 2014;197:291–299. doi: 10.1534/genetics.114.162883. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Basten CJ, Moody ME. A branching-process model for the evolution of transposable elements incorporating selection. J Math Biol. 1991;29:743–761. doi: 10.1007/BF00160190. [DOI] [PubMed] [Google Scholar]
- 28.Huynen MA, van Nimwegen E. The frequency distribution of gene family sizes in complete genomes. Mol Biol Evol. 1998;15:583–589. doi: 10.1093/oxfordjournals.molbev.a025959. [DOI] [PubMed] [Google Scholar]
- 29.Karev GP, Wolf YI, Rzhetsky AY, Berezovskaya FS, Koonin EV. Birth and death of protein domains: A simple model of evolution explains power law behavior. BMC Evol Biol. 2002;2:18. doi: 10.1186/1471-2148-2-18. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Gardiner CW. Handbook of Stochastic Methods. Springer-Verlag; Berlin: 2004. [Google Scholar]
- 31.van Kampen NG. Stochastic Processes in Physics and Chemistry. North-Holland; Amsterdam: 2001. [Google Scholar]
- 32.Novichkov PS, Ratnere I, Wolf YI, Koonin EV, Dubchak I. ATGC: A database of orthologous genes from closely related prokaryotic genomes and a research platform for microevolution of prokaryotes. Nucleic Acids Res. 2009;37:D448–D454. doi: 10.1093/nar/gkn684. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Kristensen DM, Wolf YI, Koonin EV. ATGC database and ATGC-COGs: An updated resource for micro- and macro-evolutionary studies of prokaryotic genomes and protein family annotation. Nucleic Acids Res. 2017;45:D210–D218. doi: 10.1093/nar/gkw934. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Galperin MY, Makarova KS, Wolf YI, Koonin EV. Expanded microbial genome coverage and improved protein family annotation in the COG database. Nucleic Acids Res. 2015;43:D261–D269. doi: 10.1093/nar/gku1223. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Siew N, Fischer D. Unravelling the ORFan puzzle. Comp Funct Genomics. 2003;4:432–441. doi: 10.1002/cfg.311. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Yu G, Stoltzfus A. Population diversity of ORFan genes in Escherichia coli. Genome Biol Evol. 2012;4:1176–1187. doi: 10.1093/gbe/evs081. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Nilsson AI, et al. Bacterial genome size reduction by experimental evolution. Proc Natl Acad Sci USA. 2005;102:12112–12116. doi: 10.1073/pnas.0503654102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Dillon MM, Sung W, Lynch M, Cooper VS. The rate and molecular spectrum of spontaneous mutations in the GC-rich multichromosome genome of Burkholderia cenocepacia. Genetics. 2015;200:935–946. doi: 10.1534/genetics.115.176834. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Dillon MM, Sung W, Sebra R, Lynch M, Cooper VS. Genome-wide biases in the rate and molecular spectrum of spontaneous mutations in Vibrio cholerae and Vibrio fischeri. Mol Biol Evol. 2017;34:93–109. doi: 10.1093/molbev/msw224. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Long H, et al. Background mutational features of the radiation-resistant bacterium Deinococcus radiodurans. Mol Biol Evol. 2015;32:2383–2392. doi: 10.1093/molbev/msv119. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Sung W, et al. Evolution of the insertion-deletion mutation rate across the tree of life. G3 (Bethesda) 2016;6:2583–2591. doi: 10.1534/g3.116.030890. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Sung W, et al. Asymmetric context-dependent mutation patterns revealed through mutation-accumulation experiments. Mol Biol Evol. 2015;32:1672–1683. doi: 10.1093/molbev/msv055. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Sung W, Ackerman MS, Miller SF, Doak TG, Lynch M. Drift-barrier hypothesis and mutation-rate evolution. Proc Natl Acad Sci USA. 2012;109:18488–18492. doi: 10.1073/pnas.1216223109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Sung W, et al. Extraordinary genome stability in the ciliate Paramecium tetraurelia. Proc Natl Acad Sci USA. 2012;109:19339–19344. doi: 10.1073/pnas.1210663109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Dettman JR, Sztepanacz JL, Kassen R. The properties of spontaneous mutations in the opportunistic pathogen Pseudomonas aeruginosa. BMC Genomics. 2016;17:27. doi: 10.1186/s12864-015-2244-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Lynch M. Streamlining and simplification of microbial genome architecture. Annu Rev Microbiol. 2006;60:327–349. doi: 10.1146/annurev.micro.60.080805.142300. [DOI] [PubMed] [Google Scholar]
- 47.Lynch M, Conery JS. The origins of genome complexity. Science. 2003;302:1401–1404. doi: 10.1126/science.1089370. [DOI] [PubMed] [Google Scholar]
- 48.Krylov DM, Wolf YI, Rogozin IB, Koonin EV. Gene loss, protein sequence divergence, gene dispensability, expression level, and interactivity are correlated in eukaryotic evolution. Genome Res. 2003;13:2229–2235. doi: 10.1101/gr.1589103. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Novozhilov AS, Karev GP, Koonin EV. Biological applications of the theory of birth-and-death processes. Brief Bioinform. 2006;7:70–85. doi: 10.1093/bib/bbk006. [DOI] [PubMed] [Google Scholar]
- 50.Moody ME. A branching process model for the evolution of transposable elements. J Math Biol. 1988;26:347–357. doi: 10.1007/BF00277395. [DOI] [PubMed] [Google Scholar]
- 51.Veitia RA, Potier MC. Gene dosage imbalances: Action, reaction, and models. Trends Biochem Sci. 2015;40:309–317. doi: 10.1016/j.tibs.2015.03.011. [DOI] [PubMed] [Google Scholar]
- 52.Axelsen JB, Yan KK, Maslov S. Parameters of proteome evolution from histograms of amino-acid sequence identities of paralogous proteins. Biol Direct. 2007;2:32. doi: 10.1186/1745-6150-2-32. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Innan H, Kondrashov F. The evolution of gene duplications: Classifying and distinguishing between models. Nat Rev Genet. 2010;11:97–108. doi: 10.1038/nrg2689. [DOI] [PubMed] [Google Scholar]
- 54.Jalasvuori M, Koonin EV. Classification of prokaryotic genetic replicators: Between selfishness and altruism. Ann N Y Acad Sci. 2015;1341:96–105. doi: 10.1111/nyas.12696. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Cerveau N, Leclercq S, Leroy E, Bouchon D, Cordaux R. Short- and long-term evolutionary dynamics of bacterial insertion sequences: Insights from Wolbachia endosymbionts. Genome Biol Evol. 2011;3:1175–1186. doi: 10.1093/gbe/evr096. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Nelson WC, Wollerman L, Bhaya D, Heidelberg JF. Analysis of insertion sequences in thermophilic cyanobacteria: Exploring the mechanisms of establishing, maintaining, and withstanding high insertion sequence abundance. Appl Environ Microbiol. 2011;77:5458–5466. doi: 10.1128/AEM.05090-11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Brügger K, et al. Mobile elements in archaeal genomes. FEMS Microbiol Lett. 2002;206:131–141. doi: 10.1111/j.1574-6968.2002.tb10999.x. [DOI] [PubMed] [Google Scholar]
- 58.Van Melderen L. Toxin-antitoxin systems: Why so many, what for? Curr Opin Microbiol. 2010;13:781–785. doi: 10.1016/j.mib.2010.10.006. [DOI] [PubMed] [Google Scholar]
- 59.Van Melderen L, Saavedra De Bast M. Bacterial toxin-antitoxin systems: More than selfish entities? PLoS Genet. 2009;5:e1000437. doi: 10.1371/journal.pgen.1000437. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Kobayashi I. Behavior of restriction-modification systems as selfish mobile elements and their impact on genome evolution. Nucleic Acids Res. 2001;29:3742–3756. doi: 10.1093/nar/29.18.3742. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Furuta Y, Kobayashi I. Restriction-modification systems as mobile epigenetic elements. In: Roberts AP, Mullany P, editors. Bacterial Integrative Mobile Genetic Elements. Landes Bioscience; Austin, TX: 2011. [Google Scholar]
- 62.Weinberger AD, Wolf YI, Lobkovsky AE, Gilmore MS, Koonin EV. Viral diversity threshold for adaptive immunity in prokaryotes. MBio. 2012;3:e00456-12. doi: 10.1128/mBio.00456-12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Koonin EV, Zhang F. Coupling immunity and programmed cell suicide in prokaryotes: Life-or-death choices. BioEssays. 2017;39:1–9. doi: 10.1002/bies.201600186. [DOI] [PubMed] [Google Scholar]
- 64.Iranzo J, Lobkovsky AE, Wolf YI, Koonin EV. Immunity, suicide or both? Ecological determinants for the combined evolution of anti-pathogen defense systems. BMC Evol Biol. 2015;15:43. doi: 10.1186/s12862-015-0324-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Sousa A, Bourgard C, Wahl LM, Gordo I. Rates of transposition in Escherichia coli. Biol Lett. 2013;9:20130838. doi: 10.1098/rsbl.2013.0838. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Ohtsubo Y, Genka H, Komatsu H, Nagata Y, Tsuda M. High-temperature-induced transposition of insertion elements in burkholderia multivorans ATCC 17616. Appl Environ Microbiol. 2005;71:1822–1828. doi: 10.1128/AEM.71.4.1822-1828.2005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Naas T, Blot M, Fitch WM, Arber W. Insertion sequence-related genetic variation in resting Escherichia coli K-12. Genetics. 1994;136:721–730. doi: 10.1093/genetics/136.3.721. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Beare PA, et al. Comparative genomics reveal extensive transposon-mediated genomic plasticity and diversity among potential effector proteins within the genus Coxiella. Infect Immun. 2009;77:642–656. doi: 10.1128/IAI.01141-08. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.Moran NA, Plague GR. Genomic changes following host restriction in bacteria. Curr Opin Genet Dev. 2004;14:627–633. doi: 10.1016/j.gde.2004.09.003. [DOI] [PubMed] [Google Scholar]
- 70.Mira A, Pushker R, Rodríguez-Valera F. The Neolithic revolution of bacterial genomes. Trends Microbiol. 2006;14:200–206. doi: 10.1016/j.tim.2006.03.001. [DOI] [PubMed] [Google Scholar]
- 71.Rohmer L, et al. Comparison of Francisella tularensis genomes reveals evolutionary events associated with the emergence of human pathogenic strains. Genome Biol. 2007;8:R102. doi: 10.1186/gb-2007-8-6-r102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72.Nagy Z, Chandler M. Regulation of transposition in bacteria. Res Microbiol. 2004;155:387–398. doi: 10.1016/j.resmic.2004.01.008. [DOI] [PubMed] [Google Scholar]
- 73.Filée J, Siguier P, Chandler M. Insertion sequence diversity in archaea. Microbiol Mol Biol Rev. 2007;71:121–157. doi: 10.1128/MMBR.00031-06. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 74.Elena SF, Ekunwe L, Hajela N, Oden SA, Lenski RE. Distribution of fitness effects caused by random insertion mutations in Escherichia coli. Genetica. 1998;102-103:349–358. [PubMed] [Google Scholar]
- 75.Garrett RA, et al. CRISPR-based immune systems of the Sulfolobales: Complexity and diversity. Biochem Soc Trans. 2011;39:51–57. doi: 10.1042/BST0390051. [DOI] [PubMed] [Google Scholar]
- 76.Zhou F, Olman V, Xu Y. Insertion sequences show diverse recent activities in Cyanobacteria and Archaea. BMC Genomics. 2008;9:36. doi: 10.1186/1471-2164-9-36. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 77.Touchon M, Rocha EP. Causes of insertion sequences abundance in prokaryotic genomes. Mol Biol Evol. 2007;24:969–981. doi: 10.1093/molbev/msm014. [DOI] [PubMed] [Google Scholar]
- 78.Touchon M, Bernheim A, Rocha EP. Genetic and life-history traits associated with the distribution of prophages in bacteria. ISME J. 2016;10:2744–2754. doi: 10.1038/ismej.2016.47. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 79.Lynch M, et al. Genetic drift, selection and the evolution of the mutation rate. Nat Rev Genet. 2016;17:704–714. doi: 10.1038/nrg.2016.104. [DOI] [PubMed] [Google Scholar]
- 80.Kristensen DM, et al. A low-polynomial algorithm for assembling clusters of orthologous groups from intergenomic symmetric best matches. Bioinformatics. 2010;26:1481–1487. doi: 10.1093/bioinformatics/btq229. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 81.Csűrös M, Miklós I. A probabilistic model for gene content evolution with duplication, loss, and horizontal transfer. In: Apostolico A, Guerra C, Istrail S, Pevzner PA, Waterman M, editors. Research in Computational Molecular Biology. RECOMB 2006. Lecture Notes in Computer Science. Vol 3909 Springer; Berlin: 2006. [Google Scholar]
- 82.McGill R, Tukey JW, Larsen WA. Variations of box plots. Am Stat. 1978;32:12–16. [Google Scholar]
- 83.Hudson DJ. Interval estimation from likelihood function. J R Stat Soc B. 1971;33:256–262. [Google Scholar]
- 84.Yang Z. PAML 4: Phylogenetic analysis by maximum likelihood. Mol Biol Evol. 2007;24:1586–1591. doi: 10.1093/molbev/msm088. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.