Skip to main content
Proceedings of the National Academy of Sciences of the United States of America logoLink to Proceedings of the National Academy of Sciences of the United States of America
. 2016 Oct 4;113(41):11399–11407. doi: 10.1073/pnas.1614083113

Theory of prokaryotic genome evolution

Itamar Sela a, Yuri I Wolf a, Eugene V Koonin a,1
PMCID: PMC5068321  PMID: 27702904

Significance

Bacteria and archaea have small genomes with tightly packed protein-coding genes. Typically, this genome architecture is explained by “genome streamlining” (minimization) under selection for high replication rate. We developed a mathematical model of microbial evolution and tested it against extensive data from multiple genome comparisons to identify the key evolutionary forces. The results indicate that genome evolution is not governed by streamlining but rather, reflects the balance between the benefit of additional genes that diminishes with the genome size and the intrinsic preference for DNA deletion over acquisition. These results explain the observation that, in an apparent contradiction with the population genetic theory, microbes with large genomes reach higher abundance and are subject to stronger selection than small “streamlined” genomes.

Keywords: evolutionary genomics, prokaryotic genome size, genome streamlining, positive selection, deletion bias

Abstract

Bacteria and archaea typically possess small genomes that are tightly packed with protein-coding genes. The compactness of prokaryotic genomes is commonly perceived as evidence of adaptive genome streamlining caused by strong purifying selection in large microbial populations. In such populations, even the small cost incurred by nonfunctional DNA because of extra energy and time expenditure is thought to be sufficient for this extra genetic material to be eliminated by selection. However, contrary to the predictions of this model, there exists a consistent, positive correlation between the strength of selection at the protein sequence level, measured as the ratio of nonsynonymous to synonymous substitution rates, and microbial genome size. Here, by fitting the genome size distributions in multiple groups of prokaryotes to predictions of mathematical models of population evolution, we show that only models in which acquisition of additional genes is, on average, slightly beneficial yield a good fit to genomic data. These results suggest that the number of genes in prokaryotic genomes reflects the equilibrium between the benefit of additional genes that diminishes as the genome grows and deletion bias (i.e., the rate of deletion of genetic material being slightly greater than the rate of acquisition). Thus, new genes acquired by microbial genomes, on average, appear to be adaptive. The tight spacing of protein-coding genes likely results from a combination of the deletion bias and purifying selection that efficiently eliminates nonfunctional, noncoding sequences.


The majority of bacterial and archaeal genomes are small, at least compared with the genomes of multicellular and many unicellular eukaryotes (1, 2). Also, with the exception of deteriorating genomes of some parasitic bacteria, the prokaryotic genomes are highly compact, with densely packed protein-coding genes and a low fraction of noncoding sequences (3). The small genome size is thought to be selected for fast replication, whereas the high gene density additionally facilitates coregulation of gene expression via the operon organization (4, 5). Across the full range of cellular life forms, a significant positive correlation has been shown to exist between genome size and Neu, where Ne is the effective population size, and u is the mutation rate per nucleotide (69). Accordingly, a simple and appealing population genetic theory has been developed, under which selection strength controls genome size and complexity (6, 9). Prokaryotes, with the exception of some parasites, have large effective population sizes on the order of 109 or even higher, which implies strong selection enabling prokaryotes to maintain compact genomes (10). Under this strong selection regime, even short nonfunctional sequences incur cost that is “visible” to selection, conceivably through a combination of increasing energy expenditure and reducing the replication rate, and are efficiently weeded out (11). In eukaryotes, at least the multicellular forms, the effective population size is substantially (by orders of magnitude) smaller, and consequently, selection is not strong enough to eliminate superfluous genetic material, which results in “bloated” genomes but also provides the raw material for the evolution of complex features (68). It is often assumed, implicitly or explicitly, that any extra genetic material arising from duplication or acquisition is, on average, slightly deleterious for the host, because the new DNA does not perform any immediately beneficial function but incurs the cost right from the beginning. This theory, which is steeped in the well-established principles of population genetics, provides a simple, unified framework for understanding evolution of genomic complexity without invoking widespread adaptation. This nonadaptive theory can be reasonably assumed as the null hypothesis of genome evolution, the predictions of which have to be falsified to claim adaptive phenomena (8, 12).

The population genetic theory clearly predicts an inverse correlation between the strength of selection at different levels and genome size: small genomes are predicted to be subject to stronger selection than large genomes (9). However, when protein sequence-level selection was measured for multiple groups of closely related bacteria using the ratio of the nonsynonymous to synonymous substitution rates (dN/dS) as a proxy, the opposite effect, namely a significant negative correlation between dN/dS and genome size, was observed, indicating that larger prokaryotic genomes typically evolve under a stronger selection than small ones (13).

Here, we sought to further investigate the evolutionary factors that control genome evolution in prokaryotes. We reproduced the negative correlation between dN/dS and genome size on an expanded genome collection and then, developed a mathematical model of genome evolution by gene gain and loss in prokaryotic populations. By fitting the distribution of genome sizes predicted by the model to the empirical distribution for many groups of prokaryotes, we found that a good fit between theory and the data could be obtained only for models that included a positive mean fitness contribution of the gained genes countered by a deletion bias. These results imply that, at the level of gene gain and loss, selection for genome compactness does not play a major role in determining the genome size in prokaryotes. Rather, the relatively small number of genes in prokaryotes is explained by the diminishing return associated with the acquisition of new genes as the genome grows combined with the intrinsic deletion bias.

Results

Protein-Level Selection and Microbial Genome Size: Model Prediction and Genomic Data.

It has been shown previously that, in prokaryotes, genome size is negatively and significantly correlated with dN/dS, suggestive of a stronger protein-level selection in larger genomes (13). For the purpose of this work, we reproduced these findings on a substantially expanded set of groups of closely related bacterial and archaeal genomes compiled in the latest version of the Alignable Tight Genomic Cluster (ATGC) database (14) (Fig. 1). Specifically, genome size (measured by the number of genes) and dN/dS values are correlated with Spearman’s rank correlation coefficient ρ=0.397. These findings imply that, in addition to local influences of different environments and lifestyles, there are underlying universal components of gene gain and loss probabilities with respect to genome size that are common across all microbes.

Fig. 1.

Fig. 1.

Genome size and selection in prokaryotes. Mean observed genome sizes (number of genes) are plotted against (A) the selection strength and (B) the estimated effective population size for each ATGC. Error bars correspond to 1 SD. ATGCs with more than 20 species are indicated in orange. The mean genome size, calculated using the mathematical model, is indicated in B by a solid red line. The model was applied with a constant selection coefficient, the gain and loss rates of Eq. 11, and optimized parameters shown in Table 1.

To quantitatively analyze the relations between selection strength and genome size, a mathematical model was implemented using the Moran process framework (15). In the model, a genome is represented as a collection of x genes, and the variation in genome size is attributed to stochastic gain and loss of genes (Fig. 2). New genes are assumed to be acquired at rate α, which in principle, depends on x and accounts for the probability that the acquired genetic material is inserted in a nondeleterious locus and expressed. Genes can be acquired by either horizontal gene transfer (in the model, assumed to draw genes from an infinite pool) or duplication, and α represents a weighted average over all possible evolutionary processes that lead to gene gain. Reconstructions of genome evolution performed on the ATGCs indicate that the contribution of horizontal gene transfer by far exceeds that of duplication (14). To model gene loss, the deleted gene is picked from the genome at random with the deletion rate β, which analogous to α, in principle, depends on x. Acquisitions and deletions of genetic material are either fixed or eliminated stochastically, with a fixation probability that depends on the selection coefficient s, which is a function of genome size as well s=s(x). Specifically, s(x) denotes the mean selective advantage of an individual with x+1 genes over an individual with x genes. Conveniently, selection coefficients associated with gene acquisition and gene deletion are related as (Methods)

sdeletion=sacquisition. [1]

The fixation probability F can be approximated by

F(x)s(x)1eNes(x), [2]

where Ne is the effective population size (16). The effective population size is estimated for each prokaryotic group from the dN/dS ratio (Methods). Fixed acquisition and deletion events, hereinafter “gene gain” and “gene loss,” respectively, occur with probabilities given by the multiplication of the event rate and the fixation probability:

P+(x)=α(x)s(x)1eNes(x) [3]

and

P(x)=β(x)s(x)eNes(x)1eNes(x), [4]

where Eq. 2 is used together with the relation of Eq. 1.

Fig. 2.

Fig. 2.

The model of genome evolution. (Upper) Illustration of the mathematical model of genome size evolution. The number of genes changes stochastically via gene gains and losses, which occur with probabilities P+ and P, respectively. (Lower) Gene gain (solid purple curve) and loss (dashed purple curve) probabilities. Gain and loss probabilities are equal at x0, indicated by a vertical solid line. For values of x smaller than x0, the gain probability is larger than the loss probability, and therefore, the extremum of the steady-state genome size distribution (orange curve) at x0 is a maximum. The distribution is moderately skewed, and the mean value of x, indicated by a vertical dashed line, is close to the value of x0 (indicated by the solid vertical line).

Gain and loss probabilities govern the stochastic genome size dynamics, with the intuitive relation (derivation is given in Methods)

x˙=P+(x)P(x), [5]

meaning that, at each time step, there is a probability P+ that one gene is gained and a probability P that one gene is lost, where time is measured in fixation time units (a unit is the time that it takes, on average, for an acquired gene to be fixed in the population) rather than generations. Because of the stochastic fluctuations, genome sizes form a distribution. If, for a certain genome size value x0, the gain and loss probabilities are equal, there is a steady-state genome size distribution with an extremum point at x0 as illustrated in Fig. 2 and SI Appendix, Fig. S1. The extremum point depends on the deletion–acquisition rates ratio r(x),

r(x)=β(x)α(x), [6]

and the equality of gain and loss probabilities implies

s(x0)=lnr(x0)Ne. [7]

As reflected in the above equation (and intuitively), steady state is possible only when the more frequent event, gene acquisition or gene deletion, is counterselected at genome size x0 (Fig. 3). For example, positive s(x0) implies selective advantage for larger genomes. In this case, steady state is possible only when deletion rates are higher than acquisition rates [i.e., r(x0)>1]. Formally, the sign of s(x0) in Eq. 7 is determined by the sign of lnr(x0). In the special case of acquisition and deletion rates being equal at x0, s(x0)=0, which implies that either there is no selection with respect to genome size or the fitness function has an extremum at x0 (Eq. 16). The genome size distribution extremum point at x0 is a maximum only if an additional condition is satisfied: for x<x0, gain and loss probabilities must satisfy P+>P (SI Appendix, Fig. S1). This condition is met when

P+(x0)<P(x0). [8]

The case where x0 is a minimum of the steady-state genome size distribution is biologically irrelevant, corresponding to genome sizes tending toward either zero or infinity (SI Appendix, Fig. S1).

Fig. 3.

Fig. 3.

Different regimes for the selection and the gain/loss rates ratio (Eq. 7). In the shaded area, the genome size steady state is achieved for s(x0)>0, and accordingly, r(x0)>1. In this regime, gene loss is selected against but occurs at higher rates than gene acquisition and therefore, denoted drift. Genome sizes xmax and x1 (shown by vertical lines) denote values for which s(x0)=0 and r(x0)=1, respectively. These values are not necessarily the same, and the resulting value of x0 depends on the functional forms of s(x) and r(x), which are shown as straight blue lines for convenience.

For the calculation of the steady-state distribution of genome size, it is useful to present the population model of the genome size evolution as a random walk, where the probability of a step up is P+, and the probability of a step down is P. The equation for the genome size distribution (Methods) has a steady-state solution:

f(x)[P+(x)+P(x)]1e2P+(x)P(x)P+(x)+P(x)dx. [9]

If α depends on x, f(x) depends on the functional forms of both α(x) and β(x) and unlike the equation for x0 (Eq. 7), cannot be written using r(x) only. The genome size distribution allows one to compare different functional forms for P±(x) in terms of compatibility with observed genome sizes under the assumption that, whatever the gene gain and loss probabilities might be, they are similar in all prokaryotes. Specifically, maximizing the log likelihood of the data given a specific model allows optimization of model parameters and comparison of different cases (details are in Methods). To account for the most general case, where both the selection coefficient and the acquisition–deletion rates ratio vary with genome size, s is taken as linear in x, and r is taken as a power law:

s(x)=a+bx [10]

and

r(x)=rxλ. [11]

These functional forms were chosen to minimize the number of optimized parameters. The linear selection coefficient can be regarded as a first-order expansion, and the power law functional form for r(x) was chosen, because it includes two extreme cases, those with constant and linear rates ratio, as well as all intermediates. The selection coefficient sign is not assumed a priori but is an outcome of the fitting process. Furthermore, it is in principle possible that the selection coefficient sign will be different in different ATGCs because of their different typical genome sizes.

Even for s(x) and r(x), which are approximated by low-order functions, maximizing the log likelihood requires fitting five parameters (Methods), and therefore, in principle, it is not evident that the resulting fit corresponds to a global maximum of the log likelihood rather than a local peak. We, therefore, performed the fitting with different starting points in parameter space chosen as explained in Methods. In brief, Eq. 7 is used to estimate the mean genome size for different effective population sizes, and parameters are optimized, such that goodness of fit R2 with respect to the mean genome sizes of the ATGCs and effective population sizes is maximized. This procedure resulted in five different sets of parameters (SI Appendix, Table S1) that were then used as starting points for additional optimization by log-likelihood maximization (Methods). Three of five starting points converged to similar values (SI Appendix, Table S2), whereas the remaining two converged to local maxima associated with significantly lower likelihood. In all optimized sets of parameters (from both stages), the selection coefficient is positive for all ATGCs, indicating that additional genes are, on average, beneficial as expected given the positive correlation between genome size and effective population size (SI Appendix, Fig. S2).

In all three log-likelihood optimized parameter sets, the selection coefficient is strictly positive for all ATGCs and weakly depends on x. Formally, the variation of the fixation probability with genome size is an order of magnitude smaller than the variation of acquisition and deletion rates with genome size (Methods). Accordingly, additional fittings were performed using constant selection coefficient (namely, independent of the genome size). In this case, the log-likelihood optimization always converged to the same set of parameters, with s6×1012 and r values between 1.0034 and 1.0072, corresponding to smallest and largest genomes, respectively (fitted parameters are summarized in SI Appendix, Table S4). This value of s implies x0sNe1 (i.e., the “resolution of selection” is on the order of a single gene). Notably, the dependence of r on x is weak. The mean steady-state genome size is a function of effective population size for this fit (Fig. 1).

For complementarity, fitting with constant r together with an x-dependent selection coefficient was performed as well (SI Appendix), allowing inference of the selection landscape beyond the first-order expansion for a constant acquisition–deletion rates ratio. In this case, the best fit is achieved for positive and saturating selection landscape, indicating that, on average, additional genes are beneficial but that the benefit decreases with the growth of the genome size.

As an approximation for nonuniversal, ATGC-specific factors that affect the genome size, optimization of the gene acquisition and deletion rates was performed separately for each ATGC. The selection coefficient was taken to be the same for all ATGCs and set to the value obtained in the global fitting at the previous stage. For each ATGC, r and λ were optimized, where the fitting of 120 parameters (compared with 5 parameters at the previous stage) is justified by the Akaike Information Criterion (AIC) (Table 1). The fitted values of r' and λ are shown in SI Appendix, Fig. S3, and the resulting genome size distributions for ATGCs with 20 or more species are shown in Fig. 4.

Table 1.

Parameter values, log likelihood, and AIC for the selection landscape and deletion–acquisition rates ratio of Eqs. 10 and 11

Selection landscape No. of parameters Parameters values LL AIC
Linear s and power law r (Eqs. 10 and 11) 5 a=6×1012 6.04×103 1.21×104
b=2×1016
r=0.99
λ=2×103
λ+=7×104
Power law r (Eq. 11) and constant s 4 s=6×1012 6.02×103 1.21×104
r=0.99
λ=1.7×103
λ+=1×103
Power law r (Eq. 11) and constant s individual ATGC fit 120 s=6×1012 4.55×103 9.34×103
λ+=1×102
λ and r are shown in SI Appendix, Fig. S3
SI Appendix, Eq. S1 with constant r 2 γ=2.92 6.21×103 1.24×104
r=1.0006

The selection landscape of SI Appendix, Eq. S1 with constant deletion–acquisition rates ratio is also shown for comparison. Log-likelihood (LL) calculation details are given in Methods.

Fig. 4.

Fig. 4.

Comparison of the model predictions with the empirical genome size distributions. The observed genome size distributions are shown by bars for six ATGCs that consist of 20 species or more each. Genome size distributions predicted by the population evolution model are shown by red lines using the selection landscape and deletion–acquisition rate of Eqs. 10 and 11 and optimized r' and λ parameter values for each ATGC separately (optimized values are shown in SI Appendix, Fig. S3). The goodness of fit R2 is indicated for each ATGC. The ATGCs are as follows (the numbers of genomes for each ATGC are indicated in parentheses): (A) ATGC0001 (109), (B) ATGC0003 (22), (C) ATGC0004 (22), (D) ATGC0014 (31), (E) ATGC0021 (45), and (F) ATGC0050 (51). All ATGC genomes are listed in Dataset S1.

Evolution of Distinct Functional Classes of Genes.

In our model, the gene acquisition and deletion rates, α(x) and β(x), respectively, are general characteristics of the organism that do not depend on the content of the acquired or lost genetic material. In contrast, the selection coefficient inferred above represents a local average with respect to the gene content of the organism and the available genetic material in the assumed infinite gene pool. The model can be extended to account for different classes of genes that evolve under distinct selection landscapes. Specifically, the number of class i genes, xi, is determined by the stochastic equation (Methods)

x˙i=kiα(x)F+(si(xi))xixβ(x)F(si(xi)), [12]

where x is the total number of genes, ki is the probability of gain of a class i gene, and the selection landscape si is assumed to be a function of xi only. The width of the steady-state distribution is determined primarily by the linear term xi/x in the loss probability. To further test the model consistency with the empirical data, steady-state distributions were calculated for subsets of genes. The subsets were chosen based on the functional classes of genes as classified in the COG (Clusters of Orthologous Genes) database (17, 18). The selection landscape and ki were optimized for the best log-likelihood fit of the distribution predicted by the model to the genomic data (SI Appendix, Table S5). The distributions were calculated using the values of α(x) and β(x) that were obtained by fitting the distributions for complete gene sets (Table 1). The mean value of xi can be approximated by the equilibrium value xi0, for which the gain and loss probabilities are equal (analogous to x0 for the complete genomes as described above):

xi0eNesi(xi0)=kixr(x). [13]

This expression can be regarded as a generalization of previously reported scaling laws for different functional classes of genes with the genome size (1921), where the ATGC-specific effective population sizes are taken into account (the full implications of this extension of the scaling analysis will be discussed elsewhere). Comparison of the empirical data and the model predictions for the number of genes in most of the functional classes shows a good fit between the model predictions and the genomic data (SI Appendix, Figs. S4 A and B and S5 A and B). For the translation system components and the genes involved in energy transformation, the log-likelihood values were 3,334 and 6,115, respectively, compared with the −6,022 value for complete genomes. Thus, notably, the genes for translation system components, the most conserved, universal functional class (22), are described by the model much better than a random subset of genes.

Finally, analogous to the whole-genome fitting procedure, to account for ATGC-specific effects, model distributions were further optimized by fitting ATGC-specific ki values. The resulting distributions for most of the functional classes showed good fits between the model and the genomic data as illustrated in SI Appendix, Fig. S6 A and B for the translation system components and genes involved in energy transformation of the largest ATGC001 (complete results are in SI Appendix, Figs. S7–S10). However, two classes of genes, namely the components of the “mobilome,” such as prophage genes and transposons, as well as the singletons (genes with no detectable orthologs within the given ATGC), dramatically deviate from model predictions (SI Appendix, Figs. S6 C and D, S9, and S10). For these gene classes, the observed distributions are significantly wider than those predicted by the model, regardless of the model parameters or the selection landscape. Accordingly, the log-likelihood values reflect the disagreement and are 18,870 and 49,384 for mobilome and singletons, respectively. Such a poor fit effectively indicates that these classes of genes evolve under evolutionary regimes qualitatively different from that of the rest of the genomes. Many of the mobilome components, in particular transposons, are prone to active duplication within the genome and therefore, cannot be described with the gene gain rate inferred for the complete gene sets. The singletons dramatically differ from the evolutionarily conserved genes shared by multiple microbes with respect to the tempo and mode of evolution. Most of the singletons encode small proteins and evolve fast, suggesting that they are associated with little (if any) benefit (23, 24). Thus, the key result of this analysis, namely the positive sign of the mean selection coefficient, does not apply to the singletons.

Discussion

The notion of strong purifying selection that favors small genomes (or more precisely, a small number of genes) in prokaryotes (10) seemed incompatible with the observed significant positive correlation between the genome sizes of bacteria and archaea and the inferred selection strength on the protein level reflected in the dN/dS ratio (13) (this work). These observations indicate that, on average, the larger the genome of a bacterium or an archaeon, the stronger selection under which the protein-coding genes evolve. This apparent discrepancy between the comparative genomic observations and the predictions of the population genetic theory motivated us to further investigate the selection regimes of microbial genomes. To this end, we compared the predictions of a mathematical model of genome evolution with the genome size distributions in 60 clusters of closely related bacteria and archaea.

To infer the selection landscape with regard to the genome size, the effective population size was estimated for each ATGC using the dN/dS ratio and assuming the same selection coefficient sc for the core genes in all ATGCs. This assumption is reasonable, because 51–56 core genes used for the dN/dS calculation are nearly universal and encode central biological functions, such as translation, that are functionally highly similar across the entire bacterial domain of cellular life (22). Small variations in sc across the different ATGCs might slightly affect the inferred gain and loss probabilities through the changes in the estimated effective population size associated with each ATGC. However, to affect our conclusions, namely that additional genes are, on average, beneficial, the variations in sc between ATGCs would have to be dramatic, such that the correlation between the genome size and the dN/dS ratio (Fig. 1A) would be abolished or reversed.

Inference of gain and loss probabilities requires estimation of three terms, namely gene gain and loss rates and the selection landscape. In principle, all three values depend on the genome size, such that it is impossible to infer all terms without assumptions on the functional forms of the respective dependencies. However, it has to be emphasized that our conclusions do not depend on specific modeling assumptions and hold as long as s(x) and r(x) are monotonic functions for typical prokaryotic genome sizes. The steady-state genome size distribution reflects the selection–drift balance: acquisition of new genes is, on average, beneficial, albeit with a small estimated selection coefficient (Table 1), and balanced by a deletion rate that is slightly greater than the acquisition rate. Under this regime, the selection on the gain or loss of an individual gene is weak, allowing substantial variation in genome size, but sufficient to produce the correlation between Ne and genome size. The fitted values of the deletion–acquisition rates ratio are very close to, albeit greater than unity. This slight but consistent excess of gene loss over gain is likely to reflect the deletion bias that has been identified as an intrinsic feature of genome evolution in both bacteria and eukaryotes (2527).

Extension of the model to account for subsets of genes allowed additional validation of the model consistency and assessment of the selection affecting any class of genes. In particular, we analyzed genes that are associated with specific cellular functions as classified in the COG database (17, 18) under the assumption that functionally similar genes evolve under similar selection landscapes. We obtained good fits to the model for all functional classes, indicating that the conclusion on the typical beneficial effect of gene acquisition applies to functionally diverse classes of genes. However, there were two notable exceptions to this consistency, namely the mobilome and the singletons. The distributions of the sizes of these classes in all ATGCs are much wider than predicted by the model, and the parameters could not be optimized to obtain a good fit. These observations imply that the evolutionary regimes of the mobilome and the singletons qualitatively differ from the genes in the other functional classes that possess “normal” cellular functions. There are indications of the nature of these differences. Many components of the mobilome, such as transposons, propagate within a genome, so that the dynamics of this class is dominated by duplication rather than gain from an external gene pool. The singletons are fast evolving genes that, on average, do not confer any benefit on the organism. Indeed, in a separate recent analysis of microbial genome evolution models, we have shown that the replacement rate of the singletons is effectively infinite compared with the replacement rates of the rest of the genes (28).

The model analysis described here was performed assuming a steady state with respect to the genome size, and two points have to be addressed in this regard. Each ATGC consists of phylogenetically close genomes (29) that cannot be considered independent samples from the genome size distribution without additional substantiation. We, therefore, verified that, for each ATGC, the divergence between the genomes, defined in this case as the number of gains and losses, was greater than the genome size distribution width, ensuring that a sufficient number of evolutionary events occurred so that the genomes represent an independent set of samples from the size distribution. A low bound for the number of gene gain and loss events that occurred since the divergence of individual genomes can be estimated from the number of singletons divided by the probability of the acquisition of such genes. The required number of gains and loss events for sufficient sampling of the distribution can be estimated from the SD of all genome sizes in an ATGC. The estimated numbers of gain events are greater than the SDs for all ATGCs (SI Appendix, Fig. S11), indicating that the genome size distribution was sampled sufficiently.

In our previous work, a comprehensive analysis of gain and loss events was performed for the same groups of microbial genomes (ATGCs) that were analyzed here (14). The gene loss to gain ratios obtained through a maximum likelihood reconstruction of genome evolution formed a broad distribution, with the mean value of about two. The evolution models analyzed here (Eqs. 10 and 11) yield a skewed distribution of genome sizes, which implies a distribution of loss/gain ratios with the mean slightly greater than unity. The distributions and parameters fitted here cannot explain such a large difference between the model predictions and inferences from genome comparison. Thus, it seems likely that, most of the time, the majority of the genomes are somewhat smaller than the long-term equilibrium size. A biologically plausible scenario is that prokaryotes are exposed to beneficial genetic material only for short periods of time, resulting in brief intervals of fast growth followed by slow genome shrinking (30). Steady state is possible under this scenario as well but only as the average over multiple cycles of gain and loss, which probably occur on a timescale much longer than the scale of the ATGC evolution. The strict steady-state analysis presented here can be regarded as a coarse-grained description of more complex evolutionary scenarios; however, our key finding, that acquired genes are, on average, beneficial, is expected to hold also for higher-order analyses.

The results of this analysis indicate that elimination of genes under the pressure of purifying selection is not the dominant factor of microbial evolution. On the contrary, acquisition of genes by microbes seems to be largely an adaptive process, although the positive selection that governs the genome dynamics, on average, is likely to be weak. This conclusion by no means contradicts the population genetic theory as such (9) but is incompatible with the assumption that newly acquired (and fixed in the population) genes are, on average, neutral. In other words, all of the estimates of the cost of the genetic material (11) can be valid in themselves, but the positive selection coefficient that is, on average, associated with new genes offsets these costs. Given that new genes are, on average, beneficial, microbes with larger Ne values that evolve under strong purifying selection typically accrue a greater number of genes than microbes with smaller populations. This reasoning explains the negative correlation between dN/dS and the number of genes (Fig. 1) that, at first glance, seemed paradoxical and to contradict the theory. Conceivably, the evolution of genuinely neutral, noncoding sequences is governed by the cost combined with the deletion bias, resulting in the purge of such sequences predicted by the theory and hence, the “wall to wall” architecture of the prokaryotic genomes (3).

To summarize, the analysis described here presents the formal theory of the evolution of prokaryotic gene content. Perhaps unexpectedly, comparison of the theory predictions with the genomic data shows that gene gain by prokaryotes, leading to genome growth, is largely an adaptive process, with the exception of “nonfunctional” gene classes, the mobilome and the singletons. From the biological standpoint, it seems plausible that the apparent beneficial effect of gene gain is a combined result of the capture of metabolic enzymes that can expand the biochemical capacity of microbes (31), regulators and signaling proteins that enhance regulatory circuits (32), and defense genes (33). However, much more research is required to reconstruct the full functional landscape of microbial evolution.

Methods

Prokaryotic Genome Size Data and Nonsynonymous to Synonymous Nucleotide Substitution Ratio.

Genomes of 707 prokaryotic species grouped into 60 ATGCs (14, 29) were analyzed. In addition to the number of genes (x), selection forces acting on the protein level were inferred for each ATGC. Nonsynonymous to synonymous nucleotide substitution ratio (dN/dS) was evaluated for each pair of species that belongs to the same ATGC using concatenated sequences of all core genes. The indicated dN/dS value for each ATGC is the median across all species pairs in the ATGC. Based on the COG database annotations (17, 18), singleton genes were identified and counted for each species. The effective population size was estimated for each ATGC from the calculated dN/dS value as explained in the following section.

Inference of Effective Population Size.

Synonymous mutations are assumed to be neutral and, therefore, fixed at a rate 1/Ne. Together with the fixation probability given by Eq. 2, for nonsynonymous mutations, we have (34)

dNdSNesc1eNesc, [14]

where sc denotes the selection coefficient acting on the core genes for which the dN/dS values are calculated. It is assumed that sc is similar for all ATGCs and that the variation in Ne within an ATGC is significantly smaller than the differences in Ne between different ATGCs. The value of sc is set, such that the effective population size for Escherichia coli is 109, and Eq. 14 allows estimation of effective population size for each ATGC.

Selection Function Symmetry with Respect to Gene Acquisition and Deletion.

The relation given by Eq. 1 is derived as follows. Noting by s the selection advantage of gene acquisition, the reproduction rate for genome size x is 1, and for genome size x+1, it is, therefore, 1+s. For gene deletion, the reproduction rate is 1 for genome size of x+1, and for consistency, the reproduction rate for genome size x is given by 1s, corresponding to selection advantage of s for gene deletion.

Selection and Fitness Relations.

The selection coefficient s is related to the fitness ϕ by

sab=ϕaϕb1 [15]

when considering the selective advantage of individual a over individual b (16, 35). For the inference of the selection coefficient that is associated with gene acquisition at genome size x, it is useful to assign genome sizes of x+1 and x to individuals a and b, respectively. The selection coefficient–fitness relation is, therefore,

s(x)=ϕ(x+1)ϕ(x)ϕ(x)xlnϕ(x). [16]

Mean Genome Size Dynamics.

For uniform population invaded by mutation that is associated with fitness ϕ1, the population fitness dynamics is given by

ϕ˙=0dϕ1(ϕ1ϕ)P(ϕ1)F(ϕ,ϕ1), [17]

where P(ϕ1) is mutations probability density, F(ϕ,ϕ1) is the fixation probability, and time is measured in fixation time units (rather than in Moran generations) (36). Eq. 17 is general, the only assumption being that the mutation rate is low enough, such that the weak mutation limit condition is satisfied. However, for the model analysis, it is more practical to derive the equation for genome size dynamics rather than the fitness. The integral over ϕ1 is a sum for all possible mutations and contains two terms corresponding to gene acquisition and gene deletion. Accordingly, P(ϕ1) is given by gene acquisition and deletion rates:

P(ϕ1+)=α(x) [18]

and

P(ϕ1)=β(x), [19]

where the + and − superscripts indicate acquisition and deletion events, respectively. The fitness derivative with respect to the number of genes can be approximated as

ϕ'=ΔϕΔxϕ1+ϕ1, [20]

such that ϕ1±ϕ=±ϕ'. The fitness time derivative can be calculated using the chain rule:

ϕ˙=ϕx˙ [21]

and

P±=P(ϕ1±)F±. [22]

Substituting Eqs. 18, 19, 20, 21, and 22 to Eq. 17, we get Eq. 5 for genome size dynamics.

Steady-State Genome Size Distribution.

The genome size distribution satisfies the difference equation

f(x,t+Δt)=f(x,t)(1P+(x)P(x))+f(xΔx,t)P+(xΔx)+f(x+Δx,t)P(x+Δx). [23]

This equation can be approximated by a second-order differential equation (37). The left-hand side is expanded to first order in Δt, and the right-hand size is expanded to second order in Δx, giving the following expression:

f˙ΔxΔtx[(P+P)f]+(Δx)2Δt12x2[(P++P)f], [24]

where in the weak mutation limit, Δx=1 for Δt=1 in fixation time units. For the steady-state distribution, f˙=0. The resulting differential equation has a solution in the form of Eq. 9.

Optimization of the Goodness of Fit R2 for Mean Genome Size Vs. Effective Population Size Dependency.

To search the parameter space more efficiently during the log-likelihood optimization, a preliminary optimization stage was implemented. Eq. 7 determines the relation between x0 and Ne. For given s(x) and r(x), the genome size dependence on effective population size can be compared with the dependence observed in ATGCs, where mean genome size is taken as an approximation for x0. This approximation does not introduce large errors for modestly skewed genome size distributions (SI Appendix, Fig. S12), and ATGCs genome size distributions are only slightly skewed (Fig. 4). Specifically, it is possible to optimize the selection and deletion–acquisition rates ratio parameters (Eqs. 10 and 11) to maximize the goodness of fit R2. At the next stage, the parameters that gave the highest R2 values are used as starting points for the log-likelihood optimization. Note that the log-likelihood scheme requires one additional parameter: the genome size distribution depends in principle on α(x) and β(x), whereas in Eq. 7, only the ratio r(x) appears. These values are given for the deletion–acquisition rates ratio of Eq. 11 by α(x)=xλ+ and β(x)=rxλ, where λ=λλ+. The selection landscape and deletion–acquisition rates ratio of Eqs. 10 and 11 require optimization of five parameters: a, b, r, λ+, and λ.

Maximum Likelihood Optimization.

The log likelihood (LL) of the model given the data is estimated as

LL=ilnf(xi), [25]

where xi is the observed genome size in ATGC species, and f(x) is the predicted steady-state distributions of Eq. 9. Specifically, for the log-likelihood estimation of a model, the parameters were optimized to maximize the log-likelihood LL(Z):

LL(Z)=ilnf(xi;Nei,Z), [26]

where the sum is over all 707 species, Z components are all optimized parameters, and Nei is the effective population size corresponding to the ATGC that contains species i. For the constant fitness coefficient, individual fitting of r and λ was performed separately for each ATGC, forming a set of 60 {Z}={r,λ} pairs. The log likelihood was calculated as follows:

LL({Z})=ijATGCilnf(xj;Nej,{Z}i), [27]

where the inner sum is over all species that belong to ATGC i, and the outer sum is over all ATGCs.

Gain and Loss Probabilities Genome Size Dependent.

The gene gain (loss) probability depends on genome size through the selection coefficient and the acquisition (deletion) rate. The variation in gain and loss probability ΔP± for genome size variation Δx is given by

ΔP+=F+xαΔx+αxF+Δx [28]

and

ΔP=FxβΔx+βxFΔx, [29]

where all quantities are calculated using the mean ATGCs genome size and effective population size. If, say, the second term in the above equations is significantly smaller than the first term, the variation in P+ and P with genome size is mainly caused by the variation in α and β, whereas s can be taken as constant with respect to x.

For parameters fitted using linear selection landscape and summarized in SI Appendix, Table S2, the terms involving the derivatives of F+ and F are order of magnitude smaller than the α and β derivative terms.

The Model with Two Types of Genes.

The model can be extended to account for distinct classes of genes evolving under different selection landscapes. For two classes, the numbers of genes in each class, x1 and x2, are governed by two coupled stochastic equations:

x˙1=k1α(x1+x2)F+(s1(x1))x1x1+x2β(x1+x2)F(s1(x1)) [30]

and

x˙2=k2α(x1+x2)F+(s2(x2))x2x1+x2β(x1+x2)F(s2(x2)), [31]

where the interpretation is as follows. The probability to acquire a gene of class i is ki and is a property of the gene pool. In addition, the associated selection landscape si is assumed to be a function of xi only. The loss rate for class i is given by the product of β, which is defined per genome, and the fraction of type i genes. The derivation of Eqs. 30 and 31 follows the same steps as for the complete genome. The integral of Eq. 17 in this case is a sum with four terms, namely, acquisition or deletion of either class 1 or class 2 genes. The fitness time derivative includes two terms:

ϕ˙=x˙11ϕ+x˙22ϕ, [32]

where i stands for differentiation with respect to xi. The last stage in the derivation is to split terms associated with x1 or x2 dynamics into two separate equations. This operation is possible, because the number of genes in each class is determined exclusively by gene gain and loss events: there is no process in the model that allows switching the gene type.

To calculate the steady-state distribution for a subset of genes, x1 is set to the subset size, and x2 represents the remaining genes. In this case, we have x1x2, and accordingly, x1+x2x2x; therefore, Eq. 30 is decoupled from Eq. 31.

Supplementary Material

Supplementary File
Supplementary File

Acknowledgments

We thank members of the group of E.V.K. for helpful discussions. The authors’ research is supported by intramural funds of the US Department of Health and Human Services (to the National Library of Medicine).

Footnotes

The authors declare no conflict of interest.

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1614083113/-/DCSupplemental.

References

  • 1.Koonin EV, Wolf YI. Genomics of bacteria and archaea: The emerging dynamic view of the prokaryotic world. Nucleic Acids Res. 2008;36(21):6688–6719. doi: 10.1093/nar/gkn668. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Reddy TB, et al. The Genomes OnLine Database (GOLD) v.5: A metadata management system based on a four level (meta)genome project classification. Nucleic Acids Res. 2015;43(Database issue):D1099–D1106. doi: 10.1093/nar/gku950. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Koonin EV. Evolution of genome architecture. Int J Biochem Cell Biol. 2009;41(2):298–306. doi: 10.1016/j.biocel.2008.09.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Price MN, Huang KH, Arkin AP, Alm EJ. Operon formation is driven by co-regulation and not by horizontal gene transfer. Genome Res. 2005;15(6):809–819. doi: 10.1101/gr.3368805. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Nuñez PA, Romero H, Farber MD, Rocha EP. Natural selection for operons depends on genome size. Genome Biol Evol. 2013;5(11):2242–2254. doi: 10.1093/gbe/evt174. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Lynch M, Conery JS. The origins of genome complexity. Science. 2003;302(5649):1401–1404. doi: 10.1126/science.1089370. [DOI] [PubMed] [Google Scholar]
  • 7.Lynch M. The origins of eukaryotic gene structure. Mol Biol Evol. 2006;23(2):450–468. doi: 10.1093/molbev/msj050. [DOI] [PubMed] [Google Scholar]
  • 8.Lynch M. The frailty of adaptive hypotheses for the origins of organismal complexity. Proc Natl Acad Sci USA. 2007;104(Suppl 1):8597–8604. doi: 10.1073/pnas.0702207104. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Lynch M. The Origins of Genome Architecture. Sinauer; Sunderland, MA: 2007. [Google Scholar]
  • 10.Lynch M. Streamlining and simplification of microbial genome architecture. Annu Rev Microbiol. 2006;60:327–349. doi: 10.1146/annurev.micro.60.080805.142300. [DOI] [PubMed] [Google Scholar]
  • 11.Lynch M, Marinov GK. The bioenergetic costs of a gene. Proc Natl Acad Sci USA. 2015;112(51):15690–15695. doi: 10.1073/pnas.1514974112. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Koonin EV. A non-adaptationist perspective on evolution of genomic complexity or the continued dethroning of man. Cell Cycle. 2004;3(3):280–285. [PubMed] [Google Scholar]
  • 13.Novichkov PS, Wolf YI, Dubchak I, Koonin EV. Trends in prokaryotic evolution revealed by comparison of closely related bacterial and archaeal genomes. J Bacteriol. 2009;191(1):65–73. doi: 10.1128/JB.01237-08. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Puigbò P, Lobkovsky AE, Kristensen DM, Wolf YI, Koonin EV. Genomes in turmoil: Quantification of genome dynamics in prokaryote supergenomes. BMC Biol. 2014;12:66. doi: 10.1186/s12915-014-0066-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Moran PA. Random processes in genetics. Proc Philos Soc Math Phys Sci. 1958;54:60–71. [Google Scholar]
  • 16.McCandlish DM, Epstein CL, Plotkin JB. Formal properties of the probability of fixation: Identities, inequalities and approximations. Theor Popul Biol. 2015;99:98–113. doi: 10.1016/j.tpb.2014.11.004. [DOI] [PubMed] [Google Scholar]
  • 17.Tatusov RL, Koonin EV, Lipman DJ. A genomic perspective on protein families. Science. 1997;278(5338):631–637. doi: 10.1126/science.278.5338.631. [DOI] [PubMed] [Google Scholar]
  • 18.Galperin MY, Makarova KS, Wolf YI, Koonin EV. Expanded microbial genome coverage and improved protein family annotation in the COG database. Nucleic Acids Res. 2015;43(Database issue):D261–D269. doi: 10.1093/nar/gku1223. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.van Nimwegen E. Scaling laws in the functional content of genomes. Trends Genet. 2003;19(9):479–484. doi: 10.1016/S0168-9525(03)00203-8. [DOI] [PubMed] [Google Scholar]
  • 20.Konstantinidis KT, Tiedje JM. Trends between gene content and genome size in prokaryotic species with larger genomes. Proc Natl Acad Sci USA. 2004;101(9):3160–3165. doi: 10.1073/pnas.0308653100. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Molina N, van Nimwegen E. Scaling laws in functional genome content across prokaryotic clades and lifestyles. Trends Genet. 2009;25(6):243–247. doi: 10.1016/j.tig.2009.04.004. [DOI] [PubMed] [Google Scholar]
  • 22.Koonin EV. Comparative genomics, minimal gene-sets and the last universal common ancestor. Nat Rev Microbiol. 2003;1(2):127–136. doi: 10.1038/nrmicro751. [DOI] [PubMed] [Google Scholar]
  • 23.Daubin V, Ochman H. Bacterial genomes as new gene homes: The genealogy of ORFans in E. coli. Genome Res. 2004;14(6):1036–1042. doi: 10.1101/gr.2231904. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Yu G, Stoltzfus A. Population diversity of ORFan genes in Escherichia coli. Genome Biol Evol. 2012;4(11):1176–1187. doi: 10.1093/gbe/evs081. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Petrov DA, Sangster TA, Johnston JS, Hartl DL, Shaw KL. Evidence for DNA loss as a determinant of genome size. Science. 2000;287(5455):1060–1062. doi: 10.1126/science.287.5455.1060. [DOI] [PubMed] [Google Scholar]
  • 26.Petrov DA. DNA loss and evolution of genome size in Drosophila. Genetica. 2002;115(1):81–91. doi: 10.1023/a:1016076215168. [DOI] [PubMed] [Google Scholar]
  • 27.Kuo CH, Ochman H. Deletional bias across the three domains of life. Genome Biol Evol. 2009;1:145–152. doi: 10.1093/gbe/evp016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Wolf YI, Makarova KS, Lobkovsky AE, Koonin EV. Two fundamentally different classes of microbial genes. Nat Microbiol. 2016 doi: 10.1038/nmicrobiol.2016.208. in press. [DOI] [PubMed] [Google Scholar]
  • 29.Novichkov PS, Ratnere I, Wolf YI, Koonin EV, Dubchak I. ATGC: A database of orthologous genes from closely related prokaryotic genomes and a research platform for microevolution of prokaryotes. Nucleic Acids Res. 2009;37(Database issue):D448–D454. doi: 10.1093/nar/gkn684. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Wolf YI, Koonin EV. Genome reduction as the dominant mode of evolution. BioEssays. 2013;35(9):829–837. doi: 10.1002/bies.201300037. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Maslov S, Krishna S, Pang TY, Sneppen K. Toolbox model of evolution of prokaryotic metabolic networks and their regulation. Proc Natl Acad Sci USA. 2009;106(24):9743–9748. doi: 10.1073/pnas.0903206106. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Galperin MY, Higdon R, Kolker E. Interplay of heritage and habitat in the distribution of bacterial signal transduction systems. Mol Biosyst. 2010;6(4):721–728. doi: 10.1039/b908047c. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Makarova KS, Wolf YI, Koonin EV. Comparative genomics of defense systems in archaea and bacteria. Nucleic Acids Res. 2013;41(8):4360–4377. doi: 10.1093/nar/gkt157. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Kryazhimskiy S, Plotkin JB. The population genetics of dN/dS. PLoS Genet. 2008;4(12):e1000304. doi: 10.1371/journal.pgen.1000304. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Sella G, Hirsh AE. The application of statistical physics to evolutionary biology. Proc Natl Acad Sci USA. 2005;102(27):9541–9546. doi: 10.1073/pnas.0501865102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Kryazhimskiy S, Tkacik G, Plotkin JB. The dynamics of adaptation on correlated fitness landscapes. Proc Natl Acad Sci USA. 2009;106(44):18638–18643. doi: 10.1073/pnas.0905497106. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Codling EA, Plank MJ, Benhamou S. Random walk models in biology. J R Soc Interface. 2008;5(25):813–834. doi: 10.1098/rsif.2008.0014. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary File
Supplementary File

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences

RESOURCES