Significance
Bacteria and archaea have small genomes with tightly packed protein-coding genes. Typically, this genome architecture is explained by “genome streamlining” (minimization) under selection for high replication rate. We developed a mathematical model of microbial evolution and tested it against extensive data from multiple genome comparisons to identify the key evolutionary forces. The results indicate that genome evolution is not governed by streamlining but rather, reflects the balance between the benefit of additional genes that diminishes with the genome size and the intrinsic preference for DNA deletion over acquisition. These results explain the observation that, in an apparent contradiction with the population genetic theory, microbes with large genomes reach higher abundance and are subject to stronger selection than small “streamlined” genomes.
Keywords: evolutionary genomics, prokaryotic genome size, genome streamlining, positive selection, deletion bias
Abstract
Bacteria and archaea typically possess small genomes that are tightly packed with protein-coding genes. The compactness of prokaryotic genomes is commonly perceived as evidence of adaptive genome streamlining caused by strong purifying selection in large microbial populations. In such populations, even the small cost incurred by nonfunctional DNA because of extra energy and time expenditure is thought to be sufficient for this extra genetic material to be eliminated by selection. However, contrary to the predictions of this model, there exists a consistent, positive correlation between the strength of selection at the protein sequence level, measured as the ratio of nonsynonymous to synonymous substitution rates, and microbial genome size. Here, by fitting the genome size distributions in multiple groups of prokaryotes to predictions of mathematical models of population evolution, we show that only models in which acquisition of additional genes is, on average, slightly beneficial yield a good fit to genomic data. These results suggest that the number of genes in prokaryotic genomes reflects the equilibrium between the benefit of additional genes that diminishes as the genome grows and deletion bias (i.e., the rate of deletion of genetic material being slightly greater than the rate of acquisition). Thus, new genes acquired by microbial genomes, on average, appear to be adaptive. The tight spacing of protein-coding genes likely results from a combination of the deletion bias and purifying selection that efficiently eliminates nonfunctional, noncoding sequences.
The majority of bacterial and archaeal genomes are small, at least compared with the genomes of multicellular and many unicellular eukaryotes (1, 2). Also, with the exception of deteriorating genomes of some parasitic bacteria, the prokaryotic genomes are highly compact, with densely packed protein-coding genes and a low fraction of noncoding sequences (3). The small genome size is thought to be selected for fast replication, whereas the high gene density additionally facilitates coregulation of gene expression via the operon organization (4, 5). Across the full range of cellular life forms, a significant positive correlation has been shown to exist between genome size and , where is the effective population size, and is the mutation rate per nucleotide (6–9). Accordingly, a simple and appealing population genetic theory has been developed, under which selection strength controls genome size and complexity (6, 9). Prokaryotes, with the exception of some parasites, have large effective population sizes on the order of or even higher, which implies strong selection enabling prokaryotes to maintain compact genomes (10). Under this strong selection regime, even short nonfunctional sequences incur cost that is “visible” to selection, conceivably through a combination of increasing energy expenditure and reducing the replication rate, and are efficiently weeded out (11). In eukaryotes, at least the multicellular forms, the effective population size is substantially (by orders of magnitude) smaller, and consequently, selection is not strong enough to eliminate superfluous genetic material, which results in “bloated” genomes but also provides the raw material for the evolution of complex features (6–8). It is often assumed, implicitly or explicitly, that any extra genetic material arising from duplication or acquisition is, on average, slightly deleterious for the host, because the new DNA does not perform any immediately beneficial function but incurs the cost right from the beginning. This theory, which is steeped in the well-established principles of population genetics, provides a simple, unified framework for understanding evolution of genomic complexity without invoking widespread adaptation. This nonadaptive theory can be reasonably assumed as the null hypothesis of genome evolution, the predictions of which have to be falsified to claim adaptive phenomena (8, 12).
The population genetic theory clearly predicts an inverse correlation between the strength of selection at different levels and genome size: small genomes are predicted to be subject to stronger selection than large genomes (9). However, when protein sequence-level selection was measured for multiple groups of closely related bacteria using the ratio of the nonsynonymous to synonymous substitution rates (dN/dS) as a proxy, the opposite effect, namely a significant negative correlation between dN/dS and genome size, was observed, indicating that larger prokaryotic genomes typically evolve under a stronger selection than small ones (13).
Here, we sought to further investigate the evolutionary factors that control genome evolution in prokaryotes. We reproduced the negative correlation between dN/dS and genome size on an expanded genome collection and then, developed a mathematical model of genome evolution by gene gain and loss in prokaryotic populations. By fitting the distribution of genome sizes predicted by the model to the empirical distribution for many groups of prokaryotes, we found that a good fit between theory and the data could be obtained only for models that included a positive mean fitness contribution of the gained genes countered by a deletion bias. These results imply that, at the level of gene gain and loss, selection for genome compactness does not play a major role in determining the genome size in prokaryotes. Rather, the relatively small number of genes in prokaryotes is explained by the diminishing return associated with the acquisition of new genes as the genome grows combined with the intrinsic deletion bias.
Results
Protein-Level Selection and Microbial Genome Size: Model Prediction and Genomic Data.
It has been shown previously that, in prokaryotes, genome size is negatively and significantly correlated with dN/dS, suggestive of a stronger protein-level selection in larger genomes (13). For the purpose of this work, we reproduced these findings on a substantially expanded set of groups of closely related bacterial and archaeal genomes compiled in the latest version of the Alignable Tight Genomic Cluster (ATGC) database (14) (Fig. 1). Specifically, genome size (measured by the number of genes) and dN/dS values are correlated with Spearman’s rank correlation coefficient 0.397. These findings imply that, in addition to local influences of different environments and lifestyles, there are underlying universal components of gene gain and loss probabilities with respect to genome size that are common across all microbes.
To quantitatively analyze the relations between selection strength and genome size, a mathematical model was implemented using the Moran process framework (15). In the model, a genome is represented as a collection of genes, and the variation in genome size is attributed to stochastic gain and loss of genes (Fig. 2). New genes are assumed to be acquired at rate , which in principle, depends on and accounts for the probability that the acquired genetic material is inserted in a nondeleterious locus and expressed. Genes can be acquired by either horizontal gene transfer (in the model, assumed to draw genes from an infinite pool) or duplication, and represents a weighted average over all possible evolutionary processes that lead to gene gain. Reconstructions of genome evolution performed on the ATGCs indicate that the contribution of horizontal gene transfer by far exceeds that of duplication (14). To model gene loss, the deleted gene is picked from the genome at random with the deletion rate , which analogous to , in principle, depends on . Acquisitions and deletions of genetic material are either fixed or eliminated stochastically, with a fixation probability that depends on the selection coefficient , which is a function of genome size as well . Specifically, denotes the mean selective advantage of an individual with genes over an individual with genes. Conveniently, selection coefficients associated with gene acquisition and gene deletion are related as (Methods)
[1] |
The fixation probability can be approximated by
[2] |
where is the effective population size (16). The effective population size is estimated for each prokaryotic group from the dN/dS ratio (Methods). Fixed acquisition and deletion events, hereinafter “gene gain” and “gene loss,” respectively, occur with probabilities given by the multiplication of the event rate and the fixation probability:
[3] |
and
[4] |
where Eq. 2 is used together with the relation of Eq. 1.
Gain and loss probabilities govern the stochastic genome size dynamics, with the intuitive relation (derivation is given in Methods)
[5] |
meaning that, at each time step, there is a probability that one gene is gained and a probability that one gene is lost, where time is measured in fixation time units (a unit is the time that it takes, on average, for an acquired gene to be fixed in the population) rather than generations. Because of the stochastic fluctuations, genome sizes form a distribution. If, for a certain genome size value , the gain and loss probabilities are equal, there is a steady-state genome size distribution with an extremum point at as illustrated in Fig. 2 and SI Appendix, Fig. S1. The extremum point depends on the deletion–acquisition rates ratio ,
[6] |
and the equality of gain and loss probabilities implies
[7] |
As reflected in the above equation (and intuitively), steady state is possible only when the more frequent event, gene acquisition or gene deletion, is counterselected at genome size (Fig. 3). For example, positive implies selective advantage for larger genomes. In this case, steady state is possible only when deletion rates are higher than acquisition rates [i.e., ]. Formally, the sign of in Eq. 7 is determined by the sign of . In the special case of acquisition and deletion rates being equal at , , which implies that either there is no selection with respect to genome size or the fitness function has an extremum at (Eq. 16). The genome size distribution extremum point at is a maximum only if an additional condition is satisfied: for , gain and loss probabilities must satisfy (SI Appendix, Fig. S1). This condition is met when
[8] |
The case where is a minimum of the steady-state genome size distribution is biologically irrelevant, corresponding to genome sizes tending toward either zero or infinity (SI Appendix, Fig. S1).
For the calculation of the steady-state distribution of genome size, it is useful to present the population model of the genome size evolution as a random walk, where the probability of a step up is , and the probability of a step down is . The equation for the genome size distribution (Methods) has a steady-state solution:
[9] |
If depends on , depends on the functional forms of both and and unlike the equation for (Eq. 7), cannot be written using only. The genome size distribution allows one to compare different functional forms for in terms of compatibility with observed genome sizes under the assumption that, whatever the gene gain and loss probabilities might be, they are similar in all prokaryotes. Specifically, maximizing the log likelihood of the data given a specific model allows optimization of model parameters and comparison of different cases (details are in Methods). To account for the most general case, where both the selection coefficient and the acquisition–deletion rates ratio vary with genome size, is taken as linear in , and is taken as a power law:
[10] |
and
[11] |
These functional forms were chosen to minimize the number of optimized parameters. The linear selection coefficient can be regarded as a first-order expansion, and the power law functional form for was chosen, because it includes two extreme cases, those with constant and linear rates ratio, as well as all intermediates. The selection coefficient sign is not assumed a priori but is an outcome of the fitting process. Furthermore, it is in principle possible that the selection coefficient sign will be different in different ATGCs because of their different typical genome sizes.
Even for and , which are approximated by low-order functions, maximizing the log likelihood requires fitting five parameters (Methods), and therefore, in principle, it is not evident that the resulting fit corresponds to a global maximum of the log likelihood rather than a local peak. We, therefore, performed the fitting with different starting points in parameter space chosen as explained in Methods. In brief, Eq. 7 is used to estimate the mean genome size for different effective population sizes, and parameters are optimized, such that goodness of fit with respect to the mean genome sizes of the ATGCs and effective population sizes is maximized. This procedure resulted in five different sets of parameters (SI Appendix, Table S1) that were then used as starting points for additional optimization by log-likelihood maximization (Methods). Three of five starting points converged to similar values (SI Appendix, Table S2), whereas the remaining two converged to local maxima associated with significantly lower likelihood. In all optimized sets of parameters (from both stages), the selection coefficient is positive for all ATGCs, indicating that additional genes are, on average, beneficial as expected given the positive correlation between genome size and effective population size (SI Appendix, Fig. S2).
In all three log-likelihood optimized parameter sets, the selection coefficient is strictly positive for all ATGCs and weakly depends on . Formally, the variation of the fixation probability with genome size is an order of magnitude smaller than the variation of acquisition and deletion rates with genome size (Methods). Accordingly, additional fittings were performed using constant selection coefficient (namely, independent of the genome size). In this case, the log-likelihood optimization always converged to the same set of parameters, with and values between and , corresponding to smallest and largest genomes, respectively (fitted parameters are summarized in SI Appendix, Table S4). This value of implies (i.e., the “resolution of selection” is on the order of a single gene). Notably, the dependence of on is weak. The mean steady-state genome size is a function of effective population size for this fit (Fig. 1).
For complementarity, fitting with constant together with an -dependent selection coefficient was performed as well (SI Appendix), allowing inference of the selection landscape beyond the first-order expansion for a constant acquisition–deletion rates ratio. In this case, the best fit is achieved for positive and saturating selection landscape, indicating that, on average, additional genes are beneficial but that the benefit decreases with the growth of the genome size.
As an approximation for nonuniversal, ATGC-specific factors that affect the genome size, optimization of the gene acquisition and deletion rates was performed separately for each ATGC. The selection coefficient was taken to be the same for all ATGCs and set to the value obtained in the global fitting at the previous stage. For each ATGC, and were optimized, where the fitting of 120 parameters (compared with 5 parameters at the previous stage) is justified by the Akaike Information Criterion (AIC) (Table 1). The fitted values of and are shown in SI Appendix, Fig. S3, and the resulting genome size distributions for ATGCs with 20 or more species are shown in Fig. 4.
Table 1.
Selection landscape | No. of parameters | Parameters values | LL | AIC |
Linear and power law (Eqs. 10 and 11) | 5 | |||
Power law (Eq. 11) and constant | 4 | |||
Power law (Eq. 11) and constant individual ATGC fit | 120 | |||
and are shown in SI Appendix, Fig. S3 | ||||
SI Appendix, Eq. S1 with constant | 2 | |||
The selection landscape of SI Appendix, Eq. S1 with constant deletion–acquisition rates ratio is also shown for comparison. Log-likelihood (LL) calculation details are given in Methods.
Evolution of Distinct Functional Classes of Genes.
In our model, the gene acquisition and deletion rates, and , respectively, are general characteristics of the organism that do not depend on the content of the acquired or lost genetic material. In contrast, the selection coefficient inferred above represents a local average with respect to the gene content of the organism and the available genetic material in the assumed infinite gene pool. The model can be extended to account for different classes of genes that evolve under distinct selection landscapes. Specifically, the number of class genes, , is determined by the stochastic equation (Methods)
[12] |
where is the total number of genes, is the probability of gain of a class gene, and the selection landscape is assumed to be a function of only. The width of the steady-state distribution is determined primarily by the linear term in the loss probability. To further test the model consistency with the empirical data, steady-state distributions were calculated for subsets of genes. The subsets were chosen based on the functional classes of genes as classified in the COG (Clusters of Orthologous Genes) database (17, 18). The selection landscape and were optimized for the best log-likelihood fit of the distribution predicted by the model to the genomic data (SI Appendix, Table S5). The distributions were calculated using the values of and that were obtained by fitting the distributions for complete gene sets (Table 1). The mean value of can be approximated by the equilibrium value , for which the gain and loss probabilities are equal (analogous to for the complete genomes as described above):
[13] |
This expression can be regarded as a generalization of previously reported scaling laws for different functional classes of genes with the genome size (19–21), where the ATGC-specific effective population sizes are taken into account (the full implications of this extension of the scaling analysis will be discussed elsewhere). Comparison of the empirical data and the model predictions for the number of genes in most of the functional classes shows a good fit between the model predictions and the genomic data (SI Appendix, Figs. S4 A and B and S5 A and B). For the translation system components and the genes involved in energy transformation, the log-likelihood values were and , respectively, compared with the −6,022 value for complete genomes. Thus, notably, the genes for translation system components, the most conserved, universal functional class (22), are described by the model much better than a random subset of genes.
Finally, analogous to the whole-genome fitting procedure, to account for ATGC-specific effects, model distributions were further optimized by fitting ATGC-specific values. The resulting distributions for most of the functional classes showed good fits between the model and the genomic data as illustrated in SI Appendix, Fig. S6 A and B for the translation system components and genes involved in energy transformation of the largest ATGC001 (complete results are in SI Appendix, Figs. S7–S10). However, two classes of genes, namely the components of the “mobilome,” such as prophage genes and transposons, as well as the singletons (genes with no detectable orthologs within the given ATGC), dramatically deviate from model predictions (SI Appendix, Figs. S6 C and D, S9, and S10). For these gene classes, the observed distributions are significantly wider than those predicted by the model, regardless of the model parameters or the selection landscape. Accordingly, the log-likelihood values reflect the disagreement and are and for mobilome and singletons, respectively. Such a poor fit effectively indicates that these classes of genes evolve under evolutionary regimes qualitatively different from that of the rest of the genomes. Many of the mobilome components, in particular transposons, are prone to active duplication within the genome and therefore, cannot be described with the gene gain rate inferred for the complete gene sets. The singletons dramatically differ from the evolutionarily conserved genes shared by multiple microbes with respect to the tempo and mode of evolution. Most of the singletons encode small proteins and evolve fast, suggesting that they are associated with little (if any) benefit (23, 24). Thus, the key result of this analysis, namely the positive sign of the mean selection coefficient, does not apply to the singletons.
Discussion
The notion of strong purifying selection that favors small genomes (or more precisely, a small number of genes) in prokaryotes (10) seemed incompatible with the observed significant positive correlation between the genome sizes of bacteria and archaea and the inferred selection strength on the protein level reflected in the dN/dS ratio (13) (this work). These observations indicate that, on average, the larger the genome of a bacterium or an archaeon, the stronger selection under which the protein-coding genes evolve. This apparent discrepancy between the comparative genomic observations and the predictions of the population genetic theory motivated us to further investigate the selection regimes of microbial genomes. To this end, we compared the predictions of a mathematical model of genome evolution with the genome size distributions in 60 clusters of closely related bacteria and archaea.
To infer the selection landscape with regard to the genome size, the effective population size was estimated for each ATGC using the dN/dS ratio and assuming the same selection coefficient for the core genes in all ATGCs. This assumption is reasonable, because 51–56 core genes used for the dN/dS calculation are nearly universal and encode central biological functions, such as translation, that are functionally highly similar across the entire bacterial domain of cellular life (22). Small variations in across the different ATGCs might slightly affect the inferred gain and loss probabilities through the changes in the estimated effective population size associated with each ATGC. However, to affect our conclusions, namely that additional genes are, on average, beneficial, the variations in between ATGCs would have to be dramatic, such that the correlation between the genome size and the dN/dS ratio (Fig. 1A) would be abolished or reversed.
Inference of gain and loss probabilities requires estimation of three terms, namely gene gain and loss rates and the selection landscape. In principle, all three values depend on the genome size, such that it is impossible to infer all terms without assumptions on the functional forms of the respective dependencies. However, it has to be emphasized that our conclusions do not depend on specific modeling assumptions and hold as long as and are monotonic functions for typical prokaryotic genome sizes. The steady-state genome size distribution reflects the selection–drift balance: acquisition of new genes is, on average, beneficial, albeit with a small estimated selection coefficient (Table 1), and balanced by a deletion rate that is slightly greater than the acquisition rate. Under this regime, the selection on the gain or loss of an individual gene is weak, allowing substantial variation in genome size, but sufficient to produce the correlation between Ne and genome size. The fitted values of the deletion–acquisition rates ratio are very close to, albeit greater than unity. This slight but consistent excess of gene loss over gain is likely to reflect the deletion bias that has been identified as an intrinsic feature of genome evolution in both bacteria and eukaryotes (25–27).
Extension of the model to account for subsets of genes allowed additional validation of the model consistency and assessment of the selection affecting any class of genes. In particular, we analyzed genes that are associated with specific cellular functions as classified in the COG database (17, 18) under the assumption that functionally similar genes evolve under similar selection landscapes. We obtained good fits to the model for all functional classes, indicating that the conclusion on the typical beneficial effect of gene acquisition applies to functionally diverse classes of genes. However, there were two notable exceptions to this consistency, namely the mobilome and the singletons. The distributions of the sizes of these classes in all ATGCs are much wider than predicted by the model, and the parameters could not be optimized to obtain a good fit. These observations imply that the evolutionary regimes of the mobilome and the singletons qualitatively differ from the genes in the other functional classes that possess “normal” cellular functions. There are indications of the nature of these differences. Many components of the mobilome, such as transposons, propagate within a genome, so that the dynamics of this class is dominated by duplication rather than gain from an external gene pool. The singletons are fast evolving genes that, on average, do not confer any benefit on the organism. Indeed, in a separate recent analysis of microbial genome evolution models, we have shown that the replacement rate of the singletons is effectively infinite compared with the replacement rates of the rest of the genes (28).
The model analysis described here was performed assuming a steady state with respect to the genome size, and two points have to be addressed in this regard. Each ATGC consists of phylogenetically close genomes (29) that cannot be considered independent samples from the genome size distribution without additional substantiation. We, therefore, verified that, for each ATGC, the divergence between the genomes, defined in this case as the number of gains and losses, was greater than the genome size distribution width, ensuring that a sufficient number of evolutionary events occurred so that the genomes represent an independent set of samples from the size distribution. A low bound for the number of gene gain and loss events that occurred since the divergence of individual genomes can be estimated from the number of singletons divided by the probability of the acquisition of such genes. The required number of gains and loss events for sufficient sampling of the distribution can be estimated from the SD of all genome sizes in an ATGC. The estimated numbers of gain events are greater than the SDs for all ATGCs (SI Appendix, Fig. S11), indicating that the genome size distribution was sampled sufficiently.
In our previous work, a comprehensive analysis of gain and loss events was performed for the same groups of microbial genomes (ATGCs) that were analyzed here (14). The gene loss to gain ratios obtained through a maximum likelihood reconstruction of genome evolution formed a broad distribution, with the mean value of about two. The evolution models analyzed here (Eqs. 10 and 11) yield a skewed distribution of genome sizes, which implies a distribution of loss/gain ratios with the mean slightly greater than unity. The distributions and parameters fitted here cannot explain such a large difference between the model predictions and inferences from genome comparison. Thus, it seems likely that, most of the time, the majority of the genomes are somewhat smaller than the long-term equilibrium size. A biologically plausible scenario is that prokaryotes are exposed to beneficial genetic material only for short periods of time, resulting in brief intervals of fast growth followed by slow genome shrinking (30). Steady state is possible under this scenario as well but only as the average over multiple cycles of gain and loss, which probably occur on a timescale much longer than the scale of the ATGC evolution. The strict steady-state analysis presented here can be regarded as a coarse-grained description of more complex evolutionary scenarios; however, our key finding, that acquired genes are, on average, beneficial, is expected to hold also for higher-order analyses.
The results of this analysis indicate that elimination of genes under the pressure of purifying selection is not the dominant factor of microbial evolution. On the contrary, acquisition of genes by microbes seems to be largely an adaptive process, although the positive selection that governs the genome dynamics, on average, is likely to be weak. This conclusion by no means contradicts the population genetic theory as such (9) but is incompatible with the assumption that newly acquired (and fixed in the population) genes are, on average, neutral. In other words, all of the estimates of the cost of the genetic material (11) can be valid in themselves, but the positive selection coefficient that is, on average, associated with new genes offsets these costs. Given that new genes are, on average, beneficial, microbes with larger values that evolve under strong purifying selection typically accrue a greater number of genes than microbes with smaller populations. This reasoning explains the negative correlation between dN/dS and the number of genes (Fig. 1) that, at first glance, seemed paradoxical and to contradict the theory. Conceivably, the evolution of genuinely neutral, noncoding sequences is governed by the cost combined with the deletion bias, resulting in the purge of such sequences predicted by the theory and hence, the “wall to wall” architecture of the prokaryotic genomes (3).
To summarize, the analysis described here presents the formal theory of the evolution of prokaryotic gene content. Perhaps unexpectedly, comparison of the theory predictions with the genomic data shows that gene gain by prokaryotes, leading to genome growth, is largely an adaptive process, with the exception of “nonfunctional” gene classes, the mobilome and the singletons. From the biological standpoint, it seems plausible that the apparent beneficial effect of gene gain is a combined result of the capture of metabolic enzymes that can expand the biochemical capacity of microbes (31), regulators and signaling proteins that enhance regulatory circuits (32), and defense genes (33). However, much more research is required to reconstruct the full functional landscape of microbial evolution.
Methods
Prokaryotic Genome Size Data and Nonsynonymous to Synonymous Nucleotide Substitution Ratio.
Genomes of 707 prokaryotic species grouped into 60 ATGCs (14, 29) were analyzed. In addition to the number of genes (, selection forces acting on the protein level were inferred for each ATGC. Nonsynonymous to synonymous nucleotide substitution ratio () was evaluated for each pair of species that belongs to the same ATGC using concatenated sequences of all core genes. The indicated value for each ATGC is the median across all species pairs in the ATGC. Based on the COG database annotations (17, 18), singleton genes were identified and counted for each species. The effective population size was estimated for each ATGC from the calculated value as explained in the following section.
Inference of Effective Population Size.
Synonymous mutations are assumed to be neutral and, therefore, fixed at a rate . Together with the fixation probability given by Eq. 2, for nonsynonymous mutations, we have (34)
[14] |
where denotes the selection coefficient acting on the core genes for which the values are calculated. It is assumed that is similar for all ATGCs and that the variation in within an ATGC is significantly smaller than the differences in between different ATGCs. The value of is set, such that the effective population size for Escherichia coli is , and Eq. 14 allows estimation of effective population size for each ATGC.
Selection Function Symmetry with Respect to Gene Acquisition and Deletion.
The relation given by Eq. 1 is derived as follows. Noting by the selection advantage of gene acquisition, the reproduction rate for genome size is , and for genome size , it is, therefore, . For gene deletion, the reproduction rate is for genome size of , and for consistency, the reproduction rate for genome size is given by , corresponding to selection advantage of for gene deletion.
Selection and Fitness Relations.
The selection coefficient is related to the fitness by
[15] |
when considering the selective advantage of individual over individual (16, 35). For the inference of the selection coefficient that is associated with gene acquisition at genome size , it is useful to assign genome sizes of and to individuals and , respectively. The selection coefficient–fitness relation is, therefore,
[16] |
Mean Genome Size Dynamics.
For uniform population invaded by mutation that is associated with fitness , the population fitness dynamics is given by
[17] |
where is mutations probability density, is the fixation probability, and time is measured in fixation time units (rather than in Moran generations) (36). Eq. 17 is general, the only assumption being that the mutation rate is low enough, such that the weak mutation limit condition is satisfied. However, for the model analysis, it is more practical to derive the equation for genome size dynamics rather than the fitness. The integral over is a sum for all possible mutations and contains two terms corresponding to gene acquisition and gene deletion. Accordingly, is given by gene acquisition and deletion rates:
[18] |
and
[19] |
where the + and − superscripts indicate acquisition and deletion events, respectively. The fitness derivative with respect to the number of genes can be approximated as
[20] |
such that . The fitness time derivative can be calculated using the chain rule:
[21] |
and
[22] |
Substituting Eqs. 18, 19, 20, 21, and 22 to Eq. 17, we get Eq. 5 for genome size dynamics.
Steady-State Genome Size Distribution.
The genome size distribution satisfies the difference equation
[23] |
This equation can be approximated by a second-order differential equation (37). The left-hand side is expanded to first order in , and the right-hand size is expanded to second order in , giving the following expression:
[24] |
where in the weak mutation limit, for in fixation time units. For the steady-state distribution, . The resulting differential equation has a solution in the form of Eq. 9.
Optimization of the Goodness of Fit for Mean Genome Size Vs. Effective Population Size Dependency.
To search the parameter space more efficiently during the log-likelihood optimization, a preliminary optimization stage was implemented. Eq. 7 determines the relation between and . For given and , the genome size dependence on effective population size can be compared with the dependence observed in ATGCs, where mean genome size is taken as an approximation for . This approximation does not introduce large errors for modestly skewed genome size distributions (SI Appendix, Fig. S12), and ATGCs genome size distributions are only slightly skewed (Fig. 4). Specifically, it is possible to optimize the selection and deletion–acquisition rates ratio parameters (Eqs. 10 and 11) to maximize the goodness of fit . At the next stage, the parameters that gave the highest values are used as starting points for the log-likelihood optimization. Note that the log-likelihood scheme requires one additional parameter: the genome size distribution depends in principle on and , whereas in Eq. 7, only the ratio appears. These values are given for the deletion–acquisition rates ratio of Eq. 11 by and , where . The selection landscape and deletion–acquisition rates ratio of Eqs. 10 and 11 require optimization of five parameters: , , , , and .
Maximum Likelihood Optimization.
The log likelihood (LL) of the model given the data is estimated as
[25] |
where is the observed genome size in ATGC species, and is the predicted steady-state distributions of Eq. 9. Specifically, for the log-likelihood estimation of a model, the parameters were optimized to maximize the log-likelihood :
[26] |
where the sum is over all 707 species, components are all optimized parameters, and is the effective population size corresponding to the ATGC that contains species . For the constant fitness coefficient, individual fitting of and was performed separately for each ATGC, forming a set of 60 pairs. The log likelihood was calculated as follows:
[27] |
where the inner sum is over all species that belong to ATGC , and the outer sum is over all ATGCs.
Gain and Loss Probabilities Genome Size Dependent.
The gene gain (loss) probability depends on genome size through the selection coefficient and the acquisition (deletion) rate. The variation in gain and loss probability for genome size variation is given by
[28] |
and
[29] |
where all quantities are calculated using the mean ATGCs genome size and effective population size. If, say, the second term in the above equations is significantly smaller than the first term, the variation in and with genome size is mainly caused by the variation in and , whereas can be taken as constant with respect to .
For parameters fitted using linear selection landscape and summarized in SI Appendix, Table S2, the terms involving the derivatives of and are order of magnitude smaller than the and derivative terms.
The Model with Two Types of Genes.
The model can be extended to account for distinct classes of genes evolving under different selection landscapes. For two classes, the numbers of genes in each class, and , are governed by two coupled stochastic equations:
[30] |
and
[31] |
where the interpretation is as follows. The probability to acquire a gene of class is and is a property of the gene pool. In addition, the associated selection landscape is assumed to be a function of only. The loss rate for class is given by the product of , which is defined per genome, and the fraction of type genes. The derivation of Eqs. 30 and 31 follows the same steps as for the complete genome. The integral of Eq. 17 in this case is a sum with four terms, namely, acquisition or deletion of either class or class genes. The fitness time derivative includes two terms:
[32] |
where stands for differentiation with respect to . The last stage in the derivation is to split terms associated with or dynamics into two separate equations. This operation is possible, because the number of genes in each class is determined exclusively by gene gain and loss events: there is no process in the model that allows switching the gene type.
To calculate the steady-state distribution for a subset of genes, is set to the subset size, and represents the remaining genes. In this case, we have , and accordingly, ; therefore, Eq. 30 is decoupled from Eq. 31.
Supplementary Material
Acknowledgments
We thank members of the group of E.V.K. for helpful discussions. The authors’ research is supported by intramural funds of the US Department of Health and Human Services (to the National Library of Medicine).
Footnotes
The authors declare no conflict of interest.
This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1614083113/-/DCSupplemental.
References
- 1.Koonin EV, Wolf YI. Genomics of bacteria and archaea: The emerging dynamic view of the prokaryotic world. Nucleic Acids Res. 2008;36(21):6688–6719. doi: 10.1093/nar/gkn668. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Reddy TB, et al. The Genomes OnLine Database (GOLD) v.5: A metadata management system based on a four level (meta)genome project classification. Nucleic Acids Res. 2015;43(Database issue):D1099–D1106. doi: 10.1093/nar/gku950. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Koonin EV. Evolution of genome architecture. Int J Biochem Cell Biol. 2009;41(2):298–306. doi: 10.1016/j.biocel.2008.09.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Price MN, Huang KH, Arkin AP, Alm EJ. Operon formation is driven by co-regulation and not by horizontal gene transfer. Genome Res. 2005;15(6):809–819. doi: 10.1101/gr.3368805. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Nuñez PA, Romero H, Farber MD, Rocha EP. Natural selection for operons depends on genome size. Genome Biol Evol. 2013;5(11):2242–2254. doi: 10.1093/gbe/evt174. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Lynch M, Conery JS. The origins of genome complexity. Science. 2003;302(5649):1401–1404. doi: 10.1126/science.1089370. [DOI] [PubMed] [Google Scholar]
- 7.Lynch M. The origins of eukaryotic gene structure. Mol Biol Evol. 2006;23(2):450–468. doi: 10.1093/molbev/msj050. [DOI] [PubMed] [Google Scholar]
- 8.Lynch M. The frailty of adaptive hypotheses for the origins of organismal complexity. Proc Natl Acad Sci USA. 2007;104(Suppl 1):8597–8604. doi: 10.1073/pnas.0702207104. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Lynch M. The Origins of Genome Architecture. Sinauer; Sunderland, MA: 2007. [Google Scholar]
- 10.Lynch M. Streamlining and simplification of microbial genome architecture. Annu Rev Microbiol. 2006;60:327–349. doi: 10.1146/annurev.micro.60.080805.142300. [DOI] [PubMed] [Google Scholar]
- 11.Lynch M, Marinov GK. The bioenergetic costs of a gene. Proc Natl Acad Sci USA. 2015;112(51):15690–15695. doi: 10.1073/pnas.1514974112. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Koonin EV. A non-adaptationist perspective on evolution of genomic complexity or the continued dethroning of man. Cell Cycle. 2004;3(3):280–285. [PubMed] [Google Scholar]
- 13.Novichkov PS, Wolf YI, Dubchak I, Koonin EV. Trends in prokaryotic evolution revealed by comparison of closely related bacterial and archaeal genomes. J Bacteriol. 2009;191(1):65–73. doi: 10.1128/JB.01237-08. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Puigbò P, Lobkovsky AE, Kristensen DM, Wolf YI, Koonin EV. Genomes in turmoil: Quantification of genome dynamics in prokaryote supergenomes. BMC Biol. 2014;12:66. doi: 10.1186/s12915-014-0066-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Moran PA. Random processes in genetics. Proc Philos Soc Math Phys Sci. 1958;54:60–71. [Google Scholar]
- 16.McCandlish DM, Epstein CL, Plotkin JB. Formal properties of the probability of fixation: Identities, inequalities and approximations. Theor Popul Biol. 2015;99:98–113. doi: 10.1016/j.tpb.2014.11.004. [DOI] [PubMed] [Google Scholar]
- 17.Tatusov RL, Koonin EV, Lipman DJ. A genomic perspective on protein families. Science. 1997;278(5338):631–637. doi: 10.1126/science.278.5338.631. [DOI] [PubMed] [Google Scholar]
- 18.Galperin MY, Makarova KS, Wolf YI, Koonin EV. Expanded microbial genome coverage and improved protein family annotation in the COG database. Nucleic Acids Res. 2015;43(Database issue):D261–D269. doi: 10.1093/nar/gku1223. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.van Nimwegen E. Scaling laws in the functional content of genomes. Trends Genet. 2003;19(9):479–484. doi: 10.1016/S0168-9525(03)00203-8. [DOI] [PubMed] [Google Scholar]
- 20.Konstantinidis KT, Tiedje JM. Trends between gene content and genome size in prokaryotic species with larger genomes. Proc Natl Acad Sci USA. 2004;101(9):3160–3165. doi: 10.1073/pnas.0308653100. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Molina N, van Nimwegen E. Scaling laws in functional genome content across prokaryotic clades and lifestyles. Trends Genet. 2009;25(6):243–247. doi: 10.1016/j.tig.2009.04.004. [DOI] [PubMed] [Google Scholar]
- 22.Koonin EV. Comparative genomics, minimal gene-sets and the last universal common ancestor. Nat Rev Microbiol. 2003;1(2):127–136. doi: 10.1038/nrmicro751. [DOI] [PubMed] [Google Scholar]
- 23.Daubin V, Ochman H. Bacterial genomes as new gene homes: The genealogy of ORFans in E. coli. Genome Res. 2004;14(6):1036–1042. doi: 10.1101/gr.2231904. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Yu G, Stoltzfus A. Population diversity of ORFan genes in Escherichia coli. Genome Biol Evol. 2012;4(11):1176–1187. doi: 10.1093/gbe/evs081. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Petrov DA, Sangster TA, Johnston JS, Hartl DL, Shaw KL. Evidence for DNA loss as a determinant of genome size. Science. 2000;287(5455):1060–1062. doi: 10.1126/science.287.5455.1060. [DOI] [PubMed] [Google Scholar]
- 26.Petrov DA. DNA loss and evolution of genome size in Drosophila. Genetica. 2002;115(1):81–91. doi: 10.1023/a:1016076215168. [DOI] [PubMed] [Google Scholar]
- 27.Kuo CH, Ochman H. Deletional bias across the three domains of life. Genome Biol Evol. 2009;1:145–152. doi: 10.1093/gbe/evp016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Wolf YI, Makarova KS, Lobkovsky AE, Koonin EV. Two fundamentally different classes of microbial genes. Nat Microbiol. 2016 doi: 10.1038/nmicrobiol.2016.208. in press. [DOI] [PubMed] [Google Scholar]
- 29.Novichkov PS, Ratnere I, Wolf YI, Koonin EV, Dubchak I. ATGC: A database of orthologous genes from closely related prokaryotic genomes and a research platform for microevolution of prokaryotes. Nucleic Acids Res. 2009;37(Database issue):D448–D454. doi: 10.1093/nar/gkn684. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Wolf YI, Koonin EV. Genome reduction as the dominant mode of evolution. BioEssays. 2013;35(9):829–837. doi: 10.1002/bies.201300037. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Maslov S, Krishna S, Pang TY, Sneppen K. Toolbox model of evolution of prokaryotic metabolic networks and their regulation. Proc Natl Acad Sci USA. 2009;106(24):9743–9748. doi: 10.1073/pnas.0903206106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Galperin MY, Higdon R, Kolker E. Interplay of heritage and habitat in the distribution of bacterial signal transduction systems. Mol Biosyst. 2010;6(4):721–728. doi: 10.1039/b908047c. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Makarova KS, Wolf YI, Koonin EV. Comparative genomics of defense systems in archaea and bacteria. Nucleic Acids Res. 2013;41(8):4360–4377. doi: 10.1093/nar/gkt157. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Kryazhimskiy S, Plotkin JB. The population genetics of dN/dS. PLoS Genet. 2008;4(12):e1000304. doi: 10.1371/journal.pgen.1000304. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Sella G, Hirsh AE. The application of statistical physics to evolutionary biology. Proc Natl Acad Sci USA. 2005;102(27):9541–9546. doi: 10.1073/pnas.0501865102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Kryazhimskiy S, Tkacik G, Plotkin JB. The dynamics of adaptation on correlated fitness landscapes. Proc Natl Acad Sci USA. 2009;106(44):18638–18643. doi: 10.1073/pnas.0905497106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Codling EA, Plank MJ, Benhamou S. Random walk models in biology. J R Soc Interface. 2008;5(25):813–834. doi: 10.1098/rsif.2008.0014. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.