Abstract
Mounting evidence suggests that natural populations can harbor extensive fitness diversity with numerous genomic loci under selection. It is also known that genealogical trees for populations under selection are quantifiably different from those expected under neutral evolution and described statistically by Kingman’s coalescent. While differences in the statistical structure of genealogies have long been used as a test for the presence of selection, the full extent of the information that they contain has not been exploited. Here we demonstrate that the shape of the reconstructed genealogical tree for a moderately large number of random genomic samples taken from a fitness diverse, but otherwise unstructured, asexual population can be used to predict the relative fitness of individuals within the sample. To achieve this we define a heuristic algorithm, which we test in silico, using simulations of a Wright–Fisher model for a realistic range of mutation rates and selection strength. Our inferred fitness ranking is based on a linear discriminator that identifies rapidly coalescing lineages in the reconstructed tree. Inferred fitness ranking correlates strongly with actual fitness, with a genome in the top 10% ranked being in the top 20% fittest with false discovery rate of 0.1–0.3, depending on the mutation/selection parameters. The ranking also enables us to predict the genotypes that future populations inherit from the present one. While the inference accuracy increases monotonically with sample size, samples of 200 nearly saturate the performance. We propose that our approach can be used for inferring relative fitness of genomes obtained in single-cell sequencing of tumors and in monitoring viral outbreaks.
Keywords: evolution, population genetics, genealogy, fitness inference
MOST mutations are believed to have minimal effects on the fitness of the organism and much of the analysis of the genomic data on populations (see Excoffier and Heckel 2006 for a review of methods) has been based on the neutral hypothesis, according to which the dynamics of genetic polymorphisms and the overall genetic diversity of the population are governed by the neutral drift, i.e., stochastic fluctuations in allele frequency arising from the intrinsic stochasticity in offspring number. The neutral model assumes that deleterious mutations are eliminated by selection fast enough to not significantly contribute to population diversity and beneficial mutations are rare enough to produce only occasional adaptive sweeps, where the population is taken over by the offspring of the adaptive genotype, transiently suppressing neutral genetic diversity. Statistical properties of genealogies generated by neutral dynamics in asexual populations are understood in great detail (Hein et al. 2005; Wakeley 2008) in terms of Kingman’s coalescent process (Kingman 1982), which follows the ancestors of the present population back in time as far as the most recent common ancestor (MRCA). The neutral coalescent (Hein et al. 2005; Wakeley 2008) forms the basis for estimating mutation and recombination rates and provides the null hypothesis in tests for the presence of selection (Tajima 1989; Fu and Li 1993).
Yet, as advances in sequencing have made it possible to obtain quantitative data on genetic diversity, numerous studies have reached the conclusion that nonneutral polymorphisms are ubiquitous in populations across the spectrum of life: from viruses (Coffin et al. 1995; Novella et al. 1995; Moya et al. 2004; Neher and Leitner 2010; Batorsky et al. 2011) and bacteria (Barrick et al. 2009) to flies (Sella et al. 2009) to mitochondria (Seger et al. 2010) and cells in cancerous tumors (Merlo et al. 2006). In addition, laboratory evolution experiments in bacteria (Lenski et al. 1991; Miralles et al. 1999) and yeast (Kao and Sherlock 2008; Lang et al. 2011) have demonstrated directly that large asexual populations contain numerous subclones that are continuously generated by mutation and compete for fixation. Thus, large asexual populations cannot be assumed selectively neutral.
The presence of selection affects the shape of genealogical trees, often giving them an asymmetric and “comb-like” appearance that is strikingly different from that of the neutral trees generated by Kingman’s coalescent (Hein et al. 2005; Wakeley 2008; Seger et al. 2010; Trevor et al. 2011). An example of such “genealogical anomalies”—i.e., large deviations from neutral genealogical structure (Maia et al. 2004)—is provided by the recent study (Seger et al. 2010) of mitochondrial diversity in three distinct populations of whale lice, Cyamus ovalis, where the authors demonstrate that the observed genealogies are statistically consistent with a nonneutral model with frequent mutations of small selective effect.
Our analysis is based on a similar model of asexual evolutionary dynamics driven by small deleterious and beneficial mutations. In Figure 1 we show schematically a sample of continuous genealogy for a fixed-size population governed by Wright–Fisher dynamics (Hein et al. 2005; Wakeley 2008), incorporating genetic drift, mutation, and natural selection. The example in Figure 1 covers the period over which the offspring of one of the genomes (Figure 1, top) spread over the whole population (Figure 1, bottom). We ask, given a sample of genomes from the “present time” population (Figure 1, red circles), can one predict the genetic future of the population? Or, more specifically, can one identify, within the present sample, the closest relatives of the future population, i.e., individuals that are on, or closest to, the genealogical backbone of the future population? Since long-term survival is correlated with fitness, this task is closely related to the problem of identifying the fitter fraction of the present-day sample.
Figure 1.
Schematic example of a genealogical trajectory, from past into the future, of an asexual population with fixed size (N = 9) and nonoverlapping generations. Nodes represent individual genomes, each linked to its ancestor in the previous generation. The example illustrates coalescence of the lineages of the bottom population toward its MRCA within the top population. The genealogical tree of a random sample (red) from the “current time” population partially overlaps the genealogy of the future population (blue). While actual ancestors of the future population (shown in blue) may or may not fall into the current sample, one can still define sample members that are closest to the surviving lineages. Identifying close relatives of future populations is the goal of our study.
Here, we demonstrate that the anomalous structure of the genealogical tree reconstructed for a sample of genomes can serve not only as the evidence of selection, but also as the basis for inferring the relative fitness ranking of sampled individuals and their proximity in sequence space to the fittest genomes. Information pertinent to this inference is contained in the pattern of coalescence for different lineages: in a nutshell, lineages that undergo several coalescence events much before others are relatively fit, while the less fit lineages do not merge with the rest (going backward in time) until later. Below we provide the simulation-based evidence supporting this scenario.
Our study builds upon considerable recent progress in the theoretical understanding of natural selection and drift dynamics in fitness-diverse asexual populations (Tsimring et al. 1996; Rouzine et al. 2003, 2008; Desai and Fisher 2007; O’Fallon et al. 2010; Sniegowski and Gerrish 2010; Good et al. 2012; Goyal et al. 2012; Walczak et al. 2012) and the emerging description of corresponding genealogies (Bolthausen and Sznitman 1998; Brunet et al. 2007; Berestycki 2009; O’Fallon et al. 2010; Seger et al. 2010; Desai et al. 2013; Walczak et al. 2012; Neher and Hallatschek 2013; Neher 2013). We focus on the asexual case and address how this approach might be extended to the analysis of recombining populations in the Discussion.
We focus on the regime where numerous beneficial or deleterious mutations segregate simultaneously and the population is formed by many clones with different fitness values. In this regime, sometimes referred to as clonal interference (Miralles et al. 1999; Desai and Fisher 2007), competition between clones and the linkage between mutations play a key role in evolutionary dynamics. This regime is realized in large populations with high mutation rates. Precise conditions depend on the distribution of fitness effects of mutations and have been discussed in many recent articles (Rouzine et al. 2003, 2008; Desai and Fisher 2007; Brunet et al. 2008; Sniegowski and Gerrish 2010). For example, in the case where only beneficial mutations are present, the condition for being in the interference regime is given by Nμb > 1/log(Ns), where μb is the beneficial mutation rate (Desai and Fisher 2007; Brunet et al. 2008; Rouzine et al. 2008). This is basically the condition that new beneficial mutations get established in the population at a rate faster than they can “sweep” the population (see supporting information, File S1, for additional discussion). In the case of purifying selection where only deleterious mutations are present, it can be shown (Rouzine et al. 2003, 2008; Walczak et al. 2012) that the required condition is where μd is the deleterious mutation rate, s is the deleterious effect of mutations, and N is the population size.
Quite generally, when the population is formed by several clones with different fitness values, the fate of any new mutation depends not only on its own selective effect, but also on the fitness of the genotype on which it occurs (Good et al. 2012). As a result, the MRCA of such a fitness-diverse population is with high probability among the very fittest of its generation (O’Fallon et al. 2010). In return, the pattern of genealogical coalescence is controlled by the time it takes for surviving lineages to converge, as they are tracked back in time, on the leading edge of the fitness distribution at previous times.
This article is organized as follows. After formulating the model, we provide examples of genealogies, illustrating their anomalous shape compared to the neutral coalescent, and demonstrate the correlation between the ancestral weight, defined as the fraction of the present-day sample constituted by the descendants of the ancestor, and the mean fitness of the those descendants. We then define a fitness-ranking score based on the suitably integrated ancestral weights along the reconstructed lineage of each individual in the sample. Applying the ranking to numerous samples (for populations with the same and with different mutation/selection parameters) and comparing each realization to the true fitness known from the forward simulation, we demonstrate the ability of the proposed algorithm to infer the relative fitness of sampled genomes and to identify genotypes that are likely to survive into the future. The Discussion addresses possible applications and generalizations of the proposed inference method.
Model and Methods
Model of evolutionary dynamics
Consider an asexual population of size N that evolves with nonoverlapping generations under the influx of deleterious and beneficial mutations. New mutations arise at the rate μ + μ0 (per genome per generation) with a fraction εμ being beneficial, (1 − ε)μ deleterious, and the remainder μ0 being neutral. For simplicity we assume both beneficial and deleterious mutations to have the same effect s ≪ 1 and to change the fitness of individual i carrying that mutation additively: Fi → Fi ± s. As in the Wright–Fisher model, natural selection acts by biasing the probability of an individual genome to appear in the next generation, which is taken to be proportional to exp(fi) with being the individual fitness relative to the mean fitness of the population which in general is a function of time.
We carried out 103 simulations of 2 × 105 generations for several plausible parameter combinations in the range of μ = 10−4−10−2, s = 10−3−10−2, with ε taking values 0, 0.1, and 1, and μ0 = 10μ and N = 64,000. In File S1, we study the degree of clonal diversity and interference for the set of parameters that we have simulated and show that it explores a broad range in the clonal interference regime.
The genealogical trees were constructed in two ways. We recorded the genealogies in the course of the forward simulation, providing exact ancestries of any sample in the population. In addition, an inferred genealogy of random samples (between 30 and 500 genomes) was constructed using standard neighbor-joining/UPGMA-derived methods (Durbin 1998) is detailed in File S1. In File S1, we present the performance of the tree reconstruction method for different parameter values and show that it satisfactorily reconstructs the genealogical trees. For higher mutation rates (e.g., μ = 5 × 10−3 and μ = 10−2) where there are tens to hundreds of differences between a typical pair of genomes, even setting the neutral mutation rate equal to μ would be sufficient for an accurate reconstruction of the trees.
Fitness distribution and distortion in the shape of genealogical trees
In the parameter range considered, simulated populations exhibit substantial fitness diversity with fitness variance in the order of arising from ∼10−103 simultaneously segregating nonneutral polymorphisms. Figure 2, A and B, shows examples of the population-wide fitness distribution for two different mutation rates (see File S1 for additional examples). In general, genetic diversity in the population is an increasing function of μ/s. For the highest mutation rate and lowest selection coefficients considered, μ = 10−2 and s = 10−3, the population exhibits extensive genetic diversity and is formed by many small clones (Figure 2B), whereas for the lower mutation rates, as in Figure 2A, the population typically includes larger clones.
Figure 2.
Fitness distributions and examples of genealogical trees. (A) Fitness distribution at one time point for a population with μ = 10−3, s = 2 × 10−3, and σ ≃ 2.2 × 10−3. Each bin corresponds to a fitness class and each class is composed of multiple clones delineated by horizontal lines within each bar, with larger clones stacked on the bottom. (Here, clones are defined using only the nonneutral mutations.) Also shown is the color code used in D. (B) Same as A but for a higher mutation rate μ = 10−2 and σ ≃ 5 × 10−3. (C) Same as A but for a neutral population. (D) A typical genealogical tree for a random sample of size n = 30 from the same population as in A. Each circle corresponds to one sampled genome and the color represents its fitness. Branch lengths are drawn in linear proportion to the corresponding time interval. Numbers next to internal nodes are the weights of the corresponding ancestors (only weights >10 are shown). Note the striking asymmetry of branching, i.e., uneven distribution of weight among the two lineages descending from each internal node. (E) Same as D but for the population shown in B. Note that the colors (gray and plum) corresponding to the extremes of the distribution (B) are absent from the small sample shown. (F) Same as D but for a neutral population. To focus on the shape of the genealogy, we have normalized the “height” of the trees in D–F to the time to the MRCA, which makes the neutral tree appear as tall as the trees for the populations under selection (whereas the coalescence time is really much longer in the neutral case). Note the short terminal legs and more symmetric branching. N = 64,000 and ε = 0.1 for A–F.
Figure 2, D and E, shows typical examples of genealogical trees constructed for random samples of size n = 30 drawn from the populations corresponding to Figure 2, A and B, respectively. The fitness of sampled genomes, which we know from the forward simulation, is visualized using color. Also shown are ancestral weights along some of the lineages. This weight, wi, is defined as the number of genomes in the present time sample that are direct descendants of lineage i. For example, each leaf at the bottom has weight w = 1, while the lineage at the root has the full weight of the sample n = 30. For the sake of comparison, we also show a typical genealogical tree for a neutrally evolving population in Figure 2F.
One immediately notes two well-known differences distinguishing Figure 2D and 2E from Figure 2F. Genealogies from fitness-diverse populations (i) have long terminal legs and are compressed toward the MRCA root of the tree and (ii) exhibit strong asymmetry of branching. These anomalies are quantified in Figure 3. Figure 3A presents distributions of pairwise coalescent times in the population, τij, for {i, j} genome pairs for several parameter sets. In Kingman’s coalescent, τij has an exponential distribution (with mean N) (Hein et al. 2005; Wakeley 2008) and most lineages in a genealogical tree coalesce at early times (looking backward). In contrast, the bulk of coalescence in a population under selection is significantly delayed compared to the total coalescent time—an effect corresponding to the comb-like appearance of the trees.
Figure 3.
Distortion in the shape of genealogies in the presence of selection. (A) Distribution of pairwise coalescent time, scaled with its mean, T2. (B) Probability of an ancestor to have weight w when there are a = 2 lineages left in the genealogical tree of n = 100 samples. Distributions are based on 8000 random samples and population replicas. N = 64,000 and ε = 0.1 in both A and B.
The asymmetry of branching is quantified in Figure 3B, which presents the distribution of weights at the level just below the root, where there are only two ancestral lineages left in the tree. The strong bias toward extreme values of w in populations under selection is contrasted with w-independent distribution predicted and observed in the neutral case (see File S1 for additional characteristics that quantify differences between the shapes of trees).
Results
Correlation between ancestral weight and offspring fitness
Let us consider the whole population and trace the surviving lineages back in time, identifying all ancestors of the present-day population t generations in the past. Figure 4A shows the distribution of the ancestral fitness (relative to the mean for that generation) at several time points in the past. This distribution becomes progressively shifted toward higher fitness values compared to the distribution for the whole population (O’Fallon et al. 2010). In the limit of large times, this distribution converges to the nonextinction probability as a function of the fitness in the ancestral population (Neher et al. 2010; O’Fallon et al. 2010; Neher and Hallatschek 2013).
Figure 4.
Correlation between fitness, weight, and coalescent time. (A) Fitness distribution of the ancestors of the whole population, for several time points in the past. (B) Scatter plot of weight vs. ancestral fitness t = 100 generations back. (C) Solid lines show average fitness of offspring as a function of the ancestral weight in a sample of size n = 100 at two different time slices in the past. Dashed lines represent the standard deviation, above and below the mean, The time slices (t15% and t40%) were chosen to be the first time (looking backward) when genealogy contained a lineage with weight >15% or 40% of n, respectively. (D) Heat map of mean pairwise coalescent time as a function of the fitness of the involved genomes, f1,2, normalized by the mean pairwise coalescent time for the whole population: t2(f1, f2)/T2. Evolutionary model parameters are N = 64,000, ε = 0.1, and s = 2 × 10−3 in A–D; μ = 10−3 in A, B, and D.
Let us consider the time in the past when there are still a large number of ancestors (e.g., ∼103 in the population of N = 64,000, which under conditions corresponding to simulations in Figure 4A occurs at t ≃ 100). Figure 4B shows the scatter plot of the weight of ancestors vs. their fitness advantage. Note that, by collapsing the points on the fitness axis, one gets the histogram shown in Figure 4A. We observe a strong positive correlation between the weight and the fitness of an ancestor. Higher-fitness individuals in the past generations are not only more likely to survive, but, conditioned on survival, they also leave more offspring. Thus the weight of the ancestor, which can be determined from a reconstructed genealogical tree, can be used as a proxy for ancestral fitness: a quantity that one does not expect to know directly, except in the case of computer simulations. In File S1 we provide plots of average ancestral fitness conditioned on its weight for various time points and parameter sets and confirm that the positive correlation between the weight and the fitness of ancestors holds quite generally. This correlation decreases as the time shifts farther into the past.
Next, we examine the correlation between the weight of an ancestor and the fitness of its surviving progeny. Consider a sample of genomes with size n and the corresponding genealogical tree. One expects genomes that are derived from relatively high-fitness ancestors to belong to higher-fitness classes at the present time. Since ancestral fitness correlates with weight, we expect higher-weight ancestors to produce, on average, higher-fitness descendants. To see this, let us consider an ancestor i, with weight wi, that existed some t generations in the past. We examine the fitness of the wi offspring in the sample descending from that ancestor. In particular, we focus on the mean, and the variance, over the wi offspring (subscript d refers to descendants). Let us denote the average of these quantities over random samples of genomes and over population replicas by and Note that and depend on the time t, namely, how far back in the genealogy one is considering.
In Figure 4, C and D, we show and at two different time points in the past for samples of size n = 100 (see File S1 for other parameter sets). In both cases, the mean fitness of the derived genomes is an increasing function of the weight of their ancestor. Consider a time close to the root of a tree such that a lineage can have a weight that is a significant portion of the sample size (e.g., right plot in Figure 4C). As expected, the value of for such high-weight ancestors is close to zero (remember that fi was defined relative to the population mean, so that the average of fi over the whole sample is zero). At the same time for ancestors with w approaching n. Interestingly, for the lineages that still have a small weight late in the coalescence process, the value of is clearly negative.
High-fitness genomes typically merge first in a tree and form high-weight ancestors. To make this point clear, consider the distribution of the pairwise coalescent time, τij, shown in Figure 3A. Averaging τij over all {i, j} pairs of genomes in a population gives the mean coalescent time T2. Now, consider the average of τij conditioned on the fitness of the two genomes and denote it by t2(fi, fj). Figure 4D shows a heat map of t2(fi, fj)/T2. For two genomes both with high fitness, the average coalescent time is <T2. The reason is that such genomes are likely to be relatively recent lineages emanating from the “nose” of the distribution. In other words, the chance of sampling identical or similar sequences is greater for fitter samples than for less fit samples, since fitter samples have shorter average pairwise coalescent time. This observation is the key to the proposed fitness inference method.
Relative fitness inference based on the reconstructed genealogy
Above we have reviewed different ways in which the shape of the genealogical trees for populations under selection differs from the expectation of neutral theory. We have also demonstrated the correlation between ancestral weights and the fitness of the descendants. We showed that sampled genomes that belong to high-fitness classes typically have shorter coalescent time compared to unfit genomes. We now show that this insight can be converted into a method for inferring relative fitness of genomes within the sample.
To that end, let us consider a randomly chosen set of n genomes from a population and use standard phylogenetic tree-building methods (see File S1) to approximately reconstruct the genealogy of the sample. The accuracy of the reconstructed genealogy compared to the actual genealogy, known exactly from the forward simulation of population dynamics, is discussed in File S1. It increases with the neutral mutation rate μ0: in the biologically plausible regime of μ0/μ ≈ 10 considered here, it proves more than adequate to enable meaningful inference.
Next, based on the reconstructed tree, we associate with each leaf i = 1, … , n a fitness-proxy score (FPS), φi, defined by its lineage within the tree. Specifically, we define φi as a linear discriminator in the form
(1) |
where {ak(i)} is the lineage of genome i, starting with the genome itself as a0(i) and running the length, mi, of the lineage (i.e., the number of nodes) until the root of the tree. When the ancestral lineage ak−1(i) with weight coalesces at internal node k, it forms a new ancestral lineage ak(i) with weight (see File S1 for an illustration of this notation on the example of a particular tree). The time of formation of the corresponding internal node is denoted by The parameter T2 is the estimate of the average pairwise coalescent time, obtained from the sampled genomes. Finally, Θ(x) is a “soft step” function (a.k.a. Fermi function): Θ(x) = (1 + exp(β(x/x∗ −1)))−1 parameterized by the position of the step x∗ and its characteristic width β. If β ≫ 1, the function Θ(x) changes abruptly from one to zero as x becomes >x∗, so that where a∗ is the oldest ancestor in the lineage with For β ∼ 1 the FPS is defined by a weighted sum of ancestral weights (see File S1 for details).
The logic behind our heuristic choice of the specific form of φi is to exploit the correlation between the offspring fitness and ancestral weights. Note that, at least on the high-fitness/high-weight end of the distribution, this correlation decreases as ta becomes large compared to T2. The reason for this is that for long times in the past, even the lineages originating from high-fitness ancestors spread all over the fitness distribution at the present time. Hence, we choose x∗ < 1: specifically the results below were obtained with x∗ = 0.5 and β = 5, but in File S1 we examine the performance of the ranking algorithm as a function of the parameters and demonstrate that nearly optimal performance for the present form of the FPS is achieved for a broad range of x∗ and β. Critically, normalization of ta to the characteristic time of coalescence for the sample, T2, eliminates the need to know the evolutionary parameters of the population, such as μ or N.
We rank genomes according to their φi score and compare this ranking with the actual fitness of each genome. In addition to inferring relative fitness, it is useful to know how genetically close a genome with a given rank is to the fittest one in the sample. Hence, for each genome we define di as the average of its Hamming distance to the fittest 10% of genomes in the sample. Figure 5, A and B, shows the results of the ranking for two n = 200 samples from the populations that already appeared in Figure 2, A and B. We observe a correlation between FPS ranking and the actual fitness in general and the tendency (quantified below) for the fittest genomes of the sample to show in the top ranks. In addition, high-ranked genomes that do not belong to high-fitness classes still have small di values, indicating that they are genetically close to the fittest genotypes. In other words, even if a high-ranked genome is not fit, typically it has only recently branched off from a fit clone and, compared to a randomly chosen unfit sequence, shares greater sequence similarity with fittest genotypes.
Figure 5.
Examples of performance of the ranking algorithm. (A) Heat map of rank as a function of fitness and average distance to the top 10% of fittest genomes. Distance d is normalized by its mean Left and right panels correspond to two samples of size n = 200 drawn from the same populations as in Figure 2A (μ = 10−3) and Figure 2B (μ = 10−2), respectively. To avoid overlap of points, a small random number has been added to the fitness coordinate of each point. (B) Scatter plot of rank vs. distance to the top 10% of fittest genomes (color map represents f/σ). The plots correspond to the same trees as in A. N = 64,000 and ε = 0.1 in A and B.
The above observations are confirmed and quantified by repeating and averaging the analysis for 8000 independent population samples and different sets of parameters. Specifically, Figure 6 shows the fitness distribution of the top-ranked genomes for the two parameter sets used in Figure 5. The results clearly indicate that the top-ranked genomes tend to be among fitter genotypes in the population. In addition, Figure 7A shows mean fitness conditional on the FPS ranking and Figure 7B shows the mean rank conditional on actual fitness (normalized by σ) for two different values of μ. Figure 7C shows mean distance from the fittest conditional on the FPS ranking (for four different values of μ), with distance normalized to Δ10% defined as the average di among the fittest 10%. Remarkably, we observe that d/Δ10% for the highest-ranked genomes approaches one, indicating good convergence, in the sense of Hamming distance, of the top-ranked genomes to the fittest set. Further analysis of the algorithm’s performance, as well as additional parameter sets including the case of purifying selection (ε = 0), can be found in File S1.
Figure 6.
Fitness distribution of the top 10% ranked genomes (green) compared to the fitness distribution of the whole population (blue). A and B correspond, respectively, to μ = 10−3 and μ = 10−2; other population parameters are the same as in Figure 5. The distributions are obtained by averaging over 8000 population replicas.
Figure 7.
Performance of the fitness-ranking algorithm. (A) Solid lines show mean fitness as a function of rank. Dashed lines show standard deviation above and below the mean (μ = 5 × 10−4 and 10−2; see inset in C). (B) Same as A for mean rank as a function of fitness. (C) Mean Hamming distance to the top 10% fitness set, normalized by Δ10% (see text) as a function of rank. (D) Mean Hamming distance to ancestors of the generation at one turnover time in the future, normalized by (see main text) as a function of rank. (E) Probability for the fitness of a genome within the top 10% ranked to belong to the top 50% of fitness values of sampled genomes for a range of mutation rates and selection coefficients. (F) Probability for the fitness of a genome within the top 10% ranked to belong to the top 20% of fitness values shown using solid lines. The dashed lines show this probability for a randomly chosen genome (see main text). Sample size n = 200, N = 64,000, and ε = 0.1 in all cases; s = 2 × 10−3 in A–D.
As already mentioned, we are interested in the set of evolutionary parameters for which many mutations segregate simultaneously and the population is formed by numerous clones with different fitness values. The opposing limit, which occurs for small population size, N, or mutation rate, μ, corresponds to the regime of selective sweeps/successive mutations. In this latter regime, the population is typically formed by only a few clones and the fitness diversity is relatively low. Moreover, for smaller values of the parameters N and μ, the inference of genealogical trees becomes less accurate as the genetic diversity between sampled sequences decreases. Therefore, we expect the performance of the algorithm to deteriorate for small population size, N, or mutation rate, μ. In File S1, we show that for smaller values of the quantity θ = Nμ, particularly for θ < 1, the performance of the fitness inference algorithm deteriorates. We also show that the performance of the algorithm deteriorates as the fitness diversity in the population, represented by σ/s, decreases. Note that the quantity σ/s provides a measure for the number of different fitness classes in the population.
As we see in Figure 5A, high-ranked genomes that do not belong to fittest classes still tend to have small genetic distance to fittest individuals (also note in Figure 2, D and E, the genomes with blue color located close to the mostly orange/red clusters on the right side of the trees). This is because the Hamming distance is dominated by neutral mutations μ0 ≫ μ and is less susceptible to fluctuations compared to fitness, which is defined by a much smaller number of nonneutral mutations. To the extent that genetic relatedness is defined by the distance, the latter is essential for identifying within the sample the closest relatives of future populations. Taking advantage of ready accessibility of evolutionary future within our simulations, we have directly tested the ability of our approach to identify, within the sample, the genotypes that are closer to those of future populations. For each sampled genome, we define as the average of its Hamming distance to all of the genomes in the current population that are ancestors of the population in a generation about one genetic turnover time in the future (we know these ancestors from the forward simulation). We choose this turnover time to be the first time in the future when <1% of individuals from the current population have any descendant left. In each case we normalized the distances by defined as the average of the smallest 10% of values of Figure 7D shows conditional on the FPS ranking. We again observe that for the highest-ranked genomes gets close to one, indicating that the top-ranked genomes are indeed close to the ancestors of future generations. This means that the FPS ranking makes it possible to identify the genetic elements (common among the high-rank genomes) that future populations inherit from the present one.
Finally, we examine the fitness of the genomes with the 10% highest rank. Consider the sorted vector F = [f1, … , fn] that contains the actual fitness values for all the sampled genomes. In Figure 7E, we show that the probability for the fitness of a genome within the top 10% rank to be above the median fitness is ∼0.9, for the broad range of parameters considered. The probability for the fitness of a top 10% -ranked genome to belong to the top 20% fitness class is given by the solid lines in Figure 7F and is >0.7. Note that some of the sampled genomes can have equal fitness (i.e., F contains duplicate values), which is more common for lower mutation rates where the fitness diversity in the populations is limited. Hence, to provide a meaningful comparison for this probability, in Figure 7F we show—using dashed lines—the probability for a random genome to be in the top 20% fittest.
In summary, the above results clearly indicate the power of the proposed inference method. The performance of the method improves monotonically with increasing sample size (see File S1): it degrades significantly, compared to the results presented above, for n < 100 but approaches saturation for n > 200.
As we discussed earlier, we are interested in the set of evolutionary parameters for which several mutations segregate simultaneously and the population is formed by several clones with varying fitness values. In the opposing limit corresponding to the regime of selective sweeps/successive mutations, we expect the performance of the algorithm to deteriorate, as some fundamental aspects of the dynamics (such as the dependence of the fate of mutations on the genetic background) are different. To make this point clear, we calculated the Pearson correlation coefficient between the rank and the distance d′. Figure 8A shows this correlation as a function of the parameter Nμ. As we see, for smaller values of Nμ, particularly for Nμ < 1, the correlation coefficient drops significantly.
Figure 8.
As the fitness diversity in the population decreases, the performance of the fitness-ranking algorithm deteriorates. (A) Correlation between the rank and the distance d′ (Hamming distance to the ancestors of the future generations in the current population) as a function of N × μ. (B) Same as A but the x-axis represents the participation fraction, defined as the probability that two randomly chosen genomes belong to the same clone. ε = 0.1.
Similarly, in Figure 8B, we show the above correlation as a function of the participation fraction, defined as the probability that two randomly chosen genomes belong to the same clone [i.e., where ni is the size of the ith clone]. Note that the participation fraction gives a measure for the fitness diversity in the population. Figure 8B shows again that as the genetic diversity in the population decreases, the correlation coefficient between the rank and the distance d′ drops as well.
Discussion
Whereas one often thinks of evolution occurring on geological timescales, evolutionary dynamics can also unfold swiftly as they do in bacteria acquiring antibiotic resistance, in human immunodeficiency virus evading Cytotoxic T-Cell response in the course of infection, or in the progression of an aggressive cancer. Recent advances in sequencing (Smith et al. 2010; Navin et al. 2011) have made it possible to extensively sample such rapidly evolving populations. The amount and quality of genomic data on populations will only continue to increase, accentuating the challenge of extracting more information from sampled genomes. Here, we have demonstrated that the shape of genealogical trees contains much more information than merely the evidence for (or against) selection within a population. As a proof-of-principle we have formulated a method for ranking the relative fitness of individual genomes sampled from a fitness-diverse but otherwise unstructured population, in the absence of any information other than genomic sequence. This provides the possibility of forecasting the common genotype of the future on the timescale of genetic turnover.
Our demonstration was based on a vast simplification of biological and ecological reality. Our model assumed fixed population size and constant environment; it neglected epistasis and assumed all nonneutral mutations (both deleterious and beneficial) to have the same selective strength. While we have explored a biologically interesting range of parameters within the considered model, it would be useful to extend the study to a broader class of models. Yet, we expect the proposed method to be quite robust, because it is based on the very fundamental aspect of evolutionary dynamics, realized when the population size and the mutation rate are sufficiently large. Under such a condition, the population harbors substantial nonneutral diversity, and fitness differentials between individuals are formed by the contributions of numerous weakly selected loci rather than a small number of strong ones. In this multilocus weak selection regime, surviving lineages in the course of time move from the nose of the fitness distribution toward the center, in a biased diffusion fashion. The correlation between early coalescence and rapid increase of ancestral weight along the lineages with high relative fitness derives from the continuous genetic turnover of the population described above. This turnover occurs in traveling-waves models corresponding to the continuous adaptation scenario (Tsimring et al. 1996; Rouzine et al. 2003), in the dynamic mutation–selection balance (Goyal et al. 2012) that involves both deleterious and compensating beneficial mutations, and in the case of purifying selection (ε = 0) (Gordo and Charlesworth 2000; Rouzine et al. 2003, 2008; Walczak et al. 2012).
A detailed statistical analysis of the way lineages propagate along the fitness axis could allow us to improve FPS by optimizing the trade-off between gaining more information about a particular lineage by tracking it farther back in time and the loss of predictive power due to the fact that beyond the genetic turnover time even lineages of the fittest ancestors spread all over the fitness distribution. Presently we have dealt with the problem heuristically by focusing on the coalescence sequence for each lineage up to ∼0.5T2. The advantage of our simple heuristic approach is that it is more likely to be model independent than the more fine-tuned methods. Building on the recent progress in understanding of genealogies in the presence of multilocus selection (O’Fallon et al. 2010; Walczak et al. 2012; Neher and Hallatschek 2013), it should be possible to replace our heuristic approach by a more systematic one.
It would be interesting to extend the fitness inference method to recombining populations. This should be relatively straightforward as long as genetic turnover time is fast compared to the inverse recombination rate. For a chromosome with an approximately uniform crossover probability, this condition defines a characteristic length below which loci coalesce in essentially recombination-free genealogies (Neher et al. 2013). Roughly, the asexual coalescent considerations would apply to a 1-cM size locus provided that it harbors σ > 10−2. More careful analysis is, however, necessary to deal with the Hill–Robertson effect or genetic draft (Hill and Robertson 1966; Neher and Shraiman 2011) caused by the transient linkage of the locus to the rest of the genome, which effectively adds noise, reducing effectiveness of selection on the individual loci.
The highest priority for the future would be to test the method on experimental or epidemiological data. Applications are possible wherever genomic data are available for fitness-diverse, but otherwise unstructured populations. Genomic data from single-cell sequencing of tumors (Navin et al. 2011) or from localized influenza outbreaks (Squires et al. 2011) are among the interesting possibilities to be considered. For example, it would be interesting to compare the proposed method with the clustering-based approach of Plotkin et al. (2002) to predict antigenic evolution of influenza A. A challenge in applying our approach to the existing influenza virus data is posed by the possibility of strong geographical/temporal biases in the sampling patterns. In addition, there is a bias due to preferential sequencing of antigenically distinct genomes on the basis of HI assays (Bush et al. 2001). For certain analyses, such as measuring the average substitution rate over a long period, such biases are less important (Russell et al. 2008; Bhatt et al. 2011; Strelkowa and Lässig 2012), but a more principled method of addressing the sampling bias may be necessary to achieve the full potential of our method of fitness inference.
In addition to predicting which genotypes are more likely to appear in future generations, the fitness inference method could be used for QTL mapping (Broman and Sen 2009) with FPS-based ranking being the quantitative phenotype that could be used to identify highly adaptive or deleterious alleles.
Supplementary Material
Acknowledgments
We thank Richard Neher, Daniel Balick, and Sidhartha Goyal for many useful discussions. A.D. was supported by HFSP grant RFG0045/2010 and National Science Foundation grant PHY11-25915 while B.I.S. acknowledges support from National Institute of General Medical Sciences grant R01 GM086793.
Footnotes
Communicating editor: J. Hermisson
Literature Cited
- Barrick J., Yu D., Yoon S., Jeong H., Oh T., et al. , 2009. Genome evolution and adaptation in a long-term experiment with Escherichia coli. Nature 461: 1243–1247. [DOI] [PubMed] [Google Scholar]
- Batorsky R., Kearney M. F., Palmer S. E., Maldarelli F., Rouzine I. M., et al. , 2011. Estimate of effective recombination rate and average selection coefficient for HIV in chronic infection. Proc. Natl. Acad. Sci. USA 108: 5661–5666. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bedford T., Cobey S., Pascual M., 2011. Strength and tempo of selection revealed in viral gene genealogies. BMC Evol. Biol. 11: 220. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Berestycki N., 2009. Recent progress in coalescent theory. Ensaios Matematicos 16: 1–193. [Google Scholar]
- Bhatt S., Holmes E., Pybus O., 2011. The genomic rate of molecular adaptation of the human influenza a virus. Mol. Biol. Evol. 28: 2443–2451. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bolthausen E., Sznitman A., 1998. On Ruelle’s probability cascades and an abstract cavity method. Commun. Math. Phys. 197: 247–276. [Google Scholar]
- Broman K., Sen S., 2009. A Guide to QTL Mapping with R/QTL (Statistics for Biology and Health). Springer-Verlag, New York. [Google Scholar]
- Brunet E., Derrida B., Mueller A., Munier S., 2007. Effect of selection on ancestry: an exactly soluble case and its phenomenological generalization. Phys. Rev. E Stat. Nonlin. Soft Matter Phys. 76: 041104. [DOI] [PubMed] [Google Scholar]
- Brunet É., Rouzine I. M., Wilke C. O., 2008. The stochastic edge in adaptive evolution. Genetics 179: 603–620. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bush R. M., 2001. Predicting adaptive evolution. Nat. Rev. Genet. 2: 387–392. [DOI] [PubMed] [Google Scholar]
- Coffin J. M., 1995. HIV population dynamics in vivo: implications for genetic variation, pathogenesis, and therapy. Science 267: 483–489. [DOI] [PubMed] [Google Scholar]
- Desai M., Fisher D., 2007. Beneficial mutation–selection balance and the effect of linkage on positive selection. Genetics 176: 1759–1798. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Desai, M., A. Walczak, and D. Fisher, 2013 Genetic diversity and the structure of genealogies in rapidly adapting populations. Genetics 193: 565–85. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Durbin, R., 1998 Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, Cambridge/London/New York. [Google Scholar]
- Excoffier L., Heckel G., 2006. Computer programs for population genetics data analysis: a survival guide. Nat. Rev. Genet. 7: 745–758. [DOI] [PubMed] [Google Scholar]
- Fu Y., Li W., 1993. Statistical tests of neutrality of mutations. Genetics 133: 693–709. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Good, B., I. Rouzine, D. Balick, O. Hallatschek, and M. Desai, 2012 Distribution of fixed beneficial mutations and the rate of adaptation in asexual populations. Proc. Natl. Acad. Sci. USA 109: 4950–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gordo I., Charlesworth B., 2000. The degeneration of asexual haploid populations and the speed of Muller’s ratchet. Genetics 154: 1379–1387. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Goyal S., Balick D. J., Jerison E. R., Neher R. A., Shraiman B. I., et al. , 2012. Dynamic mutation–selection balance as an evolutionary attractor. Genetics 191: 1309–1319. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hein J., Schierup M., Wiuf C., 2005. Gene Genealogies, Variation and Evolution: A Primer in Coalescent Theory. Oxford University Press, New York. [Google Scholar]
- Hill W. G., Robertson A., 1966. The effect of linkage on limits to artificial selection. Genet. Res. 8: 269–294. [PubMed] [Google Scholar]
- Kao K., Sherlock G., 2008. Molecular characterization of clonal interference during adaptive evolution in asexual populations of Saccharomyces cerevisiae. Nat. Genet. 40: 1499–1504. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kingman J., 1982. The coalescent. Stoch. Proc. Appl. 13: 235–248. [Google Scholar]
- Lang G., Botstein D., Desai M., 2011. Genetic variation and the fate of beneficial mutations in asexual populations. Genetics 188: 647–661. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lenski R., Rose M., Simpson S., Tadler S., 1991. Long-term experimental evolution in Escherichia coli. i. adaptation and divergence during 2,000 generations. Am. Nat. 138: 1315–1341. [Google Scholar]
- Maia L., Colato A., Fontanari J., 2004. Effect of selection on the topology of genealogical trees. J. Theor. Biol. 226: 315–320. [DOI] [PubMed] [Google Scholar]
- Merlo L., Pepper J., Reid B., Maley C., 2006. Cancer as an evolutionary and ecological process. Nat. Rev. Cancer 6: 924–935. [DOI] [PubMed] [Google Scholar]
- Miralles R., Gerrish P., Moya A., Elena S., 1999. Clonal interference and the evolution of RNA viruses. Science 285: 1745–1747. [DOI] [PubMed] [Google Scholar]
- Moya A., Holmes E., González-Candelas F., 2004. The population genetics and evolutionary epidemiology of RNA viruses. Nat. Rev. Microbiol. 2: 279–288. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Navin N., Kendall J., Troge J., Andrews P., Rodgers L., et al. , 2011. Tumour evolution inferred by single-cell sequencing. Nature 472: 90–94. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Neher R., Leitner T., 2010. Recombination rate and selection strength in HIV intra-patient evolution. PLoS Comput. Biol. 6: e1000660. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Neher R., Shraiman B., 2011. Genetic draft and quasi-neutrality in large facultatively sexual populations. Genetics 188: 975–996. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Neher R., Shraiman B., Fisher D., 2010. Rate of adaptation in large sexual populations. Genetics 184: 467–481. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Neher, R. A., 2013 Genetic draft, selective interference, and population genetics of rapid adaptation. arXiv: 1302.1148.
- Neher R. A., Hallatschek O., 2013. Genealogies of rapidly adapting populations. Proc. Natl. Acad. Sci. USA 110: 437–442. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Neher R. A., Kessinger T. A., Shraiman B. I., 2013. Coalescence and genetic diversity in sexual populations under selection. Proc. Natl. Acad. Sci. USA 110: 15836–15841. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Novella I. S., Clarke D. K., Quer J., Duarte E. A., Lee C. H., et al. , 1995. Extreme fitness differences in mammalian and insect hosts after continuous replication of vesicular stomatitis virus in sandfly cells. J. Virol. 69: 6805–6809. [DOI] [PMC free article] [PubMed] [Google Scholar]
- O’Fallon B., Seger J., Adler F., 2010. A continuous-state coalescent and the impact of weak selection on the structure of gene genealogies. Mol. Biol. Evol. 27: 1162–1172. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Plotkin J., Dushoff J., Levin S., 2002. Hemagglutinin sequence clusters and the antigenic evolution of influenza A virus. Proc. Natl. Acad. Sci. USA 99: 6263–6268. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rouzine I., Wakeley J., Coffin J., 2003. The solitary wave of asexual evolution. Proc. Natl. Acad. Sci. USA 100: 587–592. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rouzine I., Brunet É., Wilke C., 2008. The traveling-wave approach to asexual evolution: Muller’s ratchet and speed of adaptation. Theor. Popul. Biol. 73: 24–46. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Russell C., Jones T., Barr I., Cox N., Garten R., et al. , 2008. The global circulation of seasonal influenza a (h3n2) viruses. Science 320: 340–346. [DOI] [PubMed] [Google Scholar]
- Seger J., Smith W., Perry J., Hunn J., Kaliszewska Z., et al. , 2010. Gene genealogies strongly distorted by weakly interfering mutations in constant environments. Genetics 184: 529–545. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sella G., Petrov D., Przeworski M., Andolfatto P., 2009. Pervasive natural selection in the Drosophila genome? PLoS Genet. 5: e1000495. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Smith A., Heisler L., Onge R., Farias-Hesson E., Wallace I., et al. , 2010. Highly-multiplexed barcode sequencing: an efficient method for parallel analysis of pooled samples. Nucleic Acids Res. 38: e142. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sniegowski P., Gerrish P., 2010. Beneficial mutations and the dynamics of adaptation in asexual populations. Philos. Trans. R. Soc. Lond. B Biol. Sci. 365: 1255–1263. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Squires R., Noronha J., Hunt V., García-Sastre A., Macken C., et al. , 2011. Influenza research database: an integrated bioinformatics resource for influenza research and surveillance. Influenza Other Respir. Viruses 6: 404–416. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Strelkowa N., Lässig M., 2012. Clonal interference in the evolution of influenza. Genetics 192: 671–682. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tajima F., 1989. Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics 123: 585–595. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tsimring L., Levine H., Kessler D., 1996. RNA virus evolution via a fitness-space model. Phys. Rev. Lett. 76: 4440–4443. [DOI] [PubMed] [Google Scholar]
- Wakeley, J., 2008 Coalescent Theory. Roberts & Co., Greenwood Village, CO.
- Walczak A., Nicolaisen L., Plotkin J., Desai M., 2012. The structure of genealogies in the presence of purifying selection: a fitness-class coalescent. Genetics 190: 753–779. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.