Skip to main content
PLOS One logoLink to PLOS One
. 2013 Dec 10;8(12):e81738. doi: 10.1371/journal.pone.0081738

The Yule Approximation for the Site Frequency Spectrum after a Selective Sweep

Sebastian Bossert 1,*, Peter Pfaffelhuber 1
Editor: William J Etges2
PMCID: PMC3858263  PMID: 24339959

Abstract

In the area of evolutionary theory, a key question is which portions of the genome of a species are targets of natural selection. Genetic hitchhiking is a theoretical concept that has helped to identify various such targets in natural populations. In the presence of recombination, a severe reduction in sequence diversity is expected around a strongly beneficial allele. The site frequency spectrum is an important tool in genome scans for selection and is composed of the numbers Inline graphic, where Inline graphic is the number of single nucleotide polymorphisms (SNPs) present in Inline graphic from Inline graphic individuals. Previous work has shown that both the number of low- and high-frequency variants are elevated relative to neutral evolution when a strongly beneficial allele fixes. Here, we follow a recent investigation of genetic hitchhiking using a marked Yule process to obtain an analytical prediction of the site frequency spectrum in a panmictic population at the time of fixation of a highly beneficial mutation. We combine standard results from the neutral case with the effects of a selective sweep. As simulations show, the resulting formula produces predictions that are more accurate than previous approaches for the whole frequency spectrum. In particular, the formula correctly predicts the elevation of low- and high-frequency variants and is significantly more accurate than previously derived formulas for intermediate frequency variants.

Introduction

Genetic hitchhiking is the cause of a severe reduction of sequence diversity in a population due to recent strong positive selection [1]. Several statistical methods are available to detect these selective sweeps. The most successful approaches include various aspects of the available data, such as the site frequency spectrum and linkage disequilibrium patterns. See e.g., [2] for a framework using a likelihood ratio test using the site frequency spectrum, [3], [4] for tests based on linkage disequilibrium and [5], who use a combination of both. The most challenging issue today is to dissect population demography from signatures of selection.

One of the most successful approaches for detecting selective sweeps is called SweepFinder. Here, the site frequency spectrum for a selective and a neutral model is compared for each SNP available in the data [6]. This approach highlights the necessity of making analytical predictions for site frequency spectra under strong positive selection, which is the main goal of the current manuscript. While SweepFinder uses a selective model with the star-like method (see e.g., [7]), here, we use a refined model.

Current theoretical investigations and predictions of the signature of strong positive selection are mostly based on a genealogical perspective. The resulting genealogy is termed coalescent in a random background and was studied by [8] and [9]. The simplest approximation for large selection coefficients is the star-like approximation from [10] and [7]. The star-like approximation assumes that all individuals from a sample taken at the time of fixation are direct descendants of the founder of the selective sweep. In addition, recombination events may have split the history of the target of selection from a linked neutral variant. [7], [11], and [12] used a marked Yule process, which has been shown to be a finer approximation by [7]. Rather than using a star-like approximation of the genealogy at the target of selection, [12] used the idea put forward by [13], which states that in the early phase of a selective sweep, the beneficial allele behaves similarly to a supercritical branching process. As a consequence, the genealogy also resembles a supercritical branching process, which turns out to be a Yule process [14].

In this manuscript, we go beyond approximating the genealogy by a marked Yule process and provide an analytical expression for the site frequency spectrum after a selective sweep. Two features of the spectrum are the most important for data analysis: an excess of singletons (which might also arise due to population expansion) and an excess of high-frequency variants (which appear to be a unique feature of sweeps; [15]). [16] already gave an approximation of the site frequency spectrum and used the excess of high frequency variants to develop a statistical test for positive selection. Using our analytical approximations, we will see that such classical approaches slightly overestimate the number of high-frequency variants, while our Yule-approximation is more accurate. In addition, intermediate-frequency variants are predicted accurately only by the marked Yule-approximation. These features of the Yule-approximation can be used to construct conservative tests for selective sweeps.

Model and Results

Consider a (diploid) population of size Inline graphic which evolves under the neutral Wright-Fisher model. We will study two loci (called Inline graphic- and Inline graphic-locus) within this population, which recombine with probability Inline graphic per generation. (We neglect recombination within loci.) At the Inline graphic-locus, the population is fixed for the wild-type Inline graphic before time Inline graphic. The Inline graphic-locus is modeled using an infinite sites model of mutation with mutation probability Inline graphic per generation (see [17]). At time Inline graphic, a beneficial mutation Inline graphic with fitness Inline graphic appears at the Inline graphic-locus and is conditioned on eventual fixation in the whole population. Our main interest is the site frequency spectrum of the Inline graphic-locus at the fixation time Inline graphic of the Inline graphic-allele, which we also refer to as the end of the sweep. Consider a sample of size Inline graphic taken at time Inline graphic, and let Inline graphic be the number of SNPs at the Inline graphic-locus where the derived variant is present in exactly Inline graphic individuals. The time before Inline graphic is called the neutral phase, while the time between Inline graphic and Inline graphic is the selective phase.

Diffusion approximation and structured coalescent

To derive an approximation of the expected site frequency spectrum, we rely on a diffusion approximation for the frequency of the beneficial Inline graphic-allele (see e.g., [18]) and a coalescent process in a random background as described in [9] (see also [8]). Recall (e.g., from [19]) that the frequency of the Inline graphic-allele after Inline graphic, when time is rescaled by a factor of Inline graphic, is approximately given by the solution Inline graphic of the stochastic differential equation

graphic file with name pone.0081738.e034.jpg (1)

where Inline graphic is the rescaled (genic) selection intensity, and Inline graphic is defined by saying that Inline graphic is the expected number of Inline graphic-alleles in the next generation if the current frequency is Inline graphic. Observe that Inline graphic after some random time Inline graphic, which we call the fixation time of Inline graphic. In the background of the path Inline graphic, we consider a structured coalescent that evolves as follows (see Figure 1 for an illustration, where a sample of size Inline graphic is used): Set Inline graphic and start with Inline graphic lines at time Inline graphic (i.e., Inline graphic and the end of the sweep) in the Inline graphic-background. The following four transitions can occur between times Inline graphic and Inline graphic, i.e., during the selective phase:

Figure 1. The structured coalescent.

Figure 1

In the given example of the structured coalescent, we see on the right side the selective phase with a sample of size Inline graphic at the moment of fixation and the frequency development of the beneficial allele. At time Inline graphic, there are Inline graphic late recombinant families (labeled with Inline graphic), which all have a size of Inline graphic, one early recombinant family (labeled with Inline graphic) of size Inline graphic and one nonrecombinant family (labeled with Inline graphic) of size Inline graphic. These lines then start a standard coalescent in the neutral phase. The crosses illustrate SNPs in the sample.

  1. 1. Coalescence of a pair of lines in the Inline graphic-background: at rate Inline graphic, any pair of lines in the Inline graphic-background coalesces.

  2. 2. Switching of background from Inline graphic to Inline graphic by recombination: at rate Inline graphic with Inline graphic (Inline graphic is the recombination fraction between the selective and neutral locus within a single generation), any line in the Inline graphic-background changes to the Inline graphic-background.

  3. 3. Coalescence of a pair of lines in the Inline graphic-background: at rate Inline graphic, any pair of lines in the Inline graphic-background coalesces.

  4. 4. Switching of background from Inline graphic to Inline graphic by recombination: at rate Inline graphic, any line in the Inline graphic-background changes to the Inline graphic-background.

Due to these transitions, there is a random number Inline graphic of lines in the Inline graphic-background at time Inline graphic and Inline graphic lines in the Inline graphic-background. (If there was two or more lines in the Inline graphic-background, their coalescence rate would have been arbitrarily large by the coalescence rate Inline graphic.) The resulting Inline graphic lines follow a standard neutral coalescent after time Inline graphic, i.e., every pair of lines coalesces at rate 1 after only a single line is left and the process is stopped.

After having constructed the random tree from the coalescing lines, every line is hit by mutation events at the rate Inline graphic, with Inline graphic. We call an event a mutation of size Inline graphic if it falls on a branch leading to exactly Inline graphic leaves of the tree. The number of size Inline graphic mutations is called Inline graphic, and Inline graphic is called the site frequency spectrum, which we will approximate for large Inline graphic below.

Yule approximation of the genealogy in the selective phase

In [19] and [11], the following approximation of the structured coalescent during the selective phase was developed with the limits of large Inline graphic and for Inline graphic: As was shown, events 3. and 4. from the structured coalescent can be ignored because their probability becomes small for large Inline graphic. Thus, each line undergoes at most one recombination event during the selective phase. Two lines of the genealogy at time Inline graphic belong to the same family if they coalesce between time Inline graphic and Inline graphic. The following families are distinguished:

  1. 1. Nonrecombinant family: The set of individuals whose ancestral lineages never left background Inline graphic.

  2. 2. Early recombinant families: The set of individuals whose ancestral lines have not left background Inline graphic before (according to the backward time Inline graphic) the first coalescence in the sample occurs, but the ancestor at time Inline graphic (equivalent to Inline graphic) is in background Inline graphic.

  3. 3. Late recombinant families: The families consisting of a single individual whose ancestral line has left background Inline graphic before the first coalescence in the sample, and the ancestor at time Inline graphic is in background Inline graphic.

Note that late recombinant families are of size Inline graphic by definition, and there can be at most one nonrecombinant family that has inherited their Inline graphic-allele from the founder of the sweep.

To get an approximation formula for the genealogy at time Inline graphic, we first need the distribution for the number and size of the different families. Recall from Theorem 1 in [19] that the genealogy consists (up to an error of probability of order Inline graphic) of

  • Inline graphic late recombinant families of size Inline graphic,

  • one early recombinant family of size Inline graphic and

  • one nonrecombinant family of size Inline graphic.

For the joint distribution of Inline graphic and Inline graphic, define a random variable Inline graphic, distributed according to

graphic file with name pone.0081738.e122.jpg (2)

Given Inline graphic, Inline graphic is a binomial random variable with Inline graphic trials and success probability Inline graphic, where

graphic file with name pone.0081738.e127.jpg (3)

The distribution of Inline graphic depends on Inline graphic and on another variable Inline graphic, which gives the number of lines that are affected by the early recombination at time Inline graphic according to

graphic file with name pone.0081738.e132.jpg (4)

(Note that the case Inline graphic requires a different definition of the distribution of Inline graphic, which we give in Section A of the SI.) As one or more of these Inline graphic lines could experience a late recombination event, they could be kicked out of the family of early recombinants. This explains the hypergeometric distribution of Inline graphic, i.e., given Inline graphic and Inline graphic, the variable Inline graphic is hypergeometric with

graphic file with name pone.0081738.e140.jpg (5)

Combining these equations, a straightforward calculation (see Corollary 2.7 in [19]) leads to

graphic file with name pone.0081738.e141.jpg (6)

Note that this equation corrects an error (in the case of Inline graphic) of the equation of [19]; see SI, Sections A and B. Moreover, there is a factor of 2 difference here because we assume a diffusion constant of 1 in (1).

Yule approximation of the site frequency spectrum

Our goal is to obtain an expectation of the site frequency spectrum, Inline graphic, at the end of a selective sweep using the approximation from (6). We will assume that Inline graphic is large and that no new mutations accumulate in the sample during the selective phase. Moreover, recombination between the Inline graphic- and Inline graphic-locus has to be in a certain range to see a non-trivial frequency spectrum. (Here, trivial would either mean that there is no variation at all if Inline graphic is too small or a neutral site frequency spectrum if Inline graphic is too large.) Recalling that the duration of the sweep is approximately Inline graphic (see [19]), Inline graphic must be on the order of Inline graphic. In other words, Inline graphic is on the order of Inline graphic and hence small if Inline graphic is large.

To get an approximation formula for the frequency spectra, the events and probabilities of the selective phase must be joined with the neutral phase. In the neutral phase, Kingman's coalescent describes the genealogy of the Inline graphic remaining lines. The crucial point is how to combine the approximation of the genealogy of the Inline graphic-locus during the selective phase with a neutral coalescent before the onset of the sweep. A critical quantity is the number Inline graphic of ancestors of the sample at the onset of the sweep. Because a mutation can only influence at most Inline graphic of these ancestors, the descendants in the selective phase depend on this number of lines. Recall that the sample size is Inline graphic, Inline graphic is large, Inline graphic is the mutation rate and Inline graphic is the recombination rate, with Inline graphic being small. Therefore, the expected number of mutations of size Inline graphic is (see SI, Section C for the proof)

graphic file with name pone.0081738.e165.jpg

for Inline graphic, where the probabilities of Inline graphic are given by (6). We note that the term Inline graphic is due to the use of the approximation formula for the selective phase.

To get an idea of how this formula is computed, consider again Figure 1. There are 3 late recombinant families, one early recombinant family of size 2 and one nonrecombinant family (labeled Inline graphic) of size 4. Given these values, there are two different ways for a mutation to get to a size of Inline graphic. Either it had a size of Inline graphic at time Inline graphic and these two lines were two late recombinant families, or it had size Inline graphic at time Inline graphic and then was the founder of the early recombinant family, which has a size of Inline graphic at the end of the sweep. Taking into account all possibilities, (7) arises.

Previous approximation formulas

Using simulations, we compared the Yule approximation formula (7) to two other approximation formulas for the frequency spectra. The first approximation is from [16] and will be called the deterministic formula because a deterministic development of the frequency of allele Inline graphic is assumed in this approach. The second approximation is the star-like approximation (see [7] or chapter 6 in [20]).

Deterministic approximation

In [16], Fay and Wu obtained the following approximation for the site frequency spectrum after a selective sweep, building on the ideas of [1]. They obtain

graphic file with name pone.0081738.e177.jpg (8)

with

graphic file with name pone.0081738.e178.jpg

where Inline graphic is the starting frequency of the beneficial allele. For the numerical comparison, we use Inline graphic because, in this situation, the length of the selective phase is Inline graphic, which is close to the expectation of the stochastic model.

Star-like approximation

For the classical star-like approximation, every line in the selective phase has the same independent chance to recombine and be in background Inline graphic at time Inline graphic. Therefore, Inline graphic, and Inline graphic is binomially distributed with parameters Inline graphic and Inline graphic, which is the probability that a single line recombines. Combining this insight with (7) leads to the equation

graphic file with name pone.0081738.e188.jpg (9)

Note that for small Inline graphic, the approximation error is much larger than in (7).

Numerical comparison

Our goal is to compare the performance of the Formulas (7), (8) and (9) to simulations from the Wright-Fisher model. For the Wright-Fisher model, the simulation tool msms was used (which stands for make sample mit selection, see [21] or http://www.mabs.at/ewing/msms/index.shtml). To compare the different formulas for the expected frequency spectra, the average of Inline graphic iterations was taken as a reference. Figure 2 shows the case of a high selective advantage Inline graphic in a sample of size Inline graphic. Theoretically, the Yule and star-like approximations converge for large Inline graphic. However, while the deterministic and star-like approximations perform about equally well, the (absolute and relative) error of (7) is smaller.

Figure 2. Comparison of the expected frequency spectra I.

Figure 2

Comparison of the 3 approximation formulas and the results from msms for the parameters Inline graphic, Inline graphic, Inline graphic, Inline graphic and Inline graphic. In A, the whole frequency spectrum is illustrated, while in B, the number of the mutation sizes between Inline graphic and Inline graphic are enlarged.

In Figure 3, we used a smaller selective coefficient Inline graphic and a sample of size Inline graphic. Here, the relative error of the star-like and deterministic approximations exceed 0.6. Again, the Yule approximation (7) gives the best results, with the relative error never exceeding 0.2. Reassuringly, all approximations give good results for low- and high-frequency variants that are known to be fundamental in detecting selecting sweeps in data.

Figure 3. Comparison of the expected frequency spectra II.

Figure 3

Comparison of the 3 approximation formulas and the results from msms for the parameters Inline graphic, Inline graphic, Inline graphic, Inline graphic and Inline graphic. In A, we see the expected frequency spectra, and in B, we see the relative errors compared to the reference Inline graphic.

In applications, the case of a high recombination rate is of particular importance. Here, (7) needs to be corrected as described in Appendix A. Because the error of all approximation formulas increases with recombination rate, it is no surprise that the errors in Figure 4 are larger than those in Figures 2 and 3. Still, the Yule approximation works best for most of the frequency classes.

Figure 4. Comparison of the expected frequency spectra III.

Figure 4

Comparison for the parameters Inline graphic, Inline graphic, Inline graphic, Inline graphic and Inline graphic, where the adjusted formula for the joint distribution according to Appendix A is needed. In A, the expected frequency spectra are depicted, and in B, the relative errors compared to the reference Inline graphic are illustrated.

Discussion

The site frequency spectrum is a basic summary statistic used for the analysis of SNP data. Theoretical predictions of the shape of the frequency spectrum are most important in order to understand the evolutionary forces that have shaped the genomic data at hand. In the present paper, we have demonstrated how a recently developed approximation for selective sweeps from [7], [19], [11], [12], based on a marked Yule process, leads to such a prediction (at least for the expected site frequency spectrum). For the analytical formula, two cases have to be taken into account. If Inline graphic, the marked Yule process can be applied directly, but if Inline graphic, we have to use some normalization procedure. The latter case arises if the neutral locus has a large recombinational distance to the target of selection. In the parameter constellation of Figure 4, neither of the approximations works particularly well, with relative errors up to 20% for the Yule and deterministic approximations and over 140% for the star-like approximation. However, theoretical predictions become worse for larger Inline graphic and errors are less predicable in this setting.

For smaller recombinational distances, we find that the Yule approximation outperforms the star-like approximation, especially for intermediate frequency variants (relative error up to 20% for the Yule approximation versus up to 80% for the star-like approximation, see Figure 3). In a comparison between the Yule and star-like approximations, a basic difference is that the star-like approximation forbids what we called early recombinant families. Such families lead to a decrease in the number of singleton mutations, which is shown in our simulations and has the greatest impact on the relative errors we reported above.

Altogether, the combination of (7) and (11) gives our analytical formula. Most importantly, compared to other approaches, such as the deterministic approach of [16] and the star-like approximation derived in [10], [7] and used e.g., in [3], the Yule process approximation has a smaller error in nearly all cases. Although the formulas derived in the Yule approximation are more involved, they can still be easily implemented for data applications to obtain a higher accuracy. Above all, such accuracy is desirable in genome scans for selective sweeps, which are frequently carried out by software such as SweepFinder [6].

Supporting Information

Appendix S1

Supporting Information for the article.

(PDF)

Acknowledgments

We thank Joachim Hermisson for fruitful discussions and two anonymous referees for their helpful comments.

Funding Statement

This work was funded by the project PP672/3-1 and Hu1889/1-1 of the Deutsche Forschungsgemeinschaft (DFG). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1. Smith JM, Haigh J (1974) The hitch-hiking effect of a favorable gene. Genetic Research 23: 23–35. [PubMed] [Google Scholar]
  • 2. Kim Y, Stephan W (2002) Detecting a local signature of genetic hitchhiking along a recombining chromosome. Genetics 160: 765–777. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Kim Y, Nielsen R (2004) Linkage disequilibrium as a signature of selective sweeps. Genetics 167: 1513–1524. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Jensen JD, Thornton KR, Bustamante CD, Aquadro CF (2007) On the utility of linkage disequilibrium as a statistic for identifying targets of positive selection in nonequilibrium populations. Genetics 176: 2371–2379. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Pavlidis P, Jensen JD, Stephan W (2010) Searching for footprints of positive selection in whole-genome snp data from nonequilibrium populations. Genetics 185: 907–922. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Nielsen R, Williamson S, Kim Y, Hubisz MJ, Clark AG, et al. (2005) Genomic scans for selective sweeps using snp data. Genome Research 15: 1566–1575. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Durrett R, Schweinsberg J (2004) Approximating selective sweeps. Theoretical Population Biology 66: 129 – 138. [DOI] [PubMed]
  • 8. Kaplan NL, Hudson RR, Langley CH (1989) The 'hitchhiking effect' revisited. Genetics 123: 887–899. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Barton NH, Etheridge AM, Sturm AK (2004) Coalescence in a random background. Ann Appl Probab 14: 754–785. [Google Scholar]
  • 10. Barton NH (1998) The effect of hitch-hiking on neutral genealogies. Genetic Research 72: 123–133. [Google Scholar]
  • 11. Pfaffelhuber P, Haubold B, Wakolbinger A (2006) Approximate genealogies under genetic hitch-hiking. Genetics 174: 1995–2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Pfaffelhuber P, Studeny A (2007) Approximating genealogies for partially linked neutral loci under a selective sweep. J Math Biol 55: 299–330. [DOI] [PubMed] [Google Scholar]
  • 13.Fisher R (1930) The Genetical Theory of Natural Selection. Second edition. Oxford: Clarendon Press.
  • 14. Evans S, O'Connell N (1994) Weighted occupation time for branching particle systems and a representation for the supercritical superprocess. Canad Math Bull 37: 187–196. [Google Scholar]
  • 15. Stephan W (2010) Genetic hitchhiking versus background selection: the controversy and its implications. Phil Trans R Soc B 365: 1245–1253. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Fay JC, Wu CI (2000) Hitchhiking under positive darwinian selection. Genetics 155: 1405–1413. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Kimura M (1969) The number of heterozygous nucleotide sites maintained in a finite population due to steady ux of mutations. Genetics 61: 893–903. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Ewens WJ (2004) Mathematical population genetics. I, volume 27 of Interdisciplinary Applied Mathematics. New York: Springer-Verlag, second edition. Theoretical introduction.
  • 19. Etheridge A, Pfaffelhuber P, Wakolbinger A (2006) An approximate sampling formula under genetic hitchhiking. Ann Appl Probab 16: 685–729. [Google Scholar]
  • 20.Durrett R (2008) Probability models for DNA sequence evolution. Probability and its Applications (New York). New York: Springer, second edition.
  • 21. Ewing G, Hermisson J (2010) Msms: a coalescent simulation program including recombination, demographic structure and selection at a single locus. Bioinformatics 26: 2064–2065. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Appendix S1

Supporting Information for the article.

(PDF)


Articles from PLoS ONE are provided here courtesy of PLOS

RESOURCES