Abstract
Recurrent mutation produces multiple copies of the same allele which may be co-segregating in a population. Yet, most analyses of allele-frequency or site-frequency spectra assume that all observed copies of an allele trace back to a single mutation. We develop a sampling theory for the number of latent mutations in the ancestry of a rare variant, specifically a variant observed in relatively small count in a large sample. Our results follow from the statistical independence of low-count mutations, which we show to hold for the standard neutral coalescent or diffusion model of population genetics as well as for more general coalescent trees. For populations of constant size, these counts are distributed like the number of alleles in the Ewens sampling formula. We develop a Poisson sampling model for populations of varying size and illustrate it using new results for site-frequency spectra in an exponentially growing population. We apply our model to a large data set of human SNPs and use it to explain dramatic differences in site-frequency spectra across the range of mutation rates in the human genome.
Keywords: recurrent mutation, Ewens sampling formula, coalescent theory, human SNPs
Recurrent mutation has long been recognized as an important factor of evolution (Fisher 1928; Haldane 1933; Wright 1938). This is emphasized by recent analyses of single-nucleotide polymorphism (SNP) frequencies and variation of mutation rates across the human genome (Aggarwala and Voight 2016; Harpak et al. 2016; Seplyarskiy et al. 2021) describing how patterns of variation depend on the mutation rate, particularly for rare variants. By a rare variant we mean an allele, such as an alternate base at a SNP, which is observed a relatively small number of times in a large sample. Unless the mutation rate is very small, indistinguishable copies of the same allele may descend from multiple mutations. Here, we present a sampling theory for the numbers and associated frequencies of these unobserved or latent mutations in the ancestry of a rare variant.
Humans are on the low end of polymorphism levels among species (Leffler et al. 2012). On average, multiple mutations should be rare. In the 1000 Genomes Project data, about 1 in 1300 sites differ when two (haploid) genomes are compared, and SNPs with more than two bases segregating comprise only about % of the total SNPs observed (The 1000 Genomes Project Consortium 2015). But polymorphism rates vary by two or three orders of magnitude depending on local sequence context (Aggarwala and Voight 2016; Harpak et al. 2016; Seplyarskiy et al. 2021). Recurrent mutation is an important phenomenon for fast-mutating sites. Evidence for this can be found in the haplotype structure surrounding rare mutations (Johnson et al. 2022) and in the distribution of their frequencies among sites in large samples (Harpak et al. 2016; Seplyarskiy et al. 2021).
Here we focus on the latter, in particular on the site-frequency spectrum (Tajima 1989; Braverman et al. 1995; Fu 1995). Deviations in site-frequency spectra compared to standard predictions may be due to selection (Bustamante et al. 2001; Achaz 2009; Ferretti et al. 2017), changes in population size over time (Eldon et al. 2015; Liu and Fu 2015; Gao and Keinan 2016) or population structure (Gutenkunst et al. 2009; Städler et al. 2009; Kern and Hey 2017). But they may also be due to multiple mutations, i.e. to violations of the infinite-sites model assumption that each polymorphism is due to a unique mutation (Fisher 1930a; Kimura 1969, 1971; Ewens 1974; Watterson 1975).
The standard site-frequency prediction, which holds for a well-mixed population of constant large size N and neutral mutation rate u at a locus, is that the number of SNPs where a variant is found in i copies in a sample of size n should be proportional to , where (Tajima 1989; Fu 1995). This dramatically underpredicts the abundance of rare variants in data from humans, which is largely due to our recent explosive population growth (Keinan and Clark 2012; Gazave et al. 2014; Gao and Keinan 2016), but the standard neutral model is a useful starting point for modeling recurrent mutation.
Jenkins and Song (2011) studied the occurrence of one or two mutations at a single site under the standard neutral coalescent model (Kingman 1982; Hudson 1983; Tajima 1983). They showed that if two mutations occur and are non-nested (meaning that all descendants of both mutations can be observed) there will be a shift away from rare variants and toward common ones. An earlier work focusing on the nested case is Hobolth and Wiuf (2009). Bhaskar et al. (2012) used a similar approach as Jenkins and Song (2011) to obtain results for one, two or three mutations, up to leading order in the mutation parameter θ. Sargsyan (2006, 2015) considered two mutations occurring at two different sites, and Jenkins et al. (2014) assumed that two mutations are distinguishable and yield a tri-allelic polymorphism. These latter works (Sargsyan 2006, 2015; Jenkins et al. 2014) allowed for variable population size following the general coalescent approach of Griffiths and Tavaré (1998). None of these works considered rare variants in particular but their predictions, especially those for non-nested mutations (Jenkins and Song 2011; Bhaskar et al. 2012) are helpful for understanding recurrent mutation.
Two recent large studies of human SNPs observed this predicted shift away from rare variants and toward common ones at fast-mutating sites. Harpak et al. (2016) surveyed about 8 million SNPs in a sample of nearly 61 000 people in version of the Exome Aggregation Consortium database (Lek et al. 2016) for which data were available from other primate species. Among these, about % of these were bi-allelic, % were tri-allelic and % were quad-allelic. Harpak et al. (2016) took the presence of identical segregating variants in different species, ranging from chimpanzees to baboons, as indicative of a higher mutation rate at a site. Consistent with the hypothesis of multiple latent mutations at fast-mutating sites, they found fewer rare variants at bi-allelic SNPs for which the minor allele was segregating in another species, and that this effect is stronger when the other species is closer to humans.
The work we present here builds upon the second of these studies. Seplyarskiy et al. (2021) looked at rare variants in two datasets, one containing about 292 million variants among nearly 43 thousand individuals in TOPMed freeze 5 (Taliun et al. 2021) and the other containing about 182 million variants among 15 thousand individuals in gnomAD version r2.0.2 (Karczewski et al. 2020). Variants were divided into 192 types: each of the 3 possible base substitutions at the middle site of all 64 possible trinucleotides. A classic example of a fast-mutating site in this context would be ACG, which readily changes to ATG via a C to T transition at the CpG dinucleotide (Bird 1980; Goldman 1993). The main goals in Seplyarskiy et al. (2021) were to quantify how the rates of each kind of mutation vary across the genome and to partition this variation into distinct components correlated with different mutational processes.
Another aim, taken up in the Supplementary Materials of Seplyarskiy et al. (2021), was to correct for multiple mutations contributing to rare variants. Recurrent mutation was modeled as a multi-type Poisson process where mutations with lower sample counts occur independently at a locus to generate the appearance of higher count mutations (Desai and Plotkin 2008). The expected counts in the absence of recurrence were taken from the site-frequency spectrum at slow-mutating sites. The loss of rare variants due to recurrent mutation at fast-mutating sites was quantified for sites with up to 70 copies of a rare variant. These were considered to have descended from up to 5 mutations. Slow-mutating sites, even with rates up to the genome average in humans, should conform fairly well to the infinite-sites assumption. Resampling from these as in Seplyarskiy et al. (2021) is a way of controlling for the myriad unknown factors affecting the site-frequency spectrum, including growth.
In this work, we present a sampling theory for latent mutations of rare variants at each given site-frequency count in a large sample. We describe a mathematical population genetic framework for the Poisson-resampling method in Seplyarskiy et al. (2021) and provide closed-form analytical expressions for several quantities of interest. In short, the distributions of latent mutations and counts of rare variants depend on the expected total length of the gene genealogy of the sample, the expected lengths of branches with few descendants in the sample, and of course the mutation rate. We obtain new large-sample results for exponential growth and use these to illustrate the theory. We apply our results to a different subset of the gnomAD data than Seplyarskiy et al. (2021), synonymous variants observed in non-Finnish European individuals in v2.1.1, containing about 834 thousand variants at about million sites among 57 K individuals, presorted into 97 bins based on estimates of mutation rate by the method of Seplyarskiy et al. (2022).
We develop and present these results in the next three sections. In “Theory for constant-size large populations,” we begin with the standard neutral coalescent or diffusion model of population genetics (Ewens 2004) and demonstrate a close connection between the Ewens sampling formula (Ewens 1972) and distributions of latent mutations. In “Theory for nonconstant populations,” we extend the results to populations which have changed in size, using the Poisson-sampling models of Watterson (1974b) and Arratia et al. (1992). In “Theoretical example and data application,” we compare predictions for constant size to those for exponential growth and show how the new theory can be applied to understand the effects of recurrent mutation on counts of rare variants across the range of human per-site mutation rates.
Theory for constant-size large populations
In this section, we begin with a description of recurrent mutation via the well known predictions for allele frequencies in a population and in a sample at stationarity. We then use conditional ancestral processes to demonstrate independence of latent mutations of rare variants in a large sample and show that their numbers are distributed like the numbers of alleles in the Ewens sampling formula.
Stationary distributions and sampling probabilities
Consider a single locus with parent-independent mutation among K possible alleles in a population which obeys the Wright–Fisher diffusion (Fisher 1930b; Wright 1931; Ewens 2004). Thus, the population is very large, well mixed, constant in size over time, and there is no selection. One unit of time in the diffusion process corresponds to generations ( generations for haploid species), where is the effective population size. Each gene copy or genetic lineage experiences mutations at rate and each mutation produces an allele of type with probability , with , independent of the allelic state of the parent. At stationarity, the joint distribution of the relative frequencies of alleles is given by
(1) |
in which is the Gamma function, and where necessarily (Wright 1931, 1949).
Conditional on the population frequencies (, the sample counts of alleles ( are multinomially distributed. A sample of size n taken from the population contains copies of alleles 1 through , and necessarily copies of allele K, with probability
(2) |
(3) |
for constrained by and where denotes the Pochhammer function or rising factorial with . The shorthand defined in (2) is used extensively in what follows.
In applications to DNA, and a sample at a given site would contain counts , , , of each of the four nucleotides. The assumption of parent-independent mutation which leads to the relatively simple expressions (1) and (3) is unrealistic for DNA, but its results are useful in the case of rare variants in very large samples. In this case, it is likely that the common variant, allele 4 say, represents the ancestral state of the entire sample and that rare variants (alleles 1, 2 and 3) are due to recent mutations from the common variant. Then the mutation parameter for captures the production of type-i rare alleles in a specific ancestral background (allele 4).
An instructive special case is , where we have
(4) |
for the stationary distribution of the frequency of type 1 in the population Wright (1931), and
(5) |
for the sampling probability, i.e. that a sample of size n contains copies of allele 1 and copies of allele 2. Any two-allele mutation model can be described as a parent-independent model, but this is not so in general for .
Figure 1 shows how the sample frequency distribution in (5) depends on the mutation rate for a pair of alleles which differ by an order of magnitude in mutation rate. Three value of θ are shown, with the small value chosen so that the mutation rate for allele 2 () is equal to the human average of about (The 1000 Genomes Project Consortium 2015) and the mutation rate for allele 1 () is ten times that. When θ is small, the distribution is U-shaped and nearly symmetric, given that the sample is polymorphic. When θ is around one, the distribution becomes J-shaped (or L-shaped if ). When θ is large, the distribution has a peak around . Graphs of (not shown) display these same shapes, and will be very close to when n is large.
Relationship to infinite-sites frequency spectra
We use θ for the per-site mutation parameter. In a collection of L total sites at which (5) holds, the finite-sites version of the site-frequency spectrum (i.e. the expected number of sites with copies of allele 1 and copies of allele 2) is given by the product . Note, these expected numbers of sites do not depend on the rate of recombination, whereas the variances among sites and covariances between sites do (Kaplan and Hudson 1985).
Infinite-sites mutation models may be obtained as limits of finite-sites models as L tends to infinity with the total mutation parameter remaining finite. So when θ is small, we expect finite-sites results to be close to the usual (infinite-sites) predictions from the diffusion model (Ewens 1979, 2004) or the coalescent model (Fu 1995). Finite-sites models distinguish between kinds of mutations, subject to different mutation pressures, whereas infinite-sites models implicitly treat all mutations the same.
From Ewens (1979) equation (8.18) or Ewens (2004) equation (9.18)—see also Wright (1938) equation (16)—the expected number of sites segregating in the population with frequencies between x and under the infinite-sites model is proportional to . For comparison to (4) we may write
(6) |
for a single site (θ small) approximately under the standard infinite-sites mutation model. For comparison with (5), we have
(7) |
for the approximate single-site probability that there are type-1 alleles in a sample of size n. Equation (7) has the same form as the usual infinite-sites site-frequency spectrum (Fu 1995) but here it is for a specific mutant (allele 1) with a specific ancestral type (allele 2 in the two-allele model).
From (4) and (5) with θ small we have
(8) |
and
(9) |
for . The diffusion result (4) does not admit atoms of probability at or —see section 10.7 of Ewens (2004) for discussion—but we can interpret (8) intuitively as follows. If θ is close to zero, most of the time the population will be fixed, containing only allele 1 with probability and only allele 2 with probability . Mutants of type 2 and type 1 are introduced with rates and in these two backgrounds, respectively. Then the leading terms in (8) represent a mixture of two infinite-sites models like (6) with the constants of proportionality specified. Equation (9) has an identical interpretation, as a mixture of two infinite-sites site-frequency spectra. These are the key principles of the boundary mutation model (Vogl and Clemente 2012; Vogl et al. 2020).
Although no closed-form expression like (1) is available except under parent-independent mutation, Burden and Tang (2016, 2017) have shown that the stationary densities for pairs of alleles under general mutation models take forms identical to (8) when θ is small; see equation (21) in Burden and Tang (2017). See also Schrempf and Hobolth (2017). Similarly from a coalescent analysis of general K-alleles mutation, Bhaskar et al. (2012) obtained leading order terms for sampling probabilities with forms identical to (9) when θ is small and samples contain just two alleles. For , the result from Theorem 1 of Bhaskar et al. (2012) is identical to (9).
Mutation and the frequencies of rare sample variants
Our goal here is to understand how the frequency spectra of rare variants depend on θ and on the number of mutation events in the ancestry of the sample under the standard neutral coalescent or diffusion model of population genetics which assumes constant population size (Ewens 2004). We first describe an ancestral process for the sample, then focus on rare variants in a large sample to obtain predictions about latent mutations.
A conditional ancestral process for rare variants
Here, we focus on ordered samples because the calculations are more intuitively related to the familiar rates of events in the ancestral coalescent process. The results do not depend on the order and so apply equally to ordered and unordered samples. Using the subscript “o” for ordered and writing in place of to facilitate the calculations, we have
(10) |
which differs from the sampling probability in (3) only by the multinomial coefficient, or the number of ways a sample containing allele counts can be ordered.
Equation (10) is suggestive, as are (3) and (5), that the sampling structure of the copies of allele i may be related to the Ewens sampling formula (Ewens 1972). Specifically, from the fact that
(11) |
where is an (unsigned) Stirling number of the first kind, we might guess that there is a latent variable which is the number of mutations giving rise to the copies of allele i. As in the usual application of the Ewens sampling formula, in contrast to the total possible number of type-i mutations in the ancestry of the sample, these latent mutations are just those most recent ones which produced the observed alleles.
That is, based on (10) and (11), we suppose that the joint probability of the sample counts and their numbers of latent mutations is given by
(12) |
and therefore that the probability of conditional on is given by
(13) |
which applies to both ordered and unordered samples.
We show that (13) is true using the ancestral-process approach of Griffiths and Tavaré (1994a, 1994b). If sampling probabilities like (3) or (10) are known, this approach can be used to describe the conditional ancestral process of a sample given its allelic types (Slade 2000a, 2000b; Fearnhead 2001, 2002; Stephens and Donnelly 2003; Baake and Bialowons 2008). Following our analysis of (13) for arbitrary , we describe a large-n approximation in which allele K is the overwhelmingly common type and 1 through are the rare variants.
The conditional ancestral process has the same total rate of mutation and coalescence as the unconditional process, . Lineages which must be of type i in the sample experience type-i mutations at rate and type-i coalescent events at rate , but with additional weights proportional to the probability of given each event. All other events have rates equal to zero because the sample could not be if they occurred. To obtain (13), we follow ancestral lineages only back to the first mutation event they experience. The probability of a type-i mutation event is
(14a) |
and the probability of a type-i coalescent event is
(14b) |
where we have used (10) to obtain the results on the right. Whether mutation or coalescence occurs, the number of type i lineages decreases by one: . This ancestral process continues until there are no un-mutated ancestral lineages, that is until for all .
To this we add a mutation counting process which starts with for all then has whenever a mutation occurs on a type-i ancestral lineage. Equations (14a) and (14b) show that each event in the ancestral process includes two sub-events: a choice of the allelic type involved then a choice between mutation and coalescence. Depending on , the choices of allelic type will result in a random ordering of events among types. But for every ordering, the series of choices between mutation and coalescence within allelic type i depends only on (and ) and is independent of what happens in the ancestry of allele . The number of mutations of type i is the sum of Bernoulli random variables with success probabilities for j from down to 1. The number of latent mutations counted in this way will be distributed like the number of alleles in the Ewens sampling formula—see Arratia et al. (1992) and Arratia and Tavaré (1992)—with mutation parameter for allele i, and these counts will be independent among alleles as in (13).
We use this conditional ancestral process below but here note its close relationship to models of lines of descent (Griffiths 1980; Watterson 1984). In particular, (13) is included in equation (3.3) and Theorem 4 of Donnelly (1986), who extended Watterson’s lines-of-descent model to the case of K-allele, parent-independent mutation. See also Donnelly and Tavaré (1987). Equation (3.3) in Donnelly (1986) in fact shows that if we were to keep track of the numbers of descendants of each latent mutation, the full Ewens sampling formula would give their distribution in the sample.
Before describing a large-n approximation for rare variants, we also note that latent mutations reckoned as in (13) include what Donnelly (1986) called ‘spurious mutations to one’s own type’ and Baake and Bialowons (2008) called ‘empty mutations’. These are a modeling artifact not only of parent-independent mutation models but of general mutation models as they are typically implemented (Jenkins and Song 2011; Bhaskar et al. 2012; Jenkins et al. 2014; Burden and Tang 2017; Burden and Griffiths 2019). Empty mutations have no empirical significance and should not be counted as mutations. To deal with them, we must keep track of the ancestral types of lineages when they experience mutations. We can do using the identity
(15) |
which decomposes our previously generic type-i mutations according to their ancestral types . A mutation is empty when .
In our large-n approximation, we take K to be the overwhelmingly common allelic type in the sample and 1 through to be the rare variants. Our goal is to model latent mutations in the ancestry of the rare variants, so we use (15) only for . For the common allele K, we instead lump (14a) and (14b) together and record both mutation and coalescence as . Making these changes to (14a) and (14b), and again using (10) to simplify ratios of sampling probabilities, the conditional ancestral process for a sample with state jumps to state for with probability
(16a) |
to state for with probability
(16b) |
to state for with probability
(16c) |
and to state with probability
(16d) |
where we have used Kronecker’s delta to accommodate empty mutations, in (16a). Equation (16a) includes both empty and nonempty mutations, but only ones where the ancestral type is also rare. Nonempty mutations where the ancestral type is the common type K are in (16b). This classification of mutations by ancestral type does not change the probabilities of coalescence, so (16c) only differs from (14b) by the absence of type-K coalescent events which are now in (16d).
If is large compared to through , then . The probabilities in (16a) will be , those in (16b) and (16c) will be , and the one in (16d) will be . Empty mutations and other mutations with rare-variant ancestors will become negligible as grows for fixed through . Keeping only terms of and larger gives an approximate, large-n ancestral process with total rate and jumps, for , from state to state with probability
(17a) |
to state with probability
(17b) |
and to state with probability
(17c) |
This process is dominated by (17c), that is by events on lineages ancestral to the common allele K, which decrease the number of these but leave the counts of rare-allele lineages unchanged. Although we are not tracing the details of common-allele ancestry, we note that the overwhelming majority of these events will be coalescent events, since their rate is approximately equal to the total rate . The next most frequent will be empty mutation events at rate , followed by common-allele mutation events with rare-allele ancestors at rate .
When one of the rarer events occurs in the ancestral process, it involves allele i with probability , then is either a mutation event from a common allele as in (17a) or a coalescent event as in (17b). This process for the rare variants has the same form as that found for all variants and all mutations in (14a) and (14b). Then by the same logic as before, the number of (now nonempty) latent mutations in the ancestry of the rare variants will be distributed like the number of alleles in the Ewens sampling formula, independently and with mutation parameter for allele . In addition if we were to keep track of the counts of each mutation’s descendants among the copies of rare variant i in the sample, then because every pair of type-i lineages is equally likely to be the one which coalesces when a type-i coalescent event occurs, the distribution of these counts should be given by the full Ewen’s sampling formula (Ewens 1972; Kingman 1982; Donnelly 1986; Arratia and Tavaré 1992; Arratia et al. 1992, 2016).
The events involving the common allele in (17c) occur very quickly. But since only a fixed number of events involving rare alleles are required to resolve the ancestry of latent mutation and coalescence, the approximation remains accurate until all the rare-allele events have happened, if is large enough. In Appendix section “Time-dependent conditional ancestral process,” we study the joint distribution of the times of events among the rare alleles and the numbers of common-allele ancestors when these rare-allele events occur. Focusing on the case of two alleles for simplicity, if is the time back to the ith event involving the rare allele 1, we have
(18) |
which in either case tends to zero as tends to infinity. Further, if is the random number of type-2 ancestral lineages left at the ith event involving the rare allele 1, we have
(19) |
suggesting that, despite the rapid decrease of common-variant lineages, the approximation can hold until the entire ancestry of latent mutation and coalescence is resolved.
Even for the largest rare-variant site-frequency count considered in Seplyarskiy et al. (2021), there will still be >1200 common-variant lineages left on average at for the TOPMed data () and >400 left for the gnomAD data (). In section “Application to human SNP data,” we consider site-frequency counts up to 40 for synonymous exonic sites in gnomAD with many fewer SNPs but a larger sample size () and in this case there should be about 2780 common-variant lineages left at when the entire ancestry of latent mutation and coalescence among the rare variants is resolved.
In sum, rare alleles in a large sample will quickly coalesce and mutate. Their ancestors will be common alleles. If is the number of these latent mutations in the ancestry of allele , then from the rates of mutation and coalescence in (17a) and (17b) we have
(20) |
Latent mutations of different rare variants are independent and distributed like the numbers of alleles in the Ewens sampling formula, each with its own mutation parameter.
Latent mutations and sample counts of rare alleles
Our goal in this section is to understand how predictions about the counts of rare variants, and hence about their site-frequency spectra, depend on the number of latent mutations and the mutation rate. In anticipation of “Application to human SNP data,” we focus on the marginal count of just one rare variant, which we arbitrarily call allele 1. From (20) we have
(21) |
which we note holds for any K. Here we let for simplicity.
To understand how the mutation rate influences the count of a rare variant, we apply the result for ratios of gamma functions with a common large parameter, 6.1.47 in Abramowitz and Stegun (1964) or equation (1) in Tricomi and Erdélyi (1951), to the terms involving n in (5) to obtain
(22) |
in which we have used to make a connection with the underlying coalescent tree or gene genealogy. Specifically, is the expected number of type-1 mutations on the gene genealogy of a sample of size n, and for large n this is approximately equal to where is Euler’s constant. In “Theory for nonconstant populations” we explore this connection in detail and explain the additional constants of proportionality in (22) after finding an analogous result for the general coalescent trees of Griffiths and Tavaré (1998).
Site-frequency spectra are typically defined as the proportion of segregating sites in each possible count in the sample (Braverman et al. 1995) or equivalently as the probability that a single mutation is in each possible count given that it is polymorphic in the sample (Griffiths and Tavaré 1998; Nielsen 2000). So, to understand how depends on , we may ignore the constants of proportionality in (22) and focus on
(23) |
Then using (23) together with (21), we have
(24) |
for the dependence of the rare-variant count, , on the number of latent mutations, , relevant to the site-frequency spectrum. Figure 2 shows site-frequency spectra computed using (23) and (24), and conditioning on the event that .
Figure 2a shows the dependence on the number of latent mutations. When all copies descend from a single mutation (), the usual predictions from the infinite-sites model hold. Thus if we put in (24), then consistent with (7) we have
The total number of such sites will depend on , and in general on the factor in (21) for larger numbers of latent mutations. But conditional on , the site-frequency counts for a rare variant do not depend on θ, at least to leading order in the sample size n. If there are mutations in the ancestry of the rare variant, then cannot be less than . This is shown in Fig. 2a for to . A key effect of recurrent mutation is to give relatively less weight to low site-frequency counts, as found previously by Jenkins and Song (2011).
Using (21) and (23) the joint distribution of and obeys
(25) |
which can be compared to the results of Jenkins and Song (2011). With fixed and large n in our model, all mutations in the ancestry of the rare variant will be non-nested mutations; note this also follows from (18) in Jenkins and Song (2011). Adapting the notation of Jenkins and Song (2011) in which is the event that the copies of allele 1 are due to two non-nested mutations, both from allele to allele 1, their (21) becomes
for large (and small θ), which is identical to (25) if .
Numerical computations (not shown) using the unnumbered equation below (10) in Jenkins and Song (2011), which holds for any θ, reproduce the case of shown in Fig. 2a when n is large. This is evident in Figure 3 of Jenkins and Song (2011) for the quantity . These computations are difficult for samples beyond the hundreds. Our results for could potentially also be compared to the results of Bhaskar et al. (2012) using their Theorem 3 and summing appropriately.
Figure 2b shows how the site-frequency counts of the rare variant depend on the mutation parameter of that variant, . Although Fig. 2a shows a dramatic effect of on the site-frequency counts, Fig. 2b suggests that large values of are unlikely. This is evident from (21) and (25) in that each additional mutation results in an additional factor of . Note that the smallest value of in Fig. 2b is already more than twice the human average. From (23), we have
which is consistent with (9) in the case where allele 1 is rare in a large sample. Thus, when is small ( and in Fig. 2b) the site-frequency spectrum under recurrent mutation is very close to the standard infinite-sites model predictions. When is large ( in Fig. 2b) the site-frequency spectrum under recurrent mutation is noticeably different, with a dearth of low-frequency variants and corresponding excesses at higher frequencies. Figure 2b plots site frequencies on a scale to better illustrate differences, especially at higher frequencies.
Theory for nonconstant populations
Here we extend our analysis to populations which deviate from the standard neutral site-frequency predictions. We have in mind populations which have changed in size, although other applications may be possible. Here gene genealogies are the general coalescent trees of Griffiths and Tavaré (1998), which have the same branching structure of standard coalescent trees but may have different distributions of coalescence times.
Equation (21) suggests another way to model both the number of copies () of a variant of interest and the corresponding count of latent mutations () when the variant is rare in a large sample. Arratia et al. (1992) proved that when the sample size tends to infinity, the numbers of alleles in small counts in the Ewens distribution converge to independent Poisson random variables with expected values . Note that is the usual expected site-frequency count of mutants in i copies in the sample under the standard neutral model of a large constant-size population. A seminal result of Watterson (1974b) is that the numbers and counts of mutations in a sample from such a multi-type Poisson distribution conform to the Ewens sampling formula when conditioned on their total size. So we may interpret (21) and other findings in the previous section within this independent-Poissons sampling framework.
This is exactly the approach in the Supplementary Materials of Seplyarskiy et al. (2021). Again, human SNP data strongly reject the standard neutral model with site-frequencies , owing largely to the great excess of singletons and other rare variants due to our recent growth (Keinan and Clark 2012; Gazave et al. 2014). So we replace with , where is the total length of branches with i descendants in the gene genealogy of a sample. For an extension of independent-Poissons sampling to variants under selection, see Desai and Plotkin (2008). Our notation is different than in Seplyarskiy et al. (2021) because here we use the coalescent or diffusion time scale.
Under the standard neutral coalescent model, . For the general coalescent trees of Griffiths and Tavaré (1998), can be expressed in terms of the coalescent intervals, , which are the lengths of time when there were lineages in the ancestry of the sample. In particular,
(26) |
(Fu 1995; Griffiths and Tavaré 1998).
Watterson (1974b) studied three models. In Model 1, using our notation, mutations arise from a constant source at rate θ, then propagate or go extinct independently according to a critical branching process, i.e. with birth rate equal to death rate as for a neutral mutation. The number of mutations in count i has expected value , for a constant which converges to 1 as the duration of the process increases. Watterson (1974b) proved that the numbers and counts of mutations follow the Ewens sampling formula when conditioned on their total size, which for Watterson (1974b) was equivalent to the population size. Models 2 and 3 are the Moran model and the Wright-Fisher model (Fisher 1930b; Wright 1931; Moran 1958, 1962) and Watterson (1974b) proved that these have the same limit as Model 1 when the population size is large.
Model 1 is an example of a logarithmic species distribution (Fisher 1943; Watterson 1974a; Arratia et al. 2003; Lambert 2011). Branching-processes have also been used to describe and infer the ages of rare alleles (Rannala and Slatkin 1997; Slatkin and Rannala 2000; Wiuf 2000); for recent developments and a review, see Crespo et al. (2021). Slatkin (2000) used this approach and an extension of Griffiths and Tavaré (1998) to model the ages of rare alleles in a large sample. Champagnat and Lambert (2012, 2013) studied the convergence of population frequencies of alleles for supercritical, subcritical or critical branching processes. All of these works assume that each allele traces back to a single mutation, as under the infinite-alleles mutation model.
Our approach to modeling recurrent mutation follows that of Watterson (1974b) to Model 1. Whereas Watterson (1974b) did not specify the source of mutations, here we take it to be the production of rare variants by mutation from a common variant on the gene genealogy of a large sample. What for Watterson (1974b) was the total population size is for us the total count of a rare variant. Allele 1 is our nominal variant of interest, but for simplicity for the moment, we use n, k and θ in place of , and . As a further notational convenience, we define
so that is the expected number of mutations with count i in this independent-Poissons sampling model.
Let be the numbers of latent mutations of the variant of interest with counts . We assume that and that and are independent for . Their joint distribution is then
(27) |
with . The total sample size is what would set the upper limits of the product and the sum above, but we leave these unspecified for now, only imagining that the total sample size is much larger than the sample count of the variant of interest, so we can model the latter without restriction.
We are only concerned with for , where b is the largest rare-variant count. Thus, the assumption of independence in (27), which is equivalent to there being no nested mutations in the ancestry of a rare variant, will only need to be true for with . In Appendix section “Low-count branches of general coalescent trees” we prove that this holds for the trees of Griffiths and Tavaré (1998) for fixed b in the limit as the total sample size tends to infinity, and that the counts converge to independent Poisson random variables as with expected values . A condition is that the total height of the genealogy is finite, which is a mild assumption ruling out pathological situations such as a populations whose sizes increase too quickly backward in time.
The count of the variant of interest is and its number of latent mutations is . Following Watterson (1974b), we consider the probability generating function of n and k, which in the present case simplifies to
For the details of this derivation, see (A29) in the Appendix. The coefficient of (and ) can be found using
(28) |
where the sum is over
for , and with
Returning to our notation in which is the number of copies of a variant of interest, its number of latent mutations, its mutation parameter, and n is the total sample size, and further using τ to show the new dependence on the vector of expected times , we have
(29) |
which is nonzero for and . The sum over here is the same as in (28). It is equivalent to summing over partitions of the integers 1 through into subsets, where the sizes of the subsets are .
It is convenient to decompose (29) as follows. The number of type-1 mutations is Poisson distributed
(30) |
with parameter equal to the expected number of type-1 mutations on the gene genealogy of the sample. Conditional on this, the distribution of the number of times allele 1 appears in the sample is given by
(31) |
which depends on the relative expected branch lengths but does not depend on θ or .
Alternatively, can be computed by summing (29) appropriately, over . Then
(32) |
can be used to estimate the number of independent mutations which produced the observed copies a rare allele.
The sum over in (31) and (29) is straightforward to evaluate but will become impractical if and become too large. In what follows, we consider mutations at each site. Equation (30) suggests that this will be accurate up to about three expected mutations per site, because the probability of greater than 7 is just over 1% when . As in Fig. 2, the largest value of we consider is 40. These are not the upper limits of feasibility; it takes two minutes to evaluate (31) for all and in Mathematica version 11.2 (Wolfram Research, Inc. 2017) on a mid-2015 MacBook Pro.
Considering the first three possible values of in (31),
(33) |
(34) |
(35) |
Equation (33) says simply that if there are no type-1 mutations on the gene genealogy then no copies of allele-1 will be observed. Equation (34) is the familiar result for the site-frequency spectrum, that it is given by the proportion of branches in the tree that have descendants. Equation (35) extends this to two mutations and emphasizes that mutations in the ancestry of a rare allele will be non-nested when n is large.
For the constant-size model, we find new approximations
(36) |
(37) |
(38) |
corresponding to (23), (24) and (25), respectively, in which the condition should be taken to hold for all . Figure 2 is unchanged if (36) and (37) are used instead of (23) and (24). Also, the conditional probability of given from (36) and (38) is identical to (21).
Relation to K-alleles diffusion results
From a gene-genealogical point of view, (36) is the probability of seeing total copies of a rare variant when a random number of type-1 mutations occurs on the low-count branches of a standard neutral coalescent tree. However, the type of the common variant and the ancestral states of these mutations are not specified in the independent-Poissons model. Of course these should be allele K, as in “A conditional ancestral process for rare variants,” but (36) does not include this event. In contrast, the sampling probabilities (3) and (5) from the equilibrium diffusion model specify the types of the entire sample. Implicitly, they average over the ancestral states of the sample. Here we focus on and show how (36) is related to (5) when n is large, in particular to the leading order term in the expansion (22).
The type of the common ancestor of the entire sample, at the root of the coalescent tree, is allele 2 with probability . If this were the case, allele 2 would be the ancestral source of the low-count type-1 mutations. But if θ is not very small, it is possible for allele 2 to be the ancestral source of these mutations even if the common ancestor is type 1. To illustrate, dividing either (5) or (22) by (36) and letting gives
(39) |
Indeed when θ is small, (22) is close to (36) times . But the error of this, even as n tends to infinity, may be appreciable for larger values of θ. The additional probability of order in (39) is consistent with the possibility that the root of the coalescent tree is type 1 and there are two type-2 mutations, one on each of the two branches descending from the root.
A better guarantee that allele 2 is the ancestral source of low-count mutations would be to specify it not as type of the single most recent common ancestor but rather as the type of the pair of ancestors at the first time in the past when there were two ancestral lineages. Equation (5), with sample size equal to two, gives the relevant probability. This accounts for both possible states at the root of the tree as well as for mutation during the deepest coalescent interval, in (26). Then the independent-Poissons model could be applied to the remainder of the tree, i.e. to coalescent intervals through .
Because latent mutations of rare variants tend to be very recent, cf. (18) and (19), we may extend this logic to the first time in the past when there were r ancestral lines of the sample, for an arbitrary . The probability that these are all of type 2 is given by the diffusion result (5) with sample size r. The probability of seeing copies of the rare variant is given by an appropriately adjusted independent-Poissons model, covering coalescent intervals through . By summing (26) only over it can be shown that the total length of branches with i descendants in this more recent part of the gene genealogy differs only by from the full result . The product of these two probabilities is
(40) |
which can be compared to the leading order term in (22).
As expected from (39), if (40) reduces to (36) times . Now dividing (5) or (22) by (40) and letting gives
(41) |
as a measure of how well this augmented independent-Poissions model approximates the equilibrium diffusion result, depending on r and θ. Expanding (41) around , because we do not in fact expect the per-site mutation parameter to be large, gives
(42) |
where the π in is the usual constant (not our ). The parenthetic term in (42) tends to zero quickly as r increases. It is equal to the trigamma function for ; see 6.4.2 and 6.4.3 in Abramowitz and Stegun (1964). Even just taking instead of cuts the error by about 60%.
Similar conclusions may be drawn from the large-r expansion of (41), which gives . Again is the largest-order effect of mutation. The event that a pair of mutations occurs on the two lineages descending from the root of the coalescent tree is non-negligible in the constant-size population model, even as and even for the entire population, because ancient coalescence times tend to be long. But the chance of this event will be small for most eukaryote species as θ ranges from about to with typical values around (Leffler et al. 2012). Based on our estimates in the next section, even the fastest-mutating sites in the human genome have . Note that this event will even less likely in growing populations, because in this case the deepest coalescence times will be relatively short, but could be an important phenomenon for populations which were much larger in the past.
Theoretical example and data application
Here we illustrate the theoretical and empirical use of (30) and (31). First we describe the consequences of recurrent mutation in an exponentially growing population compared to those in a population of constant size. Second we explore an entirely empirical application to human SNP data, which suggests that disparate site-frequency spectra may be explained by differences in mutation rate (and thus recurrent mutation).
Note that if estimates of the expected fraction of the gene genealogy comprised of branches with i descendants, that is
(43) |
are available, then can be computed using (31). In addition, for any estimated or supposed values of the expected number of mutations on the gene genealogy,
(44) |
the joint distribution of the number of latent mutations, , and their total count, , is the product of (30) and (31).
An exponentially growing population
Consider the simple model of pure exponential growth which has been the subject of a number of studies (Slatkin and Hudson 1991; Griffiths and Tavaré 1998; Polanski and Kimmel 2003; Chen and Chen 2013; Polanski et al. 2017): a population which has reached its current (haploid) size by exponential growth at rate r per generation. On the coalescent time scale of generations, looking backward in time and setting ,
(45) |
gives the population size at time t in the past. This model is unrealistic because the past population size approaches zero, but it can be taken as a rough approximation for recent dramatic growth. For instance, a population of current size with a generation time of 30 years and , would have . About 40,000 years ago, it would have had size , and using equation (7) in Slatkin and Hudson (1991) the pairwise coalescence time would be about 57,000 years.
The expectation can be computed from (26) if the expected coalescent intervals are known. We use the large-n results of Chen and Chen (2013) for (our notation) to obtain a simple approximation for . With the time scale and notation here, equation (11) in Chen and Chen (2013) gives
(46) |
as a large-n approximation for the cumulative expected time for the number of ancestral lineages of the sample to decrease from n to k. Writing (46) as a continuous function of ,
(47) |
we approximate the expected coalescent interval as
(48) |
Note that while (48) is a large-n approximation, it allows that β might be of the same order of magnitude as n. Applying the same approximation to the combinatorial coefficient in (26) gives
(49) |
Finally, we approximate the sum in (26) with the integral
(50) |
(51) |
which can be evaluated efficiently either as (51), in terms of the hypergeometric function, or as the integral (50). Slatkin and Hudson (1991) and others have observed that gene genealogies under very fast exponential growth are close to star trees. Using either (50) or (51) we have
(52) |
as increases. From the term in (52), we confirm the star-tree prediction that under extreme growth essentially all variants will be singletons.
These results for exponentially growing populations, derived here using a coalescent approach, are identical in form to some results for “Luria-Delbrück distributions,” especially in application to cancer, derived using forward-time birth-death or branching processes (Luria and Delbrück 1943; Lea and Coulson 1949; Durrett 2013, 2015; Kessler and Levine 2013; Ohtsuki and Innan 2017; Cheek and Antal 2018; Gunnarsson et al. 2021; Poon et al. 2021). In particular, (50) has the same form as the approximation in equation (4) of Ohtsuki and Innan (2017) and as equation (33) in Gunnarsson et al. (2021). Equation (52) has the same form as the expression in Theorem 2 in Durrett (2013) if only the leading-order term is kept in (52) in the case .
Figure 3 shows the same quantities as Fig. 2 but for the pure exponential growth model with and . The value was chosen to roughly reproduce the ratio of singletons to doubletons observed for low-rate sites in the gnomAD data in section “Application to human SNP data.” Figure 3a is directly comparable to Fig. 2a, the only difference being whether or comes from (51). As Fig. 3a shows, recent rapid growth produces a single-mutation (, blue line) site-frequncy spectrum with an excess of rare variants and a deficit of common variants. So, compared to the constant-size case in Fig. 2a, there is a diminished tendency to observe high-frequency variants when the number of latent mutations is larger, and a stronger tendency for the site-frequency count () to be equal to or close to the number of latent mutations.
To make Fig. 3b comparable to Fig. 2b, we used (44) with and to compute the corresponding expected numbers of mutations on the gene genealogy for the three values of in Fig. 2 . The resulting expected numbers of mutations were , and , the last being about equal to the value for the highest-rate sites in the gnomAD data in section “Application to human SNP data.” We then computed by averaging (31) over the distribution (30). Similar to Fig. 2b, the two smaller values of the mutation rate give nearly indistinguishable results for the total count . But there is a dramatic difference for the largest mutation rate. In Fig. 2b the prediction is distinctly L-shaped and thus similar to that for the lowest mutation rate, which again is 100-fold lower. In contrast, in Fig. 3b singletons have a much lower chance of being observed. In fact, doubletons are slightly more likely than singletons. This relative excess of doubletons is due to the fact when there are two latent mutations these are highly likely to produce two copies of the variant under growth (Fig. 3a) than under constant size (Fig. 2a).
It is also of interest to know how the number of latent mutations in the ancestry of a rare variant depends on its count. Figure 4 depicts this for a series of increasing counts , from 1 to 16. Figure 4a shows the results for constant size, Fig. 4b the corresponding results for pure exponential growth. The expected number of mutations on the gene genealogy is 2.4 in both cases. Regardless of the demography, if only one copy of the variant is observed, it must be due to one mutation. Otherwise, the results differ greatly for constant size versus growth. Under constant size, a variant observed multiple times in the sample can easily be due to a single mutation. Under growth, higher variant counts are more likely due to multiple mutations.
Application to human SNP data
We also used (30) and (31) to account for latent mutations in the ancestry of rare variants in a subset of the gnomAD data (Karczewski et al. 2020). We took the approach described in the Supplementary Materials of Seplyarskiy et al. (2021), specifically obtaining estimates of relative branch lengths (43) from the data at low-rate sites, then using our new analytical result (31) to average over mutation counts. Rather than categorizing variants by trinucleotide context as in Seplyarskiy et al. (2021), we analyzed data from gnomAD version v2.1.1, presorted into 109 bins based on estimates of mutation rate by the Roulette method of Seplyarskiy et al. (2022) which incorporates information from the six flanking bases on either side of a SNP, strand asymmetry, expression level, methylation and promoter status. We did not use this information but simply assumed that variants within a bin all have the same mutation rate.
The data consist of variant counts for synonymous mutations in the exomes of about 57 K non-Finnish Europeans. Thus K although this varied by about 2% among sites because we required that sites were successfully genotyped in a minimum of 112K chromosomes. Importantly for our application, the data include monomorphic sites, i.e. sites with variant count equal to zero. The gnomAD only provides n for polymorphic sites, so we imputed n for monomorphic sites using the nearest value at a polymorphic site within 100 bp on either side of the focal site. After filtering for sequencing quality and coverage as well as removing mutation rate bins with fewer than 100 observed mutations, there are a total of 12,338,176 sites in 97 bins and 834,486 of these are polymorphic.
Figure 5a shows the total numbers of sites and the numbers of monomorphic sites in each bin. The great majority of sites are in bins 1 through roughly 20. These have low mutation rates, as indicated by their nearly equal numbers of total sites and monomorphic sites. The widening gap between the total number of sites and the number of monomorphic sites reflects the fact that higher-number bins have larger mutation rates.
For each bin, the data are the numbers of sites where a variant is observed in each possible count in the sample. As in “Latent mutations and sample counts of rare alleles,” these are marginal with respect to other possible variants at the site. Sites with two (resp. three) rare variants appear twice (resp. three times) in the data, once for each rare variant. These will likely be in different bins given the fine substructure of mutation rate variation (Seplyarskiy et al. 2021, 2022). Although bins contain mixtures of different sequence contexts and different nucleotide substitutions, for our purposes sites within a bin are all of the same type because they all have the same mutation rate.
Let be the number of sites in a given bin where i copies of the variant are observed in the sample. If a bin contains L total sites, then with reference to the notation in (2) we may write
(53) |
Thus we use a simplified notation here, with i in place of to avoid the additional subscript when we apply the results of the previous sections. In addition we use “mutrate” to refer to the estimate of the expected number of latent mutations per site for a given bin, i.e. for sites in that bin, as this is the rate parameter in the Poisson distribution (30).
We used (30) and the proportion of monomorphic sites, , to estimate this “mutrate” for each bin, specifically as . Figure 5b plots these estimates across bins, on a log scale. They range from for bin 1 to for bin 97, with a mean of , taking the proportion of sites in each bin into account. Most sites have mutation rates on the lower side: bins 1 through 5 contain about 47% of all sites, bins 1 through 19 about 95%, and bins 60 through 97 contain only about 2% of sites. Overall, rates vary 230-fold from lowest to highest. Assuming that the average estimated mutrate of corresponds to the genome average mutation rate per site, for which the usual estimate of θ from pairwise differences is about , we can infer that the expected number of mutations between a pair of (haploid) genomes is about for the slowest sites and about for the fastest sites.
We compared observed and expected site-frequency counts for each bin based on an empirical fit of our model. First, we used (30) with the estimated mutrate for each bin to compute probabilities of latent mutations. Then from (34) and the fact the polymorphisms at sites with very low mutation rates likely have just one latent mutation, we used the combined data for bins 1 through 5 to estimate directly as for . Our estimates of the mutrate for bins 1-5 range from to with an average of , which we note is somewhat less than the smallest mutation rate in Figures 2 and 3. We assumed that this estimated from bins 1–5 holds for all bins. Finally, we computed the expectations , for in each bin, multiplying the probabilities of counts obtained using (30) and (31) by the total number of sites in the bin, cf. (53).
The upper three panels of Fig. 6 show the observed and expected variant counts, for , for bins 9, 50 and 92, chosen to represent a low-rate bin, a middle-rate bin and a high-rate bin. Figure A2 in the Appendix gives the plots for all 97 bins. In making these plots, we grouped variant counts for which . For bin 50 for example, this was true of variant counts as depicted in Fig. 6B and in the 50th panel of Fig. A2. The mutrate values in these plots are again the estimates of the expected number of mutations per site on the gene genealogy, , for each bin.
The broad pattern from these plots is clear. For smaller mutation rates (e.g. Fig. 6a) the site-frequency spectrum is heavily weighted toward the rarest variants. For large mutation rates (e.g. Fig. 6c), that is when multiple latent mutations are likely, the site-frequency spectrum is shifted toward higher counts. Again from Fig. 5a, the data contain fewer sites with intermediate mutation rates. In this case (e.g. Fig. 6b), the site-frequency spectrum does show the expected intermediate pattern, but subject to considerable sampling error. Across the range of mutation rates, the empirical model, which uses low-rate sites to estimate relative branch lengths and assumes these hold for all sites, fits the data well.
As can be seen in Fig. 6a and the first 20 or so panels of Fig. A2, the empirical estimates of include fluctuations due to sampling error for higher-count variants. The combined data for the first five bins have ranging from 71 to 38 for . The presence of these fluctuations helps illustrate a subtler phenomenon, namely the smoothing which occurs at larger mutation rates (e.g. Fig. 6c). For reference, the combined data for the first five bins have in the thousands for the low-count variants. From these, the estimated chance that a latent mutation is a singleton is about 64%, followed by 13% for doubletons and 6% for tripletons. By comparison, the chance is less than % for each variant with count . The predictions are smoothed for higher-count variants at larger mutation rates because they are mixtures. For example, two latent mutations will come in counts 1 and , 2 and , or 3 and with approximate relative proportions 64:13:6.
The lower three panels of Fig. 6 show estimates of the probability that a variant in count descends from latent mutations, computed using (32). All singletons descend from single mutations. Variants in larger counts can have multiple latent mutations, and the probabilities of these increase very quickly then settle down to stable values. This suggestion of a limiting distribution was also seen for exponential growth in Fig. 4b, only there depicted differently. For very large counts of the variant, the distribution of is well approximated by a Poisson with mean equal to the expected number of mutations per site on the gene genealogy, . This shifted-Poisson result is known already for the constant-size case (Arratia et al. 2000; Yamato 2017). In “A remark on the total number of mutations for large ” in the Appendix we argue that it should hold more generally. The accuracy of this shifted-Poisson result for the gnomAD data and is shown by the black dots on the right axes of Figs. 6d–f.
For low-rate sites (e.g. Fig. 6d) there is a relatively small chance of multiple latent mutations. But the chance of two or more latent mutations is not negligible, owing to the very large sample size. Note that the mutrate for bin 9 is less than the genome average, which is for this sample of K. Thus in a very large sample even low-rate sites are affected by recurrent mutation. For the middle-rate sites (e.g. Fig. 6e) in the trough in Fig. 5a the chance of there being only one latent mutation is still considerable. However, for high-rate sites (e.g. Fig. 6f) it can be more likely that there are two or three mutations in the ancestry of a rare variant than the single unique mutation which is typically supposed.
Finally, we explored the extent to which rare variants might be observed less frequently than would be expected if there were no recurrent mutation. Figure 7a shows the expected frequency of singletons, doubletons, etc., up to variants found in five copies in the sample, across the range of mutrates in the binned gnomAD data. The standard infinite-sites prediction is that the frequency will increase linearly with the mutation rate. Figure 7a is largely consistent with this but shows marked deviations when the mutrate becomes too large. The point at which the linear prediction fails depends on the count of the rare variant. Singletons are the first to deviate, which they do as soon as there is an appreciable chance of two or more mutations at a site. For rare variants in five copies, linearity holds even close the upper limit of mutation rates in the human genome.
Figure 7b shows the extent to which the infinite-sites model over-predicts the frequency of singletons across the 97 bins. The infinite-sites prediction for a bin is its mutrate times the proportion of singleton branches estimated from the first five bins. The corresponding independent-Poissons predictions are the same as those for in the 97 panels of Fig. A2. The infinite-sites model makes reasonable predictions for the twenty lowest-rate bins, which contain 96% of all sites and have mutation rates less than twice the genome average. But it predicts the impossible for the seven highest-rate bins: more singletons than there are sites to mutate. For bins 21 through 97, which contain 4% of all sites, the infinite-sites model predicts a total of 269,222 singletons compared to the 83,002 which are actually observed.
We emphasize that the results in Fig. 7 depend on the sample size. The expected number of mutations at a single site, , is proportional to the total length of the gene genealogy, which is an increasing function of the sample size. Already for the sample size K considered here, singletons start to be affected by recurrent mutation at around the genome average mutation rate (Figs. 7 and 6d). For variants in any fixed count i there will be a sample size above which the infinite-sites, linear prediction starts to fail.
Discussion
In this work, we modeled the mutational ancestry of a rare variant in a large sample. Under the standard neutral model of population genetics with K-allele parent-independent mutation, we found that co-segregating rare variants may be treated independently and that the Ewens sampling formula gives the probabilistic structure of latent mutations in their ancestries. In particular, the number of latent mutations is distributed like the number of alleles in the Ewens sampling formula. We obtained more general results, for changing population size, by modeling latent mutations as independent Poisson random variates.
Our aim was to describe how the site-frequency spectra of rare variants in large samples are affected by recurrent mutation. The key parameters for a variant in count i are its expected total rate of mutation on the gene genealogy of the sample (here denoted and called “mutrate” in the previous section) and the expected relative lengths of branches in the gene genealogy which have i descendants in the sample (). Under the standard neutral model .
We obtained new results for under exponential population growth and used these to illustrate how recurrent mutation affects the site-frequency spectrum differently than under constant size. Lastly, we showed that our general results provide a good fit to synonymous variation among a large number of (non-Finnish European) individuals in the human Genome Aggregation Database (Karczewski et al. 2020), suggesting that, whatever the causes of deviations from might be for this sample, differences in mutation rate can explain differences in site-frequency spectra among sites.
Our application was empirical. We did not fit a demographic model, but following Seplyarskiy et al. (2021) used low-mutation-rate sites to estimate relative branch lengths and assumed these hold for all sites. Site-frequency spectra are a rich source of information about population-genetic phenomena but are of somewhat limited use in disentangling their effects (Myers et al. 2008; Bhaskar and Song 2014; Terhorst and Song 2015; Lapierre et al. 2017; Rosen et al. 2018). When low-mutation-rate sites are plentiful enough to provide stable estimates of relative branch lengths, this empirical method offers a way to control for myriad factors and isolate the effects of variation in mutation rate.
We began with a K-allele model with parent-independent mutation, and used its sampling probabilities in our computations for constant-size populations. We conjecture that our findings will hold for general mutation models because conditioning on a rare variant in a large sample means that the common allele will be the ancestral source of mutations with very high probability. Then the relevant mutation rate in any model will be the rate of the production of the rare allele from the common allele.
We described our general results as being for populations which may have changed in size. This is appropriate for the general coalescent model (Griffiths and Tavaré 1998) which we assumed for some proofs in the Appendix. Strictly speaking, the general coalescent does not require a generative model for the times between coalescent events. Thus our results can be applied more broadly. The case of a fixed tree with arbitrary considered in the Appendix is one example. The independent-Poissons model, with results (27) to (35), does not even require interpretation in terms of coalescence times. These results hold if we replace with an arbitrary rate parameter for the production of mutants in count i. Rates of production of mutants have been obtained for under a range of demographies and some types of selection (Lange and Fan 1997; Dorman et al. 2004; Lambert 2011; Kaj and Mugal 2016; Torres et al. 2020; Müller et al. 2022). Applications to selection will likely require free recombination between sites. Desai and Plotkin (2008) applied the independent-Poissons model (for all variant counts in the sample) for example under a version of the Poisson Random Field model (Sawyer and Hartl 1992).
Acknowledgements
We thank Vladimir Seplyarskiy for helpful discussions and assistance with the mutation-rate binned SFS data from gnomAD, and three anonymous reviewers for helpful comments.
Appendix
Time-dependent conditional ancestral process
Here we study the conditional ancestral process in detail and provide the justification for (18) and (19).
Let and be the numbers of rare alleles and common alleles respectively at time t. From (17a), (17b) and (17c), the stochastic process is a continuous-time Markov chain on with total rate of events and one-step transitions
(A1) |
Let be the probability measure for this process starting at , and define the random times
(A2) |
to be the times at which the first coordinate of the process decreases to for , with . We have almost surely under , and the process visits the following points in order .
In Theorem A1 we describe the joint distribution of the hitting times and the locations as .
Theorem A1
As , the random vector
(A3) in converges in distribution under to the random vector
where , and are independent random variables with probability density functions
Remark A1 Mean of —
Note that
Hence for , Theorem A1 implies that is of order and gives the second part of (18) in the main text. In contrast, when , and is no longer of order . Indeed, when , for by (A6). Hence by (A4) and Fubini’s theorem,
These give (18) in the main text.
Remark A2 Mean of —
for . This gives (19) in the main text.
Proof of Theorem A1
To explain the key idea we first establish weak convergence of , i.e. of the marginal distribution for in (A3). By definition, is given by
(A4) |
where is the number of downward jumps in second coordinate of the process starting at up to the first decrease in the first coordinate. The variables are the times between these downward jumps, with being the time to the final jump starting at . This last jump is the one which decreases the first coordinate. Observe that is either or . Given , is equal to
(A5) |
which correspond to a non-empty mutation event and a coalescent event of type 1 respectively. These follow from (A1).
The probability mass function of is given by and, for ,
(A6) |
(A7) |
as and . Hence and, for ,
(A8) |
as and .
Lemma A1
As , we have convergence in distribution
with and as defined in Theorem A1.
Proof of Lemma A1.
It suffices to show that the moment generating function of the -valued random variable on the left converges pointwise to that on the right; that is, to show that
(A9) for and . See, for instance, Section 30 of Billingsley (2008). Since Exp,
(A10)
(A11) where
(A12) if and . Putting (A12) and (A7) into (A11), we obtain the desired (A9) and thus Lemma A1. □
We now return to the proof of Theorem A1. Lemma A1 implies that converges in distribution to as . Since almost surely, we have in the sense that
(A13) |
As in (A4), by definition, is given by
where is the number of downward jumps starting in state up to the second decrease in the first coordinate, i.e. to . Like before, are the times between these jumps, with being the time for first coordinate to hit starting at the penultimate states . As in (A5), is either or .
As , in the sense of (A13). Hence the same argument that leads to Lemma 1 can be applied again, starting at the new location . More precisely, by computing moment generating functions as before, and applying the strong Markov property of the random walk at the stopping time , we obtain the joint convergence
under as , where are independent variables defined in Theorem A1. This implies the convergence in distribution
under , as . Continuing this way, by letting be the number of downward jumps starting at before hitting the vertical line for , we obtain the desired convergence in Theorem A1. □
Low-count branches of general coalescent trees
Here we prove the non-nestedness and Poisson-independence of low-count mutations, which we assumed in section “Theory for nonconstant populations.” We do this first for fixed trees then for the random, general coalescent trees of Griffiths and Tavaré (1998). We also present the computation of the probability generating function, , of the count of the variant of interest and its number of latent mutations. Our definition of nested differs from some previous ones (Saunders et al. 1984; Wiuf and Donnelly 1999; Hobolth and Wiuf 2009); here nested mutations may occur on the same branch of the gene genealogy.
Nested mutation on a fixed tree
Let be a fixed (non random) tree with n leaves. We suppose the tree is ultrametric, that is the leaves have the same distance from the root. We call the height of . Consistent with the main text, we adopt the following notation for some relevant properties of , for the most part suppressing the dependence on n for simplicity:
is the length of the time during which there are exactly k lineages ancestral to the sample, for .
for , is the total length of branches in that have j descendants. We suppose there are such branches with lengths . Then .
is the total branch length, the sum of all the branches in , which is equal to .
For a positive integer b, we define a collection of disjoint connected subtrees of the coalescent tree as follows: Each of the branches with b descendants in the sample (say the ith one) subtends b leaves in the coalescent tree and gives rise to a subtree which contains that branch. We say nested mutation up to count b occurs on if there exist two mutations on for some . Fig. A1 illustrates this for .
We assume that mutations arise as a Poisson point process on the tree with constant rate per unit length. Theorem A2 below holds for any fixed ultrametric tree (it can be binary or have multiple mergers, or even be a star tree).
Theorem A2 Nested mutation on fixed trees —
Let be a fixed ultrametric tree with n leaves. For any positive integer b and for any , the probability that nested mutation up to count b occurs is bounded above by
(A14) In particular, the probability that nested mutation up to count b occurs tends to 0, as , if for .
Remark A3
There is good evidence that the upper bound is actually small for humans. For the gnomAD data we analyze in the main text, the expected number of mutations per site () is between about and . So is not big with high probability. The rest of the upper bound, , should be proportional to the average pairwise difference per site (very nearly equal to this for random Kingman coalescent trees and large n) and this ranges from about to about for these same data. See section “Application to human SNP data.”
Remark A4
The simpler bound can be weaker than the other bound in (A14) for large n. For the Kingman coalescent, is larger than since the latter tends to 0 as , by (A17). For a star tree, however, both bounds are approximately (up to a multiplicative constant).
Proof.
The total number of mutations on is a Poisson variable with mean . Given the tree and , the k mutations are uniformly distributed on the tree. Hence the conditional probability that two given mutations are on the same subtree for some i is equal to
where is the total branch lengths of the subtree . Since there are ways to choose two mutations out of k,
(A15) Note that for all , and that since the subtrees are disjoint. Hence
Putting this into (A15), we obtained the first bound in (A14). To get the second bound in (A14), note that for all , where is the height of the subtree .
Furthermore, is the sum of at most b branch lengths, one from for , and these branches are pairwise disjoint for different i’s (for ). Hence
where we used the general inequality . The bound in (A14) now follows by putting these into (A15). □
A mutation on a tree (called a latent mutation in the main text) is said to have count j if the mutation is the most recent mutation in the lineages of exactly j individuals at the leaves of the tree; see Fig. A1.
Theorem A3 Poisson approximation for counts on a fixed tree —
Let be a fixed coalescent tree with n leaves for . Let be the number of mutations on with counts j. If the probability that nested mutation up to count b occurs tends to 0 as , then for any positive integer b and any , the variables are asymptotically independent and for .
Proof.
If there is no nested mutation up to count b, then is also equal to the number of mutations on the branches in that have j descendants, for . Since these branches have total length and they are disjoint for different j’s, the result follows from the assumption that mutations occur as a Poisson point process on the tree with rate . □
Nested mutation on random trees
We now suppose the tree is a random binary tree (for ), in particular the general coalescent tree of Griffiths and Tavaré (1998). For each , is a sequence of positive random variables representing the times during which there are k lineages in . The branching structure of is independent of the times . Looking forward in time, whenever there is a branching event, an existing lineage is chosen uniformly at random to split into two.
Following Griffiths and Tavaré (1998, eqn. (2.2)) we let be the the population size at time t in the past divided by the current population size. As in (45), with corresponds to an exponentially growing population.
Theorem A4 Nested mutation on random trees for fixed θ —
Let . Suppose for ,
(A16) where the expectation averages over all realizations of . Then the probability that nested mutation up to count b occurs is bounded above by , where are constants that tend to 0 as . Furthermore, (A16) holds for the generalized coalescent trees of Griffiths and Tavaré (1998) when (which includes any growing population).
Proof.
The first statement follows directly from Theorem A2. By the fact and the Cauchy-Schwarz inequality, we have
(A17) Hence assumption (A16) is satisfied if
(A18)
(A19) for . The second statement now follows from Lemmas A2, A3, and Proposition A1 below. □
Lemma A2 concerns assumption (A18). For reference, we note that it is satisfied, and hence (A18) is satisfied, if are exponential variables with parameter where . This is true for the Kingman coalescent which has .
Lemma A2
Suppose has finite pth moment, where . Then in , as .
Proof.
Consider the random tree and recall that is the length of the time during which there are exactly k lineages ancestral to the sample in . These k lineages are segments of length of the branches of the genealogy, and each of them is called a line of state k.
We construct the infinite sequence sequentially in the same probability space, by constructing a coupling of the two independent families and , where is the index of the lineage that branches into two going from to .
Let be the number of descendants in of the ℓth line of state k. Note that for , and . By exchangeability—in particular see Bertoin (2006, Proposition 2.8)—the random vector converges almost surely to a random vector that has the symmetric Dirichlet distribution on the simplex . Therefore, with probability one,
(A20) Since is finite almost surely, the trees have uniformly bounded height almost surely. So (A20) implies that with probability one,
Since , by the assumption on and the Dominated Convergence Theorem, in as . □
Next consider assumption (A19). For the Kingman coalescent, is close to its mean because for n large enough,
(A21) |
where is defined in Fu (1995, eqns. (1)-(2)). This follows from the fact that Fu’s as for each (Fu 1995, eqn. (5)). Hence
Lemma A3
Suppose there exists a constant such that
(A22) Then for all and .
Proof.
For realized values of , the argument in Fu (1995, p. 181) gives
where is the indicator variable, where is the number of descendants in of the ℓth line of state k defined in the proof of Lemma A2.
Using the independence between and the branching structure, and following the notation in Fu (1995, eqns. (18)-(19)), the conditional expectation of , given , is
(A23) and that of , given , is
(A24) where the deterministic functions do not depend on . From Fu (1995),
and for ,
where the sum is taken over .
The first and the second moments of are obtained averaging over in (A23) and (A24). The bound follows from the same calculation in Fu (1995, eqn. (22)). By (A24), the fact and assumption (A22), holds also for our random trees. □
Remark A5
As in Theorem A2, we can use an alternate assumption than A16. For any positive integer b, the probability that nested mutation up to count b occurs is bounded above by which tends to 0 if . For Kingman coalescent trees, this would require that .
We now check that the assumption (A16) in Theorem A4 holds for the generalized coalescent tree of Griffiths and Tavaré (1998).
Proposition A1
Suppose . Then satisfy the conditions in both Lemma A2 (with ) and Lemma A3. In particular, (A16) is satisfied and so the conclusion of Theorem A4 holds.
Proof.
The joint distribution of is determined by the function λ; see Griffiths and Tavaré (1994b). We can construct in terms of λ as follows: let be a pure death process with rate at state , starting at , and let
(A25) be a time-changed pure death process. Then
for , where are the jump times of (by convention ).
By (A25), the jump times of the pure death process , denoted by , are given by for . Hence, with the convention , for we have
These give for all .
Since is equal in distribution to the analog of for the Kingman coalescent, is stochastically dominated by times an exponential variable with parameter for all . The desired statement now follows since (A18) and (A19) are satisfied. □
Replacing by its mean
By using the expected coalescence times denoted in the main text, we implicitly assumed that different sites have different trees and that these are all drawn from the same distribution. Theorem A5 below asserts that even though the mutant counts at each site are conditional on the realization of the tree at that site, we can replace by its expectation in Theorem A3 when the trees are random and satisfy suitable assumptions. The key reason is that is close to its mean, as made precise in Lemma A4.
Lemma A4
Suppose (A22) holds and that the covariance
(A26) for and , where is a sequence that tends to 0 as . Then for each , the variance as . In particular, in as .
Proof.
By further taking expectations in (A23) and (A24) with respect to , we obtain the variance
(A27) up to an term. This follows from Fu (1995, eqns. (24)-(25)) and assumption (A22) in Lemma A3. This also leads to (A21).
By assumptions (A22) and (A26), the double sum in (A27) is bounded above by
(A28) By Fu (1995, eqns. (29) and (22)), the first and second terms of (A28) are of order and , respectively, as for each . The completes the proof of . The latter implies, by Chebyshev’s inequality, that in as . □
Theorem A5 Poisson approximation for counts across loci —
Let be a sequence of random coalescent trees which are the generalized coalescent trees of Griffiths and Tavaré (1998). Suppose and assumption (A26) holds. Let be the number of mutations on with counts j. Then for any positive integer b and any , the variables are asymptotically independent and for , as .
Proof.
By Theorem A4, the probability that nested mutation up to count b occurs tends to 0 as . The result then follows from Lemma A4 and Theorem A3. □
It can be checked that exponentially growing popolations clearly satisfy and also assumption (A26). The conclusions of Theorems A4 and A5 then hold for the generalized coalescent trees of Griffiths and Tavaré (1998) when for for some .
Equipped with Theorem A5, we write as in the main text and compute the probability generating function of the count of the variant of interest and its number of latent mutations. The count of the variant of interest is and its number of latent mutations is . Hence
(A29) |
as declared in the main text.
A remark on the total number of mutations for large
The stable probabilities observed in the lower three panels of Fig. 6 and in Fig. 4b suggest that the conditional distribution of given and a very large n will approach a distribution as gets larger. That this is the case under constant population size follows from the fact, here as in “A conditional ancestral process for rare variants,” that the number of alleles in the Ewens sampling formula is the sum of independent Bernoulli trials (Arratia et al. 1992, 2000). The limiting large- distribution is Poisson, but shifted because there must be at least one mutation to produce copies, so it is that is Poisson. See Proposition 3.1 of Yamato (2017).
By the following heuristic argument, we suggest that this result holds more broadly, in particular for growing populations or ones in which decreases at least as fast with i as in the constant-size model, whatever the reason. In this case, when a mutation occurs it will very likely produce a low-count variant because for small i will be much greater than for large . A large- variant which is due for example to latent mutations will very likely have a count pattern such as or and very unlikely to have one such as for some small j.
Then we expect the probability (31) of seeing copies given latent mutations to be close to
because each of the mutations has a small chance of producing an appropriately large number of copies, the other mutations being inconsequential to the total count. Multiplying by the Poisson distribution of in (30) and rearranging gives
for large . To the left of the · is a probability like (7). To the right of the · is the shifted Poisson distribution, which implicitly averages over the (small) sizes of the additional mutations.
Contributor Information
John Wakeley, Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, MA 02138, USA.
Wai-Tong (Louis) Fan, Department of Mathematics, Indiana University, Bloomington, IN 47405, USA; Center of Mathematical Sciences and Applications, Harvard University, Cambridge, MA 02138, USA.
Evan Koch, Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, USA; Division of Genetics, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA 02115, USA.
Shamil Sunyaev, Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, USA; Division of Genetics, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA 02115, USA.
Data availability
The data application to low-frequency synonymous polymorphisms used allele frequencies from exome sequencing data compiled in gnomAD v2.1.1, available here: https://gnomad.broadinstitute.org/downloads and basepair-resolution mutation rates (Seplyarskiy et al. 2022), available here: http://genetics.bwh.harvard.edu/downloads/Vova/Roulette/. The mutation rate model specifies the rate for all three possible alternative nucleotides, and different nucleotide mutations were counted separately when generating the site-frequency spectra. The pipeline used to compile and annotate all potential synonymous mutations in the human genome is available at: https://github.com/vseplyarskiy/Roulette. The site-frequency spectra in different mutation rate bins is available at: https://doi.org/10.6084/m9.figshare.3426251.v1.
Funding
This work was supported in part by National Science Foundation grants DMS-1855417 and DMS-2152103, and Office of Naval Research grant N00014-20-1-2411 to W.-T.(L.)F.; and by National Institutes of Health grants R35-GM127131, R01-MH101244, U01-HG012009 and R01-HG010372 to S.S.
Literature cited
- Abramowitz M, Stegun IA. Handbook of Mathematical Functions. New York: Dover; 1964. [Google Scholar]
- Achaz G. Frequency spectrum neutrality tests: one for all and all for one. Genetics. 2009;183:249–258. doi: 10.1534/genetics.109.104042 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Aggarwala V, Voight BF. An expanded sequence context model broadly explains variability in polymorphism levels across the human genome. Nat Genet. 2016;48:349–355. doi: 10.1038/ng.3511 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Arratia R, Barbour AD, Tavaré S. Poisson process approximations for the Ewens sampling formula. Ann Appl Probab. 1992;2:519–535. doi: 10.1214/aoap/1177005647 [DOI] [Google Scholar]
- Arratia R, Barbour AD, Tavaré S. The number of components in a logarithmic combinatorial structure. Ann Appl Probab. 2000;10:331–361. doi: 10.1214/aoap/1019487347 [DOI] [Google Scholar]
- Arratia R, Barbour AD, Tavaré S. Logarithmic Combinatorial Structures: A Probabilistic Approach. Zürich: European Mathematical Society; 2003. (EMS monographs in mathematics). [Google Scholar]
- Arratia R, Barbour AD, Tavaré S. Exploiting the Feller coupling for the Ewens sampling formula. Stat Sci. 2016;31:27–29. doi: 10.1214/15-STS537 [DOI] [Google Scholar]
- Arratia R, Tavaré S. Limit theorems for combinatorial structures via discrete process approximations. Random Struct Algorithms. 1992;3:321–345. doi: 10.1002/rsa.3240030310 [DOI] [Google Scholar]
- Baake E, Bialowons R. Ancestral processes with selection: branching and Moran models. Banach Cent Publ. 2008;80:33–52. [Google Scholar]
- Bertoin J. Random Fragmentation and Coagulation Processes. Cambridge: Cambridge University Press; 2006. (Cambridge Studies in Advanced Mathematics. [Google Scholar]
- Bhaskar A, Kamm JA, Song YS. Approximate sampling formulae for general finite-alleles models of mutation. Adv Appl Probab. 2012;44:408–428. doi: 10.1239/aap/1339878718 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bhaskar A, Song YS. Descartes’ rule of signs and the identifiability of population demographic models from genomic variation data. Ann Stat. 2014;42:2469–2493. doi: 10.1214/14-AOS1264 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Billingsley P. Probability and Measure. New York: John Wiley & Sons; 2008. [Google Scholar]
- Bird AP. DNA methylation and the frequency of CpG in animal DNA. Nucleic Acids Res. 1980;8:1499–1504. doi: 10.1093/nar/8.7.1499 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Braverman JM, Hudson RR, Kaplan NL, Langley CH, Stephan W. The hitchhiking effect on the site frequency spectrum of DNA polymorphisms. Genetics. 1995;140:783–796. doi: 10.1093/genetics/140.2.783 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Burden CJ, Griffiths RC. The stationary distribution of a sample from the Wright-Fisher diffusion model with general small mutation rates. J Math Biol. 2019;78:1211–1224. doi: 10.1007/s00285-018-1306-y [DOI] [PubMed] [Google Scholar]
- Burden CJ, Tang Y. An approximate stationary solution for multi-allele neutral diffusion with low mutation rates. Theor Popul Biol. 2016;112:22–32. doi: 10.1016/j.tpb.2016.07.005 [DOI] [PubMed] [Google Scholar]
- Burden CJ, Tang Y. Rate matrix estimation from site frequency data. Theor Popul Biol. 2017;113:23–33. doi: 10.1016/j.tpb.2016.10.001 [DOI] [PubMed] [Google Scholar]
- Bustamante CD, Wakeley J, Sawyer S, Hartl DL. Directional selection and the site-frequency spectrum. Genetics. 2001;159:1779–1788. doi: 10.1093/genetics/159.4.1779 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Champagnat N, Lambert A. Splitting trees with neutral Poissonian mutations I: small families. Stoch Process Their Appl. 2012;122:1003–1033. doi: 10.1016/j.spa.2011.11.002 [DOI] [Google Scholar]
- Champagnat N, Lambert A. Splitting trees with neutral Poissonian mutations II: largest and oldest families. Stoch Process Their Appl. 2013;123:1368–1414. doi: 10.1016/j.spa.2012.11.013 [DOI] [Google Scholar]
- Cheek D, Antal T. Mutation frequencies in a birth-death branching process. Ann Appl Probab. 2018;28:3922–3947. doi: 10.1214/18-AAP1413 [DOI] [Google Scholar]
- Chen H, Chen K. Asymptotic distributions of coalescence times and ancestral lineage numbers for populations with temporally varying size. Genetics. 2013;194:721–736. doi: 10.1534/genetics.113.151522 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Crespo FF, Posada D, Wiuf C. Coalescent models derived from birth-death processes. Theor Popul Biol. 2021;142:1–11. doi: 10.1016/j.tpb.2021.09.003 [DOI] [PubMed] [Google Scholar]
- Desai MM, Plotkin JB. The polymorphism frequency spectrum of finitely many sites under selection. Genetics. 2008;180:2175–2191. doi: 10.1534/genetics.108.087361 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Donnelly P. Dual processes in population genetics. In: Tautu P, editor. Stochastic Spatial Processes. Berlin: Springer Berlin Heidelberg; 1986. p. 94–105.
- Donnelly P, Tavaré S. The population genealogy of the infinitely-many neutral alleles model. J Math Biol. 1987;25:381–391. doi: 10.1007/BF00277163 [DOI] [PubMed] [Google Scholar]
- Dorman KS, Sinsheimer JS, Lange K. In the garden of branching processes. SIAM Rev. 2004;46:202–229. doi: 10.1137/S0036144502417843 [DOI] [Google Scholar]
- Durrett R. Population genetics of neutral mutations in exponentially growing cancer cell populations. Ann Appl Probab. 2013;23:230–250. doi: 10.1214/11-AAP824 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Durrett R. Branching Process Models of Cancer. Cham: Springer; 2015. (Mathematical Biosciences Institute Lecture Series, 1.1). [Google Scholar]
- Eldon B, Birkner M, Blath J, Freund F. Can the site-frequency spectrum distinguish exponential population growth from multiple-merger coalescents? Genetics. 2015;199:841–856. doi: 10.1534/genetics.114.173807 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ewens WJ. The sampling theory of selectively neutral alleles. Theor Popul Biol. 1972;3:87–112. doi: 10.1016/0040-5809(72)90035-4 [DOI] [PubMed] [Google Scholar]
- Ewens WJ. A note on the sampling theory for infinite alleles and infinite sites models. Theor Popul Biol. 1974;6:143–148. doi: 10.1016/0040-5809(74)90020-3 [DOI] [PubMed] [Google Scholar]
- Ewens WJ. Mathematical Population Genetics. Berlin: Springer-Verlag; 1979. [Google Scholar]
- Ewens WJ. Mathematical Population Genetics, Volume I: Theoretical Foundations. Berlin: Springer-Verlag; 2004. [Google Scholar]
- Fearnhead P. Perfect simulation from population genetic models with selection. Theor Popul Biol. 2001;59:263–279. doi: 10.1006/tpbi.2001.1514 [DOI] [PubMed] [Google Scholar]
- Fearnhead P. The common ancestor at a nonneutral locus. J Appl Probab. 2002;39:38–54. doi: 10.1017/S0021900200021495 [DOI] [Google Scholar]
- Ferretti L, Ledda A, Wiehe T, Achaz G, Ramos-Onsins SE. Decomposing the site frequency spectrum: the impact of tree topology on neutrality tests. Genetics. 2017;207:229–240. doi: 10.1534/genetics.116.188763 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fisher RA. The possible modification of the response of the wild type to recurrent mutations. Am Nat. 1928;62:115–126. doi: 10.1086/280193 [DOI] [Google Scholar]
- Fisher RA. The distribution of gene ratios for rare mutations. Proc R Soc Edinb. 1930a;50:205–220. [Google Scholar]
- Fisher RA. The Genetical Theory of Natural Selection. Oxford: Clarendon; 1930b. [Google Scholar]
- Fisher RA. A theoretical distribution for the apparent abundance of different species. J Anim Ecol. 1943;12:54–57. [Google Scholar]
- Fu Y. Statistical properties of segregating sites. Theor Popul Biol. 1995;48:172–197. doi: 10.1006/tpbi.1995.1025 [DOI] [PubMed] [Google Scholar]
- Gao F, Keinan A. Inference of super-exponential human population growth via efficient computation of the site frequency spectrum for generalized models. Genetics. 2016;202:235–245. doi: 10.1534/genetics.115.180570 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gazave E, Ma L, Chang D, Coventry A, Gao F, Muzny D, Boerwinkle E, Gibbs R, Sing CF, Clark AG, et al. Neutral genomic regions refine models of recent rapid human population growth. Proc Natl Acad Sci USA. 2014;111:757–762. doi: 10.1073/pnas.1310398110 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Goldman N. Nucleotide, dinucleotide and trinucleotide frequencies explain patterns observed in chaos game representations of DNA sequences. Nucleic Acids Res. 1993;21:2487–2491. doi: 10.1093/nar/21.10.2487 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Griffiths RC. Lines of descent in the diffusion approximation of neutral Wright-Fisher models. Theor Popul Biol. 1980;17:37–50. doi: 10.1016/0040-5809(80)90013-1 [DOI] [PubMed] [Google Scholar]
- Griffiths RC, Tavaré S. Ancestral inference in population genetics. Stat Sci. 1994a;9:307–319. doi: 10.1214/ss/1177010378 [DOI] [Google Scholar]
- Griffiths RC, Tavaré S. Sampling theory for neutral alleles in a varying environment. Philos Trans R Soc Lond B: Biol Sci. 1994b;344:403–410. doi: 10.1098/rstb.1994.0079 [DOI] [PubMed] [Google Scholar]
- Griffiths RC, Tavaré S. The age of a mutation in a general coalescent tree. Commun Stat Stoch Models. 1998;14:273–295. doi: 10.1080/15326349808807471 [DOI] [Google Scholar]
- Gunnarsson EB, Leder K, Foo J. Exact site frequency spectra of neutrally evolving tumors: a transition between power laws reveals a signature of cell viability. Theor Popul Biol. 2021;142:67–90. doi: 10.1016/j.tpb.2021.09.004 [DOI] [PubMed] [Google Scholar]
- Gutenkunst RN, Hernandez RD, Williamson SH, Bustamante CD. Inferring the joint demographic history of multiple populations from multidimensional SNP frequency data. PLoS Genet. 2009;5:1–11. doi: 10.1371/journal.pgen.1000695 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Haldane JBS. The part played by recurrent mutation in evolution. Am Nat. 1933;67:5–19. doi: 10.1086/280465 [DOI] [Google Scholar]
- Harpak A, Bhaskar A, Pritchard JK. Mutation rate variation is a primary determinant of the distribution of allele frequencies in humans. PLoS Genet. 2016;12:e1006489. doi: 10.1371/journal.pgen.1006489 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hobolth A, Wiuf C. The genealogy, site frequency spectrum and ages of two nested mutant alleles. Theor Popul Biol. 2009;75:260–265. Sam Karlin: Special Issue. doi: 10.1016/j.tpb.2009.02.001 [DOI] [PubMed] [Google Scholar]
- Hudson RR. Testing the constant-rate neutral allele model with protein sequence data. Evolution. 1983;37:203–217. doi: 10.2307/2408186 [DOI] [PubMed] [Google Scholar]
- Jenkins PA, Mueller JW, Song YS. General triallelic frequency spectrum under demographic models with variable population size. Genetics. 2014;196:295–311. doi: 10.1534/genetics.113.158584 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jenkins PA, Song YS. The effect of recurrent mutation on the frequency spectrum of a segregating site and the age of an allele. Theor Popul Biol. 2011;80:158–173. doi: 10.1016/j.tpb.2011.04.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Johnson KE, Adams CJ, Voight BF. Identifying rare variants inconsistent with identity-by-descent in population-scale whole-genome sequencing data. Methods Ecol Evol. 2022;13:2429–2442. doi: 10.1111/2041-210X.13991 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kaj I, Mugal CF. The non-equilibrium allele frequency spectrum in a poisson random field framework. Theor Popul Biol. 2016;111:51–64. doi: 10.1016/j.tpb.2016.06.003 [DOI] [PubMed] [Google Scholar]
- Kaplan N, Hudson RR. The use of sample genealogies for studying a selectively neutral m-loci model with recombination. Theor Popul Biol. 1985;28:382–396. doi: 10.1016/0040-5809(85)90036-X [DOI] [PubMed] [Google Scholar]
- Karczewski KJ, Francioli LC, Tiao G, Cummings BB, Alföldi J, Wang Q, Collins RL, Laricchia KM, Ganna A, Birnbaum DP, et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature. 2020;581:434–443. doi: 10.1038/s41586-020-2308-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Keinan A, Clark AG. Recent explosive human population growth has resulted in an excess of rare genetic variants. Science. 2012;336:740–743. doi: 10.1126/science.1217283 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kern AD, Hey J. Exact calculation of the joint allele frequency spectrum for isolation with migration models. Genetics. 2017;207:241–253. doi: 10.1534/genetics.116.194019 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kessler DA, Levine H. Large population solution of the stochastic Luria & Delbrück evolution model. Proc Natl Acad Sci USA. 2013;110:11682–11687. doi: 10.1073/pnas.1309667110 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kimura M. The number of heterozygous nucleotide sites maintained in a finite population due to a steady flux of mutations. Genetics. 1969;61:893–903. doi: 10.1093/genetics/61.4.893 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kimura M. Theoretical foundation of population genetics at the molecular level. Theor Popul Biol. 1971;2:174–208. doi: 10.1016/0040-5809(71)90014-1 [DOI] [PubMed] [Google Scholar]
- Kingman JFC. On the genealogy of large populations. J Appl Probab. 1982;19:27–43. doi: 10.1017/S0021900200034446 [DOI] [Google Scholar]
- Lambert A. Species abundance distributions in neutral models with immigration or mutation and general lifetimes. J Math Biol. 2011;63:57–72. doi: 10.1007/s00285-010-0361-9 [DOI] [PubMed] [Google Scholar]
- Lange K, Fan Rz. Branching process models for mutant genes in nonstationary populations. Theor Popul Biol. 1997;51:118–133. doi: 10.1006/tpbi.1997.1297 [DOI] [PubMed] [Google Scholar]
- Lapierre M, Lambert A, Achaz G. Accuracy of demographic inferences from the site frequency spectrum: the case of the Yoruba population. Genetics. 2017;206:439–449. doi: 10.1534/genetics.116.192708 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lea DE, Coulson A. The distribution of the numbers of mutants in bacterial populations. J Genet. 1949;49:264–285. doi: 10.1007/BF02986080 [DOI] [PubMed] [Google Scholar]
- Leffler EM, Bullaughey K, Matute DR, Meyer WK, Séurel L, Venkat A, Andolfatto P, Przeworski M. Revisiting an old riddle: what determines genetic diversity levels within species? PLoS Biol. 2012;10:1–9. doi: 10.1371/journal.pbio.1001388 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lek M, Karczewski KJ, Minikel EV, Samocha KE, Banks E, Fennell T, O’Donnell-Luria AH, Ware JS, Hill AJ, Cummings BB, et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature. 2016;536:285–291. doi: 10.1038/nature19057 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu X, Fu YX. Exploring population size changes using snp frequency spectra. Nat Genet. 2015;47:555–559. doi: 10.1038/ng.3254 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Luria SE, Delbrück M. Mutations of bacteria from virus sensitivity to virus resistance. Genetics. 1943;28:491–511. doi: 10.1093/genetics/28.6.491 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Moran PAP. Random processes in genetics. Proc Camb Phil Soc. 1958;54:60–71. doi: 10.1017/S0305004100033193 [DOI] [Google Scholar]
- Moran PAP. Statistical Processes of Evolutionary Theory. Oxford: Clarendon Press; 1962. [Google Scholar]
- Müller R, Kaj I, Mugal CF. A nearly neutral model of molecular signatures of natural selection after change in population size. Genome Biol Evol. 2022;14:evac058. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Myers S, Fefferman C, Patterson N. Can one learn history from the allelic spectrum? Theor Popul Biol. 2008;73:342–348. doi: 10.1016/j.tpb.2008.01.001 [DOI] [PubMed] [Google Scholar]
- Nielsen R. Estimation of population parameters and recombination rates from single nucleotide polymorphisms. Genetics. 2000;154:931–942. doi: 10.1093/genetics/154.2.931 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ohtsuki H, Innan H. Forward and backward evolutionary processes and allele frequency spectrum in a cancer cell population. Theor Popul Biol. 2017;117:43–50. doi: 10.1016/j.tpb.2017.08.006 [DOI] [PubMed] [Google Scholar]
- Polanski A, Kimmel M. New explicit expressions for relative frequencies of single-nucleotide polymorphisms with application to statistical inference on population growth. Genetics. 2003;165:427–436. doi: 10.1093/genetics/165.1.427 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Polanski A, Szczesna A, Garbulowski M, Kimmel M. Coalescence computations for large samples drawn from populations of time-varying sizes. PLoS ONE. 2017;12:1–22. doi: 10.1371/journal.pone.0170701 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Poon GYP, Watson CJ, Fisher DS, Blundell JR. Synonymous mutations reveal genome-wide levels of positive selection in healthy tissues. Nat Genet. 2021;53:1597–1605. doi: 10.1038/s41588-021-00957-1 [DOI] [PubMed] [Google Scholar]
- Rannala B, Slatkin M. Estimating the age of alleles by use of intraallelic variability. Am J Human Genet. 1997;60:447–458. [PMC free article] [PubMed] [Google Scholar]
- Rosen Z, Bhaskar A, Roch S, Song YS. Geometry of the sample frequency spectrum and the perils of demographic inference. Genetics. 2018;210:665–682. doi: 10.1534/genetics.118.300733 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sargsyan O. 2006.Analytical and simulation results for the general coalescent [PhD thesis]. Los Angeles: University of Southern California. [Google Scholar]
- Sargsyan O. An analytical framework in the general coalescent tree setting for analyzing polymorphisms created by two mutations. J Math Biol. 2015;70:913–956. doi: 10.1007/s00285-014-0785-8 [DOI] [PubMed] [Google Scholar]
- Saunders IW, Tavaré S, Watterson GA. On the genealogy of nested subsamples from a haploid population. Adv Appl Probab. 1984;16:471–491. doi: 10.2307/1427285 [DOI] [Google Scholar]
- Sawyer SA, Hartl DL. Population genetics of polymorphism and divergence. Genetics. 1992;132:1161–1176. doi: 10.1093/genetics/132.4.1161 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schrempf D, Hobolth A. An alternative derivation of the stationary distribution of the multivariate neutral Wright-Fisher model for low mutation rates with a view to mutation rate estimation from site frequency data. Theor Popul Biol. 2017;114:88–94. doi: 10.1016/j.tpb.2016.12.001 [DOI] [PubMed] [Google Scholar]
- Seplyarskiy V, Lee DJ, Koch EM, Lichtman JS, Luan HH, Sunyaev SR. A mutation rate model at the basepair resolution identifies the mutagenic effect of Polymerase III transcription. bioRxiv 504670. 10.1101/2022.08.20.504670, 2022, preprint: not peer reviewed. [DOI] [PMC free article] [PubMed]
- Seplyarskiy VB, Soldatov RA, Koch E, McGinty RJ, Goldmann JM, Hernandez RD, Barnes K, Correa A, Burchard EG, Ellinor PT, et al. Population sequencing data reveal a compendium of mutational processes in the human germ line. Science. 2021;373:1030–1035. doi: 10.1126/science.aba7408 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Slade PF. Most recent common ancestor probability distributions in gene genealogies under selection. Theor Popul Biol. 2000a;58:291–305. doi: 10.1006/tpbi.2000.1488 [DOI] [PubMed] [Google Scholar]
- Slade PF. Simulation of selected genealogies. Theor Popul Biol. 2000b;57:35–49. doi: 10.1006/tpbi.1999.1438 [DOI] [PubMed] [Google Scholar]
- Slatkin M. Allele age and a test for selection on rare alleles. Philos Trans R Soc Lond B: Biol Sci. 2000;355:1663–1668. doi: 10.1098/rstb.2000.0729 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Slatkin M, Hudson RR. Pairwise comparisons of mitochondrial DNA sequences in stable and exponentially growing populations. Genetics. 1991;129:555–562. doi: 10.1093/genetics/129.2.555 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Slatkin M, Rannala B. Estimating allele age. Annu Rev Genomics Hum Genet. 2000;1:225–249. doi: 10.1146/annurev.genom.1.1.225 [DOI] [PubMed] [Google Scholar]
- Städler T, Haubold B, Merino C, Stephan W, Pfaffelhuber P. The impact of sampling schemes on the site frequency spectrum in nonequilibrium subdivided populations. Genetics. 2009;182:205–216. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stephens M, Donnelly P. Ancestral inference in population genetics models with selection (with discussion). Aust N Z J Stat. 2003;45:395–430. doi: 10.1111/1467-842X.00295 [DOI] [Google Scholar]
- Tajima F. Evolutionary relationship of DNA sequences in finite populations. Genetics. 1983;105:437–460. doi: 10.1093/genetics/105.2.437 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tajima F. Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics. 1989;123:585–595. doi: 10.1093/genetics/123.3.585 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Taliun D, Harris DN, Kessler MD, Carlson J, Szpiech ZA, Torres R, Taliun SAG, Corvelo A, Gogarten SM, Kang HM, et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature. 2021;590:290–299. doi: 10.1038/s41586-021-03205-y [DOI] [PMC free article] [PubMed] [Google Scholar]
- Terhorst J, Song YS. Fundamental limits on the accuracy of demographic inference based on the sample frequency spectrum. Proc Natl Acad Sci USA. 2015;112:7677–7682. doi: 10.1073/pnas.1503717112 [DOI] [PMC free article] [PubMed] [Google Scholar]
- The 1000 Genomes Project Consortium . A global reference for human genetic variation. Nature. 2015;526:68–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Torres R, Stetter MG, Hernandez RD, Ross-Ibarra J. The temporal dynamics of background selection in nonequilibrium populations. Genetics. 2020;214:1019–1030. doi: 10.1534/genetics.119.302892 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tricomi F, Erdélyi A. The asymptotic expansion of a ratio of gamma functions. Pac J Appl Math. 1951;1:133–142. doi: 10.2140/pjm.1951.1.133 [DOI] [Google Scholar]
- Vogl C, Clemente F. The allele-frequency spectrum in a decoupled Moran model with mutation, drift, and directional selection, assuming small mutation rates. Theor Popul Biol. 2012;81:197–209. doi: 10.1016/j.tpb.2012.01.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Vogl C, Mikula LC, Burden CJ. Maximum likelihood estimators for scaled mutation rates in an equilibrium mutation-drift model. Theor Popul Biol. 2020;134:106–118. doi: 10.1016/j.tpb.2020.06.001 [DOI] [PubMed] [Google Scholar]
- Watterson GA. Models for the logarithmic species abundance distributions. Theor Popul Biol. 1974a;6:217–250. doi: 10.1016/0040-5809(74)90025-2 [DOI] [PubMed] [Google Scholar]
- Watterson GA. The sampling theory of selectively neutral alleles. Adv Appl Probab. 1974b;6:463–488. doi: 10.2307/1426228 [DOI] [Google Scholar]
- Watterson GA. On the number of segregating sites in genetical models without recombination. Theor Popul Biol. 1975;7:256–276. doi: 10.1016/0040-5809(75)90020-9 [DOI] [PubMed] [Google Scholar]
- Watterson GA. Lines of descent and the coalescent. Theor Popul Biol. 1984;26:77–92. doi: 10.1016/0040-5809(84)90025-X [DOI] [Google Scholar]
- Wiuf C. On the genealogy of a sample of neutral rare alleles. Theor Popul Biol. 2000;58:61–75. doi: 10.1006/tpbi.2000.1469 [DOI] [PubMed] [Google Scholar]
- Wiuf C, Donnelly P. Conditional genealogies and the age of a neutral mutant. Theor Popul Biol. 1999;56:183–201. doi: 10.1006/tpbi.1998.1411 [DOI] [PubMed] [Google Scholar]
- Wolfram Research, Inc. . Mathematica, Version 11.2. Champaign, IL: Wolfram Research; 2017.
- Wright S. Evolution in Mendelian populations. Genetics. 1931;16:97–159. doi: 10.1093/genetics/16.2.97 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wright S. The distribution of gene frequencies under irreversible mutation. Proc Natl Acad Sci USA. 1938;24:253–259. doi: 10.1073/pnas.24.7.253 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wright S. Adaptation and selection. In: Jepson GL, Simpson GG, Mayr E, editors. Genetics, Paleontology and Evolution. Princeton (RI): Princeton University Press; 1949.
- Yamato H. Poisson approximations for sum of Bernoulli random variables and its application to Ewens sampling formula. J Jpn Stat Soc. 2017;47:187–195. doi: 10.14490/jjss.47.187 [DOI] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The data application to low-frequency synonymous polymorphisms used allele frequencies from exome sequencing data compiled in gnomAD v2.1.1, available here: https://gnomad.broadinstitute.org/downloads and basepair-resolution mutation rates (Seplyarskiy et al. 2022), available here: http://genetics.bwh.harvard.edu/downloads/Vova/Roulette/. The mutation rate model specifies the rate for all three possible alternative nucleotides, and different nucleotide mutations were counted separately when generating the site-frequency spectra. The pipeline used to compile and annotate all potential synonymous mutations in the human genome is available at: https://github.com/vseplyarskiy/Roulette. The site-frequency spectra in different mutation rate bins is available at: https://doi.org/10.6084/m9.figshare.3426251.v1.