Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2012 Sep 1.
Published in final edited form as: Theor Popul Biol. 2011 Apr 28;80(2):158–173. doi: 10.1016/j.tpb.2011.04.001

The effect of recurrent mutation on the frequency spectrum of a segregating site and the age of an allele

Paul A Jenkins a,, Yun S Song a,b
PMCID: PMC3143209  NIHMSID: NIHMS297996  PMID: 21550359

Abstract

The sample frequency spectrum of a segregating site is the probability distribution of a sample of alleles from a genetic locus, conditional on observing the sample to be polymorphic. This distribution is widely used in population genetic inferences, including statistical tests of neutrality in which a skew in the observed frequency spectrum across independent sites is taken as a signature of departure from neutral evolution. Theoretical aspects of the frequency spectrum have been well studied and several interesting results are available, but they are usually under the assumption that a site has undergone at most one mutation event in the history of the sample. Here, we extend previous theoretical results by allowing for at most two mutation events per site, under a general finite alleles model in which the mutation rate is independent of current allelic state but the transition matrix is otherwise completely arbitrary. Our results apply to both nested and nonnested mutations. Only the former has been addressed previously, whereas here we show it is the latter that is more likely to be observed except for very small sample sizes. Further, for any mutation transition matrix, we obtain the joint sample frequency spectrum of the two mutant alleles at a triallelic site, and derive a closed-form formula for the expected age of the younger of the two mutations given their frequencies in the population. Several large-scale resequencing projects for various species are presently under way and the resulting data will include some triallelic polymorphisms. The theoretical results described in this paper should prove useful in population genomic analyses of such data.

Keywords: Frequency spectrum, Coalescent, Triallelic site, Genealogy, Allele age

1. Introduction

The frequency spectrum for a sample of genetic data taken from a population is a useful statistic, containing more information than single-value summaries like the number of segregating sites, yet remaining more tractable than working with the full data configuration. The sample frequency spectrum for a polymorphic site is defined as the probability distribution of the number of copies of the derived, or mutant, allele in a sample of size n. For a sample with many polymorphic sites, a histogram of the number of sites with i copies of the mutant allele present in the sample, for each i = 1, …, n − 1, can be compared to the sample frequency spectrum. In this manner, one can test the applicability of a given reproductive model by comparing departures of the observed frequency spectrum from its expectation. Most of the widely-used tests of neutrality are either directly or indirectly based on this observation (Achaz, 2009). Under a standard, neutral, coalescent model, and assuming the infinite sites model of mutation, the sample frequency spectrum is known in closed-form:

ϕ(i)=i1j=1n1j1, (1)

where ϕ(i) is the probability that a mutant allele is present in exactly i copies of the sample [Watterson (1975); see Fu (1995), Griffiths and Tavaré (1998) for a coalescent approach]. This appealing result has been generalized to a number of further settings, including variable population size (Griffiths and Tavaré, 1998; Polanski and Kimmel, 2003; Evans et al., 2007), and genic selection (Griffiths, 2003). Bustamante et al. (2001) obtain a number of results related to the frequency spectrum for mutant sites under selection in the Poisson random field model of Sawyer and Hartl (1992).

All of this work assumes the infinite sites model of mutation. In particular, the mutation giving rise to the new allele is assumed to have occurred at most once in the genealogy relating the sample. Since the per-site mutation parameter θ is small (typically 0.001 ≤ θ ≤ 0.01 for humans, where θ = 4Nu, N is the diploid effective population size, and u is the probability of a mutation event per individual per generation), this assumption is usually reasonable. Occasionally, however, one might observe a site that must have undergone more than one mutation: it may be triallelic, or may be incompatible with the gene genealogy inferred from completely linked sites. Moreover, recurrent mutations can affect sites that still appear to conform to the infinite sites assumption. Thus far there have been no clear theoretical grounds for how to deal with nonconforming sites when working with the frequency spectrum. For example, one simple solution is simply to bin both the mutant alleles of a triallelic site and then to treat it as if it were diallelic (e.g. Johnson and Slatkin, 2006), but this is clearly not ideal.

In this work we obtain a more general distribution for the number of copies of mutant alleles at a site, by allowing at most two mutation events in the genealogy relating the sample. We employ a general finite sites model in which a fixed but arbitrary number of alleles, K, may be observed at the site of interest, and mutations between alleles occur according to some transition matrix P. The sample frequency spectrum is then more generally defined to be the joint probability distribution of the number of copies of each of the mutant alleles, conditional on at least one mutant allele. We assume the standard coalescent (Kingman, 1982), and derive our results by arguments using topological constraints induced on the genealogy by the two mutations. This approach is most closely related to the work of Wiuf and Donnelly (1999) and Hobolth and Wiuf (2009), who studied genealogies with one mutation and genealogies with two nested mutations, respectively. Among other results, Wiuf and Donnelly (1999) obtain the density of the age of a single mutant allele given its population frequency, and Hobolth and Wiuf (2009) obtain the joint and marginal sample frequency spectra of two mutant alleles when the mutations are genealogically nested, and the age of the younger of the two nested mutants. In this paper we extend these results to nonnested mutations, which, as we show below, is the more important of the two cases: With increasing sample size, the probability that two mutations are nonnested approaches one, and it is even the most probable outcome for all sample sizes greater than four. With results for both cases in hand, by averaging over whether or not the mutations are nested we obtain the sample frequency spectra of two mutant alleles regardless of their topological placement in the genealogy. Furthermore, Hobolth and Wiuf (2009) treat the two mutants as having occurred at two completely linked but distinct sites, so that the younger and older of the two mutants are always identifiable. In this work we model the two mutations as occurring at the same site, allowing for the more general possibility of parallel mutations or back mutations. Particular choices of P in our model allows one to include the setting of Hobolth and Wiuf (2009) as a special case.

When introducing a model for mutation there are two cases to consider:

  1. The allele of the most recent common ancestor (MRCA) of the sample is known, usually by comparison with an outgroup that is related by a suitable evolutionary distance.

  2. The type of the MRCA is unknown.

In this work we largely restrict ourselves to the first case. In principle it has more power, since mutant alleles observed i times and ni times are distinguishable. When we have no prior assumptions regarding which of the alleles is the mutant, one must resort to the folded frequency spectrum, in which the two categories are binned together. In any case, when the type of the MRCA is unknown and the mutation transition matrix takes on a special parent-independent form—that is, Pij is independent of i, for each pair of alleles i and j—then a closed-form sampling distribution for each site is available, which applies for any number of mutations in the history of the sample. This formula is essentially due to Wright (1949). Use of Wright’s formula for making inferences regarding the site frequency spectrum is considered by Desai and Plotkin (2008). Note that when we assume the allele of the MRCA is known, Wright’s formula does not apply even when mutation is parent-independent. For larger mutation rates, the assumption that a genealogy has undergone at most two mutations and that the allele of the MRCA is known each becomes less justifiable, and without prior information about which allele is mutant one should revert to using a folded site frequency spectrum.

In the special case of parent-independent mutation with the type of the MRCA unknown one can use Wright’s formula as described above. It also applies to a diallelic model (K = 2), which can always be transformed into an equivalent parent-independent one. Aside from these cases, there are no classical results for the sample frequency spectrum under more general transition matrices. In this work we allow P to remain a general transition matrix apart from the restriction that the mutation rate at the locus is independent of its current allelic state. This is equivalent to ensuring Pii = 0 for each i = 1, …, K, since the effective rate at which an allele mutates to another distinct allele is (θ/2)(1 − Pii); in more general mutation models, Pii can vary for different i to allow different rates of transition out of different allelic states. It should be possible to modify our results to relax this assumption, albeit with a noticeable cost in bookkeeping, and so we do not attempt it in this work. An exception can be made when we study triallelic sample configurations later in the paper: genealogies associated with such configurations must have undergone at least two nontrivial mutation events, and we can allow Pii > 0 without any additional effect. Essentially, having observed a triallelic sample together with the assumption that there were at most two mutation events means we condition on such trivial mutations not having occurred, even if we allow them back into the model. When studying triallelic configurations, we also find that our results simplify substantially with the following additional assumption:

Pab=Pcb,  and  Pac=Pbc, (2)

where the three observed alleles are a, b, and c, and a is the ancestral allele. The condition (2) is satisfied by, and is weaker than, parent-independent mutation. It requires only that parent-independence holds in relevant entries of P, namely, in the rates of transition to each of the observed mutant alleles from the ancestral allele and from the other observed mutant allele.

Our paper is structured as follows. In Section 2 we introduce the recursion relation for the distribution of the sample configuration, which is well-known and is based on coalescent arguments. We utilize this recursion to obtain results for coalescent trees with one mutation event (Section 3) and two mutation events (Section 4). These results are made tractable in Section 5 by letting the mutation parameter go to zero, from which we can obtain useful expressions when we condition on certain observed patterns (e.g. that the site is triallelic) in Section 6. In Section 7 we also investigate the mean age in the population of a mutant allele at a triallelic site, and in Section 8 we investigate the accuracy of our expressions when the mutation parameter is in fact nonzero. We conclude with some brief discussion in Section 9.

2. Sample recursion

Denote an unordered sample configuration at a particular site by n = (n1, n2, …, nK), where K is the fixed and known number of alleles which could be observed at this site, and denote the sample size by n=i=1Kni. Members of the sample are referred to as gametes, so that ni denotes the number of gametes in the sample with allele i. We fix the ancestral allele and denote it as a ∈ {1, …, K}. Denote by Es the event that there are exactly s mutation events in the history of the sample. We will write the probability of observing the configuration n as p(n), and the joint probability of this configuration together with Es as p(n, Es). It is implicit in these expressions that we condition on the ancestral allele being a. We can now obtain the sample frequency spectrum from these probabilities. For example, suppose we have two possible alleles, denoted 1 and 2 where 1 is ancestral, so that P=(0110). As θ → 0, we may assume the mutant allele arose as the result of precisely one mutation event, and the distribution (1) is recovered as

ϕ(i)=limθ0p((ni,i)|E1)=limθ0p((ni,i),E1)j=1n1p((nj,j),E1),

i = 1, …, n − 1, since p((ni, i), E1) = θi−1 + O2) under this model (see (17) below).

Define a history to be the sequence of configurations nn′ ↦ … ↦ ea as we trace the ancestry of the sample back in time. Here, ei denotes a sample comprised of a single gamete whose allele is i, so ea denotes the sample comprising only the MRCA. A history can be regarded as an equivalence class in the space of genealogies with mutations that relate the sample. At each step in the sequence, at which the configuration is modified, we do not record which lineages are involved in each event, so there are many possible genealogies associated with any given history.

The probability p(n) satisfies the following recursion relation for n ≥ 2:

p(n)=j=1Knj1n1+θp(nej)+θn1+θi=1Kj=1KPijni+1δijnp(nej+ei), (3)

with boundary condition p(ei) = δia for each i = 1, …, K, where δij is the Kronecker delta. Similarly, the probability p(n, Es) satisfies, for n ≥ 2,

p(n,Es)=j=1Knj1n1+θp(nej,Es)+θn1+θi=1Kj=1KPijni+1δijnp(nej+ei,Es1), (4)

with boundary conditions p(ei, Es) = δiaδs0. Equations (3) and (4) are obtained from similar ones in Griffiths and Tavaré (1994), with slight modifications to the boundary conditions to reflect that we consider the allele of the MRCA to be known. Each path back through the recursions (3) and (4) is associated with a particular history, and we use this observation to obtain our results for the sample frequency spectrum. To illustrate the method, in Section 3 we first consider histories with precisely one mutation event. Henceforth we assume that Pii = 0 for i = 1, …, K, so that mutation events always result in a change of allele, and subsequent configurations in any history are always distinct. Finally, we also use the following notation: for a nonnegative real number x and a positive integer k,

(x)kx(x+1)(x+k1)

denotes the kth ascending factorial of x.

3. One mutation event

Refer to the intervals back in time while there existed n, n − 1, …, 2 ancestors to the sample as levels. Wiuf and Donnelly (1999) proceed by conditioning on the level at which the unique mutation event occurred, and then considering the distribution of the number of offspring of each lineage from that level. Here, we take a related approach but instead argue directly from (4).

Suppose we observe the sample configuration n = naea + nbeb, where ba is some mutant allele, and we have na > 0, nb > 0, and na + nb = n. Denote the sample configuration immediately before (more recently than) the time of the unique mutation event by l. Conditional on the event E1, that exactly one mutation event occurred in the history of the sample, l must be of the form l = laea + eb for some la with 1 ≤ lana (Figure 1). We refer to a history that passes through the state l as compatible with l. Thus, a history 1 that gives rise to n, is compatible with l, and is consistent with E1, must be a sequence of configurations of the form n ↦ … ↦ l ↦ (la + 1)ea ↦ … ↦ ea. We make use of the following simple but useful lemma.

Figure 1.

Figure 1

A coalescent tree with one mutation. The allele of each leaf is annotated. Also annotated is the variable la (here, la = 3), which determines the number of each type at the time of the mutation event.

Lemma 3.1. Conditional on E1, on the observed sample configuration n = naea + nbeb, and on the configuration l = laea + eb at the time of the mutation event, the distribution of compatible histories 1 is uniform. It is given by

p(1|E1,n,l)=(nla1nb1)1.

Moreover, the (unconditional) probability of such histories is

p(1)=p(1,E1,n,l)=θPab(na1)!(nb1)!(1+θ)n1·lala+θ. (5)

A proof of Lemma 3.1, along with our other results, is given in Appendix A. Summing over the (nla1nb1) histories and over l, we obtain

p(n,E1)=la=1na(nla1nb1)p(1,E1,n,l)={θPab(n1)!(1+θ)n1la=1na(na1la1)(n1la)·1la+θ,ifn=haea+nbeb,whereba  and  1na,nbn,0,otherwise. (6)

We now extend this approach to histories with precisely two mutation events.

4. Two mutation events

There are four cases to consider (Figure 2). The two mutation events are either nested (denoted E2𝒩), nonnested (E2𝒩𝒩), on the same edge (E2𝒮), or basal (E2ℬ). We define each of these in further detail below; for now note that nested excludes the case that the mutations occurred on the same edge, and nonnested excludes the case that the mutations reside on the two basal (innermost) edges of the tree. We use superscript notation to further specify the alleles to which the two age-ordered mutation events gave rise, so for example E2𝒩(b,c) (⊆ E2𝒩) denotes the event that there were precisely two mutation events, the mutations were nested, that the older mutation gave rise to a b allele, and that the younger mutation gave rise to a c allele. Note that in this example we must have ab and bc but it may or may not be the case that a = c. The a = c case will be dealt with separately. Similar special cases arise for E2𝒩𝒩, E2𝒮, and E2ℬ. We now consider each of the four events in further detail.

Figure 2.

Figure 2

Coalescent trees with two mutations. (a) Two nested mutations. (b) Two nonnested mutations. (c) Two mutations on the same branch. (d) Two mutations on the basal branches. The allele of each leaf is annotated. Also annotated are variables determining the number of each type at the times of the mutation events; for example, in (a) we have m = 2, ly = 4, and lo = 3.

4.1. Two nested mutations

In this case the clade subtended by one mutation is a proper subclade of the other [Figure 2(a)]. The genealogy of two nested mutations was also studied by Hobolth and Wiuf (2009), though using a different model of mutation. Consider the event E2𝒩(b,c), and assume for now that ac, so our observation is of the form n = naea + nbeb + ncec, with na, nb, nc > 0, and na + nb + nc = n. The sample configuration immediately before the younger mutation event is of the form ly = lyea + meb + ec, and immediately before the older mutation event it is of the form lo = loea + eb. For the mutations to be nested we must have 1 ≤ mnb, and 1 ≤ lolyna. Denote a history compatible with these requirements by 2𝒩; this is a sequence of configurations of the form n ↦ … ↦ ly ↦ (lyec + eb) ↦ … ↦ lo ↦ (loeb + ea) ↦ ea. Using (nna,nb,nc) to denote the trinomial coefficient, we have the following lemma:

Lemma 4.1. Conditional on E2𝒩(b,c), on the observed sample configuration n = naea + nbeb + ncec, and on the sample configurations ly = lyea + meb + ec and lo = loea + eb immediately before the times of the two nested mutation events, the distribution of compatible histories 2𝒩 is uniform. It is given by

p(2𝒩|E2𝒩(b,c),n,ly,lo)=[(nlym1naly,nbm,nc1)(m+lylom)]1.

Moreover, the (unconditional) probability of such histories is

p(2𝒩)=p(2𝒩,E2𝒩,n,ly,lo)=θ2PabPbc(na1)!(nb1)!(nc1)!(1+θ)n1×mlo(m+ly+θ)(lo+θ)·m+1m+ly+1. (7)

Proof. See Appendix A.

Summing over compatible histories and over valid combinations of m, ly, and lo, we obtain

p(n,E2𝒩(b,c))=ly=1nalo=1lym=1nb(nlym1naly,nbm,nc1)(m+lylom)p(2𝒩)={θ2PabPbc(n1)!(1+θ)n1ly=1nalo=1lym=1nb(na1ly1)(nb1m1)(m+lylom)(n1m+ly)(m+lym+1)×1m+ly+1·lolo+θ·1m+ly+θ,ifn=naea+nbeb+ncec,wherea,b,care all distinct,and1na,nb,ncn;0,otherwise. (8)

One can relax the age-ordering on mutations by noting that, for n = naea + nbeb + ncec,

p(n,E2𝒩)=p(n,E2𝒩(b,c))+p(n,E2𝒩(c,b)).

Similar relaxations apply to the remaining cases described below.

Finally, we return to the possibility that a = c. We denote this by E2𝒩(b,a), and have a sample of the form n = naea + nbeb, with na, nb > 0, and na + nb = n. Now, na comprises gametes whose alleles are truly ancestral and gametes whose alleles are atavistic. We can still however apply the previous result [equation (8)], first by treating a and c as if they were distinct, then summing over all possible values for the number of observed a alleles and the number of observed c alleles such that their sum is held fixed, and finally setting c = a in the resulting expression:

p(n,E2𝒩(b,a))=k=1na1p(kea+nbeb+(nnbk)ec,E2𝒩(b,c))|c=a={θ2PabPba(n1)!(1+θ)n1ly=1na1lo=1lym=1nb(na1ly)(nb1m1)(m+lylom)(n1m+ly)(m+lym+1)×1m+ly+1·lolo+θ·1m+ly+θ,ifn=naea+nbeb,whereba,and1na,nbn;0,otherwise.

4.2. Two nonnested mutations

In this case the clades subtended by the two mutations are disjoint [Figure 2(b)]. We also exclude the possibility that the two mutations reside on the basal (innermost) branches of the coalescent tree, which could result in a monomorphic sample. Consider the event E2𝒩𝒩(b,c), where bc, n = naea + nbeb + ncec, with na, nb, nc > 0, and na + nb + nc = n. Suppose that, immediately before the younger mutation, the sample configuration is ly = lyea + meb + ec, and immediately before the older mutation it is lo = loea + eb. Consideration of the genealogy [Figure 2(b)] leads to the following restrictions: 1 ≤ mnb, 1 ≤ lyna, and 1 ≤ loly + 1. One can argue as in the previous subsection to obtain the following:

Lemma 4.2. Conditional on E2𝒩𝒩(b,c), on the observed sample configuration n = naea + nbeb + ncec, and on the sample configurations ly = lyea + meb + ec and lo = loea + eb immediately before the times of the two nonnested mutation events, the distribution of compatible histories 2𝒩𝒩 is uniform. It is given by

p(2𝒩𝒩|E2𝒩𝒩(b,c),n,ly,lo)=[(nmly1naly,nbm,nc1)(m+lylom1)]1.

Moreover, the (unconditional) probability of such histories is

p(2𝒩𝒩)=p(2𝒩𝒩,E2𝒩𝒩(b,c),n,ly,lo)=θ2PabPac(na1)!(nb1)!(nc1)!(1+θ)n1×lylo(ly+m+θ)(lo+θ)·ly+1ly+m+1. (9)

Using this lemma to sum over compatible histories and over m, ly, and lo, after some simplification we obtain

p(n,E2𝒩𝒩(b,c))=m=1nbly=1nalo=1ly+1(nmly1naly,nbm,nc1)(m+lylom1)p(2𝒩𝒩)={θ2PabPac(n1)!(1+θ)n1m=1nbly=1nalo=1ly+1(na1ly1)(nb1m1)(m+lylom1)(n1m+ly)(m+lyly+1)×lolo+θ·1m+ly+1·1m+ly+θ,ifn=naea+nbeb+ncec,wherea,b,care all distinct,and1na,nb,ncn;0,otherwise. (10)

Also as before, the result (10) can also be used to handle the special case b = c:

p(n,E2𝒩𝒩(b,b))=k=1nb1p(naea+keb+(nnak)ec,E2𝒩𝒩(b,c))|b=c={θ2Pab2(n1)!(1+θ)n1m=1nb1ly=1nalo=1ly+1(na1ly1)(nb1m)(m+lylom1)(n1m+ly)(m+lyly+1)×lolo+θ·1m+ly+1·1m+ly+θ,ifn=naea+nbeb,whereba,  and  1na,nbn;0,otherwise.

4.3. Two mutations on the same branch

Here, two mutation events reside on the same edge of the coalescent tree, as illustrated in Figure 2(c). Consider first the subevent E2𝒮(b,c), with ac, so that n = naea + ncec, with na, nc > 0, and na + nc = n. Suppose the configuration immediately prior to the younger mutation event is ly = lyea + ec, and immediately prior to the older mutation is lo = loea + eb. We argue as before to yield the following:

Lemma 4.3. Conditional on E2𝒮(b,c), on the observed sample configuration n = naea + ncec, and on the sample configurations ly = lyea + ec and lo = loea + eb immediately before the times of the two mutation events, the distribution of compatible histories 2𝒮 is uniform. It is given by

p(2𝒮|E2𝒮(b,c),n,ly,lo)=(nly1nc1)1.

Moreover, the (unconditional) probability of such histories is

p(2𝒮)=p(2𝒮,E2𝒮(b,c),n,ly,lo)=θ2PabPbc(na1)!(nc1)!(1+θ)n1·lolo+θ·1(ly+1)(ly+θ). (11)

Hence, summing over compatible histories and then over lo and ly, we obtain

p(n,E2𝒮(b,c))=ly=1nalo=1ly(nly1nc1)p(2𝒮)={θ2PabPbc(n1)!(1+θ)n1ly=1nalo=1ly(na1ly1)(n1ly)·lolo+θ·1ly(ly+1)(ly+θ),ifn=naea+ncec,wherea,b,care all distinct,and1na,ncn;0,otherwise. (12)

One can relax the restriction on the unobserved allele being b by summing over each possible b. This is achieved simply by replacing PabPbc in (12) with (P2)ac.

When a = c, the sample must be n = nea, and repeating earlier arguments we obtain

p(n,E2𝒮(b,a))=k=1n1p(kea+(nk)ec,E2𝒮(b,c))|a=c={θ2PabPba(n1)!(1+θ)n1ly=1n1lo=1lylolo+θ·1ly(ly+1)(ly+θ),ifn=nea,0,otherwise.

Again, the stipulation on the unobserved allele being b can be relaxed by replacing PabPba with (P2)aa.

4.4. Two mutations on the basal branches

Here, the mutations reside on the last two edges in the coalescent tree to exist going back in time, as illustrated in Figure 2(d). Consider first the subevent E2(b,c), with bc, so that n = nbeb + ncec, nb, nc > 0, and nb + nc = n. Suppose the configuration immediately prior to the younger mutation event is ly = meb + ec. The configuration at the older mutation event must be ea + eb. Arguing as in previous subsections yields the following:

Lemma 4.4. Conditional on E2(b,c), on the sample configuration n = nbeb + ncec, and on the sample configuration ly = meb + ec immediately before the time of the younger mutation event, the distribution of compatible histories 2ℬ is uniform. It is given by

p(2|E2(b,c),ly)=(nm1nc1)1.

Moreover, the (unconditional) probability of such histories is

p(2)=p(2,E2(b,c),n,ly)=θ21+θPabPac(nb1)!(nc1)!(1+θ)n11(m+1)(m+θ). (13)

Hence, summing over compatible histories and over m, we obtain

p(n,E2(b,c))=m=1nb(nm1nc1)p(2)={θ21+θPabPac(n1)!(1+θ)n1m=1nb(nb1m1)(n1m)1m(m+1)(m+θ),ifn=nbeb+ncec,wherea,b,care all distinct,and1nb,ncn;0,otherwise. (14)

Finally, when b = c the sample must be of the form n = nbeb, and the probability of observing such a configuration as a result of two mutation events on the basal branches is given by

p(n,E2(b,b))=k=1n1p(keb+(nk)ec,E2(b,c))|b=c={θ21+θ(Pab)2(n1)!(1+θ)n1m=1n11m(m+1)(m+θ),ifn=neb,0,otherwise.

5. The limit θ → 0

To make further progress, we derive expressions in the limit as θ → 0. Results are therefore approximate for nonzero θ, but should still exhibit good accuracy when applied to human single-nucleotide polymorphism data for example, for which θ is small, as noted in the introduction.

Our results will be expressed in terms of harmonic numbers, for which we use the following notation:

Hn=j=1n1j,  and  Hn(2)=j=1n1j2.

Further, let cn(s) denote the sth order generalized harmonic number (Roman, 1993), defined for s ≥ 0 and n ≥ 1 by

cn(s)={1,ifs=0,j=1ncj(s1)j,ifs>0.

In particular,

cn(1)=Hn,  and  cn(2)=j=1nHjj=12[(Hn)2+Hn(2)]. (15)

This last identity is easily verified by induction on n. To simplify notation, we also introduce the function

d(na,nb,nc)=1(na+nb)(na+nb1)[1+nnc2n(HnHnc1)na+nb+1], (16)

where n = na + nb + nc.

Theorem 5.1. As θ → 0, the joint probability of observing a particular sample configuration n together with the topological characterization of the genealogy satisfies:

p(n,E1)=θPabnbθ2Pab[Hn1nb+1na(HnHnb1)]+O(θ3),ifn=naea+nbeb  and  0  otherwise, (17)
p(n,E2𝒩(b,c))=θ2PabPbcd(na,nb,nc)+O(θ3),ifn=naea+nbeb+ncec  and  0  otherwise, (18)
p(n,E2𝒩(b,a))=θ2PabPba[1nb+1[1nnb(Hn1Hna1)]+Hn1n1]+O(θ3),ifn=naea+nbeb  and  0  otherwise, (19)
p(n,E2𝒩𝒩(b,c))=θ2PabPac[1nc(nb+nc)d(na,nb,nc)]+O(θ3),ifn=naea+nbeb+ncec  and  0  otherwise, (20)
p(n,E2𝒩𝒩(b,b))=θ2Pab2[Hnb1nb1na+1[1nna(Hn1Hnb1)]Hn1n1]+O(θ3),ifn=naea+nbeb  and  0  otherwise, (21)
p(n,E2𝒮(b,c))=θ2PabPbcna[nna+1(HnHnc1)1]+O(θ3),ifn=naea+ncec  and  0  otherwise, (22)
p(n,E2𝒮(b,a))=θ2PabPba[11n]+O(θ3),ifn=nea  and  0  otherwise, (23)
p(n,E2(b,c))=θ2PabPacnb+1[1nc1nb(Hn1Hnc1)]+O(θ3),ifn=nbeb+ncec  and  0  otherwise, (24)
p(n,E2(b,b))=θ2Pab2[Hn1(2)1+1n]+O(θ3),ifn=neb  and  0  otherwise, (25)

where a, b, c are all distinct.

Proof. See Appendix A.

We illustrate the use of Theorem 5.1 with two simple applications. First, it can be used to calculate the sample frequency spectrum of a site that has undergone two mutations. Figure 3 shows the sample frequency spectrum conditional on E2, for a diallelic model with alleles 1 and 2, a = 1 being ancestral [so P=(0110)]. Suppose we have a large quantity of SNP data, which we incorrectly assume to satisfy the infinite sites model. The occasional second mutation will distort the frequency spectrum. Figure 3 shows that if we had used equation (1) to predict the sample frequency spectrum at a site that had in fact undergone two mutation events, then the main effect would be to slightly underestimate the probability that a mutant is seen in moderate to high frequency, and to grossly overestimate the probability that a mutant is at very low frequency. This does not mean, however, that we should expect the sample frequency spectrum of a site undergoing at most two mutation events to be strongly affected. An upper bound on this effect can be found by considering the distribution of the number of mutation events (Tavaré, 1984):

p(Es)=n1θj=1n1(1)j1(n2j1)(θj+θ)s+1. (26)

Figure 3.

Figure 3

The sample frequency spectrum for a diallelic model with n = 40, conditional on one or two mutation events having occurred (line and stacked bars, respectively), as θ → 0. Monomorphic samples that are a result of two mutations have been included (E2𝒮 and E2ℬ); hence, to be directly comparable the plot for one mutation has been scaled down to sum to p(E2𝒩 | E2) + p(E2𝒩𝒩 | E2).

For example, suppose we compare a histogram of allele frequencies from SNP data with the distribution (1). When θ = 0.01, equation (26) tells us that polymorphic sites resulting from more than one mutation will make up at most 2.3% of the area under the histogram. In other words, we expect the effect of recurrent mutation on the sample frequency spectrum of a randomly chosen polymorphic site to be small. For recurrent mutation to have an appreciable effect, it requires a higher mutation rate (see Table 1).

Table 1.

The probability that the genealogy for a sample of n = 40 gametes contains 0, 1, 2, or more than two mutations, and the probability of more than one mutation conditional on at least one.

θ p(E0) p(E1) p(E2) p(E≥3) p(E≥2 | E≥1)
0.001 0.9958 0.0042 0.0000 0.0000 0.0023
0.01 0.9584 0.0406 0.0009 0.0000 0.0229
0.05 0.8100 0.1691 0.0192 0.0017 0.1099
0.1 0.6586 0.2702 0.0601 0.0111 0.2085

The diallelic model considered above can be seen as a crude approximation of the evolution of a single nucleotide, in which say the transversion rate is zero. A second application of Theorem 5.1 is to investigate more realistic models of sequence evolution, also incorporating uncertainty over the allele of the MRCA. Now let K = 4, representing the nucleotides A, G, C, and T. One way to set a general matrix P is to match its entries to empirical estimation of rates of mutation; for illustration we use those reported in Table 1 of Tamura and Nei (1993) for a human mtDNA sequence dataset (normalizing so that each nucleotide has the same overall mutation rate). Suppose our prior probabilities for the identity of the ancestral nucleotide are (πA, πG, πC, πT); they may be based on, for example, an outgroup sequence allowing for some probability of error, on the stationary distribution of P, or on empirical base frequencies. Here we use the last option (Tamura and Nei, 1993): (πA, πG, πC, πT) = (0.321, 0.132, 0.314, 0.233). Equation (17) makes explicit how we should use prior information on the distribution of the MRCA. For example, if we observe an A-G polymorphism, then:

pu(nAeA+nGeG|E1,{nA,nG>0})=a{A,G}πap(naea+na¯ea¯,E1)p(E1,{nA,nG>0})=πAPAGnG1+πGPGAnA1Hn1(πAPAG+πGPGA)+O(θ). (27)

The subscript u is used to denote that the allele of the MRCA is unknown here. We sum a over possible ancestral alleles; the mutant allele is denoted ā. The sample frequency spectrum according to (27), ignoring terms of O(θ), is shown in Figure 4. As is clear from the figure, uncertainty over the MRCA introduces modes at both nG = 1 and nG = n − 1, with relative heights determined by πAPAG and πGPGA. By contrast, a classical approach is equivalent to assuming πAPAG = πGPGA, leading to a symmetric frequency spectrum; for this reason it is common to report a folded spectrum by binning values for each (nG, nnG) pair, nG = 1, …, n − 1. Figure 4 demonstrates how additional prior information on the ancestral allele can be incorporated properly, in this example leading to a shift in favour of the mode at nG = 1 so that the spectrum is no longer symmetric. We also repeated this analysis using equations (19), (21), (22), and (24) to allow for up to two mutations to explain a diallelic site, with the sample frequency spectrum left almost unchanged, as discussed above.

Figure 4.

Figure 4

The sample frequency spectrum of a sample of size n = 40 under realistic assumptions about rates of mutation. The rate matrix P and a distribution over the ancestral allele are based on empirical data described in the main text [see equation (27)]. Also shown is the usual, infinite sites, frequency spectrum when the ancestral allele is unknown (i.e. the folded frequency spectrum, unfolded here for comparison).

We can sum over all possible outcomes for n, b, and c in Theorem 5.1 in order to find the probabilities of the four events illustrated in Figure 2.

Theorem 5.2. The distribution of the number of mutation events, satisfies

p(Es)j=0θs+jCn1(s+j)(1)j(s+jj). (28)

The series converges for θ < 1. Further, as θ → 0,

p(E2𝒩(b,c))=PabPbcp(E2𝒩)=θ2PabPbc[Hn+1n2]+O(θ3), (29)
p(E2𝒩𝒩(b,c))=PabPacp(E2𝒩𝒩)=θ2PabPac[(Hn1)22Hn1(2)2Hn1n+2]+O(θ3), (30)
p(E2𝒮(b,c))=PabPbcp(E2𝒮)=θ2PabPbc[11n]+O(θ3), (31)
p(E2(b,c))=PabPacp(E2)=θ2PabPac[Hn1(2)1+1n]+O(θ3). (32)

Proof. See Appendix A.

As an application of these results, one can ask: Conditional on precisely two mutation events, what are the probabilities of the four possible outcomes illustrated in Figure 2? Using equations (28) and (29)(32) and letting θ → 0 yields:

p(E2𝒩|E2)=Hn+1n2cn1(2),p(E2𝒩𝒩|E2)=1Hn1(2)+Hn+1n2cn1(2),p(E2𝒮|E2)=11ncn1(2),p(E2|E2)=Hn1(2)1+1ncn1(2).

Since cn1(2) grows like (log n)2 with increasing n, we have that p(E2𝒩 | E2) declines to zero like 1/log n, p(E2𝒮 | E2) and p(E2ℬ | E2) decline to zero like 1/(log n)2, while p(E2𝒩𝒩 | E2) slowly approaches 1 (see Figure 5). Using Theorem 5.1, in a similar manner one could find the relative probabilities of these topologies conditional on the data.

Figure 5.

Figure 5

The probability of each of the four possible topologies conditional on two mutation events as θ → 0. Plots are for two nonnested mutations (E2𝒩𝒩), two nested mutations (E2𝒩), two mutations on the same branch (E2𝒮), and two mutations on the basal branches (E2ℬ).

6. Observed patterns of polymorphism

We can partition the space of coalescent trees in two ways: either by the topology of the tree, as in Sections 3 and 4, or by the observed pattern of polymorphism. In practice, it is the latter that is important since only these are known. Our next goal is therefore to find expressions for the sample frequency spectrum conditional on observed events. Assuming at most two mutations, the only possible observed outcomes are:

  • O1: No variation, with all alleles ancestral. The sample is of the form n = nea.

  • O1𝒮: No variation, but the observed allele differs from that of the MRCA. The sample is of the form n = neb, where ba.

  • O2: A regular diallelic polymorphism, with both the ancestral allele and a mutant allele observed. The sample is of the form n = naea + nbeb, where ba.

  • O2𝒮: A diallelic polymorphism in which the ancestral allele is not observed; instead, we see two mutant alleles. The sample is of the form n = nbeb + ncec, where a, b, c are all distinct.

  • O3: A triallelic polymorphism, with one observed allele ancestral. The sample is of the form n = naea + nbeb + ncec, where a, b, c are all distinct.

Note that we assume that the allele of the MRCA is known without error. However, if one observed O1𝒮 or O2𝒮 in practice, then another explanation is that the allele of the MRCA inferred from the outgroup is incorrect, and that a substitution has occurred on the lineage between the MRCA of the sample and the outgroup.

Using the superscript T to denote matrix transpose, we have the following theorem:

Theorem 6.1. As θ → 0, the joint probability of two mutations occurring and the observed pattern of polymorphism is given by:

p(O1,E2)=θ2(P2)aa(11n)+O(θ3), (33)
p(O1𝒮,E2)=θ2(PPT)aa(Hn1(2)1+1n)+O(θ3), (34)
p(O2,E2)=θ2[(P2)aa(Hn+1n2)+(1(P2)aa)(11n)+(PPT)aa((Hn1)22Hn1(2)2Hn1n+2)]+O(θ3), (35)
p(O2𝒮,E2)=θ2(1(PPT)aa)(Hn1(2)1+1n)+O(θ3), (36)
p(O3,E2)=θ2[(1(P2)aa)(Hn+1n2)+(1(PPT)aa)((Hn1)22Hn1(2)2Hn1n+2)]+O(θ3). (37)

Proof. See Appendix A.

As above, we may also consider the relative probability of these events conditional on precisely two mutations having occurred. Again, this entails normalizing equations (33)(37) by dividing by p(E2)=θ2cn1(2)+O(θ3), and then letting θ → 0. The relative probabilities of these outcomes is illustrated in Figure 6, for a simple 4 × 4 mutation model with each nondiagonal entry in P equal to 1/3.

Figure 6.

Figure 6

The probability of each possible observable outcome given two mutation events, as θ → 0: a triallelic polymorphism (O3), a regular polymorphism with one allele ancestral and one mutant (O2), a polymorphism in which both observed alleles are mutant (O2𝒮), the entire sample is ancestral (O1), and the entire sample has a mutant allele (O1𝒮). Here we take a mutation model of four alleles, in which any mutation is to one of the other alleles with probability 1/3.

Further variations on the arguments of Theorem 6.1 are possible. For example, one can proceed in a similar vein by summing over the relevant probabilities in Theorem 5.1, in order to find closed-form expressions for the joint probability of the above events together with a particular sample configuration, n. The resulting expressions are easy to obtain but do not simplify very much, so we do not give them explicitly. There is however one important exception which we now consider in further detail. If we observe a triallelic site then we know that at least two mutations must have taken place, and we would like to know the joint sample frequency spectrum for the number of copies of each of the two mutant alleles.

Theorem 6.2. As θ → 0,

p(n|O3)=1C[PabPbcd(na,nb,nc)+PacPcbd(na,nc,nb)+PabPac(1nbncd(na,nb,nc)d(na,nc,nb))], (38)

if n = naea + nbeb + ncec, and 0 otherwise, where d(na, nb, nc) is given by equation (16), and

C=[(1(P2)aa)(Hn+1n2)+(1(PPT)aa)((Hn1)22Hn1(2)2Hn1n+2)]. (39)

Proof. See Appendix A.

We remark that there is no need to condition on E2 in Theorem 6.2, since O3 requires at least two mutations, and more than two mutations occurs with probability O3). Furthermore, as we noted in the introduction, Theorem 6.2 is unchanged when we allow Pii > 0 for any i, since such a “self”-mutation could not lead to O3 without additional mutations. This is true of all our results for which we condition on O3, and so for the remainder of this section and for Section 7 we can drop the constraint that the diagonal of P is zero.

Equation (38) simplifies a great deal when the mutation transition matrix takes on a particular form, which we state in the following corollary.

Corollary 6.1. Suppose the mutation transition matrix P satisfies (2). Then, as θ → 0,

p(n|O3)=PabPacCnbnc, (40)

if n = naea + nbeb + ncec, and 0 otherwise, where the normalizing constant C is given by (39).

For the remainder of this section we continue to assume (2) holds. Employing similar arguments, we have that, conditional on observing two particular alleles b and c, the sample frequency spectrum is

limθ0p(n|O3,{nb,nc>0})=(nbnc)1(Hn1)2Hn1(2),n=naea+nbeb+ncec, (41)

and the marginal spectrum for a particular mutant allele, given that we have observed it, is

limθ0p(nb|O3,{nb>0})=Hnnb1(nb)1(Hn1)2Hn1(2),1nbn2. (42)

One can check that the normalizing constant is correct by summing (42) over nb, and using the identity

j=1n1Hnjj=(Hn)2Hn(2),

for n ≥ 2, which is easily verified by induction on n.

Similarly, the marginal spectrum for a particular allele, given that we have observed it and conditional on the number of copies of the other mutant allele, is

limθ0p(nb|O3,{nb>0},nc)=(nb)1Hnnc1,1nbnnc1,nc1. (43)

Interestingly, the distribution for nb is proportional to nb1 regardless of nc. More generally, Corollary 6.1 tells us: as θ → 0, and given that we observe three particular alleles, one of which is the ancestral allele, the sample frequency spectrum for the two mutant alleles is proportional to the inverse of the product of the number of observed copies of each of the two mutant alleles (scaled by the relative rate PabPac of appearance of these two alleles). This result is a straightforward generalization of the classical result (1), with some mild conditions on P.

There is another way to arrive at Corollary 6.1 when P satisfies (2), and which immediately generalizes to any number of observed alleles, Ol, 2 ≤ lK. An l-allele version of (2) is to suppose that Pij = Pj whenever j is an observed mutant allele and i is a different observed mutant allele or the observed ancestral allele. We refer to this condition as parent-independence among observed alleles.

Theorem 6.3. Suppose that we observe l distinct alleles, one of which is the ancestral allele a and the rest are labelled by the set Λ ⊆ {1, …, K}\{a}. Further suppose that P is parent-independent among observed alleles. Then, as θ → 0,

p(n|Ol)jΛPjnj.

Proof. See Appendix A.

7. The age of a mutant allele

In this section we will be interested in the population limits na/nfa, nb/nfb, and nc/nfc as n → ∞, and it is implicit throughout that we let θ → 0 and that a, b, c are all distinct. We assume that there have been no more than two mutation events in the history of the population, so fa + fb + fc = 1. Let f = (fa, fb, fc), and Ab, Ac denote the ages at which mutations occurred that gave rise to alleles b and c respectively. Kimura and Ohta (1973) showed that the expected age of a single mutant allele at frequency f in the population is

2f1flnf.

Griffiths and Tavaré (2003) found the expected age of the younger of two nested mutant alleles to be

𝔼[Ac|E2𝒩(b,c),f]=2fc(1+fc)  ln  fc+2(1fc)2fc  ln  fc+(1+fc)(1fc). (44)

Here, we extend this result to find the expected age of the younger of two nonnested mutant alleles. From this it is straightforward to obtain the expected age of the younger allele at a triallelic site, regardless of whether the mutations are nested or nonnested—and indeed regardless of whether we know which of the two mutant alleles is younger.

Theorem 7.1. When it is known which of two nonnested mutant alleles is the younger, the expected age of the younger allele is

𝔼[Ac|E2𝒩𝒩(b,c),f]=2fc[1+fc(1fc)2fb]  ln  fc+2(1fc)+(1fc)3  ln  (fb+fc)fb(1fbfc)(1fc)3fb+fc2fc  ln  fc(1+fc)(1fc). (45)

Proof. See Appendix A.

Let A denote the age of the younger of two mutant alleles at a triallelic site, when we do not know which of the two is younger. Our goal now is to compute its expectation given the frequencies of the two mutant alleles in the population. This can be achieved by averaging over the possible topologies that could have given rise to a triallelic site:

𝔼[A|O3,f]=𝔼[Ac|E2𝒩(b,c),f]p(E2𝒩(b,c)|O3,f)+𝔼[Ac|E2𝒩𝒩(b,c),f]p(E2𝒩𝒩(b,c)|O3,f)+𝔼[Ab|E2𝒩(c,b),f]p(E2𝒩(c,b)|O3,f)+𝔼[Ab|E2𝒩𝒩(c,b),f]p(E2𝒩𝒩(c,b)|O3,f). (46)

The expectation in each term on the right-hand side is given by equations (44), (45) and their analogues (interchanging the roles of b and c). The probabilities on the right-hand side are also known, since

p(E2𝒩(b,c)|O3,f)=p(f,E2𝒩(b,c))p(f,O3),

and these terms are found by letting nb/nfb and nc/nfc while n → ∞ in equations (18), (37), and (38). A similar argument applies for the other three terms. We obtain

p(E2𝒩(b,c)|O3,f)=PabPbcD(1fc)2(1+1fc+2  ln  fc1fc),p(E2𝒩𝒩(b,c)|O3,f)=PabPacD[1fb(fb+fc)1(1fc)2(1+1fc+2  ln  fc1fc)],

with similar expressions for p(E2𝒩(c,b)|O3,f)  and  p(E2𝒩𝒩(c,b)|O3,f), where

D=p(E2𝒩(b,c)|O3,f)+p(E2𝒩(c,b)|O3,f)+p(E2𝒩𝒩(b,c)|O3,f)+p(E2𝒩𝒩(c,b)|O3,f)=Pab(PbcPac)(1fc)2(1+1fc+2Infc1fc)+Pac(PcbPab(1fb)2(1+1fb+2Infb1fb)+pabPacfbfc.

Substituting these expressions, along with (44) and (45), into (46), we obtain the following:

Theorem 7.2. The expected age of the younger of two mutant alleles at a triallelic site is

𝔼[A|O3,f]=2PabPacD(1fbfc)(1fb+1fc)  ln  (fb+fc)+2PabD[(PacPbc)1+fc(1fc)3Pacfb(1fc)]  ln  fc+2PacD[(PabPcb)1+fb(1fb)3Pabfc(1fb)]  ln  fb+4Pab(PacPbc)D(1fc)2+4Pac(PabPcb)D(1fb)2.

When P additionally satisfies (2) for the observed alleles a, b, and c, this expression simplifies to

𝔼[A|O3,f]=2fb1fb  ln  fb2fc1fc  ln  fc+2(fb+fc)1(fb+fc)  ln  (fb+fc). (47)

This curious result, for mutation models satisfying (2), tells us that the mean age of the younger of two mutant alleles at a triallelic site is equal to the sum of the mean ages of two independent mutations at frequencies fb and fc, minus the mean age of a single mutant at frequency fb + fc. Equation (47) is plotted in Figure 8, for various values of fb and fc.

Figure 8.

Figure 8

The expected age of the younger of two mutant alleles at a triallelic site [equation (47)]. The mutant alleles are at frequencies fb (annotated) and fc (x-axis). The expected age of a single mutant at a diallelic site and at frequency fc is shown for comparison (dotted line).

Unfortunately, a corresponding expression for the expected age of the older mutation is not analytically tractable, even if we restrict our attention to nested mutations. Hobolth and Wiuf (2009) outline a method of numerical approximation which could also be adapted for nonnested mutations, but we do not pursue this here.

8. Accuracy

It would be interesting to investigate the accuracy of the expressions given in the previous section. For simplicity, we focus on equation (41), and we assume a simple Jukes-Cantor model of mutation in which K = 4, and the daughter allele of each mutation is equally likely, so that off-diagonal entries of P are all 1/3. We wrote a program to solve numerically the system of equations defined by (3), in order to obtain exact results for the sample frequency spectrum for nonzero θ. By solving this system for all sample configurations of a given size, we could calculate exact numerical values of p(n | {nb, nc > 0}) for each n. We measured the accuracy of equation (41) by its unsigned relative error:

|[limθ0p(n|O3,{nb,nc>0})]p(n|{nb,nc>0})p(n|{nb,nc>0})|×100%,

for each n. Errors in (41) are a consequence of that fact that in reality θ is nonzero and that there may have been more than two mutation events giving rise to the triallelic sample.

For a given sample size and mutation rate, we summarize the discrepancy between the estimated and actual sample frequency spectrum by the largest relative error across all configurations. Results are summarized in Table 2.

Table 2.

Maximum unsigned relative error (%) across configurations, of equation (41) (top) and equation (48) (bottom), for various samples sizes, n, and mutation rates, θ.

Triallelic

θ

n 0.001 0.01 0.05 0.1 0.5 1.0
10 0.07 0.72 3.52 6.89 30.21 57.02
20 0.13 1.25 6.15 12.04 53.74 112.16
30 0.17 1.61 7.66 14.89 64.78 143.49
40 0.25 2.45 10.57 17.97 70.05 162.82
50 0.35 3.30 13.81 22.81 72.20 175.36
60 0.44 4.16 16.87 27.17 72.52 187.79

Quadrallelic

θ

n 0.001 0.01 0.05 0.1 0.5 1.0

10 0.11 1.10 5.24 9.92 35.93 82.38
20 0.25 2.45 11.16 20.05 67.95 181.83
30 0.39 3.75 16.30 28.01 90.50 266.17
40 0.52 5.01 20.86 34.51 108.49 342.24
50 0.66 6.22 24.92 39.90 123.70 412.73
60 0.79 7.40 28.58 44.47 136.98 479.09

As is clear from the table, accuracy diminishes with increasing θ and also diminishes modestly with increasing n. For application to human SNPs, in which generally 0.001 ≤ θ ≤ 0.01, equation (41) provides an excellent approximation to the sample frequency spectrum. For comparison, Table 2 also shows the maximum relative error incurred when we use the analogous result from Theorem 6.3 for a quadrallelic polymorphism under parent-independence among observed alleles:

limθ0p(n|O4,{nb,nc,nd>0})1nbncnd, (48)

where n = naea + nbeb + ncec + nded, and a, b, c, d, are all distinct. It should be noted that the relative error is dependent on the sample configuration. Figure 7 shows the relative error incurred by using (48) for a representative slice of quadrallelic sample configurations, and, as is evident, the size of the unsigned relative error approaches its maximum across configurations near the boundary na = 1. When n = 20 and θ = 0.01 the maximum unsigned relative error of 2.45% is attained at (na, nb, nc, nd) = (1, 6, 6, 7). For samples containing more than one copy of the ancestral allele, the size of the relative error can be substantially less than its maximum, as Figure 7 confirms. We obtained qualitatively similar patterns for other slices, not shown.

Figure 7.

Figure 7

The (signed) relative error from assuming the sample frequency spectrum of a quadrallelic site is proportional to (nbncnd)−1 [equation (48)]. Here, the sample size is n = 20 and the mutation parameter is θ = 0.01. Shown are relative errors for a representative slice through the simplex of configurations, defined by fixing nd = 2. Positive relative errors are shown in dark grey, negative relative errors in light grey.

Errors in the sample frequency spectrum will also cause errors in its application, such as in the estimation of mutation rates and in tests of neutrality. We explore this issue by examining the use of the frequency spectrum to estimate θ. Here we assume a diallelic model (K = 2) in which each mutation toggles an allele between its ancestral and mutant state, as we did for Figure 3. Suppose we have L sites, and record the counts of the number of sites with zero, one, …, n mutant alleles. There are several ways to combine these counts to define an estimator of θ (Achaz, 2009); we focus on Sn, the number of sites observed to be segregating, and ξ1, the number of sites observed to be singleton mutants [having configuration n = (n − 1, 1)]. Assuming at most one mutation event per site, moment estimates using these statistics follow from Watterson (1975) and Fu (1995):

𝔼(Sn)=LθHn1, (49)
𝔼(ξ1)=Lθ, (50)

where θ is the population-scaled mutation rate per site. These moments can be corrected to allow for up to two mutation events per site. We have

𝔼(Sn)=Li=1n1p((ni,i)), (51)
=L[1p(E0)p(E2𝒮)p(E2)]+O(θ3),=L[θHn1θ22[(Hn1)2+3Hn1(2)]]+O(θ3), (52)

using (28), (31) and (32). If in (52) we drop terms of O3), then we obtain a quadratic equation which can be solved to yield θ in terms of 𝔼(Sn)/L. Replacing this expectation with the observed quantity provides a point estimate of θ. A comparison of this method of estimation with the classical estimate from (49) is given in Figure 9. For comparison, we found the true value of 𝔼(Sn)/L as a function of θ, allowing any number of mutation events, by using our program again to solve the system (3) numerically over a grid of θ-values and plugging the results into (51). This numerical estimate of the relationship between θ and 𝔼(Sn)/L is also plotted on Figure 9. Comparison of this curve with (49) and (52) shows that when few sites are segregating, say sn/L < 0.1, it is reasonable to assume at most one mutation event per site, whereas for a higher fraction of segregating sites, say sn/L < 0.25, assuming at most two mutation events per site is still reasonable. When a substantial fraction of sites are segregating, neither assumption is accurate. SNP densities in humans typically have sn/L < 0.05, so errors from using (49) or (52) will not usually be serious. However, this fraction will only grow in the near future with increasing sample sizes, and regions of the genome of high diversity in cutting-edge datasets already exceed this level (The 1000 Genomes Project Consortium, 2010, Supplementary Figure 3). A similar calculation was performed for 𝔼(ξ1):

𝔼(ξ1)=Lp((n1,1))=L[θθ2(Hn1+32(n1))]+O(θ3), (53)

with qualitatively very similar results (Figure 10).

Figure 9.

Figure 9

Estimating θ per site by the number s of segregating sites, out of L sites in total. The sample size is n = 40. Estimation of θ assumes at most one mutation event per site [dotted line, (49)], or at most two mutation events per site [dashed line, (52)]. The true relationship between θ and 𝔼(Sn)/L is plotted as a solid line.

Figure 10.

Figure 10

Estimating θ per site by the number ξ1 of singleton segregating sites, out of L sites in total. The sample size is n = 40. Estimation of θ assumes at most one mutation event per site [dotted line, (50)], or at most two mutation events per site [dashed line, (53)]. The true relationship between θ and 𝔼(ξ1)/L is plotted as a solid line.

9. Discussion

We have studied the effect of a second mutation on the sample frequency spectrum of a segregating site, under a model of mutation in which the mutation rate is independent of the current allele but the transitions between alleles are otherwise arbitrary. The problem is made tractable by conditioning on whether or not the two mutations are nested in the genealogy, and as a bonus we also obtain the relative probabilities of these topological events. Other key results include the joint sample frequency spectrum of the two mutant alleles at a triallelic site, and the mean age of the younger of the two alleles in the population. These results take on a particularly simple form when we impose mild additional conditions on P, namely (2): Then, the sample frequency spectrum (1) generalizes to ∝ (nbnc)−1, and the expected age of the younger mutant is a linear combination of the result for single mutants [equation (47)]. It would be interesting to obtain a more intuitive argument for this formula.

At present, several large-scale projects for various species are under way to resequence the genomes of many individuals (hundreds to thousands) in a population. Hence, it may soon become possible to include triallelic polymorphisms in population genomic studies. Indeed, triallelic sites are becoming interesting objects of study in their own right (Hodgkinson and Eyre-Walker, 2010). We believe that the theoretical results presented in this paper should prove useful in that regard.

Highlights.

  • We model the frequency spectrum of a site that has undergone two mutations.

  • Our results apply both to genealogically nested and nonnested mutations.

  • We find a closed-form expression for the frequency spectrum of a triallelic site.

  • We find in closed-form the expected age of the younger mutant at a triallelic site.

Acknowledgements

This research is supported in part by an NIH grant R01-GM094402, an Alfred P. Sloan Research Fellowship, and a Packard Fellowship for Science and Engineering.

Appendix A

Proofs of main results

Proof of Lemma 3.1. We argue by writing down the probability of a compatible history using (4), and observe that it depends only on n and l. It is clear that for any polymorphic sample whose history is explained by precisely one mutation event, only configurations of the form laea + eb are possible at the time of this event. So as we trace the history back in time, we must observe nb − 1 coalescent events of type b alleles and nala coalescent events of type a alleles (in some interspersed order), followed by a mutation event taking laea + eb ↦ (la + 1)ea, followed by la coalescent events of the remaining type a alleles. Think of the history as unwrapping a particular path back through the recursion (4). By multiplying together the coefficients accumulated at each transition, we obtain the probability of this history. Regardless of the order of the interspersed events, this product is

(na1)(na2)(la)(nb1)!(n1+θ)(n2+θ)(la+1+θ)×θPabla+θla+1la+1×la!(la+θ)(la1+θ)(1+θ).

Simplifying, we get (5), which indeed depends only on na, nb, and la. There are (nala+nb1nb1) ways to arrange the first nala + nb − 1 events, and thus (nala+nb1nb1)=(nla1nb1) such histories.

Proof of Lemma 4.1. We argue in a similar fashion to Lemma 3.1. Any compatible history must exhibit the following order of events:

  • an interspersed collection of naly coalescence events of type a alleles, nbm coalescence events of type b alleles, and nc − 1 coalescence events of type c alleles,

  • a mutation event taking lylyea + (m + 1)eb,

  • an interspersed collection of lylo coalescence events of type a alleles and m coalescence events of type b alleles,

  • a mutation event taking lo ↦ (lo + 1)ea, followed by

  • lo coalescence events of type a alleles.

Regardless of the relative ordering of events within each of these collections, the product of transition probabilities from (4) is

(na1)(na2)(ly)(nb1)(nb2)(m)(nc1)!(n1+θ)(n2+θ)(m+ly+1+θ)×θPbcm+ly+θm+1m+ly+1×(ly1)(ly2)(lo)m!(m+ly+θ)(m+ly1+θ)(lo+1+θ)×θPablo+θlo+1lo+1×lo!(lo+θ)(lo1+θ)(1+θ).

Simplifying, we get (7), which is independent of the history except through n, ly, and lo. There are (nlym1naly,nbm,nc1) ways to arrange the first collection of coalescence events, and (m+lylom) ways to arrange the second collection of coalescence events, so there are (nlym1naly,nbm,nc1)(m+lylom) such histories.

Proof of Lemmas 4.2, 4.3, and 4.4. These are very similar to the proof of Lemma 4.1 and so are omitted.

Proof of Theorem 5.1. The proof for each of the expressions is the same; we expand the denominator and collect the dominant terms in θ. We make use of the following identity:

(1+θ)n1=(θ)nθ=k=1ns(n,k)θk1,

where s(n, k) are the unsigned Stirling numbers of the first kind. Note also that

s(n,1)=(n1)!,s(n,2)=(n1)!Hn1,s(n,3)=12(n1)![(Hn1)2Hn1(2)].

We will also make use of standard identities for summing over binomial co-efficients, and one nonstandard one:

k=1n1k(nki1)=(ni1)(HnHi1), (A.1)

for 1 ≤ in. Equation (A.1) is proven by induction by Fu (1995, equation (33)), and using another method by Griffiths (2003, Appendix B).

Expanding (6):

p(n,E1)=θPab[1θs(n,2)s(n,1)+O(θ2)]la=1na(na1la1)(n1la)·1la[1θla+O(θ2)],=θθ2Hn1naPabla=1na(nala)(n1la)θ2naPabla=1na(nala)(n1la)·1la+O(θ3),=θθ2Hn1naPabla=1na(n1lanb1)(n1na)θ2naPabla=1na(n1lanb1)(n1na)·1la+O(θ3),=θθ2Hn1nbPabθ2naPab(n1nb1)(n1na)(HnHnb1)+O(θ3),by(A.1),

which simplifies to (17), as required. Next, expanding (8):

p(n,E2𝒩(b,c))=θ2PabPbcly=1nalo=1lym=1nb(na1ly1)(nb1m1)(m+lylom)(n1m+ly)(m+lym+1)1(m+ly)(m+ly+1)+O(θ3),=θ2PabPbcly=1nam=1nb(na1ly1)(nb1m1)(n1m+ly)·1(m+ly)(m+ly+1)+O(θ3),=θ2PabPbcn1k=3na+nb+1ly=1k21k(na1ly1)(nb1kly2)(n2k2)+O(θ3),(k=m+ly+1),=θ2PabPbcn1k=3na+nb+11k(na+nb2k3)(n2k2)1+O(θ3), (A.2)
=θ2PabPbcn1k=3na+nb+1k2k(nknc1)(n2nc)11nc+O(θ3),=θ2PabPbcn1(n2nc)11nc[(n2nc)2k=1n1k(nknc1)+2(n1nc1)+(n2nc1)]+O(θ3),=θ2PabPbcn1[1nc2(nnc1)(n2nc)11nc(HnHnc1)+2(n1)(na+nb)(na+nb1)+1na+nb1]+O(θ3), (A.3)

where the last equality uses (A.1). On rearranging, we recover (18) as required. Note that equation (A.2) is consistent with a result of Hobolth and Wiuf (2009, equation (23). Their expression (24) seems to contain an error; the summation should be over 3, …, n rather than 3, …, nnb + 1.) Next, expanding (10):

p(n,E2𝒩𝒩(b,c))=θ2PabPacm=1nbly=1nalo=1ly+1(na1ly1)(nb1m1)(m+lylom1)(n1m+ly)(m+lyly+1)·1(m+ly)(m+ly+1)+O(θ3),=θ2PabPacm=1nbly=1na(na1ly1)(nb1m1)(n1m+ly)·ly+1m·1(m+ly)(m+ly+1)+O(θ3),=θ2PabPacn1k=3na+nb+1m=1k2(na1km2)(nb1m1)(n2k2)(1m1k)+O(θ3), (A.4)
=θ2PabPacn1k=3na+nb+1m=1k2(na1k2m)(n2k2)[(nbm)1nb(nb1m1)1k]+O(θ3),=θ2PabPacn1k=3na+nb+1[(na+nb1k2)1nb(na1k2)1nb(na+nb2k3)1k](n2k2)1+O(θ3),=θ2PabPacn1k=3na+nb+1[(nknc1)(n2na+nb1)1nb(nknb+nc1)(n2na1)1nb(na+nb2k3)1k]+O(θ3),=θ2PabPacn1[n1nc(nb+nc)k=3na+nb+1(na+nb2k3)(n2k2)11k]+O(θ3). (A.5)

Finally, apply the same equality relating equations (A.2) and (A.3) to recover (20) as required. Next, expanding (12) and summing over lo:

p(n,E2𝒮(b,c))=θ2PabPbcly=1na(na1ly1)(n1ly)11ly(ly+1)+O(θ3),=θ2PabPbcna(n1na)1ly=2na+11ly(nlync1)+O(θ3),=θ2PabPbcna(n1na)1[(nnc1)(HnHnc1)1]+O(θ3), (A.6)

where the last equality uses (A.1). This simplifies to (22). Finally, expanding (14):

p(n,E2(b,c))=θ2PabPacm=1nb(nb1m1)(n1m)11m2(m+1)+O(θ3),=θ2PabPacnc(n1nb1)1m=1nb(n1mnc1)(1m1m+1)+O(θ3),=θ2PabPacnc(n1nb1)1[m=1nb(n1mnc1)1mm=2nb+1(nmnc1)1m]+O(θ3),=θ2PabPacnc(n1nb1)1[(n1nc1)(Hn1Hnc1)(nnc1)(HnHnc1)+(n1nc1)]+O(θ3), (A.7)

applying (A.1) to each sum in the penultimate expression. This then simplifies to (24). Expressions for p(n,E2𝒩(b,a)),p(n,E2𝒩𝒩(b,b)),p(n,E2𝒮(b,a)),  and  p(n,E2(b,b)) are obtained in a very similar manner and we omit the details.

Proof of Theorem 5.2. By expanding the denominator of (26) for θ < 1 and applying the following identity (Roman, 1993):

cn(s)=j=1n(nj)(1)j1js,

we obtain (28). For the remaining results, we sum over all possible observations n consistent with the event of interest:

p(E2𝒩(b,c))=nc=1n2nb=1nnc1p((nnbnc)ea+nbeb+ncec,E2𝒩(b,c)),=θ2PabPbcn1nc=1n2nb=1nnc1k=3nnc+11k(nnc2k3)(n2k2)1+O(θ3),=θ2PabPbcn1nc=1n2(nnc1)k=3nnc+11k(nnc2k3)(n2k2)1+O(θ3),=θ2PabPbcn1k=3nnc=1n+1kk2k(nnc1k2)(n2k2)1+O(θ3),=θ2PabPbck=3nk2k(k1)+O(θ3),

which simplifies to (29). The second equality above uses (A.2). Continue in this way for the remaining events. Using (A.4):

p(E2𝒩𝒩(b,c))=nc=1n2nb=1nnc1p((nnbnc)ea+nbeb+ncec,E2𝒩𝒩(b,c)),=θ2PabPacn1nc=1n2nb=1nnc1k=3nnc+1m=1k2(nnbnc1km2)(nb1m1)(n2k2)(1m1k)+O(θ3),=θ2PabPacn1nc=1n2k=3nnc+1m=1k2(nnc1k2)(n2k2)(1m1k)+O(θ3),=θ2PabPacn1k=3nm=1k2nc=1n+1k(nnc1k2)(n2k2)(1m1k)+O(θ3),=θ2PabPack=3nm=1k2(1m1k)1k1+O(θ3),=θ2PabPack=3n(Hk2k2k)1k1+O(θ3),=θ2PabPack=2n1[Hkk1k2+1k2k+1]+O(θ3),

which simplifies to (30) using (15). Using (A.6):

p(E2𝒮(b,c))=na=1n1p(naea+(nna)ec,E2𝒮(b,c)),=θ2PabPbcna=1n1ly=1na(na1ly1)(n1ly)11ly(ly+1)+O(θ3),=θ2PabPbcly=1n1na=lyn1(na1ly1)(n1ly)11ly(ly+1)+O(θ3),=θ2PabPbcly=1n11ly(ly+1)+O(θ3),

which gives (31). Using (A.7):

p(E2(b,c))=nb=1n1p(nbeb+(nnb)ec,E2(b,c)),=θ2PabPacnb=1n1m=1nb(nb1m1)(n1m)11m2(m+1)+O(θ3),=θ2PabPacm=1n1nb=mn1(nb1m1)(n1m)11m2(m+1)+O(θ3),=θ2PabPacm=1n11m2(m+1)+O(θ3),

which gives (32). To obtain p(E2𝒩), p(E2𝒩𝒩), p(E2𝒮), and p(E2ℬ) from each of these results we simply sum b and c over 1, …, K.

Proof of Theorem 6.1. The given expressions are obtained immediately from Theorem 5.2 and the following observations:

O1E2=bE2𝒮(b,a),O1𝒮E2=bE2(b,b),O2E2=[bE2𝒩(b,a)][bE2𝒩𝒩(b,a)][bcaE2𝒮(b,c)],O2𝒮E2=bcbE2(b,c),O3E2=[bcaE2𝒩(b,c)][bcbE2𝒩𝒩(b,c)].

Notice that each of these unions is over disjoint sets, so we can simply sum over the relevant probabilities. For example,

p(O1,E2)=b=1Kp(E2𝒮(b,a))=θ2b=1KPabPba[11n]+O(θ3),

which equals (33). The others follow similarly.

Proof of Theorem 6.2. For n = naea + nbeb + ncec, the expression (38) for p(n | O3) follows from

p(n|O3)=p(n,O3,E2)p(O3,E2)+O(θ),

where p(O3, E2) is given by (37), and p(n, O3, E2) is obtained from

p(n,O3,E2)=p(n,E2𝒩(b,c))+p(n,E2𝒩(c,b))+p(n,E2𝒩𝒩(b,c))+p(n,E2𝒩𝒩(c,b)),

with the right-hand side given by equations (18) and (20).

Proof of Theorem 6.3. We use induction on what is sometimes referred to as the sample complexity, n + l. We show that, for a sample configuration n with l observed alleles including the ancestral allele:

p(n,Ol)θl1kΛPknk+O(θl), (A.8)

the result following by dividing by p(Ol) = Ol−1) and letting θ → 0. We have already seen (A.8) to hold for l = 2 and l = 3, and all n [equations (17) and (40)], and we suppose inductively that it holds for all samples with complexity less than n + l. Let n be a suitable sample with complexity n + l.

Substituting the inductive hypothesis into the first (coalescence) term on the right-hand side of (3) and simplifying, we obtain

θl1n1+θ(kΛPknk)[na1+j:ja,nj2nj].

For the mutation term on the right of (3), the key here is to observe that contributions of Ol−1) come only from configurations n − ej + ei with l − 1 observed alleles and including the ancestral allele; that is, from configurations in which the loss of one gamete with allele j and gain of one gamete with allele i reduces the number of observed mutant alleles in n by one. Thus, we must have had nj = 1, ja, ni ≥ 1, and ij, with any other possibilities requiring additional mutation events and thus contributing only higher order terms. The mutation term in (3) is therefore proportional to

θn1+θj:ja,nj=1i:ij,ni1Pj(ni+1n)θl2(kΛ\{j}Pknk+δik)+O(θl)=θl1n1+θj:ja,nj=1(kΛPknk)+O(θl).

Summing contributions from both coalescence and mutation, (3) tells us

p(n)θl1n1+θ(kΛPknk)(n1)+O(θl)=θl1(kΛPknk)+O(θl),

as required.

Proof of Theorem 7.1. The argument parallels that of Hobolth and Wiuf (2009), who obtained the corresponding result for two nested mutations. We first condition on the number k of lineages at the time of the younger mutation. Inspection of (A.5) yields

p(n,k,E2𝒩𝒩(b,c))=θ2PabPacn1[(na+nb1k2)1nb(na1k2)1nb(na+nb2k3)1k](n2k2)1+O(θ3).

Letting θ → 0, nb/nfb and nc/nfc while n → ∞, we find

p(k|E2𝒩𝒩(b,c),f)=1F[(1fc)k2fb(1fbfc)k2fbk2k(1fc)k3], (A.9)

with normalizing constant

F=k=3[(1fc)k2fb(1fbfc)k2fbk2k(1fc)k3],=1fc(fb+fc)2  ln  fc(1fc)31+fcfc(1fc)2.

Under the standard, neutral coalescent model, the mean age of the younger mutation when it occurred during the time that there existed k ancestral lineages is 2/(k − 1) (Hobolth and Wiuf, 2009). Hence

𝔼[Ac|E2𝒩𝒩(b,c),f]=k=32k1p(k|E2𝒩𝒩(b,c),f).

Substituting in (A.9), summing over k and simplifying recovers (45).

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

  1. Achaz G. Frequency spectrum neutrality tests: one for all and all for one. Genetics. 2009;183:249–258. doi: 10.1534/genetics.109.104042. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Bustamante CD, Wakeley J, Sawyer S, Hartl DL. Directional selection and the site-frequency spectrum. Genetics. 2001;159:1779–1788. doi: 10.1093/genetics/159.4.1779. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Desai MM, Plotkin JB. The polymorphism frequency spectrum of finitely many sites under selection. Genetics. 2008;180:2175–2191. doi: 10.1534/genetics.108.087361. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Evans SN, Shvets Y, Slatkin M. Non-equilibrium theory of the allele frequency spectrum. Theor. Popul. Biol. 2007;71:109–119. doi: 10.1016/j.tpb.2006.06.005. [DOI] [PubMed] [Google Scholar]
  5. Fu Y-X. Statistical properties of segregating sites. Theor. Popul. Biol. 1995;48:172–197. doi: 10.1006/tpbi.1995.1025. [DOI] [PubMed] [Google Scholar]
  6. Griffiths RC. The frequency spectrum of a mutation, and its age, in a general diffusion model. Theor. Popul. Biol. 2003;64:241–251. doi: 10.1016/s0040-5809(03)00075-3. [DOI] [PubMed] [Google Scholar]
  7. Griffiths RC, Tavaré S. Simulating probability distributions in the coalescent. Theor. Popul. Biol. 1994;46:131–159. [Google Scholar]
  8. Griffiths RC, Tavaré S. The age of a mutation in a general coalescent tree. Stoch. Models. 1998;14:273–195. [Google Scholar]
  9. Griffiths RC, Tavaré S. The genealogy of a neutral mutation. In: Green P, Hjort N, Richardson S, editors. Highly structured stochastic systems. Oxford University Press; 2003. pp. 393–412. [Google Scholar]
  10. Hobolth A, Wiuf C. The genealogy, site frequency spectrum and ages of two nested mutant alleles. Theor. Popul. Biol. 2009;75:260–265. doi: 10.1016/j.tpb.2009.02.001. [DOI] [PubMed] [Google Scholar]
  11. Hodgkinson A, Eyre-Walker A. Human triallelic sites: evidence for a new mutational mechanism? Genetics. 2010;184:233–241. doi: 10.1534/genetics.109.110510. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Johnson PLF, Slatkin M. Inference of population genetic parameters in metagenomics: a clean look at messy data. Genome Res. 2006;16:1320–1327. doi: 10.1101/gr.5431206. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Kimura M, Ohta T. The age of a neutral mutation persisting in a finite population. Genetics. 1973;75:199–212. doi: 10.1093/genetics/75.1.199. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Kingman JFC. The coalescent. Stoch. Proc. Appl. 1982;13(3):235–248. [Google Scholar]
  15. Polanski A, Kimmel M. New explicit expressions for relative frequencies of single-nucleotide polymorphisms with application to statistical inference on population growth. Genetics. 2003;165:427–436. doi: 10.1093/genetics/165.1.427. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Roman S. The harmonic logarithms and the binomial formula. J. Comb. Theory A. 1993;63:143–163. [Google Scholar]
  17. Sawyer SA, Hartl DL. Population genetics of polymorphism and divergence. Genetics. 1992;132:1161–1176. doi: 10.1093/genetics/132.4.1161. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Tamura K, Nei M. Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol. Biol. Evol. 1993;10(3):512–526. doi: 10.1093/oxfordjournals.molbev.a040023. [DOI] [PubMed] [Google Scholar]
  19. Tavaré S. Line-of-descent and genealogical processes, and their applications in population genetics models. Theor. Popul. Biol. 1984;26:119–164. doi: 10.1016/0040-5809(84)90027-3. [DOI] [PubMed] [Google Scholar]
  20. The 1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature. 2010;467:1061–1073. doi: 10.1038/nature09534. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Watterson GA. On the number of segregating sites in genetical models without recombination. Theor. Popul. Biol. 1975;7:256–276. doi: 10.1016/0040-5809(75)90020-9. [DOI] [PubMed] [Google Scholar]
  22. Wiuf C, Donnelly P. Conditional genealogies and the age of a neutral mutant. Theor. Popul. Biol. 1999;56:183–201. doi: 10.1006/tpbi.1998.1411. [DOI] [PubMed] [Google Scholar]
  23. Wright S. In: Genetics, Paleontology and Evolution. Jepson GL, Mayr E, Simpson GG, editors. Princeton University Press; 1949. pp. 365–389. [Google Scholar]

RESOURCES