Skip to main content
Genetics logoLink to Genetics
. 2007 Oct;177(2):1249–1254. doi: 10.1534/genetics.107.077644

An Exact Sampling Formula for the Wright–Fisher Model and a Solution to a Conjecture About the Finite-Island Model

Sabin Lessard 1,1
PMCID: PMC2034630  PMID: 17660540

Abstract

An exact sampling formula for a Wright–Fisher population of fixed size N under the infinitely many neutral alleles model is deduced. This extends the Ewens formula for the configuration of a random sample to the case where the sample is drawn from a population of small size, that is, without the usual large-N and small-mutation-rate assumption. The formula is used to prove a conjecture ascertaining the validity of a diffusion approximation for the frequency of a mutant-type allele under weak selection in segregation with a wild-type allele in the limit finite-island model, namely, a population that is subdivided into a finite number of demes of size N and that receives an expected fraction m of migrants from a common migrant pool each generation, as the number of demes goes to infinity. This is done by applying the formula to the migrant ancestors of a single deme and sampling their types at random. The proof of the conjecture confirms an analogy between the island model and a random-mating population, but with a different timescale that has implications for estimation procedures.


WAKELEY (2003) has provided a theoretical framework for statistical inference about mutation, selection, and divergence time made from molecular data at unlinked nucleotide sites as in Sawyer and Hartl (1992) but in the case of a population subdivided into many subpopulations or demes. Assuming an island model of migration (Wright 1931; Moran 1959) but with a finite number of finite demes, it has been argued that the frequency of a mutant allele segregating with a wild-type allele at the same locus in the whole population should be governed in the limit of a large number of demes by a diffusion process that is identical to the standard diffusion approximation used for a panmictic population (see, e.g., Ewens 2004, Chap. 4), with the exception that it occurs on a longer timescale.

More precisely, consider a haploid population subdivided into D demes with N individuals in each deme and suppose discrete, nonoverlapping generations. At the beginning of each generation, every individual in every deme produces the same large number of offspring, which then disperse independently and randomly among all the demes with probability m (0 < m ≤ 1) or stay in their original deme with probability 1 − m. In other words, m is the fraction of offspring in each deme that come from a deme chosen at random. Two alleles at a single locus are segregating in the population, a mutant allele A and a wild-type allele B, and viability selection takes place among the offspring within each deme (what is known as soft selection) such that the mutant type has fitness 1 + γ/(ND) compared to 1 for the wild type. The population structure is restored before the beginning of the next generation by sampling N survivors within each deme according to a classical Wright–Fisher model (Fisher 1930; Wright 1931). The frequency of A in all the demes is then described by a multidimensional discrete-time Markov chain. The same chain is obtained in the case of a diploid population with gametic migration followed by random union of gametes and additive selection.

Measuring time in units of ND/(1 − F) generations, where F is the fixation index given by

graphic file with name M1.gif (1)

it has been shown that the frequency of A in the whole population in the limit as D goes to infinity should be described by a diffusion continuous-time process on the interval [0, 1] having drift and diffusion coefficients given by

graphic file with name M2.gif (2)

and

graphic file with name M3.gif (3)

respectively. This is exactly what is obtained for a panmictic population of size ND with ND generations taken as unit of time (see, e.g., Ewens 2004, Chap. 5). Therefore, in the limit of a large D, the only difference between the two models is the timescale, the unit of time in the island model being longer by a factor 1/(1 − F). Note that the parameter F represents the probability under neutrality and assuming a large number of demes that the lineages of two individuals chosen at random in the same deme coalesce backward in time before one of them migrates to another deme.

The limit diffusion for the island model results from a separation of timescales as in Ethier and Nagylaki (1980), drift within demes occurring on a faster timescale than drift between demes and selection pressure. Moreover, a rigorous proof relies on the following assumption:

Conjecture (Wakeley 2003): If ν =0, ν1, …, νN) is the probability distribution satisfying

graphic file with name M4.gif (4)

where

graphic file with name M5.gif (5)

and x represents the frequency of A in the current generation, then there exist some coefficients r1, …, rN such that from one generation to the next

graphic file with name M6.gif (6)

Moreover, these coefficients may depend on N, j, m, and x.

It has been noted that the conjecture is true when ν is approximated with a hypergeometric distribution Inline graphic exhibiting the same mean and variance; namely,

graphic file with name M8.gif (7)

where

graphic file with name M9.gif (8)

and

graphic file with name M10.gif (9)

with Γ(a + 1) = aΓ(a) for a > 0 (see, e.g., Feller 1968, p. 66, for properties of the gamma function). Such an approximation can be justified by exchangeability properties (Rothman et al. 1974). The accuracy of the approximation has been illustrated by numerical calculations for a deme of size as small as 10 (see Wakeley 2003 for more details). Moreover, numerical simulations in the case of a large deme size have shown little discrepancy with the stochastic dynamics predicted from the diffusion approximation (Cherry and Wakeley 2003). This is consistent with analytical results for a large deme size with Nm kept constant, in which case both Nν and Inline graphic approach the density of a beta distribution evaluated at y = j/N; namely,

graphic file with name M12.gif (10)

where M takes its limit value 2Nm (see, e.g., Moran 1962, Chap. 6). This distribution corresponds to the stationary distribution in a deme of large size that receives an expected fraction m of migrants each generation from an infinite population, possibly subdivided into an infinite number of demes, in which the frequencies of A and B are kept constant and equal to x and 1 − x, respectively (Wright 1931).

On the other hand, the hypergeometric distribution Inline graphic where

graphic file with name M14.gif (11)

is the solution of the system of equations

graphic file with name M15.gif (12)

where

graphic file with name M16.gif (13)
graphic file with name M17.gif (14)
graphic file with name M18.gif (15)

and Inline graphic otherwise. This distribution comes into play when a Moran-type model (Moran 1958) is used for reproduction, one individual at a time being replaced with an offspring of an individual either in the same deme with probability 1 − m or in the entire population with probability m (Wakeley and Takahashi 2004). This is also the distribution of allele types in an island population of size N generated by a birth-and-death process with immigration (BDI) from a large mainland population in which the frequencies of alleles A and B are x and 1 − x, respectively, if M is defined as the ratio of the immigration rate over the individual birth rate (Rannala 1996).

In the case of infinitely many alleles in the mainland population of a BDI process (Joyce and Tavaré 1987; Tavaré 1989), the distribution of allele types in an island or a subpopulation is related to the Ewens sampling formula. This formula gives the likelihood of the configuration of alleles at a single locus in a small sample drawn at random from a large population assuming that every mutation leads to a novel type. It was conjectured by Ewens (1972) from the stationary frequency spectrum based on a diffusion approximation and proved formally by induction by Karlin and McGregor (1972). The formula holds in the limit of a wide range of exchangeable models, including the Wright–Fisher model and the Moran model, and is a basic tool for estimating the mutation rate and testing neutrality (see, e.g., Ewens 1990). An analogy with a Polya urn model has been pointed out (Hoppe 1984, 1987). The formula has been proved by a simple combinatorial argument (Griffiths and Lessard 2005) based on the coalescent approach as the population size goes to infinity (Kingman 1982).

The Ewens sampling formula or any related asymptotic formula cannot be used, however, to get the distribution ν since the deme size is fixed and finite. What is needed is an exact formula for the neutral Wright–Fisher model. More precisely, the distribution ν that is to be determined corresponds to the stationary distribution in a Wright–Fisher population of fixed size N that receives an expected fraction m of migrants each generation from an infinite population in which the frequencies of A and B are kept constant and equal to x and 1 − x, respectively. Equivalently, this is the stationary distribution in a Wright–Fisher population in which mutation occurs with probability m per gene per generation and leads either to an allele A with probability x or to an allele B with probability 1 − x.

In this note, the stationary distribution ν is obtained from an exact sampling formula for the Wright–Fisher model under infinitely many alleles mutation that is deduced in the next section (see Fu 2006 for a study of the exact coalescent in this model without mutation). The proof of the conjecture and some concluding remarks follow.

EXACT SAMPLING FORMULA FOR THE NEUTRAL WRIGHT–FISHER INFINITELY MANY ALLELES MODEL

We consider the neutral Wright–Fisher model for a population of N genes at a single locus with discrete, nonoverlapping generations and a probability of mutation to a novel allelic type u per gene per generation. We are interested in the probability of having k different types, labeled arbitrarily from 1 to k and represented n1,…, nk times, respectively, in a sample of Inline graphic genes drawn at random without replacement in a given generation (say, generation g0). The sample size satisfies 1 ≤ nN.

Let us assign the labeled types to the sampled genes. There are

graphic file with name M21.gif (16)

possibilities. The probability of the sample configuration is obtained by reconstructing the history of the sample genes.

Backward in time, the sampled genes are lost by mutation or coalescence. We consider first only the generations with such mutation or coalescence events that define ordered steps of gene loss backward in time. Let step t correspond to generation gt for t = 1,…, T, where T represents the total number of steps. We introduce the notation ri,t for the number of genes of type i remaining at the beginning of step t. This number decreases from ni to 1 as t increases from 1 to ti, this step corresponding to the loss of type i by mutation, so that Inline graphic and ri,t = 0 for all t > ti. Moreover, define Inline graphic the total number of genes remaining at the beginning of step t, and mt, the number of mutation events occurring at step t, that is, the number of i such that ti = t. (See Table 1.)

TABLE 1.

Notation and inequalities for the numbers of genes of the different types (i = 1,…, k) at the beginning of the successive steps of gene loss (t = 1,…, T)

1 t t + 1 ti ti + 1 T
n1 r1,t r1,t+1 Inline graphic Inline graphic r1,T 0
· · · · · ·
· · · · · ·
· · · · · ·
ni ri,t ri,t+1 1 > 0 0 = 0
· · · · · ·
· · · · · ·
· · · · · ·
nk rk,t rk,t+1 Inline graphic Inline graphic rk,T 0
n rt > rt+1 Inline graphic > Inline graphic rT > 0

The number of mutations at step t, represented by mt, is the number of types i such that ri,t = 1 and ri,t+1 = 0 and satisfies Inline graphic.

Then, assuming that the parent of each gene in any given generation is a gene chosen at random in the previous generation independently of all others, step t as defined from the numbers of genes of each type remaining at the beginning of steps t and t + 1, that is, ri,t+1 for all i given ri,t for all i with 0 ≤ ri,t+1ri,t for all i following at least one, and possibly multiple, mutation or coalescence events (see Figure 1 for an example), has probability

graphic file with name M31.gif (17)

where N[r] = N(N − 1) … (Nr + 1), while Srs represents the number of ways that r distinct elements can be partitioned into s nonvoid subsets. This is a Stirling number of the second kind given by the formula

graphic file with name M32.gif (18)

(see, e.g., Abramowitz and Stegun 1965). Note that, in Equation 17, we use the equalities N[0] = 1 and S00 = S10 = 1.

Figure 1.—

Figure 1.—

Example of types 1, 2, 3, 4, 5 represented 3, 1, 2, 1, 4 times at step t and 1, 0, 2, 0, 3 times at step t + 1 as a result of mutation events, represented by dots, and coalescence events. Note that the coalescence event between the gene of type 4, prior to the mutation event, and one of the genes of type 5 has no effect on the sample configuration.

Considering all steps of gene loss from 1 to T = max{t1, … , tk} and using the identity Inline graphic we get the probability

graphic file with name M34.gif (19)

On the other hand, the probability of neither mutation nor coalescence in all intermediate generations separating the generations of gene loss is

graphic file with name M35.gif (20)

which reduces to

graphic file with name M36.gif (21)

Multiplying (16) with (19) and (21) and dividing by k!, which is the number of ways that the mutant types can be labeled, the probability of the sample configuration for the labeled types, denoted by p(n; n1, … , nk), is found to be

graphic file with name M37.gif (22)

where Inline graphic is an array of nonnegative integers satisfying ri,tri,t+1 for i = 1,…, k and t ≥ 1 with ri,1 = ni and Inline graphic for some ti for all i and Inline graphic as long as rt ≥ 1, which occurs for t = 1, … , T.

PROOF OF THE CONJECTURE

Suppose a neutral Wright–Fisher model for a population of N genes at a single locus without mutation but in which each gene in each generation, with probability m and independently of all others, comes from a large surrounding population where two alleles A and B are segregating and maintained at equilibrium frequencies x and 1 − x, respectively. In the population of size N, it is as if a mutation event to a novel type would occur with probability u = m per gene per generation and that the novel type would be either in class A with probability x or in class B with the complementary probability 1 − x.

In the Wright–Fisher population at stationarity, the number of genes in class A can be expressed as

graphic file with name M41.gif (23)

where K is the number of types in the population, N1, … , NK are the numbers of times that these types once labeled from 1 to K are represented in the population, and I1,…, IK are independent random variables that take the value 1 with probability x and 0 otherwise. Given that K = k, the numbers N1,…, Nk are exchangeable random variables and the sum I1 + … + Ik follows a binomial distribution with parameters k and x. Therefore, the probability of having j genes in class A in the Wright–Fisher population at stationarity is

graphic file with name M42.gif (24)

where, using the sampling formula of the previous section for the whole population (n = N), we have

graphic file with name M43.gif (25)

We conclude that

graphic file with name M44.gif (26)

where the coefficients a0, a1,…, aN depend on m, j, and N.

Let us write

graphic file with name M45.gif (27)

for the probability of having j genes in class A in the Wright–Fisher population at stationarity when the frequency of A in the surrounding population is x′ different from x. Then, the binomial formula

graphic file with name M46.gif (28)

and a rearrangement of terms yield the expression

graphic file with name M47.gif (29)

where

graphic file with name M48.gif (30)

This coefficient for i = 1, … , N depends on N, j, m, and x and the stated conjecture is established.

CONCLUDING REMARKS

The proof of the conjecture relies on the fact that the probability of having a given number of genes in class A in a haploid Wright–Fisher population of size N with a backward migration probability of m per individual per generation is a polynomial of degree N with respect to the frequency of A among migrants that is kept constant. Note that the proof does not require an exact expression for p(N; n1,…, nk), which represents the probability for the population at stationarity to have k migrant ancestors labeled from 1 to k and having n1, … , nk descendants, respectively. However, the analogy with an infinitely many alleles mutation model with a mutation rate u = m per gene per generation and the importance of sampling formulas for drawing inferences about genetic and demographic parameters make of interest such an expression for any sample size n, in which case the probability is represented by p(n; n1, … , nk). The number of terms to sum up to compute this probability corresponds to the number of ways that the sampled genes can be lost backward in time by mutation or coalescence and this number increases rapidly with the sample size. Unless the population size N is large, all these terms have to be considered.

If we let N go to infinity and u to zero and keep θ = 2Nu constant, then only the arrays Inline graphic for the gene losses with rτ+1 = rτ − 1 for τ = 1,…, n will contribute to the probability of the sample configuration since

graphic file with name M50.gif (31)

There are Inline graphic such arrays with ri,τ+1 = ri − 1 for one and only one type i, denoted by iτ, for each τ = 1, … , n. Then, using the identity Sr,r−1 = r(r − 1)/2 for r ≥ 2 and S1,0 = 1, we find that the probability is the same for each array and that the probability for all arrays is

graphic file with name M52.gif (32)

This is the Ewens sampling formula for labeled types (Ewens 1972; Karlin and Mcgregor 1972; see Ewens 2004, Section 3.6, for more details on its deduction from a diffusion approximation, and Griffiths and Lessard 2005, for a combinatorial proof based on a coalescent approximation).

The difference between the exact sampling formula for a fixed population size and a fixed mutation rate and the large-N, small-u approximation comes mainly from the occurrence of simultaneous mutation or coalescence events that cannot be neglected in the general case. This is particularly relevant when N corresponds to a deme size in a subdivided population, which can be quite small (<10), and u to a migration rate from one deme to any other, which can be very high (>10−1).

The exact formula has been deduced by considering all generations backward in time and unconditional transition probabilities for the sample configuration from one generation to the previous one. An alternative approach would be to consider only the generations with a change by mutation or coalescence and conditional transition probabilities given such a change, but this approach would be equivalent and not simpler.

Acknowledgments

The author is grateful to John Wakeley for his comments on a first draft of this article. This research began during the LMS Durham Symposium on Mathematical Genetics 5–15 July 2004, organized by Robert C. Griffiths and Gilean McVean. This research was supported in part by the Natural Sciences and Engineering Research Council of Canada.

References

  1. Abramowitz, M., and I. A. Stegun, 1965. Handbook of Mathematical Functions. Dover, New York.
  2. Cherry, J. L., and J. Wakeley, 2003. A diffusion approximation for selection and drift in a subdivided population. Genetics 163: 421–428. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Ethier, S. N., and T. Nagylaki, 1980. Diffusion approximations of Markov chains with two timescales and applications to population genetics. Adv. Appl. Probab. 12: 14–49. [Google Scholar]
  4. Ewens, W. J., 1972. The sampling theory of selectively neutral alleles. Theor. Popul. Biol. 3: 87–112. [DOI] [PubMed] [Google Scholar]
  5. Ewens, W. J., 1990. Population genetics theory—the past and the future, pp. 177–227 in Mathematical and Statistical Developments of Evolutionary Theory, edited by S. Lessard. Kluwer Academic Publishers, Amsterdam.
  6. Ewens, W. J., 2004. Mathematical Population Genetics, Ed. 2. Springer, New York.
  7. Feller, W., 1968. An Introduction to Probability Theory and Its Applications, Ed. 3. John Wiley & Sons, New York.
  8. Fisher, R. A., 1930. The Genetical Theory of Natural Selection. Clarendon Press, Oxford.
  9. Fu, Y. X., 2006. Exact coalescent for the Wright-Fisher model. Theor. Popul. Biol. 69: 385–394. [DOI] [PubMed] [Google Scholar]
  10. Griffiths, R. C., and S. Lessard, 2005. Ewens' sampling formula and related formulae: combinatorial proofs, extensions to variable population size and applications to ages of alleles. Theor. Popul. Biol. 68: 167–177. [DOI] [PubMed] [Google Scholar]
  11. Hoppe, F. M., 1984. Polya-like urns and the Ewens' sampling formula. J. Math. Biol. 20: 91–94. [Google Scholar]
  12. Hoppe, F. M., 1987. The sampling theory of neutral alleles and an urn model in population genetics. J. Math. Biol. 25: 123–159. [DOI] [PubMed] [Google Scholar]
  13. Joyce, P., and S. Tavaré, 1987. Cycles, permutations and the structure of the Yule process with immigration. Stoch. Proc. Appl. 25: 309–314. [Google Scholar]
  14. Karlin, S., and J. L. McGregor, 1972. Addendum to a paper of W. Ewens. Theor. Popul. Biol. 3: 113–116. [DOI] [PubMed] [Google Scholar]
  15. Kingman, J. F. C., 1982. The coalescent. Stoch. Proc. Appl. 13: 235–248. [Google Scholar]
  16. Moran, P. A. P., 1958. Random processes in genetics. Proc. Camb. Philos. Soc. 54: 60–71. [Google Scholar]
  17. Moran, P. A. P., 1959. The theory of some genetical effects of population subdivision. Aust. J. Biol. Sci. 12: 109–116. [Google Scholar]
  18. Moran, P. A. P., 1962. Statistical Processes of Evolutionary Theory. Clarendon Press, Oxford.
  19. Rannala, B., 1996. The sampling theory of neutral alleles in an island population of fluctuating size. Theor. Popul. Biol. 50: 91–104. [DOI] [PubMed] [Google Scholar]
  20. Rothman, E. D., C. F. Sing and A. R. Templeton, 1974. A model for the analysis of population structure. Genetics 78: 934–960. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Sawyer, S. A., and D. L. Hartl, 1992. Population genetics of polymorphism and divergence. Genetics 132: 1161–1176. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Tavaré, S., 1989. The genealogy of the birth, death, and immigration process, pp. 41–56 in Mathematical Evolutionary Theory, edited by M. W. Feldman. Princeton University Press, Princeton, NJ.
  23. Wakeley, J., 2003. Polymorphism and divergence for island-model species. Genetics 163: 411–420. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Wakeley, J., and T. Takahashi, 2004. The many-demes limit for selection and drift in a subdivided population. Theor. Popul. Biol. 66: 83–91. [DOI] [PubMed] [Google Scholar]
  25. Wright, S., 1931. Evolution in Mendelian populations. Genetics 16: 97–159. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Genetics are provided here courtesy of Oxford University Press

RESOURCES