Skip to main content
Genetics logoLink to Genetics
. 2013 Jul;194(3):721–736. doi: 10.1534/genetics.113.151522

Asymptotic Distributions of Coalescence Times and Ancestral Lineage Numbers for Populations with Temporally Varying Size

Hua Chen *,1, Kun Chen
PMCID: PMC3697976  PMID: 23666939

Abstract

The distributions of coalescence times and ancestral lineage numbers play an essential role in coalescent modeling and ancestral inference. Both exact distributions of coalescence times and ancestral lineage numbers are expressed as the sum of alternating series, and the terms in the series become numerically intractable for large samples. More computationally attractive are their asymptotic distributions, which were derived in Griffiths (1984) for populations with constant size. In this article, we derive the asymptotic distributions of coalescence times and ancestral lineage numbers for populations with temporally varying size. For a sample of size n, denote by Tm the mth coalescent time, when m + 1 lineages coalesce into m lineages, and An(t) the number of ancestral lineages at time t back from the current generation. Similar to the results in Griffiths (1984), the number of ancestral lineages, An(t), and the coalescence times, Tm, are asymptotically normal, with the mean and variance of these distributions depending on the population size function, N(t). At the very early stage of the coalescent, when t → 0, the number of coalesced lineages nAn(t) follows a Poisson distribution, and as mn, n(n1)Tm/2N(0) follows a gamma distribution. We demonstrate the accuracy of the asymptotic approximations by comparing to both exact distributions and coalescent simulations. Several applications of the theoretical results are also shown: deriving statistics related to the properties of gene genealogies, such as the time to the most recent common ancestor (TMRCA) and the total branch length (TBL) of the genealogy, and deriving the allele frequency spectrum for large genealogies. With the advent of genomic-level sequencing data for large samples, the asymptotic distributions are expected to have wide applications in theoretical and methodological development for population genetic inference.

Keywords: coalescent theory, gene genealogy, coalescence time, ancestral lineage, ancestral inference, variable population size


COALESCENT theory provides a fundamental framework for stochastic modeling and likelihood inference in population genetic studies (Griffiths 1980; Kingman 1982a; Hudson 1990; Nordborg 2001). A coalescent process can be decomposed into two independent processes: the topology of the gene genealogy and the sequential process of intercoalescence times (Kingman 1982a). In this article, we aim to investigate the latter process and two important random quantities associated with this process: the coalescence times and the number of ancestral lineages (Kingman 1982a). Studying the two quantities is both biologically and theoretically meaningful. First, inferring the coalescence times and the number of ancient lineages of a contemporary sample or population helps to elucidate ancient demographic history, including population admixture, migration, and founder effect. It can also provide insights into medical studies regarding the origin and genetic architecture of inherited diseases in different populations, as well as to ecological studies, for example, on investigating the process of species invasion (Risch et al. 2003; Anderson and Slatkin 2007; Dlugosch and Parker 2007). Second, the distributions of coalescence times and ancestral lineage numbers are the essential components needed to construct a coalescent likelihood, for example, in the allele frequency spectrum-based approaches (Tavaré 1984; Griffiths and Tavaré 1998; Polanski and Kimmel 2003; Chen 2012).

The exact distribution of the number of ancestral lineages at t generations ago for n haplotypes randomly collected at present, An(t), t ≥ 0, was derived in Tavaré (1984) under the coalescent for constant populations (Equation 15 under Asymptotics of ancestral lineage numbers; see also Griffiths 1980,Donnelly 1984,Watterson 1984, and Takahata and Nei 1985). The exact distribution has connections to the Ewens’ sampling formula under the infinitely many-alleles model (Ewens 1972). In a later study, the equation was extended to populations with temporally varying size (Griffiths and Tavaré 1998). The seminal equations in Tavaré (1984) and Griffiths and Tavaré (1998) are very useful in methodology development. However, both exact distributions are expressed as the sums of series with alternating signs, and the coefficients of the series become numerically unstable when n > 50.

As another important quantity in the coalescent process, the coalescence time, Tm, defined as the time when m + 1 lineages merge into m lineages, is well known as a sum of nm intercoalescence times. These nm intercoalescence times are distributed as independent exponential variables with distinct respective rates k(k − 1)/2, k = n, …, m + 1 under a constant population size model. The analytical expressions of many statistics are derived on the basis of this fact. For populations with time-varying size, the intercoalescence times are no longer independent. Griffiths and Tavaré (1998) and Polanski et al. (2003) derived the distribution of coalescence times under a temporally variable population size model still as a sum of series, and the evaluation of the coefficients also suffers from the numerical issue when sample size is large.

The numerical problem caused by large sample size becomes an indispensable question with the rapid emergence of large-scale sequencing data for samples of thousands of individuals (Mardis 2008; Altshuler et al. 2010; Coventry et al. 2010), which, on the other hand, provides an unprecedented opportunity for population genetic study. Great endeavors are pursued to develop computationally efficient approaches for the analysis of genomic data with large sample size. Most existing coalescent-based inference methods in population genetics rely on sampling approaches with intensive computation, such as importance sampling and Markov chain Monte Carlo, to integrate over the space of gene genealogies (Griffiths and Tavaré 1994b; Felsenstein et al. 1999), and thus are applicable only for analyzing local genomic regions in small samples. A recently developed method, centered on a coalescent-based joint allele frequency spectrum (JAFS) (Chen 2012), gains computational efficiency for the analysis of genomic data from multiple populations, as the author used the derived analytical form of the coalescent-based JAFS instead of the sampling approaches. One of the limitations is that the author derived the JAFS on the basis of Tavaré (1984) and Griffiths and Tavaré (1998) equations, and the numerical issues of these equations limit the use of the JAFS to small gene genealogies.

Griffiths (2006) simplified the computation of the exact lineage distribution by replacing the sum of alternating series with the hypergeometric function, which has a representation in terms of a complex integral and can be evaluated by numerical integration or simulation. As the distribution is not in simple form, it may intimidate its use for theory and methodology development. Polanski and Kimmel (2003) used the methods of hypergeometric summation to avoid the numerical issue of large n when using the exact distribution of coalescence times to obtain the allele frequency spectrum (AFS) under a time-varying population size model. Their method avoids the calculation of the coefficients in the alternating series that will explode when gene genealogy size increases. However, this approach is designed specifically for calculation of the AFS for some demographic scenarios and is not a general solution for the numerical instability in the calculation of the distributions of coalescence times and the number of ancestral lineages. Another way to avoid the calculation of the series with alternating signs is to use the asymptotic approximation instead of the exact distribution. The asymptotic distributions have an additional advantage that they are often in simpler form and are easier for theory establishment.

The asymptotic theories of the coalescence times and the number of ancestral lineages for large gene genealogies in constant populations have been derived by Griffiths (1984). He demonstrated that as t → 0 and the sample size n, the distributions of An(t) and Tm converge asymptotically to normal distributions. The essential ingredient in Griffiths’ proof is to apply Lyapunov’s theorem to independently distributed intercoalescence times. For populations with temporally varying size, the validity of Griffiths’ theorems is yet to be addressed, as the intercoalescence times are dependent variables in this case, violating the independence assumption of Lyapunov’s theorem (Billingsley 2012). However, if we scale the time to account for the fluctuation in population size by 0t(ds/N(s)),t0, where N(⋅) is the population size function over time, the coalescent process on the new time scale is equivalent to the standard coalescent (Kingman 1982b; Griffiths and Tavaré 1994b). The theorems for the standard coalescent in Griffiths (1984) can then be borrowed to obtain asymptotic distributions for populations with temporally varying size. Extension of Griffiths’ theorems to populations with time-varying size is very important for population genetic inference, since most ancestral inference is based on the nonequilibrium genetic polymorphism patterns in populations with temporally varying size. Also, the population size and growth rate are themselves demographic parameters of great interest.

In the following sections, we first derive in Asymptotic Distributions for Coalescence Times and Ancestral Lineage Numbers the asymptotic distributions of coalescent times and the number of ancestral lineages for populations with temporally varying size, specifically, for populations under exponential growth. In Numerical Results we then compare the asymptotic distributions to exact distributions or coalescent simulations if the exact distributions are difficult to evaluate. We demonstrate that the asymptotic distributions of coalescence times and lineage numbers coincide with both the simulated and exact distributions surprisingly well for a wide range of parameters and for samples with even moderate size. Last, in Applications, we apply the asymptotic distributions to deriving statistics related to the properties of gene genealogies, such as the expected time to the most recent common ancestor (TMRCA) and the total branch lengths (TBL), and deriving the AFS for large samples in simpler analytical form. The article closes with a discussion.

Asymptotic Distributions for Coalescence Times and Ancestral Lineage Numbers

Notations and summary

Consider a sample of n lineages (haplotypes) randomly drawn from the contemporary population. Let N(t) be the deterministic haploid population size at t generations ago. The historical population size N(t) is assumed to be large enough to satisfy Kingman’s coalescent assumption (N(t) ≫ n). For simplicity of notation, let N0N(0) be the size of the contemporary population. Following Griffiths and Tavaré (1994a), the relative size function λ(t) is defined as

λ(t)=N(t)N0. (1)

Two random quantities we investigate are the coalescence times and ancestral lineage numbers. Denote by Tm, 1 ≤ mn, the coalescence time when m + 1 lineages merge into m lineages, with Tn ≡ 0 (Figure 1). It is known that the intercoalescence time, Wm = Tm−1Tm, or the time length of gene genealogies during which there are m lineages, is distributed as an exponential variable with rate m(m1)/2N0 for populations with constant size N0 (Fu 1995). The coalescence time Tm can also be written as Tm=k=m+1nWk.

Figure 1.

Figure 1

An illustration of the gene genealogy and coalescence times of five lineages at present. The coalescence time Tm is defined as the time when m + 1 lineages coalesce into m lineages.

Denote by An(t) the number of ancestral lineages at t generations back in the past. In an ancestral process where both coalescent and mutation events can reduce the number of lineages, An(t) is referred to as the number of nonmutant lineages at t generations back from the present (Griffiths 1984). In this context, we consider the genealogical history of only coalescent events, in which mutations are treated separately and assumed to occur independently following a Poisson process along the branches of a given gene genealogy. The random process {An(t), t ≥ 0} is a pure-death process that jumps from state m to m − 1 with rate m(m − 1)/2N0, 2 ≤ mn (Kingman 1982a). All the random variables, Tm, Wm, and An(t), are defined in a coalescent process with time in units of generations. Let τ=g(t)=0t(1/N(u))du be the time scaled at rate N(t), specifically for populations with constant size, τ=t/N0. We use the notations T˙m, W˙m, and A˙n(τ) to denote the coalescence time, intercoalescence time, and number of ancestral lineages in the standard coalescent process with time scaled by the constant population size N0, which is also referred to as Kingman’s n-coalescent process in the context.

In the following, we aim to develop the asymptotic distributions of coalescence times, Tm, and ancestral lineage numbers, An(t), for populations with temporally varying size N(t). The main results include:

  1. for a sample of size n, as n, n/ma, 1 < a < , the mth coalescence time, Tm, is asymptotically normal, with the mean and variance depending on the historical population size (Equations 8 and 9);

  2. as g(t)0,n,12ng(t)α,0<α, the number of ancestral lineages at time t, An(t), is asymptotically normal, with the mean and variance provided in Equations 19 and 20; and

  3. at the very early stage of the coalescent, when t → 0 more rapidly, n → ∞, the number of coalesced lineages nAn(t) follows a Poisson distribution, and when mn, nm is bounded above, (n(n1)Tm)/2N0 follows a gamma distribution.

Asymptotics of coalescence times

Under a constant population-size model, the intercoalescence times, W˙m,2mn, are independent exponential variables with respective rates (m2). Griffiths (1984) proved the asymptotic distribution of coalescence times by applying Lyapunov’s version of the central limit theorem to the sum of independent intercoalescence times. Denote by T˙m the mth coalescence time in a standard n-coalescent process, scaled by population size N0 (Ewens 2004). By Theorem 1 of Griffiths (1984), under the conditions n → ∞, m → ∞, while n/ma, 1 < a ≤ ∞, T˙m is asymptotically normal, and the mean and the variance of T˙m are

μm=k=m+1n2k(k1)=2(1m1n), (2)

and

σm2=k=m+1n4k2(k1)2=4k=m+1n[1(k1)2+1k22k(k1)]=4{ψ1(m)ψ1(n)+ψ1(m+1)ψ1(n+1)2(m1n1)}, (3)

where the trigamma function ψ1(z)=(d2/dz2)lnΓ(z) and Γ(z)=0ettz1dt. Note that the result above is a little different from what was originally shown in Griffiths (1984) in that the effect of mutations on lines of descent is not considered in Equations 2 and 3. We take the strategy of constructing the genealogy first and then, given the branch lengths of the genealogy, modeling mutations as a Poisson process with the rate proportional to the specific branch length.

Under a variable population-size model, the intercoalescence times are no longer mutually independent. We assume that the population evolves according to the Wright–Fisher model, and its size changes over time deterministically; that is, the population size is different but known at each generation. The joint distribution of coalescence times (Tm,…,Tn−1) for populations with temporally varying size is (Griffiths and Tavaré 1998):

fTm,,Tn1(tm,,tn1)=k=mn1(k+12)N0λ(tk)exp((k+12)N0tk+1tk1λ(u)du). (4)

The marginal probability density function (p.d.f.) of coalescence times fTm was derived explicitly by Polanski et al. (2003) through expanding an integral transform of the marginal p.d.f. into partial fractions. An equivalent equation in different form can be derived on the basis of the definition of the n-coalescent and a pure-death process (see Griffiths 2006 and Chen 2012, Appendix A, for details of derivation):

fTm(t)=(An(t)=m+1)(m+1)m2N(t). (5)

As ℙ(An(t) = m + 1) was derived from the expansion of the transition function (see Equations 15 and 16), which involves the sum of alternating series and is numerically unstable for large n, the exact distribution may be practically difficult to use in the ancestral inference for large samples. In the following, we aim to derive the asymptotic distribution of coalescence times for temporally varying populations.

It is known in the coalescent literature that the coalescence time rescaled at rate 1/N(t), g(Tm)=0Tm(1/N(u))du, follows the distribution of coalescence time in the standard Kingman’s n-coalescent, and the scaled intercoalescence times,

g(Tm1)g(Tm)=TmTm11N(u)du,2mn,

are mutually independent exponential variables with the rate of (m2). The coalescent process under a variable population size model, as sample size n → ∞, is still Kingman’s coalescent since we assume the population size tends to infinity more quickly than the sample size, in other words, n/N(t) → 0, which makes the condition of Kingman’s coalescent nN(t) still satisfied. More detailed discussions on time scaling in a variable population size coalescent model can be found in Kingman (1982b), Griffiths and Tavaré (1994a), Donnelly and Tavaré (1995), and Nordborg (2001).

We start with a Taylor expansion of g(Tm) at g−1(μm),

g(Tm)=μm+g(g1(μm))(Tmg1(μm))+g(g1(μm))2(Tmg1(μm))2+O((Tmg1(μm))3), (6)

where g−1(⋅), g′(⋅), and g″(⋅) represent the inverse function, the first derivative, and the second derivative of function g, respectively. The remainder term

g(g1(μm))2(Tmg1(μm))2+O((Tmg1(μm))3),

or O((Tmg−1(μm))2) in Equation 6 is ignorable as n/ma, n → ∞, because g(Tm) follows the same distribution as T˙m and T˙mμm by the asymptotic properties of T˙m shown in Griffiths (1984, Theorem 1).

Next, by Equation 6 and ignoring the remainder term, we have

Tmg1(μm)(σm/g(g1(μm))g(Tm)μmσm. (7)

As (g(Tm)μm)/σmN(0,1), the limiting distribution of Tm can then be approximated by a normal distribution with the mean

graphic file with name 721equ1.jpg (8)

and variance

Var(Tm)=σm2(g(g1(μm))2. (9)

Substituting Equations 2 and 3 into Equations 8 and 9 yields the mean and variance for the asymptotic distribution of coalescence times, Tm,1 ≤ mn − 1.

When the population is under exponential growth with rate γ, that is, N(t) = N0eγt, it is straightforward to write the scaling function as

g(t)=eγt1N0γ. (10)

The inverse and first derivative function of g are g1(τ)=(ln(N0γτ+1))/γ and g(t)=eγt/N0 respectively. Using Equations 8 and 9, we have the mean

graphic file with name 721equ2.jpg (11)

and variance

Var(Tm)=σm2(g(g1(μm)))2=σm2exp{γ1γln(2N0γ(m1n1)+1)}/N02=(N02N0γ(m1n1)+1)2σm2=4N02(ψ1(m)ψ1(n)+ψ1(m+1)ψ1(n+1)2(m1n1))(2N0γ(m1n1)+1)2. (12)

Since the linear approximation is used in the above proof, there exists a bias between the derived asymptotic mean and the true mean of Tm. Here we quantify the magnitude of the bias specifically for the exponential growth model as an example. Using a Taylor expansion,

Tm=g1(g(Tm))=g1(μm)+(g1)(μm)(g(Tm)μm)+(g1)(μm)2(g(Tm)μm)2+O((g(Tm)μm)3),

and taking expectation at both sides, it can be seen that

graphic file with name 721equ3.jpg (13)
=σm22γμm2+o(1). (14)

By Theorem 1 in Griffiths (1984), μm is on order m−1 and σm2 is on order m−3. Therefore, the bias of the asymptotic mean is on order m−1, which shrinks to zero as n/ma, n → ∞.

Asymptotics of ancestral lineage numbers

Tavaré (1984) derived the exact distribution of the number of ancestral lineages at time t in the past for the coalescent with constant population size, by using a spectral expansion of the transition function that is associated with the death process {An(t), t ≥ 0} (see also Griffiths 1980; Donnelly 1984; Watterson 1984; Takahata and Nei 1985),

(An(t)=m)=i=mn(1)im(2i1)m(i1)n[i]m!(im)!n(i)ei(i1)t/2N,0<mn, (15)

where n(i)=n(n+1)(n+i1),i1;n(0)=1, and n[i]=n(n1)(ni+1),i1;n[0]=1 are the rising and falling factorial functions. Griffiths and Tavaré (1998) further generalized Equation 15 to populations with variable size. Then for populations with temporally varying size, the distribution of the number of ancestral lineages becomes (Griffiths and Tavaré 1998):

(An(t)=m)=i=mn(1)im(2i1)m(i1)n[i]m!(im)!n(i)e[i(i1)]/2N00t(1/λ(u))du. (16)

In addition to the coalescence times, Griffiths (1984) investigated the asymptotics of ancestral lineage numbers for populations with constant size. If omitting mutation for the same reason as in Asymptotics of coalescence times, by Griffiths’ Theorem 2, A˙n(τ) has an asymptotic normal distribution as τ0,n,12nτα,0<α<, with a mean of

μ(τ)=2ητ, (17)

and variance

σ2(τ)=2ητ1(η+β)2{1+η/(η+β)η/αη/(α+β)2η}β2, (18)

where β=12τ,η=αβ/{α(eβ1)+βeβ}. In the following, we extend Griffiths’ conclusion of the asymptotic distribution for ancestral lineage numbers to populations with temporally varying size.

We observe that ℙ(An(t) ≤ m) = ℙ(Tmt). As g is a monotone continuous function, we have (Tmt)=(g(Tm)g(t))=(A˙n(τ)m), where τg(t) as defined in last section. By Griffiths’ Theorem 2, as n, τ → 0 and 12nτα, the ancient lineage number of the coalescent process A˙n(τ)dN(u(τ),σ2(τ)), where μ(τ) and σ2(τ) are given in Equations 17 and 18. Therefore, (An(t)m)(Z(mu(τ))/σ(τ)), where Z is a standard normal.

The mean and variance of the limiting distribution of An(t) can be obtained by mapping back to the original time scale, as

graphic file with name 721equ4.jpg (19)

and

Var(An(t))σ2(τ)=σ2(g(t))=2η(g(t))1(η+β)2{1+η/(η+β)η/αη/(α+β)2η}β2, (20)

where α=limn,t012ng(t), β=12g(t), and η=αβ/{α(eβ1)+βeβ}.

When the population is under exponential growth with rate γ and scaling function as in Equation 10, plugging Equation 10 into Equations 19 and 20, An(t) then has an asymptotic mean

u(g(t))=2ηN0γeγt1, (21)

and variance

σ2(g(t))=2ηN0γeγt1(η+β)2{1+η/(η+β)η/αη/(α+β)2η}β2, (22)

where α=limn,t012[n(eγt1)/N0γ], β=12[(eγt1)/N0γ], and η=αβ/{α(eβ1)+βeβ}.

As no linear approximation is used to derive the asymptotic distribution of An(t), if u(τ) and σ2(τ) were the exact mean and variance of A˙n(τ), u(g(t)) and σ2(g(t)) should be the exact mean and variance of An(t). Here we use the asymptotic mean and variance in Griffiths (1984) for u(τ) and σ2(τ), and this results in the bias in our derived asymptotic mean and variance of An(t).

Asymptotics of coalescence times and ancestral lineage numbers at the early stage of the coalescent

At the early stage of the coalescent, or mn and t → 0, the above normal distributions may not well approximate the exact distributions of coalescence times and ancestral lineage numbers. We derive their asymptotics in this section.

As the scaled coalescence time g(Tm) follows the distribution of coalescence times in a standard coalescent process, g(Tm)g(Tm+1)Exponential((m+12)). Multiplying g(Tm) − g(Tm+1) by n(n1)/2, [n(n1)/2](g(Tm)g(Tm+1))Exponential((m+12)/(n2))Exponential(1) as mn and n. By the mean-value theorem, g(Tm)g(Tm+1)=(TmTm+1)g(ξm+1), where Tm+1ξm+1Tm. As mn and n, we have Tm → 0 and g(ξm+1)=1/N(ξm+1)1/N0. Subsequently,

n(n1)2k=m+1ng(Tk1)g(Tk)=n(n1)2k=m+1n(Tk1Tk)g(ξk). (23)

Taking limits n, mn at both sides of Equation 23, we obtain [n(n1)/2N0]Tmγ(nm,1).

Next, we derive the asymptotic distribution for the number of coalesced lineages, nAn(t), for a population with time-varying size N(t). Given that (nAn(t)nj)=(nA˙n(g(t))nj)=(nA˙n(τ)nj), and Theorem 6 in Griffiths (1984), nAn(t) asymptotically follows the Poisson distribution with mean ν=12n(n1)g(t)=12n(n1)0t[1/N(u)]du.

Numerical Results

The accuracy of the asymptotic distributions of Tm and An(t) derived in Asymptotic Distributions for Coalescence Times and Ancestral Lineage Numbers is examined by comparing their distributional properties to both exact distributions and coalescent simulations. If the analytical formulas of the exact distributions for Tm and An(t) are available and can be computed without numerical issues, for example, the mean and variance of the ancestral lineage number distribution, we use the analytical results instead of coalescent simulations in the comparison. Otherwise, we seek to compare the asymptotic distributions with simulated distributions. The coalescent simulator ms is modified to output coalescence times and the number of ancestral lineages for each simulated gene genealogy (Hudson 2002). In the simulation, the contemporary population size N0 is chosen to be 2 × 106, and a wide range of exponential growth rate γ, time t, and sample size n is investigated.

Coalescence times

We first examine the numerical accuracy of the asymptotic p.d.f. of coalescence times as provided in Asymptotics of coalescence times. We show the asymptotic p.d.f.’s of several coalescence times for n = 500 and γ = 0.001 and 0.005 in Figure 2, and then present in Table 1 the mean and standard deviation of coalescence times for a wider range of parameters (more simulation results can be seen in Supporting Information, Table S1). Since evaluating the p.d.f. of coalescence times for large samples using the exact formulas is subject to severe numerical instability (Polanski et al. 2003), we use only coalescent simulations in this section.

Figure 2.

Figure 2

The asymptotic probability density functions of coalescence times in the history for two parameter settings. (A and B) The asymptotic probability density functions of coalescence times in a sample of size 500, collected from a population with N0 = 2 × 106. The growth rate is (A) γ = 0.001 and (B) γ = 0.005. The spikes from left to right correspond to the p.d.f.’s of T2, T10, T25, T50, T100, T200, T300, and T400. (C and D) Comparison of the asymptotic probability density function and simulated distribution of coalescence time T10 for growth rate being (C) 0.001 and (D) 0.005. The histograms were generated by 500 coalescent simulations. (E and F) Comparison of the asymptotic probability density function and simulated distribution of coalescence time T100 for growth rate being (E) 0.001 and (F) 0.005. The histograms were generated by 500 coalescent simulations.

Table 1. Comparison of the asymptotic approximation and simulated results for the mean and standard deviation of the coalescence time Tm (N0 = 2.0 × 106).

Mean (Tm)
Standard deviation (Tm)
γ m Simulation Asymptotic Bias (%) Simulation Asymptotic Bias (%)
Sample size n = 50
0.001 5 6538.619 6580.639 42.020 (0.643%) 278.445 285.229 6.784 (2.436%)
0.001 10 5746.244 5771.441 25.197 (0.438%) 223.672 226.368 2.696 (1.206%)
0.001 20 4775.353 4795.791 20.438 (0.428%) 209.097 206.392 2.705 (1.294%)
0.001 45 2210.726 2291.412 80.685 (3.650%) 402.151 402.703 0.552 (0.137%)
0.005 5 1629.978 1637.793 7.816 (0.480%) 56.458 57.109 0.651 (1.154%)
0.005 10 1470.136 1475.677 5.541 (0.377%) 44.957 45.387 0.429 (0.955%)
0.005 20 1274.626 1279.719 5.093 (0.400%) 41.512 41.553 0.041 (0.100%)
0.005 45 742.147 763.298 21.151 (2.850%) 90.637 87.630 3.008 (3.318%)
0.010 5 884.172 888.198 4.026 (0.455%) 28.072 28.559 0.486 (1.731%)
0.010 10 804.606 807.122 2.515 (0.313%) 22.611 22.700 0.089 (0.395%)
0.010 20 706.889 709.091 2.202 (0.312%) 20.902 20.794 0.108 (0.517%)
0.010 45 439.281 449.857 10.577 (2.408%) 46.789 44.302 2.486 (5.314%)
Sample size n = 200
0.001 5 6626.114 6660.575 34.462 (0.520%) 254.189 263.447 9.258 (3.642%)
0.001 20 5187.985 5198.497 10.512 (0.203%) 141.164 142.544 1.380 (0.978%)
0.001 50 4104.733 4110.874 6.141 (0.150%) 105.829 106.237 0.408 (0.386%)
0.001 100 3038.662 3044.522 5.861 (0.193%) 103.844 102.868 0.977 (0.940%)
0.001 150 2028.941 2036.882 7.941 (0.391%) 126.889 124.671 2.219 (1.749%)
0.001 195 405.591 413.976 8.385 (2.067%) 147.663 151.613 3.950 (2.675%)
0.005 5 1647.798 1653.798 6.000 (0.364%) 51.114 52.743 1.629 (3.187%)
0.005 20 1359.062 1360.701 1.639 (0.121%) 28.242 28.635 0.393 (1.392%)
0.005 50 1140.012 1141.422 1.410 (0.124%) 21.275 21.530 0.255 (1.197%)
0.005 100 921.898 923.024 1.126 (0.122%) 21.349 21.388 0.039 (0.184%)
0.005 150 704.509 707.223 2.715 (0.385%) 27.911 27.839 0.071 (0.256%)
0.005 195 243.791 254.182 10.391 (4.262%) 62.681 64.354 1.672 (2.668%)
0.010 5 893.062 896.201 3.139 (0.352%) 25.688 26.375 0.687 (2.676%)
0.010 20 748.544 749.610 1.066 (0.142%) 14.058 14.326 0.268 (1.905%)
0.010 50 639.188 639.859 0.671 (0.105%) 11.055 10.783 0.273 (2.466%)
0.010 100 529.684 530.330 0.646 (0.122%) 10.860 10.747 0.112 (1.033%)
0.010 150 420.709 421.459 0.751 (0.178%) 14.020 14.125 0.105 (0.749%)
0.010 195 174.898 181.290 6.392 (3.655%) 37.343 37.428 0.085 (0.227%)
Sample size n = 800
0.001 5 6645.512 6679.599 34.087 (0.513%) 247.504 258.485 10.980 (4.436%)
0.001 10 5958.249 5981.414 23.165 (0.389%) 181.338 184.235 2.897 (1.598%)
0.001 50 4328.359 4330.733 2.374 (0.055%) 85.882 85.933 0.051 (0.059%)
0.001 200 2771.704 2772.589 0.885 (0.032%) 50.952 50.631 0.321 (0.630%)
0.001 400 1789.988 1791.759 1.771 (0.099%) 44.957 45.005 0.048 (0.106%)
0.001 795 30.645 30.962 0.317 (1.036%) 13.477 13.635 0.157 (1.167%)
0.005 5 1651.462 1657.606 6.144 (0.372%) 50.473 51.749 1.276 (2.528%)
0.005 10 1515.254 1517.766 2.512 (0.166%) 36.394 36.922 0.528 (1.451%)
0.005 50 1185.082 1185.918 0.835 (0.070%) 17.362 17.369 0.007 (0.042%)
0.005 200 865.872 866.147 0.275 (0.032%) 10.690 10.659 0.031 (0.291%)
0.005 400 651.296 651.619 0.323 (0.050%) 10.458 10.386 0.072 (0.689%)
0.005 795 28.726 29.206 0.481 (1.673%) 12.000 12.153 0.153 (1.274%)
0.010 5 894.511 898.105 3.594 (0.402%) 25.293 25.878 0.585 (2.314%)
0.010 10 825.938 828.172 2.234 (0.270%) 18.054 18.465 0.412 (2.281%)
0.010 50 661.756 662.141 0.385 (0.058%) 8.640 8.696 0.056 (0.648%)
0.010 200 501.546 501.728 0.182 (0.036%) 5.500 5.365 0.135 (2.449%)
0.010 400 393.125 393.183 0.058 (0.015%) 5.358 5.295 0.064 (1.188%)
0.010 795 26.667 27.343 0.676 (2.534%) 10.423 10.699 0.276 (2.652%)

In Figure 2, A and B, the asymptotic p.d.f.’s of coalescence times, T2, T10, T25, T50, T100, T200, T300, and T400, are displayed for two different exponential growth rates, 0.001 and 0.005. It is obvious that coalescence times on average are longer in a slowly growing population, so that the gene genealogy of a faster-growing population tends to have relatively shorter internal branches, showing a “star” shape. In Figure 2, C and E, the p.d.f.’s of T10 and T100 under growth model γ = 0.001 are presented together with the respective histograms generated from 500 coalescent simulations. A similar comparison was shown for γ = 0.005 in Figure 2, D and F. The asymptotic p.d.f is reasonably close to the simulated distribution for the two selected Tm’s and the two growth rates.

In Table 1 and Table S1, a more thorough simulation study is presented. The simulation is carried out for three different growth rates, γ = 0.001, 0.005, and 0.01, and five sample sizes, n = 50, 100, 200, 500, and 800. For each combination of parameter values, several coalescence times are examined to cover as much of the time span as possible. For example, for γ = 0.001, n = 200, we examine T5, T20, T50, T100, T150, and T195. For most of the Tm’s examined here, the asymptotic mean and variance approximate those of the simulated distributions very well.

Note that in Table 1 and Table S1 the asymptotic mean of Tm is always larger than the simulated mean. This is consistent with our quantified bias in Equation 14. It can also be observed that both the bias and the relative bias (the bias divided by the simulated mean) of the asymptotic mean of Tm are bigger when m is close to n. This can be explained by the inflated second derivative of the scaling function g−1(τ) evaluated at μm close to 0 (approximately 1/γμm2; see also Equation 14) appearing in the bias term. The detailed derivation of the quantified bias can be seen in Asymptotics of coalescence times.

Another trend worth noting in Table 1 and Table S1 are that the relative bias of the mean decreases with increasing sample size n. For example, for γ = 0.001 and the intercoalescence time T50, the relative bias is 0.353, 0.150, 0.049, 0.055, and 0.025% for n = 100, 200, 500, 800, and 5000, respectively (the last data point not shown in Table 1 and Table S1). For a coalescence time Tm with a smaller m, the relative bias is reduced more slowly. In Table 1 and Table S1, the relative bias for T5 does not have obvious trend of decrease for sample sizes up to 800 (for γ = 0.001, T5 decreases from 0.642% for n = 50 to 0.513% for n = 800). When the sample size is increased to 5000, the relative bias becomes 0.405% (data not shown in Table 1 and Table S1). Although the convergence rates are different for the above two Tm’s, the bias of both Tm’s shrinks toward zero as n.

Number of ancestral lineages

In this subsection, we aim to evaluate how the asymptotic distribution of An(t) performs as an approximation to the true distribution. The exact formulas of the first two moments of the ancestral lineage number distribution under a varying population size model were derived using the probability generating function in Tavaré (1984),

graphic file with name 721equ5.jpg (24)

and

graphic file with name 721equ6.jpg (25)

Unlike the entire exact distribution of the ancestral lineage number, the exact mean and variance of ancestral lineage numbers can be accurately calculated from Equations 24 and 25 even for quite large samples and are assumed to be the gold standard in the comparison. Three different exponential growth rates, γ = 0.001, 0.003, and 0.01, and five sample sizes, n = 50, 100, 200, 400, and 800 are considered here. For each combination of parameter values, the number of lineages are collected at four time points (t =100, 500, 1000, and 2000 for γ = 0.001 and 0.003; t =100, 300, 600, and 800 for γ = 0.01).

Table 2 and Table S2 present the mean and standard deviation of the number of lineages calculated from the exact formulas and from the asymptotic results for the above chosen parameter values, as well as the bias and relative bias (the bias divided by the exact mean or standard deviation) of the proposed asymptotic mean and standard deviation. As we can see from Table 2 and Table S2, the mean and variance obtained from the proposed asymptotic distribution are close to the exact mean and variance for a wide range of time points, growth rates, and sample sizes. Note that the asymptotic results are accurate even for a relatively small sample size, e.g., n = 50. The robustness of the asymptotic results for various demographic parameters and small sample size assures the application of the asymptotic distribution in theory and methodology development for population genetic inference.

Table 2. Comparison of the asymptotic approximation and exact results for the mean and standard deviation of An(t) (population size N0 = 2.0 × 106).

Mean (An(t))
Standard deviation (An(t))
γ t Exact Asymptotic Bias (%) Exact Asymptotic Bias (%)
Sample size n = 50
0.001 100 49.936 49.934 0.002 (0.004%) 0.253 0.256 0.003 (1.001%)
0.001 500 49.606 49.598 0.008 (0.016%) 0.623 0.629 0.006 (0.981%)
0.001 1000 48.969 48.949 0.020 (0.042%) 0.994 1.004 0.009 (0.944%)
0.001 2000 46.371 46.302 0.069 (0.148%) 1.768 1.783 0.014 (0.797%)
0.003 100 49.929 49.927 0.001 (0.003%) 0.267 0.269 0.003 (1.001%)
0.003 500 49.299 49.285 0.014 (0.029%) 0.825 0.834 0.008 (0.963%)
0.003 1000 46.385 46.317 0.068 (0.148%) 1.765 1.780 0.014 (0.798%)
0.003 2000 18.999 18.679 0.319 (1.710%) 2.432 2.429 0.002 (0.097%)
0.010 100 49.895 49.893 0.002 (0.004%) 0.323 0.327 0.003 (0.999%)
0.010 300 48.858 48.835 0.023 (0.047%) 1.044 1.054 0.010 (0.937%)
0.010 600 33.502 33.266 0.236 (0.711%) 2.790 2.797 0.007 (0.248%)
0.010 800 10.919 10.582 0.337 (3.181%) 1.874 1.869 0.005 (0.259%)
Sample size n = 200
0.001 100 198.959 198.954 0.005 (0.003%) 1.015 1.017 0.003 (0.246%)
0.001 500 193.747 193.717 0.030 (0.016%) 2.423 2.428 0.006 (0.227%)
0.001 1000 184.250 184.177 0.073 (0.040%) 3.660 3.667 0.007 (0.195%)
0.001 2000 151.766 151.578 0.188 (0.124%) 5.336 5.341 0.005 (0.102%)
0.003 100 198.846 198.841 0.006 (0.003%) 1.068 1.071 0.003 (0.246%)
0.003 500 189.083 189.031 0.052 (0.027%) 3.125 3.132 0.007 (0.211%)
0.003 1000 151.922 151.734 0.188 (0.124%) 5.332 5.338 0.005 (0.102%)
0.003 2000 26.285 25.950 0.335 (1.292%) 2.941 2.938 0.004 (0.121%)
0.010 100 198.305 198.296 0.008 (0.004%) 1.291 1.294 0.003 (0.244%)
0.010 300 182.657 182.577 0.080 (0.044%) 3.808 3.816 0.007 (0.189%)
0.010 600 66.720 66.398 0.322 (0.485%) 4.619 4.618 0.002 (0.037%)
0.010 800 12.917 12.579 0.339 (2.692%) 2.052 2.047 0.005 (0.236%)
Sample size n = 800
0.001 100 783.539 783.519 0.020 (0.003%) 3.974 3.976 0.002 (0.059%)
0.001 500 708.227 708.125 0.102 (0.014%) 8.502 8.505 0.004 (0.043%)
0.001 1000 595.586 595.390 0.196 (0.033%) 10.798 10.801 0.003 (0.024%)
0.001 2000 351.520 351.214 0.305 (0.087%) 10.352 10.352 0.000 (0.003%)
0.003 100 781.788 781.766 0.022 (0.003%) 4.171 4.173 0.002 (0.058%)
0.003 500 649.446 649.291 0.155 (0.024%) 10.033 10.036 0.003 (0.032%)
0.003 1000 352.361 352.055 0.305 (0.087%) 10.361 10.361 0.000 (0.003%)
0.003 2000 29.083 28.747 0.336 (1.168%) 3.099 3.095 0.003 (0.111%)
0.010 100 773.453 773.421 0.032 (0.004%) 4.982 4.985 0.003 (0.056%)
0.010 300 579.199 578.992 0.207 (0.036%) 10.944 10.947 0.002 (0.021%)
0.010 600 88.745 88.412 0.334 (0.377%) 5.427 5.425 0.002 (0.037%)
0.010 800 13.540 13.202 0.338 (2.564%) 2.102 2.098 0.005 (0.227%)

Next, to examine the performance of the asymptotic approximation over the entire time scale, we plot the mean and variance of the number of lineages as a function of time for two parameter settings in Figure 3. In both settings, the contemporary population size and the sample size are chosen to be N0 = 2 × 106 and n = 500, but the growth rates are different: γ = 0.003 (Figure 3, A–B) and γ = 0.01 (Figure 3, C–D). We compare in Figure 3, A and C, three approaches that can be used to obtain the mean of the number of lineages: the sample mean from the simulated data, representing a close estimate of the true value, the exact mean as shown in Equations 24, and our proposed asymptotic mean. As shown in Figure 3, A and C, the asymptotic results well approximate the true mean of lineage numbers.

Figure 3.

Figure 3

The mean and variance of the number of ancestral lineages in the history for two parameter settings. (A) The mean of ancestral lineages as a function of time t for n = 500 lineages sampled from the contemporary population. The contemporary population size is assumed to be 2 × 106 and the growth rate 0.003. The x-axis corresponds to generations back in time, and the y-axis is the expectation of number of ancestral lineages. Green open circles represent the average of lineage numbers over 500 gene genealogies generated from coalescent simulations; the blue X symbols represent the exact mean in Tavaré (1984); the red solid line represents the asymptotic mean derived in the main text. (B) The variance of ancestral lineages as a function of time t for n = 500 lineages sampled from the contemporary population. The parameters used in simulation and symbols are the same as in A and the sample variance is estimated from 500 simulations. (C) The mean of ancestral lineages as a function of time t for n = 500 lineages sampled from the contemporary population. The contemporary population size is assumed to be 2 × 106 and the growth rate 0.01. The simulation setting and the representative symbols are the same as in A. (D) The variance of ancestral lineages as a function of time t for n = 500 lineages sampled from the contemporary population. The simulation setting and the representative symbols are the same as in A. The sample variance is estimated from 500 simulations.

Recently, another approach to obtaining the expectation of the number of lineages was developed by Maruvka et al. (2011) for populations with constant or exponentially growing size. Based on the equation for Inline graphic[An(t + 1)|An(t) = i] in Watterson (1975), Maruvka et al. (2011) constructed a differential equation and gave the solution for the number of lineages as a function of time (NLFT), referred as the expectation of the ancestral lineage number in this article, as follows

graphic file with name 721equ7.jpg (26)

where n0 and N0 are the current sample size and population size, and γ is the population growth rate. Since Maruvka et al. (2011) assumed that An(t) was deterministic, instead of a random variable, no formula for the variance of An(t) was given in their article. It can easily be shown that Equation 26 is close to Equation 21 when An(t) is large. Letting β=12g(t)0, where g(t)=(eγt1)/N0γ, the asymptotic mean of An(t) in Equation 21 tends to

n01+[n0(eγt1)]/2N0γ. (27)

When g(t) is small, the denominator in Equation 26 can be approximated by n0(n01)(1(eγt1)/2N0γ)=1+(n01)[(eγt1)/2N0γ], which is approximately the denominator in Equation 27 when n0 is large. This confirms the validity of our asymptotic approximation for exponential growth populations.

In Figure 3, B and D, it is clearly evident that the asymptotic variance of An(t) is close to the sample variance of the simulated data and the exact variance at any time t. We also note that the variance of ancestral lineage numbers is on a relatively small magnitude compared to the expectation. This was exploited using simulation by Maruvka et al. (2011) when they assumed that the number of ancestral lineages was nearly deterministic in large sample genealogies. However, the randomness of ancestral lineage numbers is still quite significant even for large sample genealogies. For example, for n = 800, γ = 0.003, the variance of An(t) is ∼100 at t = 500. If not taking into account the randomness of ancestral lineages, the inference based on the coalescent likelihood will likely be biased. Our asymptotic results provide both the distribution and the analytical expressions of the two moments instead of only mean and, thus, can be used to build statistically rigorous methods for parameter inference.

Finally, in addition to the mean and the variance, we check how well the normal distribution approximates the exact distribution in shape using coalescent simulations. We examine the distribution of ancestral lineage numbers at several time points for the same two parameter settings as in Figure 3. We show snapshots at three time points for each setting in Figure 4 as an illustration: for γ = 0.003, t = 100, 1000, and 2000 generations ago representing the early, middle, and late stages of the ancestral process; and for γ = 0.01, t = 100, 400, and 800 generations ago. As can be seen from Figure 4, the normal distribution provides a reasonable approximation to the true distribution of An(t) for a wide range of time points.

Figure 4.

Figure 4

The asymptotic probability density functions of lineage numbers in the history for two parameter settings. (A, C, and E) Comparison of the asymptotic probability density function and simulated distribution of ancestral linage number An(t) for a population with N0 = 2 × 106 and growth rate 0.003. The times t are at (A) 100 generations ago, (C) 1000 generations ago, and (E) 2000 generations ago. The histograms were generated by 500 coalescent simulations. (B, D, and F) Comparison of the asymptotic probability density function and simulated distribution of ancestral linage number An(t) for a population with N0 = 2 × 106 and growth rate being 0.01. The times t are at (B) 100 generations ago, (D) 400 generations ago, and (F) 800 generations ago. The histograms were generated by 500 coalescent simulations.

We also examine how well the Poisson distribution and the gamma distribution approximate the distributions of coalesced lineages and coalescence times at the early stage of the coalescent process. The two asymptotic distributions provide only accurate approximations for the true distribution when the sample size n is sufficiently large, t is close to 0, and growth rate γ is slow (see Table S3, Table S4, Figure S1, and Figure S2 for details).

Applications

When the sample size n is large, numerical issues in evaluating the exact distributions of coalescence times and ancestral lineage numbers exist (Tavaré 1984; Griffiths and Tavaré 1998), and the asymptotic distributions derived above were shown to be a good approximation for finite sample sizes. Here we illustrate that the asymptotic distributions of coalescence times and the number of ancestral lineages can be applied to derive some fundamental statistics that summarize the properties of gene genealogies. We also show that the allele frequency spectrum of large-size samples can be derived through the asymptotic distribution of coalescence times for a population under exponential growth. These asymptotic statistics provide valid approximations and are in simple form without numerical issues for large samples.

Properties of large gene genealogies

Many statistics that summarize the properties of gene genealogies are informative for population genetic inference. Some of them can be derived as a function of coalescence times. We show the derivation of two important statistics of gene genealogies, the expected time to the most recent common ancestor (ETMRCA) and the expected total branch length of the genealogy (ETBL). Using ETBL, we can easily estimate other summary statistics, such as Watterson’s diversity measure θW (Watterson 1975) and Tajima’s D (Tajima 1989).

The time to TMRCA of a sample is defined as the time when the ancestors of all lineages coalesce into a single ancient lineage, or the coalescence time T1. Inferring TMRCA from genetic polymorphism data are of great interest in population genetic studies (Tavaré et al. 1997). ETMRCA by definition is simply the expectation of the coalescence time T1:

graphic file with name 721equ8.jpg (28)

The ETBL can be obtained by summing over the expectations of all branches of the genealogy as

graphic file with name 721equ9.jpg (29)

Specifically for populations under exponential growth, the ETMRCA can be approximated by

ETMRCA1γln(2N0γ(1n1)+1), (30)

and by substituting Equation 27 into Equation 29, we have

ETBL2nN0ln(2N0γ/n)2N0γn. (31)

In Figure 5A, we show the ETMRCA as a function of sample size n for three different growth rates (γ = 0.001, 0.005, and 0.01) in a population with N0 = 2 × 106. The curves are theoretical predictions based on Equation 30, and each point is an averaged TMRCA over 200 coalescent simulations for the sample size at x-axis. As we can see from the figure, the asymptotic ETMRCA is close to the simulated results, although biased toward larger values for slow growth rates (see Asymptotics of coalescence times for the quantified bias). Given the large variance of TMRCA for genealogies under neutrality, the approximation is considerably accurate. When the growth rate increases, the theoretical approximation becomes more accurate. We can also see that the ETMRCA curve is nearly flat, which means that there is a limit to how much the ETMRCA can increase with the sample size. This is consistent with former conclusions that when the sample size is beyond a moderate level, adding more samples mainly changes the shape of lower parts of the gene genealogy and the increase in the height of the entire genealogy (TMRCA) is very minor (Hein et al. 2005).

Figure 5.

Figure 5

The comparison of the expected TMRCA and total branch length obtained from the asymptotic distribution and the coalescent-simulated or exact distribution under three different exponential growth rates, assuming the contemporary population size to be N0 = 2 × 106. (A) The expectation of time to the most recent common ancestors (ETMRCA) for different sample sizes. The curves correspond to the asymptotic ETMRCA for three growth rates (γ = 0.001, 0.005, 0.01). The open circles, squares, and diamonds are the averages of TMRCA over 200 coalescent simulations with respective growth rate. (B) The expectation of total branch lengths of gene genealogies (ETBL) for different sample sizes. The curves correspond to the asymptotic ETBL for three growth rates (γ = 0.001, 0.005, 0.01). The open circles, squares, and diamonds are the exact ETBLs estimated using Equation 35 of Polanski et al. (2003) with respective growth rates.

Estimating the ETBL directly from the exact intercoalescence time distributions suffers from severe numerical instability when the sample size is large. But as pointed out by Polanski et al. (2003), the estimation can be simplified by interchanging summations and canceling large coefficient terms with each other in the alternating series, so that the resulting equations are computationally feasible for large samples (see Polanski et al. 2003, Equation 35). We thus compare the asymptotic formula to the exact ETBL estimated from their equation. In application, we found that even with Polanski et al. (2003)’s technique, when the sample size is sufficiently large, the numerical instability issue still exists. For example, when n > 1600 for γ = 0.001, a high precision arithmetic library is needed to estimate the exact ETBL (Polanski et al. 2003, Equation 35). The expected TBLs for the three levels of growth rates are shown in Figure 5B. The asymptotic results fit the exact results very well. Furthermore, Equation 31 is in simple and analytical form, making the computation for large samples much easier and faster than the exact equation in Polanski et al. (2003).

The allele frequency spectrum

The AFS is defined as the sampling distribution of the frequency of mutant alleles in a randomly collected finite sample (Chen 2012). The AFS is informative for the inference of demographic history and natural selection (Sawyer and Hartl 1992; Williamson et al. 2005; Evans et al. 2007; Chen et al. 2007; Gutenkunst et al. 2009; Lukić et al. 2011; Živković and Stephan 2011; Song and Steinrücken 2012; Chen 2012). It has been well studied using diffusion process since the foundation of population genetics (Kimura 1955). The AFS for any nonequilibrium populations can also be obtained in the coalescent framework using the expected time lengths of gene genealogies (Griffiths and Tavaré 1998; Polanski et al. 2003; Marth et al. 2004; Chen 2012, 2013). The coalescent-based AFS was used extensively in studies of population growth, bottlenecks, and other demographic history (Wooding and Rogers 2002; Polanski and Kimmel 2003; Marth et al. 2004). The AFS under the coalescent model is derived in analytical form and is computationally efficient for small samples. However, the exact distribution of lineages is needed in deriving the AFS, which involves the sums of alternating series, and is difficult to evaluate for large samples. As a result, for large samples, either a high-precision arithmetic library must be adopted (Wooding and Rogers 2002; Marth et al. 2004), or the problem is transformed into a hypergeometric summation (Polanski and Kimmel 2003). A high-precision arithmetic library requires tedious programming and can significantly increase the computational time. The hypergeometric summation is a technique that allows efficient estimation of the AFS for large samples for any single population with temporally varying size. But it is not a general solution for the JAFS of two or multiple populations, and it is very challenging to extend the solution to other more complicated scenarios, such as migration and selection. Here we use the asymptotic distributions of coalescence times derived in Asymptotic Distributions for Coalescence Times and Ancestral Lineage Numbers to get the approximation of the AFS for large samples. The derived AFS is in simple and analytical form and the approximation is accurate.

With the expectation of coalescence times derived in Equation 8, we can easily get the expectation of intercoalescence times Inline graphicWm = Inline graphicTm−1Inline graphicTm. Let {Inline graphicSj(n), 0 < j < n} denote the AFS of SNPs from a sample of size n, with the jth entry Inline graphicSj(n) being the expected number of segregating sites having j copies of derived alleles. The AFS can then be analytically estimated as in former studies (Fu 1995; Griffiths and Tavaré 1998):

graphic file with name 721equ10.jpg (32)

where μ is the mutation rate per generation.

In Equation 32, we assume an infinite-many-sites model for the mutation, so that mutations occurring at any of the k branches spanning the time interval (Tk, Tk−1) follow a Poisson process with the mean of μkInline graphic(Wk). During the subsequent bifurcation process in which the number of lineages increases from k to n, the count of the mutant increases from a single copy to j among the n lineages at present with the probability of

(nj1k2)(n1k1),

which comes from the exchangability property among lineages.

The asymptotic AFS for large samples under variable population size model is estimated through the expected intercoalescence times by the above approach, with Inline graphic(Tm) derived in Asymptotics of coalescence times. We present three estimates of AFS in Figure 6 to demonstrate the accuracy of asymptotic approximation: the sample AFS based on coalescent simulations, Polanski and Kimmel (2003)’s exact formula, and the asymptotic AFS. The simulated samples of size 500 are assumed to come from a contemporary population with N0 = 2 × 106 and the exponential growth rate being 0.001, 0.005, and 0.01, respectively. The dark blue bars of the histograms in Figure 6 are the AFSs of a simulated 10-Mb region, connected by 1000 regions of 10 kb with the mutation rate and recombination rate both being 1 × 10−8/nucleotide. The light blue bars represent the exact AFS estimated using Polanski and Kimmel (2003)’s exact formulas (Equations 8–15 in their article), and the white bars represent the asymptotic AFS calculated based on Equation 32 for the above chosen parameters. For a zoomed-in view, we present only the first 25 entries of the AFSs in Figure 6. The AFS based on the asymptotic distribution of intercoalescence times matches both the exact result and the simulation result accurately. The asymptotic result derived in this article has advantages over the existing methods. It is in simple analytical form, and the calculation is fast without numerical instability and doesn’t involve numerical integral or sampling-based methods. Also, the method can be flexibly generalized to various population history models other than the simple exponential growth model. The extension of the asymptotic AFS to various demographic history models, especially the joint AFS for multiple populations, will be be investigated in the future studies.

Figure 6.

Figure 6

The theoretically predicted and simulated allele frequency spectrum for a sample of 500 lineages collected from exponentially growing populations with different growth rates. The contemporary population size is 2 × 106 and the growth rates are (A) 0.001, (B) 0.005, and (C) 0.01. The three color bars in the histograms correspond to the simulated AFS and the AFS estimated using the exact formulas of Polanski and Kimmel (2003) and the asymptotic formula derived in this article, respectively.

Discussion

The distributions of coalescence times and the number of ancestral lineages play an essential role in coalescent modeling and population genetic inference. Both exact distributions of ancestral lineage numbers and coalescence times have been studied and expressed as a sum of alternating series, the terms of which are difficult to evaluate when the sample size is large (Tavaré 1984; Griffiths and Tavaré 1998; Polanski et al. 2003). With the rapid advancement of sequencing technology, large-sample genomic sequencing data are piling up, calling for new coalescent theories and methods for population genetic analysis. This article extends the asymptotic distributions of ancestral lineage numbers and coalescence times in constant populations (Griffiths 1984) to populations with temporally varying size. The asymptotic distributions provide a computationally fast and reliable alternative to the exact distributions in large samples. And we have shown that the asymptotic distributions are useful in obtaining statistics describing the properties of large genealogies and in analytically constructing the large-sample allele frequency spectrum. We expect the theoretical results derived in this article, together with the results in Griffiths (1984), to be useful for coalescent-based methodology development at the age of population-level sequencing data.

Supplementary Material

Supporting Information

Acknowledgments

We are grateful to Dr. Robert Griffiths for insightful comments on an earlier version of the manuscript, which greatly improved the work. We are grateful to Dr. Joachim Hermisson and the two anonymous reviewers for their helpful comments. We are also grateful to Drs. Li Jin, Bing Su, and Hong Shi for motivating and encouraging the work.

Footnotes

Communicating editor: J. Hermisson

Literature Cited

  1. Altshuler D., Lander E., Ambrogio L., Bloom T., Cibulskis K., et al. , 2010.  A map of human genome variation from population scale sequencing. Nature 467: 1061–1073. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Anderson E., Slatkin M., 2007.  Estimation of the number of individuals founding colonized populations. Evolution 61: 972–983 [DOI] [PubMed] [Google Scholar]
  3. Billingsley P., 2012.  Probability and Measure. Wiley, New York [Google Scholar]
  4. Chen H., 2012.  The joint allele frequency spectrum of multiple populations: a coalescent theory approach. Theor. Popul. Biol. 81: 179–195 [DOI] [PubMed] [Google Scholar]
  5. Chen H., 2013.  Intercoalescence time distribtution of incomplete genealogies in temporally varying populations, and applications in population genetic inference. Ann. Hum. Genet. 77: 158–173 [DOI] [PubMed] [Google Scholar]
  6. Chen H., Green R. E., Pääbo S., Slatkin M., 2007.  The joint allele-frequency spectrum in closely related species. Genetics 177: 387–398 [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Coventry A., Bull-Otterson L. M., Liu X., Clark A. G., Maxwell T. J., et al. , 2010.  Deep resequencing reveals excess rare recent variants consistent with explosive population growth. Nat. Commun. 1: 131. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Dlugosch K., Parker I., 2007.  Founding events in species invasions: genetic variation, adaptive evolution, and the role of multiple introductions. Mol. Ecol. 17: 431–449 [DOI] [PubMed] [Google Scholar]
  9. Donnelly P., 1984.  The transient behaviour of the moran model in population genetics. Math. Proc. Camb. Philos. Soc. 95: 349–358 [Google Scholar]
  10. Donnelly P., Tavaré S., 1995.  Coalescents and genealogical structure under neutrality. Annu. Rev. Genet. 29: 401–421 [DOI] [PubMed] [Google Scholar]
  11. Evans S., Shvets Y., Slatkin M., 2007.  Non-equilibrium theory of the allele frequency spectrum. Theor. Popul. Biol. 71: 109–119 [DOI] [PubMed] [Google Scholar]
  12. Ewens W., 1972.  The sampling theory of selectively neutral alleles. Theor. Popul. Biol. 3: 87–112 [DOI] [PubMed] [Google Scholar]
  13. Ewens W., 2004.  Mathematical Population Genetics: Theoretical Introduction, Vol. 1 Springer Verlag, New York [Google Scholar]
  14. Felsenstein, J., M. Kuhner, J. Yamato, and P. Beerli, 1999 Likelihoods on coalescents: a Monte Carlo sampling approach to inferring parameters from population samples of molecular data. Lect. Notes Monogr. Ser., 163–185.
  15. Fu Y. X., 1995.  Statistical properties of segregating sites. Theor. Popul. Biol. 48: 172–197 [DOI] [PubMed] [Google Scholar]
  16. Griffiths R. C., 1980.  Lines of descent in the diffusion approximation of neutral Wright–Fisher models. Theor. Popul. Biol. 17: 37–50 [DOI] [PubMed] [Google Scholar]
  17. Griffiths R. C., 1984.  Asymptotic line-of-descent distributions. J. Math. Biol. 21: 67–75 [Google Scholar]
  18. Griffiths R. C., 2006.  Coalescent lineage distributions. Adv. Appl. Probab. 38: 405–429 [Google Scholar]
  19. Griffiths R. C., Tavaré S., 1994a Sampling theory for neutral alleles in a varying enviroment. Philos. Trans. R. Soc. Lond. B 344: 403–410 [DOI] [PubMed] [Google Scholar]
  20. Griffiths R. C., Tavaré S., 1994b Simulating probability distributions in the coalescent. Theor. Popul. Biol. 46: 131–159 [Google Scholar]
  21. Griffiths R. C., Tavaré S., 1998.  The age of a mutation in a general coalescent tree. Stoch. Models 14: 273–295 [Google Scholar]
  22. Gutenkunst R. N., Hernandez R. D., Williamson S. H., Bustamante C. D., 2009.  Inferring the joint demographic history of multiple populations from multidimensional SNP frequency data. PLoS Genet. 5: e1000695. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Hein, J., M. Schierup, and C. Wiuf, 2005 Gene Genealogies, Variation and Evolution: A Primer in Coalescent Theory. Oxford University Press, New York. [Google Scholar]
  24. Hudson R. R., 1990.  Gene genealogies and the coalescent process. Oxford Surv. Evol. Biol. 7: 44. [Google Scholar]
  25. Hudson R. R., 2002.  Generating samples under a Wright–Fisher neutral model of genetic variation. Bioinformatics 18: 337–338 [DOI] [PubMed] [Google Scholar]
  26. Kimura M., 1955.  Solution of a process of random genetic drift with a continuous model. Proc. Natl. Acad. Sci. USA 41: 144–150 [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Kingman J., 1982a The coalescent. Stochastic Process. Appl. 13: 235–248 [Google Scholar]
  28. Kingman J., 1982b Exchangeability and the Evolution of Large Populations, pp. 97–112 in Exchangeability in Probability and Statistics, edited by G. Koch and F. Spizzichino. North-Holland, Amsterdam [Google Scholar]
  29. Lukić S., Hey J., Chen K., 2011.  Non-equilibrium allele frequency spectra via spectral methods. Theor. Popul. Biol. 79: 203–219 [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Mardis E. R., 2008.  The impact of next-generation sequencing technology on genetics. Trends Genet. 24: 133–140 [DOI] [PubMed] [Google Scholar]
  31. Marth G. T., Czabarka E., Murvai J., Sherry S. T., 2004.  The allele frequency spectrum in genome-wide human variation data reveals signals of differeential demographic history in three large world populations. Genetics 2004: 351–372 [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Maruvka Y., Shnerb N., Bar-Yam Y., Wakeley J., 2011.  Recovering population parameters from a single gene genealogy: an unbiased estimator of the growth rate. Mol. Biol. Evol. 28: 1617–1631 [DOI] [PubMed] [Google Scholar]
  33. Nordborg M., 2001.  Coalescent theory, pp. 179–212 in Handbook of Statistical Genetics, edited by Balding D. J., Bishop M., Cannings. Wiley, Chichester, UK C. [Google Scholar]
  34. Polanski A., Kimmel M., 2003.  New explicit expressions for relative frequencies of single-nucleotide polymorphisms with application to statistical inference on population growth. Genetics 165: 427–436 [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Polanski A., Bobrowski A., Kimmel M., 2003.  A note on distributions of times to coalescence, under time-dependent population size. Theor. Popul. Biol. 63: 33–40 [DOI] [PubMed] [Google Scholar]
  36. Risch N., Tang H., Katzenstein H., Ekstein J., 2003.  Geographic distribution of disease mutations in the Ashkenazi Jewish population supports genetic drift over selection. Am. J. Hum. Genet. 72: 812–822 [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Sawyer S. A., Hartl D. L., 1992.  Population genetics of polymorphism and divergence. Genetics 132: 1161–1176 [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Song Y. S., Steinrücken M., 2012.  A simple method for finding explicit analytic transition densities of diffusion processes with general diploid selection. Genetics 190: 1117–1129 [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Tajima F., 1989.  Statistical methods for testing the neutral mutations hypothesis by DNA polymorphism. Genetics 123: 585–595 [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Takahata N., Nei M., 1985.  Gene genealogy and variance of interpopulational nucleotide differences. Genetics 110: 325–344 [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Tavaré S., 1984.  Line-of-descent and genealogical processes, and their applications in population genetics models. Theor. Popul. Biol. 26: 119–164 [DOI] [PubMed] [Google Scholar]
  42. Tavaré S., Balding D. J., Griffiths R. C., Donnelly P., 1997.  Inferring coalescence times from dna sequence data. Genetics 145: 505–518 [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Watterson G., 1975.  On the number of segregating sites in genetical models without recombination. Theor. Popul. Biol. 7: 256–276 [DOI] [PubMed] [Google Scholar]
  44. Watterson G., 1984.  Lines of descent and the coalescent. Theor. Popul. Biol. 26: 77–92 [Google Scholar]
  45. Williamson S. H., Hernandez R., Fledel-Alon A., Zhu L., Nielsen R., et al. , 2005.  Simultaneous inference of selection and population growth from patterns of variation in the human genome. Proc. Natl. Acad. Sci. USA 102: 7882–7887 [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. Wooding S., Rogers A., 2002.  The matrix coalescent and an application to human single-nucleotide polymorphisms. Genetics 161: 1641–1650 [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. Živković D., Stephan W., 2011.  Analytical results on the neutral non-equilibrium allele frequency spectrum based on diffusion theory. Theor. Popul. Biol. 79: 184–191 [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supporting Information

Articles from Genetics are provided here courtesy of Oxford University Press

RESOURCES