Asymptotic Distributions of Coalescence Times and Ancestral Lineage Numbers for Populations with Temporally Varying Size

Hua Chen; Kun Chen

doi:10.1534/genetics.113.151522

. 2013 Jul;194(3):721–736. doi: 10.1534/genetics.113.151522

Asymptotic Distributions of Coalescence Times and Ancestral Lineage Numbers for Populations with Temporally Varying Size

Hua Chen ^*,¹, Kun Chen ^†

PMCID: PMC3697976 PMID: 23666939

Abstract

The distributions of coalescence times and ancestral lineage numbers play an essential role in coalescent modeling and ancestral inference. Both exact distributions of coalescence times and ancestral lineage numbers are expressed as the sum of alternating series, and the terms in the series become numerically intractable for large samples. More computationally attractive are their asymptotic distributions, which were derived in Griffiths (1984) for populations with constant size. In this article, we derive the asymptotic distributions of coalescence times and ancestral lineage numbers for populations with temporally varying size. For a sample of size n, denote by T_m the mth coalescent time, when m + 1 lineages coalesce into m lineages, and A_n(t) the number of ancestral lineages at time t back from the current generation. Similar to the results in Griffiths (1984), the number of ancestral lineages, A_n(t), and the coalescence times, T_m, are asymptotically normal, with the mean and variance of these distributions depending on the population size function, N(t). At the very early stage of the coalescent, when t → 0, the number of coalesced lineages n − A_n(t) follows a Poisson distribution, and as m → n, $n (n - 1) T_{m} / 2 N (0)$ follows a gamma distribution. We demonstrate the accuracy of the asymptotic approximations by comparing to both exact distributions and coalescent simulations. Several applications of the theoretical results are also shown: deriving statistics related to the properties of gene genealogies, such as the time to the most recent common ancestor (TMRCA) and the total branch length (TBL) of the genealogy, and deriving the allele frequency spectrum for large genealogies. With the advent of genomic-level sequencing data for large samples, the asymptotic distributions are expected to have wide applications in theoretical and methodological development for population genetic inference.

Keywords: coalescent theory, gene genealogy, coalescence time, ancestral lineage, ancestral inference, variable population size

COALESCENT theory provides a fundamental framework for stochastic modeling and likelihood inference in population genetic studies (Griffiths 1980; Kingman 1982a; Hudson 1990; Nordborg 2001). A coalescent process can be decomposed into two independent processes: the topology of the gene genealogy and the sequential process of intercoalescence times (Kingman 1982a). In this article, we aim to investigate the latter process and two important random quantities associated with this process: the coalescence times and the number of ancestral lineages (Kingman 1982a). Studying the two quantities is both biologically and theoretically meaningful. First, inferring the coalescence times and the number of ancient lineages of a contemporary sample or population helps to elucidate ancient demographic history, including population admixture, migration, and founder effect. It can also provide insights into medical studies regarding the origin and genetic architecture of inherited diseases in different populations, as well as to ecological studies, for example, on investigating the process of species invasion (Risch et al. 2003; Anderson and Slatkin 2007; Dlugosch and Parker 2007). Second, the distributions of coalescence times and ancestral lineage numbers are the essential components needed to construct a coalescent likelihood, for example, in the allele frequency spectrum-based approaches (Tavaré 1984; Griffiths and Tavaré 1998; Polanski and Kimmel 2003; Chen 2012).

The exact distribution of the number of ancestral lineages at t generations ago for n haplotypes randomly collected at present, A_n(t), t ≥ 0, was derived in Tavaré (1984) under the coalescent for constant populations (Equation 15 under Asymptotics of ancestral lineage numbers; see also Griffiths 1980,Donnelly 1984,Watterson 1984, and Takahata and Nei 1985). The exact distribution has connections to the Ewens’ sampling formula under the infinitely many-alleles model (Ewens 1972). In a later study, the equation was extended to populations with temporally varying size (Griffiths and Tavaré 1998). The seminal equations in Tavaré (1984) and Griffiths and Tavaré (1998) are very useful in methodology development. However, both exact distributions are expressed as the sums of series with alternating signs, and the coefficients of the series become numerically unstable when n > 50.

As another important quantity in the coalescent process, the coalescence time, T_m, defined as the time when m + 1 lineages merge into m lineages, is well known as a sum of n − m intercoalescence times. These n − m intercoalescence times are distributed as independent exponential variables with distinct respective rates k(k − 1)/2, k = n, …, m + 1 under a constant population size model. The analytical expressions of many statistics are derived on the basis of this fact. For populations with time-varying size, the intercoalescence times are no longer independent. Griffiths and Tavaré (1998) and Polanski et al. (2003) derived the distribution of coalescence times under a temporally variable population size model still as a sum of series, and the evaluation of the coefficients also suffers from the numerical issue when sample size is large.

The numerical problem caused by large sample size becomes an indispensable question with the rapid emergence of large-scale sequencing data for samples of thousands of individuals (Mardis 2008; Altshuler et al. 2010; Coventry et al. 2010), which, on the other hand, provides an unprecedented opportunity for population genetic study. Great endeavors are pursued to develop computationally efficient approaches for the analysis of genomic data with large sample size. Most existing coalescent-based inference methods in population genetics rely on sampling approaches with intensive computation, such as importance sampling and Markov chain Monte Carlo, to integrate over the space of gene genealogies (Griffiths and Tavaré 1994b; Felsenstein et al. 1999), and thus are applicable only for analyzing local genomic regions in small samples. A recently developed method, centered on a coalescent-based joint allele frequency spectrum (JAFS) (Chen 2012), gains computational efficiency for the analysis of genomic data from multiple populations, as the author used the derived analytical form of the coalescent-based JAFS instead of the sampling approaches. One of the limitations is that the author derived the JAFS on the basis of Tavaré (1984) and Griffiths and Tavaré (1998) equations, and the numerical issues of these equations limit the use of the JAFS to small gene genealogies.

Griffiths (2006) simplified the computation of the exact lineage distribution by replacing the sum of alternating series with the hypergeometric function, which has a representation in terms of a complex integral and can be evaluated by numerical integration or simulation. As the distribution is not in simple form, it may intimidate its use for theory and methodology development. Polanski and Kimmel (2003) used the methods of hypergeometric summation to avoid the numerical issue of large n when using the exact distribution of coalescence times to obtain the allele frequency spectrum (AFS) under a time-varying population size model. Their method avoids the calculation of the coefficients in the alternating series that will explode when gene genealogy size increases. However, this approach is designed specifically for calculation of the AFS for some demographic scenarios and is not a general solution for the numerical instability in the calculation of the distributions of coalescence times and the number of ancestral lineages. Another way to avoid the calculation of the series with alternating signs is to use the asymptotic approximation instead of the exact distribution. The asymptotic distributions have an additional advantage that they are often in simpler form and are easier for theory establishment.

The asymptotic theories of the coalescence times and the number of ancestral lineages for large gene genealogies in constant populations have been derived by Griffiths (1984). He demonstrated that as t → 0 and the sample size n → ∞, the distributions of A_n(t) and T_m converge asymptotically to normal distributions. The essential ingredient in Griffiths’ proof is to apply Lyapunov’s theorem to independently distributed intercoalescence times. For populations with temporally varying size, the validity of Griffiths’ theorems is yet to be addressed, as the intercoalescence times are dependent variables in this case, violating the independence assumption of Lyapunov’s theorem (Billingsley 2012). However, if we scale the time to account for the fluctuation in population size by $\int_{0}^{t} (d s / N (s)), t \geq 0$ , where N(⋅) is the population size function over time, the coalescent process on the new time scale is equivalent to the standard coalescent (Kingman 1982b; Griffiths and Tavaré 1994b). The theorems for the standard coalescent in Griffiths (1984) can then be borrowed to obtain asymptotic distributions for populations with temporally varying size. Extension of Griffiths’ theorems to populations with time-varying size is very important for population genetic inference, since most ancestral inference is based on the nonequilibrium genetic polymorphism patterns in populations with temporally varying size. Also, the population size and growth rate are themselves demographic parameters of great interest.

In the following sections, we first derive in Asymptotic Distributions for Coalescence Times and Ancestral Lineage Numbers the asymptotic distributions of coalescent times and the number of ancestral lineages for populations with temporally varying size, specifically, for populations under exponential growth. In Numerical Results we then compare the asymptotic distributions to exact distributions or coalescent simulations if the exact distributions are difficult to evaluate. We demonstrate that the asymptotic distributions of coalescence times and lineage numbers coincide with both the simulated and exact distributions surprisingly well for a wide range of parameters and for samples with even moderate size. Last, in Applications, we apply the asymptotic distributions to deriving statistics related to the properties of gene genealogies, such as the expected time to the most recent common ancestor (TMRCA) and the total branch lengths (TBL), and deriving the AFS for large samples in simpler analytical form. The article closes with a discussion.

Asymptotic Distributions for Coalescence Times and Ancestral Lineage Numbers

Notations and summary

Consider a sample of n lineages (haplotypes) randomly drawn from the contemporary population. Let N(t) be the deterministic haploid population size at t generations ago. The historical population size N(t) is assumed to be large enough to satisfy Kingman’s coalescent assumption (N(t) ≫ n). For simplicity of notation, let N₀ ≡ N(0) be the size of the contemporary population. Following Griffiths and Tavaré (1994a), the relative size function λ(t) is defined as

λ (t) = \frac{N (t)}{N_{0}} .

(1)

Two random quantities we investigate are the coalescence times and ancestral lineage numbers. Denote by T_m, 1 ≤ m ≤ n, the coalescence time when m + 1 lineages merge into m lineages, with T_n ≡ 0 (Figure 1). It is known that the intercoalescence time, W_m = T_m₋₁ − T_m, or the time length of gene genealogies during which there are m lineages, is distributed as an exponential variable with rate $m (m - 1) / 2 N_{0}$ for populations with constant size N₀ (Fu 1995). The coalescence time T_m can also be written as $T_{m} = \sum_{k = m + 1}^{n} W_{k}$ .

An illustration of the gene genealogy and coalescence times of five lineages at present. The coalescence time *T_m* is defined as the time when m + 1 lineages coalesce into m lineages.

Denote by A_n(t) the number of ancestral lineages at t generations back in the past. In an ancestral process where both coalescent and mutation events can reduce the number of lineages, A_n(t) is referred to as the number of nonmutant lineages at t generations back from the present (Griffiths 1984). In this context, we consider the genealogical history of only coalescent events, in which mutations are treated separately and assumed to occur independently following a Poisson process along the branches of a given gene genealogy. The random process {A_n(t), t ≥ 0} is a pure-death process that jumps from state m to m − 1 with rate m(m − 1)/2N₀, 2 ≤ m ≤ n (Kingman 1982a). All the random variables, T_m, W_m, and A_n(t), are defined in a coalescent process with time in units of generations. Let $τ = g (t) = \int_{0}^{t} (1 / N (u)) d u$ be the time scaled at rate N(t), specifically for populations with constant size, $τ = t / N_{0}$ . We use the notations ${\dot{T}}_{m}$ , ${\dot{W}}_{m}$ , and ${\dot{A}}_{n} (τ)$ to denote the coalescence time, intercoalescence time, and number of ancestral lineages in the standard coalescent process with time scaled by the constant population size N₀, which is also referred to as Kingman’s n-coalescent process in the context.

In the following, we aim to develop the asymptotic distributions of coalescence times, T_m, and ancestral lineage numbers, A_n(t), for populations with temporally varying size N(t). The main results include:

for a sample of size n, as n → ∞, n/m → a, 1 < a < ∞, the mth coalescence time, T_m, is asymptotically normal, with the mean and variance depending on the historical population size (Equations 8 and 9);
as $g (t) \to 0, n \to \infty, \frac{1}{2} n g (t) \to α, 0 < α \leq \infty$ , the number of ancestral lineages at time t, A_n(t), is asymptotically normal, with the mean and variance provided in Equations 19 and 20; and
at the very early stage of the coalescent, when t → 0 more rapidly, n → ∞, the number of coalesced lineages n − A_n(t) follows a Poisson distribution, and when m → n, n − m is bounded above, $(n (n - 1) T_{m}) / 2 N_{0}$ follows a gamma distribution.

Asymptotics of coalescence times

Under a constant population-size model, the intercoalescence times, ${\dot{W}}_{m}, 2 \leq m \leq n$ , are independent exponential variables with respective rates $(\begin{matrix} m \\ 2 \end{matrix})$ . Griffiths (1984) proved the asymptotic distribution of coalescence times by applying Lyapunov’s version of the central limit theorem to the sum of independent intercoalescence times. Denote by ${\dot{T}}_{m}$ the mth coalescence time in a standard n-coalescent process, scaled by population size N₀ (Ewens 2004). By Theorem 1 of Griffiths (1984), under the conditions n → ∞, m → ∞, while n/m → a, 1 < a ≤ ∞, ${\dot{T}}_{m}$ is asymptotically normal, and the mean and the variance of ${\dot{T}}_{m}$ are

μ_{m} = \sum_{k = m + 1}^{n} \frac{2}{k (k - 1)} = 2 (\frac{1}{m} - \frac{1}{n}),

(2)

and

\begin{matrix} σ_{m}^{2} = \sum_{k = m + 1}^{n} \frac{4}{k^{2} {(k - 1)}^{2}} = 4 \sum_{k = m + 1}^{n} [\frac{1}{{(k - 1)}^{2}} + \frac{1}{k^{2}} - \frac{2}{k (k - 1)}] \\ = 4 {ψ_{1} (m) - ψ_{1} (n) + ψ_{1} (m + 1) - ψ_{1} (n + 1) - 2 (m^{- 1} - n^{- 1})}, \end{matrix}

(3)

where the trigamma function $ψ_{1} (z) = (d^{2} / d z^{2}) \ln Γ (z)$ and $Γ (z) = \int_{0}^{\infty} e^{- t} t^{z - 1} d t$ . Note that the result above is a little different from what was originally shown in Griffiths (1984) in that the effect of mutations on lines of descent is not considered in Equations 2 and 3. We take the strategy of constructing the genealogy first and then, given the branch lengths of the genealogy, modeling mutations as a Poisson process with the rate proportional to the specific branch length.

Under a variable population-size model, the intercoalescence times are no longer mutually independent. We assume that the population evolves according to the Wright–Fisher model, and its size changes over time deterministically; that is, the population size is different but known at each generation. The joint distribution of coalescence times (T_m,…,T_n₋₁) for populations with temporally varying size is (Griffiths and Tavaré 1998):

f_{T_{m}, \dots, T_{n - 1}} (t_{m}, \dots, t_{n - 1}) = \prod_{k = m}^{n - 1} \frac{(\begin{matrix} k + 1 \\ 2 \end{matrix})}{N_{0} λ (t_{k})} exp (- \frac{(\begin{matrix} k + 1 \\ 2 \end{matrix})}{N_{0}} \int_{t_{k + 1}}^{t_{k}} \frac{1}{λ (u)} d u) .

(4)

The marginal probability density function (p.d.f.) of coalescence times $f_{T_{m}}$ was derived explicitly by Polanski et al. (2003) through expanding an integral transform of the marginal p.d.f. into partial fractions. An equivalent equation in different form can be derived on the basis of the definition of the n-coalescent and a pure-death process (see Griffiths 2006 and Chen 2012, Appendix A, for details of derivation):

f_{T_{m}} (t) = ℙ (A_{n} (t) = m + 1) \frac{(m + 1) m}{2 N (t)} .

(5)

As ℙ(A_n(t) = m + 1) was derived from the expansion of the transition function (see Equations 15 and 16), which involves the sum of alternating series and is numerically unstable for large n, the exact distribution may be practically difficult to use in the ancestral inference for large samples. In the following, we aim to derive the asymptotic distribution of coalescence times for temporally varying populations.

It is known in the coalescent literature that the coalescence time rescaled at rate $1 / N (t)$ , $g (T_{m}) = \int_{0}^{T_{m}} (1 / N (u)) d u$ , follows the distribution of coalescence time in the standard Kingman’s n-coalescent, and the scaled intercoalescence times,

g (T_{m - 1}) - g (T_{m}) = \int_{T_{m}}^{T_{m - 1}} \frac{1}{N (u)} d u, 2 \leq m \leq n,

are mutually independent exponential variables with the rate of $(\begin{matrix} m \\ 2 \end{matrix})$ . The coalescent process under a variable population size model, as sample size n → ∞, is still Kingman’s coalescent since we assume the population size tends to infinity more quickly than the sample size, in other words, n/N(t) → 0, which makes the condition of Kingman’s coalescent n ≪ N(t) still satisfied. More detailed discussions on time scaling in a variable population size coalescent model can be found in Kingman (1982b), Griffiths and Tavaré (1994a), Donnelly and Tavaré (1995), and Nordborg (2001).

We start with a Taylor expansion of g(T_m) at g⁻¹(μ_m),

\begin{matrix} g (T_{m}) = μ_{m} + g^{'} (g^{- 1} (μ_{m})) (T_{m} - g^{- 1} (μ_{m})) \\ + \frac{g^{″} (g^{- 1} (μ_{m}))}{2} {(T_{m} - g^{- 1} (μ_{m}))}^{2} + O ({(T_{m} - g^{- 1} (μ_{m}))}^{3}), \end{matrix}

(6)

where g⁻¹(⋅), g′(⋅), and g″(⋅) represent the inverse function, the first derivative, and the second derivative of function g, respectively. The remainder term

\frac{g^{″} (g^{- 1} (μ_{m}))}{2} {(T_{m} - g^{- 1} (μ_{m}))}^{2} + O ({(T_{m} - g^{- 1} (μ_{m}))}^{3}),

or O((T_m − g⁻¹(μ_m))²) in Equation 6 is ignorable as n/m → a, n → ∞, because g(T_m) follows the same distribution as ${\dot{T}}_{m}$ and ${\dot{T}}_{m} \to μ_{m}$ by the asymptotic properties of ${\dot{T}}_{m}$ shown in Griffiths (1984, Theorem 1).

Next, by Equation 6 and ignoring the remainder term, we have

\frac{T_{m} - g^{- 1} (μ_{m})}{(σ_{m} / g^{'} (g^{- 1} (μ_{m}))} \to \frac{g (T_{m}) - μ_{m}}{σ_{m}} .

(7)

As $(g (T_{m}) - μ_{m}) / σ_{m} \to N (0, 1)$ , the limiting distribution of T_m can then be approximated by a normal distribution with the mean

(8)

and variance

Var (T_{m}) = \frac{σ_{m}^{2}}{{(g^{'} (g^{- 1} (μ_{m}))}^{2}} .

(9)

Substituting Equations 2 and 3 into Equations 8 and 9 yields the mean and variance for the asymptotic distribution of coalescence times, T_m,1 ≤ m ≤ n − 1.

When the population is under exponential growth with rate γ, that is, N(t) = N₀e⁻^γt, it is straightforward to write the scaling function as

g (t) = \frac{e^{γ t} - 1}{N_{0} γ} .

(10)

The inverse and first derivative function of g are $g^{- 1} (τ) = (ln (N_{0} γ τ + 1)) / γ$ and $g^{'} (t) = e^{γ t} / N_{0}$ respectively. Using Equations 8 and 9, we have the mean

(11)

and variance

\begin{matrix} Var (T_{m}) = \frac{σ_{m}^{2}}{{(g^{'} (g^{- 1} (μ_{m})))}^{2}} \\ = \frac{σ_{m}^{2}}{exp {γ \frac{1}{γ} ln (2 N_{0} γ (m^{- 1} - n^{- 1}) + 1)} / N_{0}^{2}} \\ = {(\frac{N_{0}}{2 N_{0} γ (m^{- 1} - n^{- 1}) + 1})}^{2} σ_{m}^{2} \\ = \frac{4 N_{0}^{2} (ψ_{1} (m) - ψ_{1} (n) + ψ_{1} (m + 1) - ψ_{1} (n + 1) - 2 (m^{- 1} - n^{- 1}))}{{(2 N_{0} γ (m^{- 1} - n^{- 1}) + 1)}^{2}} . \end{matrix}

(12)

Since the linear approximation is used in the above proof, there exists a bias between the derived asymptotic mean and the true mean of T_m. Here we quantify the magnitude of the bias specifically for the exponential growth model as an example. Using a Taylor expansion,

\begin{matrix} T_{m} = g^{- 1} (g (T_{m})) \\ = g^{- 1} (μ_{m}) + {(g^{- 1})}^{'} (μ_{m}) (g (T_{m}) - μ_{m}) \\ + \frac{{(g^{- 1})}^{''} (μ_{m})}{2} {(g (T_{m}) - μ_{m})}^{2} + O ({(g (T_{m}) - μ_{m})}^{3}), \end{matrix}

and taking expectation at both sides, it can be seen that

(13)

= - \frac{σ_{m}^{2}}{2 γ μ_{m}^{2}} + o (1) .

(14)

By Theorem 1 in Griffiths (1984), μ_m is on order m⁻¹ and $σ_{m}^{2}$ is on order m⁻³. Therefore, the bias of the asymptotic mean is on order m⁻¹, which shrinks to zero as n/m → a, n → ∞.

Asymptotics of ancestral lineage numbers

Tavaré (1984) derived the exact distribution of the number of ancestral lineages at time t in the past for the coalescent with constant population size, by using a spectral expansion of the transition function that is associated with the death process {A_n(t), t ≥ 0} (see also Griffiths 1980; Donnelly 1984; Watterson 1984; Takahata and Nei 1985),

ℙ (A_{n} (t) = m) = \sum_{i = m}^{n} \frac{{(- 1)}^{i - m} (2 i - 1) m_{(i - 1)} n_{[i]}}{m! (i - m)! n_{(i)}} e^{- i (i - 1) t / 2 N}, 0 < m \leq n,

(15)

where $n_{(i)} = n (n + 1) \dots (n + i - 1), i \geq 1; n_{(0)} = 1$ , and $n_{[i]} = n (n - 1) \dots (n - i + 1), i \geq 1; n_{[0]} = 1$ are the rising and falling factorial functions. Griffiths and Tavaré (1998) further generalized Equation 15 to populations with variable size. Then for populations with temporally varying size, the distribution of the number of ancestral lineages becomes (Griffiths and Tavaré 1998):

ℙ (A_{n} (t) = m) = \sum_{i = m}^{n} \frac{{(- 1)}^{i - m} (2 i - 1) m_{(i - 1)} n_{[i]}}{m! (i - m)! n_{(i)}} e^{[- i (i - 1)] / 2 N_{0} \int_{0}^{t} (1 / λ (u)) d u} .

(16)

In addition to the coalescence times, Griffiths (1984) investigated the asymptotics of ancestral lineage numbers for populations with constant size. If omitting mutation for the same reason as in Asymptotics of coalescence times, by Griffiths’ Theorem 2, ${\dot{A}}_{n} (τ)$ has an asymptotic normal distribution as $τ \to 0, n \to ∞, \frac{1}{2} n τ \to α, 0 < α < ∞$ , with a mean of

μ (τ) = \frac{2 η}{τ},

(17)

and variance

σ^{2} (τ) = 2 η τ^{- 1} {(η + β)}^{2} {1 + η / (η + β) - η / α - η / (α + β) - 2 η} β^{- 2},

(18)

where $β = - \frac{1}{2} τ$ , $η = α β / {α (e^{β} - 1) + β e^{β}}$ . In the following, we extend Griffiths’ conclusion of the asymptotic distribution for ancestral lineage numbers to populations with temporally varying size.

We observe that ℙ(A_n(t) ≤ m) = ℙ(T_m ≤ t). As g is a monotone continuous function, we have $ℙ (T_{m} \leq t) = ℙ (g (T_{m}) \leq g (t)) = ℙ ({\dot{A}}_{n} (τ) \leq m)$ , where τ ≡ g(t) as defined in last section. By Griffiths’ Theorem 2, as n → ∞, τ → 0 and $\frac{1}{2} n τ \to α$ , the ancient lineage number of the coalescent process ${\dot{A}}_{n} (τ) \to_{d} N (u (τ), σ^{2} (τ))$ , where μ(τ) and σ²(τ) are given in Equations 17 and 18. Therefore, $ℙ (A_{n} (t) \leq m) \to ℙ (Z \leq (m - u (τ)) / σ (τ))$ , where Z is a standard normal.

The mean and variance of the limiting distribution of A_n(t) can be obtained by mapping back to the original time scale, as

(19)

and

\begin{matrix} Var (A_{n} (t)) \to σ^{2} (τ) = σ^{2} (g (t)) \\ = 2 η {(g (t))}^{- 1} {(η + β)}^{2} {1 + η / (η + β) - η / α - η / (α + β) - 2 η} β^{- 2}, \end{matrix}

(20)

where $α = {lim}_{n \to ∞, t \to 0} \frac{1}{2} n g (t)$ , $β = - \frac{1}{2} g (t)$ , and $η = α β / {α (e^{β} - 1) + β e^{β}}$ .

When the population is under exponential growth with rate γ and scaling function as in Equation 10, plugging Equation 10 into Equations 19 and 20, A_n(t) then has an asymptotic mean

u (g (t)) = \frac{2 η N_{0} γ}{e^{γ t} - 1},

(21)

and variance

σ^{2} (g (t)) = \frac{2 η N_{0} γ}{e^{γ t} - 1} {(η + β)}^{2} {1 + η / (η + β) - η / α - η / (α + β) - 2 η} β^{- 2},

(22)

where $α = {lim}_{n \to ∞, t \to 0} \frac{1}{2} [n (e^{γ t} - 1) / N_{0} γ]$ , $β = - \frac{1}{2} [(e^{γ t} - 1) / N_{0} γ]$ , and $η = α β / {α (e^{β} - 1) + β e^{β}}$ .

As no linear approximation is used to derive the asymptotic distribution of A_n(t), if u(τ) and σ²(τ) were the exact mean and variance of ${\dot{A}}_{n} (τ)$ , u(g(t)) and σ²(g(t)) should be the exact mean and variance of A_n(t). Here we use the asymptotic mean and variance in Griffiths (1984) for u(τ) and σ²(τ), and this results in the bias in our derived asymptotic mean and variance of A_n(t).

Asymptotics of coalescence times and ancestral lineage numbers at the early stage of the coalescent

At the early stage of the coalescent, or m → n and t → 0, the above normal distributions may not well approximate the exact distributions of coalescence times and ancestral lineage numbers. We derive their asymptotics in this section.

As the scaled coalescence time g(T_m) follows the distribution of coalescence times in a standard coalescent process, $g (T_{m}) - g (T_{m + 1}) \sim Exponential ((\begin{matrix} m + 1 \\ 2 \end{matrix}))$ . Multiplying g(T_m) − g(T_m₊₁) by $n (n - 1) / 2$ , $[n (n - 1) / 2] (g (T_{m}) - g (T_{m + 1})) \sim Exponential ((\begin{matrix} m + 1 \\ 2 \end{matrix}) / (\begin{matrix} n \\ 2 \end{matrix})) \to Exponential (1)$ as m → n and n → ∞. By the mean-value theorem, $g (T_{m}) - g (T_{m + 1}) = (T_{m} - T_{m + 1}) g^{'} (ξ_{m + 1})$ , where $T_{m + 1} \leq ξ_{m + 1} \leq T_{m}$ . As m → n and n → ∞, we have T_m → 0 and $g^{'} (ξ_{m + 1}) = 1 / N (ξ_{m + 1}) \to 1 / N_{0}$ . Subsequently,

\frac{n (n - 1)}{2} \sum_{k = m + 1}^{n} g (T_{k - 1}) - g (T_{k}) = \frac{n (n - 1)}{2} \sum_{k = m + 1}^{n} (T_{k - 1} - T_{k}) g^{'} (ξ_{k}) .

(23)

Taking limits n → ∞, m → n at both sides of Equation 23, we obtain $[n (n - 1) / 2 N_{0}] T_{m} \to γ (n - m, 1)$ .

Next, we derive the asymptotic distribution for the number of coalesced lineages, n−A_n(t), for a population with time-varying size N(t). Given that $ℙ (n - A_{n} (t) \geq n - j) = ℙ (n - {\dot{A}}_{n} (g (t)) \geq n - j) = ℙ (n - {\dot{A}}_{n} (τ) \geq n - j)$ , and Theorem 6 in Griffiths (1984), n−A_n(t) asymptotically follows the Poisson distribution with mean $ν = \frac{1}{2} n (n - 1) g (t) = \frac{1}{2} n (n - 1) \int_{0}^{t} [1 / N (u)] d u$ .

Numerical Results

The accuracy of the asymptotic distributions of T_m and A_n(t) derived in Asymptotic Distributions for Coalescence Times and Ancestral Lineage Numbers is examined by comparing their distributional properties to both exact distributions and coalescent simulations. If the analytical formulas of the exact distributions for T_m and A_n(t) are available and can be computed without numerical issues, for example, the mean and variance of the ancestral lineage number distribution, we use the analytical results instead of coalescent simulations in the comparison. Otherwise, we seek to compare the asymptotic distributions with simulated distributions. The coalescent simulator ms is modified to output coalescence times and the number of ancestral lineages for each simulated gene genealogy (Hudson 2002). In the simulation, the contemporary population size N₀ is chosen to be 2 × 10⁶, and a wide range of exponential growth rate γ, time t, and sample size n is investigated.

Coalescence times

We first examine the numerical accuracy of the asymptotic p.d.f. of coalescence times as provided in Asymptotics of coalescence times. We show the asymptotic p.d.f.’s of several coalescence times for n = 500 and γ = 0.001 and 0.005 in Figure 2, and then present in Table 1 the mean and standard deviation of coalescence times for a wider range of parameters (more simulation results can be seen in Supporting Information, Table S1). Since evaluating the p.d.f. of coalescence times for large samples using the exact formulas is subject to severe numerical instability (Polanski et al. 2003), we use only coalescent simulations in this section.

Table 1. Comparison of the asymptotic approximation and simulated results for the mean and standard deviation of the coalescence time T_m (N₀ = 2.0 × 10⁶).

		Mean (T_m)			Standard deviation (T_m)
γ	m	Simulation	Asymptotic	Bias (%)	Simulation	Asymptotic	Bias (%)
Sample size n = 50
0.001	5	6538.619	6580.639	42.020 (0.643%)	278.445	285.229	6.784 (2.436%)
0.001	10	5746.244	5771.441	25.197 (0.438%)	223.672	226.368	2.696 (1.206%)
0.001	20	4775.353	4795.791	20.438 (0.428%)	209.097	206.392	2.705 (1.294%)
0.001	45	2210.726	2291.412	80.685 (3.650%)	402.151	402.703	0.552 (0.137%)
0.005	5	1629.978	1637.793	7.816 (0.480%)	56.458	57.109	0.651 (1.154%)
0.005	10	1470.136	1475.677	5.541 (0.377%)	44.957	45.387	0.429 (0.955%)
0.005	20	1274.626	1279.719	5.093 (0.400%)	41.512	41.553	0.041 (0.100%)
0.005	45	742.147	763.298	21.151 (2.850%)	90.637	87.630	3.008 (3.318%)
0.010	5	884.172	888.198	4.026 (0.455%)	28.072	28.559	0.486 (1.731%)
0.010	10	804.606	807.122	2.515 (0.313%)	22.611	22.700	0.089 (0.395%)
0.010	20	706.889	709.091	2.202 (0.312%)	20.902	20.794	0.108 (0.517%)
0.010	45	439.281	449.857	10.577 (2.408%)	46.789	44.302	2.486 (5.314%)
Sample size n = 200
0.001	5	6626.114	6660.575	34.462 (0.520%)	254.189	263.447	9.258 (3.642%)
0.001	20	5187.985	5198.497	10.512 (0.203%)	141.164	142.544	1.380 (0.978%)
0.001	50	4104.733	4110.874	6.141 (0.150%)	105.829	106.237	0.408 (0.386%)
0.001	100	3038.662	3044.522	5.861 (0.193%)	103.844	102.868	0.977 (0.940%)
0.001	150	2028.941	2036.882	7.941 (0.391%)	126.889	124.671	2.219 (1.749%)
0.001	195	405.591	413.976	8.385 (2.067%)	147.663	151.613	3.950 (2.675%)
0.005	5	1647.798	1653.798	6.000 (0.364%)	51.114	52.743	1.629 (3.187%)
0.005	20	1359.062	1360.701	1.639 (0.121%)	28.242	28.635	0.393 (1.392%)
0.005	50	1140.012	1141.422	1.410 (0.124%)	21.275	21.530	0.255 (1.197%)
0.005	100	921.898	923.024	1.126 (0.122%)	21.349	21.388	0.039 (0.184%)
0.005	150	704.509	707.223	2.715 (0.385%)	27.911	27.839	0.071 (0.256%)
0.005	195	243.791	254.182	10.391 (4.262%)	62.681	64.354	1.672 (2.668%)
0.010	5	893.062	896.201	3.139 (0.352%)	25.688	26.375	0.687 (2.676%)
0.010	20	748.544	749.610	1.066 (0.142%)	14.058	14.326	0.268 (1.905%)
0.010	50	639.188	639.859	0.671 (0.105%)	11.055	10.783	0.273 (2.466%)
0.010	100	529.684	530.330	0.646 (0.122%)	10.860	10.747	0.112 (1.033%)
0.010	150	420.709	421.459	0.751 (0.178%)	14.020	14.125	0.105 (0.749%)
0.010	195	174.898	181.290	6.392 (3.655%)	37.343	37.428	0.085 (0.227%)
Sample size n = 800
0.001	5	6645.512	6679.599	34.087 (0.513%)	247.504	258.485	10.980 (4.436%)
0.001	10	5958.249	5981.414	23.165 (0.389%)	181.338	184.235	2.897 (1.598%)
0.001	50	4328.359	4330.733	2.374 (0.055%)	85.882	85.933	0.051 (0.059%)
0.001	200	2771.704	2772.589	0.885 (0.032%)	50.952	50.631	0.321 (0.630%)
0.001	400	1789.988	1791.759	1.771 (0.099%)	44.957	45.005	0.048 (0.106%)
0.001	795	30.645	30.962	0.317 (1.036%)	13.477	13.635	0.157 (1.167%)
0.005	5	1651.462	1657.606	6.144 (0.372%)	50.473	51.749	1.276 (2.528%)
0.005	10	1515.254	1517.766	2.512 (0.166%)	36.394	36.922	0.528 (1.451%)
0.005	50	1185.082	1185.918	0.835 (0.070%)	17.362	17.369	0.007 (0.042%)
0.005	200	865.872	866.147	0.275 (0.032%)	10.690	10.659	0.031 (0.291%)
0.005	400	651.296	651.619	0.323 (0.050%)	10.458	10.386	0.072 (0.689%)
0.005	795	28.726	29.206	0.481 (1.673%)	12.000	12.153	0.153 (1.274%)
0.010	5	894.511	898.105	3.594 (0.402%)	25.293	25.878	0.585 (2.314%)
0.010	10	825.938	828.172	2.234 (0.270%)	18.054	18.465	0.412 (2.281%)
0.010	50	661.756	662.141	0.385 (0.058%)	8.640	8.696	0.056 (0.648%)
0.010	200	501.546	501.728	0.182 (0.036%)	5.500	5.365	0.135 (2.449%)
0.010	400	393.125	393.183	0.058 (0.015%)	5.358	5.295	0.064 (1.188%)
0.010	795	26.667	27.343	0.676 (2.534%)	10.423	10.699	0.276 (2.652%)

Open in a new tab

In Figure 2, A and B, the asymptotic p.d.f.’s of coalescence times, T₂, T₁₀, T₂₅, T₅₀, T₁₀₀, T₂₀₀, T₃₀₀, and T₄₀₀, are displayed for two different exponential growth rates, 0.001 and 0.005. It is obvious that coalescence times on average are longer in a slowly growing population, so that the gene genealogy of a faster-growing population tends to have relatively shorter internal branches, showing a “star” shape. In Figure 2, C and E, the p.d.f.’s of T₁₀ and T₁₀₀ under growth model γ = 0.001 are presented together with the respective histograms generated from 500 coalescent simulations. A similar comparison was shown for γ = 0.005 in Figure 2, D and F. The asymptotic p.d.f is reasonably close to the simulated distribution for the two selected T_m’s and the two growth rates.

In Table 1 and Table S1, a more thorough simulation study is presented. The simulation is carried out for three different growth rates, γ = 0.001, 0.005, and 0.01, and five sample sizes, n = 50, 100, 200, 500, and 800. For each combination of parameter values, several coalescence times are examined to cover as much of the time span as possible. For example, for γ = 0.001, n = 200, we examine T₅, T₂₀, T₅₀, T₁₀₀, T₁₅₀, and T₁₉₅. For most of the T_m’s examined here, the asymptotic mean and variance approximate those of the simulated distributions very well.

Note that in Table 1 and Table S1 the asymptotic mean of T_m is always larger than the simulated mean. This is consistent with our quantified bias in Equation 14. It can also be observed that both the bias and the relative bias (the bias divided by the simulated mean) of the asymptotic mean of T_m are bigger when m is close to n. This can be explained by the inflated second derivative of the scaling function g⁻¹(τ) evaluated at μ_m close to 0 (approximately $- 1 / γ μ_{m}^{2}$ ; see also Equation 14) appearing in the bias term. The detailed derivation of the quantified bias can be seen in Asymptotics of coalescence times.

Another trend worth noting in Table 1 and Table S1 are that the relative bias of the mean decreases with increasing sample size n. For example, for γ = 0.001 and the intercoalescence time T₅₀, the relative bias is 0.353, 0.150, 0.049, 0.055, and 0.025% for n = 100, 200, 500, 800, and 5000, respectively (the last data point not shown in Table 1 and Table S1). For a coalescence time T_m with a smaller m, the relative bias is reduced more slowly. In Table 1 and Table S1, the relative bias for T₅ does not have obvious trend of decrease for sample sizes up to 800 (for γ = 0.001, T₅ decreases from 0.642% for n = 50 to 0.513% for n = 800). When the sample size is increased to 5000, the relative bias becomes 0.405% (data not shown in Table 1 and Table S1). Although the convergence rates are different for the above two T_m’s, the bias of both T_m’s shrinks toward zero as n → ∞.

Number of ancestral lineages

In this subsection, we aim to evaluate how the asymptotic distribution of A_n(t) performs as an approximation to the true distribution. The exact formulas of the first two moments of the ancestral lineage number distribution under a varying population size model were derived using the probability generating function in Tavaré (1984),

(24)

and

(25)

Unlike the entire exact distribution of the ancestral lineage number, the exact mean and variance of ancestral lineage numbers can be accurately calculated from Equations 24 and 25 even for quite large samples and are assumed to be the gold standard in the comparison. Three different exponential growth rates, γ = 0.001, 0.003, and 0.01, and five sample sizes, n = 50, 100, 200, 400, and 800 are considered here. For each combination of parameter values, the number of lineages are collected at four time points (t =100, 500, 1000, and 2000 for γ = 0.001 and 0.003; t =100, 300, 600, and 800 for γ = 0.01).

Table 2 and Table S2 present the mean and standard deviation of the number of lineages calculated from the exact formulas and from the asymptotic results for the above chosen parameter values, as well as the bias and relative bias (the bias divided by the exact mean or standard deviation) of the proposed asymptotic mean and standard deviation. As we can see from Table 2 and Table S2, the mean and variance obtained from the proposed asymptotic distribution are close to the exact mean and variance for a wide range of time points, growth rates, and sample sizes. Note that the asymptotic results are accurate even for a relatively small sample size, e.g., n = 50. The robustness of the asymptotic results for various demographic parameters and small sample size assures the application of the asymptotic distribution in theory and methodology development for population genetic inference.

Table 2. Comparison of the asymptotic approximation and exact results for the mean and standard deviation of A_n(t) (population size N₀ = 2.0 × 10⁶).

		Mean (A_n(t))			Standard deviation (A_n(t))
γ	t	Exact	Asymptotic	Bias (%)	Exact	Asymptotic	Bias (%)
Sample size n = 50
0.001	100	49.936	49.934	0.002 (0.004%)	0.253	0.256	0.003 (1.001%)
0.001	500	49.606	49.598	0.008 (0.016%)	0.623	0.629	0.006 (0.981%)
0.001	1000	48.969	48.949	0.020 (0.042%)	0.994	1.004	0.009 (0.944%)
0.001	2000	46.371	46.302	0.069 (0.148%)	1.768	1.783	0.014 (0.797%)
0.003	100	49.929	49.927	0.001 (0.003%)	0.267	0.269	0.003 (1.001%)
0.003	500	49.299	49.285	0.014 (0.029%)	0.825	0.834	0.008 (0.963%)
0.003	1000	46.385	46.317	0.068 (0.148%)	1.765	1.780	0.014 (0.798%)
0.003	2000	18.999	18.679	0.319 (1.710%)	2.432	2.429	0.002 (0.097%)
0.010	100	49.895	49.893	0.002 (0.004%)	0.323	0.327	0.003 (0.999%)
0.010	300	48.858	48.835	0.023 (0.047%)	1.044	1.054	0.010 (0.937%)
0.010	600	33.502	33.266	0.236 (0.711%)	2.790	2.797	0.007 (0.248%)
0.010	800	10.919	10.582	0.337 (3.181%)	1.874	1.869	0.005 (0.259%)
Sample size n = 200
0.001	100	198.959	198.954	0.005 (0.003%)	1.015	1.017	0.003 (0.246%)
0.001	500	193.747	193.717	0.030 (0.016%)	2.423	2.428	0.006 (0.227%)
0.001	1000	184.250	184.177	0.073 (0.040%)	3.660	3.667	0.007 (0.195%)
0.001	2000	151.766	151.578	0.188 (0.124%)	5.336	5.341	0.005 (0.102%)
0.003	100	198.846	198.841	0.006 (0.003%)	1.068	1.071	0.003 (0.246%)
0.003	500	189.083	189.031	0.052 (0.027%)	3.125	3.132	0.007 (0.211%)
0.003	1000	151.922	151.734	0.188 (0.124%)	5.332	5.338	0.005 (0.102%)
0.003	2000	26.285	25.950	0.335 (1.292%)	2.941	2.938	0.004 (0.121%)
0.010	100	198.305	198.296	0.008 (0.004%)	1.291	1.294	0.003 (0.244%)
0.010	300	182.657	182.577	0.080 (0.044%)	3.808	3.816	0.007 (0.189%)
0.010	600	66.720	66.398	0.322 (0.485%)	4.619	4.618	0.002 (0.037%)
0.010	800	12.917	12.579	0.339 (2.692%)	2.052	2.047	0.005 (0.236%)
Sample size n = 800
0.001	100	783.539	783.519	0.020 (0.003%)	3.974	3.976	0.002 (0.059%)
0.001	500	708.227	708.125	0.102 (0.014%)	8.502	8.505	0.004 (0.043%)
0.001	1000	595.586	595.390	0.196 (0.033%)	10.798	10.801	0.003 (0.024%)
0.001	2000	351.520	351.214	0.305 (0.087%)	10.352	10.352	0.000 (0.003%)
0.003	100	781.788	781.766	0.022 (0.003%)	4.171	4.173	0.002 (0.058%)
0.003	500	649.446	649.291	0.155 (0.024%)	10.033	10.036	0.003 (0.032%)
0.003	1000	352.361	352.055	0.305 (0.087%)	10.361	10.361	0.000 (0.003%)
0.003	2000	29.083	28.747	0.336 (1.168%)	3.099	3.095	0.003 (0.111%)
0.010	100	773.453	773.421	0.032 (0.004%)	4.982	4.985	0.003 (0.056%)
0.010	300	579.199	578.992	0.207 (0.036%)	10.944	10.947	0.002 (0.021%)
0.010	600	88.745	88.412	0.334 (0.377%)	5.427	5.425	0.002 (0.037%)
0.010	800	13.540	13.202	0.338 (2.564%)	2.102	2.098	0.005 (0.227%)

Open in a new tab

Next, to examine the performance of the asymptotic approximation over the entire time scale, we plot the mean and variance of the number of lineages as a function of time for two parameter settings in Figure 3. In both settings, the contemporary population size and the sample size are chosen to be N₀ = 2 × 10⁶ and n = 500, but the growth rates are different: γ = 0.003 (Figure 3, A–B) and γ = 0.01 (Figure 3, C–D). We compare in Figure 3, A and C, three approaches that can be used to obtain the mean of the number of lineages: the sample mean from the simulated data, representing a close estimate of the true value, the exact mean as shown in Equations 24, and our proposed asymptotic mean. As shown in Figure 3, A and C, the asymptotic results well approximate the true mean of lineage numbers.

The mean and variance of the number of ancestral lineages in the history for two parameter settings. (A) The mean of ancestral lineages as a function of time t for n = 500 lineages sampled from the contemporary population. The contemporary population size is assumed to be 2 × 10⁶ and the growth rate 0.003. The x-axis corresponds to generations back in time, and the y-axis is the expectation of number of ancestral lineages. Green open circles represent the average of lineage numbers over 500 gene genealogies generated from coalescent simulations; the blue X symbols represent the exact mean in Tavaré (1984); the red solid line represents the asymptotic mean derived in the main text. (B) The variance of ancestral lineages as a function of time t for n = 500 lineages sampled from the contemporary population. The parameters used in simulation and symbols are the same as in A and the sample variance is estimated from 500 simulations. (C) The mean of ancestral lineages as a function of time t for n = 500 lineages sampled from the contemporary population. The contemporary population size is assumed to be 2 × 10⁶ and the growth rate 0.01. The simulation setting and the representative symbols are the same as in A. (D) The variance of ancestral lineages as a function of time t for n = 500 lineages sampled from the contemporary population. The simulation setting and the representative symbols are the same as in A. The sample variance is estimated from 500 simulations.

Recently, another approach to obtaining the expectation of the number of lineages was developed by Maruvka et al. (2011) for populations with constant or exponentially growing size. Based on the equation for Inline graphic [A_n(t + 1)|A_n(t) = i] in Watterson (1975), Maruvka et al. (2011) constructed a differential equation and gave the solution for the number of lineages as a function of time (NLFT), referred as the expectation of the ancestral lineage number in this article, as follows

(26)

where n₀ and N₀ are the current sample size and population size, and γ is the population growth rate. Since Maruvka et al. (2011) assumed that A_n(t) was deterministic, instead of a random variable, no formula for the variance of A_n(t) was given in their article. It can easily be shown that Equation 26 is close to Equation 21 when A_n(t) is large. Letting $β = - \frac{1}{2} g (t) \to 0$ , where $g (t) = (e^{γ t} - 1) / N_{0} γ$ , the asymptotic mean of A_n(t) in Equation 21 tends to

\frac{n_{0}}{1 + [n_{0} \cdot (e^{γ t} - 1)] / 2 N_{0} γ} .

(27)

When g(t) is small, the denominator in Equation 26 can be approximated by $n_{0} - (n_{0} - 1) (1 - (e^{γ t} - 1) / 2 N_{0} γ) = 1 + (n_{0} - 1) [(e^{γ t} - 1) / 2 N_{0} γ]$ , which is approximately the denominator in Equation 27 when n₀ is large. This confirms the validity of our asymptotic approximation for exponential growth populations.

In Figure 3, B and D, it is clearly evident that the asymptotic variance of A_n(t) is close to the sample variance of the simulated data and the exact variance at any time t. We also note that the variance of ancestral lineage numbers is on a relatively small magnitude compared to the expectation. This was exploited using simulation by Maruvka et al. (2011) when they assumed that the number of ancestral lineages was nearly deterministic in large sample genealogies. However, the randomness of ancestral lineage numbers is still quite significant even for large sample genealogies. For example, for n = 800, γ = 0.003, the variance of A_n(t) is ∼100 at t = 500. If not taking into account the randomness of ancestral lineages, the inference based on the coalescent likelihood will likely be biased. Our asymptotic results provide both the distribution and the analytical expressions of the two moments instead of only mean and, thus, can be used to build statistically rigorous methods for parameter inference.

Finally, in addition to the mean and the variance, we check how well the normal distribution approximates the exact distribution in shape using coalescent simulations. We examine the distribution of ancestral lineage numbers at several time points for the same two parameter settings as in Figure 3. We show snapshots at three time points for each setting in Figure 4 as an illustration: for γ = 0.003, t = 100, 1000, and 2000 generations ago representing the early, middle, and late stages of the ancestral process; and for γ = 0.01, t = 100, 400, and 800 generations ago. As can be seen from Figure 4, the normal distribution provides a reasonable approximation to the true distribution of A_n(t) for a wide range of time points.

The asymptotic probability density functions of lineage numbers in the history for two parameter settings. (A, C, and E) Comparison of the asymptotic probability density function and simulated distribution of ancestral linage number *A_n*(t) for a population with N₀ = 2 × 10⁶ and growth rate 0.003. The times t are at (A) 100 generations ago, (C) 1000 generations ago, and (E) 2000 generations ago. The histograms were generated by 500 coalescent simulations. (B, D, and F) Comparison of the asymptotic probability density function and simulated distribution of ancestral linage number *A_n*(t) for a population with N₀ = 2 × 10⁶ and growth rate being 0.01. The times t are at (B) 100 generations ago, (D) 400 generations ago, and (F) 800 generations ago. The histograms were generated by 500 coalescent simulations.

We also examine how well the Poisson distribution and the gamma distribution approximate the distributions of coalesced lineages and coalescence times at the early stage of the coalescent process. The two asymptotic distributions provide only accurate approximations for the true distribution when the sample size n is sufficiently large, t is close to 0, and growth rate γ is slow (see Table S3, Table S4, Figure S1, and Figure S2 for details).

Applications

When the sample size n is large, numerical issues in evaluating the exact distributions of coalescence times and ancestral lineage numbers exist (Tavaré 1984; Griffiths and Tavaré 1998), and the asymptotic distributions derived above were shown to be a good approximation for finite sample sizes. Here we illustrate that the asymptotic distributions of coalescence times and the number of ancestral lineages can be applied to derive some fundamental statistics that summarize the properties of gene genealogies. We also show that the allele frequency spectrum of large-size samples can be derived through the asymptotic distribution of coalescence times for a population under exponential growth. These asymptotic statistics provide valid approximations and are in simple form without numerical issues for large samples.

Properties of large gene genealogies

Many statistics that summarize the properties of gene genealogies are informative for population genetic inference. Some of them can be derived as a function of coalescence times. We show the derivation of two important statistics of gene genealogies, the expected time to the most recent common ancestor (ETMRCA) and the expected total branch length of the genealogy (ETBL). Using ETBL, we can easily estimate other summary statistics, such as Watterson’s diversity measure θ_W (Watterson 1975) and Tajima’s D (Tajima 1989).

The time to TMRCA of a sample is defined as the time when the ancestors of all lineages coalesce into a single ancient lineage, or the coalescence time T₁. Inferring TMRCA from genetic polymorphism data are of great interest in population genetic studies (Tavaré et al. 1997). ETMRCA by definition is simply the expectation of the coalescence time T₁:

(28)

The ETBL can be obtained by summing over the expectations of all branches of the genealogy as

(29)

Specifically for populations under exponential growth, the ETMRCA can be approximated by

ETMRCA \approx \frac{1}{γ} ln (2 N_{0} γ (1 - n^{- 1}) + 1),

(30)

and by substituting Equation 27 into Equation 29, we have

ETBL \approx \frac{2 n N_{0} ln (2 N_{0} γ / n)}{2 N_{0} γ - n} .

(31)

In Figure 5A, we show the ETMRCA as a function of sample size n for three different growth rates (γ = 0.001, 0.005, and 0.01) in a population with N₀ = 2 × 10⁶. The curves are theoretical predictions based on Equation 30, and each point is an averaged TMRCA over 200 coalescent simulations for the sample size at x-axis. As we can see from the figure, the asymptotic ETMRCA is close to the simulated results, although biased toward larger values for slow growth rates (see Asymptotics of coalescence times for the quantified bias). Given the large variance of TMRCA for genealogies under neutrality, the approximation is considerably accurate. When the growth rate increases, the theoretical approximation becomes more accurate. We can also see that the ETMRCA curve is nearly flat, which means that there is a limit to how much the ETMRCA can increase with the sample size. This is consistent with former conclusions that when the sample size is beyond a moderate level, adding more samples mainly changes the shape of lower parts of the gene genealogy and the increase in the height of the entire genealogy (TMRCA) is very minor (Hein et al. 2005).

The comparison of the expected TMRCA and total branch length obtained from the asymptotic distribution and the coalescent-simulated or exact distribution under three different exponential growth rates, assuming the contemporary population size to be N₀ = 2 × 10⁶. (A) The expectation of time to the most recent common ancestors (ETMRCA) for different sample sizes. The curves correspond to the asymptotic ETMRCA for three growth rates (γ = 0.001, 0.005, 0.01). The open circles, squares, and diamonds are the averages of TMRCA over 200 coalescent simulations with respective growth rate. (B) The expectation of total branch lengths of gene genealogies (ETBL) for different sample sizes. The curves correspond to the asymptotic ETBL for three growth rates (γ = 0.001, 0.005, 0.01). The open circles, squares, and diamonds are the exact ETBLs estimated using Equation 35 of Polanski *et al.* (2003) with respective growth rates.

Estimating the ETBL directly from the exact intercoalescence time distributions suffers from severe numerical instability when the sample size is large. But as pointed out by Polanski et al. (2003), the estimation can be simplified by interchanging summations and canceling large coefficient terms with each other in the alternating series, so that the resulting equations are computationally feasible for large samples (see Polanski et al. 2003, Equation 35). We thus compare the asymptotic formula to the exact ETBL estimated from their equation. In application, we found that even with Polanski et al. (2003)’s technique, when the sample size is sufficiently large, the numerical instability issue still exists. For example, when n > 1600 for γ = 0.001, a high precision arithmetic library is needed to estimate the exact ETBL (Polanski et al. 2003, Equation 35). The expected TBLs for the three levels of growth rates are shown in Figure 5B. The asymptotic results fit the exact results very well. Furthermore, Equation 31 is in simple and analytical form, making the computation for large samples much easier and faster than the exact equation in Polanski et al. (2003).

The allele frequency spectrum

The AFS is defined as the sampling distribution of the frequency of mutant alleles in a randomly collected finite sample (Chen 2012). The AFS is informative for the inference of demographic history and natural selection (Sawyer and Hartl 1992; Williamson et al. 2005; Evans et al. 2007; Chen et al. 2007; Gutenkunst et al. 2009; Lukić et al. 2011; Živković and Stephan 2011; Song and Steinrücken 2012; Chen 2012). It has been well studied using diffusion process since the foundation of population genetics (Kimura 1955). The AFS for any nonequilibrium populations can also be obtained in the coalescent framework using the expected time lengths of gene genealogies (Griffiths and Tavaré 1998; Polanski et al. 2003; Marth et al. 2004; Chen 2012, 2013). The coalescent-based AFS was used extensively in studies of population growth, bottlenecks, and other demographic history (Wooding and Rogers 2002; Polanski and Kimmel 2003; Marth et al. 2004). The AFS under the coalescent model is derived in analytical form and is computationally efficient for small samples. However, the exact distribution of lineages is needed in deriving the AFS, which involves the sums of alternating series, and is difficult to evaluate for large samples. As a result, for large samples, either a high-precision arithmetic library must be adopted (Wooding and Rogers 2002; Marth et al. 2004), or the problem is transformed into a hypergeometric summation (Polanski and Kimmel 2003). A high-precision arithmetic library requires tedious programming and can significantly increase the computational time. The hypergeometric summation is a technique that allows efficient estimation of the AFS for large samples for any single population with temporally varying size. But it is not a general solution for the JAFS of two or multiple populations, and it is very challenging to extend the solution to other more complicated scenarios, such as migration and selection. Here we use the asymptotic distributions of coalescence times derived in Asymptotic Distributions for Coalescence Times and Ancestral Lineage Numbers to get the approximation of the AFS for large samples. The derived AFS is in simple and analytical form and the approximation is accurate.

With the expectation of coalescence times derived in Equation 8, we can easily get the expectation of intercoalescence times Inline graphic W_m = T_m₋₁ − T_m. Let {S_j(n), 0 < j < n} denote the AFS of SNPs from a sample of size n, with the jth entry S_j(n) being the expected number of segregating sites having j copies of derived alleles. The AFS can then be analytically estimated as in former studies (Fu 1995; Griffiths and Tavaré 1998):

(32)

where μ is the mutation rate per generation.

In Equation 32, we assume an infinite-many-sites model for the mutation, so that mutations occurring at any of the k branches spanning the time interval (T_k, T_k₋₁) follow a Poisson process with the mean of μk Inline graphic (W_k). During the subsequent bifurcation process in which the number of lineages increases from k to n, the count of the mutant increases from a single copy to j among the n lineages at present with the probability of

\frac{(\begin{matrix} n - j - 1 \\ k - 2 \end{matrix})}{(\begin{matrix} n - 1 \\ k - 1 \end{matrix})},

which comes from the exchangability property among lineages.

The asymptotic AFS for large samples under variable population size model is estimated through the expected intercoalescence times by the above approach, with Inline graphic (T_m) derived in Asymptotics of coalescence times. We present three estimates of AFS in Figure 6 to demonstrate the accuracy of asymptotic approximation: the sample AFS based on coalescent simulations, Polanski and Kimmel (2003)’s exact formula, and the asymptotic AFS. The simulated samples of size 500 are assumed to come from a contemporary population with N₀ = 2 × 10⁶ and the exponential growth rate being 0.001, 0.005, and 0.01, respectively. The dark blue bars of the histograms in Figure 6 are the AFSs of a simulated 10-Mb region, connected by 1000 regions of 10 kb with the mutation rate and recombination rate both being 1 × 10⁻⁸/nucleotide. The light blue bars represent the exact AFS estimated using Polanski and Kimmel (2003)’s exact formulas (Equations 8–15 in their article), and the white bars represent the asymptotic AFS calculated based on Equation 32 for the above chosen parameters. For a zoomed-in view, we present only the first 25 entries of the AFSs in Figure 6. The AFS based on the asymptotic distribution of intercoalescence times matches both the exact result and the simulation result accurately. The asymptotic result derived in this article has advantages over the existing methods. It is in simple analytical form, and the calculation is fast without numerical instability and doesn’t involve numerical integral or sampling-based methods. Also, the method can be flexibly generalized to various population history models other than the simple exponential growth model. The extension of the asymptotic AFS to various demographic history models, especially the joint AFS for multiple populations, will be be investigated in the future studies.

The theoretically predicted and simulated allele frequency spectrum for a sample of 500 lineages collected from exponentially growing populations with different growth rates. The contemporary population size is 2 × 10⁶ and the growth rates are (A) 0.001, (B) 0.005, and (C) 0.01. The three color bars in the histograms correspond to the simulated AFS and the AFS estimated using the exact formulas of Polanski and Kimmel (2003) and the asymptotic formula derived in this article, respectively.

Discussion

The distributions of coalescence times and the number of ancestral lineages play an essential role in coalescent modeling and population genetic inference. Both exact distributions of ancestral lineage numbers and coalescence times have been studied and expressed as a sum of alternating series, the terms of which are difficult to evaluate when the sample size is large (Tavaré 1984; Griffiths and Tavaré 1998; Polanski et al. 2003). With the rapid advancement of sequencing technology, large-sample genomic sequencing data are piling up, calling for new coalescent theories and methods for population genetic analysis. This article extends the asymptotic distributions of ancestral lineage numbers and coalescence times in constant populations (Griffiths 1984) to populations with temporally varying size. The asymptotic distributions provide a computationally fast and reliable alternative to the exact distributions in large samples. And we have shown that the asymptotic distributions are useful in obtaining statistics describing the properties of large genealogies and in analytically constructing the large-sample allele frequency spectrum. We expect the theoretical results derived in this article, together with the results in Griffiths (1984), to be useful for coalescent-based methodology development at the age of population-level sequencing data.

Supplementary Material

Supporting Information

supp_194_3_721__index.html^{(2.6KB, html)}

Acknowledgments

We are grateful to Dr. Robert Griffiths for insightful comments on an earlier version of the manuscript, which greatly improved the work. We are grateful to Dr. Joachim Hermisson and the two anonymous reviewers for their helpful comments. We are also grateful to Drs. Li Jin, Bing Su, and Hong Shi for motivating and encouraging the work.

Footnotes

Communicating editor: J. Hermisson

Literature Cited

Altshuler D., Lander E., Ambrogio L., Bloom T., Cibulskis K., et al. , 2010. A map of human genome variation from population scale sequencing. Nature 467: 1061–1073. [DOI] [PMC free article] [PubMed] [Google Scholar]
Anderson E., Slatkin M., 2007. Estimation of the number of individuals founding colonized populations. Evolution 61: 972–983 [DOI] [PubMed] [Google Scholar]
Billingsley P., 2012. Probability and Measure. Wiley, New York [Google Scholar]
Chen H., 2012. The joint allele frequency spectrum of multiple populations: a coalescent theory approach. Theor. Popul. Biol. 81: 179–195 [DOI] [PubMed] [Google Scholar]
Chen H., 2013. Intercoalescence time distribtution of incomplete genealogies in temporally varying populations, and applications in population genetic inference. Ann. Hum. Genet. 77: 158–173 [DOI] [PubMed] [Google Scholar]
Chen H., Green R. E., Pääbo S., Slatkin M., 2007. The joint allele-frequency spectrum in closely related species. Genetics 177: 387–398 [DOI] [PMC free article] [PubMed] [Google Scholar]
Coventry A., Bull-Otterson L. M., Liu X., Clark A. G., Maxwell T. J., et al. , 2010. Deep resequencing reveals excess rare recent variants consistent with explosive population growth. Nat. Commun. 1: 131. [DOI] [PMC free article] [PubMed] [Google Scholar]
Dlugosch K., Parker I., 2007. Founding events in species invasions: genetic variation, adaptive evolution, and the role of multiple introductions. Mol. Ecol. 17: 431–449 [DOI] [PubMed] [Google Scholar]
Donnelly P., 1984. The transient behaviour of the moran model in population genetics. Math. Proc. Camb. Philos. Soc. 95: 349–358 [Google Scholar]
Donnelly P., Tavaré S., 1995. Coalescents and genealogical structure under neutrality. Annu. Rev. Genet. 29: 401–421 [DOI] [PubMed] [Google Scholar]
Evans S., Shvets Y., Slatkin M., 2007. Non-equilibrium theory of the allele frequency spectrum. Theor. Popul. Biol. 71: 109–119 [DOI] [PubMed] [Google Scholar]
Ewens W., 1972. The sampling theory of selectively neutral alleles. Theor. Popul. Biol. 3: 87–112 [DOI] [PubMed] [Google Scholar]
Ewens W., 2004. Mathematical Population Genetics: Theoretical Introduction, Vol. 1 Springer Verlag, New York [Google Scholar]
Felsenstein, J., M. Kuhner, J. Yamato, and P. Beerli, 1999 Likelihoods on coalescents: a Monte Carlo sampling approach to inferring parameters from population samples of molecular data. Lect. Notes Monogr. Ser., 163–185.
Fu Y. X., 1995. Statistical properties of segregating sites. Theor. Popul. Biol. 48: 172–197 [DOI] [PubMed] [Google Scholar]
Griffiths R. C., 1980. Lines of descent in the diffusion approximation of neutral Wright–Fisher models. Theor. Popul. Biol. 17: 37–50 [DOI] [PubMed] [Google Scholar]
Griffiths R. C., 1984. Asymptotic line-of-descent distributions. J. Math. Biol. 21: 67–75 [Google Scholar]
Griffiths R. C., 2006. Coalescent lineage distributions. Adv. Appl. Probab. 38: 405–429 [Google Scholar]
Griffiths R. C., Tavaré S., 1994a Sampling theory for neutral alleles in a varying enviroment. Philos. Trans. R. Soc. Lond. B 344: 403–410 [DOI] [PubMed] [Google Scholar]
Griffiths R. C., Tavaré S., 1994b Simulating probability distributions in the coalescent. Theor. Popul. Biol. 46: 131–159 [Google Scholar]
Griffiths R. C., Tavaré S., 1998. The age of a mutation in a general coalescent tree. Stoch. Models 14: 273–295 [Google Scholar]
Gutenkunst R. N., Hernandez R. D., Williamson S. H., Bustamante C. D., 2009. Inferring the joint demographic history of multiple populations from multidimensional SNP frequency data. PLoS Genet. 5: e1000695. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hein, J., M. Schierup, and C. Wiuf, 2005 Gene Genealogies, Variation and Evolution: A Primer in Coalescent Theory. Oxford University Press, New York. [Google Scholar]
Hudson R. R., 1990. Gene genealogies and the coalescent process. Oxford Surv. Evol. Biol. 7: 44. [Google Scholar]
Hudson R. R., 2002. Generating samples under a Wright–Fisher neutral model of genetic variation. Bioinformatics 18: 337–338 [DOI] [PubMed] [Google Scholar]
Kimura M., 1955. Solution of a process of random genetic drift with a continuous model. Proc. Natl. Acad. Sci. USA 41: 144–150 [DOI] [PMC free article] [PubMed] [Google Scholar]
Kingman J., 1982a The coalescent. Stochastic Process. Appl. 13: 235–248 [Google Scholar]
Kingman J., 1982b Exchangeability and the Evolution of Large Populations, pp. 97–112 in Exchangeability in Probability and Statistics, edited by G. Koch and F. Spizzichino. North-Holland, Amsterdam [Google Scholar]
Lukić S., Hey J., Chen K., 2011. Non-equilibrium allele frequency spectra via spectral methods. Theor. Popul. Biol. 79: 203–219 [DOI] [PMC free article] [PubMed] [Google Scholar]
Mardis E. R., 2008. The impact of next-generation sequencing technology on genetics. Trends Genet. 24: 133–140 [DOI] [PubMed] [Google Scholar]
Marth G. T., Czabarka E., Murvai J., Sherry S. T., 2004. The allele frequency spectrum in genome-wide human variation data reveals signals of differeential demographic history in three large world populations. Genetics 2004: 351–372 [DOI] [PMC free article] [PubMed] [Google Scholar]
Maruvka Y., Shnerb N., Bar-Yam Y., Wakeley J., 2011. Recovering population parameters from a single gene genealogy: an unbiased estimator of the growth rate. Mol. Biol. Evol. 28: 1617–1631 [DOI] [PubMed] [Google Scholar]
Nordborg M., 2001. Coalescent theory, pp. 179–212 in Handbook of Statistical Genetics, edited by Balding D. J., Bishop M., Cannings. Wiley, Chichester, UK C. [Google Scholar]
Polanski A., Kimmel M., 2003. New explicit expressions for relative frequencies of single-nucleotide polymorphisms with application to statistical inference on population growth. Genetics 165: 427–436 [DOI] [PMC free article] [PubMed] [Google Scholar]
Polanski A., Bobrowski A., Kimmel M., 2003. A note on distributions of times to coalescence, under time-dependent population size. Theor. Popul. Biol. 63: 33–40 [DOI] [PubMed] [Google Scholar]
Risch N., Tang H., Katzenstein H., Ekstein J., 2003. Geographic distribution of disease mutations in the Ashkenazi Jewish population supports genetic drift over selection. Am. J. Hum. Genet. 72: 812–822 [DOI] [PMC free article] [PubMed] [Google Scholar]
Sawyer S. A., Hartl D. L., 1992. Population genetics of polymorphism and divergence. Genetics 132: 1161–1176 [DOI] [PMC free article] [PubMed] [Google Scholar]
Song Y. S., Steinrücken M., 2012. A simple method for finding explicit analytic transition densities of diffusion processes with general diploid selection. Genetics 190: 1117–1129 [DOI] [PMC free article] [PubMed] [Google Scholar]
Tajima F., 1989. Statistical methods for testing the neutral mutations hypothesis by DNA polymorphism. Genetics 123: 585–595 [DOI] [PMC free article] [PubMed] [Google Scholar]
Takahata N., Nei M., 1985. Gene genealogy and variance of interpopulational nucleotide differences. Genetics 110: 325–344 [DOI] [PMC free article] [PubMed] [Google Scholar]
Tavaré S., 1984. Line-of-descent and genealogical processes, and their applications in population genetics models. Theor. Popul. Biol. 26: 119–164 [DOI] [PubMed] [Google Scholar]
Tavaré S., Balding D. J., Griffiths R. C., Donnelly P., 1997. Inferring coalescence times from dna sequence data. Genetics 145: 505–518 [DOI] [PMC free article] [PubMed] [Google Scholar]
Watterson G., 1975. On the number of segregating sites in genetical models without recombination. Theor. Popul. Biol. 7: 256–276 [DOI] [PubMed] [Google Scholar]
Watterson G., 1984. Lines of descent and the coalescent. Theor. Popul. Biol. 26: 77–92 [Google Scholar]
Williamson S. H., Hernandez R., Fledel-Alon A., Zhu L., Nielsen R., et al. , 2005. Simultaneous inference of selection and population growth from patterns of variation in the human genome. Proc. Natl. Acad. Sci. USA 102: 7882–7887 [DOI] [PMC free article] [PubMed] [Google Scholar]
Wooding S., Rogers A., 2002. The matrix coalescent and an application to human single-nucleotide polymorphisms. Genetics 161: 1641–1650 [DOI] [PMC free article] [PubMed] [Google Scholar]
Živković D., Stephan W., 2011. Analytical results on the neutral non-equilibrium allele frequency spectrum based on diffusion theory. Theor. Popul. Biol. 79: 184–191 [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supporting Information

supp_194_3_721__index.html^{(2.6KB, html)}

71064779eb5f91ee97128c58fbe5aca3_genetics.113.151522-1.pdf^{(261.4KB, pdf)}

7206c8c44aee35ccaf60ad617030666f_genetics.113.151522-5.pdf^{(52KB, pdf)}

26ab1b33087d9e8510c59b3ae2158218_genetics.113.151522-6.pdf^{(51.8KB, pdf)}

cd1268229da16bfe002cfe624849795e_genetics.113.151522-7.pdf^{(50.9KB, pdf)}

64c990a3228cc8d668dee6329d05b140_genetics.113.151522-2.pdf^{(52.4KB, pdf)}

588df1c917e13b412242390f610a51b7_genetics.113.151522-4.pdf^{(146.5KB, pdf)}

a141452c5d3928d98779df83755f0187_genetics.113.151522-3.pdf^{(149.4KB, pdf)}

[bib1] Altshuler D., Lander E., Ambrogio L., Bloom T., Cibulskis K., et al. , 2010. A map of human genome variation from population scale sequencing. Nature 467: 1061–1073. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib2] Anderson E., Slatkin M., 2007. Estimation of the number of individuals founding colonized populations. Evolution 61: 972–983 [DOI] [PubMed] [Google Scholar]

[bib3] Billingsley P., 2012. Probability and Measure. Wiley, New York [Google Scholar]

[bib4] Chen H., 2012. The joint allele frequency spectrum of multiple populations: a coalescent theory approach. Theor. Popul. Biol. 81: 179–195 [DOI] [PubMed] [Google Scholar]

[bib5] Chen H., 2013. Intercoalescence time distribtution of incomplete genealogies in temporally varying populations, and applications in population genetic inference. Ann. Hum. Genet. 77: 158–173 [DOI] [PubMed] [Google Scholar]

[bib6] Chen H., Green R. E., Pääbo S., Slatkin M., 2007. The joint allele-frequency spectrum in closely related species. Genetics 177: 387–398 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib7] Coventry A., Bull-Otterson L. M., Liu X., Clark A. G., Maxwell T. J., et al. , 2010. Deep resequencing reveals excess rare recent variants consistent with explosive population growth. Nat. Commun. 1: 131. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib8] Dlugosch K., Parker I., 2007. Founding events in species invasions: genetic variation, adaptive evolution, and the role of multiple introductions. Mol. Ecol. 17: 431–449 [DOI] [PubMed] [Google Scholar]

[bib9] Donnelly P., 1984. The transient behaviour of the moran model in population genetics. Math. Proc. Camb. Philos. Soc. 95: 349–358 [Google Scholar]

[bib10] Donnelly P., Tavaré S., 1995. Coalescents and genealogical structure under neutrality. Annu. Rev. Genet. 29: 401–421 [DOI] [PubMed] [Google Scholar]

[bib11] Evans S., Shvets Y., Slatkin M., 2007. Non-equilibrium theory of the allele frequency spectrum. Theor. Popul. Biol. 71: 109–119 [DOI] [PubMed] [Google Scholar]

[bib12] Ewens W., 1972. The sampling theory of selectively neutral alleles. Theor. Popul. Biol. 3: 87–112 [DOI] [PubMed] [Google Scholar]

[bib13] Ewens W., 2004. Mathematical Population Genetics: Theoretical Introduction, Vol. 1 Springer Verlag, New York [Google Scholar]

[bib14] Felsenstein, J., M. Kuhner, J. Yamato, and P. Beerli, 1999 Likelihoods on coalescents: a Monte Carlo sampling approach to inferring parameters from population samples of molecular data. Lect. Notes Monogr. Ser., 163–185.

[bib15] Fu Y. X., 1995. Statistical properties of segregating sites. Theor. Popul. Biol. 48: 172–197 [DOI] [PubMed] [Google Scholar]

[bib16] Griffiths R. C., 1980. Lines of descent in the diffusion approximation of neutral Wright–Fisher models. Theor. Popul. Biol. 17: 37–50 [DOI] [PubMed] [Google Scholar]

[bib17] Griffiths R. C., 1984. Asymptotic line-of-descent distributions. J. Math. Biol. 21: 67–75 [Google Scholar]

[bib18] Griffiths R. C., 2006. Coalescent lineage distributions. Adv. Appl. Probab. 38: 405–429 [Google Scholar]

[bib19] Griffiths R. C., Tavaré S., 1994a Sampling theory for neutral alleles in a varying enviroment. Philos. Trans. R. Soc. Lond. B 344: 403–410 [DOI] [PubMed] [Google Scholar]

[bib20] Griffiths R. C., Tavaré S., 1994b Simulating probability distributions in the coalescent. Theor. Popul. Biol. 46: 131–159 [Google Scholar]

[bib21] Griffiths R. C., Tavaré S., 1998. The age of a mutation in a general coalescent tree. Stoch. Models 14: 273–295 [Google Scholar]

[bib22] Gutenkunst R. N., Hernandez R. D., Williamson S. H., Bustamante C. D., 2009. Inferring the joint demographic history of multiple populations from multidimensional SNP frequency data. PLoS Genet. 5: e1000695. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib23] Hein, J., M. Schierup, and C. Wiuf, 2005 Gene Genealogies, Variation and Evolution: A Primer in Coalescent Theory. Oxford University Press, New York. [Google Scholar]

[bib24] Hudson R. R., 1990. Gene genealogies and the coalescent process. Oxford Surv. Evol. Biol. 7: 44. [Google Scholar]

[bib25] Hudson R. R., 2002. Generating samples under a Wright–Fisher neutral model of genetic variation. Bioinformatics 18: 337–338 [DOI] [PubMed] [Google Scholar]

[bib26] Kimura M., 1955. Solution of a process of random genetic drift with a continuous model. Proc. Natl. Acad. Sci. USA 41: 144–150 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib27] Kingman J., 1982a The coalescent. Stochastic Process. Appl. 13: 235–248 [Google Scholar]

[bib28] Kingman J., 1982b Exchangeability and the Evolution of Large Populations, pp. 97–112 in Exchangeability in Probability and Statistics, edited by G. Koch and F. Spizzichino. North-Holland, Amsterdam [Google Scholar]

[bib29] Lukić S., Hey J., Chen K., 2011. Non-equilibrium allele frequency spectra via spectral methods. Theor. Popul. Biol. 79: 203–219 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib30] Mardis E. R., 2008. The impact of next-generation sequencing technology on genetics. Trends Genet. 24: 133–140 [DOI] [PubMed] [Google Scholar]

[bib31] Marth G. T., Czabarka E., Murvai J., Sherry S. T., 2004. The allele frequency spectrum in genome-wide human variation data reveals signals of differeential demographic history in three large world populations. Genetics 2004: 351–372 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib32] Maruvka Y., Shnerb N., Bar-Yam Y., Wakeley J., 2011. Recovering population parameters from a single gene genealogy: an unbiased estimator of the growth rate. Mol. Biol. Evol. 28: 1617–1631 [DOI] [PubMed] [Google Scholar]

[bib33] Nordborg M., 2001. Coalescent theory, pp. 179–212 in Handbook of Statistical Genetics, edited by Balding D. J., Bishop M., Cannings. Wiley, Chichester, UK C. [Google Scholar]

[bib34] Polanski A., Kimmel M., 2003. New explicit expressions for relative frequencies of single-nucleotide polymorphisms with application to statistical inference on population growth. Genetics 165: 427–436 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib35] Polanski A., Bobrowski A., Kimmel M., 2003. A note on distributions of times to coalescence, under time-dependent population size. Theor. Popul. Biol. 63: 33–40 [DOI] [PubMed] [Google Scholar]

[bib36] Risch N., Tang H., Katzenstein H., Ekstein J., 2003. Geographic distribution of disease mutations in the Ashkenazi Jewish population supports genetic drift over selection. Am. J. Hum. Genet. 72: 812–822 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib37] Sawyer S. A., Hartl D. L., 1992. Population genetics of polymorphism and divergence. Genetics 132: 1161–1176 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib38] Song Y. S., Steinrücken M., 2012. A simple method for finding explicit analytic transition densities of diffusion processes with general diploid selection. Genetics 190: 1117–1129 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib39] Tajima F., 1989. Statistical methods for testing the neutral mutations hypothesis by DNA polymorphism. Genetics 123: 585–595 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib40] Takahata N., Nei M., 1985. Gene genealogy and variance of interpopulational nucleotide differences. Genetics 110: 325–344 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib41] Tavaré S., 1984. Line-of-descent and genealogical processes, and their applications in population genetics models. Theor. Popul. Biol. 26: 119–164 [DOI] [PubMed] [Google Scholar]

[bib42] Tavaré S., Balding D. J., Griffiths R. C., Donnelly P., 1997. Inferring coalescence times from dna sequence data. Genetics 145: 505–518 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib43] Watterson G., 1975. On the number of segregating sites in genetical models without recombination. Theor. Popul. Biol. 7: 256–276 [DOI] [PubMed] [Google Scholar]

[bib44] Watterson G., 1984. Lines of descent and the coalescent. Theor. Popul. Biol. 26: 77–92 [Google Scholar]

[bib45] Williamson S. H., Hernandez R., Fledel-Alon A., Zhu L., Nielsen R., et al. , 2005. Simultaneous inference of selection and population growth from patterns of variation in the human genome. Proc. Natl. Acad. Sci. USA 102: 7882–7887 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib46] Wooding S., Rogers A., 2002. The matrix coalescent and an application to human single-nucleotide polymorphisms. Genetics 161: 1641–1650 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib47] Živković D., Stephan W., 2011. Analytical results on the neutral non-equilibrium allele frequency spectrum based on diffusion theory. Theor. Popul. Biol. 79: 184–191 [DOI] [PubMed] [Google Scholar]

PERMALINK

Asymptotic Distributions of Coalescence Times and Ancestral Lineage Numbers for Populations with Temporally Varying Size

Hua Chen

Kun Chen

Abstract

Asymptotic Distributions for Coalescence Times and Ancestral Lineage Numbers

Notations and summary

Figure 1.

Asymptotics of coalescence times

Asymptotics of ancestral lineage numbers

Asymptotics of coalescence times and ancestral lineage numbers at the early stage of the coalescent

Numerical Results

Coalescence times

Figure 2.

Table 1. Comparison of the asymptotic approximation and simulated results for the mean and standard deviation of the coalescence time T_m (N₀ = 2.0 × 10⁶).

Number of ancestral lineages

Table 2. Comparison of the asymptotic approximation and exact results for the mean and standard deviation of A_n(t) (population size N₀ = 2.0 × 10⁶).

Figure 3.

Figure 4.

Applications

Properties of large gene genealogies

Figure 5.

The allele frequency spectrum

Figure 6.

Discussion

Supplementary Material

Acknowledgments

Footnotes

Literature Cited

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Asymptotic Distributions of Coalescence Times and Ancestral Lineage Numbers for Populations with Temporally Varying Size

Hua Chen

Kun Chen

Abstract

Asymptotic Distributions for Coalescence Times and Ancestral Lineage Numbers

Notations and summary

Figure 1.

Asymptotics of coalescence times

Asymptotics of ancestral lineage numbers

Asymptotics of coalescence times and ancestral lineage numbers at the early stage of the coalescent

Numerical Results

Coalescence times

Figure 2.

Table 1. Comparison of the asymptotic approximation and simulated results for the mean and standard deviation of the coalescence time Tm (N0 = 2.0 × 106).

Number of ancestral lineages

Table 2. Comparison of the asymptotic approximation and exact results for the mean and standard deviation of An(t) (population size N0 = 2.0 × 106).

Figure 3.

Figure 4.

Applications

Properties of large gene genealogies

Figure 5.

The allele frequency spectrum

Figure 6.

Discussion

Supplementary Material

Acknowledgments

Footnotes

Literature Cited

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Table 1. Comparison of the asymptotic approximation and simulated results for the mean and standard deviation of the coalescence time T_m (N₀ = 2.0 × 10⁶).

Table 2. Comparison of the asymptotic approximation and exact results for the mean and standard deviation of A_n(t) (population size N₀ = 2.0 × 10⁶).