Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2018 Feb 16.
Published in final edited form as: J Comput Graph Stat. 2017 Feb 16;26(1):182–194. doi: 10.1080/10618600.2016.1159212

Efficient computation of the joint sample frequency spectra for multiple populations

John A Kamm 1, Jonathan Terhorst 2, Yun S Song 3
PMCID: PMC5319604  NIHMSID: NIHMS777150  PMID: 28239248

Abstract

A wide range of studies in population genetics have employed the sample frequency spectrum (SFS), a summary statistic which describes the distribution of mutant alleles at a polymorphic site in a sample of DNA sequences and provides a highly efficient dimensional reduction of large-scale population genomic variation data. Recently, there has been much interest in analyzing the joint SFS data from multiple populations to infer parameters of complex demographic histories, including variable population sizes, population split times, migration rates, admixture proportions, and so on. SFS-based inference methods require accurate computation of the expected SFS under a given demographic model. Although much methodological progress has been made, existing methods suffer from numerical instability and high computational complexity when multiple populations are involved and the sample size is large. In this paper, we present new analytic formulas and algorithms that enable accurate, efficient computation of the expected joint SFS for thousands of individuals sampled from hundreds of populations related by a complex demographic model with arbitrary population size histories (including piecewise-exponential growth). Our results are implemented in a new software package called momi (MOran Models for Inference). Through an empirical study we demonstrate our improvements to numerical stability and computational complexity.

Keywords: coalescent, population genetics, demographic inference, sum-product algorithm

1 Introduction

Massive increases in the availability of DNA sequence data are creating exciting new opportunities for discovery in evolutionary genetics. With those opportunities come new challenges, as algorithms and statistical models originally designed to analyze small amounts of DNA from a limited number of individuals are now called on to study thousands to tens of thousands of individuals at a time. As in other branches of statistics, the arrival of “big data” has led to a new focus on scalability, accuracy, speed, and related computational issues within statistical genetics.

In this paper, we tackle this challenge by presenting a new method for computing the expected joint sample frequency spectrum (SFS), a summary statistic on DNA sequences which lies at the core of a large number of empirical investigations and inference procedures in population genetics (Bhaskar et al., 2015; Coventry et al., 2010; Excoffier et al., 2013; Gazave et al., 2014; Gravel et al., 2011; Griffths and Tavaré, 1998; Gutenkunst et al., 2009; Jenkins et al., 2014; Nelson et al., 2012; Nielsen, 2000; Wakeley and Hey, 1997). As we discuss in greater technical detail below, the joint SFS is of interest because it maps complex demographic models involving population size changes, population splits, migration, and admixture to a low-dimensional vector. This dimensionality reduction is useful for performing inference using sampled DNA sequences. Inferring population demographic histories is not only innately interesting, for example in dating events such as the out-of-Africa migration of modern humans (Gutenkunst et al., 2009; Schaffner et al., 2005), but is also important for biological applications, such as distinguishing between the effects of natural selection and demography (Beaumont and Nichols, 1996; Boyko et al., 2008).

We argue both theoretically and by simulation that existing methods in population genetics cannot scale to the problem sizes encountered in modern genetic analyses. Our primary contribution is a novel algorithm (and accompanying open-source software package) which is several orders of magnitude faster (in the number of sampled individuals n) than existing approaches. Moreover, by careful algorithmic design we are able to mitigate certain numerical issues (e.g., underow and catastrophic cancellation) that often arise when applying these models to data. The combined effect of these innovations is to permit the analysis of much larger data sets, which will lead to improved inference in population genetics.

The rest of the paper is organized as follows: In Section 2, we describe the problem more precisely, survey related work, and summarize our main results. Section 3 presents the theoretical results that lead to the improved algorithm described in Section 4. Runtime and numerical accuracy results are discussed in Section 5. Mathematical proofs of our theoretical results are deferred to Section 6.

Software availability

The algorithms presented in this paper are implemented in a new software package called momi (MOran Models for Inference), which is freely available at https://github.com/popgenmethods/momi.

2 Background and summary

In this section, we briefly review some terminology and concepts from population genetics that are relevant to our problem. We then discuss existing work and summarize our main results.

2.1 Motivation

To simplify the exposition, suppose that the genomes of n individuals have been randomly sampled from a single population. (We will shortly consider the general case where genomes are sampled from multiple related populations, which is substantially more challenging to analyze and is the main focus of this paper.) The positions in the genome where at least one, but fewer than all, of the sampled individuals bears a mutant allele (or genetic type) are called segregating sites. Assume that all such sites are dimorphic, meaning that we can label each sampled individual at each segregating site as either “ancestral” or “derived”. The sample frequency spectrum is defined to be the count vector ξ^+n1, of which the ith entry ξ̂i records the number of segregating sites that have exactly i copies of the mutant allele in the sample. (Thus, ‖ξ̂‖1 equals the total number of segregating sites found in the sampled genomes.)

The SFS is of interest because its distribution encodes past demographic events experienced by the population(s) from which the samples were obtained. In particular, suppose we are analyzing a single population and let α−1(t) : ℝ≥0 → ℝ>0 be a strictly positive function such that α−1(t) equals the effective size (Ewens, 2004) of the population at t time units back in the past. (The inverse exponent is included because the function α(t) is used below to describe the rate at which a pair of individuals in the population find a common ancestor or coalesce; intuitively, this is more probable when the population is small.) There is a well-defined map

α1ξ(α1)=𝔼α1ξ^ (1)

which takes population size histories to their expected frequency spectra. Here, the notation 𝔼α−1 denotes expectation with respect to the distribution on genomes induced by α−1. (For illustrative purposes we prefer to define this map abstractly for now; a concrete instantiation is given in Section 3.3).

Although the above map is not injective for an arbitrary function (Myers et al., 2008), a general sufficient condition for injectivity has been found recently (Bhaskar and Song, 2014). For most classes of size functions of interest to population geneticists (for example, piecewise-exponential), the map is injective provided that the sample size n is sufficiently large. Hence, this leads to the inverse problem of estimating a size function α̂−1(·) given the sample statistic ξ̂. This machinery may be generalized to more complex demographic scenarios involving multiple populations which split or merged from one another, migration between populations, and other features, although we note that identifiability for these more general models has yet to be established.

The map (1) cannot in general be analytically inverted, so numerical methods are required to recover α−1. In order to numerically solve this inverse problem, it is necessary to be able evaluate (1) rapidly and accurately.

2.2 Existing work

Two approaches exist for computing the forward mapping (1). One is based on the so-called Wright-Fisher diffusion (Ewens, 2004), a stochastic process which describes the frequency trajectory of a mutant allele segregating in a population as time moves forward. The diffusion framework has the advantage of being applicable to arbitrary demographic models, but its computational complexity grows exponentially with the number of populations. Also, it requires numerically solving a system of partial differential equations, which can be difficult in practice. For these reasons, current implementations (Gravel et al., 2011; Gutenkunst et al., 2009; Lukić and Hey, 2012) of the diffusion approach are limited to analyzing no more than four populations at once.

An alternative class of methods, which includes ours, relies on a backward-in-time stochastic process called the coalescent (Kingman, 1982a,b,c), which is dual to the Wright-Fisher diffusion. Intuitively, the coalescent describes the genealogical structure of a sample of chromosomes; below we provide a more detailed description of the coalescent since it is an integral part of our method. Using the coalescent, one computes the expected SFS by integrating over all genealogies underlying the sample. This can be done either via Monte Carlo or analytically. Monte Carlo integration (Nielsen, 2000) can effectively handle arbitrary demographic histories with a large number of populations, and Excoffier et al. (2013) have recently developed a useful implementation. However, when the number 𝒟 of populations (or demes) is moderate to large, most of the O(n𝒟) SFS entries will be unobserved in simulations, and thus the Monte Carlo integral may naively assign a probability of 0 to observed SNPs. Monte Carlo computation of the expected SFS thus requires careful regularization techniques to avoid degeneracy.

An alternative to the Monte Carlo approach is to compute the expected SFS exactly via analytic integration over coalescent genealogies (Griffiths and Tavaré, 1998; Wakeley and Hey, 1997). For a demography involving multiple populations, this can be done by a dynamic program (Chen, 2012, 2013). This algorithm is more complicated and less general than both the Monte Carlo and diffusion approaches: while it can handle population splits, merges, size changes, and instantaneous gene flow, it is difficult to include continuous gene flow between populations. However, it scales well to a large number 𝒟 of populations, since it only computes entries of the SFS that are observed in the data, and ignores the O(n𝒟) SFS entries that are not observed.

2.3 Summary of our main results

Unfortunately, existing coalescent-based algorithms (Chen, 2012, 2013; Wakeley and Hey, 1997) do not scale well to large sample size n, either in terms of running time or numerical stability. These algorithms rely on large alternating sums that explode with n and exhibit catastrophic cancellation. To circumvent these problems, we obtain new results for computing the truncated SFS, a key quantity needed to compute the joint SFS for multiple populations. For a fixed time τ, the truncated SFS gives the expected number of mutations arising in the time interval [0, τ) that are found in k = 1, 2, …, n individuals sampled at time 0. We provide an algorithm for computing this quantity efficiently and in a numerically stable manner.

For general demographic histories, the complexity of the dynamic program devised by Chen (2012, 2013) is O(n5V + WL), where V is the number of populations (vertices) throughout the history, L is the number of distinct SFS entries to be computed, and W is a term that depends on n and the graph structure of the demography; in this paper, we improve this to O(n2V + WL). For the special case of a tree-shaped demography without migration or admixture, Chen’s algorithm gives W = O(n4V), for a total cost of O(n5V + n4VL). In this special case (without admixture), we introduce an additional speedup that improves the runtime to O(n3V + n2VL). (See Section 4.2.2). We show through an empirical study that our algorithm is not only orders of magnitude faster, but also more numerically stable.

Lastly, we note that our algorithm relies on ideas similar to those found in Bryant et al. (2012) and De Maio et al. (2013), but with a different focus. Those methods aim to compute a “phylogenetic SFS”, in which mutations are allowed to be recurrent and time scales are sufficiently long that population size can be assumed to be constant. In contrast, our method considers an infinite sites model (Kimura, 1969) without recurrent mutation, and can handle arbitrary population size change functions. These features make it more appropriate for use in population genetics.

3 Theoretical results on the truncated SFS

In this section we study the truncated sample frequency spectrum, which is the key object needed to compute the joint SFS in our algorithm. For readers who are unfamiliar with this area, we first give a brief introduction to the key concepts in order to make our results more understandable.

3.1 Background on the coalescent and the SFS

Kingman’s coalescent (Kingman, 1982a,b,c) is a partition-valued stochastic process which describes shared ancestry of a group of individuals sampled randomly from a population. More precisely, the coalescent {𝒞tn}t0 on n leaves is the backward-in-time Markov jump process whose value at time t is a partition of {1, …, n}, and at time t, each pairs of blocks in 𝒞tn merge with rate α(t). We also call 1α(t) the population size history function. We often drop the dependence on n, and write 𝒞t=𝒞tn. We prefer to denote a dependence on n through the probability ℙn and the expectation 𝔼n. So if X(𝒞n) denotes a random variable of the process 𝒞n, we usually write 𝔼n[X] instead of 𝔼[X(𝒞n)]

A sample path of {𝒞tn}t0 can be viewed as a rooted ultrametric binary tree with n leaves, labeled 1, …, n, corresponding to sampled individuals. The tree extends backwards in time and each branch represents a partition block of the process such that 𝒞tn corresponds to the partition induced on {1, …, n} by cutting the tree at height t. Merging a pair of blocks after a random amount of time results in a corresponding merger of tree lineages. This process repeats until all individuals have “coalesced” into a single common ancestor.

The coalescent may be used to model the mutation patterns arising in sampled DNA. The basic idea is as follows: each branch in the coalescent tree represents a succession of individuals who are ancestral to the present-day individuals (leaves) they subtend. Now suppose one of those ancestral individuals experiences a mutation at a certain position in her genome. Then, barring mutations at that position in all other parts of the tree, all of her present-day descendants will bear the mutant allele at that position, while all other members of the sample will not. If we suppose that the probability of a mutation occurring on a particular branch of a coalescent tree is proportional to the length of that branch, then this creates a link between observed mutation data and the distribution of the coalescent tree, and hence between the observed data and α(t).

It turns out that this assumption is accurate for selectively neutral mutations: conditional on the underlying coalescent tree, they are distributed roughly as a Poisson point process occurring at some rate θ/2 ≪ 1 on the tree. (The scaling factor 1/2 is a convention in the literature.) This implies that, given that a mutation is segregating within the sample, it falls uniformly with respect to branch lengths on the coalescent tree of the sampled individuals. Now let ℳ ⊊ {1, …, n} be the set of leaves carrying the mutant allele (we only consider mutations beneath the root, so by assumption ℳ ≠ {1, …, n}. Then we define the sample frequency spectrum fn(k), for 0 < k < n, as the first order Taylor series coefficient of ℙn(|ℳ| = k) in the mutation rate,

n(||=k)=θ2fn(k)+o(θ).

We will generally refer to fn(k) as the sample frequency spectrum (SFS). We also note two alternative definitions of the SFS. First, fn(k) is the expected number of mutations with k descendants when θ2=1. Second, 1(n|K|)fn(|K|) is the expected length of the branch whose leaf set is K ⊂ {1, …, n}. More specifically, let 𝕀 denote the indicator function, and define K0𝕀K𝒞tdt. Then

1(n|K|)fn(|K|)=𝔼n[K].

The equivalence of these alternate definitions follows from previous results in Bhaskar et al. (2012); Griffiths and Tavaré (1998); Jenkins and Song (2011).

We now consider truncating the coalescent with mutation at time τ, as illustrated in Figure 1. Let ℳτ denote the set of leaves under mutations occurring in the time interval [0, τ). We define the truncated SFS fnτ(k) according to

n(|τ|=k)=θ2fnτ(k)+o(θ),

where fnτ(k) corresponds to the total expected length of all branches in the time interval [0, τ) each of which subtends k leaves. Using the truncated SFS fnτ(k) for each population appearing in a demographic history, where τ denotes the length of time a particular population exists, it is possible to compute the joint SFS for multiple related populations (Chen, 2012). In Section 4.1, we describe a dynamic program algorithm for computing the joint SFS for multiple populations related by a complex demography, and the way in which this algorithm uses the truncated SFS fnτ(k).

Figure 1.

Figure 1

A sample path of the coalescent truncated at time τ. Star symbols denote mutations, while ℳτ denotes the set of leaves under those mutations. Tkτ denotes the waiting time in the interval [0, τ) while there are k lineages.

3.2 Previous work on the truncated SFS

The key challenge is to compute the truncated SFS fnτ(k). Let At𝒞 be the random variable which equals the number of coalescent lineages at time t ancestral to the sample. This is a pure death process which decreases by one each time a coalescence event occurs, until finally reaching the absorbing state 1 when all individuals have found a common ancestor. More precisely, At𝒞=|𝒞t| and the rate of transition from m to m − 1 is given by λm,m1𝒞(t)=(m2)α(t). We define a conditional version of the SFS via fnτ(k|Aτ𝒞=m) according to

n(|τ|=k|Aτ𝒞=m)=θ2fnτ(k|Aτ𝒞=m)+o(θ). (2)

Here, fnτ(k|Aτ𝒞=m) is the total expected length of all branches each subtending k leaves given that there are m ancestors at time τ.

In what follows we will introduce a recursive method for calculating the truncated SFS. The recursion is in terms of the sample size n so we let m ≤ ν ≤ n denote a generic sample size from this point forward. The method of Chen (2012, eq. 5) computes the truncated SFS by summing over the number m of ancestors remaining at time τ:

fντ(k)=m=1nk+1ν(Aν𝒞=m)fντ(k|Aτ𝒞=m). (3)

The first term in the summand, ν(Aτ𝒞=m), can be computed in at least three ways: by numerically exponentiating the rate matrix of A𝒞, by computing an alternating sum with O(ν) terms (Tavaré, 1984), or by solving a recursion described in Section 6.1. We note that the recursion described in Section 6.1 has the advantage of computing all values of ν(Aτ𝒞=m), m ≤ ν ≤ n, in O(n2) time.

The second term fντ(k|Aτ𝒞=m) in the summand of (3) is computed in Chen (2012, eq. 4) as

fντ(k|Aτ𝒞=m)=i=mνipν,ik,1𝔼ν[Tiτ|Aτ𝒞=m], (4)

where

pν,ik,j{(k1j1)(νk1ij1)(ν1i1),if kj>0 and νkij>0,1,if j=k=0 or ij=νk=0,0,else,

is the transition probability of the Pólya urn model, starting with ij white balls and j black balls, and ending with ν − k white balls and k black balls (Johnson and Kotz, 1977), and

Tiτ0τ𝕀At𝒞=idt

is the length of time in [0, τ) where there are i ancestral lineages to the sample, as illustrated in Figure 1. Chen (2012, eq. 3) provides a formula for the conditional expectation 𝔼ν[Tiτ|Aτ𝒞=m] for the case of constant population size, which he later extends (Chen, 2013) to the case of an exponentially growing population. However, these formulas involve an alternating sum with O2) terms. Thus, computing 𝔼ν[Tiτ|Aτ𝒞=m] for every value of i, m, ν, as required to compute {fντ(k)}kνn with (3) and (4), takes O(n5) time with these formulas. In addition, large alternating sums are numerically unstable due to catastrophic cancellation (Higham, 2002), and so these formulas require the use of high-precision numerical libraries, further increasing runtime.

3.3 A fast, stable algorithm for computing the truncated SFS

Here, we present a numerically stable algorithm to compute {fντ(k)|1kνn} in O(n2) time instead of O(n5) time. Our approach utilizes the following two lemmas:

Lemma 1

The entry fnτ(n) of the truncated SFS is given by

fnτ(n)=τk=1n1knfnτ(k). (5)

Lemma 2

For all 1 ≤ k ≤ ν, the truncated SFS fντ(k) satisfies the linear recurrence

fντ(k)=νk+1ν+1fν+1τ(k)+k+1ν+1fν+1τ(k+1). (6)

We prove Lemma 1 in Section 6.2. We note here that our proof also yields the identity 𝔼[TMRCA]=k=1n1knfn(k), where TMRCA denotes the time to the most recent common ancestor of the sample; to our knowledge, this formula is new. A proof of Lemma 2 is provided in Section 6.3.

We now sketch our algorithm. For a given n, we show below that all values of fnτ(k), for 1 ≤ k < n, can be computed in O(n2) time. We then compute fnτ(n) using Lemma 1 in O(n) time. Finally, using fnτ(k) for 1 ≤ kn as boundary conditions, Lemma 2 allows us to compute all fντ(k), for ν = n − 1, n − 2, …, 2 and k = 1, …, ν, in O(n2) time.

We now describe how to compute the aforementioned terms fnτ(k), for all k < n, in O(n2) time. We first recall the result of Polanski and Kimmel (2003) which represents the untruncated SFS fn(k), for 1 ≤ kn − 1, as

fn(k)=m=2nWn,k,mcm, (7)

where

cm𝔼m[Tm]=0t(m2)α(t) exp [(m2)0tα(x)dx]dt=0 exp [(m2)0tα(x)dx]dt (8)

denotes the waiting time to the first coalescence for a sample of size m, and Wn,k,m are universal constants that are efficiently computable using the following recursions (Polanski and Kimmel, 2003):

Wn,k,2=6n+1,
Wn,k,3=30(n2k)(n+1)(n+2),
Wn,k,m+2=(1+m)(3+2m)(nm)m(2m1)(n+m+1)Wn,k,m+(3+2m)(n2k)m(n+m+1)Wn,k,m+1, (9)

for 2 ≤ mn − 2. The key observation is to note that, in a similar vein as (7), we have:

Lemma 3

The truncated SFS fnτ(k), for 1 ≤ kn − 1, can be written as

fnτ(k)=m=2nWn,k,mcmτ, (10)

where cmτ is a truncated version of (8):

cmτ𝔼m[Tmτ]=0τ exp [(m2)0tα(x)dx]dt. (11)

We prove Lemma 3 in Section 6.4. For piecewise-exponential α(t), cmτ can be computed explicitly using formulas from Bhaskar et al. (2015). Using (9), we can compute all values of Wn,k,m, for 1 ≤ kn and 2 ≤ mn, in O(n2) time. Then, using (10), all values of fnτ(k), for 1 ≤ kn − 1 can be computed in O(n2) time.

Note that the above algorithm not only significantly improves computational complexity, but also resolves numerical issues, since it allows us to avoid computing the expected times 𝔼ν[Tiτ|Aτ𝒞=m], which are alternating sums of O(n2) terms and are numerically unstable to evaluate for large values of n (say, n > 100).

3.4 An alternative formula for piecewise-constant subpopulation sizes

For demographic scenarios with piecewise-constant subpopulation sizes, we present an alternative formula for computing the truncated SFS within a constant piece. This formula has the same sample computational complexity as that described in the previous section.

Let 𝒦t denote the coalescent with killing, a stochastic process that is closely related to the Chinese restaurant process, Hoppe’s urn, and Ewens’ sampling formula (Aldous, 1985; Hoppe, 1984). In particular, the coalescent with killing {𝒦t}t≥0 is a stochastic process whose value at time t is a marked partition of {1, …, n}, where each partition block is marked as “killed” or “unkilled”. We obtain the partition for 𝒦t by dropping mutations onto the coalescent tree as a Poisson point process with rate θ2, and then defining an equivalence relation on {1, …, n}, where i ~ j if and only if i, j have coalesced by time t and there are no mutations on the branches between i and j (i.e., i and j are identical by descent). We furthermore mark the equivalence classes (i.e. partition blocks) of 𝒦t that are descended from a mutation in [0, t) as “killed”. See Figure 2 for an illustration. The process 𝒦τ can also be obtained by running Hoppe’s urn, or equivalently the Chinese restaurant process, forward in time (Durrett, 2008, Theorem 1.9).

Figure 2.

Figure 2

The coalescent with killing for the genealogy in Figure 1. Note that 𝒦τ is a marked partition, with the blocks killed by mutations in [0, τ) being specially marked.

Let Aτ𝒦 be the number of unkilled blocks in 𝒦t, so that Aτ𝒦 is a pure death process with transition rate λi,i1𝒦(t)=(i2)α(t)+iθ2 (the rate of coalescence is the number of unkilled pairs (i2)α(t), and the rate of killing due to mutation is iθ2). Our next proposition gives a formula for the truncated conditional sample frequency spectrum given Aτ𝒦, i.e., fnτ(k|Aτ𝒦=m).

Proposition 1

Consider the constant population size history 1α(t)=1α for t ∈ [0, τ), and let m > 0 and 0 < knm. The joint probability that the number of derived mutants is k and the number of unkilled ancestral lineages is m, when truncating at time τ, is given by

n(|τ|=k,Aτ𝒦=m)=θ2fnτ(k|Aτ𝒦=m)(Aτ𝒞=m)+o(θ),

where

fnτ(k|Aτ𝒦=m)=2αk(nmk)(n1k). (12)

We prove Proposition 1 in Section 6.5. Note that this equation does not hold for the case k = n, m = 0, but fortunately we do not need to consider that case in what follows below.

We can use Proposition 1 to stably and efficiently compute the terms fντ(k), for k ≤ ν ≤ n, as follows. We first compute the case k < ν = n. Note that n(|τ|=K)=mn(|τ|=K,Aτ𝒦=m). So for k < n, by Proposition 1

fnτ(k)=m=1nfnτ(k|Aτ𝒦=m)n(Aτ𝒞=m)=m=1n2αk(nmk)(n1k)n(Aτ𝒞=m). (13)

The sum in (13) contains O(n) terms, so it costs O(n2) to compute fnτ(k) for all k < n. After this, we use Lemma 1 to compute fnτ(n), and then use Lemma 2 to compute fντ(k) for all 1 ≤ k ≤ ν < n. Since there are O(n2) such terms, this also takes O(n2) time.

4 The joint SFS for multiple populations

In this section we discuss an algorithm for computing the multi-population SFS (Chen, 2012, 2013; Wakeley and Hey, 1997). We describe the algorithm in Section 4.1, and note how the results from Section 3 improve the time complexity of this algorithm. In Section 4.2, we focus on the special case of tree-shaped demographies, and introduce a further algorithmic speedup by replacing the coalescent with a Moran model.

Let V be the number of subpopulations in the demographic history, n the total sample size, and L the number of SFS entries to compute. Then the results from Section 3 improve the computational complexity of the SFS from O(n5V + WL) to O(n2V + WL), where W is a term that depends on n and the structure of the demographic history. In the special case of tree-shaped demographies, the algorithm from Chen (2012) gives WL = O(n4VL). The Moran-based speedup introduced in Section 4.2 improves the runtime of tree-shaped demographies from O(n4VL) to O(n3V + n2VL).

The Moran-based speedup can be generalized to non-tree demographies, but the notation, implementation, and analysis of computational complexity becomes substantially more complicated. We thus leave its generalization to future work.

4.1 A coalescent-based dynamic program

Suppose at the present we have 𝒟 populations, and in the ith population we observe ni alleles. For a single point mutation, let x = (x1, …, x𝒟) denote the number of alleles that are derived in each population. We wish to compute f(x), where θ2f(x) is the expected number of point mutations with derived counts x.

For demographic histories consisting of population size changes, population splits, population mergers, and pulse admixture events, Chen (2012) gave an algorithm to compute f(x) using the truncated SFS fnτ(k) that we defined in Section 3.

We describe this algorithm to compute f(x). We start by representing the population history as a directed acyclic graph (DAG), where each vertex υ represents a subpopulation (Figure 3). We draw a directed edge from υ to υ′ if there is gene flow from the bottom-most part of υ to the top-most part of υ′, where “down” is the present and “up” is the ancient past. Thus, the leaf vertices correspond to the subpopulations at the present. For a vertex υ in the population history graph, let τυ ∈ (0, ∞) denote the length of time the corresponding population persists, and let αυ : [0, τυ) → ℝ+ denote the inverse population size history of υ So going backwards in time from the present, αυ(t) gives the instantaneous rate at which two particular lineages in υ coalesce, after υ has existed for time t. We use fnυ(k) to denote the truncated SFS for the coalescent embedded in υ, i.e., fnυ(k)=fnτυ(k) for a coalescent with coalescence rate αυ(t). Then we have

f(x)=υm0υ,k0υfm0υυ(k0υ)(x|k0υ,m0υ)(m0υ) (14)

where m0υ denotes the number of lineages at the bottom of υ that are ancestral to the initial sample, and k0υ denotes the number of these lineages with a derived allele.

Figure 3.

Figure 3

A demographic history with a pulse migration event (left), and its corresponding directed graph (right).

In order to use (14), we must compute fm0υυ(k0υ) for every population υ, and every value of m0υ and k0υ. If n is the total sample size and V the total number of vertices, then this takes O(n5V) time using the formulas of Chen (2012). Our results from Section 3 improve this to O(n2V).

To use (14), we must also compute the terms (x|k0υ,m0υ)(m0υ), for which Chen (2012) constructs a dynamic program, starting at the leaf vertices and moving up the graph. This dynamic program essentially consists of setting up a Bayesian graphical model with random variables m0υ,k0υ and performing belief propagation, which can be done via the sum-product algorithm (“tree-peeling”) if the population graph is a tree (Felsenstein, 1981; Pearl, 1982), or via a junction tree algorithm if not (Lauritzen and Spiegelhalter, 1988).

The time complexity of the algorithm thus depends on the topological structure of the population graph. In the special case where the demographic history is a binary tree, the tree-peeling algorithm computes the values (x|k0υ,m0υ)(m0υ) in O(n4V) time, since the vertex υ has O(n2) possible states (k0υ,m0υ), so summing over the transitions between every pair of states costs O(n4). Note that Chen (2012) mistakenly states that the computation takes O(n3V) time.

To summarize, let W be the time it takes to compute (14) after the terms fmυ(k) have been precomputed, and let L be the number of distinct entries x for which we wish to compute f(x). Then our results from Section 3 improve the computational complexity from O(n5V + WL) to O(n2V + WL). In the case of a binary tree the original algorithm of Chen (2012) gives WL = O(n4VL). In the following section, we improve this to O(n3V + n2VL).

4.2 A Moran-based dynamic program

Here, we describe a new dynamic program that improves the computational complexity of computing f(x) for tree-shaped demographies. The main idea is to replace the backwards-in-time coalescent with a forwards-in-time Moran model.

4.2.1 Algorithm description

We assume the 𝒟 populations at the present are related by a binary rooted tree with 𝒟 leaves, where each leaf represents a population at the present, and at each internal vertex, a parent population splits into two child populations. (Note that a non-binary tree can be represented as a binary tree, with additional vertices of height 0).

Instead of working with the multi-population coalescent directly, we will consider a multi-population Moran model, in which the coalescent is embedded (Moran, 1958). In particular, let 𝔏(υ) denote the leaf populations descended from the population υ, and let nυ = ∑i∈𝔏(υ) ni be the number of present-day alleles with ancestry in υ. For each population υ (except the root), we construct a Moran model going forward in time, i.e. starting at τυ and ending at 0. The Moran model consists of nυ lineages, each with either an ancestral or derived allele. Going forward in time, every lineage copies itself onto every other lineage at rate 12αυ(t). Thus, the total rate of copying events is (nυ2)αυ(t). Let μtυ denote the number of derived alleles at time t in population υ. Then the transition rate of μtυ when μtυ=x, is λxx+1(t)=λxx1(t)=x(nυx)2αυ(t) since there are x(nυx) pairs of lineages with different alleles.

The coalescent is embedded within the Moran model, because if we trace the ancestry of genetic material backwards in time in the Moran model, we obtain a genealogy with the same distribution as under the coalescent (Durrett, 2008, Theorem 1.30). Thus, we can obtain the expected number of mutations with derived counts x, by summing over the population υ in which the mutation occurred:

f(x)=υk=1nυfnυυ(k)(x|μ0υ=k,μτυυ=0). (15)

Let xυ = {xi : i ∈ 𝔏(υ)} denote the subsample of derived allele counts in the populations descended from υ. Similarly, let xυc={xi:i𝔏(υ)}. Then for k ≥ 1,

(x|μ0υ=k,μτυυ=0)={(xυ|μ0υ=k),if xυc=0,0,if xυc0. (16)

So it suffices to compute (xυ|μ0υ=k) for all υ and k. If υ is the ith leaf population, then (xυ|μ0υ=k)=𝕀k=xi. On the other hand, if υ is an interior vertex with children υ1 and υ2, then

(xυ|μ0υ=k)=k1=0nυ1(nυ1k1)(nυ2kk1)(nυk)(xυ1|μτυ1υ1=k1)(xυ2|μτυ2υ2=kk1), (17)

where (xυi|μτυiυi) can be computed from

(xυ|μτυυ=k)=j=0nυ(xυ|μ0υ=j)(μ0υ=j|μτυυ=k). (18)

To compute the transition probability (μ0υ=j|μτυυ=k), note that the transition rate matrix of μtυ can be written as Q(nυ)α(t), where Q(nυ)=(qij(nυ))0i,jnυ is a (nυ + 1) × (nυ + 1) matrix with

qij(nυ)={i(nυi),if i=j,12i(nυi),if |ji|=1,0,else,

so then the transition probability is given by the matrix exponential

(μ0υ=j|μτυυ=k)=[eQ(nυ)0τυαυ(t)dt]k,j. (19)

Thus, the joint SFS f(x) can be computed using (15) and (16), with (xυ|μ0υ=k) given by recursively computing (17), (18), and (19), in a depth-first search on the population tree (i.e., Felsenstein’s tree-peeling algorithm, or the sum-product algorithm for belief propagation).

4.2.2 Computational complexity of Moran approach

We now consider the computational complexity associated with (17), (18), and (19) for each vertex υ. For fixed configuration x, (17) and (18) must be computed for O(nυ) values, and are each sums with O(nυ) terms; they thus contribute O(nυ2L) to the total runtime, where L is the number of distinct values of x. The matrix exponential e(Q(nυ)0τυαυ(t)dt) in (19) can be computed in several ways, including spectral decomposition or scaling-and-squaring, and costs O(nυ3) time (Moler and Van Loan, 2003). The computational complexity associated with a single vertex υ is thus O(nυ3+nυ2L). Therefore, for a binary population tree with V nodes, arbitrary population size functions, and no migration, the total cost of computing the observed SFS entries is O(n3V + n2VL).

The time complexity can be further improved with techniques applied by Bryant et al. (2012), but in practice we found that the decreased running time was offset by other problems such as numerical instability or large hidden time costs. In particular, let tυ(k)=(xυ|μtυ=k), and ˜tυ(k)=(nυk)tυ(k). Then (17) can be written as a convolution

˜0υ=˜τυ1υ1*˜τυ2υ2, (20)

which can be computed via the FFT (Cooley and Tukey, 1965), reducing the complexity of (17) from O(nυ2L) to O(nυ log(nυ)L). Another potential speedup is to rewrite (18) as

τυυ=e(Q(nυ)0τυαυ(t)dt)0υ (21)

and utilize the sparsity of Q(nυ) (Al-Mohy and Higham, 2011). In particular, (21) can be computed by 𝒯 sparse matrix-vector products, where 𝒯 depends on Q(nυ) and the desired level of precision. This reduces the cost of (18) and (19) from O(nυ3+nυ2L) to O(nυ𝒯L).

These two speedups were applied by Bryant et al. (2012) to reduce the complexity of their coalescent-based approach from O(nυ6+nυ4L) to O(nυ2L(log(nυ)+𝒯)). When applied in our Moran-based approach, they reduce the complexity of (17), (18), and (19) from O(nυ3+nυ2L) to O(nυL(log(nυ) + 𝒯)). In practice, however, we found 𝒯 to be quite large, and it was faster to use the naive approach to compute and multiply e(Q(nυ)0τυαυ(t)dt). Furthermore, computing (20) via the FFT can be very numerically unstable. Taking the Fourier transform introduces cancellation errors, due to multiplying and adding terms like eix, and we found that converting from ˜0υ back to 0υ can cause these errors to blow up, due to the combinatorial factors.

5 Runtime and accuracy results

5.1 Comparison with Chen (2012)

We implemented our formulas and algorithm in Python, using the Python packages numpy and scipy. We also implemented the formulas from Chen (2012), and compared the performance of the two algorithms on simulated data.

We simulated datasets with n ∈ {2, 4, 8, …, 256} lineages and 𝒟 ∈ {2, 4, 8, …, n} populations at present, each containing n𝒟 lineages. For each value of n, 𝒟, we used the program scrm (Staab et al., 2015) to generate 20 random datasets, each with a demographic history that is a random binary tree.

In Figure 4, we compare the running time of the original algorithm of Chen (2012) against our new algorithm that utilizes the formulas for fnτ(k) presented in Section 3 and our new Moran-based approach described in Section 4.2. We find our algorithm to be orders of magnitude faster; the difference is especially pronounced as the number n of lineages grows. Note that, due to the increased running time of Chen’s algorithm, we did not finish running his method for n = 256 and 𝒟 ≥ 32.

Figure 4.

Figure 4

Average computation time of the joint SFS. For each combination of the sample size n and the number 𝒟 of populations (with n𝒟 samples per population), we generated 20 random datasets, each under a demographic history that is a random binary tree. The expected joint SFS for the resulting segregating sites were then computed using our method (momi) and that of Chen (2012). In the top row, we plot the average runtime per joint SFS entry, and in the bottom row, the average amount of time needed to precompute the truncated SFS for every subpopulation within each demographic history. (a) Runtime results plotted separately for each method in a linear scale. Note the y-axis is on a different scale for each row. (b) Runtime results with the axes on a log-log scale, so that shorter runtimes are visible.

In Figure 5, we compare the accuracy of the two algorithms. The figure compares the SFS entries returned by the two methods across a subset of the simulations depicted in Figure 4. The line y = x is also plotted; points falling on the line depict the SFS entries where both methods agreed. All negative return values represent numerical errors. For n ≤ 64 the two methods generally agree, but for larger n Chen’s algorithm displays considerable numerical instability, returning extremely large positive and negative numbers.

Figure 5.

Figure 5

Numerical stability of the two algorithms. The plot compares the numerical values returned by our method (momi) and Chen’s method, for the simulations described in Figure 4. The dashed red line represents the identity y = x. To adequately illustrate the full numerical range (both positive and negative) of values encountered in the simulations, we applied the transformation z ↦ sign(z) log(1 + |z|) to the values of each method in order to produce the scatter plot. The two methods agree for n ≤ 64, while Chen’s method is extremely unstable for larger n.

5.2 Comparison with ∂ai

We also compared our method with the popular program ∂ai (Gutenkunst et al., 2009). We note that our method has several key differences from ∂ai, and the two methods have strengths in distinct use cases. ∂ai computes the joint SFS by numerically integrating a PDE. This PDE can be easily modified to include effects such as natural selection and continuous migration, which gives ∂ai more flexibility than the coalescent or Moran-based approaches discussed in this paper. However, numerical integration of multidimensional PDEs is a challenging problem; thus ∂ai can handle only a small number of populations (up to 3), and may occasionally encounter numerical instability.

We compared our method (momi) with ∂ai on a modified version of the three-population out-of-Africa demography inferred by Gutenkunst et al. (2009). In this history, the Eurasian population initially splits off from the African population, and then splits into separate European and Asian populations. The populations experience several piecewise constant size changes, and the European and Asia populations experience exponential growth in the recent past. The original, unmodified history also contains continuous migration between the separated populations; since our implementation does not currently support migration, we modified the demographic history to have no migration. (Support for migration will be added to our method in the future.)

We consider sample sizes of n = 16, 32, 64, 128 per population; for ∂ai, we consider discretization grid sizes of G = 16, 32, …, 512, 1024 points per population, with Gn. We compare the values returned by momi and ∂ai in Figure 6. momi and ∂ai mostly agree, but some entries of ∂ai are off by a factor of 10 or 100, especially when Gn is small. In general the value computed by ∂ai converges to that of momi as G increases. However, for G = 1024 and n = 64, 128, ∂ai appears to have some numerical instability. In particular, ∂ai diverges in some of the smaller entries, and returns some negative numbers.

Figure 6.

Figure 6

Comparison of SFS values computed by ∂ai and momi, varying the sample size n per population and the number G of grid points per population in ∂ai. We used the 3-population out-of-Africa demography inferred in Gutenkunst et al. (2009), but modified to have no gene flow (migration). The x-axis is fmomi(x) the value computed by momi, the y-axis is the absolute value of the ratio |fai(x)fmomi(x)|, and the color gives the sign of fai(x) (all values returned by momi were positive). fai generally converges to fmomi(x) as G increases, but for large values of n and G, ∂ai appears to diverge, and returns some negative values.

We compare the runtime of momi and ∂ai in Figure 7, which shows the time to compute all entries of the SFS. The two methods are roughly comparable, depending on the number G of grid points in ∂ai. We note that ∂ai computes all entries of the SFS together, whereas momi and the method of Chen (2012, 2013) can easily compute a subset of the SFS. This is a key advantage for the latter methods when the number of populations 𝒟 is large, as the size of the SFS grows exponentially with 𝒟.

Figure 7.

Figure 7

Runtime of momi and ∂ai to compute the results in Figure 6.

6 Proofs

In this section, we provide proofs of the mathematical results presented in earlier sections.

6.1 A recursion for efficiently computing ν(Aτ𝒞=m)

We describe how to compute ν(Aτ𝒞=m), for all values of m ≤ ν ≤ n, in O(n2) time. First, note that

ν1(Aτ𝒞=m)=ν(Aτ𝒞=m+1,{ν}𝒞τ)+ν(Aτ𝒞=m,{ν}𝒞τ)=(m+1)pν,m+11,1(ν1)ν(Aτ𝒞=m+1)+(1mpν,m1,1(ν1))ν(Aτ𝒞=m)=(m+1)(m)ν(ν1)ν(Aτ𝒞=m+1)+(1m(m1)ν(ν1))ν(Aτ𝒞=m).

Rearranging, we get the recursion

ν(Aτ𝒞=m)=11m(m1)ν(ν1)[ν1(Aτ𝒞=m)(m+1)(m)ν(ν1)ν(Aτ𝒞=m+1)] (22)

with base cases

ν(Aτ𝒞=ν)=e(ν2)0τα(t)dt.

So after solving 0τα(t)dt, we can use the recursion and memoization to solve for all of the O(n2) terms ν(Aτ𝒞=m) in O(n2) time. In particular, in the case of constant population size, α(t) = α, the base case is given by

ν(Aτ𝒞=ν)=e(ν2)ατ,

and in the case of an exponentially growing population size, α (t) = α (τ)eβ(τ−t), the base case is given by

ν(Aτ𝒞=ν)=e(ν2)α(τ)(eβτ1β).

6.2 Proof of Lemma 1

Let TMRCA denote the time to the most recent common ancestor of the sample. We first note that

fnτ(n)=τ𝔼n[TMRCAτ],

since the branch length subtending the whole sample is the time between τ and TMRCA.

Next, note that θ2𝔼n[TMRCAτ] is equal to the number of polymorphic mutations in [0, τ) where the individual “1” is derived. This is because, as we trace the ancestry of “1” backwards in time, all mutations hitting the lineage below TMRCA are polymorphic, while all mutations hitting above TMRCA are monomorphic.

The expected number of polymorphic mutations with “1” derived is also equal to θ2k=1n1knfnτ(k), since if a mutation has k derived leaves, the chance that “1” is in the derived set is kn. Thus

𝔼n[TMRCAτ]=k=1n1knfnτ(k),

which completes the proof.

6.3 Proof of Lemma 2

We first note that

n(τ={1,,k})=n+1(τ={1,,k})+n+1(τ={1,,k,n+1}).

By exchangeability, we have n(τ=K)=θ2fnτ(|K|)(n|K|)+o(θ) for all K ⊆ {1, …, n}, so

1(nk)fnτ(k)=1(n+1k)fn+1τ(k)+1(n+1k+1)fn+1τ(k+1).

Multiplying both sides by (nk) gives

fnτ(k)=nk+1n+1fn+1τ(k)+k+1n+1fn+1τ(k+1).

6.4 Proof of Lemma 3

Let α*(t) denote the inverse population size history given by

α*(t)={α(t)if t<τif tτ.

So the demographic history with population size 1α*(t) agrees with the original history up to time τ, at which point the population size drops to 0, and all lineages instantly coalesce into a single lineage with probability 1.

Let Tm, * denote the amount of time there are m ancestral lineages for the coalescent with size history 1α*(t). Similarly, let fn,*(k) denote the SFS under the size history 1α*(t). Then from the result of Polanski and Kimmel (2003),

fn,*(k)=m=2nWn,k,m𝔼m[Tm,*].

Note that for m > 1, we almost surely have Tm,*=Tm,*τ, i.e. the intercoalescence time equals its truncated version, since all lineages coalesce instantly at τ with probability 1. Thus, 𝔼m[Tm,*]=𝔼m[Tm,*τ]. Similarly, for k < n, fn,*(k)=fn,*τ(k), i.e. the SFS equals the truncated SFS, because the probability of a polymorphic mutation occurring in [τ, ∞) is 0.

Finally, note that 𝔼m[Tm,*τ]=𝔼m[Tmτ] and fn,*τ(k)=fnτ(k), because α(t) and α*(t) are identical on [0, τ).

6.5 Proof of Proposition 1

We start by showing that n(Aτ𝒦=m)=n(Aτ𝒞=m)+O(θ). Let Tiτ(𝒦)=0τ𝕀At𝒦=idt denote the amount of time where 𝒦 has i unkilled lineages. Let p denote the probability density function. For (tn, …, tm) with ∑ ti = τ, we have

p(Tnτ(𝒦)=tn,,Tmτ(𝒦)=tm)=eλm,m1𝒦tmi=m+1nλi,i1𝒦eλi,i1𝒦ti=e((m2)α+mθ2)tmi=m+1n((i2)α+iθ2)e((i2)α+iθ2)ti=e(m2)αtmi=m+1n(i2)αe(i2)αti+O(θ)=p(Tnτ=tn,,Tmτ=tm)+O(θ),

and so

limθ0n(Aτ𝒦=m)=limθ0ti=τp(Tnτ(𝒦)=tn,,Tmτ(𝒦)=tm)dt=ti=τp(Tnτ=tn,,Tmτ=tm)dt=n(Aτ𝒞=m).

where we can exchange the limit and the integral by the Bounded Convergence Theorem, because p(Tnτ(𝒦)=tn,,Tmτ(𝒦)=tm)i=m+1n((i2)α+i2)

Thus we have

n(|τ|=k,Aτ𝒦=m)=n(|τ|=k|Aτ𝒦=m)n(Aτ𝒦=m)=(θ2fnτ(k|Aτ𝒦=m)o(θ))(n(Aτ𝒞=m)+O(θ))=θ2fnτ(k|Aτ𝒦=m)n(Aτ𝒞=m)+o(θ),

which proves the first part of the proposition.

We next solve for fnτ(k|Aτ𝒦=m), the first order Taylor series coefficient for n(|τ|=k|Aτ𝒦=m) in the mutation rate θ2.

When there are i unkilled lineages, the probability that the next event is a killing event is θα(i1)+θ=θα(i1)+O(θ). Given that the event is a killing, the chance that the killed lineage has k leaf descendants is pn,ik,1. So summing over i, and dividing out the mutation rate θ2, we get

fnτ(k|Aτ𝒦=m)=2αi=m+1nk+11i1pn,ik,1=2αi=m+1nk+11i1(nk1i2)(n1i1)=2αi=m+1nk+11i1(nk1)!(i1)!(ni)!(i2)!(nki+1)!(n1)!=2(nk1)!α(n1)!i=m+1nk+1(ni)!(nki+1)!=2(nk1)!α(n1)!j=0nkm(j+k1)!j!=2αk(n1k)j=0nkm(j+k1j)=2αk(nmk)(n1k),

where we made the change of variables j = nki + 1, and where the final line follows from repeated application of the combinatorial identity (ab)=(a1b)+(a1b1).

6.5.1 Alternative proof for fnτ(k|Aτ𝒦=m) via the Chinese Restaurant Process

We sketch an alternative proof of the expression for fnτ(k|Aτ𝒦=m), using the Chinese Restaurant Process.

Consider the coalescent with killing going forward in time (towards the present), and only looking at it when the number of individuals increases. Then when there are i lineages, a new mutation occurs with probability θαi+θ=θ/αi+θ/α, and each lineage branches with probability ααi+θ=1i+θ/α. Thus, conditional on Aτ𝒦=m, the distribution on 𝒦τ is given by a Chinese Restaurant Process (Aldous, 1985), starting with m tables each with 1 person, and with new tables founded with parameter θ/α.

Let (x)i = x(x + 1) ⋯ (x + i − 1) denote the rising factorial. If there is a single mutation with k descendants, then there are (nmk) ways to pick which of the nm events involve mutant lineages. The probability of a particular such ordering is

θα(1)k(m)nkm(m+θ/α)nm=θα(k1)!(nk1)!/m!(n1)!/m!+o(θ).

Summing over all (nmk) orderings, and dividing by θ2, yields

fnτ(k|Aτ𝒦=m)=2α(nmk)(k1)!(nk1)!/m!(n1)!/m!.

Acknowledgments

This research is supported in part by NIH grants R01-GM109454 and R01-GM108805, a Packard Fellowship for Science and Engineering, a Miller Research Professorship, and a Citadel Graduate Fellowship.

Contributor Information

John A. Kamm, Department of Statistics, University of California, Berkeley

Jonathan Terhorst, Department of Statistics, University of California, Berkeley.

Yun S. Song, Departments of EECS, Statistics, and Integrative Biology, University of California, Berkeley

References

  1. Al-Mohy AH, Higham NJ. Computing the action of the matrix exponential, with an application to exponential integrators. SIAM Journal on Scientific Computing. 2011;33(2):488–511. [Google Scholar]
  2. Aldous DJ. Exchangeability and related topics. In: Hennequin P, editor. École d' Été de Probabilités de Saint-Flour XIII — 1983, volume 1117 of Lecture Notes in Mathematics. Berlin Heidelberg: Springer; 1985. pp. 1–198. [Google Scholar]
  3. Beaumont MA, Nichols RA. Evaluating loci for use in the genetic analysis of population structure. Proceedings of the Royal Society of London. Series B: Biological Sciences. 1996;263(1377):1619–1626. [Google Scholar]
  4. Bhaskar A, Kamm JA, Song YS. Approximate sampling formulae for general finite-alleles models of mutation. Advances in Applied Probability. 2012;44:408–428. doi: 10.1239/aap/1339878718. (PMC3953561) [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Bhaskar A, Song YS. Descartes’ rule of signs and the identifiability of population demographic models from genomic variation data. Annals of Statistics. 2014;42(6):2469–2493. doi: 10.1214/14-AOS1264. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Bhaskar A, Wang YXR, Song YS. Efficient inference of population size histories and locus-specific mutation rates from large-sample genomic variation data. Genome Research. 2015;25(2):268–279. doi: 10.1101/gr.178756.114. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Boyko AR, Williamson SH, Indap AR, Degenhardt JD, Hernandez RD, Lohmueller KE, Adams MD, Schmidt S, Sninsky JJ, Sunyaev SR, et al. Assessing the evolutionary impact of amino acid mutations in the human genome. PLoS Genetics. 2008;4(5):e1000083. doi: 10.1371/journal.pgen.1000083. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Bryant D, Bouckaert R, Felsenstein J, Rosenberg NA, RoyChoudhury A. Inferring species trees directly from biallelic genetic markers: bypassing gene trees in a full coalescent analysis. Molecular Biology and Evolution. 2012;29(8):1917–1932. doi: 10.1093/molbev/mss086. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Chen H. The joint allele frequency spectrum of multiple populations: A coalescent theory approach. Theoretical Population Biology. 2012;81(2):179–195. doi: 10.1016/j.tpb.2011.11.004. [DOI] [PubMed] [Google Scholar]
  10. Chen H. Intercoalescence time distribution of incomplete gene genealogies in temporally varying populations, and applications in population genetic inference. Annals of Human Genetics. 2013;77(2):158–173. doi: 10.1111/ahg.12007. [DOI] [PubMed] [Google Scholar]
  11. Cooley JW, Tukey JW. An algorithm for the machine calculation of complex fourier series. Mathematics of Computation. 1965;19(90):297–301. [Google Scholar]
  12. Coventry A, Bull-Otterson LM, Liu X, Clark AG, Maxwell TJ, Crosby J, Hixson JE, Rea TJ, Muzny DM, Lewis LR, et al. Deep resequencing reveals excess rare recent variants consistent with explosive population growth. Nature Communications. 2010;1:131. doi: 10.1038/ncomms1130. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. De Maio N, Schlötterer C, Kosiol C. Linking great apes genome evolution across time scales using polymorphism-aware phylogenetic models. Molecular biology and evolution. 2013;30(10):2249–2262. doi: 10.1093/molbev/mst131. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Durrett R. Probability Models for DNA Sequence Evolution. 2nd. New York: Springer; 2008. [Google Scholar]
  15. Ewens WJ. Mathematical Population Genetics: I. Theoretical Introduction. New York: Springer Science+Business Media, Inc.; 2004. [Google Scholar]
  16. Excoffier L, Dupanloup I, Huerta-Sánchez E, Sousa VC, Foll M. Robust demographic inference from genomic and SNP data. PLoS Genetics. 2013;9(10):e1003905. doi: 10.1371/journal.pgen.1003905. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Felsenstein J. Evolutionary trees from DNA sequences: a maximum likelihood approach. Journal of Molecular Evolution. 1981;17:368–376. doi: 10.1007/BF01734359. [DOI] [PubMed] [Google Scholar]
  18. Gazave E, Ma L, Chang D, Coventry A, Gao F, Muzny D, Boerwinkle E, Gibbs RA, Sing CF, Clark AG, et al. Neutral genomic regions refine models of recent rapid human population growth. Proceedings of the National Academy of Sciences. 2014;111(2):757–762. doi: 10.1073/pnas.1310398110. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Gravel S, Henn BM, Gutenkunst RN, Indap AR, Marth GT, Clark AG, Yu F, Gibbs RA, Bustamante CD, Altshuler DL, et al. Demographic history and rare allele sharing among human populations. Proceedings of the National Academy of Sciences. 2011;108(29):11983–11988. doi: 10.1073/pnas.1019276108. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Griffiths R, Tavaré S. The age of a mutation in a general coalescent tree. Communi- cations in Statistics. Stochastic Models. 1998;14(1–2):273–295. [Google Scholar]
  21. Gutenkunst RN, Hernandez RD, Williamson SH, Bustamante CD. Inferring the joint demographic history of multiple populations from multidimensional SNP frequency data. PLoS Genetics. 2009;5(10):e1000695. doi: 10.1371/journal.pgen.1000695. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Higham NJ. Accuracy and Stability of Numerical Algorithms. 2nd. SIAM: Society for Industrial and Applied Mathematics; 2002. [Google Scholar]
  23. Hoppe F. Pólya-like urns and the Ewens’ sampling formula. J. Math. Biol. 1984;20:91–94. [Google Scholar]
  24. Jenkins PA, Mueller JW, Song YS. General triallelic frequency spectrum under demographic models with variable population size. Genetics. 2014;196(1):295–311. doi: 10.1534/genetics.113.158584. (PMC3872192) [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Jenkins PA, Song YS. The effect of recurrent mutation on the frequency spectrum of a segregating site and the age of an allele. Theor. Popul. Biol. 2011;80(2):158–173. doi: 10.1016/j.tpb.2011.04.001. (PMC3143209) [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Johnson NL, Kotz S. Urn Models and Their Application: An Approach to Modern Discrete Probability Theory. New York: Wiley; 1977. [Google Scholar]
  27. Kimura M. The number of heterozygous nucleotide sites maintained in a finite population due to steady flux of mutations. Genetics. 1969;61(4):893. doi: 10.1093/genetics/61.4.893. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Kingman JFC. The coalescent. Stoch. Process. Appl. 1982a;13:235–248. [Google Scholar]
  29. Kingman JFC. Exchangeability and the evolution of large populations. In: Koch G, Spizzichino F, editors. Exchangeability in Probability and Statistics. North-Holland Publishing Company; 1982b. pp. 97–112. [Google Scholar]
  30. Kingman JFC. On the genealogy of large populations. J. Appl. Prob. 1982c;19A:27–43. [Google Scholar]
  31. Lauritzen SL, Spiegelhalter DJ. Local computations with probabilities on graphical structures and their application to expert systems. Journal of the Royal Statistical Society. Series B (Methodological) 1988;50(2):157–224. [Google Scholar]
  32. Lukić S, Hey J. Demographic inference using spectral methods on SNP data, with an analysis of the human out-of-Africa expansion. Genetics. 2012;192(2):619–639. doi: 10.1534/genetics.112.141846. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Moler C, Van Loan C. Nineteen dubious ways to compute the exponential of a matrix, twenty-five years later. SIAM review. 2003;45(1):3–49. [Google Scholar]
  34. Moran P. Random processes in genetics. Mathematical Proceedings of the Cambridge Philosophical Society. 1958;54:60–71. [Google Scholar]
  35. Myers S, Fefferman C, Patterson N. Can one learn history from the allelic spectrum? Theoretical Population Biology. 2008;73(3):342–348. doi: 10.1016/j.tpb.2008.01.001. [DOI] [PubMed] [Google Scholar]
  36. Nelson MR, Wegmann D, Ehm MG, Kessner D, Jean PS, Verzilli C, Shen J, Tang Z, Bacanu S-A, Fraser D, et al. An abundance of rare functional variants in 202 drug target genes sequenced in 14,002 people. Science. 2012;337(6090):100–104. doi: 10.1126/science.1217876. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Nielsen R. Estimation of population parameters and recombination rates from single nucleotide polymorphisms. Genetics. 2000;154(2):931–942. doi: 10.1093/genetics/154.2.931. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Pearl J. Reverend Bayes on inference engines: a distributed hierarchical approach; Proceedings of the National Conference on Artificial Intelligence; 1982. pp. 133–136. [Google Scholar]
  39. Polanski A, Kimmel M. New explicit expressions for relative frequencies of single-nucleotide polymorphisms with application to statistical inference on population growth. Genetics. 2003;165(1):427–436. doi: 10.1093/genetics/165.1.427. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Schaffner SF, Foo C, Gabriel S, Reich D, Daly WJ, Altshuler D. Calibrating a coalescent simulation of human genome sequence variation. Genome Res. 2005;15:1576–1583. doi: 10.1101/gr.3709305. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Staab PR, Zhu S, Metzler D, Lunter G. scrm: efficiently simulating long sequences using the approximated coalescent with recombination. Bioinformatics. 2015;31(10):1680–1682. doi: 10.1093/bioinformatics/btu861. [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Tavaré S. Line-of-descent and genealogical processes, and their applications in population genetics models. Theor. Popul. Biol. 1984;26:119–164. doi: 10.1016/0040-5809(84)90027-3. [DOI] [PubMed] [Google Scholar]
  43. Wakeley J, Hey J. Estimating ancestral population parameters. Genetics. 1997;145(3):847–855. doi: 10.1093/genetics/145.3.847. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES