Skip to main content
Genetics logoLink to Genetics
. 2017 Nov 10;208(1):323–338. doi: 10.1534/genetics.117.300499

Coalescent Processes with Skewed Offspring Distributions and Nonequilibrium Demography

Sebastian Matuszewski *,†,‡,1, Marcel E Hildebrandt *,‡,1, Guillaume Achaz §,**, Jeffrey D Jensen *,†,††,2
PMCID: PMC5753866  PMID: 29127263

Abstract

Nonequilibrium demography impacts coalescent genealogies leaving detectable, well-studied signatures of variation. However, similar genomic footprints are also expected under models of large reproductive skew, posing a serious problem when trying to make inference. Furthermore, current approaches consider only one of the two processes at a time, neglecting any genomic signal that could arise from their simultaneous effects, preventing the possibility of jointly inferring parameters relating to both offspring distribution and population history. Here, we develop an extended Moran model with exponential population growth, and demonstrate that the underlying ancestral process converges to a time-inhomogeneous psi-coalescent. However, by applying a nonlinear change of time scale—analogous to the Kingman coalescent—we find that the ancestral process can be rescaled to its time-homogeneous analog, allowing the process to be simulated quickly and efficiently. Furthermore, we derive analytical expressions for the expected site-frequency spectrum under the time-inhomogeneous psi-coalescent, and develop an approximate-likelihood framework for the joint estimation of the coalescent and growth parameters. By means of extensive simulation, we demonstrate that both can be estimated accurately from whole-genome data. In addition, not accounting for demography can lead to serious biases in the inferred coalescent model, with broad implications for genomic studies ranging from ecology to conservation biology. Finally, we use our method to analyze sequence data from Japanese sardine populations, and find evidence of high variation in individual reproductive success, but few signs of a recent demographic expansion.

Keywords: coalescent theory, multiple mergers, population growth, maximum likelihood, site-frequency spectrum


THE origins of the coalescent in the early 1970s mark a milestone for evolutionary theory (Kingman 2000). More than 45 years after Kingman formally proved the existence of the “n-coalescent” (Kingman 1982a,b,c), the so-called Kingman-n-coalescent has gradually become the key theoretical tool to study the complex interplay of mutation, genetic drift, gene flow, and selection. Closely linked to its underlying forward-in-time population model, e.g., the Wright-Fisher (WF; Fisher 1930; Wright 1931) and the Moran model (Moran 1958, 1962), the Kingman coalescent has been used to derive expected levels of neutral variation, including the number of segregating sites, the average number of pairwise differences, and the expectation of the allele frequencies in a population sample (i.e., the site-frequency spectrum; SFS). In fact, these predictions apply not only to the WF and Moran model, but extend to a large class of Cannings exchangeable population models (Cannings 1974) that all converge to the Kingman coalescent in the ancestral limit (Möhle and Sagitov 2001). Furthermore, the Kingman coalescent forms the basis for many population genetic statistics—such as Tajima’s D (Tajima 1989), Fay and Wu’s H (Fay and Wu 2000), or, more generally, any SFS-based test statistic (Achaz 2009; Ferretti et al. 2010)—and subsequent inferences (Irwin et al. 2016) to detect deviations from the assumption of a neutrally evolving, constant-sized, panmictic population (Wakeley 2009).

While the Kingman coalescent has been shown to be robust to violations of its assumptions (Möhle 1998, 1999), such as constant population size, random mating, and nonoverlapping generations, and has been extended to accommodate selection, migration, and population structure (Neuhauser and Krone 1997; Nordborg 1997; Wilkinson-Herbots 1998), it breaks down in the presence of skewed offspring distributions (Eldon and Wakeley 2006), strong positive selection (Neher and Hallatschek 2013; Schweinsberg 2017), recurrent selective sweeps (Durrett and Schweinsberg 2004, 2005), and large sample sizes (Wakeley and Takahashi 2003; Bhaskar et al. 2014). In particular, all of these effects can cause more than two lineages to coalesce at a time, resulting in so-called multiple mergers. Hence, the underlying coalescent topology (i.e., the gene genealogy) is no longer represented by a bifurcating tree as in the “standard” Kingman case, but can take more complex tree shapes that can also feature several simultaneous mergers. Taking these points into account, a more general class of models, so-called multiple-merger coalescent (MMC) models, have been developed (e.g., Bolthausen and Sznitman 1998; Pitman 1999; Sagitov 1999; Schweinsberg 2000; Möhle and Sagitov 2001; reviewed in Tellier and Lemaire 2014), aiming to generalize the Kingman coalescent model (Wakeley 2013). As for the latter, these MMC models can often be derived from Moran models, generalized to allow multiple offspring per individual (Eldon and Wakeley 2006; Huillet and Möhle 2013; see also review of Irwin et al. 2016).

Starting from such an extended Moran model, Eldon and Wakeley (2006) proved that the underlying ancestral process converges to a psi-coalescent (sometimes also called Dirac coalescent; Eldon et al. 2015), and that population genetic parameters inferred from genetic data from Pacific oysters (Crassostrea gigas) under this model differ vastly from those inferred assuming the Kingman coalescent. Their study—being the first to link MMC models to actual biological questions, molecular data and population genetic inferences—highlighted that high variation in individual reproductive success drastically affect both genealogical history and subsequent analyses; this has been observed in many marine organisms such Atlantic cod (Gadus morhua) and Japanese sardines (Sardinops melanostictus), but should also occur more generally in any species with type III survivorship curves that undergo so-called sweepstake-reproductive events (Hedgecock 1994; Hedgecock and Pudovkin 2011). Fundamentally, the problem is that an excess of low-frequency alleles (i.e., singletons), a ubiquitous characteristic of many marine species (Niwa et al. 2016), could be explained by either models of recent population growth or skewed offspring distributions when analyzed under the Kingman coalescent, assuming neutrality, which can result in serious mis-inference (e.g., a vast overestimation of population growth).

In developing a SFS-based maximum likelihood framework, Eldon et al. (2015) demonstrated that multiple merger coalescents and population growth can be distinguished from their genomic footprints in the higher-frequency classes of the SFS with high statistical power (see also Spence et al. 2016). However, there is currently neither a modeling framework that considers the genomic signal arising from the joint action of both reproductive skew and population growth, nor is there any a priori reason to believe that the two could not act simultaneously.

Here, we develop an extension of the standard Moran model that accounts for both reproductive skewness and exponential population growth, and prove that its underlying ancestral process converges to a time-inhomogeneous psi-coalescent. By (nonlinearly) rescaling branch lengths this process can—analogous to the Kingman coalescent (Griffiths and Tavaré 1998)—be transformed into its time-homogeneous analog, allowing efficient large-scale simulations. Furthermore, we derive analytical formulae for the expected site-frequency spectrum under the time-inhomogeneous psi-coalescent and develop an approximate-likelihood framework for the joint estimation of the coalescent and growth parameters. We then perform extensive validation of our inference framework on simulated data, and show that both the coalescent parameter and the growth rate can be estimated accurately from whole-genome data. In addition, we demonstrate that, when demography is not accounted for, the inferred coalescent model can be seriously biased, with broad implications for genomic studies ranging from ecology to conservation biology (e.g., due to its effects on effective population size or diversity estimates). Finally, using our joint estimation method, we reanalyze mtDNA from Japanese sardine (Sardinops melanostictus) populations, and find evidence for considerable reproductive skew, but only limited support for a recent demographic expansion.

Methods

Here, we will first present an extended, discrete-time Moran model (Moran 1958, 1962; Eldon and Wakeley 2006) with exponential population growth that will serve as the forward-in-time population genetic model underlying the ancestral limit process. We will then give a brief overview of coalescent models, with special focus on the psi-coalescent (Eldon and Wakeley 2006), before revisiting SFS-based maximum likelihood methods to infer coalescent parameters and population growth rates.

An extended Moran model with exponential growth

We consider the idealized, discrete-time model with variable population size shown generally in Figure 1. Furthermore, let Nn be the deterministic and time-dependent population size n time steps in the past, where, by definition, N=N0 denotes the present population size. In particular, defining ν(n) as the exchangeable vector of family sizes—with components νi(n) indicating the number of descendants of the ith individual—the (variable) population size can be expressed as

Table 1. Summary of notation and definitions.

Notation Definition
UN Number of offspring of a reproductive event in an extended Moran model with population size N
ν Vector of family sizes
Λ Probability measure on [0,1]
λi,x Coalescent rate for x out of i active lineages
Gi,x Probability of an xmerger among i active lineages
cN(n) Coalescence probability
(An,kψ,ρ)nPk Ancestral process of the extended Moran model sweepstake parameter ψ (ψ=0 implying Kingman’s coalescent), and exponential population growth at rate ρ for a sample of size k defined on Pk, i.e., the collection of partitions of the set [k]={1,,k}.
(Πt,kψ,ρ)t0Pk ψcoalescent (ψ=0 implying Kingman’s coalescent) with exponential growth at rate ρ and sample of size k defined on Pk, i.e., the collection of partitions of the set [k]={1,,k}.
G() Time-change function
TMRCA(k) Time until the MRCA for a sample of size k
Ti Sum of the length of all branches with i descendants
Ttot Total branch length of the coalescent tree
η(k)=(η1(k),,ηk1(k)) SFS for a sample of size k
ϕ(k)=(ϕ1(k),,ϕk1(k)) Normalized expected SFS for a sample of size k
ck=(c2,2,,ck,k) Expected time to the first coalescence for a sample of size i{2,,k}
Nn1=i=1Nnνi(n)with(ν1(n),ν2(n),,νN(n))Nn. (1)

Furthermore, we assume that the reproductive mechanism follows that of an extended Moran model (Eldon and Wakeley 2006; Huillet and Möhle 2013). In particular, as in the original Moran model, at any given point in time n, only a single individual reproduces and leaves UN(n) offspring (including itself). Formally, the number of offspring can be written as a sequence of random variables (UN(n))n [where each UN(n) is supported on {0,1,,Nn1}], such that ν(n) – up to reordering – is given by

νi(n){0ifi<UN(n)UN(n)ifi=UN(n)1otherwise. (2)

However, since population size varies over time, the sequence (UN(n))n is generally not identically distributed. On a technical note though, we require that the (Un) are independently distributed, which ensures that the corresponding backward process satisfies the Markov property.

Figure 1.

Figure 1

Illustration of the extend Moran model with exponential growth. Shown are the four different scenarios of population transition within a single discrete time step. (A) The population size remains constant and a single individual produces exactly two offspring (“Moran-type” reproductive event). (B) The population size remains constant and a single individual produces ψNn offspring (“sweepstake” reproductive event). (C) The population size increases by ΔN(n) individuals and a single individual produces exactly max[ΔN(n)+1,2] offspring. (D) The population size increases by ΔN(n) individuals and a single individual produces exactly max[ΔN(n)+1,ψNn] offspring. Note that n denotes the number of steps in the past, such that n=0 denotes the present. An overview of the notation used in this model is given in Table 1.

An illustration of our model, and the four different scenarios for forming the next generation (i.e., within a single discrete time step), is shown in Figure 1. Generally, we differentiate between two possible reproductive events: a classic “Moran-type” reproductive event (Figure 1, A and C), and a “sweepstake” reproductive event (Figure 1, B and D) occurring with probabilities 1Nnγ and Nnγ, respectively. If the population size remains constant between consecutive generations (Figure 1, A and B), we reobtain the extended Moran model introduced by Eldon and Wakeley (2006), in which a single randomly chosen individual either leaves exactly two offspring and replaces one randomly chosen individual (Moran-type), or replaces a fixed proportion ψ(0,1] of the population (of size Nn). Note that, throughout, without loss of generality, we assume that Nnψ is integer-valued. In both reproductive scenarios, the remaining individuals persist. However, if the population size increases between consecutive generations (Figure 1, C and D), the reproductive mechanism needs to be adjusted accordingly. Let

ΔN(n)Nn1Nn (3)

denote the increment in population size between two consecutive time points. Then, the number of offspring at time n is given by

UN(n)max[ΔN(n)+1,UN(n)] (4)

where UN(n) denotes number of offspring for the constant-size population. Thus, independent of the type of reproductive event, i.e., Moran-type or sweepstake, and, in the spirit of the original Moran model, additional individuals are always assigned to be offspring of the single reproducing individual of the previous generation.

Following Eldon and Wakeley (2006), the distribution of the number of offspring (UN(n)=u) can be written as

(UN(n)=u)={Nnγifu=max[ΔN(n)+1,Nnψ]1Nnγifu=max[ΔN(n)+1,2]0otherwise, (5)

for some γ>0 that—for a given fixed population size—determines the probability of a sweepstake reproductive event. Here, we will consider only the case where 1<γ<2, such that sweepstake events happen frequently enough that the ancestral process will be characterized by multiple mergers, and that all coalescent events are due to sweepstake reproductive events, but not so frequently that the population is devoid of genetic variation (Eldon and Wakeley 2006). Note that, while the numbers of offspring and replaced individuals are no longer (necessarily) equal when the population size increases, the general reproductive mechanism remains unaltered.

Throughout the paper, and following Griffiths and Tavaré (1994), we will assume that the population is growing exponentially over time at rate ϱ, and, in particular, that the population size, n steps in the past, is given by

NnN(1ϱ)n, (6a)

with

ϱ=ρψ2Nγ (6b)

if the ancestral process is dominated by sweepstake events (i.e., if 1<γ<2), or

ϱ=ρ1N2 (6c)

if Moran-type reproductive events dominate (i.e., if γ>2), and the growth rate ρ is measured in units of the corresponding coalescent time. A discussion and details about the derivation of the coalescent-time scaling are given below in the Derivation of the ancestral limit process section.

Multiple merger coalescents: the Psi-coalescent

The most general class of coalescent processes that allows for multiple lineages to coalesce per coalescent event (but not for multiple coalescent events at the same time) is the so-called Λ-coalescent. These processes are partition-valued exchangeable stochastic processes defined by a finite measure Λ on the [0,1] interval (Donnelly and Kurtz 1999; Pitman 1999; Sagitov 1999). In particular, the rate at which x out of i active lineages merge is given by

λi,x=(ix)01yx2(1y)ixΛ(dy). (7)

Special instances of the Λ-coalescent are Kingman’s coalescent (Kingman 1982a,b) with

Λ(dy)=δ0(dy) (8)

and the psi-coalescent (Eldon and Wakeley 2006) with

Λ(dy)=δψ(dy), (9)

where the measure Λ is entirely concentrated at 0 and ψ, respectively.

Under a (constant-size) extended Moran model, as proposed by Eldon and Wakeley (2006) (corresponding to Figure 1, A and B), the scaled coalescence rates of the ancestral process become

λi,x={(ix)ψx2(1ψ)ixif0<γ<2(ix)(22+ψ21x=2+ψx2+ψ2(1ψ)ix)ifγ=2(ix)1x=2otherwise, (10)

where 1x=2 denotes the indicator function, which is 1 if x=2 and 0 otherwise. Accordingly, the corresponding rate matrix of the ancestral process Qψk×k with sample size k is given by

Qψ(i,j)={λi,jifi>j1ψ2(1(1ψ)iiψ(1ψ)i1)ifi=j0otherwise, (11)

where j=ix1. Note that the diagonal entries of Qψ(i,j) (i.e., when i=j) is given by the (negative) sum over all coalescent rates, i.e., m=2iλi,m, which evaluates to the closed-form representation given in the second line of Equation 11.

In particular, in the boundary case ψ=0, we recover the rate matrix under the Kingman coalescent as

Q0(i,j)={(i2)ifj=i1(i2)ifi=j0otherwise. (12)

Note that, in the infinite population size limit, γ defines the time scale of the ancestral process. In particular, if 0<γ<2, all coalescence events are due to sweepstake reproductive events, whereas sweepstake events do not happen frequently enough if γ>2, such that all (2-) mergers are due to Moran-type reproductive events. Moreover, in the latter case, the ancestral process of the Moran model can be described accurately by the Kingman coalescent (when scaled appropriately). Note that, for the special case γ=2, both reproductive events happen on the same time scale (Eldon and Wakeley 2006).

SFS-based maximum likelihood inference

In order to infer the coalescent model and its associated coalescent parameter, and to (separately) estimate the demographic history of the population, Eldon et al. (2015) recently derived an (approximate) maximum likelihood framework based on the SFS [see also Birkner and Blath (2008) and Koskela et al. (2015) for alternative inference approaches based on a full likelihood framework and approximate conditional sampling distributions, respectively].

In the following, we will give a concise overview of their approach, which forms the basis for the joint inference of coalescent parameters and population growth rates.

First, let k denote the number of sampled (haploid) individuals (i.e., the number of leaves in the coalescent tree). Furthermore, let η(k)=(η1(k),,ηk1(k)) denote the number of segregating sites with derived allele count of i=1,,k1 of all sampled individuals (i.e., the SFS), and let s=i=1k1ηi be the total number of segregating sites. Provided that s>0, we define the normalized expected SFS ϕ(k)=(ϕ1(k),,ϕk1(k)) as

ϕi(k)=E[ηi(k)]i=1k1E[ηi(k)], (13)

which, given a coalescent model (Πt,kψ,ρ)t0, and, assuming the infinite-sites model (Watterson 1975), can be interpreted as the probability that a mutation appears i times in a sample of size k (Eldon et al. 2015). Furthermore, note that ϕi(k) is a function of (Πt,kψ,ρ)t0 (i.e., of the coalescent process and the demographic population history), but, unlike E[η(k)], is not a function of the mutation rate, and should provide a good first-order approximation of the expected SFS as long as the sample size and the mutation rate are not too small (Eldon et al. 2015).

Then, the likelihood function ((Πt,kψ,ρ)t0,η(k),s) for the observed frequency spectrum η(k) and given coalescent model (Πt,kψ,ρ)t0 is given by

((Πt,kψ,ρ)t0,η(k),s)=(Πt,kψ,ρ)t0,s(ηi(k)=ηi(k),i[k1])=EΠ[s!η1(k)!ηk1(k)!i=1k1(Ti(k)Ttot(k))ηi(k)]s!η1(k)!ηk1(k)!i=1k1(ϕi(k))ηi(k)i=1k1exp[sϕi(k)](sϕi(k))ηi(k)ηi(k)! (14)

(Eldon et al. 2015). Note that, in the third line, we approximated E[(Ti(k)/Ttot(k))]E[Ti(k)]/E[Ttot(k)]=ϕi(k). In fact, Bhaskar et al. (2015) recently used a Poisson random field approximation to derive an analogous, structurally identical likelihood function for estimating demographic parameters under the Kingman coalescent. Notably though, their approximation assumes that the underlying coalescent tree is independent at each site, under which condition Equation 14 is exact.

As an alternative to the likelihood approach, we followed Eldon et al. (2015) and also implemented a minimal-distance statistic approach where

ψ^,ρ^=argminψ,ρdp(η(k),E[η(k)]), (15)

where dp is some metric on p1 calculated between the observed and the expected SFS under the generating coalescent process.

Note, though, that both the likelihood and the distance-based approach require expressions for the normalized expected SFS ϕ(k). Instead of performing Monte Carlo simulations to obtain these quantities, we adapted an approach recently proposed by Spence et al. (2016), who derived analytical formulas for the expected SFS under a given (general) coalescent model (Πt,kψ,ρ)t0, and an intensity measure ξ(t):R0R>0. In particular, the authors showed that

E[η(k)]=θ2BCLck, (16)

where Bk1×k1 and Ck1×k1 are both Λindependent (and thus easy to calculate) matrices, Lk1×k1 is a Λdependent lower triangular matrix that depends on the rate matrix Q and its spectral decomposition, θ is the population-scaled mutation rate, and ck=(c2,2,,ck,k) denotes the expected time to the first coalescence for a sample of size i{2,,k}. Importantly, the time-inhomogeneity of the underlying coalescent process only enters through the first coalescence times ck. For example, the first coalescence times for the Kingman coalescent with an exponentially growing population are given by

ci,i=1ρexp[(i2)ρ]Ei((i2)ρ), (17)

where Ei(x)x(exp[t]/t)dt denotes the exponential integral (Polanski et al. 2003; Polanski and Kimmel 2003; Bhaskar et al. 2015). Finally, plugging Equation 16 into Equation 13 leads to

ϕi(k)=(BCLck)ii=1k1(BCLck)i, (18)

highlighting that θ cancels, and that the likelihood function (Equation 14) is independent of the mutation rate.

To obtain the coalescent parameter ψ and population growth rate ρ that maximize the likelihood function (Equation 14) or, respectively, minimize the distance function (Equation 15), we used a grid search procedure over an equally spaced two-dimensional grid with ψgrid={0,0.01,,1} and ρgrid={0,1,,1024}, and evaluated the value of the likelihood, respectively, distance function, at each grid point.

Data availability

The empirical raw data used have been downloaded from GenBank (accession numbers LC031518–LC031623; data from Niwa et al. (2016)). The empirical SFS can be downloaded from Supplemental Material, File S5. The simulation program and the inference program were written in C++ and can be downloaded from GitHub under https://github.com/Matu2083/MultipleMergers-PopulationGrowth.

Results and Discussion

The aim of this work was to derive the ancestral process for an exponentially expanding population that undergoes sweepstake reproductive events. We first derive the time-inhomogeneous Markovian ancestral process that underlies the extended Moran model, and show that, analogous to the Kingman coalescent, it can be described by a time-homogeneous Markov chain on a nonlinear time scale. In particular, we derive the coalescent rates and the time-change function, and prove convergence to a Λcoalescent with Dirac measure at ψ. Detailed derivations of the results, which in the main text have been abbreviated to keep formulas concise, can be found in File S1. On the basis of these results, we derive a maximum likelihood inference framework for the joint inference of the coalescent parameter and the population growth rate, and assess its accuracy and performance through large-scale simulations. Furthermore, we quantify the bias of coalescent and population growth parameter estimates when mistakenly neglecting population demography or reproductive skew. Finally, we apply our approach to mtDNA from Japanese sardine (S. melanostictus) populations. where patterns of sequence variation were shown to be more consistent with sole influence from sweepstake reproductive events, again highlighting the potential mis-inference of growth if reproductive skew is not properly accounted for (Grant et al. 2016; Niwa et al. 2016).

Derivation of the ancestral limit process

Unlike in the case of a constant-size population, the sequence of the number of offspring (UN(n))n changes along with the (time-dependent) population size. Thus, the ancestral process is characterized by an inhomogeneous Markov chain with transition probabilities

Gi,x(n)=(ix)u=2NnN(UN(n)=u)(u)x(Nnu)ix(Nn)i (19a)

where (z)j is the descending factorial, z(z1)(zj+1) with (z)0=1, and N denotes the rescaled distribution of UN given by

N(UN=u)(UN=u)u(u1)/(N(N1))cN(n). (19b)

Note that N is scaled by the time-dependent coalescence probability cN(n), which scales the unit of time in the limit process such that it is equal to G2,2 steps in the discrete-time model, and thus serves as the “natural” time scale for the corresponding ancestral process—defined as

cN(n)G2,2(n)=E[UN(n)2UN(n)]Nn(Nn1) (20)

for all n.

Plugging Equation 5 into Equations 19a and 20, and using Equation 4 then yields

Gi,x(n)=(ix)1Nn(Nn1)((1Nnγ)(max[2,ΔN(n)+1])x×(Nnmax[2,ΔN(n)+1])ix+Nnγ(max[ψNn,ΔN(n)+1])x×(Nnmax[ψNn,ΔN(n)+1])ix) (21)

and

cN(n)=(1Nnγ)(max[2,ΔN(n)+1])2+Nnγ(max[ψNn,ΔN(n)+1])2Nn(Nn1), (22)

respectively. Note that Equation 22 is the weighted sum of the number of offspring for the two different reproductive events. Furthermore, taking the limit N in Equation 3

limNΔN(n)=limN(NnNn1)limNNρψ2Nγ=0 (23)

shows that ΔN(n) is bounded for all n under the exponential growth model, and thus allows dropping of the maxima condition in Equations 21 and 22. Furthermore, for sufficiently large N, Equation 22 becomes

cN(n)=(1Nnγ)2+ψNn1γ(ψNn1)Nn(Nn1)ψ2Nnγ. (24)

To prove that the time-scaled ancestral process of the underlying extended Moran model converges to a continuous-time Markov chain as the initial population size approaches infinity, we apply Theorem 2.2 in Möhle (2002), which requires the following definitions: First, consider a step function FN:[0,)[0,) given by

FN(s)n=1scN(n). (25)

Furthermore, let GN1 denote a modification of the right-continuous inverse of FN

GN1(t)inf{s>0|FN(s)>t}1 (26)

which will constitute the time-change function in the following. Since by assumption limsFN(s)= it follows that GN1(t) is finite for all t[0,). Finally, Theorem 2.2 (Möhle 2002) requires that for all t[0,)

limNinf1nGN1(t)Nn= (27)

and

limNsup1nGN1(t)cN(n)=0 (28)

holds, i.e., that—on the new time scale—the population size remains large while the coalescent probabilities become small.

Then, let (An,kψ,ρ)n denote the ancestral process of the extended Moran model with exponential growth (see Model and Methods), and let ϕ and ξ denote two partitions of [k] with ξφ of size a and b=b1++ba (where b1b2ba1), respectively. The transition probability of (An,kψ,ρ)n at time n is given by

Φa(N)(n;b1,b2,,ba)1(Nn1)balldistincti1,,1a=1NnE((νi1(n))b1(νia(n))ba). (29)

Thus, for the extended Moran model with exponential growth Theorem 2.2 in Möhle (2002) states:

Theorem 1. (Theorem 2.2; Möhle 2002). Assume that Equations 27 and 28 hold, and for all t>0 the limit

πa((0,t];b1,,ba)limNn=1GN1(t)Φa(N)(n;b1,b2,,ba) (30)

exists. Then, for each sample size, k, the ancestral process (AGN1(t),kψ,ρ)t0 converges as N tends to infinity to a time-continuous, and, in general, a time-inhomogeneous Markov chain (Πt,kψ,ρ)t0.

Note, though, that in its general form Theorem 1 was derived for any generic Cannings model as well as any kind of population size change (Möhle 2002).

We will now derive our first main result, and show that the ancestral limiting process limN(An,kψ,ρ)t0 converges to a Λkcoalescent on a nonlinear time scale. First, we derive the time-change function GN1(t) for the ancestral process by considering the step function (Equation 25)

FN(s)=n=1scN(n)n=1sψ2Nnγ=ψ2Nγ(1ρψ2Nγ)γ((1ρψ2Nγ)γs1)(1ρψ2Nγ)γ1. (31)

Solving for s then gives

GN1(t)=inf{s>0:FN(s>t)}[log[1+ργt]ργNγψ2], (32)

where we have used log[1ρψ2/Nγ]ρψ2/Nγ for sufficiently large N. In particular, we have

G1(t)limNGN1(t)cN(0)=log[1+ργt]ργ. (33)

Furthermore, Equations 27 and 28 hold, since

limNinf1nGN1(t)N(1ρψ2Nγ)n=limNNinf1nGN1(t)exp(ρψ2Nγn)=limNN(1+ργt)1γ=t(0,), (34)

and, by the same reasoning

limNsup1nGN1(t)ψ2N(1ρψ2Nγ)n=0,t(0,). (35)

Finally, to show that Equation 30 holds, we first note that

Φa(N)(n;b1,,ba)=0, (36)

and for a2, and there are two indices 1i<ja with bi,bj2

πa((0,t];b1,,ba)=0, (37)

since the extended Moran model does not allow for more than one reproductive event at a time.

Thus, limN(AGN1(t),kψ,ρ)t0 is well defined and does not feature any simultaneous coalescent events, implying that the limiting process must be a (possibly time-inhomogeneous) Λk coalescent. Further, for a=1,

limNn=1GN1(t)Φ1(N)(n;b)=limNn=1GN1(t)Gb,b=ψb2t. (38)

Hence, Theorem 1 implies that, for each sample size k the limit of the time-scaled ancestral process (AGN1(t),kψ,ρ)t0 exists, and, from Equation 38 it follows that limN(AGN1(t),kψ,ρ)t0 is a time-homogeneous Λkcoalescent. Further,

tψb2=t01xb2Λ(dx) (39)

holds for all b, if, and only if, Λ is the Dirac measure at ψ. Thus, in the large population-size limit, the ancestral process, limN(AGN1(t),kψ,ρ)t0, converges to a psi-k-coalescent – (Πt,kψ,ρ)t0limN(At/cN(0),kψ,ρ)t0 , which is equal (in distribution) to a regular psi-k-coalescent (ΠG(t),kψ,0) with time (nonlinearly) rescaled by

G(t)=exp(ργt)1ργ. (40)

Put differently, analogous to the results obtained for the Kingman coalescent (Griffiths and Tavaré 1994, 1998; Kaj and Krone 2003), the time-inhomogeneous ancestral limiting process of the extended Moran model with exponential growth can be transformed into a time-homogeneous psi-coalescent with coalescent rates given by Equation 10, with branches rescaled by Equation 33, allowing it to be simulated easily and efficiently. Intuitively, the transformation sums over the coalescence intensities of the time-inhomogeneous process, and weighs them by the time they were effective, such that, on the new time-scale coalescent intensities are constant across time, and the (rescaled) process is time-homogeneous (see also Kaj and Krone 2003). Thus, changing the time-scale by Equation 40 compensates for the shrinking population sizes (going backward in time) and the effect of increasing (total) coalescent rates.

To highlight the duality between the two processes, i.e., the (forward in time) extended Moran model and the corresponding coalescent, key properties (e.g., the summed length of all branches with i descendants Ti and the total tree length Ttot) are compared in the File S2. Finally, note that Equation 33 is—except for the additional factor γ that is proportional to the coalescent time scale—structurally identical to the time-change function in the Kingman case (see Equation 2.7 in Griffiths and Tavaré 1998). However, since G1(t) depends on the product ργ, it is impossible to obtain a direct estimate of ρ (or γ) without additional information, and thus—analogous to the case of the population scaled mutation rate θ—only the compound parameter can be estimated. To keep notation simple, though, we will refer to ρ (instead of the compound parameter ργ) when referring to growth rate estimates.

Joint inference of coalescent parameters and population growth rates

In this paragraph we modify the likelihood function

((Πt,kψ,ρ)t0,η(k),s)i=1k1exp[sϕi(k)](sϕi(k))ηi(k)ηi(k)! (41)

derived in the Methods section to jointly infer the coalescent parameter ψ and the population growth rate ρ. Note that, while the general form of the likelihood function (Equation 14) is independent of the generating coalescent process, changes in ψ and ρ affect the normalized expected SFS, as given by

ϕi(k)=(BCLck)ii=1k1(BCLck)i. (42)

Recall that B and C depend neither on ψ nor ρ, and that L does depend on ψ but not on ρ, and that the time-inhomogeneity of the underlying coalescent process enters only through the first coalescence times ck, which are given by

ci,i=0(timeoffirstcoalescenceforiindividuals>t)dt=0exp((Q)i,i0t(1/ξ(s))ds)dt, (43)

where ξ(s) denotes the intensity measure (Polanski and Kimmel 2003; Bhaskar et al. 2015; Spence et al. 2016). For the psi-coalescent with exponential growth, ξ(t)=eργt, such that Equation 43 becomes

ci,i=0exp((Q)i,iexp(ργt)1ργ)dt=exp((Q)i,iργ)ργEi((Q)i,iργ), (44)

where Ei(x)x(exp[t]/t)dt denotes the exponential integral. Thus, when growth rates are measured on their corresponding coalescent scale, i.e., ργ under the psi-coalescent vs. ρ under the Kingman coalescent, Equation 44 is a generalization of the Kingman coalescent result (Equation 17) derived by Polanski and Kimmel (2003). Finally, combining Equation 44 with Equation 42 allows for the exact computation of the normalized expected SFS ϕ(k), avoiding the simulation error that would be introduced by Monte Carlo simulations.

Figure 2 shows the normalized expected SFS obtained from Equation 42, where higher frequency classes have been aggregated (i.e., lumped) for different values of ψ and ρ. In line with previous findings, both multiple mergers and population growth lead to an excess in singletons (Durrett and Schweinsberg 2005; Eldon et al. 2015; Niwa et al. 2016). Furthermore, this excess increases as sample size increases under the psi-coalescent (Figure S1 in File S3), while it decreases for the Kingman coalescent independent of the presence or absence of exponential growth. These qualitative differences stem from the different footprints reproductive skew and exponential growth leave on a genealogy. While the latter is a simple rescaling of branch lengths, leaving the topology unchanged, multiple-merger coalescents by definition affect the topology of the genealogical tree (Eldon et al. 2015). In particular, when ψ is large, adding samples will disproportionally increase the number of external branches T1, such that the genealogy will become more star-like, rendering disproportionately more singletons.

Figure 2.

Figure 2

The normalized expected (lumped) SFS for the psi-coalescent for an exponentially growing population (Equation 18) with sample size k=20 (A) for different values of ρ and fixed ψ=0.15, and (B) for different values of ψ and fixed ρ=1. The sixth entry in the SFS contains the aggregate of the higher frequency classes.

Though the excess in singletons characterizes either process, their higher frequency classes will typically differ (Eldon et al. 2015). When both processes—reproductive skew and exponential growth—act simultaneously though, their joint effects on the SFS (nontrivially) combine. As expected, increasing growth under the psi-coalescent further exacerbates the excess in singletons. More generally, exponential growth leads to a systematic left shift in the SFS toward lower frequency classes that is independent of ψ. Increasing ψ, on the other hand, changes the SFS—and in particular the higher frequency classes—nonmonotonically even if there is no population growth (Figure S2 in File S3). Interestingly, for ρ=0, the last entry of the normalized expected SFS E[ηk1] initially increases with ψ, and takes an intermediate maximum, decreases monotonically until ψ0.85, peaks again, and then quickly reduces to 0 as ψ approaches 1. This effect prevails as sample size increases (Figure S2 in File S3), even though the intermediate maximum shifts slightly toward lower ψ. However, this intermediate maximum is effectively washed out by increasing ρ, such that the second peak becomes the maximum. Furthermore, the shape of the peak becomes more pronounced as the sample size increases. Thus, reproductive skew and exponential growth leave complex and distinct genomic footprints on the SFS. While, in theory, population growth and reproductive skew should be identifiable, in practice this strongly depends on sample size (Spence et al. 2016). In the next section, we will assess the accuracy of our joint estimation framework, and perform extensive validation (Equation 14) on large-scale simulated data.

Simulated coalescent and demographic models

To test our inference framework, we followed two different simulation approaches, each corresponding to two biological limiting cases. In both, data were simulated for the Cartesian product set over ψ={0,0.15,0.3,0.45,0.6,0.75,0.9}, ρ={0,1,10,100}, k={20,50,100,200}, and s={100,1000,10000} per locus over 10,000 replicates each. In order to make results comparable across different coalescent models, and, thus, across different values of ψ and ρ, we calculated the population-scaled mutation rate θ based on Watterson’s estimator (Watterson 1975),

θ=2sE[Ttotψ,ρ] (45)

for a fixed number of segregating sites s over the expected total tree length under the generating coalescent model (given by the denominator in Equation 42). Note that Ttot decreases with both increasing ψ and ρ. Thus, keeping s constant implies that θ effectively increases with ψ and ρ. We will discuss the latter point in more detail in light of the results below. Data were simulated for the following two underlying genetic architectures:

  • Case 1 (Independent-sites simulations): Under the Poisson random field assumption, the underlying coalescent tree at each site is independent (Sawyer and Hartl 1992; Bhaskar et al. 2015). Thus, by averaging over independent realizations of the (shared) underlying coalescent process, the SFS can be obtained by randomly drawing from a multinomial distribution such that ηMultinomial(s,ϕ).

  • Case 2 (Whole-genome simulations): In this scenario, we consider a genome of =100 independent loci, where sites within each locus share the same genealogy (i.e., coalescent tree). Thus, for each locus, we draw a random genealogy according to Equations 10 and 33, superimpose sPoisson(θ/2) random mutations onto the ancestral tree by multinomial sampling, and aggregate the individual locus SFS into a single genome-wide SFS.

Finally, data sets where s=η1 (i.e., where all segregating sites were singletons) were discarded, and simulated again since these do not allow the underlying coalescent parameter and demographic history to be identified. Note that both types of simulations are merely for checking the robustness and accuracy of the inference framework (and might not always necessarily be biologically realistic). A discussion of the independence assumption between loci and its biological implications is given below.

Accuracy of the joint estimation framework

Next, we evaluated the accuracy of the joint estimation framework by means of the mean absolute deviation (MAD) MAD=1/ni=1n|xix^i|, the mean deviation (MD) MD=1/ni=1nxix^i, the mean squared error (MSE) MSE=1/ni=1n(xix^i)2, and the median deviation (MDD), where x and x^ denote the true and the estimated parameter, respectively. If not stated otherwise, results in the main text are shown for the default parameters k=100 and θ (Equation 45), with s=10,000. More results are given in File S3 and File S4. Recall that, for notational simplicity, we will refer to ρ (instead of the compound parameter ργ) when referring to growth rate estimates.

Inference under the independent-sites assumption:

First, for a consistency check, we applied our grid-search algorithm to estimate ψ and ρ from an idealized SFS (i.e., where the SFS accurately reflects the expected branch length under the generating coalescent and demographic model ϕ, except for distortions due to rounding). An exemplary likelihood surface (Equation 14) for such an idealized SFS is depicted in Figure 3, which shows that the likelihood surface—up to the resolution of the grid point—is smooth, and generally unimodal, and that the true parameters can be estimated accurately. Furthermore, Figure 3 shows that there is generally a negative correlation between ψ and ρ, and that the likelihood surface tends to be steeper and more concentrated along the ψ direction, which suggests that growth rate estimates might show a larger variance, and could, in general, be more difficult to estimate. The steepness of the likelihood surface along the ψ axis tends to increase with ψ, and sample size k, suggesting that the accuracy for estimating ψ should increase as well, while it should become more difficult to estimate ρ accurately.

Figure 3.

Figure 3

Likelihood surface (Equation 14) of the idealized SFS with k=100, ψ=0.3, ρ=10, and s=10,000. Contours show the 0.95,0.9675,0.975,0.99, 0.99225,0.9945, 0.99675,0.999,0.99945, and 0.9999 quantiles. Likelihoods below the 0.95 quantile are uniformly colored in gray. The green square shows the true ψ and ρ. The black star shows the maximum likelihood estimates ψ and ρ^.

An exemplary distribution of the jointly inferred maximum likelihood estimates (ψ,ρ^) assuming independent sites is shown in Figure 4. The shape of this distribution resembles that of the likelihood surface (Figure 3), indicating that there is some variance—in particular along the ρ-axis—in the maximum likelihood estimates. However, the median and the mean of the distribution match the true underlying coalescent and growth rate parameters (i.e., ψ and ρ) very well, implying that, if sites are independent, ψ and ρ^ are unbiased estimators.

Figure 4.

Figure 4

Heatplot of the frequency of the maximum likelihood estimates for 10,000 data sets, assuming independent sites with k=100, ψ=0.3, ρ=10, and θ (Equation 45) with s=10,000. Counts increase from blue to red with gray squares showing zero counts. The green square shows the true ψ and ρ. The black star shows the median (and mean) of the maximum likelihood estimates ψ and ρ^.

Generally, as expected from the shape of the likelihood surface, ψ is estimated with high accuracy and precision, even for large sample sizes (k=200) with only a few segregating sites (s=100) and (nearly) independent of ρ (Figure 5A, Figure S3A, Figure S4A in File S3, and Table S1 in File S4). Growth rate estimates ρ^, however, show a larger variance, and, for some parameters—namely large k and small s—might be slightly upwardly biased when both the coalescent parameter and the growth rate are large (Figure 5B, Figure S3B, Figure S4B in File S3, and Table S2 in File S4). Though, as the number of segregating sites increases, this bias vanishes and the variance decreases (Figure S5 in File S3), highlighting that the joint estimation procedure gives asymptotically unbiased estimators.

Figure 5.

Figure 5

Boxplot of the deviation of the maximum likelihood estimate from the true (A) ψ and (B) ρ for 10,000 data , assuming independent sites with k=100 and θ (Equation 45) with s=10,000. Boxes represent the interquartile range (i.e., the 50% C.I.) and whiskers extend to the highest/lowest data point within the box ±1.5 times the interquartile range.

For a given s, increasing sample size k increases the signal-to-noise ratio, and, thus, the error in both ψ and ρ^ (Table S1, Table S2, Table S3, and Table S4 in File S4) which is most noticeable in growth rate estimates, in particular when ρ is large (Figure S6 in File S3). This increase in estimation error can (partially) be compensated by increasing the number of segregating sites s (Figure S7 in File S3 and Table S5 in File S4). Specifically, if the true underlying ψ is large (i.e., if the offspring distribution is heavily skewed), an increasing number of segregating sites is needed to accurately infer ρ. However, the total tree length T¯tot—and thus the number of segregating sites s—is expected to decrease sharply with ψ (Eldon and Wakeley 2006), implying that trees tend to become shorter under heavily skewed offspring distributions. This effect could (again, partially) be overcome by increasing sample size since T¯tot—unlike the Kingman coalescent—scales linearly with k as ψ approaches 1 (Eldon and Wakeley 2006). However, population growth will reduce T¯tot and the number of segregating sites even further.

Calculating θ based on a fixed and constant (expected) number of segregating sites for the assessment of the accuracy of the estimation method evades this problem to some extent. However, by making this assumption, we effectively increase θ in our simulations as ψ and ρ increases. Our results suggest, though, that even more segregating sites than considered in this study (i.e., an even larger θ) would be necessary to infer population growth accurately. Thus, unless (effective) population sizes and/or genome-wide mutation rates are large, it might be very difficult to infer population growth if the offspring distribution is heavily skewed (i.e., if ψ is large). On the other hand, the few studies that have estimated ψ generally found it to be small (Eldon and Wakeley 2006; Birkner et al. 2013; Árnason and Halldórsdóttir 2015), leaving it unresolved whether this problem is of any practical importance when studying natural populations.

Inference from genome-wide data:

We next tested the accuracy of our joint estimation framework when applied to genome-wide data obtained from =100 independent loci. An exemplary distribution of the jointly inferred maximum likelihood estimates (ψ,ρ^) is depicted in Figure 6, and Figure 7 shows the overall performance of the joint estimation method when applied to genome-wide data. While the whole-genome simulations are designed such that each site in a given locus shares the same underlying genealogy, and, thus, violate the assumption of (statistical) independence between sites, we find that coalescent and growth rate parameters (i.e., ψ and ρ) can be estimated robustly and accurately. In concordance with the independent-sites simulations, the variance in ψ is typically small, whereas ρ^ spreads considerably, and increasingly so if ψ is large. The mean and median of the coalescent parameter and growth rate estimates are again centered around the true value, implying that ψ and ρ^ are unbiased estimators (see also Table S6, Table S7, Table S8, and Table S9 in File S4).

Figure 6.

Figure 6

Heatplot of the frequency of the maximum likelihood estimates for 10,000 whole-genome data sets assuming with =100,k=100, ψ=0.3, ρ=10,γ=1.5 and θ (Equation 45) with s=1000. Counts increase from blue to red with gray squares showing zero counts. The green square shows the true ψ and ρ. The black star shows the median (and mean) of the maximum likelihood estimates ψ and ρ^

Figure 7.

Figure 7

Boxplot of the deviation of the maximum likelihood estimate from the true (A) ψ and (B) ρ for 10,000 whole-genome data sets with =100,k=100, γ=1.5, and θ (Equation 45) with s=1000. Boxes represent the interquartile range (i.e., the 50% C.I.), and whiskers extend to the highest/lowest data point within the box ±1.5 times the interquartile range.

Next, we assessed how the precision of the coalescent and growth rate parameter estimates depends on the number of (independent) loci (i.e., the number of independent coalescent realizations), while keeping the number of segregating sites constant. We find that coalescent and growth rate estimates obtained from a single locus display a huge variance, in particular, when the true underlying growth rate is large (Figure S8 in File S3), warranting caution when interpolating population trends from a single statistical realization as is common practice in studies fitting a multiple-merger coalescent models. Expectedly, the precision of the coalescent and growth rate parameter estimates increases (Figure S9 in File S3 and Table S10 in File S4) when considering estimates obtained from =1000 independent loci (i.e., from independent coalescent realizations), suggesting that sequencing efforts should be put on covering the genome in its entirety rather than on increasing coverage of individual genomic regions.

Distance-based inference and the effect of lumping:

As an alternative to the likelihood-based method, Eldon et al. (2015) proposed an ABC approach based on a minimum-distance statistic (Equation 15). In this section, we assess the accuracy of ψd and ρ^d when estimated from d1 and d2 distances (i.e., the l1 and l2 distance). A surface plot of the l1 and the l2 distance is shown in Figure S10 in File S3. We find that, for the l1 and the l2 distance, results are comparable to those of the likelihood-based estimates, but generally display a larger variance (Figure S11, Figure S12, and Figure S13 in File S3). Likelihood-based estimates, ψML, tend to be more accurate across the entire parameter space, though differences between the two are marginal.

Over the majority of the parameter space, the same holds true for ρ^ML. Particularly for small-to-intermediate ψ, the likelihood-based approach outperforms both distance-based approaches considerably (Table S11 and Table S12 in File S4). Interestingly though, for large ψ and ρ (i.e., in the part of the parameter space, where estimating ρ is generally difficult) the l1 distance approach gives more accurate estimates. When increasing the number of segregating sites, though, the likelihood approach becomes more accurate again, suggesting that the l1 distance-based approach only outperforms the likelihood-based approach when there is insufficient data (data not shown). These general findings are also upheld when considering genome-wide data (Figure S14 and Figure S15 in File S3). Despite the slightly reduced power as compared to the maximum likelihood approach, our results indicate that, given the asymptotic properties, both the l1 and the l2 distance should perform reasonably well when used in a rejection-based ABC analysis.

Finally, we investigated the effect of lumping (i.e., aggregating the higher-frequency classes of the SFS into a single entry after a given threshold i) on the performance of our estimator. In contrast to Eldon et al. (2015), who found that lumping can improve the power to distinguish between multiple-merger coalescent models and models of population growth, we find that estimates based on the lumped SFS (using i=5 and i=15) show considerably more error (Table S13 and Table S14 in File S4). While ψ can again be reasonably well estimated, ρ^—in particular when ψ and/or ρ are large—is orders of magnitude more inaccurate when higher frequency classes are lumped. The reason is that, when trying to differentiate between different coalescent or growth models, lumping can reduce the noise associated with the individual higher frequency classes, and, thus, increases the power, provided that the different candidate models show different mean behaviors in the lumped classes (Eldon et al. 2015). While this seems to hold true when considering “pure” coalescent or growth models, the joint footprints of skewed offspring distributions and (exponential) population growth are more subtle. In particular, since growth induces a systematic left shift in the SFS toward lower frequency classes, most of the information to distinguish between a psi-coalescent, with or without growth, is lost when aggregated.

Mis-inference of coalescent parameters when neglecting demography

As argued above, both reproductive skew and population growth result in an excess of singletons (i.e., low-frequency mutations) in the SFS. However, topological differences between the two generating processes in the right tail of the SFS allows distinguishing between the two. In particular, fitting an exponential growth model and not accounting for reproductive skewness results in a vastly (and often unrealistically) overestimated growth rate (Eldon et al. 2015).

Here, we investigate how coalescent parameter estimates (i.e., ψ) are affected when not accounting for (exponential) population growth (i.e., assuming ρ=0) when both processes act simultaneously. As expected, we find that ψ is consistently overestimated (Figure 8) and that the estimation error—independent of ψ—increases with larger (unaccounted for) growth rates. This is because, unless the underlying genealogy is star-shaped (e.g., when ψ=1), growth will always left-shift the SFS, and, hence, increase the singleton class. Thus, when assuming ρ=0, increasing ψ compensates for the “missing” singletons.

Figure 8.

Figure 8

Boxplot of the deviation of the maximum likelihood estimate from the true ψ for 10,000 data sets, assuming independent sites with k=100 and θ (Equation 45) withs=10,000 when not accounting for population growth. Boxes represent the interquartile range (i.e., the 50% C.I.), and whiskers extend to the highest/lowest data point within the box ±1.5 times the interquartile range.

Interestingly though, the estimation error changes nonmonotonically with ψ, and, for large ρ, can be as great as twice the value of the true underlying coalescent parameter. Furthermore, for low-to-intermediate ψ, even small growth rates can result in a relative error of up to 23%. Overall, not accounting for demography can lead to serious biases in ψ with broad ecological implications when trying to understand the variation in reproductive success.

Application to sardine data

Finally, we applied our joint inference framework to a derived SFS for the control region of mtDNA in Japanese sardine (S. melanostictus; File S5). Niwa et al. (2016) recently analyzed this data to test whether the observed excess in singletons was more likely caused by a recent population expansion or by sweepstake reproductive events, and found that the latter is the more likely explanation. However, there is of course no a priori reason to believe that both reproductive skew and population growth could not have acted simultaneously.

When estimated jointly, the maximum likelihood estimate is (ψ,ρ^)=(0.46,0), which implies considerable reproductive skew, but no (exponential) population growth (Figure 9; see Figure S16 in File S3 for the corresponding l1 and l2 distance estimates). While our analysis confirms their results at first glance, there are two points that warrant caution with this interpretation. First, as indicated by the contour lines in the plot, there is some probability that the Japanese sardine population underwent a recent population expansion, though, if it did, it only grew at a very low rate. Second, our inference is based on a single nonrecombining locus (i.e., mtDNA), implying that there is correlation between sites. Our approximation, though, is exact only if there is independence between sites. While violations of the independence assumption seem to be robust on the genome-wide scale (see above; Figure 7), per-locus estimates can vary drastically, and might not be representative for the true underlying coalescent process (Figure S8 in File S3).

Figure 9.

Figure 9

Likelihood surface (Equation 14) of the unfolded SFS (given the ML rooted tree) of the sardine mtDNA sequences with k=106 and s=78. Contours show the 0.95,0.9675,0.975,0.99, 0.99225,0.9945,0.99675, 0.999,0.99945, and 0.9999 quantiles. Likelihoods below the 0.95 quantile are uniformly colored in gray. The black star shows the maximum likelihood estimates ψ=0.46 and ρ^=0.

Concluding remarks

This study marks the first multiple-merger coalescent with time-varying population sizes derived from a discrete time random mating model, and provides the first in-depth analyses of the joint inference of coalescent and demographic parameters. Since the Kingman coalescent represents a special case of the general class of multiple-merger coalescents (Donnelly and Kurtz 1999; Pitman 1999; Sagitov 1999; Schweinsberg 2000; Spence et al. 2016), it is interesting and encouraging to see that our analytical results—i.e., the time-change function (Equation 33) and the first expected coalescence times (Equation 44)—are generalizations of results derived for the Kingman coalescent (Griffiths and Tavaré 1998; Polanski and Kimmel 2003). In fact, when growth rates are measured within the corresponding coalescent framework (e.g., as ργ for the psi-coalescent or ρ for the Kingman coalescent), these formulas should extend to other, more general multiple-merger coalescents. This also holds true for the challenges arising when calculating the normalized expected SFS (Equation 13), which is central to estimating coalescent parameters and growth rates: Because of catastrophic cancellation errors—due mainly to summing terms involving large binomial coefficients and numerical representations of the exponential integral Ei(x) with alternating signs—computations have to be carried out using multi-precision libraries (Spence et al. 2016).

While both ψ and ρ can generally be estimated precisely, accurate estimation of the latter requires sufficient information (i.e., a large number of segregating sites), especially when offspring distributions are heavily skewed (i.e., if ψ is large). However, since strong recurrent sweepstake reproductive events—analogous to recurrent selective sweeps—constantly erase genetic variation (i.e., reduce the number of segregating sites), there might be little power to accurately infer ρ in natural populations in these cases. In accordance with previous findings derived for the Kingman coalescent (Terhorst and Song 2015), increasing sample size does not improve the accuracy of demographic inference (i.e., estimating ρ) for a fixed (expected) number of segregating sites s. However, unlike in the Kingman coalescent, where s increases logarithmically with sample size, genetic variation in ψ increases linearly for large ψ, which could offset—or at least hamper—this effect.

More importantly, these results have proven to be robust to violations of the assumptions underlying the approximate likelihood framework (Equation 14), namely, that the expectation of a ratio can be approximated by the ratio of two expectations (i.e., ϕi(k)), allowing ψ and ρ to be estimated accurately on a genome-wide scale. Interestingly, the performance of the estimators seemed to improve when considering more independent loci (while keeping the number of segregating sites constant; see also Figure S8, Figure S9 in File S3, and Table S10 in File S4). Note, though, that we have used a very simplistic genetic architecture, in particular one where sites within each locus are maximally dependent, and there is no correlation among genealogies across different loci (i.e., where loci are independent). While these assumptions might be met for some loci and sites, they generally mark the endpoint of a continuum of correlations, and might not always be biologically realistic. Importantly, these linkage disequilibria (i.e., the extent of statistical independence between sites) depend not only on the rate of recombination, but also on the specifics of the reproduction parameters (i.e., ψ)—and can potentially be elevated, despite frequent recombination, or largely absent, despite infrequent recombination in the MMC setting (Eldon and Wakeley 2008; Birkner et al. 2012), potentially biasing results. For instance, when trying to estimate the duration and the rate of exponential growth under the Kingman coalescent, Bhaskar et al. (2015) found that linkage equilibria cause the approximate likelihood approach (Equation 14) to become increasingly inaccurate, and, thus, bias estimates. Likewise, Schrider et al. (2016) recently found that linked positive selection can severely bias demographic estimates. While their analyses assumed a Kingman framework, positive selection and recurrent selective sweeps typically result in multiple merger events (Durrett and Schweinsberg 2004, 2005; Neher and Hallatschek 2013; Schweinsberg 2017). Thus, if neutral regions are tightly linked to a selected site they will—at least partially—share the genealogical relationship with the selected region, and potentially skew inference. Similarly, large reproductive skew (i.e., large ψ) will induce correlations between coalescent trees across loci (i.e., linkage), which will reduce the number of “effective independent loci,” suggesting an increased variance in both coalescent parameter and growth rate estimates. However, due to the lack of explicit coalescent simulators that allow for multiple-mergers, nonconstant population sizes, and varying recombination rates, the effects of linkage on the joint estimation of coalescent and demographic parameters cannot directly be assessed, and remain open for future research.

Despite the fact that our model here considers organisms with skewed offspring distributions under neutrality owing to the specifics of their reproductive biology, increasing ψ is tantamount to increasing the strength of positive selection under a non-neutral model, which is thus relevant to a very broad class of organisms indeed. It is important to note that while both processes—selection and sweepstake reproductive events—have a similar effect on the SFS (i.e., an excess of low-frequency alleles and a slight increase in high-frequency alleles), there are of course vast qualitative differences in the underlying processes and their causes. First, in the presence of selection, offspring no longer choose their parents at random, such that selected alleles need to be tracked along the genealogy (e.g., see the ancestral selection graph under the Kingman coalescent (Krone and Neuhauser 1997), or under the Λ-coalescent (Etheridge et al. 2010)). Second, similar to the effects of demography, sweepstake reproductive events should have a genome-wide impact, whereas traces of selection should remain local, unless selection is very strong, such that only a single individual gives rise to the entire next generation. Thus, it should in principle be possible to discriminate between the two processes, though, also analogous to demography, it will be important to investigate the conditions under which positively selected loci will be expected to reside in the tails of genomic distributions under such models (see Thornton and Jensen 2007).

Overall, our analyses emphasize the importance of accounting for demography and illuminates the serious biases that can arise in the inferred coalescent model if ignored. Such bias can have broad implications on inferred patterns of genetic variation (Eldon and Wakeley 2006; Tellier and Lemaire 2014; Niwa et al. 2016), including misguiding conservation efforts (Montano 2016), and obscuring the extent of reproductive skew.

Finally, most of the current analytical and computational tools have been derived and developed under the Kingman coalescent. In order to achieve the overall aim of generalizing the Kingman coalescent model (Wakeley 2013), these tools, though often computationally challenging, need to be extended. Great efforts have recently been undertaken toward developing a statistical inference framework, allowing for model selection (Birkner and Blath 2008; Eldon 2011; Birkner et al. 2011, 2012, 2013; Steinrücken et al. 2013; Eldon et al. 2015; Spence et al. 2016). By setting up a discrete-time random mating model, and deriving the ancestral process, along with providing the analytical tools necessary to enable the joint inference of offspring distribution and demography, this study makes an important contribution toward this goal.

Supplementary Material

Supplemental material is available online at www.genetics.org/lookup/suppl/doi:10.1534/genetics.117.300499/-/DC1.

Acknowledgments

We thank Kristen Irwin, Adamandia Kapopoulou, Martin Möhle, Sylvain Mousset, and Meike Wittmann for helpful discussion and comments on earlier versions of this manuscript, Hiro-Sato Niwa for providing the sardine data, and two anonymous reviewers for their insightful comments and constructive criticism, which greatly helped to improve this manuscript. This project was funded by grants from the Swiss National Science Foundation (FNS) and a European Research Council (ERC) Starting Grant to J.D.J.

Footnotes

Communicating editor: R. Nielsen

Literature Cited

  1. Achaz G., 2009.  Frequency spectrum neutrality tests: one for all and all for one. Genetics 183: 249–258. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Árnason E., Halldórsdóttir K., 2015.  Nucleotide variation and balancing selection at the Ckma gene in Atlantic cod: analysis with multiple merger coalescent models. PeerJ 3: e786. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Bhaskar A., Clark A. G., Song Y. S., 2014.  Distortion of genealogical properties when the sample is very large. Proc. Natl. Acad. Sci. USA 111: 2385–2390. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Bhaskar A., Wang Y. R., Song Y. S., 2015.  Efficient inference of population size histories and locus-specific mutation rates from large-sample genomic variation data. Genome Res. 25: 268–279. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Birkner M., Blath J., 2008.  Computing likelihoods for coalescents with multiple collisions in the infinitely many sites model. J. Math. Biol. 57: 435–465. [DOI] [PubMed] [Google Scholar]
  6. Birkner M., Blath J., Steinrücken M., 2011.  Importance sampling for lambda-coalescents in the infinitely many sites model. Theor. Popul. Biol. 79: 155–173. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Birkner M., Blath J., Eldon B., 2012.  An ancestral recombination graph for diploid populations with skewed offspring distribution. Genetics 193: 255–290. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Birkner M., Blath J., Eldon B., 2013.  Statistical properties of the site-frequency spectrum associated with λ-coalescents. Genetics 195: 1037–1053. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Bolthausen E., Sznitman A.-S., 1998.  On Ruelle’s probability cascades and an abstract cavity method. Commun. Math. Phys. 197: 247–276. [Google Scholar]
  10. Cannings C., 1974.  The latent roots of certain Markov chains arising in genetics: a new approach, i. haploid models. Adv. Appl. Probab. 6: 260–290. [Google Scholar]
  11. Donnelly P., Kurtz T. G., 1999.  Particle representations for measure-valued population models. Ann. Probab. 27: 166–205. [Google Scholar]
  12. Durrett R., Schweinsberg J., 2004.  Approximating selective sweeps. Theor. Popul. Biol. 66: 129–138. [DOI] [PubMed] [Google Scholar]
  13. Durrett R., Schweinsberg J., 2005.  A coalescent model for the effect of advantageous mutations on the genealogy of a population. Stoch. Proc. Appl. 115: 1628–1657. [Google Scholar]
  14. Eldon B., 2011.  Estimation of parameters in large offspring number models and ratios of coalescence times. Theor. Popul. Biol. 80: 16–28. [DOI] [PubMed] [Google Scholar]
  15. Eldon B., Wakeley J., 2006.  Coalescent processes when the distribution of offspring number among individuals is highly skewed. Genetics 172: 2621–2633. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Eldon B., Wakeley J., 2008.  Linkage disequilibrium under skewed offspring distribution among individuals in a population. Genetics 178: 1517–1532. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Eldon B., Birkner M., Blath J., Freund F., 2015.  Can the site-frequency spectrum distinguish exponential population growth from multiple-merger coalescents? Genetics 199: 841–856. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Etheridge A. M., Griffiths R. C., Taylor J. E., 2010.  A coalescent dual process in a Moran model with genic selection, and the lambda coalescent limit. Theor. Popul. Biol. 78: 77–92. [DOI] [PubMed] [Google Scholar]
  19. Fay J. C., Wu C.-I., 2000.  Hitchhiking under positive Darwinian selection. Genetics 155: 1405–1413. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Ferretti L., Pérez-Enciso M., Ramos-Onsins S. E., 2010.  Optimal neutrality tests based on the frequency spectrum. Genetics 186: 353–365. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Fisher R., 1930.  The Genetical Theory of Natural Selection. Clarendon Press, Oxford. [Google Scholar]
  22. Grant W. S., Árnason E., Eldon B., 2016.  New DNA coalescent models and old population genetics software. ICES J. Mar. Sci. 73: 2178–2180. [Google Scholar]
  23. Griffiths R., Tavaré S., 1998.  The age of a mutation in a general coalescent tree. Commun. Stat. Stoch. Models 14: 273–295. [Google Scholar]
  24. Griffiths R. C., Tavaré S., 1994.  Sampling theory for neutral alleles in a varying environment. Philos. Trans. R. Soc. Lond. B Biol. Sci. 344: 403–410. [DOI] [PubMed] [Google Scholar]
  25. Hedgecock D., 1994.  Does variance in reproductive success limit effective population sizes of marine organisms? pp. 1222–1344 in Genetics and Evolution of Aquatic Organisms, edited by Beaumont A. Chapman and Hall, London. [Google Scholar]
  26. Hedgecock D., Pudovkin A. I., 2011.  Sweepstakes reproductive success in highly fecund marine fish and shellfish: a review and commentary. Bull. Mar. Sci. 87: 971–1002. [Google Scholar]
  27. Huillet T., Möhle M., 2013.  On the extended Moran model and its relation to coalescents with multiple collisions. Theor. Popul. Biol. 87: 5–14. [DOI] [PubMed] [Google Scholar]
  28. Irwin K. K., Laurent S., Matuszewski S., Vuilleumier S., Ormond L., et al. , 2016.  On the importance of skewed offspring distributions and background selection in virus population genetics. Heredity 117: 393–399. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Kaj I., Krone S. M., 2003.  The coalescent process in a population with stochastically varying size. J. Appl. Probab. 40: 33–48. [Google Scholar]
  30. Kingman J. F. C., 1982a The coalescent. Stoch. Proc. Appl. 13: 235–248. [Google Scholar]
  31. Kingman J. F. C., 1982b On the genealogy of large populations. J. Appl. Probab. 19: 27–43. [Google Scholar]
  32. Kingman J. F. C., 1982c Exchangeability and the evolution of large populations, pp. 97–112 in Exchangeability in Probability and Statistics, edited by Koch G., Spizzichino F. North-Holland, Amsterdam. [Google Scholar]
  33. Kingman J. F. C., 2000.  Origins of the coalescent: 1974–1982. Genetics 156: 1461–1463. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Koskela J., Jenkins P., Spanò D., 2015.  Computational inference beyond Kingman’s coalescent. J. Appl. Probab. 52: 519–537. [Google Scholar]
  35. Krone S. M., Neuhauser C., 1997.  Ancestral processes with selection. Theor. Popul. Biol. 51: 210–237. [DOI] [PubMed] [Google Scholar]
  36. Möhle M., 1998.  Robustness results for the coalescent. J. Appl. Probab. 35: 438–447. [Google Scholar]
  37. Möhle M., 1999.  Weak convergence to the coalescent in neutral population models. J. Appl. Probab. 36: 446–460. [Google Scholar]
  38. Möhle M., 2002.  The coalescent in population models with time-inhomogeneous environment. Stoch. Proc. Appl. 97: 199–227. [Google Scholar]
  39. Möhle M., Sagitov S., 2001.  A classification of coalescent processes for haploid exchangeable population models. Ann. Probab. 29: 1547–1562. [Google Scholar]
  40. Montano V., 2016.  Coalescent inferences in conservation genetics: should the exception become the rule? Biol. Lett. 12: 20160211. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Moran P. A. P., 1958.  Random processes in genetics. Proc. Camb. Philos. Soc. 54: 60. [Google Scholar]
  42. Moran P. A. P., 1962.  The Statistical Processes of Evolutionary Theory. Clarendon Press, Oxford. [Google Scholar]
  43. Neher R. A., Hallatschek O., 2013.  Genealogies of rapidly adapting populations. Proc. Natl. Acad. Sci. USA 110: 437–442. [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Neuhauser C., Krone S. M., 1997.  The genealogy of samples in models with selection. Genetics 145: 519–534. [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Niwa H.-S., Nashida K., Yanagimoto T., 2016.  Reproductive skew in Japanese sardine inferred from DNA sequences. ICES J. Mar. Sci. 73: 2181–2189. [Google Scholar]
  46. Nordborg M., 1997.  Structured coalescent processes on different time scales. Genetics 146: 1501–1514. [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. Pitman J., 1999.  Coalescents with multiple collisions. Ann. Probab. 27: 1870–1902. [Google Scholar]
  48. Polanski A., Kimmel M., 2003.  New explicit expressions for relative frequencies of single-nucleotide polymorphisms with application to statistical inference on population growth. Genetics 165: 427–436. [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. Polanski A., Bobrowski A., Kimmel M., 2003.  A note on distributions of times to coalescence, under time-dependent population size. Theor. Popul. Biol. 63: 33–40. [DOI] [PubMed] [Google Scholar]
  50. Sagitov S., 1999.  The general coalescent with asynchronous mergers of ancestral lines. J. Appl. Probab. 36: 1116–1125. [Google Scholar]
  51. Sawyer S. A., Hartl D. L., 1992.  Population genetics of polymorphism and divergence. Genetics 132: 1161–1176. [DOI] [PMC free article] [PubMed] [Google Scholar]
  52. Schrider D. R., Shanku A. G., Kern A. D., 2016.  Effects of linked selective sweeps on demographic inference and model selection. Genetics 204: 1207–1223. [DOI] [PMC free article] [PubMed] [Google Scholar]
  53. Schweinsberg J., 2000.  Coalescents with simultaneous multiple collisions. Electron. J. Probab. 5: 1–50. [Google Scholar]
  54. Schweinsberg J., 2017.  Rigorous results for a population model with selection II: genealogy of the population. Electron. J. Probab. 22: 54. [Google Scholar]
  55. Spence J. P., Kamm J. A., Song Y. S., 2016.  The site frequency spectrum for general coalescents. Genetics 202: 1549–1561. [DOI] [PMC free article] [PubMed] [Google Scholar]
  56. Steinrücken M., Birkner M., Blath J., 2013.  Analysis of DNA sequence variation within marine species using beta-coalescents. Theor. Popul. Biol. 87: 15–24. [DOI] [PMC free article] [PubMed] [Google Scholar]
  57. Tajima F., 1989.  Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics 123: 585–595. [DOI] [PMC free article] [PubMed] [Google Scholar]
  58. Tellier A., Lemaire C., 2014.  Coalescence 2.0: a multiple branching of recent theoretical developments and their applications. Mol. Ecol. 23: 2637–2652. [DOI] [PubMed] [Google Scholar]
  59. Terhorst J., Song Y. S., 2015.  Fundamental limits on the accuracy of demographic inference based on the sample frequency spectrum. Proc. Natl. Acad. Sci. USA 112: 7677–7682. [DOI] [PMC free article] [PubMed] [Google Scholar]
  60. Thornton K. R., Jensen J. D., 2007.  Controlling the false-positive rate in multilocus genome scans for selection. Genetics 175: 737–750. [DOI] [PMC free article] [PubMed] [Google Scholar]
  61. Wakeley J., 2009.  Coalescent Theory: An Introduction. Roberts & Company Publishers, Greenwood Village, CO. [Google Scholar]
  62. Wakeley J., 2013.  Coalescent theory has many new branches. Theor. Popul. Biol. 87: 1–4. [DOI] [PubMed] [Google Scholar]
  63. Wakeley J., Takahashi T., 2003.  Gene genealogies when the sample size exceeds the effective size of the population. Mol. Biol. Evol. 20: 208–213. [DOI] [PubMed] [Google Scholar]
  64. Watterson G., 1975.  On the number of segregating sites in genetical models without recombination. Theor. Popul. Biol. 7: 256–276. [DOI] [PubMed] [Google Scholar]
  65. Wilkinson-Herbots H. M., 1998.  Genealogy and subpopulation differentiation under various models of population structure. J. Math. Biol. 37: 535–585. [Google Scholar]
  66. Wright S., 1931.  Evolution in Mendelian populations. Genetics 16: 97–159. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Data Availability Statement

The empirical raw data used have been downloaded from GenBank (accession numbers LC031518–LC031623; data from Niwa et al. (2016)). The empirical SFS can be downloaded from Supplemental Material, File S5. The simulation program and the inference program were written in C++ and can be downloaded from GitHub under https://github.com/Matu2083/MultipleMergers-PopulationGrowth.


Articles from Genetics are provided here courtesy of Oxford University Press

RESOURCES