Skip to main content
Genetics logoLink to Genetics
. 2010 Oct;186(2):601–611. doi: 10.1534/genetics.110.117135

The Total Branch Length of Sample Genealogies in Populations of Variable Size

A Eriksson *,†, B Mehlig *,1, M Rafajlovic *, S Sagitov
PMCID: PMC2954466  PMID: 20660649

Abstract

We consider neutral evolution of a large population subject to changes in its population size. For a population with a time-variable carrying capacity we study the distribution of the total branch lengths of its sample genealogies. Within the coalescent approximation we have obtained a general expression—Equation 20—for the moments of this distribution with a given arbitrary dependence of the population size on time. We investigate how the frequency of population-size variations alters the total branch length.


MODELS for gene genealogies of biological populations often assume a constant, time-independent population size N. This is the case for the Wright–Fisher model (Fisher 1930; Wright 1931), for the Moran model (Moran 1958), and for their representation in terms of the coalescent (Kingman 1982). In real biological populations, by contrast, the population size changes over time. Such fluctuations may be due to catastrophic events (bottlenecks) and subsequent population expansions or just reflect the randomness in the factors determining the population dynamics. Many authors have argued that genetic variation in a population subject to size fluctuations may nevertheless be described by the Wright–Fisher model, if one replaces the constant population size in this model by an effective population size of the form

graphic file with name M1.gif (1)

where Nl stands for the population size in generation l. The harmonic average in Equation 1 is argued to capture the significant effect of catastrophic events on patterns of genetic variation in a population: if, for example, a population went through a recent bottleneck, a large fraction of individuals in a given sample would originate from few parents. This in turn would lead to significantly reduced genetic variation, parameterized by a small value of Neff. (See, e.g., Ewens 1982 for a review of different measures of the effective population size and Sjödin et al. 2005 and Wakeley and Sargsyan 2009 for recent developments of this concept.)

The concept of an effective population size has been frequently used in the literature, implicitly assuming that the distribution of neutral mutations in a large population of fluctuating size is identical to the distribution in a Wright–Fisher model with the corresponding constant effective population size given by Equation 1. However, recently it was shown that this is true only under certain circumstances (Kaj and Krone 2003; Nordborg and Krone 2003; Jagers and Sagitov 2004). It is argued by Sjödin et al. (2005) that the concept of an effective population size is appropriate when the timescale of fluctuations of Nl is either much smaller or much larger than the typical time between coalescent events in the sample genealogy. In these limits it can be proved that the distribution of the sample genealogies is exactly given by that of the coalescent with a constant, effective population size.

More importantly, it follows from these results that, in populations with variable size, the coalescent with a constant effective population size is not always a valid approximation for the sample genealogies. Deviations between the predictions of the standard coalescent model and empirical data are frequently observed, and there are a number of different statistical tests quantifying the corresponding discrepancies (see, for example, Tajima 1989, Fu and Li 1993, and Zeng et al. 2006). The analysis of such deviations is of crucial importance in understanding, for example, human genetic history (Garrigan and Hammer 2006). But while there is a substantial amount of work numerically quantifying deviations, often in terms of a single number, little is known about their qualitative origins and their effect upon summary statistics in the population in question.

The question is thus to understand the effect of population-size fluctuations on the patterns of genetic variation, in particular for the case where the scale of the population-size fluctuations is comparable to the time between coalescent events in the ancestral tree. As is well known, many empirical measures of genetic variation can be computed from the total branch length of the sample genealogy (the expected number of single-nucleotide polymorphisms, for example, is proportional to the average total branch length).

The aim of this article is to analyze the distribution of the scaled total branch length Tn for a sample genealogy in a population of fluctuating size, as illustrated in Figure 1. For the genealogy of n ≥ 2 lineages sampled at the present time, the expression ⌊NTn⌋ gives the total branch length in terms of generations. Here ⌊Nt⌋ is the largest integer ≤Nt, and the scaling factor N is a suitable measure of the number of genes in the population and serves as a counterpart of the constant generation size of the standard Wright–Fisher model.

Figure 1.—

Figure 1.—

The effect of population-size oscillations on the genealogy of a sample of size n = 17 (schematic). Left, genealogy described by Kingman's coalescent for a large population of constant size, illustrated by the light blue rectangle; right, sinusoidally varying population size. Coalescence is accelerated in regions of small population sizes and vice versa. This significantly alters the tree and gives rise to changes in the distribution of the number of mutations and of the population homozygosity.

A motivating example is given in Figure 2, which shows numerically computed distributions ρ(Tn) of the total branch lengths Tn for a particular population model with a time-dependent carrying capacity. The model is described briefly in the Figure 2 legend and in detail in a model for a population with time-dependent carrying capacity. As Figure 2 shows, the distributions depend in a complex manner on the form of the size changes. We observe that when the frequency of the population-size fluctuations is very small (Figure 2a), the distribution is well described by the standard coalescent result

graphic file with name M2.gif (2)

(Hein et al. 2005). When the frequency is very large (Figure 2e), Equation 2 also applies, but with a different time scaling reflecting an effective population size: t on the right-hand side (rhs) in Equation 2 is replaced by t/c with c = N/Neff. Apart from these special limits, however, the form of the distributions appears to depend in a complicated manner upon the frequency of the population-size variation. The observed behavior is caused by the fact that coalescence proceeds faster for smaller population sizes and more slowly for larger population sizes, as illustrated in Figure 1. But the question is how to quantitatively account for the changes shown in Figure 2.

Figure 2.—

Figure 2.—

Numerically computed distributions Inline graphic of the scaled total branch lengths Tn in genealogies of samples of size n = 10. The model employed in the simulations is outlined in a model for a population with time-dependent carrying capacity. It describes a population subject to a time-varying carrying capacity, Kl = K0(1 + ɛ sin(2πνl)). The frequency of the time changes is determined by ν, and l = 1, 2, 3, … labels discrete generations forward in time. The parameter N = K0 describes the typical population size, which is taken here to be equal to the time-averaged carrying capacity. a–e show Inline graphic for populations with increasingly rapidly oscillating carrying capacity. The dashed red line in a shows that in the limit of low frequencies the standard coalescent result, Equation 2, is obtained. The dashed red line in e shows that also in the limit of large frequencies the standard coalescent result is obtained, but now with an effective population size. The dashed red line in d is a two-parameter distribution, Equation 41, derived in comparison between numerical simulations and coalescent predictions. Further numerical and analytical results on the frequency dependence of the moments of these distributions are shown in Figure 4. Parameter values used: K0 = 10,000, ɛ = 0.9, and r = 1 (see a model for a population with time-dependent carrying capacity for the exact meaning of the intrinsic growth rate r) and (a) νN = 0.001, (b) νN = 0.1, (c) νN = 0.316, (d) νN = 1, and (e) νN = 100.

We show in this article that the results of the simulations displayed in Figure 2 are explained by a general expression—Equation 20—for the moments of the distributions shown in Figure 2. Our general result is obtained within the coalescent approximation valid in the limit of large population size. But we find that in most cases, the coalescent approximation works very well down to small population sizes (a few hundred individuals). Our result enables us to understand and quantitatively describe how the distributions shown in Figure 2 depend upon the frequency of the population-size oscillations. It makes possible to determine, for example, how the variance, skewness, and the kurtosis of these distributions depend upon the frequency of demographic fluctuations. This in turn allows us to compute the population homozygosity and to characterize genetic variation in populations with size fluctuations.

The remainder of this article is organized as follows. The next section summarizes our analytical results for the moments of the total branch length. Following that, we describe the model employed in the computer simulations. Then, corresponding numerical results are compared to the analytical predictions. And finally, we summarize how population-size fluctuations influence the distribution of total branch lengths and conclude with an outlook.

COALESCENT APPROXIMATION FORMULAS FOR THE MOMENTS Inline graphic

For the purpose of coalescent approximation it is convenient to introduce a “scaled time” t and a “scaled population size” x(t) by writing

graphic file with name M6.gif (3)

Here N is a suitable counterpart of the constant generation size of the standard Wright–Fisher model assumed to be large. The population is sampled in generation ls corresponding to t = 0, and the time t is now counted backward in units of N generations, as is common in the coalescent picture. Note that Equation 1 translates into

graphic file with name M7.gif (4)

In this section we show how to calculate the moments Inline graphic for the total (scaled) branch length Tn for a given realization of the curve x(t), making use of results obtained by Tavaré (1984).

The starting point is the obvious expression for the total time:

graphic file with name M9.gif (5)

Here τj denotes the time during which the genealogy has j ancestral lines. For the population with variable size the times τn,…, τ2 all depend upon the sample size n; however, this dependence is not made explicit, either here or in the following. As shown by Griffiths and Tavaré (1994) and Tavaré (2004), the joint distribution of the times τj can be written in terms of the variables Inline graphic for jn (sj = 0 for j > n):

graphic file with name M11.gif (6)

Here bj = j(j − 1)/2 and Inline graphic is the “population-size intensity function” defined by Griffiths and Tavaré (1994). In a population of constant size, the variables τj are mutually independent. In general this is not the case: Zivkovic and Wiehe (2008), for example, calculated Inline graphic for a time-varying population (Equations 2 and 3 in their article), using Equation 6.

Given Equation 5, the kth moment of the distribution of Tn is simply

graphic file with name M14.gif (7)

where the variables νj can assume values between 0 and k (subject to the constraint ν2 + ν3 + · · · + νn = k). In the following we show how the correlation functions of arbitrary order appearing in (7) can be calculated in a very simple manner. Consider first the case k = 1. We have

graphic file with name M15.gif (8)

Here ℓ(t) denotes the number of lines for a particular realization of the coalescent process at time t in a sample of size n = ℓ(0). The indicator function in Equation 8 is unity when ℓ(t) = j and zero otherwise. Averaging over realizations gives

graphic file with name M16.gif (9)

Here fnm(t1, t2) is the conditional probability that n ancestral lines at t1 coalesce to m lines at time t2 > t1.

For a constant population size, the coalescent is invariant under time translations, fnm(t1, t2) = gnm(t2t1)H(t2t1). Here H(t) = 1 if t > 0 and zero otherwise. The conditional probability gnm(t) was derived by Tavaré (1984). For m ≥ 2 the result is

graphic file with name M17.gif (10)
graphic file with name M18.gif (11)

In the general case of a variable population size, as shown by Griffiths and Tavaré (1994), the conditional probability depends only on the intensity Λ(t2) − Λ(t1) during the time interval [t1, t2]:

graphic file with name M19.gif (12)

Now consider the case k = 2. For i > j we have simply

graphic file with name M20.gif (13)

because the second indicator function vanishes when t2 < t1. Averaging over realizations we find

graphic file with name M21.gif (14)

In deriving this result we used the multiplicative rule

graphic file with name M22.gif (15)

For i = j, by contrast, we find

graphic file with name M23.gif (16)

which upon averaging yields

graphic file with name M24.gif (17)

More general correlation functions are readily obtained in terms of multiple integrals over the functions fnm. Inserting into (7) we see that the combinatorial factors (ν2!)−1 · · · (νn!)−1 cancel to obtain

graphic file with name M25.gif (18)

Equation 18 provides an explicit expression for the moments of the total branch lengths Tn in populations with population-size variations. The results can be written in a recursive form, particularly convenient for numerical computations,

graphic file with name M26.gif (19)

with initial conditions Inline graphic for m ≥ 2 and Inline graphic for k ≥ 1. Here Tn(t) is the total time corresponding to the genealogy of n sequences sampled at time t in the past given a population-size curve x(t). Note that t = 0 corresponds to the present time, so that Tn(0) ≡ Tn. In a population of constant size, Tn(t) is independent of t.

Equation 18 or 19 expresses the kth moment of Tn in terms of a 2k-fold sum [according to (10) each factor of Inline graphic contains a sum over ji]. Equation 18 can be further simplified by explicitly performing the sums over m1, …, mk. This results in

graphic file with name M30.gif (20)

The coefficients are determined by recursion:

graphic file with name M31.gif (21)
graphic file with name M32.gif (22)

For the particular case k = 1 our result corresponds to an expression derived by Austerlitz et al. (1997) and Slatkin (1996) and also to the result obtained by summing Equation 1 in Zivkovic and Wiehe (2008). For k = 2, the coefficients Inline graphic are tabulated in Figure A1 in the appendix for small values of n. In general, the nested integrals in Equation 20 cannot be simplified further; their form expresses the correlations of the times τj due to population-size variations.

Figure A1.—

Figure A1.—

Coefficients Inline graphic occurring in Equation 20 for n = 2,…, 10. Coefficients for odd values of j2 vanish.

Finally note that for n = 2, Equation 18 can be evaluated as follows:

graphic file with name M34.gif (23)

This representation demonstrates how the expression (18) simplifies when k > n.

We conclude this section by briefly describing three different scenarios where our main result (Equation 18) is applicable. First, Sjödin et al. (2005) discussed a model where the scaled population size x(t) defined by Equation 3 may assume two values, 1 and x. The population size randomly jumps from 1 to x at rate λ and back at rate λx. Initially the population size is x(0) = 1. Our result (Equation 18) is directly applicable to a given realization of the random process x(t). We denote the ensemble average over realizations of x(t) by Inline graphic. By averaging Equation 18 over the corresponding distribution of Λ we find

graphic file with name M36.gif (24)

Higher moments can be obtained in a similar fashion. This provides explicit expressions for the fluctuations of Tn in the case of slow, fast, and intermediate population-size changes. This model is particularly suited to examine the limit of fast population-size fluctuations Inline graphic. As expected, the standard Kingman coalescent, Equation 2, is recovered but now with an effective population size Neff = N/c with c = (1 + x−1)/2.

Second, intermediate population-size variations over many generations give rise to deviations from the standard Kingman behavior. The deviations are expected to be most significant when the timescale of the size variations is comparable to the times between coalescent events. Such intermediate population-size variations are commonly interpreted as due to a changing environment. In this case it is inappropriate to average over an ensemble of random population size curves x(t). The task is instead to describe the fluctuations of Tn conditional on a particular, externally imposed form of x(t). An example is the question: How does a recent bottleneck influence the distribution of Tn? To compute the kth moment of Tn, a k-fold integration is required. In general this must be performed numerically. However, in the case of piecewise constant functions x(t) the multiple integrals are straightforward to evaluate. If, on the other hand, the function x(t) is sufficiently “smooth,” the multiple integrals can be evaluated in closed form in the limits of slowly and rapidly varying population sizes as demonstrated below.

Third, in general stochastic population dynamics subject to a slowly changing environment may exhibit both slow changes due to an externally imposed change of the environment (in the form of a time-changing carrying capacity, for example) and “fast” (generation-to-generation) changes due to the random population dynamics. In the next two sections such a model is introduced and analyzed by means of Equation 18. The analysis is simplified by the observation that the fast size variations are irrelevant when their amplitude remains small. In this case Equation 18 may be evaluated using a deterministic population-size curve that is averaged over the fast changes. In the model discussed in the next two sections this curve is given by the deterministic time dependence of the carrying capacity.

A MODEL FOR A POPULATION WITH TIME-DEPENDENT CARRYING CAPACITY

The purpose of this section is to describe a modified Wright–Fisher model with a fluctuating carrying capacity. This model is used in the numerical simulations of sample genealogies described in the next section. Recall the three key assumptions of the Wright–Fisher model: (a) constant population size, (b) discrete, nonoverlapping generations, and (c) a symmetric multinomial distribution of family sizes. We have adopted the following approach: in our simulations, assumptions b and c are still satisfied, but assumption a is relaxed.

We study a large but finite population of fluctuating size Nl, where l = 1, 2,… labels the discrete, nonoverlapping generations forward in time. The model we have adopted is the following: consider a generation l consisting of Nl individuals. The number of individuals in generation l + 1 is then given by

graphic file with name M38.gif (25)

where the random family sizes ξj are independent and identically distributed random variables having a Poisson distribution with parameter λl (specified below). Consequently the number Nl+1 is Poisson distributed with mean Nlλl.

This model exhibits a fluctuating population size Nl, rapidly changing from generation to generation. As pointed out in the Introduction, in large populations such fluctuations are averaged over by the ancestral coalescent process and can be captured in terms of an effective population size. The resulting genealogies are simply described by Kingman's coalescent for a constant effective population size of the form (1) or (4).

Interesting population-size fluctuations occur on larger timescales, corresponding to slow variations of the population size over several generations. Such slow changes are most commonly interpreted as consequences of a changing environment. A natural model for such changes is to impose a finite carrying capacity Kl that varies as a function of generation index l. This is the approach adopted in the following, and we choose

graphic file with name M39.gif (26)

for a certain parameter value r > 0. Here Kl+1 is the carrying capacity in generation l + 1. If the environmental changes affected the population through fertility variations, Kl+1 would be replaced by Kl in Equation 26. Equation 26 is chosen so that the population ceases to grow on average when the carrying capacity is reached (λl = 1 for Nl = Kl+1). When the population size is small and Inline graphic, the population growth follows the logistic law, λl = 1 + r(1 − Nl/Kl+1), where r is the intrinsic growth rate. The particular form of Equation 26 ensures that λl > 0.

Note that fluctuations of Nl in this model are due to two different sources: rapid fluctuations are caused by the randomness of the family sizes, and slow fluctuations are caused by the time dependence of the carrying capacity. Our choice for the time dependence of Kl is dictated by the following considerations. The aim is to describe the influence of a fluctuating population size upon the statistics of genetic variation. To this end we need to consider the functional form of Kl. A simple choice for Kl is a periodically varying function, such as

graphic file with name M41.gif (27)

Note that a more complex dependence of Kl upon l can be obtained from superpositions of such functions with different amplitudes ɛ and frequencies ν. Here we use simply (27) and investigate how the statistics of genetic variation in a sample depend upon frequency of the fluctuations in Kl.

Figure 3 shows a realization of a curve Nl obtained in this manner (the choice of parameters is given in the Figure 3 legend). Figure 3 clearly exhibits fluctuations in Nl on two timescales, fast and slow. As explained above, the fast fluctuations are irrelevant provided their amplitude is sufficiently small. In this case we expect that the distribution of Tn can be described by a population-size curve that is averaged over the fast fluctuations. In the present model, averaging over the fast fluctuations results in a deterministic population-size curve determined by the carrying capacity (27). This curve is shown in Figure 3 as a dashed line.

Figure 3.—

Figure 3.—

One realization of the curve Nl obtained from simulations of the model described in a model for a population with time-dependent carrying capacity (black solid line). Choice of parameters: r = 1, K0 = 100, ɛ = 0.9, and Nν = 1, with N = K0. Also shown is an average over the fast fluctuations. The upper horizontal axis illustrates where the population is sampled and how time is counted backward in the coalescent approximation.

Note that conditional on the sequence of population sizes, the genealogy of a set of individuals sampled in generation l can be determined recursively by randomly choosing ancestors in the preceding generations. This is ensured by the assumption that, conditioned on the values of Nl and Nl+1, the family sizes follow a symmetric multinomial distribution Inline graphic. The resulting correspondence with the Wright–Fisher rule of reproduction ensures that the genealogies can be determined recursively in the way suggested above.

In the next section we analyze results of three sets of 10,000 computer simulations for this population model with parameters r = 1 and ɛ = 0.9 and for a range of values for ν. The three sets differ in the values for the average carrying capacity that are chosen to be K0 = 100, K0 = 1000, and K0 = 10,000. The population is sampled in generation ls (see Figure 3), which is chosen so that 2νls becomes an odd natural number. This implies that Inline graphic and that the population size was declining toward Inline graphic in the most recent past.

COMPARISON BETWEEN NUMERICAL SIMULATIONS AND COALESCENT PREDICTIONS

In this section we discuss the numerically computed distributions shown in Figure 2 in terms of Equation 18. The shapes observed in Figure 2 are conveniently characterized in terms of their mean Inline graphic, variance, skewness, and kurtosis:

graphic file with name M46.gif
graphic file with name M47.gif (28)
graphic file with name M48.gif (29)

Recall that for a normal distribution the skewness vanishes, and the kurtosis equals three. We can write the skewness and kurtosis in terms of the moments Inline graphic using

graphic file with name M50.gif

The moments Inline graphic are evaluated by means of Equations 18 and 19 as functionals of the scaled population size x(t) followed backward in time. With stochastically fluctuating sizes the scaled population size x(t) also becomes a random process. As Figure 3 indicates, the random fluctuations around the deterministic carrying capacity function are relatively small and we expect that such generation-to-generation fluctuations are irrelevant for the distribution of Tn. We therefore disregard the fast (random) fluctuations of the population sizes and define function x(t) deterministically by

graphic file with name M52.gif (30)

This is obtained from an analog of Equation 3 when the population is sampled in generation ls (as indicated in Figure 3):

graphic file with name M53.gif (31)

Here (27) was used with ω = 2πνN, and N = K0. Note that the particular form (30) of x(t) depends upon when the population is sampled. Had the population been sampled at a different time, a different curve x(t) could have resulted, leading in turn to a different distribution ρ(Tn) of Tn, since the distribution depends, for example, upon whether most recently the population was expanding or declining.

Our results are summarized in Figure 4. It shows how the mean, variance, skewness, and kurtosis of the distribution of Tn depend on the scaled frequency ω of the population size variation, Equation 30. Shown are results of numerical simulations of the model described in the previous section (symbols) and results obtained within the coalescent approximation using Equation 19. We observe that the coalescent approximation describes the results of the numerical simulations well, even for small population sizes.

Figure 4.—

Figure 4.—

Mean (a), variance (b), skewness (c), and kurtosis (d) of the distribution of Tn for samples of size n = 10, as a function of the frequency of the population-size fluctuations. Shown are results of numerical simulations (10,000 simulations with N = K0, K0 = 100, triangles; K0 = 1000, diamonds; and K0 = 10,000, circles) as well as results computed within the coalescent approximation described in coalescent approximation formulas for the moments (red solid lines). Black dashed-dotted and dashed lines show the approximations for small frequencies (Equations 33 and 34) and for large frequencies (Equations 39 and 40). The expressions for the limiting behaviors of the skewness and the kurtosis are shown in c and d, but are not given in the text. The remaining parameter values are r = 1 and ɛ = 0.9, as in Figure 2.

In the numerical simulations we have found that, for very small population sizes, random fluctuations of Nl around the time-dependent carrying capacity Kl become increasingly important. Since we suspected that the small deviations observed in Figure 4a for K0 = 100 were due to such fluctuations, we performed slightly modified simulations imposing a deterministic law upon Nl by forcing Nl = Kl in every generation [where Kl is given by (27)]. Comparison of the corresponding results (not shown) with Figure 4a indicates that the deviations for K0 = 100 at large frequencies are indeed caused by the stochastic fluctuations in the population dynamics underlying Figure 4a. A different interpretation of this effect is the following: when the population size is very small, and when ɛ is close to unity, the population may exhibit a nonnegligible probability of becoming extinct during the expected time to the most recent common ancestor for a sample of size n. In this case we have conditioned on the existence of the population during 100K0 generations using rejection sampling. In practice this avoids extinction, but it leads to a biased size distribution.

Consider now the frequency dependence of the moments shown in Figure 4. It can be qualitatively and quantitatively understood using Equation 20 together with the following expression for Λ(t):

graphic file with name M54.gif (32)

Here ⌈z⌉ is the smallest integer larger than z. Next we discuss the asymptotical formulas for small and large frequencies ω.

In the limit of Inline graphic, Equation 32 simplifies to Inline graphic. Inserting this into (20) we find approximately

graphic file with name M57.gif (33)

Here Inline graphic and Inline graphic. Equation 33 is shown in Figure 4a as a dashed-dotted line. For the variance we find the approximate expression

graphic file with name M60.gif (34)

with Inline graphic. The limiting value for zero frequency is that of the standard coalescent with constant population size. Equation 34 is shown in Figure 4b as a dashed-dotted line. Similarly the standard results for the constant-size coalescent are obtained for the skewness and for the kurtosis in the limit of Inline graphic. This limiting behavior is illustrated in Figure 2a, which shows that the distribution of Tn approaches that for Kingman's coalescent for a constant population size in the limit of small frequencies. We note that for Inline graphic, the population-size dependence is essentially that of a declining population, because the time to the most recent common ancestor is reached before the first maximum in x(t) going backward in time (see Figure 3 and Equation 30).

Of particular interest is the limit of large frequencies, as we now show. As ω→∞, one expects that the coalescent process averages over the population-size oscillations, and the standard coalescent process with a constant effective population size should be obtained. For large but finite frequencies, by contrast, Figure 4a exhibits deviations from the standard coalescent behavior. In the following we analyze the behavior of the moments in this regime. In the limit of large frequencies, Equation 32 simplifies to

graphic file with name M64.gif (35)

For large frequencies, the function Λ(t) is well approximated by a shifted linear function

graphic file with name M65.gif (36)

Here c determines the effective population size according to Equation 4: Neff = N/c with

graphic file with name M66.gif (37)

The parameter c describes the influence of the demographic fluctuations upon the part of the genealogy in the distant past. The small offset

graphic file with name M67.gif (38)

describes the influence of demographic changes on the most recent part of the genealogy. Inserting the approximation (36) into (20) we find for large frequencies (and when the amplitude ɛ is not too close to unity)

graphic file with name M68.gif (39)

The first term in (39) is the expected time of Kingman's coalescent for a constant effective population size Neff = N/c. The curve corresponding to (39) is shown as a dashed line in Figure 4a. We infer that corrections to the standard coalescent result are significant when the sample size is large, the amplitude of the size oscillations is not too small, and the frequency ω is of order unity. This is consistent with the results of Sjödin et al. (2005).

We now discuss the behavior of the variance shown in Figure 4b. For the second moment we find

graphic file with name M69.gif (40)

The first term in Equation 40 corresponds to the second moment of Tn in Kingman's coalescent with a constant effective population size. The second term in (40) represents a correction due to finite but large frequencies; it depends in a simple fashion on the effective population-size parameter c and on the sample size n.

Comparing Equations 39 and 40, we arrive at the conclusion that the corresponding correction for the variance var(Tn) vanishes. This is consistent with the fact that, at large frequencies, the variance of Tn is surprisingly insensitive to changes in frequency (as opposed to the behavior of Inline graphic, see Figure 4, a and b). In fact, the limiting value (shown in Figure 4b as a dashed line) is a very good approximation to var(Tn) down to ω ≈ 3.

Consider now the skewness and the kurtosis shown in Figure 4, c and d. Their behavior is similar to that of the variance: for ω > 3, the skewness and the kurtosis are essentially independent of ω. The results shown in Figure 4 imply that over a large range of frequencies, the distribution of the total branch lengths Tn can be approximated as follows: the distribution is essentially that of the standard Kingman coalescent with an effective population size, but the distribution is shifted such that its mean is given by Equation 39, rather than by 2hn/c.

One may wonder when this “rigid shift” occurs. Given Equation 18 it is straightforward to work out the fluctuations of the times τj within the approximation (36). We find that for j < n, the expected value of τj is exactly that of the standard Kingman coalescent with effective population size. But for j = n it is rigidly shifted by −Λ0/c. This indicates that the genealogies are essentially those of the standard coalescent, but modified by an initial rigid shift. In the parameter regime discussed here, the distribution of times is expected to be well approximated by a two-parameter family of distributions,

graphic file with name M71.gif (41)

when zc > −nΛ0, and P(Tn < z) ≈ 0 for smaller values of z. The first parameter determines the effective population size. It parameterizes the slope of the function Λ(t) at large times and describes the demographic effect on the distant past of the genealogy. The second parameter, Λ0 describes the influence of the demographic fluctuations on the initial part of the sample genealogy. This parameter can be negative (recent population decline, this is the case shown in Figure 3) or positive (recent population expansion). When Λ0 > 0, the distribution ρ(Tn) is rigidly shifted to the left. In this case the approximation (36) is expected to break down when the body of the distribution reaches Tn = 0.

Note that the distribution (41) cannot be described by a single parameter (a “generalized effective population size”). The approximation (41) was used to generate the red dashed curve in Figure 2d.

CONCLUSIONS

The aim of this article was to investigate how the frequency of population-size fluctuations determines the shape of the distribution of total branch lengths of sample genealogies and thus of statistical measures of genetic variation.

We performed simulations for a modified Wright–Fisher model of a population subject to a time-periodically varying carrying capacity and determined the distribution of the total branch lengths, shown in Figure 2. We characterized how the shapes of the distributions depend upon the frequency of the population-size fluctuations by computing the frequency dependence of the moments of these distributions. We could explain these dependencies in terms of coalescent approximations. In particular, we derived a general expression—Equation 20—for the moments Inline graphic in populations subject to smooth population changes of otherwise arbitrary form.

Our results show how quickly (or slowly) the standard coalescent result for a constant (effective) population sizes is recovered in the limits of large and small frequencies. More importantly, our coalescent results allow us to determine how significant deviations are at large but finite frequencies. In this case we argued that at large frequencies, the distribution of Tn is essentially that of the standard Kingman coalescent with an effective population size Neff = N/c, but with a shifted mean value

graphic file with name M73.gif (42)

The first term on the rhs corresponds to the result of the standard Kingman coalescent with a constant effective population size. The second term on the rhs is the correction term resulting from the population-size variations (ɛ is the amplitude of the population-size oscillations, ω is its frequency, and n is the sample size). We infer that corrections to the standard coalescent result are largest when the sample size is large and the amplitude ɛ of the size fluctuations is not very small. This is consistent with the results of Sjödin et al. (2005).

Last but not least we found that the coalescent approximation yields a reliable description of the numerical data, even for very small populations.

We close with a number of remarks. First, Equation 20 is easily generalized to describe the moments of observables that are polynomial functions of the times τj. Particularly simple is the case of observables A that are linear functions of the times τj, Inline graphic. In this case the kth moment of An is given by Equation 20, but with modified coefficients: the factors m in Equations 21 and 22 are replaced by am.

Second, some observables [such as the F-statistic (Fu and Li 1993)] can be written as linear functions of τj, but with random coefficients. In this case too it is possible to explicitly compute the moments of the distribution of the observable. These two questions are addressed in a separate article (S. Sagitov, M. Rafajlovic, B. Mehlig, and A. Eriksson, unpublished results).

Third, Equation 19 allows us to determine in a transparent fashion how the fluctuations of Tn depend upon the time at which the population is sampled. This will make it possible to discuss, for example, how Tajima's D-statistic or the F-statistic depends upon the time of sampling after a bottleneck, a population expansion, or a decline.

Fourth, population-size fluctuations are sampled nonuniformly by the genealogies: initial coalescent events occur at faster rates and are thus more sensitive to recent size fluctuations. Remote coalescent events, by contrast, occur at slower rates, thus damping the effect of size fluctuations in the distant past. We therefore expect significant deviations from the standard coalescent behavior arising from the most recent history for large sample sizes n. It would be interesting to quantify this expectation by computing the covariances and higher moments of the times τj during which the sample genealogy has j lines: first, for large in and jn we expect to observe strong correlations between τi and τj and thus deviations from the standard Kingman coalescent. Second, for small values of i and j we expect the times τi and τj to decorrelate and to follow the distribution of the standard coalescent (with an effective population size).

Fifth, the model introduced in a model for a population with time-dependent carrying capacity assumes a carrying capacity that varies sinusoidally, with a single frequency. It turns out, however, that our findings (summarized in Equation 41) are valid for arbitrary time-dependent fluctuations with a sufficiently strong and narrow mode at high frequencies. Examples are linear combinations of high-frequency oscillations or stochastic fluctuations around a constant population size, with a sufficiently narrow frequency spectrum. In this case, too, we expect that Λ(t) is well approximated by (36). If this is the case, the distribution of times is of the form (41) when Λ0 is small.

Taken together, the results derived in this article give a rather complete understanding of the fluctuations of empirical observables due to population size variations. These results will be significant when attempting to disentangle the effects of population-size variations from other factors influencing genetic variation.

Our results raise the question under which circumstances the deviations from standard coalescent behaviors due to population-size fluctuations (Figures 2 and 4) are most likely to strongly affect the interpretation of empirical data. As our analysis indicates, the deviations become substantial when the frequency ω = 2πνN is of order unity. Here ν is the frequency of the population size variations, Equation 30, and N is a suitable measure of the population size. In other words, rapid population-size fluctuations will have the strongest effect (other than simply determining the effective population size) in small local subpopulations with restricted gene flow between subpopulations with different fluctuations. The deviations are expected to be smaller at larger spatial scales, because the ancestral process averages over the spatial fluctuations. More generally, we conclude that deviations from standard coalescent behavior are expected for populations subject to an environment that changes as a function of space and time on neither too small nor too large length and timescales. An example for such a population is the marine snail Littorina saxatilis. Its habitat on the Northern coast of Bohuslän (Sweden) is fragmented into subpopulations with strongly restricted gene flow between them, and effective population sizes of subpopulations have been found to be very small (K. Johannesson, personal communication). Starting from the results derived in this article, we hope to determine gene genealogies in such fragmented populations subject to variations of population size in space and time.

Acknowledgments

We thank two anonymous reviewers for helpful comments and suggestions. Support from Vetenskapsrådet, The Bank of Sweden Tercentenary Foundation, and the Centre for Theoretical Biology at the University of Gothenburg is gratefully acknowledged.

APPENDIX: COEFFICIENTSInline graphic FOR n = 2,…, 10

In Figure A1 we give the coefficients Inline graphic determining the second moment Inline graphic according to Equation 20 for n = 2,…, 10. Note that the coefficient for n = 2 is consistent with Equation 23.

References

  1. Austerlitz, B., B. Jung-Muller, B. Godelle and P. Gouyon, 1997. Evolution of coalescence times, genetic diversity and structure during colonization. Theor. Popul. Biol. 51 148–164. [Google Scholar]
  2. Ewens, W., 1982. The concept of the effective population size. Theor. Popul. Biol. 21 373–378. [Google Scholar]
  3. Fisher, R. A., 1930. The Genetical Theory of Natural Selection. Clarendon Press, Oxford.
  4. Fu, Y., and W. Li, 1993. Statistical tests of neutrality of mutations. Genetics 133 693–709. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Garrigan, D., and M. F. Hammer, 2006. Reconstructing human origins in the genomic era. Nat. Rev. Genet. 7 669–680. [DOI] [PubMed] [Google Scholar]
  6. Griffiths, R., and S. Tavaré, 1994. Sampling theory for neutral alleles in a varying environment. Philos. Trans. R. Soc. Lond. B 344 403–410. [DOI] [PubMed] [Google Scholar]
  7. Hein, J., M. H. Schierup and C. Wiuf, 2005. Gene Genealogies, Variation and Evolution: A Primer in Coalescent Theory. Oxford University Press, London/New York/Oxford.
  8. Jagers, P., and S. Sagitov, 2004. Convergence to the coalescent in populations of substantially varying size. J. Appl. Probab. 41 368–378. [Google Scholar]
  9. Kaj, I., and S. Krone, 2003. The coalescent process in a population with stochastically varying size. J. Appl. Probab. 40 33–48. [Google Scholar]
  10. Kingman, J., 1982. The coalescent. Stoch. Proc. Appl. 13 235–248. [Google Scholar]
  11. Moran, P., 1958. Random processes in genetics. Proc. Camb. Philos. Soc. 54 60–71. [Google Scholar]
  12. Nordborg, M., and S. Krone, 2003. Modern Developments in Population Genetics: The Legacy of Gustave Malécot, pp. 194–232. Oxford University Press, Oxford.
  13. Sjödin, P., I. Kaj, S. Krone, M. Lascoux and M. Nordborg, 2005. On the meaning and existence of an effective population size. Genetics 169 1061–1070. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Slatkin, M., 1996. Gene genealogies within mutant allelic classes. Genetics 143 579–587. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Tajima, F., 1989. Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics 123 585–595. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Tavaré, S., 1984. Lines of descent and genealogical processes, and their application in population genetics models. Theor. Popul. Biol. 26 119–164. [DOI] [PubMed] [Google Scholar]
  17. Tavaré, S., 2004. Ancestral Inference in Population Genetics, pp. 1–188. Springer, Berlin.
  18. Wakeley, J., and O. Sargsyan, 2009. Extensions of the coalescent effective population size. Genetics 181 341–345. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Wright, S., 1931. Evolution in Mendelian populations. Genetics 16 97–159. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Zeng, K., Y. Fu, S. Shi and C. Wu, 2006. Statistical tests for detecting positive selection by utilizing high-frequency variants. Genetics 174 1431–1439. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Zivkovic, D., and T. Wiehe, 2008. Second-order moments of segregating sites under variable population size. Genetics 180 341–357. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Genetics are provided here courtesy of Oxford University Press

RESOURCES