An efficient algorithm for generating the internal branches of a Kingman coalescent

M Reppell; S Zöllner

doi:10.1016/j.tpb.2017.05.002

. Author manuscript; available in PMC: 2019 Jul 1.

Published in final edited form as: Theor Popul Biol. 2017 Jul 11;122:57–66. doi: 10.1016/j.tpb.2017.05.002

An efficient algorithm for generating the internal branches of a Kingman coalescent

M Reppell ^a,^*, S Zöllner ^b,^c

PMCID: PMC5764821 NIHMSID: NIHMS892215 PMID: 28709926

Abstract

Coalescent simulations are a widely used approach for simulating sample genealogies, but can become computationally burdensome in large samples. Methods exist to analytically calculate a sample’s expected frequency spectrum without simulating full genealogies. However, statistics that rely on the distribution of the length of internal coalescent branches, such as the probability that two mutations of equal size arose on the same genealogical branch, have previously required full coalescent simulations to estimate. Here, we present a sampling method capable of efficiently generating limited portions of sample genealogies using a series of analytic equations that give probabilities for the number, start, and end of internal branches conditional on the number of final samples they subtend. These equations are independent of the coalescent waiting times and need only be calculated a single time, lending themselves to efficient computation. We compare our method with full coalescent simulations to show the resulting distribution of branch lengths and summary statistics are equivalent, but that for many conditions our method is at least 10 times faster.

Keywords: coalescent, genealogical topology, coalescent simulations

1. Introduction

In recent years the declining costs of human sequencing and genotyping have facilitated increasingly large studies. Sequencing experiments with tens of thousands of samples (Coventry et al., 2010; Nelson et al., 2012; Tennessen et al., 2012), and genotyping projects combining hundreds of thousands of samples (Teslovich et al., 2010; Morris et al., 2012; Berndt et al., 2013) are now common. The data observed in such large studies is frequently compared with simulated data generated according to theoretical models to test hypotheses about demography or disease architecture. Coalescent simulations are a common and widely used approach for generating such simulated data (Nelson et al., 2012; Ferreira et al., 2013; Gazave et al., 2014). The coalescent (Kingman, 1982) is a model which traces the ancestry of a present day sample backwards through time until reaching the most recent common ancestor of the entire sample. Researchers have expanded the coalescent to model a range of population histories and conditions (Kaplan et al., 1988; Takahata and Slatkin, 1990; Griffiths and Tavarè, 1994; Neuhauser and Krone, 1997). However, for large samples, coalescent simulations can become computationally burdensome, especially in Monte-Carlo applications when many datasets have to be generated. Here, we propose a sampling method that allow us to generate individual branch lengths and configurations in Kingman coalescent genealogies, and which allow us selectively generate limited portions of sample genealogies. This approach is particularly effective for research questions that need consider only limited portions of the full genealogies generated by coalescent simulations. For example, the study of very rare variants found to be abundant in human populations (Nelson et al., 2012; Tennessen et al., 2012), concerns only external or nearly external genealogical branches.

Our method relies on a set of equations that give the probability of a genealogical branch starting and ending at specific coalescent events. To derive these probabilities we first define the length and size of a branch. The structure of a coalescent genealogy is a bifurcating tree, with internal nodes that represent coalescent events where two lineages merge at a common ancestor. Therefore, a genealogical branch begins either at an external node along the tips of the tree or at a subsequent internal coalescent event, and then ends at a coalescent event closer to the root of the tree. The time between its beginning and ending events is the length of a branch. The size of a branch is a count of the number of external nodes in the final sample that it subtends (Fu, 1995). Branch size corresponds to the number of derived alleles that would appear in the final sample were a mutation event to occur along the branch’s length. With a constant size population, where waiting times between coalescent events are independent, the combination of these equations provides an explicit probability distribution function for individual branch lengths. Directly sampling from this explicit formula is computationally challenging. Here we introduce a recursive calculation that gives the distribution of the number of branches with a given size in a genealogy. We show that these recursive computations combined with storing reusable intermediate results and sampling from simple exponential distributions facilitate a rapid method for sampling selected portions of genealogies.

Our approach of developing an algorithm targeted at a specific feature of the coalescent has previously been taken in many contexts. Simulation methods grounded in coalescent theory and designed to efficiently handle recombination (McVean and Cardin, 2005; Marjoram and Wall, 2006), selection (Fearnhead, 2006), or the number of ancestral lineages remaining (Blum and Rosenberg, 2007; Jewett and Rosenberg, 2014) have all been proposed. Previous work on the distribution of internal branches of the Kingman coalescent was focused on their summed length or the proportion of a genealogy with a given size (Fu and Li, 1993; Fu, 1995; Griffiths and Tavarè, 1998; Wooding and Rogers, 2002; Polanski and Kimmel, 2003; Dahmer and Kersting, 2015). The proportion of a tree with a given size was of interest because under the infinite sites mutation model, the number of segregating sites observed in a sample with a given number of derived alleles is a function of the total length of branches with a size equal to the number of derived alleles. Fu and Li (1993) presented the expectation and variance of the total summed length of both external and internal branches along a Kingman coalescent without recombination. Griffiths and Tavarè (1998) expanded on this work to derive an expression for the probability of a mutation having a specific number of descendants in the final sample, even in samples from populations with variable past sizes. Jenkins and Song (2011) built on Griffiths and Tavaré’s work by considering allele configurations with two separate mutation events, and they extended their work to variable size populations in 2014 (Jenkins et al., 2014). In related work Ferretti et al. (2016) was able to derive closed expressions for the joint frequency spectrum of two linked sites. Fu (1995) gave expressions for the expectations, variances, and covariances for a sample’s frequency spectrum. Efficient methods for modeling the total time in a sample’s genealogy with a given size have been developed (Wooding and Rogers, 2002; Polanski and Kimmel, 2003; Polanski et al., 2003). However, the methods of Wooding and Rogers (2002) and Polanski et al. (2003) fail to model individual branch lengths and their topology. The topology of a genealogy is where the correlation between observed mutations arises. These correlations can contain information about demography lacking from the frequency spectrum (Gutenkunst et al., 2009) and influence the outcome of tests for neutral evolution (Ledda et al., 2015). In the absence of recombination, this correlation is equivalent to the linkage disequilibrium between mutations, and it has been shown that patterns of linkage disequilibrium between very rare variants can provide information about departures from Wright-Fisher neutrality (Wall, 1999), including recent population growth rates (Reppell et al., 2014). With the focus in our work on individual genealogical branches rather than their summed length, we more closely build on the findings of Rosenberg (2006), which derived the expectation and variance for the number of internal branches with a specific size.

Here our calculations build a sampling framework that can quickly generate portions of a genealogy with a specific size. Considering all coalescent events on a tree, we integrate over all possible starts and ends for a branch of a given size. Conditional on the start and end of the branch we then calculate the probability that the branch has a given length. For a constant size population, we show that our work gives rise to an explicit probability distribution function for branch lengths. As this formula becomes computationally intractable as sample size grows, we introduce a computationally more efficient algorithm that recursively calculates all probabilities of start and end points and evaluates the conditional probability of branch length by Monte Carlo sampling. We compare our sampling method with full coalescent simulations for a range of sample sizes and demonstrate it performs up to 10 times faster, and show that as long as the ratio of branch size to sample size is moderate (< 0.15) it produces branches with an equivalent length distribution and summary statistics.

2. Methods

In this section we first provide the full probability distribution function for genealogical branches under a model of constant population size, and then subsequently derive its components, notably in 2.2 and 2.3. In section 2.4 the distribution of the number of branches with a given size in a genealogy is derived, which we combine with the proceeding work to propose a sampling method that can efficiently generate selected portions of genealogies. In section 2.5 we combine the elements of the preceding sections into our proposed algorithm, which we label topology free sampling. Section 2.6 gives summary statistics we use to evaluate our method, and section 2.7 gives details of the open source software implementation of our method and the simulations we use in this text.

2.1. A probability distribution function for coalescent branch lengths in a model with constant population size

The probability that a coalescent tree branch of size j has length ℓ is the product of three probabilities: the probability that the branch begins at specific coalescent event, then, conditional on its starting event, the probability that it ends at a specific coalescent event, and finally, conditional on its starting and ending events, the probability that the intervening coalescent times sum to ℓ. For the random variable L_j, the length of a branch with size j:

P (L_{j} = Length ℓ) = \sum_{Start} \sum_{End} P (Length ℓ | Start, End) P (End | Start) P (Start | Size = j) .

(1)

P (End|Start) and P (Start|Size = j) are given in sections 2.3 and 2.2, respectively. For a constant size population, the length of a branch follows a hypoexponential distribution: it is a sum of coalescent waiting times, each an exponential random variable with a unique rate. The rates that define the hypoexponential distribution are conditional on a branch’s starting and ending coalescent events which define the number of ancestral lines remaining during the branch’s duration. If we label coalescent events k ∈ 1, 2, …, n − 1 such that at event k, n − k + 1 ancestral branches are reduced by 1 to n − k ancestral branches we can write the exact probability distribution of branch lengths with size j as

P (L_{j} = ℓ) = \sum_{k - 1}^{n - 2} P_{Start} (k | j) \sum_{b = k + 1}^{n - 1} [P_{End} (b | k, j) \sum_{z = k + 1}^{b} \frac{e^{- (\underset{2}{n - z + 1}) l} \prod_{v = k + 1}^{z} (\underset{2}{n - v + 1})}{\prod_{v = k + 1, v \neq z}^{b} ((\underset{2}{n - v + 1}) - (\underset{2}{n - z + 1}))}]

(2)

Where P_Start(k|j) is the probability a branch with size j begins at coalescent event k (Equation 4), P_End(b|k, j) is the conditional probability a branch with size j that began at event k, ends at event b (Equation 9), and the inner sum of equation 2 is the probability that the branches between events k and b have exactly summed length ℓ_j; this probability follows a hypoexponential distribution defined by the exponential waiting times of the coalescent. Figure 1 shows the density curve for branches with sizes between 2 and 10. The expected value of this distribution is the sum of the independent expected exponential waiting times, weighted by the probability of starting and ending at particular coalescent events, and can be written

E (L_{j}) = \sum_{k = 1}^{n - 2} P_{Start} (k | j) \sum_{b = k + 1}^{n - 1} [P_{End} (b | k, j) \sum_{z = k + 1}^{b} \frac{2}{(n - z + 1) (n - z)}]

(3)

Density functions for the distribution of genealogical branch lengths with sizes between 2 and 10 for a sample of size 50

2.2. The probability that a branch with size j originated at coalescent event k

Given a branch with size j and a sample size n, we are interested in the probability that it originated at a specific coalescent event along the sample’s ancestry. We label this value as P_start(k|j), the probability a branch begins at event k conditional on having size j. In Appendix 6.2 we provide a combinatorial method for directly calculating this value. Although simpler to write, the combinatorial method scales poorly with sample size, so we have derived an alternate approach. The approach presented here is easily implemented and efficiently calculates P_start(k|j). Using the rules of conditional probability

P_{Start} (k | j) = \frac{P (branch has size j | branch began at event k)}{\sum_{i = 0}^{n - 1} P (branch has size j | branch began at event i)} = \frac{P (j | k, n)}{\sum_{i - 0}^{n - 1} P (j | i, n)}

(4)

where P (j|k, n) is defined as the probability that a branch with size j arising at coalescent event k in sample size n. As part of the coalescent process a new branch arises at every coalescence event, so in equation 4 we can drop the unconditional probabilities of a a branch beginning at an event, as P(a branch began at k) = P(a branch began at i) = 1 for all events k and i.

Based on equation 4, we we use the following components to construct a recursive equation for P (j|k, n): At an arbitrary coalescent event, where two ancestral lines are randomly chosen to coalesce, and n − k + 1 ancestral lines become n − k lines (event k), a branch with size j is created if the sizes of the two lines coalescing sum to j. Each of these two lines coalescing at event k must have either formed at an earlier coalescent event or be an external branch. Additionally, they must have avoided coalescing at any event occurring between their formation and event k.

Allow y < k and z < k to be the coalescent events giving rise to two branches with sizes m and j − m that eventually coalesce at event k. In our notation, external branches originate at coalescent event 0. The joint probability that branches with sizes j and j − m arose at two specific coalescent events can be calculated by conditioning on the earlier of the two events. Arbitrarily specifying that y < z, we are interested in

P (j | k, n) = \sum_{m, y, z} P (j - m | z, y, n) P (m | y, n) P (lineages from y and z persist until k) .

(5)

Setting up our recursion, the probability that a branch with size m arises at y is simply P (m|y, n). We also need an expression for the probability P (j − m|z, y, n), which is conditional on both y and z. The properties of the Kingman coalescent allow us to simplify this expression by substituting a probability that only depends on z. Before coalescing at event k, the branches formed at z and y share no common descendants in the final sample. Because m of the final samples descend from the branch formed at y, the total pool of available descendants at event z is not n but n − m. Similarly, the branch formed at y cannot coalesce at event z, so the pool of ancestral branches available to coalesce at event z is one smaller than the total number of branches remaining. Essentially, we split off the portion of the genealogy corresponding to event y, and allow event z to involve only what remains. The conditional probability of a branch with size j − m arising at event z conditional on event y is then

P (j - m | z, y, n) = P (j - m | z - (m - 1), n - m) .

(6)

Next, we need the probability that the branches that arise at events y or z persist without coalescing until event k. For event y this can be written as

Q (k | y, n) = P (No coalescent at events y + 1, y + 2, \dots, k - 1) = \prod_{s = y + 1}^{k - 1} (1 - \frac{2}{n - s + 1}) = \frac{(n - k) (n - k + 1)}{(n - y) (n - y - 1)}

(7)

and a simple substitution of z for y gives the equation for event z. The two branches must coalesce at event k, the probability of a specific pair of lines coalescing when n − k + 1 lines remain is simply $C (n - k + 1) = \frac{2}{(n - k) (n - k + 1)}$ .

Together these pieces allow us to construct a general outline for a recursive equation when y < z:

P (j | k, n) = \sum_{m, y, z} [C (n - k + 1) Q (k | y, n) Q (k | z, n - 1) P (m | y, n) P (j - m | z - (m - 1), n - m]

(8)

The final form of P (j|k, n) accounts for both the fact that z may be larger than y and that the external branches are exchangeable. The full formula is presented in appendix 6.1, and once calculated can be readily substituted into equation 4 to provide P_Start(j|k).

2.3. The conditional probability that a branch with size j ends at event b

At each coalescence the probability that the branch ends is a function of the number of remaining ancestral lines:

P_{End} (b | k, j) = P (No coalescence at k + 1, \dots, b - 1, AND coalescence at b) = \frac{2}{n - b + 1} \prod_{a = k + 1}^{b - 1} (1 - \frac{2}{n - a + 1}) = \frac{2}{n - b + 1} \prod_{a = k + 1}^{b - 1} (\frac{n - a - 1}{n - a + 1}) = \frac{2 (n - b)}{(n - k) (n - k - 1)}

(9)

Note that this value is independent of branch size j.

2.4. The number of branches with size j in a genealogy

With the components in place to calculate the length of a single genealogical branch, we turn our attention to additional calculations necessary to generate larger portions of a genealogy. Notably, the distribution of the number of branches with size j in a sample’s ancestry can be calculated recursively. If we halted the coalescent process for a genealogy with n external nodes immediately before the final coalescent event, we would have two independent sub-genealogies. If the n external nodes of the full sample are treated as exchangeable, there are ⌊n/2c⌋ ways in which they may be divided between the two sub-genealogies, where ⌊⌋ denotes the floor function. Every possible partition of the n nodes between the sub-genealogies is equally likely, with the exception of finding exactly n/2 nodes in each, which is half as likely owing to the exchangeability assumption. Now, note that observing x branches with size j in the full genealogy is equivalent to observing x − y and y branches with size j in the divided sub-genealogies. Defining G(x|j, n) as the probability of observing x branches with size j in a sample of size n this leads us to the recursive equation

G (x | j, n) = {\begin{cases} 0 & for x > 0, j > n \\ 1 & for x = 0, j > n \\ 0 & for x = 0, n = j \\ 1 & for x = 1, n = j \\ \sum_{i \leq \frac{n}{2}} \frac{2 - I [i = \frac{n}{2}]}{n - 1} \sum_{y = 0}^{x} G (x - y | j, i) G (y | j, n - i) & for n > j \end{cases}

(10)

The value in the outer sum of equation 10 is the probability of a partition of sizes i and n − i. Figure 2 displays the exact probabilities for the number of branches with different sizes in a sample of size 50. In > 99.5% of genealogies we would expected to see between 12 and 20 branches with size 2, between 4 and 13 branches with size 3, and fewer than 8 branches with size 5.

The exact probabilities of observing a given number of branches of a given size in the genealogy of a sample with size 50.

2.5. Topology free sampling

Equation 2 can be computationally burdensome, in particular the alternating hypoexponential sums require extra care to maintain precision for sample sizes above ≈ 70. Moreover, it cannot be easily extended to populations with changing population size. We therefore developed a fast, sampling based method that yields accurate results and can be expanded to larger sample sizes (figure 3).

Topology free sampling. After calculating and storing intermediate topological probabilities, it is possible to efficiently generate segments of genealogies, and relevant summary statistics, using Monte Carlo integration.

Assume our goal is to calculate the probability of genealogical summary statistic η which depends on L_j = (ℓ₁, …, ℓ_x), a vector containing all x of the branches with size j in a genealogy. As an example, η could be the probability of two mutations of size j arising on the same branch. For this purpose we first use equations 4 and 9 to calculate the joint probability distribution of all possible starting and ending coalescent events for branches with size j. These values need only be calculated a single time ever, and can then be stored and reused across all further analyses. Along with our software we provide pre-calculated values for all samples sizes up to 1000. Next we can calculate $P (η) = \sum_{x} \sum_{L_{j}} P (η | x, L_{j}) P (x) \sum_{i = 1}^{x} P (ℓ_{i})$ using Monte Carlo integration. First, we sample x from equation 10. Next, to sample x branch lengths for $\sum_{i = 1}^{x} P (ℓ_{i})$ , we generate sets of start events (s₁, …, s_k) and ending events (e₁, …, e_k) from the joint distribution previously calculated, and generate a set of coalescent waiting times (T_n, T₍_n−₁₎, …, T₂) under a demographic model of choice. The generated set of branch lengths is then $L_{j} = (\sum_{i = s_{1}}^{e_{1}} T_{i}, \sum_{i = s_{2}}^{e_{2}} T_{i}, \dots, \sum_{i = s_{x}}^{e_{x}} T_{i})$ . Note that this approach maintains an interdependence of branch lengths within the same genealogy. Repeating this sampling procedure we generate a random sample $((x^{(1)}, L_{j}^{(1)}), (x^{(2)}, L_{j}^{(2)}), \dots, (x^{(r)}, L_{j}^{(r)}))$ such that $\lim_{r \to \infty} \frac{1}{r} \sum_{i} P (η | x^{(i)}, L_{j}^{(i)}) \to P (η)$ and we can calculate the desired probability. This approach is particularly efficient for demographic inference, as the joint distribution of start times and end times is independent of population history, and can be reused when calculating P(η) for multiple demographic histories.

We also test our approach with expected values substituted for random coalescent waiting times. Specifically, we substitute $(E [T_{n}], E [T_{(n - 1)}], \dots, E [T_{2}])$ for (T_n, T₍_n−₁₎,…,T₂) in the method described above, where $E [T_{i}]$ is the expected time between coalescent events i and i + 1, with value $\frac{2}{(n - i) (n - i + 1)}$ .

2.6. Summary statistics that rely on individual branch lengths

With the approach of section 2.5 we can calculate the probability of observing summary statistics based on the individual branches from selected portions of a genealogy. Here we address two such examples. First, with individual branches we can calculate is the average variance between the length of branches with the same size, which we refer to as inter-branch variance. If there are k branches with size j in a genealogy and their average length is $\bar{ℓ_{j}}$ , then the inter-branch variance is

s_{j}^{2} = \frac{\sum_{i = 1}^{k} {(ℓ_{i, j} - \bar{ℓ_{j}})}^{2}}{k - 1}

(11)

A second statistic we can calculate is how often a pair of mutations with the same size arise from events on the same genealogical branch. Under the infinite-sites model and in the absence of recombination these mutations would be perfectly linked, thus this becomes equivalent to the probability of the variant pair having r² = 1.

Suppose there are x branches of size j in a genealogy, each with length ℓ_i for i ∈ 1, 2, …, x. Conditional on observing a mutation with size j, the probability it occurs on branch i with length ℓ_i is P(mutation on branch i|mutation occured ∈ branches 1, 2, …, x) = $ℓ_{i} / \sum_{i = 1}^{x} ℓ_{i}$ . So, if two independent mutations with size j are observed, the probability they occur on the same branch is

P (Same branch | Two mutations size j) = \frac{\sum_{i = 1}^{x} ℓ_{i}^{2}}{{(\sum_{i = 1}^{x} ℓ_{i})}^{2}}

(12)

2.7. Implementation and sampling

To compare our topology free approach with coalescent simulations we evaluated branches with sizes 3, 5, and 7 in samples of size 250, 500, and 1,000 for 2.5 million genealogies and timed their generation. We simulated branches with our topology free sampling approach using both random waiting times and with expected waiting times as described in section 2.5. For comparison, we performed full coalescent simulations using a version of FTEC (Reppell et al., 2012) modified to output individual branch lengths. We also compared the distribution of branch lengths in smaller samples of 25, 50, and 75 chromosomes and the results of these analyses appear in appendix 6.3.

Additionally, using topology free sampling with random waiting times and full coalescent simulations we generated branches with sizes 2, 3, 4, and 5 in 1,000,000 genealogies with sample sizes between 10 and 75, and between 750 and 1,000 in order to calculate the inter-branch variance and probability that two mutations of the same size arose on the same branch.

Code for all analyses was written in C++ and is available for download via GitHub (https://github.com/mreppell/Coalescent_Internal_Branches). We first calculated tables of values for equations 13, 9, and 10, stored the results, and used them for all subsequent analysis.

With stored values our topology free sampling algorithm scales quadratically with sample size, as does the calculation of P_End(b|k, j) (equation 9). The calculation of P_Start(k|j) (equation 4) relies on P(j|k, n) (equation 8), this scales at rate O(N⁴). With an Intel Core i5-4210 processor, for a sample of size 1,000, equation 8 required 21.8 hours for size 8 branches, 39.9 hours for size 10 branches, and 108.7 hours for size 15 branches. Equation 4 required less that 4 seconds for branches of size 8, 10, or 15 and a sample of size 1,000. For comparison, the full coalescent simulator FTEC used in this work, as well as the popular coalescent program ms (Hudson, 2002) also scale quadratically with sample size.

3. Results

3.1. Performance of topology free sampling relative to full coalescent simulations

To compare the speed and accuracy of our method we generated summary statistics using 3 methods: coalescent simulations, our topology free sampling approach with random waiting times, and the topology free approach with expected waiting times. We evaluated branches with sizes 3, 5, and 7 in a samples with sizes 250, 500, and 1,000 for 2.5 million genealogies each. The resulting distribution of branch lengths are indistinguishable between methods (Figure 4). However, as table 1 demonstrates, when we use precomputed starting probabilities our topology free approach is much faster than coalescent simulations, and using random waiting times makes it only marginally slower than using expected waiting times.

(A) Cumulative distribution of branch lengths for branches of size 3, 5, and 7 in a genealogy from a sample with size 1,000. The curves give the probability of observing a branch shorter in length than the value given along the x-axis, with each color representing a different branch size. The double-dashed lines represents lengths drawn from 50,000 full coalescent simulations, the dashed lines from 50,000 genealogies sampled according to the topology free approach presented in the methods section, and the solid line from 50,000 genealogies sampled using expected rather than random waiting times. The results of the three approaches are so similar they are indistinguishable from the plots. (B) The number of branches with a size of 3, 5, or 7 in a genealogy from a sample with size 1,000. The solid black lines give the analytical values from formula (10) and the red bars were observed values from 50,000 genealogies realized with full coalescent simulations.

Table 1.

Average time required by different methods to generate summary statistics for all branches of sizes 3, 5, and 7 in 50,000 genealogies with a sample size of 1,000. With the topology free and expected value approach we used precomputed start probabilities.

Method	Time (Min)
Topology Free	2.35
Coalescent	24.16
Expected Value	2.19

Open in a new tab

With our genealogies we could compare the average total summed branch length generated by each method with it’s theoretical value. Table 2 provides the root mean square error results for these simulations, and shows the approaches again give nearly identical results.

Table 2.

Root mean square error (RMSE) for the summed length of branches in a genealogy with 3, 5, or 7 descendants. The RMSE was calculated from the average value in 50,000 genealogies, independently realized 50 times, generated using either sampling from our equations with multiple random waiting times, labeled Topology Free, with expected waiting times, labeled Expected Times, or full coalescent simulations, labeled Coalescent. The expected theoretical values of the summed length of branches with sizes 3, 5, and 7 are $\frac{2}{3}$ , $\frac{2}{5}$ , and $\frac{2}{7}$ respectively.

		Branch Size

		3	5	7

Sample Size	Method	RMSE (×10⁻³)

250	Topology Free	1.40	1.57	1.25
	Coalescent	1.48	1.10	1.24
	Expected Times	1.42	1.26	1.13

500	Topology Free	1.27	0.98	1.13
	Coalescent	1.03	1.30	1.04
	Expected Times	1.04	0.99	0.94

1000	Topology Free	0.73	0.98	0.83
	Coalescent	0.99	0.87	0.70
	Expected Times	0.79	0.83	0.74

Open in a new tab

We also compared our method with coalescent simulations at smaller sample sizes. We provide the results regarding branch lengths in appendix 6.3. Generally, as with larger samples our topology free approach with random waiting times and full coalescent simulations generate branches and summed lengths that cannot be distinguished from one another. However, in smaller sample sizes when expected waiting times are substituted for random waiting times the absence of very long or very short branches creates significant differences from coalescent values and makes for inaccurate approximation.

3.2. Individual branch summary statistics

The ability to directly sample individual internal branch lengths opens up the possibility of calculating statistics unavailable from summed branch lengths. For example, the inter-branch variance and the probability that two variants with equal minor allele counts arose on the same genealogical branch can be computed. For samples with sizes between 10 and 75 and between 750 and 1000 we realized branches with sizes 2, 3, 4, and 5 in 1,000,000 independent genealogies, and for each we calculated both the inter-branch variances and the probability that two mutations with same number of derived alleles in the final sample arose along the same genealogical branch, conditional on observing two such variants.

Overall, as figure 5 shows, the toplogy free sampling and coalescent simulations gave nearly identical results. Differences emerged between the methods only in smaller samples where branch size began to approach sample size. At the root of these differences is the fact that the lengths of branches with the same size are slightly negatively correlated in coalescent simulations with the exception of branches with a size that is exactly half the sample size (Dhersin and Mölhe (2013) and Supplementary Figure S1) but the same lengths are slightly positively correlated in topology free sampling. The positive correlation is a result of non-exclusive coalescent starting times and shared waiting times in the topology free method. The impact of this difference is limited because the correlations are generally of negligible magnitude, for example in 50,000 realized coalescent genealogies of size 1,000 the correlation between two randomly chosen branches with size 3 is 0.002, and in topology free sampling it is −0.003. However, with both methods the correlation between branches grows as branch size approaches sample size. In 50,000 realized genealogies with a sample size of 10, size 3 branches have a correlation of −0.045 in coalescent simulations and 0.082 in topology free sampling. These larger correlations in small samples results in underestimates of the inter-branch variance in the topology-free sampling, as seen on the left of figure 5B. The difference is much smaller when looking at the probability of two mutations arising on the same branch (figure 5D), where topology free sampling results in slightly smaller values across sample sizes. This difference between statistics occurs because in coalescent simulations as the number of branches with a given size increases the variance also increases but the probability of observing two variants decreases, resulting in values very close to those that topology free sampling produces.

The sampling of individual branches allows us to calculate summary statistics with our topology free approach that are not available from the summed total length. (A) For sample sizes between 750 and 100 the variance between branch lengths was very similar between our topology free approach and full coalescent simulations. (B) In smaller sample sizes, larger size branches had higher inter-branch variance in coalescent simulations, due to the negative correlation between their lengths. (C) In larger samples the probability that two mutations of the same size arose on the same branch is nearly identical between coalescent simulations and topology free sampling. (D) In smaller sample sizes, the use of topology free sampling resulted in very similar, yet slightly lower, probability estimates than full coalescent simulations. This reflects the same underlying causes as seen in panel B, where our assumption that the number of branches is independent of their lengths and the allowing of shared coalescent events leading to lower inter-branch variance.

4. Discussion

Here we have presented a novel series of methods that allow us to realize selected portions of a genealogy without full coalescent simulations. The formulas presented create an efficient sampling method that can generate branch lengths identical to those generated by full coalescent simulations, without requiring the simulation of entire genealogies. By allowing the generation of individual branch lengths we are able to calculate summary statistics unknowable from previous methods designed to estimate the total summed length of all branches with a given size. The equations also make possible an explicit expression for the probability distribution function of internal Kingman coalescent branches in a constant size population.

While equations 4, 9, and 10 can easily be calculated once, stored, and repeatedly reused for sample sizes > 1, 000, sampling directly from our probability distribution (equation 2) is computationally intensive, and scales poorly with sample size, becoming prohibitive for sizes > 75. To counter this we implemented the sampling approach presented in section 2.5, which was still able to leverage the single calculation of starting and ending probabilities and number of branches, but scaled much more easily to larger sample sizes. We were able to show that where branch size did not approach sample size the resulting distributions of branch lengths, the number of branches with a given size in a genealogy, and summary statics that combined the length and number of branches all matched results generated by traditional coalescent simulations, and we were able to calculate these values more efficiently. We investigated the substitution of expected waiting times for random waiting times to further accelerate computations. We found that using expected waiting times was marginally faster than random waiting times, however the results deviated even more significantly from full coalescent simulations in smaller sample sizes as showcased in appendix 6.3. We conclude from this that our sampling method with random waiting times provides a superior balance of accuracy and speed, and should almost exclusively be preferred to expected waiting times.

Excitingly, expanding our findings to populations with variable past sizes should not be problematic within the sampling framework. Following the results of Donnelly and Tavarè (1995), in a population with variable past sizes the waiting times between coalescent events are no longer independent of each other. This means that while our topological equations remain valid, the sum of waiting times defined by each branch’s starting and ending coalescent events is no longer hypoexponential, and equation 2 no longer holds. However, the strategy of generating a sequence of waiting times appropriate to a given demography, without generating a corresponding topology, and the repeatedly sampling a number of branches and their starting and ending probabilities from formulas 4, 9, and 10, should be capable of generating individual branch lengths from any model of changing population size. However, for demographic models with variable size using the expected waiting times to calculate branch lengths would result in substantially underestimating the variance of branch lengths and lead to an inaccurate distribution of lengths (results not shown).

Our method assumes that the number of branches in a genealogy is independent of their length, and also does not constrain branches from starting or ending at the same coalescent event. This contrasts with a true genealogy, where correlations exist between the number of branches and their lengths and branches cannot share events. As a result, where coalescent simulations have negative correlations between the lengths of branches with the same size, topology free sampling results in a positive correlation. The impact of this can be of concern in small samples, where our approach will give different results than coalescent simulations. However, where the ratio of branch size to sample size is even moderately small (< 0.15 in figure 5) the correlation is negligible, and does not meaningfully alter results relative to full coalescent simulations.

Efficient methods for evaluating only portions of a genealogy, and for calculating summary statistics beyond the total summed length, are important towards addressing questions of recent human demography and the abundance of rare variation found in the human genome. Our equations allow us to calculate the probabilities of tree topology independent of demography, available for storage and reuse across models, in an efficient and stable manner. For many applications the work presented here provides a superior alternative to full coalescent simulations.

Acknowledgments

This work was supported by NIH grants HG000376, HG005855, and GM108805.

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

Berndt S, Gustafsson S, Mägi R, Ganna A, Wheeler E, Feitosa M, Justice A, Monda K, Croteau-Chonka D, Day F, Esko T, Fall T, Ferreira T, Gentilini D, Jackson A, Luan J, Randall J, Vedantam S, Willer C, Winkler T, Wood A, Workalemahu T, Hu Y, Lee S, Liang L, Lin D, Min J, Neale B, Thorleifsson G, Yang J, Albrecht E, Amin N, Bragg-Gresham J, Cadby G, den Heijer M, Eklund N, Fischer K, Goel A, Hottenga J, Huffman J, Jarick I, Johansson A, Johnson T, Kanoni S, Kleber M, König I, Kristiansson K, Kutalik Z, Lamina C, Lecoeur C, Li G, Mangino M, McArdle W, Medina-Gomez C, Müller-Nurasyid M, Ngwa J, Nolte I, Paternoster L, Pechlivanis S, Perola M, Peters M, Preuss M, Rose L, Shi J, Shungin D, Smith A, Strawbridge R, Surakka I, Teumer A, Trip M, Tyrer J, Van Vliet-Ostaptchouk J, Vandenput L, Waite L, Zhao J, Absher D, Asselbergs F, Atalay M, Attwood A, Balmforth A, Basart H, Beilby J, Bonnycastle L, Brambilla P, Bruinenberg M, Campbell H, Chasman D, Chines P, Collins F, Connell J, Cookson W, de Faire U, de Vegt F, Dei M, Dimitriou M, Edkins S, Estrada K, Evans D, Farrall M, Ferrario M, Ferriéres J, Franke L, Frau F, Gejman P, Grallert H, Grönberg H, Gudnason V, Hall A, Hall P, Hartikainen A, Hayward C, Heard-Costa N, Heath A, Hebebrand J, Homuth G, Hu F, Hunt S, Hyppönen E, Iribarren C, Jacobs K, Jansson J, Jula A, Kähönen M, Kathiresan S, Kee F, Khaw K, Kivimäki M, Koenig W, Kraja A, Kumari M, Kuulasmaa K, Kuusisto J, Laitinen J, Lakka T, Langenberg C, Launer L, Lind L, Lindström J, Liu J, Liuzzi A, Lokki M, Lorentzon M, Madden P, Magnusson P, Manunta P, Marek D, März W, Mateo Leach I, McKnight B, Medland S, Mihailov E, Milani L, Montgomery GVM, Mühleisen T, Munroe P, Musk A, Narisu N, Navis G, Nicholson G, Nohr E, Ong K, Oostra B, Palmer C, Palotie A, Peden J, Ped-ersen N, Peters A, Polasek O, Pouta A, Pramstaller P, Prokopenko I, Pütter C, Radhakrishnan A, Raitakari O, Rendon A, Rivadeneira F, Rudan I, Saaristo T, Sambrook J, Sanders A, Sanna S, Saramies J, Schipf S, Schreiber S, Schunkert H, Shin S, Signorini S, Sinisalo J, Skrobek B, Soranzo N, Stančáková A, Stark K, Stephens J, Stirrups K, Stolk R, Stumvoll M, Swift A, Theodoraki E, Thorand B, Tregouet D, Tremoli E, Van der Klauw M, van Meurs J, Vermeulen S, Viikari J, Virtamo J, Vitart V, Waeber G, Wang Z, Widèn E, Wild S, Willemsen G, Winkelmann B, Witteman J, Wolffenbuttel B, Wong A, Wright A, Zillikens M, Amouyel P, Boehm B, Boerwinkle E, Boomsma D, Caulfield M, Chanock S, Cupples L, Cusi D, Dedoussis G, Erdmann J, Eriksson J, Franks P, Froguel P, Gieger C, Gyllensten U, Hamsten A, Harris T, Hengstenberg C, Hicks A, Hingorani A, Hinney A, Hofman A, Hovingh K, Hveem K, Illig T, Jarvelin M, Jöckel K, Keinanen-Kiukaanniemi S, Kiemeney L, Kuh D, Laakso M, Lehtimäki T, Levinson D, Martin N, Metspalu A, Morris A, Nieminen M, Njølstad I, Ohlsson C, Oldehinkel A, Ouwehand W, Palmer L, Penninx B, Power C, Province M, Psaty B, Qi L, Rauramaa R, Ridker P, Ripatti S, Salomaa V, Samani N, Snieder H, Sørensen T, Spector T, Stefansson K, Tönjes A, Tuomilehto J, Uitterlinden A, Uusitupa M, van der Harst P, Vollenweider P, Wallaschofski H, Wareham N, Watkins H, Wichmann H, Wilson J, Abecasis G, Assimes T, Barroso I, Boehnke M, Borecki I, Deloukas P, Fox C, Frayling T, Groop L, Haritunian T, Heid I, Hunter D, Kaplan R, Karpe F, Moffatt M, Mohlke K, O’Connell J, Pawitan Y, Schadt E, Schlessinger D, Steinthorsdottir V, Strachan D, Thorsteinsdottir U, Visscher P, Di Blasio A, Hirschhorn J, Lindgren C, Morris A, Meyre D, Scherag A, McCarthy M, Speliotes E, North K, Loos R, Ingelsson E. Genome-wide meta-analysis identifies 11 new loci for anthropometric traits and provides insights into genetic architecture. Nat Genet. 2013;45:501–12. doi: 10.1038/ng.2606. [DOI] [PMC free article] [PubMed] [Google Scholar]
Blum MG, Rosenberg NA. Estimating the number of ancestral lineages using a maximum-likelihood method based on rejection sampling. Genetics. 2007;176:1741–1757. doi: 10.1534/genetics.106.066233. [DOI] [PMC free article] [PubMed] [Google Scholar]
Coventry A, Bull-Otterson L, Liu X, Clark A, Maxwell T, Crosby J, Hixson J, Rea T, Muzny D, Lewis L, Wheeler D, Sabo A, Lusk C, Weiss K, Akbar H, Cree A, Hawes A, Newsham I, Varghese R, Villasana D, Gross S, Joshi V, Santibanez J, Morgan M, Chang K, Hale W, IV, Templeton A, Boerwinkle E, Gibbs R, Sing C. Deep resequencing reveals excess rare recent variants consistent with explosive population growth. Nature Communications. 2010;1:131. doi: 10.1038/ncomms1130. [DOI] [PMC free article] [PubMed] [Google Scholar]
Dahmer I, Kersting G. The internal branch lengths of the Kingman coalescent. Ann Appl Probab. 2015:1325–1348. [Google Scholar]
Dhersin J, Mölhe M. On the external branches of coalescents with multiple collisions. Electron J Probab. 2013;18(40):11. [Google Scholar]
Donnelly P, Tavaré S. Coalescents and genealogical structure under neutrality. Annu Rev Genet. 1995;29:401–21. doi: 10.1146/annurev.ge.29.120195.002153. [DOI] [PubMed] [Google Scholar]
Fearnhead P. Perfect simulation from nonneutral population genetic models: variable population size and population subdivision. Genetics. 2006;174:1397–1406. doi: 10.1534/genetics.106.060681. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ferreira Z, Seixas S, Andrès A, Kretzschmar W, Mullikin J, Cherukuri P, Cruz P, Swanson W, NISC Comparative Sequencing Program. Clark A, Green E, Hurle B. Reproduction and immunity-driven natural selection in the human wfdc locus. Mol Biol Evol. 2013;30:938–50. doi: 10.1093/molbev/mss329. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ferretti L, Klassmann A, Wiehe T, Ramos-Onsins S, Achaz G. The expected neutral frequency spectrum of two linked sites. arXiv. 2016:1604.06713. doi: 10.1016/j.tpb.2018.06.001. q–bio.PE. [DOI] [PubMed] [Google Scholar]
Fu Y. Statistical properties of segregating sites. Theor Popul Biol. 1995;48:172–97. doi: 10.1006/tpbi.1995.1025. [DOI] [PubMed] [Google Scholar]
Fu Y, Li W. Statistical tests of neutrality of mutations. Genetics. 1993;133:693–709. doi: 10.1093/genetics/133.3.693. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gazave E, Ma L, Chang D, Coventry A, Gao F, Muzny D, Boerwinkle E, Gibbs R, Sing C, Clark A, Keinan A. Neutral genomic regions refine models of recent rapid human population growth. Proc Natl Acad Sci USA. 2014;111:757–62. doi: 10.1073/pnas.1310398110. [DOI] [PMC free article] [PubMed] [Google Scholar]
Griffiths R, Tavarè S. Sampling theory for neutral alleles in a varying environment. Philos Trans R Soc Lond B Biol Sci. 1994;29:403–10. doi: 10.1098/rstb.1994.0079. [DOI] [PubMed] [Google Scholar]
Griffiths R, Tavarè S. The age of a mutation in a general coalescent tree. Stoch Models. 1998;14:273–95. [Google Scholar]
Gutenkunst R, Hernandez R, Williamson S, Bustamante C. Inferring the joint demographic history of multiple populations from multidimensional snp frequency data. PLoS Genet. 2009;5:e1000695. doi: 10.1371/journal.pgen.1000695. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hudson R. Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics. 2002;18:337–8. doi: 10.1093/bioinformatics/18.2.337. [DOI] [PubMed] [Google Scholar]
Jenkins P, Song Y. The effect of recurrent mutation on the frequency spectrum of a segregating site and the age of an allele. Theor Popul Biol. 2011;80:158–73. doi: 10.1016/j.tpb.2011.04.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
Jenkins P, Mueller J, Song Y. General triallelic frequency spectrum under demographic models with variable population size. Genetics. 2014;196:295–311. doi: 10.1534/genetics.113.158584. [DOI] [PMC free article] [PubMed] [Google Scholar]
Jewett EM, Rosenberg NA. Theory and applications of a deterministic approximation to the coalescent model. Theor Popul Biol. 2014;93:14–29. doi: 10.1016/j.tpb.2013.12.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kaplan N, Darden T, Hudson R. The coalescent process in models with selection. Genetics. 1988;120:819–29. doi: 10.1093/genetics/120.3.819. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kingman J. The coalescent. Stoch Process Appl. 1982;13:235–48. [Google Scholar]
Ledda A, Achaz G, Wiehe T, Ferretti L. Decomposing the site frequency spectrum: the impact of tree topology on neutrality tests. arXiv. 2015:1510.06748. doi: 10.1534/genetics.116.188763. q–bio.PE. [DOI] [PMC free article] [PubMed] [Google Scholar]
Marjoram P, Wall JD. Fast “coalescent” simulation. BMC Genet. 2006;7:16. doi: 10.1186/1471-2156-7-16. [DOI] [PMC free article] [PubMed] [Google Scholar]
McVean GA, Cardin NJ. Approximating the coalescent with recombination. Philos Trans R Soc Lond, B, Biol Sci. 2005;360(1459):1387–1393. doi: 10.1098/rstb.2005.1673. [DOI] [PMC free article] [PubMed] [Google Scholar]
Morris A, Voight B, Teslovich T, Ferreira T, Segrè A, Steinthorsdottir V, Strawbridge R, Khan H, Grallert H, Mahajan A, Prokopenko I, Kang H, Dina C, Esko T, Fraser R, Kanoni S, Kumar A, Lagou V, Langenberg C, Luan J, Lindgren C, Müller-Nurasyid M, Pechlivanis S, Rayner N, Scott L, Wiltshire S, Yengo L, Kinnunen L, Rossin E, Raychaudhuri S, Johnson A, Dimas A, Loos R, Vedantam S, Chen H, Florez J, Fox C, Liu C, Rybin D, Couper D, Kao W, Li M, Cornelis M, Kraft P, Sun Q, van Dam R, Stringham H, Chines P, Fischer K, Fontanillas P, Holmen O, Hunt S, Jackson A, Kong A, Lawrence R, Meyer J, Perry J, Platou C, Potter S, Rehnberg E, Robertson N, Sivapalaratnam S, Stančáková A, Stirrups K, Thorleifsson G, Tikkanen E, Wood A, Almgren P, Atalay M, Benediktsson R, Bonnycastle L, Burtt N, Carey J, Charpentier G, Crenshaw A, Doney A, Dorkhan M, Edkins S, Emilsson V, Eury E, Forsen T, Gertow K, Gigante B, Grant G, Groves C, Guiducci C, Herder C, Hreidarsson A, Hui J, James A, Jonsson A, Rathmann W, Klopp N, Kravic J, Krjutškov K, Langford C, Leander K, Lindholm E, Lobbens S, Männistö S, Mirza G, Mühleisen T, Musk B, Parkin M, Rallidis L, Saramies J, Sennblad B, Shah S, Sigurðsson G, Silveira A, Steinbach G, Thorand B, Trakalo J, Veglia F, Wennauer R, Winckler W, Zabaneh D, Campbell H, van Duijn C, Uitterlinden A, Hofman A, Sijbrands E, Abecasis G, Owen K, Zeggini E, Trip M, Forouhi N, Syvänen A, Eriksson J, Peltonen L, Nöthen M, Balkau B, Palmer C, Lyssenko V, Tuomi T, Isomaa B, Hunter D, Qi L, Wellcome Trust Case Control Consortium. Meta-Analyses of Glucose and Insulin-related traits Consortium (MAGIC) Investigators. Genetic Investigation of ANthropometric Traits (GIANT) Consortium. Asian Genetic Epidemiology Network; Type 2 Diabetes (AGEN-T2D) Consortium. South Asian Type 2 Diabetes (SAT2D) Consortium. Shuldiner A, Roden M, Barroso I, Wilsgaard T, Beilby J, Hovingh K, Price J, Wilson J, Rauramaa R, Lakka T, Lind L, Dedoussis G, Njølstad I, Pedersen N, Khaw K, Wareham N, Keinanen-Kiukaanniemi S, Saaristo T, Korpi-Hyövälti E, Saltevo J, Laakso M, Kuusisto J, Metspalu A, Collins F, Mohlke K, Bergman R, Tuomilehto J, Boehm B, Gieger C, Hveem K, Cauchi S, Froguel P, Baldassarre D, Tremoli E, Humphries S, Saleheen D, Danesh J, Ingelsson E, Ripatti S, Salomaa V, Erbel R, Jöckel K, Moebus S, Peters A, Illig T, de Faire U, Hamsten A, Morris A, Donnelly P, Frayling T, Hattersley A, Boerwinkle E, Melander O, Kathiresan S, Nilsson P, Deloukas P, Thorsteinsdottir U, Groop L, Stefansson K, Hu F, Pankow J, Dupuis J, Meigs J, Alt-shuler D, Boehnke M, McCarthy M, DIAbetes Genetics Replication And Meta-analysis (DIAGRAM) Consortium Large-scale association analysis provides insights into the genetic architecture and pathophysiology of type 2 diabetes. Nat Genet. 2012;44:981–90. doi: 10.1038/ng.2383. [DOI] [PMC free article] [PubMed] [Google Scholar]
Nelson M, Wegmann D, Ehm M, Kessner D, St Jean P, Verzilli C, Shen J, Tang Z, Bacanu S, Fraser D, Warren L, Aponte J, Zawistowski M, Liu X, Zhang H, Zhang Y, Li J, Li Y, Li L, Woollard P, Topp S, Hall M, Nangle K, Wang J, Abecasis G, Cardon L, Zöllner S, Whittaker J, Chissoe S, Novembre J, Mooser V. An abundance of rare functional variants in 202 drug target genes sequenced in 14,002 people. Science. 2012;337(6090):100–4. doi: 10.1126/science.1217876. [DOI] [PMC free article] [PubMed] [Google Scholar]
Neuhauser C, Krone S. The genealogy of samples in models with selection. Genetics. 1997;145:519–34. doi: 10.1093/genetics/145.2.519. [DOI] [PMC free article] [PubMed] [Google Scholar]
Polanski A, Kimmel M. New explicit expressions for relative frequencies of single-nucleotide polymorphisms with application to statistical inference on population growth. Genetics. 2003;165:427–36. doi: 10.1093/genetics/165.1.427. [DOI] [PMC free article] [PubMed] [Google Scholar]
Polanski A, Bobrowski A, Kimmel M. A note on distributions of times to coalescence under time-dependent population size. Theor Popul Biol. 2003;63:33–40. doi: 10.1016/s0040-5809(02)00010-2. [DOI] [PubMed] [Google Scholar]
Reppell M, Boehnke M, Zöllner S. FTEC: a coalescent simulator for modeling faster than exponential growth. Bioinformatics. 2012;28:1282–3. doi: 10.1093/bioinformatics/bts135. [DOI] [PMC free article] [PubMed] [Google Scholar]
Reppell M, Boehnke M, Zöllner S. The impact of accelerating faster than exponential population growth on genetic variation. Genetics. 2014;196:819–28. doi: 10.1534/genetics.113.158675. [DOI] [PMC free article] [PubMed] [Google Scholar]
Rosenberg N. The mean and variance of the numbers of r-pronged nodes and r-caterpillars in yule-generated genealogical trees. Ann Combinatorics. 2006;10:129–46. [Google Scholar]
Spouge J. Within a sample from a population, the distribution of the number of descendants of a subsample’s most recent common ancestor. Theor Popul Biol. 2014;92:51–4. doi: 10.1016/j.tpb.2013.11.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
Takahata N, Slatkin M. Genealogy of neutral genes in two partially isolated populations. Theor Popul Biol. 1990;38:331–50. doi: 10.1016/0040-5809(90)90018-q. [DOI] [PubMed] [Google Scholar]
Tennessen J, Bigham A, O’Connor T, Fu W, Kenny E, Gravel S, McGee S, Do R, Liu X, Jun G, Kang H, Jordan D, Leal S, Gabriel S, Rieder M, Abecasis G, Altshuler D, Nickerson D, Boerwinkle E, Sunyaev S, Bustamante C, Bamshad M, Akey J, Broad GO, Seattle GO, NHLBI Exome Sequencing Project Evolution and functional impact of rare coding variation from deep sequencing of human exomes. Science. 2012;337(6090):64–9. doi: 10.1126/science.1219240. [DOI] [PMC free article] [PubMed] [Google Scholar]
Teslovich T, Musunuru K, Smith A, Edmondson A, Stylianou I, Koseki M, Pirruccello J, Ripatti S, Chasman D, Willer C, Johansen C, Fouchier S, Isaacs A, Peloso G, Barbalic M, Ricketts S, Bis J, Aulchenko Y, Thorleifsson G, Feitosa M, Chambers J, Orho-Melander M, Melander O, Johnson T, Li X, Guo X, Li M, Shin Cho Y, Jin Go M, Jin Kim Y, Lee J, Park T, Kim K, Sim X, Twee-Hee Ong R, Croteau-Chonka D, Lange L, Smith J, Song K, Hua Zhao J, Yuan X, Luan J, Lamina C, Ziegler A, Zhang W, Zee R, Wright A, Witteman J, Wilson J, Willemsen G, Wichmann H, Whitfield J, Waterworth D, Wareham N, Waeber G, Vollenweider P, Voight B, Vitart V, Uitterlinden A, Uda M, Tuomilehto J, Thompson J, Tanaka T, Surakka I, Stringham H, Spector T, Soranzo N, Smit J, Sinisalo J, Silander K, Sijbrands E, Scuteri A, Scott J, Schlessinger D, Sanna S, Salomaa V, Saharinen J, Sabatti C, Ruokonen A, Rudan I, Rose L, Roberts R, Rieder M, Psaty B, Pramstaller P, Pichler I, Perola M, Penninx B, Pedersen N, Pattaro C, Parker A, Pare G, Oostra B, O’Donnell C, Nieminen M, Nickerson D, Montgomery G, Meitinger T, McPherson R, McCarthy M, McArdle W, Masson D, Martin N, Marroni F, Mangino M, Magnusson P, Lucas G, Luben R, Loos R, Lokki M, Lettre G, Langenberg C, Launer L, Lakatta E, Laaksonen R, Kyvik K, Kronenberg F, König I, Khaw K, Kaprio J, Kaplan L, Johansson A, Jarvelin M, Janssens A, Ingelsson E, Igl W, Kees Hovingh G, Hottenga J, Hofman A, Hicks A, Hengstenberg C, Heid I, Hayward C, Havulinna A, Hastie N, Harris T, Haritunians T, Hall A, Gyllensten U, Guiducci C, Groop L, Gonzalez E, Gieger C, Freimer N, Ferrucci L, Erdmann J, Elliott P, Ejebe K, Döring A, Dominiczak A, Demissie S, Deloukas P, de Geus E, de Faire U, Crawford G, Collins F, Chen Y, Caulfield M, Campbell H, Burtt N, Bonnycastle L, Boomsma D, Boekholdt S, Bergman R, Barroso I, Bandinelli S, Ballantyne C, Assimes T, Quertermous T, Altshuler D, Seielstad M, Wong T, Tai E, Feranil A, Kuzawa C, Adair L, Taylor H, Borecki I, Gabriel S, Wilson J, Holm H, Thorsteinsdottir U, Gudnason V, Krauss R, Mohlke K, Ordovas J, Munroe P, Kooner J, Tall A, Hegele R, Kastelein J, Schadt E, Rotter J, Boerwinkle E, Strachan D, Mooser V, Stefansson K, Reilly M, Samani N, Schunkert H, Cupples L, Sandhu M, Ridker P, Rader D, van Duijn C, Peltonen L, Abecasis G, Boehnke M, Kathiresan S. Biological, clinical and population relevance of 95 loci for blood lipids. Nature. 2010;466:707–13. doi: 10.1038/nature09270. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wall J. Recombination and the power of statistical tests of neutrality. Genet Res. 1999;74:65–79. [Google Scholar]
Wooding S, Rogers A. The matrix coalescent and an application to human single-nucleotide polymorphisms. Genetics. 2002;161:1641–50. doi: 10.1093/genetics/161.4.1641. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] Blum MG, Rosenberg NA. Estimating the number of ancestral lineages using a maximum-likelihood method based on rejection sampling. Genetics. 2007;176:1741–1757. doi: 10.1534/genetics.106.066233. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] Coventry A, Bull-Otterson L, Liu X, Clark A, Maxwell T, Crosby J, Hixson J, Rea T, Muzny D, Lewis L, Wheeler D, Sabo A, Lusk C, Weiss K, Akbar H, Cree A, Hawes A, Newsham I, Varghese R, Villasana D, Gross S, Joshi V, Santibanez J, Morgan M, Chang K, Hale W, IV, Templeton A, Boerwinkle E, Gibbs R, Sing C. Deep resequencing reveals excess rare recent variants consistent with explosive population growth. Nature Communications. 2010;1:131. doi: 10.1038/ncomms1130. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] Dahmer I, Kersting G. The internal branch lengths of the Kingman coalescent. Ann Appl Probab. 2015:1325–1348. [Google Scholar]

[R5] Dhersin J, Mölhe M. On the external branches of coalescents with multiple collisions. Electron J Probab. 2013;18(40):11. [Google Scholar]

[R6] Donnelly P, Tavaré S. Coalescents and genealogical structure under neutrality. Annu Rev Genet. 1995;29:401–21. doi: 10.1146/annurev.ge.29.120195.002153. [DOI] [PubMed] [Google Scholar]

[R7] Fearnhead P. Perfect simulation from nonneutral population genetic models: variable population size and population subdivision. Genetics. 2006;174:1397–1406. doi: 10.1534/genetics.106.060681. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] Ferreira Z, Seixas S, Andrès A, Kretzschmar W, Mullikin J, Cherukuri P, Cruz P, Swanson W, NISC Comparative Sequencing Program. Clark A, Green E, Hurle B. Reproduction and immunity-driven natural selection in the human wfdc locus. Mol Biol Evol. 2013;30:938–50. doi: 10.1093/molbev/mss329. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] Ferretti L, Klassmann A, Wiehe T, Ramos-Onsins S, Achaz G. The expected neutral frequency spectrum of two linked sites. arXiv. 2016:1604.06713. doi: 10.1016/j.tpb.2018.06.001. q–bio.PE. [DOI] [PubMed] [Google Scholar]

[R10] Fu Y. Statistical properties of segregating sites. Theor Popul Biol. 1995;48:172–97. doi: 10.1006/tpbi.1995.1025. [DOI] [PubMed] [Google Scholar]

[R11] Fu Y, Li W. Statistical tests of neutrality of mutations. Genetics. 1993;133:693–709. doi: 10.1093/genetics/133.3.693. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] Gazave E, Ma L, Chang D, Coventry A, Gao F, Muzny D, Boerwinkle E, Gibbs R, Sing C, Clark A, Keinan A. Neutral genomic regions refine models of recent rapid human population growth. Proc Natl Acad Sci USA. 2014;111:757–62. doi: 10.1073/pnas.1310398110. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] Griffiths R, Tavarè S. Sampling theory for neutral alleles in a varying environment. Philos Trans R Soc Lond B Biol Sci. 1994;29:403–10. doi: 10.1098/rstb.1994.0079. [DOI] [PubMed] [Google Scholar]

[R14] Griffiths R, Tavarè S. The age of a mutation in a general coalescent tree. Stoch Models. 1998;14:273–95. [Google Scholar]

[R15] Gutenkunst R, Hernandez R, Williamson S, Bustamante C. Inferring the joint demographic history of multiple populations from multidimensional snp frequency data. PLoS Genet. 2009;5:e1000695. doi: 10.1371/journal.pgen.1000695. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] Hudson R. Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics. 2002;18:337–8. doi: 10.1093/bioinformatics/18.2.337. [DOI] [PubMed] [Google Scholar]

[R17] Jenkins P, Song Y. The effect of recurrent mutation on the frequency spectrum of a segregating site and the age of an allele. Theor Popul Biol. 2011;80:158–73. doi: 10.1016/j.tpb.2011.04.001. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] Jenkins P, Mueller J, Song Y. General triallelic frequency spectrum under demographic models with variable population size. Genetics. 2014;196:295–311. doi: 10.1534/genetics.113.158584. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] Jewett EM, Rosenberg NA. Theory and applications of a deterministic approximation to the coalescent model. Theor Popul Biol. 2014;93:14–29. doi: 10.1016/j.tpb.2013.12.007. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] Kaplan N, Darden T, Hudson R. The coalescent process in models with selection. Genetics. 1988;120:819–29. doi: 10.1093/genetics/120.3.819. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] Kingman J. The coalescent. Stoch Process Appl. 1982;13:235–48. [Google Scholar]

[R22] Ledda A, Achaz G, Wiehe T, Ferretti L. Decomposing the site frequency spectrum: the impact of tree topology on neutrality tests. arXiv. 2015:1510.06748. doi: 10.1534/genetics.116.188763. q–bio.PE. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] Marjoram P, Wall JD. Fast “coalescent” simulation. BMC Genet. 2006;7:16. doi: 10.1186/1471-2156-7-16. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] McVean GA, Cardin NJ. Approximating the coalescent with recombination. Philos Trans R Soc Lond, B, Biol Sci. 2005;360(1459):1387–1393. doi: 10.1098/rstb.2005.1673. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] Nelson M, Wegmann D, Ehm M, Kessner D, St Jean P, Verzilli C, Shen J, Tang Z, Bacanu S, Fraser D, Warren L, Aponte J, Zawistowski M, Liu X, Zhang H, Zhang Y, Li J, Li Y, Li L, Woollard P, Topp S, Hall M, Nangle K, Wang J, Abecasis G, Cardon L, Zöllner S, Whittaker J, Chissoe S, Novembre J, Mooser V. An abundance of rare functional variants in 202 drug target genes sequenced in 14,002 people. Science. 2012;337(6090):100–4. doi: 10.1126/science.1217876. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] Neuhauser C, Krone S. The genealogy of samples in models with selection. Genetics. 1997;145:519–34. doi: 10.1093/genetics/145.2.519. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] Polanski A, Kimmel M. New explicit expressions for relative frequencies of single-nucleotide polymorphisms with application to statistical inference on population growth. Genetics. 2003;165:427–36. doi: 10.1093/genetics/165.1.427. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] Polanski A, Bobrowski A, Kimmel M. A note on distributions of times to coalescence under time-dependent population size. Theor Popul Biol. 2003;63:33–40. doi: 10.1016/s0040-5809(02)00010-2. [DOI] [PubMed] [Google Scholar]

[R30] Reppell M, Boehnke M, Zöllner S. FTEC: a coalescent simulator for modeling faster than exponential growth. Bioinformatics. 2012;28:1282–3. doi: 10.1093/bioinformatics/bts135. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] Reppell M, Boehnke M, Zöllner S. The impact of accelerating faster than exponential population growth on genetic variation. Genetics. 2014;196:819–28. doi: 10.1534/genetics.113.158675. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] Rosenberg N. The mean and variance of the numbers of r-pronged nodes and r-caterpillars in yule-generated genealogical trees. Ann Combinatorics. 2006;10:129–46. [Google Scholar]

[R33] Spouge J. Within a sample from a population, the distribution of the number of descendants of a subsample’s most recent common ancestor. Theor Popul Biol. 2014;92:51–4. doi: 10.1016/j.tpb.2013.11.004. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R34] Takahata N, Slatkin M. Genealogy of neutral genes in two partially isolated populations. Theor Popul Biol. 1990;38:331–50. doi: 10.1016/0040-5809(90)90018-q. [DOI] [PubMed] [Google Scholar]

[R35] Tennessen J, Bigham A, O’Connor T, Fu W, Kenny E, Gravel S, McGee S, Do R, Liu X, Jun G, Kang H, Jordan D, Leal S, Gabriel S, Rieder M, Abecasis G, Altshuler D, Nickerson D, Boerwinkle E, Sunyaev S, Bustamante C, Bamshad M, Akey J, Broad GO, Seattle GO, NHLBI Exome Sequencing Project Evolution and functional impact of rare coding variation from deep sequencing of human exomes. Science. 2012;337(6090):64–9. doi: 10.1126/science.1219240. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R37] Wall J. Recombination and the power of statistical tests of neutrality. Genet Res. 1999;74:65–79. [Google Scholar]

[R38] Wooding S, Rogers A. The matrix coalescent and an application to human single-nucleotide polymorphisms. Genetics. 2002;161:1641–50. doi: 10.1093/genetics/161.4.1641. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

An efficient algorithm for generating the internal branches of a Kingman coalescent

M Reppell

S Zöllner

Abstract

1. Introduction

2. Methods

2.1. A probability distribution function for coalescent branch lengths in a model with constant population size

Figure 1.

2.2. The probability that a branch with size j originated at coalescent event k

2.3. The conditional probability that a branch with size j ends at event b