Abstract
In recent years, a number of methods have been developed to infer complex demographic histories, especially historical population size changes, from genomic sequence data. Coalescent Hidden Markov Models have proven to be particularly useful for this type of inference. Due to the Markovian structure of these models, an essential building block is the joint distribution of local genealogical trees, or statistics of these genealogies, at two neighboring loci in populations of variable size. Here, we present a novel method to compute the marginal and the joint distribution of the total length of the genealogical trees at two loci separated by at most one recombination event for samples of arbitrary size. To our knowledge, no method to compute these distributions has been presented in the literature to date. We show that they can be obtained from the solution of certain hyperbolic systems of partial differential equations. We present a numerical algorithm, based on the method of characteristics, that can be used to efficiently and accurately solve these systems and compute the marginal and the joint distributions. We demonstrate its utility to study properties of the joint distribution. Our flexible method can be straightforwardly extended to handle an arbitrary fixed number of recombination events, to include the distributions of other statistics of the genealogies as well, and can also be applied in structured populations.
Keywords: coalescent theory, variable population size, hyperbolic systems of PDEs
AMS subject classification: 92D10, 60J27, 60J28, 35L40
1 Introduction
Unravelling the complex demographic histories of humans or other species and understanding their effects on contemporary genetic variation is a central goal of population genetics. In addition to advancing our knowledge of the evolutionary processes that shape genomic variation, demographic inference is also an important step towards understanding disease related genetic variation. Recent rapid population growth, for example, severely affects the distribution of rare genetic variants (Keinan and Clark, 2012), which have been linked to complex genetic diseases. Moreover, ancient and contemporary population structure can lead to the accumulation of private genetic variation in certain sub-populations.
Methods to study genetic variation, or perform inference, in populations with varying size or more complex demographic histories have been developed based on the Wright-Fisher diffusion, describing the evolution of population allele frequencies forward in time (Griffiths, 2003; Živković et al., 2015; Gutenkunst et al., 2009; Excoffier et al., 2013), or the Coalescent process, a model for the genealogical relationship in a sample of individuals (Griffiths and Tavaré, 1994; Griffiths and Marjoram, 1996; Griffiths and Tavaré, 1998; Živković and Wiehe, 2008; Bhaskar et al., 2015; Kamm et al., 2017). A powerful representation of genetic variation data that has been used in this context is the Site-Frequency-Spectrum. In this representation, however, any linkage information present in the genetic data is ignored. With the increasing availability of full-genomic sequence data, linkage information is more readily available, and approaches based on Coalescent Hidden Markov Models (HMM) that use this linkage information have proven to be particularly successful for demographic inference and other population genetic applications.
In a population-sample of genomic sequences, the genealogical relationships vary along the genome, due to intra-chromosomal recombination. The Coalescent-HMMs approximate the intricate correlation structure between these local genealogical trees by a Markov chain, the Sequentially Markovian Coalescent (Wiuf and Hein, 1999; McVean and Cardin, 2005). Due to the Markovian structure of the SMC-approximation, an essential building block is thus the transition or joint distribution of these local genealogies at two neighboring loci. In a sample of size two, the local genealogies are simple trees with two leaves, that is, one-dimensional objects at each locus. The transitions can be readily computed, and Li and Durbin (2011) employed this framework to developed a powerful approach to infer population size history. Moreover, Dutheil et al. (2009) used Coalescent-HMMs to explore the divergence patterns between humans and great apes, using up to 4 genomic sequences, one for each species. However, due to the increase in complexity of the local genealogies with increasing sample size, these approaches cannot be generalized efficiently to larger sample sizes.
For large sample sizes, approaches that use Monte-Carlo Markov Chain techniques (Rasmussen et al., 2014), suitable composite likelihood frameworks (Sheehan et al., 2013; Steinrücken et al., 2015), or representations of the local genealogical trees by lower-dimensional summaries (Schiffels and Durbin, 2014; Terhorst et al., 2017) have been developed. In the latter, the choice on how to represent the local genealogical trees affects the performance of the inference procedure. Li and Durbin (2011) observed that using the coalescence time between two lineages lacks information in the more recent past, whereas using the first coalescence time in a large sample is less accurate for ancient times (Schiffels and Durbin, 2014). A promising low-dimensional representation is the total branch length of the genealogical tree at each locus. In expectation, this quantity grows without bound as the sample size increases, thus retaining not only information about ancient events, but also about the more recent dynamics. However, to implement a Coalescent-HMM inference framework using the tree length, it is crucial to efficiently compute the joint distribution of the total tree length at two neighboring loci.
Thus, in this paper, we present a novel efficient and accurate method to numerically compute the joint distribution of the total branch length of the genealogical trees at two neighboring loci for a sample of arbitrary size n in populations of varying size, as well as the single-locus marginal distribution. To our knowledge, no method to compute these distributions has been presented in the literature to date that can be applied to arbitrary sample sizes. Moreover, even computing the marginal distribution of the total tree length at a single locus has only received limited attention (Pfaffelhuber et al., 2011; Wiuf and Hein, 1999). We present analytical details and numerical results for the case of at most one recombination event separating the two loci, but our methodology can be readily extended to handle an arbitrary, but fixed, maximal number of recombination events, by suitably augmenting the underlying process.
The inter-coalescent times , that is the time period during which k lineages persist in the genealogical tree for a sample of size n can be used to compute the total branch length at a single locus as
| (1.1) |
since in the period , k lineages contribute towards the total length. In the case of a panmictic population of constant size, formulas for the first two moments of the total tree length can be readily obtained using standard arguments for sums of the independently exponentially distributed random variables . Furthermore, ℒ is distributed like the maximum of k − 1 exponential variables with intensity (Wiuf and Hein, 1999, p. 255). However, non-constant population size histories introduce intricate dependencies among the inter-coalescent times, and thus it is not straightforward to generalize this approach. Polanski et al. (2003) introduced a method to compute the expected inter-coalescence times under variable population size. However, the coalescence rates of ancestral lineages in the genealogical process depend on past population sizes, whereas the rate for ancestral recombination is constant along each ancestral lineage. The approach of Polanski et al. (2003) depends on the fact that all rates of the process are rescaled uniformly with the same factor, and thus it cannot be extended to the case when ancestral recombination between two linked loci is taken into account.
Ferretti et al. (2013) used another approach to investigate the correlation between the times to the most recent common ancestor at two neighboring loci. The authors approached the problem using coalescent arguments to quantify the changes recombination induces on the local trees, but it is unclear how to generalize their approach efficiently to the total length of the genealogical trees. Furthermore, Li and Durbin (2011) presented analytic formulas for the joint distribution of the local genealogies for a sample of size two under variable population size, but these cannot readily be extended to an arbitrary sample size n. Eriksson et al. (2009) presented similar analytic formulas for a population of constant size and explored more complex demographic scenarios using simulations. Introducing suitable Markov chains, Hobolth and Jensen (2014) investigated the transitional distribution of the local genealogies for samples of size 4, and discussed approximations for larger sample sizes. These Markov chains are closely related to our methodology, but our focus is on exact computations for large sample sizes.
Although we focus on the total tree length under variable population size in a single panmictic population in this paper, our approach can be extended to compute the transition densities for the coalescence time in a sample of size two (Li and Durbin, 2011), the coalescence time of two distinguished lineages (Terhorst et al., 2017), and the time of the first coalescent event amongst the sampled sequences (Schiffels and Durbin, 2014). Furthermore, our method can be generalized to multiple sub-populations related by a complex demographic history (see discussion in Section 5).
This article is structured as follows. In Section 2, we introduce the requisite notation and the stochastic processes that are involved in computing the marginal and joint distributions. We further introduce a hyperbolic system of partial differential equations (PDEs) in Section 3 that can be solved to compute the distributions of interest. We provide a proof of the main proposition used to derive these equations in Appendix A. In Section 3, we also provide details of our novel numerical algorithm based on the method of characteristics that can be used to efficiently compute the solutions to these PDEs. We demonstrate the accuracy of the method, and study properties of the joint distribution function in Section 4. Finally, we discuss future applications and extensions of this method in Section 5.
2 Background and Notation
In this section, we will introduce the necessary background and notation for the stochastic processes that we employ to compute the marginal and joint distribution of the length of the genealogical trees. We will also provide some details about computing the distribution of these processes, since our main result extends upon the underlying ideas.
2.1 Ancestral Process at a Single Locus
The genealogical relationship of a sample of n haploid individuals in a panmictic population of constant size is commonly modeled using Kingman’s coalescent (Kingman, 1982; Wakeley, 2008), and this process and its extensions have found widespread applications. It is a Markov process that describes the dynamics of the ancestral lineages of the sample backwards in time. Here we focus on the ancestral process A(t) (Tavaré and Zeitouni, 2004, Chapter 4.1). This coarser process records only the number of ancestral lineages in the coalescent process at time t before present, which is sufficient to compute the total branch length of the coalescent tree. The initial number of lineages is equal to the sample size n. Furthermore, at time t, each pair of lineages coalesces at rate one, thus if there are A(t) = k lineages at time t, then coalescence of any two lineages happens at rate . This dynamics is followed until all lineages coalesced into a single lineage, and this time is denoted by TMRCA, the time to the most recent common ancestor.
Variable population size is commonly modeled by a positive, real-valued function λ(t), which provides the coalescent rate for each pair of ancestral lineages at time t in the past (Tavaré and Zeitouni, 2004, Chapter 4.1). If the size of the population changes at different points in the past, the rate of coalescence at a given time is inversely proportional to the relative population size at that time. Intuitively, for two 3 lineages to coalesce, they have to find a common ancestor. If the population consists of a large number of individuals, this happens at a lower rate, whereas in small populations, the ancestral lineages coalesce more quickly. In the remainder of this paper, we assume that λ(t) is continuous. If λ(t) is piece-wise continuous, we can obtain the same results by considering each continuous piece separately. For convenience, we further introduce the cumulative coalescent rate at time t as
These considerations yield the following definition.
Definition 2.1 (Ancestral Process with variable population size)
The ancestral process with variable population size {A(t)}t∈ℝ+ is a time-inhomogeneous Markov chain on {1, …, n} with initial state A(0) = n, and the transition rates at time t are given by the infinitesimal generator matrix
with
| (2.1) |
Remark 2.2
Note that we do require A(0) = n, and thus this definition of the ancestral process does depend on the sample size n. However, for different sample sizes n′, the rates of the process are given by equation (2.1) as well, only the initial state changes. The dynamics of the process is essentially the same, independent of the sample size, and we therefore do not include the dependence on the sample size explicitly in the notation for the remainder of this article.
The ancestral process can be used to formally define the time to the most recent common ancestor as
the time when the number of lineages reaches one. Furthermore, with
for k ∈ {1, …, n}, the distribution of the ancestral process can be obtained by solving the Kolmogorov-forward-equation (Stroock, 2008, Chapter 5), a system of ordinary differential equations (ODEs) given by
| (2.2) |
Equivalently, perhaps more familiar to the reader, this system can be expressed as
| (2.3) |
for all k ∈ {1, …, n}. The latter version is more explicit about the influence of the number of ancestral lineages and the coalescent-speed function on the dynamics of the ODEs. The relevant solution is given by
and
| (2.4) |
for t ∈ ℝ+, where (·)n,· refers to the n-th row of the matrix. In Tavaré and Zeitouni (2004), the authors provide an analytic expression for these probabilities using the spectral decomposition of the rate matrix Q(t). However, the resulting formulas are numerically unstable, so for practical purposes it can be more efficient to solve the system of ODEs numerically using step-wise solution schemes. Furthermore, note that
| (2.5) |
holds for t* ∈ ℝ+, thus equation (2.4) can also be used to compute the cumulative distribution function of the time to the most recent common ancestor.
We can employ the ancestral process to compute the total tree length as follows. If at a given time t there are k ancestral lineages or branches in the coalescent tree, each branch extends further into the past. Thus, we can say that the total sum of branch lengths in the coalescent tree grows at a rate of k. Once all lineages have coalesced into a single common ancestral lineage, the most recent common ancestor is reached, and the coalescent tree stops growing. This motivates the following definition.
Definition 2.3
The accumulated tree length L(t) ∈ ℝ+ by time t ∈ ℝ+ is given by
| (2.6) |
With this definition, the total tree length or the total sum of the branch lengths at a single locus is given by
| (2.7) |
Note that
holds, which is equal to equation (1.1). Here is the period of time for which k lineages persist in the ancestral process, the inter-coalescent time. The main goal of this paper is to study the distribution of ℒ for populations with arbitrary coalescent-rate function λ(t) marginally at a single locus and jointly at two loci, which can be computed using a system of hyperbolic PDEs that is closely related to the ODE (2.3). For the two-locus case, we will now introduce the joint ancestral process at two linked loci.
2.2 Ancestral Process with Recombination
The joint genealogy of the ancestral lineages for two loci, locus a and b, separated by a recombination distance ρ is commonly modeled by the coalescent with recombination (Hudson, 1990). The initial state in the coalescent with recombination for a sample of size n is comprised of n lineages, each ancestral to both loci of one sampled haplotype. As in the single-locus coalescent with variable population size, at time t, each pair of lineages can coalesce at rate λ(t). In addition, ancestral recombination events happen at rate ρ/2 along each active lineage. At a recombination event, the lineage splits into two new lineages, each ancestral to the respective haplotype of the original lineage at only one of the two loci. Note that recombination happens along each lineage at a constant rate and, unlike the coalescent rate, is not affected by the population size, and thus it does not scale with λ(t).
Again, we do not focus on the exact genealogical relationships, but only on the number of lineages at time t that are ancestral to a certain locus, given by the ancestral process with recombination Aρ(t). The process Aρ for a sample of size two under constant population size is described in detail by Simonsen and Churchill (1997). Here we use an extension of this process to samples of arbitrary size n and variable population size. A similar model has also been introduced by Hobolth and Jensen (2014).
Definition 2.4 (Ancestral Process with Recombination)
For a sample of size n ∈ ℕ and t ∈ ℝ+, the ancestral process with recombination in a population of variable size
is a time-inhomogeneous Markov chain with state space
The component Kab(t) gives the number of lineages that are ancestral to both loci, Ka(t) is the number ancestral to locus a only, and Kb(t) is the number ancestral to locus b only. The initial state is
all n lineages ancestral to both loci. The transition rates are given by the infinitesimal generator matrix
where all off-diagonal entries of Qc (coalescence) are zero, except
| (2.8) |
and all off-diagonal entries of Qρ (recombination) are zero, except
The state (1, 0, 0) is defined to be the absorbing state, so all rates leaving this state are set to zero. Furthermore, the diagonal entries of both matrices are set to minus the sum of the off-diagonal entries in the corresponding row.
Remark 2.5
Two versions of the coalescent with recombination are commonly used in the literature, one version for the infinitely-many-sites (IMS) model (Hudson, 1990; Griffiths and Marjoram, 1997), and another version for the finitely-many-sites (FMS) model (Paul et al., 2011; Steinrücken et al., 2015). In the IMS version, the chromosome is modeled as the interval [0, 1], and whenever recombination occurs, it occurs at a uniformly chosen point in this interval. As a result, recombination always occurs at a novel site, and two neighboring local genealogies are separated by at most one recombination event. In the FMS version, multiple recombination events can occur between two loci. It can be obtained from the IMS version by considering the local genealogies at two fixed loci along the continuous chromosome that are separated by a certain fixed recombination distance. Our definition of the ancestral process with recombination is in line with the FMS version for two loci.
The ancestral process with recombination can be defined for an arbitrary number of loci. However, in the remainder of the paper, we will only use the process for two loci.
In the literature, some authors use the ‘full’ coalescent with recombination and others the ‘reduced’ coalescent with recombination. The difference between the two is that the ‘full’ version always keeps track of both ancestral lineages that branch off at a recombination event, whereas in the ‘reduced’ version, lineages that don’t leave any descendant ancestral material in the contemporary sample are not traced. Our definition of the ancestral process with recombination is compatible with the ‘reduced’ version. Thus, the number of ancestral lineages is bounded, which is not the case in the ‘full’ version.
Following the ideas of Wiuf and Hein (1999), the correlation structure between all local genealogies along a chromosome can be approximated using the Sequentially Markovian Coalescent (SMC) (McVean and Cardin, 2005), or the modified version SMC’ (Marjoram and Wall, 2006). In the SMC, if a lineage has been hit by a recombination event and branches into two, subsequently, the two resulting branches are not allowed to coalesce with each other, whereas such events are permitted under the SMC’. Thus, under the SMC’, the rates for coalescence of lineages with no overlapping ancestral material (equation (2.8)) are as given in Definition 2.4, whereas under the SMC, these rates have to be set to zero.
Again, the Kolmogorov-forward-equation can be used to compute the distribution of the ancestral process Aρ(t) as the solution of
| (2.9) |
where the row-vector p(t) is defined by
Note that the rate matrix Q(t) in the ODE (2.2) for the ancestral process at a single locus is triangular for all t. This simplifies approaches to compute solutions substantially, as the solutions can be obtained sequentially for each state of the corresponding Markov chain. In the ancestral process with recombination for two loci on the other hand, with a positive probability, the underlying Markov chain can transition back to a state it already visited before. Consequently, the rate matrix Q̃(t) in the ODE (2.9) is not triangular, and it is also not possible to transform it into a triangular matrix by permuting the rows and columns.
Since a triangular rate matrix simplifies analytical and numerical approaches significantly, we introduce an approximation to the full ancestral process with recombination that exhibits this property and compute the distributions of the tree lengths under this approximation. To achieve this, we explicitly account for the number of recombination events that have occurred up to a certain time t. For ease of exposition, we further limit the maximal number of recombination events to one. Since in most organisms the per generation recombination probability is very small between loci that are physically close, this approximation is justified. Furthermore, numerical experiments supporting this approximation are provided in Section 4. Note that this limiting the number of recombination events to one yields effectively a first-order approximation to the full ancestral process.
Definition 2.6 (Ancestral Process with Limited Recombination)
For a sample of size n ∈ ℕ and t ∈ ℝ+, the ancestral process with limited recombination
is a time-inhomogeneous Markov chain with state space
| (2.10) |
The components K̄ab(t), K̄a(t), K̄b(t) have the same interpretation as before, and R̄(t) is the number of recombination events that have happened by time t. The first line in equation (2.10) corresponds to the states that can be reached without recombination, and the second line to those that require one recombination event. The initial state is
and the transition rates are given by the infinitesimal generator matrix
| (2.11) |
where the entries of Q̄c (coalescence) are given by
and all off-diagonal entries of Qρ (recombination) are zero, except
allowing at most one recombination event. The diagonal entries are set to minus the sum of the off-diagonal entries in the corresponding row. The states (1, 0, 0, 0) and (1, 0, 0, 1) are absorbing states, so all rates leaving these states are set to zero.
For later convenience, define the relation ≺ on 𝒮̄ρ as
| (2.12) |
that is, s ≺ s′ holds if s can be reached from s′ in one step. Note that embedded into the ancestral process with recombination (limited or not) is a single-locus ancestral process for locus a and for locus b. Thus, we can define the branch length of the genealogical tree at locus a and b similar to the one-locus case as follows, and study their joint distribution.
Definition 2.7
For a given time t ∈ ℝ+, the accumulated tree lengths La(t) ∈ ℝ+ at locus a and Lb(t) ∈ ℝ+ at locus b are given by
and
Remark 2.8
This definition of the accumulated tree length can be applied to Āρ, as well as Aρ. We will not distinguish these cases in our notation, since in the remainder of the paper, we will use Āρ.
The total tree length at locus a is thus given by
| (2.13) |
and at locus b by
| (2.14) |
Here, is the time to the most recent common ancestor at locus a
and thus its distribution is given by
for t* ∈ ℝ+. Similar relations hold for locus b. We will now study the joint distribution of ℒa and ℒb, and also the marginal ℒ. Note that these quantities are computed under the ancestral process with limited recombination, but we will demonstrate in Section 4 that they give an accurate approximation to the respective quantities under the true ancestral process.
3 Marginal and Joint Distribution of the Total Tree Length
The main goal of this paper is to present a method to compute the marginal and joint cumulative distribution function (CDF) of the total tree length at two linked loci. Thus, we aim at computing
| (3.1) |
and
| (3.2) |
for x, y ∈ ℝ+.
Note that equation (2.5) can be used to compute the distribution of the time until the ancestral process reaches the absorbing state, which yields the marginal distribution of the TMRCA. The latter is equal to the sum of the inter-coalescence times, and in a population of constant size, equation (2.5) can also be obtained by convolving the densities of independent exponential variables. The total branch length is a more general linear combination of the inter-coalescence times, but Wiuf and Hein (1999) used a similar convolution approach to derive its marginal density. In a population with variable population size, the inter-coalescence times are not mutually independent. However, Polanski et al. (2003) derived formulas for the density of TMRCA using a uniform rescaling of time by the coalescent-rate function λ(t).
The two main difficulties in extending these considerations to the total tree length in a two-locus model with variable population size are as follows: Firstly, in a model that includes recombination, only the coalescence rates scale with λ(t), while the recombination rate is constant along each lineage. The approach of Polanski et al. (2003), however, relies on a uniform rescaling of all rates, and therefore it cannot be applied. Secondly, note that, similar to equation (2.6), we can define
With this definition, the quantity t is not only the time elapsed in the ancestral process, but it can also be interpreted as the amount accumulated towards TMRCA. The absorption time of the ancestral process can thus be used in equation (2.5) to compute the distribution of TMRCA. However, when the accumulated tree length L(t) defined in equation (2.6) is considered, the quantity t only gives the elapsed time, and it cannot be used as the amount accumulated towards ℒ. Thus, our approach to compute the distribution of ℒ, and the joint distribution of ℒa and ℒb has to explicitly account for both, the time that has elapsed in the ancestral process, as well as the amount accumulated towards the total tree length.
To this end, with t ∈ ℝ+, we introduce the time-dependent cumulative distribution functions
| (3.3) |
for k ∈ {1, …, n} and
| (3.4) |
for s ∈ 𝒮̄ρ.
We will show that the CDFs (3.1) and (3.2) can be computed from the time-dependent CDFs (3.3) and (3.4). Furthermore, we will present numerical schemes, to efficiently and accurately compute the time-dependent CDFs (3.3) and (3.4).
3.1 Distribution of the Total Tree length at a Single Locus
The following lemma shows that the CDF (3.1) can be computed from the time-dependent CDF (3.3).
Lemma 3.1
With definition (3.3), the relation
holds for x ∈ ℝ+ and t̄ ≥ x/2.
Proof
First, observe that
since A(s) ≥ 2 holds for s < TMRCA. Thus, on the event {ℒ ≤ x}, the relation TMRCA ≤ x/2 ≤ t̄ holds, which implies A(t̄) = 1, and therefore
On the event P{A(t̄) = 1}, t̄ ≥ TMRCA and L(t̄) = ℒ hold, and thus
which proves the statement of the lemma.
Lemma 3.1 shows that the CDF of ℒ can be computed from the time-dependent CDF F1(t, x). Due to the structure of the underlying Markov chain, it is necessary to compute the time-dependent CDFs for all states in order to compute it for the absorbing state. Thus, in the remainder of this section, we focus on computing the time-dependent CDFs for all k ∈ {1, …, n}. Proposition A.14 derived in Appendix A can be applied to show that the time-dependent CDFs solve a certain system of linear hyperbolic PDEs. This yields the following corollary.
Corollary 3.2
The row-vector
can be obtained for all points in 𝒰 = {(x, t): 0 < x < nt, t > 0} as the strong solution of
| (3.5) |
with
boundary conditions
| (3.6) |
and matrix Q(t) as defined in equation (2.1).
Proof
Define the function
on the state space 𝒮 = {1, …, n} of the ancestral process. This function and the generator Q(t) satisfy the requirements of Proposition A.14, and thus, the statement of the corollary follows from Proposition A.14 and Remark A.16.
Remark 3.3
The n-th component of the boundary condition (3.6) is equal to 0 and not ℙ{A(t) = n}. This holds for technical reasons that will be detailed in the proof of Proposition A.14.
Note that the process (A(t), L(t))t∈ℝ+ is a piecewise-deterministic Markov process (see Remark A.17). The right-hand side of equation (3.5) is essentially equal to the right-hand side of equation (2.3), because the only stochastic element in the underlying dynamics is the ancestral process A(t). Given a certain number of lineages {A(t) = k}, the accumulation towards the total tree length happens deterministically at rate k, and is captured by the term V ∂xF(t, x).
To derive a numerical scheme for the efficient and accurate computation of the time-dependent CDF F(t, x), note that the system of PDEs introduced in Corollary 3.2 can be solved using the method of characteristics (Renardy and Rogers, 2004, Chapter 3). Due to the triangular structure of the matrix Q(t), for a given component with k ∈ {1, …, n}, the right-side of equation (3.5) does only depend on Fℓ whit ℓ ≥ k. Thus, the system of PDEs (3.1) can be solved separately for each k, starting at k = n, and decreasing it step-by-step.
Furthermore, note that for k ∈ {1, …, n},
| (3.7) |
since if the ancestral process has k lineages at time t, it must have accumulated at least v(k)t towards the total tree length. It can be shown that the solution to equation (3.5) exhibits this property. Moreover,
since the process can have accumulated at most nt. Thus, we only have to use equation (3.5) to compute the solution
when v(k)t ≤ x < n · t. Note that v(1) = 0. Moreover, for k = n, the region v(k)t ≤ x < n · t is empty, and thus Fn(t, x) has a discontinuity along the line n · t. See Figure 1 for a visualization of the different regions for different values of k. To devise an accurate and efficient numerical scheme for computing the time-dependent CDFs in the interior region, we use the method of characteristics to solve the respective PDE
Figure 1.
The different regions and characteristics of Fk(t, x) (defined in equation (3.3)) for different values of k. In (c), according to Lemma 3.1, Fk(t, x) does not depend on t beyond the dashed line x = 2t.
| (3.8) |
Since for k = n, the interior region is empty, we consider k ≠ n and introduce the family of characteristics
Taking the derivative of Fk(t, x) along such a characteristic yields
| (3.9) |
Here we used the chain rule and the fact that the third line is equal to the left-hand side of equation (3.8). Formally, the derivations (3.9) do not hold for all τ. It can be shown, however, that the equality holds for almost all τ ; we omit the technical details here for readability. Thus, for given x0, as a function of τ, the function τ → Fk(t0 + τ, x0 + v(k)τ ) solves the equation
with
and
Since this is a non-homogeneous linear first-order ODE, the solution can be readily obtained as
| (3.10) |
with
| (3.11) |
The initial conditions for τ = 0 are given by the boundary values of the associated PDE as
Now, to obtain the value of the function Fk(t, x), for given t and x, one just needs to identify the right characteristic and the parameters x0 and τ such that (t0+τ, x0+v(k)τ)⊤ = (t, x)⊤. Since the characteristics are parallel, it can be uniquely identified. Using these values of x0 and τ in the solution (3.10) yields Fk(t, x). However, we will not pursue this strategy to compute the requisite values of Fk(t, x). Instead, we present a numerical upstream scheme in Appendix B.1 that can be used to compute Fk(t, x) efficiently on a suitable grid to ultimately obtain values for the CDF ℙ{ℒ ≤ x}.
3.2 Joint Distribution of the Total Tree Length
In this section we present a method to compute the joint CDF of the total tree length
at two loci a and b separated by a given recombination distance ρ. Again, we approach this problem by first computing the time-dependent joint CDF
We will follow closely along the lines of the method presented in Section 3.1, where we replace the ancestral process A by the ancestral process with limited recombination Āρ, and compute the integrals (2.13) and (2.14), to ultimately compute the joint CDF.
The analog to Lemma 3.1 is as follows.
Lemma 3.4
With definition (3.4), the relation
holds for x, y ∈ ℝ+, t̄ ≥ max{x, y}/2, and Δ = {(1, 0, 0, 0), (1, 0, 0, 1)}, the absorbing states of Āρ.
Proof
The proof is similar to the proof of lemma 3.1. With
note that
and similarly . Thus, on the event {ℒa ≤ x,ℒb ≤ y}, the relations and hold. This implies K̄ab(t̄) + K̄a(t̄) = 1 and K̄ab(t̄) + K̄a(t̄) = 1, which in turn implies Āρ(t̄) ∈ Δ = {(1, 0, 0, 0), (1, 0, 0, 1)}, since these two states are the only admissible states that can satisfy these conditions. Incidentally, these are also the absorbing states of Āρ. Thus,
holds. Furthermore, on the event {Āρ(t̄) ∈ Δ}, and hold, which imply La(t̄) = ℒa and Lb(t̄) = ℒb. This in turn implies
Finally, note that
which proves the statement of the lemma.
Again, Lemma 3.4 shows that the joint CDF of ℒa and ℒb can be computed from the time-dependent CDFs F(1,0,0,0)(t, x, y), and F(1,0,0,1)(t, x, y). In order to derive a system of PDEs like (3.5) for the time-dependent joint CDF of the tree length at two loci, we again apply Proposition A.14, for dimension d = 2. To this end, define the functions
and
that yield for (kab, ka, kb, r) ∈ 𝒮̄ρ the number of lineages ancestral to locus a and b, respectively, and define
We then have the following corollary.
Corollary 3.5
The time-dependent joint CDF of the tree lengths
can be obtained for all points in U = {(t, x, y) : 0 < x < nt, 0 < y < nt, t > 0} as the strong solution of
| (3.12) |
with boundary conditions
for (x, y, t) ∈ ∂U and Q̄ (t) as defined in (2.11).
Proof
Define the function v: 𝒮̄ρ → ℝ2 as
This function and the generator Q̄ (t) satisfy the requirements of Proposition A.14, and thus, the statement of the corollary follows from Proposition A.14 and Remark A.16.
Remark 3.6
Note that due to symmetry of Āρ,
holds.
The process (A(t), La(t), Lb(t))t∈ℝ+ is a piecewise-deterministic Markov process as well (see Remark A.17), where Q̄ (t) captures the stochastic dynamics, and ∂x and ∂y the deterministic dynamics. The numerical scheme to compute the time-dependent joint CDF is again an upstream scheme based on the method of characteristics and follows essentially along the lines of the scheme presented for the marginal case. The relation ≺ defined in (2.12) implies a partial ordering on the state space 𝒮̄ρ, and the matrix Q̄ (t) is triangular with respect to this ordering. Thus, again, the values of Fs only depends on Fs′ with s ≺ s′, and they can be computed for each s separately. For given s ∈ 𝒮̄ρ,
| (3.13) |
holds. Figure 2 shows the different regions of Fs(t, x, y) for a fixed t. Moreover, for each s ∈ 𝒮̄ρ, the PDE that has to be satisfied in the region va(s) · t ≤ x < n· t and vb(s) · t ≤ y < n· t can be re-written as
Figure 2.
The different regions and (projected) characteristics of Fs(t, x, y) (defined in equation (3.4)) for an intermediate state s ∈ 𝒮̄ρ at a given time t. The characteristics also extend in the t-direction at unit speed. Note that for the states s with va(s) = n or vb(s) = n the interior region is empty.
| (3.14) |
where ∇f = (∂xf, ∂yf)⊤. Again, taking the derivative of Fs(t, x, y) along the characteristics
with , x0 := (x0, y0), and v(s) := (va(s), vb(s)), yields the right-hand side of equation (3.14). Thus, Fs(·, ·, ·) satisfies the ODE
with
and
The characteristics for Fs(t, x, y) are depicted in Figure 2. Like in the marginal case, this is a non-homogeneous linear first-order ODE and can be readily solved. The solution involves integrating , which leads to
| (3.15) |
with
| (3.16) |
We provide the details of our numerical upstream scheme to efficiently and accurately compute solutions to equation (3.15) in Appendix B.2.
4 Empirical evaluation
In this section, we demonstrate that the numerical algorithms presented in Section B.1 and B.2 can be used to accurately and efficiently compute the time-dependent marginal CDF (3.3) and joint CDF (3.4), as well as the regular marginal CDF (3.1) and joint CDF (3.2), for different population size histories and different recombination rates. Furthermore, we show how our method can be used to study properties of the marginal and joint distributions, and compute their moments. We implemented the numerical algorithms in Matlab, and the code is available upon request.
For ease of exposition, we use a sample size of n = 10 in the remainder of this paper, unless mentioned otherwise. We mainly focus on three population size histories, depicted in Figure 3. The first is a history of constant size 1, and we refer to the corresponding rate function as λc. Secondly, we consider a history with an ancient bottleneck, followed by exponential growth up to the present. Specifically, for t > 0.15, the relative population size is set to 2, and for 0.025 < t < 0.15, it is set to 0.25. Then, the population grows exponentially from size 0.25 at t = 0.025 up to t = 0 (the present), at an exponential rate of g. We refer to this population size history by λe, and if not mentioned otherwise, the growth rate is set to g = 200. This size history is a rough sketch of the human population size history, with an out-of-Africa bottleneck, followed by recent exponential growth at a rate of 1% per generation. In addition, we consider a pure bottleneck, where the relative ancestral size is 2 until time t = 0.05, and NB from t = 0.05 until the present. We refer to this size history by λb, and if not otherwise mentioned, we set NB = 0.2.
Figure 3.

The three population size histories we will mainly consider in this paper: A constant population size (λc), an ancient bottleneck followed by exponential growth (λe), and a recent bottleneck (λb).
4.1 Accuracy
In this section we demonstrate that the numerical algorithms presented in this paper can be used to compute the requisite CDFs accurately. Naturally, the accuracy will depend on the exact choice of the grid for the numerical algorithm. We will present results for a particular grid here, and discuss the issues for choosing an adequate grid in Section 5. We set n = 5 and compute the time-dependent marginal CDF
for k = 5, 3, and the absorbing state 1, and show the respective surfaces as functions of t and x in Figure 4. Here we used the population size history with exponential growth λe. These surfaces exhibit the properties sketched in Figure 1, and the different regions can be observed. Below the line x = nt, the functions are independent of x. Furthermore, the functions are zero above the line t = kt, except for k = 1, where the function is independent of t above the line x = 2t.
Figure 4.
Heatmaps of ℙ{A(t) = k,L(t) ≤ x} (defined in equation (3.3)) as a function of t and x, for different k, computed using our numerical algorithm.
As shown in Section 3, the marginal CDF of the total tree length
and the joint CDF
can be computed from the respective time-dependent CDFs. To demonstrate the accuracy of our numerical algorithm, we compared the numerical values from the algorithm to simulations under the respective ancestral processes A and Āρ. To this end, we simulated a certain number N of trajectories from these processes, and estimated the respective probabilities. Figure 5 shows the marginal CDFs for n = 10 under exponential growth (λe) and the bottleneck scenario (λb). The simulations can also be used to bound the difference d(Ppde, PT) between the values computed using the numerical scheme Ppde and the true value PT. These bounds are indicated in Figure 5 for different values of N and decrease as N gets larger, as expected. For the joint CDF, we present numerical values for different x and y, and compare them to the respective estimates from the simulations, including confidence bounds for these estimates. We set n = 10, and used ρ = 0.001. The values for the model with exponential growth (λe) are shown in Table 1, and for the bottleneck scenario (λb) in Table 2. The values computed using the numeric algorithm always fall into the confidence bounds, demonstrating that our algorithm computes the respective values accurately. In these tables, it becomes particularly apparent that in order to guarantee a high accuracy using simulations, a very large number of trajectories needs to be simulated, which is time-consuming. Our numerical scheme yields a high accuracy, and does not suffer from these issues.
Figure 5.
The CDF ℙ{ℒ ≤ x} (defined in equation (3.1)) as a function of x is depicted by the red line. Additionally, the green bars indicate the bound on the distance between the numerical value Ppde and the true value PT for different N, thus the true value is guaranteed to fall within these bounds.
Table 1.
The CDF ℙ{ℒa ≤ x,ℒb ≤ y} (defined in equation (3.2)) for different values of x and y under λ1, with n = 10 and ρ = 0.001. p is computed using the numeric algorithm, and p̂ is estimated from simulations for different N. The confidence bounds are indicated in parentheses.
| x | y | p | p̂ (N = 256, 000) | p̂ (N = 16, 384, 000) |
|---|---|---|---|---|
| 1.5 | 3.0 | 0.075326 | 0.074914 (± 0.002) | 0.075030 (± 0.0002) |
| 3.0 | 6.0 | 0.213703 | 0.213324 (± 0.002) | 0.213565 (± 0.0002) |
| 6.0 | 6.0 | 0.522821 | 0.521578 (± 0.002) | 0.522707 (± 0.0003) |
| 12.0 | 18.0 | 0.873357 | 0.872840 (± 0.002) | 0.873319 (± 0.0002) |
| 30.0 | 30.0 | 0.998499 | 0.998504 (± 0.0002) | 0.998516 (± 0.00002) |
Table 2.
The CDF ℙ{ℒa ≤ x,ℒb ≤ y} (defined in equation (3.2)) for different values of x and y under λ2, with n = 10 and ρ = 0.001. p is computed using the numeric algorithm, and p̂ is estimated from simulations for different N. The confidence bounds are indicated in parentheses.
| x | y | p | p̂ (N = 256, 000) | p̂ (N = 16, 384, 000) |
|---|---|---|---|---|
| 1.5 | 3.0 | 0.019794 | 0.019238 (± 0.0006) | 0.019579 (± 0.00007) |
| 3.0 | 6.0 | 0.094393 | 0.094414 (± 0.002) | 0.094172 (± 0.0002) |
| 6.0 | 6.0 | 0.369581 | 0.369059 (± 0.002) | 0.369544 (± 0.0003) |
| 12.0 | 18.0 | 0.812236 | 0.812328 (± 0.002) | 0.812109 (± 0.0002) |
| 30.0 | 30.0 | 0.997696 | 0.997922 (± 0.0002) | 0.997721 (± 0.00003) |
4.2 Properties of the Distributions
The results provided in the previous section show that our numerical algorithm can be used to accurately and efficiently compute the marginal and joint CDF of the total tree length in populations with variable size. We will now demonstrate the utility of our numerical method to study properties of the respective distributions.
The numerical values of the marginal CDF ℙ{ℒ ≤ x} can be readily applied to compute approximations of the expected value and the variance of the total tree length ℒ. Figure 6 shows the different values of the expectation and the variance under exponential growth (λe), with varying growth-rates g. Recall that a rate of g = 0 corresponds to no growth. Figure 7 shows the expected value and the variance under the bottleneck model (λb) for different values of the bottleneck size NB. In both scenarios, the expected value and the variance are smallest in the models with the smallest contemporary population size, corresponding to the largest recent coalescent rate. They increase as g, respectively NB, increases, but level off, indicating that increasing the population size has diminishing effects for large values. The absolute value of the expectation is higher in the bottleneck scenario, because, independent of the growth parameter, there is a substantial bottleneck in the growth-scenario.
Figure 6.
Approximations to the expected value and the variance of the total tree length ℒ (defined in equation (2.7)) computed using our numerical procedure, under the model for exponential growth (λe), with different values for the growth-rate g.
Figure 7.
Approximations to the expected value and the variance of the total tree length ℒ (defined in equation (2.7)), under the bottleneck model (λb), with different values for the bottleneck size NB.
Figure 8 shows the joint CDF ℙ{ℒa ≤ x,ℒb ≤ y} as a function of x and y for different population size scenarios and different recombination rates ρ, computed on a suitable grid using our numerical algorithm. Naturally, the CDF converges towards 1 as x and y increase, and due to the symmetry of the ancestral process Āρ the CDF is symmetric when interchanging x and y. Furthermore, note that the isolines in the plots for ρ = 0.0001 show pronounced right angles along the line x = y, because for small ρ the trees at the two loci are highly correlated. As the recombination rate increases, the two tree lengths become increasingly uncorrelated, and these angles soften. In all plots, the isoline for 0.2 is around x = y = 5, for the case λe even lower. Thus, under λe, there is an elevated probability for very short trees, likely due to the strong bottleneck, which favors short trees. Under the constant population size model λc, the CDF increases rapidly as x and y increase, whereas the function is less steep for λe and λb. This behavior seems to be dominated by the ancient population sizes.
Figure 8.
The joint CDF ℙ{ℒa ≤ x,ℒb ≤ y} (defined in equation (3.2)) for the three population models (rows), with different recombination rates ρ (columns). Again, we use n = 10.
Finally, we employ our numerical values of the joint CDF to compute approximations to the correlation coefficient between the tree lengths
where cov(·, ·) denotes the covariance. Figure 9 shows this correlation coefficient under the population size history λe for different values of ρ, and sample sizes n = 5 and n = 20, respectively. Recall that our numerical procedure was derived using the approximate ancestral process Āρ for computational efficiency, where we limited the number of recombination events to 1. To compare the correlation under the process Āρ with the correlation under the regular ancestral process with recombination Aρ, we estimated the latter from repeated simulations using the widely applied coalescent-simulation tool ms (Hudson, 2002), which is based on the regular coalescent with recombination (using N = 107 repetitions). Naturally, the correlation is close to 1 for small recombination rates, and it decreases with increasing recombination rate. The values are basically indistinguishable until they start separating around ρ = 0.05. This is to be expected, since the approximation we introduced limits the number of recombination events to 1, and thus, as the recombination rate increases, the approximation error also increases.
Figure 9.
Correlation between ℒa and ℒb (defined in equations (2.13) and (2.14)) under the exponential growth model (λe) for different sample sizes n and different recombination rates ρ. The black lines show the values computed using our method under Āρ, and the blue lines show values estimated from coalescent simulations under Aρ using the popular tool ms (using N = 107 repetitions).
To further investigate how restrictive the assumption of at most one recombination event is, we also used the simulated trajectories to estimate the probability that two or more recombination events occur under the regular ancestral process Aρ. The results are shown in Figure 10. These probabilities increase with increasing ρ and n. However, they remain small for ρ ≤ 0.05, which is in good agreement with the observation that the correlation is well approximated for ρ up to 0.05. In conclusion, Figure 9 and Figure 10 show that the approximate process can be used without loss of accuracy for a large range of recombination rates relevant for human genetics, where recombination rates between neighboring sites are on the order of 10−3.
Figure 10.

Probability of two or more recombination events R in the regular ancestral process Aρ (Definition 2.4), under exponential growth (λe), for different different sample sizes n and different recombination rates ρ. These values were estimated using the coalescent simulation tool ms (using N = 107 repetitions).
5 Discussion
In this paper, we presented a novel computational framework to compute the marginal and joint CDF of the total tree length in populations with variable size. To our knowledge, these distributions have not been addressed in the literature before, especially in populations of variable size. We introduced a system of linear hyperbolic PDEs and showed that the requisite CDFs can be obtained from the solution of this system. We introduced a numerical algorithm to compute the solution of this system based on the method of characteristics and demonstrated its accuracy in a wide range of biologically relevant scenarios.
The numerical algorithm that we introduced is an upstream-method that computes the requisite solutions step-wise on a grid. We presented the algorithm for a regular, equidistantly spaced grid. We used the trapezoidal rule for the integration steps in the method, and also used linear interpolation to interpolate values that do not fall onto the specified grid. We used these basic approaches for ease of exposition. Using higher order interpolation and integration schemes, combined with adaptive grids that have more points in regions where the coalescent-rate function is large will most certainly increase the accuracy. However, such higher order schemes come with additional computational cost. This opens numerous avenues for future research to optimize the balance between accuracy and efficiency that is required in the respective applications.
Moreover, for reasons of computational efficiency, we introduced the first-order approximation Āρ to the regular ancestral process with recombination Aρ, and computed the joint CDF under this approximate process. We demonstrated that this approximation is accurate for a large range of relevant recombination rates. It is straightforward to use higher order approximations, including more recombination events, to gain additional accuracy, but computing the joint CDF under the regular ancestral process is desirable. Proposition A.14 guarantees that we can use our numerical procedure to compute the requisite CDF under Āρ, but it is conceivable that it can be extended to more general processes like Aρ in future work.
Another research direction is to use our novel framework to study higher order correlations between trees at multiple loci. On the one hand, this could again be correlations between the total tree lengths, but the distribution of other summary statistics of the genealogical trees could be included, for example, the length of the external branches or the length of all branches subtending k leaves. Statistics that have been successfully used in the literature, like the coalescence time between two lineages (Li and Durbin, 2011; Terhorst et al., 2017) or the time of the first coalescent event (Schiffels and Durbin, 2014) could be used as well. Our framework is flexible enough to compute the distribution of multiple path integrals along the trajectories of a given Markov chain. Thus, to implement these additions, one needs to define and implement an appropriate ancestral process and compute suitable integrals along the trajectories.
In this paper, we studied the ancestral process in a single panmictic population. However, in recent years, researchers have gathered an increasing amount of genomic datasets that contain individuals from multiple sub-populations, and studied historical events like migration or population subdivision using these datasets. In light of these studies, it is important to augment our framework to compute joint CDFs of the total tree length in structured populations with complex migration histories. Again, this can be done by introducing suitable ancestral processes and suitable integrals along their trajectories.
Acknowledgments
We thank Yun S. Song for numerous helpful discussions that sparked many of the ideas presented in this paper. This research is supported in part by a National Institutes of Health grant R01-GM094402 (M.S.). We also thank Robin Young for helpful suggestions and fruitful discussions relevant to the design of the numerical scheme.
Appendix
A Path-integrals of Markov chains
Since the marginal and joint distributions of the tree length can be obtained by integrating a certain function of the ancestral processes, we now consider distributions of path integrals for Markov chains. We will introduce these distributions assuming Lipschitz continuity in Section A.1, and then show in Section A.2 that these assumptions can be relaxed if the state space is monotone.
A.1 Path-integrals under regularity assumptions
Let the Markov chain X be defined on a probability space (Ω, ℱ, ℙ). We will be using the following assumptions throughout the section.
-
(A1){X(t, ω), t ∈ ℝ+, ω ∈ Ω} is a regular jump Markov process with values in a finite state space 𝒮n, for convenience labeled 𝒮n = {1, 2, … , n}, satisfying
We assume that the trajectories of t → X(·, ω) are right-continuous.
-
(A2)The infinitesimal generator Q(t) = {qij(t)}i,j=1,…,n is conservative, that is
and satisfies Q ∈ C(ℝ+;Mn×n)∩L∞(ℝ+;Mn×n). In addition, for each i ∈ 𝒮n either qi(t) = 0 for all t ≥ 0 or qi(t) > 0 for all t ≥ 0. In the latter case, we require .
Definition A.1
Let X(t) satisfy (A1)–(A2). Given a function
we define a vector-valued path-integral over the interval [0, t] by
Definition A.2
Let X(t) satisfy (A1)–(A2) and v : 𝒮n → ℝd be some real-valued function defined on the state space. We define a distribution vector-function associated with (X(t), Lv(t)) by
| (A.1) |
with
k ∈ {1, … , n}, x ∈ ℝd, and the comparison is understood componentwise.
Definition A.3
Let v : 𝒮n → ℝd be some real-valued function defined on the state space. Define
and
as the componentwise minima and maxima.
Remark A.4
Since X(t) is a regular jump process, it is separable. Thus, for each t0 ∈ ℝ+ the random variable Lv(t0, ·) is well-defined and ℱ-measurable, and for each ω0 ∈ Ω the map t → Lv(t, ω0) is Lipschitz continuous. This in turn implies that the process Lv(t) is measurable and separable (see Chapter 12 of Koralov and Sinay (2007) and Chapter 2 of Doob (1953)).
Proposition A.5 (Locally Lipschitz)
Let (A1)–(A2) hold. Let v : 𝒮n → ℝd and Fv be defined by (A.1). Suppose that Fv is Lipschitz continuous in an open set 𝒰 ⊂ ℝ+ × ℝd. Then
| (A.2) |
where Vj = diag (vj(1), … , vj(n)).
To prove this statement, we first need to introduce the following lemmas.
Lemma A.6
Let X(t) satisfy (A1)–(A2), v : 𝒮n → ℝd. Then, for any i, k ∈ 𝒮n, any ε > 0, and each x ∈ ℝd, we have
and the comparison is understood componentwise.
Proof
We suppress the superscript v in our calculations. Take any ε > 0. Observe that
and therefore
Let y ∈ ℝd. Suppose that 0 < Fi(t, y). Observe that for any t0 > 0 the path-integral L(t0) is fully determined by (X(t), t ∈ [0, t0)). This fact (and the separability of the process) enables us to use the Markov property and we obtain
This together with the previous inequality finishes the proof.
Lemma A.7
Let X(t) satisfy (A1)–(A2) and v : 𝒮n → ℝd. For every (t, x) ∈ ℝ+ × ℝd and each k ∈ 𝒮n, we have
Proof
We suppress the superscript v in what follows. Due to (A1) the process X(t) is separable and therefore (see (Karlin, 1981, p. 146))
| (A.3) |
Suppose now that Fk(t, x) > 0. Then, using the Markov property, we obtain
If Fk(t, x) = 0, then the first and the last term in the above identity are zero. Thus, employing (A.3) and recalling that Q is continuous, we conclude
Since , the statement of the lemma follows.
We can now turn back to the proof of Proposition A.5
Proof of Proposition A.5
Let us suppress the superscript v in our calculations. Let 𝒰̃ denote the set of all points in 𝒰 at which F is differentiable. Since F is Lipschitz continuous in 𝒰, Rademacher’s Theorem (Federer, 1969) implies that F is Lebesgue almost surely differentiable in 𝒰 and therefore 𝒰\𝒰̃ is of Lebesgue measure zero. Take any k ∈ 𝒮n. Fix any (t, x) ∈ 𝒰̃. For any ε > 0 we have
| (A.4) |
Consider first the terms with i ≠ k. By Lemma A.6, we have
where m and M are as in Definition A.3. Since Q is continuous we must have
and hence, employing the continuity of F, we conclude
| (A.5) |
Next consider the case i = k. By Lemma A.7 we have
| (A.6) |
Combining (A.4) with (A.5) and (A.6) we obtain
| (A.7) |
Since F is differentiable at (t, x) ∈ 𝒰̃ and the map ε → (t+ε, x+ε v(k)) is differentiable with the image contained in 𝒰 for sufficiently small ε, the chain rule is applicable (see Rudin (1976)[Theorem 9.15]) and we conclude
Since both k ∈ 𝒮n and (t, x) ∈ 𝒰̃ were arbitrary, (A.7) implies (A.2).
We next show that Fv as t → 0+ has certain continuity properties.
Proposition A.8
Let (A1)–(A2) hold. Let v : 𝒮n → ℝd and Fv as defined by (A.1). Then
Proof
First, we note that L(0) = 0 for all ω ∈ Ω and hence .
Observe that
where mv and Mv as in Definition A.3. Fix δ > 0. Then for all 0 < t < δ(1 + max(||mv||∞, ||Mv||∞))−1, and every x ∈ ℝd such that ||x − y|| > δ for all , we have
Remark A.9
From Proposition A.8 it follows that the ‘initial values’ of Fv are discontinuous. Since the system (A.2) a is linear hyperbolic system, discontinuities present at time t = 0 will travel in space as time t increases and therefore Fv is not C1 or even continuous. Nevertheless, one can show that (A.2) holds in a weaker sense. To do that one needs to employ the notion of weak solutions, and we will pursue this avenue in an upcoming paper. However, for certain type of state space functions v, relevant to our application, one can show additional regularity properties of Fv. We provide more details in Appendix A.2.
A.2 Path-integrals for monotone state space functions
Hyperbolic systems of partial differential equations admit in general solutions that are not classical even if the initial (or boundary data) is smooth. Typically there are two distinct classes of solutions: strong solutions, which are Lipschitz continuous (see Dafermos (2010)), and weak solutions, which allow for discontinuities. Here, we will be using the first type of solutions.
Definition A.10
Let 𝒰 ⊂ ℝ+ × ℝd be open and let A1, … , Ad, B ∈ L∞(𝒰; ℝn×n). We say that u(t, x) : ℝ+ × ℝd → ℝn is a strong solution of
| (A.8) |
if u is Lipschitz continuous in 𝒰, and the equation (A.8) holds for Lebesgue almost all points (t, x) in 𝒰.
Remark A.11
By the Rademacher’s Theorem (Federer, 1969) a function u(t, x) that is Lipschitz continuous in an open domain 𝒰 is Lebesgue almost sure differentiable in 𝒰. In fact, its pointwise partial derivatives, which exist almost everywhere, coincide with its corresponding weak partial derivatives (see Evans (2010)).
Regularity of solutions to hyperbolic problems depends on both the initial (or boundary) data and the domain itself. For linear hyperbolic problems as long as the initial data is smooth and the domain has a smooth boundary one may expect a solution to be (locally) smooth. Typically one studies solutions to hyperbolic problems on the domain 𝒰 = ℝ+ × ℝd with initial data u0(x) at t = 0 (Cauchy problem). The initial data for the vector of probabilities Fv (studied in 𝒰) are unfortunately discontinuous (which is shown below). To avoid unnecessary difficulties, in Proposition A.14 we split the space-time domain into two regions 𝒰I and 𝒰E. In 𝒰E the values of Fv admit a simpler form while in 𝒰I the vector Fv is obtained via solving a linear hyperbolic system with smooth initial data. We note that components of Fv are in general merely Lipschitz continuous in 𝒰I. This is not surprising for two reasons. Firstly, the domain is singular because it has a ‘corner’ and the discontinuities of the derivatives of Fv originating at points travel along the corresponding characteristics. Secondly, the vector Fv solves the same system of equations in the domain 𝒰 with discontinuous initial data and hence it is in general not smooth.
Definition A.12
Let (A1)–(A2) hold. Let v : 𝒮n → ℝd. We say that v is monotone along the process X(t) if the map t → v(X(t)) is either non-increasing ℙ-almost surely or non-decreasing ℙ-almost surely.
Definition A.13
Define the following regions in ℝ+ × ℝd:
and
where the comparison is understood componentwise.
Proposition A.14
Let X(t) satisfy (A1)–(A2), X(0) = n, and Q(t) be lower triangular. Let v : 𝒮n → ℝd be monotone along X(t). Suppose that t → v(X(t)) is non-increasing on Ω, and mv < Mv. For x ∈ ℝd, define . Then Fv defined by (A.1) has the following properties:
For each i ∈ 𝒮n, with v(i) < Mv, is Lipschitz continuous on ℝ+ × ℝd.
For each i ∈ 𝒮n, with v(i) ≮ Mv, .
-
Fv is a strong solution of
(A.9) where Vj = diag(vj(1), … , vj(n)), in the open region 𝒰I. Furthermore, let (t, x) ∈ ∂𝒰I. Then(A.10) and(A.11)
Remark A.15
Computing the solution in Proposition A.14 for a given number d of path-integrals requires computing solutions for d̄ integrals with d̄ < d on the boundary. These can be obtained by straightforwardly applying the Proposition in lower dimensions. Note that for d = 1, the values on the boundary can be directly obtained from the distribution of X(t).
Remark A.16
Note that for each i ∈ 𝒮n we have
for x ≤ v(i)t.
Remark A.17
The process (X(t), Lv(t)) t∈ℝ+ is a time-inhomogeneous piecewise-deterministic strong Markov process (Davis, 1993, Chapter 2), and Proposition A.14 essentially shows that the generator is given by
for suitably defined functions H(x). The stochastic transitions of X(t) are described by Q(t) and the deterministic evolution of Lv(t) in each dimension is governed by the terms Vj ∂xj. However, in addition, Proposition A.14 establishes regularity of Fv(t, x), which is important for numerical computations.
Remark A.18
Then ancestral process with limited recombination satisfies assumptions (A1)–(A2), and thus, we focus on this case here. It is conceivable that these assumptions could be relaxed and Proposition A.14 could be extended to more general Markov chains X(t) with a (countably) infinite state space, and more general dynamics, for example, a non-triangular rate matrix Q(t), or . However, the approach presented here in the proof of Proposition A.14 to show the necessary regularity of Fv(t, x) uses the fact that X(t) has absorbing states, and reaches them in finite time, after a finite number of jumps. For a more general version, this strategy would need to be adapted, or a different strategy used.
Proof of Proposition A.14
Let Δ denote the set of absorbing states of the process X(t). Since Q is lower triangular, 1 ∈ Δ and thus Δ is not empty.
Take any i ∈ 𝒮n with v(i) ≮ Mv. Since v is monotone along the process we conclude that
and this yields (ii).
Recall next that for a time-inhomogeneous Markov processes X(t) (under the assumptions (A1)–(A2)) the jumping times T1, T2, T3, … of X(t) satisfy and for k ≥ 2
| (A.12) |
Take any i ∈ 𝒮n with v(i) < Mv, in which case i < n. Since Q(t) is lower triangular, each trajectory of the process has at most n − 1 jumps before it enters into the absorbing set Δ. Thus we obtain
We next denote T0 = 0, s0 = n, , with k ≥ 1, and
First, suppose that i/∈ Δ. For (t, x) ∈ ℝ+ × ℝd, using the above partitioning, we write
| (A.13) |
We now show that is Lipschitz continuous. To this end, consider the function
Observe that is well-defined for (t, x) ∈ ℝ1+d. Moreover, since i ∉ Δ, the assumption (A2) implies that the process after entering the state i leaves this state in finite time ℙ-almost surely. Thus and therefore
Now, using (A.12) and induction, one can show that for each r ∈ 𝒮n and k ≥ 1
| (A.14) |
where is a globally bounded function. Thus, we conclude that the map
is globally Lipschitz for each k ≥ 1.
Since v is non-increasing along the process, for each we have
By assumption v(i) < Mv and hence for each l ∈ {1, . . . , d} there exists kl ∈ {1, . . . , k} such that vl(skl− 1) − vl(skl) > 0, which guarantees that not all terms in the nonnegative sum vanish. Then, in view of the fact that the event does not depend on the (t, x)-variable, we can use (A.14) and induction to conclude that
for some globally bounded function f̃kl. It can be shown that this implies that
| (A.15) |
globally Lipschitz. Combining (A.14) with (A.15) and using the definition of the Lipschitz continuity we conclude that and are globally Lipschitz and hence is as well.
Furthermore, any Lipschitz continuous function composed with a linear map is also Lipschitz continuous. Thus , where B(t, x) = (t, x − v(i)t), is globally Lipschitz. In (A.13) each of the terms in the sum is one of the functions . Hence Fi which is restricted to (t, x) ∈ [0, ∞) × ℝd is globally Lipschitz on this domain.
Lastly, if i ∈ Δ, observe that
and therefore
Using an analogous approach (to the one in the case i ∉ Δ) one can show that each term in the above expression is globally Lipschitz continuous. This yields (i).
From (i) and (ii) it follows that Fv is Lipschitz continuous in the open region 𝒰I. Then, by Proposition A.8 we conclude that Fv is a strong solution of (A.9) in 𝒰I. The boundary conditions (A.10) and equation (A.11) follow directly from the definition of Fv. This proves (iii).
B Numerical Schemes
B.1 Upstream Numerical Scheme for Single-Locus Case
Here we present a numerical algorithm for computing solutions to the system (3.5). The numerical scheme is an upstream scheme based on the method of characteristics. In particular, the numerical scheme we develop makes use of the integral representation formulas (3.10), (3.11).
To define a grid in the (t, x)-space suitable for computation, choose xmax, the maximum value that the CDF ℙ{ℒ ≤ xmax} should be computed for. Due to Lemma 3.1, the relation ℙ{ℒ ≤ x} = F1(tmax, x) holds for all x ≤ xmax, with . Thus tmax is set as the maximal gridpoint for t. In addition to the maximum gridpoints, choose small step sizes Δt and Δx. The number of gridpoints in the t dimension is then given by , and the set of gridpoints is given as
| (B.1) |
For each point Ti, define a grid in the x-dimension as
| (B.2) |
with and X̄i−1 := max(Xi−1). Furthermore, set Ui := |Xi|. The same grid will be used for all k ∈ {1, . . . , n}. The points kΔt + X̄i−1 and kTi are added for numerical stability reasons, to improve the accuracy of the interpolation we will perform in subsequent steps.
Now fix i ∈ {0, . . . , M} and k ∈ {1, . . . , n}, and assume that Fℓ(Ti−1, Xi−1,j) has been computed for all ℓ ∈ {1, . . . , n} and Xi−1,j ∈ Xi−1. Furthermore, assume that Fℓ(Ti, Xi,j) has been computed for all ℓ ∈ {k + 1, . . . , n} and Xi,j ∈ Xi. Under these assumptions, Fk(Ti, Xi,j) can be computed for all Xi,j ∈ Xi as follows. If Xi,j < v(k)Ti, then
| (B.3) |
If Xi,j = nTi, the maximal value of Xi, then
| (B.4) |
The values on the right-hand side can be pre-computed for all k and Ti ∈ T by solving the ODE (2.3) numerically. In the general case, note that the characteristic of Fk that goes through the point (Ti, Xi,j)⊤ and the boundary x = nt intersect at the point (Tx, nTx)⊤, with . Thus, define
the projection of (Ti, Xi,j)⊤ back along the corresponding characteristic to the previous time-slice Ti−1, or onto the boundary x = nt, whichever has the larger t-component. This backward projection step is illustrated in Figure 11(a). Then, according to equation (3.10)
Figure 11.
The back-tracing and propagation step of the upstream numerical scheme to compute Fk at all points of the grid.
| (B.5) |
holds.
The right-hand side of the equation (B.5) can now be computed using two approximations. Note that the point is in general not on the grid Xi, and thus has not been pre-computed. If Tx < Ti−1, the point is equal to (Ti−1, Xi,j − v(k)Δt)⊤. In that case, identify the two grid points in Xi−1 that are closest to Xi,j − v(k)Δt to the right and left. Then let be the linear interpolation between the values of Fk at those gridpoints. If Tx ≥ Ti−1, then the point is given by and is located on the boundary. Thus
which is also not pre-computed. However, ℙ{A(Ti−1) = k} and ℙ{A(Ti) = k} have been pre-computed, and thus set as the linear interpolation between these two values.
The second approximation is to compute the integral on the right-hand side of equation (B.5) using the trapezoidal rule. Thus, the values of Fk on the grid can be computed using
| (B.6) |
The terms gk(·) depend on values of Fℓ with k < ℓ ≤ n that might not have been pre-computed on the grid either. However, the same interpolation schemes as for F̄k can be applied. At this stage it is important though to strictly set Fℓ to 0 if it should be 0 according to equation (3.7). Lastly, the values for can either be obtained using analytic formulas in equation (3.11) for certain classes of coalescent-speed functions we will consider (e.g. piece-wise constant), or by computing the requisite integrals using the trapezoidal rule, which can be done incrementally. The integration step of out numerical scheme is illustrated in Figure 11(b).
These equations lead naturally to a dynamic programming algorithm to compute Fk on the specified grid. To this end, iterate through the values Ti ∈ T in increasing order. For each Ti, iterate through k ∈ {1, 2, . . . , n} in decreasing order, starting with k = n. Then, for each fixed Ti and k, Fk(Ti, Xi,j) can be computed for every Xi,j ∈ Xi using equations (B.3), (B.4), and (B.6). The order of iteration guarantees that all necessary quantities have been pre-computed. This dynamic program can be employed to compute Fk on the specified grid for all k. Due to Lemma 3.1, the relation
holds, which yields the values of the CDF ℙ {ℒ ≤ x} on the specified grid XM.
B.2 Upstream Numerical Scheme for Two-Locus Case
In the two-locus case, we can compute Fs efficiently on a chosen grid similar to the marginal case. To this end, we again choose xmax = ymax, set , and choose step sizes Δt and Δx = Δy. Then, define the grid T as in definition (B.1), M = |T| and for each Ti, define Xi as in definition (B.2). Furthermore, set Yi := Xi and Ui := |Yi|. Thus, we use the regular grid Xi × Yi in the (x, y)-space.
Now fix Ti and s ∈ 𝒮̄ρ, and assume that Fs′(Ti−1, Xi−1,j, Yi−1,ℓ) has been computed for all s′ ∈ 𝒮̄ρ, Xi−1,j ∈ Xi−1, and Yi−1,ℓ ∈ Yi−1. Furthermore, assume that Fs′(Ti, Xi,j, Yi,ℓ) has been computed for all s′ with s ≺ s′, Xi,j ∈ Xi, and Yi,ℓ ∈ Yi. To compute Fs(Ti, Xi,j, Yi,ℓ), first check using equation (3.13) whether the requisite point lies on the boundary, or in the zero region. The values on the boundary according to equation (3.13) are computed as time-dependent CDFs of marginal integrals along trajectories of the process Āρ , and thus they can be computed using exactly the same procedure as detailed in Section B.1, replacing A by Āρ. In the interior region, applying the trapezoidal rule to the solution of the first-order ODE, for all Xi,j ∈ Xi, and Yi,ℓ ∈ Yi the value of Fs(Ti, Xi,j, Yi,ℓ) can be computed using
Here
with
being the t-coordinate of the point of intersection between the characteristic through the point (Ti, Xi,j, Yi,ℓ)⊤ and the boundary x = nt, and
likewise for the boundary y = nt.
The points will in general not be on the grid of pre-computed values, and thus the approximation has to be used. In the case max(Tx, Ty) < Ti−1, this value can be obtained by identifying the four points in Xi−1 × Yi−1 surrounding ( ), and interpolating the respective values of Fs(Ti−1, ·, ·) linearly. In the case max(Ti−1, Ty) ≤ Tx, the point is on the boundary x = nt, and
holds. The value of the time-dependent CDF on the right-hand can be obtained as the linear interpolation between the values ℙ{Āρ(Ti−1) = s, Lb(Ti−1) ≤ Yi,ℓ − vb(s)Δt} and ℙ{Āρ (Ti) = s, Lb(Ti) ≤ Yi,ℓ}, which we pre-compute (or approximations thereof) using the numerical scheme for the marginal case (see Appendix B.1) on the boundary. By symmetry, the case max(Tx, Ti−1) ≤ Ty can be handled in the same way. Computing will require some Fs′ with s ≺ s′, which can be obtained by similar interpolation procedures, or setting it to zero in the appropriate regions. The values of can be computed according to equation (3.16) analytically or numerically, as before.
Again, we can implement these formulas in an efficient dynamic programming algorithm to compute the values of Fs(t, x, y) on the specified grid for all s ∈ 𝒮̄ρ, and thus compute
the joint CDF of the total tree length at two linked loci evaluated on the specified grid.
Footnotes
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
References
- Bhaskar A, Wang YR, Song YS. Efficient inference of population size histories and locus-specific mutation rates from large-sample genomic variation data. Genome Res. 2015;25(2):268–279. doi: 10.1101/gr.178756.114. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dafermos CM. Hyperbolic conservation laws in continuum physics. Springer; New York: 2010. [Google Scholar]
- Davis MHA. Markov Models & Optimization (Chapman & Hall/CRC Monographs on Statistics & Applied Probability) Chapman and Hall/CRC; 1993. [Google Scholar]
- Doob J. Stochastic processes. Wiley; New York: 1953. [Google Scholar]
- Dutheil JY, Ganapathy G, Hobolth A, Mailund T, Uyenoyama MK, Schierup MH. Ancestral population genomics: the coalescent hidden markov model approach. Genetics. 2009;183(1):259–274. doi: 10.1534/genetics.109.103010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Eriksson A, Mahjani B, Mehlig B. Sequential markov coalescent algorithms for population models with demographic structure. Theor Popul Biol. 2009;76(2):84–91. doi: 10.1016/j.tpb.2009.05.002. [DOI] [PubMed] [Google Scholar]
- Evans L. Graduate studies in mathematics. American Mathematical Society; 2010. Partial Differential Equations. [Google Scholar]
- Excoffier L, Dupanloup I, Huerta-Sánchez E, Sousa VC, Foll M. Robust demographic inference from genomic and snp data. PLoS Genet. 2013;9(10):e1003905. doi: 10.1371/journal.pgen.1003905. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Federer H. Grundlehren der mathematischen Wissenschaften. Springer; 1969. Geometric measure theory. [Google Scholar]
- Ferretti L, Disanto F, Wiehe T. The effect of single recombination events on coalescent tree height and shape. PLoS ONE. 2013;8(4):1–15. doi: 10.1371/journal.pone.0060123. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Griffiths RC. The frequency spectrum of a mutation, and its age, in a general diffusion model. Theor Popul Biol. 2003;64(2):241–251. doi: 10.1016/s0040-5809(03)00075-3. [DOI] [PubMed] [Google Scholar]
- Griffiths RC, Marjoram P. Ancestral inference from samples of dna sequences with recombination. J Comp Biol. 1996;3(4):479–502. doi: 10.1089/cmb.1996.3.479. [DOI] [PubMed] [Google Scholar]
- Griffiths RC, Tavaré S. Simulating probability distributions in the coalescent. Theor Popul Biol. 1994;46(2):131–159. [Google Scholar]
- Griffiths RC, Tavaré S. The age of a mutation in a general coalescent tree. Comm Statist Stochastic Models. 1998;14(1–2):273–295. [Google Scholar]
- Griffiths RC, Marjoram P. An ancestral recombination graph. In: Donnelly P, Tavaré S, editors. Progress in Population Genetics and Human Evolution. Vol. 87. Springer; Berlin: 1997. pp. 257–270. [Google Scholar]
- Gutenkunst RN, Hernandez RD, Williamson SH, Bustamante CD. Inferring the joint demographic history of multiple populations from multidimensional SNP frequency data. PLoS Genet. 2009;5:e1000695. doi: 10.1371/journal.pgen.1000695. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hobolth A, Jensen JL. Markovian approximation to the finite loci coalescent with recombination along multiple sequences. Theor Popul Biol. 2014;98:48–58. doi: 10.1016/j.tpb.2014.01.002. [DOI] [PubMed] [Google Scholar]
- Hudson RR. Gene genealogies and the coalescent process. J Evol Biol. 1990;7:1–44. [Google Scholar]
- Hudson RR. Generating samples under a wright–fisher neutral model of genetic variation. Bioin-formatics. 2002;18(2):337–338. doi: 10.1093/bioinformatics/18.2.337. [DOI] [PubMed] [Google Scholar]
- Kamm JA, Terhorst J, Song YS. Efficient computation of the joint sample frequency spectra for multiple populations. J Comp Graph Stat. 2017;26(1):182–194. doi: 10.1080/10618600.2016.1159212. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Karlin S. A second course in stochastic processes. Academic Press; New York: 1981. [Google Scholar]
- Keinan A, Clark AG. Recent explosive human population growth has resulted in an excess of rare genetic variants. Science. 2012;336(6082):740–743. doi: 10.1126/science.1217283. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kingman JF. The coalescent. J Evol Biol. 1982;13(3):235–248. [Google Scholar]
- Koralov L, Sinay Y. Theory of probability and random processes. Springer; Berlin: 2007. [Google Scholar]
- Li H, Durbin R. Inference of human population history from individual whole-genome sequences. Nature. 2011;475:493–496. doi: 10.1038/nature10231. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Marjoram P, Wall JD. Fast “coalescent” simulation. BMC Genet. 2006;7:16. doi: 10.1186/1471-2156-7-16. [DOI] [PMC free article] [PubMed] [Google Scholar]
- McVean GA, Cardin NJ. Approximating the coalescent with recombination. Philos Trans R Soc Lond B Biol Sci. 2005;360:1387–1393. doi: 10.1098/rstb.2005.1673. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Paul JS, Steinrücken M, Song YS. An accurate sequentially markov conditional sampling distribution for the coalescent with recombination. Genetics. 2011;187(4):1115–1128. doi: 10.1534/genetics.110.125534. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pfaffelhuber P, Wakolbinger A, Weisshaupt H. The tree length of an evolving coalescent. Probab Theory Related Fields. 2011;151(3):529–557. [Google Scholar]
- Polanski A, Bobrowski A, Kimmel M. A note on distributions of times to coalescence, under time-dependent population size. Theor Popul Biol. 2003;63(1):33–40. doi: 10.1016/s0040-5809(02)00010-2. [DOI] [PubMed] [Google Scholar]
- Rasmussen MD, Hubisz MJ, Gronau I, Siepel A. Genome-wide inference of ancestral recombination graphs. PLoS Genet. 2014;10(5):1–27. doi: 10.1371/journal.pgen.1004342. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Renardy M, Rogers RC. An Introduction to Partial Differential Equations. 2 Springer; 2004. [Google Scholar]
- Rudin W. International series in pure and applied mathematics. McGraw-Hill; 1976. Principles of Mathematical Analysis. [Google Scholar]
- Schiffels S, Durbin R. Inferring human population size and separation history from multiple genome sequences. Nat Genet. 2014;46(8):919–25. doi: 10.1038/ng.3015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sheehan S, Harris K, Song YS. Estimating variable effective population sizes from multiple genomes: A sequentially markov conditional sampling distribution approach. Genetics. 2013;194(3):647–662. doi: 10.1534/genetics.112.149096. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Simonsen KL, Churchill GA. A markov chain model of coalescence with recombination. Theor Popul Biol. 1997;52(1):43–59. doi: 10.1006/tpbi.1997.1307. [DOI] [PubMed] [Google Scholar]
- Steinrücken M, Kamm JA, Song YS. Inference of complex population histories using whole-genome sequences from multiple populations. 2015 doi: 10.1073/pnas.1905060116. bioRxivPreprint at: http://dx.doi.org/10.1101/026591. [DOI] [PMC free article] [PubMed]
- Stroock DW. An Introduction to Markov Processes (Graduate Texts in Mathematics) Springer; 2008. [Google Scholar]
- Tavaré S, Zeitouni O. Lectures on Probability Theory and Statistics: Ecole d’Eté de Probabilités de Saint-Flour XXXI - 2001 (Lecture Notes in Mathematics) Springer; 2004. [Google Scholar]
- Terhorst J, Kamm JA, Song YS. Robust and scalable inference of population history from hundreds of unphased whole genomes. Nat Genet. 2017;49(2):303–309. doi: 10.1038/ng.3748. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wakeley J. Coalescent Theory: An Introduction. W. H. Freeman; 2008. [Google Scholar]
- Wiuf C, Hein J. Recombination as a point process along sequences. Theor Popul Biol. 1999;55:248–259. doi: 10.1006/tpbi.1998.1403. [DOI] [PubMed] [Google Scholar]
- Zivković D, Wiehe T. Second-order moments of segregating sites under variable population size. Genetics. 2008;180(1):341–357. doi: 10.1534/genetics.108.091231. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zivković D, Steinrücken M, Song YS, Stephan W. Transition densities and sample frequency spectra of diffusion processes with selection and variable population size. Genetics. 2015;200(2):601. doi: 10.1534/genetics.115.175265. [DOI] [PMC free article] [PubMed] [Google Scholar]









