Computing the joint distribution of the total tree length across loci in populations with variable size

Alexey Miroshnikov; Matthias Steinrücken

doi:10.1016/j.tpb.2017.09.002

. Author manuscript; available in PMC: 2018 Dec 1.

Published in final edited form as: Theor Popul Biol. 2017 Sep 21;118:1–19. doi: 10.1016/j.tpb.2017.09.002

Computing the joint distribution of the total tree length across loci in populations with variable size

Alexey Miroshnikov ^1,³, Matthias Steinrücken ^2,³

PMCID: PMC5705476 NIHMSID: NIHMS907787 PMID: 28943126

Abstract

In recent years, a number of methods have been developed to infer complex demographic histories, especially historical population size changes, from genomic sequence data. Coalescent Hidden Markov Models have proven to be particularly useful for this type of inference. Due to the Markovian structure of these models, an essential building block is the joint distribution of local genealogical trees, or statistics of these genealogies, at two neighboring loci in populations of variable size. Here, we present a novel method to compute the marginal and the joint distribution of the total length of the genealogical trees at two loci separated by at most one recombination event for samples of arbitrary size. To our knowledge, no method to compute these distributions has been presented in the literature to date. We show that they can be obtained from the solution of certain hyperbolic systems of partial differential equations. We present a numerical algorithm, based on the method of characteristics, that can be used to efficiently and accurately solve these systems and compute the marginal and the joint distributions. We demonstrate its utility to study properties of the joint distribution. Our flexible method can be straightforwardly extended to handle an arbitrary fixed number of recombination events, to include the distributions of other statistics of the genealogies as well, and can also be applied in structured populations.

Keywords: coalescent theory, variable population size, hyperbolic systems of PDEs

AMS subject classification: 92D10, 60J27, 60J28, 35L40

1 Introduction

Unravelling the complex demographic histories of humans or other species and understanding their effects on contemporary genetic variation is a central goal of population genetics. In addition to advancing our knowledge of the evolutionary processes that shape genomic variation, demographic inference is also an important step towards understanding disease related genetic variation. Recent rapid population growth, for example, severely affects the distribution of rare genetic variants (Keinan and Clark, 2012), which have been linked to complex genetic diseases. Moreover, ancient and contemporary population structure can lead to the accumulation of private genetic variation in certain sub-populations.

Methods to study genetic variation, or perform inference, in populations with varying size or more complex demographic histories have been developed based on the Wright-Fisher diffusion, describing the evolution of population allele frequencies forward in time (Griffiths, 2003; Živković et al., 2015; Gutenkunst et al., 2009; Excoffier et al., 2013), or the Coalescent process, a model for the genealogical relationship in a sample of individuals (Griffiths and Tavaré, 1994; Griffiths and Marjoram, 1996; Griffiths and Tavaré, 1998; Živković and Wiehe, 2008; Bhaskar et al., 2015; Kamm et al., 2017). A powerful representation of genetic variation data that has been used in this context is the Site-Frequency-Spectrum. In this representation, however, any linkage information present in the genetic data is ignored. With the increasing availability of full-genomic sequence data, linkage information is more readily available, and approaches based on Coalescent Hidden Markov Models (HMM) that use this linkage information have proven to be particularly successful for demographic inference and other population genetic applications.

In a population-sample of genomic sequences, the genealogical relationships vary along the genome, due to intra-chromosomal recombination. The Coalescent-HMMs approximate the intricate correlation structure between these local genealogical trees by a Markov chain, the Sequentially Markovian Coalescent (Wiuf and Hein, 1999; McVean and Cardin, 2005). Due to the Markovian structure of the SMC-approximation, an essential building block is thus the transition or joint distribution of these local genealogies at two neighboring loci. In a sample of size two, the local genealogies are simple trees with two leaves, that is, one-dimensional objects at each locus. The transitions can be readily computed, and Li and Durbin (2011) employed this framework to developed a powerful approach to infer population size history. Moreover, Dutheil et al. (2009) used Coalescent-HMMs to explore the divergence patterns between humans and great apes, using up to 4 genomic sequences, one for each species. However, due to the increase in complexity of the local genealogies with increasing sample size, these approaches cannot be generalized efficiently to larger sample sizes.

For large sample sizes, approaches that use Monte-Carlo Markov Chain techniques (Rasmussen et al., 2014), suitable composite likelihood frameworks (Sheehan et al., 2013; Steinrücken et al., 2015), or representations of the local genealogical trees by lower-dimensional summaries (Schiffels and Durbin, 2014; Terhorst et al., 2017) have been developed. In the latter, the choice on how to represent the local genealogical trees affects the performance of the inference procedure. Li and Durbin (2011) observed that using the coalescence time between two lineages lacks information in the more recent past, whereas using the first coalescence time in a large sample is less accurate for ancient times (Schiffels and Durbin, 2014). A promising low-dimensional representation is the total branch length of the genealogical tree at each locus. In expectation, this quantity grows without bound as the sample size increases, thus retaining not only information about ancient events, but also about the more recent dynamics. However, to implement a Coalescent-HMM inference framework using the tree length, it is crucial to efficiently compute the joint distribution of the total tree length at two neighboring loci.

Thus, in this paper, we present a novel efficient and accurate method to numerically compute the joint distribution of the total branch length of the genealogical trees at two neighboring loci for a sample of arbitrary size n in populations of varying size, as well as the single-locus marginal distribution. To our knowledge, no method to compute these distributions has been presented in the literature to date that can be applied to arbitrary sample sizes. Moreover, even computing the marginal distribution of the total tree length at a single locus has only received limited attention (Pfaffelhuber et al., 2011; Wiuf and Hein, 1999). We present analytical details and numerical results for the case of at most one recombination event separating the two loci, but our methodology can be readily extended to handle an arbitrary, but fixed, maximal number of recombination events, by suitably augmenting the underlying process.

The inter-coalescent times $T_{k}^{(n)}$ , that is the time period during which k lineages persist in the genealogical tree for a sample of size n can be used to compute the total branch length at a single locus as

ℒ = \sum_{k = 2}^{n} k T_{k}^{(n)},

(1.1)

since in the period $T_{k}^{(n)}$ , k lineages contribute towards the total length. In the case of a panmictic population of constant size, formulas for the first two moments of the total tree length can be readily obtained using standard arguments for sums of the independently exponentially distributed random variables $T_{k}^{(n)}$ . Furthermore, ℒ is distributed like the maximum of k − 1 exponential variables with intensity $\frac{1}{2}$ (Wiuf and Hein, 1999, p. 255). However, non-constant population size histories introduce intricate dependencies among the inter-coalescent times, and thus it is not straightforward to generalize this approach. Polanski et al. (2003) introduced a method to compute the expected inter-coalescence times under variable population size. However, the coalescence rates of ancestral lineages in the genealogical process depend on past population sizes, whereas the rate for ancestral recombination is constant along each ancestral lineage. The approach of Polanski et al. (2003) depends on the fact that all rates of the process are rescaled uniformly with the same factor, and thus it cannot be extended to the case when ancestral recombination between two linked loci is taken into account.

Ferretti et al. (2013) used another approach to investigate the correlation between the times to the most recent common ancestor at two neighboring loci. The authors approached the problem using coalescent arguments to quantify the changes recombination induces on the local trees, but it is unclear how to generalize their approach efficiently to the total length of the genealogical trees. Furthermore, Li and Durbin (2011) presented analytic formulas for the joint distribution of the local genealogies for a sample of size two under variable population size, but these cannot readily be extended to an arbitrary sample size n. Eriksson et al. (2009) presented similar analytic formulas for a population of constant size and explored more complex demographic scenarios using simulations. Introducing suitable Markov chains, Hobolth and Jensen (2014) investigated the transitional distribution of the local genealogies for samples of size 4, and discussed approximations for larger sample sizes. These Markov chains are closely related to our methodology, but our focus is on exact computations for large sample sizes.

Although we focus on the total tree length under variable population size in a single panmictic population in this paper, our approach can be extended to compute the transition densities for the coalescence time in a sample of size two (Li and Durbin, 2011), the coalescence time of two distinguished lineages (Terhorst et al., 2017), and the time of the first coalescent event amongst the sampled sequences (Schiffels and Durbin, 2014). Furthermore, our method can be generalized to multiple sub-populations related by a complex demographic history (see discussion in Section 5).

This article is structured as follows. In Section 2, we introduce the requisite notation and the stochastic processes that are involved in computing the marginal and joint distributions. We further introduce a hyperbolic system of partial differential equations (PDEs) in Section 3 that can be solved to compute the distributions of interest. We provide a proof of the main proposition used to derive these equations in Appendix A. In Section 3, we also provide details of our novel numerical algorithm based on the method of characteristics that can be used to efficiently compute the solutions to these PDEs. We demonstrate the accuracy of the method, and study properties of the joint distribution function in Section 4. Finally, we discuss future applications and extensions of this method in Section 5.

2 Background and Notation

In this section, we will introduce the necessary background and notation for the stochastic processes that we employ to compute the marginal and joint distribution of the length of the genealogical trees. We will also provide some details about computing the distribution of these processes, since our main result extends upon the underlying ideas.

2.1 Ancestral Process at a Single Locus

The genealogical relationship of a sample of n haploid individuals in a panmictic population of constant size is commonly modeled using Kingman’s coalescent (Kingman, 1982; Wakeley, 2008), and this process and its extensions have found widespread applications. It is a Markov process that describes the dynamics of the ancestral lineages of the sample backwards in time. Here we focus on the ancestral process A(t) (Tavaré and Zeitouni, 2004, Chapter 4.1). This coarser process records only the number of ancestral lineages in the coalescent process at time t before present, which is sufficient to compute the total branch length of the coalescent tree. The initial number of lineages is equal to the sample size n. Furthermore, at time t, each pair of lineages coalesces at rate one, thus if there are A(t) = k lineages at time t, then coalescence of any two lineages happens at rate $(\begin{matrix} k \\ 2 \end{matrix})$ . This dynamics is followed until all lineages coalesced into a single lineage, and this time is denoted by T_MRCA, the time to the most recent common ancestor.

Variable population size is commonly modeled by a positive, real-valued function λ(t), which provides the coalescent rate for each pair of ancestral lineages at time t in the past (Tavaré and Zeitouni, 2004, Chapter 4.1). If the size of the population changes at different points in the past, the rate of coalescence at a given time is inversely proportional to the relative population size at that time. Intuitively, for two 3 lineages to coalesce, they have to find a common ancestor. If the population consists of a large number of individuals, this happens at a lower rate, whereas in small populations, the ancestral lineages coalesce more quickly. In the remainder of this paper, we assume that λ(t) is continuous. If λ(t) is piece-wise continuous, we can obtain the same results by considering each continuous piece separately. For convenience, we further introduce the cumulative coalescent rate at time t as

Λ (t) = \int_{0}^{t} λ (s) d s .

These considerations yield the following definition.

Definition 2.1 (Ancestral Process with variable population size)

The ancestral process with variable population size {A(t)}_t_∈ℝ₊ is a time-inhomogeneous Markov chain on {1, …, n} with initial state A(0) = n, and the transition rates at time t are given by the infinitesimal generator matrix

Q (t) = λ (t) Q,

with

Q_{k, j} : = {\begin{cases} - (\begin{matrix} k \\ 2 \end{matrix}), & i f j = k, \\ (\begin{matrix} k \\ 2 \end{matrix}), & i f j = k - 1, \\ 0, & otherwise . \end{cases}

(2.1)

Remark 2.2

Note that we do require A(0) = n, and thus this definition of the ancestral process does depend on the sample size n. However, for different sample sizes n′, the rates of the process are given by equation (2.1) as well, only the initial state changes. The dynamics of the process is essentially the same, independent of the sample size, and we therefore do not include the dependence on the sample size explicitly in the notation for the remainder of this article.

The ancestral process can be used to formally define the time to the most recent common ancestor as

T_{MRCA} : = inf {t \in ℝ_{+} : A (t) \leq 1},

the time when the number of lineages reaches one. Furthermore, with

p_{k} (t) : = ℙ {A (t) = k},

for k ∈ {1, …, n}, the distribution of the ancestral process can be obtained by solving the Kolmogorov-forward-equation (Stroock, 2008, Chapter 5), a system of ordinary differential equations (ODEs) given by

\frac{d}{d t} (p_{1} (t), \dots, p_{n} (t)) = (p_{1} (t), \dots, p_{n} (t)) Q (t) .

(2.2)

Equivalently, perhaps more familiar to the reader, this system can be expressed as

\frac{d}{d t} p_{k} (t) = λ (t) (\begin{matrix} k + 1 \\ 2 \end{matrix}) p_{k + 1} (t) - λ (t) (\begin{matrix} k \\ 2 \end{matrix}) p_{k} (t),

(2.3)

for all k ∈ {1, …, n}. The latter version is more explicit about the influence of the number of ancestral lineages and the coalescent-speed function on the dynamics of the ODEs. The relevant solution is given by

(p_{1} (0), \dots, p_{n} (0)) = (0, \dots, 0, 1)

and

(p_{1} (t), \dots, p_{n} (t)) = {(e^{Λ (t) \cdot Q})}_{n, \cdot}

(2.4)

for t ∈ ℝ₊, where (·)_n,_· refers to the n-th row of the matrix. In Tavaré and Zeitouni (2004), the authors provide an analytic expression for these probabilities using the spectral decomposition of the rate matrix Q(t). However, the resulting formulas are numerically unstable, so for practical purposes it can be more efficient to solve the system of ODEs numerically using step-wise solution schemes. Furthermore, note that

ℙ {T_{MRCA} \leq t^{*}} = {(e^{Λ (t^{*}) \cdot Q})}_{n, 1}

(2.5)

holds for t^* ∈ ℝ₊, thus equation (2.4) can also be used to compute the cumulative distribution function of the time to the most recent common ancestor.

We can employ the ancestral process to compute the total tree length as follows. If at a given time t there are k ancestral lineages or branches in the coalescent tree, each branch extends further into the past. Thus, we can say that the total sum of branch lengths in the coalescent tree grows at a rate of k. Once all lineages have coalesced into a single common ancestral lineage, the most recent common ancestor is reached, and the coalescent tree stops growing. This motivates the following definition.

Definition 2.3

The accumulated tree length L(t) ∈ ℝ₊ by time t ∈ ℝ₊ is given by

L (t) : = \int_{0}^{t} 𝟙_{{A (s) > 1}} A (s) d s .

(2.6)

With this definition, the total tree length or the total sum of the branch lengths at a single locus is given by

ℒ : = L (T_{MRCA}) .

(2.7)

Note that

ℒ = \sum_{k = 2}^{n} k T_{k}^{(n)}

holds, which is equal to equation (1.1). Here $T_{k}^{(n)}$ is the period of time for which k lineages persist in the ancestral process, the inter-coalescent time. The main goal of this paper is to study the distribution of ℒ for populations with arbitrary coalescent-rate function λ(t) marginally at a single locus and jointly at two loci, which can be computed using a system of hyperbolic PDEs that is closely related to the ODE (2.3). For the two-locus case, we will now introduce the joint ancestral process at two linked loci.

2.2 Ancestral Process with Recombination

The joint genealogy of the ancestral lineages for two loci, locus a and b, separated by a recombination distance ρ is commonly modeled by the coalescent with recombination (Hudson, 1990). The initial state in the coalescent with recombination for a sample of size n is comprised of n lineages, each ancestral to both loci of one sampled haplotype. As in the single-locus coalescent with variable population size, at time t, each pair of lineages can coalesce at rate λ(t). In addition, ancestral recombination events happen at rate ρ/2 along each active lineage. At a recombination event, the lineage splits into two new lineages, each ancestral to the respective haplotype of the original lineage at only one of the two loci. Note that recombination happens along each lineage at a constant rate and, unlike the coalescent rate, is not affected by the population size, and thus it does not scale with λ(t).

Again, we do not focus on the exact genealogical relationships, but only on the number of lineages at time t that are ancestral to a certain locus, given by the ancestral process with recombination A^ρ(t). The process A^ρ for a sample of size two under constant population size is described in detail by Simonsen and Churchill (1997). Here we use an extension of this process to samples of arbitrary size n and variable population size. A similar model has also been introduced by Hobolth and Jensen (2014).

Definition 2.4 (Ancestral Process with Recombination)

For a sample of size n ∈ ℕ and t ∈ ℝ₊, the ancestral process with recombination in a population of variable size

A^{ρ} (t) = (K_{a b} (t), K_{a} (t), K_{b} (t))

is a time-inhomogeneous Markov chain with state space

S^{ρ} : = {s \in ℕ_{0}^{3} ∣ s_{1} + max {s_{2}, s_{3}} \leq n} \ {(0, 0, 0), (0, 1, 0), (0, 0, 1)} .

The component K_ab(t) gives the number of lineages that are ancestral to both loci, K_a(t) is the number ancestral to locus a only, and K_b(t) is the number ancestral to locus b only. The initial state is

A^{ρ} (0) = (n, 0, 0),

all n lineages ancestral to both loci. The transition rates are given by the infinitesimal generator matrix

\tilde{Q} (t) = λ (t) Q^{c} + Q^{ρ},

where all off-diagonal entries of Q_c (coalescence) are zero, except

Q_{(k_{a b}, k_{a}, k_{b}), (k_{a b} - 1, k_{a}, k_{b})}^{c} = (\begin{matrix} k_{a b} \\ 2 \end{matrix}), Q_{(k_{a b}, k_{a}, k_{b}), (k_{a b}, k_{a} - 1, k_{b})}^{c} = (\begin{matrix} k_{a} \\ 2 \end{matrix}) + k_{a b} k_{a}, Q_{(k_{a b}, k_{a}, k_{b}), (k_{a b}, k_{a}, k_{b} - 1)}^{c} = (\begin{matrix} k_{b} \\ 2 \end{matrix}) + k_{a b} k_{b}, and Q_{(k_{a b}, k_{a}, k_{b}), (k_{a b} + 1, k_{a} - 1, k_{b} - 1)}^{c} = k_{a} k_{b},

(2.8)

and all off-diagonal entries of Q^ρ (recombination) are zero, except

Q_{(k_{a b}, k_{a}, k_{b}), (k_{a b} - 1, k_{a} + 1, k_{b} + 1)}^{ρ} = \frac{ρ}{2} k_{a b} .

The state (1, 0, 0) is defined to be the absorbing state, so all rates leaving this state are set to zero. Furthermore, the diagonal entries of both matrices are set to minus the sum of the off-diagonal entries in the corresponding row.

Remark 2.5

Two versions of the coalescent with recombination are commonly used in the literature, one version for the infinitely-many-sites (IMS) model (Hudson, 1990; Griffiths and Marjoram, 1997), and another version for the finitely-many-sites (FMS) model (Paul et al., 2011; Steinrücken et al., 2015). In the IMS version, the chromosome is modeled as the interval [0, 1], and whenever recombination occurs, it occurs at a uniformly chosen point in this interval. As a result, recombination always occurs at a novel site, and two neighboring local genealogies are separated by at most one recombination event. In the FMS version, multiple recombination events can occur between two loci. It can be obtained from the IMS version by considering the local genealogies at two fixed loci along the continuous chromosome that are separated by a certain fixed recombination distance. Our definition of the ancestral process with recombination is in line with the FMS version for two loci.
The ancestral process with recombination can be defined for an arbitrary number of loci. However, in the remainder of the paper, we will only use the process for two loci.
In the literature, some authors use the ‘full’ coalescent with recombination and others the ‘reduced’ coalescent with recombination. The difference between the two is that the ‘full’ version always keeps track of both ancestral lineages that branch off at a recombination event, whereas in the ‘reduced’ version, lineages that don’t leave any descendant ancestral material in the contemporary sample are not traced. Our definition of the ancestral process with recombination is compatible with the ‘reduced’ version. Thus, the number of ancestral lineages is bounded, which is not the case in the ‘full’ version.
Following the ideas of Wiuf and Hein (1999), the correlation structure between all local genealogies along a chromosome can be approximated using the Sequentially Markovian Coalescent (SMC) (McVean and Cardin, 2005), or the modified version SMC’ (Marjoram and Wall, 2006). In the SMC, if a lineage has been hit by a recombination event and branches into two, subsequently, the two resulting branches are not allowed to coalesce with each other, whereas such events are permitted under the SMC’. Thus, under the SMC’, the rates for coalescence of lineages with no overlapping ancestral material (equation (2.8)) are as given in Definition 2.4, whereas under the SMC, these rates have to be set to zero.

Again, the Kolmogorov-forward-equation can be used to compute the distribution of the ancestral process A^ρ(t) as the solution of

\frac{d}{d t} p (t) = p (t) \tilde{Q} (t),

(2.9)

where the row-vector p(t) is defined by

p (t) : = {(ℙ {A^{ρ} (t) = s})}_{s \in S^{ρ}} .

Note that the rate matrix Q(t) in the ODE (2.2) for the ancestral process at a single locus is triangular for all t. This simplifies approaches to compute solutions substantially, as the solutions can be obtained sequentially for each state of the corresponding Markov chain. In the ancestral process with recombination for two loci on the other hand, with a positive probability, the underlying Markov chain can transition back to a state it already visited before. Consequently, the rate matrix Q̃(t) in the ODE (2.9) is not triangular, and it is also not possible to transform it into a triangular matrix by permuting the rows and columns.

Since a triangular rate matrix simplifies analytical and numerical approaches significantly, we introduce an approximation to the full ancestral process with recombination that exhibits this property and compute the distributions of the tree lengths under this approximation. To achieve this, we explicitly account for the number of recombination events that have occurred up to a certain time t. For ease of exposition, we further limit the maximal number of recombination events to one. Since in most organisms the per generation recombination probability is very small between loci that are physically close, this approximation is justified. Furthermore, numerical experiments supporting this approximation are provided in Section 4. Note that this limiting the number of recombination events to one yields effectively a first-order approximation to the full ancestral process.

Definition 2.6 (Ancestral Process with Limited Recombination)

For a sample of size n ∈ ℕ and t ∈ ℝ₊, the ancestral process with limited recombination

{\bar{A}}^{ρ} (t) = ({\bar{K}}_{a b} (t), {\bar{K}}_{a} (t), {\bar{K}}_{b} (t), \bar{R} (t))

is a time-inhomogeneous Markov chain with state space

{\bar{S}}^{ρ} : = ({1, \dots, n} \times {(0, 0, 0)} \cup {1, \dots, n} \times {0, 1} \times {0, 1} \times {1}) \ {(n, 1, 1, 1), (n, 1, 0, 1), (n, 0, 1, 1)} .

(2.10)

The components K̄_ab(t), K̄_a(t), K̄_b(t) have the same interpretation as before, and R̄(t) is the number of recombination events that have happened by time t. The first line in equation (2.10) corresponds to the states that can be reached without recombination, and the second line to those that require one recombination event. The initial state is

{\bar{A}}^{ρ} (0) = (n, 0, 0, 0),

and the transition rates are given by the infinitesimal generator matrix

\bar{Q} (t) = λ (t) {\bar{Q}}^{c} + {\bar{Q}}^{ρ},

(2.11)

where the entries of Q̄_c (coalescence) are given by

{\bar{Q}}_{(k_{a b}, k_{a}, k_{b}, r), (k_{a b}, k_{a}, k_{b}, r)}^{c} = Q_{(k_{a b}, k_{a}, k_{b}), (k_{a b}, k_{a}, k_{b})}^{c},

and all off-diagonal entries of Q^ρ (recombination) are zero, except

{\bar{Q}}_{(k_{a b}, k_{a}, k_{b}, 0), (k_{a b} - 1, k_{a} + 1, k_{b} + 1, 1)}^{ρ} = \frac{ρ}{2} k_{a b},

allowing at most one recombination event. The diagonal entries are set to minus the sum of the off-diagonal entries in the corresponding row. The states (1, 0, 0, 0) and (1, 0, 0, 1) are absorbing states, so all rates leaving these states are set to zero.

For later convenience, define the relation ≺ on 𝒮̄^ρ as

s ≺ s^{'} : \Leftrightarrow {\bar{Q}}_{s^{'}, s} (t) > 0,

(2.12)

that is, s ≺ s′ holds if s can be reached from s′ in one step. Note that embedded into the ancestral process with recombination (limited or not) is a single-locus ancestral process for locus a and for locus b. Thus, we can define the branch length of the genealogical tree at locus a and b similar to the one-locus case as follows, and study their joint distribution.

Definition 2.7

For a given time t ∈ ℝ₊, the accumulated tree lengths L^a(t) ∈ ℝ⁺ at locus a and L^b(t) ∈ ℝ₊ at locus b are given by

L^{a} (t) : = \int_{0}^{t} 𝟙_{{{\bar{K}}_{a b} (s) + {\bar{K}}_{a} (s) > 1}} ({\bar{K}}_{a b} (s) + {\bar{K}}_{a} (s)) d s,

and

L^{b} (t) : = \int_{0}^{t} 𝟙_{{{\bar{K}}_{a b} (s) + {\bar{K}}_{b} (s) > 1}} ({\bar{K}}_{a b} (s) + {\bar{K}}_{b} (s)) d s .

Remark 2.8

This definition of the accumulated tree length can be applied to Ā^ρ, as well as A^ρ. We will not distinguish these cases in our notation, since in the remainder of the paper, we will use Ā^ρ.

The total tree length at locus a is thus given by

ℒ^{a} : = L^{a} (T_{MRCA}^{a}),

(2.13)

and at locus b by

ℒ^{b} : = L^{b} (T_{MRCA}^{b}) .

(2.14)

Here, $T_{MRCA}^{a}$ is the time to the most recent common ancestor at locus a

T_{MRCA}^{a} : = inf {t \in ℝ_{+} : {\bar{K}}_{a b} (t) + {\bar{K}}_{a} (t) \leq 1},

and thus its distribution is given by

ℙ {T_{MRCA}^{a} \leq t^{*}} = ℙ {{\bar{K}}_{a b} (t^{*}) + {\bar{K}}_{a} (t^{*}) \leq 1}

for t^* ∈ ℝ₊. Similar relations hold for locus b. We will now study the joint distribution of ℒ^a and ℒ^b, and also the marginal ℒ. Note that these quantities are computed under the ancestral process with limited recombination, but we will demonstrate in Section 4 that they give an accurate approximation to the respective quantities under the true ancestral process.

3 Marginal and Joint Distribution of the Total Tree Length

The main goal of this paper is to present a method to compute the marginal and joint cumulative distribution function (CDF) of the total tree length at two linked loci. Thus, we aim at computing

ℙ {ℒ \leq x}

(3.1)

and

ℙ {ℒ^{a} \leq x, ℒ^{b} \leq y}

(3.2)

for x, y ∈ ℝ₊.

Note that equation (2.5) can be used to compute the distribution of the time until the ancestral process reaches the absorbing state, which yields the marginal distribution of the T_MRCA. The latter is equal to the sum of the inter-coalescence times, and in a population of constant size, equation (2.5) can also be obtained by convolving the densities of independent exponential variables. The total branch length is a more general linear combination of the inter-coalescence times, but Wiuf and Hein (1999) used a similar convolution approach to derive its marginal density. In a population with variable population size, the inter-coalescence times are not mutually independent. However, Polanski et al. (2003) derived formulas for the density of T_MRCA using a uniform rescaling of time by the coalescent-rate function λ(t).

The two main difficulties in extending these considerations to the total tree length in a two-locus model with variable population size are as follows: Firstly, in a model that includes recombination, only the coalescence rates scale with λ(t), while the recombination rate is constant along each lineage. The approach of Polanski et al. (2003), however, relies on a uniform rescaling of all rates, and therefore it cannot be applied. Secondly, note that, similar to equation (2.6), we can define

T (t) : = \int_{0}^{t} 𝟙_{{A (s) > 1}} d s = {\begin{cases} t, & if t < T_{MRCA}, \\ T_{MRCA}, & otherwise . \end{cases}

With this definition, the quantity t is not only the time elapsed in the ancestral process, but it can also be interpreted as the amount accumulated towards T_MRCA. The absorption time of the ancestral process can thus be used in equation (2.5) to compute the distribution of T_MRCA. However, when the accumulated tree length L(t) defined in equation (2.6) is considered, the quantity t only gives the elapsed time, and it cannot be used as the amount accumulated towards ℒ. Thus, our approach to compute the distribution of ℒ, and the joint distribution of ℒ^a and ℒ^b has to explicitly account for both, the time that has elapsed in the ancestral process, as well as the amount accumulated towards the total tree length.

To this end, with t ∈ ℝ₊, we introduce the time-dependent cumulative distribution functions

F_{k} (t, x) : = ℙ {A (t) = k, L (t) \leq x}

(3.3)

for k ∈ {1, …, n} and

F_{s} (t, x, y) : = ℙ {{\bar{A}}^{ρ} (t) = s, L^{a} (t) \leq x, L^{b} (t) \leq y}

(3.4)

for s ∈ 𝒮̄^ρ.

We will show that the CDFs (3.1) and (3.2) can be computed from the time-dependent CDFs (3.3) and (3.4). Furthermore, we will present numerical schemes, to efficiently and accurately compute the time-dependent CDFs (3.3) and (3.4).

3.1 Distribution of the Total Tree length at a Single Locus

The following lemma shows that the CDF (3.1) can be computed from the time-dependent CDF (3.3).

Lemma 3.1

With definition (3.3), the relation

ℙ {ℒ \leq x} = ℙ {A (\bar{t}) = 1, L (\bar{t}) \leq x} = F_{1} (\bar{t}, x)

holds for x ∈ ℝ₊ and t̄ ≥ x/2.

Proof

First, observe that

2 T_{MRCA} \leq \int_{0}^{T_{MRCA}} 𝟙_{{A (s) > 1}} A (s) d s = ℒ,

since A(s) ≥ 2 holds for s < T_MRCA. Thus, on the event {ℒ ≤ x}, the relation T_MRCA ≤ x/2 ≤ t̄ holds, which implies A(t̄) = 1, and therefore

{ℒ \leq x} = {A (\bar{t}) = 1, ℒ \leq x} .

On the event P{A(t̄) = 1}, t̄ ≥ T_MRCA and L(t̄) = ℒ hold, and thus

{A (\bar{t}) = 1, ℒ \leq x} = {A (\bar{t}) = 1, L (\bar{t}) \leq x},

which proves the statement of the lemma.

Lemma 3.1 shows that the CDF of ℒ can be computed from the time-dependent CDF F₁(t, x). Due to the structure of the underlying Markov chain, it is necessary to compute the time-dependent CDFs for all states in order to compute it for the absorbing state. Thus, in the remainder of this section, we focus on computing the time-dependent CDFs for all k ∈ {1, …, n}. Proposition A.14 derived in Appendix A can be applied to show that the time-dependent CDFs solve a certain system of linear hyperbolic PDEs. This yields the following corollary.

Corollary 3.2

The row-vector

F (t, x) : = (F_{1} (t, x), \dots, F_{n} (t, x))

can be obtained for all points in 𝒰 = {(x, t): 0 < x < nt, t > 0} as the strong solution of

\partial_{t} F (t, x) + \partial_{x} F (t, x) V = F (t, x) Q (t),

(3.5)

with

V = diag (0, 2, 3, \dots, n),

boundary conditions

F (t, x) = (ℙ {A (t) = 1}, \dots, ℙ {A (t) = n - 1}, 0), x = n t F (t, 0) = (0, 0, \dots, 0), t > 0,

(3.6)

and matrix Q(t) as defined in equation (2.1).

Proof

Define the function

v (k) : = k \cdot 𝟙_{{k > 1}}

on the state space 𝒮 = {1, …, n} of the ancestral process. This function and the generator Q(t) satisfy the requirements of Proposition A.14, and thus, the statement of the corollary follows from Proposition A.14 and Remark A.16.

Remark 3.3

The n-th component of the boundary condition (3.6) is equal to 0 and not ℙ{A(t) = n}. This holds for technical reasons that will be detailed in the proof of Proposition A.14.

Note that the process (A(t), L(t))_t_∈ℝ₊ is a piecewise-deterministic Markov process (see Remark A.17). The right-hand side of equation (3.5) is essentially equal to the right-hand side of equation (2.3), because the only stochastic element in the underlying dynamics is the ancestral process A(t). Given a certain number of lineages {A(t) = k}, the accumulation towards the total tree length happens deterministically at rate k, and is captured by the term V ∂_xF(t, x).

To derive a numerical scheme for the efficient and accurate computation of the time-dependent CDF F(t, x), note that the system of PDEs introduced in Corollary 3.2 can be solved using the method of characteristics (Renardy and Rogers, 2004, Chapter 3). Due to the triangular structure of the matrix Q(t), for a given component with k ∈ {1, …, n}, the right-side of equation (3.5) does only depend on F_ℓ whit ℓ ≥ k. Thus, the system of PDEs (3.1) can be solved separately for each k, starting at k = n, and decreasing it step-by-step.

Furthermore, note that for k ∈ {1, …, n},

ℙ {A (t) = k, L (t) \leq x} = 0, if x < v (k) t,

(3.7)

since if the ancestral process has k lineages at time t, it must have accumulated at least v(k)t towards the total tree length. It can be shown that the solution to equation (3.5) exhibits this property. Moreover,

ℙ {A (t) = k, L (t) \leq x} = ℙ {A (t) = k}, if x \geq n \cdot t,

since the process can have accumulated at most nt. Thus, we only have to use equation (3.5) to compute the solution

F_{k} (t, x) = ℙ {A (t) = k, L (t) \leq x}

when v(k)t ≤ x < n · t. Note that v(1) = 0. Moreover, for k = n, the region v(k)t ≤ x < n · t is empty, and thus F_n(t, x) has a discontinuity along the line n · t. See Figure 1 for a visualization of the different regions for different values of k. To devise an accurate and efficient numerical scheme for computing the time-dependent CDFs in the interior region, we use the method of characteristics to solve the respective PDE

The different regions and characteristics of *F_k*(*t, x*) (defined in equation (3.3)) for different values of k. In (c), according to Lemma 3.1, *F_k*(*t, x*) does not depend on t beyond the dashed line x = 2t.

\partial_{t} F_{k} (t, x) + v (k) \partial_{x} F_{k} (t, x) = F_{k} (t, x) Q_{k, k} (t) + F_{k + 1} (t, x) Q_{k + 1, k} (t) .

(3.8)

Since for k = n, the interior region is empty, we consider k ≠ n and introduce the family of characteristics

τ \to {(t_{0} + τ, x_{0} + v (k) τ)}^{⊤} with t_{0} = \frac{x_{0}}{n}

Taking the derivative of F_k(t, x) along such a characteristic yields

\frac{d}{d τ} F_{k} (t_{0} + τ, x_{0} + v (k) τ) = {(\frac{d}{d τ} [t_{0} + τ] \cdot \partial_{t} F_{k} (t, x) + \frac{d}{d τ} [x_{0} + v (k) τ] \cdot \partial_{x} F_{k} (t, x)) |}_{(t, x) = (\frac{x_{0}}{n} + τ, x_{0} + v (k) τ)} = {(\partial_{t} F_{k} (t, x) + v (k) \partial_{x} F_{k} (t, x)) |}_{(t, x) = (\frac{x_{0}}{n} + τ, x_{0} + v (k) τ)} = F_{k} (t_{0} + τ, x_{0} + v (k) τ) Q_{k, k} (t_{0} + τ) + F_{k + 1} (t_{0} + τ, x_{0} + v (k) τ) Q_{k + 1, k} (t_{0} + τ) .

(3.9)

Here we used the chain rule and the fact that the third line is equal to the left-hand side of equation (3.8). Formally, the derivations (3.9) do not hold for all τ. It can be shown, however, that the equality holds for almost all τ ; we omit the technical details here for readability. Thus, for given x₀, as a function of τ, the function τ → F_k(t₀ + τ, x₀ + v(k)τ ) solves the equation

\frac{d}{d τ} F_{k} (t_{0} + τ, x_{0} + v (k) τ) = - q_{k}^{(1)} (τ) F_{k} (t_{0} + τ, x_{0} + v (k) τ) + g_{k}^{(1)} (τ),

with

q_{k}^{(1)} (τ) : = - Q_{k, k} (t_{0} + τ) = \frac{k (k - 1)}{2} λ (t_{0} + τ)

and

g_{k}^{(1)} (τ) : = F_{k + 1} (t_{0} + τ, x_{0} + v (k) τ) Q_{k + 1, k} (t_{0} + τ) = F_{k + 1} (t_{0} + τ, x_{0} + v (k) τ) \frac{(k + 1) k}{2} λ (t_{0} + τ),

Since this is a non-homogeneous linear first-order ODE, the solution can be readily obtained as

F_{k} (t_{0} + τ, x_{0} + v (k) τ) = e^{- H_{k}^{(1)} (τ)} (\int_{0}^{τ} g_{k}^{(1)} (α) e^{H_{k}^{(1)} (α)} d α + F_{k} (t_{0}, x_{0})),

(3.10)

with

H_{k}^{(1)} (τ) : = \int_{0}^{τ} q_{k}^{(1)} (α) d α = \frac{k (k - 1)}{2} (Λ (u) - Λ (t_{0})) .

(3.11)

The initial conditions for τ = 0 are given by the boundary values of the associated PDE as

F_{k} (t_{0}, x_{0}) = ℙ {A (t_{0}) = k} .

Now, to obtain the value of the function F_k(t, x), for given t and x, one just needs to identify the right characteristic and the parameters x₀ and τ such that (t₀+τ, x₀+v(k)τ)^⊤ = (t, x)^⊤. Since the characteristics are parallel, it can be uniquely identified. Using these values of x₀ and τ in the solution (3.10) yields F_k(t, x). However, we will not pursue this strategy to compute the requisite values of F_k(t, x). Instead, we present a numerical upstream scheme in Appendix B.1 that can be used to compute F_k(t, x) efficiently on a suitable grid to ultimately obtain values for the CDF ℙ{ℒ ≤ x}.

3.2 Joint Distribution of the Total Tree Length

In this section we present a method to compute the joint CDF of the total tree length

ℙ {ℒ^{a} \leq x, ℒ^{b} \leq y}

at two loci a and b separated by a given recombination distance ρ. Again, we approach this problem by first computing the time-dependent joint CDF

F_{s} (t, x, y) = ℙ {{\bar{A}}^{ρ} (t) = s, L^{a} (t) \leq x, L^{b} (t) \leq y} .

We will follow closely along the lines of the method presented in Section 3.1, where we replace the ancestral process A by the ancestral process with limited recombination Ā^ρ, and compute the integrals (2.13) and (2.14), to ultimately compute the joint CDF.

The analog to Lemma 3.1 is as follows.

Lemma 3.4

With definition (3.4), the relation

ℙ {ℒ^{a} \leq x, ℒ^{b} \leq y} = ℙ {{\bar{A}}^{ρ} (\bar{t}) \in Δ, L^{a} (\bar{t}) \leq x, L^{b} (\bar{t}) \leq y} = F_{(1, 0, 0, 0)} (\bar{t}, x, y) + F_{(1, 0, 0, 1)} (\bar{t}, x, y)

holds for x, y ∈ ℝ₊, t̄ ≥ max{x, y}/2, and Δ = {(1, 0, 0, 0), (1, 0, 0, 1)}, the absorbing states of Ā^ρ.

Proof

The proof is similar to the proof of lemma 3.1. With

{\bar{A}}^{ρ} (t) = ({\bar{K}}_{a b} (t), {\bar{K}}_{a} (t), {\bar{K}}_{b} (t), \bar{R} (t)),

note that

2 T_{MRCA}^{a} \leq \int_{0}^{T_{MRCA}^{a}} 𝟙_{{{\bar{K}}_{a b} (s) + {\bar{K}}_{a} (s) > 1}} ({\bar{K}}_{a b} (s) + {\bar{K}}_{a} (s)) d s = ℒ^{a},

and similarly $2 T_{MRCA}^{b} \leq ℒ^{b}$ . Thus, on the event {ℒ^a ≤ x,ℒ^b ≤ y}, the relations $T_{MRCA}^{a} \leq max {x, y} / 2 \leq \bar{t}$ and $T_{MRCA}^{b} \leq t$ hold. This implies K̄_ab(t̄) + K̄_a(t̄) = 1 and K̄_ab(t̄) + K̄_a(t̄) = 1, which in turn implies Ā^ρ(t̄) ∈ Δ = {(1, 0, 0, 0), (1, 0, 0, 1)}, since these two states are the only admissible states that can satisfy these conditions. Incidentally, these are also the absorbing states of Ā^ρ. Thus,

{ℒ^{a} \leq x, ℒ^{b} \leq y} = {{\bar{A}}^{ρ} (\bar{t}) \in Δ, ℒ^{a} \leq x, ℒ^{b} \leq y}

holds. Furthermore, on the event {Ā^ρ(t̄) ∈ Δ}, $T_{MRCA}^{a} \leq \bar{t}$ and $T_{MRCA}^{b} \leq \bar{t}$ hold, which imply L^a(t̄) = ℒ^a and L^b(t̄) = ℒ^b. This in turn implies

{{\bar{A}}^{ρ} (\bar{t}) \in Δ, ℒ^{a} \leq x, ℒ^{b} \leq y} = {{\bar{A}}^{ρ} (\bar{t}) \in Δ, L^{a} (\bar{t}) \leq x, L^{b} (\bar{t}) \leq y} .

Finally, note that

{{\bar{A}}^{ρ} (\bar{t}) = (1, 0, 0, 1)} \cap {{\bar{A}}^{ρ} (\bar{t}) = (1, 0, 0, 0)} = \emptyset,

which proves the statement of the lemma.

Again, Lemma 3.4 shows that the joint CDF of ℒ^a and ℒ^b can be computed from the time-dependent CDFs F_(1,0,0,0)(t, x, y), and F_(1,0,0,1)(t, x, y). In order to derive a system of PDEs like (3.5) for the time-dependent joint CDF of the tree length at two loci, we again apply Proposition A.14, for dimension d = 2. To this end, define the functions

v^{a} (k_{a b}, k_{a}, k_{b}, r) : = 𝟙_{{k_{a b} + k_{a} > 1}} (k_{a b} + k_{a})

and

v^{b} (k_{a b}, k_{a}, k_{b}, r) : = 𝟙_{{k_{a b} + k_{b} > 1}} (k_{a b} + k_{b})

that yield for (k_ab, k_a, k_b, r) ∈ 𝒮̄^ρ the number of lineages ancestral to locus a and b, respectively, and define

V^{a} : = diag ({(v^{a} (s))}_{s \in {\bar{S}}^{ρ}}) and V^{b} : = diag ({(v^{b} (s))}_{s \in {\bar{S}}^{ρ}}) .

We then have the following corollary.

Corollary 3.5

The time-dependent joint CDF of the tree lengths

F (t, x, y) = {(F_{s} (t, x, y))}_{s \in {\bar{S}}^{ρ}}

can be obtained for all points in U = {(t, x, y) : 0 < x < nt, 0 < y < nt, t > 0} as the strong solution of

\partial_{t} F (t, x, y) + \partial_{x} F (t, x, y) V^{a} + \partial_{y} F (t, x, y) V^{b} = F (t, x, y) \bar{Q} (t),

(3.12)

with boundary conditions

F (t, x, y) = {\begin{cases} {(ℙ {{\bar{A}}^{ρ} (t) = s, L^{b} (t) \leq y})}_{s \in {\bar{S}}^{ρ}} \cdot 𝟙_{{v^{a} (s) \neq n}}, & i f x = n t, \\ {(ℙ {{\bar{A}}^{ρ} (t) = s, L^{a} (t) \leq x})}_{s \in {\bar{S}}^{ρ}} \cdot 𝟙_{{v^{b} (s) \neq n}}, & i f y = n t, \\ 0, & i f x = 0 o r y = 0, \end{cases}

for (x, y, t) ∈ ∂U and Q̄ (t) as defined in (2.11).

Proof

Define the function v: 𝒮̄^ρ → ℝ² as

v (k_{a b}, k_{a}, k_{b}, r) : = (𝟙_{{k_{a b} + k_{a} > 1}} (k_{a b} + k_{a}), 𝟙_{{k_{a b} + k_{b} > 1}} (k_{a b} + k_{b})) .

This function and the generator Q̄ (t) satisfy the requirements of Proposition A.14, and thus, the statement of the corollary follows from Proposition A.14 and Remark A.16.

Remark 3.6

Note that due to symmetry of Ā^ρ,

ℙ {{\bar{A}}^{ρ} (t) = s, L^{a} (t) \leq x} = ℙ {{\bar{A}}^{ρ} (t) = s, L^{b} (t) \leq x}

holds.

The process (A(t), L^a(t), L^b(t))_t_∈ℝ₊ is a piecewise-deterministic Markov process as well (see Remark A.17), where Q̄ (t) captures the stochastic dynamics, and ∂_x and ∂_y the deterministic dynamics. The numerical scheme to compute the time-dependent joint CDF is again an upstream scheme based on the method of characteristics and follows essentially along the lines of the scheme presented for the marginal case. The relation ≺ defined in (2.12) implies a partial ordering on the state space 𝒮̄^ρ, and the matrix Q̄ (t) is triangular with respect to this ordering. Thus, again, the values of F_s only depends on F_s′ with s ≺ s′, and they can be computed for each s separately. For given s ∈ 𝒮̄^ρ,

F_{s} (t, x, y) = ℙ {{\bar{A}}^{ρ} (t) = s, L^{a} (t) \leq x, L^{b} (t) \leq y} = {\begin{cases} 0, & if x < v^{a} (s) \cdot t or y < v^{b} (s) \cdot t, \\ solution to (3.12), & if v^{a} (s) \cdot t \leq x < n \cdot t and v^{b} (s) \cdot t \leq y < n \cdot t, \\ ℙ {{\bar{A}}^{ρ} (t) = s, L^{a} (t) \leq x}, & if v^{a} (s) \cdot t \leq x < n \cdot t and n \cdot t \leq y, \\ ℙ {{\bar{A}}^{ρ} (t) = s, L^{b} (t) \leq y}, & if n \cdot t \leq x and v^{b} (s) \cdot t \leq y < n \cdot t, \\ ℙ {{\bar{A}}^{ρ} (t) = s}, & if n \cdot t \leq x and n \cdot t \leq y \end{cases}

(3.13)

holds. Figure 2 shows the different regions of F_s(t, x, y) for a fixed t. Moreover, for each s ∈ 𝒮̄^ρ, the PDE that has to be satisfied in the region v^a(s) · t ≤ x < n· t and v^b(s) · t ≤ y < n· t can be re-written as

The different regions and (projected) characteristics of *F_s*(*t, x, y*) (defined in equation (3.4)) for an intermediate state s ∈ 𝒮̄^ρ at a given time t. The characteristics also extend in the t-direction at unit speed. Note that for the states s with *v^a*(s) = n or *v^b*(s) = n the interior region is empty.

\partial_{t} F_{s} (t, x, y) + (v^{a} (s), v^{b} (s)) \nabla F_{s} (t, x, y) = F_{s} (t, x, y) {\bar{Q}}_{s, s} (t) + \sum_{s ≺ s^{'}} F_{s^{'}} (t, x, y) {\bar{Q}}_{s^{'}, s} (t),

(3.14)

where ∇f = (∂_xf, ∂_yf)^⊤. Again, taking the derivative of F_s(t, x, y) along the characteristics

τ \to {(t_{0} + τ, x_{0} + τ v (s))}^{⊤},

with $t_{0} : = \frac{1}{n} max {x_{0}, y_{0}}$ , x₀ := (x₀, y₀), and v(s) := (v^a(s), v^b(s)), yields the right-hand side of equation (3.14). Thus, F_s(·, ·, ·) satisfies the ODE

\frac{d}{d τ} F_{s} (t_{0} + τ, x_{0} + τ v (s)) = - q_{s}^{(2)} (τ) F_{s} (t_{0} + τ, x_{0} + τ v (s)) + g_{s}^{(2)} (τ),

with

q_{s}^{(2)} (τ) = - {\bar{Q}}_{s, s} (t_{0} + τ)

and

g_{s}^{(2)} (τ) = \sum_{s ≺ s^{'}} F_{s^{'}} (t_{0} + τ, x_{0} + τ v (s)) {\bar{Q}}_{s^{'}, s} (t_{0} + τ) .

The characteristics for F_s(t, x, y) are depicted in Figure 2. Like in the marginal case, this is a non-homogeneous linear first-order ODE and can be readily solved. The solution involves integrating $q_{s}^{(2)} (τ)$ , which leads to

F_{s} (t_{0} + τ, x_{0} + v (s) τ) = e^{- H_{k}^{(2)} (τ)} (\int_{0}^{τ} g_{s}^{(2)} (α) e^{H_{k}^{(2)} (α)} d α + F_{s} (t_{0}, x_{0})),

(3.15)

with

H_{s}^{(2)} (τ) = \int_{0}^{τ} q_{s}^{(2)} (α) d α = - {\bar{Q}}_{s, s}^{ρ} (u - t_{0}) - {\bar{Q}}_{s, s}^{c} (Λ (u) - Λ (t_{0})) .

(3.16)

We provide the details of our numerical upstream scheme to efficiently and accurately compute solutions to equation (3.15) in Appendix B.2.

4 Empirical evaluation

In this section, we demonstrate that the numerical algorithms presented in Section B.1 and B.2 can be used to accurately and efficiently compute the time-dependent marginal CDF (3.3) and joint CDF (3.4), as well as the regular marginal CDF (3.1) and joint CDF (3.2), for different population size histories and different recombination rates. Furthermore, we show how our method can be used to study properties of the marginal and joint distributions, and compute their moments. We implemented the numerical algorithms in Matlab, and the code is available upon request.

For ease of exposition, we use a sample size of n = 10 in the remainder of this paper, unless mentioned otherwise. We mainly focus on three population size histories, depicted in Figure 3. The first is a history of constant size 1, and we refer to the corresponding rate function as λ_c. Secondly, we consider a history with an ancient bottleneck, followed by exponential growth up to the present. Specifically, for t > 0.15, the relative population size is set to 2, and for 0.025 < t < 0.15, it is set to 0.25. Then, the population grows exponentially from size 0.25 at t = 0.025 up to t = 0 (the present), at an exponential rate of g. We refer to this population size history by λ_e, and if not mentioned otherwise, the growth rate is set to g = 200. This size history is a rough sketch of the human population size history, with an out-of-Africa bottleneck, followed by recent exponential growth at a rate of 1% per generation. In addition, we consider a pure bottleneck, where the relative ancestral size is 2 until time t = 0.05, and N_B from t = 0.05 until the present. We refer to this size history by λ_b, and if not otherwise mentioned, we set N_B = 0.2.

The three population size histories we will mainly consider in this paper: A constant population size (λ_c), an ancient bottleneck followed by exponential growth (λ_e), and a recent bottleneck (λ_b).

4.1 Accuracy

In this section we demonstrate that the numerical algorithms presented in this paper can be used to compute the requisite CDFs accurately. Naturally, the accuracy will depend on the exact choice of the grid for the numerical algorithm. We will present results for a particular grid here, and discuss the issues for choosing an adequate grid in Section 5. We set n = 5 and compute the time-dependent marginal CDF

ℙ {A (t) = k, L (t) \leq x}

for k = 5, 3, and the absorbing state 1, and show the respective surfaces as functions of t and x in Figure 4. Here we used the population size history with exponential growth λ_e. These surfaces exhibit the properties sketched in Figure 1, and the different regions can be observed. Below the line x = nt, the functions are independent of x. Furthermore, the functions are zero above the line t = kt, except for k = 1, where the function is independent of t above the line x = 2t.

Heatmaps of ℙ{A(t) = *k,L*(t) ≤ x} (defined in equation (3.3)) as a function of t and x, for different k, computed using our numerical algorithm.

As shown in Section 3, the marginal CDF of the total tree length

ℙ {ℒ \leq x}

and the joint CDF

ℙ {ℒ^{a} \leq x, ℒ^{b} \leq y}

can be computed from the respective time-dependent CDFs. To demonstrate the accuracy of our numerical algorithm, we compared the numerical values from the algorithm to simulations under the respective ancestral processes A and Ā^ρ. To this end, we simulated a certain number N of trajectories from these processes, and estimated the respective probabilities. Figure 5 shows the marginal CDFs for n = 10 under exponential growth (λ_e) and the bottleneck scenario (λ_b). The simulations can also be used to bound the difference d(P_pde, P_T) between the values computed using the numerical scheme P_pde and the true value P_T. These bounds are indicated in Figure 5 for different values of N and decrease as N gets larger, as expected. For the joint CDF, we present numerical values for different x and y, and compare them to the respective estimates from the simulations, including confidence bounds for these estimates. We set n = 10, and used ρ = 0.001. The values for the model with exponential growth (λ_e) are shown in Table 1, and for the bottleneck scenario (λ_b) in Table 2. The values computed using the numeric algorithm always fall into the confidence bounds, demonstrating that our algorithm computes the respective values accurately. In these tables, it becomes particularly apparent that in order to guarantee a high accuracy using simulations, a very large number of trajectories needs to be simulated, which is time-consuming. Our numerical scheme yields a high accuracy, and does not suffer from these issues.

The CDF ℙ{ℒ ≤ x} (defined in equation (3.1)) as a function of x is depicted by the red line. Additionally, the green bars indicate the bound on the distance between the numerical value P_pde and the true value P_T for different N, thus the true value is guaranteed to fall within these bounds.

Table 1.

The CDF ℙ{ℒ^a ≤ x,ℒ^b ≤ y} (defined in equation (3.2)) for different values of x and y under λ₁, with n = 10 and ρ = 0.001. p is computed using the numeric algorithm, and p̂ is estimated from simulations for different N. The confidence bounds are indicated in parentheses.

x	y	p	p̂ (N = 256, 000)	p̂ (N = 16, 384, 000)
1.5	3.0	0.075326	0.074914 (± 0.002)	0.075030 (± 0.0002)
3.0	6.0	0.213703	0.213324 (± 0.002)	0.213565 (± 0.0002)
6.0	6.0	0.522821	0.521578 (± 0.002)	0.522707 (± 0.0003)
12.0	18.0	0.873357	0.872840 (± 0.002)	0.873319 (± 0.0002)
30.0	30.0	0.998499	0.998504 (± 0.0002)	0.998516 (± 0.00002)

Open in a new tab

Table 2.

The CDF ℙ{ℒ^a ≤ x,ℒ^b ≤ y} (defined in equation (3.2)) for different values of x and y under λ₂, with n = 10 and ρ = 0.001. p is computed using the numeric algorithm, and p̂ is estimated from simulations for different N. The confidence bounds are indicated in parentheses.

x	y	p	p̂ (N = 256, 000)	p̂ (N = 16, 384, 000)
1.5	3.0	0.019794	0.019238 (± 0.0006)	0.019579 (± 0.00007)
3.0	6.0	0.094393	0.094414 (± 0.002)	0.094172 (± 0.0002)
6.0	6.0	0.369581	0.369059 (± 0.002)	0.369544 (± 0.0003)
12.0	18.0	0.812236	0.812328 (± 0.002)	0.812109 (± 0.0002)
30.0	30.0	0.997696	0.997922 (± 0.0002)	0.997721 (± 0.00003)

Open in a new tab

4.2 Properties of the Distributions

The results provided in the previous section show that our numerical algorithm can be used to accurately and efficiently compute the marginal and joint CDF of the total tree length in populations with variable size. We will now demonstrate the utility of our numerical method to study properties of the respective distributions.

The numerical values of the marginal CDF ℙ{ℒ ≤ x} can be readily applied to compute approximations of the expected value and the variance of the total tree length ℒ. Figure 6 shows the different values of the expectation and the variance under exponential growth (λ_e), with varying growth-rates g. Recall that a rate of g = 0 corresponds to no growth. Figure 7 shows the expected value and the variance under the bottleneck model (λ_b) for different values of the bottleneck size N_B. In both scenarios, the expected value and the variance are smallest in the models with the smallest contemporary population size, corresponding to the largest recent coalescent rate. They increase as g, respectively N_B, increases, but level off, indicating that increasing the population size has diminishing effects for large values. The absolute value of the expectation is higher in the bottleneck scenario, because, independent of the growth parameter, there is a substantial bottleneck in the growth-scenario.

Approximations to the expected value and the variance of the total tree length ℒ (defined in equation (2.7)) computed using our numerical procedure, under the model for exponential growth (λ_e), with different values for the growth-rate g.

Approximations to the expected value and the variance of the total tree length ℒ (defined in equation (2.7)), under the bottleneck model (λ_b), with different values for the bottleneck size *N_B*.

Figure 8 shows the joint CDF ℙ{ℒ^a ≤ x,ℒ^b ≤ y} as a function of x and y for different population size scenarios and different recombination rates ρ, computed on a suitable grid using our numerical algorithm. Naturally, the CDF converges towards 1 as x and y increase, and due to the symmetry of the ancestral process Ā^ρ the CDF is symmetric when interchanging x and y. Furthermore, note that the isolines in the plots for ρ = 0.0001 show pronounced right angles along the line x = y, because for small ρ the trees at the two loci are highly correlated. As the recombination rate increases, the two tree lengths become increasingly uncorrelated, and these angles soften. In all plots, the isoline for 0.2 is around x = y = 5, for the case λ_e even lower. Thus, under λ_e, there is an elevated probability for very short trees, likely due to the strong bottleneck, which favors short trees. Under the constant population size model λ_c, the CDF increases rapidly as x and y increase, whereas the function is less steep for λ_e and λ_b. This behavior seems to be dominated by the ancient population sizes.

The joint CDF ℙ{ℒ^a ≤ x,ℒ^b ≤ y} (defined in equation (3.2)) for the three population models (rows), with different recombination rates ρ (columns). Again, we use n = 10.

Finally, we employ our numerical values of the joint CDF to compute approximations to the correlation coefficient between the tree lengths

corr (ℒ^{a}, ℒ^{b}) : = \frac{cov (ℒ^{a}, ℒ^{b})}{\sqrt{V ℒ^{a}} \sqrt{V ℒ^{b}}},

where cov(·, ·) denotes the covariance. Figure 9 shows this correlation coefficient under the population size history λ_e for different values of ρ, and sample sizes n = 5 and n = 20, respectively. Recall that our numerical procedure was derived using the approximate ancestral process Ā^ρ for computational efficiency, where we limited the number of recombination events to 1. To compare the correlation under the process Ā^ρ with the correlation under the regular ancestral process with recombination A^ρ, we estimated the latter from repeated simulations using the widely applied coalescent-simulation tool ms (Hudson, 2002), which is based on the regular coalescent with recombination (using N = 10⁷ repetitions). Naturally, the correlation is close to 1 for small recombination rates, and it decreases with increasing recombination rate. The values are basically indistinguishable until they start separating around ρ = 0.05. This is to be expected, since the approximation we introduced limits the number of recombination events to 1, and thus, as the recombination rate increases, the approximation error also increases.

Correlation between ℒ^a and ℒ^b (defined in equations (2.13) and (2.14)) under the exponential growth model (*λ_e*) for different sample sizes n and different recombination rates ρ. The black lines show the values computed using our method under *Ā^ρ*, and the blue lines show values estimated from coalescent simulations under *A^ρ* using the popular tool `ms` (using N = 10⁷ repetitions).

To further investigate how restrictive the assumption of at most one recombination event is, we also used the simulated trajectories to estimate the probability that two or more recombination events occur under the regular ancestral process A^ρ. The results are shown in Figure 10. These probabilities increase with increasing ρ and n. However, they remain small for ρ ≤ 0.05, which is in good agreement with the observation that the correlation is well approximated for ρ up to 0.05. In conclusion, Figure 9 and Figure 10 show that the approximate process can be used without loss of accuracy for a large range of recombination rates relevant for human genetics, where recombination rates between neighboring sites are on the order of 10⁻³.

Probability of two or more recombination events R in the regular ancestral process *A^ρ* (Definition 2.4), under exponential growth (*λ_e*), for different different sample sizes n and different recombination rates ρ. These values were estimated using the coalescent simulation tool `ms` (using N = 10⁷ repetitions).

5 Discussion

In this paper, we presented a novel computational framework to compute the marginal and joint CDF of the total tree length in populations with variable size. To our knowledge, these distributions have not been addressed in the literature before, especially in populations of variable size. We introduced a system of linear hyperbolic PDEs and showed that the requisite CDFs can be obtained from the solution of this system. We introduced a numerical algorithm to compute the solution of this system based on the method of characteristics and demonstrated its accuracy in a wide range of biologically relevant scenarios.

The numerical algorithm that we introduced is an upstream-method that computes the requisite solutions step-wise on a grid. We presented the algorithm for a regular, equidistantly spaced grid. We used the trapezoidal rule for the integration steps in the method, and also used linear interpolation to interpolate values that do not fall onto the specified grid. We used these basic approaches for ease of exposition. Using higher order interpolation and integration schemes, combined with adaptive grids that have more points in regions where the coalescent-rate function is large will most certainly increase the accuracy. However, such higher order schemes come with additional computational cost. This opens numerous avenues for future research to optimize the balance between accuracy and efficiency that is required in the respective applications.

Moreover, for reasons of computational efficiency, we introduced the first-order approximation Ā^ρ to the regular ancestral process with recombination A^ρ, and computed the joint CDF under this approximate process. We demonstrated that this approximation is accurate for a large range of relevant recombination rates. It is straightforward to use higher order approximations, including more recombination events, to gain additional accuracy, but computing the joint CDF under the regular ancestral process is desirable. Proposition A.14 guarantees that we can use our numerical procedure to compute the requisite CDF under Ā^ρ, but it is conceivable that it can be extended to more general processes like A^ρ in future work.

Another research direction is to use our novel framework to study higher order correlations between trees at multiple loci. On the one hand, this could again be correlations between the total tree lengths, but the distribution of other summary statistics of the genealogical trees could be included, for example, the length of the external branches or the length of all branches subtending k leaves. Statistics that have been successfully used in the literature, like the coalescence time between two lineages (Li and Durbin, 2011; Terhorst et al., 2017) or the time of the first coalescent event (Schiffels and Durbin, 2014) could be used as well. Our framework is flexible enough to compute the distribution of multiple path integrals along the trajectories of a given Markov chain. Thus, to implement these additions, one needs to define and implement an appropriate ancestral process and compute suitable integrals along the trajectories.

In this paper, we studied the ancestral process in a single panmictic population. However, in recent years, researchers have gathered an increasing amount of genomic datasets that contain individuals from multiple sub-populations, and studied historical events like migration or population subdivision using these datasets. In light of these studies, it is important to augment our framework to compute joint CDFs of the total tree length in structured populations with complex migration histories. Again, this can be done by introducing suitable ancestral processes and suitable integrals along their trajectories.

Acknowledgments

We thank Yun S. Song for numerous helpful discussions that sparked many of the ideas presented in this paper. This research is supported in part by a National Institutes of Health grant R01-GM094402 (M.S.). We also thank Robin Young for helpful suggestions and fruitful discussions relevant to the design of the numerical scheme.

Appendix

A Path-integrals of Markov chains

Since the marginal and joint distributions of the tree length can be obtained by integrating a certain function of the ancestral processes, we now consider distributions of path integrals for Markov chains. We will introduce these distributions assuming Lipschitz continuity in Section A.1, and then show in Section A.2 that these assumptions can be relaxed if the state space is monotone.

A.1 Path-integrals under regularity assumptions

Let the Markov chain X be defined on a probability space (Ω, ℱ, ℙ). We will be using the following assumptions throughout the section.

(A1)
{X(t, ω), t ∈ ℝ₊, ω ∈ Ω} is a regular jump Markov process with values in a finite state space 𝒮_n, for convenience labeled 𝒮_n = {1, 2, … , n}, satisfying
$ℙ {X (t + h) = j ∣ X (t) = i} = q_{i j} (t) h + o (h), i, j \in S_{n} as h \to 0^{+} .$

We assume that the trajectories of t → X(·, ω) are right-continuous.
(A2)
The infinitesimal generator Q(t) = {q_ij(t)}_i,j₌₁_,_…_,n is conservative, that is
$q_{i} (t) : = - q_{i i} (t) = \sum_{i \neq j} q_{i j} (t),$

and satisfies Q ∈ C(ℝ₊;M^n×n)∩L^∞(ℝ₊;M^n×n). In addition, for each i ∈ 𝒮_n either q_i(t) = 0 for all t ≥ 0 or q_i(t) > 0 for all t ≥ 0. In the latter case, we require $\int_{0}^{\infty} q_{i} (s) d s = \infty$ .

Definition A.1

Let X(t) satisfy (A1)–(A2). Given a function

v = (v_{1}, v_{2}, \dots, v_{d}) : S_{n} \to ℝ^{d}

we define a vector-valued path-integral over the interval [0, t] by

L^{v} (t, ω) : = \int_{0}^{t} v (X (s, ω)) d s \in ℝ^{d}, t \in ℝ_{+} .

Definition A.2

Let X(t) satisfy (A1)–(A2) and v : 𝒮_n → ℝ^d be some real-valued function defined on the state space. We define a distribution vector-function associated with (X(t), L^v(t)) by

F^{v} = (F_{1}^{v}, \dots, F_{n}^{v}) : ℝ_{+} \times ℝ^{d} \to ℝ_{+}^{n}

(A.1)

with

F_{k}^{v} (t, x) : = ℙ {X (t) = k, L^{v} (t) \leq x},

k ∈ {1, … , n}, x ∈ ℝ^d, and the comparison is understood componentwise.

Definition A.3

Let v : 𝒮_n → ℝ^d be some real-valued function defined on the state space. Define

m^{v} : = (min_{j \in S_{n}} v_{1} (j), \dots, min_{j \in S_{n}} v_{d} (j))

and

M^{v} : = (max_{j \in S_{n}} v_{1} (j), \dots, max_{j \in S_{n}} v_{d} (j)) .

as the componentwise minima and maxima.

Remark A.4

Since X(t) is a regular jump process, it is separable. Thus, for each t₀ ∈ ℝ₊ the random variable L^v(t₀, ·) is well-defined and ℱ-measurable, and for each ω₀ ∈ Ω the map t → L^v(t, ω₀) is Lipschitz continuous. This in turn implies that the process L^v(t) is measurable and separable (see Chapter 12 of Koralov and Sinay (2007) and Chapter 2 of Doob (1953)).

Proposition A.5 (Locally Lipschitz)

Let (A1)–(A2) hold. Let v : 𝒮_n → ℝ^d and F^v be defined by (A.1). Suppose that F^v is Lipschitz continuous in an open set 𝒰 ⊂ ℝ⁺ × ℝ^d. Then

\partial_{t} F^{v} (t, x) + \sum_{j = 1}^{d} \partial_{x_{j}} F^{v} (t, x) V_{j} = F^{v} (t, x) Q (t) a . e . i n U,

(A.2)

where V_j = diag (v_j(1), … , v_j(n)).

To prove this statement, we first need to introduce the following lemmas.

Lemma A.6

Let X(t) satisfy (A1)–(A2), v : 𝒮_n → ℝ^d. Then, for any i, k ∈ 𝒮_n, any ε > 0, and each x ∈ ℝ^d, we have

ℙ {X (t) = i, X (t + ε) = k} F_{i}^{v} (t, x - ε M^{v}) \leq ℙ {X (t) = i, X (t + ε) = k, L^{v} (t + ε) \leq x} ℙ {x (t) = i} \leq ℙ {X (t) = i, X (t + ε) = k} F_{i}^{v} (t, x - ε m^{v}),

and the comparison is understood componentwise.

Proof

We suppress the superscript v in our calculations. Take any ε > 0. Observe that

L (t) + ε m \leq L (t + ε) = L (t) + \int_{t}^{t + ε} v (X (s)) d s \leq L (t) + ε M

and therefore

ℙ {X (t) = i, X (t + ε) = k, L (t) \leq x - ε M} \leq ℙ {X (t) = i, X (t + ε) = k, L (t + ε) \leq x} \leq ℙ {X (t) = i, X (t + ε) = k, L (t) \leq x - ε m} .

Let y ∈ ℝ^d. Suppose that 0 < F_i(t, y). Observe that for any t₀ > 0 the path-integral L(t₀) is fully determined by (X(t), t ∈ [0, t₀)). This fact (and the separability of the process) enables us to use the Markov property and we obtain

ℙ {X (t) = i, X (t + ε) = k, L (t) \leq y} = ℙ {X (t + ε) = k ∣ X (t) = i, L (t) \leq y} ℙ {X (t) = i, L (t) \leq y} = ℙ {X (t + ε) = k ∣ X (t) = i} F_{i} (t, y) .

This together with the previous inequality finishes the proof.

Lemma A.7

Let X(t) satisfy (A1)–(A2) and v : 𝒮_n → ℝ^d. For every (t, x) ∈ ℝ₊ × ℝ^d and each k ∈ 𝒮_n, we have

lim_{ε \to 0^{+}} \frac{1}{ε} | ℙ {X (t) = k, X (t + ε) = k, L^{v} (t + ε) \leq x + ε v (k)} - (1 - q_{k} (t) ε) F_{k} (t, x) | = 0.

Proof

We suppress the superscript v in what follows. Due to (A1) the process X(t) is separable and therefore (see (Karlin, 1981, p. 146))

ℙ {X (s) = k, s \in [t, t + ε]} = exp {- \int_{t}^{t + ε} q_{i} (s) d s} ℙ {X (t) = k} .

(A.3)

Suppose now that F_k(t, x) > 0. Then, using the Markov property, we obtain

ℙ {X (s) = k, s \in [t, t + ε], L (t + ε) \leq x + ε v (k)} = ℙ {X (s) = k, s \in [t, t + ε] ∣ X (t) = k, L (t) \leq x} ℙ {X (t) = k, L (t) \leq x} = ℙ {X (s) = k, s \in [t, t + ε] ∣ X (t) = k} F_{k} (t, x) = exp (- \int_{t}^{t + ε} q_{k} (s) d s) F_{k} (t, x) .

If F_k(t, x) = 0, then the first and the last term in the above identity are zero. Thus, employing (A.3) and recalling that Q is continuous, we conclude

\frac{1}{ε} | ℙ {X (t) = k, X (t + ε) = k, L (t + ε) \leq x + ε v (k)} - exp (- \int_{t}^{t + ε} q_{k} (s) d s) F_{k} (t, x) | = \frac{1}{ε} (ℙ {X (t) = k, X (t + ε) = k, L (t + ε) \leq x + ε v (k)} - ℙ {X (s) = k, s \in [t, t + ε], L (t + ε) \leq x + ε v (k)}) \leq \frac{1}{ε} (ℙ (X (t) = k, X (t + ε) = k) - ℙ (X (s) = k, s \in [t, t + ε])) \leq \frac{1}{ε} (ℙ {X (t) = k, X (t + ε) = k} - ℙ {X (t) = k}) - \frac{1}{ε} [exp (- \int_{t}^{t + ε} q_{i} (s) d s) - 1] ℙ {X (t) = k} \to 0 as ε \to 0^{+} .

Since $(1 - q_{k} (t) ε) F_{k} (t, x) = exp (- \int_{t}^{t + ε} q_{k} (s) d s) F_{k} (t, x) + o (ε^{2})$ , the statement of the lemma follows.

We can now turn back to the proof of Proposition A.5

Proof of Proposition A.5

Let us suppress the superscript v in our calculations. Let 𝒰̃ denote the set of all points in 𝒰 at which F is differentiable. Since F is Lipschitz continuous in 𝒰, Rademacher’s Theorem (Federer, 1969) implies that F is Lebesgue almost surely differentiable in 𝒰 and therefore 𝒰\𝒰̃ is of Lebesgue measure zero. Take any k ∈ 𝒮_n. Fix any (t, x) ∈ 𝒰̃. For any ε > 0 we have

F_{k} (t + ε, x + ε v (k)) - F_{k} (t, x) = (\sum_{i = 1}^{n} ℙ {X (t) = i, X (t + ε) = k, L (t + ε) \leq x + ε v (k)}) - F_{k} (t, x) .

(A.4)

Consider first the terms with i ≠ k. By Lemma A.6, we have

\frac{1}{ε} ℙ {X (t + ε) = k, X (t) = i} F_{i} (t + ε, x + ε v (k) - ε M) \leq \frac{1}{ε} ℙ {X (t) = k, X (t + ε) = i, L (t + ε) \leq x + ε v (k)} ℙ {X (t) = i} \leq \frac{1}{ε} ℙ {X (t + ε) = k, X (t) = i} F_{i} (t + ε, x + ε v (k) - ε m)

where m and M are as in Definition A.3. Since Q is continuous we must have

lim_{ε \to 0^{+}} \frac{1}{ε} ℙ {X (t + ε) = i, X (t) = k} = q_{k i} (t) ℙ {X (t) = k}

and hence, employing the continuity of F, we conclude

lim_{ε \to 0^{+}} \frac{1}{ε} ℙ {X (t) = i, X (t + ε) = k, L (t + ε) \leq x + ε v (k)} = q_{k i} (t) F_{i} (t, x) .

(A.5)

Next consider the case i = k. By Lemma A.7 we have

lim_{ε \to 0^{+}} \frac{1}{ε} (ℙ {X (t) = k, X (t + ε) = k, L (t + ε) \leq x + ε v (k)} - F_{k} (t, x)) = - q_{k} (t) F_{k} (t, x) .

(A.6)

Combining (A.4) with (A.5) and (A.6) we obtain

lim_{ε \to 0^{+}} \frac{1}{ε} (F_{k} (t + ε, x + ε v (k)) - F_{k} (t, x)) = - q_{k} (t) F_{k} (t, x) + \sum_{i \neq k} q_{i k} (t) F_{i} (t, x) = {(F (t, x) Q (t))}_{k} .

(A.7)

Since F is differentiable at (t, x) ∈ 𝒰̃ and the map ε → (t+ε, x+ε v(k)) is differentiable with the image contained in 𝒰 for sufficiently small ε, the chain rule is applicable (see Rudin (1976)[Theorem 9.15]) and we conclude

lim_{ε \to 0^{+}} \frac{1}{ε} (F_{k} (t + ε, x + ε v (k)) - F_{k} (t, x)) = {\frac{d}{d ε} F_{k} (t + ε, x + ε v (k)) |}_{ε = 0} = \partial_{t} F_{k} (t, x) + \sum_{j = 1}^{d} v_{j} (k) \partial_{x_{j}} F_{k} (t, x) .

Since both k ∈ 𝒮_n and (t, x) ∈ 𝒰̃ were arbitrary, (A.7) implies (A.2).

We next show that F^v as t → 0⁺ has certain continuity properties.

Proposition A.8

Let (A1)–(A2) hold. Let v : 𝒮_n → ℝ^d and F^v as defined by (A.1). Then

lim_{t \to 0^{+}} F_{k}^{v} (t, x) = 𝟙_{ℝ_{+}^{d}} (x) ℙ {X (0) = k} = F_{k}^{v} (0, x) for x \notin \partial ℝ_{+}^{d} .

Proof

First, we note that L(0) = 0 for all ω ∈ Ω and hence $F_{k}^{v} (0, x) = 𝟙_{ℝ_{+}^{d}} (x) ℙ {X (0) = k}$ .

Observe that

m^{v} t \leq L (t) = \int_{0}^{t} v (X (s)) d s \leq M^{v} t,

where m^v and M^v as in Definition A.3. Fix δ > 0. Then for all 0 < t < δ(1 + max(||m^v||_∞, ||M^v||_∞))⁻¹, and every x ∈ ℝ^d such that ||x − y|| > δ for all $y \in \partial ℝ_{+}^{d}$ , we have

F_{k} (t, x) = 𝟙_{ℝ_{+}^{d}} (x) ℙ {X (t) = k, L (t) \leq x} = 𝟙_{ℝ_{+}^{d}} (x) ℙ {X (t) = k} \to 𝟙_{ℝ_{+}^{d}} (x) P (X (0) = k) as t \to 0^{+} .

Remark A.9

From Proposition A.8 it follows that the ‘initial values’ of F^v are discontinuous. Since the system (A.2) a is linear hyperbolic system, discontinuities present at time t = 0 will travel in space as time t increases and therefore F^v is not C¹ or even continuous. Nevertheless, one can show that (A.2) holds in a weaker sense. To do that one needs to employ the notion of weak solutions, and we will pursue this avenue in an upcoming paper. However, for certain type of state space functions v, relevant to our application, one can show additional regularity properties of F^v. We provide more details in Appendix A.2.

A.2 Path-integrals for monotone state space functions

Hyperbolic systems of partial differential equations admit in general solutions that are not classical even if the initial (or boundary data) is smooth. Typically there are two distinct classes of solutions: strong solutions, which are Lipschitz continuous (see Dafermos (2010)), and weak solutions, which allow for discontinuities. Here, we will be using the first type of solutions.

Definition A.10

Let 𝒰 ⊂ ℝ₊ × ℝ^d be open and let A₁, … , A_d, B ∈ L^∞(𝒰; ℝ^n×n). We say that u(t, x) : ℝ⁺ × ℝ^d → ℝⁿ is a strong solution of

\partial_{t} u (t, x) + \sum_{j = 1}^{d} \partial_{x_{j}} {u (t, x)} A_{j} (t, x) = u (t, x) B (t, x) i n U \subset ℝ_{+} \times ℝ^{d},

(A.8)

if u is Lipschitz continuous in 𝒰, and the equation (A.8) holds for Lebesgue almost all points (t, x) in 𝒰.

Remark A.11

By the Rademacher’s Theorem (Federer, 1969) a function u(t, x) that is Lipschitz continuous in an open domain 𝒰 is Lebesgue almost sure differentiable in 𝒰. In fact, its pointwise partial derivatives, which exist almost everywhere, coincide with its corresponding weak partial derivatives (see Evans (2010)).

Regularity of solutions to hyperbolic problems depends on both the initial (or boundary) data and the domain itself. For linear hyperbolic problems as long as the initial data is smooth and the domain has a smooth boundary one may expect a solution to be (locally) smooth. Typically one studies solutions to hyperbolic problems on the domain 𝒰 = ℝ₊ × ℝ^d with initial data u₀(x) at t = 0 (Cauchy problem). The initial data for the vector of probabilities F^v (studied in 𝒰) are unfortunately discontinuous (which is shown below). To avoid unnecessary difficulties, in Proposition A.14 we split the space-time domain into two regions 𝒰_I and 𝒰_E. In 𝒰_E the values of F^v admit a simpler form while in 𝒰_I the vector F^v is obtained via solving a linear hyperbolic system with smooth initial data. We note that components of F^v are in general merely Lipschitz continuous in 𝒰_I. This is not surprising for two reasons. Firstly, the domain is singular because it has a ‘corner’ and the discontinuities of the derivatives of F^v originating at points $\partial ℝ_{+}^{d}$ travel along the corresponding characteristics. Secondly, the vector F^v solves the same system of equations in the domain 𝒰 with discontinuous initial data and hence it is in general not smooth.

Definition A.12

Let (A1)–(A2) hold. Let v : 𝒮_n → ℝ^d. We say that v is monotone along the process X(t) if the map t → v(X(t)) is either non-increasing ℙ-almost surely or non-decreasing ℙ-almost surely.

Definition A.13

Define the following regions in ℝ₊ × ℝ^d:

U_{I} : = {(t, x) : t > 0, x < M^{v} t}

and

U_{E} : = {(U_{I} \cup \partial U_{I})}^{c},

where the comparison is understood componentwise.

Proposition A.14

Let X(t) satisfy (A1)–(A2), X(0) = n, and Q(t) be lower triangular. Let v : 𝒮_n → ℝ^d be monotone along X(t). Suppose that t → v(X(t)) is non-increasing on Ω, and m^v < M^v. For x ∈ ℝ^d, define $J (t, x) = {j : x_{j} < M_{j}^{v} t}$ . Then F^v defined by (A.1) has the following properties:

For each i ∈ 𝒮_n, with v(i) < M^v, $F_{i}^{v} (t, x)$ is Lipschitz continuous on ℝ₊ × ℝ^d.
For each i ∈ 𝒮_n, with v(i) ≮ M^v, $F_{i}^{v} (t, x) = 𝟙_{\bar{U_{E}}} (x) ℙ {X (t) = i, L_{j}^{v} (t) \leq x_{j}, j \in J (t, x)}$ .
F^v is a strong solution of
$\partial_{t} F^{v} (t, x) + \sum_{j = 1}^{d} \partial_{x_{j}} F^{v} (t, x) V_{j} = F^{v} (t, x) Q (t),$ (A.9)

where V_j = diag(v_j(1), … , v_j(n)), in the open region 𝒰_I. Furthermore, let (t, x) ∈ ∂𝒰_I. Then
$lim_{(\bar{t}, \bar{x}) \in U_{I} \to (t, x)} F_{i}^{v} (\bar{t}, \bar{x}) = {\begin{cases} ℙ {X (t) = i, L_{j}^{v} (t) \leq x_{j}, j \in J (t, x)}, & i f v (i) < M^{v}, \\ 0, & otherwise, \end{cases}$ (A.10)

and
$F_{i}^{v} (\bar{t}, \bar{x}) = ℙ {X (\bar{t}) = i, L_{j}^{v} (\bar{t}) \leq {\bar{x}}_{j}, j \in J (t, \bar{x})} for all (\bar{t}, \bar{x}) \in U_{E} .$ (A.11)

Remark A.15

Computing the solution in Proposition A.14 for a given number d of path-integrals requires computing solutions for d̄ integrals with d̄ < d on the boundary. These can be obtained by straightforwardly applying the Proposition in lower dimensions. Note that for d = 1, the values on the boundary can be directly obtained from the distribution of X(t).

Remark A.16

Note that for each i ∈ 𝒮_n we have

F_{i}^{v} (t, x) = 0,

for x ≤ v(i)t.

Remark A.17

The process (X(t), L^v(t)) _t_∈ℝ₊ is a time-inhomogeneous piecewise-deterministic strong Markov process (Davis, 1993, Chapter 2), and Proposition A.14 essentially shows that the generator is given by

G_{t} H (x) = - \sum_{j = 1}^{d} \partial_{x_{j}} H (x) V_{j} + H (x) Q (t),

for suitably defined functions H(x). The stochastic transitions of X(t) are described by Q(t) and the deterministic evolution of L^v(t) in each dimension is governed by the terms V_j ∂_{x_j}. However, in addition, Proposition A.14 establishes regularity of F^v(t, x), which is important for numerical computations.

Remark A.18

Then ancestral process with limited recombination satisfies assumptions (A1)–(A2), and thus, we focus on this case here. It is conceivable that these assumptions could be relaxed and Proposition A.14 could be extended to more general Markov chains X(t) with a (countably) infinite state space, and more general dynamics, for example, a non-triangular rate matrix Q(t), or $\int_{0}^{\infty} q_{i} (s) d s < \infty$ . However, the approach presented here in the proof of Proposition A.14 to show the necessary regularity of F^v(t, x) uses the fact that X(t) has absorbing states, and reaches them in finite time, after a finite number of jumps. For a more general version, this strategy would need to be adapted, or a different strategy used.

Proof of Proposition A.14

Let Δ denote the set of absorbing states of the process X(t). Since Q is lower triangular, 1 ∈ Δ and thus Δ is not empty.

Take any i ∈ 𝒮_n with v(i) ≮ M^v. Since v is monotone along the process we conclude that

L_{j} (t, ω) = \int_{0}^{t} v_{j} (X (s, ω)) d s = M_{j}^{v} t for all j \notin J (1, v (i)), ω \in {\tilde{ω} : X (t, \tilde{ω}) = i}

and this yields (ii).

Recall next that for a time-inhomogeneous Markov processes X(t) (under the assumptions (A1)–(A2)) the jumping times T₁, T₂, T₃, … of X(t) satisfy $ℙ {T_{1} > α} = exp (- \int_{0}^{α} q_{1} (s) d s)$ and for k ≥ 2

ℙ {T_{k} > t + α ∣ T_{k - 1} = t, X (T_{k - 1}) = i} = exp (- \int_{t}^{t + α} q_{i} (s) d s) .

(A.12)

Take any i ∈ 𝒮_n with v(i) < M^v, in which case i < n. Since Q(t) is lower triangular, each trajectory of the process has at most n − 1 jumps before it enters into the absorbing set Δ. Thus we obtain

{ω : X (\cdot, ω) enters the state i} = \cup_{k = 1}^{n - 1} Ω_{k}^{(i)}, Ω_{k}^{(i)} = {ω : X (\cdot, ω) enters the state i on the k - th jump} .

We next denote T₀ = 0, s₀ = n, $s_{i}^{(k)} = (s_{1}, s_{2}, \dots, s_{k - 1}, s_{k} = i) \in {(S_{n})}^{k}$ , with k ≥ 1, and

A (s_{i}^{(k)}) = {ω : X (T_{1}) = s_{1}, \dots, X (T_{k - 1}) = s_{k - 1}, X (T_{k}) = s_{k} = i} \subset Ω_{k}^{(i)} .

First, suppose that i/∈ Δ. For (t, x) ∈ ℝ₊ × ℝ^d, using the above partitioning, we write

F_{i}^{v} (t, x) = ℙ {X (t) = i, L (t) \leq x} = \sum_{k = 1}^{n - 1} \sum_{s_{i}^{(k)} \in S^{k}} ℙ {A (s_{i}^{(k)}), T_{k} < t < T_{k + 1}, L (t) \leq x} = \sum_{k = 1}^{n - 1} \sum_{s_{i}^{(k)} \in S^{k}} ℙ {A (s_{i}^{(k)}), T_{k} < t < T_{k + 1}, \sum_{j = 1}^{k} T_{j} (v (s_{j - 1}) - v (s_{j})) \leq x - v (i) t} .

(A.13)

We now show that $F_{i}^{v}$ is Lipschitz continuous. To this end, consider the function

G (t, x; s_{i}^{(k)}) = ℙ {A (s_{i}^{(k)}), T_{k} < t < T_{k + 1}, \sum_{j = 1}^{k} T_{j} (v (s_{j - 1}) - v (s_{j})) \leq x} .

Observe that $G (t, x; s_{i}^{(k)})$ is well-defined for (t, x) ∈ ℝ¹⁺^d. Moreover, since i ∉ Δ, the assumption (A2) implies that the process after entering the state i leaves this state in finite time ℙ-almost surely. Thus $Ω_{k}^{(i)} \subset {T_{k} < \infty} \subset {T_{k + 1} < \infty}$ and therefore

G (t, x; s_{i}^{(k)}) = ℙ {A (s^{(k)}), T_{k} < t, \sum_{j = 1}^{k} T_{j} (v (s_{j - 1}) - v (s_{j})) \leq x} - ℙ {A (s^{(k)}), T_{k + 1} < t, \sum_{j = 1}^{k} T_{j} (v (s_{j - 1}) - v (s_{j})) \leq x} = : G_{1} (t, x; s_{i}^{(k)}) - G_{2} (t, x; s_{i}^{(k)}) .

Now, using (A.12) and induction, one can show that for each r ∈ 𝒮_n and k ≥ 1

ℙ {A (s_{r}^{(k)}), T_{k + 1} \leq z} = \int_{- \infty}^{z} f_{k + 1} (α; s_{r}^{(k)}) d α

(A.14)

where $f_{k + 1} (\cdot; s_{r}^{(k)})$ is a globally bounded function. Thus, we conclude that the map

z \to ℙ {A (s_{r}^{(k)}), T_{k + 1} \leq z}

is globally Lipschitz for each k ≥ 1.

Since v is non-increasing along the process, for each $s_{i}^{(k)}$ we have

v (n) = M^{v} \geq v (s_{1}) \dots \geq v (s_{k}) = v (i) .

By assumption v(i) < M^v and hence for each l ∈ {1, . . . , d} there exists k_l ∈ {1, . . . , k} such that v_l(s_{k_l− 1}) − v_l(s_{k_l}) > 0, which guarantees that not all terms in the nonnegative sum $\sum_{j = 1}^{k} T_{j} (v_{l} (s_{j - 1}) - v_{l} (s_{j}))$ vanish. Then, in view of the fact that the event $A (s_{i}^{(k)})$ does not depend on the (t, x)-variable, we can use (A.14) and induction to conclude that

ℙ {A (s_{i}^{(k)}), \sum_{j = 1}^{k} T_{j} (v_{l} (s_{j - 1}) - v_{l} (s_{j})) \leq x_{l}} = \int_{- \infty}^{x_{l}} {\tilde{f}}_{k l} (s) d s, k \geq 1, l \in {1, \dots, d}

for some globally bounded function f̃_kl. It can be shown that this implies that

x \to ℙ {A (s_{i}^{(k)}), \sum_{j = 1}^{k} T_{j} (v (s_{j - 1}) - v (s_{j})) \leq x}

(A.15)

globally Lipschitz. Combining (A.14) with (A.15) and using the definition of the Lipschitz continuity we conclude that $G_{1} (t, x; s_{i}^{(k)})$ and $G_{2} (t, x; s_{i}^{(k)})$ are globally Lipschitz and hence $G (t, x; s_{i}^{(k)})$ is as well.

Furthermore, any Lipschitz continuous function composed with a linear map is also Lipschitz continuous. Thus $\bar{G} (t, x; s_{i}^{(k)}) : = G (B (t, x); s_{i}^{(k)})$ , where B(t, x) = (t, x − v(i)t), is globally Lipschitz. In (A.13) each of the terms in the sum is one of the functions $\bar{G} (x, t; s_{i}^{(k)})$ . Hence F_i which is restricted to (t, x) ∈ [0, ∞) × ℝ^d is globally Lipschitz on this domain.

Lastly, if i ∈ Δ, observe that

{T_{k} < \infty, X (T_{k}) = i} \subset {T_{k + 1} = \infty}

and therefore

F_{i}^{v} (t, x) = ℙ {X (t) = i, L (t) \leq x} = \sum_{k = 1}^{n - 1} \sum_{s_{i}^{(k)} \in S^{k}} ℙ {A (s_{i}^{(k)}), T_{k} < t, \sum_{j = 1}^{k} T_{j} (v (s_{j - 1}) - v (s_{j})) \leq x - v (i) t} .

Using an analogous approach (to the one in the case i ∉ Δ) one can show that each term in the above expression is globally Lipschitz continuous. This yields (i).

From (i) and (ii) it follows that F^v is Lipschitz continuous in the open region 𝒰_I. Then, by Proposition A.8 we conclude that F^v is a strong solution of (A.9) in 𝒰_I. The boundary conditions (A.10) and equation (A.11) follow directly from the definition of F^v. This proves (iii).

B Numerical Schemes

B.1 Upstream Numerical Scheme for Single-Locus Case

Here we present a numerical algorithm for computing solutions to the system (3.5). The numerical scheme is an upstream scheme based on the method of characteristics. In particular, the numerical scheme we develop makes use of the integral representation formulas (3.10), (3.11).

To define a grid in the (t, x)-space suitable for computation, choose x_max, the maximum value that the CDF ℙ{ℒ ≤ x_max} should be computed for. Due to Lemma 3.1, the relation ℙ{ℒ ≤ x} = F₁(t_max, x) holds for all x ≤ x_max, with $t_{max} : = \frac{x_{max}}{2}$ . Thus t_max is set as the maximal gridpoint for t. In addition to the maximum gridpoints, choose small step sizes Δt and Δx. The number of gridpoints in the t dimension is then given by $M : = ⌈ \frac{t_{max}}{Δ t} ⌉ + 1$ , and the set of gridpoints is given as

T : = {0, Δ t, 2 Δ t, \dots, (M - 1) Δ t, min (M Δ t, t_{max})} .

(B.1)

For each point T_i, define a grid in the x-dimension as

X_{i} : = {0, Δ x, \dots, min (U Δ x, n T_{i})} \cup {2 Δ t + {\bar{X}}_{i - 1}, 3 Δ t + {\bar{X}}_{i - 1}, \dots, n Δ t + {\bar{X}}_{i - 1}} U {2 T_{i}, 3 T_{i}, \dots, n T_{i}},

(B.2)

with $U = ⌈ \frac{n T_{i}}{Δ x} ⌉$ and X̄_i₋₁ := max(X_i₋₁). Furthermore, set U_i := |X_i|. The same grid will be used for all k ∈ {1, . . . , n}. The points kΔt + X̄_i₋₁ and kT_i are added for numerical stability reasons, to improve the accuracy of the interpolation we will perform in subsequent steps.

Now fix i ∈ {0, . . . , M} and k ∈ {1, . . . , n}, and assume that F_ℓ(T_i₋₁, X_i₋₁_,j) has been computed for all ℓ ∈ {1, . . . , n} and X_i₋₁_,j ∈ X_i₋₁. Furthermore, assume that F_ℓ(T_i, X_i,j) has been computed for all ℓ ∈ {k + 1, . . . , n} and X_i,j ∈ X_i. Under these assumptions, F_k(T_i, X_i,j) can be computed for all X_i,j ∈ X_i as follows. If X_i,j < v(k)T_i, then

F_{k} (T_{i}, X_{i, j}) = 0.

(B.3)

If X_i,j = nT_i, the maximal value of X_i, then

F_{k} (T_{i}, X_{i, j}) = ℙ {A (T_{i}) = k} .

(B.4)

The values on the right-hand side can be pre-computed for all k and T_i ∈ T by solving the ODE (2.3) numerically. In the general case, note that the characteristic of F_k that goes through the point (T_i, X_i,j)^⊤ and the boundary x = nt intersect at the point (T_x, nT_x)^⊤, with $T_{x} : = \frac{X_{i, j} - v (k) T_{i}}{n - v (k)}$ . Thus, define

{(X_{i, j}^{↓}, T_{i, j}^{↓})}^{⊤} : = {\begin{cases} {(T_{i - 1}, X_{i, j} - v (k) Δ t)}^{⊤}, & if T_{x} < T_{i - 1}, \\ {(T_{x}, n T_{x})}^{⊤}, & otherwise, \end{cases}

the projection of (T_i, X_i,j)^⊤ back along the corresponding characteristic to the previous time-slice T_i₋₁, or onto the boundary x = nt, whichever has the larger t-component. This backward projection step is illustrated in Figure 11(a). Then, according to equation (3.10)

The back-tracing and propagation step of the upstream numerical scheme to compute *F_k* at all points of the grid.

F_{k} (T_{i}, X_{i, j}) = e^{- (H_{k}^{(1)} (T_{i}) - H_{k}^{(1)} (T_{i, j}^{↓}))} (\int_{T_{i, j}^{↓}}^{T_{i}} g_{k}^{(1)} (α) e^{(H_{k}^{(1)} (α) - H_{k}^{(1)} (T_{i, j}^{↓}))} d α + F_{k} (X_{i, j}^{↓}, T_{i, j}^{↓}))

(B.5)

holds.

The right-hand side of the equation (B.5) can now be computed using two approximations. Note that the point ${(T_{i, j}^{↓}, X_{i, j}^{↓})}^{⊤}$ is in general not on the grid X_i, and thus $F_{k} (T_{i, j}^{↓}, X_{i, j}^{↓})$ has not been pre-computed. If T_x < T_i₋₁, the point is equal to (T_i₋₁, X_i,j − v(k)Δt)^⊤. In that case, identify the two grid points in X_i₋₁ that are closest to X_i,j − v(k)Δt to the right and left. Then let ${\bar{F}}_{k} (T_{i, j}^{↓}, X_{i, j}^{↓})$ be the linear interpolation between the values of F_k at those gridpoints. If T_x ≥ T_i₋₁, then the point is given by ${(T_{i, j}^{↓}, X_{i, j}^{↓})}^{⊤} = {(T_{x}, n T_{x})}^{⊤}$ and is located on the boundary. Thus

F_{k} (X_{i, j}^{↓}, T_{i, j}^{↓}) = ℙ {A (T_{x}) = k},

which is also not pre-computed. However, ℙ{A(T_i₋₁) = k} and ℙ{A(T_i) = k} have been pre-computed, and thus set ${\bar{F}}_{k} (T_{i, j}^{↓}, X_{i, j}^{↓})$ as the linear interpolation between these two values.

The second approximation is to compute the integral on the right-hand side of equation (B.5) using the trapezoidal rule. Thus, the values of F_k on the grid can be computed using

F_{k} (T_{i}, X_{i, j}) = \frac{Δ t}{2} (g_{k}^{(1)} (T_{i}) + e^{- (H_{k}^{(1)} (T_{i}) - H_{k}^{(1)} (T_{i, j}^{↓}))} g_{k}^{(1)} (T_{i, j}^{↓})) + e^{- (H_{k}^{(1)} (T_{i}) - H_{k}^{(1)} (T_{i, j}^{↓}))} {\bar{F}}_{k} (T_{i, j}^{↓}, X_{i, j}^{↓}) + o (Δ t^{3}) + o (Δ x^{2})

(B.6)

The terms g_k(·) depend on values of F_ℓ with k < ℓ ≤ n that might not have been pre-computed on the grid either. However, the same interpolation schemes as for F̄_k can be applied. At this stage it is important though to strictly set F_ℓ to 0 if it should be 0 according to equation (3.7). Lastly, the values for $H_{k}^{(1)} (\cdot)$ can either be obtained using analytic formulas in equation (3.11) for certain classes of coalescent-speed functions we will consider (e.g. piece-wise constant), or by computing the requisite integrals using the trapezoidal rule, which can be done incrementally. The integration step of out numerical scheme is illustrated in Figure 11(b).

These equations lead naturally to a dynamic programming algorithm to compute F_k on the specified grid. To this end, iterate through the values T_i ∈ T in increasing order. For each T_i, iterate through k ∈ {1, 2, . . . , n} in decreasing order, starting with k = n. Then, for each fixed T_i and k, F_k(T_i, X_i,j) can be computed for every X_i,j ∈ X_i using equations (B.3), (B.4), and (B.6). The order of iteration guarantees that all necessary quantities have been pre-computed. This dynamic program can be employed to compute F_k on the specified grid for all k. Due to Lemma 3.1, the relation

ℙ {ℒ \leq X_{M, j}} = F_{1} (T_{M}, X_{M, j})

holds, which yields the values of the CDF ℙ {ℒ ≤ x} on the specified grid X_M.

B.2 Upstream Numerical Scheme for Two-Locus Case

In the two-locus case, we can compute F_s efficiently on a chosen grid similar to the marginal case. To this end, we again choose x_max = y_max, set $t_{max} : = \frac{1}{n} x_{max}$ , and choose step sizes Δt and Δx = Δy. Then, define the grid T as in definition (B.1), M = |T| and for each T_i, define X_i as in definition (B.2). Furthermore, set Y_i := X_i and U_i := |Y_i|. Thus, we use the regular grid X_i × Y_i in the (x, y)-space.

Now fix T_i and s ∈ 𝒮̄^ρ, and assume that F_s_′(T_i₋₁, X_i₋₁_,j, Y_i₋₁_,ℓ) has been computed for all s′ ∈ 𝒮̄^ρ, X_i₋₁_,j ∈ X_i₋₁, and Y_i₋₁_,ℓ ∈ Y_i₋₁. Furthermore, assume that F_s_′(T_i, X_i,j, Y_i,ℓ) has been computed for all s′ with s ≺ s′, X_i,j ∈ X_i, and Y_i,ℓ ∈ Y_i. To compute F_s(T_i, X_i,j, Y_i,ℓ), first check using equation (3.13) whether the requisite point lies on the boundary, or in the zero region. The values on the boundary according to equation (3.13) are computed as time-dependent CDFs of marginal integrals along trajectories of the process Ā^ρ , and thus they can be computed using exactly the same procedure as detailed in Section B.1, replacing A by Ā^ρ. In the interior region, applying the trapezoidal rule to the solution of the first-order ODE, for all X_i,j ∈ X_i, and Y_i,ℓ ∈ Y_i the value of F_s(T_i, X_i,j, Y_i,ℓ) can be computed using

F_{s} (T_{i}, X_{i, j}, Y_{i, ℓ}) = \frac{Δ t}{2} (g_{s}^{(2)} (T_{i}) + e^{- (H_{s}^{(2)} (T_{i}) - H_{s}^{(2)} (T_{i, j, ℓ}^{↓}))} g_{s}^{(2)} (T_{i, j, ℓ}^{↓})) + e^{- (H_{s}^{(2)} (T_{i}) - H_{s}^{(2)} (T_{i, j, ℓ}^{↓}))} {\bar{F}}_{s} (T_{i, j, ℓ}^{↓}, X_{i, j}^{↓}, Y_{i, ℓ}^{↓}) + o (Δ t^{3}) + o (Δ x^{2}) + o (Δ y^{2}) .

Here

{(T_{i, j, ℓ}^{↓}, X_{i, j}^{↓}, Y_{i, ℓ}^{↓})}^{⊤} : = {\begin{cases} {(T_{i - 1}, X_{i, j} - v^{a} (s) Δ t, Y_{i, ℓ} - v^{b} (s) Δ t)}^{⊤}, & if max (T_{x}, T_{y}) < T_{i - 1}, \\ {(T_{x}, n T_{x}, Y_{i, ℓ} - v^{b} (s) \cdot (T_{i} - T_{x}))}^{⊤}, & if max (T_{i - 1}, T_{y}) \leq T_{x}, \\ {(T_{y}, X_{i, j} - v^{a} (s) \cdot (T_{i} - T_{y}), n T_{y})}^{⊤}, & if max (T_{x}, T_{i - 1}) \leq T_{y}, \end{cases}

with

T_{x} : = \frac{X_{i, j} - v^{a} (s) T_{i}}{n - v^{a} (s)}

being the t-coordinate of the point of intersection between the characteristic through the point (T_i, X_i,j, Y_i,ℓ)^⊤ and the boundary x = nt, and

T_{y} : = \frac{Y_{i, ℓ} - v^{b} (s) T_{i}}{n - v^{a} (s)}

likewise for the boundary y = nt.

The points ${(T_{i, j, ℓ}^{↓}, X_{i, j}^{↓}, Y_{i, ℓ}^{↓})}^{⊤}$ will in general not be on the grid of pre-computed values, and thus the approximation ${\bar{F}}_{s} (T_{i, j, ℓ}^{↓}, X_{i, j}^{↓}, Y_{i, ℓ}^{↓})$ has to be used. In the case max(T_x, T_y) < T_i₋₁, this value can be obtained by identifying the four points in X_i₋₁ × Y_i₋₁ surrounding ( $X_{i, j}^{↓}, Y_{i, ℓ}^{↓}$ ), and interpolating the respective values of F_s(T_i₋₁, ·, ·) linearly. In the case max(T_i₋₁, T_y) ≤ T_x, the point ${(T_{i, j, ℓ}^{↓}, X_{i, j}^{↓}, Y_{i, ℓ}^{↓})}^{⊤}$ is on the boundary x = nt, and

F_{s} (T_{i, j, ℓ}^{↓}, X_{i, j}^{↓}, Y_{i, ℓ}^{↓}) = ℙ {{\bar{A}}^{ρ} (T_{x}) = s, L^{b} (T_{x}) \leq Y_{i, ℓ} - v^{b} (s) \cdot (T_{i} - T_{x})}

holds. The value of the time-dependent CDF on the right-hand can be obtained as the linear interpolation between the values ℙ{Ā^ρ(T_i₋₁) = s, L^b(T_i₋₁) ≤ Y_i,ℓ − v^b(s)Δt} and ℙ{Ā^ρ (T_i) = s, L^b(T_i) ≤ Y_i,ℓ}, which we pre-compute (or approximations thereof) using the numerical scheme for the marginal case (see Appendix B.1) on the boundary. By symmetry, the case max(T_x, T_i₋₁) ≤ T_y can be handled in the same way. Computing $g_{s}^{(2)} (\cdot)$ will require some F_s_′ with s ≺ s′, which can be obtained by similar interpolation procedures, or setting it to zero in the appropriate regions. The values of $H_{s}^{(2)} (\cdot)$ can be computed according to equation (3.16) analytically or numerically, as before.

Again, we can implement these formulas in an efficient dynamic programming algorithm to compute the values of F_s(t, x, y) on the specified grid for all s ∈ 𝒮̄^ρ, and thus compute

ℙ {ℒ^{a} \leq X_{M, j}, ℒ^{b} \leq Y_{M, ℓ}} = F_{(1, 0, 0, 0)} (t_{max}, X_{M, j}, Y_{M, ℓ}) + F_{(1, 0, 0, 1)} (t_{max}, X_{M, j}, Y_{M, ℓ}),

the joint CDF of the total tree length at two linked loci evaluated on the specified grid.

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

Bhaskar A, Wang YR, Song YS. Efficient inference of population size histories and locus-specific mutation rates from large-sample genomic variation data. Genome Res. 2015;25(2):268–279. doi: 10.1101/gr.178756.114. [DOI] [PMC free article] [PubMed] [Google Scholar]
Dafermos CM. Hyperbolic conservation laws in continuum physics. Springer; New York: 2010. [Google Scholar]
Davis MHA. Markov Models & Optimization (Chapman & Hall/CRC Monographs on Statistics & Applied Probability) Chapman and Hall/CRC; 1993. [Google Scholar]
Doob J. Stochastic processes. Wiley; New York: 1953. [Google Scholar]
Dutheil JY, Ganapathy G, Hobolth A, Mailund T, Uyenoyama MK, Schierup MH. Ancestral population genomics: the coalescent hidden markov model approach. Genetics. 2009;183(1):259–274. doi: 10.1534/genetics.109.103010. [DOI] [PMC free article] [PubMed] [Google Scholar]
Eriksson A, Mahjani B, Mehlig B. Sequential markov coalescent algorithms for population models with demographic structure. Theor Popul Biol. 2009;76(2):84–91. doi: 10.1016/j.tpb.2009.05.002. [DOI] [PubMed] [Google Scholar]
Evans L. Graduate studies in mathematics. American Mathematical Society; 2010. Partial Differential Equations. [Google Scholar]
Excoffier L, Dupanloup I, Huerta-Sánchez E, Sousa VC, Foll M. Robust demographic inference from genomic and snp data. PLoS Genet. 2013;9(10):e1003905. doi: 10.1371/journal.pgen.1003905. [DOI] [PMC free article] [PubMed] [Google Scholar]
Federer H. Grundlehren der mathematischen Wissenschaften. Springer; 1969. Geometric measure theory. [Google Scholar]
Ferretti L, Disanto F, Wiehe T. The effect of single recombination events on coalescent tree height and shape. PLoS ONE. 2013;8(4):1–15. doi: 10.1371/journal.pone.0060123. [DOI] [PMC free article] [PubMed] [Google Scholar]
Griffiths RC. The frequency spectrum of a mutation, and its age, in a general diffusion model. Theor Popul Biol. 2003;64(2):241–251. doi: 10.1016/s0040-5809(03)00075-3. [DOI] [PubMed] [Google Scholar]
Griffiths RC, Marjoram P. Ancestral inference from samples of dna sequences with recombination. J Comp Biol. 1996;3(4):479–502. doi: 10.1089/cmb.1996.3.479. [DOI] [PubMed] [Google Scholar]
Griffiths RC, Tavaré S. Simulating probability distributions in the coalescent. Theor Popul Biol. 1994;46(2):131–159. [Google Scholar]
Griffiths RC, Tavaré S. The age of a mutation in a general coalescent tree. Comm Statist Stochastic Models. 1998;14(1–2):273–295. [Google Scholar]
Griffiths RC, Marjoram P. An ancestral recombination graph. In: Donnelly P, Tavaré S, editors. Progress in Population Genetics and Human Evolution. Vol. 87. Springer; Berlin: 1997. pp. 257–270. [Google Scholar]
Gutenkunst RN, Hernandez RD, Williamson SH, Bustamante CD. Inferring the joint demographic history of multiple populations from multidimensional SNP frequency data. PLoS Genet. 2009;5:e1000695. doi: 10.1371/journal.pgen.1000695. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hobolth A, Jensen JL. Markovian approximation to the finite loci coalescent with recombination along multiple sequences. Theor Popul Biol. 2014;98:48–58. doi: 10.1016/j.tpb.2014.01.002. [DOI] [PubMed] [Google Scholar]
Hudson RR. Gene genealogies and the coalescent process. J Evol Biol. 1990;7:1–44. [Google Scholar]
Hudson RR. Generating samples under a wright–fisher neutral model of genetic variation. Bioin-formatics. 2002;18(2):337–338. doi: 10.1093/bioinformatics/18.2.337. [DOI] [PubMed] [Google Scholar]
Kamm JA, Terhorst J, Song YS. Efficient computation of the joint sample frequency spectra for multiple populations. J Comp Graph Stat. 2017;26(1):182–194. doi: 10.1080/10618600.2016.1159212. [DOI] [PMC free article] [PubMed] [Google Scholar]
Karlin S. A second course in stochastic processes. Academic Press; New York: 1981. [Google Scholar]
Keinan A, Clark AG. Recent explosive human population growth has resulted in an excess of rare genetic variants. Science. 2012;336(6082):740–743. doi: 10.1126/science.1217283. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kingman JF. The coalescent. J Evol Biol. 1982;13(3):235–248. [Google Scholar]
Koralov L, Sinay Y. Theory of probability and random processes. Springer; Berlin: 2007. [Google Scholar]
Li H, Durbin R. Inference of human population history from individual whole-genome sequences. Nature. 2011;475:493–496. doi: 10.1038/nature10231. [DOI] [PMC free article] [PubMed] [Google Scholar]
Marjoram P, Wall JD. Fast “coalescent” simulation. BMC Genet. 2006;7:16. doi: 10.1186/1471-2156-7-16. [DOI] [PMC free article] [PubMed] [Google Scholar]
McVean GA, Cardin NJ. Approximating the coalescent with recombination. Philos Trans R Soc Lond B Biol Sci. 2005;360:1387–1393. doi: 10.1098/rstb.2005.1673. [DOI] [PMC free article] [PubMed] [Google Scholar]
Paul JS, Steinrücken M, Song YS. An accurate sequentially markov conditional sampling distribution for the coalescent with recombination. Genetics. 2011;187(4):1115–1128. doi: 10.1534/genetics.110.125534. [DOI] [PMC free article] [PubMed] [Google Scholar]
Pfaffelhuber P, Wakolbinger A, Weisshaupt H. The tree length of an evolving coalescent. Probab Theory Related Fields. 2011;151(3):529–557. [Google Scholar]
Polanski A, Bobrowski A, Kimmel M. A note on distributions of times to coalescence, under time-dependent population size. Theor Popul Biol. 2003;63(1):33–40. doi: 10.1016/s0040-5809(02)00010-2. [DOI] [PubMed] [Google Scholar]
Rasmussen MD, Hubisz MJ, Gronau I, Siepel A. Genome-wide inference of ancestral recombination graphs. PLoS Genet. 2014;10(5):1–27. doi: 10.1371/journal.pgen.1004342. [DOI] [PMC free article] [PubMed] [Google Scholar]
Renardy M, Rogers RC. An Introduction to Partial Differential Equations. 2 Springer; 2004. [Google Scholar]
Rudin W. International series in pure and applied mathematics. McGraw-Hill; 1976. Principles of Mathematical Analysis. [Google Scholar]
Schiffels S, Durbin R. Inferring human population size and separation history from multiple genome sequences. Nat Genet. 2014;46(8):919–25. doi: 10.1038/ng.3015. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sheehan S, Harris K, Song YS. Estimating variable effective population sizes from multiple genomes: A sequentially markov conditional sampling distribution approach. Genetics. 2013;194(3):647–662. doi: 10.1534/genetics.112.149096. [DOI] [PMC free article] [PubMed] [Google Scholar]
Simonsen KL, Churchill GA. A markov chain model of coalescence with recombination. Theor Popul Biol. 1997;52(1):43–59. doi: 10.1006/tpbi.1997.1307. [DOI] [PubMed] [Google Scholar]
Steinrücken M, Kamm JA, Song YS. Inference of complex population histories using whole-genome sequences from multiple populations. 2015 doi: 10.1073/pnas.1905060116. bioRxivPreprint at: http://dx.doi.org/10.1101/026591. [DOI] [PMC free article] [PubMed]
Stroock DW. An Introduction to Markov Processes (Graduate Texts in Mathematics) Springer; 2008. [Google Scholar]
Tavaré S, Zeitouni O. Lectures on Probability Theory and Statistics: Ecole d’Eté de Probabilités de Saint-Flour XXXI - 2001 (Lecture Notes in Mathematics) Springer; 2004. [Google Scholar]
Terhorst J, Kamm JA, Song YS. Robust and scalable inference of population history from hundreds of unphased whole genomes. Nat Genet. 2017;49(2):303–309. doi: 10.1038/ng.3748. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wakeley J. Coalescent Theory: An Introduction. W. H. Freeman; 2008. [Google Scholar]
Wiuf C, Hein J. Recombination as a point process along sequences. Theor Popul Biol. 1999;55:248–259. doi: 10.1006/tpbi.1998.1403. [DOI] [PubMed] [Google Scholar]
Zivković D, Wiehe T. Second-order moments of segregating sites under variable population size. Genetics. 2008;180(1):341–357. doi: 10.1534/genetics.108.091231. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zivković D, Steinrücken M, Song YS, Stephan W. Transition densities and sample frequency spectra of diffusion processes with selection and variable population size. Genetics. 2015;200(2):601. doi: 10.1534/genetics.115.175265. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R1] Bhaskar A, Wang YR, Song YS. Efficient inference of population size histories and locus-specific mutation rates from large-sample genomic variation data. Genome Res. 2015;25(2):268–279. doi: 10.1101/gr.178756.114. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] Dafermos CM. Hyperbolic conservation laws in continuum physics. Springer; New York: 2010. [Google Scholar]

[R3] Davis MHA. Markov Models & Optimization (Chapman & Hall/CRC Monographs on Statistics & Applied Probability) Chapman and Hall/CRC; 1993. [Google Scholar]

[R4] Doob J. Stochastic processes. Wiley; New York: 1953. [Google Scholar]

[R5] Dutheil JY, Ganapathy G, Hobolth A, Mailund T, Uyenoyama MK, Schierup MH. Ancestral population genomics: the coalescent hidden markov model approach. Genetics. 2009;183(1):259–274. doi: 10.1534/genetics.109.103010. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] Eriksson A, Mahjani B, Mehlig B. Sequential markov coalescent algorithms for population models with demographic structure. Theor Popul Biol. 2009;76(2):84–91. doi: 10.1016/j.tpb.2009.05.002. [DOI] [PubMed] [Google Scholar]

[R7] Evans L. Graduate studies in mathematics. American Mathematical Society; 2010. Partial Differential Equations. [Google Scholar]

[R8] Excoffier L, Dupanloup I, Huerta-Sánchez E, Sousa VC, Foll M. Robust demographic inference from genomic and snp data. PLoS Genet. 2013;9(10):e1003905. doi: 10.1371/journal.pgen.1003905. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] Federer H. Grundlehren der mathematischen Wissenschaften. Springer; 1969. Geometric measure theory. [Google Scholar]

[R10] Ferretti L, Disanto F, Wiehe T. The effect of single recombination events on coalescent tree height and shape. PLoS ONE. 2013;8(4):1–15. doi: 10.1371/journal.pone.0060123. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] Griffiths RC. The frequency spectrum of a mutation, and its age, in a general diffusion model. Theor Popul Biol. 2003;64(2):241–251. doi: 10.1016/s0040-5809(03)00075-3. [DOI] [PubMed] [Google Scholar]

[R12] Griffiths RC, Marjoram P. Ancestral inference from samples of dna sequences with recombination. J Comp Biol. 1996;3(4):479–502. doi: 10.1089/cmb.1996.3.479. [DOI] [PubMed] [Google Scholar]

[R13] Griffiths RC, Tavaré S. Simulating probability distributions in the coalescent. Theor Popul Biol. 1994;46(2):131–159. [Google Scholar]

[R14] Griffiths RC, Tavaré S. The age of a mutation in a general coalescent tree. Comm Statist Stochastic Models. 1998;14(1–2):273–295. [Google Scholar]

[R15] Griffiths RC, Marjoram P. An ancestral recombination graph. In: Donnelly P, Tavaré S, editors. Progress in Population Genetics and Human Evolution. Vol. 87. Springer; Berlin: 1997. pp. 257–270. [Google Scholar]

[R16] Gutenkunst RN, Hernandez RD, Williamson SH, Bustamante CD. Inferring the joint demographic history of multiple populations from multidimensional SNP frequency data. PLoS Genet. 2009;5:e1000695. doi: 10.1371/journal.pgen.1000695. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] Hobolth A, Jensen JL. Markovian approximation to the finite loci coalescent with recombination along multiple sequences. Theor Popul Biol. 2014;98:48–58. doi: 10.1016/j.tpb.2014.01.002. [DOI] [PubMed] [Google Scholar]

[R18] Hudson RR. Gene genealogies and the coalescent process. J Evol Biol. 1990;7:1–44. [Google Scholar]

[R19] Hudson RR. Generating samples under a wright–fisher neutral model of genetic variation. Bioin-formatics. 2002;18(2):337–338. doi: 10.1093/bioinformatics/18.2.337. [DOI] [PubMed] [Google Scholar]

[R20] Kamm JA, Terhorst J, Song YS. Efficient computation of the joint sample frequency spectra for multiple populations. J Comp Graph Stat. 2017;26(1):182–194. doi: 10.1080/10618600.2016.1159212. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] Karlin S. A second course in stochastic processes. Academic Press; New York: 1981. [Google Scholar]

[R22] Keinan A, Clark AG. Recent explosive human population growth has resulted in an excess of rare genetic variants. Science. 2012;336(6082):740–743. doi: 10.1126/science.1217283. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] Kingman JF. The coalescent. J Evol Biol. 1982;13(3):235–248. [Google Scholar]

[R24] Koralov L, Sinay Y. Theory of probability and random processes. Springer; Berlin: 2007. [Google Scholar]

[R25] Li H, Durbin R. Inference of human population history from individual whole-genome sequences. Nature. 2011;475:493–496. doi: 10.1038/nature10231. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] Marjoram P, Wall JD. Fast “coalescent” simulation. BMC Genet. 2006;7:16. doi: 10.1186/1471-2156-7-16. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] McVean GA, Cardin NJ. Approximating the coalescent with recombination. Philos Trans R Soc Lond B Biol Sci. 2005;360:1387–1393. doi: 10.1098/rstb.2005.1673. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] Paul JS, Steinrücken M, Song YS. An accurate sequentially markov conditional sampling distribution for the coalescent with recombination. Genetics. 2011;187(4):1115–1128. doi: 10.1534/genetics.110.125534. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] Pfaffelhuber P, Wakolbinger A, Weisshaupt H. The tree length of an evolving coalescent. Probab Theory Related Fields. 2011;151(3):529–557. [Google Scholar]

[R30] Polanski A, Bobrowski A, Kimmel M. A note on distributions of times to coalescence, under time-dependent population size. Theor Popul Biol. 2003;63(1):33–40. doi: 10.1016/s0040-5809(02)00010-2. [DOI] [PubMed] [Google Scholar]

[R31] Rasmussen MD, Hubisz MJ, Gronau I, Siepel A. Genome-wide inference of ancestral recombination graphs. PLoS Genet. 2014;10(5):1–27. doi: 10.1371/journal.pgen.1004342. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] Renardy M, Rogers RC. An Introduction to Partial Differential Equations. 2 Springer; 2004. [Google Scholar]

[R33] Rudin W. International series in pure and applied mathematics. McGraw-Hill; 1976. Principles of Mathematical Analysis. [Google Scholar]

[R34] Schiffels S, Durbin R. Inferring human population size and separation history from multiple genome sequences. Nat Genet. 2014;46(8):919–25. doi: 10.1038/ng.3015. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R35] Sheehan S, Harris K, Song YS. Estimating variable effective population sizes from multiple genomes: A sequentially markov conditional sampling distribution approach. Genetics. 2013;194(3):647–662. doi: 10.1534/genetics.112.149096. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R36] Simonsen KL, Churchill GA. A markov chain model of coalescence with recombination. Theor Popul Biol. 1997;52(1):43–59. doi: 10.1006/tpbi.1997.1307. [DOI] [PubMed] [Google Scholar]

[R37] Steinrücken M, Kamm JA, Song YS. Inference of complex population histories using whole-genome sequences from multiple populations. 2015 doi: 10.1073/pnas.1905060116. bioRxivPreprint at: http://dx.doi.org/10.1101/026591. [DOI] [PMC free article] [PubMed]

[R38] Stroock DW. An Introduction to Markov Processes (Graduate Texts in Mathematics) Springer; 2008. [Google Scholar]

[R39] Tavaré S, Zeitouni O. Lectures on Probability Theory and Statistics: Ecole d’Eté de Probabilités de Saint-Flour XXXI - 2001 (Lecture Notes in Mathematics) Springer; 2004. [Google Scholar]

[R40] Terhorst J, Kamm JA, Song YS. Robust and scalable inference of population history from hundreds of unphased whole genomes. Nat Genet. 2017;49(2):303–309. doi: 10.1038/ng.3748. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R41] Wakeley J. Coalescent Theory: An Introduction. W. H. Freeman; 2008. [Google Scholar]

[R42] Wiuf C, Hein J. Recombination as a point process along sequences. Theor Popul Biol. 1999;55:248–259. doi: 10.1006/tpbi.1998.1403. [DOI] [PubMed] [Google Scholar]

[R43] Zivković D, Wiehe T. Second-order moments of segregating sites under variable population size. Genetics. 2008;180(1):341–357. doi: 10.1534/genetics.108.091231. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R44] Zivković D, Steinrücken M, Song YS, Stephan W. Transition densities and sample frequency spectra of diffusion processes with selection and variable population size. Genetics. 2015;200(2):601. doi: 10.1534/genetics.115.175265. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Computing the joint distribution of the total tree length across loci in populations with variable size

Alexey Miroshnikov

Matthias Steinrücken

Abstract

1 Introduction

2 Background and Notation

2.1 Ancestral Process at a Single Locus

Definition 2.1 (Ancestral Process with variable population size)

Remark 2.2

Definition 2.3

2.2 Ancestral Process with Recombination

Definition 2.4 (Ancestral Process with Recombination)

Remark 2.5

Definition 2.6 (Ancestral Process with Limited Recombination)

Definition 2.7

Remark 2.8

3 Marginal and Joint Distribution of the Total Tree Length

3.1 Distribution of the Total Tree length at a Single Locus

Lemma 3.1

Proof

Corollary 3.2

Proof

Remark 3.3

Figure 1.

3.2 Joint Distribution of the Total Tree Length

Lemma 3.4

Proof

Corollary 3.5

Proof

Remark 3.6

Figure 2.

4 Empirical evaluation

Figure 3.

4.1 Accuracy

Figure 4.

Figure 5.

Table 1.

Table 2.

4.2 Properties of the Distributions

Figure 6.

Figure 7.

Figure 8.

Figure 9.

Figure 10.

5 Discussion

Acknowledgments

Appendix

A Path-integrals of Markov chains

A.1 Path-integrals under regularity assumptions

Definition A.1

Definition A.2

Definition A.3

Remark A.4

Proposition A.5 (Locally Lipschitz)

Lemma A.6

Proof

Lemma A.7

Proof

Proof of Proposition A.5

Proposition A.8

Proof

Remark A.9

A.2 Path-integrals for monotone state space functions

Definition A.10

Remark A.11

Definition A.12

Definition A.13

Proposition A.14

Remark A.15

Remark A.16

Remark A.17

Remark A.18

Proof of Proposition A.14

B Numerical Schemes

B.1 Upstream Numerical Scheme for Single-Locus Case

Figure 11.

B.2 Upstream Numerical Scheme for Two-Locus Case

Footnotes

References