Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2016 Jul 1.
Published in final edited form as: Electron J Probab. 2016 Jun 4;20:58. doi: 10.1214/ejp.v20-3564

TRACTABLE DIFFUSION AND COALESCENT PROCESSES FOR WEAKLY CORRELATED LOCI

Paul A Jenkins 1, Paul Fearnhead 2, Yun S Song 3
PMCID: PMC4929886  NIHMSID: NIHMS777154  PMID: 27375350

Abstract

Widely used models in genetics include the Wright-Fisher diffusion and its moment dual, Kingman’s coalescent. Each has a multilocus extension but under neither extension is the sampling distribution available in closed-form, and their computation is extremely difficult. In this paper we derive two new multilocus population genetic models, one a diffusion and the other a coalescent process, which are much simpler than the standard models, but which capture their key properties for large recombination rates. The diffusion model is based on a central limit theorem for density dependent population processes, and we show that the sampling distribution is a linear combination of moments of Gaussian distributions and hence available in closed-form. The coalescent process is based on a probabilistic coupling of the ancestral recombination graph to a simpler genealogical process which exposes the leading dynamics of the former. We further demonstrate that when we consider the sampling distribution as an asymptotic expansion in inverse powers of the recombination parameter, the sampling distributions of the new models agree with the standard ones up to the first two orders.

Keywords and phrases: population genetics, recombination, sampling distribution, diffusion, coupling

1. Introduction

The basis of many important problems in genetics is to find an expression for a sampling distribution or likelihood. Valuable tools in this endeavour are stochastic models of allele frequency evolution forwards in time, and their dual genealogical processes backwards in time. In particular, the numerous variants of the Wright-Fisher diffusion and Kingman’s coalescent, respectively, have focused attention on the scaling limit as the population size goes to infinity, leading from a (complicated) finite-population model of reproduction to a (simpler) infinite-population limit. At a single genetic locus, the problem of computing sampling distributions in these models is well studied, with even some closed-form formulas available (Wright, 1949; Ewens, 1972; Jenkins and Song, 2011; Bhaskar, Kamm and Song, 2012). However, with ongoing technological developments in high-throughput DNA sequencing, large genomic datasets are becoming available and it is necessary to consider multilocus models. Inter-locus recombination quickly makes such models intractable; for neither the Wright-Fisher diffusion with recombination nor the coalescent with recombination—or ancestral recombination graph (ARG)—is it possible to obtain a closed-form expression for the sampling distribution. This has remained a notoriously difficult problem, and to make progress using these models it has usually been necessary to resort to computationally-intensive techniques such as importance sampling (Griffiths and Marjoram, 1996; Fearnhead and Donnelly, 2001; Griffiths, Jenkins and Song, 2008; Jenkins and Griffiths, 2011), Markov chain Monte Carlo (Kuhner, Yamato and Felsenstein, 2000; Nielsen, 2000; Wang and Rannala, 2008; Rasmussen et al., 2014), or other numerical approximations (Boitard and Loisel, 2007; Miura, 2011). Denoting the population-scaled recombination parameter by ρ, only in the special cases of ρ = 0 or ρ = ∞ is it possible to make progress analytically, since then we are back to a single locus, or to many independent single loci, respectively.

In another direction, we have considered an analytic approach to the problem, as follows. Denote the observed sample configuration at two loci by n and its sampling probability by q(n; ρ) (to be defined precisely below). Consider the asymptotic expansion in inverse powers of ρ:

q(n;ρ)=q0(n)+q1(n)ρ+q2(n)ρ2+, (1)

where for convenience we suppress the dependence of these terms on other parameters of the model. Under an infinite-alleles type of mutation, we obtained closed-form formulas for q0(n) and q1(n) in terms of the marginal one-locus sampling probabilities, and a decomposition of q2(n) into a closed-form term plus a second part which is evaluated easily by dynamic programming (Jenkins and Song, 2010). (The result is stated more precisely in Theorem 2.1 below.) This provides the first closed-form extension of Ewens’ Sampling Formula (Ewens, 1972) to handle finite amounts of recombination. It has been extended subsequently to include more general models of mutation (Jenkins and Song, 2009), natural selection (Jenkins and Song, 2012), higher-order terms (Jenkins and Song, 2012), and more than two loci (Bhaskar and Song, 2012), and has had practical implications for genomic inference (Chan, Jenkins and Song, 2012). One particularly appealing conclusion of these works is that both q0(n) and q1(n) are universal; that is, their functional form is invariant to our assumptions about mutation and selection acting marginally at each locus. The effects of these marginal processes are entirely subsumed into the relevant one-locus sampling distributions.

The simple and universal forms for q0(n) and q1(n) provide strong circumstantial evidence that there exists an underlying stochastic process which is much simpler than the standard models for finite amounts of recombination. In particular, we previously conjectured (Jenkins and Song, 2010) the existence of a process which is both much simpler than the standard models based on the Wright-Fisher diffusion or on the ARG, and is in agreement with the sampling distribution (1) up to O−2). The goal of this paper is to describe such a process. In fact, using different arguments we describe two such processes, obtaining both a limiting diffusion and a coalescent process with these properties. In the diffusion approximation, the key idea is to suppose that the probability r of a recombination per individual per generation scales as N−β as the population size N → ∞, for 0 < β < 1, rather than the usual choice of β = 1. Interest in asymptotically large recombination rates is reasonable because of extensive recombination rate heterogeneity along chromosomes in e.g. humans, strong recombination rates in some species such as Drosophila melanogaster (Chan, Jenkins and Song, 2012), and because of the need to understand the long-range dependencies between well-separated loci. Our diffusion in this scaling is intimately related to the central limit theorem for density dependent population processes (see Ethier and Kurtz, 1986, Theorem 11.2.3), which has been analyzed in genetics—for models of strong mutation rather than strong recombination—by Feller (1951) and Norman (1975a). A closely related scaling in the context of Ξ-coalescent processes was also recently explored by Birkner, Blath and Eldon (2013) (in that paper β = 1 but with timescale N2). The coalescent approach, meanwhile, uses a coupling argument. Intuitively, we would like to couple the ARG to the limiting case of two independent coalescent trees (ρ = ∞). To account for contributions to the sampling distribution of O−1), we must quantify the “leading order reasons” for such a coupling to fail. When ρ is large but finite, lineages in the ARG ancestral to both loci undergo recombination backwards in time very rapidly, until the first time U that no such lineage survives. In this paper we show that, roughly speaking, in order to recover the sampling distribution up to O−1) we need consider only the following type of exceptional event: a coalescence occurs more recently than time U in the ARG, and the coalescence is between two lineages each of which is ancestral to both of the two loci. This observation enables us to define a simple coalescent process which allows for at most one of these events but is otherwise very similar to the easy limiting process corresponding to ρ = ∞.

The paper is organized as follows. In Section 2 we specify our notation and summarize previous research. Novel diffusion and coalescent processes are introduced in Sections 3 and 4, respectively, and we conclude in Section 5 with a brief discussion.

2. Notation and previous results

For M ∈ ℕ = {0, 1, 2, …}, let [M] ≔ {1, 2, …, M}. The complement of a set J is written J𝖢. Denote the Kronecker delta by δij which takes the value 1 if i = j and 0 otherwise. Let ei denote a unit vector whose jth entry is δij, and let eij denote a matrix with (k, l)th entry equal to δikδjl. For a vector υ ∈ ℝd we denote by |υ| the usual Euclidean norm. Denote the k × l zero matrix by 0k×l and the k × k identity matrix by Ik. We will replace a subscript with a “·” to denote summation over that index. A prime symbol ′ will denote vector or matrix transpose. For z ∈ ℝ≥0 and n ∈ ℕ, (z)nz(z + 1) ⋯ (z + n − 1) denotes the nth ascending factorial of z. Finally, for a matrix R of processes we let [R]t = ([Ri, Rj]t)i,j denote the matrix of corresponding covariation processes.

Consider the usual diffusion limit of an exchangeable model of random mating with constant population size of N haplotypes. Our interest will be in a sample from this population at two loci, which we call A and B, with the probability of mutation per haplotype per generation denoted by uA and uB respectively. In the diffusion limit we let N → ∞ and uA, uB → 0 while the population-scaled parameters θA = 2NuA and θB = 2NuB remain fixed. In this paper we will suppose a finite-alleles model of mutation such that a mutation to an allele i in type space EA = [K], K ∈ ℕ, takes it to allele k ∈ [K] with probability PikA, with EB = [L] and PjlB, j, l ∈ [L] defined analogously. (As we discover below, the mutation model is not important and we could pose something more complicated with little extra effort.) The probability of a recombination between the two loci per haplotype per generation is denoted by r, and we assume that ρβ = 2Nβr is fixed as N → ∞, for some fixed β ∈ (0, 1]. Previous work has focused on the case β = 1 with time measured in units of N generations. For consistency with the usual notation we write ρ = ρ1.

A sample from this model comprises a haplotypes observed only at locus A, b haplotypes observed only at locus B, and c haplotypes observed at both loci. The sample configuration is denoted by n = (a, b, c) where a = (ai)i∈[K] and ai is the number of haplotypes observed to exhibit allele i at locus A; b = (bj)j∈[L] where bj is the number of haplotypes observed to exhibit allele j at locus B; and c = (cij)i∈[K],j∈[L] where cij is the number of haplotypes with allele i at locus A and allele j at locus B. Thus,

a=i=1Kai,b=j=1Lbj,c=i=1Kj=1Lcij,

and we let n = a + b + c. We further write cA = (ci·)i∈[K] and cB = (c·j)j∈[L] to denote the marginal sample configurations of c restricted to locus A and locus B respectively. Finally, we use q(a, b, c) to denote the probability that when we sample n haplotypes in some order from the population at stationarity we obtain the unordered configuration (a, b, c); by sampling exchangeability this is indeed a function only of the unordered configuration (a, b, c). For convenience we suppress the dependence of this quantity on the model parameters and on β. The main result motivating this work is an expansion for q(a, b, c) for the case of β = 1, and later we will show that this expansion holds for all β ∈ (0, 1].

Theorem 2.1 (See Jenkins and Song (2009))

Consider the following asymptotic expansion for q(a, b, c) under the diffusion limit with β = 1:

q(a,b,c)=q0(a,b,c)+q1(a,b,c)ρ+O(1ρ2),  as ρ,

with q0, q1, … independent of ρ. Then the zeroth order term is given by

q0(a,b,c)=qA(a+cA)qB(b+cB), (2)

and the first order term is given by

q1(a,b,c)=(c2)qA(a+cA)qB(b+cB)qB(b+cB)i=1K(ci·2)qA(a+cAei)qA(a+cA)j=1L(c·j2)qB(b+cBej)+i=1Kj=1L(cij2)qA(a+cAei)qB(b+cBej), (3)

where qA, qB are the marginal sampling distributions at locus A and locus B, respectively.

Remark 2.1

Under a neutral, finite-alleles model of mutation, if mutation is parent independent—that is, PkiA=PiA, i, k ∈ [K], and PljB=PjB, j, l ∈ [L], then qA(a) and qB(b) are known in closed-form:

qA(a)=1(θA)ai=1K(θAPiA)ai,  and  qB(b)=1(θB)bj=1L(θBPiB)bj.

These expressions follow, for example, from the moments of the Wright-Fisher diffusion with parent-independent mutation, whose stationary distribution at locus A is Dirichlet(θAP1A,,θAPK1A) (Wright, 1949), and similarly at locus B.

Remark 2.2

The zeroth-order decomposition is well known (e.g. Ethier, 1979) and also intuitive, since the two loci become independent as ρ → ∞.

Theorem 2.1 can be obtained by diffusion (Jenkins and Song, 2012) or by coalescent (Jenkins and Song, 2009, 2010) arguments. In this paper we address both approaches in further detail.

3. Diffusion model

In this section we extend the above results by obtaining a full description of a simple diffusion process such that its sampling distribution is known exactly and has a Taylor expansion about ρ = ∞ consistent with (2) and (3). For simplicity we will obtain our diffusion as the limit of an appropriately rescaled Moran model, although we expect our results to hold for a more general class of discrete models of reproduction within the domain of convergence of the Wright-Fisher diffusion.

3.1. Neutral Moran model

A population of N haploid, monoecious individuals evolves as a multitype birth-and-death process in continuous time. Each individual carries a haplotype comprising a pair of alleles (i, j) ∈ [K] × [L], one at locus A and one at locus B. Let Zij(τ) ∈ {0, 1, …, N} denote the number of (i, j) haplotypes in the population at time τ ∈ ℝ≥0, and Z(τ) = (Zij(τ))i∈[K],j∈ [L]. The population evolves as follows. At rate N2/2 a reproduction event occurs, in which an individual is chosen uniformly at random from the population to die. It is replaced by a copy of another individual also chosen uniformly at random (the same individual could be chosen; whether sampling is with or without replacement does not affect the diffusion limit). Independently, each locus of each haplotype undergoes mutation: any locus A mutates at rate θA/2 and its allele is updated according to the transition matrix PA=(PikA)i,k[K]; similarly any locus B mutates at rate θB/2 and its allele is updated according to PB=(PjlB)j,l[L]. Finally, each haplotype independently undergoes recombination at rate ρ/2: at such an event, it is replaced by a haplotype formed by sampling two alleles (one for each locus) independently from the population. Putting all this together, the rate at which a haplotype (i, j) dies and is replaced by a haplotype (k, l) when Z(τ) = z is given by

λij,kl(N)(z)=zijN[N22zklN+N(θA2PikAδjl+θB2PjlBδjk+ρ2zk·Nz·lN)],(i,j),(k,l)[K]×[L].

Notice that, as is standard (e.g. Baake and Herms, 2008), we decouple the mutation and recombination mechanisms from reproduction (and from each other). This simplifies the analysis without unduly affecting the diffusion limit.

We will change variables by introducing the collection

M(N)(τ){X(N)(τ),Y(N)(τ),D(N)(τ)},

where

X(N)(τ)=(Xi(N)(τ))i[K]=(Zi·(τ)N:i[K]),
Y(N)(τ)=(Yj(N)(τ))j[L]=(Z·j(τ)N:j[L]),
D(N)(τ)=(Dij(N)(τ))i[K],j[L]=(Zij(τ)NZi·(τ)NZ·j(τ)N:i[K],j[L]).

That is, we describe the state of the Moran model at time τ by the marginal allele frequencies and the coefficients of linkage disequilibrium (see, e.g. Ewens, 2004, p69, p227). We will write this succinctly by arranging the variables in a linear order:

(X1(N),,XK(N),Y1(N),,YL(N),D11(N),,DKL(N)),

and thinking of M(N)(τ) as a vector of length Λ ≔ K + L + KL. The process (M(N)(τ) : τ = 0, 1, …) is then Markov on a state space we denote by ΔKL1(N), which is a rational subset (those points consistent with i=1Kj=1LZij=N) of the (KL − 1)-dimensional shifted simplex

ΔKL1={(x,y,d)[0,1]K×[0,1]L×[1,1]KL:i=1Kxi=1=j=1Lyj,i=1Kdij=0=j=1Ldij}.

To find the diffusion limit we first need the conditional means and covariances of the increments

ΔM(N)(τ)M(N)(τ+dτ)M(N)(τ).

From these, and under the assumption that θA, θB, and ρ are fixed as N → ∞, it is possible to show that the model converges to a (Wright-Fisher) diffusion limit (Ethier and Kurtz, 1986, Example 10.3.9, p433). Recall however that our interest is when ρβ, rather than ρ, is fixed, so below we write these increments in terms of ρβ using ρ = ρβN1−β.

In the following, for convenience we drop the dependence on τ.

Proposition 3.1

In the neutral two-locus Moran model with mutation and recombination, the conditional means and covariances of increments of M(N) are given by

limdτ0(dτ)1𝔼[ΔXi(N)|M(N)]=θA2k=1K(PkiAδik)Xk(N), (4)
limdτ0(dτ)1𝔼[ΔYj(N)|M(N)]=θB2l=1L(PljBδjl)Yl(N), (5)
limdτ0(dτ)1𝔼[ΔDij(N)|M(N)]=ρβ2Nβ1Dij(N)Dij(N)+θA2k=1K(PkiAδik)Dkj(N)+θB2l=1L(PljBδjl)Dil(N)+O(1Nβ), (6)
limdτ0(dτ)1ℂov[ΔXi(N),ΔXk(N)|M(N)]=Xi(N)(δikXk(N))+O(1Nβ),
limdτ0(dτ)1ℂov[ΔYj(N),ΔYl(N)|M(N)]=Yj(N)(δjlYl(N))+O(1Nβ),
limdτ0(dτ)1ℂov[ΔXi(N),ΔYj(N)|M(N)]=Dij(N)+O(1Nβ),
limdτ0(dτ)1ℂov[ΔXi(N),ΔDkl(N)|M(N)]=Dkl(N)(δikXi(N))Xk(N)Dil(N)+O(1Nβ),
limdτ0(dτ)1ℂov[ΔYj(N),ΔDkl(N)|M(N)]=Dkl(N)(δjlYj(N))Yl(N)Dkj(N)+O(1Nβ),
limdτ0(dτ)1ℂov[ΔDij(N),ΔDkl(N)|M(N)]=Xi(N)Yj(N)(δikXk(N))(δjlYl(N))+Dkj(N)Xi(N)Yl(N)+Dil(N)Xk(N)Yj(N)+Dij(N)(Xk(N)Yl(N)δikYl(N)δjlXk(N))+Dkl(N)(Xi(N)Yj(N)δikYj(N)δjlXi(N))+Dij(N)(δikδjlDkl(N))+O(1Nβ).

Higher order moments of order m ≥ 2 are O(N−(m−2)).

Proof

These expressions follow directly from the first four moments of Z(τ + dτ) | Z(τ), which are easily computed by noting that

𝔼[f(Z(τ+dτ))|Z(τ)=z]=(i,j)(k,l)f(zeij+ekl)λij,kl(N)(z)dτ+f(z)[1N2(N+θA+θB+ρ)dτ]+o(dτ).

For example, choosing f(z) = zuυ we find

𝔼[Zuυ(τ+dτ)|Z(τ)=z]=zuυ+N[θA2k=1K(PkuAδku)zkυN+θB2l=1L(PlυAδlu)zulN+ρ2(zu·Nz·υNzuυN)]dτ+o(dτ),

and hence we recover (4) via

𝔼[ΔXu|M(N)]=1Nυ=1L(𝔼[Zuυ(τ+dτ)|Z(τ)]Zuυ)=θA2k=1K(PkuAδku)Xk(N)dτ+o(dτ).

The remaining terms follow similarly; we omit the straightforward but lengthy algebraic details.

To prepare for our diffusion limit, we must rescale time; from (6) it is clear that to obtain a nontrivial limit we should let t = N1−βτ. Now introduce the conditional mean vector w(N) and conditional covariance matrix s(N) on this timescale, defined by

limdt0(dt)1𝔼[ΔM(N)|M(N)(t)=m]=Nβ1limdτ0(dτ)1𝔼[ΔM(N)|M(N)(τ)=m]w(N)(m), (7)
limdt0(dt)1ℂov[ΔM(N)|M(t)=m]=Nβ1limdτ0(dτ)1ℂov[ΔM(N)|M(N)(τ)=m]Nβ1s(N)(m), (8)

with entries determined by Proposition 3.1. Thus, with m = (x1, …, xK, y1, …, yL, d11, …, dKL), equations (4)(6) show that

w(N)(m)=w(m)+O(Nβ1),
where    w(m)=(0,0,K0,0,Lρβ2d11,,ρβ2dKLK×L), (9)

with s(N)(m) = s(m) + O(N−β) determined in a similar fashion:

s(m)=[sXX(m)sXY(m)sXD(m)sXY(m)sYY(m)sYD(m)sXD(m)sYD(m)sDD(m)],

where

[sXX(m)]ik=xi(δikxk),
[sYY(m)]jl=yj(δjlyl),
[sXY(m)]ij=dij,
[sXD(m)]i,kl=dkl(δikxi)xkdil,
[sYD(m)]j,kl=dkl(δjlyj)yldkj,
[sDD(m)]ij,kl=xiyj(δikxk)(δjlyl)+dkjxiyl+dilxkyj+dij(xkylδikylδjlxk)+dkl(xiyjδikyjδjlxi)+dij(δikδjldkl).

Notice in particular the different leading orders of the two quantities in (7) and (8): the mean increments are of O(1) on this timescale while the covariances are of O(Nβ−1). It is this difference, which is a consequence of our assumption that the recombination probability r is O(N−β) for β < 1, that leads to a novel diffusion limit. Under the usual choice of β = 1 it is well known that we see convergence to a diffusion process after a linear rescaling of time. In the special case of a Wright-Fisher model and K = L = 2, the diffusion limit for M(N)(⌊Nτ⌋) as N → ∞ was obtained by Ohta and Kimura (1969a,b). Our interest is however in β ∈ (0, 1), for which r is larger, and the loss of linkage disequilibrium (LD) is subsequently much faster. Intuitively, we should expect such loss to resemble the exponential decay predicted in an infinitely large population, but with small fluctuations about this deterministic behaviour. The diffusion process we define below quantifies these fluctuations precisely.

3.2. Gaussian diffusion limit of fluctuations in linkage disequilibrium

We first provide a heuristic description of the diffusion limit. First, observe from (7) and (8) that, provided M(N)(0) → M(0) as N → ∞ and that β ∈ [0, 1), then

M(N)dM{(X(0),Y(0),D(0)eρβt/2):t0},N, (10)

the deterministic exponential decay in LD typical of an infinitely large population. See Baake and Herms (2008) for a formal statement of this law-of-large-numbers type result for the Moran model with recombination. For the corresponding central limit theorem, we seek a diffusion limit for

U(N)(t)rN[M(N)(t)M(t)], (11)

for some rescaling rN → ∞. In our application the appropriate choice is

rNN(1β)/2,

which can be regarded as the one on which both recombination and genetic drift are observable on the fastest timescale (Jenkins and Song, 2012). We will assume this scaling henceforward. To find the limit U = limN→∞ U(N), write

U(N)(t)=rN[[M(N)(0)M(0)]+0t[w(N)(M(N)(s))w(M(s))]ds+R(N)(t)], (12)

where

R(N)(t)M(N)(t)M(N)(0)0tw(N)(M(N)(s))ds

describes the deviations of M(N)(t) from its expected behaviour and is a martingale. It suffices to characterize the limits of each of the three grouped terms on the right of (12). For the first term we assume that it converges to a limit, U(N)(0)dU(0) as N → ∞. For the second term, from (9) we should expect

rN0t[w(N)(M(N)(s))w(M(s))]ds=rN0t[ρβ2[M(N)(s)M(s)]+O(Nβ1)]ds=0t[ρβ2U(N)(s)+O(N(β1)/2)]dsdρβ20tU(s)ds,N. (13)

Finally, we obtain a complete description of the limit rNR(N)dR as N → ∞ by an application of the martingale central limit theorem (Ethier and Kurtz, 1986, Theorem 7.1.4); we find

R(t)=0tσ(M(s))dW(s),

where σσ′ = s, and W is a (KL − 1)-dimensional Brownian motion. In summary then, we expect U to satisfy

U(t)=U(0)ρβ20tU(s)ds+0tσ(M(s))dW(s). (14)

Our main result formalizes this argument, as follows.

Theorem 3.1

Suppose that U(N)(0)dU(0) as N → ∞. Then for each t > 0, as N → ∞,

supst|M(N)(s)M(s)|d0;

N(1β)/2R(N)dR, where R has Gaussian, independent increments with mean zero, and with

𝔼[R(t)R(t)]=0ts(M(s))ds; (15)

and U(N)dU, satisfying (14).

Proof of Theorem 3.1

This is an application of a central limit theorem for density dependent population processes; for textbook coverage see Ethier and Kurtz (1986, Chapter 11) and for a recent treatment see Kang, Kurtz and Popovic (2014). We apply Theorem 2.11 of Kang, Kurtz and Popovic (2014). To do so we need to validate each of the assertions that led to (14) above by checking the following sufficient conditions (i)–(iv). (Kang, Kurtz and Popovic (2014, Theorem 2.11) is rather more general than is required here: it permits the state space of M(N) to be unbounded, and for M(N) to depend on other processes that evolve on faster timescales than that of the diffusion. We omit those conditions which are not needed.)

  1. The Moran process converges to an identifiable, deterministic limit. This is guaranteed by the following: the infinitesimal generator 𝒜N of M(N) satisfies
    limNsupmΔKL1(N)|𝒜Nf(m)𝒜f(m)|=0,f𝒟(𝒜),
    for a generator 𝒜 with domain 𝒟(𝒜).
  2. Fluctuations about the deterministic limit are well behaved. More precisely, R(N) is a local martingale and the covariations processes [M(N)]td0.

  3. Contributions of O(rN1) to the error w(N)w can be identified. These would contribute to the limiting drift of U (t), and a sufficient condition to identify them is: there exists a continuous function G0: ΔKL−1 → ℝΛ (recall Λ = K + L + KL) such that
    limNsupmΔKL1(N)|rN[w(N)(m)w(m)]G0(m)|=0.
  4. The martingale central limit theorem applies to rNR(N). This is guaranteed by the following:
    limN𝔼[supstrN|M(N)(s)M(N)(s)|]=0, (16)
    and there exists a continuous G : ΔKL−1 → ℝΛ×Λ such that for each t > 0,
    rN2[M(N)]t0tG(M(N)(s))dsd0. (17)

We address each of these requirements in turn.

  1. Convergence of 𝒜N f(m) to 𝒜f(m) ≔ w·∇f(m), the generator of M [see (10)], is immediate from Proposition 3.1. Convergence is uniform in m because the O(N−β) terms in Proposition 3.1 have coefficients that are polynomials in M(N) on a compact space.

  2. Since the state space is bounded, for R(N) to be a martingale it suffices that the jump rate is uniformly bounded (Kurtz, 1971, Proposition 2.1), as is the case for the Moran process. The covariations process [M(N)]td0 as a consequence of (17), verified below.

  3. From (9), rN[w(N)(m) − w(m)] = O(N(β−1)/2), again uniformly in mΔKL1(N), so here the appropriate choice is G00. Thus, the only relevant contribution to the limit (13) is from the error w(M(N)(s)) − w(M(s)) rather than from w(N)(M(N)(s)) − w(M(N)(s)).

  4. Jumps of any component of M(N) are bounded in magnitude by 2/N, so
    supstrN|M(N)(s)M(N)(s)|N(1β)/2·2Λ1/2N0,N,
    and (16) holds. To identify the asymptotic behaviour of rN2[M(N)]t, let
    𝒩m(N)(t)=𝒴m(0tλm(N)(M(N)(s))ds)
    denote the total number of jumps of the Moran process into state mΔKL1(N) by time t, where (𝒴m:mΔKL1(N)) is a collection of independent Poisson processes of unit rate and λm(N)(M(N)(s)) denotes the rate of transition of the process from current state M(N)(s) to m. Then
    rN2[M(N)]t=N1β0tmΔKL1(N)[ΔM(N)(s)][ΔM(N)(s)]d𝒩m(N)(s),~N1β0tmΔKL1(N)[mM(N)(s)][mM(N)(s)]λm(N)(M(N)(s))ds,~0ts(N)(M(N)(s))ds,
    by (8). Thus we may take G = s in (17) [G identifies the moments appearing in (15)].

Remark 3.1

One could obtain the same diffusion limit starting from a Wright-Fisher model rather than a Moran model, since the means and covariances of its increments are identical to leading order, up to a rescaling of time. This alternative approach is in some respects less appealing since the Wright-Fisher model, when expressed in continuous time, is non-Markovian. The additional complications raised by this approach have been addressed by Norman (1975a) (see also Ethier and Nagylaki, 1980, 1988), and we have checked that the conditions of his theorems still apply when we introduce recombination to the Wright-Fisher model. The theory of Norman (1975a) has been used to study strong mutation and selection (Norman, 1972, 1975a; Kaplan, Darden and Hudson, 1988; Nagylaki, 1986, 1990; Wakeley and Sargsyan, 2009), and a Gaussian diffusion approximation of a Moran model with strong selection is developed by Feder, Kryazhimskiy and Plotkin (2014), but to the best of our knowledge this is the first time a central limit theorem has been obtained for strong recombination.

Remark 3.2

The exponential decay of linkage disequilibrium implied by M [equation (10)] is a classical result; the above theorem further quantifies the fluctuations about this deterministic behaviour in a fully time-dependent manner. In particular, the definition of U [equation (11)] shows that fluctuations are of order N(1 − β)/2 on a timescale of Nβ − 1 units of the Moran process. If we designate the expected lifetime of an individual, 2/N, as one generation, then these fluctuations can be said to occur on a timescale of order Nβ generations. (This definition of “generation” is consistent with uA, uB, and r in Section 2 provided we replace N with the effective population size of the Moran model, N/2, in the definitions of θA, θB, and ρ (Ewens, 2004, p121).)

3.3. Stationary distribution

Although U is described completely by (14), the volatility term σ(M(t)) is neither simple nor time-independent. On the other hand, our main interest is in stationary behaviour, and σ(M(∞)) takes on a much simpler form. First note that the components of U(t) corresponding to each Xi and Yj undergo Brownian motions (with nonunit volatility), so we restrict our attention to the stationary distribution of the component corresponding to D, which we denote UD. Conditions of Norman (1975b) confirm convergence of UD(t) to its stationary distribution. Setting σ(M(s)) = σ(M(∞)) in (14), we find

dUD=ρβ2UDdt+σdW(s), (18)

where σ is a constant defined by

σσ=ssDD(M())=[Xi(0)Yj(0)(δikXk(0))(δjlYl(0))]ij,kl.

The process (18) is much simpler to describe. Marginally, UDij is an Ornstein-Uhlenbeck process with damping towards linkage equilibrium at rate ρβ/2 and constant volatility [σ]ij, ij. UD has stationary distribution

Normal (0KL×1,sρβ).

This is a slightly different idea of stationarity than usual, since it depends on X(0) and Y (0). An immediate question is: what should be the distributions for X(0) and Y (0)? We address this by reconsidering the usual two-locus Wright-Fisher diffusion limit operating on a slower timescale. We can exploit (18) to obtain a simple approximation of this diffusion limit, as follows. First, we have derived the Gaussian diffusion approximation

D(0)eρβt/2+N(β1)/2UD(t)

for D(N)(t). Thus the stationary distribution of this approximation is

Normal (0KL×1,sρ). (19)

Notice that this description does not depend on the particular choice of β. Under the usual “Wright-Fisher” regime we treat ρ as fixed. It remains to specify the stationary distributions for the marginal allele frequencies X and Y, which we suppose to have reached their usual (independent) stationary distributions in the Wright-Fisher diffusion limit, which we refer to as πA and πB, respectively (and whose respective sampling distributions are qA and qB). Then we can complete the picture for (19) by specifying (X(0), Y (0)) ~ πA ⊗ πB.

The distribution (19) therefore provides a simple, explicit method for the approximate simulation of haplotype frequencies under a stationary, two-locus Wright-Fisher diffusion, which we summarize in the algorithm below. (When mutation is parent independent, as in Remark 2.1, πA and πB take on a particularly simple form, but we note that these distributions are not known in general.)

Algorithm to simulate from a Gaussian approximation to
the stationary Wright-Fisher diffusion with recombination.
  1. Simulate marginal allele frequencies at locus A, X(0) ~ πA.

  2. Independently simulate marginal allele frequencies at locus B, Y (0) ~ πB.

  3. Conditionally simulate D from (19) given X(0) and Y (0).

  4. Calculate two-locus haplotype frequencies via
    Xij=Dij+Xi(0)Yj(0),  for each i[K],j[L].

3.4. Sampling distribution

The significance of the Gaussian diffusion approximation UD is further evident from the following theorem. First we need some further notation. Let

𝒫m={rK×L:i=1Kj=1Lrij=m},

for m ∈ ℕ, and let l(r) ∈ ([K] × [L])m denote a sequence of m haplotypes (in some arbitrary, fixed order) with multiplicities specified by r ∈ 𝒫m. Further let l(r)A ∈ [K]m denote the corresponding list of alleles obtained by looking at the first entry of each element of l(r), and define l(r)B similarly. For λ ∈ ℕ denote by 𝒬 the set of partitions of [2λ] with precisely λ blocks of size 2, and write a representative element as ξμν = {{μk, νk}: k = 1, …, λ} ∈ 𝒬; μ = (μk) and ν = (νk) are sequences of length λ. For J ⊆ [λ], denote by μJ, νJ the subsequences obtained by looking only at the indices in J, and denote by lμ(r) the subsequence of l(r) obtained by looking only at the indices in μ. The matrix of multiplicities of lμ(r) is denoted by r(μ), so that r(μ) + r(ν) = r. For example, if r=[1201] then a representative list of haplotypes is l(r) = ((1, 1), (1, 2), (1, 2), (2, 2)) with marginal allele lists l(r)A = (1, 1, 1, 2) and l(r)B = (1, 2, 2, 2). Here, m = 2λ = 4, and 𝒬4 = {{{1, 2}, {3, 4}}, {{1, 3}, {2, 4}}, {{1, 4}, {2, 3}}}. Then for example the first element in 𝒬4 is the partition ξμν constructed from μ = (1, 3) and ν = (2, 4), and so lμ(r)=((1,1),(1,2)) and lν(r)=((1,2),(2,2)).

Theorem 3.2

Suppose that X ~ πA, Y ~ πB independently, and conditional on X and Y, D is distributed according to the Gaussian distribution in (19). Then the sampling distribution is given exactly by

qG(a,b,c)=λ=0c/21ρλr𝒫2λξ𝒬2λ[i=1Kj=1L(cijrij)]×[I[λ]:lμI(r)A=lνI(r)A(1)|I𝖢|qA(a+cArA(νI))]×[J[λ]:lμJ(r)B=lνJ(r)B(1)|J𝖢|qB(b+cBrB(νJ))],=q0(a,b,c)+q1(a,b,c)ρ+O(1ρ2), (20)

with q0 and q1 given by (2) and (3) respectively (and we impose the convention that the empty summations for λ = 0 have a single term, with (−1)|∅\∅| = 1).

Proof

With respect to the diffusion in the transformed co-ordinate system, the sampling distribution is

qG(a,b,c)=𝔼[(i=1KXiai)(j=1LYjbj)(i=1Kj=1L[Dij+XiYj]cij)],=m=0cr𝒫m[i=1Kj=1L(cijrij)]𝔼[(i=1KXiai+ci·ri·)×(j=1LYjbj+c·jr·j)𝔼[i=1Kj=1LDijrij|X,Y]],=λ=0c/2r𝒫2λξ𝒬2λ[i=1Kj=1L(cijrij)]𝔼[(i=1KXiai+ci·ri·)×(j=1LYjbj+c·jr·j)k=1λ𝔼[Dlμk(r)Dlνk(r)|X,Y]],=λ=0c/21ρλr𝒫2λξ𝒬2λ[i=1Kj=1L(cijrij)]×𝔼[(i=1KXiai+ci·ri·)(j=1LYjbj+c·jr·j)×k=1λXlμk(r)AYlμk(r)B(δlμk(r)Alνk(r)AXlνk(r)A)(δlμk(r)Blνk(r)BYlνk(r)B)],=λ=0c/21ρλr𝒫2λξ𝒬2λ[i=1Kj=1L(cijrij)]×I[λ](1)|I𝖢|δlμI(r)AlνI(r)AJ[λ](1)|J𝖢|δlμJ(r)BlνJ(r)B×𝔼[(i=1KXiai+ci·ri·(νI))(j=1LYjbj+c·jr·j(νJ))],

The second equality follows from the multinomial theorem and the tower property, the third equality follows from Isserlis’ theorem (Michalowicz et al., 2011), and the fourth equality follows from (19):

𝔼[DijDkl|X,Y]=1ρXiYj(δikXk)(δjlYl).

The fifth equality follows from expanding the final product (using the convention δ∅∅ = 1), while (20) follows from (X, Y) ~ πA ⊗ πB. The equalities still hold for λ = 0 provided we take ∏ = 1.

Extracting the two leading order terms λ = 0 and λ = 1, the expression simplifies to

qG(a,b,c)=𝔼[(i=1KXiai+ci·)(j=1LYjbj+c·j)]+1ρk,u=1Kl,υ=1Lckl(cuυδkuδlυ)2𝔼[(i=1KXiai+ci·δiu)×(j=1LYjbj+c·jδjυ)(δkuXu)(δlυYυ)]+O(1ρ2),=q0(a,b,c)+q1(a,b,c)ρ+O(1ρ2),

as required.

3.5. Accuracy of the diffusion process

A natural question to ask is: to what extent does the process of Theorem 3.2 capture the dynamics of the full process? To address this we consider the accuracy of the sampling distribution (20) as an approximation to the “true” distribution, q(a, b, c). For moderate sample sizes it is possible to compute the latter as the solution to a system of recursive equations (Golding, 1984; Ethier and Griffiths, 1990; Jenkins and Song, 2009). The number of summands in (20) grows rapidly with λ (as long as λc2), so we define an approximate sampling distribution qG(λ)(a,b,c) by truncating the outer sum in (20) at a fixed index λ. This is analogous to the asymptotic sampling formulae for the full model which are obtained by truncating equation (1) (Jenkins and Song, 2012). As our measure of accuracy we define the relative error,

e˚Gaussian(λ)=|QG(λ)(0,0,c)q(0,0,c)q(0,0,c)|×100%, (21)

where QG(λ)(0,0,c) is the staircase Padé approximant to qG(λ)(0,0,c). (The former is used for its superior convergence properties; see Jenkins and Song, 2012, for details.) We define e˚True(λ) analogously, replacing QG(λ)(0,0,c) in (21) with the Padé approximant to the partial sum of (1), computed up to O−(λ+1)) by the method of Jenkins and Song (2012).

We computed the distribution of e˚Gaussian(λ) and of e˚True(λ) across all sample configurations of size c = 20 for which both alleles are observed at each locus; results are shown in Table 1. For a collection of this size it was straightforward to compute up to λ = 6 for every possible sample configuration. Using a partial sum to approximate (1) contributes to both errors; e˚Gaussian(λ) has additional contributions reflecting its use of an approximate model. Of course, the two errors agree up to λ = 1. However, Table 1 shows that they are comparable more broadly, particularly for large recombination rates. As λ increases, QG(λ)(0,0,c) converges rapidly (even without Padé summation; not shown), and becomes a reasonable approximation to q(a, b, c). For example, for ρ = 50, QG(6)(0,0,c) is within 10% of q(a, b, c) with probability 0.79, though it is within 1% only with probability 0.50. When we consider the highest levels of accuracy, as in Φ(1) in Table 1, e˚Gaussian(λ) actually increases with λ when λ > 1. This suggests that the Gaussian model typically cannot approximate the true model to the same level of precision as a first order asymptotic approximation of the true model, though its behaviour as a coarser approximation (as reflected in the columns for Φ(100), for example) is comparable.

Table 1.

Cumulative distribution Φ(x) = ℙ(e̊(λ) < x%) (where e̊(λ) denotes either e˚Gaussian(λ) or e˚True(λ) as defined in the main text), for all samples of size 20 dimorphic at both loci.

ρ = 25 ρ = 50

λ Type
of sum
Φ(1) Φ(10) Φ(100) Φ(1) Φ(10) Φ(100)

0 True 0.39 0.58 1.00 0.49 0.63 1.00
Gaussian 0.39 0.58 1.00 0.49 0.63 1.00
1 True 0.51 0.75 0.96 0.59 0.84 0.99
Gaussian 0.51 0.75 0.96 0.59 0.84 0.99
2 True 0.59 0.91 0.97 0.77 0.98 1.00
Gaussian 0.50 0.73 0.97 0.50 0.86 1.00
4 True 0.83 0.99 1.00 0.95 1.00 1.00
Gaussian 0.51 0.72 1.00 0.50 0.80 1.00
6 True 0.89 0.99 1.00 0.99 1.00 1.00
Gaussian 0.49 0.71 0.99 0.50 0.79 1.00

ρ = 100 ρ = 200

λ Type
of sum
Φ(1) Φ(10) Φ(100) Φ(1) Φ(10) Φ(100)

0 True 0.50 0.72 1.00 0.54 0.95 1.00
Gaussian 0.50 0.72 1.00 0.54 0.95 1.00
1 True 0.74 0.95 1.00 0.90 0.99 1.00
Gaussian 0.74 0.95 1.00 0.90 0.99 1.00
2 True 0.95 1.00 1.00 1.00 1.00 1.00
Gaussian 0.64 0.99 1.00 0.85 1.00 1.00
4 True 1.00 1.00 1.00 1.00 1.00 1.00
Gaussian 0.64 0.99 1.00 0.83 1.00 1.00
6 True 1.00 1.00 1.00 1.00 1.00 1.00
Gaussian 0.64 0.99 1.00 0.83 1.00 1.00

4. Coalescent process

4.1. A coupling argument

In this section we derive a coalescent process which is much simpler than the ARG but whose sampling distribution agrees with (2) and (3). We first provide an informal description. Let 𝒞a,b,c(ρ)(t) denote the standard, neutral, two-locus coalescent process a time t back from a sample taken at time t = 0, with a, b, and c counting the three types of sample as defined in Section 2. Recombination occurs at the usual rate of ρc/2, where ρ = 2Nr. Lineages ancestral to the three types are sometimes referred to as representing left half-fragments, right half-fragments, and full fragments, respectively. Our strategy is to define a coupling on a joint probability space for the pair of processes (𝒞(ρ)=𝒞a,b,c(ρ)(t):t0),𝒟()=(𝒟a,b,c()(t)):t0)), where 𝒟(∞) is a simple process closely related to 𝒞(∞) and defined below. 𝒞(ρ)(ω) is said to be coupled to 𝒟(∞)(ω) if the two realizations have the same marginal coalescent tree at locus A and the same marginal coalescent tree at locus B. Since it is the marginal trees which govern the mutation process at each locus, coupled processes therefore have the same sampling distribution. (There should be no ambiguity arising from the fact that our coupling is not on pairs of realizations but on pairs of equivalence classes, where an equivalence class of 𝒞(ρ) or of 𝒟(∞) is a set of realizations with the same marginal tree at locus A and the same marginal tree at locus B.)

A complete description of a coalescent process is one taking values in partitions of [n], as introduced by Kingman (1982), with natural extensions to incorporate recombination. We opt instead to represent 𝒞(ρ) only by its ancestral process; that is, as a birth-death process on the number of each type of lineage. Such a process is studied in depth by Ethier and Griffiths (1990) and Griffiths (1991). In what follows it is understood implicitly that for any given realization of the ancestral process one could reconstruct a complete coalescent process—an ARG—given some additional independent randomness. Provided the ancestral processes of 𝒞(ρ) and 𝒟(∞) remain coupled, then it is also always possible to couple their respective coalescent processes. For example, a decrease by one in the ancestral process corresponds to a coalescence event in the coalescent process, which can be realized by merging two uniformly chosen blocks in the partition of [n]. A coupling of two ancestral processes lets us couple the corresponding coalescent processes if we always pick the same pair of blocks to merge in the two processes. With this kept in mind, it is sufficient for the argument developed below to consider the simpler ancestral process representation.

Recall the two-locus ancestral process for the coalescent with recombination: Going backwards in time, each pair of lineages coalesces independently at rate 1, and each lineage ancestral at both loci recombines at rate ρ/2. When two lineages coalesce, they are replaced with a single lineage, and this lineage is ancestral at a given locus if either of its two progenitors were ancestral at this locus. Thus for example, with a, b, and c defined as above the total rate of coalescence involving one left-half fragment and one right-half fragment is ab, resulting in a transition of the form (a, b, c) ↦ (a − 1, b − 1, c + 1). The remaining transitions are given in Table 2. We can now make the following concise definition.

Table 2.

Transition rates of events in the two ancestral processes 𝒞(ρ) and 𝒟(∞).

Transition Rate
Type (a, b, c, d) ↦ 𝒞(ρ) 𝒟(∞)
I (a, b, c − 1, d − 1) c(c − 1)/2* 0
II (a, b, c − 1, d) 0 c(c − 1)/2
III (a, b, c, d − 1) 0 d(d − 1)/2
IV (a − 1 b, c, d) a(a + 2c − 1)/2 a(a + 2c − 1)/2
V (a, b, − 1, c, d) b(b + 2d − 1)/2 b(b + 2d − 1)/2
VI (a − 1, b − 1, c + 1, d + 1) ab ab
VII (a + 𝕀{c > 0}, b + 𝕀{d > 0}),
c − 𝕀{c > 0}, d − 𝕀{d > 0})
ρc/2* ρmax{c, d}/2
*

Defined only when c = d.

Definition 4.1

The ancestral process 𝒞(ρ)=(𝒞a,b,c(ρ)(t):t0) is a continuous-time Markov process on4 such that 𝒞a,b,c(ρ)(0)=(a,b,c,c) a.s., and with infinitesimal generator

f(a,b,c,c)=ρc2f(a+1,b+1,c1,c1)+(c2)f(a,b,c1,c1)+Ra,b,c,c𝓖f(a,b,c,c)[ρc2+(c2)+Ra,b,c,c]f(a,b,c,c), (22)

where

Ra,b,c,d=ab+ac+bd+(a2)+(b2),
𝓖f(a,b,c,d)=12Ra,b,c,d[2abf(a1,b1,c+1,d+1)+a(a+2c1)f(a1,b,c,d)+b(b+2d1)f(a,b1,c,d)],

and f : ℕ4 → ℝ is an appropriate test function.

Regard the third and fourth entries in f as the number of left- and right-halves of full fragments; these entries are always equal. This representation is seemingly redundant, but it will make the coupling with the corresponding process 𝒟(∞) (for which we allow cd) transparent. We will define 𝒟(∞) via the following recipe. First, take 𝒞(ρ) and let ρ → ∞. Ordinarily, 𝒞a,b,c()(0) moves instantaneously to the state 𝒞a+c,b+c,0()(0+) and evolves thereafter according to ℒf(a + c, b + c, 0, 0). However, our second step is to make a notational change: we reuse the third and fourth entries of f by separately tracking the half-fragment lineages that originated as full fragments: we write it as a process initiated at (a, b, c, c) and evolving according to the generator

()f(a,b,c,d)=(c2)f(a,b,c1,d)+(d2)f(a,b,c,d1)+Ra,b,c,d𝓖f(a,b,c,d)[(c2)+(d2)+Ra,b,c,d]f(a,b,c,d). (23)

Third, we introduce an artificial recombination process which induces transitions of the form (a, b, c, c) ↦ (a + 1, b + 1, c − 1, c − 1) at rate ρc/2. This does not reflect any concrete evolutionary dynamic but merely acts as a mathematical device to facilitate a coupling between the two processes. (As a minor technical detail, we should like to allow the process ultimately to reach a state of the form (a, b, 0, 0). We therefore make a minor adjustment, below, to this artificial process to allow for it to act even if one of c or d is 0.) We therefore have the following definition.

Definition 4.2

The ancestral process 𝒟()=(𝒟a,b,c()(t):t0) is a continuous-time Markov process on4 such that 𝒟a,b,c()(0)=(a,b,c,c) a.s., and with infinitesimal generator

()f(a,b,c,d)()f(a,b,c,d)+ρ max{c,d}2[f(a+𝕀{c>0},b+𝕀{d>0},c𝕀{c>0},d𝕀{d>0})f(a,b,c,d)], (24)

where f : ℕ4 → ℝ is an appropriate test function.

Transitions of this process are also summarized in Table 2, and henceforth we will refer to the numberings of each type of transition given in the table. It is important to keep in mind that although ρ appears as a parameter in (24), the process 𝒟(∞) acts as if the two loci are independent. The process with rate depending on ρ is simply an artificial relabelling of lineages. A key observation is that this artificial process does not affect the distribution of the marginal coalescent trees, so 𝒞(∞) and 𝒟(∞) have the same sampling distribution.

To summarize, we have defined two Markov processes on ℕ4, 𝒞(ρ) and 𝒟(∞), which describe two-locus ancestral processes going backwards in time and with respective generators ℒ and ℋ(∞). ℒ is the generator of a standard process with recombination parameter ρ. ℋ(∞) is the generator of a standard process with recombination parameter ∞ and with the additional properties that left half-fragments are recorded in two categories (of multiplicity a and c), right half-fragments are recorded in two categories (of multiplicity b and d), and there is an artificial movement of pairs from the latter to the former as if they were still full fragments. This somewhat contrived definition has an important advantage: it is a simple matter to attempt to couple the two processes by matching each kind of event in the two generators whenever possible. A recombination event in 𝒞a,b,c(ρ)(t) can be matched by an artificial recombination event in 𝒟a,b,c()(t), a coalescence of type IV in 𝒞a,b,c(ρ)(t) can be matched by a coalescence of type IV in 𝒟a,b,c()(t), and so on.

The aforementioned description is a probabilistic coupling, which may or may not succeed since not all events can be paired off in this way. Comparing (22) and (24), we see that a coupling will fail if there is a type I transition in 𝒞(ρ) or if there is a type II or type III transition in 𝒟(∞). Define the failure times

Ta,b,c(1)inf{t0:𝒞a,b,c(ρ)(t)=𝒞a,b,c(ρ)(t)(0,0,1,1)},
Ta,b,c(2)inf{t0:𝒟a,b,c()(t)=𝒟a,b,c()(t)(0,0,1,1)},
Ta,b,c(3)inf{t0:𝒟a,b,c()(t)=𝒟a,b,c()(t)(0,0,0,1)},

and

Ta,b,cMRCAinf {t0:𝒞a,b,c(ρ)(s)=𝒟a,b,c()(s)st,𝒞a,b,c(ρ)(t){(1,1,0,0),(0,0,1,1)}},

the first time that both loci find a most recent common ancestor in the coupled processes (with the convention inf ∅ = ∞). If Ta,b,cMRCA<min{Ta,b,c(1),Ta,b,c(2),Ta,b,c(3)}, we say that the coupling has been successful. We are now in a position to verify the observation made in Section 1: that we need consider whether or not a coupling has been successful only as far back as the first time that no lineages ancestral to both loci survive. For if we reach this point then, even further back in time, jointly ancestral lineages may arise again temporarily (with c ≥ 1), but the coupling can fail only in the unlikely [i.e. O−2)] event that c ≥ 2. We formalize this argument in the following lemma.

Lemma 4.1

If c ∈ {0, 1}, the coupling between 𝒞(ρ) and 𝒟(∞) fails with probability O−2), as ρ → ∞.

Proof

The three events causing the coupling to fail occur at rates proportional to (c2) and thus require c ≥ 2. For the pair (𝒞a,b,1(ρ),𝒟a,b,1()), we therefore first need to see a transition of the form (a′, b′, 1, 1) ↦ (a′ − 1, b′ − 1, 2, 2) for some a′, b′, followed by one of the transitions causing the coupling to fail. Reading off the rates from the generators, each of these transitions occurs with probability O−1). The case c = 0 is similar, first needing a transition of the form (a′, b′, 0, 0) ↦ (a′ − 1, b′ − 1, 1, 1) whose probability is of O(1).

Lemma 4.2

The coupling between 𝒞(ρ) and 𝒟(∞) fails with the following probabilities:

(I(k))=1ρ(c2)+O(1ρ2)   as ρ,k=1,2,3, (25)

where I(k){Ta,b,c(k)<Ta,b,cMRCA}. Moreover, ℙ(I(k1)I(k2)) = O−2) for k1k2.

Proof

For k = 1, by Lemma 4.1 it is enough to show that

(Ta,b,c(1)<Ua,b,c(1))=1ρ(c2)+O(1ρ2),

where

Ua,b,c(1)inf {t0:𝒞a,b,c(ρ)(t){(a,b,0,0):a,b}}

is the first time 𝒞(ρ) reaches c = 0. We proceed by induction on c; Lemma 4.1 provides the base cases c ∈ {0, 1}. First note that for any c ≥ 1,

(Ta,b,c(1)<Ua,b,c(1))=O(1ρ), (26)

since this event requires at least one transition that is not a recombination. Reading off the relevant probabilities from (22), we have for c ≥ 2:

(Ta,b,c(1)<Ua,b,c(1))=ρc2ρc2+(c2)+Ra,b,c,c·(Ta+1,b+1,c1(1)<Ua+1,b+1,c1(1)+abρc2+(c2)+Ra,b,c,c·(Ta1,b1,c+1(1)<Ua1,b1,c+1(1))+(c2)ρc2+(c2)+Ra,b,c,c·1+O(1ρ2),=1ρ(c2)+O(1ρ2),

by the inductive hypothesis for the first term on the right and using (26) for the second term. By considering

Ua,b,c(k)inf {t0:𝒟a,b,c()(t){(a,b,0,0):a,b}},k=2,3,

the cases k = 2, 3 are similar. ℙ(I(k1)I(k2)) = O−2) also follows from the fact that this event requires at least two transitions which are not recombinations during the time that c > 0.

Should the coupling fail, we can say much about the sequence of events prior to Ua,b,c(k). Intuitively, the probability that more than one transition other than recombinations occurs is O−2). To make this precise we denote by 𝒮a,b,c(k)(t) the jump chain up to time t of 𝒞(ρ) if k = 1 and of 𝒟(∞) if k = 2, 3.

Lemma 4.3

Let 𝓢a,b,c denote the set of jump chains comprising sequences which start at (a, b, c, c), end at the first entry of the form (a′, b′, 0, 0), a′, b′ ∈ ℕ, and with all transitions corresponding to recombination events, except for possibly one transition. Then

(𝒮a,b,c(k)(Ua,b,c(k)𝓢a,b,c|I(k))=1O(1ρ)   as ρ,k=1,2,3.
Proof

The non-recombination event causing I(k) occurs at time Ta,b,c(k). Inspection of the generators (22) and (24) shows that any further transition other than a recombination occurs with probability O−1) during the time that c > 0.

Recall that our purpose is to obtain the sampling distribution for 𝒞(ρ). For successful couplings, this is easy to obtain since it is the same as that of 𝒟(∞) and hence 𝒞(∞); thus 𝒞(ρ) | I(1)𝖢 has the same sampling distribution as 𝒟(∞) | (I(2)I(3))𝖢. Even if the coupling fails, Lemmata 4.1 and 4.3, demonstrate that the behaviour of 𝒞(ρ) is still predictable enough to recover its sampling distribution up to O−2). Roughly [up to O−2)], Lemma 4.3 says: if there is an event that causes the coupling to fail then this is the only non-recombination event in the failing process before Ua,b,c(k); by Lemma 4.1, if it has not failed by Ua,b,c(k) then the coupling will not fail after Ua,b,c(k).

The following theorem is proven in Jenkins and Song (2009); however, the following proof gives a coherent, process-level explanation for the result.

Theorem 4.1

Expressing the sampling distribution for (𝒞a,b,c(ρ)(t):t0) as in (1), the first two terms are given by (2) and (3).

Proof

Denote by q𝒞(ρ)|I(1)(a, b, c) the sampling distribution of the process 𝒞(ρ) | I(1). By Lemmata 4.1 and 4.3, this sampling distribution is obtained up to O−1) by picking a pair of full fragments at random to coalesce, with the remaining c − 1 fragments all undergoing recombination, and subsequently running the process as 𝒟a+c1,b+c1,0()(=a.s.𝒞a+c1,b+c1,0()). Hence,

q𝒞(ρ)|I(1)(a,b,c)=i=1Kj=1L(cij2)(c2)q𝒞()(a,b,ceij)+O(1ρ),=i=1Kj=1L(cij2)(c2)qA(a+cAei)qB(b+cBej)+O(1ρ). (27)

(We can also ignore the possibility of mutation prior to Ua,b,c(1) since, by the same argument as in Lemma 4.3, a mutation occurs during this phase with probability O−1).) Similarly,

q𝒟()|I(2)(a,b,c)=i=1K(ci·2)(c2)q𝒞()(a+cAei,b+cB,0)+O(1ρ),=i=1K(ci·2)(c2)qA(a+cAei)qB(b+cB)+O(1ρ), (28)
q𝒟()|I(3)(a,b,c)=j=1L(c·j2)(c2)q𝒞()(a+cA,b+cBej,0)+O(1ρ),=j=1L(c·j2)(c2)qA(a+cA)qB(b+cBej)+O(1ρ), (29)

and so, together with Lemma 4.2 and the observation that

([I(2)I(3)]𝖢)q𝒟()|(I(2)I(3))𝖢(a,b,c)=q𝒟()(a,b,c)(I(2))q𝒟()|I(2)(a,b,c)(I(3))q𝒟()|I(3)(a,b,c)+O(ρ2),

we obtain

q𝒟()|(I(2)I(3))𝖢(a,b,c)=[1+2ρ(c2)][q𝒟()(a,b,c)1ρ(c2)q𝒟()|I(2)(a,b,c)1ρ(c2)q𝒟()|I(3)(a,b,c)]+O(1ρ2). (30)

The key decomposition is then

q(a,b,c)=(I(1)q𝖢(ρ)|I(1)(a,b,c)+(I(1)𝖢)q𝖢(ρ)|I(1)𝖢(a,b,c)=(I(1))q𝖢(ρ)|I(1)(a,b,c)+(I(1)𝖢)q𝒟()|(I(2)I(3))𝖢(a,b,c)=q0(a,b,c)+1ρq1(a,b,c)+O(1ρ2), (31)

using (25), (27), (28), (29), and (30), with q0, q1 given by (2) and (3), respectively.

Remark 4.1

It may be possible to use similar arguments to obtain a genealogical interpretation of the second-order term, q2 in (1); for example, genealogies with two events that cause the coupling to fail would surely contribute. However, as is clear from the expression for q2 given in Jenkins and Song (2009, 2010), this is not a simple endeavour and it is seems difficult to interpret some of the components of q2.

4.2. A new “loose-linkage” coalescent process

Equation (31) tells us that, up to O−2), we can obtain the correct sampling distribution using the mixture

α[𝒞(ρ)|I(1)]+(1α)[𝒟()|(I(2)I(3))𝖢],α=1ρ(c2),

provided α < 1. The coupling used to prove Theorem 4.1 demonstrates that we can define a simple stochastic process for weakly correlated loci, ℰ(ρ), as follows, whose sampling distribution agrees with (2) and (3) up to O−2).

Algorithm to simulate ℰ(ρ), the loose-linkage coalescent.
  1. With probability α, choose a pair uniformly at random from the c full fragments to coalesce, and then choose uniformly from the chains in 𝒮a,b,c compatible with I(1). Such chains are some permutation of a sequence corresponding to this sole coalescence and c − 1 recombinations. Inter-event times up to Ua,b,c(1) can be sampled according to the rates specified in (22). Go to step 3.

  2. Otherwise (w.p. 1 − α), sample from 𝒟(∞) | (I(2)I(3))𝖢 up to time Ua,b,c(2)(=Ua,b,c(3)), which can be achieved by running 𝒟(∞) as usual according to (24) but banning transitions of the form (a, b, c, d) ↦ (a, b, c − 1, d) and (a, b, c, d) ↦ (a, b, c, d − 1). (The rates of these transitions still contribute to the overall rate governing inter-event times, however.) Go to step 3.

  3. Beyond time Ua,b,c(k) (k = 1 in the first case above and k = 2 in the second), construct the remainder of the process independently using (𝒞()(tUa,b,c(k)):tUa,b,c(k)) (with the appropriate starting configuration) back to the first time both loci have found a most recent common ancestor.

An example is shown in Figure 1. Simulation and inference under ℰ(ρ) should be straightforward, since its dynamics are little more complicated than those of a coalescent process with ρ = ∞. Unlike our diffusion process of Section 3, it does not seem easy to write down its sampling distribution to all orders in closed-form, since that of 𝒟(∞) | (I(2)I(3))𝖢 is not so obvious.

Fig 1.

Fig 1

Sampling from the loose-linkage coalescent, ℰ(ρ), from an initial configuration (0, 0, 4). Steps of the algorithm in the main text are denoted by circled numbers. Left: Commence from step 1 (probability α). Step 1 samples from an approximation to 𝒞(ρ) | I(1) which is correct to O(ρ−2), back as far as time U0,0,4(1). The jump chain sampled here is 𝒮0,0,4(1)(U0,0,4(1))=((0,0,4,4),(1,1,3,3),(1,1,2,2),(2,2,1,1),(3,3,0,0)). Thereafter (step 3) the sample is constructed from 𝒞3,3,0()(tU0,0,4(1)). Right: Commence from step 2 (probability 1 − α). Step 2 samples from 𝒟0,0,4()(t)|(I(2)I(3))𝖢; a transition which would cause I(2) is banned. Thereafter (step 3) the sample is constructed from 𝒞4,4,0()(tU0,0,4(2)).

5. Discussion

We have described two novel stochastic models of evolution for loosely linked, or weakly correlated, loci, using both diffusion- and coalescent-based arguments. As a consequence we have obtained deep insight into the simple form of the asymptotic sampling formula given by (2) and (3). Our diffusion model is based on a central limit theorem for density dependent population processes, which may be viewed as a separation of the timescales Nβ and N (in generations), for 0 < β < 1, and pioneered in population genetics by Norman (1975a). This contrasts with most research in this area, which focuses on separating the timescales N0 = 1 and N. Indeed, both diffusion (Ethier and Nagylaki, 1980, 1988) and coalescent (Möhle, 1998; Wakeley, 2008) limits of this latter regime have been studied in detail. It is also the setting of the “loose linkage” limit of Ethier and Nagylaki (1989). Our usage of “loose linkage” therefore refers to a scaling intermediate between the usual Wright-Fisher diffusion and that of Ethier and Nagylaki (1989). That the pioneering approach of Norman (1975a) to investigate recombination does not seem to have been considered until now supports the observation that his work is “somewhat neglected” (Wakeley, 2005). It would also be of interest to find a coalescent-based analogue of these results along the lines of Möhle (1998), or even a duality relationship in the manner of Etheridge and Griffiths (2009).

For simplicity we have focused on a two-locus, finite-alleles, neutral model. Most of this article does not hinge heavily on these assumptions, and it should be relatively straightforward to extend our results to incorporate things like natural selection and more sophisticated models of mutation.

Acknowledgments

We gratefully acknowledge the support of the Isaac Newton Institute. Part of this work stemmed from discussions P.F. and Y.S.S. had during the 2010 program on “Statistical Challenges Arising from Genome Resequencing.” We also thank the generous support of the Simons Institute for the Theory of Computing. This work was completed while P.A.J. and Y.S.S. were participating in the 2014 program on “Evolutionary Biology and the Theory of Computing.”

Footnotes

*

Supported in part by EPSRC Research Grant EP/L018497/1 and an NIH Grant R01-GM094402.

Supported in part by an NIH Grant R01-GM094402, and a Packard Fellowship for Science and Engineering.

Contributor Information

Paul A. Jenkins, Department of Statistics, University of Warwick, Coventry CV4 7AL, UK, p.jenkins@warwick.ac.uk.

Paul Fearnhead, Department of Mathematics and Statistics, Lancaster University, Lancaster LA1 4YF, UK, p.fearnhead@lancaster.ac.uk.

Yun S. Song, Department of Statistics and Computer Science Division, University of California, Berkeley, Berkeley, CA 94720, USA, yss@stat.berkeley.edu.

References

  1. Baake E, Herms I. Single-crossover dynamics: finite versus infinite populations. Bulletin of Mathematical Biology. 2008;70:603–624. doi: 10.1007/s11538-007-9270-5. [DOI] [PubMed] [Google Scholar]
  2. Bhaskar A, Kamm JA, Song YS. Approximate sampling formulae for general finite-alleles models of mutation. Advances in Applied Probability. 2012;44:408–428. doi: 10.1239/aap/1339878718. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Bhaskar A, Song YS. Closed-form asymptotic sampling distributions under the coalescent with recombination for an arbitrary number of loci. Advances in Applied Probability. 2012;44:391–407. doi: 10.1239/aap/1339878717. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Birkner M, Blath J, Eldon B. An ancestral recombination graph for diploid populations with skewed offspring distribution. Genetics. 2013;193:255–290. doi: 10.1534/genetics.112.144329. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Boitard S, Loisel P. Probability distribution of haplotype frequencies under the two-locus Wright-Fisher model by diffusion approximation. Theoretical Population Biology. 2007;71:380–391. doi: 10.1016/j.tpb.2006.12.007. [DOI] [PubMed] [Google Scholar]
  6. Chan AH, Jenkins PA, Song YS. Genome-wide fine-scale recombination rate variation in Drosophila melanogaster. PLoS Genetics. 2012;8:e1003090. doi: 10.1371/journal.pgen.1003090. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Etheridge AM, Griffiths RC. A coalescent dual process in a Moran model with genic selection. Theoretical Population Biology. 2009;75:320–330. doi: 10.1016/j.tpb.2009.03.004. [DOI] [PubMed] [Google Scholar]
  8. Ethier SN. A limit theorem for two-locus diffusion models in population genetics. Journal of Applied Probability. 1979;16:402–408. [Google Scholar]
  9. Ethier SN, Griffiths RC. On the two-locus sampling distribution. Journal of Mathematical Biology. 1990;29:131–159. [Google Scholar]
  10. Ethier SN, Kurtz TG. Markov processes: characterization and convergence. New York: Wiley; 1986. [Google Scholar]
  11. Ethier SN, Nagylaki T. Diffusion approximations of Markov chains with two time scales and applications to population genetics. Advances in Applied Probability. 1980;12:14–49. [Google Scholar]
  12. Ethier SN, Nagylaki T. Diffusion approximations of Markov chains with two time scales and applications to population genetics, II. Advances in Applied Probability. 1988;20:525–545. [Google Scholar]
  13. Ethier SN, Nagylaki T. Diffusion approximations of the two-locus Wright-Fisher model. Journal of Mathematical Biology. 1989;27:17–28. doi: 10.1007/BF00276078. [DOI] [PubMed] [Google Scholar]
  14. Ewens WJ. The sampling theory of selectively neutral alleles. Theoretical Population Biology. 1972;3:87–112. doi: 10.1016/0040-5809(72)90035-4. [DOI] [PubMed] [Google Scholar]
  15. Ewens WJ. Mathematical Population Genetics. 2nd. New York: Springer-Verlag; 2004. [Google Scholar]
  16. Fearnhead P, Donnelly P. Estimating recombination rates from population genetic data. Genetics. 2001;159:1299–1318. doi: 10.1093/genetics/159.3.1299. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Feder AF, Kryazhimskiy S, Plotkin JB. Identifying signatures of selection in genetic time series. Genetics. 2014;196:509–522. doi: 10.1534/genetics.113.158220. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Feller W. Diffusion processes in genetics. Proceedings of the Second Berkeley Symposium on Mathematical Statistics and Probability; University of California Press; Berkeley, Calif. 1951. pp. 227–246. [Google Scholar]
  19. Golding GB. The sampling distribution of linkage disequilibrium. Genetics. 1984;108:257–274. doi: 10.1093/genetics/108.1.257. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Griffiths RC. The two-locus ancestral graph. In: Basawa IV, Taylor RL, editors. Selected proceedings of the Sheffield symposium on applied probability: 18. IMS Lecture Notes—Monograph series. Vol. 18. 1991. pp. 100–117. [Google Scholar]
  21. Griffiths RC, Jenkins PA, Song YS. Importance sampling and the two-locus model with subdivided population structure. Advances in Applied Probability. 2008;40:473–500. doi: 10.1239/aap/1214950213. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Griffiths RC, Marjoram P. Ancestral inference from samples of DNA sequences with recombination. Journal of Computational Biology. 1996;3:479–502. doi: 10.1089/cmb.1996.3.479. [DOI] [PubMed] [Google Scholar]
  23. Jenkins PA, Griffiths RC. Inference from samples of DNA sequences using a two-locus model. Journal of Computational Biology. 2011;18:109–127. doi: 10.1089/cmb.2009.0231. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Jenkins PA, Song YS. Closed-form two-locus sampling distributions: accuracy and universality. Genetics. 2009;183:1087–1103. doi: 10.1534/genetics.109.107995. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Jenkins PA, Song YS. An asymptotic sampling formula for the coalescent with recombination. Annals of Applied Probability. 2010;20:1005–1028. doi: 10.1214/09-AAP646. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Jenkins PA, Song YS. The effect of recurrent mutation on the frequency spectrum of a segregating site and the age of an allele. Theoretical Population Biology. 2011;80:158–173. doi: 10.1016/j.tpb.2011.04.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Jenkins PA, Song YS. Padé approximants and exact two-locus sampling distributions. Annals of Applied Probability. 2012;22:576–607. doi: 10.1214/11-AAP780. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Kang HW, Kurtz TG, Popovic L. Central limit theorems and diffusion approximations for multiscale Markov chain models. Annals of Applied Probability. 2014;24:721–759. [Google Scholar]
  29. Kaplan N, Darden T, Hudson RR. The coalescent process in models with selection. Genetics. 1988;120:819–829. doi: 10.1093/genetics/120.3.819. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Kingman JFC. The coalescent. Stochastic Processes and their Applications. 1982;13:235–248. [Google Scholar]
  31. Kuhner MK, Yamato J, Felsenstein J. Maximum likelihood estimation of recombination rates from population data. Genetics. 2000;156:1393–1401. doi: 10.1093/genetics/156.3.1393. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Kurtz TG. Limit theorems for sequences of jump Markov processes approximating ordinary differential processes. Journal of Applied Probability. 1971;8:344–356. [Google Scholar]
  33. Michalowicz JV, Nichols JM, Bucholtz F, Olson CC. A general Isserlis theorem for mixed-Gaussian random variables. Statistics and Probability Letters. 2011;81:1233–1240. [Google Scholar]
  34. Miura C. On an approximate formula for the distribution of 2-locus 2-allele model with mutual mutations. Genes and Genetic Systems. 2011;86:207–214. doi: 10.1266/ggs.86.207. [DOI] [PubMed] [Google Scholar]
  35. Möhle M. A convergence theorem for Markov chains arising in population genetics and the coalescent with selfing. Advances in Applied Probability. 1998;30:493–512. [Google Scholar]
  36. Nagylaki T. The Gaussian approximation for random genetic drift. In: Karlin S, Nevo E, editors. Evolutionary processes and theory. New York: Academic Press; 1986. pp. 629–642. [Google Scholar]
  37. Nagylaki T. Models and approximations for random genetic drift. Theoretical Population Biology. 1990;37:192–212. [Google Scholar]
  38. Nielsen R. Estimation of population parameters and recombination rates from single nucleotide polymorphisms. Genetics. 2000;154:931–942. doi: 10.1093/genetics/154.2.931. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Norman MF. Markov processes and learning models. Mathematics in science and engineering. Vol. 84. New York: Academic Press; 1972. [Google Scholar]
  40. Norman MF. Approximation of stochastic processes by Gaussian diffusions, and applications to Wright-Fisher genetic models. SIAM Journal on Applied Mathematics. 1975a;29:225–242. [Google Scholar]
  41. Norman MF. Limit theorems for stationary distributions. Advances in Applied Probability. 1975b;7:561–575. [Google Scholar]
  42. Ohta T, Kimura M. Linkage disequilibrium due to random genetic drift. Genetical Research. 1969a;13:47–55. [Google Scholar]
  43. Ohta T, Kimura M. Linkage disequilibrium at steady state determined by random genetic drift and recurrent mutations. Genetics. 1969b;63:229–238. doi: 10.1093/genetics/63.1.229. [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Rasmussen MD, Hubisz MJ, Gronau I, Siepel A. Genome-wide inference of ancestral recombination graphs. PLOS Genetics. 2014;10:e1004342. doi: 10.1371/journal.pgen.1004342. [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Wakeley J. The limits of theoretical population genetics. Genetics. 2005;169:1–7. doi: 10.1093/genetics/169.1.1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. Wakeley J. Coalescent theory: an introduction. Greenwood Village, Colorado: Roberts & Company Publishers; 2008. [Google Scholar]
  47. Wakeley J, Sargsyan O. The conditional ancestral selection graph with strong balancing selection. Theoretical Population Biology. 2009;75:355–364. doi: 10.1016/j.tpb.2009.04.002. [DOI] [PubMed] [Google Scholar]
  48. Wang Y, Rannala B. Bayesian inference of fine-scale recombination rates using population genomic data. Philosophical Transactions of the Royal Society B. 2008;363:3921–3930. doi: 10.1098/rstb.2008.0172. [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. Wright S. Adaptation and selection. In: Jepson GL, Mayr E, Simpson GG, editors. Genetics, Paleontology and Evolution. Princeton: Princeton University Press; 1949. pp. 365–389. [Google Scholar]

RESOURCES