Skip to main content
Genetics logoLink to Genetics
. 2015 Mar 17;200(1):343–355. doi: 10.1534/genetics.114.173898

The SMC′ Is a Highly Accurate Approximation to the Ancestral Recombination Graph

Peter R Wilton *,1, Shai Carmi †,2, Asger Hobolth ‡,2
PMCID: PMC4423375  PMID: 25786855

Abstract

Two sequentially Markov coalescent models (SMC and SMC′) are available as tractable approximations to the ancestral recombination graph (ARG). We present a Markov process describing coalescence at two fixed points along a pair of sequences evolving under the SMC′. Using our Markov process, we derive a number of new quantities related to the pairwise SMC′, thereby analytically quantifying for the first time the similarity between the SMC′ and the ARG. We use our process to show that the joint distribution of pairwise coalescence times at recombination sites under the SMC′ is the same as it is marginally under the ARG, which demonstrates that the SMC′ is, in a particular well-defined, intuitive sense, the most appropriate first-order sequentially Markov approximation to the ARG. Finally, we use these results to show that population size estimates under the pairwise SMC are asymptotically biased, while under the pairwise SMC′ they are approximately asymptotically unbiased.

Keywords: sequentially Markov coalescent, ancestral recombination graph, consistency, ergodicity, Markov approximation


OF the many models of genetic variation in the field of population genetics, few have as much relevance in the era of genomics as the ancestral recombination graph (ARG). The ancestral recombination graph models patterns of ancestry and genetic variation within sequences experiencing recombination under neutral conditions (Hudson 1991; Griffiths and Marjoram 1997). Under the formulation of Griffiths and Marjoram (1997), lineages recombine apart and coalesce together back in time to produce a graph structure describing the ancestral genealogy at each point along a continuous chromosome. While only a few simple rules govern the process, many aspects of the model are analytically intractable.

Wiuf and Hein (1999) provided a formulation of the ARG that proceeds across the chromosome (rather than back in time), producing the genealogy at each point sequentially. As with the back-in-time formulation of the ARG, at each point along the chromosome there is a local genealogy describing the ancestry of the sample at that point, and changes in the genealogy occur at recombination sites. In this sequential formulation of the ARG, a new lineage is produced wherever an ancestral recombination event is encountered along the chromosome. To produce a new genealogy at the recombination site, the new lineage is coalesced to the ARG representing the ancestry of all previous points along the chromosome. This dependence on all previous points makes the process non-Markovian along the chromosome and (together with a large state space) makes calculations often intractable.

Approximations to the ARG have been suggested with the goal of modeling coalescence with recombination in a way that is analytically tractable. McVean and Cardin (2005) introduced the sequentially Markov coalescent (SMC). The original formulation of the SMC was sequential, generating genealogies along the chromosome such that each new genealogy depends only on the previous genealogy. Like the ARG, the SMC has both a back-in-time formulation and a sequential formulation. The back-in-time formulation of the SMC is equivalent to that of the ARG except that coalescence is allowed only between lineages containing overlapping ancestral material. As a consequence, in the sequential formulation of the pairwise (n=2 chromosomes) SMC, each recombination event produces a new pairwise coalescence time.

Marjoram and Wall (2006) introduced a slight modification to the SMC, termed the SMC′, which retains the Markov behavior along the chromosome but models additional coalescence events that make it a closer approximation to the ARG. Specifically, in the back-in-time formulation of the SMC′, coalescence is allowed between lineages containing either overlapping or adjacent ancestral material. In the sequential formulation of the pairwise SMC′, this means that not every recombination event necessarily produces a new pairwise coalescence time, since two lineages created by a recombination event can coalesce back together. Figure 1 shows the transitions that are permitted under the back-in-time and sequential formulations of the pairwise ARG, SMC, and SMC′. The sequentially Markov coalescent models have been used in many recently introduced population-genetic, model-based inference procedures, including the pairwise SMC (PSMC) model (Li and Durbin 2011), the multiple SMC (MSMC) model (Schiffels and Durbin 2014), diCal (Sheehan et al. 2013), coalHMM (Hobolth et al. 2007; Dutheil et al. 2009), and ARGWeaver (Rasmussen et al. 2014), and in a study of past human demography based on tracts of identity by state (Harris and Nielsen 2013).

Figure 1.

Figure 1

Transitions permitted under the pairwise ARG, SMC′, and SMC models. Under “sequential transitions,” a transition occurs left to right across the chromosome at the rightmost recombination event (marked with a red line). The ith coalescence time is labeled ti. Under “back-in-time transitions,” the arrow indicates a coalescence event that occurs between two aligned ancestral chromosomes, each carrying a combination of ancestral (solid black lines) and nonancestral material (dashed gray lines). Ancestral material is defined as a portion of a chromosome that is ancestral to the sample.

The SMC′ was shown by simulation to produce measurements of linkage disequilibrium more similar to the ARG than those produced by the SMC (Marjoram and Wall 2006) and was hence used as the preferred model by some recent studies (Harris and Nielsen 2013; Schiffels and Durbin 2014; Zheng et al. 2014). Additionally, a number of recent studies have explored the theoretical properties of the SMC′ (Eriksson et al. 2009; Harris and Nielsen 2013; Carmi et al. 2014; Schiffels and Durbin 2014; Zheng et al. 2014). However, few direct comparisons between the SMC′ and the ARG have been made, and a number of open questions remain. Here, we show how the joint distribution of pairwise coalescence times at two fixed points along a chromosome evolving under the SMC′ can be described by a continuous-time Markov chain. Through analysis of this Markov chain, we calculate many statistical properties of the pairwise SMC′ and compare them to those of the ARG and the SMC. Specifically, for each model of coalescence with recombination, we compare the following: the joint density fT1,T2(t1,t2) (Joint probability density functions), the conditional density fT2|T1(t2|t1) (Conditional distribution of coalescence times), and the covariance between T1 and T2, which we show to be equal to the probability that T1 and T2 are the same (Covariance of coalescence times). These quantities are readily related to measures of linkage disequilibrium in real sequence data.

Using our two-locus Markov process for the two-locus, pairwise SMC′, we also show that the joint distribution of coalescence times immediately to the left and right of a recombination event is the same under the SMC′ and ARG. This allows us to calculate the asymptotic bias of the pairwise SMC- and SMC′-based population-size estimators, which we confirm by simulation. We show that the SMC′ estimator is approximately asymptotically unbiased.

Results

Two-locus Markov chain model for the SMC and SMC′

Here, we present back-in-time Markov processes for the two-locus SMC and SMC′. Previous work has developed analogous two-locus, back-in-time Markov processes for the ARG. Kaplan and Hudson (1985) first described how the process of generating coalescence times at two linked loci modeled by the ARG can be represented as a continuous-time Markov chain, with coalescence and recombination events causing transitions between states. Simonsen and Churchill (1997) explored this process further for the case where the sample size is n=2 and derived for the ARG many of the quantities we compare against the SMC′ in this article. Subsequent work has extended this approach to study two-locus coalescence distributions in the presence of population structure (Eriksson and Mehlig 2004) and recurrent bottlenecks (Schaper et al. 2012) and to study species-tree concordance at linked loci (Slatkin and Pollack 2006) and coalescence histories at one locus conditional on the history at a nearby locus (Hobolth and Jensen 2014).

We begin by presenting the simpler SMC model, which provides context for the more complex SMC′ model. If time is scaled such that the rate of coalescence is 1 and the total rate of recombination between the two linked loci is ρ/2, then the two-locus ancestral process under the SMC is the process depicted in Figure 2. The process starts in state R0 with two lineages, each containing linked copies of the two loci. From R0, the process transitions with rate ρ to state R1, in which one of the two chromosomes has experienced a recombination event, or with rate 1 to state CB, an absorbing state in which both loci have coalesced. Under the SMC, lineages can coalesce only if they contain overlapping ancestral material, so after entering R1, the process cannot return to the fully linked state R0, and each locus coalesces independently with rate 1 from that time onward. Thus, under the SMC, the joint distribution of coalescence times at two loci is that of

(T1,T2)(X0+RXL,X0+RXR), (1)

where X0Exp(1+ρ) is the amount of time to leave R0, RBernoulli(ρ/(1+ρ)) indicates whether the first event is a recombination event, and XLXRExp(1) are the exponential waiting times until coalescence after the first recombination event. All of these random variables are independent in the SMC model, so it is straightforward to calculate many of the quantities we compare in this article, using this representation.

Figure 2.

Figure 2

Schematic of the SMC back-in-time Markov process for two loci. The process starts in state R0, and transitions to other states occur with the rates indicated by arrows between states.

The defining rule of the SMC′ model of coalescence with recombination is that only ancestral lineages containing overlapping or contiguous ancestral material can coalesce (Marjoram and Wall 2006). The back-in-time process of coalescence at two fixed loci under this model is the continuous-time Markov chain shown in Figure 3. Under the SMC′, it is necessary to model the number of recombination events that have occurred between the two loci at each point in time. To see that this is the case, consider the state R2 in Figure 3. In this state, two recombination events have occurred between the focal loci, and neither focal locus has coalesced. Because lineages can coalesce only to lineages containing overlapping or adjacent ancestral material, two particular coalescence events would need to occur before the process returns to state R0, regardless of the placement of the recombination events on the two chromosomes.

Figure 3.

Figure 3

Schematic of the SMC′ back-in-time Markov process for two loci. Dashed arrows show transition rates that apply for all Ri. State I is the state in which some portion of the chromosome between the two focal loci has coalesced but neither focal locus has coalesced. The red lines in states R2 and R3 show the coalescence events that take the process to state I.

The SMC′ two-locus Markov process also features an additional state I, which is entered when some portion of the chromosome between the focal loci coalesces before either of the focal loci. Upon entering I it becomes impossible for the process to reenter the initial, fully linked state (R0), so the remaining times until coalescence at the focal loci become independent exponential random variables with mean 1. If Ri is the state in which neither focal locus has coalesced and i recombination events have occurred between the focal loci, the transition rate into I is i1. This is due to the fact that each recombination event after the first produces an additional pair of lineages that can coalesce to take the process to I. For each state Ri, i1, the number of lineages that can coalesce to take the process to Ri1 is i, and the rate of transitioning to Ri+1 through recombination is ρ. Transitions to CL and CR occur at rate 1 whenever the process is in state Ri, i1. Following Eriksson and Mehlig (2004), we disregard any information about linkage between the two loci after one locus has coalesced, since the rate of coalescence at the uncoalesced locus is 1 regardless of the state of linkage with the coalesced locus.

For comparison, an analogous two-locus continuous-time Markov chain for the ARG is presented in Supporting Information, Figure S1. An equivalent process was studied by Simonsen and Churchill (1997) and others. In this model, state R1 is reached when the first event is a recombination event, and state R2 is reached only after a subsequent recombination event occurs on the ancestral lineage that did not experience the first recombination event, making all ancestral copies of the two loci unlinked.

Joint probability density functions

Considering the SMC′ model above, let R0(t) represent the probability that the two-locus ancestral coalescent process is in state R0 at time t, and let R+(t) represent the probability that the process is in any state Ri, i1, or state I, at time t. The joint density of coalescence times at the two focal loci is then

fT1,T2(t1,t2)={R0(t1)t1=t2R+(t1)e(t2t1)t1<t2R+(t2)e(t1t2)t1>t2, (2)

since R0(t) is the rate of entering state CB at time t, and R+(t) is the rate of entering either CL or CR at time t. The joint density for the ARG and SMC is analogously defined, with R+(t) representing R1 and R2 under the ARG and R1 under the SMC. For the ARG and the SMC, the number of states is finite and R0(t) and R+(t) can be solved using matrix exponentiation. For the SMC′, there is an infinite number of states, representing the possibility of an infinite number of recombination events occurring between the two focal loci. In the Appendix, we derive closed-form expressions for R0(t), R+(t), and fT1,T2(t1,t2). The main idea in these derivations is to recognize the structure of the SMC′ in Figure 3 as a birth–death process with killing. In this formulation the states are Ri,{i0}, a birth corresponds to a recombination event (and the birth rate is constant), a death corresponds to a coalescence event (and the death rate is linear), and killing corresponds to leaving the Ri states.

Figure 4 compares the joint coalescence time distributions under the SMC and SMC′, displaying the differences of these joint distributions with the joint distribution of the ARG. The SMC′ provides a much better fit to the ARG for the range of recombination rates compared. Both the SMC and the SMC′ underestimate the density of outcomes where T1=T2, but this underestimation is substantially less under the SMC′.

Figure 4.

Figure 4

Comparison of the difference in the joint density of coalescence times fT1,T2(t1,t2) between the SMC and ARG (top row) and SMC′ and ARG (bottom row). Comparisons are made for three different recombination rates (ρ=0.1,1.0,5.0).

To summarize the difference between the joint distributions more succinctly, we calculated the total variation distance between the SMC and the ARG and between the SMC′ and the ARG across a range of recombination rates. The total variation distance between the SMC and the ARG is defined as

TV(SMC,ARG)=1200|fSMC(t1,t2)fARG(t1,t2)|dt2dt1, (3)

where fSMC(t1,t2) and fARG(t1,t2) are the joint densities fT1,T2(t1,t2) defined under the SMC and ARG, respectively. The total variation distance between the SMC′ and the ARG is similarly defined. Figure 5 shows the total variation distance from the ARG for the SMC and SMC′ over a range of recombination rates. Total variation distances were calculated numerically. For both the SMC and SMC′, the total variation distance was maximized at some intermediate recombination rate, ρ=1.1 for the SMC and ρ=3.2 for the SMC′.

Figure 5.

Figure 5

Total variation distance between the SMC and ARG (solid line) and the SMC′ and ARG (dashed line) as a function of recombination rate. Total variation distances were calculated numerically.

Conditional distribution of coalescence times

In this section we consider the distribution of coalescence times at one locus given the coalescence time at the other. The conditional density of T2 given T1, fT2|T1(t2|t1), can be calculated by dividing the joint density by the marginal distribution of coalescence times at the left locus:

fT2|T1(t2|t1)=fT1,T2(t1,t2)et1. (4)

Hobolth and Jensen (2014) introduced a framework for modeling the distribution of T2 given T1, using a time-inhomogeneous continuous-time Markov chain. [Note that the model called SMC′ in Hobolth and Jensen (2014) is an SMC′-like model of two loci that is not based on the continuous-chromosome SMC′. It is different from the SMC′ model we consider here.] This framework can be extended to the SMC′, producing the continuous-time Markov chain shown in Figure S2. Figure 6 compares the conditional density fT2|T1(t2|t1) of coalescence times t2 at the right locus conditioned upon the coalescence times t1 at the left locus for different values of t1 and ρ.

Figure 6.

Figure 6

Comparison of densities of coalescence times t2 at the right locus conditioned upon coalescence times t1 at the left locus. Conditional densities fT2|T1(t2|t1) are shown for the ARG, SMC, and SMC′ models for three different rates of recombination between the two loci (ρ=0.1,1.0,5.0) and three different conditioned-upon coalescence times t1 at the left locus (t1=0.1,1.0,4.0). The area under each curve is P(T2t1|T1=t1); the conditional probabilities P(T2=t1|T1=t1) are not shown.

We note that recently it was proposed that the mutation rate could be estimated by simulation-based calibration of the increase in mean heterozygosity when moving away from a site of known, low heterozygosity (Lipson et al. 2015). Our expressions for the conditional distribution of coalescence times could provide theoretical expectations for such a statistic.

Covariance of coalescence times

In the two-locus, back-in-time Markov processes for the SMC, SMC′, and ARG, T1 and T2 are equal when the state CB is entered through R0 rather than CL or CR. For the ARG, Simonsen and Churchill (1997) showed that the probability that T1 is equal to T2 is

PARG(T1=T2)=ρ+18ρ2+13ρ+18. (5)

Under the SMC (McVean and Cardin 2005),

PSMC(T1=T2)=11+ρ. (6)

Eriksson et al. (2009) used the sequential formulation of the SMC′ to show that

PSMC(T1=T2)=0eteρλ(t)dt=2ρ/2eρ/4(ρ)(1/2)(ρ/4)[Γ(2+ρ4)Γ(2+ρ4,ρ4)], (7)

where λ(t)=(1e2t+2t)/4 is the exponential rate of encountering a change in coalescence time when the local coalescence time is t and Γ(a,b)=bxa1exdx is the incomplete gamma function.

For the ARG and SMC, the covariance Cov[T1,T2] is equal to P(T1=T2). Eriksson et al. (2009) showed by simulation that this is also true of the SMC′. Here we present a short proof that this is the case for any two-locus model of coalescence where the marginal distribution of coalescence times is exponential with rate 1.

The expectation E[T1T2] can be derived using the fact that (ab)2=a2+b22ab:

2E[T1T2]=E[T12]+E[T22]E[(T1T2)2]=2+2E[(T1T2)2|T1T2]P(T1T2)=42P(T1T2). (8)

The final equality in (8) follows from the fact that |T1T2| has an exponential distribution with rate 1 when T1T2. Therefore E[T1T2]=2P(T1T2) and

Cov[T1,T2]=E[T1T2]E[T1]E[T2]=E[T1T2]1=P(T1=T2). (9)

This result holds in other situations with exponential coalescence times, for example in the context of the population-divergence model considered by Eriksson et al. (2009) (in which case the marginal distribution is exponential plus a constant) and for the various covariances used by McVean (2002) to calculate σd2, the approximation to the linkage disequilibrium measure r2.

It is interesting to consider Cov[T1,T2]=P(T1=T2) when ρ is small. For the ARG, consideration of (5) shows that Cov[T1,T2]=PARG(T1=T2)=12ρ/3+O(ρ2). Likewise, for the SMC, (6) shows that Cov[T1,T2]=PSMC(T1=T2)=1ρ+O(ρ2).

For the SMC′, the integral representation of PSMC(T1=T2) in (7) allows for the calculation of this quantity as a first-order expansion in ρ:

Cov[T1,T2]=0eteρλ(t)dt=1ρ0etλ(t)dt+O(ρ2)=12ρ3+O(ρ2). (10)

Thus, Cov[T1,T2] [or P(T1=T2)] is the same up to order ρ2 under the ARG and SMC′.

Coalescence times at recombination sites

In this section, we show that the joint distribution of coalescence times on either side of a recombination event is the same under the SMC′ and marginally under the ARG, and we derive this distribution. Consider the continuous-time Markov chains representing the two-locus SMC′ and ARG models (Figure 3 and Figure S1, respectively) in the limit of ρ0 and conditioning on the first event being a recombination event. These processes represent the joint distribution of coalescence times on either side of a recombination event under the ARG and SMC′. In both of these processes, the waiting time until the first event, conditional on that event being a recombination event, has an exponential distribution with rate 1+ρ, which converges to 1 as ρ0. After that first recombination event, the rate of all additional recombination events converges to zero in the ρ0 limit, so all of the remaining events must be coalescence events, each of which occurs with rate 1. Under the ARG and the SMC′, the coalescence events that are possible from state R1 are the same. Thus, the joint distribution of coalescence times at recombination sites is the same under the SMC′ and the ARG.

Figure 7A shows the two-locus continuous-time Markov chain representing this conditional process. This Markov chain starts in a special initial state R0*, out of which the first event is always a recombination event, which happens with rate 1, as described above. In previous sections, we used T1 and T2 to represent the coalescence times at two loci some fixed distance apart. To avoid confusion, in this section we use S and T to represent the coalescence times on the left and right sides of a recombination event, respectively.

Figure 7.

Figure 7

Two-locus continuous-time Markov chains representing the ARG and SMC′ models in the ρ0 limit, conditional on the first event being a recombination event. These processes represent the joint distribution of coalescence times on either side of a recombination site under the ARG and SMC′. The state R0* is a special starting state out of which the first event is always a recombination event. A shows the process unconditional on whether S=T, and B shows the process conditional on ST. The model representing the joint distribution of coalescence times at recombination sites under the SMC is equivalent to the model in B with the transition rates from R1 to CL and CR equal to 1 instead of 3/2.

Recombination events are visible in sequence data only if they change the local coalescence time. Thus, it is of special interest to condition on ST in the above model to derive the joint distribution of coalescence times on either side of a change in coalescence times under the ARG and SMC′. Conditioning on ST, the transition out of R1 must be into either CL or CR. These transitions occur with conditional rate 3/2, since the total rate of leaving R1 is 3 in the unconditional model, and two of the ways of leaving R1 result in the coalescence times being different.

The continuous-time Markov chain representing coalescence times on either side of a change in coalescence times (i.e., at recombination sites where ST) is shown in Figure 7B. Under this model, the joint distribution of S and T is that of

(S,T)(X1+X2+RX3,X1+X2+(1R)X3), (11)

where X1Exp(1), X2Exp(3), RBernoulli(1/2), X3Exp(1), and the random variables are independently distributed.

Under the SMC, the continuous-time Markov chain representing the joint distribution of coalescence times at recombination sites is equivalent to the model in Figure 7B with the transition rates from R1 to CL and CR equal to 1 instead of 3/2. Under this model for the SMC, the joint distribution of coalescence times on either side of a recombination event is that of

(S,T)(X1+X2,X1+X3), (12)

where X1, X2, and X3 are mutually independent exponential random variables with rate 1.

In File S1, we use these Markov processes to derive the joint, marginal, and conditional distributions of coalescence times at recombination sites under the ARG, SMC′, and SMC. These calculations confirm previous derivations of Carmi et al. (2014) for the SMC′ and Li and Durbin (2011) for the SMC.

SMC′ as the canonical first-order Markov approximation to ARG

Under the sequential formulation of the continuous-chromosome ARG, SMC, and SMC′ models, the infinitesimal probability of a recombination event occurring in the interval (x,x+dx) given the coalescence time s at x is sdx. This fact, together with the fact that the joint distribution of coalescence times at recombination sites is the same under the ARG and SMC′ (whether or not the coalescence time changes), implies that the conditional distribution of coalescence times at point x+dx given the coalescence time at point x is the same under the SMC′ and ARG.

This demonstrates that the pairwise SMC′ is the canonical first-order Markov approximation for the pairwise ARG. Given an infinite-order Markov chain {Xi,i=0,1,2,}, where the distribution of each Xj depends on all previous Xi, i<j, the canonical k-order Markov approximation to {Xi} is the Markov chain {Xi[k]} satisfying

P(Xn[k]|Xn1[k]=xn1,,Xnk[k]=xnk)=P(Xn|Xn1=xn1,,Xnk=xnk). (13)

That is, the transition probabilities under the k-order canonical Markov approximation are equal to the transition probabilities conditional on the previous k states under the infinite-order chain. See Schwarz (1976), Fernández and Galves (2002), and Gallo et al. (2013) for examples of mathematical studies of canonical Markov approximations of infinite-order Markov chains.

Here we informally extend the terminology of canonical Markov approximations to continuous processes. The SMC′ is the canonical first-order Markov approximation to the ARG because the distribution of coalescence times at x+dx conditional on the coalescence time at x is the same under the ARG (an infinite-order, sequentially non-Markovian continuous process) and the SMC′ (a first-order sequentially Markov continuous process). In this sense, the SMC′ is the most natural first-order sequentially Markov approximation to the ARG.

Asymptotic bias of the population-size estimators under SMC and SMC′

Given the joint density of pairwise coalescence times at recombination sites under the ARG, it is possible to determine the asymptotic bias of maximum-likelihood population size estimators derived from the pairwise SMC and SMC′ likelihood functions. These likelihood functions give the probability of observing a sequence of pairwise coalescence times and corresponding segment lengths across a chromosome under the SMC and SMC′ models. Related likelihood functions (allowing for variable historical population size) are implicitly maximized in the PSMC and MSMC inference procedures (Li and Durbin 2011; Schiffels and Durbin 2014, respectively). These inference procedures are hidden Markov model (HMM) methods in which the local coalescence times (or genealogies) and segment lengths are hidden states inferred from sequence data.

Here, we consider the estimators that would be obtained if the hidden states in these models were actually observable (see also Kim et al. 2015). We are motivated by the fact that any biases of the estimators we investigate are likely to be inherent in the full HMM-based inference procedures, since these biases would be present even with perfect knowledge of an infinite number of coalescence times. Furthermore, by analyzing estimators derived from the hidden coalescence states, we isolate the bias that is due to choice of coalescent algorithm (SMC vs. SMC′) from the bias due to the mutation model or discretization of the continuous hidden states in a full HMM approach to inference.

To investigate the asymptotic properties of these estimators, we assume that data are generated under the ARG, such that at a fixed point the distribution of pairwise coalescence times is exponential with rate = 1 and an ancestral segment of length l recombines back in time at rate ρl/2. Segment lengths are measured in units of the true scaled recombination parameter ρ. Data generated under this model can be represented as a sequence of pairwise coalescence times and corresponding segment lengths: {(ti,li):1ik}.

We are interested in estimating a single relative population size η (defined relative to the true population size, N). If the data are modeled by the SMC or SMC′, the likelihood of a particular value of η is

L(η|{(ti,li)})=1ηe(t1/η)i=2kq(ti|ti1;η)i=1kλ(ti;η)eλ(ti;η)li, (14)

where q(t|s) is the transition function and λ(t;η) is the rate of encountering the end of a segment given t, with both quantities pertaining to the sequentially Markov coalescent model being used to calculate the likelihood.

In the Appendix, we show that if the SMC is used, the maximum-likelihood estimate of η converges to ∼0.95 as the chromosome gets infinitely long. If the SMC′ is used, the estimate is approximately unbiased in the same limit. If the data are reduced to just the segment ages, the likelihood equation is

L(η|{(ti,li)})=1ηe(t1/η)i=2kq(ti|ti1;η). (15)

Using this reduced likelihood, the asymptotic maximum-likelihood estimate is asymptotically unbiased under the SMC′. Under the SMC, the reduced likelihood and the full likelihood produce the same maximum-likelihood estimate (see Appendix).

We confirm the asymptotic bias of the SMC estimator and the apparent lack of asymptotic bias of the SMC′ estimators by simulation. Figure 8 shows 100 simulated estimates calculated using the SMC, SMC′, and reduced SMC′ likelihood functions. Each estimate was calculated using 100 independent pairs of chromosomes simulated under the ARG, with each chromosome of total length 4Nr=1000, where N is the diploid size and r is the per-generation probability of recombination. Likelihood functions were multiplied across independent pairs of chromosomes, and the same set of simulations was used to produce the estimates for all three likelihood functions.

Figure 8.

Figure 8

Maximum-likelihood estimates of relative population size with three different Markov chain likelihood functions. For each simulation, the segment lengths and coalescence times were taken from 100 independent pairs of chromosomes, with each chromosome being of length ρ=4Nr=1000 simulated under the ARG. A maximum-likelihood estimate was calculated using the SMC, SMC′, and times-only SMC′ likelihood functions (Equations A11, A12, and 15, respectively). The true scaled population size is η=1, shown with the dashed blue line. The predicted asymptotic bias of the SMC likelihood function (η^0.95) is shown with a solid blue line. The sample mean of the estimates calculated with each likelihood function is shown with a solid red line. A total of 100 simulated data sets were analyzed.

Discussion

We have presented a continuous-time Markov chain that describes the pairwise coalescence times at two fixed loci evolving under the SMC′ model of coalescence with recombination. We analyzed this Markov chain to derive the joint distribution of coalescence times at the two loci and the conditional distribution of coalescence times at one locus given the coalescence time at the other. We compared these distributions to those of the ARG and SMC models and found that the difference between the ARG and the SMC′ was much less than the difference between the ARG and the SMC.

We showed that the conditional distribution of coalescence times at point x+dx given the coalescence time at x is the same under the ARG and SMC′. This implies that the SMC′ is the canonical first-order approximation to the pairwise ARG. However, this correspondence is true only of the continuous-chromosome models. If instead the ARG is a model of the genealogies at a sequence of discrete loci, then the first-order canonical Markov approximation is the Markov approximation obtained by modeling a conditional ARG between every successive pair of loci. This model was studied by Hobolth and Jensen (2014), who referred to the model as a “natural” Markov approximation to the ARG. Conceptually similar sequentially Markov coalescent models have been used in the so-called “coalescent hidden Markov model” family of methods (Hobolth et al. 2007; Dutheil et al. 2009; Mailund et al. 2011).

Chen et al. (2009) presented a method of simulating data under higher-order sequentially Markov approximations to the ARG, where the ARG of some number of preceding loci is retained in the process of generating the marginal genealogy at a given locus. They showed by simulation that higher-order approximations generate times until most recent common ancestry that are more consistent with the ARG than do lower-order approximations, but little theoretical work on these higher-order Markov approximations has been done.

The two-locus Markov chains analyzed in this article assume a single well-mixed population, but natural populations often have more complex demographic histories, featuring, for example, variable historical population sizes, migration between subpopulations, and/or past divergence from other populations. The theoretical properties of the sequential, across-the-chromosome formulations of the pairwise SMC and SMC′ with variable population sizes have been studied previously (Li and Durbin 2011; Schiffels and Durbin 2014). Eriksson et al. (2009) used simulation to study two-locus properties of the SMC′ with population bottlenecks, migration between subpopulations, and divergence between populations. They found that the SMC′ generally performs well in these contexts. The two-locus Markov chains we study here could be extended to include these features (as was done for the ARG by Lessard and Wakeley 2003 and Eriksson and Mehlig 2004), which would provide a framework for calculating exact quantities for the two-locus SMC and SMC′ in the context of structured populations. We leave this for future work.

We calculated the asymptotic bias of a population size estimator under the pairwise SMC to be ∼95% of the true population size. This is not a very large bias, but given the continued use of the SMC model in population-genomic inference methods (Palamara et al. 2012; Sheehan et al. 2013; Rasmussen et al. 2014), there is an apparent need to reexamine the consequences of using the simpler SMC model instead of the slightly more complicated SMC′ model. For example, it will be important to consider whether including the possibility of varying population sizes will increase or decrease asymptotic bias. In this context, using the SMC as a basis for a likelihood function may also bias the estimates of the magnitude and timing of population size changes, since the longer segments produced by the ARG will seem younger when they are modeled under the SMC.

Depending on the particular application, it may sometimes be mathematically difficult to employ the SMC′ instead of the SMC. Nevertheless, the SMC′ is the model underlying two recently introduced population-genetic inference methods: the MSMC method of Schiffels and Durbin (2014) (which simplifies to a PSMC′ inference procedure when the number of haplotypes is two) and a procedure based on the distribution of distances between heterozygous bases, introduced by Harris and Nielsen (2013). In each case it was acknowledged that the SMC′ provided more accurate results than the SMC.

From the arguments that led to the development of the continuous-time Markov chains representing the joint distribution of coalescence times at recombination sites (Figure 7), it seems that the joint distribution of coalescence times on either side of a recombination event will be the same under a variety of demographic scenarios. If one were to allow the historical population size to vary, the waiting time until the conditioned-upon recombination event would still be the same under the SMC′ and ARG, and the remaining coalescence events would also be distributed identically. Likewise, when there is population substructure with migration between subpopulations, the distribution of events occurring at recombination sites should be the same under the SMC′ and ARG. Finally, when there are more than two haplotypes sampled, it seems that the joint distributions of genealogies on either side of a recombination event would be the same between the SMC′ and the ARG marginally. These ideas need to be properly explored in future studies, but they suggest that asymptotic bias due to using the SMC′ in place of the ARG will be minimal under a variety of demographic scenarios.

Supplementary Material

Supporting Information

Acknowledgments

We thank Erik van Doorn and Søren Asmussen for identifying the correspondence to birth–death models with killing (see Appendix). We are grateful to John Wakeley and Paul Marjoram and two anonymous reviewers for comments that helped improve this article. S.C. thanks the Human Frontier Science Program for financial support.

Appendix

Derivation of Joint Density of Pairwise Coalescence Times at Two Loci

To calculate the joint density of coalescence times, it is necessary to calculate Rj(t), the probability that the SMC′ two-locus Markov process (Figure 3) is in state Rj at time t, and I(t), the probability that the SMC′ process is in state I at time t. To solve for Rj(t), one can use the forward Kolmogorov equation (for j1)

Rj(t)=ρRj1(t)+(j+1)Rj+1(t)(2j+1+ρ)Rj(t). (A1)

Through substitution, the solution to (A1) can be shown to be

Rj(t)=R0(t)[(ρ/2)(1e2t)]jj!. (A2)

To find R0(t), we note that it is equal to fT1,T2(t,t) (see Equation 2). In turn,

fT1,T2(t,t)=fT1(t)P(T2=t|T1=t), (A3)

where fT1(t)=et is the marginal distribution of coalescence times at the first (or second) locus and P(T2=t|T1=t)=eρλ(t) is the probability of no change in coalescence times given the coalescence time t at the first locus. Here λ(t)=(1e2t+2t)/4 is the exponential rate of encountering a change in coalescence time along the chromosome given that the local coalescence time is t (Eriksson et al. 2009; Carmi et al. 2014). Thus R0(t) is given by

R0(t)=eteρλ(t). (A4)

This completes the solution of Rj(t). Using Figure 3,

R+(t)=I(t)+j=1Rj(t), (A5)

where I(t) is the probability that the process is in state I at time t. Using (A2) and (A4), we get

j=1Rj(t)=R0(t)j=1[(ρ/2)(1e2t)]jj!=ete(ρ/4)(1+2te2t)[e(ρ/2)(1e2t)1]. (A6)

Next, I(t) satisfies the forward Kolmogorov equation

I(t)=j=2(j1)Rj(t)2I(t), (A7)

the solution to which is

I(t)=e2t0te2uj=2(j1)Rj(u)du=e2t0tR0(u){2e2u+e(ρ/2)(1e2u)[(ρ2)e2uρ]}du=e2t{1e(2t(ρ2)+ρe2tρ)/4e(ρ/4)2(ρ4)/2(ρ)(ρ2)/4[Γ(ρ24,ρ4)Γ(ρ24,e2tρ4)]}. (A8)

Here, Γ(a,b)=bxa1exdx is the incomplete gamma function.

Together (A4), (A5), (A6), and (A8) give the joint distribution (2) for the SMC′. For the ARG and SMC, the quantities analogous to R0(t) and R+(t) for these models can be obtained by exponentiating the rate matrices implicit in Figure 2 and Figure S1. For the SMC, the joint distribution can also be derived using the representation (1).

The walk on the states R0,R1,R2, constitutes a birth–death process with killing, where birth events correspond to additional recombination events taking the process from Ri to Ri+1; death events correspond to coalescence events that take the process from Ri to Ri1; and killing events, which take the process to an absorbing state, here correspond to coalescence events that take the process to CL, CR, or I. Under this formulation, the birth rate is constant λi=ρ, the death rate is linear μi=i, and the killing rate is linear γi=i+1. This class of processes was studied by van Doorn and Zeifman (2005), who demonstrated a different approach for calculating Ri(t). This alternative approach (not shown) confirms our derivation of (A4).

Derivation of Asymptotic Bias

We are interested in estimating a single relative population size η (defined relative to the true population size, N), which must be incorporated into the transition density function q(t|s) at recombination sites under the SMC and SMC′. Under the SMC, this transition density function is

qSMC(t|s;η)={1s(1et/η)t<s1se(ts)/η(1es/η)t>s. (A9)

This is equivalent to the conditional density (S6 in File S1) with the addition of a relative population size parameter. Under the SMC′, the transition function is

qSMC(t|s;η)={(2/η)(1e2t/η)1+2s/ηe2st<s(2/η)e(ts)/η(1e2s/η)1+2s/ηe2st>s, (A10)

which is equivalent to the conditional density (S3 in File S1) with a relative population size parameter included.

Under the SMC, given the local coalescence time t, the distance along the chromosome until the nearest recombination event (measured in units of ρ) is exponentially distributed with rate t (McVean and Cardin 2005). The likelihood function for a single relative population size η under the SMC is thus

LSMC(η|{(ti,li)})=1ηe(t1/η)i=2kqSMC(ti|ti1;η)i=1ktietili1ηe(t1/η)i=2kqSMC(ti|ti1;η). (A11)

Under the SMC′, the likelihood function for a relative population size η is

LSMC(η|{(ti,li)})=1ηe(t1/η)i=2kqSMC(ti|ti1;η)i=1kλ(ti,η)eλ(ti,η)li, (A12)

where λ(t,η)=[η(1e2t/η)+2t]/4 is the exponential rate of encountering recombination events that change the coalescence time when the local coalescence time is t (Eriksson et al. 2009). Note that under the SMC, the length li of a segment is independent of the relative population size η given the local coalescence time ti. This is not true for the SMC′, since the probability that the coalescence time changes at a recombination site depends on the population size.

As the length of the chromosome increases and the number of coalescence-time changes goes to infinity, the asymptotic maximum-likelihood estimate η^ of the relative population size under the SMC is

η^=limkargmaxη1ηe(t1/η)i=2kqSMC(ti|ti1;η)=limkargmaxη{log(1ηe(t1/η))+i=2klog[qSMC(ti|ti1;η)]}=limkargmaxηi=2klog[qSMC(ti|ti1;η)]=argmaxηEARG[log(qSMC(T|S;η))]=argmaxη00πSMC(s)qSMC(t|s;1)log(qSMC(t|s;η))dtds0.95. (A13)

Here the penultimate equality holds only if there is ergodic (i.e., law-of-large-numbers-like) convergence of the sequence of pairs of coalescence times on either side of a recombination site under the ARG. In File S1, we show that the continuous-chromosome pairwise ARG is ergodic. That is, the mean coalescence time across a long chromosome converges to the mean coalescence time at a single point along the chromosome. We are unable to prove the ergodicity of the sequence of pairs of coalescence times at recombination sites where the coalescence time changes; instead, we note that (A13) is supported by simulation (see above). We also note that Wiuf (2006) proved the ergodicity of the discrete-locus ARG under a variety of neutral demographic models. A similarly in-depth proof may also apply for continuous-chromosome models, but we do not explore the point further.

In (A13), the ultimate equality follows from the fact that the joint distribution of coalescence times is marginally the same at recombination sites under the ARG and the SMC′. Numerical maximization of the double integral shows that the maximum-likelihood estimate of a single population size N under the pairwise SMC is asymptotically biased, with the asymptotic estimate being ∼0.95N.

Under the ARG, the stationary distribution of lengths between recombination events that change the local coalescence time (i.e., the identity-by-descent segment length distribution) is slightly different from that of the SMC′. (They are different because subsequent recombination events “heal” with slightly different probabilities under the ARG, while under the SMC′, each subsequent recombination event heals with the same probability.) Under the ARG, the identity-by-descent (IBD) length distribution is not currently known. Given that under the SMC′ the maximum-likelihood estimator for a relative population size involves the observed lengths, it is not currently possible to calculate the asymptotic bias of the pairwise SMC′ maximum-likelihood estimator of a single population size. However, the IBD length distribution under the ARG is approximated very closely by the SMC′ IBD length distribution (Carmi et al. 2014), so the SMC′ estimator is likely to be nearly asymptotically unbiased.

Footnotes

Communicating editor: R. Nielsen

Literature Cited

  1. Carmi S., Wilton P. R., Wakeley J., Peer I., 2014.  A renewal theory approach to IBD sharing. Theor. Popul. Biol. 97: 35–48. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Chen G. K., Marjoram P., Wall J. D., 2009.  Fast and flexible simulation of DNA sequence data. Genome Res. 19: 136–142. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Dutheil J. Y., Ganapathy G., Hobolth A., Mailund T., Uyenoyama M. K., et al. , 2009.  Ancestral population genomics: the coalescent hidden Markov model approach. Genetics 183: 259–274. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Eriksson A., Mehlig B., 2004.  Gene-history correlation and population structure. Phys. Biol. 1: 220. [DOI] [PubMed] [Google Scholar]
  5. Eriksson A., Mahjani B., Mehlig B., 2009.  Sequential Markov coalescent algorithms for population models with demographic structure. Theor. Popul. Biol. 76: 84–91. [DOI] [PubMed] [Google Scholar]
  6. Fernández R., Galves A., 2002.  Markov approximations of chains of infinite order. Bull. Braz. Math. Soc. 33: 1–12. [Google Scholar]
  7. Gallo S., Lerasle M., Takahashi D., 2013.  Markov approximation of chains of infinite order in the d¯-metric. Markov Processes and Related Fields 19: 51–82. [Google Scholar]
  8. Griffiths, R., and P. Marjoram, 1997 An ancestral recombination graph, pp. 257–270 in Progress in Population Genetics and Human Evolution (IMA Volumes in Mathematics and Its Application, Vol. 87), edited by P. Donnelly and S. Tavaré. Springer-Verlag, New York. [Google Scholar]
  9. Harris K., Nielsen R., 2013.  Inferring demographic history from a spectrum of shared haplotype lengths. PLoS Genet. 9: e1003521. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Hobolth A., Jensen J. L., 2014.  Markovian approximation to the finite loci coalescent with recombination along multiple sequences. Theor. Popul. Biol. 48: 48–58. [DOI] [PubMed] [Google Scholar]
  11. Hobolth A., Christensen O. F., Mailund T., Schierup M. H., 2007.  Genomic relationships and speciation times of human, chimpanzee, and gorilla inferred from a coalescent hidden Markov model. PLoS Genet. 3: e7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Hudson R., 1991.  Gene genealogies and the coalescent process, pp. 1–44 in Oxford Surveys in Evolutionary Biology, Vol. 7, edited by Futuyma D., Antonovics J. University Press, Oxford, UK. [Google Scholar]
  13. Kaplan N., Hudson R. R., 1985.  The use of sample genealogies for studying a selectively neutral m-loci model with recombination. Theor. Popul. Biol. 28: 382–396. [DOI] [PubMed] [Google Scholar]
  14. Kim J., Mossel E., Rácz M. Z., Ross N., 2015.  Can one hear the shape of a population history? Theor. Popul. Biol. 100: 26–38. [DOI] [PubMed] [Google Scholar]
  15. Lessard S., Wakeley J., 2003.  The two-locus ancestral graph in a subdivided population: convergence as the number of demes grows in the island model. J. Math. Biol. 48: 275–292. [DOI] [PubMed] [Google Scholar]
  16. Li H., Durbin R., 2011.  Inference of human population history from individual whole-genome sequences. Nature 475: 493–496. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Lipson, M., P.-R. Loh, S. Sankararaman, N. Patterson, B. Berger, et al., 2015 Calibrating the human mutation rate via ancestral recombination density in diploid genomes. bioRxiv DOI: http://dx.doi.org/10.1101/015560 [DOI] [PMC free article] [PubMed]
  18. Mailund T., Dutheil J. Y., Hobolth A., Lunter G., Schierup M. H., 2011.  Estimating divergence time and ancestral effective population size of Bornean and Sumatran orangutan subspecies using a coalescent hidden Markov model. PLoS Genet. 7: e1001319. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Marjoram P., Wall J. D., 2006.  Fast “coalescent” simulation. BMC Genet. 7: 16. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. McVean G. A. T., 2002.  A genealogical interpretation of linkage disequilibrium. Genetics 162: 987–991. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. McVean G. A. T., Cardin N. J., 2005.  Approximating the coalescent with recombination. Philos. Trans. R. Soc. Lond. B Biol. Sci. 360: 1387–1393. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Palamara P. F., Lencz T., Darvasi A., Pe’er I., 2012.  Length distributions of identity by descent reveal fine–scale demographic history. Am. J. Hum. Genet. 91: 809–822. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Rasmussen M. D., Hubisz M. J., Gronau I., Siepel A., 2014.  Genome-wide inference of ancestral recombination graphs. PLoS Genet. 10: e1004342. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Schaper E., Eriksson A., Rafajlovic M., Sagitov S., Mehlig B., 2012.  Linkage disequilibrium under recurrent bottlenecks. Genetics 190: 217–229. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Schiffels S., Durbin R., 2014.  Inferring human population size and separation history from multiple genome sequences. Nat. Genet. 46: 919–925. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Schwarz G., 1976.  Noninvariance of d¯-convergence of k-step Markov approximations. Ann. Probab. 4: 1033–1035. [Google Scholar]
  27. Sheehan S., Harris K., Song Y. S., 2013.  Estimating variable effective population sizes from multiple genomes: a sequentially Markov conditional sampling distribution approach. Genetics 194: 647–662. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Simonsen K. L., Churchill G. A., 1997.  A Markov chain model of coalescence with recombination. Theor. Popul. Biol. 52: 43–59. [DOI] [PubMed] [Google Scholar]
  29. Slatkin M., Pollack J. L., 2006.  The concordance of gene trees and species trees at two linked loci. Genetics 172: 1979–1984. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. van Doorn E. A., Zeifman A. I., 2005.  Birth-death processes with killing. Stat. Probab. Lett. 72: 33–42. [Google Scholar]
  31. Wiuf C., 2006.  Consistency of estimators of population scaled parameters using composite likelihood. J. Math. Biol. 53: 821–841. [DOI] [PubMed] [Google Scholar]
  32. Wiuf C., Hein J., 1999.  Recombination as a point process along sequences. Theor. Popul. Biol. 55: 248–259. [DOI] [PubMed] [Google Scholar]
  33. Zheng C., Kuhner M. K., Thompson E. A., 2014.  Bayesian inference of local trees along chromosomes by the sequential Markov coalescent. J. Mol. Evol. 78: 279–292. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supporting Information

Articles from Genetics are provided here courtesy of Oxford University Press

RESOURCES