Skip to main content
Genetics logoLink to Genetics
. 2013 Jan;193(1):255–290. doi: 10.1534/genetics.112.144329

An Ancestral Recombination Graph for Diploid Populations with Skewed Offspring Distribution

Matthias Birkner *, Jochen Blath , Bjarki Eldon †,1
PMCID: PMC3527250  PMID: 23150600

Abstract

A large offspring-number diploid biparental multilocus population model of Moran type is our object of study. At each time step, a pair of diploid individuals drawn uniformly at random contributes offspring to the population. The number of offspring can be large relative to the total population size. Similar “heavily skewed” reproduction mechanisms have been recently considered by various authors (cf. e.g., Eldon and Wakeley 2006, 2008) and reviewed by Hedgecock and Pudovkin (2011). Each diploid parental individual contributes exactly one chromosome to each diploid offspring, and hence ancestral lineages can coalesce only when in distinct individuals. A separation-of-timescales phenomenon is thus observed. A result of Möhle (1998) is extended to obtain convergence of the ancestral process to an ancestral recombination graph necessarily admitting simultaneous multiple mergers of ancestral lineages. The usual ancestral recombination graph is obtained as a special case of our model when the parents contribute only one offspring to the population each time. Due to diploidy and large offspring numbers, novel effects appear. For example, the marginal genealogy at each locus admits simultaneous multiple mergers in up to four groups, and different loci remain substantially correlated even as the recombination rate grows large. Thus, genealogies for loci far apart on the same chromosome remain correlated. Correlation in coalescence times for two loci is derived and shown to be a function of the coalescence parameters of our model. Extending the observations by Eldon and Wakeley (2008), predictions of linkage disequilibrium are shown to be functions of the reproduction parameters of our model, in addition to the recombination rate. Correlations in ratios of coalescence times between loci can be high, even when the recombination rate is high and sample size is large, in large offspring-number populations, as suggested by simulations, hinting at how to distinguish between different population models.

Keywords: ancestral recombination graph (ARG), diploidy, skewed offspring distribution, simultaneous multiple merger coalescent processes, correlation in coalescence times, linkage disequilibrium (LD), ratios of coalescence times


DIPLOIDY, in which each offspring receives two sets of chromosomes, one from each of two distinct diploid parents, is fairly common among natural populations. Mathematical models in population genetics tend to assume, however, that all individuals in a population are haploid, simplifying the mathematics. Mendel’s laws describe the mechanism of inheritance as composed of two main steps, equal segregation (first law) and independent assortment (second law). The first law proclaims gametes are haploid, i.e., carry only one of each pair of homologous chromosomes. Most models in population genetics are thus models of chromosomes or gene copies. Mendel’s second law proclaims independent assortment of alleles at different genes, or loci, into gametes. Linkage of alleles on chromosomes, resulting in nonrandom association of alleles at different loci into gametes, is of course an important exception to the second law.

Coalescent processes (Kingman 1982a,b; Hudson 1983b; Tajima 1983) describe the ancestral relations of chromosomes (or gene copies) drawn from a natural population. The coalescent was initially derived from a Cannings (1974) haploid exchangeable population model. Related ancestral processes take into account population structure (Notohara 1990; Herbots 1997), selection (Krone and Neuhauser 1997; Neuhauser and Krone 1997; Etheridge et al. 2010), and recombination between linked loci (Hudson 1983a; Griffiths 1991; Griffiths and Marjoram 1997). The coalescent has proved to be an important advance in theoretical population genetics and a valuable tool for inference of evolutionary histories of populations.

Ancestral recombination graphs (ARGs) (Hudson 1983a; Griffiths 1991; Griffiths and Marjoram 1997) trace ancestral lineages of gene copies at linked loci, in which linkage is broken up by recombination. An ARG is a branching–coalescing graph, in which recombination leads to branching of ancestral chromosomes and coalescence to segments rejoining. Coalescence events in an ARG may not lead to coalescence of gene copies at individual loci. An example ARG for two linked loci is given below, labeled as ARG(1), with notation borrowed from Durrett (2002). The labels a and b refer to the two alleles (types) at loci 1 and 2, respectively. A single chromosome with two linked alleles is denoted by (ab), while chromosomes carrying ancestral alleles at only one locus are denoted (a) and (b). When coalescence occurs at either locus, the number of alleles at the corresponding locus is reduced by one. The absorbing state, either (ab) or (a)(b), is reached when alleles at both loci have coalesced:

ARG(1):(ab)(ab)r(a)(b)(ab)c(ab)(b)r(a)(b)(b)c(a)(b)ARG(2):(ab)(ab)r(a)(b)(ab)r(a)(b)(a)(b)c(a)(b).

In ARG(1), the first transition is a recombination, denoted by r, followed by a coalescence (c), in which the two alleles at locus 1 coalesce. Graph ARG(1) serves to illustrate two important concepts we are concerned with, namely correlation in coalescence times between alleles at different loci and the restriction to binary mergers of ancestral lineages.

Correlation in coalescence times between types at different loci follows from linkage. Alleles at different loci can become associated due to a variety of factors, including changes in population size, natural selection, and population structure. Within-generation fecundity variance polymorphism induces correlation between a neutral locus and the locus associated with the fecundity variance (Taylor 2009). Sweepstake-style reproduction (Hedgecock et al. 1982; Avise et al. 1988; Palumbi and Wilson 1990; Beckenbach 1994; Hedgecock 1994; Árnason 2004; Hedgecock and Pudovkin 2011), in which few individuals produce most of the offspring, has also been shown to induce correlation in coalescence times between loci (Eldon and Wakeley 2008). Understanding genome-wide correlations in coalescence times becomes ever more important as multilocus genetic data become ubiquitous.

The ARG exemplified by ARG(1) is characterized by admitting only binary mergers of ancestral lineages; i.e., exactly two lineages coalesce in each coalescence event. The restriction to binary mergers follows from bounds on the underlying offspring distribution, in which the probability of large offspring numbers becomes negligible in a large population (Kingman 1982a,b). Sweepstake-style reproduction, in which few individuals contribute very many offspring to the population, has been suggested to explain the “shallow” gene genealogy observed for many marine organisms (Hedgecock et al. 1982; Avise et al. 1988; Palumbi and Wilson 1990; Beckenbach 1994; Hedgecock 1994; Árnason 2004; Hedgecock and Pudovkin 2011). Large offspring-number models are models of extremely high variance in individual reproductive output. Namely, individuals can have very many offspring or up to the order of the population size with nonnegligible probability (Sagitov 2003; Schweinsberg 2003; Eldon and Wakeley 2006; Sargsyan and Wakeley 2008; Birkner and Blath 2009). Such models do predict shallow gene genealogies and can be shown to give better fit to genetic data obtained from Atlantic cod (Árnason 2004) than the Kingman coalescent (Birkner and Blath 2008; Birkner et al. 2011; Eldon 2011; Steinrücken et al. 2012). Different large offspring-number models will no doubt be appropriate for different populations, and the identification of large offspring-number population models for each population is an open problem. For the sake of simplicity and mathematical tractability, the simple large offspring-number model considered by Eldon and Wakeley (2006) is adapted to our situation.

The coalescent processes derived from large offspring-number models belong to a large class of multiple-merger coalescent processes introduced by Donnelly and Kurtz (1999), Pitman (1999), and Sagitov (1999). Multiple-merger coalescent processes (Λ-coalescents), as the name implies, admit multiple mergers of ancestral lineages in each coalescence event, in which any number of active ancestral lineages can coalesce, and at most one such merger occurs each time. In simultaneous multiple-merger coalescent processes (Schweinsberg 2000a; Möhle and Sagitov 2001), any number of multiple mergers can occur each time; i.e., distinct groups of active ancestral lineages can coalesce each time. The ancestral recombination graph derived from our diploid large offspring-number model admits simultaneous multiple mergers of ancestral lineages, as exemplified in ARG(2). The last transition in ARG(2) is a simultaneous multiple merger, in which the two types at each locus coalesce to separate ancestral chromosomes.

To investigate correlations in coalescence times among loci due to skewed offspring distribution, we formally derive an ancestral recombination graph, or a coalescent process for many linked loci, from our diploid large offspring-number model. The key to the proof of convergence to an ancestral recombination graph from our diploid model lies in resolving the separation-of-timescales phenomenon we observe. Following Mendel’s laws, the two chromosomes of an offspring come from distinct diploid parents. Chromosomes can therefore coalesce only when in distinct individuals. The ancestral process consists of two phases, a dispersion phase occurring on a “fast” timescale and a coalescence and recombination phase occurring on a “slow” timescale. In the dispersion phase, chromosomes paired together in diploid individuals disperse into distinct individuals. Coalescence and recombination occur only on the slow timescale. Similar separation-of-timescales issues arise in models of populations structured into infinitely many subpopulations (demes) (Taylor and Véber 2009). When viewing the diploid individuals in our model as “demes,” our scenario departs from those describing structured populations by allowing only active ancestral lineages residing in separate demes to coalesce. A simple extension of a result of Möhle (1998) yields convergence in our case.

The limiting process we formally obtain is an ancestral recombination graph for many loci admitting simultaneous multiple mergers of ancestral chromosomes (lineages). In simultaneous multiple-merger coalescent processes, so-called Ξ-coalescents, different groups of active ancestral lineages can coalesce to different ancestors at the same time. Such coalescent processes were first studied as more abstract mathematical objects by Schweinsberg (2000a) and derived from general single-locus population models by several authors (Möhle and Sagitov 2001; Sagitov 2003; Sargsyan and Wakeley 2008; Birkner et al. 2009). A Ξ-coalescent with necessarily up to quadruple simultaneous multiple mergers arises at each marginal locus (i.e., considering each locus separately) in our model, since four parental chromosomes are involved in each reproduction event. This structure is intrinsically owed to our diploidy assumptions.

Formulas for the correlation in coalescence times between two alleles at two loci are obtained using our ARG. As predicted by J. E. Taylor (personal communication), these correlations will not necessarily be small even for loci separated by a high recombination rate. This is a novel effect not visible in classical models. The correlation structure will of course depend on the underlying coalescent parameters introduced by the large offspring-number model we adopt. An approximation of the expected value of the statistics r2, commonly used to quantify linkage disequilibrium, is also investigated using our ARG. In addition, we employ our ARG to investigate correlations in ratios of coalescence times between loci for samples larger than two at each locus, using simulations.

A Diploid Population Model with Multilocus Recombination and Skewed Offspring Distribution

The forward population model

Consider a population consisting of N ∈ ℕ ≡ {1, 2, …} diploid individuals, meaning that each individual contains two chromosomes. Each chromosome is structured into L ∈ ℕ loci. We assume Moran-type dynamics: At each time step (“generation”), either a small or a large reproduction event occurs. In a small reproduction event, a single individual chosen uniformly at random from the population dies, and two other distinct individuals are chosen as parents. A diploid offspring is then formed by choosing one chromosome from each parent (see Figure 1). The parents always persist. A small reproduction event occurs with probability 1 − ɛN, in which ɛN ∈ (0, 1) depends on N. In a large reproduction event, a fraction ψ ∈ (0, 1) of the population perishes, meaning that ⌊ψN⌋ individuals die (⌊x⌋ for x ≥ 0 denotes the largest integer smaller than x). Two distinct individuals are then chosen uniformly from the remaining N − ⌊ψN⌋ individuals to act as parents of ⌊ψN⌋ offspring, and each offspring is formed independently by choosing one (potentially recombined) chromosome from each parent (see Figure 1). The population size always stays constant at N diploid individuals. Individuals that neither reproduce nor die simply persist.

Figure 1 .

Figure 1 

Illustration of “small” and “large” reproduction events without recombination. The dotted arrows indicate the copying of parental chromosomes into offspring chromosomes. The solid arrows indicate individuals that persist.

Given the two parents, genetic types of the offspring individuals are then obtained as follows. Each parent generates a large number of potential offspring chromosomes, of which a fraction 1 − rN are exact copies of the original parental chromosomes and a fraction rN are recombinants. Each chromosome is structured into L loci. Recombination occurs only between loci and never within. If recombination between a pair of chromosomes in a parent occurs between loci ℓ and ℓ + 1 ∈ {1, … , L} (where we say that X ∈ {1, … , L − 1} is the crossover point), the two chromosomes exchange types at all loci from ℓ + 1 to L. Only one crossover point is allowed in each recombination event. Let rN() denote the probability of recombination between loci ℓ and ℓ + 1 (i.e., the probability that the potential crossover point X equals ℓ). An offspring chromosome is a recombinant with probability rN=rN(1)++rN(L1). Given that recombination happens, we thus have

{X=}=rN()rN(1)++rN(L1),1L1.

Each pair of recombined chromosomes is formed independently of all other pairs. From this large pool of chromosomes, each new offspring is randomly assigned (independently of all other offspring in the case of a large reproduction event), one potentially recombined chromosome generated by each parent. In addition, the reproduction mechanism in different generations is assumed to be independent.

Ancestral relationships—notation

Now we switch from the forward population model to its ancestral process, running backward in time. Our sample consists of n ∈ {1, … , 2N} chromosomes, each subdivided into L loci. Hence, we need to keep track of the ancestry of nL segments (types/alleles). This implies that the different segments could end up on up to nL distinct chromosomes in nL distinct ancestral individuals. The required notation is now introduced, and our discourse will therefore necessarily become a little bit technical. However, we believe that a precise description of the objects we are working with is essential. The key to understanding our notation is that we are working with enumerated chromosomes and ordered loci on chromosomes.

At present (that is, time step m = 0), assume that we consider an even number n of chromosomes carried by n/2 individuals. The chromosomes are enumerated from 1 to n, attaching consecutive numbers to chromosomes found in the same individual. Our ancestral process keeps track of the chromosomal ancestral information, that is, which locus is ancestral to which set of sampled chromosomes. That is, in each generation m ∈ ℕ0 (backward in time), we record all chromosomes that are active in the sense that they carry at least one locus that is ancestral to the same locus of at least one chromosome in generation 0. Denote the number of active chromosomes in generation m ∈ ℕ0 by β(m) ∈ ℕ. The number β(m) of active chromosomes can both increase, due to recombination, and decrease, due to coalescence, going back in time.

Now we explain our notation for the loci. For each chromosome j ∈ [n] := {1, … , n}, denote by L(j)(m) locus ℓ ∈ [L] on chromosome j at time m. The subsets L(j)(m) of [n] contain all the numbers of chromosomes at present (time-step 0) to which locus ℓ on active chromosome number j at time step m is ancestral. With this convention, and for each m ∈ ℕ and ℓ ∈ [L], the collection

{L(j)(m),j=1,,β(m)}.

which describes the configuration of segments (i.e., which ones have coalesced and which ones have not) at locus ℓ at time m, is a partition of [n]; i.e.,

L(j)(m)L(j^)(m)=Øforjj^;

and

j=1β(m)L(j)(m)=[n].

Thus, with our notation we can correctly describe the configuration of segments among chromosomes at any given time. By C(j)(m) we denote chromosome number j at time m. At time m = 0,

C(j)(0):={L1(j)(0),,LL(j)(0)}:={{j},,{j}}.

For m > 0, consider the jth active chromosome at generation m, where j ∈ [β(m)]. The corresponding ancestral information at generation m is encoded via an ordered list of subsets of [n], setting

C(j)(m):={L1(j)(m),,LL(j)(m)},L(j)(m)[n],[L]. (1)

Chromosomes are carried by diploid individuals. Keeping track of the grouping of active chromosomes into individuals is important, since by our diploid reproduction mechanism, chromosomal lineages can coalesce only when in distinct individuals (see Example B below). In analogy with our previous nomenclature for our ancestral process, an active individual will carry at least one (and at most two) active chromosome(s). Let b(m) denote the number of active individuals at generation m, where β(m)/2 ≤ b(m) ≤ β(m) for all m. The ordered list of active chromosomes and the number of active individuals (called a “configuration”) at time m ≥ 0 are denoted by

ξn,N(m):={C(1)(m),,C(β(m))(m);b(m)}. (2)

An individual number i at generation m is denoted by Ii(m), for i ∈ [b(m)]. An active individual is single marked, if carrying one active chromosome, and is double marked, if carrying two active chromosomes. Specifying the arrangement of chromosomes in individuals completes our description of the (prelimiting) ancestral process. However, since all active individuals are single marked in the limiting process, our description of the arrangement of chromosomes in individuals is given in section A1.1 in the Appendix. That is, each configuration ξn,N(m) begins with the 2(β(m) − b(m)) ordered consecutive chromosomes of the β(m) − b(m) double-marked individuals, followed by the 2b(m) − β(m) chromosomes contained in single-marked individuals. With this convention, the set of single- and double-marked individuals and the grouping of chromosomes into individuals at generation m are uniquely determined by a configuration ξn,N(m) of form (2). For notational convenience, the time index m is omitted if there is no ambiguity.

For a given sample size n, the set of all possible ancestral configurations ξn,N is denoted by An. The subset AnsmAn of all configurations ξn,N = {C(1), …, C(β); b} with b = β, i.e., configurations consisting only of single-marked individuals, will play an important role later on. Indeed, all configurations in the limiting model will be confined to the set Ansm, and the pairing of chromosomes in individuals will become irrelevant.

The mapping “complete dispersion” (cd),

cd:AnAnsm

breaks up the pairing of chromosomes into diploid double-marked individuals. More precisely, we define

cd({C(1),,C(β);b}):={C(1),,C(β);β}. (3)

Configurations in Ansm describe configurations in which all active individuals are single marked, i.e., carry only one active chromosome.

The effects of recombination and coalescence on the ancestral configurations in the case of two typical situations are now illustrated. Example A illustrates recombination, and Example B illustrates coalescence of two chromosomes.

Example A.

Suppose the most recent previous event in the history of a given configuration ξn,N(m) was a small reproduction event (at time m + 1), and suppose that the resulting offspring individual is currently part of our configuration at time m, but neither of its parents is, and that the offspring individual is single marked, i.e., carries one active chromosome. We obtain ξn,N(m + 1) as follows:

  • If there is no recombination during the reproduction event, then the configuration in the previous generation remains unchanged; i.e., ξn,N(m + 1) = ξn,N(m).

  • If there is recombination, say at a crossover point X ∈ {1, … , L − 1}, suppose the (single) offspring chromosome is
    C(j)(m)={L1(j)(m),,LL(j)(m)}.

Necessarily, the two parental chromosomes will be part of the configuration ξn,N(m + 1), residing in the same double-marked individual. More precisely, the two parental chromosomes, say C(j)(m+1) and C(j+1)(m+1), are determined by (for [L])

L(j)(m+1)={L(j)(m):1X,Ø:X+1L,

and

L(j+1)(m+1)={Ø:1X,L(j)(m):X+1L,

in which Ø denotes loci not carrying any ancestral segments. The offspring chromosome is of course not part of ξn,N(m + 1). This transition can be partially trivial (a “silent recombination” event), if the crossover point is not in an “active” area, i.e., if L(j)=Ø for X + 1 ≤ ℓ ≤ L (or for all 1 ≤ ℓ ≤ X). By way of example, with L = 3, if chromosome C(j) = {{j}, {j}, {j}} was a recombinant, and the crossover point occurred between loci 2 and 3, the two parental chromosomes are given by C(j)={{j},{j},Ø} and C(j+1)={Ø,Ø,{j}}.

Example B.

Suppose the most recent previous event in the history of a given configuration ξn,N(m) of chromosomes at generation m is a small reproduction event at time m + 1, leading to a coalescence of lineages. This is the case, e.g., if a single-marked offspring individual with active chromosome Cj^(m) is in our configuration ξn,N(m), as well as its single-marked parent [say with currently active chromosome Cj(m)], from which it actually obtained its active chromosome. Then, to obtain the configuration ξn,N(m + 1), the offspring chromosome C(j^)(m) is deleted, and the resulting ancestral chromosome C(j)(m + 1) is given by the family of the union of the sets L(j) and L(j^),

C(j)(m+1)={L1(j)(m)L1(j^)(m),,LL(j)(m)LL(j^)(m)}. (4)

All other chromosomes in ξn,N(m + 1) are copied from ξn,N(m). Again, taking L = 3, if chromosomes C(j) = {{j}, {j}, {j}} and C(k) = {{k}, {k}, {k}} coalesce, the resulting ancestral chromosome is given by C(j) = {{j, k}, {j, k}, {j, k}}.

Scaling and classification of transitions

To obtain a nontrivial scaling limit for {ξn,N(m)} as N → ∞, the limit theorem of Möhle and Sagitov (2001) (cf. also the special case considered in Eldon and Wakeley 2006) suggests one should, for some constant c > 0, choose probability 1 − c/N2 for the small reproduction events, choose c/N2 for the large reproduction events, i.e., setting

ɛN=cN2, (5)

and speed up time by N2. For the recombination rate to be nontrivial in the limit (i.e., neither 0 nor infinitely large), we require that all recombination values rN() scale in units of N; i.e., for each crossover point ℓ ∈ [L]\{L},

rN():=r()N,0<r()<. (6)

Thus, even though our timescale is in units of N2 time steps, recombination is scaled in units of N time steps. On the level of single lineages the probability of recombination is of the order O(N−2). Indeed, after a small reproduction event, the probability of drawing an offspring is 1/N. The probability that the offspring carries a recombined chromosome is of order O(1/N).

Given the cornucopia of possible transitions from ξn,N(m) to ξn,N(m + 1), it is important to identify those transitions that are expected to be visible in the limiting process.

All possible transitions fall into the following three regimes:

  • Those transitions that happen at probability of order O(N−2) per generation, which will be visible in the limit (since time is scaled by N2): They are called effective transitions and will appear at a finite positive rate in the limit.

  • Further, there are transitions that happen less frequently, typically with probability of order O(N−3) or smaller per generation, which will thus become negligible as N → ∞ and hence be invisible in the limit. These are called negligible transitions.

  • Finally, there are transitions that happen much more frequently [with probability of order O(N−1) or even O(1) per generation]. At first sight, one might think that their presence might lead to chaotic behavior in the limit. However, this is not the case. Instead, these transitions will happen “instantaneously” in the limit and result in a projection of the states of our process from An into the subspace Ansm, which will be the limiting state space. This is proved below. Such transitions are called projective or instantaneous transitions. The identity transition is a special case of a projective transformation.

In the Appendix (section A1), a full classification of all transitions into the above groups is provided.

Instantaneous and effective transitions

The most important transitions and their effect for the limiting process are now described in detail. Consider the following most recent events in the history of a set of lineages, i.e., events occurring at time m + 1, from the perspective of the ancestral process ξn,N(m) at time m:

  • Event 1 (silent): A small reproduction event occurs, but the offspring is not active. This is the most likely event and is of the order O(1), but does not affect our ancestral configuration process ξn,N(m); i.e., ξn,N(m+1) = ξn,N(m). This event leads to an identity transition (a trivial instantaneous transition).

  • Event 2 (dispersion): A small reproduction event occurs, the offspring is active in our sample but neither parent is, and recombination does not occur. This is a relatively frequent event that occurs with a probability of the order O(N−1) per generation [since the probability that the offspring is in the sample is b(m)/N]. If the offspring carries only one active chromosome, we again see an identity transition; i.e., ξn,N(m + 1) = ξn,N(m). If the offspring carries two active chromosomes, i.e., is a double-marked individual, the two active chromosomes will disperse to two separate individuals, who will then become single-marked individuals. Formally, for ξ={C(1),,C(β);b}An with at least one double-marked individual (b < β), define the map dispi():AnAn dispersing the chromosomes paired in individual i,
    dispi(ξ)={C(1),,C(2i2),C(2i+1),C(2i+2),,C(2(βb)),C(2i1),C(2i),C(2(βb)+1),,C(β);b+1} (7)
    if 1 ≤ iβb and dispi(ξ) := ξ otherwise. Recall that the ith double-marked individual has chromosomes labeled 2i − 1 and 2i. For ξn,N(m), if the ith double-marked individual is affected, we have the transition ξn,N(m + 1) = dispi(ξn,N(m)).

The dispersion events will happen instantaneously as N → ∞ (recall we are speeding time up by N2) and thus will, in the limit, lead to an immediate complete dispersion of all chromosomes paired in double-marked individuals. If in the course of events, a new double-marked individual emerges due to pairing of active chromosomes in the same diploid individual, a dispersion of the chromosomes will occur immediately. Event 2 will hence result in a permanent instantaneous transition, mapping our current state ξAn into the subspace Ansm by means of the map cd defined in (3). Our limiting process will thus live, with probability one for each given t > 0, in Ansm, even if we start with a configuration from An\Ansm at time t = 0.

  • Event 3 (recombination): A small reproduction event occurs, a single-marked offspring but neither parent is in our sample, and recombination affecting the active chromosome at a crossover point x. This event has probability of the order O(N−2) per generation and will thus be visible with finite positive rate in the limit. It is an effective transition, which can be described formally as follows. Define the recombination operation recomb acting on chromosome j and crossover point x for a configuration ξAnsm as
    recombj,x(ξ):={C(1),,C(j1),C~(j,1),C~(j,2),C(j+1),,C(β);β+1}, (8)
    where
    C~(j,1)={L~1(j,1),,L~L(j,1)}
    with
    L~(j,1)={L(j):1x1,Ø:xL,
    and
    C~(j,2)={L~1(j,2),,L~L(j,2)}
    with
    L~(j,2)={Ø:1x1,L(j):xL
    (if one of C~(j,1), C~(j,2) = {Ø, … , Ø}, we define recombj,α(ξ) := ξ, giving rise to a silent recombination event).
  • Event 4 (pairwise coalescence): A small reproduction event occurs, one single-marked parent and a single-marked offspring are in the sample, the active chromosome is inherited from the parent in the sample, and recombination does not occur. This event occurs with probability of order O(N−2) and will therefore be visible in the limit with finite positive rate and hence gives rise to an effective transition. It will lead to a binary coalescence of lineages and can formally be described as follows. The ancestral chromosome C~(j1) formed by the coalescence of chromosomes j1 and j2 is given by
    C~(j1)={L1(j1)L1(j2),,LL(j1)LL(j2)} (9)
    if 1 ≤ j1 < j2β. Define the binary coalescence operation pairmerge acting on chromosomes j1 and j2 (1 ≤ j1 < j2) in a configuration ξAnsm as
    pairmergej1,j2(ξ):={C(1),,C~(j1),,C(j21),C(j2+1),,C(β);β1} (10)
    if 1 ≤ j1 < j2β (otherwise, we put pairmergej1,j2(ξ):=ξ).
  • Event 5 (multiple-merger coalescence): A large reproduction event occurs, neither parent but (possibly several) single-marked offspring are in our sample, and recombination does not occur. This is again an event with probability of order O(N−2) per generation and therefore will be visible in the limit with finite positive rate and hence gives rise to an effective transition. The offspring chromosomes are assigned their parental chromosomes independently and uniformly at random, since due to an immediate complete dispersion via Event 2 each offspring individual will carry precisely one active chromosome. Now we formally define the multiple-coalescence operation groupmerge for ξAnsm and pairwise disjoint subsets J1, J2, J3, J4, ⊂ [β] in which either at least one |Ji| ≥ 3 or at least two of the |Ji| ≥ 2. This transition is, thus, really different from a pairmerge transition. Let Jj denote the set of offspring chromosomes derived from parental chromosome j. Then
    groupmergeJ1,J2,J3,J4(ξ):={C~(1),C~(2),C~(3),C~(4),C(j),j∈[β]\(J1J2J3J4);β~} (11)
    with ((x)+ := max(x, 0))
    β~=βj=14(|Jj|1)+
    and the four parental chromosomes, at least one of which is involved in a merger, are given by (1 ≤ i ≤ 4),
    C~(i)={jJiL1(j),,jJiLL(j)}.
    The chromosome(s) C(j) appearing in groupmergeJ1,J2,J3,J4(ξ) denote the chromosomes in ξ that are not involved in a merger.
  • All other events: These will either not affect our ancestral process or have a probability of order smaller than N−2 so that they will be absent in the limit after rescaling. A complete classification of these events is given in the Appendix (section A1).

The limiting dynamics and state space

The expected dynamics of the limiting continuous-time Markov chain {ξ(t), t ≥ 0}, taking values in An, as N → ∞, is now briefly discussed:

  • Complete dispersion (Event 2) of the sampled chromosomes is the first event to occur (between times t = 0 and t = 0+). By ℕi we denote individual number i (see section A1.1 in Appendix). At time t = 0 when ξ(0)An we assume all n sampled chromosomes are paired in double-marked individuals (n even):
    ξ(0)={Ii:Ii={C0(2i1),C0(2i)},1in2} (12)
    Immediately (at time 0+), the chromosomes disperse into single-marked individuals,
    ξ(0+)=cd(ξ(0))={Ii:Ii={C0(i),Ø},1in}={C0+(1),,C0+(n);n}Ansm. (13)
  • Throughout the evolution of the process, whenever double-marked individuals appear (e.g., from a coalescence-of-lineages event), Event 2 will immediately change our configuration to the corresponding “all dispersed” configuration; i.e., for each t > 0,
    ξ(t+)=cd(ξ(t))Ansm.
    Such “flickering” states will not affect any quantities of interest of our genealogy, so we can assume that they will be removed from the limit by choosing the càdlàg modification of {ξ(t), t ≥ 0}, taking only values in Ansm for all t > 0 (this modification does not affect the finite-dimensional distributions of {ξ(t), t ≥ 0}).
  • Recombination (Event 3) appears in the limiting process at total rate r = r(1) +⋯+ r(L−1), where a certain recombination involving a given crossover point ℓ appears with rate r(ℓ) on any lineage. Indeed, from our scaling considerations, we have that the probability of not seeing a recombination at ℓ in a small resampling event for more than N2t scaled time units for a given single-marked individual satisfies (rN()=r()/N)
    (1(1cN2)rN()1N)N2ter()t,
    as N→∞ [recall (6); the probability for any given individual to be the child in a small reproduction event is 1/N]; hence the waiting time for this event to happen is exponential with rate r().
  • Coalescences appear according to the effective transitions described by Events 4 and 5. From the point of view of a given pair of active chromosomes in different individuals, a single pairwise coalescence will occur at rate 1+c(ψ2/4)Cβ;2;β2 with Cβ;2;β−2 from (15) (with r = 1, s = β − 2), where the 1 comes from a pairwise coalescence according to a small reproduction event and the c(ψ2/4)Cβ;2;β2 from a large merger event (the rates can be easily derived from considerations similar to the recombination rate r above), recalling that both coalescing chromosomes have to “successfully flip a ψ-coin” to take part in the large coalescence event and then are uniformly distributed into four groups according to the choice of any of the four potential parental chromosomes.

Given that large coalescence events (involving at least three individuals or at least two simultaneous pairwise mergers) happen with overall rate c(ψ2/4) times the corresponding coalescence rate of a Ξ-coalescent, obtained from the number of individuals taking part in the merger independently with probability ψ, the participating individuals are then distributed uniformly into four groups according to the chosen parental chromosome. The corresponding rate is given in the third line of (14) [cf. also (15)].

The limiting ancestral process

According to the above consideration, it is now plausible to consider the following limiting Markov chain as the ancestral limiting process. This fact is proved below, with most computations provided in the Appendix. The mth falling factorial is given by (a)m := a(a − 1) … (am + 1), (a)0 := 1. The operations pairmerge, recomb, and groupmerge for elements of Ansm were defined above in the section on scaling. Now we define the generator of the continuous-time ancestral recombination graph derived from our model.

Definition 1.1 (limiting multilocus diploid ancestral recombination graph). The continuous-time Markov chain {ξ(t), t ≥ 0} with values in Ansm, initial condition ξ(0) := cd(ξ) for ξAn and transition matrix G, with entries for elements ξ,ξAnsm, ξ′ ≠ ξ, is given by (J := (J1, …, J4))

G(ξ,ξ)={1+cψ24Cβ;2;β2ifξ=pairmergej1,j2(ξ)r()ifξ=recombj,(ξ)cψ24Cβ;|J|ifξ=groupmergeJ(ξ)0forallotherξξ (14)

(where in the penultimate line we consider only cases where either at least one |Ji| ≥ 3 or at least two of the |Ji| ≥ 2), with

Cβ;|J|:=Cβ;|J1|,|J2|,|J3|,|J4|;β(|J1|+|J2|+|J3|+|J4|)

and (s = bk1 −⋯− kr ≥ 0, xy := min(x, y)),

Cb;k1,,kr;s=4ψ2l=0s(4r)(sl)(4)r+l4k1++kr+l(1ψ)sl(ψ)k1++kr+l (15)

For the diagonal elements, one has of course

G(ξ,ξ)=ξξ,ξAnsmG(ξ,ξ). (16)

The rates in (15) are the transition rates of the Ξ-coalescent (a simultaneous multiple-merger coalescent) with

Ξ=δ(ψ/4,ψ/4,ψ/4,ψ/4,0,0,),

when r distinct groups of ancestral lineages merge. The number of lineages in each group is given by k1, … , kr, given β active ancestral lineages. The number s = β − (k1 + … + kr) ≥ 0 gives the number of lineages (ancestral chromosomes) unaffected by the merger (cf. Schweinsberg 2000a, Theorem 2). The particular form of Ξ given above follows from the fraction ψ of the population replaced by the offspring of the two parents in a large reproduction event and our assumption that each parent contributes exactly one chromosome to each offspring. We have the following convergence result:

Theorem 1.2. Let {ξn,N(m), m0} be the ancestral process of a sample of n chromosomes in a population of size N and assume the scaling relations (5) and (6). Then, starting from ξn,N(0)An, we have that

{ξn,N(N2t)}{ξ(t)},asN,

in the sense of the finite-dimensional distributions on the interval (0, ∞). The initial value of the limiting process is given by

ξ(0)=cd(ξn,N(0))Ansm.

A proof can be found in the Appendix. If c = 0, the classical ancestral recombination graph for a diploid population with recombination in the spirit of Griffiths and Marjoram (1997) results.

General Diploid Moran-Type Models: “Random” ψ

One of the aims of the present work is to understand the genome-wide correlations in gene genealogies induced by sweepstake-style reproduction. So far, we have discussed this for a very simple example of a sweepstake mechanism (analog to the one considered in Eldon and Wakeley 2006). More precisely, the fraction ψ ∈ (0, 1) of the population replaced by the offspring of a single pair of individuals in a large offspring-number event has hitherto been assumed to be (approximately) constant. Along the lines of the previous discussion, an ancestral recombination graph with a randomized offspring distribution can be derived (a comprehensive discussion of single-locus haploid Moran models in the domain of attraction of Λ-coalescents can be found in a recent article by Huillet and Möhle 2011). Even though ψ is now considered a random variable, the population size stays constant at N diploid individuals. Allowing ψ to be random may be biologically more realistic than taking ψ to be a constant. On the other hand, the problem of identifying suitable classes of probability distributions for ψ, reflecting the specific biology of given natural populations, is still open and an area of active research.

To explain the convergence arguments when ψ is random, let the random variable ΨN, taking values in [N − 2], denote the random number of diploid offspring contributed by the single reproducing pair of parents at each time step; a new realization of ΨN is drawn before each reproduction event. Again, we consider the effect of such a reproduction mechanism on coalescence events in a sample. The probability that two given chromosomes residing in two single-marked individuals in the sample coalesce in the previous time step given the value of ΨN is

({paircoalescence}|ΨN=k)=14δ{k=1}4N(N1)+14δ{k>1}(4kN(N1)+k(k1)N(N1)), (17)

where the first and second terms on the right-hand side describe the case where one parent and one offspring are drawn, the third term covers the case where two offspring are drawn, and the 1/4 accounts for the probability that the two chromosomes in question must descend from the same parental chromosome. Define

cN:=4({paircoalescence}) (18)
=4k=1N2({paircoalescence}|ΨN=k)(ΨN=k)
=E[ΨN(ΨN+3)N(N1)] (19)

(the factor 4 facilitates comparison with the haploid case). The sequence of laws ℒ(ΨN), N ∈ ℕ, is assumed to satisfy the following three conditions,

cN0asN, (20)
cNE[ΨN/N]=1/E[ΨN/N]1/cN=E[ΨN(ΨN+3)](N1)E[ΨN]0asN, (21)

and there exists a probability measure F on [0, 1] such that

1cN(ΨN>Nx)Nx11y2F(dy) (22)

for all continuity points x ∈ (0, 1] of F.

Condition (20) is necessary for any limit process of the genealogies to be a continuous-time Markov chain, condition (21) ensures that a separation-of-timescales phenomenon occurs, and (22) fixes the limit dynamics of the large merging events [it is analogous to Sagitov 1999, necessary condition (13) in the haploid case]. In the proof of convergence to a limit process we recall equivalent conditions to (22) (see Appendix, section A4). Condition (20) implies (see section A4 in the Appendix)

E[ΨNN]0asN; (23)

i.e., the probability for a given individual to be an offspring in a given reproduction event becomes small. Hence, (23) and (21) together show that there will be two diverging timescales: The “short” timescale 1/EN/N] on which chromosomes paired in double-marked individuals disperse into single-marked individuals and the “long” timescale 1/cN over which we observe nontrivial ancestral coalescences.

To obtain a nontrivial genealogical limit process, we then speed up time by a factor of 4/cN; i.e., 4/cN reproduction events correspond to one coalescent time unit (see Theorem 1.3 below). This time rescaling is chosen for two chromosomes to coalesce at rate 1 in the limit. The required scaling relation for the recombination rates is now

rN()cN4E[ΨN/N]r()asN (24)

with r(ℓ) ∈ [0, ∞) fixed for ℓ = 1, … , L − 1 [where f(N) ∼ g(N) means limN→∞f(N)/g(N) = 1]. An intuitive explanation for the requirement (24) is that since the probability for a given individual to be an offspring in a given reproduction event is EN/N], after speeding up time by 4/cN, on any lineage recombination events between loci ℓ and ℓ + 1 occur as a Poisson process with rate r(ℓ).

A simple sufficient condition for (21) is the following: For any ɛ > 0,

N(ΨN>ɛN)0asN. (25)

Indeed, we have, by assuming N > ɛN,

E[ΨN2]=k=1ɛNk2(ΨN=k)+k=ɛN+1Nk2(ΨN=k)k=1ɛNkɛN(ΨN=k)+k=ɛN+1NN2(ΨN=k)ɛNE[ΨN]+N2(ΨN>ɛN).

Dividing by NEN] gives

E[ΨN2]NE[ΨN]ɛ+N(ΨN>ɛN)E[ΨN],

and, since EN] > 1,

limsupNE[ΨN2]NE[ΨN]<ɛ+limsupNN(ΨN>ɛN)=ɛ.

Condition (21) is now obtained since we can choose ɛ to be as small as we like.

The limiting genealogical process will then be a continuous-time Markov chain on Ansm with generator matrix G whose off-diagonal elements are given by [for the values on the diagonal we again have (16)]

G(ξ,ξ)={Cβ;2ifξ=pairmergej1,j2(ξ)r()ifξ=recombj,(ξ)Cβ;|J|ifξ=groupmergeJ1,J2,J3,J4(ξ)0forallotherξξ, (26)

where

Cβ;|J|:=Cβ;|J1|,|J2|,|J3|,|J4|;β(|J1|+|J2|+|J3|+|J4|),

k = (k1, …, kr), |k| = k1 + … + kr, and

Cb;k;s=4l=0s(4r)(sl)(4)r+l4|k|+l[0,1]x|k|+l(1x)sl1x2F(dx)=F({0})δ{r=1,k1=2}+4l=0s(4r)(sl)(4)r+l4|k|+l(0,1]x|k|+l(1x)sl1x2F(dx) (27)

with F from (22). As in the case of constant ψ, the third line in (26) gives the transition rates for a given merger into r (≤4) groups of sizes k1, … , kr when β active ancestral lineages are present, with s = β − |k| ≥ 0 lineages unaffected by a given merger of the Ξ-coalescent with

Ξ=[0,1]δ(x/4,x/4,x/4,x/4,0,0,)F(dx)

(cf. Schweinsberg 2000a, Theorem 2). By way of example, C2;2;0 = 1. Now we can state the convergence of our ancestral recombination graph process with random ψ. The analog of Theorem 1.2 is the following:

Theorem 1.3. Let {ξn,N(m), m ≥ 0} be the ancestral process of a sample of n chromosomes in a population of size N with offspring laws ℒ(ΨN) that satisfy (20), (21), and (22), and assume the scaling relation (24) for the recombination rates. Then, starting from ξn,N(0) ∈ An, we have that

{ξn,N(4tcN)}{ξ(t)},asN,

in the sense of the finite-dimensional distributions on the interval (0, ∞). The process {ξ(t)} is the Markov chain with generator matrix (26) and initial value ξ(0) given by

ξ(0)=cd(ξn,N(0))Ansm.

The proof is given in section A4 in the Appendix.

While cN ≥ 1/N2 by definition, in principle any decay behavior of cN that is consistent with lim infN→∞N2cN ≥ 1, and hence any there-from–derived scaling relation between coalescent timescale and model census population size, is possible via a suitable choice of the family ℒ(ΨN), N ∈ ℕ.

For an extreme example, let ΨN = ⌊Nγ⌋ for some γ ∈ (0, 1); then cNN−2(1−γ) and (22) is satisfied with F = δ0.

The relation with the “fixed-ψ” model is as follows: For Theorem 1.2, we used the simple mixture distribution for ΨN,

(ΨN=ψN)=1(ΨN=1)=cN2 (28)

for ΨN, in which ψ ∈ (0, 1) and c > 0 are both constants. Our choice (28) of law for ΨN gives, using (17),

cN=E[ΨN(ΨN+3)N(N1)]=(1cN2)4N(N1)+cN2ψN(ψN+3)N(N1)1N2(4+cψ2).

Define 1(0,ψ)(x) = 1 if x ∈ (0, ψ) and 1(0,ψ)(x) = 0 otherwise. Our choice (28) further gives

(ΨN>Nx)=1(0,ψ)(x)(ΨN>Nx)=1(0,ψ)(x)cN2,

and therefore

1cN(ΨN>Nx)1(0,ψ)(x)c4+cψ2=(x,1]y2F(dy)

with

F=44+cψ2δ0+cψ24+cψ2δψ.

Furthermore, EN/N] = 1/N + O(1/N2); thus

cN4E[ΨN/N]1N4+cψ24

and Theorem 1.2 follows from Theorem 1.3 [after rescaling time in the limit process {ξ(t)} by a factor of (4 + 2)/4].

The constant Cb;k:=Cb;k1,,kr;s (27) depends on the probability measure F. The form of F will no doubt be different for different populations. We reiterate that resolving the mechanism of sweepstake-style reproduction will require detailed knowledge of the reproductive behavior and the ecology of the organism in question, along with comparison of model predictions to multilocus genetic data. A candidate for F may be the beta distribution with parameters ϑ > 0 and γ > 0, in which case the constant Cb;k in (26) takes the form (|k|:= k1 + ⋯ + kr)

Cb;k=4(s)(4)r+(14)|k|+B(|k|++ϑ2,s+γ)B(ϑ,γ), (29)

B(⋅,⋅) being the Beta function.

Different Scaling Regimes

The mechanism of sweepstake-style reproduction may be different for different populations, and the frequency of large offspring-number events may also be different. The particular timescale of the large reproduction events (we chose ɛN = c/N2) results in a separation of timescales of the limit process. Resolving the separation-of-timescales problem results in the ARG with generator (14). Different scalings of ɛN result in different limit processes. By way of example, if N2ɛN → 0, large offspring-number events are negligible in a large population, and we obtain the ARG associated with the usual Wright–Fisher reproduction, which can be read off Equation 14 by taking c = 0. One other scaling regime may seem reasonable, namely taking large offspring-number events to be more frequent than in assumption (5), but not too frequent. In mathematical notation, N2ɛN → ∞ and N → 0. The ancestral process in this regime is again characterized by instantaneous separation of marked chromosomes into single-marked individuals, followed by coalescence and recombination occurring on the slow timescale. The probability of recombination is proportional to N since the slow timescale must be in units proportional to 1/ɛN. Hence, small reproduction events become negligible in the limit, and the generator of the limit process is given by

G(ξ,ξ)={ψ24Cβ;2;β2ifξ=pairmergej1,j2(ξ)r()/rifξ=recombj,α(ξ)ψ24Cβ;|J|ifξ=groupmergeJ(ξ)0for all otherξξ (30)

in which C⋅;⋅;⋅ is given by Equation 15. The requirement N → 0 is needed to prevent an unreasonably high rate of recombination.

Haploid Analogs

A haploid version of the above model, where only one parent contributes offspring at each time step, is a specific example of a Λ-coalescent, where

Λ(dx)=δ0(dx)+cψ2δψ(dx),ψ(0,1),c[0,)

(see, e.g., Eldon and Wakeley 2006 and Birkner and Blath 2009). More precisely, as the population size N tends to infinity, assume probability 1 − c/N2 for the small reproduction events and c/N2 for the large reproduction events (i.e., choose ɛN = c/N2), and speed up generation time by N2. Again, by randomizing ψ and/or switching to different scaling regimes, it is possible to obtain any given Λ-coalescent as limiting genealogy.

Two-Sex Extensions

Recent studies of the spawning behavior of Atlantic cod indicate that cod adopts a lekking behavior, in which males compete for females, and females exercise mate choice (Nordeide and Folstad 2000). Direct microsatellite DNA analysis indicates that although multiple paternity is sometimes detected, the reproductive success is highly skewed among the males; i.e., most of the successfully fertilized eggs can be attributed to a single male (Hutchings et al. 1999). Our model thus seems a good approximation to the actual reproduction mechanism of cod. Modifications to allow two distinct genders, and multiple paternity, are in principle straightforward.

More General Recombination Models

Our model can easily be enriched to allow also more general recombination events involving more than one crossover point at a time. Furthermore, by letting the number L of loci tend to infinity, a continuous model, where [0, 1] represents a whole chromosome (as in Griffiths and Marjoram 1997), can be accommodated into our framework.

Correlations in Coalescence Times

The marginal process

Every marginal process (marginal with respect to one fixed locus under consideration) of our ancestral recombination graph is a Ξ-coalescent (see Schweinsberg 2000a for notation and details) with

Ξ=δ0+cψ24δ(ψ/4,ψ/4,ψ/4,ψ/4,0,0,…).

For r = 0, all marginals are identical (realization-wise), in particular times to the most recent common ancestor for different loci have correlation 1. However, in contrast to the classical setting, for r → ∞ one expects that the loci will not completely decorrelate, but instead keep positive correlations, as pointed out to us by J. E. Taylor (personal communication). In particular, one will not obtain the product distribution. This observation is a potential starting point for designing tests for the presence of large reproduction events, by comparing correlations for loci at large distance (hence with high recombination rate) under a Kingman- and a Ξ-coalescent–based ARG.

Correlation in coalescence times at two loci

Correlations in coalescence times between two loci have been considered in the context of quantifying association between loci (McVean 2002). Eldon and Wakeley (2008) consider correlations in coalescence times for a haploid population model, admitting large offspring numbers, in which the ancestral process admits only asynchronous multiple mergers of ancestral lineages. To illustrate the effects of the reproduction parameters on the coalescence times, we also consider the probability that coalescence occurs at the same time at the two loci, as well as the expected time until coalescence.

The calculations to obtain the correlations for a sample of size two at two loci (following the approach and notation of Durrett 2002) are shown in the Appendix, section A5. As we are now considering the gene genealogy of unlabeled lineages, let us briefly state the sample space. Let a and b denote the types at loci a and b, respectively. The three sample states before coalescence at either locus has occurred can be denoted as (ab)(ab), (ab)(a)(b), and (a)(b)(b)(b). By (ab)(ab) we denote the state of two chromosomes, each carrying ancestral material at both loci. By (ab)(a)(b) we denote the state of one (ab) chromosome in addition to two chromosomes (a) and (b) carrying ancestral types at loci 1 and 2 only, respectively. The notation (a)(a)(b)(b) denotes the state of four chromosomes, each carrying ancestral types at only one locus. Let

h(i):=({Ta=Tb}|i),i{0,1,2}

denote the probability that coalescence at the two loci occurs at the same time, given that the process starts in state i, in which i refers to the number of double-marked chromosomes (2, 1, or 0). As we are working with the limiting model, all marked individuals are effectively single marked. Under the usual (Kingman-coalescent–based) ARG, limr→∞h(i) = 0 as one would expect. Our model yields

limrh(i)=cψ432+8cψ2cψ4,i{0,1,2}, (31)

indicating that even unlinked loci remain correlated due to sweepstake-style reproduction. Figure 2 shows graphs of h(i) as a function of ψ for different values of c and r. As expected, h(i) increases with ψ, at a rate that increases with c.

Figure 2 .

Figure 2 

(A and B) The probabilities h(2), h(1), and h(0) as functions of ψ (lines) for different values of r and c. Values of h(⋅) obtained from the usual Moran model are shown for reference (symbols).

Under the usual ARG, the expected time Ei[Ts] until coalescence at either locus, starting from state i is given by Ei[Ts] = (1 + h(i))/2. The random variable Ts can be viewed as the minimum of the time until coalescence occurs at the two loci. As r → ∞, the times T1 and T2 until coalescence at the two loci, respectively, become independent and identically distributed exponentials (i.i.d.e.) with rate 1, whose minimum has expected value 1/2. Under our model, the mean of Ts is not the minimum of two i.i.d.e. with rate 1+2/4, another reflection of the correlation in gene genealogies induced by sweepstake-style reproduction. Indeed, our model gives

limrEi[Ts]=12(11+χcψ2/4),i{0,1,2},

in which χ = 1 − ψ2/8.

Under our model, Ei[Ts] decreases with ψ, and the rate of decrease increases with c (Figure 3). The same pattern holds for the expected time Ei[Tl] until coalescence has occurred at both loci (Figure 4). As r → ∞, Ei[Tl] associated with the usual ARG approaches the expected value (3/2) of the maximum of two i.i.d.e. with rate 1. Under our model,

limrEi[Tl]=3211+cψ2/411+cψ2/4cψ4/32+cψ2(6ψ2)(cψ2+4)(4+cψ2cψ4/8)

while the maximum of two i.i.d.e. with rate λ has expected value 3/(2λ).

Figure 3 .

Figure 3 

The expected time Ei[Ts] as a function of ψ for different values of c and r. Values of Ei[Ts] associated with the case c = 0 are shown for reference (symbols).

Figure 4 .

Figure 4 

The expected time Ei[Ts] as a function of ψ for different values of c and r. For explanation of symbols, see Figure 3.

The correlation cori(T1, T2) between T1 and T2 when starting from one of the three possible sample states i ∈ {0, 1, 2} (see Appendix) increases with ψ and more so if c is large (Figure 5). One obtains the following limit relations between h(i) and cori(T1, T2) for i ∈ {0, 1, 2}:

limrcori(T1,T2)=limrh(i)(seeEquation31);
limr0cori(T1,T2)=limr0h(i)(seeEquationA35);
limccori(T1,T2)=limch(i)(seeEquationA34).

Figure 5 .

Figure 5 

Correlation of the time to coalescence at two loci as a function of ψ, for different values of c and r. For explanation of symbols, see Figure 3.

Quantifying the association between alleles at different loci can give insight into the evolutionary history of populations. Let fa and fb denote the frequencies of alleles a at locus 1 and b at locus 2, and let fab denote the frequency of chromosome ab in the total population. The statistic Dab := fabfafb measures the deviation from independence, since if the two loci were evolving independently, fab = fafb. A related quantity is the r2 statistic, defined as

r2:=D2fa(1fa)fb(1fb)

(Hill and Robertson 1968), assuming fa, fb ∉ {0, 1}. In applications, one wants to compare observed values of r2 calculated from data to the expected value E[r2], obtained under an appropriate population model. Calculating the expected value of r2 is not straightforward, since r2 is a ratio of correlated random variables. The expected value of r2 is, instead, approximated by the ratio Inline graphic[D2]/E[fa(1 − fa)fb(1 − fb)] (Ohta and Kimura 1971).

A prediction Inline graphic of linkage disequilibrium in the population can be framed in terms of correlations in coalescence times between two loci for a sample of size two, assuming a small mutation rate (McVean 2002). The prediction rests on approximating the expected value E[r2] of the squared correlation statistic r2 (Hill and Robertson 1968) of association between alleles at two loci by the ratio of expected values (Ohta and Kimura 1971). Following, e.g., Durrett (2002) one can obtain expressions for correlations in coalescence times between two loci for a sample of size two (see Appendix). Under our model, one obtains the limit results

graphic file with name 255equ1.jpg

When ψ is small but c large, one obtains

graphic file with name 255equ2.jpg

Under the usual ARG, limr→0Inline graphic = 5/11. Thus, even in the presence of a high recombination rate, if large offspring-number events are frequent enough, one may see only evidence of low recombination rate in data. Further, the prediction Inline graphic can be substantially higher than Kingman-coalescent–based predictions if c is large and the recombination rate is not too small (Figure 6).

Figure 6 .

Figure 6 

The estimate Inline graphic of the expected value E[r2] as a function of ψ for different values of c and r. The solid lines represent the value of Inline graphic associated with the usual Wright–Fisher model.

For particular examples of probability measures F from Equation 27 associated with the generator derived from our random offspring distribution model one can compute the quantities considered above in relation to fixed ψ. One such example distribution can be the Beta(ϑ, γ) distribution (see Figure 7 for Inline graphic). One obtains for i ∈ {0, 1, 2},

limrh(i)=4γ(1+2ϑ+γ)8γ(1+γ)+10γϑ+7ϑ(1+ϑ).

Define h~(i):=limrh(i). For i ∈ {0, 1, 2} one obtains

limrEi[Ts]=4h~(i)+4γ(1+2ϑ+γ)8γ(1+γ)+10γϑ+7ϑ(1+ϑ),limrEi[Tl]=3212h~(i)+3γ2(8γ(1+γ)+10γϑ+7ϑ(1+ϑ)). (32)

The form of the relation shown in (32) between h(i) and Ei[Ts] and Ei[Tl] resembles the one obtained for the Kingman-coalescent–based ARG, with the addition of a “correction” term due to simultaneous multiple mergers.

Figure 7 .

Figure 7 

The prediction Inline graphic of linkage disequilibrium obtained from the ARG associated with the Beta(ϑ, γ) distribution. The different lines represent different values of γ (top panels) or ϑ (bottom panels). The broken horizontal line represents the prediction obtained from the usual ARG.

Variance of pairwise differences

The expected variance of pairwise differences was employed by Wakeley (1997) to estimate the recombination rate in low offspring-number (Wright–Fisher) populations, under the usual ancestral recombination graph. Let the random variable Kij denote the number of differences between sequences i and j, with Kii = 0. The average number π of pairwise differences for n sequences is

π=2n(n1)i<jKij.

The (empirical) variance Sπ2 of pairwise differences is defined as

Sπ2=2n(n1)i<j(Kijπ)2.

In the Appendix we derive the expected variance of pairwise differences E[Sπ2] under the ancestral recombination graph described by the generator G (Equation 14) derived from our large offspring-number model. Under our model, E[Sπ2] is a function of the parameters c and ψ, in addition to being a function of r and θ (Figure 8 and Figure 9). In Figure 8, E[Sπ2], when only two loci are considered, is graphed as a function of the recombination rate and in Figure 9 as a function of sample size. Figures 8 and 9 show that E[Sπ2] is primarily influenced by the mutation rate (θ), when the values of c and ψ are fairly modest. However, E[Sπ2] can be quite low when both c and ψ are large, even when θ is also large (Figure 9). When c and ψ are both large, two sequences are more likely to coalesce before a mutation separates them.

Figure 8 .

Figure 8 

(A–D) The expected variance of pairwise differences for sample size 50 as a function of the recombination rate r for different values of the parameters c, ψ, and θ as shown.

Figure 9 .

Figure 9 

(A–D) The expected variance of pairwise differences as a function of sample size for different values of the parameters c, ψ, r, and θ as shown.

The variance of pairwise differences alone will not suffice to yield estimates of r if both c and ψ are unknown. To jointly estimate the four parameters (c, ψ, r, θ) of our model one probably needs to employ computationally heavy likelihood and importance sampling methods in the spirit of Fearnhead and Donnelly (2001). However, given knowledge of c and ψ, one can, in principle, use the variance of pairwise differences to quickly obtain estimates of the recombination rate.

Correlations in ratios of coalescence times

The behavior of the correlations in ratios of coalescence times for sample sizes larger than two is investigated using Monte Carlo simulations.

Let Li denote the total length of branches ancestral to i sequences at one locus, let L denote the total length of the genealogy at the same locus, and define Ri := Li/L. Thus, R1 is the total length of external branches to the total size of the genealogy. The idea behind estimating the expected value E[Ri] is as follows. Assuming the infinitely many sites mutation model, let Si denote the total number of mutations in i copies and S denote the total number of segregating sites, and define Vi := Si/S. The key idea behind deriving the coalescent was to separate the (neutral) mutation process from the genealogical process. The same principle also applies to predicting patterns of genetic variation using the coalescent: First, one constructs the genealogy and then superimposes mutations on the genealogy. The shape of the genealogy is thus a deciding factor in the genetic patterns one predicts. The relative lengths Ri of the different types of branches should therefore predict the relative number Vi of mutations of each class. This idea is exploited by Eldon (2011) to estimate coalescence parameters in the large offspring-number models introduced by Schweinsberg (2003) and Eldon and Wakeley (2006). Namely, the claim is

limnE[Ri]=limnE[Vi]=f(ϖ,i), (33)

where n denotes the sample size, and ϖ denotes the coalescence (reproduction) parameters. Indeed, it follows from the results of Berestycki et al. (2007, 2008) that (1 < α < 2)

limnE[Ri]=limnE[Vi]=Γ(i+α2)(α1)(2α)Γ(α)i!

when associated with the Beta(2 − α, α) coalescent derived by Schweinsberg (2003) from a population model in which the offspring law is stable with index α. A key feature of expression (33) is the absence of mutation rate in the function f(ϖ, i); thus given a large number of DNA sequences (possibly in the thousands), one hopes to be able to obtain estimates of the coalescence parameters ϖ without having to jointly estimate the mutation rate. In our model, there are four parameters to estimate, namely mutation and recombination rates, along with the coalescence parameters c and ψ. Even though full-likelihood methods exist (Birkner and Blath 2008; Birkner et al. 2011), applying them to large data sets consisting of thousands of sequences may represent a challenge.

Estimates of E[Ri] as functions of the sample size n and the coalescence parameters c and ψ are shown in Table 1. In nearly all cases the estimates of Ri decreased as sample size increased; the exception was R1 when (c, ψ) = (1000, 0.5) (Table 1). When both c and ψ are large enough, we observe a nonmonotonic behavior in R1 as sample size increases (results not shown). The nonmonotonic behavior may be related to the property of the marginal haploid process (the point-mass part obtained as c → ∞) of a single locus of not coming down from infinity (Schweinsberg 2000b); i.e., when one starts with an infinite number of lineages (sample size), the number of lineages stays infinite. For such processes that do not come down from infinity, the ratio R1 should go to one; i.e., the gene genealogy should become completely star-shaped (see, e.g., Eldon 2011). As both c and ψ increase, one expects the deviation from Kingman-coalescent–based predictions to increase. By way of example, for sample size 50 the vector (E[R1], … , E[R4)] is estimated to be ∼(0.24, 0.12, 0.08, 0.06) when associated with the Kingman coalescent (c = 0), while being ∼(0.58, 0.20, 0.09, 0.05) when (c, ψ) = (1000, 0.5). In all cases the estimate R^i of the standard deviation of Ri decreases as sample size increases, indicating convergence.

Table 1 .

Estimates i of the expected values E[Ri] of the ratios Ri := Li/L for 1≤ i ≤ 4 at one marginal locus, along with estimates R^i of the standard deviations of Ri

ψ c n R1 R2 R3 R4 R^1 R^2 R^3 R^4
0 6 0.466 0.219 0.138 0.100 0.183 0.167 0.198 0.124
10 0.378 0.180 0.117 0.085 0.156 0.132 0.120 0.110
20 0.300 0.146 0.096 0.070 0.119 0.097 0.088 0.081
50 0.235 0.116 0.077 0.057 0.080 0.063 0.058 0.055
0.005 1 6 0.466 0.219 0.138 0.100 0.183 0.167 0.198 0.124
10 0.377 0.181 0.117 0.085 0.156 0.133 0.120 0.111
20 0.299 0.146 0.095 0.071 0.118 0.097 0.088 0.082
50 0.234 0.116 0.076 0.057 0.080 0.064 0.057 0.054
1000 6 0.467 0.219 0.137 0.100 0.182 0.167 0.198 0.124
10 0.377 0.181 0.117 0.085 0.156 0.133 0.120 0.110
20 0.299 0.146 0.095 0.071 0.119 0.097 0.088 0.082
50 0.235 0.116 0.077 0.057 0.080 0.064 0.058 0.054
0.5 1 6 0.468 0.217 0.138 0.099 0.184 0.166 0.199 0.124
10 0.381 0.179 0.115 0.085 0.157 0.132 0.120 0.110
20 0.304 0.145 0.095 0.070 0.120 0.097 0.088 0.081
50 0.242 0.117 0.077 0.056 0.081 0.064 0.058 0.054
1000 6 0.541 0.173 0.116 0.089 0.184 0.152 0.177 0.116
10 0.566 0.117 0.078 0.058 0.159 0.101 0.090 0.082
20 0.743 0.101 0.035 0.022 0.084 0.053 0.033 0.027
50 0.576 0.195 0.089 0.046 0.058 0.051 0.037 0.026

Estimates are obtained from 105 simulated gene genealogies.

The rationale behind comparing the statistics in Tables 2 and 3 is as follows. As sequencing technologies advance, and the genomic sequences of more organisms become available, a case in point being the recently published genomic sequence of Atlantic cod (Star et al. 2011), genomic scans of thousands of individuals will become more common. Given DNA sequence data for many loci, one could calculate correlations for counts and ratios of counts of mutations and compare them to predictions based on different ancestral recombination graphs. Similarly for the single-locus statistics (Table 1), the idea is that the correlations of the coalescence time statistics (Li and Ri) should reflect correlations of mutation counts (Si). In particular, under the usual ARG one expects (see Tables 2 and 3)

limrcor(Li(1),Lj(2))=limrcor(Ri(1),Rj(2))=0,

where the superscript refers to locus numbers 1 and 2, respectively, while under an ARG admitting simultaneous multiple mergers one expects

limrcor(Li(1),Lj(2))=f(i,j,ϖ)limrcor(Ri(1),Rj(2))=g(i,j,ϖ),

where f and g are functions of the particular statistics indicated by i and j as well as the vector ϖ of coalescence (reproduction) parameters.

Table 2 .

Estimates of the correlation cor(X(1), Y(2)) between X(1) and Y(2), where X(1) represents a statistic for locus 1 and Y(2) that for locus 2, as follows: the time T until the most recent common ancestor at a locus; L, the total length of the gene genealogy at a locus; and Ri := Li/L, in which Li denotes the total length of branches ancestral to i sequences

c ψ r cor(T(1), T(2)) cor(L(1), L(2)) cor(L(1),L1(2)) cor(L(1),L2(2)) cor(L3(1),L3(2)) cor(L4(1),L4(2))
0 1 0.311 0.418 0.586 0.501 0.434 0.378
10 0.016 0.058 0.169 0.089 0.047 0.036
1 0.005 1 0.306 0.415 0.588 0.508 0.431 0.380
10 0.015 0.055 0.171 0.090 0.049 0.034
1000 0.005 1 0.308 0.419 0.585 0.509 0.438 0.376
10 0.013 0.051 0.168 0.093 0.052 0.030
1 0.5 1 0.328 0.447 0.601 0.516 0.449 0.389
1 10 0.024 0.085 0.193 0.107 0.064 0.036
1000 1 0.982 0.995 0.976 0.950 0.918 0.879
10 0.924 0.947 0.763 0.623 0.503 0.396
c ψ r cor(L1(1),L2(2)) cor(L1(1),L3(2)) cor(L1(1),L4(2)) cor(L2(1),L3(2)) cor(L2(1),L4(2)) cor(L3(1),L4(2))
0 1 −0.031 −0.031 −0.021 −0.005 −0.018 0.009
10 0.005 −0.006 −0.001 0.012 0.005 0.013
1 0.005 1 −0.035 −0.025 −0.021 −0.001 −0.019 0.009
10 0.000 −0.002 0.008 0.009 0.005 0.014
1000 0.005 1 −0.036 −0.029 −0.021 −0.006 −0.018 0.010
10 −0.002 −0.003 0.003 0.014 0.004 0.005
1 0.5 1 −0.022 −0.014 −0.007 0.004 −0.004 0.023
10 0.009 0.006 0.010 0.022 0.014 0.025
1000 1 0.326 0.314 0.305 0.238 0.218 0.176
10 0.311 0.284 0.266 0.289 0.239 0.262

Estimates are based on 105 simulated ancestral recombination graphs each for a sample of size 50.

Table 3 .

Estimates of the correlation cor(X(1), Y(2)) between X(1) and Y(2), where X(1) represents a statistic for locus 1 and Y(2) that for locus 2, as follows: the time T until most recent common ancestor at a locus; L, the total length of the gene genealogy at a locus; and Ri := Li/L, in which Li denotes the total length of branches ancestral to i sequences

c Ψ r cor(R1(1),R2(2)) cor(R2(1),R2(2)) cor(R3(1),R3(2)) cor(R4(1),R4(2))
0 1 0.570 0.548 0.486 0.431
10 0.116 0.089 0.052 0.042
1 0.005 1 0.566 0.552 0.487 0.435
10 0.115 0.091 0.054 0.035
1000 0.005 1 0.570 0.551 0.491 0.434
10 0.115 0.095 0.059 0.031
1 0.5 1 0.583 0.557 0.504 0.447
10 0.135 0.102 0.063 0.038
1000 0.5 1 0.955 0.927 0.900 0.866
10 0.679 0.469 0.384 0.304
c ψ r cor(R1(1),R2(2)) cor(R1(1),R3(2)) cor(R1(1),R4(2)) cor(R2(1),R3(2)) cor(R2(1),R4(2)) cor(R3(1),R4(2))
0 1 −0.023 −0.040 −0.042 −0.026 −0.042 −0.014
10 −0.022 −0.023 −0.020 0.003 −0.005 0.005
1 0.005 1 −0.024 −0.038 −0.042 −0.023 −0.046 −0.014
10 −0.027 −0.018 −0.015 0.001 −0.007 0.011
1000 0.005 1 −0.028 −0.038 −0.038 −0.031 −0.043 −0.012
10 −0.030 −0.024 −0.016 0.003 −0.008 −0.001
1 0.5 1 −0.023 −0.035 −0.035 −0.028 −0.034 −0.007
1 0.5 10 −0.029 −0.023 −0.015 0.004 0.000 0.016
1000 1 −0.622 −0.348 −0.112 −0.100 −0.038 −0.016
10 −0.330 −0.255 −0.135 0.009 0.004 0.096

Estimates are based on 105 simulated ancestral recombination graphs each for a sample of size 50.

In general, the results reported in Tables 2 and 3 indicate that high values of both ψ and c are required for high correlations when recombination rate is high, when associated with our model. In particular, the correlations between Ri(1) and Ri(2) (i.e., between corresponding Ri’s at different loci) can be quite high, even when recombination is high, when both c and ψ are large enough, another indicator of the genome-wide correlations induced by sweepstake-like reproduction.

A different question concerns the limit behavior as sample size n increases. Fix the recombination rate and consider the limits

limncor(Ri(1),Rj(2)),limncor(Vi(1),Vj(2)) (34)

Under the usual ARG, one expects the limits in (34) to be only functions of the recombination rate (and i and j). If the ARG also admits simultaneous multiple mergers, one expects the limits in (34) also to be functions of ϖ. Considering unlinked loci, one would be interested in the limits

limrlimncor(Ri(1),Rj(2)),limrlimncor(Vi(1),Vj(2)). (35)

Resolving the limits (35) for different ARGs promises not only to yield insights into genome-wide correlations, but also to provide tools for inference, e.g., to distinguish between different population models.

The C program written to perform the simulations was checked by comparing correlation in coalescence times for sample size two at two loci to analytical results. The program is available upon request.

Comparison with Eldon and Wakeley (2008)

Eldon and Wakeley (2008) consider correlations in coalescence times and the prediction Inline graphic of linkage disequilibrium, under a modified Wright–Fisher sweepstake-style reproduction model, and observe correlations in coalescence times between loci despite high recombination rate. Our work differs from theirs in important ways. To begin with, we treat diploidy in detail, in which each offspring receives its two chromosomes from two distinct diploid parents. This leads to a separation of timescales of the ancestral process. We formally derive an ancestral recombination graph that admits simultaneous multiple mergers of ancestral lineages, which naturally arise in diploid models. Eldon and Wakeley observed correlations in coalescence times when considering only sample size two at each locus in a model that contains diploid individuals only implicitly; it is not a priori obvious that the correlations would still hold for large sample sizes. We confirm this using our formally obtained ARG that allows us also to investigate correlations in coalescence times and in ratios of coalescence times, for sample sizes larger than two at each locus. In addition, one can apply our ARG to inference problems. Indeed, we show how the variance of pairwise differences can, in principle, be used to obtain estimates of the recombination rate. Finally, we obtain a large class of ARGs by randomizing the offspring distribution; thus one is not restricted to the simple case of fixed ψ.

Furthermore, since the estimate Inline graphic of the expected value of r2 can be expressed in terms of correlations in coalescence times, Eldon and Wakeley consider Inline graphic under their modified Wright–Fisher model. However, Inline graphic is based on approximating an expected value of a ratio of correlated random variables by the ratio of expected values of the corresponding random variables and is also derived for a sample of size two at two loci. Thus, Inline graphic may not be the ideal quantity to quantify association between loci for large sample sizes. A more natural way may be to investigate correlations in coalescence times for samples larger than two the way we do.

Discussion

Understanding the genome-wide effects of sweepstake-like reproduction on gene genealogies was our main aim. To this end, we derived ancestral recombination graphs for many loci arising from population models admitting large offspring numbers. High variance in individual reproductive success, or sweepstake-style reproduction, has been suggested to explain the low genetic diversity observed in many marine populations (Hedgecock et al. 1982; Avise et al. 1988; Palumbi and Wilson 1990; Beckenbach 1994; Hedgecock 1994; Árnason 2004). Hedgecock and Pudovkin (2011) review the sweepstake-style reproduction hypothesis and conclude that it provides the correct framework in which to investigate many natural marine populations.

Multiple-merger (Donnelly and Kurtz 1999; Pitman 1999; Sagitov 1999) and simultaneous (Schweinsberg 2000a; Möhle and Sagitov 2001) multiple-merger coalescent models arise from population models incorporating sweepstakes reproduction by admitting large offspring numbers (Sagitov 2003; Eldon and Wakeley 2006; Sargsyan and Wakeley 2008). While multiple-merger coalescent processes describing the ancestral relations of alleles at a single locus have received the most attention from mathematicians, ancestral processes for multiple linked loci have hitherto remained unexplored. We derive an ancestral recombination graph for many loci from a diploid biparental population model, in which one pair of diploid individuals (parents) contributes offspring to the population at each time step. Thus, each offspring necessarily receives its chromosomes from distinct individuals, as diploid individuals tend to do. Incorporating diploidy into our model the way we do leads to a separation-of-timescales problem. Our limiting object is essentially a “haploid” process, in which chromosomes either coalesce or recombine. By extending a result of Möhle (1998), we show that diploidy, a fundamental characteristic of many natural populations, can thus be treated as a “black box,” since the limiting object does not depend on the location of chromosomes in individuals.

By adopting a Moran-type model, in which only a single pair of individuals gives rise to offspring at each reproduction event, we chose mathematical tractability over more biologically realistic scenarios, in which, for example, many individuals contribute offspring at each time step. It should be straightforward to extend our model in many ways, for example by allowing a random number of parents or introducing population structure. Indeed, we do extend our model in one way, by taking a random offspring distribution. These extensions still leave open the question of distinguishing among different large offspring-number models. Our work on ancestral recombination graphs incorporating information from many loci is a step in this direction.

Sweepstake-style reproduction induces correlation in coalescence times even between loci separated by a high rate of recombination. The correlation follows from the multiple-merger property of our ancestral recombination graph, since many chromosomes coalesce at the same time in a multiple-merger event. The correlation remains a function of the coalescence parameters (c and ψ) of our population model. An immediate question is the effects on predictions of linkage disequilibrium (LD). The approximation Inline graphic by McVean (2002) predicts low LD when the recombination rate is high. However, when the rate of large reproduction events is high (c → ∞), Inline graphic remains a function of the coalescence parameters. The dependence of Inline graphic on coalescence parameters has implications for the use of LD in inference for populations exhibiting sweepstake-style reproduction. Using simulations, Davies et al. (2007) found little effect of multiple mergers on the prediction r2 of linkage disequilibrium, when comparing the exact Wright–Fisher model with recombination to the usual (continuous-time) ARG. However, by directly incorporating large offspring-number events the way we do, we can show that large offspring-number events do induce correlation in coalescence times and hence influence predictions of linkage disequilibrium.

The genome-wide correlation in coalescence times (Tables 2 and 3) induced by sweepstake-style reproduction offers hints about how to distinguish between large offspring number and ordinary Wright–Fisher reproduction. We are unaware of any published multilocus methods derived to distinguish among different population models. Full-likelihood methods may be preferable to the simple moment-based methods we consider. However, likelihood-based inference tends to be computationally intensive and more so for large samples. For large samples, one should be able to quickly obtain a good idea of the underlying processes by comparing correlations in ratios of mutation counts with predictions based on different population models.

In conclusion, ancestral recombination graphs admitting simultaneous multiple mergers of ancestral lineages are derived from a diploid population model of sweepstake-style reproduction, suggested to be common in many diverse marine populations. Our calculations show that sweepstake-style reproduction results in genome-wide correlation of gene genealogies, even for large sample sizes. Estimates of linkage disequilibrium and of recombination rates are confounded by the coalescence parameters of our population model. The genome-wide correlation in gene genealogies induced by sweepstake-style reproduction implies that examining correlations between loci should provide a means of distinguishing between ordinary Wright–Fisher and sweepstake-style reproduction.

Acknowledgments

We acknowledge the comments of two anonymous referees, which helped to improve the presentation; one referee also spotted an error in our original proof of Theorem 1.3. J.B. and B.E. thank Institut für Mathematik, Johannes-Gutenberg-Universität Mainz, for hospitality. M.B. thanks Mathematisch Instituut, Universiteit Leiden, for hospitality. B.E. was supported in part by Engineering and Physical Sciences Research Council grant EP/G052026/1 and by a Junior Research Fellowship at Lady Margaret Hall, Oxford University. J.B. and B.E. were supported in part by Deutsche Forschungsgemeinschaft (DFG) grant BL 1105/3-1. M.B. was in part supported by DFG grant BI 1058/2-1 and through European Research Council advanced grant 267356 VARIS (Variational Approach to Random Interacting Systems).

Appendix

A1: Overview of Transitions and Their Probabilities in the Finite Population Model

A1.1: Basic setup and notation

We now classify all transitions and their probabilities of our population model relevant for the ancestral process under the scaling ɛN2 = c/N2, in which N denotes the population size. Fix a sample size n for this section. Usually we suppress the dependence on the sample size in the notation below. Recall the state space An of our ancestral process (respectively Ansm for the “effective” limiting model).

Let ΠN be the transition matrix of the Markov chain {ξn, N(m)}m=0, 1,… on An describing the ancestral states of an n sample in a population of size N. Our aim is to decompose ΠN into

ΠN=AN+1N2BN+RN, (A1)

where the matrix AN contains all transitions whose probability is O(1) or O(N−1) per generation, so that they will happen “instantaneously” in the limit and are either identity transitions or projections from An to Ansm by means of dispersing chromosomes paired in double-marked individuals. The matrix BN contains all transition probabilities that are positive and finite after multiplication with N2 and N → ∞, that is, our “effective transitions.” The remainder matrix RN carries only transition probabilities that are of order O(N−3) or smaller that will thus vanish after scaling.

Once we have established this decomposition, we can apply Lemma 1.7 below in a suitable way to identify the limit given in Definition 1.1 and establish the convergence result, i.e., Theorem 1.2.

In Table A1, Table A2, and Table A3 we schematically deal with all possible transitions that can happen to a current sample over one time step.

Analogous to the notation and convention of Möhle and Sagitov (2003), we assume that in every configuration ξn,N(m) from (2), for the order of chromosomes in individuals Ii for i ∈ [b(m)] we have

Ii(m)={C(2i1)(m),C(2i)(m)}if1iβ(m)b(m);Ii(m)={C(β(m)b(m)+i)(m),Ø}ifβ(m)b(m)+1ib(m). (A2)
Table A1 .

Transitions of type 2

Parent with marked chromosome(s) (Ø means no parent in sample)
Offspring Ø I
I {I′, I′}(A) {I′, I′}, {I″, I′}
O(N−1)a O(N−2), (B)
I {I′}(A) {I′}, {I″}, {I′, I′}, (B)
O(N−1)b O(N−2)c
I I^
I O(N−2), (B) O(N−2), (B)d
I {I″}, {I′, I′}, (B) {I′, I″}, {I′}, (B)
O(N−2) O(N−2)
a

Offspring double-marked, no parent in sample.

b

Offspring single marked, no parent in sample.

c

Offspring single marked, one single-marked parent in sample.

d

Offspring double-marked, one double-marked parent in sample.

Table A2 .

Transitions of type 4, neither parent in sample

Offspring Parent: Ø
I {I″, I′}, O(N−2), (B)
{I″, I″}, O(N−3), (R)
I I″, O(N−2), (B)
Table A3 .

Transitions of type 5

Offspring Parent: Ø
k1I′, k2I {I″, I″}, O(N−2), (B)
{I″, I′}, O(N−2), (B)
{I′, I′}, O(N−2), (B)
I″, O(N−2), (B)
I′, O(N−2), (B)

For ease of presentation, we denote by I′ a single-marked individual carrying one active chromosome, by I″ a double-marked individual carrying two active chromosomes, by I a single-marked individual (parent) whose marked chromosome is not passed on in the sample during a given reproduction event, and by I^ a double-marked individual (parent) where one marked chromosome is passed on and the other is not during a given reproduction event.

The symbols (A), (B), and (R) in the tables denote whether the corresponding transitions belong to AN (A), to BN (B), or to the “remainder term” (R) in (A1) according to the decomposition mentioned above. After that, we compute all the important probabilities explicitly. The order of the probability of each transition is also noted in Tables A1A3.

A1.2: Transition type 1: Small or large reproduction event, no offspring in the sample

If a reproduction event takes place, say at generation m, which does not affect our sample, this will not affect the state of our ancestral process at m + 1, and we have ξn,N(m) = ξn,N(m + 1). Hence, we see an identity transformation. We now compute the probability that our sample is not affected. Given the current state ξAn with b individuals and β chromosomes (hence βb double-marked and 2bβ single-marked individuals), the probability that no child is in the sample is

(1ɛ)NbN+ɛN(NbψN)(NψN)=1O(N1).

A1.3: Transition type 2: Small reproduction event, offspring in the sample, at most one parent in the sample, no recombination

Here, we need to distinguish only whether the offspring is single or double marked and whether there is a parent in the sample. For example, it is immediate to see that the probability of a transition from a double-marked (I″) offspring to two single-marked ({I′,I′}) individuals is of order O(N−1) when no parent is in the sample and no recombination happens. Table A1 lists all corresponding events. By way of example, the state-labeled {I′, I′} denotes that two single-marked individuals, each carrying one active chromosome, are reached from the sample configuration. One such configuration is if the sample contains one offspring, but neither parent (Ø), and the offspring is carrying two active chromosomes (I″).

A1.4: Transition type 3: Small reproduction event, offspring in the sample, both parents in the sample

If both parents and offspring are in the sample in a small event, this immediately gives a transition probability of order O(N−3) or smaller (depending on the presence of recombination) and hence will be irrelevant and be part of RN. We omit a detailed table listing the different single- and double-marked individuals.

A1.5: Transition type 4: Small reproduction event, offspring and at most one parent in the sample, recombination occurs

Table A2 lists transitions due to recombination and when neither parent is in the sample. The probability of the presence of both an offspring and at least one parent in a sample, when recombination occurs, is of order O(N−3) and so will vanish in the limit.

A1.6: Transition type 5: Large reproduction event, offspring in the sample, no parent in the sample, no recombination

Table A3 lists all possible transitions when a large reproduction event occurs, no parent is in the sample, and recombination does not occur. The probabilities of the events listed in Table 1 in the main text are of order O(N−2) and so will appear as effective transitions in the limit.

A1.7: Transition type 6: Large reproduction event, offspring in the sample, recombination occurs, and/or at least one parent is in the sample

The probability that a large reproduction event takes place and at least one child and at least one parent are in the sample is O(N−3). In addition, the probability that a large reproduction event takes place, at least one child is in the sample, and also a recombination event happens in the sample is O(N−3). Hence all such events are negligible.

A2: The Convergence Result

A2.1: The limit of the projection matrix AN

Some care is needed to make sure AN converges in the right sense to the desired projection matrix. The only relevant transitions of order O(1) or O(N−1) are transitions of types 1 and 2. The only one that is not an identity transition is the first dispersion event of Table A1. For ξAn with b < β (i.e., at least one marked individual is double marked), that is

ξdispi(ξ).

This event will become part of AN and has probability

AN(ξ,dispi(ξ))=(1ɛN)1N(Nb12)(N2)(1rN)2,1iβb (A3)

(this is the probability of the event a listed in Table A1; event b in Table A1 leads to an identity transition). Otherwise, we have

AN(ξ,ξ)=1(1ɛN)βbN(Nb12)(N2)(1rN)2.

Of course, AN has to leave elements of the subspace Ansm invariant; hence we set, for ξ with b = β,

AN(ξ,ξ):=1{ξ=ξ}.

Proposition 1.4. With the above settings, AN is a stochastic matrix for each N and

limClimNsuprCNANrP=0 (A4)

for all C > 0 large enough, where P is the canonical projection from An to Ansm; i.e.,

P(ξ,ξ)=1{ξ=cd(ξ)}.

Proof of Proposition 1.4. The Markov chain with transition matrix AN can change state only by dispersing the chromosomes paired in a double-marked individual. We see from (A3) that

AN(ξ,dispi(ξ))K(n,r,c)N

for some suitable constant K(n, r, c), uniformly in b and iβb and N (for all N large enough). Hence, starting from ξ with βb double-marked individuals, the number of AN steps required until complete dispersion has occurred is dominated by the sum of βb independent geometric random variables γ1(N)++γβb(N), with success probability K(n, r, c)/N. By Markov’s inequality,

supNN{γ1(N)++γβb(N)CN}1CNE[γ1(N)++γβb(N)]=N(βb)CNK(n,r,c)0asC.

The proof can now be completed with a coupling argument, noting that two Markov chains run according to AN resp. P, started in ξAn, both get stuck in cd(ξ), and this happens after at most CN steps with high probability (for C large).

A2.2: Proof of the convergence result

With the definition of AN from the previous section, put

BN*:=N2(ΠNAN) (A5)

and let P be the canonical projection from An to Ansm defined in Proposition 1.4. The following lemma identifies G as the limit containing all the effective transitions of BN* when projecting on the subspace Ansm:

Lemma 1.5. We have

B^N:=PBN*PGasN (A6)

with G from (14).

Remark 1.6. We do believe that in fact the sequence of (formally larger) matrices BN* on An converges as well, but the statement about B^N is sufficient for our purposes below [see (A13) in Lemma 1.7] and simpler to prove since it allows us to restrict to thecompletely dispersedconfigurations in Ansm.

Proof of Lemma 1.5. We inspect the types of events listed in Tables A1A3 that are marked with (B). Events that are marked with (R) have probability of order at most O(N−3); hence their total contribution to any entry of B^N is at most O(N−1) (since we are following a finite sample, there are only finitely many possible one-step events altogether). It suffices to consider B^N(ξ,cd(η)) for ξ={C(1),,C(β);β}Ansm,ηAn (because P projects to Ansm).

Regarding ξ=pairmergej1,j2(ξ), this transition can happen in a small reproduction event (these events are listed at c in Table A1; note that events listed at d in Table A1 lead to a trivial transition once P is applied) or in a large reproduction event as in Table A3 if the grouping is suitable. Up to four parental chromosomes are involved in any reproduction event. Hence, a large reproduction event can lead to a given pair merger in the sample if up to five individuals in the sample are children. Thus

B^N(ξ,ξ)=N2(1ɛN)(1rN)2×1N1(Nb)(N12)1212+N2ɛNc=25(1rN)c(β2c2)(NβNψc)(NNψ)(4)c1(14)c+O(N1). (A7)

For the first term on the right note that either j1 or j2 can be the child, and the two factors of 12 come from the requirement that the chromosome in the child we are following is the one from the parent in the sample and is also the one we are following in the parent. For the second term on the right note that once we decide on c children in the sample [((β2)/(c2)) choices because j1 and j2 are already chosen], there are (4)c−1 ways to assign them to the four parental chromosomes. For comparison with (15) and the first line in (14) observe

(NβNψc)(NNψ)=(Nβ)!Nψ!(NNψ)!(Nψc)!(NβNψ+c)!N!(Nψ)c(N(1ψ))βcNβ=ψc(1ψ)βc.

Regarding ξ′ = recombj,ℓ(ξ) [assuming that α is such that C(j) can be nontrivially cut into two by a recombination event between loci ℓ − 1 and ℓ], this transition can happen in a small reproduction event as listed at b in Table A1 or in another event that has probability O(N−3). Hence

B^N(ξ,ξ)=N2(1ɛN)×1N(Nb2)(N12)r()N+O(N1)=r()+O(N1). (A8)

Regarding ξ=groupmergeJ1,J2,J3,J4(ξ), this can occur only through a large reproduction event as listed in section A1.6. Write ki := |Ji|; we assume k1 ≥ … ≥ ka ≥ 2 for some a ∈ [4], ka+1 = … = k4 = 0 (if a = 1, k1 ≥ 3), and s := β − (k1 + ⋯ + ka) is the number of singletons (nonparticipating chromosomes) in the merger. Note that by the structure of the diploid model, with a groups merging there can be up to k1 + ⋯ + ka + (4 − a)+ children in the sample [put differently, up to (4 − a)+ “nonmerging children”]. Then

B^N(ξ,ξ)=N2ɛNc=0(4a)+(βk1kac)(1rN)k1++ka+c×(NβNψ(k1++ka+c))(NNψ)(4)a+c(14)k1++ka+c+O(N1).

It remains to check that the diagonal terms behave correctly, i.e., that as N → ∞,

B^N(ξ,ξ)G(ξ,ξ)=ξξ,ξAnsmG(ξ,ξ). (A9)

Because ΠN and AN are both stochastic matrices (as is P), we have

B^N(ξ,ξ)=ξξ,ξAnsmB^N(ξ,ξ) (A10)

for each N. By inspection and the discussion above, all terms in ΠN with decay rate 1/N are accounted for in AN, and all nondiagonal terms in ΠNAN with decay rate 1/N2 appear after multiplication with N2 in B^N with their correct limits, namely the corresponding terms in G, while terms with a faster decay rate disappear in the limit. Hence (A10) implies (A9).

A3: Markov Chains with Two Timescales—A Variation on a Lemma of Möhle

Conceptually, our convergence result rests on a separation-of-timescales phenomenon. It can be established with the help of a variant of a well-know result; see Lemma 1 from Möhle (1998).

Let E be a finite set. We equip matrices A = (A(x, y))x,yE on E with the matrix norm A:=maxxEyE|A(x,y)|. Note that then ‖AB‖ ≤ ‖A‖‖B‖ and ‖A‖ = 1 if A is a stochastic matrix.

Lemma 1.7. Assume that for N ∈ ℕ, AN is a stochastic matrix on E such that

limClimNsuprCNANrP=0 (A11)

for some matrix P. Then we have for any 0 < c, K, t <∞

limNsupBK(AN+cN2B)[tN2](P+cN2B)[tN2]=0. (A12)

Furthermore, if (BN)Nis a sequence of matrices on E such that

G:=limNPBNPexists, (A13)

then

limN(AN+cN2BN)[tN2]=PectGforallt>0. (A14)

Remark 1.8. Instead of timescales N and N2 one can allow more generally any aN, bN → ∞ with bN/aN → ∞, with only notational modifications in the proof.

Proof of Lemma 1.7. We begin with (A12). Without loss of generality assume K = 1; otherwise replace B by B/K and c by cK. Fix c, t > 0 and a matrix B with ‖B‖ ≤ 1, and abbreviate m := ⌊tN2⌋. Let ɛ > 0 and choose C0 < ∞ and N0 ∈ ℕ such that

ANrPɛforNN0,rC0N (A15)

[as guaranteed by (A11)]. Note that

(AN+cN2B)m(P+cN2B)mANmP+k=1m(cN2)km1,,mk+1N0m1++mk+1=mkANm1j=2k+1(BANmj)Pm1j=2k+1(BPmj).

Mimicking the proof in Möhle (1998), we split the second summand into (the ellipses refer to the term inside the large-norm brackets on the right of the last line of the previous formula)

S1:=k=1m(cN2)km1,,mk+1C0Nm1++mk+1=mkandS2:=k=1m(cN2)km1,,mk+1N0m1++mk+1=mkj:mj<C0N

As in Möhle (1998, p. 509) we have S1 ≤ 2et(t + 1)ɛ for all N large enough, and our estimate for S2 is a small variation of the corresponding estimate in Möhle (1998). Each of the matrix norms appearing in the big sum in S2 is at most 2, and hence (with xy := min(x, y))

S22k=1m(cN2)k#{(m1,,mk+1)N0k:m1++mk+1=mk,j:mj<C0N}2k=1m(cN2)k(k+1)m1=0C0N(mk)(mm11k1)2k=1m(cN2)k(k+1)C0N(m1k1)=2C0NcN2k=0m1(cN2)k(k+2)(m1k)C1N.

[We use in the last estimate that for |x| < 1, n ∈ ℕ, n=0(nk)xk=(1+x)n, and n=0k(nk)xk=nx(1+x)n1.]

The derivation of (A14) from (A12) is literally the same as in Möhle (1998, pp. 509–511) (read cN = c/N2 there).

A4: The Convergence Result with General Random ΨN

In this section we briefly indicate how the proof of Theorem 1.2 can be modified to yield Theorem 1.3. In each reproduction event, a random number ΨN of individuals die and are replaced by the same number of offspring, and we recall assumptions (20), (22) and (24). By short timescale we refer to the scaling aN given by

aN=NE[ΨN]

and by long timescale the scaling bN given by

bN=1cN=N(N1)E[ΨN(ΨN+3)].

Assumption (20) yields bN → ∞ as N → ∞, and bN/aN → ∞ by assumption (21). To check (23), i.e., that indeed aN → ∞, observe that ΨN/N is a positive random variable, bounded by 1. Condition (20) is equivalent to E[(ΨN/N)2] →ℕ0, which implies ΨN/N → 0 in probability and EN/N] → 0 and hence (23).

For use below, we recall implications of (22) provided that (20) holds (cf. Sagitov 1999):

Forallj3:1cNE[(ΨNN)j][0,1]xj2F(dx). (A16)

Indeed, integration by parts yields

1cNE[(ΨNN)j]=1cN(0,1]jxj1(ΨNN>x)dxN(0,1]jxj1(x,1]y2F(dy)dx=(0,1]((0,1]1{xy}jxj1dx)y2F(dy)=(0,1]yj2F(dy). (A17)

Furthermore for the case j = 2 one obtains

limsupN1cNE[(ΨNN)2]=limsupNE[ΨN2]E[ΨN(ΨN+3)]1<. (A18)

Let Ψ~N have the following reweighted distribution (relative to ΨN):

(Ψ~N=k)=k(k+3)E[ΨN(ΨN+3)](ΨN=k),k=1,,N2 (A19)

and then

Ψ~NNdFasN. (A20)

Indeed, for any ℓ ∈ ℕ

E[(Ψ~NN)]=N(N1)E[ΨN(ΨN+3)]E[(ΨNN)+1ΨN+3N1]=1cNE[(ΨNN)+2]NN1+3(N1)1cNE[(ΨNN)+1]N(0,1]yF(dy) (A21)

by (A17) and (A18), so (A20) follows because the moments characterize a probability law on [0, 1]. One can check (along the lines of Sagitov 1999) that under assumption (20), both (A17) and (A18) are in fact equivalent to (22).

The proof of Theorem 1.3 is now a relatively straightforward adaptation of the proof of Theorem 1.2 discussed in sections A1 and A2 above. Scaling by N is throughout replaced by scaling with aN = N/EN] and scaling by N2 becomes scaling with bN = N(N − 1)/ENN+ 3)]:

  1. When currently following b ≥ 1 individuals, the probability that none of them is an offspring in the previous reproduction event (and hence the sample configuration remains unchanged) is
    E[(NbΨN)(NΨN)]=E[j=0ΨN1NbjNj]=E[j=0ΨN1(1bNj)]=1O(bE[ΨN]N)=1O(aN1).
    This is analogous to transitions discussed in section A1.2 and happens “all the time” (leading to the projecting transitions part in the limit).
  2. When currently following b ≥ 1 individuals, say the kth of which is double marked, the probability that the ith individual is the only offspring in the sample, and that the sample also does not contain a parent, is [we write (x)k = x(x − 1)…(xk + 1) for the ith falling factorial]
    E[ΨN(NΨN2)b1(N)b]E[ΨNN(1ΨNN)b1]=aN1(1+o(1)).
    The projection matrix AN now becomes
    AN(ξ,dispi(ξ))=E[ΨN(NΨN2)b1(N)b](1r)2,1iβb (A22)
    and AN(ξ,ξ)=1(βb)E[ΨN(NΨN2)b1/(N)b](1r)2; the analogue of Proposition 1.4 is then
    limClimNsuprCaN|ANrP|=0. (A23)
  3. From now on we can work on the “projected” space Ansm. The distinction between small and large reproduction events is irrelevant in the general case. Hence, it is more suitable to distinguish whether a parent and an offspring are in the sample or whether several offspring (but no parent) are in the sample. In analogy with (A5) and (A6), we split ΠN into fast and slow parts and define
    BN*:=bN(ΠNAN),B^N:=PBN*P. (A24)
    It then remains to check that
    B^NGwithGdefinedin(26), (A25)
    whence Theorem 1.3 follows from Lemma 1.7 together with Remark 1.8.

    We now verify (A25):

  4. Recombination events give the correct limit; see the discussion below (24).

  5. “Large” is the probability that exactly k ≥ 2 individuals among b (excluding the parents) is, using (A19),
    E[(ΨN)k(NΨN2)bk(N)b]=E[ΨN(ΨN+3)]E[(Ψ~N)k(N2Ψ~N)bkΨ~N(Ψ~N+3)(N)b]; (A26)
    thus 1/cN times this probability is
    N(N1)E[(Ψ~N)k(N2Ψ~N)bkΨ~N(Ψ~N+3)(N)b]=1(N2)b2E[(Ψ~N2)k2(N2Ψ~N)bk]+O(1N)N(0,1]yk2(1y)bkF(dy) (A27)
    by (A20). Furthermore, the probability that at least two offspring and at least one parent are in the sample is at most
    b(b12)E[2(ΨN)2(N)3]=O(cN/N); (A28)
    hence such events become negligible in the limit.
  6. “Small” is a merger of a single pair, which can result either from one offspring and one parent in the sample or from two offspring but no parent in the sample: Here, the weight of F({0}) plays a role.

The probability that exactly two given single-marked individuals in a sample of size b are offspring (and none are parents) is

E[(ΨN)2(N2ΨN)b2(N)b]; (A29)

and the probability that among a pair of two given single-marked individuals, one is a parent, the other is an offspring, and no other element of the sample is affected by the reproduction event is

E[2(2)1(ΨN)1(NΨN2)b2(N)b]; (A30)

thus, 1/cN times the probability that exactly one given pair (of single-marked individuals) is involved in a reproduction event is

1cNE[ΨN(ΨN+3)(NΨN2)b2(N)b]=E[(N2Ψ~N)b2(N2)b2]N[0,1](1y)b2F(dy)=F({0})+(0,1](1y)b2F(dy) (A31)

by (A20).

  1. (Combinatorial connections between participation in reproduction events and merging of ancestral chromosomes) The rest of the argument to replace (15) by (27) is purely combinatorial; it is concerned only with possible groupings of the k single-marked offspring into up to four groups depending on which of the four parental chromosomes they descend from.

In both cases considered in (6) the probability that the chromosomes actually coalesce is 14 because they must descend from the same chromosome in the same parent or from the particular chromosome in the particular parent we are following, respectively.

A5: Correlation in Coalescence Times

In this section we outline the calculations to obtain the correlation in coalescence times T1 and T2 of types at two loci (1 and 2). As our sample consists of two unlabeled chromosomes typed at two loci, we sometimes find it convenient to denote an unlabeled chromosome carrying ancestral segments at both loci with the symbol ⊢⊣ and chromosomes carrying ancestral segments at only one locus with the symbols ⊢ and ⊣. Loci at which types have coalesced are denoted by •− or •⊣. The states S of the unlabeled process for a sample of size two at two loci are also numbered as

  • S In symbols

  • 2(⊢⊣)(⊢⊣)

  • 1(⊢⊣)(⊢)(⊣)

  • 0(⊢)(⊢)(⊣)(⊣)

  • −1 (⊣)(⊣)

  • −2 (⊢))(⊢)

in which states {0, 1, 2} denote the three possible sample states, before coalescence at either loci has occurred. States {−1, −2} will be needed when deriving the variance of pairwise differences.

Let h(i): = ℙ({T1 = T2} | i) denote the probability of the event T1 = T2, when B is in state i. Excluding large offspring numbers, one readily obtains (h(i) = 0 for i ≠ {0, 1, 2, })

h(2)=r+92r2+13r+9h(1)=32r2+13r+9h(0)=22r2+13r+9. (A32)

For each i ∈ {0, 1, 2}, the expression for h(i) is the same as the one for the correlation between T1 and T2 when in state i, excluding large offspring numbers. The expected value w(i) = E[Ts] of the time Ts until a coalescence event at either locus starting from state i ∈ {0, 1, 2} is, again excluding large offspring numbers,

w(2)=r+92(2r2+13r+9)+12=12(1+h(2))w(1)=32(2r2+13r+9)+12=12(1+h(1)),w(0)=12r2+13r+9+12=12(1+h(0)),

obtained by solving the recursions.

w(2)=1+2rw(1)1+2rw(2)=1+2rw(1)1+2rw(1)=1+w(2)+rw(0)r+3w(0)=1+4w(1)6.

Let v(i):=Ei[Ts2] denote the expected value of Ts2 when starting from state i ∈ {0, 1, 2}. One can follow Durrett (2002) to obtain the recursions

v(i)=2qi2+2qikiqikqiw(k)+kiqikqiv(k) (A33)

in which qi=kiqik is the sum of the transition rates out of state i. To obtain (A33) let J denote the exponential waiting time until the first transition and XJ be the state of the process immediately after the first transition. The random variables J and XJ are independent. One can write

E[Ts2|J,XJ]=E[(TsJ+J)(TsJ+J)|J,XJ]=E[(TsJ)2+2J(TsJ)+J2|J,XJ]=E[(TsJ)2|J,XJ]+2JE[TsJ|XJ]+E[J2].

Taking expectations gives (A33).

The variance Vi[Ts] of Ts when starting in state i is given by

V2[Ts]=r3+31r2/2+153r/2+81(2r+1)(r+6)(2r2+13r+9)+1214(1+h(2))2V1[Ts]=r+9(r+6)(2r2+13r+9)+1214(1+h(1))2V0[Ts]=r+8(r+6)(2r2+13r+9)+1214(1+h(0))2.

Hence, limr→∞ Vi[Ts] = 1/4 for i ∈ {2, 1, 0}, and

limr0V2[Ts]=1limr0V1[Ts]=29limr0V0[Ts]=89324.

Denote by Tl the time until coalescence has occurred at both loci. The marginal coalescence times are exponential with rate 1, when excluding large offspring numbers. Solving the recursions

E2[Tl]=1+2rE1[Tl]1+2rE1[Tl]=1+E2[Tl]+rE0[Tl]+2r+3E0[Tl]=1+4E1[Tl]+26

yields

E[Tl(2)]=32r+92(2r2+13r+9)=12(3h(2))E[Tl(1)]=3232(2r2+13r+9)=12(3h(1))E[Tl(0)]=3212r2+13r+9=12(3h(0)).

Applying the recursions (A33) yields the variances Vi[Tl],

V2[Tl]=2r3+111r2/4+171r/281/4(2r2+13r+9)2+54V1[Tl]=4r2+17r45/4(2r2+13r+9)2+54V0[Tl]=2r2+7r10(2r2+13r+9)2+54

with limr→∞ Vi[Tl] = 5/4 for i ∈ {0, 1, 2}, and

limr0V2[Tl]=1,limr0V1[Tl]=109,limr0V0[Tl]=365324.

Now we admit large offspring numbers, taking ɛN = c/N2 and rN = r/N. Ignoring the labeling of the chromosomes, the limit process has three effective sample states, depending on the number of double-marked chromosomes (⊢⊣). Denote the three sample states by

()(),
()()(),

and

()()()(),

in which ⊢ and ⊣ denote single-marked chromosomes. The states of the limit process are composed of single-marked individuals only and are therefore the same as those of the haploid Wright–Fisher process. By •− denote a chromosome carrying a common ancestor at one locus, and (•−•) denotes the absorbing states. The transition rates are summarized in the following table:

()()()()()()()()()()()()()()()()()2r1+cψ24()()()1+cψ24(1ψ4)r2+cψ22(1ψ4)cψ316()()()()c3ψ4324+c(ψ2ψ32ψ48)2+c(ψ22ψ34ψ416)cψ34(1ψ4)cψ416()()()2+cψ22(1ψ4)1+cψ24()()r1+cψ24.

By way of example, the rate of the transition from 1 to 2 by coalescence of the chromosomes ⊢ and ⊣ is 1 + cC3;2;1, the transition rate from 0 to 1 is 4(1 + cC4;2;2), and the transition rate from 0 to the absorbing state [(•−• or (•−)(−•)] is c(C4;4;0 + C4;2;2;0).

As before, let h(i) denote the probability the two loci coalesce at the same time. One obtains limit results

limrh(i)=cψ432+8cψ2cψ4,i{0,1,2}limch(2)=1limch(1)=26ψlimch(0)=56ψ2/3272ψ+544(ψ6)(3ψ2+16ψ48)53.

The first equation in (A34) tells us that the loci remain correlated due to multiple mergers even when they are far apart on a chromosome. When the recombination rate r is quite small, one obtains

limr0h(2)=1limr0h(1)=2(cψ2+4)cψ3+6cψ2+24limr0h(0)=13(8cψ2+32cψ3+6cψ2+24+80cψ3+208cψ2+8323cψ416cψ3+48cψ2+1925).

Let Ei[Ts], as before, denote the time until coalescence at either locus, starting from state i. Admitting large offspring numbers, one obtains

limrEi[Ts]=1632+8cψ2cψ4,i{0,1,2},limcEi[Ts]=0,i{0,1,2},limr0E2[Ts]=4cψ2+4limr0E1[Ts]=c(16ψ22ψ3)+64c2ψ5+6c2ψ44cψ3+48cψ2+96limr0E0[Ts]=163(c(6ψ2ψ3)+24)4(ψ8)(3ψ+16)(cψ2+4)32(39ψ32)3(c(3ψ4+16ψ348ψ2)192)(3ψ+16).

Let Ei[Tl], as before, denote the expected value of the time Tl until coalescence has occurred at both loci, when starting from state i. Admitting large offspring numbers, one obtains the limits

limrEi[Tl]=c(48ψ28ψ4)+192(cψ2+4)(cψ4+8cψ2+32),i{0,1,2},limcEi[Tl]=0,i{0,1,2},limr0E2[Tl]=4cψ2+4limr0E1[Tl]=c(32ψ26ψ3)+128c2ψ5+6c2ψ44cψ3+48cψ2+96limr0E0[Tl]=(28ψ756ψ6800ψ5+1600ψ4)c2+(608ψ43200ψ3+12,800ψ2)c+25,600a

in which

a=3c3ψ92c3ψ8144c3ψ7+288c3ψ6+12c2ψ780c2ψ61152c2ψ5+3456c2ψ4288cψ42304cψ3+13,824cψ2+18,432.

Considering the variance 񡖉i[Ts] of the time Ts when starting from state i ∈ {0, 1, 2}, and admitting large offspring numbers, one obtains

limrVi[Ts]=256(c(8ψ2ψ4)+32)2,i{0,1,2},limcV2[Ts]=0,i{0,1,2},limr0V2[Ts]=16(cψ2+4)2limr0V1[Ts]=(12ψ6128ψ5+384ψ4)c2+(3072ψ2512ψ3)c+6144(cψ2+4)2(cψ3+6cψ2+24)2.

Correlations in coalescence times have been employed to quantify LD (McVean 2002), in which LD is quantified as the square of the correlation coefficient of types at two loci (Hill and Robertson 1968). A description of how one can quantify linkage disequilibrium as the square of the correlation coefficient of types at two loci can be found in Hartl and Clark (1989). Assuming a very small mutation rate, McVean (2002) related Inline graphic to covariances in coalescence times. Writing Covi(T1, T2) as the covariance of T1 and T2 when starting from state i ∈ {0, 1, 2}, McVean (2002) obtained

graphic file with name 255equ3.jpg

in which T1 and T2 denote the times until coalescence at the two loci, respectively, and the covariances are conditional on the sample configurations, as indicated. Following, e.g., Durrett (2002) one can obtain the covariances under any population model. Under our population model, Inline graphic, in which

graphic file with name 255equ4.jpg

One obtains the limit results

graphic file with name 255equ6.jpg

A6: Correlations in Coalescence Times for Random ψ

In this section we consider the simple example of the probability measure F, evoked in relation to a random offspring distribution, taking the beta distribution with parameters ϑ and γ. The following transition rates for a sample of size two at two loci are obtained:

(⊢⊣)(⊢⊣)(⊢⊣)()()()()()()()()()()()()(⊢⊣)(⊢⊣)2r1(⊢⊣)()()γ+3ϑ/4ϑ+γr2γ+3ϑ/4ϑ+γϑ4(ϑ+γ)()()()()38(1+ϑ)ϑ(1+ϑ+γ)(ϑ+γ)4(1+γ)γ+3ϑγ+(3/2)(1+ϑ)ϑ(1+ϑ+γ)(ϑ+γ)2(1+γ)γ+(3/2)ϑγ+(3/4)(1+ϑ)ϑ(1+ϑ+γ)(ϑ+γ)ϑγ+(3/4)(1+ϑ)ϑ(1+ϑ+γ)(ϑ+γ)(ϑ+1)ϑ4(ϑ+γ+1)(ϑ+γ)()()()2γ+3ϑ/4ϑ+γ1()()r1.

As before, the transition rates given above can be employed to derive correlations in coalescence times. Here we consider only the probability h(i). One obtains limϑ→0h(i) = limγ→∞h(i) and the limit results are those obtained from the usual ARG (A32).

A7: Variance of Pairwise Differences

The variance of pairwise differences between DNA sequences has been employed to estimate recombination rates in low offspring number populations (Wakeley, 1997). Let the random variable Kij denote the number of differences between sequences i and j, with Kii = 0. The average number π of pairwise differences for n sequences is

π=2n(n1)i<jKij.

Under the infinitely many sites mutation model, E[π] = θE[T], in which T is the time until coalescence of two sequences. Under our model, E[T] = 1/(1 + 2/4). Define the variance Sπ2 of pairwise differences as

Sπ2=2n(n1)i<j(Kijπ)2.

To obtain an estimate of the recombination rate, one needs to compute the expected value E[Sπ2],

E[Sπ2]=2n(n1)i<jE[(Kijπ)2]=E[(K12π)2].

Thus, it suffices to consider E[(K12π)2]. Expanding, one obtains

E[(K12π)2]=E[(2n(n1)i<j(K12Kij))2]=4n2(n1)2i<ji^<j^E[(K12Kij)(K12Ki^j^)].

Define the event Aij() by

Aij():={sequencesiandjdiffer at locus}.

Assuming each sequence consists of L loci, and 1Aij() are indicator functions,

K12Kij==1L(1A12()1Aij())

yielding, in case i = i1 = 1, and j=j^=3,

E[(K12K13)2]==1L^=1LE[(1A12()1A13())(1A12(^)1A13(^))]=2=1L^=1L(A12()A12(^))−ℙ(A12()A13(^)).

In general,

E[(K12Kij)(K12Ki^j^)]==1L^=1LE[(1A12()1Aij())(1A12(^)1Ai^j^(^))]==1L^=1L(A12()A12(^))−ℙ(A12()Ai^j^(^))−ℙ(A12(^)Aij())+ℙ(Aij()Ai^j^(^)). (A36)

Now consider the probability (A12()A12(^)) of the event that sequences 1 and 2 differ at both loci ℓ and ^. Admitting mutation introduces two new states, namely the states

()()

and

()().

Define

graphic file with name 255equ5.jpg

Thus, (A12()A12(^))=g(2), (A12()A13(^))=g(1), and (A12()A34(^))=g(0), for ^. Now,

g(2)=θ1g(1)+θ2g(2)+2rg(1)θ1+θ2+1+cψ24+2rg(1)=θ2θ2+1+cψ24g(2)=θ1θ1+1+cψ24g(1)=θ1g(1)+θ2g(2)+rg(0)+(1+cψ2/4)g(2)θ1+θ2+r+3+3c(ψ2/4)(1ψ/4)+c(ψ3/16)g(0)=θ1g(1)+θ2g(2)+c(3ψ4/32)g(2)+(c(ψ2ψ3/2ψ4/8)+4)g(1)c(3ψ4/32)+c(ψ2ψ3/2ψ4/8)+c(ψ2/2ψ3/4ψ4/16)+6+c(ψ3/4)(1ψ/4)+c(ψ4/16)+θ1+θ2.

In view of expression (A36), one obtains

(A12()A12())=ℙ(A12())=θθ+1+cψ2/4,(A12()A13())=θ3θ/2+λ3+λ23θ/2+λ3θθ+λ2,(A12()A34())=2θ2θ+λ4θ2θ+λ4+λ4;22θ+λ4(θ/23θ/2+λ3+(θ/23θ/2+λ3)2). (A37)

The event A12()A34() (Equation A37) occurs if the first two events in the history of the four sequences are mutations on appropriate ancestral lineages or if lineages labeled 2 and 3 coalesce, followed by appropriately placed mutations.

Footnotes

Communicating editor: Y. S. Song

Literature Cited

  1. Árnason E., 2004.  Mitochondrial cytochrome b variation in the high-fecundity Atlantic cod: trans-Atlantic clines and shallow gene genealogy. Genetics 166: 1871–1885 [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Avise J. C., Ball R. M., Arnold J., 1988.  Current versus historical population sizes in vertebrate species with high gene flow: a comparison based on mitochondrial DNA lineages and inbreeding theory for neutral mutations. Mol. Biol. Evol. 5: 331–344 [DOI] [PubMed] [Google Scholar]
  3. Beckenbach A. T., 1994.  Mitochondrial haplotype frequencies in oysters: neutral alternatives to selection models, pp. 188–198 in Non-Neutral Evolution, edited by B. Golding. Chapman & Hall, New York
  4. Berestycki J., Berestycki N., Schweinsberg J., 2007.  Beta-coalescents and continuous stable random trees. Ann. Probab. 35: 1835–1887 [Google Scholar]
  5. Berestycki J., Berestycki N., Schweinsberg J., 2008.  Small-time behavior of beta coalescents. Ann. Inst. H. Poincaré. Probab. Stat. 44:214–238
  6. Birkner M., Blath J., 2008.  Computing likelihoods for coalescents with multiple collisions in the infinitely many sites model. J. Math. Biol. 57: 435–465 [DOI] [PubMed] [Google Scholar]
  7. Birkner M., Blath J., 2009.  Measure-valued diffusions, general coalescents and population genetic inference, pp. 329–363 in Trends in Stochastic Analysis, edited by J. Blath, P. Mörters, and M. Scheutzow. Cambridge University Press, Cambridge/London/New York
  8. Birkner M., Blath J., Möhle M., Steinrücken M., Tams J., 2009.  A modified lookdown construction for the Xi-Fleming-Viot process with mutation and populations with recurrent bottlenecks. ALEA Lat. Am. J. Probab. Math. Stat. 6: 25–61 [Google Scholar]
  9. Birkner M., Blath J., Steinrücken M., 2011.  Importance sampling for Lambda-coalescents in the infinitely many sites model. Theor. Popul. Biol. 79: 155–173 [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Cannings C., 1974.  The latent roots of certain Markov chains arising in genetics: a new approach, I. Haploid models. Adv. Appl. Probab. 6: 260–290 [Google Scholar]
  11. Davies J. L., Simančík F., Lyngsø R., Mailund T., Hein J., 2007.  On recombination-induced multiple and simultaneous coalescent events. Genetics 177: 2151–2160 [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Donnelly P., Kurtz T. G., 1999.  Particle representations for measure-valued population models. Ann. Probab. 27: 166–205 [Google Scholar]
  13. Durrett R., 2002.  Probability Models for DNA Sequence Evolution. Springer-Verlag, New York [Google Scholar]
  14. Eldon B., 2011.  Estimation of parameters in large offspring number models and ratios of coalescence times. Theor. Popul. Biol. 80: 16–28 [DOI] [PubMed] [Google Scholar]
  15. Eldon B., Wakeley J., 2006.  Coalescent processes when the distribution of offspring number among individuals is highly skewed. Genetics 172: 2621–2633 [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Eldon B., Wakeley J., 2008.  Linkage disequilibrium under skewed offspring distribution among individuals in a population. Genetics 178: 1517–1532 [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Etheridge A. M., Griffiths R. C., Taylor J. E., 2010.  A coalescent dual process in a Moran model with genic selection, and the Lambda coalescent limit. Theor. Popul. Biol. 78: 77–92 [DOI] [PubMed] [Google Scholar]
  18. Fearnhead P., Donnelly P., 2001.  Estimating recombination rates from population genetic data. Genetics 159: 1299–1318 [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Griffiths R. C., 1991.  The two-locus ancestral graph, pp. 100–117 in Selected Proceedings of the Symposium on Applied Probability, edited by I. V. Basawa and R. L. Taylor. Institute of Mathematical Statistics, Hayward, CA
  20. Griffiths R. C., Marjoram P., 1997.  An ancestral recombination graph, pp. 257–270 in Progress in Population Genetics and Human Evolution (IMA Volumes in Mathematics and Its Applications 87), edited by P. Donnelly, and S. Tavaré. Springer-Verlag, New York
  21. Hartl D. L., Clark A. G., 1989.  Principles of Population Genetics, Ed. 2 Sinauer Associates, Sunderland, MA [Google Scholar]
  22. Hedgecock D., 1994.  Does variance in reproductive success limit effective population sizes of marine organisms? pp. 1222–1344 Genetics and Evolution of Aquatic Organisms, edited by Beaumont A. Chapman & Hall, London [Google Scholar]
  23. Hedgecock D., Pudovkin A. I., 2011.  Sweepstakes reproductive success in highly fecund marine fish and shellfish: a review and commentary. Bull. Mar. Sci. 87: 971–1002 [Google Scholar]
  24. Hedgecock D., Tracey M., Nelson K., 1982.  Genetics, pp. 297–403 The Biology of Crustacea, Vol. 2, edited by Abele L. G. Academic Press, New York [Google Scholar]
  25. Herbots H. M., 1997.  The structured coalescent, pp. 231–255 in Progress of Population Genetics and Human Evolution, edited by P. Donnelly and S. Tavaré. Springer-Verlag, Berlin/Heidelberg, Germany/New York
  26. Hill W. G., Robertson A., 1968.  Linkage disequilibrium in finite populations. Theor. Appl. Genet. 38: 226–231 [DOI] [PubMed] [Google Scholar]
  27. Hudson R. R., 1983a.  Properties of a neutral allele model with intragenic recombination. Theor. Popul. Biol. 23: 183–201 [DOI] [PubMed] [Google Scholar]
  28. Hudson R. R., 1983b.  Testing the constant-rate neutral allele model with protein sequence data. Evolution 37: 203–217 [DOI] [PubMed] [Google Scholar]
  29. Huillet T., Möhle M., 2011.  On the extended Moran model and its relation to coalescents with multiple collisions. Theor. Popul. Biol. (in press) [DOI] [PubMed] [Google Scholar]
  30. Hutchings J. A., Bishop T. D., McGregor-Shaw C. R., 1999.  Spawning behaviour of Atlantic cod, Gadus morhua: evidence of mate competition and mate choice in a broadcast spawning. Can. J. Fish. Aquat. Sci. 56: 97–104 [Google Scholar]
  31. Kingman J. F. C., 1982a.  The coalescent. Stoch. Proc. Appl. 13: 235–248 [Google Scholar]
  32. Kingman J. F. C., 1982b.  On the genealogy of large populations. J. Appl. Probab. 19A: 27–43 [Google Scholar]
  33. Krone S. M., Neuhauser C., 1997.  Ancestral processes with selection. Theor. Popul. Biol. 51: 210–237 [DOI] [PubMed] [Google Scholar]
  34. McVean G. A., 2002.  A genealogical interpretation of linkage disequilibrium. Genetics 162: 987–991 [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Möhle M., 1998.  A convergence theorem for Markov chains arising in population genetics and the coalescent with selfing. Adv. Appl. Probab. 30: 493–512 [Google Scholar]
  36. Möhle M., Sagitov S., 2001.  A classification of coalescent processes for haploid exchangeable population models. Ann. Probab. 29: 1547–1562 [Google Scholar]
  37. Möhle M., Sagitov S., 2003.  Coalescent patterns in diploid exchangeable population models. J. Math. Biol. 47: 337–352 [DOI] [PubMed] [Google Scholar]
  38. Neuhauser C., Krone S. M., 1997.  The genealogy of samples in models with selection. Genetics 145: 519–534 [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Nordeide J. T., Folstad I., 2000.  Is cod lekking or a promiscuous group spawner? Fish Fish. 1: 90–93 [Google Scholar]
  40. Notohara M., 1990.  The coalescent and the genealogical process in geographically structured population. J. Math. Biol. 29: 59–75 [DOI] [PubMed] [Google Scholar]
  41. Ohta T., Kimura M., 1971.  Linkage disequilibrium between two segregating nucleotide sites under the steady flux of mutations in a finite population. Genetics 68: 571–580 [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Palumbi S. R., Wilson A. C., 1990.  Mitochondrial DNA diversity in the sea-urchins Strongylocentrotus purpuratus and Strongylocentrotus droebachiensis. Evolution 44: 403–415 [DOI] [PubMed] [Google Scholar]
  43. Pitman J., 1999.  Coalescents with multiple collisions. Ann. Probab. 27: 1870–1902 [Google Scholar]
  44. Sagitov S., 1999.  The general coalescent with asynchronous mergers of ancestral lines. J. Appl. Probab. 36: 1116–1125 [Google Scholar]
  45. Sagitov S., 2003.  Convergence to the coalescent with simultaneous mergers. J. Appl. Probab. 40: 839–854 [Google Scholar]
  46. Sargsyan O., Wakeley J., 2008.  A coalescent process with simultaneous multiple mergers for approximating the gene genealogies of many marine organisms. Theor. Popul. Biol. 74: 104–114 [DOI] [PubMed] [Google Scholar]
  47. Schweinsberg J., 2000a.  Coalescents with simultaneous multiple collisions. Electron. J. Probab. 5: 1–50 [Google Scholar]
  48. Schweinsberg J., 2000b.  A necessary and sufficient condition for the λ-coalescent to come down from infinity. Electron. Comm. Probab. 5: 1–11 [Google Scholar]
  49. Schweinsberg J., 2003.  Coalescent processes obtained from supercritical Galton-Watson processes. Stoch. Proc. Appl. 106: 107–139 [Google Scholar]
  50. Star B., Nederbragt A. J., Jentoft S., Grimholt U., Malstrøm M., et al. , 2011.  The genomic sequence of Atlantic cod reveals a unique immune system. Nature 477: 207–210 [DOI] [PMC free article] [PubMed] [Google Scholar]
  51. Steinrücken M., Birkner M., Blath J., 2013.  Analysis of DNA sequence variation within marine species using Beta-coalescents. Theor. Popul. Biol. 83: 20–29 [DOI] [PMC free article] [PubMed] [Google Scholar]
  52. Tajima F., 1983.  Evolutionary relationships of DNA sequences in finite populations. Genetics 105: 437–460 [DOI] [PMC free article] [PubMed] [Google Scholar]
  53. Taylor J., Véber A., 2009.  Coalescent processes in subdivided populations subject to recurrent mass extinctions. Electron. J. Probab. 14: 242–288 [Google Scholar]
  54. Taylor J. E., 2009.  The genealogical consequences of fecundity variance polymorphism. Genetics 182: 813–837 [DOI] [PMC free article] [PubMed] [Google Scholar]
  55. Wakeley J., 1997.  Using the variance of pairwise differences to estimate the recombination rate. Genet. Res. 69: 45–48 [DOI] [PubMed] [Google Scholar]

Articles from Genetics are provided here courtesy of Oxford University Press

RESOURCES