Abstract
This study describes a compact method for determining joint probabilities of identity-by-state (IBS) within and between loci in populations evolving under genetic drift, crossing-over, mutation, and regular inbreeding (partial self-fertilization). Analogues of classical indices of associations among loci arise as functions of these joint identities. This coalescence-based analysis indicates that multi-locus associations reflect simultaneous coalescence events across loci. Measures of association depend on genetic diversity rather than allelic frequencies, as do linkage disequilibrium and its relatives. Scaled indices designed to show monotonic dependence on rates of crossing-over, inbreeding, and mutation may prove useful for interpreting patterns of genome-scale variation.
Keywords: Linkage disequilibrium, Effective number, Selfing, Two-locus identity, Mating system
1. Introduction
This article explores the nature of associations between a pair of loci evolving in populations subject to genetic drift, crossing-over, mutation, and regular inbreeding. Here, associations are expressed as probabilities of joint identity by state in evolving populations, rather than probabilities of identity by descent in pedigrees.
Correlations and probabilities.
Wright’s (1921) foundational work on single- and multi-locus descent measures were developed in the context of pedigrees, expressed as correlations among genes in the presence of genetic drift but absence of mutation. Wright (1933) addressed the association between alleles at a pair of loci (A and B) borne by a single gamete. For a random variable that assumes value 1 if the gene at locus A corresponds to the upper case allele and 0 otherwise and a random variable that assumes value 1 if the gene at locus B corresponds to the upper case allele and 0, the correlation between and is
| (1) |
in which the expectation of and correspond to the frequencies in the population of the upper case alleles ( and ), and the frequency in the population of the indicated two-locus haplotype. Numerous studies (e.g., Hill and Robertson, 1968; Ohta and Kimura, 1969) have addressed the expected steady-state value of this correlation or its square, evolving under selective neutrality and crossing-over.
While preserving the pedigree context in the absence of mutation, later work (e.g., Cockerham and Weir, 1968) reinterpreted the concept of multi-locus associations in terms of probabilities of joint identity by descent (IBD). In addition, probabilities of identity-by-state (IBS) evolving under mutation as well as genetic drift and crossing-over have been studied (e.g., Strobeck and Morgan, 1978; Golding and Strobeck, 1980). Building on Golding (1984), Ethier and Griffiths (1990) undertook to develop a two-locus version of the Ewens Sampling Formula (Ewens, 1972) for any number of alleles at each locus. Hudson (2001) studied the determination of the probability of collections of the four possible haplotypes comprising a pair of biallelic loci through a recursive method. Pluzhnikov and Donnelly (1996) addressed the estimation of molecular diversity at multiple sites in the presence of mutation and recombination. McVean (2002) reformulated Wright’s correlation (1) in terms of the correlation between coalescence times at a pair of loci.
Scaling of associations.
Wright (1952) explicitly noted that he intended the hierarchical -statistics to serve as summaries of the effects of population structure on the genome, largely free from locus-specific aspects, including allele frequencies and mutation rates. Similarly, the dependence on population allele frequencies (, ) of the correlation (1), its numerator, and several variants has received much attention (see Hedrick, 1987; Lewontin, 1988). A number of researchers have sought to reduce this dependence through various scalings. For example, Lewontin’s (1964) is scaled by the maximal value of possible under current allele frequencies in the population. Even so, allele frequency constrains the range of values such unscaled or scaled measures can attain (VanLiere and Rosenberg, 2008).
Interpretation.
The very term “linkage disequilibrium” (LD, Lewontin and Kojima, 1960) signals an expectation that associations among neutral variants are transient, declining over time at geometric rates set by the rate of crossing-over. Under this view, empirical observation of pervasive LD would indicate some departure from the standard neutral model. Slatkin (2008) has reviewed a number of perspectives on the interpretation of LD.
In humans, Tishkoff et al. (1998) appears to have been the first to document differences between African and non-African populations with respect to genetic associations between tightly-linked loci and to interpret the nature of the associations in terms of demographic history. These striking patterns, which are robust across the genome and planet (Jakobsson et al., 2008), may reflect historical population bottlenecks that accompanied the spread of humans out of Africa (Tishkoff and Williams, 2002). In fact, the decline in LD with geographical distance from Africa has been used as a bioinformatic index to suggest that humanity originated on the southwest coast of Africa (Henn et al., 2011).
Joint IBS probabilities.
This study presents a simple method for determining joint identity-by-state (IBS) probabilities within and between a pair of neutral loci in a population evolving under mutation, genetic drift, inbreeding, and crossing-over. Joint identity differs conceptually and quantitatively from linkage disequilibrium as classically defined (1). In particular, estimation of joint identity does not require phasing of haplotypes (compare Sabatti and Risch, 2002). The introduction of mutation and shift from IBD to IBS compel consideration of the effects of genetic diversity on relationships within and between loci. Accordingly, genetic diversity replaces population frequencies of alleles (1) as a central determinant of those relationships. Similar conceptual and quantitative implications arise in a coalescence-based reformulation of Wright’s -statistics (Uyenoyama, 2024).
2. Model structure
Each generation derives from reproduction by diploid individuals. Genetic lineages borne by distinct individuals derive from a common parent at a per-generation rate of the order of , with changes of the order treated as negligible.
2.1. Reproduction
Regular inbreeding.
Reproduction entails the production of an egg cell, which is fertilized either by a gamete from an independent meiosis in the maternal parent, with probability , or by a gamete sampled from the population gamete pool, with probability . The assumption that gamete production is not limiting implies that self-fertilization by an individual has no effect on its contribution to the gamete pool. Accordingly, a pair of genes randomly sampled from distinct individuals derive from the same reproductive in the parental generation with probability , irrespective of the rate of selfing (Appendix A).
Let denote the probability that a random reproductive individual is uniparental (the product of the fusion of gametes derived from independent meioses in a single individual) and the probability that the individual is biparental (the product of the fusion of gametes derived from distinct individuals):
| (2) |
Random union of gametes corresponds to or .
Time since the most recent outcross event.
For a random reproductive individual, random variable denotes the number of generations since the most recent outcross event in its ancestry. In particular, implies that the most recent biparental ancestor of the individual existed generations ago, with indicating that the individual is itself biparental. The probability mass function of corresponds to
| (3) |
Mutation.
Neutral mutations arise at locus A (B) at the per-gene, per-generation rate of , comparable to the rate of coalescence :
| (4) |
Unless explicitly noted, mutation and coalescence occur on a much longer time scale than reproduction:
| (5) |
2.2. Crossing-over
Here, the term “haplotype” refers to the set of genes that derive from a single gamete that united to form an individual, whether or not the loci of interest lie on the same chromosome. This usage appears to correspond to “gamete” in the sense of Cockerham and Weir (1968). For a pair of loci, A and B, represent the chromosomes of the parental individual that generated the haplotypes by AB and ab, in which the labels indicate locus and origin (upper case for maternally-derived, lower case for paternally-derived), and not allelic state.
As in Cockerham and Weir (1968), denotes the per-meiosis probability that crossing-over is not observed (an even number of template-switching events have occurred between the loci). The probability that a random gamete represents a recombinant haplotype corresponds to
This expression reflects that the sampling of a recombinant haplotype entails the occurrence of a crossover event, with half the meiotic products representing a recombinant type (Ab or aB).
For genes that trace back to a pair of haplotypes that derive from independent meioses in the same individual, coalescence may occur at both loci, exactly one locus, or neither locus. Coalescence at one but not the other locus entails that one haplotype be parental (AB or ab) and the other recombinant (Ab or aB). Coalescence occurs at locus A but not B or B but not A with probability
| (6) |
which implies that the probability of coalescence at exactly one locus is twice this value.
Coalescence occurs at both loci simultaneously with probability
| (7) |
reflecting that this event requires that the pair of haplotypes be both parental or both recombinant. Coalescence occurs at neither locus with the same probability.
A coalescence event of some kind (at one or both loci) occurs with probability
| (8a) |
Under complete linkage , reduces to the value expected for a single locus:
| (8b) |
In the absence of linkage , the rate of coalescence at one or both loci increases to 3/4.
2.3. Relative rates
This analysis addresses two cases: crossing-over occurring either on a time scale much shorter than mutation and parent-sharing,
| (9a) |
or on a comparable time scale,
| (9b) |
In the latter case, goes to zero at the same rate as , with
| (10) |
3. States of the process
Genes that represent the same allelic class are identical by state (IBS). For a sample of 4 genes, 2 from each of a pair of loci, the joint IBS probabilities described here may be regarded as likelihoods: probabilities of the sample under the model specified in Section 2.
3.1. Descriptors of identity by state
Let represent the probability of observing IBS between the genes held at locus A in a random individual and the IBS probability between a pair of genes sampled from locus A, one from each of two distinct random individuals. The corresponding probabilities for locus B are and . While the notation evokes the Malécot (1969) coefficients of inbreeding and kinship, these indices refer to IBS in an evolving population and not to IBD in a pedigree.
For a pair of loci, Table 1 presents joint IBS probabilities between the maternal and paternal complements held by a random individual.
Table 1.
Haplotypes within individuals.
IBS is observed at both loci with probability , at locus A but not B with probability , at B but not A with probability , and at neither locus with probability .
The vector of probabilities
| (11) |
corresponds to an analogue of Weir and Cockerham’s (1969) , the two-locus IBD measure for individual A. The single-locus measures (, ) correspond to marginal sums, for which Pollak (1987) determined the steady-state values:
| (12) |
with the scaled mutation rates given in (4).
Table 2 presents the corresponding joint IBS probabilities for 4 genes, 2 from each of a pair of loci, with no more than a single haplotype sampled from a single individual. As discussed in the next section, the elements of
| (13) |
depend on the sampling configuration (Fig. 2).
Table 2.
Haplotypes from distinct individuals.
Fig. 2.
Initial sample configurations of 4 genes (vertical bars), a pair from each locus (A on the right, B on the left). Vector describes the initial state of the sample, which comprises 0, 1, or 2 Type 3 haplotypes. These configurations are respectively denoted 1122, 123, 33.
As is the case for the within-individual measures (11), the marginal sums correspond to one-locus measures of IBS between genes sampled from distinct individuals derived by Pollak (1987):
| (14) |
Indices analogous to traditional disequilibrium measures correspond to
| (15) |
for within-individual comparisons and to
| (16) |
for between-individual comparisons. Together with the single-locus IBS indices (12) and (14), the disequilibrium measures and fully determine the within-individual and between-individual IBS probabilities.
3.2. Tracing descent
Determination of the likelihood vectors (11) and (13) at stationarity under mutation, drift, crossing-over, and regular inbreeding entails tracing the genealogical history of four genes, two at each locus. At any point in this history, each two-locus haplotype is classified into one of four types according to whether the constituent genes have direct descendants in the sample (compare Takebayashi et al., 2004). A gene is in state 1 only if it is known to have at least one direct descendant in the sample, with state 0 indicating agnosticism. In particular, a focal gene with designated state 0 may nevertheless be identical by descent (IBD) to a gene in the sample if both the focal gene and the gene in the sample descend from a more ancient common ancestor, with no mutations along the genealogical path connecting the three genes. In such cases, the designation of the focal gene as state 0 reflects its own direct descendants, even though it may constitute “ancestral material” in the sense of Wiuf and Hein (1999). An example is discussed in Section 6.2.2 and depicted in Fig. 10.
Fig. 10.
Ancestry of genetic material in a Type 3 haplotype generated by crossing-over. A Type 3 haplotype that is a recombinant traces back to an individual (base) that bears genomic material on either side of the breakpoint on distinct complements. As in the case of a single locus (Fig. 3), the process resolves either to fusion of the complements, with probability (17), or to their separation into distinct individuals, with the complement probability.
Fig. 1 illustrates the terminology for haplotypes comprising loci A and B. Haplotypes that bear ancestral lineages at both loci, denoted (1, 1), are denoted Type 3. Type 1 (0, 1) haplotypes bear an ancestral lineage at locus A but not necessarily at B and type 2 (1, 0) an ancestral lineage at locus B but not necessarily at A. Type 0 (0, 0) haplotypes are not known to bear an ancestral lineage at either locus. The label of each haplotype corresponds to the number indicated by its constitution viewed in base 2, and the genotype of a diploid individual to a pair of numbers, indicating the haplotypes it carries (e.g., 03, 12).
Fig. 1.
Two-locus haplotypes. Locus A (on right) and locus B (on left) are separated by a region corresponding to a recombination fraction of . Vertical bars indicate genes in descent state 1 (those that have a direct descendant in the sample).
As one might expect, the between-individual IBS probabilities (13) depend on the configuration of the initial sample. Four genes, a pair derived from each locus, may reside on 4, 3, or 2 haplotypes (Fig. 2). For initial samples with these configurations, the vector of between-individual IBS probabilities (13) respectively corresponds to the analogue of Weir and Cockerham’s (1969) quadrigametic , trigametic , and digametic measures.
4. Identity within and between individuals
In the course of the genealogical history of the sample, transitions occur among states of the process. The initial states of the sample (Fig. 2) correspond to transient states, among which the process may move as a consequence of breakage of haplotypes (crossing-over) or fusion of haplotypes without coalescence of genetic lineages at either locus. Coalescence or mutation events induce transitions from transient to absorbing states. This section addresses the relative rates at which these evolutionary events occur.
4.1. Identity at a single locus
Of interest is the relationship between the probability of IBS between the genes held by a random individual and the IBS probability of genes sampled from distinct individuals . While this relationship is well-known ((12), (14)), framing one-locus identity in a graphical context facilitates the analysis of two-locus identity.
Fig. 3 illustrates the ancestry of the complements held at a unitary (non-recombining) locus by a random reproductive individual. The most recent evolutionary event in the ancestry of the individual corresponds either to coalescence of the lineages or to the appearance of a biparental ancestor. Because reproduction occurs much more rapidly than mutation (9), mutation is excluded as the most recent event. With probability (2), the individual is biparental, implying the separation of the lineages into distinct reproductive individuals in the preceding generation. Alternatively, if the focal individual is uniparental , the lineages either derive from the same complement in the parent (fusion) or derive from distinct complements, which returns the process to the state of a pair of lineages held by a single individual. Virtually instantaneously relative to the process of mutation, the genealogical history resolves either to Separation of the lineages in distinct individuals or to Fusion (coalescence), with probability
| (17) |
Haldane (1924) derived this expression for , Wright’s (1921) correlation between uniting gametes, in populations undergoing partial selfing (compare Fu, 1997; Nordborg and Donnelly, 1997).
Fig. 3.
Ancestry of non-recombining genetic material (black dots) held by a random reproductive individual. A random individual is biparental with probability , with the ancestral lineages deriving from distinct parental individuals. Alternatively, the individual is uniparental, with (8b) the probability of coalescence (fusion) of the lineage pair in the parent.
Appealing to one of the many definitions of effective number (Ewens, 1982; Crow and Denniston, 1988), let the rate of coalescence between a pair of genes sampled from distinct diploid individuals correspond to . Because the genes share a parent with probability under any level of inbreeding (Appendix A), the rate of coalescence corresponds to
| (18) |
in agreement with Pollak (1987, p. 354). Regular inbreeding at rate reduces effective number by a factor of .
Fig. 4 illustrates that the probability that a random individual holds genes that are not IBS reflects the most recent event in its ancestry. With probability (17), its lineages coalesce more recently than the formation of a biparental individual, forming an ancestor that cannot generate an individual with non-IBS genes. With probability , the lineages represent genes sampled from distinct individuals, with the probability that they are non-IBD. Accordingly,
| (19) |
Pollak (1987) interpreted (19) as an analogue of Wright’s iconic partitioning of the hierarchical -statistics:
(see, for example, Wright, 1969, p. 295), with corresponding to , to , and to . The right side of Fig. 4 depicts a similar interpretation of , the probability that the random individual holds genes that are IBS.
Fig. 4.
One-locus coalescent. The left panel indicates that lineages held by a random individual are not IBS only if both (a) they resolve to separation in distinct individuals more recently than they coalesce and (b) those lineages are not IBS . Similarly (right panel), lineages held by a random individual are IBS if they coalesce more recently than the formation of a biparental ancestor or resolve to IBS lineages in separate individuals.
4.2. Joint identity
This section determines the steady-state relationship between the within-individual measures (Table 1) and the between-individual measures (Table 2). It uses a labeled coalescent argument (see Karlin and McGregor, 1972; Uyenoyama et al., 2019), extending to two loci the argument presented in Fig. 4.
The probability of observing depends on the recent history of the focal individual. As for a single locus (Fig. 3), its maternal and paternal complements may derive from distinct parents or from a single parent. If the haplotype pair derive from distinct parents, the focal individual represents the fusion of a pair of Type 3 haplotypes (Fig. 1), one from each parent. If the haplotype pair derive from a single parent, coalescence may occur at both loci, exactly one locus, or neither locus, with probabilities given in Section 2.2. Mutation is not considered because it occurs on a much longer time scale than reproduction (5). The observed IBS profile implies a restricted set of possible ancestors, some of which may have zero probability of generating the IBS profile of the focal individual. While the process may return any number of times to distinct haplotypes borne by a single individual, the genealogical history resolves rapidly either to separation or to fusion of some kind.
To terms larger than , Fig. 5 presents a graphical derivation of , the probability that a random individual bears genes IBS at locus A but not at locus B. A more traditional coalescent argument appears in Appendix B. Formation of a biparental ancestor implies descent from a pair of haplotypes derived from distinct individuals. The haplotype pair has the IBS profile of the focal individual with probability , in which the hat denotes the sampling of two Type 3 haplotypes in Fig. 2). Coalescence events involving locus B generate ancestors that have zero probability of generating the descendant. Coalescence at locus A but not at locus B implies identity at locus A and reduces the system to a single locus, with the genes borne by the ancestor at locus B not IBS with probability .
Fig. 5.
Ancestry of an individual bearing genes IBS at locus A but not at locus B. To terms larger than , the most recent event in the ancestry of the focal individual at the base of the diagram corresponds to the separation of its maternal and paternal complements into distinct ancestors, with probability . The sample configuration consistent with the IBS profile of the focal individual occurs with probability , in which the hat denotes the sampling configuration , a pair of Type 3 haplotypes (Fig. 2). With probability , the most recent event is coalescence at locus A but not at locus B; the ancestor in this case bears non-IBS genes at locus B with probability . All other scenarios, including coalescence at locus B, imply ancestors that have zero probability of generating the descendant state.
Similar considerations determine the other relationships between the within-individual and between-individual measures:
| (20) |
in which the two-locus rates of coalescence are given in Section 2.2 and the hats denote the between-individual IBS measures for a sample comprising a pair of Type 3 haplotypes (Fig. 2). These expressions confirm the marginal sums of Table 1, including
5. Joint identity under loose linkage
For the case in which crossing-over occurs much more rapidly than mutation or parent-sharing , any Type 3 haplotype resolves to a Type 1 haplotype and a Type 2 haplotype nearly instantaneously compared to mutation or the formation of a new Type 3 haplotype. Accordingly, the between-individual measures converge rapidly to
| (21) |
irrespective of the configuration of the initial sample. This independence across loci implies the absence of between-individual disequilibrium (16),
| (22) |
in agreement with the results of Bennett and Binet (1956), Karlin (1968, Section 1.5) and Christiansen (2000, Theorem 4.1, p. 85), who addressed the evolution of multi-locus associations in the absence of genetic drift and mutation.
Within-individual disequilibrium (15),
is analogous identity disequilibrium as defined by Weir and Cockerham (1973). An equivalent expression corresponds at steady state to
using (19) for each locus as well as (20) and (21). Substitution of expressions for (8a) and (17) produces a scaled measure of within-individual disequilibrium:
| (23) |
for
| (24) |
corresponding to identity disequilibrium under partial self-fertilization (Equation (37) of Weir and Cockerham, 1973). In populations undergoing mixed random mating and inbreeding, the time since the most recent random outcross event in the genealogical history of an individual organism is shared across loci throughout the genome. Consequently, identity disequilibrium arises even among unlinked loci , for which
Identity disequilibrium increases monotonically as linkage tightens (Fig. 6). However, (23) shows a non-monotonic dependence on the inbreeding rate , declining as approaches the limits of its range (Fig. 7).
Fig. 6.
Within-individual disequilibrium scaled by between-individual measures of diversity (23) as a function of the rate of crossing-over under three assignments of the inbreeding rate . assumes its highest values for near 3∕4.
Fig. 7.
Within-individual disequilibrium scaled by between-individual measures of diversity (23) as a function of the rates of crossing-over and selfing .
An alternative index of association corresponds to within-individual disequilibrium scaled by within-individual measures of diversity:
| (25) |
These measures are clearly closely related:
for the correlation between uniting gametes (17). Under any positive rate of inbreeding , exceeds . Unlike , increases monotonically with , increasing without bound as approaches unity (Fig. 8).
Fig. 8.
Within-individual disequilibrium scaled by within-individual measures of diversity (25) as a function of the rates of crossing-over and selfing .
6. Joint identity under tight linkage
This section addresses the case (9b) in which crossing-over proceeds at a rate comparable to mutation and parent-sharing:
Determination of the joint identity vectors (11) and (13) entails analysis of the full two-locus coalescent process.
6.1. IBS within and between individuals
Because the most recent outcross event in the lineage of an individual occurs virtually instantaneously relative to the time scale of crossing-over, the within-individual IBS probabilities (11) correspond to (20) with set to zero:
| (26) |
for given by (17). These relationships imply an expression for (23):
| (27a) |
in which the hat denotes samples comprising two Type 3 haplotypes in Fig. 2). Alternatively, (25), within-individual disequilibrium scaled by within-individual diversity, corresponds to:
| (27b) |
As in the case of loose linkage (Fig. 8), increases without bound as and approach 1.
For any configuration of the initial sample,
| (28) |
appears to be the analogue of Ohta’s (1980) “standardized identity excess” expression. Because is independent of the scaled mutation rates (, ), (27b) implies that either both and or neither depend on the scaled mutation rates.
6.2. Genealogical paths
This section describes a recursion that tracks the configuration (Fig. 2) of lineages ancestral to the initial sample at the point of successive coalescence or mutation events. The most recent ancestral state in which the IBS status at both loci are determined entails coalescence at both loci, coalescence at locus B and mutation at locus A, coalescence at locus A and mutation at locus B, or mutation at both loci. The probabilities of these outcomes correspond respectively to the elements of (13) : , , , and .
For haplotypes comprising an arbitrary number of loci, Fig. 9 depicts a given state of the system as a point on one of multidimensional planes. The original sample, described by (Fig. 2), corresponds to a single point on the lowest plane. Movement on a given plane represents transitions among transient states, with transitions to absorbing states corresponding to jumps to higher planes (fewer undetermined loci). Coalescence at a single locus induce jumps to the next higher plane and simultaneous coalescence at multiple loci induce jumps beyond the adjacent plane. In the case of loci, the genealogical history of the sample need be followed only until the most recent mutation or coalescence event. Beyond that event, one-locus theory ((12), (14)) provides the IBS probabilities for the remaining locus.
Fig. 9.
A genealogical path. Planes represent states comprising a given number of genetic lineages. State represents the entry state on a plane, the state from which the jump to a higher plane occurs, and the entry state on that plane. This process terminates in the most recent common ancestor (MRCA) for which the IBS status at both loci is determined.
6.2.1. Complete linkage
The boundary case of complete linkage provides a useful baseline for the case of tight linkage . Mutations occur independently at locus A and locus B, which represent disjunct sets of sites within a non-recombining region. A pair of haplotypes drawn from distinct individuals coalesce at rate (18).
For initial samples comprising two Type 3 haplotypes drawn from distinct individuals , the steady-state joint IBS probabilities correspond to
| (29) |
These expressions indicate that the process corresponds to a two-locus coalescent up to the most recent event and to a one-locus coalescent more anciently. Fusion more recently than any mutation induces the simultaneous coalescence of lineages at A and B. For the limits defined in (4), this event occurs with probability
Observation of IBS at locus B but not A, with probability , reflects mutation at A as the most recent event,
together with a more ancient coalescence at the remaining locus . The probability of IBS at locus A but not B has a similar form. Finally, IBS at neither locus reflects mutations at both loci, in either order, more recently than any coalescence.
Using (29), the index of disequilibrium scaled by between-individual diversity corresponds to
| (30) |
That this scaled index corresponds to implies that disequilibrium entails simultaneous coalescence at both loci.
For other sample configurations (Fig. 2), the haplotypes may fuse with or without coalescence at a given locus, even in the absence of crossing-over. Such events correspond to transitions within a plane in Fig. 9. In the presence of crossing-over, within-plane transitions may occur for all sample configurations. The remainder of this section addresses transitions both within and between planes under low but positive rates of crossing-over (9b).
6.2.2. Incomplete linkage
Tight but incomplete linkage introduces the possibility of the formation and dissolution by crossing-over of a Type 3 haplotype, bearing ancestral lineages at both loci.
Fig. 10 traces the immediate ancestry of the parent that generated a recombinant Type 3 haplotype. The maternal and paternal complements of that parent are Type 1 and Type 2, with the state of genes other than those directly ancestral to the recombinant Type 3 haplotype depicted as 0. As discussed in Section 3.2, genes in descent state 0 may in fact be IBD with those in descent state 1 through a more ancient common ancestor. Their 0 designation for descent state indicates that they themselves failed to leave direct descendants in the sample. If those genes had left direct descendants in the sample, the ancestor at the base of Fig. 10 would have generated at least two haplotypes with descendants in the sample: the focal Type 3 recombinant together with a haplotype with designation other than Type 0. As each event occurs at rate , the probability of the combination is treated as negligible.
Just as in Fig. 3, the Type 1 and Type 2 haplotypes in the parent that generated the recombinant Type 3 either resolve to separation in distinct individuals, with probability , or to fusion into a Type 3 haplotype, with probability (17). As the latter event restores the Type 3 haplotype, a crossover event in its immediate ancestry induces a transition in state at rate
| (31) |
A Type 3 haplotype may also form through fusion between Type 1 and Type 2 haplotypes carried by distinct individuals. Those haplotypes derive from the same parent with probability , irrespective of whether their carriers are uniparental or biparental (Appendix A). Because the rates of crossing-over and mutation are also in this case of tight linkage (9b), the possibility that either has in addition experienced a crossing-over or mutation event in the immediately ancestral generation is excluded to the order of approximation considered here. Accordingly, the unitary nature of haplotypes implies that the rate of fusion of a pair of haplotypes residing in distinct individuals corresponds to (18).
6.2.3. Hitting probabilities
This section addresses the derivation of the between-individual IBS probabilities (13). The elements of correspond to the probabilities of the four possible states of the most recent common ancestor in Fig. 9: signals IBS at both loci, at locus B alone, at locus A alone, and at neither locus. Appendix D summarizes the phase-type argument (Neuts, 1995) that determines these probabilities.
At any point in the genealogical history ancestral to the most recent mutation or coalescent event, the process is in one of the three configurations indicated in Fig. 2, with each haplotype borne by a distinct individual. The total instantaneous rates of change for these configurations correspond to
| (32) |
in which the subscripts denote the location of the 4 lineages: 1122 for (on 4 haplotypes), 123 for (on 3 haplotypes), and 33 for (on 2 haplotypes).
Breakage or fusion of haplotypes without coalescence of genetic lineages induce transitions among transient states on the two-locus level (Fig. 9). For rows and columns respectively representing configurations 1122, 123, and 33, matrix provides instantaneous rates of transition:
| (33) |
(Appendix C). Each element on the diagonal indicates the total rate of change from the corresponding transient state (32).
Matrix converts the description from instantaneous rates (change per generation) to probabilities of transition on the time scale of evolutionary events (D.7):
for
| (34) |
Taking limits as described in Section 2 (see (D.8)),
| (35a) |
for
| (35b) |
Mutation at either locus establishes that the sampled genes at that locus are non-IBS. Similarly, fusion events that induce coalescence at one or both loci determine genes as IBS. Upon a coalescence or mutation event, the process jumps to a higher plane in Fig. 9. Table 3 summarizes the designations of end states of a jump between planes, any of which is absorbing with respect to the level of origin.
Table 3.
Absorbing states.
| Designation | Configuration |
|---|---|
|
| |
| Mutation at locus B, locus A undetermined | |
| Coalescence at locus B, locus A undetermined | |
| Mutation at locus A, locus B undetermined | |
| Coalescence at locus A, locus B undetermined | |
| Coalescence at both loci | |
Matrix provides rates of transition to the higher-level states listed in Table 3. The probability that a process that enters a level at transient state ( in Fig. 9) jumps to state on a higher level ( in Fig. 9) corresponds to the th element of (D.5). Instantaneous rates of transition from configurations 1122, 123, and 33 (rows) to exit states , , , , and (columns) are given by
| (36) |
Probabilities of transition are given by
| (37) |
| (38) |
with the one-locus identity measures given in (14). Matrix provides the transition probabilities from the five exit states (, , , , and ) to the four MRCA configurations represented by the elements of (13):
| (39) |
The th element of ,
| (40) |
is proportional to the probability that a process presently in transient state on the two-locus level ultimately terminates in the th MRCA state , , , or .
For the probability distribution of the state of the initial sample, the vector of likelihoods (13) corresponds to
| (41) |
(Appendix D), for
| (42) |
That each row of (39) sums to unity ensures that for each transient state on the two-locus level (row), the hitting probabilities across terminal states sum to unity:
in which denotes a vector of appropriate length in which all elements equal unity (D.10).
6.3. Measures of disequilibrium
Determination of the between-individual joint identities (41) permits determination of measures of within- as well as between-individual disequilibrium (27).
As noted previously, matrix (40) provides the distribution of MRCA states for each transient state from which the jump departing the two-locus level occurs. For each row of , the index of between-individual disequilibrium,
is proportional to the difference between the product of the first and fourth elements and the product of the second and third elements. Jumps from transient states 1122 or 123 (first two rows) generate zero two-locus disequilibrium:
This absence of disequilibrium reflects that processes that leave the two-locus level from 1122 or 123 proceed to the MRCA through two independent evolutionary events: mutation or coalescence at a single locus and then at the remaining locus. A notational shift clarifies the structure of the process. Matrix (40) may be expressed in terms of instantaneous rates of change:
for given in (34) and given in Box I. The matrix in Box I is equivalent to
| (43) |
in which denotes the Kronecker product. That the first two rows of (43) have the form of a Kronecker product between probabilities of events and locus B and events at locus A indicates that the process proceeds to the MRCA through independent transitions. In contrast, the last row indicates that processes that jump from state 33 (two Type 3 haplotypes) may attain the MRCA not only through successive coalescence or mutational events, but also through coalescence at both loci simultaneously. Simultaneous coalescence introduces a departure from independence that manifests as disequilibrium.
Box I.
Only a jump from state 33 generates between-individual disequilibrium:
| (44) |
for given in (35b). Proportionality factor depends on the configuration of the initial sample, with configurations for which jumps from state 33 (two Type 3 haplotypes) are more likely generating higher disequilibrium. In (41),
provides the probability distribution of states from which the process departs the two-locus level. For initial sample configuration , gives the last row of (42). Because simultaneous coalescence occurs only through jumps from the 33 configuration, (44) corresponds to the last element of that row:
| (45a) |
In the absence of crossing-over , this proportionality factor is unity, confirming (30). Similarly, for an initial sample that includes only one Type 3 haplotype , as in Fig. 2 of McVean (2002), proportionality factor (44) corresponds to the last element of the second row of (42):
| (45b) |
and for an initial sample comprising no Type 3 haplotypes , corresponds to the last element of the third row:
| (45c) |
The relation
| (46) |
implies that the scaled two-locus disequilibrium measure (44) declines with the number of Type 3 haplotypes in the initial sample.
Crossing-over, at the per-generation rate of , promotes the separation of Type 3 haplotypes into Type 1 and Type 2, while the formation of a Type 3 haplotype through fusion of Type 1 and Type 2 occurs at rate (18), independent of rates of crossing-over. Under all initial sample configurations, looser linkage (higher ) reduces the probability of Type 3 haplotypes. Accordingly, both (16) and (44) uniformly increase as linkage tightens.
Relatively simple expressions arise in the boundary case of complete linkage . The scaled measure of disequilibrium for a sample comprising two Type 3 haplotypes (33) corresponds to
(30). It declines to
for a sample comprising a single Type 3 haplotype (123) and to
for a sample lacking a Type 3 haplotype (1122).
6.4. Qualitative trends in disequilibrium
To facilitate exploration of the properties of within- and between-individual disequilibrium under tight linkage (9b), this section restricts consideration to uniform rates of mutation across loci:
Major qualitative features include the dependence of measures of disequilibrium on the configuration of the initial sample and on rates of inbreeding and mutation.
While tighter linkage (lower ) uniformly increases (16), this unscaled measure of between-individual disequilibrium in general shows non-monotone dependence on the rate of inbreeding and mutation : whether higher increases or reduces depends on the values of and . In contrast, (28), which is scaled by between-individual diversity, increases uniformly with higher and lower mutation . That the scaled within-individual measure (27b) differs from only by terms that depend on (17) and not on or ensures that also declines monotonically with and .
Fig. 11 illustrates the nature of the effect of crossing-over and mutation on (44) in the absence of regular inbreeding for an initial sample comprising a pair of Type 3 haplotypes. This form of association uniformly increases with the rate of inbreeding . Across all levels of inbreeding and initial sample configurations, higher mutation rates induce monotonic declines in , reflecting a higher probability that the most recent event corresponds to mutation rather than simultaneous coalescence.
Fig. 11.
For an initial sample comprising a pair of Type 3 haplotypes from a population without regular inbreeding , scaled between-individual disequilibrium (28) declines with increasing rates of recombination and mutation .
Fig. 12 illustrates that initial sample configurations comprising greater numbers of Type 3 haplotypes give rise to higher , in accordance with (46). Within-individual disequilibrium scaled by within-individual diversity varies over numerically greater values: declining from 1.80 to 0.81 as progresses from 0 to 10.
Fig. 12.
Between-individual disequilibrium scaled by between-individual diversity (27b) under partial selfing with . From bottom to top, the curves correspond to initial sample configurations of , , and (1122, 123, and 33, compare Fig. 2). The maximum values for these sample configurations correspond respectively to 2∕9, 1∕3, and 1 (Table 4).
Table 4 summarizes the maximal values of , approached as and decline.
Table 4.
Maximum values of scaled indices of disequilibrium.
| Index of disequilibrium | Sample configuration | ||
|---|---|---|---|
|
| |||
|
| |||
| 1 | |||
| 1 | |||
Also shown are the maximal values of , in the absence and presence of regular inbreeding. Within-individual joint identity is defined with respect to the pair of haplotypes held by an individual, but requires the solution for between-individual joint identity with initial sample configuration of a pair of Type 3 haplotypes (26). While the maximum value of between-individual disequilibrium does not depend on the rate of inbreeding , within-individual disequilibrium (25) increases without bound as approaches unity (Section 6.1). This behavior mirrors that shown by under loose linkage (Fig. 8).
Fig. 13 illustrates that higher rates of inbreeding promote greater scaled between-individual disequilibrium (28), corresponding to an initial sample comprising two Type 3 haplotypes. As approaches zero, each curve approaches a maximum of
which in turn approaches unity for very small . The qualitative form of the increase in disequilibrium with greater population structure, represented here by the rate of selfing , appears consistent with the empirical findings of Jakobsson et al. (2008, compare their Figure 2).
Fig. 13.
Between-individual disequilibrium scaled by between-individual diversity measures (28) for a sample comprising a pair of Type 3 haplotypes with . From bottom to top, the curves represent rates of selfing of 0.1, 0.3, and 0.5.
7. Discussion
The classical single- and multi-locus descent measures were developed in the context of a pedigree, representing correlations among alleles in the presence of genetic drift but absence of mutation (e.g., Wright, 1921). Wright (e.g., 1942) and others extended this conceptual framework in many directions.
7.1. Significance of
Wright (1921) defined as the correlation between uniting gametes. In a one-locus context (Fig. 3), it reflects the probability of coalescence of lineages. In a multi-locus context (Fig. 10), emerges as the key determinant in the process of breakage and fusion of haplotypes.
For populations undergoing self-fertilization at rate , Golding and Strobeck (1980) and Nordborg (2000) have suggested that effective number declines by a factor of with respect to mutation but by a factor of with respect to crossing-over. Recognition of fusion and separation as outcomes of a single process offers an alternative to proposing separate effective numbers for mutation and crossing-over.
A random multi-locus haplotype represents a recombinant with probability proportional to , the per-generation rate of crossing-over. This event entails that chromosomal segments flanking the cross-over point derive from separate complements in the parental individual that produced the haplotype. Fig. 10 illustrates that the most recent event in the genealogical history of complements borne by a single individual corresponds either to their separation into distinct individuals, with probability , or to their derivation from the same haplotype, with probability . Accordingly, the rate of a change in state through crossing-over corresponds to
(31). Otherwise, the state of residence of the lineages on the same haplotype is restored, reflecting the near-instantaneous “healing” of the chromosomal breakage relative to the time scale of mutation, drift, and parent-sharing.
The role of (17) in joint identity across loci is evident in the analysis of Golding and Strobeck (1980), who developed linear recursions in 16 inbreeding coefficients (see their Table 1). Their Equation (4a) indicates that at steady state, many of these coefficients can be expressed as weighted averages of basic measures of diversity. Golding and Strobeck’s (1980) , for example, corresponds to IBS at both loci for a Type 3 haplotype sampled from one individual and a Type 1 and a Type 2 from a distinct individual (Fig. 1). Fig. 10 indicates that the pair of haplotypes (Type 1 and Type 2) within an individual resolve to fusion (descent from a Type 3 haplotype) with probability or resolve to separation in distinct individuals with probability . Accordingly, in the notation used here corresponds to
In particular, for a sample comprising pair of Type 3 haplotypes corresponds to the first element of
(41), an expression that agrees with of Golding and Strobeck (1980, p. 780, Equation (4a)). All relations in their Equation (4a) correspond to similar averages of one-locus diversity measures (, , , ) and two-locus diversity measures for the three initial configurations), weighted by and .
7.2. Mutation and identity
The present analysis addresses probabilities of identity by state (IBS) among 4 genes, 2 at each of a pair of loci, A and B, in evolving populations subject to drift, mutation, and regular inbreeding. Elements of the IBS vectors (11) and (13) are not properties of populations, but rather likelihoods: probabilities of observed samples under a model.
The two-locus descent measures of Weir and Cockerham (1973) address probabilities of identity by descent (IBD) in the presence of drift but absence of mutation. Viewing identity in the present context as a Boolean random variable, the expectation of Weir and Cockerham’s (1973) corresponds to within-individual probabilities (11). Similarly, the expectations of their digametic , trigametic , and quadrigametic identity measures correspond to (13) for the three possible configurations of the initial sample (, Fig. 2).
Despite this close relationship, the introduction of mutation and shift to IBS in a coalescence context compel consideration of the effects of genetic diversity on relationships within and between loci. Weir and Cockerham (1973) showed that in cases in which crossing-over occurs on a much shorter time scale than mutation or coalescence (9a), identity disequilibrium is independent of rates of mutation (23). In contrast, under tight linkage (9b), all forms of multi-locus disequilibrium studied here depend on measures of diversity (, ) and consequently the scaled rates of mutation (, ), even after conditioning on heterozygosity at both loci. In essence, genetic diversity in a sample replaces population frequencies of alleles in classical measures of disequilibrium (e.g. Wright, 1933; Hill and Robertson, 1968).
That the level of diversity segregating at loci used to estimate may strongly influence the inferred magnitude of (Heller and Siegismund, 2009) is a sobering prospect. Perhaps an even greater concern is that associations among loci depend not only on observed levels of diversity at the loci but also sample configuration (44).
7.3. Joint identity and linkage disequilibrium
Joint identity (e.g., Cockerham and Weir, 1968; Sved, 1971) among loci differs conceptually and quantitatively from linkage disequilibrium (LD) as classically defined (1). In particular, joint identity does not refer to haplotype phase.
7.3.1. Indices of association
Measures of disequilibrium in the sense discussed here are similar to joint identity in the sense discussed by Sved (1971) and Sved and Feldman (1973). They defined joint identity between locus A and locus B in terms of the probability of identity at B conditional on identity at A:
for and random variables that take the value 1 if the haplotype pair show IBS at the corresponding locus.
A simple index of association based on this notion of joint identity might compare the conditioned and unconditioned probabilities of IBS at locus B:
Restatement in terms of (13) suggests an analogy between this index and
Conditioning on locus B rather than A would generate
Also closely related is the analogue of “standardized identity excess” (Ohta, 1980), which corresponds to
(28). This symmetric expression resembles Wright’s (1933) correlation between allele frequencies across loci (1).
7.3.2. Scaling of measures of disequilibrium
Weir and Cockerham (1973, unnumbered equation between their (25) and (26)) defined identity disequilibrium as
for the probability that a random individual bears genes IBD at both loci and the probability that the genes at either locus held by a random individual are IBD. In evolving populations subject to mutation, with identity corresponding to IBS rather than IBD, within-individual disequilibrium is defined as
(15). By interpreting as the analogue of the expectation of and allowing the probabilities of identity at locus A and locus B to differ, one may regard as analogous to the expectation of .
For a pair of loosely-linked loci (9a) evolving under partial self-fertilization, the steady-state value of (24) in fact corresponds to within-individual disequilibrium scaled by between-individual diversity:
(23). That identity disequilibrium should be scaled at all is not immediately obvious from its derivation (Weir and Cockerham, 1973). Hill and Robertson (1968) used a similar inference framework (see Figure 1.5 of Weir, 1990) in their study of linkage disequilibrium , the numerator of Wright’s between-locus correlation (1). In this context, the observed sample derives from a descendant of a replicate population established at some point in the past from a base population and quantities to be inferred from the sample are properties of the population and not the sample. Hill and Robertson (1968) noted that for a model in which represents recombinational distance, their expressions for should be scaled to reflect conditioning on variation at both loci:
We have developed the analysis so far in terms of the average values of computed over all replicate lines. But in many replicates one or other locus will become fixed after a few generations and in these is zero. If we observe linkage disequilibrium between a pair of loci in natural populations, it can only be among those still segregating at both loci. We therefore need to describe the behaviour of the linkage disequilibrium within such lines. When , the average value of within lines still segregating at both loci, denoted , can be obtained by dividing from Eq. (3) by the proportion of lines still segregating.
— Hill and Robertson (1968, p. 228)
The apparent independence of identity disequilibrium (24) from allelic frequencies reflects this implicit conditioning, which does depend on genetic diversity. Its non-monotone relationship with the level of inbreeding (Fig. 6) reflects dual consequences of inbreeding: higher concordance between genealogical histories across loci but lower genetic diversity.
That the analogue of identity disequilibrium (23) corresponds to a scaled measure of association prompts an exploration of alternative scalings. In particular, the scaling of within-individual disequilibrium (15) by within- rather than between-individual diversity gives
(25). Under loose linkage (9a), this index preserves the independence from rates of mutation of identity disequilibrium (24), but increases uniformly with the rate of inbreeding (Fig. 8).
While between-individual disequilibrium (16) is zero at steady state under loose linkage (22), this index assumes positive values under tight linkage (44). Section 6.4 indicates that under tight linkage (9b), the between-individual index is not monotone in the rate of selfing . As in the case of rapid crossing-over (Fig. 7), the decline of as nears zero or unity reflects a tradeoff between levels of segregating diversity and rates of simultaneous coalescence across loci. In contrast, between-individual disequilibrium scaled by between-individual diversity,
(28), declines uniformly with increases in rates of crossing-over and mutation ( and in Fig. 11). The scaled within-individual index (25) shares these properties. Under tight linkage (9b),
(27b), for , between-individual disequilibrium scaled by between-individual diversity in a sample comprising a pair of Type 3 haplotypes (Fig. 2). Wright’s (17) again assumes a central role in the relationship between these scaled measures of disequilibrium.
As noted in the Introduction, various researchers have explored scalings of disequilibrium with a view to mitigating the dependence on allele frequencies (reviewed in Hedrick, 1987). In the present sample-based analysis, which does not address allele frequencies in the population, mutation in combination with genetic drift and inbreeding determine associations between loci as well as genetic diversity within loci. Proposal of the scaled measures of within-individual disequilibrium (25) and between-individual disequilibrium (28) is not intended to reduce the fundamental dependence on diversity, but rather to suggest measures that exhibit the useful property of monotonic dependence on rates of crossing-over, mutation, and inbreeding. Expressions for the maximal values of these measures (Table 4) may be useful if further scaling to restrict the scale of disequilibrium to the range [0, 1] is desired.
7.4. Interpretation of disequilibrium
7.4.1. Effect of sample configuration
Previous coalescence-based studies have characterized multi-locus associations in terms of correlations or covariances among loci in total tree length or MRCA age (Pluzhnikov and Donnelly, 1996; McVean, 2002). Here, the analysis in Section 6.3 indicates that non-zero between-individual disequilibrium (16) reflects the probability of the attainment of the two-locus MRCA involves the fusion of a pair of Type 3 haplotypes, which entails coalescence of lineages at both loci at exactly the same time. Genealogical histories involving successive mutation or coalescence events generate no disequilibrium.
An implication is that the magnitude of multi-locus disequilibria in the sense discussed here depends on the configuration of the sample. This dependence is explicit in Weir and Cockerham’s (1973) distinction among di-, tri-, and quadri-gametic measures, and in Hudson’s (2001) and McVean’s (2002) specification of sample configuration. Under tight linkage (9b), samples comprising more Type 3 haplotypes show greater (46), between-individual disequilibrium scaled by between-individual diversity.
In spite of the long-standing recognition in the theoretical literature of the importance of sample configuration, explicit specification is generally absent from empirical studies of linkage disequilibrium (LD). Among the contributing factors to this trend may be the perception of LD as a property of the population.
7.4.2. Effect of population structure
Tishkoff et al. (1998) reported a number of trends that have become iconic in human evolutionary genomics including higher LD between tightly-linked sites in non-African than African populations. A genome- and planet-wide study of single nucleotide polymorphism (SNP) loci demonstrated that in all human populations surveyed, estimates of a form of identity disequilibrium (Sabatti and Risch, 2002) decay at a roughly exponential rate with physical distance across tens of kilobases (KB), with non-African populations showing higher disequilibrium between sites separated by a given physical interval (Jakobsson et al., 2008). Further, genetic diversity (heterozygosity) in non-African populations declines with distance from northeast Africa (Ramachandran et al., 2005). These striking correlations with geographical distance from Africa may reflect a history of successive founder effects during the spread of humans out of Africa (e.g., Tishkoff and Williams, 2002; Ramachandran et al., 2005).
For tightly linked sites, between which multiple crossover events in a single meiosis occur at negligible rates, the recombination fraction corresponds to the interval length in Morgans separating the sites (Haldane, 1919). The 3000 × 106 base pairs constituting the human genome comprise 3000 to 3500 cM (Broman et al., 1998), with the lower value suggesting that 1 cM corresponds roughly to a physical distance of 1 megabase (MB). This figure suggests a recombination fraction of about (or a bit more) between sites separated by 10 KB (10−2 MB). Assuming an effective number on the order of would then suggest between sites separated by 10 KB. For a range of values of this magnitude, between-individual disequilibrium scaled by between-individual diversity (44) shows behaviour similar to empirical findings: greater disequilibrium as diversity declines (Fig. 11) and as inbreeding increases. These trends suggest that both lower diversity and greater population structure may contribute to the higher multi-locus associations observed in non-African populations.
Acknowledgments
I am grateful to have had the opportunity to become acquainted with Freddy Bugge Christiansen and with some of his extensive contributions to multi-locus theory in particular. I thank Sudhir Kumar and the Institute of Genomics and Evolutionary Medicine (iGEM) of Temple University for their gracious hospitality and support. Public Health Service, United States grant GM 37841 provided partial funding for this research.
Appendix A. Parent-sharing
A random individual is uniparental with probability and biparental with probability (2). A pair of uniparental individuals share their parent with probability
A uniparental individual and a biparental individual share a parent with probability
A pair of biparental individuals may have 2, 1, or 0 parents in common. They share both parents with probability
and share exactly one parent with probability
To order , the expected fraction of random pairs of offspring that share exactly one parent corresponds to
Given parent-sharing between a pair of individuals, the probability that two gametes, one sampled from each individual, both derive from the shared parent is 1 for two uniparental individuals, 1/4 for two biparental individuals, and 1/2 for the remaining case. To , the probability that two gametes sampled from distinct individuals derive from the same parent is then
| (A.1) |
irrespective of the rate of selfing.
Appendix B. Within- and between-individual identities
This section derives steady-state expressions for the within-individual IBS probabilities (11) in terms of the between-individual IBS probabilities (13) for a sample comprising two Type 3 haplotypes. Fig. 5 provides an alternative graphical argument.
Random variable denotes the number of generations in the ancestry of a random individual since its most recent biparental ancestor, with probability mass function given by (3). The individual bears non-IBS genes at both loci only if its most recent biparental ancestor also bore non-IBS genes at both loci, with no intervening coalescence events:
for the hat denoting sampling configuration (Fig. 2) and (8a) the per-generation probability of a coalescence event of any kind between haplotypes derived from the same parental individual. Using the probability mass function of the number of generations since the most recent outcross event (3) to marginalize out the time since this event produces, to terms larger than ,
A similar argument determines (01), the probability that a random individual bears genes IBS at locus A but not locus B. Conditioning on the occurrence of the most recent biparental ancestor generations ago,
A coalescence event at locus B more recently than the most recent outcross event would imply an ancestral state with zero probability of generating the descendant. With probability (8a), coalescence occurs at neither locus in any generation since the most recent biparental ancestor. In this case, the gamete pair that united to form that ancestor must bear genes IBS at locus A but not locus B, an event with probability . Alternatively, the most recent coalescence event≤ occurs at locus A alone generations in the past with probability . The term reflects that any history involving coalescence at locus B in the remaining generations until the most recent biparental ancestor would have zero probability of generating the descendant. The gametes that united to form the most recent biparental ancestor derive from distinct reproductive individuals, bearing genes IBS at neither locus (with probability or IBS at locus A but not B (with probability . Using that
and marginalizing out the time since the most recent outcross as for produces
for the one- and two-locus rates given in (8) and given in (17).
A parallel argument addressing identity by state at locus B but not A gives
Because the elements of sum to unity, these expressions determine , the probability that a random individual bears genes IBS at both loci:
Using that
(19) for each locus produces (20):
Appendix C. Instantaneous rates of transition
This section derives transition probabilities for each of the three possible configurations of a sample of four genes, a pair from each locus (Fig. 2). Breakage or fusion of chromosomes without coalescence at either locus induce within-plane transitions (Fig. 9), contributing to matrix (33). Mutation or coalescence induce between-plane jumps, signifying (fewer undetermined loci), contributing to matrix (36).
C.1. Lineages borne by four individuals
Each genetic lineage resides in a distinct individual in configuration 1122 (Fig. 2). The total instantaneous rate of change corresponds to
(32), in which the first two terms represent mutation and the last parent-sharing between any pair of the four haplotypes. As each event has an exponentially distributed waiting time that is independent of any other event, the probability that a given event is the most recent corresponds to the rate of occurrence of that event divided by the total rate of change.
At rate , a mutation occurs in one of the A locus lineages, inducing a jump to state (Table 3), signifying that the sampled A genes are non-IBS with the B genes remaining undetermined. Taking limits as described in Section 2, the probability of transition from state 1122 to state is
| (C.1a) |
A similar argument applies to a mutation at locus B, which corresponds to the most recent event with probability Similarly, the probability of transition from state 1122 to state through mutation at locus B is
| (C.1b) |
Given parent-sharing with fusion as the most recent evolutionary event, the fusion involves haplotypes Type 1 and Type 2 haplotype with probability
implying an 03 ancestor. Accordingly, the transition from configuration 1122 to configuration 123 occurs with probability
| (C.1c) |
Similarly, the fusion event involves the pair of type 1 haplotypes with probability
implying coalescence at locus A. Transition from configuration 1122 to state (Table 3) occurs with probability
| (C.1d) |
The probability of transition from 1122 to state through fusion of the pair of type 2 haplotypes corresponds to (C.1d) as well.
C.2. Lineages borne by three individuals
For configuration 123, the total instantaneous rate of change corresponds to (32):
A cross-over event in the Type 3 haplotype induces transition from configuration 123 to configuration 1122 with probability
for given in (10).
Configuration 123 transitions through mutation at locus A to state (Table 3) with probability
| (C.2a) |
and to state with probability
| (C.2b) |
Fusion between any pair of the three haplotypes occurs with probability
| (C.2c) |
Fusions involving the Type 3 haplotype implies coalescence at exactly one locus, including transition to state (IBS at locus A) or to state (IBS at locus B).
Fusion between the Type 1 and Type 2 haplotypes implying coalescence at neither locus, inducing transition from configuration 123 to 33. Note that in this case, the Type 1 and Type 2 haplotypes both derive without crossing-over from a Type 3 haplotype. In the Type 1 haplotype (01), for example, the gene at the B locus is labeled 0 even though it is identical by descent (IBD) to a gene in the sample. As discussed in Section 3.2, this designation indicates that particular B locus gene may fail to leave descendants in the sample even though it is derived from an ancestor that succeeds in doing so.
C.3. Lineages borne by two individuals
The total rate of change from configuration 33, comprising two Type 3 haplotypes, is
(32), reflecting mutation, crossing-over, or fusion.
Mutation at locus A induces transition to state (Table 3) with probability
| (C.3a) |
Similarly, transition to state through mutation at locus B occurs with probability
| (C.3b) |
Transition from state 33 to state 123, through separation of a Type 3 into a Type 1 and a Type 2 haplotype, corresponds to the most recent event with probability
| (C.3c) |
Fusion of the Type 3 haplotypes occurs at rate
(18), implying simultaneous coalescence at both loci (state ). This transition occurs with probability
| (C.3d) |
Appendix D. Hitting probabilities
Fig. 9 depicts a genealogical path from the initial sample (lowest level) to the multi-locus MRCA, which specifies the IBS status at all loci. For the two-locus model considered here, the lowest level corresponds to 4 genes, 2 sampled from each locus, in one of the configurations shown in Fig. 2: two Type 3 haplotypes (33), a single Type 3 together with a Type 1 and a Type 2 (123), or a pair of Type 1 and a pair of Type 2 (1122). Within-level transitions of the process correspond to separation of a Type 3 haplotype into a Type 1 and a Type 2 through crossing-over or to formation of a Type 3 haplotype through fusion between a Type 1 and a Type 2. Determination of IBS status through coalescence of lineages at a locus (IBS) or mutation (non-IBS) constitute jumps to higher levels (fewer undetermined loci). From the perspective of a given level, within-level moves represent transitions among transient states and jumps to a higher level transitions to absorbing states.
Under the assumptions of the coalescent model addressed here, parent-sharing between genetic material borne by distinct individuals, mutation, and crossing-over in Type 3 haplotypes occur independently, each with an exponentially-distributed waiting time. Within the framework developed by Neuts (1995), the waiting time to absorption has a phase-type distribution. Kumagai and Uyenoyama (2015) used this approach to address population subdivision in a coalescent context. This section summarizes the description in the interests of completeness, and also adapts the framework to the two-locus coalescent process depicted in Fig. 9.
D.1. Continuous time scale
For a given entrance state on a level in Fig. 9, the process may access transient states on the same level. A total of absorbing states on higher levels are accessible from these transient states. Let denote the transition probability matrix, of which the element in the th and th column represents the probability that the process resides in state at time given its residence in transient state at time 0, for any . Under the Markov property, satisfies the Chapman–Kolmogorov equations:
| (D.1) |
Transitions in the model considered here reflect parent-sharing between genetic material borne by distinct individuals, mutation, and crossing-over in Type 3 haplotypes, events that occur on the time scale of generations . For , (D.1) implies
in which denotes the identity matrix. Dividing by and taking limits, we obtain
which has solution
| (D.2) |
for matrix providing the instantaneous rate of change between any pair of states:
| (D.3) |
(see, for example, Taylor and Karlin, 1998, Chapter VI, section 6).
For a level with transient states and absorbing states, provides the instantaneous rate of change between any pair of those states:
in which gives the instantaneous rates of within-level moves and between-level moves. Moves from the absorbing states occur at rates given by the lower blocks, which correspond to matrices of zeros of appropriate size ( and ). Powers of have the form
From (D.2), we obtain the transition probability matrix
| (D.4) |
Because represents rates of transition among transient states,
Accordingly, (D.4) implies that the probability that a process initiated at transient state ultimately exits to state corresponds to the th element of
| (D.5) |
To confirm that each row of this matrix sums to unity, we note that the constraint that all rows of sum to zero implies
in which denotes a vector of appropriate length in which all elements are equal to unity. We then obtain from (D.5)
| (D.6) |
D.2. Evolutionary event time scale
This section rephrases the key expression for the hitting probabilities (D.5) in terms of the time scale of the most recent evolutionary event among parent-sharing, crossing-over, and mutation.
Let denote the total instantaneous rate of change from transient state (see (32), for example). As the rows of must sum to zero, the diagonal elements of correspond to
Matrices and which gives the probability of each transition induced by the most recent evolutionary event (parent-sharing, crossing-over, or mutation):
| (D.7) |
in which
| (D.8) |
and the −1 superscript denotes the matrix inverse. In this notation, the matrix of hitting probabilities (D.5) corresponds to
| (D.9) |
As each row of the matrix on the left side of the first equation sums to unity (D.6), the matrix on the right has this property as well. Another demonstration involves noting that the rows of the matrix
sum to unity, implying
| (D.10) |
Explicit expressions for (35) and (37) in the case of two loci appear in Section 6.2.3.
Footnotes
CRediT authorship contribution statement
Marcy K. Uyenoyama: Conceptualization, Formal analysis, Funding acquisition, Investigation, Methodology, Writing – original draft, Writing – review & editing.
Data availability
No data was used for the research described in the article.
References
- Bennett JH, Binet FE, 1956. Association between mendelian factors with mixed selfing and random mating. Heredity 10, 51–55. [Google Scholar]
- Broman KW, Murray JC, Sheffield VC, White RL, Weber JL, 1998. Comprehensive human genetic maps: individual and sex-specific variation in recombination. Am. J. Hum. Genet 63, 861–869. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Christiansen FB, 2000. Population Genetics of Multiple Loci. John Wiley & Sons, New York. [Google Scholar]
- Cockerham CC, Weir BS, 1968. Sib mating with two linked loci. Genetics 60, 629–640. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Crow JF, Denniston C, 1988. Inbreeding and variance effective population numbers. Evolution 42, 482–495. [DOI] [PubMed] [Google Scholar]
- Ethier SN, Griffiths RC, 1990. On the two-locus sampling distribution. J. Math. Biol 29, 131–159. [Google Scholar]
- Ewens WJ, 1972. The sampling theory of selectively neutral alleles. Theor. Popul. Biol 3, 87–112. [DOI] [PubMed] [Google Scholar]
- Ewens WJ, 1982. On the concept of effective population size. Theor. Popul. Biol 21, 373–378. [Google Scholar]
- Fu Y-X, 1997. Coalescent theory for a partially selfing population. Genetics 146, 1489–1499. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Golding GB, 1984. The sampling distribution of linkage disequilibrium. Genetics 108, 257–274. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Golding GB, Strobeck C, 1980. Linkage disequilibrium in a finite population that is partially selfing. Genetics 94, 777–789. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Haldane JBS, 1919. The combination of linkage values, and the calculation of distances between the loci of linked factors. J. Genet 8, 299–309. [Google Scholar]
- Haldane JBS, 1924. A mathematical theory of natural and artificial selection. Part II The influence of partial self-fertilisation, inbreeding, assortative mating, and selective fertilization on the composition of Mendelian populations, and on natural selection. Biol. Rev 1, 158–163. [Google Scholar]
- Hedrick PW, 1987. Gametic disequilibrium measures: Proceed with caution. Genetics 117, 331–341. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Heller R, Siegismund HR, 2009. Relationship between three measures of genetic differentiation GST, DEST and : How wrong have we been? Mol. Ecol 18, 2080–2083. [DOI] [PubMed] [Google Scholar]
- Henn BM, Gignoux CR, Jobin M, Granka JM, Macpherson JM, Kidd JM, Rodríguez-Botigué L, Ramachandran S, Hon L, Brisbin A, Lin AA, Underhill PA, Comas D, Kidd KK, Norman PJ, Parham P, Bustamante CD, Mountain JL, Feldman MW, 2011. Hunter-gatherer genomic diversity suggests a southern African origin for modern humans. Proc. Natl. Acad. Sci. (USA) 108, 5154–5162. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hill WG, Robertson A, 1968. Linkage disequilibrium in finite populations. Theor. Appl. Genet 38, 226–231. [DOI] [PubMed] [Google Scholar]
- Hudson RR, 2001. Two-locus sampling distributions and their application. Genetics 159, 1805–1817. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jakobsson M, Scholz SW, Scheet P, Gibbs JR, VanLiere JM, Fung H-C, Szpiech ZA, Degnan JH, Wang K, Guerreiro R, Bras JM, Schymick JC, Hernandez DG, Traynor BJ, Simon-Sanchez J, Matarin M, Britton A, van de Leemput J, Rafferty I, Bucan M, Cann HM, Hardy JA, Rosenberg NA, Singleton AB, 2008. Genotype, haplotype and copy-number variation in worldwide human populations. Nature 451, 998–1003. [DOI] [PubMed] [Google Scholar]
- Karlin S, 1968. Equilibrium behavior of population genetic models with non-random mating, part I: Preliminaries and special mating systems. J. Appl. Probab 5, 231–313. [Google Scholar]
- Karlin S, McGregor J, 1972. Addendum to a paper of W. Ewens.. Theor. Popul. Biol 3, 113–116. [DOI] [PubMed] [Google Scholar]
- Kumagai S, Uyenoyama MK, 2015. Genealogical histories in structured populations. Theor. Popul. Biol 102, 3–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lewontin RC, 1964. The interaction of selection and linkage. I. General considerations; heterotic models. Genetics 49, 49–67. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lewontin RC, 1988. On measures of gametic disequilibrium. Genetics 120, 849–852. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lewontin RC, Kojima K. i., 1960. The evolutionary dynamics of complex polymorphisms. Evolution 14, 458–472. [Google Scholar]
- Malécot G, 1969. The mathematics of heredity. In: English translation by Yermanos DM of Malécot G. 1948. Les Mathématiques de l’Hérédité. Masson, Paris, W. H. Freeman & Co., San Francisco. [Google Scholar]
- McVean GAT, 2002. A genealogical interpretation of linkage disequilibrium. Genetics 162, 987–991. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Neuts MF, 1995. Algorithmic Probability: A Collection of Problems. Chapman & Hall, London. [Google Scholar]
- Nordborg M, 2000. Linkage disequilibrium, gene trees and selfing: An ancestral recombination graph with partial self-fertilization. Genetics 154, 923–929. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nordborg M, Donnelly P, 1997. The coalescent process with selfing. Genetics 146, 1185–1195. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ohta T, 1980. Linkage disequilibrium between amino acid sites in immunoglobulin genes and other multigene families. Genet. Res 36, 181–197. [DOI] [PubMed] [Google Scholar]
- Ohta T, Kimura M, 1969. Linkage disequilibrium due to random genetic drift. Genet. Res 13, 47–55. [Google Scholar]
- Pluzhnikov A, Donnelly P, 1996. Optimal sequencing strategies for surveying molecular genetic diversity. Genetics 144, 1247–1262. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pollak E, 1987. On the theory of partially inbreeding finite populations. I. Partial selfing. Genetics 117, 353–360. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ramachandran S, Deshpande O, Roseman CC, Rosenberg NA, Feldman MW, Cavalli-Sforza LL, 2005. Support from the relationship of genetic and geographic distance in human populations for a serial founder effect originating in Africa. Proc. Natl. Acad. Sci. (USA) 102, 15942–15947. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sabatti C, Risch N, 2002. Homozygosity and linkage disequilibrium. Genetics 160, 1707–1719. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Slatkin M, 2008. Linkage disequilibrium – understanding the evolutionary past and mapping the medical future. Nat. Rev. Genet 9, 477–485. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Strobeck C, Morgan K, 1978. Effect of intragenic recombination on the number of alleles in a finite population. Genetics 88, 829–844. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sved JA, 1971. Linkage disequilibrium and homozygosity of chromosome segments in finite populations. Theor. Popul. Biol 2, 125–141. [DOI] [PubMed] [Google Scholar]
- Sved JA, Feldman MW, 1973. Correlation and probability methods for one and two loci. Theor. Popul. Biol 4, 129–132. [DOI] [PubMed] [Google Scholar]
- Takebayashi N, Newbigin E, Uyenoyama MK, 2004. Maximum-likelihood estimation of rates of recombination within mating type regions. Genetics 167, 2097–2109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Taylor HM, Karlin S, 1998. An Introduction to Stochastic Modeling, third ed. Academic Press, New York. [Google Scholar]
- Tishkoff SA, Goldman A, Calafell F, Speed WC, Deinard AS, Bonne-Tamir B, Kidd JR, Pakstis AJ, Jenkins T, Kidd KK, 1998. A global haplotype analysis of the myotonic dystrophy locus: Implications for the evolution of modern humans and for the origin of myotonic dystrophy mutations. Am. J. Hum. Genet 62, 1389–1402. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tishkoff SA, Williams SM, 2002. Genetic analysis of African populations: Human evolution and complex disease. Nat. Rev. Genet 3, 611–621. [DOI] [PubMed] [Google Scholar]
- Uyenoyama MK, 2024. Wright’s hierarchical -statistics. Mol. Biol. Evol 41, msae083. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Uyenoyama MK, Takebayashi N, Kumagai S, 2019. Inductive determination of allele frequency spectrum probabilities in structured populations. Theor. Popul. Biol 129, 148–159. [DOI] [PMC free article] [PubMed] [Google Scholar]
- VanLiere JM, Rosenberg NA, 2008. Mathematical properties of the measure of linkage disequilibrium. Theor. Popul. Biol 74, 130–137. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Weir BS, 1990. Genetic Data Analysis. Sinauer Assoc. Inc., Sunderland, MA. [Google Scholar]
- Weir BS, Cockerham CC, 1969. Pedigree mating with two linked loci. Genetics 61, 923–940. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Weir BS, Cockerham CC, 1973. Mixed self and random mating at two loci. Genet. Res 21, 247–262. [DOI] [PubMed] [Google Scholar]
- Wiuf C, Hein J, 1999. The ancestry of a sample of sequences subject to recombination. Genetics 151, 1217–1228. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wright S, 1921. Systems of mating I, II, III, IV, V. Genetics 6, 111–178. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wright S, 1933. Inbreeding and homozygosis. Proc. Natl. Acad. Sci. (USA) 19, 411–420. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wright S, 1942. Statistical genetics and evolution. Bull. Amer. Math. Soc 48, 223–246. [Google Scholar]
- Wright S, 1952. The theoretical variance within and among subdivisions of a population that is in a steady state. Genetics 37, 312–321. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wright S, 1969. Evolution and the Genetics of Populations, 2, The Theory of Gene Frequencies. Univ. Chicago Press, Chicago. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
No data was used for the research described in the article.













