Skip to main content
G3: Genes | Genomes | Genetics logoLink to G3: Genes | Genomes | Genetics
. 2024 Feb 27;14(4):jkad236. doi: 10.1093/g3journal/jkad236

Relatedness coefficients and their applications for triplets and quartets of genetic markers

Kermit Ritland 1,✉,2
Editor: P Ingvarsson
PMCID: PMC10989858  PMID: 38411620

Abstract

Relatedness coefficients which seek the identity-by-descent of genetic markers are described. The markers are in groups of two, three or four, and if four, can consist of two pairs. It is essential to use cumulants (not moments) for four-marker-gene probabilities, as the covariance of homozygosity, used in four-marker applications, can only be described with cumulants. A covariance of homozygosity between pairs of markers arises when populations follow a mixture distribution. Also, the probability of four markers all identical-by-descent equals the normalized fourth cumulant. In this article, a “genetic marker” generally represents either a gene locus or an allele at a locus. Applications of three marker coefficients mainly involve conditional regression, and applications of four marker coefficients can involve identity disequilibrium. Estimation of relatedness using genetic marker data is discussed. However, three- and four-marker estimators suffer from statistical and numerical problems, including higher statistical variance, complexity of estimation formula, and singularity at some intermediate allele frequencies.

Keywords: relationship, inbreeding, genetic markers, cumulants, moments, population genetics, quantitative genetics

Introduction

Relatedness is a general term for the level of genetic similarity between individuals and is measured by the sharing alleles identical-by-descent (Malécot 1948; Pamilo and Crozier 1982). Relatedness is quantified with gene identity coefficients, which characterize both the pattern and the frequency of identity-by-descent. The unit of observation is normally a pair of marker loci, and the object of estimation is the kinship coefficient or the coefficient of relationship. In this article, the unit of observation is extended to triplets and quartets of genes, allowing the opportunity to characterize additional parameters of population structure.

Relatedness may be estimated with genetic markers (Morton et al. 1971) and for pairs of marker genes, many computer programs are available for estimation of relatedness (Wang 2014), in particular for “pairwise relationship”, such as the r of Queller and Goodnight (1989). However, the equations for pairwise relationship are not extendable to three or four genes, as the covariances and higher moments need to be defined in new ways.

With more than two markers the situation becomes much more complex. Estimators for three- and four-marker-gene measures of relatedness have recently been proposed. Samanta et al (2009) provided the first estimator for three genes. Ackerman et al. (2017) examined all four genes and described the estimation of seven of eight coefficients of relatedness. Multiallelic data have information about all eight coefficients, but they used a biallelic model which provides just seven degrees of freedom, constraining their space of estimates. In addition, they did not use cumulants and their moments are normalized differently than will be here.

Cumulants are of use in certain problems in quantitative genetics (Burger 1991; Turelli and Barton 1994) and compared to cumulants have more useful theoretical properties (Kendall et al. 1977), but ordinary moments are sufficient for two and three-gene fixation indices. Ritland (1987) found cumulants instead of moments were an essential component of four-gene fixation indices. Fourth-order cumulants are needed to specify the probability of gene identity for all four genes and also to describe identity disequilibrium as the “covariance of covariances”. As an example of the necessity of cumulants, for population gene frequency p, the fourth central moment for four genes, denoted (X1,X2,X3andX4), is E[(X1p)(X2p)(X3p)(X4p)]=σ122σ342+  σ132σ242+σ232σ132+κ4, where σij2 is the covariance of Xi and Xj and κ4 is a fourth-order cumulant which does not appear in moment-based treatments.

The probabilities developed in the article are used to estimate relatedness in a population with polymorphic genetic markers. Any numbers of alleles at a single locus are allowed in the equations, but practically, one has two and at most three alleles with SNP types of data, and for tractability, we use a three-allele model to examine a special situation with four genes. These models and estimation procedures are readily applicable to the emerging mountains of genome data.

Applications of markers

Definitions

Between pairs of genetic markers, the coefficient of relationship measures the degree of consanguinity (e.g. the probability that markers are identical by descent, termed ibd). The coefficient relationship equals twice the kinship coefficient. The inbreeding coefficient is the probability that a pair of markers within one individual are ibd. With more than two markers, the coefficient of relationship is more broadly defined with groups of markers (two, three, or four). With four markers, there are nine modes of Jacquard's gene identity, with ibd genes connected by lines. The normalized central moment gives the probability of ibd of all markers. At the level of four markers, cumulants are necessary to describe identity disequilibrium (the excess of identity between marker pairs). The covariance of cumulants forms the machinery of higher order interactions.

Relatedness and two markers

The two-marker coefficient of relatedness is used for many inferences with genetic markers, mainly involving pairs of genes sampled between two individuals (“coefficient of relationship”) or pairs of genes sampled within one individual (the “inbreeding coefficient”). Analysis of data with two marker measures are ubiquitous (Wang 2014) and the two-marker probabilities are often incorporated into probabilities of groups of three and four genes.

Regression and three markers

The three-marker relationship coefficient is the probability that the three marker loci have alleles all identical-by-descent (Fig. 1a). This coefficient, G, is usually combined with two-allele coefficients for biological interpretable parameters, useful at least for problems involving mating systems or kin selection. In the theory of mating system estimation, the “effective selfing rate” is the genetically equivalent rate of selfing caused by all types of biparental inbreeding (Ritland 1985); the effective selfing rate of individual A equals 2R-G, where R is the relatedness between mates and G the third moment involving the two maternal and single paternal allele (Fig. 1b). In the theory of kin selection, the regression coefficient of relatedness is used and, properly, a three-gene model is needed (Fig. 1c), as shown by Michod and Hamilton (1980), where their Equation 18 depends upon whether the reference genotype is homozygous (18a) vs heterozygous (18b). Note that both the effective selfing rate and the regression coefficients of relationship can be asymmetrical when inbreeding coefficients differ between the two relatives.

Fig. 1.

Fig. 1.

Three cases where three-gene modes of gene identity are used. a) Effective selfing model involves two genes of one parent, b) progeny-pair model, and c) Altruist-recipient. Identical genes are linked by lines. In the effective selfing model, two of the genes are from the maternal parent and the other gene is the paternal contribution.

Identity disequilibrium and four markers

Between two diploid individuals, there are 15 patterns of gene identity (Liu and Weir 2005). A pair of individuals can share two, three, or four markers, and at each level, allelic similarities can describe aspects of relatedness. After Jacquard (1966, 2012), for four marker genes and two individuals, there are nine condensed identity modes, denoted as Δi (Fig. 2a). There are eight independent parameters of relatedness: three pairwise measures (FA, FB, R), two three-way measures (GA, GB), and three four-marker measures (FAB, RAB, H). At the highest level, the measures are much different as the four-marker measures FAB, RAB are covariances (not identities). The four-marker parameter, H, is the probability that four markers are identical. As well, the quantities must be defined as cumulants. Cumulants equal moments up to order three, but fourth-order moments do not equal fourth-order cumulants.

Fig. 2.

Fig. 2.

Two cases where four-gene modes of identity are used. a) The general case where each of the nine identity modes are inferred. b) A model to fit progeny pairs to mating system parameters. c) The inference of heritability in the field, where M is the marker and Q the quantitative trait.

The fourth central moment equals κ4+FAB+2RAB (note the covariances between second moments enter this expression) and the normalized cumulant κ4 (equivalent to H) equals the probability of identity of all four marker genes. While the variance is a measure of the spread of the distribution, kurtosis is a measure of the “peakedness” of the distribution of random variables, and infrequent extreme deviations contributing excessively to this statistic (de la Rosa and Moreno Muñoz 2008).

While applications of three and four gene measures are in their infancy, at least, the skew and kurtosis as measured by higher moments can help remove bias in DNA forensics caused by genotyping error (Weir 1994).

In the four-marker model, many possibilities exist about attaching meaning to each of the i. One example is the progeny-pair model (Ritland and Leblanc 2004), where A and B are two progeny of the same mother plant (Fig. 2b). At another more abstract level, two of the genes can be markers and two are quantitative trait loci (Fig. 2c). If identity disequilibrium is present, the regression of phenotypic similarity (QTL) on estimated relationship (markers) gives an estimate of heritability “in the field” (Ritland 2000).

Two, three, and four marker probabilities

At any level of comparison, associations are measured as the frequency of a given configuration (allele “state”) divided by the denominators in Table 1. These denominators are termed “normalization constants” and are the maximum possible value of the numerator. Some of these normalized measures of association arise naturally in the derivations below.

Table 1.

Examples of the statistical variances of relationship coefficients when estimates are based upon a single locus and when the true values are zero (“x” denotes not estimable). See Equations 3 and 6 for definitions of G and H.

Array of p F GA GB ϕXY H
0.6,0.3,0.1 2.93 73.5 74.3 12.96 7.63
0.5,0.3,0.2 2.32 X x 11.66 3.66
0.4,0.3,0.2,0.1 1.58 0.19 7.79 13.05 2.18

Probabilities of two-marker relationship

From Equation (7) of Ritland (1987), which follows Kendall et al. (1977, eq. 13.36), the frequency of gametes with allele i and with allele j is

fij=κiκj+κij.

The two-marker coefficient of relationship can be estimated from the frequencies of each allele in a sample. For any given allele, say Ai, it derived by equating the observed frequency of homozygotes to that expected by the above equation

fii=E[AiAi]=pi2+pi(1pi)R
fij=E[AiAj]=2pipj(1pi)R.

The likelihood of the data, given R, L(R)=ijfijXij. Solving for R gives estimators based upon pairs of alleles A,

R^ii=fiipi2pi(1pi).

The estimate of R for allele i, estimates are combined across alleles as

R^=iwiR^ii, (1)

where the weights wi sum to unity.

These weights are found by finding the wi that minimize wTVw, where w is an n element vector of weights, and V the n × n variance–covariance matrix of allele-specific estimates (for details, see Ritland (1996)). The weights require prior specification of true relatedness. With zero prior R, the weight for allele Ai is wi=1pin1. An m-allele locus receives the weight (nm − 1), giving the estimator for r given by equation 5 in Ritland (1996).

Probabilities of three-marker relationship

The three-marker relationship coefficient is the probability that three sampled marker genes are all identical-by-descent. From Equation (7) of Ritland (1987), which follows Kendall et al. (1977 eq. 13.36), The joint frequency of markers i, j, and k is

fijk=κiκjκk+κiκj+κiκk+κjκk+κijk.

This written in conventional population genetic terms as

fijk=pipjpk+pipjvf+(pkpl+pjpk)vr+wijk.

Where alleles i and j are from one individual and allele k from a second individual. The cumulants are written in bold face to emphasize they have a random component that may covary.

From Equation (7) of Ritland (1987), there are three primary patterns

fiii=E[AiAiAi]=pi3(1F2R+2G)+pi2(F+2R3G)+piG.
fiij=E[AiAiAj]=pi2pj(1F2R+2G)+pipj(FG).
fijk=E[AiAjAk]=pipjpk(1F2R+2G). (2)

Where the order is irrelevant (AiAiAj, AjAiAj, and AiAjAj are equivalent). The genotype frequencies are mixtures of marker gene identity: G is the probability that all three markers are ibd, R-G is the probability of ibd of one pair of markers, F+2R3G for two pairs, and 1-F-2R + 2G is the probability of no ibd among the three markers).

Solving for G in Equation (7) gives three probabilities involving G,

G^iii=fiiipi2(1pi)(F+2R)pi3pi(1pi)(12pi)
G^iij=fiijpi(1pi)pj(F+2R)pi2pjpipj(2pi3)
G^ijk=fijkpipjpk(F+2R)2pipjpk. (3)

G is a normalized third central moment and the normalization constant depends upon the pattern of subscript.

Each allele can provide an estimate of G, denoted G^i, and its weighted estimate across possible alleles i is

G^=iwiG^i. (4)

This represents a “linear estimator” of G. The weights are derived in the Appendix. The best alternative to linear estimation is maximum likelihood.

Probabilities of four-marker relationship

From Equation (7) of Ritland (1987), which follows Kendall et al. (1977, eq. 13.36),

fijkl=κiκjκkκl+κiκjκkl+κiκkκjl+κiκlκjk+κjκkκil+κjκlκik+κkκlκij
+κiκjkl+κjκikl+κkκijl+κlκijk+κijκkl+κikκjl+κilκjk+κijkl

The cumulants κi are similar to moments and covariances and may have a random component that may covary with other cumulants. The subscripts indicate alleles. The recursion equation is

fijkl=pipjpkpl+(pipj+pkpl)vf+(pipl+pjpk+pjpl+pkpl)vr+piωjkl+pjωikl+pkωijl+plωjkl+vij2vkl2+vik2vjl2+vil2vjk2+Cov(vij,vkl)+Cov(vikvjl)+Cov(vilvjk)+κijkl. (5)

Where the v terms are second-order covariances. When there is a mixture model (which creates the covariances), each subpopulation m, contributes to the mean cumulant across pooled m. The term κi,mκj,mκk,mκl,m contributes to all 18 population level moments, the term κi,mκj,mκkl,m contributes to six population level moments, and so on. However, the quantitative extent of these contributions are complex and beyond treatment here. Regardless, that subpopulation cumulants “distill” to the same assortment of cumulants, albeit with perhaps slightly different values. where the covariance terms are across the mixture terms m. This is a finite mixture model, needed when a single component distribution is inadequate (McLachlan et al. 2019). These can get complex but Withers et al. (2015) does provide the first known expressions for cumulants used available computer technology (an equation solver), not available in 1987 to KR. Cumulants are allowed to vary across the mixture, and that this results in effective covariance between second-order cumulants. The expectations taken across m result in changes to the above expression due to associations among the ijkl across m, that causes the cumulants to be associated in a certain way, since for example, E[κi,mκj,m]κiκj.

The associations between pairs of markers is termed identity disequilibrium. If one pair of alleles is heterozygous, it is more likely the second pair is also heterozygous. This is a four-gene marker measure that has been neglected due to inordinate attention to linkage disequilibrium. Identity disequilibrium has classically been characterized as the excess of homozygosity above that expected from the squared gene frequencies (Hill 1975; Ohta 1980). The identity excess is closely correlated to the expectation of the total squared linkage disequilibrium (Takahata 1982). Some of the problem is that haploid gametes are not directly assayed but rather imputed (Vitalis and Couvet 2001b).

We can add a cumulant to the equation for the probability of identity-by-state. From Equation 3.78 in Kendall et al. (1977), the moments about the mean for two squared random variables (the alleles present at each locus) equals

E[pi2pj2]=κ22+κ20κ02+2κ112

Whose form corresponds to 3σr4 for the fourth central moment with the difference that the cumulant κ22 is added. Vitalis and Couvet (2001a) and others have given estimator for identity disequilibrium which omits this cumulant.

The four-allele case introduces higher-order associations and brings with it new statistical problems. Among four alleles, two new measures arise. The first is termed H and is the probability that all four alleles are identical-by-descent. The other two have not been recognized in the literature, perhaps because they invoke the existence of cumulants, which differ from the corresponding moments with products of four or more variates.

The first, termed RAB, is the probability that both alleles in the first relatives are identical-by-descent to both alleles in the second relative. The second, termed FAB, is the probability that both individuals have both marker genes identical-by-descent. Thus, the three unique four-allele measures are

H
FAB=FAFB+Cov(FA,FB)
RAB=R2+Cov(R,Rc) (6)

the covariances between second moments, Cov(FA,FB) and Cov(RAB,RAB), exist only when the distribution of gene frequency follows a mixture distribution where subpopulations vary for F and R.

We can rewrite Equation (5) as

fijkl=(2δij)(2δkl)[pipjpkpl+(δjlpipkpj+δjkpiplpj+δilpjpkpi+δikpjplpi4pipjpkpl)R+(δikδjlpipj+δilδjkpipjδjlpipkpjδjkpiplpjδilpjpkpiδikpjplpi+2pipjpkpl)RAB.
+pkpl(δijpipipj)FA+pipj(δklpkpkpl)FB+(δijδklpipkδklpipjpkδijpipkpl+pipjpkpl)FAB+2(δijlpkpijl+δijkplpijkpipkpjlpiplpjkpjpkpilpjplpik2pkplpij+4pipjpkpl)GA+2(δjklpipj+δiklpipkpipkpjlpiplpjkpjpkpilpjplpik2pipjpkl+4pipjpkpl)GB+(pijklpipjklpjpiklpkpijlplpijk+2pipjpkl+2pipkpjl+2piplpjk+2pjpkpil+2pjplpik+2pkplpijpijpklpikpjlpilpjk6pipjpkpl)]H, (7)

where, for shorthand, pij=δijpi.

In this expression, there are eight relationship coefficients (RAB, RABAB, FA, FB, FAB, GA, GB, H), which in principle will specify eight different classes of marker genotypes. This probability of four alleles, fijkl, is then fitted to the observed frequencies in a sample.

For equations that solve for all eight parameters, the choice is somewhat arbitrary but a natural set of eight classes, in which identity-by-state mirrors the identity-by-descent, is: AiAiAiAi (all identical by state, or “ibs”), AiAkAiAk, AiAiAkAk, (two pairs ibs), AiAiAiAk, AiAkAkAk, (one triplet ibs) and AiAiAjAk, AiAjAkAk, and AiAjAjAk. (one pair ibs between A and B). Thus, we seek the expected frequencies in the vector (fiiii, fijij, fiijj, fiiij, fiijk, fijik).

The frequency of AiAiAiAi, is obtained from Equation (7), where all δ = 1 and all marker frequencies are pi:

fiiii=pipi3+pi2qi(FA+FB+4R)+piqi2(FAB+2RAB)+2piqi(12pi)(GA+GB)+qi(16piqi)H.

Likewise, the frequency that A and B are both heterozygous for Ai and Aj is, irrespective of order or phase,

fijij=4pipjpipj(1FAFBFAB)+(pi+pj4pipj)R+(1pipj+2pipj)RAB+(pipj+4pipj)(GA+GB)(12pi2pj+6pipj)H

and homozygous for alternative alleles Ai and Aj is

fiijj=2pipjpipj(14R+2RAB)+qipjFA+piqjFB+(1pipj+pipj)FAB2pj(12pi)GA2pi(12pj)GB(12pi2pj+6pipj)H

for triplets of identity-by-state fiiij

fiiij=2pipjpi2+piqiFApi2FBpiqiFAB+(pi+pj4pipj)R(pi+pj2pipj)RAB+(14pi+4pipj)GA2pi(12pj)GB(16piqi)H

Finally, a single allele pair ii can be shared only within individual A,

fiijk=2pipjpkpi(14R+2RABFB+FAB)+qiFA2(12pi)GA+4piGB+2(13pi)H

or shared only once between A and B:

fijik=4pipjpkpi(1FAFB+FAB)+(14pi)R(12pi)RAB(14pi)(GA+GB)+2(13pi)H.

The expressions for fijjjandfijkk are obtained by symmetry, and the expression for fijjk is summed over all four pairings of j between A and B: fijjk,fijkj,fjijk,andfjikj.

The appendix gives the 8 × 8 matrix of probabilities of observing the marker gene frequencies f given the relatedness coefficients. Of course, this depends upon the particular array of f's used (there are others than the above). In this case, the determinant of the matrix is

512pi8pj7pk3(pi1)(96pi7160pi64pi5(24pj216pj9)+3pi4(32pj280pj+47)pi3(132pj2302pj+147)+pi2(89pj2154pj+59)+p(1pj)(23pj11)+(pj1)(2pj1))

That it is nonzero indicates all 8 parameters are jointly estimable, but a linear approach which uses residuals to simplify things is needed at this point.

Joint values of H and identity disequilibrium

[fAAAAfAAaa]=[p(1p)(16p(1p))p2(1p)22pq(12p2q+6pq)2p2q2][HFab]

whose solution is

H^=q2fAAAA(1p)2fAAaapq(1p)(12p)(1pq)
Fab^=q(12p2q+6pq)fAAAA(1p)(16p(1p))fAAaapq(1p)(12p)(1pq).

Joint estimates of H and joint identity disequilibrium

[fAAAAfAaAafAAaa]=[p(1p)(16p(1p))p2(1p)22p2(1p)24pq(12p2q+6pq)4p2q24pq((1p)(1q)+pq2pq(12p2q+6pq)2pq(1pq+pq)4p2q2][HFR]
H^=2q(1pq+3pq)fAAAAp(1p)2(fAAaa+fAaAa)2pq(1p)(1pq)(13p)
Fab^=2q(12p2q+6pq)fAAAA+(1p)(2p24p+1)fAAaa+p(12p)fAaAa2pq(1p)(1pq)(13p)
Rabab^=4q(12p2q+6pq)fAAAA+(1p)p(12p)fAAaa(4p25p+1)fAaAa2pq(1p)(1pq)(13p).

The denominator shows that at least three alleles required in the population, and marker frequencies of pi = 1/3 are noninformative.

Identity disequilibrium has classically been characterized as the excess of homozygosity above that expected from the squared gene frequencies, as proposed by Hill (1975) and Ohta (1980). The identity excess is closely correlated to the expectation of the total squared linkage disequilibrium (Takahata 1982). Some of the problem is that haploid gametes are not directly assayed but rather imputed (Vitalis and Couvet 2001b).

The “classic” procedure for estimating identity disequilibrium involves comparing observed vs expected double homozygotes. From Vitalis and Couvet (2001a), one such estimator can be written in the form,

RAB^=afAAAAbfAA,aac,

where a=6(6pq2p2q+1),b=(16pq),c=6pq(1pq)  (13p). The numerator is positive as a is always larger than b, but denominator can be either positive or negative and negative values occurs when b is larger an a. Such is life with four genes. It is also amazing the four-gene frequencies fAAAA and fAA,aa are directly used in an effective two gene estimator, as fAAAA is the observed double homozygotes and fAA,aa represents homozygotes expected with no zygotic association. Equally amazing is that Vitalis and Couvet (2001a) and others have given estimators for identity disequilibrium which omits this cumulant.

Discussion and Conclusion

A main feature of higher-order relatedness is the covariance of homozygosity between pairs of marker genes, this is effectively a covariance of second moments. Such a “covariance of covariance” arises when pedigrees occur in a mixture distribution (McLachlan et al. 2019). Such a distribution generates the genomic variation of homozygosity necessary for the existence of covariance of homozygosity between individuals at specific loci. The simplest mixture distribution is that of two populations with gene frequency p + a and pa; in this case, covariance of heterozygosity, after mixing in equal proportions, equals a4 + 6a2p2.

Another feature of higher-order relatedness is that the four-marker coefficient of gene identity must be described with cumulants and not moments. As an example of the necessity of cumulants, for population gene frequency p, the fourth central moment for four markers, denoted (X1,X2,X3andX4), is σ122σ342+σ132σ242+σ232σ132+κ4, where σij2 is the covariance of Xi and Xj and κ4 is a fourth-order cumulant which does not appear in the moment. Some type of term (not involving the product of variances) is needed for κ4 and it could be any rational number. In summary, incorporating cumulants into four marker measures only requires some value X in the expansion of the fourth central moment σ122σ342+σ132σ242+σ232σ132+X and this X is numerically estimated in the same way as the lower order cumulant terms.

Cumulants do have useful properties for models of quantitative traits, the most important is that the cumulant of the sum of two random variables X + Y is M(X + Y)=M(X) + M(Y); differential equations for models of selection on quantitative traits that involve cumulants are simpler than models involving moments (Burger 1991; Turelli and Barton 1994). This cumulant will also be key in deriving a marker-based estimator for Qst (Ritland in prep) and for a portrayal of higher order population structure that separately accounts for both the correlation of relationship and the squared linkage disequilibrium (Ritland in prep).

We give probabilities of relationship for a homogenous population of just one generation. Such populations are most commonly assayed in genomics; however, it should be noted that the levels of nucleotide variation (for SNPs) is not high and loci with more than two alleles are uncommon; in fact, only about 5% of human SNPs are triallelic (Cao et al. 2015) although microsatellites and other types of repeat markers show greater variation. Reconstruction of pedigree relationship has traditionally involved cumbersome graph-tracing algorithms, and simpler recursive methods which require at least two generations of records (Karigl 1981; Thompson 1988; Whittemore and Halpern 1994). Also, current recursive methods (Zheng et al. 2018) assume a known pedigree (Kirkpatrick et al. 2019). This is somewhat like estimating Qst with current methods, where aspects of the pedigree must be known.

Normalization constants

Relatedness coefficients are obtained by calculating the pairwise covariance of relatives and dividing it by a normalization constant that converts the covariance into a correlation. This constant is the maximum possible value that the covariance can take. For cases where pairwise comparisons involve the frequency of identical genotypes, it is simple to calculate as a binomial variance. For the two-marker relationship coefficient as described in Equation (1), the maximum covariance between two genotypes, conditioned upon observing allele i, is E[AiAi]E[Ai]2=  pi(1 − pi) when R = 1. Likewise, the three marker coefficient has a normalization constant of pi(1 − pi)(1 − 2pi) for AiAiAi, and the four-marker coefficients are normalized by pi(1 − pi)(1 − 6pi(1 − pi)) for AiAiAiAi. The normalization constants for combinations of alleles falls out of the analyses.

Ackerman et al. (2017) provided a different set of normalization constants for three and four marker measures than given here. In their Equation 6, they normalized the third central moment by the geometric mean gene frequency of the three central moments (rather than by p(1 − p)(1 − 2p) as done here). Their justification was a similarity of this “third moment correlation” to a bivariate correlation formula. Their normalization constant for the four-marker coefficient (Equation 8) involves a parameter α that mixes the unknown proportions of the two types of higher order identity (all identical vs 2 pairs identical), resulting in an inference that may be subject to biases.

Estimation

Table 2 gives example estimates of R, G, and H at a single locus. As expected, the variances decrease with numbers of alleles. While the variances for R are reasonable, those for H and the variances for the identity disequilibrium RAB are quite large. Interestingly, and not remarkably, the variances for G are all over the place and it is not even estimate in one case (P = 0.5).

Table 2.

Allele states and denominators of cumulants.

Allele states Expectation Denominator
Two alleles
i=j pi2+pi(1pi)F pi(1pi)
ij 2pipj(1F) 2pipj
Three alleles
i=j=k pi3+pi2(1pi)(F+2R)+piG pi(1pi)(12pi)
i=jk pi2pk(2R2G) pipk(12pi)
ijk pipjpk(1F2R+2G) 2pipjpk
Four alleles
i=j=k=l pi2qi2(FAB+2RAB)+qi(16piqi)H pi(1pi)(16pi(1pi))
i=j=kl 2pipj(FAB+(pi+pj2pipj)RAB(16piqi)H) pipk(12pi2pk+6pipk)
i=jk=l 2pipj(2pipjRAB+(1pipj+pipj)FAB(12pi2pj+6pipj)H) pipk(12pi2pk+6pipk)
i=jkl 2pipjpk(pi(2RAB+FAB)+2(13pi)H) 2pipjpk(13pi)
ijkl pipjpkpl(14R2F+4G8H) 6pipjpkpl

Likelihood requires numerical solutions which introduces complications, as the numerical solution is normally iterated until convergence. We used the expectation–maximization method which has slow convergence. When using likelihood, I found the number of marker loci needed for adequate convergence was about 20 loci for three-gene coefficients and roughly double at 30–50 loci for four gene coefficients. Interestingly, it was found that loci with fewer alleles are more likely to give convergent estimates because the problem with nonconvergence arises when relatives do not share the same marker allele.

Calculating the probabilities of higher-order relationship poses an interesting set of obstacles. Equation solvers such as Derive or Mathematica help with the derivation and interpretation of complicated formulae. The coefficients can also undefined at certain intermediate frequencies (p = 1/2 or 1/3) and show high statistical variance about those frequencies (Ritland 1987).

Possible future approaches

The complications of correctly estimating population structure are discussed and treated by Weir and Goudet (2017). They developed two-marker moment estimators that can describe the “relativity” and this requires an explicit reference population. They develop their estimator in a multilevel approach (within individuals, between individual within populations, and between populations) which promoted a unified treatment of relatedness and population structure. Clearly, further progress will depend upon adequate definitions and applications of models.

“Relatedness mapping” (Albrechtsen et al. 2009) uses relatedness to identifying causative mutations, using the principle that affected individuals share higher relatedness about the mutation. A somewhat related activity is “IBD mapping”, in which segments of identity-by-descent (IBD) present in high-density genomic data are used to map casual variants (Browning and Browning 2012). However, the data by itself only reveal the presence of the variant.

Other fields have adopted the use of cumulants, which may show new approaches that population genetics can undertake. The central 4th cumulant has been used to detect early stages of termite infestation, as it can separate termite alarm signals from background noise (de la Rosa and Moreno Muñoz 2008). Advances in electrophysiological and imaging techniques are used to study the synchrony of neuron cell firings in the brain (Staude et al. 2010b ) and have highlighted the need for correlation measures that go beyond simple pairwise analyses, taking advantage of the “interaction property” of higher order cumulants as measures of correlation (Staude et al. 2010a ). In information systems, a “covariance of covariance” approach for individual pixels has been developed for image description and classification (Serra et al. 2009).

Acknowledgments

Joe Felsenstein provided many helpful comments on my first and only sabbatical.

Data availability

All data necessary for confirming the conclusions of the article are present within the article's text, figures, and tables.

Funding

This work was supported by NSERC discovery grants to KR.

Literature cited

  1. Ackerman  MS, Johri  P, Spitze  K, Xu  S, Doak  TG, Young  K, Lynch  M. 2017. Estimating seven coefficients of pairwise relatedness using population-genomic data. Genetics. 206(1):105–118. doi: 10.1534/genetics.116.190660. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Albrechtsen  A, Sand Korneliussen  T, Moltke  I, van Overseem Hansen  T, Nielsen  FC, Nielsen  R. 2009. Relatedness mapping and tracts of relatedness for genome-wide data in the presence of linkage disequilibrium. Genet Epidemiol.  33(3):266–274. doi: 10.1002/gepi.20378. [DOI] [PubMed] [Google Scholar]
  3. Browning  SR, Browning  BL. 2012. Identity by descent between distant relatives: detection and applications. Annu Rev Genet.  46(1):617–633. doi: 10.1146/annurev-genet-110711-155534. [DOI] [PubMed] [Google Scholar]
  4. Burger  R. 1991. Moments, cumulants, and polygenic dynamics. J Math Biol.  30(2):199–213. doi: 10.1007/BF00160336. [DOI] [PubMed] [Google Scholar]
  5. Cao  M, Shi  J, Wang  J, Hong  J, Cui  B, Ning  G. 2015. Analysis of human triallelic SNPs by next-generation sequencing. Ann Hum Genet. 79(4):275–281. doi: 10.1111/ahg.12114. [DOI] [PubMed] [Google Scholar]
  6. de la Rosa  JJG, Moreno Muñoz  A. 2008. Higher-order cumulants and spectral kurtosis for early detection of subterranean termites. Mech Syst Signal Process.  22(2):279–294. doi: 10.1016/j.ymssp.2007.08.009. [DOI] [Google Scholar]
  7. Hill  WG. 1975. Linkage disequilibrium among multiple neutral alleles produced by mutation in finite population. Theor Popul Biol. 8(2):117–126. doi: 10.1016/0040-5809(75)90028-3. [DOI] [PubMed] [Google Scholar]
  8. Jacquard  A. 1966. Logique du calcul des coefficients d'identité entre deux individus. Population (French edition). 21(4):751–776. doi: 10.2307/1527654. [DOI] [Google Scholar]
  9. Jacquard  A. 2012. The Genetic Structure of Populations. Springer Science & Business Media. [Google Scholar]
  10. Karigl  G. 1981. A recursive algorithm for the calculation of identity coefficients. Ann Hum Genet.  45(3):299–305. doi: 10.1111/j.1469-1809.1981.tb00341.x. [DOI] [PubMed] [Google Scholar]
  11. Kendall  MG, Stuart  A, Ord  JK. 1977. The Advanced Theory of Statistics. London: Griffin. [Google Scholar]
  12. Kirkpatrick  B, Ge  S, Wang  L. 2019. Efficient computation of the kinship coefficients. Bioinformatics. 35(6):1002–1008. doi: 10.1093/bioinformatics/bty725. [DOI] [PubMed] [Google Scholar]
  13. Liu  W, Weir  BS. 2005. Genotypic probabilities for pairs of inbred individuals. Philos Trans R Soc Lond B Biol Sci. 360(1459):1379–1385. doi: 10.1098/rstb.2005.1677. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Malécot  G. 1948. Mathématiques de L'hérédité. Paris: Masson. [Google Scholar]
  15. McLachlan  GJ, Lee  SX, Rathnayake  SI. 2019. Finite mixture models. Annu Rev Stat Appl.  6(1):355–378. doi: 10.1146/annurev-statistics-031017-100325. [DOI] [Google Scholar]
  16. Michod  RE, Hamilton  WD. 1980. Coefficients of relatedness in sociobiology. Nature. 288(5792):694–697. doi: 10.1038/288694a0. [DOI] [Google Scholar]
  17. Morton  NE, Yee  S, Harris  DE, Lew  R. 1971. Bioassay of kinship. Theor Popul Biol.  2(4):507–524. doi: 10.1016/0040-5809(71)90038-4. [DOI] [PubMed] [Google Scholar]
  18. Ohta  T. 1980. Linkage disequilibrium between amino acid sites in immunoglobulin genes and other multigene families. Genet Res.  36(2):181–197. doi: 10.1017/S0016672300019790. [DOI] [PubMed] [Google Scholar]
  19. Pamilo  P, Crozier  RH. 1982. Measuring genetic relatedness in natural populations: methodology. Theor Popul Biol.  21(2):171–193. doi: 10.1016/0040-5809(82)90012-0. [DOI] [Google Scholar]
  20. Queller  DC, Goodnight  KF. 1989. Estimating relatedness using genetic markers. Evolution. 43(2):258–275. doi: 10.2307/2409206. [DOI] [PubMed] [Google Scholar]
  21. Ritland  K. 1985. The genetic mating structure of subdivided populations I. Open-mating model. Theor Popul Biol.  27(1):51–74. doi: 10.1016/0040-5809(85)90015-2. [DOI] [Google Scholar]
  22. Ritland  K. 1987. Definition and estimation of higher-order gene fixation indices. Genetics. 117(4):783–793. doi: 10.1093/genetics/117.4.783. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Ritland  K. 1996. Estimators for pairwise relatedness and individual inbreeding coefficients. Genet Res (Camb).  67(2):175–185. doi: 10.1017/S0016672300033620. [DOI] [Google Scholar]
  24. Ritland  K. 2000. Marker-inferred relatedness as a tool for detecting heritability in nature. Mol Ecol.  9(9):1195–1204. doi: 10.1046/j.1365-294x.2000.00971.x. [DOI] [PubMed] [Google Scholar]
  25. Ritland  K, Leblanc  M. 2004. Mating system of four inbreeding monkeyflower (Mimulus) species revealed using ‘progeny-pair’ analysis of highly informative microsatellite markers. Plant Species Biol. 19(3):149–157. doi: 10.1111/j.1442-1984.2004.00111.x. [DOI] [Google Scholar]
  26. Samanta  S, Li  YJ, Weir  BS. 2009. Drawing inferences about the coancestry coefficient. Theor Popul Biol.  75(4):312–319. doi: 10.1016/j.tpb.2009.03.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Serra  G, Grana  C, Manfredi  M, Cucchiara  R. 2009. Proceedings of International Conference on Multimedia Retrieval. New York (NY): Association for Computing Machinery, p. 411–414..
  28. Staude  B, Grün  S, Rotter  S. 2010a. Higher-order correlations and cumulants. In: Analysis of Parallel Spike Trains. Boston (MA): Springer. p. 253–280. [Google Scholar]
  29. Staude  B, Grün  S, Rotter  S. 2010b. Higher-order correlations in non-stationary parallel spike trains: statistical modeling and inference. Front Comput Neurosci.  4:16. doi: 10.3389/fncom.2010.00016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Takahata  N. 1982. Linkage disequilibrium, genetic distance and evolutionary distance under a general model of linked genes or a part of the genome. Genet Res.  39(1):63–77. doi: 10.1017/S0016672300020747. [DOI] [Google Scholar]
  31. Thompson  EA. 1988. Two-locus and three-locus gene identity by descent in pedigrees. IMA J Math Appl Med Biol. 5(4):261–279. doi: 10.1093/imammb/5.4.261. [DOI] [PubMed] [Google Scholar]
  32. Turelli  M, Barton  NH. 1994. Genetic and statistical analyses of strong selection on polygenic traits: what, me normal?  Genetics. 138(3):913–941. doi: 10.1093/genetics/138.3.913. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Vitalis  R, Couvet  D. 2001a. Estimation of effective population size and migration rate from one- and two-locus identity measures. Genetics. 157(2):911–925. doi: 10.1093/genetics/157.2.911. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Vitalis  R, Couvet  D. 2001b. Two-locus identity probabilities and identity disequilibrium in a partially selfing subdivided population. Genet Res (Camb).  77(1):67–81. doi: 10.1017/S0016672300004833. [DOI] [PubMed] [Google Scholar]
  35. Wang  J. 2014. Marker-based estimates of relatedness and inbreeding coefficients: an assessment of current methods. J Evol Biol.  27(3):518–530. doi: 10.1111/jeb.12315. [DOI] [PubMed] [Google Scholar]
  36. Weir  BS. 1994. The effects of inbreeding on forensic calculations. Annu Rev Genet. 28(1):597–621. doi: 10.1146/annurev.ge.28.120194.003121. [DOI] [PubMed] [Google Scholar]
  37. Weir  BS, Goudet  J. 2017. A unified characterization of population structure and relatedness. Genetics. 206(4):2085–2103. doi: 10.1534/genetics.116.198424. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Whittemore  AS, Halpern  J. 1994. Probability of gene identity by descent: computation and applications. Biometrics. 50(1):109–117. doi: 10.2307/2533201. [DOI] [PubMed] [Google Scholar]
  39. Withers  CS, Nadarajah  S, Shih  SH. 2015. Moments and cumulants of a mixture. Methodol Comput Appl Probab.  17(3):541–564. doi: 10.1007/s11009-013-9379-y. [DOI] [Google Scholar]
  40. Zheng  C, Boer  MP, van Eeuwijk  FA. 2018. Recursive algorithms for modeling genomic ancestral origins in a fixed pedigree. G3 (Bethesda). 8(10):3231–3245. doi: 10.1534/g3.118.200340. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

All data necessary for confirming the conclusions of the article are present within the article's text, figures, and tables.


Articles from G3: Genes|Genomes|Genetics are provided here courtesy of Oxford University Press

RESOURCES