Skip to main content
Genetics logoLink to Genetics
. 2021 Jul 15;220(1):iyab090. doi: 10.1093/genetics/iyab090

Properties and unbiased estimation of F- and D-statistics in samples containing related and inbred individuals

Mehreen R Mughal 1,, Michael DeGiorgio 2,
Editor: S Browning
PMCID: PMC8733448  PMID: 34849832

Abstract

The Patterson F- and D-statistics are commonly used measures for quantifying population relationships and for testing hypotheses about demographic history. These statistics make use of allele frequency information across populations to infer different aspects of population history, such as population structure and introgression events. Inclusion of related or inbred individuals can bias such statistics, which may often lead to the filtering of such individuals. Here, we derive statistical properties of the F- and D-statistics, including their biases due to the inclusion of related or inbred individuals, their variances, and their corresponding mean squared errors. Moreover, for those statistics that are biased, we develop unbiased estimators and evaluate the variances of these new quantities. Comparisons of the new unbiased statistics to the originals demonstrates that our newly derived statistics often have lower error across a wide population parameter space. Furthermore, we apply these unbiased estimators using several global human populations with the inclusion of related individuals to highlight their application on an empirical dataset. Finally, we implement these unbiased estimators in open-source software package funbiased for easy application by the scientific community.

Keywords: inbreeding, relatedness, introgression, demography, population genetics, gene flow

Introduction

The recently introduced F- and D-statistics (Huson et al. 2005; Kulathinal et al. 2009; Reich et al. 2009; Green et al. 2010; Patterson et al. 2012) have transformed the way geneticists measure population differentiation. These statistics have been instrumental in many major recent discoveries, including testing which Neanderthal populations are closest to the populations that admixed with modern humans (Hajdinjak et al. 2018), and detecting which population is likely the admixing source for European admixture in modern Ethiopian populations (Molinaro et al. 2019). Iterating through different combinations of populations using the F4- and D-statistics has allowed reconstruction of population histories in diverse groups such as Native Americans and South Asians (Reich et al. 2012; Moorjani et al. 2013). In addition, the D-statistics have been used extensively to provide evidence of introgression and hybridization among species of Drosophila fruit flies and Heliconius butterflies (Martin et al. 2015; Turissini and Matute 2017).

In many cases, however, the populations tested by these statistics are small, and proper random sampling may include data from related individuals. It is common to remove one or more of the relatives from a group of related individuals because including them might provide a bias in the value of a particular statistic being measured (Rosenberg 2006; DeGiorgio and Rosenberg 2009; DeGiorgio et al. 2010; Harris and DeGiorgio 2017a; Waples and Anderson 2017). For this reason, we explore whether the current estimators for these statistics are biased with the inclusion of related or inbred individuals and if so, then develop unbiased estimators under such scenarios.

These statistics are flexible and relatively simple to compute, as they measure genetic drift along branches of a population tree by contrasting allele frequencies between different combinations of populations. Using allele frequency data from two, three, or four populations, these statistics measure shared variation along specific branches of the tree relating the populations. We begin by providing intuitive descriptions and formal definitions of each of the F- and D-statistics that we evaluate. Specifically, consider that we have allele frequency data at J biallelic loci from each of four populations, denoted A, B, C, and D. We denote the parametric population frequencies of the reference allele at locus j as aj, bj, cj, and dj in populations A, B, C, and D, respectively, which are unknown quantities that we will ultimately need to estimate prior to computing the F- and D-statistics from data. We begin by defining the true parametric population F- and D-statistics (Reich et al. 2009; Patterson et al. 2012) and then proceed in the Theory section to define sample estimators of these quantities, and show that our new estimators that account for related and inbred individuals reduce to the unbiased estimators in Appendix A of Patterson et al. (2012) when samples contain unrelated and noninbred individuals.

We first examine the F2 statistic, which measures the amount of genetic drift separating a pair of populations, and is thus a test for differentiation between them, and is akin to the widely used fixation index FST (Weir and Cockerham 1984). For a pair of populations A and B, we define the F2 statistic as

F2(A,B)=1Jj=1JF2(Aj,Bj), (1)

where for locus j  

F2(Aj,Bj)=(ajbj)2. (2)

It is clear from this definition that F2 takes values between zero, when the populations have identical allele frequencies, and one, when the populations are fixed for different alleles (Figure 1).

Figure 1.

Figure 1

Trees showing the different relationships the F- and D-statistics are designed to test. F2(A,B) can test the differentiation of two populations A and B. F3(A;B,C) can test for introgression or relatedness between populations A and B or populations A and C. The placement of population A as an ingroup branch is arbitrary, because the F3(A;B,C) statistic measures the amount of genetic drift along branch A, and thus assumes the three-population tree is unrooted. However, because we are also showing how F3(A;B,C) can be employed to detect admixture in population A, we depict population A as an ingroup for visual convenience. F4(A,B;C,D) can test the hypothesis of whether two populations are closer to each other than they are to two other populations, in this case are A and B closer to each other than they are to C and D. The D(A,B,C,D) statistic can test whether there has been admixture between population C and either populations A or B.

The F3 statistic employs allele frequencies from three populations, and measures the amount of genetic drift along the branch leading to a target population, given allele frequency data from two reference populations. For a target population A and two reference populations B and C, we define the F3 statistic as

F3(A;B,C)=1Jj=1JF3(Aj;Bj,Cj), (3)

where for locus j  

F3(Aj;Bj,Cj)=(ajbj)(ajcj). (4)

Because it measures genetic drift along a branch leading to a target population, its value is expected to be nonnegative. However, an interesting property of the F3 statistic is that it can be negative if the target population experienced admixture, and therefore a negative value directly indicates admixture in the history of the target population (Reich et al. 2009; Patterson et al. 2012). However, though F3 can detect admixture if its value is negative, admixture is not guaranteed to lead to negative values (Reich et al. 2009; Patterson et al. 2012), and it is therefore an inconclusive test for admixture if F3 is nonnegative. Moreover, because loci with higher minor allele frequencies may affect F3 more than loci with lower minor allele frequencies, the F3 statistic is sometimes normalized (Reich et al. 2009; Patterson et al. 2012) based on levels of diversity of the target population. Formally, this normalized F3 statistic has definition

F3(A;B,C|A)=F3(A;B,C)2G(A), (5)

where we define for population P (here P = A)

G(P)=1Jj=1G(Pj) (6)

such that for locus j  

G(Pj)=pj(1pj). (7)

The F4 statistic, on the other hand, is a test of “treeness” among a set of four populations, examining whether the unrooted tree relating four populations is supported by the allele frequencies within the set of populations. For a pair of sister populations A and B and a pair of sister populations C and D, we define the F4 statistic as

F4(A,B;C,D)=1Jj=1JF4(Aj,Bj;Cj,Dj), (8)

where for locus j  

F4(Aj,Bj;Cj,Dj)=(ajbj)(cjdj). (9)

If the unrooted relationship is true, then F4 takes the value of zero, and is nonzero otherwise. If it is known a priori that the unrooted relationship should be true, then a nonzero F4 statistic can be indicative of admixture, and the sign of the statistic will suggest which set of populations may be violating the assumed unrooted tree topology (Figure 1). As with the F3 statistic, a normalized version (Reich et al. 2009; Patterson et al. 2012) of the F4 statistic is sometimes used, with normalization based on the diversity of one of the four populations. Formally, this normalized F4 statistic has definition

F4(A,B;C,D|P)=F4(A,B;C,D)G(P) (10)

where we normalize by diversity in population P{A,B,C,D}.

Finally, the D-statistic is a special version of the F4 statistic that is a test of treeness for a particular asymmetric rooted tree relating four populations, with the tree topology containing a pair of sister populations, together with a close and a distant outgroup population (Figure 1). For sister populations A and B, close outgroup population C, and distant outgroup population D, we define the D statistic as

D(A,B,C,D)=F4(A,B;C,D)H(A,B,C,D), (11)

where

H(A,B,C,D)=1Jj=1JH(Aj,Bj,Cj,Dj) (12)

is a normalizing factor to constrain the D statistic to take values between negative one and one, such that for locus j  

H(Aj,Bj,Cj,Dj)=(aj+bj2ajbj)(cj+dj2cjdj). (13)

If the rooted relationship is true, then D takes the value of zero, and is nonzero otherwise. A nonzero D value can be used to detect admixture between the close outgroup population and one of the two sister populations based on its sign (Figure 1).

Theory

The F- and D-statistic equations presented in the Introduction employ population allele frequencies, and are thus parameters of the set of populations. To estimate them, we first need to build an estimator of allele frequencies based on samples. We denote estimates of the reference allele frequencies at locus j, j=1,2,,J, in populations A, B, C, and D by a^j,b^j,c^j, and d^j, respectively.

As used previously (e.g., McPeek et al. 2004; DeGiorgio and Rosenberg 2009; DeGiorgio et al. 2010; Harris and DeGiorgio 2017a), a linear unbiased estimator of population reference allele frequency p at a biallelic locus can be defined as

p^=k=1N(P)ϕk(P)Xk, (14)

where N(P) is the number of individuals sampled at the locus, Xk is the frequency of the reference allele in individual k at the locus, and ϕk(P) is the weight of individual k in population P at the locus. McPeek et al. (2004) discussed the impact of various weighting schemes on allele frequency estimation, and Harris and DeGiorgio (2017a) examined the effects of weighting scheme on estimation of expected heterozygosity.

Plugging the sample estimate of allele frequencies in place of parametric population allele frequencies, estimators of the F- and D-statistics can be computed as

F^2(A,B)=1Jj=1JF^2(Aj,Bj) (15)

 

F^3(A;B,C)=1Jj=1JF^3(Aj;Bj,Cj) (16)

 

F^4(A,B;C,D)=1Jj=1JF^4(Aj,Bj;Cj,Dj) (17)

 

F^3(A;B,C|A)=F^3(A;B,C)2G^(A) (18)

 

F^4(A,B;C,D|P)=F^4(A,B;C,D)G^(P) (19)

 

D^(A,B,C,D)=F^4(A,B,C,D)H^(A,B,C,D), (20)

where

F^2(Aj,Bj)=(a^jb^j)2 (21)

 

F^3(Aj,Bj,Cj)=(a^jb^j)(a^jc^j) (22)

 

F^4(Aj,Bj;Cj,Dj)=(a^jb^j)(c^jd^j), (23)

and where

G^(P)=1Jj=1JG^(Pj) (24)

 

H^(A,B,C,D)=1Jj=1JH^(Aj,Bj,Cj,Dj) (25)

with

G^(Pj)=p^j(1p^j) (26)

 

H^(Aj,Bj,Cj,Dj)=(a^j+b^j2a^jb^j)(c^j+d^j2c^jd^j). (27)

In the following, we discuss properties of these estimators, and where appropriate, develop unbiased estimators for the statistics that are biased due to the inclusion of related or inbred individuals in the sample.

To begin, we define the kinship coefficient Φxy between individuals x and y, as the probability that a pair of sampled alleles, one from x and one from y are identical by descent if xy, and as the probability that a pair of alleles sampled with replacement from individual x are identical by descent if x = y (Lange 2002). A pair of unrelated individuals x and y have kinship coefficient Φxy=0 (Lange 2002). Moreover, an individual x with ploidy mx has kinship coefficient Φxx=1/mx+(11/mx)fx=(1/mx)[1+(mx1)fx], where fx is the inbreeding coefficient of individual x, and is defined as the probability that a pair of alleles sampled without replacement in individual x are identical by descent (DeGiorgio et al. 2010). A noninbred individual x has inbreeding coefficient fx = 0, and so if x is noninbred, then their kinship coefficient is Φxx=1/mx. If an individual is haploid for regions of their genome, then by definition at those loci Φxx=1. Directly accounting for the ploidy of individuals is especially pertinent for X-linked loci, as sampled male individuals will be haploid and sampled females will be diploid, thereby leading to noninbred male individuals to have self-kinship coefficients mirroring completely inbred individuals (Φxx=1), whereas noninbred females at the same loci will have self-kinship coefficients consistent with noninbred diploids (Φxx=1/2). As in DeGiorgio et al. (2010) and Harris and DeGiorgio (2017a), we define the weighted mean kinship coefficients across sets of individuals sampled in population P{A,B,C,D} at locus j as

Φ2(Pj)=w=1N(Pj)x=1N(Pj)ϕw(Pj)ϕx(Pj)Φwx (28)

 

Φ3(Pj)=w=1N(Pj)x=1N(Pj)y=1N(Pj)ϕw(Pj)ϕx(Pj)ϕy(Pj)Φwxy (29)

 

Φ4(Pj)=w=1N(Pj)x=1N(Pj)y=1N(Pj)z=1N(Pj)ϕw(Pj)ϕx(Pj)ϕy(Pj)ϕz(Pj)Φwxyz (30)

 

Φ2,2(Pj)=w=1N(Pj)x=1N(Pj)y=1N(Pj)z=1N(Pj)ϕw(Pj)ϕx(Pj)ϕy(Pj)ϕz(Pj)Φwx,yz, (31)

which are the weighted mean kinship coefficients for the N(Pj) individuals sampled at locus j in population P for pairs, triples, quadruples, and pairs of pairs of individuals, respectively. Here, Φwxy,Φwxyz, and Φwx,yz are kinship coefficients respectively defining the probabilities that a trio of alleles from individuals w, x, and y, a quadruple of alleles from individuals w, x, y, and z, and a pair of alleles from w and x and a pair of alleles from y and z are identical by descent (Harris 1964; Cockerham 1971). In particular, Φwx,Φwxy,Φwxyz, and Φwx,yz are identical to the Cockerham (1971) coefficients denoted by θ, γ, δ, and Δ, respectively.

These Cockerham (1971) pairwise and extended kinship coefficients for measuring identity-by-descent probabilities among individuals are computed from sets of alleles sampled from two, three, or four individuals within the same population, whereas the F2, F3, and F4 statistics respectively are computed from sets of alleles sampled from two, three, or four distinct populations. Harris (1964) derived the genotypic covariances between individuals within a population, which are related to these kinship coefficients. In an interesting connection, F2(A,B) measures the covariance (variance) in the frequency difference between alleles sampled from populations A and B, F3(A;B,C) the covariance in the frequency difference between alleles sampled from populations A and B and the difference between alleles sampled from populations A and C, and F4(A,B;C,D) the covariance in the frequency difference between alleles sampled from populations A and B and the difference between alleles sampled from populations C and D (Peter 2016). For this reason, we may expect that these covariances (F-statistics) will depend on the identity-by-descent probabilities defined by the Cockerham (1971) kinship coefficients, which we show is the case based on our derivations of the theoretical properties of the F-statistics.

From our definitions of kinship, we know that unrelated individuals have kinship coefficients of zero, but noninbred individuals still have positive values of their self-kinship coefficient, thereby causing the mean kinship coefficients to necessarily be positive quantities. It is for this reason that some F-statistic estimators will be biased even without related or inbred individuals, and this bias would be due to finite sample size. The estimators presented in Appendix A of Patterson et al. (2012) correct this bias due to finite sample sizes, and our goal is to further correct for the biases induced by related and inbred individuals. For accurate estimates of the drift quantities, it is therefore important to obtain unbiased estimators.

A number of quantities (particularly variances and covariances involving the F- and D-statistics) will be mathematically complex, as they will involve linear combinations of higher order mean kinship coefficients. For this reason, we follow prior studies (DeGiorgio et al. 2010; Harris and DeGiorgio 2017a) and make the simplifying assumption that no individual in a sample from population P is related to more than one other individual in the sample, such that terms Φ3(Pj),Φ4(Pj),Φ2,2(Pj), and Φ2(Pj)2 negligible to Φ2(Pj). Moreover, we assume that individuals sampled in different populations are unrelated to each other, as well as assume that the different populations in general are independent so that alleles sampled in different populations cannot be identical by descent. Furthermore, in the cases referred to in this article, sampling is defined as statistical sampling, where the expectation is averaging over repeated sampling. Under these assumptions, we approximate a few key results from prior studies (DeGiorgio et al. 2010; Harris and DeGiorgio 2017a) that will ultimately make derivations easier. Given that p^j is an estimate of the frequency of a reference allele at locus j in population P, we have the following expectations (approximate notation when not exact)

E[p^j]=pj (32)

 

E[p^j2]=pj2+Φ2(Pj)pj(1pj) (33)

 

E[p^j3]pj3+3Φ2(Pj)pj2(1pj) (34)

 

E[p^j4]pj4+6Φ2(Pj)pj3(1pj). (35)

From prior studies (Nei and Roychoudhury 1974; Weir 1989; DeGiorgio and Rosenberg 2009; DeGiorgio et al. 2010; Harris and DeGiorgio 2017a), we know that 2p^(1p^) is a downwardly biased estimator of expected heterozygosity at a locus, with the bias due to finite sample size (Nei and Roychoudhury 1974) and exacerbated by the inclusion of inbred (Weir 1989) and related (DeGiorgio and Rosenberg 2009; DeGiorgio et al. 2010; Harris and DeGiorgio 2017a) individuals in the sample. Based on this definition, 2G(P)=2p(1p) is expected heterozygosity, and its estimator 2G^(P)=2p^(1p^) therefore biased. We begin by developing an unbiased estimator for G(P), as it is a key normalization quantity in the F3 and F4 statistics.

Lemma 1.

Consider J polymorphic loci in a population P with parametric reference allele frequencies pj(0,1), and suppose we take a random sample of N(Pj) individuals at locus j, some of which may be related or inbred. The estimator G^(P) has downward bias

Bias[G^(P)]=1Jj=1JΦ2(Pj)G(Pj) (36)

and an unbiased estimator of G(P) is

G˜(P)=1Jj=1JG˜(Pj), (37)

where

G˜(Pj)=G^(Pj)1Φ2(Pj) (38)

is an unbiased estimator of G(Pj).

Though this result is also given based on work in page 153 of Weir (1996), we provide the proof of Lemma 1 in the Appendix for completeness. Intuitively though, because G^(P) involves the product of frequencies for two alleles drawn from population P, there is a chance of having the two alleles being identical by descent by sampling the same allele twice, and is therefore a biased estimator with and without related or inbred individuals. As a corollary, we next provide the bias of G^(P) due to finite sample size, and use it to construct the unbiased estimator G(P) of G(P) in samples of unrelated and noninbred individuals (proof of this corollary given in the Appendix). Note that this estimator is identical to the one provided in Appendix A of Patterson et al. (2012).

Corollary 2.

Consider J polymorphic loci in a population P with parametric reference allele frequencies pj(0,1), and suppose we take a random sample of N(Pj) unrelated and noninbred individuals at locus j where individual k{1,2,,N(Pj)} has ploidy mk. Assuming allele frequencies are estimated using the sample proportion, the estimator G^(P) has downward bias

Bias[G^(P)]=1Jj=1J1k=1N(Pj)mkG(Pj) (39)

and an unbiased estimator of G(P) is

G(P)=1Jj=1JG(Pj), (40)

where

G(Pj)=k=1N(Pj)mkk=1N(Pj)mk1G^(Pj) (41)

is an unbiased estimator of G(Pj). Here, G(Pj) is equivalent to the unbiased estimator termed h^A in Appendix A of Patterson et al. (2012).

We next consider examining the bias of the estimator F^2(A,B). As with G^(P), because F^2(A,B) requires sampling two alleles from population A and two alleles from population B, we find it is biased due to the inclusion of related or inbred individuals. We present the formal result for F2 next (Proposition 3), and prove the result in the Appendix.

Proposition 3.

Consider J polymorphic loci in populations A and B with respective parametric reference allele frequencies aj,bj(0,1), and suppose we take a random sample of N(Pj) individuals at locus j in population P{A,B}, some of which may be related or inbred. The estimate F^2(A,B) has upward bias

Bias[F^2(A,B)]=1Jj=1J[Φ2(Aj)G(Aj)+Φ2(Bj)G(Bj)] (42)

and an unbiased estimator of F2(A,B) is

F˜2(A,B)=1Jj=1J[F^2(Aj,Bj)Φ2(Aj)G˜(Aj)Φ2(Bj)G˜(Bj)]. (43)

As one can see, the estimator F^2(A,B) is upwardly biased due to relatedness and inbreeding, and that sampling within both populations A and B contributes proportionally to this bias. The new unbiased estimator F˜2(A,B) corrects this bias by adjusting the computation to account for the kinship coefficients and diversity within each population, with the adjustment of diversity using the unbiased estimator G˜(P) presented in Lemma 1. As a corollary, we next provide the bias of F^2(A,B) due to finite sample size, and use it to construct the unbiased estimator F2(A,B) of F2(A,B) in samples of unrelated and noninbred individuals (proof of this corollary given in the Appendix). Note that this estimator is identical to the one derived in Appendix A of Patterson et al. (2012), and we provide an additional corollary highlighting its bias in samples containing related or inbred individuals, which we prove in the Appendix.

Corollary 4.

Consider J polymorphic loci in populations A and B with respective parametric reference allele frequencies aj,bj(0,1), and suppose we take a random sample of N(Pj) unrelated and noninbred individuals at locus j in population P{A,B} where individual k{1,2,,N(Pj)} has ploidy mk. Assuming allele frequencies are estimated using the sample proportion, the estimate F^2(A,B) has upward bias

Bias[F^2(A,B)]=1Jj=1J[1k=1N(Aj)mkG(Aj)+1k=1N(Bj)mkG(Bj)] (44)

and an unbiased estimator of F2(A,B) is

F2(A,B)=1Jj=1J[F^2(Aj,Bj)1k=1N(Aj)mk1G^(Aj)1k=1N(Bj)mk1G^(Bj)]. (45)

Here, F2(Aj,Bj) is equivalent to the unbiased estimator termed F^2(A,B) in Appendix A of Patterson et al. (2012).

Corollary 5.

Consider J polymorphic loci in populations A and B with respective parametric reference allele frequencies aj,bj(0,1), and suppose we take a random sample of N(Pj) individuals at locus j in population P{A,B}, some of which may be related or inbred. The estimate F2(A,B) described in Corollary 4 [also in Appendix A of Patterson et al. (2012)] has upward bias

Bias[F2(A,B)]=1Jj=1J[Φ2(Aj)k=1N(Aj)mk1k=1N(Aj)mk1G(Aj)+Φ2(Bj)k=1N(Bj)mk1k=1N(Bj)mk1G(Bj)]. (46)

Similarly to F^2(A,B), the original estimator F^3(A;B,C) is also upwardly biased because it requires the sampling of two alleles from the target population A. We show the formal results for F3 next (Proposition 6), and prove the result in the Appendix.

Proposition 6.

Consider J polymorphic loci in populations A, B, and C with respective parametric reference allele frequencies aj,bj,cj(0,1), and suppose we take a random sample of N(Pj) individuals at locus j in population P{A,B,C}, some of which may be related or inbred. The estimate F^3(A;B,C) has upward bias

Bias[F^3(A;B,C)]=1Jj=1JΦ2(Aj)G(Aj) (47)

and an unbiased estimator of F3(A;B,C) is

F˜3(A;B,C)=1Jj=1J[F^3(Aj;Bj,Cj)Φ2(Aj)G˜(Aj)]. (48)

The bias of the original estimator is proportional to the relatedness and diversity within the target population A. The new unbiased estimator F˜3(A;B,C) corrects the bias by adjusting the computation to account for the kinship and diversity within the target population, with the adjustment of diversity using the unbiased estimator G˜(A). Moreover, it is important to note that the reference populations B and C do not contribute to bias, as only a single allele is sampled from each of these populations. As a corollary, we next provide the bias of F^3(A;B,C) due to finite sample size, and use it to construct the unbiased estimator F3(A;B,C) of F3(A;B,C) in samples of unrelated and noninbred individuals (proof of this corollary given in the Appendix). Note that this estimator is identical to the one derived in Appendix A of Patterson et al. (2012), and we provide an additional corollary highlighting its bias in samples containing related or inbred individuals, which we prove in the Appendix.

Corollary 7.

Consider J polymorphic loci in populations A, B, and C with respective parametric reference allele frequencies aj,bj,cj(0,1), and suppose we take a random sample of N(Pj) unrelated and noninbred individuals at locus j in population P{A,B,C} where individual k{1,2,,N(Pj)} has ploidy mk. Assuming allele frequencies are estimated using the sample proportion, the estimate F^3(A;B,C) has upward bias

Bias[F^3(A;B,C)]=1Jj=1J1k=1N(Aj)mkG(Aj) (49)

and an unbiased estimator of F3(A;B,C) is

F3(A;B,C)=1Jj=1J[F^3(Aj;Bj,Cj)1k=1N(Aj)mk1G^(Aj)]. (50)

Here, F3(Aj;Bj,Cj) is equivalent to the unbiased estimator termed F^3(A;B,C) in Appendix A of Patterson et al. (2012).

Corollary 8.

Consider J polymorphic loci in populations A, B, and C with respective parametric reference allele frequencies aj,bj,cj(0,1), and suppose we take a random sample of N(Pj) individuals at locus j in population P{A,B,C}, some of which may be related or inbred. The estimate F3(A;B,C) described in Corollary 7 [also in Appendix A of Patterson et al. (2012)] has upward bias

Bias[F3(A;B,C)]=1Jj=1JΦ2(Aj)k=1N(Aj)mk1k=1N(Aj)mk1G(Aj). (51)

Given that F^3(A;B,C|A) uses the biased estimators F^3(A;B,C) and G^(A) in its definition, we can expect that it would be biased as its component estimators are biased, and these components have different biases that are also in different directions. However, F^3(A;B,C|A) is a ratio estimator, and we can therefore not directly take its expectation to evaluate bias. Instead, we will make some simplifying assumptions and compute the approximate bias of F^3(A;B,C|A). We show the formal results next (Proposition 9), and prove the result in the Appendix.

Proposition 9.

Consider J polymorphic loci in populations A, B, and C with respective parametric reference allele frequencies aj,bj,cj(0,1), and suppose we take a random sample of N(Pj) individuals at locus j in population P{A,B,C}, some of which may be related or inbred. The ratio estimator F^3(A;B,C|A) is approximately upwardly biased, assuming that its mean is well-approximated by the ratio of means of F^3(A;B,C) and 2G^(A) that it uses in its definition, with its upward approximate bias

Bias[F^3(A;B,C|A)](1/J)j=1JΦ2(Aj)G(Aj)G(A)(1/J)j=1JΦ2(Aj)G(Aj)[F3(A;B,C|A)+12]. (52)

Moreover, an approximately unbiased estimator of F3(A;B,C|A) is

F˜3(A;B,C|A)=F˜3(A;B,C)2G˜(A). (53)

There is an upward approximate bias of the original normalized F3 estimator, and the bias is, as with the standard estimator of F3, due partially to the diversity and sampling in the target population. The new approximately unbiased estimator F˜3(A;B,C|A) is based simply on the ratio of unbiased estimators of its components F˜3(A;B,C) and G˜(A). As a corollary, we next provide the approximate bias of F^3(A;B,C|A) due to finite sample size, and use it to construct the approximately unbiased estimator F3(A;B,C|A) of F3(A;B,C|A) in samples of unrelated and noninbred individuals (proof of this corollary given in the Appendix). We provide an additional corollary highlighting its bias in samples containing related or inbred individuals, which we prove in the Appendix.

Corollary 10.

Consider J polymorphic loci in populations A, B, and C with respective parametric reference allele frequencies aj,bj,cj(0,1), and suppose we take a random sample of N(Pj) unrelated and noninbred individuals at locus j in population P{A,B,C} where individual k{1,2,,N(Pj)} has ploidy mk. The ratio estimator F^3(A;B,C|A) is approximately upwardly biased, assuming that its mean is well-approximated by the ratio of means of F^3(A;B,C) and 2G^(A) that it uses in its definition, with its upward approximate bias

Bias[F^3(A;B,C|A)](1/J)j=1J[1/k=1N(Aj)mk]G(Aj)G(A)(1/J)j=1J[1/k=1N(Aj)mk]G(Aj)[F3(A;B,C|A)+12]. (54)

Moreover, an approximately unbiased estimator of F3(A;B,C|A) is

F3(A;B,C|A)=F3(A;B,C)2G(A). (55)

Corollary 11.

Consider J polymorphic loci in populations A, B, and C with respective parametric reference allele frequencies aj,bj,cj(0,1), and suppose we take a random sample of N(Pj) individuals at locus j in population P{A,B,C}, some of which may be related or inbred. The ratio estimate F3(A;B,C|A) described in Corollary 10 is approximately upwardly biased, assuming that its mean is well-approximated by the ratio of means of F3(A;B,C) and 2G(A) that it uses in its definition, with its upward approximate bias

Bias[F3(A;B,C|A)]12[1Y(A)1G(A)]F3(A;B,C)+X(A)2Y(A), (56)

where

X(A)=1Jj=1JΦ2(Aj)k=1N(Aj)mk1k=1N(Aj)mk1G(Aj) (57)

 

Y(A)=1Jj=1J[1Φ2(Aj)]k=1N(Aj)mkk=1N(Aj)mk1G(Aj). (58)

Finally, we move to the four population statistics F4 and D. Note that the F4 statistic by definition only samples a single allele per population, and therefore the original estimator F^4(A,B;C,D) is intuitively unbiased. We show the formal results next (Proposition 12), and prove the result in the Appendix.

Proposition 12.

Consider J polymorphic loci in populations A, B, C, and D with respective parametric reference allele frequencies aj,bj,cj,dj(0,1), and suppose we take a random sample of N(Pj) individuals at locus j in population P{A,B,C,D}, some of which may be related or inbred. The estimator F^4(A,B;C,D) is unbiased.

Though the original F4 estimator is unbiased, the normalized F4 and D statistics are more complicated as they are ratio estimators, meaning their biases cannot be directly assessed. However, intuitively, because both estimators have F^4(A,B;C,D) as their numerator, bias would seemingly derive from their denominator component. Next, we show formally in Proposition 13 that the normalized F^4(A,B;C,D|P) estimator is approximately upwardly biased, and prove the result in the Appendix.

Proposition 13.

Consider J polymorphic loci in populations A, B, C and D with respective parametric reference allele frequencies aj,bj,cj,dj(0,1), and suppose we take a random sample of N(Pj) individuals at locus j in population P{A,B,C,D}, some of which may be related or inbred. The ratio estimator F^4(A,B;C,D|P) is approximately upwardly biased, assuming that its mean is well-approximated by the ratio of means of F^4(A,B;C,D) and G^(P) for any population P{A,B,C,D} that it uses in its definition, with its upward approximate bias

Bias[F^4(A,B;C,D|P)](1/J)j=1JΦ2(Pj)G(Pj)G(P)(1/J)j=1JΦ2(Pj)G(Pj)F4(A,B;C,D|P). (59)

Moreover, an approximately unbiased estimator of F4(A,B;C,D|P) is

F˜4(A,B;C,D|P)=F^4(A,B;C,D)G˜(P). (60)

The reasoning that the F^4(A,B;C,D|P) estimator has upward approximate bias is that its estimator G^(P) used in its denominator is downwardly biased. By using the unbiased estimator G˜(P) in its place within the denominator, we find a new estimator F˜4(A,B;C,D|P) is approximately unbiased. As a corollary, we next provide the approximate bias of F^4(A,B;C,D|P) due to finite sample size, and use it to construct the approximately unbiased estimator F4(A,B;C,D|P) of F4(A,B;C,D|P) in samples of unrelated and noninbred individuals (proof of this corollary given in the Appendix). We provide an additional corollary highlighting its bias in samples containing related or inbred individuals, which we prove in the Appendix.

Corollary 14.

Consider J polymorphic loci in populations A, B, C, and D with respective parametric reference allele frequencies aj,bj,cj,dj(0,1), and suppose we take a random sample of N(Pj) unrelated and noninbred individuals at locus j in population P{A,B,C,D} where individual k{1,2,,N(Pj)} has ploidy mk. The ratio estimator F^4(A,B;C,D|P) is approximately upwardly biased, assuming that its mean is well-approximated by the ratio of means of F^4(A,B;C,D) and G^(P) that it uses in its definition, with its upward approximate bias

Bias[F^4(A,B;C,D|P)](1/J)j=1J[1/k=1N(Pj)mk]G(Pj)G(P)(1/J)j=1J[1/k=1N(Pj)mk]G(Pj)F4(A,B;C,D). (61)

Moreover, an approximately unbiased estimator of F4(A,B;C,D|P) is

F4(A,B;C,D|P)=F^4(A,B;C,D)G(P). (62)

Corollary 15.

Consider J polymorphic loci in populations A, B, C, and D with respective parametric reference allele frequencies aj,bj,cj,dj(0,1), and suppose we take a random sample of N(Pj) individuals at locus j in population P{A,B,C,D}, some of which may be related or inbred. The ratio estimate F4(A,B;C,D|P) described in Corollary 14 is approximately upwardly biased, assuming that its mean is well-approximated by the ratio of means of F^4(A,B;C,D) and G(P) that it uses in its definition, with its upward approximate bias

Bias[F4(A,B;C,D|P)][1Y(P)1G(P)]F4(A,B;C,D), (63)

where

Y(P)=1Jj=1J[1Φ2(Pj)]k=1N(Pj)mkk=1N(Pj)mk1G(Pj). (64)

The bias property of the D statistic is different than the normalized F4 statistic, as the estimator H^(A,B,C,D) of its denominator is unbiased (Lemma 17 of the Appendix). Intuitively, this result is due to the denominator not having a product of frequencies for two alleles sampled from the same population. Because both its numerator and denominator are unbiased, we next show that the ratio estimator D^(A,B,C,D) is approximately unbiased in Proposition 16, and prove the result in the Appendix.

Proposition 16.

Consider J polymorphic loci in populations A, B, C, and D with respective parametric reference allele frequencies aj,bj,cj,dj(0,1), and suppose we take a random sample of N(Pj) individuals at locus j in population P{A,B,C,D}, some of which may be related or inbred. The ratio estimator D^(A,B,C,D) is approximately unbiased, assuming that its mean is well-approximated by the ratio of means of F^4(A,B;C,D) and H^(A,B,C,D) that it uses in its definition.

In addition to bias, variance is an important property of an estimator, as both bias and variance are components of mean squared error (MSE). Because the formulas and derivations for the variances of the F- and D-statistics are not particularly insightful, we relegate these results to the Appendix. Specifically, we provide the variances for F^2(A,B),F˜2(A,B), F2(A,B), F^3(A;B,C),F˜3(A;B,C), F3(A;B,C), F^3(A;B,C|A),F˜3(A;B,C|A), F3(A;B,C|A), F^4(A,B;C,D),F^4(A,B;C,D|P),F˜4(A,B;C,D|P), F4(A,B;C,D|P), and D^(A,B,C,D) in Propositions 20, 21, 22, 23, 25, 26, 28, 31, 32, 27, 34, 37, 38, and 41 of the Appendix, respectively.

Results

In the Theory and Appendix, we introduced new unbiased estimators of F2 and F3 statistics, and derived biases and variances (and hence MSEs) for the original and new estimators of F- and D-statistics. In this section, we theoretically evaluate the relative performances of the old biased estimators and the new unbiased estimators under an array of settings, including different mixtures of relatedness, inbreeding, sample sizes, and population parameters.

For all of our results we require the kinship coefficients for each pair of individuals. To acquire these values, we need to know if each individual is related to any other in the population and also whether they are inbred, and if so, how these values are quantified through the use of kinship coefficients (Φxy). To summarize how an entire sample from a population P is related to each other at a locus, we use

Φ2(P)=w=1N(P)x=1N(P)ϕw(P)ϕx(P)Φwx, (65)

where ϕw(P) and ϕx(P) are weights of individuals w and x in population P, and in this study we use weights corresponding to the proportion of alleles contributed by individual x to the sample from population P, which is computed as

ϕx(P)=mxk=1N(P)mk. (66)

Here mx is the ploidy of individual x. Moreover, using this weighting scheme, we also estimate the frequency of the reference allele at a biallelic locus as the sample proportion (McPeek et al. 2004; DeGiorgio et al. 2010; Harris and DeGiorgio, 2017a)

p^=k=1N(P)ϕk(P)Xk=k=1N(P)mkj=1N(P)mjXk. (67)

Effect of population F-statistic value on mean squared error

The relationship between the population parameter for a statistic and the estimate based on a sample from the population is important to evaluate. We compare the difference in the MSE between the biased F^ estimators, the F estimators, which are unbiased in samples not containing related or inbred individuals, and our unbiased F˜ estimators to the true value of each statistic in the cases for which both estimators exist. The F2, F3, and F4 statistics require allele frequency information from either two, three, or four populations, respectively.

For our F2 comparisons, we use the sample allele frequencies from the YRI (sub-Saharan African) and CEU (central Europeans) from the 1000 Genomes Project (The 1000 Genomes Project Consortium 2015) as the true population allele frequencies to obtain the true F2(A,B) statistic by using the population definition from the Introduction, with populations A=CEU and B=YRI. We use populations from the 1000 Genomes Project, as the released dataset of genotype calls across the 2504 worldwide samples does not include related individuals. To evaluate the relative performances of F2 estimators over a range of true F2 values, we randomly sample 20 independent loci from both populations for 1000 independent replicates of J =20 loci, yielding 1000 independent draws of the true F2 statistic, which ranged across the set of values F2[0.02,0.12]. Using these allele frequencies, along with the sample size and relatedness information, we also calculate the difference in MSE between the F^2 and F˜2 estimators by using Propositions 3, 20, and 21 and the difference in MSE between the F^ and F estimators by using Corollary 5 and Proposition 22. We calculate the MSE by summing the variance and squared bias. We note that the MSEs of the unbiased estimators are equal to their variances. We repeat this process for F3(A;B,C),F3(A;B,C|A), and F4(A,B;C,D|A) as these are the estimators that are biased in their F^ forms.

We use Propositions 6, 23, 25, and 26 and Corollary 8 to determine the MSE for all three F3 estimators by including allele frequency information from the JPT (Japanese) population where A=JPT,B=CEU, and C=YRI, with true range for F3[0.00,0.08]. For the normalized F3(A;B,C|A) estimators, we compare MSE between the biased and unbiased versions by using bias and variances derived in Propositions 9, 28, 31, and 32 and Corollary 11 with true range for normalized F3[0.00,0.30]. Finally, we estimate the MSE for the normalized F4(A,B;C,D|A) estimators by including GIH (Gujurati Indian) allele frequency data and using the derivations in Propositions 13, 34, 37, and 38 and Corollary 15. In this case, we set A=YRI,B=CEU,C=JPT, and D=GIH for a true range of normalized F4[0.3,0.2].

For each analysis, we estimate MSE for instances when samples of 60 diploid individuals from each population include 30 relative pairs, including 10 avuncular relationships, 10 inbred full siblings, and 10 outbred full siblings. We also assumed every individual was related to exactly one other individual. In these estimates, all populations contain the same composition of related individuals.

The difference in log10(MSE) between F^ and F˜ estimators for F2(A,B),F3(A;B,C),F3(A;B,C|A), and F4(A,B;C,D|A) show similar trends with respect to the true F-statistic values. Similar trends emerge when examining the difference between F^ and F estimators. Specifically, the difference in log10(MSE) decreases as the true F-statistic value approaches zero (Figure 2). In our evaluation of F4(A,B;C,D|A), we considered both positive and negative values for its true value, which shows that the difference in log10(MSE) of F^4 and F˜4 exhibits a quadratic shaped trend as a function of true F4. Overall, we notice that the difference in MSE between biased and unbiased estimators is dependent on the true value of the F-statistic, with the least difference occurring when the true F-statistic is closest to zero.

Figure 2.

Figure 2

Difference in theoretically calculated log10(MSE) of F^ and F˜ (A–D) and F^ and F (E–H) estimators when including relatives or inbred individuals. The MSE is estimated for instances when samples of 60 individuals include individuals related to exactly one other in the sample, with 10 pairs of avuncular relationships, 10 pairs of inbred full siblings and 10 pairs of outbred full siblings. Each point represents calculations from J =20 randomly sampled loci from the 1000 Genomes Project dataset for CEU, European, YRI African, JPT Japanese, and GIH Indian populations. For F2(A,B) we use A=CEU and B=YRI, while for F3(A;B,C) and F3(A;B,C|A) we use A=JPT,B=CEU, and C=YRI and for F4(A,B;C,D|A) we assign A=YRI,B=CEU,C=JPT, and D=GIH.

Effect of sample size on mean squared error

To probe how sample size within each population affects the difference in estimator error rate, we theoretically computed the MSE for both F^ and F˜ estimators when different numbers of individuals are sampled, with the constraint that every sampled individual is related to exactly one other individual in the sample from that population. Specifically, we evaluate the impact on these estimators when sampling from one pair to 50 pairs of related individuals, with relationships of inbred full sibling pairs (Φxy=3/8), outbred full sibling pairs (Φxy=1/4), parent-offspring pairs (Φxy=1/4), and avuncular pairs (Φxy=1/8). We compute the MSE as in Effect of population F-statistic value onMSE section above.

In almost all cases, the biased F^ and the unbiased F estimators always displayed elevated MSE compared to their corresponding unbiased F˜ estimators (Figure 3 and Supplementary Figures S1–S3), with the F estimators always having values between the F^ and F˜ estimators, and usually being closer to F˜ than to F^. For all estimators, we see a clear decrease in the MSE as the number of sampled individuals increases, with the greatest error observed when two individuals are sampled. As expected, a greater sample size allows one to better estimate allele frequencies, and ultimately reduces the mean pairwise kinship coefficient within the sample, as the number of pairs in the sample grows quadratically but the number of relative pairs grows linearly. We also find that the difference in the MSE at larger sample sizes is not as pronounced for normalized F4 as it is for F2, F3, and normalized F3, as the difference in bias among the biased and unbiased estimators is much smaller for normalized F4 (Supplementary Figure S3).

Figure 3.

Figure 3

Mean squared error theoretically calculated for F^2(A,B), F2(A,B), and F˜2(A,B) across different sample sizes or related pairs of individuals, including avuncular relationships (A), parent-offspring relationships (B), outbred full siblings (C), and inbred full siblings (D). The number of sampled individuals ranges from 2 to 100 with the number of relative pairs equaling half the total sampled, all computed using J =20 loci. The true value of F2(A,B) is 0.071.

Effect of sample composition on mean squared error

Different types of relatives have different proportions of their alleles shared identical by descent, and thus have different pairwise kinship coefficients. Because we have demonstrated that bias and variance (and hence MSE) of estimators are influenced by within-population mean pairwise kinship coefficient across sampled individuals, the distribution of relative types within a sample will impact overall F-statistic estimation error. For this reason, it is important to examine how our F-statistics are affected by samples containing diverse mixtures of relative types. Specifically, to accurately assess the impact of relative composition, we hold sample sizes, number of relative pairs, and true population F-statistic values constant.

We computed the theoretical MSE when samples of 50 pairs of relatives (100 diploid individuals sampled) contain relative pairs of three different types as in Harris and DeGiorgio (2017a). In addition, each individual is related to exactly one other individual in the sample from the same population. For each statistic we vary the number of pairs related by each of three types of relationships between zero and 50, with 1326 combinations for each. We repeat this process for three configurations of relationships to probe estimator error as a function of the mixture of relative types. We also provide comparisons among inclusion of male–male full siblings (Φxy=1/2), male–female full siblings (Φxy=3/8) and female–female full siblings (Φxy=1/4) at mixed-ploidy loci such as on the X chromosome (DeGiorgio et al. 2010), with results showing elevated MSE for both estimators for higher male–male sibling proportions, when compared to male–female or female–female full siblings. To investigate the effects of inbred individuals, we also provide a comparison between inbred full siblings (Φxy=3/8) with inbreeding coefficient fx=fy=1/4, and outbred full-siblings (Φxy=1/4) at a autosomal diploid loci. We see that MSE is higher for inbred full-siblings than for outbred full-siblings in all cases examined (Figure 4 and Supplementary Figures S4–S6).

Figure 4.

Figure 4

Theoretically calculated MSE of F^2(A,B) (A), F2(A,B) (B), and F˜2(A,B) (C) when including relatives or inbred individuals for J =20 loci. The MSE is estimated for instances when samples of 100 individuals include individuals related to exactly one other in the sample. The first column shows MSE for samples with different combinations of parent-offspring (PO), full sibling (FS), and avuncular (AV) relationships, the second includes full siblings that are male–male (MM), male–female (MF), and female–female (MF). The last column includes AV relationships as well as inbred (FSi) and outbred (FSo) full siblings. The true value of F2(A,B) is 0.071.

We also note that the value of MSE for the biased F^ estimators is always greater than the value for their corresponding unbiased F˜ statistics, which is true in part due to the values of the true F-statistics for the loci we chose to use. In addition, the values for the F estimators are also higher than the corresponding F˜ estimators, but are lower than respective F^ estimators for all combinations of individuals. Though the MSE is higher for the biased estimators, the variation in MSE values is similar for all estimators. For example, the data point with the highest proportion of avuncular relatives has the lowest MSE when compared to parent-offspring relationships and outbred full siblings. In all tested settings (Figure 4 and Supplementary Figures S4–S6), we notice similar patterns of MSE variation when comparing F^ estimators with F and F˜ estimators. This pattern is again shared when comparing MSE variation among the estimators for F2, F3, normalized F3, and normalized F4. We can conclude that in all of these cases, the value of the mean kinship coefficient is most important in determining MSE when sample size and true F-statistic value are fixed.

Simulations to evaluate theoretical MSE approximations

To verify that our theoretical approximations for MSE are reasonable, we simulate samples containing related individuals and use them to compute the biased F^- and unbiased F˜-statistics as well as calculate their biases, variances, and MSEs. For each population (CEU, YRI, GIH, and JPT), we simulate 10 noninbred parent-offspring pairs with each individual related to exactly one other individual in the sample. Genotypes for each individual are simulated by first sampling two alleles with replacement according to their respective population allele frequencies from each of the populations (CEU, YRI, GIH, and JPT) to create a set of 20 unrelated individuals per population. Individuals x and y that form one of the 10 relative pairs have the genotype of individual y modified according to their relationship type. Specifically, for each relative type, there are probabilities Δ0,Δ1, and Δ2 that the two individuals will share zero, one, or two alleles identical by descent, respectively. The first allele of individual y is copied from the first allele from individual x with probability Δ1 and the entire genotype of individual x is copied over to individual y with probability Δ2. This process is repeated across 20 independent loci to generate a sample of 20 individuals with 10 relative pairs in each population with genotypes taken at J =20 independent loci. To generate 20 independent loci from the four 1000 Genomes Project populations, we used loci either on separate chromosomes, or at least one megabase away from each other.

For each of our new unbiased estimators we compute the bias, variance, and MSE along with the same values for the original estimators (Figure 5 and Supplementary Figures S7–S9). Comparing the bias measurements in these figures, we observe a clear reduction in bias when applying the F˜ and F estimators as opposed to the F^ estimators. Importantly, the bias measurements for the F estimators are always higher than the bias measures for F˜ estimates, as the F estimators do not account for relatives. However, the variances are highly similar for F^, F, and F˜ in all cases. As the value of variance is much larger than the magnitude of the bias (by an order of magnitude) and hence the squared bias, the resulting MSE is consequently similar as well. Because F4 is quantifying the relationship among four populations, more simulations may be required to converge to the pattern seen by theoretical simulations. For this reason, we increased the number of simulations used to compute the bias, variance, and MSE to 104 for each data point in Supplementary Figure S9, whereas 103 simulation replicates were used for F2, and both versions of F3.

Figure 5.

Figure 5

Comparison of squared bias (A, D, and G), variance (B, E, and H), and MSE (C, F, and I), for F^2(A,B), F2(A,B), and F˜2(A,B) from simulated data including 60 parent offspring relative pairs. Each estimate was computed from J =20 randomly sampled loci using A=CEU and B=YRI.

To compare the accuracy of our theoretical approximations to simulation results across a spectrum of relatedness between individuals in a sample, we simulate combinations of parent offspring, outbred full sibling, and avuncular relationships. In a manner similar to described above (first paragraph of Simulations to evaluate theoretical MSE approximations), we simulate a total of 10 relative pairs made up of a combination of each of the three relative types, with the number of each relative type ranging from 0 to 10. We simulate each of these 66 distinct settings of relative type combinations with genotypes sampled at J =20 independent loci, and completed 1000 independent replicates of each setting to obtain accurate measurements of bias, variance, and MSE for each simulation setting, with each simulation using true F-statistic values specified in (Figure 5 and Supplementary Figures S7–S9). We compute the bias, variance, and MSE for simulations, and compare these values to theoretically calculated computations for each relative combination (Supplementary Figures S11–S14). We find that although noisier, the bias, variance, and MSE patterns in our simulation results match theoretical calculations, suggesting that our theoretical computations are accurate. For all cases, the simulated bias measurements for the F˜ estimators are close to zero, whereas the F^ estimators display bias measurements matching the theoretically calculated F^ bias values.

Utility and applications of unbiased estimators

In previous sections, we have shown through simulations that our theoretical results produced expected patterns and evaluated the performance of our unbiased estimators under varying combinations of relatives, true F-statistic values, and sample sizes. In this section, we show some potential applications of these estimators, using both simulated and empirical data. As discussed previously in the Introduction, the value of F3(A;B,C) can be used to identify whether population A is the result of admixture between populations related to B and C (Figure 1). A negative value of F3 indicates the presence of this process, whereas a nonnegative value is inconclusive and means that further tests may be required to verify a history of admixture. However, because F^3 is upwardly biased and because F˜3 corrects for this bias, F˜3 might allow us to detect admixture in cases where F^3 would be inconclusive, even without the presence of related or inbred individuals.

To explore this hypothesis, we first examine an admixture scenario in which F3(A;B,C) might provide marginally negative values. We simulate two populations (B and C) with effective population size of 104 diploid individuals (Takahata 1993) that diverged 2000 generations prior to sampling using SLiM (Haller and Messer, 2019). This simple divergence model has parameters inspired by the history relating African and non-African human populations (Gravel et al. 2011). These populations then merge with admixture proportions 0.4 and 0.6 for B and C, respectively, to form population A 400 generations prior to sampling. Using these parameters, the expected value is F3(A;B,C)=0.0568. To generate genetic data from this model, we evolved sequences with a per-site per-generation mutation rate of μ=1.25×108 (Scally and Durbin 2012) and a uniform per-site per-generation recombination rate of r=108 (Payseur and Nachman 2000). We output 20 two megabase chromosomal regions containing allele frequency information for all three populations. Using allele frequency information from the three populations (A, B, and C) we generate 50 individuals for each population, in which there are 25 parent-offspring pairs. We then compute F˜3(A;B,C), F3(A;B,C), and F^3(A;B,C) across J =20 loci, either on separate chromosomes or at least one megabase away from each other to ensure independence.

Figure 6 illustrates that F˜3 values are lower than F^3, with F3 values falling in between F^3 and F˜3. F˜3 values are almost always negative while F^3 and F3 values are almost always positive. Because this statistic is used to test for admixture and a negative result indicates the presence of admixture, the use of biased estimators when related individuals are included in the sample leads to a different conclusion than when using the unbiased estimator. However, we notice that there is almost no correlation between the true F3 values and the estimates of F3 in Figure 6. This lack of correlation is due to the fact that the F3 values we simulated for this experiment are drawn from a particularly small range, and correlations in estimated versus true F3 are unable to be observed over the estimation noise. To ensure that this is indeed the case, we conduct simulations across a larger range of true F3 values and estimate F˜3,F^3, and F3. We notice that the expected trend appears when the range of F3 values is expanded (Supplementary Figure S15).

Figure 6.

Figure 6

F^3(A;B,C) , F3(A;B,C), and F˜3(A;B,C) calculated for simulations where populations B and C merge with admixture proportions of 0.4 and 0.6, respectively, 400 generations ago (0.02 coalescent units) to form population A (tree shown in panel A). Panel (B) shows results for a sample containing 25 parent-offspring pairs.

Finally, we test the performance of our statistics on empirical data. We use populations from the HGDP SNP dataset (Li et al. 2008) that include related individuals (Rosenberg 2006). Specifically, we use genotype information from Colombian, Lahu, Melanesian, Mandenka, San, and Druze populations, and we sample 20 independent loci that are at least one megabase apart from all populations for 1000 independent replicates of J =20 loci, yielding 1000 independent draws. Each of these populations contains between two and 14 pairs of inferred related individuals, according to Rosenberg (2006). Using distinct pairs for F2, triples for F3, and quadruples for F4 of these populations and the relationships from Rosenberg (2006), we estimate F^2(A,B),F˜2(A,B), F2(A,B), F^3(A;B,C|A),F˜3(A;B,C|A), F3(A;B,C|A), F^4(A,B;C,D|A),F˜4(A,B;C,D|A), and F4(A,B;C,D|A), and compare the mean and standard deviation of the biased and unbiased estimators (Figure 7). In all cases shown, the biased estimator has higher mean than the unbiased estimator, although the standard deviations are similar for both. This indicates that correcting the bias generated by related individuals yields more accurate F-statistic estimates with minimal cost in precision of the estimates. In addition, the unbiased F estimators presented in Appendix A of Patterson et al. (2012) all have lower means than the biased F^ estimators, and are more similar to our unbiased F˜ estimators, while having standard deviation measures that are a lower than both other estimators. This could indicate that Patterson’s unbiased estimators have slightly higher precision, while having slightly lower accuracy than the unbiased estimators introduced in this article when samples contain related individuals.

Figure 7.

Figure 7

The difference between the means and standard deviations of biased, unbiased Patterson’s, and our new unbiased estimators of F2, normalized and un-normalized F3 and normalized F4 when estimated with genotype information from four different combinations of Colombian, Lahu, Melanesian, Mandenka, San, and Druze populations. All of these populations include between 2 and 14 relative pairs. The black dots represent the values for the biased estimators, while the white dots show the value for the unbiased estimators. Each mean and standard deviation was calculated for a combination of two, three, or four populations, for F2, F3, and F4, respectively, and consists of 1000 estimates of the statistic, each calculated from J =20 randomly samples single nucleotide polymorphisms from the genome. Panel (A) has values for F2(A,B), panel (B) has values for F3(A;B,C), panel (C) shows results for normalized F3(A;B,C|A), and panel (D) has results for normalized F4(A,B;C,D|A).

Discussion

We have introduced the unbiased estimators F˜2(A,B),F˜3(A;B,C),F˜3(A;B,C|A), and F˜4(A,B;C,D|P) as well as shown that the estimators F^4(A,B;C,D) and D^(A,B,C,D) are unbiased with the inclusion of related and inbred individuals. In addition, we have demonstrated that the variance of F˜2(A,B) is similar to that of F^2(A,B), as are the variances of F˜3(A;B,C) and F^3(A;B,C). We have also provided variance calculations for all other F- and D-statistic estimators included in this study. Using these calculations, we have compared the performance of the biased and newly derived unbiased estimators, and shown that in most cases the unbiased estimators have lower MSE values than the biased estimators of the same statistic.

Interestingly, the two statistics that sample from each analyzed population only once per locus—F^4(A,B;C,D) and D^(A,B,C,D)—are unbiased with the inclusion of related or inbred individuals, whereas F^2(A,B), which samples from each population A and B twice, and F^3(A;B,C), which samples from population A twice, are biased. This process of sampling more than once from a single population per locus is responsible for creating bias due the inclusion of related or inbred individuals within the twice-sampled population.

The development of these unbiased statistics, and the proofs showing other statistics are unbiased is beneficial for anthropologists interested in populations such as hunter-gatherers, some of which are often small and widely dispersed yet retain high genetic diversity (Kim et al. 2014). Small population sizes may necessitate the sampling of close relatives, such as parents and offspring, or siblings. Along with small human populations, these statistics are often applied to nonhuman species. Some, such as elephants, rhinoceros, and cheetahs are close to extinction or have extremely small and inbred populations due to human activity. The F- and D-statistics may prove important in conservation efforts to test how (and whether) different populations of these animals are interacting. For these reasons, having estimators that are unbiased under such conditions is imperative in making accurate inferences about the relationships of such small populations with others. Although it may not be possible to identify relatives through the sampling process, especially in the case of wild animals, there are methods available to identify related individuals and estimate their likely degree of relatedness once the samples have been sequenced (Epstein et al. 2000). The inferences from these methods will allow users to identify pairwise kinship coefficients necessary to apply the unbiased statistics of this study.

A key consideration when evaluating the importance unbiased estimators of F- and D-statistics is their potential use. Specifically, a number of applications of these statistics do not employ the raw estimates, but instead standardized estimates (Soraggi et al. 2018; Zheng and Janke 2018), where a particular F- or D-statistic has its genomewide mean subtracted, and is normalized by the standard error using a genomic block jackknife procedure (Reich et al. 2009). Indeed, subtracting out this genomewide mean may circumvent bias issues. However, this assumes that all genomic blocks have similar sample properties, yet blocks with reduced sample size (e.g., in regions with difficult to call genotypes) may still deviate from the genomewide expectation. In contrast, accounting for this bias due to relatedness would provide estimates closer to the genomewide mean. Because the variance for these biased and unbiased estimators is approximately the same (compare Propositions 20 vs 21 and 23 vs 25), the standard errors used for normalizing these statistics are expected to be comparable, and thus, the unbiased estimators of the F-statistics derived here represent a more robust alternative to the original biased estimators, regardless of whether the raw or standardized values of the statistics are used. Furthermore, the raw value of some statistics, such as using the F3 statistic to detect population admixture, is important, and without correcting the bias of such statistics (Figure 6), key historical events relating populations could be missed. In addition, applying these statistics to genomic regions that are likely to be evolving non-neutrally (e.g., protein-coding regions) may lead to skewed estimates due to selection. For this reason, it is recommended that these statistics be applied to intergeneic regions.

The F- and D-statistics evaluated here are the most commonly used. However, since their development by Reich et al. (2009) and Patterson et al. (2012), other D-statistic type tests have been formulated to not only detect admixture, but also to identify the direction of gene flow—namely the partitioned D-statistics of Eaton and Ree (2013) and the DFOIL statistics of Pease and Hahn (2015). Specifically, the DFOIL statistics as originally formulated by Pease and Hahn (2015) sampled a single lineage (or allele) from each of a set of five populations A, B, C, D, and O, with a symmetric rooted topology ((AB)(CD)) relating populations A, B, C, and D, and with O an outgroup to these populations used to polarize the ancestral allelic state. Subsequently, Harris and DeGiorgio (2017b) derived allele frequency formulas for the DFOIL statistics, and showed that allele frequency information for the outgroup population O is not needed for computation. The DFOIL statistics are a set of four quantities (Harris and DeGiorgio 2017b)

DFO(A,B;C,D)=j=1J(12aj)(djcj)j=1J(cj+dj2cjdj) (68)

 

DIL(A,B;C,D)=j=1J(12bj)(djcj)j=1J(cj+dj2cjdj) (69)

 

DFI(A,B;C,D)=j=1J(12cj)(bjaj)j=1J(aj+bj2ajbj) (70)

 

DOL(A,B;C,D)=j=1J(12dj)(bjaj)j=1J(aj+bj2ajbj), (71)

each of which does not have the frequencies for two alleles sampled from a single population multiplying each other. Hence, using sample allele frequencies in place of the population quantities would still yield approximately unbiased estimators of the DFOIL statistics, regardless of whether related or inbred individuals were included in the sample. Though we chose to focus on the more classic F- and D-statistics, variance quantities for these partitioned D and DFOIL statistics can be readily computed as we have done for other ratio estimators in this study.

Though we have only shown results when all populations contain samples with the same relative pair composition, it is trivial to include different relative types in different populations within these statistics. In addition, a key assumption of our theoretical formulas is that pairwise relative contributions to the bias and variance are so much larger in magnitude than higher-order relative contributions, that the inclusion of kinship terms for trios, quadruples, or pairs of pairs of relatives would minimally affect results. For this reason, and to avoid highly unwieldy formulas for variance calculations, we made approximations to the variance formulas using this assumption. We briefly explore the accuracy of such approximations under simulations with 20 full sibling trios, and calculate the bias, variance, and MSE over a range of F2 values. We compare these simulated results to theoretically calculated bias, variance, and MSE for F˜2 across identical F2 values (Supplementary Figure S16), where the theoretical variance (and hence MSE) formulas are approximations that only consider pairwise kinship coefficients. We see that though the trends in variance (and hence MSE) values are similar between the simulated and theoretical quantities, there is a slight difference between the simulated and theoretical variance (and hence MSE), where the theoretical values are consistently lower than the simulated. Therefore, our theoretical approximate variance calculations have underestimated the true variance values, which can be expected as we drop a number of terms that would be in the exact variance calculation, while still being useful for understanding the overall trends of the variances (and hence MSEs) across estimators.

In addition, it is also possible to apply our new unbiased estimators when only some or none of the populations contain related or inbred individuals. Moreover, though we have demonstrated results for allele frequencies estimated as the sample proportion, we could have instead used the best linear unbiased estimator (BLUE) of McPeek et al. (2004), as all derivations in this article are based on a general form of a linear unbiased estimator. The BLUE allele frequency estimator would have superior properties to the sample proportion discussed here, as it has smallest variance (McPeek et al. 2004), and this reduction in variance translates to functions of the allele frequency as highlighted by improvements in both expected heterozygosity and FST by Harris and DeGiorgio (2017a). To apply the BLUE estimator, we would simply alter the weight ϕx(P) of an individual x in population P at a particular locus with the equation

ϕx(P)=k=1N(P)(K1)kx1TK11, (72)

where KRN(P)×N(P) is the matrix of pairwise kinship coefficients, with element in row j and column k given by Kjk=Φjk,1RN(P) is a column vector of ones, and superscript T indicates transpose. To facilitate easy application of these statistics, we have developed open-source software funbiased for use by the scientific community, which is available at https://github.com/MehreenRuhi/funbiased.

Data availability

Supplementary data are available at Genetics online. The 1000 Genomes Project data used in this publication is available at http://www.1000genomes.org/. Relatedness information used to generate Figure 7 is available online within Supplementary Tables S7–S15 of Rosenberg (2006).

Supplementary Material

iyab090_Supplementary_Figures

Acknowledgments

The authors thank three anonymous reviewers for their comments that improved our article. They also thank Nick Patterson for pointing us to the unbiased F-statistics in samples containing unrelated and noninbred individuals within Appendix A of Patterson et al. (2012). The authors thank George (PJ) Perry for the helpful comments on an early draft of this article and Alexandre Harris for fruitful discussions about our simulation protocol.

Funding

This research was funded by the National Institutes of Health (R35GM128590), the National Science Foundation (DEB-1949268 and BCS-2001063), a NIGMS funded training grant on Computation, Bioinformatics, and Statistics (Predoctoral Training Program T32GM102057), and the NASA Pennsylvania Space Grant Graduate Fellowship. Computations for this research were performed on the Pennsylvania State University’s Institute for Computational and Data Sciences Advanced CyberInfrastructure (ICDS-ACI).

Conflicts of interest

The authors declare that there is no conflict of interest.

Appendix

In this section, we provide proofs of key lemmas and propositions from the Theory section, and also develop and prove other important results.

Proof of Lemma 1. We first calculate (result in Weir 1996, page 153)

E[G^(Pj)]=E[p^j(1p^j)]=E[p^j]E[p^j2]=pj[1Φ2(Pj)]pj2Φ2(Pj)pj=[1Φ2(Pj)]pj(1pj)=[1Φ2(Pj)]G(Pj),

which gives

E[G^(P)]=1Jj=1JE[G^(Pj)]=1Jj=1J[1Φ2(Pj)]G(Pj)=G(P)+Δ(P),

where we define the downward bias of G^(P) as

Δ(P)=E[G^(P)]G(P)=1Jj=1JΦ2(Pj)G(Pj).

It follows that G˜(P) is an unbiased estimator of G(P) because

E[G˜(P)]=1Jj=1J11Φ2(Pj)G^(Pj)=1Jj=1J11Φ2(Pj)[1Φ2(Pj)]G(Pj)=1Jj=1JG^(Pj)=G(P).

Proof of Corollary 2. When using the sample proportion to estimate allele frequencies, the weight of individual x in population P at locus j is (Harris and DeGiorgio 2017a)

ϕx(Pj)=mxk=1N(Pj)mk.

Plugging into the definition of Φ2(Pj), we have

Φ2(Pj)=w=1N(Pj)x=1N(Pj)ϕw(Pj)ϕx(Pj)Φwx=1(k=1N(Pj)mk)2w=1N(Pj)x=1N(Pj)mwmxΦwx.

Because the sample contains unrelated individuals, then Φwx=0 for wx and Φxx=1/mx resulting in

Φ2(Pj)=1(k=1N(Pj)mk)2w=1N(Pj)mw21mw=1k=1N(Pj)mk.

Plugging Φ2(Pj) into the definitions of Bias[G^(P)] and G˜(Pj) within Lemma 1 gives the desired result. □

Proof of Proposition 3. We first calculate

E[F^2(Aj,Bj)]=E[(a^jb^j)2]=E[a^j2]+E[b^j2]2E[a^j]E[b^j]=[1Φ2(Aj)]aj2+Φ2(Aj)aj+[1Φ2(Bj)]bj2+Φ2bj2ajbj=(ajbj)2+Φ2(Aj)aj(1aj)+Φ2(Bj)bj(1bj)=F2(Aj,Bj)+Φ2(Aj)G(Aj)+Φ2(Bj)G(Bj),

which gives

E[F^2(A,B)]=1Jj=1JE[F^2(Aj,Bj)]=1Jj=1J[F2(Aj,Bj)+Φ2(Aj)G(Aj)+Φ2(Bj)G(Bj)]=F2(A,B)+Δ(A,B),

where we define the upward bias of F^2(A,B) as

Δ(A,B)=E[F^2(A,B)]F2(A,B)=1Jj=1J[Φ2(Aj)G(Aj)+Φ2(Bj)G(Bj)].

It follows that F˜2(A,B) is an unbiased estimator of F2(A,B) because

E[F˜2(A,B)]=1Jj=1J(E[F^2(Aj,Bj)]Φ2(Aj)E[G˜(Aj)]Φ2(Bj)E[G˜(Bj)])=1Jj=1J[F2(Aj,Bj)+Φ2(Aj)G(Aj)+Φ2(Bj)G(Bj)Φ2(Aj)G(Aj)Φ2(Bj)G(Bj)]=1JJ=1JF2(Aj,Bj)=F2(A,B).

Proof of Corollary 4. Plugging the definition of Φ2(Pj),P{A,B}, derived in the proof of Corollary 2 as well as

G(Pj)=k=1N(Pj)mkk=1N(Pj)mk1G^(Pj)

of Corollary 2 into the definitions of Bias[F^2(A,B)] and F˜2(A,B) within Proposition 3 gives the desired result. □

Proof of Corollary 5. From Corollary 4, we have

F2(A,B)=1Jj=1J[F^2(Aj,Bj)1k=1N(Aj)mk1G^(Aj)1k=1N(Bj)mk1G^(Bj)].

Because populations A or B may have related or inbred individuals within them, from the proofs of Lemma 1 and Proposition 3, we have

E[F^2(Aj,Bj)]=F2(Aj,Bj)+Φ2(Aj)G(Aj)+Φ2(Bj)G(Bj)E[G^(Pj)]=[1Φ2(Pj)]G(Pj)

yielding

E[F2(A,B)]=1Jj=1J[F2(Aj,Bj)+Φ2(Aj)G(Aj)+Φ2(Bj)G(Bj)1Φ2(Aj)k=1N(Aj)mk1G(Aj)1Φ2(Bj)k=1N(Bj)mk1G(Bj)]=1Jj=1J[F2(Aj,Bj)+Φ2(Aj)k=1N(Aj)mk1k=1N(Aj)mk1G(Aj)+Φ2(Bj)k=1N(Bj)mk1k=1N(Bj)mk1G(Bj)]=F2(A,B)+1Jj=1J[Φ2(Aj)k=1N(Aj)mk1k=1N(Aj)mk1G(Aj)+Φ2(Bj)k=1N(Bj)mk1k=1N(Bj)mk1G(Bj)].

It follows that the bias is

Bias[F2(A,B)]=E[F˜2(A,B)]F2(A,B)=1Jj=1J[Φ2(Aj)k=1N(Aj)mk1k=1N(Aj)mk1G(Aj)+Φ2(Bj)k=1N(Bj)mk1k=1N(Bj)mk1G(Bj)],

and because Φ2(Pj)1/k=1N(Pj)mk, this is an upward bias. □

Proof of Proposition 6. We first calculate

E[F^3(Aj;Bj,Cj)]=E[(a^jb^j)(a^jc^j)]=E[a^j2]E[a^j]E[c^j]E[a^j]E[b^j]+E[b^j]E[c^j]=[1Φ2(Aj)]aj2+Φ2(Aj)ajajcjajbj+bjcj=(aj2ajcjajbj+bjcj)+Φ2(Aj)aj(1aj)=F3(Aj;Bj,Cj)+Φ2(Aj)G(Aj),

which gives

E[F^3(A;B,C)]=1Jj=1JE[F^3(Aj;Bj,Cj)]=1Jj=1J[F3(A;Bj,Cj)+Φ2(Aj)G(Aj)]=F3(A;B,C)+Δ(A;B,C),

where we define the upward bias of F^3(A;B,C) as

Δ(A;B,C)=E[F^3(A;B,C)]F3(A;B,C)=1Jj=1JΦ2(Aj)G(Aj).

It follows that F˜3(A;B,C) is an unbiased estimator of F3(A;B,C) because

E[F˜3(A;B,C)]=1Jj=1J(E[F^3(Aj;Bj,Cj)]Φ2(Aj)E[G˜(Aj)])=1Jj=1J[F3(Aj;Bj,Cj)+Φ2(Aj)G(Aj)Φ2(Aj)G(Aj)]=1JJ=1JF3(Aj;Bj,Cj)=F3(A;B,C).

Proof of Corollary 7. Plugging the definition of Φ2(Aj) derived in the proof of Corollary 2 as well as

G(Aj)=k=1N(Aj)mkk=1N(Aj)mk1G^(Aj)

of Corollary 2 into the definitions of Bias[F^3(A;B,C)] and F˜3(A;B,C) within Proposition 6 gives the desired result. □

Proof of Corollary 8. From Corollary 7, we have

F3(A;B,C)=1Jj=1J[F^3(Aj;Bj,Cj)1k=1N(Aj)mk1G^(Aj)].

Because population A may have related or inbred individuals within it, from the proofs of Lemma 1 and Proposition 6, we have

E[F^3(Aj;Bj,Cj)]=F3(Aj;Bj,Cj)+Φ2(Aj)G(Aj)E[G^(Aj)]=[1Φ2(Aj)]G(Aj)

yielding

E[F3(A;B,C)]=1Jj=1J[F3(Aj;Bj,Cj)+Φ2(Aj)G(Aj)1Φ2(Aj)k=1N(Aj)mk1G(Aj)]=1Jj=1J[F3(Aj;Bj,Cj)+Φ2(Aj)k=1N(Aj)mk1k=1N(Aj)mk1G(Aj)]=F3(A;B,C)+1Jj=1JΦ2(Aj)k=1N(Aj)mk1k=1N(Aj)mk1G(Aj).

It follows that the bias is

Bias[F3(A;B,C)]=E[F˜3(A;B,C)]F3(A;B,C)=1Jj=1JΦ2(Aj)k=1N(Aj)mk1k=1N(Aj)mk1G(Aj),

and because Φ2(Aj)1/k=1N(Aj)mk, this is an upward bias. □

Proof of Proposition 9. Assuming that the expectation of F^3(A;B,C|A) is approximately equal to the ratio of expectations of F^3(A;B,C) and 2G^(A), we find that

E[F^3(A;B,C|A)]=E[F^3(A;B,C)2G^(A)]E[F^3(A;B,C)]2E[G^(A)]=(1/J)j=1J[F3(Aj;Bj,Cj)+Φ2(Aj)G(Aj)](2/J)j=1J[1Φ2(Aj)]G(Aj)=F3(A;B,C)+(1/J)j=1JΦ2(Aj)G(Aj)2G(A)(2/J)j=1JΦ2(Aj)G(Aj)=F3(A;B,C)2G(A)+Δ(A;B,C|A)=F3(A;B,C|A)+Δ(A;B,C|A),

where we define the upward bias of F^3(A;B,C|A) as

Δ(A;B,C|A)=E[F^3(A;B,C|A)]F3(A;B,C|A)=F3(A;B,C)+(1/J)j=1JΦ2(Aj)G(Aj)2G(A)(2/J)j=1JΦ2(Aj)G(Aj)F3(A;B,C)2G(A)=[F3(A;B,C)+G(A)](1/J)j=1JΦ2(Aj)G(Aj)2G(A)[G(A)(1/J)j=1JΦ2(Aj)G(Aj)]=(1/J)j=1JΦ2(Aj)G(Aj)G(A)(1/J)j=1JΦ2(Aj)G(Aj)[F3(A;B,C|A)+12].

We can also see that F˜3(A;B,C|A) is an approximately unbiased estimator of F3(A;B,C|A) because

E[F˜3(A;B,C|A)]=E[F˜3(A;B,C)2G˜(A)]E[F˜3(A;B,C)]2E[G˜(A)]=F3(A;B,C)2G(A)=F3(A;B,C|A).

Proof of Corollary 10. Plugging the definition of Φ2(Aj) derived in the proof of Corollary 2 as well as

G(Aj)=k=1N(Aj)mkk=1N(Aj)mk1G^(Aj)

of Corollary 2 into the definitions of Bias[F^3(A;B,C|A)] and F˜3(A;B,C|A) within Proposition 9 gives the desired result. □

Proof of Corollary 11. From Corollaries 2, 7, and 10, we have

F3(A;B,C|A)=F3(A;B,C)2G(A),

where

G(A)=1Jj=1JG(Aj)=1Jj=1Jk=1N(Aj)mkk=1N(Aj)mk1G^(Aj)F3(A;B,C)=1Jj=1J[F^3(Aj;Bj,Cj)1k=1N(Aj)mk1G^(Aj)].

Because population A may have related or inbred individuals within it, from the proofs of Lemma 1 and Proposition 6, we have

E[F^3(Aj;Bj,Cj)]=F3(Aj;Bj,Cj)+Φ2(Aj)G(Aj)E[G^(Aj)]=[1Φ2(Aj)]G(Aj)

yielding

E[F3(A;B,C|A)]E[F3(A;B,C)]2E[G(A)]=F3(A;B,C)+(1/J)j=1JG(Aj)[Φ2(Aj)k=1N(Aj)mk1]/[k=1N(Aj)mk1](2/J)j=1J[1Φ2(Aj)][k=1N(Aj)mk]k=1N(Aj)mk1G(Aj)=F3(A;B,C)+X(A)2Y(A).

It follows that the approximate bias is

Bias[F3(A;B,C|A)]=E[F˜3(A;B,C|A)]F3(A;B,C|A)12[1Y(A)1G(A)]F3(A;B,C)+X(A)2Y(A),

and because Φ2(Aj)1/k=1N(Aj)mk, then 0<Y(A)G(A) and X(A)0 making this is an upward approximate bias. □

Proof of Proposition 12. We first calculate

E[F^4(Aj,Bj;Cj,Dj)]=E[(a^jb^j)(c^jd^j]=E[a^jb^j]E[c^jd^j]=(E[a^j]E[b^j])(E[c^j]E[d^j])=(ajbj)(cjdj)=F4(Aj,Bj;Cj,Dj).

We show that F^4(A,B;C,D) is unbiased estimator of F4(A,B;C,D) because

E[F^4(A,B;C,D)]=1Jj=1JE[F^4(Aj,Bj;Cj,Dj)]=1Jj=1JF4(Aj,Bj;Cj,Dj)=F4(A,B;C,D).

Proof of Proposition 13. Assuming that the expectation of F^4(A,B;C,D|P) is approximately equal to the ratio of expectations of F^4(A,B;C,D) and G^(P) for some P{A,B,C,D}, we find that

E[F^4(A,B;C,D|P)]=E[F^4(A,B;C,D)G^(P)]E[F^4(A,B;C,D)]E[G^(P)]=F4(A,B;C,D)(1/J)j=1J[1Φ2(Pj)]G(Pj)=F4(A,B;C,D)G(P)(1/J)j=1JΦ2(Pj)G(Pj)=F4(A,B;C,D)G(P)+Δ(A,B;C,D|P)=F4(A,B;C,D|P)+Δ(A,B;C,D|P),

where we define the approximate upward bias of F^4(A,B;C,D|P) as

Δ(A,B;C,D|P)=E[F^(A,B;C,D|P)]F4(A,B;C,D|P)=F4(A,B;C,D)G(P)(1/J)j=1JΦ2(Pj)G(Pj)F4(A,B;C,D)G(P)=F4(A,B;C,D)(1/J)j=1JΦ2(Pj)G(Pj)G(P)[G(P)(1/J)j=1JΦ2(Pj)G(Pj)]=(1/J)j=1JΦ2(Pj)G(Pj)G(P)(1/J)j=1JΦ2(Pj)G(Pj)F4(A,B;C,D|P).

We can also see that F˜4(A,B;C,D|P) is an approximately unbiased estimator of F4(A,B;C,D|P) because

E[F˜4(A,B;C,D|P)]=E[F^4(A,B;C,D)G˜(P)]E[F^4(A,B;C,D)]E[G˜(P)]=F4(A,B;C,D)G(P)=F4(A,B;C,D|P).

Proof of Corollary 14. Plugging the definition of Φ2(Pj),P{A,B,C,D}, derived in the proof of Corollary 2 as well as

G(Pj)=k=1N(Aj)mkk=1N(Aj)mk1G^(Pj)

of Corollary 2 into the definitions of Bias[F^4(A,B;C,D|P)] and F˜4(A,B;C,D|P) within Proposition 13 gives the desired result. □

Proof of Corollary 15. From Corollaries 2 and 14, we have

F4(A,B;C,D|P)=F^4(A,B;C,D)G(P),

where

G(P)=1Jj=1JG(Pj)=1Jj=1Jk=1N(Pj)mkk=1N(Pj)mk1G^(Pj)F^4(A,B;C,D)=1Jj=1JF^4(Aj,Bj;Cj,Dj).

Because population P may have related or inbred individuals within it, from the proofs of Lemma 1 and Proposition 12, we have

E[F^4(A,B;C,D]=F4(A,B;B,D)E[G^(Pj)]=[1Φ2(Pj)]G(Pj)

yielding

E[F4(A,B;C,D|P)]E[F^4(A,B;C,D)]E[G(P)]=F4(A,B,C,D)(1/J)j=1J[1Φ2(Pj)][k=1N(Pj)mk]k=1N(Pj)mk1G(Pj)=F4(A,B,C,D)Y(P).

It follows that the approximate bias is

Bias[F4(A,B;C,D|P)]=E[F4(A,B;C,D|P)]F4(A,B;C,D|P)[1Y(P)1G(P)]F4(A,B;C,D),

and because Φ2(Pj)1/k=1N(Pj)mk, then 0<Y(P)G(P) making this is an upward approximate bias. □

Lemma 17. Consider J polymorphic loci in populations A, B, C, and D with respective parametric reference allele frequencies aj,bj,cj,dj(0,1), and suppose we take a random sample of N(Pj) individuals at locus j in population P{A,B,C,D}, some of which may be related or inbred. The estimator H^(A,B,C,D) is unbiased.

Proof. We first calculate

E[H^(Aj,Bj,Cj,Dj)]=E[(a^j+b^j2a^jb^j)(c^j+d^j2c^jd^j)]=E[(a^j+b^j2a^jb^j)]E[(c^j+d^j2c^jd^j)]=(E[a^j]+E[b^j]2E[a^j]E[b^j])(E[c^j]+E[d^j]2E[c^j]E[d^j])=(aj+bj2ajbj)(cj+dj2cjdj)=H(Aj,Bj,Cj,Dj).

We show that H^(A,B,C,D) is unbiased estimator of H(A,B,C,D) because

E[H^(A,B,C,D)]=1Jj=1JE[H^(Aj,Bj,Cj,Dj)]=1Jj=1JH(Aj,Bj,Cj,Dj)=H(A,B,C,D).

Proof of Proposition 16. Assuming that the expectation of D^(A,B,C,D) is approximately equal to the ratio of expectations of F^4(A,B;C,D) and H^(A,B,C,D),D^(A,B,C,D) is an approximately unbiased estimator of D(A,B,C,D) because

E[D^(A,B,C,D)]=E[F^4(A,B;C,D)H^(A,B,C,D)]E[F^4(A,B;C,D)]E[H^(A,B,C,D)]=F4(A,B;C,D)H(A,B,C,D)=D(A,B,C,D).

Lemma 18. Consider J independent polymorphic loci in a population P with parametric reference allele frequencies pj(0,1), and suppose we take a random sample of N(Pj) individuals at locus j, some of which may be related or inbred, where individual k{1,2,,N(Pj)} has ploidy mk. Moreover, assume that no individual is related to more than one other individual, which makes the terms Φ3(Pj),Φ4(Pj),Φ2,2(Pj), and Φ2(Pj)2 negligible to Φ2(Pj). Based on this simplifying assumption, the estimator G^(P) has an approximate variance

Var[G^(P)]1J2j=1JΦ2(Pj)G(Pj)4J2j=1JΦ2(Pj)G(Pj)2.

Moreover, the respective approximate variance for the unbiased estimator G˜(P) and the estimator G(P) are

Var[G˜(P)]1J2j=1JΦ2(Pj)12Φ2(Pj)G(Pj)4J2j=1JΦ2(Pj)12Φ2(Pj)G(Pj)2Var[G(P)]1J2j=1JΦ2(Pj)[k=1N(Pj)mkk=1N(Pj)mk1]2G(Pj)4J2j=1JΦ2(Pj)[k=1N(Pj)mkk=1N(Pj)mk1]2G(Pj)2.

Proof. From the proof of Lemma 1, we have

E[G^(Pj)]=[1Φ2(Pj)]G(Pj)

and we calculate

E[G^(Pj)2]=E[p^j2(1p^j)2]=E[p^j2]2E[p^j3]+E[p^j4]pj2+Φ2(Pj)pj(1pj)2[pj3+3Φ2(Pj)pj2(1pj)]+pj4+6Φ2(Pj)pj3(1pj)=pj22pj3+pj4+Φ2(Pj)pj(1pj)[16pj+6pj2]=pj2(1pj)2+Φ2(Pj)pj(1pj)[16pj(1pj)]=G(Pj)2+Φ2(Pj)G(Pj)[16G(Pj)]=Φ2(Pj)G(Pj)+[16Φ2(Pj)]G(Pj)2.

Therefore, we have that

Var[G^(Pj)]=E[G^(Pj)2]E[G^(Pj)]2Φ2(Pj)G(Pj)+[16Φ2(Pj)]G(Pj)2[1Φ2(Pj)]2G(Pj)2=Φ2(Pj)G(Pj)+[16Φ2(Pj)]G(Pj)2[12Φ2(Pj)+Φ2(Pj)2]G(Pj)2Φ2(Pj)G(Pj)+[16Φ2(Pj)]G(Pj)2[12Φ2(Pj)]G(Pj)2=Φ2(Pj)G(Pj)4Φ2(Pj)G(Pj)2,

which gives

Var[G^(P)]=Var[1Jj=1JG^(Pj)]=1J2j=1JVar[G^(Pj)]1J2j=1JΦ2(Pj)G(Pj)4J2j=1JΦ2(Pj)G(Pj)2.

Recall that

G˜(P)=1Jj=1JG˜(Pj)G(P)=1Jj=1JG(Pj),

where

G˜(Pj)=11Φ2(Pj)G^(Pj)G(Pj)=k=1N(Pj)mkk=1N(Pj)mk1G^(Pj).

It follows that

Var[G˜(P)]=Var[1Jj=1JG˜(Pj)]=1J2j=1JVar[G˜(Pj)]=1J2j=1J1[1Φ2(Pj)]2Var[G^(Pj)]1J2j=1JΦ2(Pj)12Φ2(Pj)G(Pj)4J2j=1JΦ2(Pj)12Φ2(Pj)G(Pj)2

and similarly

Var[G(P)]=Var[1Jj=1JG(Pj)]=1J2j=1JVar[G(Pj)]=1J2j=1J[k=1N(Pj)mkk=1N(Pj)mk1]2Var[G^(Pj)]1J2j=1JΦ2(Pj)[k=1N(Pj)mkk=1N(Pj)mk1]2G(Pj)4J2j=1JΦ2(Pj)[k=1N(Pj)mkk=1N(Pj)mk1]2G(Pj)2.

Lemma 19. Consider J independent polymorphic loci in populations A and B with respective parametric reference allele frequencies aj,bj(0,1), and suppose we take a random sample of N(Pj) individuals at locus j in population P{A,B}, some of which may be related or inbred. Moreover, assume that no individual is related to more than one other individual, which makes the terms Φ3(Pj),Φ4(Pj),Φ2,2(Pj), and Φ2(Pj)2 negligible to Φ2(Pj). Based on this simplifying assumption, the estimators F^2(A,B) and G^(B) have an approximate covariance

Cov[F^2(A,B),G^(B)]2J2j=1JΦ2(Bj)G(Bj)22J2j=1JΦ2(Bj)G(Aj)G(Bj)2J2j=1JΦ2(Bj)F2(Aj,Bj)G(Bj).

Proof. From the proofs of Lemma 1 and Proposition 3, we have

E[G^(Bj)]=[1Φ2(Bj)]G(Bj)

and

E[F^2(Aj,Bj)]=F2(Aj,Bj)+Φ2(Aj)G(Aj)+Φ2(Bj)G(Bj),

yielding

E[F^2(Aj,Bj)]E[G^(Bj)]=[F2(Aj,Bj)+Φ2(Aj)G(Aj)+Φ2(Bj)G(Bj)][1Φ2(Bj)]G(Bj)=F2(Aj,Bj)G(Bj)+Φ2(Aj)G(Aj)G(Bj)+Φ2(Bj)G(Bj)2Φ2(Bj)F2(Aj,Bj)G(Bj)Φ2(Aj)Φ2(Bj)G(Aj)G(Bj)Φ2(Bj)2G(Bj)2F2(Aj,Bj)G(Bj)+Φ2(Aj)G(Aj)G(Bj)+Φ2(Bj)G(Bj)2Φ2(Bj)F2(Aj,Bj)G(Bj)Φ2(Aj)Φ2(Bj)G(Aj)G(Bj),

where we used the fact that Φ2(Bj)2 is negligible compared to Φ2(Bj) as an approximation. We also calculate

E[F^2(Aj,Bj)G^(Bj)]=E[(a^jb^j)2b^j(1b^j)]=E[(a^j22a^jb^j+b^j2)(b^jb^j2)]=E[a^j2(b^jb^j2)2a^j(b^j2b^j3)+b^j3b^j4]=E[a^j2](E[b^j]E[b^j2])2E[a^j](E[b^j2]E[b^j3])+E[b^j3]E[b^j4][aj2+Φ2(Aj)aj(1aj)][bjbj2Φ2(Bj)bj(1bj)]2aj[bj2+Φ2(Bj)bj(1bj)bj33Φ2(Bj)bj2(1bj)]+bj3+3Φ2(Bj)bj2(1bj)bj46Φ2(Bj)bj3(1bj).

Recognizing that G(Aj)=aj(1aj),G(Bj)=bj(1bj), and F2(Aj,Bj)=aj22ajbj+bj2, we have

E[F^2(Aj,Bj)G^(Bj)][aj2+Φ2(Aj)G(Aj)][1Φ2(Bj)]G(Bj)2aj[Φ2(Bj)+[13Φ2(Bj)]bj]G(Bj)+[3Φ2(Bj)bj+[16Φ2(Bj)]bj2]G(Bj)=G(Bj)[aj2Φ2(Bj)aj2+Φ2(Aj)G(Aj)Φ2(Aj)Φ2(Bj)G(Aj)2Φ2(Bj)aj2ajbj+2(3)Φ2(Bj)ajbj+3Φ2(Bj)bj+bj26Φ2(Bj)bj2]=G(Bj)[F2(Aj,Bj)Φ2(Bj)[3aj22(3)ajbj+3bj22aj2+3bj2]2Φ2(Bj)aj+3Φ2(Bj)bj+[Φ2(Aj)Φ2(Aj)Φ2(Bj)]G(Aj)]=G(Bj)[F2(Aj,Bj)3Φ2(Bj)[aj22ajbj+bj2]2Φ2(Bj)[ajaj2]+3Φ2(Bj)[bjbj2]+[Φ2(Aj)Φ2(Aj)Φ2(Bj)]G(Aj)]=G(Bj)[F2(Aj,Bj)3Φ2(Bj)F2(Aj,Bj)2Φ2(Bj)G(Aj)+3Φ2(Bj)G(Bj)+[Φ2(Aj)Φ2(Aj)Φ2(Bj)]G(Aj)]=[13Φ2(Bj)]F2(Aj,Bj)G(Bj)+3Φ2(Bj)G(Bj)2+[Φ2(Aj)2Φ2(Bj)Φ2(Aj)Φ2(Bj)]G(Aj)G(Bj).

Therefore, we have that

Cov[F^2(Aj,Bj),G^(Bj)]=E[F^2(Aj,Bj)G^(Bj)]E[F^2(Aj,Bj)]E[G^(Bj)][13Φ2(Bj)]F2(Aj,Bj)G(Bj)+3Φ2(Bj)G(Bj)2+[Φ2(Aj)2Φ2(Bj)Φ2(Aj)Φ2(Bj)]G(Aj)G(Bj)[F2(Aj,Bj)G(Bj)+Φ2(Aj)G(Aj)G(Bj)+Φ2(Bj)G(Bj)2Φ2(Bj)F2(Aj,Bj)G(Bj)Φ2(Aj)Φ2(Bj)G(Aj)G(Bj)]=2Φ2(Bj)G(Bj)22Φ2(Bj)G(Aj)G(Bj)2Φ2(Bj)F2(Aj,Bj)G(Bj)=2Φ2(Bj)G(Bj)[G(Bj)G(Aj)F2(Aj,Bj)],

which gives

Cov[F^2(A,B),G^(B)]=Cov[1Jj=1JF^2(Aj,Bj),1Jj=1JG^(Bj)]=1J2j=1JCov[F^2(Aj,Bj),G^(Bj)]2J2j=1JΦ2(Bj)G(Bj)22J2j=1JΦ2(Bj)G(Aj)G(Bj)2J2j=1JΦ2(Bj)F2(Aj,Bj)G(Bj).

Proposition 20. Consider J independent polymorphic loci in a populations A and B with respective parametric reference allele frequencies aj,bj(0,1), and suppose we take a random sample of N(Pj) individuals at locus j in population P{A,B}, some of which may be related or inbred. Moreover, assume that no individual is related to more than one other individual, which makes the terms Φ3(Pj),Φ4(Pj),Φ2,2(Pj),Φ2(Pj)2, and Φ2(Aj)Φ2(Bj) negligible to Φ2(Pj). Based on this simplifying assumption, the estimator F^2(A,B) has approximate variance

Var[F^2(A,B)]4J2j=1JΦ2(Aj)F2(Aj,Bj)G(Aj)+4J2j=1JΦ2(Bj)F2(Aj,Bj)G(Bj).

Proof. From the proof of Proposition 3, we have

E[F^2(Aj,Bj)]=F2(Aj,Bj)+Φ2(Aj)G(Aj)+Φ2(Bj)G(Bj),

which gives

E[F^2(Aj,Bj)]2=[F2(Aj,Bj)+Φ2(Aj)G(Aj)+Φ2(Bj)G(Bj)]2=F2(Aj,Bj)2+2Φ2(Aj)F2(Aj,Bj)G(Aj)+2Φ2(Bj)F2(Aj,Bj)G(Bj)+2Φ2(Aj)Φ2(Bj)G(Aj)G(Bj)+Φ2(Aj)2G(Aj)2+Φ2(Bj)2G(Bj)F2(Aj,Bj)2+2Φ2(Aj)F2(Aj,Bj)G(Aj)+2Φ2(Bj)F2(Aj,Bj)G(Bj),

where we used the fact that Φ2(Aj)2,Φ2(Bj), and Φ2(Aj)Φ2(Bj) are negligible compared to Φ2(Aj) and Φ2(Bj) as an approximation. We also calculate

E[F^2(Aj,Bj)2]=E[(a^jb^j)4]=E[a^j4]4E[a^j3]E[b^j]+6E[a^j2]E[b^j2]4E[a^j]E[b^j3]+E[b^j4]aj4+6Φ2(Aj)aj3(1aj)4[aj3+3Φ2(Aj)aj2(1aj)]bj+6[aj2+Φ2(Aj)aj(1aj)][bj2+Φ2(Bj)bj(1bj)]4aj[bj3+3Φ2(Bj)bj2(1bj)]+bj4+6Φ2(Bj)bj3(1bj)=aj44aj3bj+6aj2bj24ajbj3+bj4+6Φ2(Aj)aj(1aj)[aj22ajbj+bj2]+6Φ2(Bj)bj(1bj)[aj22ajbj+bj2]+6Φ2(Aj)Φ2(Bj)aj(1aj)bj(1bj)=(ajbj)4+6Φ2(Aj)aj(1aj)(ajbj)2+6Φ2(Bj)bj(1bj)(ajbj)2+6Φ2(Aj)Φ2(Bj)aj(1aj)bj(1bj)=F2(Aj,Bj)2+6Φ2(Aj)F2(Aj,Bj)G(Aj)+6Φ2(Bj)F2(Aj,Bj)G(Bj)+6Φ2(Aj)Φ2(Bj)G(Aj)G(Bj)F2(Aj,Bj)2+6Φ2(Aj)F2(Aj,Bj)G(Aj)+6Φ2(Bj)F2(Aj,Bj)G(Bj).

Therefore, we have that

Var[F^2(Aj,Bj)]=E[F^2(Aj,Bj)2]E[F^2(Aj,Bj)]2F2(Aj,Bj)2+6Φ2(Aj)F2(Aj,Bj)G(Aj)+6Φ2(Bj)F2(Aj,Bj)G(Bj)[F2(Aj,Bj)2+2Φ2(Aj)F2(Aj,Bj)G(Aj)+2Φ2(Bj)F2(Aj,Bj)G(Bj)]=4Φ2(Aj)F2(Aj,Bj)G(Aj)+4Φ2(Bj)F2(Aj,Bj)G(Bj),

which gives

Var[F^2(A,B)]=Var[1Jj=1JF^2(Aj,Bj)]=1J2j=1JVar[F^2(Aj,Bj)]4J2j=1JΦ2(Aj)F2(Aj,Bj)G(Aj)+4J2j=1JΦ2(Bj)F2(Aj,Bj)G(Bj).

Proposition 21. Consider J independent polymorphic loci in a populations A and B with respective parametric reference allele frequencies aj,bj(0,1), and suppose we take a random sample of N(Pj) individuals at locus j in population P{A,B}, some of which may be related or inbred. Moreover, assume that no individual is related to more than one other individual, which makes the terms Φ3(Pj),Φ4(Pj),Φ2,2(Pj),Φ2(Pj)2, and Φ2(Aj)Φ2(Bj) negligible to Φ2(Pj). Based on this simplifying assumption, the unbiased estimator F˜2(A,B) has approximate variance

Var[F˜2(A,B)]4J2j=1JΦ2(Aj)F2(Aj,Bj)G(Aj)+4J2j=1JΦ2(Bj)F2(Aj,Bj)G(Bj).

Proof. Recall that

F˜2(Aj,Bj)=F^2(Aj,Bj)Φ2(Aj)G˜(Aj)Φ2(Bj)G˜(Bj),

where F˜2(Aj,Bj) is an unbiased estimator for F2(Aj,Bj) and G˜(Pj) is an unbiased estimator of G(Pj) for P{A,B} at locus j{1,2,,J}. Also, from the proof of Proposition 3, we have

E[F^2(Aj,Bj)]=F2(Aj,Bj)+Φ2(Aj)G(Aj)+Φ2(Bj)G(Bj).

Therefore, we have that

Var[F˜2(Aj,Bj)]=Var[F^2(Aj,Bj)Φ2(Aj)G˜(Aj)Φ2(Bj)G˜(Bj)]=Var[F^2(Aj,Bj)]+Φ2(Aj)2Var[G˜(Aj)]+Φ2(Bj)2Var[G˜(Bj)]2Φ2(Aj)Cov[F^2(Aj,Bj),G˜(Aj)]2Φ2(Bj)Cov[F^2(Aj,Bj),G˜(Bj)]+2Φ2(Aj)Φ2(Bj)Cov[G˜(Aj),G˜(Bj)]Var[F^2(Aj,Bj)]2Φ2(Aj)Cov[F^2(Aj,Bj),G˜(Aj)]2Φ2(Bj)Cov[F^2(Aj,Bj),G˜(Bj)],

where we used the fact that Φ2(Aj)2 and Φ2(Bj)2 are negligible compared to Φ2(Aj) and Φ2(Bj) as an approximation, and where Cov[G˜(Aj),G˜(Bj)]=0 because drawing alleles in population A is independent of population B. Moreover, because G˜(Pj)=G^(Pj)/[1Φ2(Pj)], we have

Var[F˜2(Aj,Bj)]Var[F^2(Aj,Bj)]2Φ2(Aj)1Φ2(Aj)Cov[F^2(Aj,Bj),G^(Aj)]2Φ2(Bj)1Φ2(Bj)Cov[F^2(Aj,Bj),G^(Bj)]4Φ2(Aj)F2(Aj,Bj)G(Aj)+4Φ2(Bj)F2(Aj,Bj)G(Bj)+4Φ2(Aj)Φ2(Bj)G(Aj)G(Bj)2Φ2(Aj)1Φ2(Aj)[2Φ2(Aj)G(Aj)22Φ2(Aj)G(Aj)G(Bj)2Φ2(Aj)F2(Aj,Bj)G(Aj)]2Φ2(Bj)1Φ2(Bj)[2Φ2(Bj)G(Bj)22Φ2(Bj)G(Aj)G(Bj)2Φ2(Bj)F2(Aj,Bj)G(Bj)].

Recalling the assumption that Φ2(Aj)2,Φ2(Bj)2, and Φ2(Aj)Φ2(Bj) are negligible compared to Φ2(Aj) and Φ2(Bj), we have

Var[F˜2(Aj,Bj)]4Φ2(Aj)F2(Aj,Bj)G(Aj)+4Φ2(Bj)F2(Aj,Bj)G(Bj)Var[F^2(Aj,Bj)],

and it follows that

Var[F˜2(A,B)]Var[F^2(A,B)].

Proposition 22. Consider J independent polymorphic loci in a populations A and B with respective parametric reference allele frequencies aj,bj(0,1), and suppose we take a random sample of N(Pj) individuals at locus j in population P{A,B}, some of which may be related or inbred. Moreover, assume that no individual is related to more than one other individual, which makes the terms Φ3(Pj),Φ4(Pj),Φ2,2(Pj),Φ2(Pj)2,Φ2(Aj)Φ2(Bj), and other terms of similar order negligible to Φ2(Pj). Based on this simplifying assumption, the estimator F2(A,B) has approximate variance

Var[F2(A,B)]4J2j=1JΦ2(Aj)F2(Aj,Bj)G(Aj)+4J2j=1JΦ2(Bj)F2(Aj,Bj)G(Bj).

Proof. Recall that

F2(Aj,Bj)=F^2(Aj,Bj)1k=1N(Aj)mk1G^(Aj)1k=1N(Aj)mk1G^(Aj),

where F2(Aj,Bj) and F^2(Aj,Bj) are estimators for F2(Aj,Bj) and G^(Pj) is an estimator of G(Pj) for P{A,B} at locus j{1,2,,J}. Also, from the proof of Proposition 3, we have

E[F^2(Aj,Bj)]=F2(Aj,Bj)+Φ2(Aj)G(Aj)+Φ2(Bj)G(Bj).

Therefore, we have that

Var[F2(Aj,Bj)]=Var[F^2(Aj,Bj)1k=1N(Aj)mk1G^(Aj)1k=1N(Aj)mk1G^(Aj)]=Var[F^2(Aj,Bj)]+[1k=1N(Aj)mk1]2Var[G^(Aj)]+[1k=1N(Bj)mk1]2Var[G^(Bj)]2[1k=1N(Aj)mk1]Cov[F^2(Aj,Bj),G^(Aj)]2[1k=1N(Bj)mk1]Cov[F^2(Aj,Bj),G^(Bj)]+2[1k=1N(Aj)mk1][1k=1N(Bj)mk1]Cov[G^(Aj),G^(Bj)]Var[F^2(Aj,Bj)]2[1k=1N(Aj)mk1]Cov[F^2(Aj,Bj),G^(Aj)]2[1k=1N(Bj)mk1]Cov[F^2(Aj,Bj),G^(Bj)],

where we used the fact that Φ2(Aj)2 and Φ2(Bj)2 are negligible compared to Φ2(Aj) and Φ2(Bj) as an approximation (and hence 1/[k=1N(Pj)mk1] assuming N(Pj) is large enough), and where Cov[G^(Aj),G^(Bj)]=0 because drawing alleles in population A is independent of population B. Moreover, because

Cov[F^2(Aj,Bj),G^(Aj)]Φ2(Aj)F2(Aj,Bj)G(Aj)Cov[F^2(Aj,Bj),G^(Bj)]Φ2(Bj)F2(Aj,Bj)G(Bj)

we have

[1k=1N(Aj)mk1]Cov[F^2(Aj,Bj),G^(Aj)]0[1k=1N(Bj)mk1]Cov[F^2(Aj,Bj),G^(Bj)]0

by recalling the assumption that Φ2(Aj)2,Φ2(Bj)2, or any other term of similar magnitude (such as the Φ2(Pj)/[k=1N(Pj)mk1] terms that appear for populations P{A,B}) and Φ2(Aj)Φ2(Bj) are negligible compared to Φ2(Aj) and Φ2(Bj). Because of this, we have

Var[F2(Aj,Bj)]Var[F^2(Aj,Bj)],

and it follows that

Var[F2(A,B)]Var[F^2(A,B)].

Proposition 23. Consider J independent polymorphic loci in populations A, B, and C with respective parametric reference allele frequencies aj,bj,cj(0,1), and suppose we take a random sample of N(Pj) individuals at locus j in population P{A,B,C}, some of which may be related or inbred. Moreover, assume that no individual is related to more than one other individual, which makes the terms Φ3(Pj),Φ4(Pj),Φ2,2(Pj),Φ2(Pj)2, Φ2(Aj)Φ2(Bj),Φ2(Aj)Φ2(Cj), and Φ2(Bj)Φ2(Cj) negligible to Φ2(Pj). Based on this simplifying assumption, the estimator F^3(A;B,C) has approximate variance

Var[F^3(A;B,C)]4J2j=1JΦ2(Aj)F3(Aj;Bj,Cj)G(Aj)+1J2j=1JΦ2(Aj)F2(Bj,Cj)G(Aj)+1J2j=1JΦ2(Bj)F2(Aj,Cj)G(Bj)+1J2j=1JΦ2(Cj)F2(Aj,Bj)G(Cj).

Proof. From the proof of Proposition 6, we have

E[F^3(Aj;Bj,Cj)]=F3(Aj;Bj,Cj)+Φ2(Aj)G(Aj),

which gives

E[F^3(Aj;Bj,Cj)]2=[F3(Aj;Bj,Cj)+Φ2(Aj)G(Aj)]2=F3(Aj;Bj,Cj)2+2Φ2(Aj)F3(Aj;Bj,Cj)G(Aj)+Φ2(Aj)2G(Aj)2F3(Aj;Bj,Cj)2+2Φ2(Aj)F3(Aj;Bj,Cj)G(Aj),

where we used the fact that Φ2(Aj)2 is negligible compared to Φ2(Aj) as an approximation. We also calculate

E[F^3(Aj;Bj,Cj)2]=E[(a^jb^j)2(a^jc^j)2]=E[a^j42a^j3c^j+a^j2c^j22a^j3b^j+4a^j2b^jc^j2a^jb^jc^j2+a^j2b^j22a^jb^j2c^j+b^j2c^j2]aj4+6Φ2(Aj)aj3(1aj)2[aj3+3Φ2(Aj)aj2(1aj)]cj+[aj2+Φ2(Aj)aj(1aj)][cj2+Φ2(Cj)cj(1cj)]2[aj3+3Φ2(Aj)aj2(1aj)]bj+4[aj2+Φ2(Aj)aj(1aj)]bjcj2ajbj[cj2+Φ2(Cj)cj(1cj)]+[aj2+Φ2(Aj)aj(1aj)][bj2+Φ2(Bj)bj(1bj)]2aj[bj2+Φ2(Bj)bj(1bj)]cj+[bj2+Φ2(Bj)bj(1bj)][cj2+Φ2(Cj)cj(1cj)]=(ajbj)2(ajcj)2+Φ2(Aj)[6(ajbj)(ajcj)+(bjcj)2]aj(1aj)+Φ2(Bj)(ajcj)2bj(1bj)+Φ2(Cj)(ajbj)2cj(1cj)+Φ2(Aj)Φ2(Bj)aj(1aj)bj(1bj)+Φ2(Aj)Φ2(Cj)aj(1aj)cj(1cj)+Φ2(Bj)Φ2(Cj)bj(1bj)cj(1cj)=F3(Aj;Bj,Cj)2+6Φ2(Aj)F3(Aj;Bj,Cj)G(Aj)+Φ2(Aj)F2(Bj,Cj)G(Aj)+Φ2(Bj)F2(Aj,Cj)G(Bj)+Φ2(Cj)F2(Aj,Bj)G(Cj)+Φ2(Aj)Φ2(Bj)G(Aj)G(Bj)+Φ2(Aj)Φ2(Cj)G(Aj)G(Cj)+Φ2(Bj)Φ2(Cj)G(Bj)G(Cj)F3(Aj;Bj,Cj)2+6Φ2(Aj)F3(Aj;Bj,Cj)G(Aj)+Φ2(Aj)F2(Bj,Cj)G(Aj)+Φ2(Bj)F2(Aj,Cj)G(Bj)+Φ2(Cj)F2(Aj,Bj)G(Cj),

where we used the fact that Φ2(Aj)Φ2(Bj),Φ2(Aj)Φ2(Cj), and Φ2(Bj)Φ2(Cj) are negligible compared to Φ2(Aj),Φ2(Bj), and Φ2(Cj) as an approximation. Therefore, we have that

Var[F^3(Aj;Bj,Cj)]=E[F^3(Aj;Bj,Cj)2]E[F^3(Aj;Bj,Cj)]2F3(Aj;Bj,Cj)2+6Φ2(Aj)F3(Aj;Bj,Cj)G(Aj)+Φ2(Aj)F2(Bj,Cj)G(Aj)+Φ2(Bj)F2(Aj,Cj)G(Bj)+Φ2(Cj)F2(Aj,Bj)G(Cj)[F3(Aj;Bj,Cj)2+2Φ2(Aj)F3(Aj;Bj,Cj)G(Aj)]=4Φ2(Aj)F3(Aj;Bj,Cj)G(Aj)+Φ2(Aj)F2(Bj,Cj)G(Aj)+Φ2(Bj)F2(Aj,Cj)G(Bj)+Φ2(Cj)F2(Aj,Bj)G(Cj),

which gives

Var[F^3(A;B,C)]=Var[1Jj=1JF^3(Aj;Bj,Cj)]=1J2j=1JVar[F^3(Aj;Bj,Cj)]4J2j=1JΦ2(Aj)F3(Aj;Bj,Cj)G(Aj)+1J2j=1JΦ2(Aj)F2(Bj,Cj)G(Aj)+1J2j=1JΦ2(Bj)F2(Aj,Cj)G(Bj)+1J2j=1JΦ2(Cj)F2(Aj,Bj)G(Cj).

Lemma 24. Consider J independent polymorphic loci in populations A, B, and C with respective parametric reference allele frequencies aj,bj,cj(0,1), and suppose we take a random sample of N(Pj) individuals at locus j in population P{A,B,C}, some of which may be related or inbred. Moreover, assume that no individual is related to more than one other individual, which makes the terms Φ3(Pj),Φ4(Pj),Φ2,2(Pj), and Φ2(Pj)2 negligible to Φ2(Pj). Based on this simplifying assumption, the estimators F^3(A;B,C) and g^(A) have an approximate covariance

Cov[F^3(A;B,C),G^(A)]1J2j=1JΦ2(Aj)G(Aj)[2G(Aj)2F3(Aj;Bj,Cj)bj(1cj)(1bj)cj].

Proof. From the proofs of Lemma 1 and Proposition 6, we have

E[G^(Aj)]=[1Φ2(Aj)]G(Aj)

and

E[F^3(Aj;Bj,Cj)]=F3(Aj;Bj,Cj)+Φ2(Aj)G(Aj),

yielding

E[F^3(Aj;Bj,Cj)]E[G^(Aj)]=[F3(Aj;Bj,Cj)+Φ2(Aj)G(Aj)][1Φ2(Aj)]G(Aj)=F3(Aj;Bj,Cj)G(Aj)+Φ2(Aj)G(Aj)2Φ2(Aj)F3(Aj;Bj,Cj)G(Aj)Φ2(Aj)2G(Aj)2F3(Aj;Bj,Cj)G(Aj)+Φ2(Aj)G(Aj)2Φ2(Aj)F3(Aj;Bj,Cj)G(Aj),

where we used the fact that Φ2(Aj)2 is negligible compared to Φ2(Aj) as an approximation. We also calculate

E[F^3(Aj;Bj,Cj)G^(Aj)]=E[(a^jb^j)(a^jc^j)a^j(1a^j)]=E[(a^j2a^jb^ja^jc^j+b^jc^j)(a^ja^j2)]=E[a^j3]E[a^j2]E[b^j]E[a^j2]E[c^j]+E[a^j]E[b^j]E[c^j]E[a^j4]+E[a^j3]E[b^j]+E[a^j3]E[c^j]E[a^j2]E[b^j]E[c^j]aj3+3Φ2(Aj)aj2(1aj)[aj2+Φ2(Aj)aj(1aj)]bj[aj2+Φ2(Aj)aj(1aj)]cj+ajbjcj[aj4+6Φ2(Aj)aj3(1aj)]+[aj3+3Φ2(Aj)aj2(1aj)]bj+[aj3+3Φ2(Aj)aj2(1aj)]cj[aj2+Φ2(Aj)aj(1aj)]bjcj.

Recognizing that G(Aj)=aj(1aj) and F3(Aj;Bj,Cj)=aj2ajbjajcj+bjcj, we have

E[F^3(Aj;Bj,Cj)G^(Bj)]aj3aj2bjaj2cj+ajbjcjaj4+aj3bj+aj3cjaj2bjcj+Φ2(Aj)[3ajbjcj6aj2+3ajbj+3ajcjbjcj]G(Aj)=(aj2ajbjajcj+bjcj)(ajaj2)+Φ2(Aj)[3(ajaj2)3(aj2ajbjajcj+bjcj)bj(1cj)(1bj)cj]G(Aj)=F3(Aj;Bj,Cj)G(Aj)+Φ2(Aj)[3G(Aj)3F3(Aj;Bj,Cj)bj(1cj)(1bj)cj]G(Aj)=F3(Aj;Bj,Cj)G(Aj)+3Φ2(Aj)G(Aj)23Φ2(Aj)F3(Aj;Bj,Cj)G(Aj)Φ2(Aj)G(Aj)bj(1cj)Φ2(Aj)G(Aj)(1bj)cj=[13Φ2(Aj)]F3(Aj;Bj,Cj)G(Aj)+3Φ2(Aj)G(Aj)2Φ2(Aj)G(Aj)bj(1cj)Φ2(Aj)G(Aj)(1bj)cj.

Therefore, we have that

Cov[F^3(Aj;Bj,Cj),G^(Aj)]=E[F^3(Aj;Bj,Cj)G^(Aj)]E[F^3(Aj;Bj,Cj)]E[G^(Aj)][13Φ2(Aj)]F3(Aj;Bj,Cj)G(Aj)+3Φ2(Aj)G(Aj)2Φ2(Aj)G(Aj)bj(1cj)Φ2(Aj)G(Aj)(1bj)cj[F3(Aj;Bj,Cj)G(Aj)+Φ2(Aj)G(Aj)2Φ2(Aj)F3(Aj;Bj,Cj)G(Aj)]=2Φ2(Aj)G(Aj)22Φ2(Aj)F3(Aj;Bj,Cj)G(Aj)Φ2(Aj)G(Aj)bj(1cj)Φ2(Aj)G(Aj)(1bj)cj=Φ2(Aj)G(Aj)[2G(Aj)2F3(Aj;Bj,Cj)bj(1cj)(1bj)cj],

which gives

Cov[F^3(A;B,C),G^(A)]=Cov[1Jj=1JF^3(Aj;Bj,Cj),1Jj=1JG^(Aj)]=1J2j=1JCov[F^3(Aj;Bj,Cj),G^(Aj)]1J2j=1JΦ2(Aj)G(Aj)[2G(Aj)2F3(Aj;Bj,Cj)bj(1cj)(1bj)cj].

Proposition 25. Consider J independent polymorphic loci in populations A, B, and C with respective parametric reference allele frequencies aj,bj,cj(0,1), and suppose we take a random sample of N(Pj) individuals at locus j in population P{A,B,C}, some of which may be related or inbred. Moreover, assume that no individual is related to more than one other individual, which makes the terms Φ3(Pj),Φ4(Pj),Φ2,2(Pj),Φ2(Pj)2,Φ2(Aj)Φ2(Bj),Φ2(Aj)Φ2(Cj), and Φ2(Bj)Φ2(Cj) negligible to Φ2(Pj). Based on this simplifying assumption, the unbiased estimator F˜3(A;B,C) has approximate variance

Var[F˜3(A;B,C)]4J2j=1JΦ2(Aj)F3(Aj;Bj,Cj)G(Aj)+1J2j=1JΦ2(Aj)F2(Bj,Cj)G(Aj)+1J2j=1JΦ2(Bj)F2(Aj,Cj)G(Bj)+1J2j=1JΦ2(Cj)F2(Aj,Bj)G(Cj).

Proof. Recall that

F˜3(Aj;Bj,Cj)=F^3(Aj;Bj,Cj)Φ2(Aj)G˜(Aj),

where F˜3(Aj;Bj,Cj) is an unbiased estimator for F3(Aj;Bj,Cj) and G˜(Aj)=G^(Aj)/[1Φ2(Aj)] is an unbiased estimator of G(Aj) at locus j{1,2,,J}. Also, from the proof of Proposition 6, we have

E[F^3(Aj;Bj,Cj)]=F3(Aj;Bj,Cj)+Φ2(Aj)G(Aj).

Therefore, we have that

Var[F˜3(Aj;Bj;Cj)]=Var[F^3(Aj;Bj,Cj)Φ2(Aj)G˜(Aj)]=Var[F^3(Aj;Bj,Cj)]+Φ2(Aj)2Var[G˜(Aj)]2Φ2(Aj)Cov[F^3(Aj;Bj,Cj),G˜(Aj)]Var[F^3(Aj;Bj,Cj)]2Φ2(Aj)1Φ2(Aj)Cov[F^3(Aj;Bj,Cj),G^(Aj)],

where we used the fact that Φ2(Aj)2 is negligible compared to Φ2(Aj) as an approximation. Moreover, because G˜(Aj)=G^(Aj)/[1Φ2(Aj)], we have

Var[F˜3(Aj;Bj;Cj)]Var[F^3(Aj;Bj,Cj)]2Φ2(Aj)1Φ2(Aj)Cov[F^3(Aj;Bj,Cj),g^(Aj)]=Var[F^3(Aj;Bj,Cj)]2Φ2(Aj)21Φ2(Aj)G(Aj)[2G(Aj)2F3(Aj;Bj,Cj)bj(1cj)(1bj)cj].

Recalling the assumption that Φ2(Aj)2 is negligible compared to Φ2(Aj), we have

Var[F˜3(Aj;Bj,Cj)]Var[F^3(Aj;Bj,Cj)].

Proposition 26. Consider J independent polymorphic loci in populations A, B, and C with respective parametric reference allele frequencies aj,bj,cj(0,1), and suppose we take a random sample of N(Pj) individuals at locus j in population P{A,B,C}, some of which may be related or inbred. Moreover, assume that no individual is related to more than one other individual, which makes the terms Φ3(Pj),Φ4(Pj),Φ2,2(Pj),Φ2(Pj)2,Φ2(Aj)Φ2(Bj),Φ2(Aj)Φ2(Cj),Φ2(Bj)Φ2(Cj) and any other terms of similar order negligible to Φ2(Pj). Based on this simplifying assumption, the estimator F3(A;B,C) has approximate variance

Var[F3(A;B,C)]4J2j=1JΦ2(Aj)F3(Aj;Bj,Cj)G(Aj)+1J2j=1JΦ2(Aj)F2(Bj,Cj)G(Aj)+1J2j=1JΦ2(Bj)F2(Aj,Cj)G(Bj)+1J2j=1JΦ2(Cj)F2(Aj,Bj)G(Cj).

Proof. Recall that

F3(Aj;Bj,Cj)=F^3(Aj;Bj,Cj)1k=1N(Aj)mk1G^(Aj),

where F3(Aj;Bj,Cj) and F^3(Aj;Bj,Cj) are estimators for F3(Aj;Bj,Cj) and G^(Aj) is an estimator of G(Aj) at locus j{1,2,,J}. Also, from the proof of Proposition 6, we have

E[F^3(Aj;Bj,Cj)]=F3(Aj;Bj,Cj)+Φ2(Aj)G(Aj).

Therefore, we have that

Var[F3(Aj;Bj;Cj)]=Var[F^3(Aj;Bj,Cj)1k=1N(Aj)mk1G^(Aj)]=Var[F^3(Aj;Bj,Cj)]+[1k=1N(Aj)mk1]2Var[G^(Aj)]2[1k=1N(Aj)mk1]Cov[F^3(Aj;Bj,Cj),G^(Aj)]Var[F^3(Aj;Bj,Cj)][2k=1N(Aj)mk1]Cov[F^3(Aj;Bj,Cj),G^(Aj)],

where we used the fact that Φ2(Aj)2 is negligible compared to Φ2(Aj) as an approximation (and hence 1/[k=1N(Aj)mk1] assuming N(Aj) is large enough). Moreover, because

Cov[F^3(Aj;Bj,Cj),G^(Aj)]=Φ2(Aj)G(Aj)[2G(Aj)2F3(Aj;Bj,Cj)bj(1cj)(1bj)cj]

we have

[2k=1N(Aj)mk1]Cov[F^3(Aj;Bj,Cj),G^(Aj)]0

by recalling the assumption that Φ2(Aj)2 or any other term of similar magnitude (such as the Φ2(Aj)/[k=1N(Aj)mk1] term) are negligible compared to Φ2(Aj). Because of this, we have

Var[F3(Aj;Bj,Cj)]Var[F^3(Aj;Bj,Cj)]

and it follows that

Var[F3(A;B,C)]Var[F^3(A;B,C)].

Proposition 27. Consider J independent polymorphic loci in populations A, B, C, and D with respective parametric reference allele frequencies aj,bj,cj,dj(0,1), and suppose we take a random sample of N(Pj) individuals at locus j in population P{A,B,C,D}, some of which may be related or inbred. Moreover, assume that no individual is related to more than one other individual, which makes the terms Φ3(Pj),Φ4(Pj),Φ2,2(Pj),Φ2(Pj)2,Φ2(Aj)Φ2(Cj),Φ2(Aj)Φ2(Dj),Φ2(Bj)Φ2(Cj), and Φ2(Bj)Φ2(Dj) negligible to Φ2(Pj). Based on this simplifying The unbiased estimator F^4(A,B;C,D) has approximate variance

Var[F^4(A,B;C,D)]1J2j=1J[Φ2(Cj)G(Cj)+Φ2(Dj)G(Dj)]F2(Aj,Bj)+1J2j=1J[Φ2(Aj)G(Aj)+Φ2(Bj)G(Bj)]F2(Cj,Dj).

Proof. From the proofs of Propositions 3 and 12, we have

E[F^2(Aj,Bj)]=F2(Aj,Bj)+Φ2(Aj)G(Aj)+Φ2(Bj)G(Bj)

and

E[F^4(Aj,Bj;Cj,Dj)]=F4(Aj,Bj;Cj,Dj).

We calculate

E[F^4(Aj,Bj;Cj,Dj)2]=E[(a^jb^j)2(c^jd^j)2]=E[F^2(Aj,Bj)F^2(Cj,Dj)]=E[F^2(Aj,Bj)]E[F^2(Cj,Dj)]=[F2(Aj,Bj)+Φ2(Aj)G(Aj)+Φ2(Bj)G(Bj)]×[F2(Cj,Dj)+Φ2(Cj)G(Cj)+Φ2(Dj)G(Dj)]=F4(Aj,Bj;Cj,Dj)2+F2(Aj,Bj)[Φ2(Cj)G(Cj)+Φ2(Dj)G(Dj)]+F2(Cj,Dj)[Φ2(Aj)G(Aj)+Φ2(Bj)G(Bj)]+[Φ2(Aj)G(Aj)+Φ2(Bj)G(Bj)][Φ2(Cj)G(Cj)+Φ2(Dj)G(Dj)],

where we use the identity that

F4(Aj,Bj;Cj,Dj)2=(ajbj)2(cjdj)2=F2(Aj,Bj)F2(Cj,Dj).

Therefore, we have that

Var[F^4(Aj,Bj;Cj,Dj)]=E[F^4(Aj,Bj;Cj,Dj)2]E[F^4(Aj,Bj;Cj,Dj)]2=F4(Aj,Bj;Cj,Dj)2+F2(Aj,Bj)[Φ2(Cj)G(Cj)+Φ2(Dj)G(Dj)]+F2(Cj,Dj)[Φ2(Aj)G(Aj)+Φ2(Bj)G(Bj)]+[Φ2(Aj)G(Aj)+Φ2(Bj)G(Bj)][Φ2(Cj)G(Cj)+Φ2(Dj)G(Dj)]F4(Aj,Bj;Cj,Dj)2=F2(Aj,Bj)[Φ2(Cj)G(Cj)+Φ2(Dj)G(Dj)]+F2(Cj,Dj)[Φ2(Aj)G(Aj)+Φ2(Bj)G(Bj)]+[Φ2(Aj)G(Aj)+Φ2(Bj)G(Bj)][Φ2(Cj)G(Cj)+Φ2(Dj)G(Dj)]F2(Aj,Bj)[Φ2(Cj)G(Cj)+Φ2(Dj)G(Dj)]+F2(Cj,Dj)[Φ2(Aj)G(Aj)+Φ2(Bj)G(Bj)],

where we used the fact that Φ2(Aj)Φ2(Cj),Φ2(Aj)Φ2(Dj),Φ2(Bj)Φ2(Cj), and Φ2(Cj)Φ2(Dj) are negligible compared to Φ2(Aj),Φ2(Bj),Φ2(Cj) and Φ2(Dj) as an approximation. It follows that

Var[F^4(A,B;C,D)]=Var[1Jj=1JF^4(Aj,Bj;Cj,Dj)]=1J2j=1JVar[F^4(Aj,Bj;Cj,Dj)]1J2j=1J[Φ2(Cj)G(Cj)+Φ2(Dj)G(Dj)]F2(Aj,Bj)+1J2j=1J[Φ2(Aj)G(Aj)+Φ2(Bj)G(Bj)]F2(Cj,Dj).

Following Wolter (2007), we have that an approximation to the variance of the ratio estimator X/Y is

Var[XY]E[X]2E[Y]2[Var[X]E[X]2+Var[Y]E[Y]22Cov[X,Y]E[X]E[Y]]

Proposition 28. Consider J polymorphic loci in populations A, B, and C with respective parametric reference allele frequencies aj,bj,cj(0,1), and suppose we take a random sample of N(Pj) individuals at locus j in population P{A,B,C}, some of which may be related or inbred. Moreover, assume that no individual is related to more than one other individual, which makes the terms Φ3(Pj),Φ4(Pj),Φ2,2(Pj),Φ2(Pj)2,Φ2(Aj)Φ2(Bj),Φ2(Aj)Φ2(Cj), and Φ2(Bj)Φ2(Cj) negligible to Φ2(Pj). Based on this simplifying assumption, the ratio estimator F^3(A;B,C|A) has approximate variance

Var[F^3(A;B,C|A)]E[F^3(A;B,C)]24E[G^(A)]2[Var[F^3(A;B,C)]E[F^3(A;B,C)]2+Var[G^(A)]E[G^(A)]22Cov[F^3(A;B,C),G^(A)]E[F^3(A;B,C)]E[G^(A)]],

where the expectations are

E[F^3(A;B,C)]=F3(A;B,C)+1Jj=1JΦ2(Aj)G(Aj)E[G^(A)]=G(A)1Jj=1JΦ2(Aj)G(Aj)

the variances are

Var[F^3(A;B,C)]4J2j=1JΦ2(Aj)F3(Aj;Bj,Cj)G(Aj)+1J2j=1JΦ2(Aj)F2(Bj,Cj)G(Aj)+1J2j=1JΦ2(Bj)F2(Aj,Cj)G(Bj)+1J2j=1JΦ2(Cj)F2(Aj,Bj)G(Cj)Var[G^(A)]1J2j=1JΦ2(Aj)G(Aj)4J2j=1JΦ2(Aj)G(Aj)2

and the covariance is

Cov[F^3(A;B,C),G^(A)]1J2j=1JΦ2(Aj)G(Aj)[2G(Aj)2F3(Aj;Bj,Cj)bj(1cj)(1bj)cj].

Proof. Recall that

F^3(A;B,C,|A)=F^3(A;B,C)2G^(A).

Assuming that X=F^3(A;B,C) and Y=2G^(A), following the approximation in Wolter (2007) we have

Var[F^3(A;B,C|A)]E[F^3(A;B,C)]24E[G^(A)]2[Var[F^3(A;B,C)]E[F^3(A;B,C)]2+Var[G^(A)]E[G^(A)]22Cov[F^3(A;B,C),G^(A)]E[F^3(A;B,C)]E[G^(A)]],

where E[F^3(A;B,C)] is given in Proposition 6, E[G^(A)] in Lemma 1, Var[F^3(A;B,C)] in Proposition 23, Var[G^(A)] in Lemma 18, and Cov[F^3(A;B,C),G^(A)] in Lemma 24. □

Lemma 29. Consider J independent polymorphic loci in populations A, B, and C with respective parametric reference allele frequencies aj,bj,cj(0,1), and suppose we take a random sample of N(Pj) individuals at locus j in population P{A,B,C}, some of which may be related or inbred. Moreover, assume that no individual is related to more than one other individual, which makes the terms Φ3(Pj),Φ4(Pj),Φ2,2(Pj), and Φ2(Pj)2 negligible to Φ2(Pj). Based on this simplifying assumption, the unbiased estimators F˜3(A;B,C) and G˜(A) have an approximate covariance

Cov[F˜3(A;B,C),G˜(A)]1J2j=1JΦ2(Aj)1Φ2(Aj)G(Aj)[2G(Aj)2F3(Aj;Bj,Cj)bj(1cj)(1bj)cj].

Proof. Recall that

F˜3(Aj;Bj,Cj)=F^3(Aj;Bj,Cj)Φ2(Aj)G˜(Aj),

where G˜(Aj)=G^(Aj)/[1Φ2(Aj)]. It follows that

Cov[F˜3(Aj;Bj,Cj),G˜(Aj)]=Cov[F^3(Aj;Bj,Cj)Φ2(Aj)G˜(Aj),G˜(Aj)]=Cov[F^3(Aj;Bj,Cj),G˜(Aj)]Φ2(Aj)Var[G˜(Aj)]=11Φ2(Aj)Cov[F^3(Aj;Bj,Cj),G^(Aj)]Φ2(Aj)[1Φ2(Aj)]2Var[G^(Aj)]11Φ2(Aj)Cov[F^3(Aj;Bj,Cj),G^(Aj)]Φ2(Aj)12Φ2(Aj)Var[G^(Aj)],

where we used the fact that Φ2(Aj)2 is negligible compared to Φ2(Aj) as an approximation. From the proofs of Lemmas 18 and 24, we have

Var[G^(Aj)]Φ2(Aj)G(Aj)4Φ2(Aj)G(Aj)2

and

Cov[F^3(Aj;Bj,Cj),G^(Aj)]=Φ2(Aj)G(Aj)[2G(Aj)2F3(Aj;Bj,Cj)bj(1cj)(1bj)cj].

Assuming that Φ2(Aj)2 is negligible compared to Φ2(Aj), we have that

Φ2(Aj)Var[G^(Aj)]0.

We therefore have that

Cov[F˜3(Aj;Bj,Cj),G˜(Aj)]11Φ2(Aj)Cov[F^3(Aj;Bj,Cj),G^(Aj)],

and thus by independence of loci we have

Cov[F˜3(A;B,C),G˜(A)]=Cov[1Jj=1JF˜3(Aj;Bj,Cj),1Jj=1JG˜(Aj)]=1J2j=1JCov[F˜3(Aj;Bj,Cj),G˜(Aj)]1J2j=1JΦ2(Aj)1Φ2(Aj)G(Aj)[2G(Aj)2F3(Aj;Bj,Cj)bj(1cj)(1bj)cj]. 

Lemma 30. Consider J independent polymorphic loci in populations A, B, and C with respective parametric reference allele frequencies aj,bj,cj(0,1), and suppose we take a random sample of N(Pj) individuals at locus j in population P{A,B,C}, some of which may be related or inbred, where individual k{1,2,,N(Pj)} has ploidy mk. Moreover, assume that no individual is related to more than one other individual, which makes the terms Φ3(Pj),Φ4(Pj),Φ2,2(Pj),Φ2(Pj)2, and any other terms of similar order negligible to Φ2(Pj). Based on this simplifying assumption, the estimators F3(A;B,C) and G(A) have an approximate covariance

Cov[F3(A;B,C),G(A)]1J2j=1JΦ2(Aj)k=1N(Aj)mkk=1N(Aj)mk1G(Aj)[2G(Aj)2F3(Aj;Bj,Cj)bj(1cj)(1bj)cj].

Proof. Recall that

F3(Aj;Bj,Cj)=F^3(Aj;Bj,Cj)1k=1N(Aj)mk1G^(Aj),

where

G(Aj)=k=1N(Aj)mkk=1N(Aj)mk1G(Aj).

It follows that

Cov[F3(Aj;Bj,Cj),G(Aj)]=Cov[F^3(Aj;Bj,Cj)1k=1N(Aj)mk1G^(Aj),G(Aj)]=Cov[F^3(Aj;Bj,Cj),G(Aj)]1k=1N(Aj)mk1Cov[G^(Aj),G(Aj)]=k=1N(Aj)mkk=1N(Aj)mk1Cov[F^3(Aj;Bj,Cj),G^(Aj)]k=1N(Aj)mk[k=1N(Aj)mk1]2Var[G^(Aj)].

From the proofs of Lemmas 18 and 24, we have

Var[G^(Aj)]Φ2(Aj)G(Aj)4Φ2(Aj)G(Aj)2

and

Cov[F^3(Aj;Bj,Cj),G^(Aj)]=Φ2(Aj)G(Aj)[2G(Aj)2F3(Aj;Bj,Cj)bj(1cj)(1bj)cj].

By recalling the assumption that Φ2(Aj)2 or any other term of similar magnitude (such as the Φ2(Aj)/[k=1N(Aj)mk1] term for N(Aj) large enough) is negligible compared to Φ2(Aj), we have that

1k=1N(Aj)mk1Var[G^(Aj)]0.

We therefore have that

Cov[F3(Aj;Bj,Cj),G(Aj)]k=1N(Aj)mkk=1N(Aj)mk1Cov[F^3(Aj;Bj,Cj),G^(Aj)],

and thus by independence of loci we have

Cov[F3(A;B,C),G(A)]=Cov[1Jj=1JF3(Aj;Bj,Cj),1Jj=1JG(Aj)]=1J2j=1JCov[F3(Aj;Bj,Cj),G(Aj)]1J2j=1JΦ2(Aj)k=1N(Aj)mkk=1N(Aj)mk1G(Aj)[2G(Aj)2F3(Aj;Bj,Cj)bj(1cj)(1bj)cj]. 

Proposition 31. Consider J polymorphic loci in populations A, B, and C with respective parametric reference allele frequencies aj,bj,cj(0,1), and suppose we take a random sample of N(Pj) individuals at locus j in population P{A,B,C}, some of which may be related or inbred. Moreover, assume that no individual is related to more than one other individual, which makes the terms Φ3(Pj),Φ4(Pj),Φ2,2(Pj),Φ2(Pj)2,Φ2(Aj)Φ2(Bj),Φ2(Aj)Φ2(Cj), and Φ2(Bj)Φ2(Cj) negligible to Φ2(Pj). Based on this simplifying assumption, the approximately unbiased ratio estimator F˜3(A;B,C|A) has approximate variance

Var[F˜3(A;B,C|A)]F3(A;B,C)24G(A)2[Var[F˜3(A;B,C)]F3(A;B,C)2+Var[G˜(A)]G(A)22Cov[F˜3(A;B,C),G˜(A)]F3(A;B,C)G(A)],

where the variances are

Var[F˜3(A;B,C)]4J2j=1JΦ2(Aj)F3(Aj;Bj,Cj)G(Aj)+1J2j=1JΦ2(Aj)F2(Bj,Cj)G(Aj)+1J2j=1JΦ2(Bj)F2(Aj,Cj)G(Bj)+1J2j=1JΦ2(Cj)F2(Aj,Bj)G(Cj)Var[G˜(A)]1J2j=1JΦ2(Aj)12Φ2(Aj)G(Aj)4J2j=1JΦ2(Aj)12Φ2(Aj)G(Aj)2

and the covariance is

Cov[F˜3(A;B,C),G˜(A)]1J2j=1JΦ2(Aj)1Φ2(Aj)G(Aj)[2G(Aj)2F3(Aj;Bj,Cj)bj(1cj)(1bj)cj].

Proof. Recall that

F˜3(A;B,C,|A)=F˜3(A;B,C)2G˜(A),

where F˜3(A;B,C) is an unbiased estimator for F3(A;B,C) and G˜(A) is an unbiased estimator of G(A). Assuming that X=F˜3(A;B,C) and Y=2G˜(A), following the approximation in Wolter (2007) we have

Var[F˜3(A;B,C|A)]F3(A;B,C)24G(A)2[Var[F˜3(A;B,C)]F3(A;B,C)2+Var[G˜(A)]G(A)22Cov[F˜3(A;B,C),G˜(A)]F3(A;B,C)G(A)],

where Var[F˜3(A;B,C)] is given in Proposition 25, Var[G˜(A)] in Lemma 18, and Cov[F˜3(A;B,C),G˜(A)] in Lemma 29. □

Proposition 32. Consider J polymorphic loci in populations A, B, and C with respective parametric reference allele frequencies aj,bj,cj(0,1), and suppose we take a random sample of N(Pj) individuals at locus j in population P{A,B,C}, some of which may be related or inbred, where individual k{1,2,,N(Pj)} has ploidy mk. Moreover, assume that no individual is related to more than one other individual, which makes the terms Φ3(Pj),Φ4(Pj),Φ2,2(Pj),Φ2(Pj)2,Φ2(Aj)Φ2(Bj),Φ2(Aj)Φ2(Cj),Φ2(Bj)Φ2(Cj), and any other terms of similar order negligible to Φ2(Pj). Based on this simplifying assumption, the ratio estimator F3(A;B,C|A) has approximate variance

Var[F3(A;B,C|A)]X(A;B,C)24Y(A)2[Var[F3(A;B,C)]X(A;B,C)2+Var[G(A)]Y(A)22Cov[F3(A;B,C),G(A)]X(A;B,C)Y(A)],

where the variances are

Var[F3(A;B,C)]4J2j=1JΦ2(Aj)F3(Aj;Bj,Cj)G(Aj)+1J2j=1JΦ2(Aj)F2(Bj,Cj)G(Aj)+1J2j=1JΦ2(Bj)F2(Aj,Cj)G(Bj)+1J2j=1JΦ2(Cj)F2(Aj,Bj)G(Cj)Var[G(A)]1J2j=1JΦ2(Aj)[k=1N(Aj)mkk=1N(Aj)mk1]2G(Aj)4J2j=1JΦ2(Aj)[k=1N(Aj)mkk=1N(Aj)mk1]2G(Aj)2

the covariance is

Cov[F3(A;B,C),G(A)]1J2j=1JΦ2(Aj)k=1N(Aj)mkk=1N(Aj)mk1G(Aj)[2G(Aj)2F3(Aj;Bj,Cj)bj(1cj)(1bj)cj],

and where

X(A;B,C)=F3(A;B,C)+1Jj=1JΦ2(Aj)k=1N(Aj)mk1k=1N(Aj)mk1G(Aj)Y(A)=1Jj=1J[1Φ2(Aj)]k=1N(Aj)mkk=1N(Aj)mk1G(Aj).

Proof. Recall that

F3(A;B,C,|A)=F3(A;B,C)2G(A),

where F3(A;B,C) is an estimator for F3(A;B,C) and G(A) is an estimator of G(A). Assuming that X=F3(A;B,C) and Y=2G(A), following the approximation in Wolter (2007) we have

Var[F3(A;B,C|A)]X(A;B,C)24Y(A)2[Var[F3(A;B,C)]X(A;B,C)2+Var[G(A)]Y(A)22Cov[F3(A;B,C),G(A)]X(A;B,C)Y(A)],

where Var[F3(A;B,C)] is given in Proposition 26, Var[G(A)] in Lemma 18, Cov[F3(A;B,C),G(A)] in Lemma 30, X(A;B,C)=E[F3(A;B,C)] in proof of Corollary 8, and Y(A)=E[G(A)] in proof of Corollary 11. □

Lemma 33. Consider J independent polymorphic loci in populations A, B, C, and D with respective parametric reference allele frequencies aj,bj,cj,dj(0,1), and suppose we take a random sample of N(Pj) individuals at locus j in population P{A,B,C,D}, some of which may be related or inbred. Moreover, assume that no individual is related to more than one other individual, which makes the terms Φ3(Pj),Φ4(Pj),Φ2,2(Pj), and Φ2(Pj)2 negligible to Φ2(Pj). Based on this simplifying assumption, the estimators F^4(A,B;C,D) and G^(P),P{A,B,C,D}, have approximate covariances

Cov[F^4(A,B;C,D),G^(A)]1J2j=1JΦ2(Aj)G(Aj)(12aj)(cjdj)Cov[F^4(A,B;C,D),G^(B)]1J2j=1JΦ2(Bj)G(Bj)(12bj)(cjdj)Cov[F^4(A,B;C,D),G^(C)]1J2j=1JΦ2(Cj)G(Cj)(12cj)(ajbj)Cov[F^4(A,B;C,D),G^(D)]1J2j=1JΦ2(Dj)G(Dj)(12dj)(ajbj).

Proof. From the proofs of Lemma 1 and Proposition 12, we that have

E[G^(Pj)]=[1Φ2(Pj)]G(Pj)

and

E[F^4(Aj,Bj;Cj,Dj)]=F4(Aj,Bj;Cj,Dj),

yielding

E[F^4(Aj,Bj;Cj,Dj)]E[G^(Pj)]=[1Φ2(Pj)]F4(Aj,Bj;Cj,Dj)G(Pj).

We first calculate

E[F^4(Aj,Bj;Cj,Dj)G^(Aj)]=E[(a^jb^j)(c^jd^j)a^j(1a^j)]=E[(a^jb^j)(a^ja^j2)]E[c^jd^j]=[E[a^j2]E[a^j3]E[a^j]E[b^j]+E[a^j2]E[b^j]][E[c^j]E[d^j]][aj2+Φ2(Aj)aj(1aj)[aj3+3Φ2(Aj)aj3(1aj)ajbj+[aj2+Φ2(Aj)aj(1aj)]bj](cjdj)=(aj2aj3ajbj+aj2bj)(cjdj)+Φ2(Aj)aj(1aj)(13aj+bj)(cjdj)=(ajbj)(cjdj)aj(1aj)+Φ2(Aj)aj(1aj)[12aj(ajbj)](cjdj)=F4(Aj,Bj;Cj,Dj)G(Aj)Φ2(Aj)F4(Aj,Bj;Cj,Dj)G(Aj)+Φ2(Aj)G(Aj)(12aj)(cjdj)=[1Φ2(Aj)]F4(Aj,Bj;Cj,Dj)G(Aj)+Φ2(Aj)G(Aj)(12aj)(cjdj).

Hence, we have that

Cov[F^4(Aj,Bj;Cj,Dj),G^(Aj)]=E[F^4(Aj,Bj;Cj,Dj)G^(Aj)]E[F^4(Aj,Bj;Cj,Dj)]E[G^(Aj)][1Φ2(Aj)]F4(Aj,Bj;Cj,Dj)G(Aj)+Φ2(Aj)G(Aj)(12aj)(cjdj)[1Φ2(Aj)]F4(Aj,Bj;Cj,Dj)G(Aj)=Φ2(Aj)G(Aj)(12aj)(cjdj).

Similarly, we have that

E[F^4(Aj,Bj;Cj,Dj)G^(Bj)]=E[(a^jb^j)(c^jd^j)b^j(1b^j)]=E[(a^jb^j)(b^jb^j2)]E[c^jd^j]=E[(b^ja^j)(b^jb^j2)]E[c^jd^j]=[E[b^j2]E[b^j3]E[a^j]E[b^j]+E[a^j]E[b^j2]][E[c^j]E[d^j]][bj2+Φ2(Bj)bj(1bj)[bj3+3Φ2(Bj)bj3(1bj)ajbj+aj[bj2+Φ2(Bj)bj(1bj)]](cjdj)=(bj2bj3ajbj+ajbj2)(cjdj)Φ2(Bj)bj(1bj)(13bj+aj)(cjdj)=(ajbj)(cjdj)bj(1bj)Φ2(Bj)bj(1bj)[12bj+(ajbj)](cjdj)=F4(Aj,Bj;Cj,Dj)G(Aj)Φ2(Bj)F4(Aj,Bj;Cj,Dj)G(Bj)Φ2(Bj)G(Bj)(12bj)(cjdj)=[1Φ2(Bj)]F4(Aj,Bj;Cj,Dj)G(Bj)Φ2(Bj)G(Bj)(12bj)(cjdj).

Hence, we have that

Cov[F^4(Aj,Bj;Cj,Dj),G^(Bj)]=E[F^4(Aj,Bj;Cj,Dj)G^(Bj)]E[F^4(Aj,Bj;Cj,Dj)]E[G^(Bj)][1Φ2(Bj)]F4(Aj,Bj;Cj,Dj)G(Bj)Φ2(Bj)G(Bj)(12bj)(cjdj)[1Φ2(Bj)]F4(Aj,Bj;Cj,Dj)G(Bj)=Φ2(Bj)G(Bj)(12bj)(cjdj).

Parallel to the derivation for P = A, we have

Cov[F^4(Aj,Bj;Cj,Dj),G^(Cj)]=Φ2(Cj)G(Cj)(12cj)(ajbj)

and parallel to the derivation for P = B, we have

Cov[F^4(Aj,Bj;Cj,Dj),G^(Dj)]=Φ2(Dj)G(Dj)(12dj)(ajbj).

We know that by independence of loci we have

Cov[F^4(A,B;C,D),G^(P)]=Cov[1Jj=1JF^4(Aj,Bj;Cj,Dj),1Jj=1JG^(Pj)]=1J2j=1JCov[F^4(Aj,Bj;Cj,Dj),G^(Pj)]

which gives

Cov[F^4(A,B;C,D),G^(A)]1J2j=1JΦ2(Aj)G(Aj)(12aj)(cjdj)Cov[F^4(A,B;C,D),G^(B)]1J2j=1JΦ2(Bj)G(Bj)(12bj)(cjdj)Cov[F^4(A,B;C,D),G^(C)]1J2j=1JΦ2(Cj)G(Cj)(12cj)(ajbj)Cov[F^4(A,B;C,D),G^(D)]1J2j=1JΦ2(Dj)G(Dj)(12dj)(ajbj).

Proposition 34. Consider J polymorphic loci in populations A, B, C, and D with respective parametric reference allele frequencies aj,bj,cj,dj(0,1), and suppose we take a random sample of N(Pj) individuals at locus j in population P{A,B,C,D}, some of which may be related or inbred. Moreover, assume that no individual is related to more than one other individual, which makes the terms Φ3(Pj),Φ4(Pj),Φ2,2(Pj), and Φ2(Pj)2 negligible to Φ2(Pj). Based on this simplifying assumption, the ratio estimator F^4(A,B;C,D|P) has approximate variance

Var[F^4(A,B;C,D|P)]F4(A,B;C,D)2E[G^(P)]2[Var[F^4(A,B;C,D)]F4(A,B;C,D)2+Var[G^(A)]E[G^(P)]22Cov[F^4(A,B;C,D),G^(P)]F4(A,B;C,D)E[G^(P)]],

where the expectation is

E[G^(P)]=G(P)1Jj=1JΦ2(Pj)G(Pj)

the variances are

Var[F^4(A,B;C,D)]1J2j=1J[Φ2(Cj)g(Cj)+Φ2(Dj)G(Dj)]F2(Aj,Bj)+1J2j=1J[Φ2(Aj)G(Aj)+Φ2(Bj)G(Bj)]F2(Cj,Dj)Var[G^(P)]1J2j=1JΦ2(Pj)G(Pj)4J2j=1JΦ2(Pj)G(Pj)2

and the covariances are

Cov[F^4(A,B;C,D),G^(A)]1J2j=1JΦ2(Aj)G(Aj)(12aj)(cjdj)Cov[F^4(A,B;C,D),G^(B)]1J2j=1JΦ2(Bj)G(Bj)(12bj)(cjdj)Cov[F^4(A,B;C,D),G^(C)]1J2j=1JΦ2(Cj)G(Cj)(12cj)(ajbj)Cov[F^4(A,B;C,D),G^(D)]1J2j=1JΦ2(Dj)G(Dj)(12dj)(ajbj).

Proof. Recall that

F^4(A,B;C,D|P)=F^4(A,B;C,D)G^(P),

where F^4(A,B;C,D) is an unbiased estimator for F4(A,B;C,D). Assuming that X=F^4(A,B;C,D) and Y=G^(P), following the approximation in Wolter (2007) we have

Var[F^4(A,B;C,D|P)]F4(A,B;C,D)2E[G^(P)]2[Var[F^4(A,B;C,D)]F4(A,B;C,D)2+Var[G^(A)]E[G^(P)]22Cov[F^4(A,B;C,D),G^(P)]F4(A,B;C,D)E[G^(P)]],

where E[G^(P)] is given in Lemma 1, Var[F^4(A,B;C,D)] in Proposition 27, Var[G^(P)] in Lemma 18, and Cov[F^4(A,B;C,D),G^(P)] in Lemma 33 for each population P{A,B,C,D}. □

Lemma 35. Consider J independent polymorphic loci in populations A, B, C, and D with respective parametric reference allele frequencies aj,bj,cj,dj(0,1), and suppose we take a random sample of N(Pj) individuals at locus j in population P{A,B,C,D}, some of which may be related or inbred. Moreover, assume that no individual is related to more than one other individual, which makes the terms Φ3(Pj),Φ4(Pj),Φ2,2(Pj), and Φ2(Pj)2 negligible to Φ2(Pj). Based on this simplifying assumption, the unbiased estimators F^4(A,B;C,D) and G˜(P),P{A,B,C,D}, have approximate covariances

Cov[F^4(A,B;C,D),G˜(A)]1J2j=1JΦ2(Aj)1Φ2(Aj)G(Aj)(12aj)(cjdj)Cov[F^4(A,B;C,D),G˜(B)]1J2j=1JΦ2(Bj)1Φ2(Bj)G(Bj)(12bj)(cjdj)Cov[F^4(A,B;C,D),G˜(C)]1J2j=1JΦ2(Cj)1Φ2(Cj)G(Cj)(12cj)(ajbj)Cov[F^4(A,B;C,D),G˜(D)]1J2j=1JΦ2(Dj)1Φ2(Dj)G(Dj)(12dj)(ajbj).

Proof. Recall that G˜(Pj)=G^(Pj)/[1Φ2(Pj)]. It follows that

Cov[F^4(Aj,Bj;Cj,Dj),G˜(Pj)]=11Φ2(Pj)Cov[F^4(Aj,Bj;Cj,Dj),G^(Pj)].

From the proof of Lemma 33, we have

Cov[F^4(Aj,Bj;Cj,Dj),G˜(Aj)]Φ2(Aj)1Φ2(Aj)G(Aj)(12aj)(cjdj)Cov[F^4(Aj,Bj;Cj,Dj),G˜(Bj)]Φ2(Bj)1Φ2(Bj)G(Bj)(12bj)(cjdj)Cov[F^4(Aj,Bj;Cj,Dj),G˜(Cj)]Φ2(Cj)1Φ2(Cj)G(Cj)(12cj)(ajbj)Cov[F^4(Aj,Bj;Cj,Dj),G˜(Dj)]Φ2(Dj)1Φ2(Dj)G(Dj)(12dj)(ajbj),

yielding

Cov[F^4(A,B;C,D),G˜(A)]1J2j=1JΦ2(Aj)1Φ2(Aj)G(Aj)(12aj)(cjdj)Cov[F^4(A,B;C,D),G˜(B)]1J2j=1JΦ2(Bj)1Φ2(Bj)G(Bj)(12bj)(cjdj)Cov[F^4(A,B;C,D),G˜(C)]1J2j=1JΦ2(Cj)1Φ2(Cj)G(Cj)(12cj)(ajbj)Cov[F^4(A,B;C,D),G˜(D)]1J2j=1JΦ2(Dj)1Φ2(Dj)G(Dj)(12dj)(ajbj).

Lemma 36. Consider J independent polymorphic loci in populations A, B, C, and D with respective parametric reference allele frequencies aj,bj,cj,dj(0,1), and suppose we take a random sample of N(Pj) individuals at locus j in population P{A,B,C,D}, some of which may be related or inbred, where individual k{1,2,,N(Pj)} has ploidy mk. Moreover, assume that no individual is related to more than one other individual, which makes the terms Φ3(Pj),Φ4(Pj),Φ2,2(Pj), and Φ2(Pj)2 negligible to Φ2(Pj). Based on this simplifying assumption, the estimators F^4(A,B;C,D) and G(P),P{A,B,C,D}, have approximate covariances

Cov[F^4(A,B;C,D),G(A)]1J2j=1JΦ2(Aj)k=1N(Aj)mkk=1N(Aj)mk1G(Aj)(12aj)(cjdj)Cov[F^4(A,B;C,D),G(B)]1J2j=1JΦ2(Bj)k=1N(Bj)mkk=1N(Bj)mk1G(Bj)(12bj)(cjdj)Cov[F^4(A,B;C,D),G(C)]1J2j=1JΦ2(Cj)k=1N(Cj)mkk=1N(Cj)mk1G(Cj)(12cj)(ajbj)Cov[F^4(A,B;C,D),G(D)]1J2j=1JΦ2(Dj)k=1N(Dj)mkk=1N(Dj)mk1G(Dj)(12dj)(ajbj).

Proof. Recall that

G(Pj)=k=1N(Pj)mkk=1N(Pj)mk1G^(Pj).

It follows that

Cov[F^4(Aj,Bj;Cj,Dj),G(Pj)]=k=1N(Pj)mkk=1N(Pj)mk1Cov[F^4(Aj,Bj;Cj,Dj),G^(Pj)].

From the proof of Lemma 33, we have

Cov[F^4(Aj,Bj;Cj,Dj),G(Aj)]Φ2(Aj)k=1N(Aj)mkk=1N(Aj)mk1G(Aj)(12aj)(cjdj)Cov[F^4(Aj,Bj;Cj,Dj),G(Bj)]Φ2(Gj)k=1N(Gj)mkk=1N(Gj)mk1G(Bj)(12bj)(cjdj)Cov[F^4(Aj,Bj;Cj,Dj),G(Cj)]Φ2(Cj)k=1N(Cj)mkk=1N(Cj)mk1G(Cj)(12cj)(ajbj)Cov[F^4(Aj,Bj;Cj,Dj),G(Dj)]Φ2(Dj)k=1N(Dj)mkk=1N(Dj)mk1G(Dj)(12dj)(ajbj),

yielding

Cov[F^4(A,B;C,D),G(A)]1J2j=1JΦ2(Aj)k=1N(Aj)mkk=1N(Aj)mk1G(Aj)(12aj)(cjdj)Cov[F^4(A,B;C,D),G(B)]1J2j=1JΦ2(Bj)k=1N(Bj)mkk=1N(Bj)mk1G(Bj)(12bj)(cjdj)Cov[F^4(A,B;C,D),G(C)]1J2j=1JΦ2(Cj)k=1N(Cj)mkk=1N(Cj)mk1G(Cj)(12cj)(ajbj)Cov[F^4(A,B;C,D),G(D)]1J2j=1JΦ2(Dj)k=1N(Dj)mkk=1N(Dj)mk1G(Dj)(12dj)(ajbj).

Proposition 37. Consider J polymorphic loci in populations A, B, C, and D with respective parametric reference allele frequencies aj,bj,cj,dj(0,1), and suppose we take a random sample of N(Pj) individuals at locus j in population P{A,B,C,D}, some of which may be related or inbred. Moreover, assume that no individual is related to more than one other individual, which makes the terms Φ3(Pj),Φ4(Pj),Φ2,2(Pj), and Φ2(Pj)2 negligible to Φ2(Pj). Based on this simplifying assumption, the approximately unbiased ratio estimator F˜4(A,B;C,D|P) has approximate variance

Var[F˜4(A,B;C,D|P)]F4(A,B;C,D)2G(P)2[Var[F^4(A,B;C,D)]F4(A,B;C,D)2+Var[G˜(A)]G(P)22Cov[F^4(A,B;C,D),G˜(P)]F4(A,B;C,D)G(P)],

where the variances are

Var[F^4(A,B;C,D)]1J2j=1J[Φ2(Cj)G(Cj)+Φ2(Dj)G(Dj)]F2(Aj,Bj)+1J2j=1J[Φ2(Aj)G(Aj)+Φ2(Bj)G(Bj)]F2(Cj,Dj)Var[G˜(P)]1J2j=1JΦ2(Pj)12Φ2(Pj)G(Pj)4J2j=1JΦ2(Pj)12Φ2(Pj)G(Pj)2

and the covariances are

Cov[F^4(A,B;C,D),G˜(A)]1J2j=1JΦ2(Aj)1Φ2(Aj)G(Aj)(12aj)(cjdj)Cov[F^4(A,B;C,D),G˜(B)]1J2j=1JΦ2(Bj)1Φ2(Bj)G(Bj)(12bj)(cjdj)Cov[F^4(A,B;C,D),G˜(C)]1J2j=1JΦ2(Cj)1Φ2(Cj)G(Cj)(12cj)(ajbj)Cov[F^4(A,B;C,D),G˜(D)]1J2j=1JΦ2(Dj)1Φ2(Dj)G(Dj)(12dj)(ajbj).

Proof. Recall that

F˜4(A,B;C,D|P)=F^4(A,B;C,D)G˜(P),

where F^4(A,B;C,D) is an unbiased estimator for F4(A,B;C,D) and G˜(P) is an unbiased estimator of G(P). Assuming that X=F^4(A,B;C,D) and Y=G˜(P), following the approximation in Wolter (2007) we have

Var[F˜4(A,B;C,D|P)]F4(A,B;C,D)2G(P)2[Var[F^4(A,B;C,D)]F4(A,B;C,D)2+Var[G˜(A)]G(P)22Cov[F^4(A,B;C,D),G˜(P)]F4(A,B;C,D)G(P)],

where Var[F^4(A,B;C,D)] is given in Proposition 27, Var[G˜(P)] in Lemma 18, and Cov[F^4(A,B;C,D),G˜(P)] in Lemma 35 for each population P{A,B,C,D}. □

Proposition 38. Consider J polymorphic loci in populations A, B, C, and D with respective parametric reference allele frequencies aj,bj,cj,dj(0,1), and suppose we take a random sample of N(Pj) individuals at locus j in population P{A,B,C,D}, some of which may be related or inbred, where individual k{1,2,,N(Pj)} has ploidy mk. Moreover, assume that no individual is related to more than one other individual, which makes the terms Φ3(Pj),Φ4(Pj),Φ2,2(Pj), and Φ2(Pj)2 negligible to Φ2(Pj). Based on this simplifying assumption, the ratio estimator F4(A,B;C,D|P) has approximate variance

Var[F4(A,B;C,D|P)]F4(A,B;C,D)2Y(P)2[Var[F^4(A,B;C,D)]F4(A,B;C,D)2+Var[G(A)]Y(P)22Cov[F^4(A,B;C,D),G(P)]F4(A,B;C,D)Y(P)],

where the variances are

Var[F^4(A,B;C,D)]1J2j=1J[Φ2(Cj)G(Cj)+Φ2(Dj)G(Dj)]F2(Aj,Bj)+1J2j=1J[Φ2(Aj)G(Aj)+Φ2(Bj)G(Bj)]F2(Cj,Dj)Var[G(P)]1J2j=1JΦ2(Pj)[k=1N(Pj)mkk=1N(Pj)mk1]2G(Pj)4J2j=1JΦ2(Pj)[k=1N(Pj)mkk=1N(Pj)mk1]2G(Pj)2,

the covariances are

Cov[F^4(A,B;C,D),G(A)]1J2j=1JΦ2(Aj)k=1N(Aj)mkk=1N(Aj)mk1G(Aj)(12aj)(cjdj)Cov[F^4(A,B;C,D),G(B)]1J2j=1JΦ2(Bj)k=1N(Bj)mkk=1N(Bj)mk1G(Bj)(12bj)(cjdj)Cov[F^4(A,B;C,D),G(C)]1J2j=1JΦ2(Cj)k=1N(Cj)mkk=1N(Cj)mk1G(Cj)(12cj)(ajbj)Cov[F^4(A,B;C,D),G(D)]1J2j=1JΦ2(Dj)k=1N(Dj)mkk=1N(Dj)mk1G(Dj)(12dj)(ajbj),

and where

Y(P)=1Jj=1J[1Φ2(Pj)]k=1N(Pj)mkk=1N(Pj)mk1G(Pj).

Proof. Recall that

F4(A,B;C,D|P)=F^4(A,B;C,D)G(P),

where F^4(A,B;C,D) is an unbiased estimator for F4(A,B;C,D) and G(P) is an estimator of G(P). Assuming that X=F^4(A,B;C,D) and Y=G(P), following the approximation in Wolter (2007) we have

Var[F4(A,B;C,D|P)]F4(A,B;C,D)2Y(P)2[Var[F^4(A,B;C,D)]F4(A,B;C,D)2+Var[G(A)]Y(P)22Cov[F^4(A,B;C,D),G(P)]F4(A,B;C,D)Y(P)],

where Var[F^4(A,B;C,D)] is given in Proposition 27, Var[G(P)] in Lemma 18, Cov[F^4(A,B;C,D),G(P)] in Lemma 36, and Y(P)=E[G(P)] in proof of Corollary 11 for each population P{A,B,C,D}. □

Lemma 39. Consider J independent polymorphic loci in populations A, B, C, and D with respective parametric reference allele frequencies aj,bj,cj,dj(0,1), and suppose we take a random sample of N(Pj) individuals at locus j in population P{A,B,C,D}, some of which may be related or inbred. Moreover, assume that no individual is related to more than one other individual, which makes the terms Φ3(Pj),Φ4(Pj),Φ2,2(Pj),Φ2(Pj)2,Φ2(Aj)Φ2(Bj),Φ2(Aj)Φ2(Cj),Φ2(Aj)Φ2(Dj),Φ2(Bj)Φ2(Cj),Φ2(Bj)Φ2(Dj), and Φ2(Cj)Φ2(Dj) negligible to Φ2(Pj). Based on this simplifying assumption, the unbiased estimator H^(A,B,C,D) has approximate variance

Var[H^(A,B;C,D)]1J2j=1J(aj+bj2ajbj)2[Φ2(Cj)G(Cj)(12dj)2+Φ2(Dj)G(Dj)(12cj)2]+1J2j=1J(cj+dj2cjdj)2[Φ2(Aj)G(Aj)(12bj)2+Φ2(Bj)G(Bj)(12aj)2].

Proof. From the proof of Lemma 17, we have that

E[H^(Aj,Bj,Cj,Dj)]=H(Aj,Bj,Cj,Dj),

yielding

E[H^(Aj,Bj,Cj,Dj)]2=H(Aj,Bj,Cj,Dj)2.

We first calculate

E[H^(Aj,Bj,Cj,Dj)2]=E[(a^j+b^j2a^jb^j)2(c^j+d^j2c^jd^j)2]=E[(a^j+b^j2a^jb^j)2]E[(c^j+d^j2c^jd^j)2].

We compute the first term as

E[(a^j+b^j2a^jb^j)2]=E[a^j2]+E[b^j2]+4E[a^j2]E[b^j2]+2E[a^j]E[b^j]4E[a^j2]E[b^j]4E[a^j]E[b^j2]=E[a^j2]+E[b^j2]+4E[a^j2]E[b^j2]+2ajbj4E[a^j2]bj4ajE[b^j2]aj2+Φ2(Aj)aj(1aj)+bj2+Φ2(Bj)bj(1bj)+4[aj2+Φ2(Aj)aj(1aj)][bj2+Φ2(Bj)bj(1bj)]+2ajbj4[aj2+Φ2(Aj)aj(1aj)]bj4aj[bj2+Φ2(Bj)bj(1bj)]=aj2+bj2+4aj2bj2+2ajbj4aj2bj4ajbj2+Φ2(Aj)aj(1aj)+Φ2(Bj)bj(1bj)+4Φ2(Aj)aj(1aj)bj2+4Φ2(Bj)aj2bj(1bj)+4Φ2(Aj)Φ2(Bj)aj(1aj)bj(1bj)4Φ2(Aj)aj(1aj)bj4Φ2(Bj)ajbj(1bj)=(aj+bj2ajbj)2+Φ2(Aj)G(Aj)[14bj+4bj2]+Φ2(Bj)G(Bj)[14aj+4aj2]+4Φ2(Aj)Φ2(Bj)G(Aj)G(Bj)(aj+bj2ajbj)2+Φ2(Aj)G(Aj)(12bj)2+Φ2(Bj)G(Bj)(12aj)2,

where we used the fact that Φ2(Aj)Φ2(Bj) is negligible compared to Φ2(Aj) and Φ2(Bj) as an approximation. Using a similar argument, we have that

E[(c^j+d^j2c^jd^j)2](cj+dj2cjdj)2+Φ2(Cj)G(Cj)(12dj)2+Φ2(Dj)G(Dj)(12cj)2.

Hence, we have that

E[H^(Aj,Bj,Cj,Dj)2]=E[(a^j+b^j2a^jb^j)2]E[(c^j+d^j2c^jd^j)2][(aj+bj2ajbj)2+Φ2(Aj)G(Aj)(12bj)2+Φ2(Bj)G(Bj)(12aj)2]×[(cj+dj2cjdj)2+Φ2(Cj)G(Cj)(12dj)2+Φ2(Dj)G(Dj)(12cj)2]H(Aj,Bj,Cj,Dj)2+(aj+bj2ajbj)2[Φ2(Cj)G(Cj)(12dj)2+Φ2(Dj)G(Dj)(12cj)2]+(cj+dj2cjdj)2[Φ2(Aj)G(Aj)(12bj)2+Φ2(Bj)G(Bj)(12aj)2],

where we used the fact that Φ2(Aj)Φ2(Cj),Φ2(Aj)Φ2(Dj),Φ2(Bj)Φ2(Cj), and Φ2(Bj)Φ2(Dj) are negligible compared to Φ2(Aj),Φ2(Bj),Φ2(Cj), and Φ2(Dj) as an approximation. Putting it together, we have

Var[H^(Aj,Bj,Cj,Dj)]=E[H^(Aj,Bj,Cj,Dj)2]E[H^(Aj,Bj,Cj,Dj)]2=E[H^(Aj,Bj,Cj,Dj)2]H(Aj,Bj,Cj,Dj)2(aj+bj2ajbj)2[Φ2(Cj)G(Cj)(12dj)2+Φ2(Dj)G(Dj)(12cj)2]+(cj+dj2cjdj)2[Φ2(Aj)G(Aj)(12bj)2+Φ2(Bj)G(Bj)(12aj)2].

Given the assumption of independent loci, we have

Var[H^(A,B,C,D)]=Var[1Jj=1JH^(Aj,Bj,Cj,Dj)]=1J2j=1JVar[H^(Aj,Bj,Cj,Dj)]1J2j=1J(aj+bj2ajbj)2[Φ2(Cj)G(Cj)(12dj)2+Φ2(Dj)G(Dj)(12cj)2]+1J2j=1J(cj+dj2cjdj)2[Φ2(Aj)G(Aj)(12bj)2+Φ2(Bj)G(Bj)(12aj)2].

Lemma 40. Consider J independent polymorphic loci in populations A, B, C, and D with respective parametric reference allele frequencies aj,bj,cj,dj(0,1), and suppose we take a random sample of N(Pj) individuals at locus j in population P{A,B,C,D}, some of which may be related or inbred. Moreover, assume that no individual is related to more than one other individual, which makes the terms Φ3(Pj),Φ4(Pj),Φ2,2(Pj),Φ2(Pj)2,Φ2(Aj)Φ2(Cj),Φ2(Aj)Φ2(Dj),Φ2(Bj)Φ2(Cj), and Φ2(Bj)Φ2(Dj) negligible to Φ2(Pj). Based on this simplifying assumption, the unbiased estimators F^4(A,B;C,D) and H^(A,B,C,D) have approximate covariance

Cov[F^4(A,B;C,D),H^(A,B,C,D)]1J2j=1JΦ2(Aj)G(Aj)(cjdj)(cj+dj2cjdj)(12bj)1J2j=1JΦ2(Bj)G(Bj)(cjdj)(cj+dj2cjdj)(12aj)+1J2j=1JΦ2(Cj)G(Cj)(ajbj)(aj+bj2ajbj)(12dj)1J2j=1JΦ2(Dj)G(Dj)(ajbj)(aj+bj2ajbj)(12cj).

Proof. From the proofs of Proposition 12 and Lemma 17, we that have

E[F^4(Aj,Bj;Cj,Dj)]=F4(Aj,Bj;Cj,Dj),

and

E[H^(Aj,Bj,Cj,Dj)]=H(Aj,Bj,Cj,Dj),

yielding

E[F^4(Aj,Bj;Cj,Dj)]E[H^(Aj,Bj,Cj,Dj)]=F4(Aj,Bj;Cj,Dj)H(Aj,Bj,Cj,Dj).

We first calculate

E[F^4(Aj,Bj;Cj,Dj)H^(Aj,Bj,Cj,Dj)]=E[(a^jb^j)(c^jd^j)(a^j+b^j2a^jb^j)(c^j+d^j2c^jd^j)]=E[(a^jb^j)(a^j+b^j2a^jb^j)]E[(c^jd^j)(c^j+d^j2c^jd^j)].

We compute the first term as

E[(a^jb^j)(a^j+b^j2a^jb^j)]=E[a^j2]2E[a^j2]E[b^j]E[b^j2]+2E[a^j]E[b^j2]=E[a^j2](12E[b^j])E[b^j2](12E[a^j])[aj2+Φ2(Aj)aj(1aj)][12bj][bj2+Φ2(Bj)bj(1bj)][12aj]=(ajbj)(aj+bj2ajbj)+Φ2(Aj)G(Aj)(12bj)Φ2(Bj)G(Bj)(12aj).

Using a similar argument, we have that

E[(c^jd^j)(c^j+d^j2c^jd^j)](cjdj)(cj+dj2cjdj)+Φ2(Cj)g(Cj)(12dj)Φ2(Dj)g(Dj)(12cj).

Hence, we have that

E[F^4(Aj,Bj;Cj,Dj)H^(Aj,Bj,Cj,Dj)]=E[(a^jb^j)(a^j+b^j2a^jb^j)]E[(c^jd^j)(c^j+d^j2c^jd^j)][(ajbj)(aj+bj2ajbj)+Φ2(Aj)G(Aj)(12bj)Φ2(Bj)G(Bj)(12aj)]×[(cjdj)(cj+dj2cjdj)+Φ2(Cj)G(Cj)(12dj)Φ2(Dj)G(Dj)(12cj)]F4(Aj,Bj;Cj,Dj)H(Aj,Bj,Cj,Dj)+Φ2(Aj)G(Aj)(cjdj)(cj+dj2cjdj)(12bj)Φ2(Bj)G(Bj)(cjdj)(cj+dj2cjdj)(12aj)+Φ2(Cj)G(Cj)(ajbj)(aj+bj2ajbj)(12dj)Φ2(Dj)G(Dj)(ajbj)(aj+bj2ajbj)(12cj),

where we used the fact that Φ2(Aj)Φ2(Cj),Φ2(Aj)Φ2(Dj),Φ2(Bj)Φ2(Cj), and Φ2(Bj)Φ2(Dj) are negligible compared to Φ2(Aj),Φ2(Bj),Φ2(Cj), and Φ2(Dj) as an approximation. Putting it together, we have

Cov[F^4(Aj,Bj;Cj,Dj),H^(Aj,Bj,Cj,Dj)]=E[F^4(Aj,Bj;Cj,Dj)H^(Aj,Bj,Cj,Dj)]E[F^4(Aj,Bj;Cj,Dj)]E[H^(Aj,Bj,Cj,Dj)]=E[F^4(Aj,Bj;Cj,Dj)H^(Aj,Bj,Cj,Dj)]F4(Aj,Bj;Cj,Dj)H(Aj,Bj,Cj,Dj)Φ2(Aj)G(Aj)(cjdj)(cj+dj2cjdj)(12bj)Φ2(Bj)G(Bj)(cjdj)(cj+dj2cjdj)(12aj)+Φ2(Cj)G(Cj)(ajbj)(aj+bj2ajbj)(12dj)Φ2(Dj)G(Dj)(ajbj)(aj+bj2ajbj)(12cj).

Given the assumption of independent loci, we have

Cov[F^4(A,B;C,D),H^(A,B,C,D)]=Cov[1Jj=1JF^4(Aj,Bj;Cj,Dj),1Jj=1JH^(Aj,Bj,Cj,Dj)]=1J2j=1JCov[F^4(Aj,Bj;Cj,Dj),H^(Aj,Bj,Cj,Dj)]1J2j=1JΦ2(Aj)G(Aj)(cjdj)(cj+dj2cjdj)(12bj)1J2j=1JΦ2(Bj)G(Bj)(cjdj)(cj+dj2cjdj)(12aj)+1J2j=1JΦ2(Cj)G(Cj)(ajbj)(aj+bj2ajbj)(12dj)1J2j=1JΦ2(Dj)G(Dj)(ajbj)(aj+bj2ajbj)(12cj).

Proposition 41. Consider J polymorphic loci in populations A, B, C, and D with respective parametric reference allele frequencies aj,bj,cj,dj(0,1), and suppose we take a random sample of N(Pj) individuals at locus j in population P{A,B,C,D}, some of which may be related or inbred. Moreover, assume that no individual is related to more than one other individual, which makes the terms Φ3(Pj),Φ4(Pj),Φ2,2(Pj),Φ2(Pj)2,Φ2(Aj)Φ2(Bj),Φ2(Aj)Φ2(Cj),Φ2(Aj)Φ2(Dj),Φ2(Bj)Φ2(Cj),Φ2(Bj)Φ2(Dj), and Φ2(Cj)Φ2(Dj) negligible to Φ2(Pj). Based on this simplifying assumption, the approximately unbiased ratio estimator D^(A,B,C,D) has approximate variance

Var[D^(A,B;C,D)]F4(A,B;C,D)2H(A,B,C,D)2[Var[F^4(A,B;C,D)]F4(A,B;C,D)2+Var[H^(A,B,C,D)]H(A,B,C,D)22Cov[F^4(A,B;C,D),H^(A,B,C,D)]F4(A,B;C,D)H(A,B,C,D)],

where the variances are

Var[F^4(A,B;C,D)]1J2j=1J[Φ2(Cj)G(Cj)+Φ2(Dj)G(Dj)]F2(Aj,Bj)+1J2j=1J[Φ2(Aj)G(Aj)+Φ2(Bj)G(Bj)]F2(Cj,Dj)Var[H^(A,B;C,D)]1J2j=1J(aj+bj2ajbj)2[Φ2(Cj)G(Cj)(12dj)2+Φ2(Dj)G(Dj)(12cj)2]+1J2j=1J(cj+dj2cjdj)2[Φ2(Aj)G(Aj)(12bj)2+Φ2(Bj)G(Bj)(12aj)2].

and the covariance is

Cov[F^4(A,B;C,D),H^(A,B,C,D)]1J2j=1JΦ2(Aj)G(Aj)(cjdj)(cj+dj2cjdj)(12bj)1J2j=1JΦ2(Bj)G(Bj)(cjdj)(cj+dj2cjdj)(12aj)+1J2j=1JΦ2(Cj)G(Cj)(ajbj)(aj+bj2ajbj)(12dj)1J2j=1JΦ2(Dj)G(Dj)(ajbj)(aj+bj2ajbj)(12cj).

Proof. Recall that

D^(A,B;C,D)=F^4(A,B;C,D)H^(A,B,C,D),

where F^4(A,B;C,D) is an unbiased estimator for F4(A,B;C,D) and H^(A,B,C,D) is an unbiased estimator of H(A,B,C,D). Assuming that X=F^4(A,B;C,D) and Y=H^(A,B,C,D), following the approximation in Wolter (2007) we have

Var[D˜(A,B;C,D)]F4(A,B;C,D)2H(A,B,C,D)2[Var[F^4(A,B;C,D)]F4(A,B;C,D)2+Var[H^(A,B,C,D)]H(A,B,C,D)22Cov[F^4(A,B;C,D),H^(A,B,C,D)]F4(A,B;C,D)H(A,B,C,D)],

where Var[F^4(A,B;C,D)] is given in Proposition 27, Var[H^(A,B,C,D)] in Lemma 39, and Cov[F^4(A,B;C,D),H^(A,B,C,D)] in Lemma 40. □

Literature cited

  1. Cockerham CC.  1971. Higher order probability functions of identity of allelles by descent. Genetics. 69:235–246. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. DeGiorgio M, Jankovic I, Rosenberg NA.  2010. Unbiased estimation of gene diversity in samples containing related individuals: exact variance and arbitrary ploidy. Genetics. 186:1367–1387. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. DeGiorgio M, Rosenberg NA.  2009. An unbiased estimator of gene diversity in samples containing related individuals. Mol Biol Evol. 26:501–512. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Eaton DAR, Ree RH.  2013. Inferring phylogeny and introgression using RADseq data: an example from flowering plants (Pedicularis: Orobanchaceae). Syst Biol. 62:689–706. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Epstein MP, Duren WL, Boehnke M.  2000. Improved inference of relationship for pairs of individuals. Am J Hum Genet. 67:1219–1231. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Gravel S, Henn BM, Gutenkunst RN, Indap AR, Marth GT, et al. ; The 1000 Genomes Project. 2011. Demographic history and rare allele sharing among human populations. Proc Natl Acad Sci USA. 108:11983–11988. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Green RE, Krause J, Briggs AW, Maricic T, Stenzel U, et al.  2010. A draft sequence of the neandertal genome. Science. 328:710–722. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Hajdinjak M, Fu Q, Hübner A, Petr M, Mafessoni F, et al.  2018. Reconstructing the genetic history of late neanderthals. Nature. 555:652–656. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Haller BC, Messer PW.  2019. 3: forward genetic simulations beyond the wright–fisher model. Mol Biol Evol. 36:632–637. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Harris AM, DeGiorgio M.  2017a. An unbiased estimator of gene diversity with improved variance for samples containing related and inbred individuals of any ploidy. G3 (Bethesda). 7:671–691. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Harris AM, DeGiorgio M.  2017b. Admixture and ancestry inference from ancient and modern samples through measures of population genetic drift. Hum Biol. 89:21–46. [DOI] [PubMed] [Google Scholar]
  12. Harris DL.  1964. Genotypic covariances between inbred relatives. Genetics. 50:1319–1348. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Huson D, Klöpper T, Lockhart P, Steel M.  2005. Reconstruction of reticulate networks from gene trees. RECOMB.  2005. 3500:233–249. [Google Scholar]
  14. Kim HL, Ratan A, Perry GH, Montenegro A, Miller W, et al.  2014. Khoisan hunter-gatherers have been the largest population throughout most of modern-human demographic history. Nat Commun. 5:5692–5692. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Kulathinal RJ, Stevison LS, Noor MAF.  2009. The genomics of speciation in drosophila: diversity, divergence, and introgression estimated using low-coverage genome sequencing. PLoS Genet. 5:e1000550. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Lange K.  2002. Mathematical and Statistical Methods for Genetic Analysis. New York, NY:Springer. [Google Scholar]
  17. Li JZ, Absher DM, Tang H, Southwick AM, Casto AM, et al.  2008. Worldwide human relationships inferred from genome-wide patterns of variation. Science. 319:1100–1104. [DOI] [PubMed] [Google Scholar]
  18. Martin SH, Davey JW, Jiggins CD.  2015. Evaluating the use of ABBA–BABA statistics to locate introgressed loci. Mol Biol Evol. 32:244–257. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. McPeek MS, Wu X, Ober C.  2004. Best linear unbiased allele-frequency estimation in complex pedigrees. Biometrics. 60:359–367. [DOI] [PubMed] [Google Scholar]
  20. Molinaro L, Montinaro F, Yelmen B, Marnetto D, Behar DM, et al.  2019. West Asian sources of the Eurasian component in Ethiopians: a reassessment. Sci Rep. 9:18811. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Moorjani P, Thangaraj K, Patterson N, Lipson M, Loh P-R, et al.  2013. Genetic evidence for recent population mixture in India. Am J Hum Genet. 93:422–438. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Nei M, Roychoudhury AK.  1974. Sampling variances of heterozygosity and genetic distance. Genetics. 76:379–390. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Patterson N, Moorjani P, Luo Y, Mallick S, Rohland N, et al.  2012. Ancient admixture in human history. Genetics. 192:1065–1093. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Payseur BA, Nachman MW.  2000. Microsatellite variation and recombination rate in the human genome. Genetics. 156:1285–1298. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Pease JB, Hahn MW.  2015. Detection and polarization of introgression in a five-taxon phylogeny. Syst Biol. 64:651–662. [DOI] [PubMed] [Google Scholar]
  26. Peter BM.  2016. Admixture, population structure, and f-statistics. Genetics. 202:1485–1501. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Reich D, Patterson N, Campbell D, Tandon A, Mazieres S, et al.  2012. Reconstructing native American population history. Nature. 488:370–374. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Reich D, Thangaraj K, Patterson N, Price AL, Singh L.  2009. Reconstructing Indian population history. Nature. 461:489–494. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Rosenberg NA.  2006. Standardized subsets of the HGDP-CEPH Human Genome Diversity Cell Line Panel, accounting for atypical and duplicated samples and pairs of close relatives. Ann Hum Genet. 70:841–847. [DOI] [PubMed] [Google Scholar]
  30. Scally A, Durbin R.  2012. Revising the human mutation rate: implications for understanding human evolution. Nat Rev Genet. 13:745–753. [DOI] [PubMed] [Google Scholar]
  31. Soraggi S, Wiuf C, Albrechtsen A.  2018. Powerful inference with the D-statistic on low-coverage whole-genome data. G3 (Bethesda). 8:551–566. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Takahata N.  1993. Allelic genealogy and human evolution. Mol Biol Evol. 10:2–22. [DOI] [PubMed] [Google Scholar]
  33. The 1000 Genomes Project Consortium. 2015. A global reference for human genetic variation. Nature. 526:68–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Turissini DA, Matute DR.  2017. Fine scale mapping of genomic introgressions within the Drosophila yakuba clade. PLoS Genet. 13:e1006971. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Waples RS, Anderson EC.  2017. Purging putative siblings from population genetic data sets: a cautionary view. Mol Ecol. 26:1211–1224. [DOI] [PubMed] [Google Scholar]
  36. Weir B.  1996. Genetic Data Analysis II. Sunderland, Massachusetts:Sinauer Associates. [Google Scholar]
  37. Weir BS.  1989. Sampling properties of gene diversity. In: Plant Population Genetics, Breeding and Genetic Resources. p. 23–42. [Google Scholar]
  38. Weir BS, Cockerham CC.  1984. Estimating F-statistics for the analysis of population structure. Evolution. 38:1358–1370. [DOI] [PubMed] [Google Scholar]
  39. Wolter KM.  2007. Introduction to Variance Estimation. 2nd ed. New York, NY: Springer. [Google Scholar]
  40. Zheng Y, Janke A.  2018. Gene flow analysis method, the D-statistic, is robust in a wide parameter space. BMC Bioinformatics. 19:10. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

iyab090_Supplementary_Figures

Data Availability Statement

Supplementary data are available at Genetics online. The 1000 Genomes Project data used in this publication is available at http://www.1000genomes.org/. Relatedness information used to generate Figure 7 is available online within Supplementary Tables S7–S15 of Rosenberg (2006).


Articles from Genetics are provided here courtesy of Oxford University Press

RESOURCES