Abstract
The Patterson F- and D-statistics are commonly used measures for quantifying population relationships and for testing hypotheses about demographic history. These statistics make use of allele frequency information across populations to infer different aspects of population history, such as population structure and introgression events. Inclusion of related or inbred individuals can bias such statistics, which may often lead to the filtering of such individuals. Here, we derive statistical properties of the F- and D-statistics, including their biases due to the inclusion of related or inbred individuals, their variances, and their corresponding mean squared errors. Moreover, for those statistics that are biased, we develop unbiased estimators and evaluate the variances of these new quantities. Comparisons of the new unbiased statistics to the originals demonstrates that our newly derived statistics often have lower error across a wide population parameter space. Furthermore, we apply these unbiased estimators using several global human populations with the inclusion of related individuals to highlight their application on an empirical dataset. Finally, we implement these unbiased estimators in open-source software package funbiased for easy application by the scientific community.
Keywords: inbreeding, relatedness, introgression, demography, population genetics, gene flow
Introduction
The recently introduced F- and D-statistics (Huson et al. 2005; Kulathinal et al. 2009; Reich et al. 2009; Green et al. 2010; Patterson et al. 2012) have transformed the way geneticists measure population differentiation. These statistics have been instrumental in many major recent discoveries, including testing which Neanderthal populations are closest to the populations that admixed with modern humans (Hajdinjak et al. 2018), and detecting which population is likely the admixing source for European admixture in modern Ethiopian populations (Molinaro et al. 2019). Iterating through different combinations of populations using the F4- and D-statistics has allowed reconstruction of population histories in diverse groups such as Native Americans and South Asians (Reich et al. 2012; Moorjani et al. 2013). In addition, the D-statistics have been used extensively to provide evidence of introgression and hybridization among species of Drosophila fruit flies and Heliconius butterflies (Martin et al. 2015; Turissini and Matute 2017).
In many cases, however, the populations tested by these statistics are small, and proper random sampling may include data from related individuals. It is common to remove one or more of the relatives from a group of related individuals because including them might provide a bias in the value of a particular statistic being measured (Rosenberg 2006; DeGiorgio and Rosenberg 2009; DeGiorgio et al. 2010; Harris and DeGiorgio 2017a; Waples and Anderson 2017). For this reason, we explore whether the current estimators for these statistics are biased with the inclusion of related or inbred individuals and if so, then develop unbiased estimators under such scenarios.
These statistics are flexible and relatively simple to compute, as they measure genetic drift along branches of a population tree by contrasting allele frequencies between different combinations of populations. Using allele frequency data from two, three, or four populations, these statistics measure shared variation along specific branches of the tree relating the populations. We begin by providing intuitive descriptions and formal definitions of each of the F- and D-statistics that we evaluate. Specifically, consider that we have allele frequency data at J biallelic loci from each of four populations, denoted A, B, C, and D. We denote the parametric population frequencies of the reference allele at locus j as aj, bj, cj, and dj in populations A, B, C, and D, respectively, which are unknown quantities that we will ultimately need to estimate prior to computing the F- and D-statistics from data. We begin by defining the true parametric population F- and D-statistics (Reich et al. 2009; Patterson et al. 2012) and then proceed in the Theory section to define sample estimators of these quantities, and show that our new estimators that account for related and inbred individuals reduce to the unbiased estimators in Appendix A of Patterson et al. (2012) when samples contain unrelated and noninbred individuals.
We first examine the F2 statistic, which measures the amount of genetic drift separating a pair of populations, and is thus a test for differentiation between them, and is akin to the widely used fixation index (Weir and Cockerham 1984). For a pair of populations A and B, we define the F2 statistic as
| (1) |
where for locus j
| (2) |
It is clear from this definition that F2 takes values between zero, when the populations have identical allele frequencies, and one, when the populations are fixed for different alleles (Figure 1).
Figure 1.
Trees showing the different relationships the F- and D-statistics are designed to test. can test the differentiation of two populations A and B. can test for introgression or relatedness between populations A and B or populations A and C. The placement of population A as an ingroup branch is arbitrary, because the statistic measures the amount of genetic drift along branch A, and thus assumes the three-population tree is unrooted. However, because we are also showing how can be employed to detect admixture in population A, we depict population A as an ingroup for visual convenience. can test the hypothesis of whether two populations are closer to each other than they are to two other populations, in this case are A and B closer to each other than they are to C and D. The statistic can test whether there has been admixture between population C and either populations A or B.
The F3 statistic employs allele frequencies from three populations, and measures the amount of genetic drift along the branch leading to a target population, given allele frequency data from two reference populations. For a target population A and two reference populations B and C, we define the F3 statistic as
| (3) |
where for locus j
| (4) |
Because it measures genetic drift along a branch leading to a target population, its value is expected to be nonnegative. However, an interesting property of the F3 statistic is that it can be negative if the target population experienced admixture, and therefore a negative value directly indicates admixture in the history of the target population (Reich et al. 2009; Patterson et al. 2012). However, though F3 can detect admixture if its value is negative, admixture is not guaranteed to lead to negative values (Reich et al. 2009; Patterson et al. 2012), and it is therefore an inconclusive test for admixture if F3 is nonnegative. Moreover, because loci with higher minor allele frequencies may affect F3 more than loci with lower minor allele frequencies, the F3 statistic is sometimes normalized (Reich et al. 2009; Patterson et al. 2012) based on levels of diversity of the target population. Formally, this normalized F3 statistic has definition
| (5) |
where we define for population P (here P = A)
| (6) |
such that for locus j
| (7) |
The F4 statistic, on the other hand, is a test of “treeness” among a set of four populations, examining whether the unrooted tree relating four populations is supported by the allele frequencies within the set of populations. For a pair of sister populations A and B and a pair of sister populations C and D, we define the F4 statistic as
| (8) |
where for locus j
| (9) |
If the unrooted relationship is true, then F4 takes the value of zero, and is nonzero otherwise. If it is known a priori that the unrooted relationship should be true, then a nonzero F4 statistic can be indicative of admixture, and the sign of the statistic will suggest which set of populations may be violating the assumed unrooted tree topology (Figure 1). As with the F3 statistic, a normalized version (Reich et al. 2009; Patterson et al. 2012) of the F4 statistic is sometimes used, with normalization based on the diversity of one of the four populations. Formally, this normalized F4 statistic has definition
| (10) |
where we normalize by diversity in population .
Finally, the D-statistic is a special version of the F4 statistic that is a test of treeness for a particular asymmetric rooted tree relating four populations, with the tree topology containing a pair of sister populations, together with a close and a distant outgroup population (Figure 1). For sister populations A and B, close outgroup population C, and distant outgroup population D, we define the D statistic as
| (11) |
where
| (12) |
is a normalizing factor to constrain the D statistic to take values between negative one and one, such that for locus j
| (13) |
If the rooted relationship is true, then D takes the value of zero, and is nonzero otherwise. A nonzero D value can be used to detect admixture between the close outgroup population and one of the two sister populations based on its sign (Figure 1).
Theory
The F- and D-statistic equations presented in the Introduction employ population allele frequencies, and are thus parameters of the set of populations. To estimate them, we first need to build an estimator of allele frequencies based on samples. We denote estimates of the reference allele frequencies at locus j, , in populations A, B, C, and D by , and , respectively.
As used previously (e.g., McPeek et al. 2004; DeGiorgio and Rosenberg 2009; DeGiorgio et al. 2010; Harris and DeGiorgio 2017a), a linear unbiased estimator of population reference allele frequency p at a biallelic locus can be defined as
| (14) |
where N(P) is the number of individuals sampled at the locus, Xk is the frequency of the reference allele in individual k at the locus, and is the weight of individual k in population P at the locus. McPeek et al. (2004) discussed the impact of various weighting schemes on allele frequency estimation, and Harris and DeGiorgio (2017a) examined the effects of weighting scheme on estimation of expected heterozygosity.
Plugging the sample estimate of allele frequencies in place of parametric population allele frequencies, estimators of the F- and D-statistics can be computed as
| (15) |
| (16) |
| (17) |
| (18) |
| (19) |
| (20) |
where
| (21) |
| (22) |
| (23) |
and where
| (24) |
| (25) |
with
| (26) |
| (27) |
In the following, we discuss properties of these estimators, and where appropriate, develop unbiased estimators for the statistics that are biased due to the inclusion of related or inbred individuals in the sample.
To begin, we define the kinship coefficient between individuals x and y, as the probability that a pair of sampled alleles, one from x and one from y are identical by descent if , and as the probability that a pair of alleles sampled with replacement from individual x are identical by descent if x = y (Lange 2002). A pair of unrelated individuals x and y have kinship coefficient (Lange 2002). Moreover, an individual x with ploidy mx has kinship coefficient , where fx is the inbreeding coefficient of individual x, and is defined as the probability that a pair of alleles sampled without replacement in individual x are identical by descent (DeGiorgio et al. 2010). A noninbred individual x has inbreeding coefficient fx = 0, and so if x is noninbred, then their kinship coefficient is . If an individual is haploid for regions of their genome, then by definition at those loci . Directly accounting for the ploidy of individuals is especially pertinent for X-linked loci, as sampled male individuals will be haploid and sampled females will be diploid, thereby leading to noninbred male individuals to have self-kinship coefficients mirroring completely inbred individuals (), whereas noninbred females at the same loci will have self-kinship coefficients consistent with noninbred diploids (). As in DeGiorgio et al. (2010) and Harris and DeGiorgio (2017a), we define the weighted mean kinship coefficients across sets of individuals sampled in population at locus j as
| (28) |
| (29) |
| (30) |
| (31) |
which are the weighted mean kinship coefficients for the individuals sampled at locus j in population P for pairs, triples, quadruples, and pairs of pairs of individuals, respectively. Here, , and are kinship coefficients respectively defining the probabilities that a trio of alleles from individuals w, x, and y, a quadruple of alleles from individuals w, x, y, and z, and a pair of alleles from w and x and a pair of alleles from y and z are identical by descent (Harris 1964; Cockerham 1971). In particular, , and are identical to the Cockerham (1971) coefficients denoted by θ, γ, δ, and Δ, respectively.
These Cockerham (1971) pairwise and extended kinship coefficients for measuring identity-by-descent probabilities among individuals are computed from sets of alleles sampled from two, three, or four individuals within the same population, whereas the F2, F3, and F4 statistics respectively are computed from sets of alleles sampled from two, three, or four distinct populations. Harris (1964) derived the genotypic covariances between individuals within a population, which are related to these kinship coefficients. In an interesting connection, measures the covariance (variance) in the frequency difference between alleles sampled from populations A and B, the covariance in the frequency difference between alleles sampled from populations A and B and the difference between alleles sampled from populations A and C, and the covariance in the frequency difference between alleles sampled from populations A and B and the difference between alleles sampled from populations C and D (Peter 2016). For this reason, we may expect that these covariances (F-statistics) will depend on the identity-by-descent probabilities defined by the Cockerham (1971) kinship coefficients, which we show is the case based on our derivations of the theoretical properties of the F-statistics.
From our definitions of kinship, we know that unrelated individuals have kinship coefficients of zero, but noninbred individuals still have positive values of their self-kinship coefficient, thereby causing the mean kinship coefficients to necessarily be positive quantities. It is for this reason that some F-statistic estimators will be biased even without related or inbred individuals, and this bias would be due to finite sample size. The estimators presented in Appendix A of Patterson et al. (2012) correct this bias due to finite sample sizes, and our goal is to further correct for the biases induced by related and inbred individuals. For accurate estimates of the drift quantities, it is therefore important to obtain unbiased estimators.
A number of quantities (particularly variances and covariances involving the F- and D-statistics) will be mathematically complex, as they will involve linear combinations of higher order mean kinship coefficients. For this reason, we follow prior studies (DeGiorgio et al. 2010; Harris and DeGiorgio 2017a) and make the simplifying assumption that no individual in a sample from population P is related to more than one other individual in the sample, such that terms , and negligible to . Moreover, we assume that individuals sampled in different populations are unrelated to each other, as well as assume that the different populations in general are independent so that alleles sampled in different populations cannot be identical by descent. Furthermore, in the cases referred to in this article, sampling is defined as statistical sampling, where the expectation is averaging over repeated sampling. Under these assumptions, we approximate a few key results from prior studies (DeGiorgio et al. 2010; Harris and DeGiorgio 2017a) that will ultimately make derivations easier. Given that is an estimate of the frequency of a reference allele at locus j in population P, we have the following expectations (approximate notation when not exact)
| (32) |
| (33) |
| (34) |
| (35) |
From prior studies (Nei and Roychoudhury 1974; Weir 1989; DeGiorgio and Rosenberg 2009; DeGiorgio et al. 2010; Harris and DeGiorgio 2017a), we know that is a downwardly biased estimator of expected heterozygosity at a locus, with the bias due to finite sample size (Nei and Roychoudhury 1974) and exacerbated by the inclusion of inbred (Weir 1989) and related (DeGiorgio and Rosenberg 2009; DeGiorgio et al. 2010; Harris and DeGiorgio 2017a) individuals in the sample. Based on this definition, is expected heterozygosity, and its estimator therefore biased. We begin by developing an unbiased estimator for G(P), as it is a key normalization quantity in the F3 and F4 statistics.
Lemma 1.
Consider J polymorphic loci in a population P with parametric reference allele frequencies , and suppose we take a random sample of individuals at locus j, some of which may be related or inbred. The estimator has downward bias
(36) and an unbiased estimator of G(P) is
(37) where
(38) is an unbiased estimator of .
Though this result is also given based on work in page 153 of Weir (1996), we provide the proof of Lemma 1 in the Appendix for completeness. Intuitively though, because involves the product of frequencies for two alleles drawn from population P, there is a chance of having the two alleles being identical by descent by sampling the same allele twice, and is therefore a biased estimator with and without related or inbred individuals. As a corollary, we next provide the bias of due to finite sample size, and use it to construct the unbiased estimator of G(P) in samples of unrelated and noninbred individuals (proof of this corollary given in the Appendix). Note that this estimator is identical to the one provided in Appendix A of Patterson et al. (2012).
Corollary 2.
Consider J polymorphic loci in a population P with parametric reference allele frequencies , and suppose we take a random sample of unrelated and noninbred individuals at locus j where individual } has ploidy mk. Assuming allele frequencies are estimated using the sample proportion, the estimator has downward bias
(39) and an unbiased estimator of G(P) is
(40) where
(41) is an unbiased estimator of . Here, is equivalent to the unbiased estimator termed in Appendix A of Patterson et al. (2012).
We next consider examining the bias of the estimator . As with , because requires sampling two alleles from population A and two alleles from population B, we find it is biased due to the inclusion of related or inbred individuals. We present the formal result for F2 next (Proposition 3), and prove the result in the Appendix.
Proposition 3.
Consider J polymorphic loci in populations A and B with respective parametric reference allele frequencies , and suppose we take a random sample of individuals at locus j in population , some of which may be related or inbred. The estimate has upward bias
(42) and an unbiased estimator of is
(43)
As one can see, the estimator is upwardly biased due to relatedness and inbreeding, and that sampling within both populations A and B contributes proportionally to this bias. The new unbiased estimator corrects this bias by adjusting the computation to account for the kinship coefficients and diversity within each population, with the adjustment of diversity using the unbiased estimator presented in Lemma 1. As a corollary, we next provide the bias of due to finite sample size, and use it to construct the unbiased estimator of in samples of unrelated and noninbred individuals (proof of this corollary given in the Appendix). Note that this estimator is identical to the one derived in Appendix A of Patterson et al. (2012), and we provide an additional corollary highlighting its bias in samples containing related or inbred individuals, which we prove in the Appendix.
Corollary 4.
Consider J polymorphic loci in populations A and B with respective parametric reference allele frequencies , and suppose we take a random sample of unrelated and noninbred individuals at locus j in population where individual } has ploidy mk. Assuming allele frequencies are estimated using the sample proportion, the estimate has upward bias
(44) and an unbiased estimator of is
(45) Here, is equivalent to the unbiased estimator termed in Appendix A of Patterson et al. (2012).
Corollary 5.
Consider J polymorphic loci in populations A and B with respective parametric reference allele frequencies , and suppose we take a random sample of individuals at locus j in population , some of which may be related or inbred. The estimate described in Corollary 4 [also in Appendix A of Patterson et al. (2012)] has upward bias
(46) Similarly to , the original estimator is also upwardly biased because it requires the sampling of two alleles from the target population A. We show the formal results for F3 next (Proposition 6), and prove the result in the Appendix.
Proposition 6.
Consider J polymorphic loci in populations A, B, and C with respective parametric reference allele frequencies , and suppose we take a random sample of individuals at locus j in population , some of which may be related or inbred. The estimate has upward bias
(47) and an unbiased estimator of is
(48)
The bias of the original estimator is proportional to the relatedness and diversity within the target population A. The new unbiased estimator corrects the bias by adjusting the computation to account for the kinship and diversity within the target population, with the adjustment of diversity using the unbiased estimator . Moreover, it is important to note that the reference populations B and C do not contribute to bias, as only a single allele is sampled from each of these populations. As a corollary, we next provide the bias of due to finite sample size, and use it to construct the unbiased estimator of in samples of unrelated and noninbred individuals (proof of this corollary given in the Appendix). Note that this estimator is identical to the one derived in Appendix A of Patterson et al. (2012), and we provide an additional corollary highlighting its bias in samples containing related or inbred individuals, which we prove in the Appendix.
Corollary 7.
Consider J polymorphic loci in populations A, B, and C with respective parametric reference allele frequencies , and suppose we take a random sample of unrelated and noninbred individuals at locus j in population where individual } has ploidy mk. Assuming allele frequencies are estimated using the sample proportion, the estimate has upward bias
(49) and an unbiased estimator of is
(50) Here, is equivalent to the unbiased estimator termed in Appendix A of Patterson et al. (2012).
Corollary 8.
Consider J polymorphic loci in populations A, B, and C with respective parametric reference allele frequencies , and suppose we take a random sample of individuals at locus j in population , some of which may be related or inbred. The estimate described in Corollary 7 [also in Appendix A of Patterson et al. (2012)] has upward bias
(51)
Given that uses the biased estimators and in its definition, we can expect that it would be biased as its component estimators are biased, and these components have different biases that are also in different directions. However, is a ratio estimator, and we can therefore not directly take its expectation to evaluate bias. Instead, we will make some simplifying assumptions and compute the approximate bias of . We show the formal results next (Proposition 9), and prove the result in the Appendix.
Proposition 9.
Consider J polymorphic loci in populations A, B, and C with respective parametric reference allele frequencies , and suppose we take a random sample of individuals at locus j in population , some of which may be related or inbred. The ratio estimator is approximately upwardly biased, assuming that its mean is well-approximated by the ratio of means of and that it uses in its definition, with its upward approximate bias
(52) Moreover, an approximately unbiased estimator of is
(53)
There is an upward approximate bias of the original normalized F3 estimator, and the bias is, as with the standard estimator of F3, due partially to the diversity and sampling in the target population. The new approximately unbiased estimator is based simply on the ratio of unbiased estimators of its components and . As a corollary, we next provide the approximate bias of due to finite sample size, and use it to construct the approximately unbiased estimator of in samples of unrelated and noninbred individuals (proof of this corollary given in the Appendix). We provide an additional corollary highlighting its bias in samples containing related or inbred individuals, which we prove in the Appendix.
Corollary 10.
Consider J polymorphic loci in populations A, B, and C with respective parametric reference allele frequencies , and suppose we take a random sample of unrelated and noninbred individuals at locus j in population where individual } has ploidy mk. The ratio estimator is approximately upwardly biased, assuming that its mean is well-approximated by the ratio of means of and that it uses in its definition, with its upward approximate bias
(54) Moreover, an approximately unbiased estimator of is
(55)
Corollary 11.
Consider J polymorphic loci in populations A, B, and C with respective parametric reference allele frequencies , and suppose we take a random sample of individuals at locus j in population , some of which may be related or inbred. The ratio estimate described in Corollary 10 is approximately upwardly biased, assuming that its mean is well-approximated by the ratio of means of and that it uses in its definition, with its upward approximate bias
(56) where
(57)
(58)
Finally, we move to the four population statistics F4 and D. Note that the F4 statistic by definition only samples a single allele per population, and therefore the original estimator is intuitively unbiased. We show the formal results next (Proposition 12), and prove the result in the Appendix.
Proposition 12.
Consider J polymorphic loci in populations A, B, C, and D with respective parametric reference allele frequencies , and suppose we take a random sample of individuals at locus j in population , some of which may be related or inbred. The estimator is unbiased.
Though the original F4 estimator is unbiased, the normalized F4 and D statistics are more complicated as they are ratio estimators, meaning their biases cannot be directly assessed. However, intuitively, because both estimators have as their numerator, bias would seemingly derive from their denominator component. Next, we show formally in Proposition 13 that the normalized estimator is approximately upwardly biased, and prove the result in the Appendix.
Proposition 13.
Consider J polymorphic loci in populations A, B, C and D with respective parametric reference allele frequencies , and suppose we take a random sample of individuals at locus j in population , some of which may be related or inbred. The ratio estimator is approximately upwardly biased, assuming that its mean is well-approximated by the ratio of means of and for any population that it uses in its definition, with its upward approximate bias
(59) Moreover, an approximately unbiased estimator of is
(60)
The reasoning that the estimator has upward approximate bias is that its estimator used in its denominator is downwardly biased. By using the unbiased estimator in its place within the denominator, we find a new estimator is approximately unbiased. As a corollary, we next provide the approximate bias of due to finite sample size, and use it to construct the approximately unbiased estimator of in samples of unrelated and noninbred individuals (proof of this corollary given in the Appendix). We provide an additional corollary highlighting its bias in samples containing related or inbred individuals, which we prove in the Appendix.
Corollary 14.
Consider J polymorphic loci in populations A, B, C, and D with respective parametric reference allele frequencies , and suppose we take a random sample of unrelated and noninbred individuals at locus j in population where individual } has ploidy mk. The ratio estimator is approximately upwardly biased, assuming that its mean is well-approximated by the ratio of means of and that it uses in its definition, with its upward approximate bias
(61) Moreover, an approximately unbiased estimator of is
(62)
Corollary 15.
Consider J polymorphic loci in populations A, B, C, and D with respective parametric reference allele frequencies , and suppose we take a random sample of individuals at locus j in population , some of which may be related or inbred. The ratio estimate described in Corollary 14 is approximately upwardly biased, assuming that its mean is well-approximated by the ratio of means of and that it uses in its definition, with its upward approximate bias
(63) where
(64)
The bias property of the D statistic is different than the normalized F4 statistic, as the estimator of its denominator is unbiased (Lemma 17 of the Appendix). Intuitively, this result is due to the denominator not having a product of frequencies for two alleles sampled from the same population. Because both its numerator and denominator are unbiased, we next show that the ratio estimator is approximately unbiased in Proposition 16, and prove the result in the Appendix.
Proposition 16.
Consider J polymorphic loci in populations A, B, C, and D with respective parametric reference allele frequencies , and suppose we take a random sample of individuals at locus j in population , some of which may be related or inbred. The ratio estimator is approximately unbiased, assuming that its mean is well-approximated by the ratio of means of and that it uses in its definition.
In addition to bias, variance is an important property of an estimator, as both bias and variance are components of mean squared error (MSE). Because the formulas and derivations for the variances of the F- and D-statistics are not particularly insightful, we relegate these results to the Appendix. Specifically, we provide the variances for , , , , , , , , and in Propositions 20, 21, 22, 23, 25, 26, 28, 31, 32, 27, 34, 37, 38, and 41 of the Appendix, respectively.
Results
In the Theory and Appendix, we introduced new unbiased estimators of F2 and F3 statistics, and derived biases and variances (and hence MSEs) for the original and new estimators of F- and D-statistics. In this section, we theoretically evaluate the relative performances of the old biased estimators and the new unbiased estimators under an array of settings, including different mixtures of relatedness, inbreeding, sample sizes, and population parameters.
For all of our results we require the kinship coefficients for each pair of individuals. To acquire these values, we need to know if each individual is related to any other in the population and also whether they are inbred, and if so, how these values are quantified through the use of kinship coefficients (). To summarize how an entire sample from a population P is related to each other at a locus, we use
| (65) |
where and are weights of individuals w and x in population P, and in this study we use weights corresponding to the proportion of alleles contributed by individual x to the sample from population P, which is computed as
| (66) |
Here mx is the ploidy of individual x. Moreover, using this weighting scheme, we also estimate the frequency of the reference allele at a biallelic locus as the sample proportion (McPeek et al. 2004; DeGiorgio et al. 2010; Harris and DeGiorgio, 2017a)
| (67) |
Effect of population F-statistic value on mean squared error
The relationship between the population parameter for a statistic and the estimate based on a sample from the population is important to evaluate. We compare the difference in the MSE between the biased estimators, the estimators, which are unbiased in samples not containing related or inbred individuals, and our unbiased estimators to the true value of each statistic in the cases for which both estimators exist. The F2, F3, and F4 statistics require allele frequency information from either two, three, or four populations, respectively.
For our F2 comparisons, we use the sample allele frequencies from the YRI (sub-Saharan African) and CEU (central Europeans) from the 1000 Genomes Project (The 1000 Genomes Project Consortium 2015) as the true population allele frequencies to obtain the true statistic by using the population definition from the Introduction, with populations and . We use populations from the 1000 Genomes Project, as the released dataset of genotype calls across the 2504 worldwide samples does not include related individuals. To evaluate the relative performances of F2 estimators over a range of true F2 values, we randomly sample 20 independent loci from both populations for 1000 independent replicates of J = 20 loci, yielding 1000 independent draws of the true F2 statistic, which ranged across the set of values . Using these allele frequencies, along with the sample size and relatedness information, we also calculate the difference in MSE between the and estimators by using Propositions 3, 20, and 21 and the difference in MSE between the and estimators by using Corollary 5 and Proposition 22. We calculate the MSE by summing the variance and squared bias. We note that the MSEs of the unbiased estimators are equal to their variances. We repeat this process for , and as these are the estimators that are biased in their forms.
We use Propositions 6, 23, 25, and 26 and Corollary 8 to determine the MSE for all three estimators by including allele frequency information from the JPT (Japanese) population where , and , with true range for . For the normalized estimators, we compare MSE between the biased and unbiased versions by using bias and variances derived in Propositions 9, 28, 31, and 32 and Corollary 11 with true range for normalized . Finally, we estimate the MSE for the normalized estimators by including GIH (Gujurati Indian) allele frequency data and using the derivations in Propositions 13, 34, 37, and 38 and Corollary 15. In this case, we set , and for a true range of normalized .
For each analysis, we estimate MSE for instances when samples of 60 diploid individuals from each population include 30 relative pairs, including 10 avuncular relationships, 10 inbred full siblings, and 10 outbred full siblings. We also assumed every individual was related to exactly one other individual. In these estimates, all populations contain the same composition of related individuals.
The difference in between and estimators for , and show similar trends with respect to the true F-statistic values. Similar trends emerge when examining the difference between and estimators. Specifically, the difference in decreases as the true F-statistic value approaches zero (Figure 2). In our evaluation of , we considered both positive and negative values for its true value, which shows that the difference in of and exhibits a quadratic shaped trend as a function of true F4. Overall, we notice that the difference in MSE between biased and unbiased estimators is dependent on the true value of the F-statistic, with the least difference occurring when the true F-statistic is closest to zero.
Figure 2.
Difference in theoretically calculated of and (A–D) and and (E–H) estimators when including relatives or inbred individuals. The MSE is estimated for instances when samples of 60 individuals include individuals related to exactly one other in the sample, with 10 pairs of avuncular relationships, 10 pairs of inbred full siblings and 10 pairs of outbred full siblings. Each point represents calculations from J = 20 randomly sampled loci from the 1000 Genomes Project dataset for CEU, European, YRI African, JPT Japanese, and GIH Indian populations. For we use and , while for and we use , and and for we assign , and .
Effect of sample size on mean squared error
To probe how sample size within each population affects the difference in estimator error rate, we theoretically computed the MSE for both and estimators when different numbers of individuals are sampled, with the constraint that every sampled individual is related to exactly one other individual in the sample from that population. Specifically, we evaluate the impact on these estimators when sampling from one pair to 50 pairs of related individuals, with relationships of inbred full sibling pairs (), outbred full sibling pairs (), parent-offspring pairs (), and avuncular pairs (). We compute the MSE as in Effect of population F-statistic value on MSE section above.
In almost all cases, the biased and the unbiased estimators always displayed elevated MSE compared to their corresponding unbiased estimators (Figure 3 and Supplementary Figures S1–S3), with the estimators always having values between the and estimators, and usually being closer to than to . For all estimators, we see a clear decrease in the MSE as the number of sampled individuals increases, with the greatest error observed when two individuals are sampled. As expected, a greater sample size allows one to better estimate allele frequencies, and ultimately reduces the mean pairwise kinship coefficient within the sample, as the number of pairs in the sample grows quadratically but the number of relative pairs grows linearly. We also find that the difference in the MSE at larger sample sizes is not as pronounced for normalized F4 as it is for F2, F3, and normalized F3, as the difference in bias among the biased and unbiased estimators is much smaller for normalized F4 (Supplementary Figure S3).
Figure 3.
Mean squared error theoretically calculated for , , and across different sample sizes or related pairs of individuals, including avuncular relationships (A), parent-offspring relationships (B), outbred full siblings (C), and inbred full siblings (D). The number of sampled individuals ranges from 2 to 100 with the number of relative pairs equaling half the total sampled, all computed using J = 20 loci. The true value of is 0.071.
Effect of sample composition on mean squared error
Different types of relatives have different proportions of their alleles shared identical by descent, and thus have different pairwise kinship coefficients. Because we have demonstrated that bias and variance (and hence MSE) of estimators are influenced by within-population mean pairwise kinship coefficient across sampled individuals, the distribution of relative types within a sample will impact overall F-statistic estimation error. For this reason, it is important to examine how our F-statistics are affected by samples containing diverse mixtures of relative types. Specifically, to accurately assess the impact of relative composition, we hold sample sizes, number of relative pairs, and true population F-statistic values constant.
We computed the theoretical MSE when samples of 50 pairs of relatives (100 diploid individuals sampled) contain relative pairs of three different types as in Harris and DeGiorgio (2017a). In addition, each individual is related to exactly one other individual in the sample from the same population. For each statistic we vary the number of pairs related by each of three types of relationships between zero and 50, with 1326 combinations for each. We repeat this process for three configurations of relationships to probe estimator error as a function of the mixture of relative types. We also provide comparisons among inclusion of male–male full siblings (), male–female full siblings () and female–female full siblings () at mixed-ploidy loci such as on the X chromosome (DeGiorgio et al. 2010), with results showing elevated MSE for both estimators for higher male–male sibling proportions, when compared to male–female or female–female full siblings. To investigate the effects of inbred individuals, we also provide a comparison between inbred full siblings () with inbreeding coefficient , and outbred full-siblings () at a autosomal diploid loci. We see that MSE is higher for inbred full-siblings than for outbred full-siblings in all cases examined (Figure 4 and Supplementary Figures S4–S6).
Figure 4.
Theoretically calculated MSE of (A), (B), and (C) when including relatives or inbred individuals for J = 20 loci. The MSE is estimated for instances when samples of 100 individuals include individuals related to exactly one other in the sample. The first column shows MSE for samples with different combinations of parent-offspring (PO), full sibling (FS), and avuncular (AV) relationships, the second includes full siblings that are male–male (MM), male–female (MF), and female–female (MF). The last column includes AV relationships as well as inbred (FSi) and outbred (FSo) full siblings. The true value of is 0.071.
We also note that the value of MSE for the biased estimators is always greater than the value for their corresponding unbiased statistics, which is true in part due to the values of the true F-statistics for the loci we chose to use. In addition, the values for the estimators are also higher than the corresponding estimators, but are lower than respective estimators for all combinations of individuals. Though the MSE is higher for the biased estimators, the variation in MSE values is similar for all estimators. For example, the data point with the highest proportion of avuncular relatives has the lowest MSE when compared to parent-offspring relationships and outbred full siblings. In all tested settings (Figure 4 and Supplementary Figures S4–S6), we notice similar patterns of MSE variation when comparing estimators with and estimators. This pattern is again shared when comparing MSE variation among the estimators for F2, F3, normalized F3, and normalized F4. We can conclude that in all of these cases, the value of the mean kinship coefficient is most important in determining MSE when sample size and true F-statistic value are fixed.
Simulations to evaluate theoretical MSE approximations
To verify that our theoretical approximations for MSE are reasonable, we simulate samples containing related individuals and use them to compute the biased - and unbiased -statistics as well as calculate their biases, variances, and MSEs. For each population (CEU, YRI, GIH, and JPT), we simulate 10 noninbred parent-offspring pairs with each individual related to exactly one other individual in the sample. Genotypes for each individual are simulated by first sampling two alleles with replacement according to their respective population allele frequencies from each of the populations (CEU, YRI, GIH, and JPT) to create a set of 20 unrelated individuals per population. Individuals x and y that form one of the 10 relative pairs have the genotype of individual y modified according to their relationship type. Specifically, for each relative type, there are probabilities , and that the two individuals will share zero, one, or two alleles identical by descent, respectively. The first allele of individual y is copied from the first allele from individual x with probability and the entire genotype of individual x is copied over to individual y with probability . This process is repeated across 20 independent loci to generate a sample of 20 individuals with 10 relative pairs in each population with genotypes taken at J = 20 independent loci. To generate 20 independent loci from the four 1000 Genomes Project populations, we used loci either on separate chromosomes, or at least one megabase away from each other.
For each of our new unbiased estimators we compute the bias, variance, and MSE along with the same values for the original estimators (Figure 5 and Supplementary Figures S7–S9). Comparing the bias measurements in these figures, we observe a clear reduction in bias when applying the and estimators as opposed to the estimators. Importantly, the bias measurements for the estimators are always higher than the bias measures for estimates, as the estimators do not account for relatives. However, the variances are highly similar for , , and in all cases. As the value of variance is much larger than the magnitude of the bias (by an order of magnitude) and hence the squared bias, the resulting MSE is consequently similar as well. Because F4 is quantifying the relationship among four populations, more simulations may be required to converge to the pattern seen by theoretical simulations. For this reason, we increased the number of simulations used to compute the bias, variance, and MSE to 104 for each data point in Supplementary Figure S9, whereas 103 simulation replicates were used for F2, and both versions of F3.
Figure 5.
Comparison of squared bias (A, D, and G), variance (B, E, and H), and MSE (C, F, and I), for , , and from simulated data including 60 parent offspring relative pairs. Each estimate was computed from J = 20 randomly sampled loci using and .
To compare the accuracy of our theoretical approximations to simulation results across a spectrum of relatedness between individuals in a sample, we simulate combinations of parent offspring, outbred full sibling, and avuncular relationships. In a manner similar to described above (first paragraph of Simulations to evaluate theoretical MSE approximations), we simulate a total of 10 relative pairs made up of a combination of each of the three relative types, with the number of each relative type ranging from 0 to 10. We simulate each of these 66 distinct settings of relative type combinations with genotypes sampled at J = 20 independent loci, and completed 1000 independent replicates of each setting to obtain accurate measurements of bias, variance, and MSE for each simulation setting, with each simulation using true F-statistic values specified in (Figure 5 and Supplementary Figures S7–S9). We compute the bias, variance, and MSE for simulations, and compare these values to theoretically calculated computations for each relative combination (Supplementary Figures S11–S14). We find that although noisier, the bias, variance, and MSE patterns in our simulation results match theoretical calculations, suggesting that our theoretical computations are accurate. For all cases, the simulated bias measurements for the estimators are close to zero, whereas the estimators display bias measurements matching the theoretically calculated bias values.
Utility and applications of unbiased estimators
In previous sections, we have shown through simulations that our theoretical results produced expected patterns and evaluated the performance of our unbiased estimators under varying combinations of relatives, true F-statistic values, and sample sizes. In this section, we show some potential applications of these estimators, using both simulated and empirical data. As discussed previously in the Introduction, the value of can be used to identify whether population A is the result of admixture between populations related to B and C (Figure 1). A negative value of F3 indicates the presence of this process, whereas a nonnegative value is inconclusive and means that further tests may be required to verify a history of admixture. However, because is upwardly biased and because corrects for this bias, might allow us to detect admixture in cases where would be inconclusive, even without the presence of related or inbred individuals.
To explore this hypothesis, we first examine an admixture scenario in which might provide marginally negative values. We simulate two populations (B and C) with effective population size of 104 diploid individuals (Takahata 1993) that diverged 2000 generations prior to sampling using SLiM (Haller and Messer, 2019). This simple divergence model has parameters inspired by the history relating African and non-African human populations (Gravel et al. 2011). These populations then merge with admixture proportions 0.4 and 0.6 for B and C, respectively, to form population A 400 generations prior to sampling. Using these parameters, the expected value is . To generate genetic data from this model, we evolved sequences with a per-site per-generation mutation rate of (Scally and Durbin 2012) and a uniform per-site per-generation recombination rate of (Payseur and Nachman 2000). We output 20 two megabase chromosomal regions containing allele frequency information for all three populations. Using allele frequency information from the three populations (A, B, and C) we generate 50 individuals for each population, in which there are 25 parent-offspring pairs. We then compute , , and across J = 20 loci, either on separate chromosomes or at least one megabase away from each other to ensure independence.
Figure 6 illustrates that values are lower than , with values falling in between and . values are almost always negative while and values are almost always positive. Because this statistic is used to test for admixture and a negative result indicates the presence of admixture, the use of biased estimators when related individuals are included in the sample leads to a different conclusion than when using the unbiased estimator. However, we notice that there is almost no correlation between the true F3 values and the estimates of F3 in Figure 6. This lack of correlation is due to the fact that the F3 values we simulated for this experiment are drawn from a particularly small range, and correlations in estimated versus true F3 are unable to be observed over the estimation noise. To ensure that this is indeed the case, we conduct simulations across a larger range of true F3 values and estimate , and . We notice that the expected trend appears when the range of F3 values is expanded (Supplementary Figure S15).
Figure 6.
, , and calculated for simulations where populations B and C merge with admixture proportions of 0.4 and 0.6, respectively, 400 generations ago (0.02 coalescent units) to form population A (tree shown in panel A). Panel (B) shows results for a sample containing 25 parent-offspring pairs.
Finally, we test the performance of our statistics on empirical data. We use populations from the HGDP SNP dataset (Li et al. 2008) that include related individuals (Rosenberg 2006). Specifically, we use genotype information from Colombian, Lahu, Melanesian, Mandenka, San, and Druze populations, and we sample 20 independent loci that are at least one megabase apart from all populations for 1000 independent replicates of J = 20 loci, yielding 1000 independent draws. Each of these populations contains between two and 14 pairs of inferred related individuals, according to Rosenberg (2006). Using distinct pairs for F2, triples for F3, and quadruples for F4 of these populations and the relationships from Rosenberg (2006), we estimate , , , , , and , and compare the mean and standard deviation of the biased and unbiased estimators (Figure 7). In all cases shown, the biased estimator has higher mean than the unbiased estimator, although the standard deviations are similar for both. This indicates that correcting the bias generated by related individuals yields more accurate F-statistic estimates with minimal cost in precision of the estimates. In addition, the unbiased estimators presented in Appendix A of Patterson et al. (2012) all have lower means than the biased estimators, and are more similar to our unbiased estimators, while having standard deviation measures that are a lower than both other estimators. This could indicate that Patterson’s unbiased estimators have slightly higher precision, while having slightly lower accuracy than the unbiased estimators introduced in this article when samples contain related individuals.
Figure 7.
The difference between the means and standard deviations of biased, unbiased Patterson’s, and our new unbiased estimators of F2, normalized and un-normalized F3 and normalized F4 when estimated with genotype information from four different combinations of Colombian, Lahu, Melanesian, Mandenka, San, and Druze populations. All of these populations include between 2 and 14 relative pairs. The black dots represent the values for the biased estimators, while the white dots show the value for the unbiased estimators. Each mean and standard deviation was calculated for a combination of two, three, or four populations, for F2, F3, and F4, respectively, and consists of 1000 estimates of the statistic, each calculated from J = 20 randomly samples single nucleotide polymorphisms from the genome. Panel (A) has values for , panel (B) has values for , panel (C) shows results for normalized , and panel (D) has results for normalized .
Discussion
We have introduced the unbiased estimators , and as well as shown that the estimators and are unbiased with the inclusion of related and inbred individuals. In addition, we have demonstrated that the variance of is similar to that of , as are the variances of and . We have also provided variance calculations for all other F- and D-statistic estimators included in this study. Using these calculations, we have compared the performance of the biased and newly derived unbiased estimators, and shown that in most cases the unbiased estimators have lower MSE values than the biased estimators of the same statistic.
Interestingly, the two statistics that sample from each analyzed population only once per locus— and —are unbiased with the inclusion of related or inbred individuals, whereas , which samples from each population A and B twice, and , which samples from population A twice, are biased. This process of sampling more than once from a single population per locus is responsible for creating bias due the inclusion of related or inbred individuals within the twice-sampled population.
The development of these unbiased statistics, and the proofs showing other statistics are unbiased is beneficial for anthropologists interested in populations such as hunter-gatherers, some of which are often small and widely dispersed yet retain high genetic diversity (Kim et al. 2014). Small population sizes may necessitate the sampling of close relatives, such as parents and offspring, or siblings. Along with small human populations, these statistics are often applied to nonhuman species. Some, such as elephants, rhinoceros, and cheetahs are close to extinction or have extremely small and inbred populations due to human activity. The F- and D-statistics may prove important in conservation efforts to test how (and whether) different populations of these animals are interacting. For these reasons, having estimators that are unbiased under such conditions is imperative in making accurate inferences about the relationships of such small populations with others. Although it may not be possible to identify relatives through the sampling process, especially in the case of wild animals, there are methods available to identify related individuals and estimate their likely degree of relatedness once the samples have been sequenced (Epstein et al. 2000). The inferences from these methods will allow users to identify pairwise kinship coefficients necessary to apply the unbiased statistics of this study.
A key consideration when evaluating the importance unbiased estimators of F- and D-statistics is their potential use. Specifically, a number of applications of these statistics do not employ the raw estimates, but instead standardized estimates (Soraggi et al. 2018; Zheng and Janke 2018), where a particular F- or D-statistic has its genomewide mean subtracted, and is normalized by the standard error using a genomic block jackknife procedure (Reich et al. 2009). Indeed, subtracting out this genomewide mean may circumvent bias issues. However, this assumes that all genomic blocks have similar sample properties, yet blocks with reduced sample size (e.g., in regions with difficult to call genotypes) may still deviate from the genomewide expectation. In contrast, accounting for this bias due to relatedness would provide estimates closer to the genomewide mean. Because the variance for these biased and unbiased estimators is approximately the same (compare Propositions 20 vs 21 and 23 vs 25), the standard errors used for normalizing these statistics are expected to be comparable, and thus, the unbiased estimators of the F-statistics derived here represent a more robust alternative to the original biased estimators, regardless of whether the raw or standardized values of the statistics are used. Furthermore, the raw value of some statistics, such as using the F3 statistic to detect population admixture, is important, and without correcting the bias of such statistics (Figure 6), key historical events relating populations could be missed. In addition, applying these statistics to genomic regions that are likely to be evolving non-neutrally (e.g., protein-coding regions) may lead to skewed estimates due to selection. For this reason, it is recommended that these statistics be applied to intergeneic regions.
The F- and D-statistics evaluated here are the most commonly used. However, since their development by Reich et al. (2009) and Patterson et al. (2012), other D-statistic type tests have been formulated to not only detect admixture, but also to identify the direction of gene flow—namely the partitioned D-statistics of Eaton and Ree (2013) and the statistics of Pease and Hahn (2015). Specifically, the statistics as originally formulated by Pease and Hahn (2015) sampled a single lineage (or allele) from each of a set of five populations A, B, C, D, and O, with a symmetric rooted topology relating populations A, B, C, and D, and with O an outgroup to these populations used to polarize the ancestral allelic state. Subsequently, Harris and DeGiorgio (2017b) derived allele frequency formulas for the statistics, and showed that allele frequency information for the outgroup population O is not needed for computation. The statistics are a set of four quantities (Harris and DeGiorgio 2017b)
| (68) |
| (69) |
| (70) |
| (71) |
each of which does not have the frequencies for two alleles sampled from a single population multiplying each other. Hence, using sample allele frequencies in place of the population quantities would still yield approximately unbiased estimators of the statistics, regardless of whether related or inbred individuals were included in the sample. Though we chose to focus on the more classic F- and D-statistics, variance quantities for these partitioned D and statistics can be readily computed as we have done for other ratio estimators in this study.
Though we have only shown results when all populations contain samples with the same relative pair composition, it is trivial to include different relative types in different populations within these statistics. In addition, a key assumption of our theoretical formulas is that pairwise relative contributions to the bias and variance are so much larger in magnitude than higher-order relative contributions, that the inclusion of kinship terms for trios, quadruples, or pairs of pairs of relatives would minimally affect results. For this reason, and to avoid highly unwieldy formulas for variance calculations, we made approximations to the variance formulas using this assumption. We briefly explore the accuracy of such approximations under simulations with 20 full sibling trios, and calculate the bias, variance, and MSE over a range of F2 values. We compare these simulated results to theoretically calculated bias, variance, and MSE for across identical F2 values (Supplementary Figure S16), where the theoretical variance (and hence MSE) formulas are approximations that only consider pairwise kinship coefficients. We see that though the trends in variance (and hence MSE) values are similar between the simulated and theoretical quantities, there is a slight difference between the simulated and theoretical variance (and hence MSE), where the theoretical values are consistently lower than the simulated. Therefore, our theoretical approximate variance calculations have underestimated the true variance values, which can be expected as we drop a number of terms that would be in the exact variance calculation, while still being useful for understanding the overall trends of the variances (and hence MSEs) across estimators.
In addition, it is also possible to apply our new unbiased estimators when only some or none of the populations contain related or inbred individuals. Moreover, though we have demonstrated results for allele frequencies estimated as the sample proportion, we could have instead used the best linear unbiased estimator (BLUE) of McPeek et al. (2004), as all derivations in this article are based on a general form of a linear unbiased estimator. The BLUE allele frequency estimator would have superior properties to the sample proportion discussed here, as it has smallest variance (McPeek et al. 2004), and this reduction in variance translates to functions of the allele frequency as highlighted by improvements in both expected heterozygosity and by Harris and DeGiorgio (2017a). To apply the BLUE estimator, we would simply alter the weight of an individual x in population P at a particular locus with the equation
| (72) |
where is the matrix of pairwise kinship coefficients, with element in row j and column k given by is a column vector of ones, and superscript indicates transpose. To facilitate easy application of these statistics, we have developed open-source software funbiased for use by the scientific community, which is available at https://github.com/MehreenRuhi/funbiased.
Data availability
Supplementary data are available at Genetics online. The 1000 Genomes Project data used in this publication is available at http://www.1000genomes.org/. Relatedness information used to generate Figure 7 is available online within Supplementary Tables S7–S15 of Rosenberg (2006).
Supplementary Material
Acknowledgments
The authors thank three anonymous reviewers for their comments that improved our article. They also thank Nick Patterson for pointing us to the unbiased F-statistics in samples containing unrelated and noninbred individuals within Appendix A of Patterson et al. (2012). The authors thank George (PJ) Perry for the helpful comments on an early draft of this article and Alexandre Harris for fruitful discussions about our simulation protocol.
Funding
This research was funded by the National Institutes of Health (R35GM128590), the National Science Foundation (DEB-1949268 and BCS-2001063), a NIGMS funded training grant on Computation, Bioinformatics, and Statistics (Predoctoral Training Program T32GM102057), and the NASA Pennsylvania Space Grant Graduate Fellowship. Computations for this research were performed on the Pennsylvania State University’s Institute for Computational and Data Sciences Advanced CyberInfrastructure (ICDS-ACI).
Conflicts of interest
The authors declare that there is no conflict of interest.
Appendix
In this section, we provide proofs of key lemmas and propositions from the Theory section, and also develop and prove other important results.
Proof of Lemma 1. We first calculate (result in Weir 1996, page 153)
which gives
where we define the downward bias of as
It follows that is an unbiased estimator of G(P) because
Proof of Corollary 2. When using the sample proportion to estimate allele frequencies, the weight of individual x in population P at locus j is (Harris and DeGiorgio 2017a)
Plugging into the definition of , we have
Because the sample contains unrelated individuals, then for and resulting in
Plugging into the definitions of and within Lemma 1 gives the desired result. □
Proof of Proposition 3. We first calculate
which gives
where we define the upward bias of as
It follows that is an unbiased estimator of because
Proof of Corollary 4. Plugging the definition of , derived in the proof of Corollary 2 as well as
of Corollary 2 into the definitions of and within Proposition 3 gives the desired result. □
Proof of Corollary 5. From Corollary 4, we have
Because populations A or B may have related or inbred individuals within them, from the proofs of Lemma 1 and Proposition 3, we have
yielding
It follows that the bias is
and because , this is an upward bias. □
Proof of Proposition 6. We first calculate
which gives
where we define the upward bias of as
It follows that is an unbiased estimator of because
Proof of Corollary 7. Plugging the definition of derived in the proof of Corollary 2 as well as
of Corollary 2 into the definitions of and within Proposition 6 gives the desired result. □
Proof of Corollary 8. From Corollary 7, we have
Because population A may have related or inbred individuals within it, from the proofs of Lemma 1 and Proposition 6, we have
yielding
It follows that the bias is
and because , this is an upward bias. □
Proof of Proposition 9. Assuming that the expectation of is approximately equal to the ratio of expectations of and , we find that
where we define the upward bias of as
We can also see that is an approximately unbiased estimator of because
Proof of Corollary 10. Plugging the definition of derived in the proof of Corollary 2 as well as
of Corollary 2 into the definitions of and within Proposition 9 gives the desired result. □
Proof of Corollary 11. From Corollaries 2, 7, and 10, we have
where
Because population A may have related or inbred individuals within it, from the proofs of Lemma 1 and Proposition 6, we have
yielding
It follows that the approximate bias is
and because , then and making this is an upward approximate bias. □
Proof of Proposition 12. We first calculate
We show that is unbiased estimator of because
Proof of Proposition 13. Assuming that the expectation of is approximately equal to the ratio of expectations of and for some , we find that
where we define the approximate upward bias of as
We can also see that is an approximately unbiased estimator of because
Proof of Corollary 14. Plugging the definition of , derived in the proof of Corollary 2 as well as
of Corollary 2 into the definitions of and within Proposition 13 gives the desired result. □
Proof of Corollary 15. From Corollaries 2 and 14, we have
where
Because population P may have related or inbred individuals within it, from the proofs of Lemma 1 and Proposition 12, we have
yielding
It follows that the approximate bias is
and because , then making this is an upward approximate bias. □
Lemma 17. Consider J polymorphic loci in populations A, B, C, and D with respective parametric reference allele frequencies , and suppose we take a random sample of individuals at locus j in population , some of which may be related or inbred. The estimator is unbiased.
Proof. We first calculate
We show that is unbiased estimator of because
Proof of Proposition 16. Assuming that the expectation of is approximately equal to the ratio of expectations of and is an approximately unbiased estimator of because
Lemma 18. Consider J independent polymorphic loci in a population P with parametric reference allele frequencies , and suppose we take a random sample of individuals at locus j, some of which may be related or inbred, where individual } has ploidy mk. Moreover, assume that no individual is related to more than one other individual, which makes the terms , and negligible to . Based on this simplifying assumption, the estimator has an approximate variance
Moreover, the respective approximate variance for the unbiased estimator and the estimator are
Proof. From the proof of Lemma 1, we have
and we calculate
Therefore, we have that
which gives
Recall that
where
It follows that
and similarly
Lemma 19. Consider J independent polymorphic loci in populations A and B with respective parametric reference allele frequencies , and suppose we take a random sample of individuals at locus j in population , some of which may be related or inbred. Moreover, assume that no individual is related to more than one other individual, which makes the terms , and negligible to . Based on this simplifying assumption, the estimators and have an approximate covariance
Proof. From the proofs of Lemma 1 and Proposition 3, we have
and
yielding
where we used the fact that is negligible compared to as an approximation. We also calculate
Recognizing that , and , we have
Therefore, we have that
which gives
Proposition 20. Consider J independent polymorphic loci in a populations A and B with respective parametric reference allele frequencies , and suppose we take a random sample of individuals at locus j in population , some of which may be related or inbred. Moreover, assume that no individual is related to more than one other individual, which makes the terms , and negligible to . Based on this simplifying assumption, the estimator has approximate variance
Proof. From the proof of Proposition 3, we have
which gives
where we used the fact that , and are negligible compared to and as an approximation. We also calculate
Therefore, we have that
which gives
Proposition 21. Consider J independent polymorphic loci in a populations A and B with respective parametric reference allele frequencies , and suppose we take a random sample of individuals at locus j in population , some of which may be related or inbred. Moreover, assume that no individual is related to more than one other individual, which makes the terms , and negligible to . Based on this simplifying assumption, the unbiased estimator has approximate variance
Proof. Recall that
where is an unbiased estimator for and is an unbiased estimator of for at locus . Also, from the proof of Proposition 3, we have
Therefore, we have that
where we used the fact that and are negligible compared to and as an approximation, and where because drawing alleles in population A is independent of population B. Moreover, because , we have
Recalling the assumption that , and are negligible compared to and , we have
and it follows that
Proposition 22. Consider J independent polymorphic loci in a populations A and B with respective parametric reference allele frequencies , and suppose we take a random sample of individuals at locus j in population , some of which may be related or inbred. Moreover, assume that no individual is related to more than one other individual, which makes the terms , and other terms of similar order negligible to . Based on this simplifying assumption, the estimator has approximate variance
Proof. Recall that
where and are estimators for and is an estimator of for at locus . Also, from the proof of Proposition 3, we have
Therefore, we have that
where we used the fact that and are negligible compared to and as an approximation (and hence assuming is large enough), and where because drawing alleles in population A is independent of population B. Moreover, because
we have
by recalling the assumption that , or any other term of similar magnitude (such as the terms that appear for populations ) and are negligible compared to and . Because of this, we have
and it follows that
Proposition 23. Consider J independent polymorphic loci in populations A, B, and C with respective parametric reference allele frequencies , and suppose we take a random sample of individuals at locus j in population , some of which may be related or inbred. Moreover, assume that no individual is related to more than one other individual, which makes the terms , , and negligible to . Based on this simplifying assumption, the estimator has approximate variance
Proof. From the proof of Proposition 6, we have
which gives
where we used the fact that is negligible compared to as an approximation. We also calculate
where we used the fact that , and are negligible compared to , and as an approximation. Therefore, we have that
which gives
Lemma 24. Consider J independent polymorphic loci in populations A, B, and C with respective parametric reference allele frequencies , and suppose we take a random sample of individuals at locus j in population , some of which may be related or inbred. Moreover, assume that no individual is related to more than one other individual, which makes the terms , and negligible to . Based on this simplifying assumption, the estimators and have an approximate covariance
Proof. From the proofs of Lemma 1 and Proposition 6, we have
and
yielding
where we used the fact that is negligible compared to as an approximation. We also calculate
Recognizing that and , we have
Therefore, we have that
which gives
Proposition 25. Consider J independent polymorphic loci in populations A, B, and C with respective parametric reference allele frequencies , and suppose we take a random sample of individuals at locus j in population , some of which may be related or inbred. Moreover, assume that no individual is related to more than one other individual, which makes the terms , and negligible to . Based on this simplifying assumption, the unbiased estimator has approximate variance
Proof. Recall that
where is an unbiased estimator for and is an unbiased estimator of at locus . Also, from the proof of Proposition 6, we have
Therefore, we have that
where we used the fact that is negligible compared to as an approximation. Moreover, because , we have
Recalling the assumption that is negligible compared to , we have
Proposition 26. Consider J independent polymorphic loci in populations A, B, and C with respective parametric reference allele frequencies , and suppose we take a random sample of individuals at locus j in population , some of which may be related or inbred. Moreover, assume that no individual is related to more than one other individual, which makes the terms and any other terms of similar order negligible to . Based on this simplifying assumption, the estimator has approximate variance
Proof. Recall that
where and are estimators for and is an estimator of at locus . Also, from the proof of Proposition 6, we have
Therefore, we have that
where we used the fact that is negligible compared to as an approximation (and hence assuming is large enough). Moreover, because
we have
by recalling the assumption that or any other term of similar magnitude (such as the term) are negligible compared to . Because of this, we have
and it follows that
Proposition 27. Consider J independent polymorphic loci in populations A, B, C, and D with respective parametric reference allele frequencies , and suppose we take a random sample of individuals at locus j in population , some of which may be related or inbred. Moreover, assume that no individual is related to more than one other individual, which makes the terms , and negligible to . Based on this simplifying The unbiased estimator has approximate variance
Proof. From the proofs of Propositions 3 and 12, we have
and
We calculate
where we use the identity that
Therefore, we have that
where we used the fact that , and are negligible compared to and as an approximation. It follows that
Following Wolter (2007), we have that an approximation to the variance of the ratio estimator X/Y is
Proposition 28. Consider J polymorphic loci in populations A, B, and C with respective parametric reference allele frequencies , and suppose we take a random sample of individuals at locus j in population , some of which may be related or inbred. Moreover, assume that no individual is related to more than one other individual, which makes the terms , and negligible to . Based on this simplifying assumption, the ratio estimator has approximate variance
where the expectations are
the variances are
and the covariance is
Proof. Recall that
Assuming that and , following the approximation in Wolter (2007) we have
where is given in Proposition 6, in Lemma 1, in Proposition 23, in Lemma 18, and in Lemma 24. □
Lemma 29. Consider J independent polymorphic loci in populations A, B, and C with respective parametric reference allele frequencies , and suppose we take a random sample of individuals at locus j in population , some of which may be related or inbred. Moreover, assume that no individual is related to more than one other individual, which makes the terms , and negligible to . Based on this simplifying assumption, the unbiased estimators and have an approximate covariance
Proof. Recall that
where . It follows that
where we used the fact that is negligible compared to as an approximation. From the proofs of Lemmas 18 and 24, we have
and
Assuming that is negligible compared to , we have that
We therefore have that
and thus by independence of loci we have
Lemma 30. Consider J independent polymorphic loci in populations A, B, and C with respective parametric reference allele frequencies , and suppose we take a random sample of individuals at locus j in population , some of which may be related or inbred, where individual } has ploidy mk. Moreover, assume that no individual is related to more than one other individual, which makes the terms , and any other terms of similar order negligible to . Based on this simplifying assumption, the estimators and have an approximate covariance
Proof. Recall that
where
It follows that
From the proofs of Lemmas 18 and 24, we have
and
By recalling the assumption that or any other term of similar magnitude (such as the term for large enough) is negligible compared to , we have that
We therefore have that
and thus by independence of loci we have
Proposition 31. Consider J polymorphic loci in populations A, B, and C with respective parametric reference allele frequencies , and suppose we take a random sample of individuals at locus j in population , some of which may be related or inbred. Moreover, assume that no individual is related to more than one other individual, which makes the terms , and negligible to . Based on this simplifying assumption, the approximately unbiased ratio estimator has approximate variance
where the variances are
and the covariance is
Proof. Recall that
where is an unbiased estimator for and is an unbiased estimator of G(A). Assuming that and , following the approximation in Wolter (2007) we have
where is given in Proposition 25, in Lemma 18, and in Lemma 29. □
Proposition 32. Consider J polymorphic loci in populations A, B, and C with respective parametric reference allele frequencies , and suppose we take a random sample of individuals at locus j in population , some of which may be related or inbred, where individual } has ploidy mk. Moreover, assume that no individual is related to more than one other individual, which makes the terms , and any other terms of similar order negligible to . Based on this simplifying assumption, the ratio estimator has approximate variance
where the variances are
the covariance is
and where
Proof. Recall that
where is an estimator for and is an estimator of G(A). Assuming that and , following the approximation in Wolter (2007) we have
where is given in Proposition 26, in Lemma 18, in Lemma 30, in proof of Corollary 8, and in proof of Corollary 11. □
Lemma 33. Consider J independent polymorphic loci in populations A, B, C, and D with respective parametric reference allele frequencies , and suppose we take a random sample of individuals at locus j in population , some of which may be related or inbred. Moreover, assume that no individual is related to more than one other individual, which makes the terms , and negligible to . Based on this simplifying assumption, the estimators and have approximate covariances
Proof. From the proofs of Lemma 1 and Proposition 12, we that have
and
yielding
We first calculate
Hence, we have that
Similarly, we have that
Hence, we have that
Parallel to the derivation for P = A, we have
and parallel to the derivation for P = B, we have
We know that by independence of loci we have
which gives
Proposition 34. Consider J polymorphic loci in populations A, B, C, and D with respective parametric reference allele frequencies , and suppose we take a random sample of individuals at locus j in population , some of which may be related or inbred. Moreover, assume that no individual is related to more than one other individual, which makes the terms , and negligible to . Based on this simplifying assumption, the ratio estimator has approximate variance
where the expectation is
the variances are
and the covariances are
Proof. Recall that
where is an unbiased estimator for . Assuming that and , following the approximation in Wolter (2007) we have
where is given in Lemma 1, in Proposition 27, in Lemma 18, and in Lemma 33 for each population . □
Lemma 35. Consider J independent polymorphic loci in populations A, B, C, and D with respective parametric reference allele frequencies , and suppose we take a random sample of individuals at locus j in population , some of which may be related or inbred. Moreover, assume that no individual is related to more than one other individual, which makes the terms , and negligible to . Based on this simplifying assumption, the unbiased estimators and have approximate covariances
Proof. Recall that . It follows that
From the proof of Lemma 33, we have
yielding
Lemma 36. Consider J independent polymorphic loci in populations A, B, C, and D with respective parametric reference allele frequencies , and suppose we take a random sample of individuals at locus j in population , some of which may be related or inbred, where individual } has ploidy mk. Moreover, assume that no individual is related to more than one other individual, which makes the terms , and negligible to . Based on this simplifying assumption, the estimators and have approximate covariances
Proof. Recall that
It follows that
From the proof of Lemma 33, we have
yielding
Proposition 37. Consider J polymorphic loci in populations A, B, C, and D with respective parametric reference allele frequencies , and suppose we take a random sample of individuals at locus j in population , some of which may be related or inbred. Moreover, assume that no individual is related to more than one other individual, which makes the terms , and negligible to . Based on this simplifying assumption, the approximately unbiased ratio estimator has approximate variance
where the variances are
and the covariances are
Proof. Recall that
where is an unbiased estimator for and is an unbiased estimator of G(P). Assuming that and , following the approximation in Wolter (2007) we have
where is given in Proposition 27, in Lemma 18, and in Lemma 35 for each population . □
Proposition 38. Consider J polymorphic loci in populations A, B, C, and D with respective parametric reference allele frequencies , and suppose we take a random sample of individuals at locus j in population , some of which may be related or inbred, where individual } has ploidy mk. Moreover, assume that no individual is related to more than one other individual, which makes the terms , and negligible to . Based on this simplifying assumption, the ratio estimator has approximate variance
where the variances are
the covariances are
and where
Proof. Recall that
where is an unbiased estimator for and is an estimator of G(P). Assuming that and , following the approximation in Wolter (2007) we have
where is given in Proposition 27, in Lemma 18, in Lemma 36, and in proof of Corollary 11 for each population . □
Lemma 39. Consider J independent polymorphic loci in populations A, B, C, and D with respective parametric reference allele frequencies , and suppose we take a random sample of individuals at locus j in population , some of which may be related or inbred. Moreover, assume that no individual is related to more than one other individual, which makes the terms , and negligible to . Based on this simplifying assumption, the unbiased estimator has approximate variance
Proof. From the proof of Lemma 17, we have that
yielding
We first calculate
We compute the first term as
where we used the fact that is negligible compared to and as an approximation. Using a similar argument, we have that
Hence, we have that
where we used the fact that , and are negligible compared to , and as an approximation. Putting it together, we have
Given the assumption of independent loci, we have
Lemma 40. Consider J independent polymorphic loci in populations A, B, C, and D with respective parametric reference allele frequencies , and suppose we take a random sample of individuals at locus j in population , some of which may be related or inbred. Moreover, assume that no individual is related to more than one other individual, which makes the terms , and negligible to . Based on this simplifying assumption, the unbiased estimators and have approximate covariance
Proof. From the proofs of Proposition 12 and Lemma 17, we that have
and
yielding
We first calculate
We compute the first term as
Using a similar argument, we have that
Hence, we have that
where we used the fact that , and are negligible compared to , and as an approximation. Putting it together, we have
Given the assumption of independent loci, we have
Proposition 41. Consider J polymorphic loci in populations A, B, C, and D with respective parametric reference allele frequencies , and suppose we take a random sample of individuals at locus j in population , some of which may be related or inbred. Moreover, assume that no individual is related to more than one other individual, which makes the terms , and negligible to . Based on this simplifying assumption, the approximately unbiased ratio estimator has approximate variance
where the variances are
and the covariance is
Proof. Recall that
where is an unbiased estimator for and is an unbiased estimator of . Assuming that and , following the approximation in Wolter (2007) we have
where is given in Proposition 27, in Lemma 39, and in Lemma 40. □
Literature cited
- Cockerham CC. 1971. Higher order probability functions of identity of allelles by descent. Genetics. 69:235–246. [DOI] [PMC free article] [PubMed] [Google Scholar]
- DeGiorgio M, Jankovic I, Rosenberg NA. 2010. Unbiased estimation of gene diversity in samples containing related individuals: exact variance and arbitrary ploidy. Genetics. 186:1367–1387. [DOI] [PMC free article] [PubMed] [Google Scholar]
- DeGiorgio M, Rosenberg NA. 2009. An unbiased estimator of gene diversity in samples containing related individuals. Mol Biol Evol. 26:501–512. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Eaton DAR, Ree RH. 2013. Inferring phylogeny and introgression using RADseq data: an example from flowering plants (Pedicularis: Orobanchaceae). Syst Biol. 62:689–706. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Epstein MP, Duren WL, Boehnke M. 2000. Improved inference of relationship for pairs of individuals. Am J Hum Genet. 67:1219–1231. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gravel S, Henn BM, Gutenkunst RN, Indap AR, Marth GT, et al. ; The 1000 Genomes Project. 2011. Demographic history and rare allele sharing among human populations. Proc Natl Acad Sci USA. 108:11983–11988. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Green RE, Krause J, Briggs AW, Maricic T, Stenzel U, et al. 2010. A draft sequence of the neandertal genome. Science. 328:710–722. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hajdinjak M, Fu Q, Hübner A, Petr M, Mafessoni F, et al. 2018. Reconstructing the genetic history of late neanderthals. Nature. 555:652–656. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Haller BC, Messer PW. 2019. 3: forward genetic simulations beyond the wright–fisher model. Mol Biol Evol. 36:632–637. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Harris AM, DeGiorgio M. 2017a. An unbiased estimator of gene diversity with improved variance for samples containing related and inbred individuals of any ploidy. G3 (Bethesda). 7:671–691. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Harris AM, DeGiorgio M. 2017b. Admixture and ancestry inference from ancient and modern samples through measures of population genetic drift. Hum Biol. 89:21–46. [DOI] [PubMed] [Google Scholar]
- Harris DL. 1964. Genotypic covariances between inbred relatives. Genetics. 50:1319–1348. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Huson D, Klöpper T, Lockhart P, Steel M. 2005. Reconstruction of reticulate networks from gene trees. RECOMB. 2005. 3500:233–249. [Google Scholar]
- Kim HL, Ratan A, Perry GH, Montenegro A, Miller W, et al. 2014. Khoisan hunter-gatherers have been the largest population throughout most of modern-human demographic history. Nat Commun. 5:5692–5692. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kulathinal RJ, Stevison LS, Noor MAF. 2009. The genomics of speciation in drosophila: diversity, divergence, and introgression estimated using low-coverage genome sequencing. PLoS Genet. 5:e1000550. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lange K. 2002. Mathematical and Statistical Methods for Genetic Analysis. New York, NY:Springer. [Google Scholar]
- Li JZ, Absher DM, Tang H, Southwick AM, Casto AM, et al. 2008. Worldwide human relationships inferred from genome-wide patterns of variation. Science. 319:1100–1104. [DOI] [PubMed] [Google Scholar]
- Martin SH, Davey JW, Jiggins CD. 2015. Evaluating the use of ABBA–BABA statistics to locate introgressed loci. Mol Biol Evol. 32:244–257. [DOI] [PMC free article] [PubMed] [Google Scholar]
- McPeek MS, Wu X, Ober C. 2004. Best linear unbiased allele-frequency estimation in complex pedigrees. Biometrics. 60:359–367. [DOI] [PubMed] [Google Scholar]
- Molinaro L, Montinaro F, Yelmen B, Marnetto D, Behar DM, et al. 2019. West Asian sources of the Eurasian component in Ethiopians: a reassessment. Sci Rep. 9:18811. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Moorjani P, Thangaraj K, Patterson N, Lipson M, Loh P-R, et al. 2013. Genetic evidence for recent population mixture in India. Am J Hum Genet. 93:422–438. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nei M, Roychoudhury AK. 1974. Sampling variances of heterozygosity and genetic distance. Genetics. 76:379–390. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Patterson N, Moorjani P, Luo Y, Mallick S, Rohland N, et al. 2012. Ancient admixture in human history. Genetics. 192:1065–1093. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Payseur BA, Nachman MW. 2000. Microsatellite variation and recombination rate in the human genome. Genetics. 156:1285–1298. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pease JB, Hahn MW. 2015. Detection and polarization of introgression in a five-taxon phylogeny. Syst Biol. 64:651–662. [DOI] [PubMed] [Google Scholar]
- Peter BM. 2016. Admixture, population structure, and f-statistics. Genetics. 202:1485–1501. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Reich D, Patterson N, Campbell D, Tandon A, Mazieres S, et al. 2012. Reconstructing native American population history. Nature. 488:370–374. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Reich D, Thangaraj K, Patterson N, Price AL, Singh L. 2009. Reconstructing Indian population history. Nature. 461:489–494. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rosenberg NA. 2006. Standardized subsets of the HGDP-CEPH Human Genome Diversity Cell Line Panel, accounting for atypical and duplicated samples and pairs of close relatives. Ann Hum Genet. 70:841–847. [DOI] [PubMed] [Google Scholar]
- Scally A, Durbin R. 2012. Revising the human mutation rate: implications for understanding human evolution. Nat Rev Genet. 13:745–753. [DOI] [PubMed] [Google Scholar]
- Soraggi S, Wiuf C, Albrechtsen A. 2018. Powerful inference with the D-statistic on low-coverage whole-genome data. G3 (Bethesda). 8:551–566. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Takahata N. 1993. Allelic genealogy and human evolution. Mol Biol Evol. 10:2–22. [DOI] [PubMed] [Google Scholar]
- The 1000 Genomes Project Consortium. 2015. A global reference for human genetic variation. Nature. 526:68–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Turissini DA, Matute DR. 2017. Fine scale mapping of genomic introgressions within the Drosophila yakuba clade. PLoS Genet. 13:e1006971. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Waples RS, Anderson EC. 2017. Purging putative siblings from population genetic data sets: a cautionary view. Mol Ecol. 26:1211–1224. [DOI] [PubMed] [Google Scholar]
- Weir B. 1996. Genetic Data Analysis II. Sunderland, Massachusetts:Sinauer Associates. [Google Scholar]
- Weir BS. 1989. Sampling properties of gene diversity. In: Plant Population Genetics, Breeding and Genetic Resources. p. 23–42. [Google Scholar]
- Weir BS, Cockerham CC. 1984. Estimating F-statistics for the analysis of population structure. Evolution. 38:1358–1370. [DOI] [PubMed] [Google Scholar]
- Wolter KM. 2007. Introduction to Variance Estimation. 2nd ed. New York, NY: Springer. [Google Scholar]
- Zheng Y, Janke A. 2018. Gene flow analysis method, the D-statistic, is robust in a wide parameter space. BMC Bioinformatics. 19:10. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Supplementary data are available at Genetics online. The 1000 Genomes Project data used in this publication is available at http://www.1000genomes.org/. Relatedness information used to generate Figure 7 is available online within Supplementary Tables S7–S15 of Rosenberg (2006).







