Skip to main content
G3: Genes | Genomes | Genetics logoLink to G3: Genes | Genomes | Genetics
. 2016 Dec 30;7(2):671–691. doi: 10.1534/g3.116.037168

An Unbiased Estimator of Gene Diversity with Improved Variance for Samples Containing Related and Inbred Individuals of any Ploidy

Alexandre M Harris *,, Michael DeGiorgio *,‡,1
PMCID: PMC5295611  PMID: 28040781

Abstract

Gene diversity, or expected heterozygosity (H), is a common statistic for assessing genetic variation within populations. Estimation of this statistic decreases in accuracy and precision when individuals are related or inbred, due to increased dependence among allele copies in the sample. The original unbiased estimator of expected heterozygosity underestimates true population diversity in samples containing relatives, as it only accounts for sample size. More recently, a general unbiased estimator of expected heterozygosity was developed that explicitly accounts for related and inbred individuals in samples. Though unbiased, this estimator’s variance is greater than that of the original estimator. To address this issue, we introduce a general unbiased estimator of gene diversity for samples containing related or inbred individuals, which employs the best linear unbiased estimator of allele frequencies, rather than the commonly used sample proportion. We examine the properties of this estimator, HBLUE, relative to alternative estimators using simulations and theoretical predictions, and show that it predominantly has the smallest mean squared error relative to others. Further, we empirically assess the performance of HBLUE on a global human microsatellite dataset of 5795 individuals, from 267 populations, genotyped at 645 loci. Additionally, we show that the improved variance of HBLUE leads to improved estimates of the population differentiation statistic, FST, which employs measures of gene diversity within its calculation. Finally, we provide an R script, BestHet, to compute this estimator from genomic and pedigree data.

Keywords: expected heterozygosity, identity state, inbreeding, locus-specific branch length, relatedness


The gene diversity of a locus, also known as its expected heterozygosity (H), is a fundamental measure of genetic variation in a population, and describes the proportion of heterozygous genotypes expected under Hardy-Weinberg equilibrium (Nei 1973). Formally, gene diversity is the probability that a pair of randomly sampled allele copies from a population are different, and is computed as

H=1i=1Ipi2, (1)

where I is the number of distinct alleles at a locus, and pi (i=1,2,,I) is the frequency of allele i in the population.

For a sample without related or inbred individuals composed of n allele copies, an unbiased estimator of expected heterozygosity is (Nei and Roychoudhury 1974)

H^=nn1(1i=1Ip^i2), (2)

where p^i is the sample proportion of allele i. H^ is a biased estimator when inbred or related individuals are included in the sample (DeGiorgio and Rosenberg 2009). This result is based on the idea that, as the proportion of related individuals in the sample increases, the number of independent allele observations decreases.

When two alleles are drawn from a sample, one each from a pair of related individuals, there is a nonzero probability that they will be identical by descent (IBD), rather than just identical by state (Lange 2002). This IBD probability is known as the kinship coefficient, and is denoted by Φjk for a pair of individuals j and k. Thus, the observed diversity will be lower than the true value because a greater proportion of identical alleles are observed than for a sample in which there are no related individuals. DeGiorgio et al. (2010) developed an estimator of expected heterozygosity,

H=11Φ¯2(1i=1Ip^i2), (3)

which is unbiased for samples containing related and inbred individuals of any ploidy, and employs a weighted mean kinship coefficient Φ¯2 as a bias correction factor. Φ¯2 is the average of all kinship coefficients Φjk for every pair of individuals within the sample (see Methods). Further, DeGiorgio et al. (2010) derived the theoretical variance of H, as well as its approximate value for samples wherein individuals are related to no more than one other sampled individual.

As an alternative to the sample proportion (p^i), McPeek et al. (2004) introduced the best linear unbiased estimator (BLUE, denoted as pi) of population allele frequency, which is an unbiased linear estimator with smaller variance than the unbiased linear estimator p^i. The BLUE incorporates the relatedness of individuals in the sample as a covariance matrix to define the weight of each observation. Simulations and analytical evaluation corroborating their result suggest that the mean squared error (MSE) of pi is always smaller than that of p^i, and this difference is especially evident for samples with complex pedigrees.

Because pi has the smallest variance of any unbiased linear estimator of allele frequencies, we expect its low variance to translate to smaller variance of gene diversity statistics that use pi. We developed such a statistic, termed HBLUE, that is an unbiased estimator of expected heterozygosity in samples containing related and inbred individuals of arbitrary ploidy. Through simulations, analytical predictions, and empirical assessments, we compare the performance of HBLUE to that of H and H^ for samples containing related individuals of various types across different ploidy and inbreeding status. Additionally, we derive the variance of any measure of expected heterozygosity that uses unbiased linear estimators of allele frequencies. We find that the increased precision of allele frequency estimates transfers to our unbiased estimator, yielding values for MSE invariably equal to or smaller than those of H, while occasionally exceeding the precision of H^. The improved properties of HBLUE translate to its applications as well, which we demonstrate in the calculation of the population differentiation statistic, FST (Wright 1951). FST can be written in terms of intrapopulation and interpopulation gene diversity as (Hudson et al. 1992)

FST=H1212(H1+H2)H12, (4)

where H1 and H2 are the values of expected heterozygosity within each of two compared populations, and H12 is the expected heterozygosity between them.

Methods

Consider a locus with I distinct alleles in a sample of n individuals. Let Xk(i) denote the fraction of alleles at the locus in individual k that are of type i, i=1,2,,I. An unbiased linear estimator of population allele frequencies pi, denoted by p˘i, is defined as

p˘i=k=1nwkXk(i), (5)

where wk, 0wk1, is the weight of individual k, k=1,2,,n, and k=1nwk=1. Formally, we have that

Xk(i)=1mkt=1mkAkt(i),

where Akt(i) is an indicator random variable whose value is 1 if allele t of individual k is of type i, and zero otherwise, and where mk is the ploidy of individual k. As an example, if individual k were diploid at the locus, then mk=2. Taking the expectation of p˘i,

E[p˘i]=k=1nwkmkt=1mkE[Akt(i)]=k=1nwkmkt=1mkpi=pi,

shows that it is an unbiased estimator of pi.

Unbiased estimation of gene diversity using unbiased linear estimators of allele frequencies

In this section, we construct an unbiased estimator, H˘, of expected heterozygosity that uses a general unbiased linear estimator, p˘i, of allele frequency pi (Proposition 1). We then show that the unbiased estimator, H, of DeGiorgio et al. (2010) follows as a corollary, assuming that p˘i=p^i, the sample proportion allele frequency estimator (Corollary 2). We then derive a new estimator, HBLUE, also as a corollary, assuming that p˘i=pi, the BLUE of allele frequency (Corollary 3).

Proposition 1:

Consider a locus with I distinct alleles and parametric allele frequencies pi[0,1], i=1,2,,I, and i=1Ipi=1. For a sample of size n individuals of any ploidy, inbreeding status, and relatedness,

H˘=11ρ2(1i=1Ip˘i2) (6)

is an unbiased estimator of expected heterozygosity, where

ρ2=j=1nk=1nwjwkΦjk

is a weighted mean kinship coefficient of the sample for all pairs of individuals in the sample, and where wk, k=1,2,,n, is the weight for individual k. The proof of Proposition 1 is found in the Appendix.

From p˘i, the sample proportion estimator p^i of allele frequency i, i=1,2,,I, is recovered when wk=mk/j=1nmj for individual k, k=1,2,,n, leading to

p^i=k=1nmkj=1nmjXk(i).

Here, each individual is weighted by its contribution to the number of allele copies in the sample.

Corollary 2:

Consider a locus with I distinct alleles and parametric allele frequencies pi[0,1], i=1,2,,I, and i=1Ipi=1. For a sample of size n individuals of any ploidy, inbreeding status, and relatedness,

H=11Φ¯2(1i=1Ip^i2) (7)

is an unbiased estimator of expected heterozygosity, where

p^i=k=1nmkj=1nmjXk(i)

is the sample proportion estimator of allele frequency i, where

Φ¯2=j=1nk=1nmjx=1nmxmky=1nmyΦjk

is a weighted mean kinship coefficient of the sample for all pairs of individuals, and where mk, k=1,2,,n, is the ploidy for individual k. The proof of Corollary 2 is found in the Appendix.

It may be beneficial to apply an unbiased linear estimator of allele frequencies that has minimum variance. McPeek et al. (2004) introduced the BLUE of allele frequencies, which we formally define here. We will use the BLUE of allele frequencies to construct a new unbiased estimator of gene diversity that would ideally have improved variance over other estimators. Let K be an n×n symmetric matrix of kinship coefficients, with Kjk=Φjk. The BLUE (pi) of allele frequency is obtained when wk=j=1n(K1)jk1TK11, yielding

pi=k=1nj=1n(K1)jk1TK11Xk(i),

where K1 denotes the inverse matrix of K, 1 is a column vector of n elements with all entries equal to 1, and 1T is the transpose of 1.

Corollary 3:

Consider a locus with I distinct alleles, and parametric allele frequencies pi[0,1], i=1,2,,I, and i=1Ipi=1. For a sample of size n individuals of any ploidy, inbreeding status, and relatedness,

HBLUE=11κ2(1i=1Ipi2) (8)

is an unbiased estimator of expected heterozygosity, where

pi=k=1nj=1n(K1)jk1TK11Xk(i)

is the BLUE of allele frequencies, and where

κ2=j=1nk=1nx=1n(K1)xj1TK11y=1n(K1)yk1TK11Φjk

is a weighted mean kinship coefficient of the sample for all pairs of individuals. The proof of Corollary 3 is found in the Appendix.

Variance of H estimators using unbiased linear estimators of allele frequencies

We now derive the equation (Proposition 4) describing the variance of the unbiased estimator H˘, which takes p˘i as the unbiased linear estimate of population allele frequency pi. This value depends on the weighted mean kinship coefficients of the sample for all pairs, trios, quartets, and pairs of pairs of individuals in the sample, defined as

ρ2=j=1nk=1nwjwkΦjkρ3=j=1nk=1nj=1nwjwkwjΦjkjρ4=j=1nk=1nj=1nk=1nwjwkwjwkΦjkjkρ2,2=j=1nk=1nj=1nk=1nwjwkwjwkΦjk,jk.

Here, Φjkj is the probability that three randomly sampled alleles, one each from individuals j, k, and j, are IBD. Φjkjk is the probability that four randomly sampled alleles, one each from individuals j, k, j, and k, are IBD. Finally, Φjk,jk is the joint probability that two randomly sampled alleles, one each from individuals j and k are IBD, and two randomly sampled alleles, one each from individuals j and k, are IBD. Note that individuals j, k, j, and k are not necessarily distinct. The variances of H and of HBLUE follow as Corollaries 7 and 8, once again differing only in the weight of a sampled individual in the mean kinship coefficient calculation.

Proposition 4:

Consider a locus with I distinct alleles and parametric allele frequencies pi[0,1], i=1,2,,I, and i=1Ipi=1. For a sample of size n individuals of any ploidy, inbreeding status, and relatedness,

Var[H˘]=1(1ρ2)2Var[1i=1Ip˘i2] (9)

is the variance of the unbiased estimator of expected heterozygosity H˘, where ρ2=j=1nk=1nwjwkΦjk is a weighted mean kinship coefficient of the sample, and where wk for k=1,2,,n is the weight of individual k. Further, we have

Var[1i=1Ip˘i2]=ρ2,2ρ22+2(ρ22ρ4)i=1Ipi2+4(2ρ4+ρ22ρ3ρ2,2)i=1Ipi3+(3ρ2,2+8ρ36ρ44ρ2ρ22)(i=1Ipi2)2. (10)

The proof of Equation 10 is presented for the specific case of Var[1i=1Ip^i2] in Appendix B of DeGiorgio et al. (2010), where p^i is substituted for p˘i, and Φ¯2, Φ¯3, and Φ¯4, and Φ¯2,2 coefficients are substituted for ρ2, ρ3, ρ4, and ρ2,2 coefficients, respectively. We provide an abbreviated version of this proof for the general case in the Appendix. Further, the approximate value of Equation 10 for samples wherein no individual is related to more than one other is

Var[1i=1Ip˘i2]4ρ2[i=1Ipi3(i=1Ipi2)2]. (11)

For this simplifying case, the terms ρ3, ρ4, ρ2,2, and ρ22 are negligible compared to ρ2.

In the Appendix, we reintroduce the definition of Var[H] from DeGiorgio et al. (2010) (Corollary 7), and then define Var[HBLUE] (Corollary 8), both of which take the form illustrated in Proposition 4. As demonstrated by DeGiorgio et al. (2010), the mean kinship coefficients composing Equation 10 derive from the relationship between the 15 identity states available to four alleles (Gillois 1965; Cockerham 1971), and the coefficients of kinship between pairs, trios, quartets, and pairs of pairs of alleles within those four.

Bias of H^ for samples containing related or inbred individuals

Here, we briefly derive an equation (Equation 12) within Proposition 5 that describes the bias of H^, which we display in the left panels of Supplemental Material, Figure S1A and Figure S2A. We include Corollaries 9 and 10 to Proposition 5 within the Appendix for specific cases of bias derived from p^i-based and pi-based estimations, respectively. We also note that Equation A10 of Corollary 9 represents the form of the bias typically encountered in applications of H^, as well as in all of our experimental scenarios.

Proposition 5:

Consider a locus with I distinct alleles and parametric allele frequencies pi[0,1], i=1,2,,I, and i=1Ipi=1. For a sample of n possibly related or inbred individuals, the bias of the estimator of expected heterozygosity H^ changes with the true locus expected heterozygosity such that

Bias[H^(p˘i)]=1nρ2n1H, (12)

where

H^(p˘i)=nn1(1i=1Ip˘i2). (13)

Proof:

We begin by substituting Equation 6 into Equation 13 such that

H^(p˘i)=n(1ρ2)n1H˘,

and

E[H^(p˘i)]=n(1ρ2)n1H.

From the definition of bias,

Bias[H^(p˘i)]=E[H^(p˘i)]H=1nρ2n1H.

Variance of FST estimators using unbiased linear estimators of allele frequencies

Because the population differentiation statistic FST (Wright 1951) can be defined in terms of expected heterozygosities, it is possible to theoretically evaluate its approximate variance. A general estimator of FST can be written as

F˘ST=H˘1212(H˘1+H˘2)H˘12, (14)

where H˘12 is an unbiased estimator for the expected heterozygosity between a pair of sampled populations, numbered 1 and 2, defined as H˘12=1i=1Ip˘iq˘i (where q˘i is a linear unbiased estimator of the frequency of allele i in population 2, analogous to p˘i in population 1), while H˘1 and H˘2 are the within-population expected heterozygosities for populations 1 and 2, respectively. Referring to the numerator as x, and the denominator as y, we can write the expression for an approximation of the variance of a ratio as

Var[xy](E[x])2(E[y])2[Var[x](E[x])2+Var[y](E[y])22Cov[x,y]E[x]E[y]], (15)

following the definition for the approximate variance of a ratio (Wolter 2007).

Proposition 6:

Consider a locus with I distinct alleles across two populations and parametric allele frequencies pi[0,1], i=1,2,,I, and i=1Ipi=1 for population 1, and qi[0,1], i=1,2,,I, and i=1Iqi=1 for population 2. For samples of size n1 and n2 individuals from populations 1 and 2, respectively, each with individuals of any ploidy, inbreeding status, and relatedness, the variance of the population differentiation statistic calculated from their respective expected heterozygosities is approximated as

Var[F˘ST][H1212(H1+H2)]2H122×[Var[H˘1212(H˘1+H˘2)][H1212(H1+H2)]2+Var[H˘12]H1222Cov[H˘1212(H˘1+H˘2),H˘12][H1212(H1+H2)]H12], (16)

where

Var[H˘1212(H˘1+H˘2)]=Var[H˘12]+14Var[H˘1]+14Var[H˘2](Cov[H˘12,H˘1]+Cov[H˘12,H˘2]). (17)

In the Appendix, we provide a derivation of the variance and covariance components of Equations 16 and 17. For each of these equations, the result and proof are fairly long, and do not simplify when arranged into Equation 16.

Data availability

The authors state that all data necessary for confirming the conclusions presented in the article are represented fully within the article.

Results

Analytical validation of HBLUE

We tested the performance of HBLUE using both theory and simulations against that of the unbiased estimator H (DeGiorgio et al. 2010), and of H^ (Nei and Roychoudhury 1974). Here, we applied the estimators to samples of individuals wherein each individual was related to exactly one other. Thus, for samples of size n individuals, the number of relative pairs was n/2. When inbred or closely related individuals are included in a sample, H^ is a biased estimator of gene diversity for which we use the symbol H^full. To construct an unbiased estimator with H^, we also applied H^ to a reduced sample in which one member of each relative pair was removed randomly for samples containing only diploid individuals, and the haploid member was removed for each haploid-diploid (i.e., male-female) pair (reduced sample size of n/2), and we denote this estimator by H^red. To evaluate the performance of the four estimators (H^full, H^red, H, and HBLUE), we modified the factors upon which their variance depends: true locus expected heterozygosity (H), sample size n, and relatedness of individuals within the sample (Φ).

Effect of true locus expected heterozygosity, H, on estimators

We first evaluated the theoretical bias, variance, and mean squared error (MSE) of each estimator across the 645 human microsatellite loci from across the genome in the composite dataset MS5795 of Pemberton et al. (2013), where MSE is the sum of the squared bias and variance. The data used in our analyses is freely available online within File S1 of Pemberton et al. (2013) (http://www.g3journal.org/content/early/2013/03/27/g3.113.005728/suppl/DC1). We took the sample allele frequencies calculated from all individuals in the MS5795 dataset as the true population allele frequencies for the variance calculations, and, from these, determined the true expected heterozygosity at each locus using Equation 1 (see File S1; incorporated into Equation A10). Here, each sample contained 60 diploid individuals composed of 10 inbred full-sibling, 10 outbred full-sibling, and 10 outbred avuncular pairs. Each point in Figure 1 and Figure S1 represents a single analytical computation for a sample of 60 (or 30 for H^red) individuals at a microsatellite locus. We report the approximate variance and MSE because each individual is related to exactly one other in the sample, satisfying the assumption of Equation 11. Further, under this scenario DeGiorgio et al. (2010) showed that this was a reasonable approximation of the exact variance.

Figure 1.

Figure 1

Theoretical difference in MSE between the unbiased estimator H^red (left), H (center), or HBLUE (right), and the biased estimator H^full calculated at each of 645 microsatellite loci (0.5212H0.9301) in the MS5795 dataset for samples of 60 diploid individuals containing some inbred relative pairs. Each sampled individual was related to exactly one other, and samples contained 10 pairs of inbred full-siblings (Φ=3/8), 10 pairs of outbred full-siblings (Φ=1/4), and 10 outbred avuncular pairs (Φ=1/8). Dotted lines in each plot correspond to a difference in MSE of zero with H^full. See File S1 for the true expected heterozygosity values incorporated into analytical calculations.

We begin by demonstrating the relative performance of the unbiased estimators H^red, H, and HBLUE, measured in terms of MSE, against the biased estimator H^full (Figure 1). While the variance of H^full is invariably smaller than that of the other estimators, and the MSE and variance of each estimator decrease with increasing locus expected heterozygosity (0.5212H0.9301), H^full accumulates bias quadratically with increasing H, and thus yields an increasingly unreliable estimate with increasing site diversity (Figure S1A, left). However, the effect of this trend differs for each comparison. The MSE of H^red always exceeds that of H^full, because the removal of relatives to create the reduced sample causes a substantial increase in estimator variance, though, for high diversity markers, the MSE values of H^full and H^red converge (Figure 1, left). In contrast, H outperforms H^full for most loci, demonstrating that the rate of decrease in MSE with increasing H is greater for H than for H^full (Figure 1, center). Interestingly, the comparison of HBLUE with H^full shows an opposite trend to the preceding two. Despite the impact of bias, the decrease in variance of H^full over the analyzed range outpaces that of HBLUE. Even so, HBLUE uniformly yields a smaller MSE for the analyzed diploid samples (which contain a proportion of inbred individuals) across all loci (Figure 1, right).

To validate these theoretical predictions, we simulated 30 independent genotypes for each locus, and, for each independent genotype, simulated a single relative’s genotype (inbred full-sibling, outbred full-sibling, or avuncular). Briefly, we generated the independent genotypes by sampling alleles uniformly at random from the distribution of allele frequencies at each microsatellite locus, and generated relatives by copying zero, one, or two alleles from the relative according to the probability the pair would share zero, one, or two alleles IBD [see Lange (2002), Chapter 5]. The patterns observed for the simulated data accord with those of the theoretical predictions (Figure S2, each point is based on 104 simulations). It is clear from these results that locus expected heterozygosity is heavily influential on estimator MSE. However, we also find that the observed value of expected heterozygosity for a locus normalized to its range of expected heterozygosity values has an impact on estimator MSE. The maximum and minimum values of expected heterozygosity for a locus depend on the number of distinct alleles (I), and the frequency of the most frequent allele (M), at that locus [see Theorem 2 of Reddy and Rosenberg (2012)]. We quantify proximity of H for a locus to its maximum possible value as B=D/R, where D is the observed value of expected heterozygosity for a locus minus its minimum possible value given I and M, and R is the maximum minus the minimum value of expected heterozygosity, given I and M, such that B[0,1]. Loci with a smaller value of B yield a smaller MSE for all estimators (Figure S3).

Effect of sample size, n, on estimators

We next examined the properties of each estimator as a function of sample size. All estimators perform increasingly well for samples of increasing size. We demonstrate this property by measuring estimator MSE for samples containing 2–100 relative pairs of various type and ploidy at the D3S2427 locus, selected to highlight the improved performance of HBLUE as the bias of H^full increases (H=0.9301; Figure 2). For these tests, we considered only a single relative pair type at a time. The unbiased estimators H and HBLUE perform identically for diploid samples of first- and second-degree relative pairs regardless of inbreeding (Figure 2, A–D). Additionally, estimator MSE is uniformly smaller for samples containing only second-degree relative pairs than it is for samples containing only first-degree pairs (cf. Figure 2, A and B, and Figure 2, C and D; see also, Figure S4A). However, HBLUE unambiguously outperforms the other estimators with relative pairs of varying ploidy (in this case, male-female full-sibling pairs at an X-linked locus). In this scenario, H^red provides a more accurate estimate of expected heterozygosity than does H when the reduced set is created by removing only males from the original while retaining females (Figure 2E). When all females are removed instead, and males retained (Figure 2F), the MSE of H^red is markedly the largest of the four estimators because 2/3 of the alleles in the sample are discarded, rather than 1/3. For samples with inbred full-siblings whose parents are brother and sister (Figure 2, C and D), the trend of MSE with sample size mirrors that of outbred diploid samples (Figure 2, A and B), but with larger MSE. However, the relative performance of H^full is notably worse for samples containing inbred diploid avuncular pairs (Figure 2D) than for samples containing outbred diploid avuncular pairs (Figure 2B). That is, its MSE remains greater than, or equal to, that of the other estimators over the range of sample sizes considered for the inbred diploid avuncular pair scenario (Figure 2D), but consistently has smaller MSE than H^red for the outbred diploid avuncular pair scenario (Figure 2B). Generally, increasing the sample size is most effective for samples of <20 individuals, and it is over this range that the difference in performance of the estimators is most apparent.

Figure 2.

Figure 2

Theoretical MSE as a function of sample size for samples of outbred diploid full-siblings (A), outbred diploid avuncular pairs (B), inbred diploid full-siblings (C), inbred diploid avuncular pairs (D), male-female full siblings at an X-linked locus with the reduced set omitting males and retaining females (E), and male-female full siblings at an X-linked locus with the reduced set omitting females and retaining males (F). The samples were evaluated for the D3S2427 locus (H=0.9301), and sample size was always twice the number of relative pairs included in the sample for samples containing 2–100 relative pairs. Each individual in the sample was related to exactly one other.

Effect of varying sample relative pair composition on estimators

Finally, we calculated the MSE of each estimator for all 1326 combinations of one to three relative pair types for samples of 100 individuals fixed at 50 relative pairs, which we represent as triangular heat maps, across samples containing outbred diploids, males and females at an X-linked locus, or inbred diploids (each individual related to exactly one other; Figure 3, Figure S4, Figure S5, Figure S6, Figure S7, and Figure S8). The kinship coefficients (Φ) for each relative pair type considered across our tests are defined in Lange (2002, Chapter 5) and DeGiorgio et al. (2010, see Table 2), and modeled on the D3S2427 locus (H=0.9301).

Figure 3.

Figure 3

Theoretical difference in MSE between H^full (left), H^red (center), or H (right), and HBLUE, for samples of 100 (A) outbred diploid individuals, (B) male and female individuals at an X-linked locus, or (C) diploid individuals wherein some full siblings are inbred with brother-sister parents. The samples and MSE values considered for each subtraction were modeled on the D3S2427 locus (H=0.9301). Each sample contained 50 relative pairs, such that each individual was related to exactly one other. Each sample configuration is a single point in the space of a heat map defined by three coordinates (each representing the count of a relative pair type). For each configuration, the MSE of HBLUE is subtracted from that of the other estimators, yielding a value >0. Samples were composed of one to three relative pair types where the vertex of each heat map represents a sample with only a single relative pair type. The relative pair types were (A) parent-offspring (PO), second-degree avuncular (AV), and full-sibling (FS), (B) male-male (MM), male-female (MF), and female-female (FF) full-sibling such that the number of males and females in each sample is not fixed, or (C) inbred full-sibling (FSi), second-degree avuncular (AV), and outbred full-sibling (FSo). Blue and black points indicate the smallest and largest values, respectively, on each map. Threshold values for coloration are indicated in the scales to the right of each heat map, with smaller values colored lighter. Note that the scales are not identical across heat maps. The values upon which these subtractions are based are represented as heat maps in (A) Figure S4A, (B) Figure S4B, or (C) Figure S4C.

Table 2. Wilcoxon signed-rank test for weighted mean across all loci of F^ST,red with F^ST and FST,BLUE for the French population with the 92 other populations whose samples contained related individuals.

Comparison P-Value for Wilcoxon Signed-Rank Test
F^ST,red with F^ST 5.25×1015
F^ST,red with FST,BLUE 0.967

The outbred diploid samples included parent-offspring (Φ=1/4), avuncular (Φ=1/8), and full-sibling (Φ=1/4) relative pairs. Because parent-offspring and full-sibling pairs have the same kinship coefficient, the heat maps in Figure 3, Figure S4A, Figure S5A, Figure S6A, and Figure S7A are symmetrical with parent-offspring and full-sibling pairs on the bottom vertices, and avuncular pairs on the top vertex. H^red yielded the largest MSE of the four estimators, and this value was constant throughout the space of the heat map (Figure S4A, second triangle), because all reduced sets are identical for outbred diploid samples. HBLUE consistently yielded the smallest MSE across configurations (Figure S4A, fourth triangle). As was the case in Figure 2, the MSE of the estimators H^full, H, and HBLUE was smallest for samples with only avuncular pairs, because these contain fewer dependent allele observations on average. We observed these features in simulated data as well (Figure S8A).

Although HBLUE performed best overall for samples including outbred diploid relative pairs at D3S2427, the estimator with the smallest variance in all situations is the biased estimator H^full (Figure S6A). However, because its squared bias increases with the number of first-degree pairs (Figure S5A), its relative performance declines compared to HBLUE as more of these pairs are sampled (Figure 3A, left triangle). The relative performance of H^red is highest when the number of first degree pairs is maximized, but this is due to the decreasing performance of HBLUE as more dependent observations are included (Figure 3A, center triangle). While the difference in MSE between H and HBLUE is always slight for samples of noninbred diploids, these values diverge as the complexity of the sample increases (Figure 3A, right triangle). That is, as the numbers of first- and second-degree pairs approach each other, HBLUE emerges decisively as the more accurate estimator, with the maximum value of this difference reached at 23 second-degree and 27 first-degree pairs. Thus, while the performance of the estimators for a sample containing relatives follows the same general trend, HBLUE provides the greatest accuracy for heterogeneous samples of outbred diploid individuals.

We also considered the relative performance of each estimator when using either the BLUE (pi) or the sample proportion (p^i) to estimate allele frequencies. Notably, all estimators perform best when the BLUE (pi) of allele frequency rather than the sample proportion (p^i) is used to infer population allele frequencies. We calculated the theoretical MSE for each estimator once with p^i, and once with pi, across all combinations of relative pairs for diploid individuals at the D3S2427 locus and mapped its value for the estimate with p^i minus the estimate with pi (Figure S7A). Because both frequency estimations yield the same values in samples of unrelated individuals, H^red performs identically for p^i and pi, and is not included. The MSE of an estimator calculated with pi is always smaller than that of the estimator calculated with p^i, and the pattern of divergence between their MSEs follows a similar trend across all estimators, resembling the rightmost panel in Figure 3A. This result suggests that the difference in MSE between H and HBLUE is driven primarily by the difference in performance between p^i and pi. Both the p^i and pi estimators yield the same value at the vertices of the triangles, and the difference in their MSEs reaches a maximum at 22 second-degree pairs for H^full and 24 second-degree pairs for H and HBLUE (Figure S7A, center and right triangles). The MSE of HBLUE calculated with p^i is, at most, on the order of 109 greater than that of HBLUE calculated with pi, indicating its robustness to variance in allele frequency determination (Figure S7A, right triangle). In contrast, the other estimators return a maximum difference in MSE on the order of 107. The estimation of expected heterozygosity with H^full, H, or HBLUE will always yield a smaller MSE for samples of outbred, diploid individuals when pi rather than p^i is taken as the estimator of population allele frequency.

We repeated these tests in samples of mixed ploidy (Figure 3B, Figure S4B, Figure S5B, Figure S6B, Figure S7B, and Figure S8B), and HBLUE emerged similarly superior to the other estimators, once again yielding the smallest MSE. We analyzed the D3S2427 locus as X-linked for these tests, counting males as haploid and females as diploid, and observed full-sibling pairs [similarly to DeGiorgio et al. (2010), Φ=1/2 for male-male pairs, Φ=1/4 for male-female pairs, and Φ=3/8 for female-female pairs] for samples of 100 individuals and 50 relative pairs. All estimators reach their maximum MSE in samples containing only male-male pairs (Figure S4B). This is because the number of independent observations (indicated by a larger mean kinship coefficient) is smallest when there are no females in the sample. Correspondingly, the estimators yield smaller MSE values with increasing incorporation of male-female pairs. The minimum MSE of H^full is reached at 50 male-female pairs, as with H and HBLUE because its squared bias (Figure S5B) decreases with increasing male-female pairs, though its variance is smallest at 50 female-female pairs, due to the greater number of alleles in the sample (Figure S6B). To create the reduced sets, males were removed from male-female pairs to minimize the subsequent increase in MSE. That is, the removal of males removes 1/3 of the allele copies from the sample, rather than 2/3 if females are removed, or 1/2 for a pair of same-ploidy individuals, and so H^red has the same value across samples with the same number of male-male pairs (Figure S4B, second triangle).

The direct comparison of HBLUE with the other estimators once again yielded different signatures for each subtraction for mixed-ploidy samples (Figure 3B). The point of greatest difference in MSE between H^full and HBLUE occurs when all relative pairs are male-male, while the point of least difference occurs for samples of only male-female pairs (Figure 3B, left triangle). This pattern broadly resembles the squared bias of H^full (Figure S5B, first triangle), underscoring the effect of bias on estimator performance. The pattern of difference in performance between H^red and HBLUE differs markedly, and the two estimators perform most similarly as the number of male-male pairs decreases, reaching a minimum at 33 male-female pairs plus 17 female-female pairs (Figure 3B, middle triangle). H yields the closest MSE to that of HBLUE for all relative pair configurations, and their difference is, at most, on the order of 106 (Figure 3B, right triangle). The pattern here mainly reflects the difference in performance between p^i and pi estimates of population allele frequency, as in Figure S7B, where pi estimators yield increasingly smaller comparative MSE values as the numbers of relative pairs in the sample approach each other.

We repeated the preceding tests once more for a sample in which full-siblings resulting from a brother-sister mating were included alongside second-degree and outbred full-sibling pairs (Figure 3C, Figure S4C, Figure S5C, Figure S6C, Figure S7C, and Figure S8C). Here, the kinship of inbred individuals with each other was 3/8 rather than 1/4. For all estimators, the inclusion of inbred full-siblings increased the MSE of the estimator, with a maximum MSE at 50 inbred full-sibling pairs, and a minimum at 50 second-degree pairs. For H^red, this minimum was also reached for any sample in which there were no inbred individuals, because the reduced sample is identical for these (Figure S4C, second triangle). Again, HBLUE was the least errant estimator across the space of sample configurations (Figure S4C, fourth triangle), and its advantage over the other estimators differs for each estimator (Figure 3C). Because the bias of H^full is largest at 50 inbred full-sibling pairs, the greatest difference in performance between it and HBLUE is at this point (Figure 3C, left triangle). Meanwhile, the largest differences in MSE between H^red and HBLUE are near the top vertex, where second-degree relative pairs predominate, while the smallest are toward the bottom vertices (Figure 3C, center triangle). The difference in MSE between H and HBLUE is at least an order of magnitude less than for the other comparisons, and increases for increasing sample complexity, but reaches its maximum for samples of 28 inbred full-sibling plus 22 second-degree pairs (Figure 3C, right triangle). This pattern reflects the decreased MSE for the estimators when calculated with pi compared to their calculation with p^i (Figure S7C). Ultimately, the performance of the estimators of expected heterozygosity across varying sample compositions depends on the estimator of allele frequency incorporated into the expected heterozygosity calculation. No matter the sample type, estimators based on pi outperform estimators based on p^i, and HBLUE outperforms H^full, H^red, and H.

Tests of HBLUE on single-nucleotide polymorphism (SNP) loci

Because SNP datasets are more common in recent studies, we performed analyses equivalent to our microsatellite analyses for 50 hypothetical SNP loci. These loci were biallelic with minor allele frequency (MAF) between 0.01 and 0.5, with increments of 0.01, corresponding to expected heterozygosity values ranging from 0.0198 to 0.5. We first measured the difference in MSE of H^full with that of H^red, H, or HBLUE as a function of true locus expected heterozygosity (H), as we did in Figure 1 (Figure S9). For each locus, the MSE of HBLUE was smallest, while that of H^full was generally second-smallest, following the trend for microsatellite loci visible in Figure 1, wherein less diverse loci yielded a smaller MSE for H^full than for H. However, unlike for microsatellite loci, estimator MSE peaks midway through the range of evaluated SNP loci, such that the smallest MSE values lie at either extreme of the range and the largest MSE value, as well as the largest difference in MSE values for all comparisons, is at the locus with MAF=0.15 (H=0.255). Additionally, H^full performs comparatively better than H^red (Figure S9, left) and H (Figure S9, center) as H approaches 0.255, but is outperformed by these unbiased estimators as H approaches 0.5. Once more, the trend is opposite for the comparison between H^full and HBLUE, showing the greatest comparative performance by HBLUE at the same locus (MAF =0.15, H=0.255). Thus, considering the results presented in Figure 1 and Figure S9, the greatest relative performance of HBLUE for inbred samples is achieved at loci for which estimator MSE is largest.

We next examined the effect of sample size on estimator performance for hypothetical samples of outbred diploid, inbred diploid, and outbred male-female relative pairs at the simulated locus with MAF =0.05 (H=0.095). As we varied the sample size from two relative pairs to 100 (each individual related to exactly one other, one relative pair type per sample), we found that HBLUE yielded the smallest MSE of all estimators only for samples containing male-female full-sibling pairs modeled at an X-linked locus (Figure S10, E and F). This observation mirrors the trend seen in Figure 2, wherein HBLUE outperformed the other estimators across all sample sizes. However, H^full yielded the smallest MSE across all sample sizes for outbred and inbred diploid full-siblings and avuncular pairs (Figure S10, A–D). This result is because the samples modeled here are minimally complex, with only one relative pair type, and modeled for a highly homozygous marker—two conditions under which the low bias and variance of H^full result in favorable performance.

Finally, we analyzed estimator performance once more for the locus with MAF =0.05 (H=0.095), for a sample of 50 individuals across changing outbred diploid, inbred diploid, and male-female full-sibling relative pair compositions (Figure S11, A–C). We display these results as heat maps, and find that our results here are broadly concordant with those for the D3S2427 human microsatellite locus (H=0.9301). As with the experiments displayed in Figure S10, the least complex samples yielded a smaller MSE for H^full estimates than for HBLUE estimates. Correspondingly, samples whose relative pair compositions resulted in fewer independent allele observations were more accurately and precisely evaluated with HBLUE. Thus, while sampling lower-diversity markers may occasionally favor the use of H^full, the inclusion of two or more relative pair types in the sample is likely to bias H^full, and require the use of HBLUE to yield accurate inferences.

Empirical application of HBLUE

To conclude our investigation into the performance of HBLUE, we applied it to empirical data from the MS5795 dataset. We retrieved human microsatellite data from 5795 individuals (11,590 allele copies) across 645 autosomal loci sampled genome wide. We assumed the mean value across loci for H^red in each of 267 populations to be the true expected heterozygosity value for these populations, as it is an unbiased estimate. We additionally chose to compare the other estimators with H^red, because an important basis for their evaluation is their agreement with this unbiased estimator, irrespective of the data to which they are applied.

To emphasize this, we performed three Wilcoxon signed-rank tests to compare the ranking of populations by their mean expected heterozygosity across all loci calculated with H^red, and either H^full, H, or HBLUE (Table 1). At the α<0.01 significance level, the comparisons showed that the inclusion of relatives for H^full was highly significant on the rankings it yielded, indicating that not correcting for relatedness among samples can significantly alter the estimates of expected heterozygosity. However, both H and, especially, HBLUE, yielded P-values greater than the threshold for the test against H^red. These results indicate that the estimates of expected heterozygosity are not significantly affected by the inclusion of related individuals in the sample when relatedness is taken into account. Furthermore, a test between H and HBLUE yielded a P-value of 3.44×102, suggesting no significant difference in the ranking of populations by mean expected heterozygosity with these two estimators.

Table 1. Wilcoxon signed-rank test for mean across loci of H^red with H^full, H, and HBLUE for the 93 populations whose samples contained related individuals.

Comparison P-Value for Wilcoxon Signed-Rank Test
H^red with H^full 4.39×1015
H^red with H 1.00×102
H^red with HBLUE 0.255

Although the unbiased estimators H and HBLUE have smaller MSE than H^full for samples with related individuals, their variance tends to be larger than that of H^full. DeGiorgio et al. (2010) previously showed that the difference in SD of H with H^full was small, while the mean values of H and H^red were much more similar to each other than either of them was to the mean of H^full. We again show this to be the case, and find as well that HBLUE not only repeats, or improves upon, the concordance of H with H^red, but, in some cases, HBLUE has a smaller SD than does H^full (Figure 4, left and center panels). A direct comparison of the performance of H against that of HBLUE (Figure 4, right panel) shows that HBLUE has a generally improved SD, and similarity to the H^red estimate over H. For some samples (primarily those from the Americas), this is not the case, possibly because all close relatives were not identified in the original dataset, resulting in an incorrect kinship matrix for calculation of the statistic.

Figure 4.

Figure 4

Application of the estimators to dataset MS5795. Here, we show a comparison of two estimators at a time (H^full, H, or HBLUE) by the difference in their mean with that of H^red across the 645 sampled microsatellite loci of MS5795 (vertical axis), and by their SDs (horizontal axis). The horizontal dotted line corresponds to no difference between the mean of the estimator and the mean of the unbiased estimator H^red. Solid lines connect calculations made for the same population with different estimators. Points are colored by geographic division defined in the dataset. Only the 93 populations with relatives in their samples were included because H^full, H, and HBLUE return the same value for samples of unrelated individuals. In the leftmost plot, open points are estimates for H^full, while closed points are for H. In the center plot, open points are estimates for H^full, while closed points are for HBLUE. In the rightmost plot, open points are estimates for H, while closed points are for HBLUE.

Improving estimates of FST by application of HBLUE

We predicted that the smaller MSE of HBLUE would translate to improved accuracy for estimators that are summaries of expected heterozygosity when samples contain related individuals. To test this hypothesis, we calculated the population differentiation statistic, FST (Equation 4), for pairs of populations whose samples in the MS5795 dataset contained related individuals. Our intent was to compare the MSE and bias of the commonly used FST estimator of Reynolds et al. (1983), which is based on H^full, and which we label as F^ST, to an estimate of FST calculated from HBLUE, which we label FST,BLUE. The formulas for these estimators follow the form of the general estimator of FST (Equation 14). We first measured the MSE of both methods (and an estimate using H, FST) on simulated data, where the FST of pairs of populations with samples of size 60 diploids each (30 relative pairs, 10 inbred full-sibling, 10 outbred full-sibling, and 10 avuncular pairs; Figure 5) was averaged across 104 simulated replicates. The calculations included here were performed for simulated Gujarati and Maya (left), Gujarati and Japanese (center), or Gujarati and Hadza (right) samples for the least diverse (TCTA015M_22), median diverse (D10S2327), and most diverse (D3S2427) loci of the MS5795 dataset, following their allele frequency distribution in MS5795. FST,BLUE consistently has a smaller MSE than the others, and the MSE of all estimators of FST decreases with increasing locus diversity, as the MSE of the estimator of expected heterozygosity decreases.

Figure 5.

Figure 5

Application of the estimators H^full, H, and HBLUE to the calculation of FST as F^ST, FST, and FST,BLUE, respectively, using simulated data for the Gujarati sample, with either the Maya (left), Japanese (center), or Hadza (right) samples, showing MSE on the vertical axis. The Reynolds et al. (1983) estimator is equivalent to the application of H^full in calculating population differentiation. The simulated samples contained 60 individuals and 30 relative pairs, of which 10 were inbred full-siblings, 10 were outbred full-siblings, and 10 were outbred avuncular pairs. Each individual was related to exactly one other, and the data were simulated following the same probabilistic method as employed to generate Figure S2. The three loci displayed on the horizontal axis are the least diverse, median diverse, and most diverse loci of the 645 MS5795 human microsatellites.

We additionally find that F^ST has an upward bias compared with F^ST,red (calculated with H^red), as well as a larger SD in general than FST,BLUE (Figure 6). Furthermore, all values of FST,BLUE are smaller than the paired value of F^ST calculated for the same population. The difference in the mean of F^ST and of FST,BLUE across all loci with the mean of F^ST,red, an estimator which serves as a proxy for the true value of FST, is displayed on the vertical axis, while the horizontal axis measures the SD of F^ST and of FST,BLUE (Figure 6). Supporting our observations indicating the improved accuracy of FST,BLUE over F^ST, Wilcoxon signed-rank tests (Table 2) between F^ST,red and either F^ST or FST,BLUE indicate that the inclusion of relatives significantly affects the estimate of population differentiation at the α<0.01 significance level. Meanwhile, F^ST,red and FST,BLUE are not significantly different in their estimates. These results suggest that the improved properties of HBLUE transfer to the summaries that include it in their calculations.

Figure 6.

Figure 6

Application of the estimators HBLUE and H^full to the estimation of FST as F^ST and FST,BLUE, respectively, from empirical data. Similarly to Figure 4, the difference between the mean of the estimator of FST (either derived from HBLUE or H^full) and an unbiased estimator (derived from H^red), is displayed on the vertical axis, while the SD of the estimator is displayed on the horizontal axis. The empty circles represent the Reynolds et al. (1983) estimator (identical to the H^full-derived estimation), while the filled circles represent the estimation derived from HBLUE. Here, the FST values for the French sample with each of the 92 other samples containing related individuals in the dataset MS5795 are plotted, colored by the region of the changing sample.

Discussion

We have introduced HBLUE, an extension to the estimator (H) of expected heterozygosity developed by DeGiorgio et al. (2010) that yields a smaller mean squared error in samples containing related individuals, while maintaining unbiasedness. Conveniently, the derivations of HBLUE, and its variance, are parallel in form to those of H, and we were therefore able to analytically evaluate the performance of the new estimator simultaneously with that of its predecessor. Our updated estimator, HBLUE, is based on results from McPeek et al. (2004), who characterized the BLUE (pi) of allele frequency. The BLUE improves the precision of allele frequency estimation in complex pedigrees, for which the sample proportion (p^i, the estimator of allele frequency used in H^ and H) is unbiased, but increases in variance with inclusion of related and inbred individuals. Because the properties of the estimator of allele frequency transfer to the estimator of expected heterozygosity, HBLUE is likely to outperform H in situations where pi has a smaller variance than p^i. This trend is true for genome-wide data as well (Figure 4 and Table 1).

Overall, HBLUE yields identical results to H in samples containing only one relative pair type, but the two diverge in performance as sample complexity increases (see heat maps in Figure 3, Figure S4, Figure S5, Figure S6, Figure S7, and Figure S8). While both estimators are unbiased, H experiences a larger increase in variance for each additional relative pair type introduced into a sample after the first. This holds true for all sample types regardless of ploidy and inbreeding, suggesting that HBLUE will outperform H in practice, where datasets are often complex. Furthermore, the results of our empirical analysis provide an equally important complement to this observation. Of the 93 populations from the MS5795 dataset we considered that contained relative pairs in their samples, each contained sampled individuals that were not related to any other in the sample. Thus, these samples were more complex than those in which each individual was part of a relative pair of the same type. For most of these cases, except for some American populations (discussed below), HBLUE outperformed H. This is corroborated by the Wilcoxon signed-rank test (Table 1). We expect therefore that any scenario in which there is heterogeneity in relative pair type among sampled individuals, as is observed in many human population-genetic datasets (Pemberton et al. 2010, 2013), should favor the application of HBLUE over other estimators.

In addition, random sampling of small isolated populations yields an increased chance that related individuals will be included with large enough sample sizes. Further, inbreeding may confound estimates of diversity, and mislead H^full to underreport true population expected heterozygosity. Populations of interest that may display these attributes include geographically isolated human settlements in remote alpine (Coia et al. 2012; Capocasa et al. 2013), South American rainforest (Wang et al. 2007), and Siberian taiga and steppe habitats (Dulik et al. 2012), and groups such as the Old Order Amish (Van Hout et al. 2010), Hutterites (Abney et al. 2002; Chong et al. 2011), and Mennonites (Payne et al. 2011). Further, though our analysis did not directly consider polyploid organisms, the applicability of HBLUE to samples containing individuals of any, and varying, ploidy highlights its usefulness for such data. Prominently, analysis on polyploid organisms such as plants including tetraploid Arabidopsis thaliana (Hollister et al. 2012), and hexaploid bread wheat (Nielsen et al. 2014), both of which self-fertilize, and may therefore be inbred, as well as commercially and ecologically significant Hymenopteran insects, including honeybees (Solignac et al. 2003; Harpur et al. 2014), bumblebees (Lye et al. 2011), and ants (Butler et al. 2014), whose males are haploid at all loci, while females are diploid, is likely to benefit from the improved accuracy and precision of HBLUE.

We additionally believe that continued investigations into the diversity at single sites in organisms as diverse as dogs (Sutter et al. 2007), gray wolves (Zhang et al. 2014), humans living at high altitude (Simonson et al. 2010; Huerta-Sánchez et al. 2013), and rice (Huang et al. 2012), in addition to host-microbiome studies (Blekhman et al. 2015), will benefit from the advances provided by HBLUE. These studies, as well as many others, have performed scans for positive selection using genomic outliers of population differentiation-based statistics (e.g., FST, locus-specific branch length, and the population branch statistic), where the calculation is performed per-site, rather than averaged across a large number of sites. Such studies would benefit from estimators of genetic diversity, such as HBLUE and FST,BLUE, with improved variance.

It is pertinent at this point to revisit a pair of potential limitations in our method and examine their implications. First, in Figure 4 (rightmost panel), the mean of H is either closer to that of H^red than to HBLUE, has smaller SD than HBLUE, or both for certain samples (predominantly from the Americas). These observations indicate that the accuracy and precision of HBLUE may be impacted by the accuracy of the kinship information incorporated into the calculation. The pedigrees of smaller, more remotely located, populations may be more complex compared to those of larger groups. Further, with a greater proportion of relative pairs in each sample, the effect of relative pair type misidentification may be larger. For RELPAIR (Epstein et al. 2000), which was the software chosen to identify relative pairs in MS5795 samples, second-degree pairs cannot be identified as confidently as first-degree pairs (Pemberton et al. 2010). Even so, although H may exhibit a somewhat greater robustness to relative pair misclassification, it is still generally outperformed by HBLUE.

The second point we address is the smaller MSE of H^full at less diverse loci in the dataset, especially for samples with fewer relative pairs. While the variance of H^full is always smaller than that of the other estimators, its bias increases with increasing locus allelic diversity. It is for this reason that the unbiasedness of HBLUE is its most desirable property. In practice, the mean of expected heterozygosity is often taken across loci. Based on such an approach, HBLUE (and H as well) will return the mean expected heterozygosity, and the variance of this estimation (as with all estimators taking the mean across loci) approaches zero as more loci are sampled. An interesting property of all estimators is that their variance (and therefore MSE) is larger for loci whose value for B is closer to 1, where B=D/R (B[0,1]; see Results and Figure S3). Because this effect is greatest for loci with lower true values of H, we expect H^full to have the smallest MSE of all estimators at less diverse loci that are close to their maximum expected heterozygosity, and for which the sample mean kinship coefficient is insufficiently large to appreciably bias the estimator (Equation 12). It is thus important to note that no estimator is uniformly superior to the others. Accordingly, the unique limitation of HBLUE is that the sample kinship matrix must be invertible for the calculation to proceed.

HBLUE additionally confers its improved MSE over H^full downstream to calculations that incorporate estimates of expected heterozygosity. To illustrate this point, we computed FST as a function of three estimators: H^full, H, and HBLUE. For simulated data, we found that FST,BLUE, yielded an estimate with smaller MSE for the three tested loci than did F^ST (Figure 5) or FST, and a much smaller mean distance from the true FST value than F^ST. For empirical data (Figure 6), we observed a consistent upward bias for F^ST compared to F^ST,red in samples containing relatives that followed much the same pattern as the downward bias of H^full for such samples. This trend is clear when we consider the formula for FST, which can be written as 1(H1+H2)/(2H12). Taking H^1,full and H^2,full as H1 and H2, this expression yields a larger value than if H^1,red and H^2,red were used, because the ratio (H1+H2)/(2H12) is smaller for downwardly biased estimators. Interestingly, the SD of FST,BLUE is, in most cases, smaller than that of F^ST for the dataset, while the SD of HBLUE was frequently (though not consistently) larger than that of H^full (Figure 4, center panel).

It is thus noteworthy to consider that the performance of HBLUE and H^full may diverge further in their applications, where any improvement in MSE for HBLUE may be magnified downstream. This is highlighted by the increased concordance between FST,BLUE and F^ST,red compared to HBLUE and H^red (cf. P-values between Table 1 and Table 2). With this in mind, applications of FST,BLUE can also be considered. Two such examples are the locus-specific branch length (LSBL; Shriver et al. 2004) and the similar population branch statistic (PBS; Yi et al. 2010). These statistics incorporate FST values between three populations as measures of branch length to detect positive selection at a locus. Loci for which the unrooted three-taxon tree indicates a significantly longer branch length in a particular lineage may represent regions possibly under selection. To allow for the easy application of HBLUE, we have written an R script, BestHet, that computes HBLUE, FST,BLUE, and LSBLBLUE, given matrices of genotype and kinship data for a sample (download available at http://www.personal.psu.edu/mxd60/best_het.html).

Supplementary Material

Supplemental material is available online at www.g3journal.org/lookup/suppl/doi:10.1534/g3.116.037168/-/DC1.

Acknowledgments

We thank two anonymous reviewers for their insightful comments. This work was supported by Pennsylvania State University startup funds from the Eberly College of Science.

Appendix

Derivations of unbiased estimators of expected heterozygosity

In this section, we derive the general unbiased estimator of expected heterozygosity H˘ for any unbiased linear estimator of population allele frequencies, defined in Proposition 1, and show how the formulas for H (DeGiorgio et al. 2010), and HBLUE (Corollaries 2 and 3), emerge from specific cases of H˘.

Proof of Proposition 1:

We need to show that E[H˘]=H. Note that

p˘i2=j=1nk=1nwjwkXj(i)Xk(i)=j=1nk=1nwjwkmjmk=1mjt=1mkAj(i)Akt(i).

Taking the expectation, we obtain

E[p˘i2]=j=1nk=1nwjwkmjmk=1mjt=1mkE[Aj(i)Akt(i)]=j=1nk=1nwjwkmjmk=1mjt=1mk[Aj(i)=1,Akt(i)=1]=j=1nk=1nwjwkmjmk=1mjt=1mk[(1Φjk)pi2+Φjkpi]=pi2+ρ2pi(1pi). (A1)

Therefore

E[H˘]=11ρ2(1i=1IE[p˘i2])=11ρ2(1i=1I[pi2+ρ2pi(1pi)])=1i=1Ipi2=H.
Proof of Corollary 2:

We show that defining the weight of each individual in the calculation of ρ2 in terms of an individual’s relative allele copy contribution yields H from H˘. Letting wk=mk/x=1nmx, we have that

p˘i=k=1nmkj=1nmjXk(i)=p^i

and

ρ2=j=1nk=1nmjx=1nmxmky=1nmyΦjk=Φ¯2.

Plugging in yields

H˘=11Φ¯2(1i=1Ip^i2)=H.
Proof of Corollary 3:

We show that defining the weight each individual according to their relative contribution to the inverted kinship matrix of the sample yields HBLUE from H˘. Letting wk=j=1n(K1)jk/1TK11, we have that

p˘i=k=1nj=1n(K1)jk1TK11Xk(i)=pi

and

ρ2=j=1nk=1nx=1n(K1)xj1TK11y=1n(K1)yk1TK11Φjk=κ2.

Plugging in yields

H˘=11κ2(1i=1Ipi2)=HBLUE.

Derivations of variances of expected heterozygosity estimators

In this section, we summarize the procedure by which DeGiorgio et al. (2010) derived the equation for the variance of H, illustrating the variance of the general case, H˘. For the full derivation, see Appendix B of DeGiorgio et al. (2010). We then provide the specific formulation for the variance of H (Corollary 7) and HBLUE (Corollary 8).

Abbreviated proof of Proposition 4:

The variance of H˘ (Equation 9) is defined as

Var[H˘]=1(1ρ2)2Var[1i=1Ip˘i2].

By definition of variance, we get

Var[1i=1Ip˘i2]=i=1IVar[p˘i2]+2i=1I1i=i+1ICov[p˘i2,p˘i2],

with

Var[p˘i2]=E[p˘i4](E[p˘i2])2

and

Cov[p˘i2,p˘i2]=E[p˘i2p˘i2]E[p˘i2]E[p˘i2].

Recalling that p˘i=j=1n=1mjwjmjAj(i) for the th allele copy of individual j, whose ploidy is mj, we have that

E[p˘i4]=j=1nk=1nj=1nk=1n=1mjt=1mk=1mjt=1mkwjwkwjwkmjmkmjmkE[Aj(i)Akt(i)Aj(i)Akt(i)],

and

E[p˘i2p˘i2]=j=1nk=1nj=1nk=1n=1mjt=1mk=1mjt=1mkwjwkwjwkmjmkmjmkE[Aj(i)Akt(i)Aj(i)Akt(i)]

for the case ii. We have previously shown in Equation A1 that E[p˘i2]=pi2+ρ2pi(1pi), and the value of E[p˘i2] similarly follows. Thus, we need to calculate E[Aj(i)Akt(i)Aj(i)Akt(i)] and E[Aj(i)Akt(i)Aj(i)Akt(i)] for the ii case. These are

E[Aj(i)Akt(i)Aj(i)Akt(i)]=Φjkjkpi+[Φjkj+Φjkk+Φjjk+Φkjk+Φjk,jk+Φjj,kk+Φjk,kj7Φjkjk]pi2+[12Φjkjk+Φjk+Φjj+Φjk+Φkj+Φkk+Φjk3(Φjkj+Φjkk+Φjjk+Φkjk)2(Φjk,jk+Φjj,kk+Φjk,kj)]pi3+[1+Φjk,jk+Φjj,kk+Φjk,kj+2(Φjkj+Φjkk+Φjjk+Φkjk)6Φjkjk(Φjk+Φjj+Φjk+Φkj+Φkk+Φjk)]pi4 (A2)

and

E[Aj(i)Akt(i)Aj(i)Akt(i)]=[Φjk,jkΦjkjk]pipi+[2Φjkjk+Φjk(Φjkj+Φjkk)Φjk,jk]pipi2+[2Φjkjk+Φjk(Φjjk+Φkjk)Φjk,jk]pi2pi+[1+Φjk,jk+Φjj,kk+Φjk,kj+2(Φjkj+Φjkk+Φjjk+Φkjk)6Φjkjk(Φjk+Φjj+Φjk+Φkj+Φkk+Φjk)]pi2pi2. (A3)

Substituting Equation A2 into E[p˘i4] and solving for Var[p˘i2], we obtain

Var[p˘i2]=ρ4pi+(4ρ3+3ρ2,27ρ4ρ22)pi2+(12ρ4+4ρ2+2ρ2212ρ36ρ2,2)pi3+(3ρ2,2+8ρ36ρ44ρ2ρ22)pi4,

and, substituting Equation A3 into E[p˘i2p˘i2], and solving for Cov[p˘i2,p˘i2], we obtain

Cov[p˘i2,p˘i2]=(ρ2,2ρ4ρ22)pipi+(2ρ4+ρ222ρ3ρ2,2)pipi2+(2ρ4+ρ222ρ3ρ2,2)pi2pi+(3ρ2,2+8ρ36ρ44ρ2ρ22)pi2pi2.

Thus, substituting the values for variance and covariance into the definition of variance, we have

Var[H˘]=1(1ρ2)2[ρ2,2ρ22+2(ρ22ρ4)i=1Ipi2+4(2ρ4+ρ22ρ3ρ2,2)i=1Ipi3+(3ρ2,2+8ρ36ρ44ρ2ρ22)(i=1Ipi2)2].
Corollary 7:

Consider a locus with I distinct alleles, and parametric allele frequencies pi[0,1], i=1,2,,I, and i=1Ipi=1. For a sample of size n individuals of any ploidy, inbreeding status, and relatedness,

Var[H]=1(1Φ¯2)2Var[1i=1Ip^i2] (A4)

and

Var[1i=1Ip^i2]=Φ¯2,2Φ¯22+2(Φ¯22Φ¯4)i=1Ipi2+4(2Φ¯4+Φ¯22Φ¯3Φ¯2,2)i=1Ipi3+(3Φ¯2,2+8Φ¯36Φ¯44Φ¯2Φ¯22)(i=1Ipi2)2, (A5)

where Φ¯2, Φ¯3, Φ¯4, and Φ¯2,2 are mean kinship coefficients, weighted by the contribution of individuals to the number of allele copies in the sample, with subscripts corresponding to the number of individuals considered for the calculation. Additionally,

Var[H]4Φ¯2[i=1Ipi3(i=1Ipi2)2]. (A6)

The proof of Corollary 7 follows from the proof of Proposition 4, where p^i is substituted for p˘i, and Φ¯2, Φ¯3, Φ¯4, and Φ¯2,2 are substituted for ρ2, ρ3, ρ4, and ρ2,2, respectively.

Corollary 8:

Consider a locus with I distinct alleles and parametric allele frequencies pi[0,1], i=1,2,,I, and i=1Ipi=1. For a sample of size n individuals of any ploidy, inbreeding status, and relatedness,

Var[HBLUE]=1(1κ2)2Var[1i=1Ipi2] (A7)

and

Var[1i=1Ipi2]=κ2,2κ22+2(κ22κ4)i=1Ipi2+4(2κ4+κ22κ3κ2,2)i=1Ipi3+(3κ2,2+8κ36κ44κ2κ22)(i=1Ipi2)2, (A8)

where κ2, κ3, κ4, and κ2,2 are mean kinship coefficients, weighted by the contribution of individuals to the inverted kinship matrix, with subscripts corresponding to the number of individuals considered for the calculation. Additionally,

Var[HBLUE]4κ2[i=1Ipi3(i=1Ipi2)2]. (A9)

The proof of Corollary 8 follows from the proof of Proposition 4, where pi is substituted for p˘i, and κ2, κ3, κ4, and κ2,2 are substituted for ρ2, ρ3, ρ4, and ρ2,2, respectively.

Derivations of bias measurements in the application of H^

For samples containing related and inbred individuals, H^ has a downward bias, which is defined in Equation 12 for the general estimator of population allele frequency p˘i. Here, we present Corollaries 9 and 10 for the specific estimators of population allele frequency p^i and pi, respectively.

Corollary 9:

Consider a locus with I distinct alleles and parametric allele frequencies pi[0,1], i=1,2,,I, and i=1Ipi=1. For a sample of size n possibly related or inbred individuals, the bias of the estimator of expected heterozygosity H^ changes with the true locus expected heterozygosity such that

Bias[H^(p^i)]=1nΦ¯2n1H, (A10)

where

H^(p^i)=nn1(1i=1Ip^i2).

As this is the standard application of H^ (Equation 2), Equation A10 describes the bias of H^ in the Results. However, H^ is biased with any unbiased linear estimator of allele frequency for samples containing related or inbred individuals. The proof of Corollary 9 follows from the proof of Proposition 5, where Φ¯2 is substituted for ρ2.

Corollary 10:

Consider a locus with I distinct alleles and parametric allele frequencies pi[0,1], i=1,2,,I, and i=1Ipi=1. For a sample of size n possibly related or inbred individuals, the bias of the estimator of expected heterozygosity H^ changes with the true locus expected heterozygosity such that

Bias[H^(pi)]=1nκ2n1H, (A11)

where

H^(pi)=nn1(1i=1Ipi2). (A12)

The proof of Corollary 10 follows from the proof of Proposition 5, where κ2 is substituted for ρ2.

Derivations of components for the variance of FST estimators

In this final section of the Appendix, we provide derivations for the components of Equations 16 and 17, which describe the variance of F˘ST. We derive the variance of H˘12, as well as the covariances of H˘12 with H˘1 (and interchangeably, H˘12 with H˘2), and of [H˘1212H˘112H˘2] with H˘12. Because the complete expression for Var[F˘ST] is unwieldy, we stop at the derivation of the final component.

Lemma 11:

Consider a locus with I distinct alleles across two independent populations and parametric allele frequencies pi[0,1], i=1,2,,I, and i=1Ipi=1 for population 1, and qi[0,1], i=1,2,,I, and i=1Iqi=1 for population 2. For two samples of size n1 and n2, individuals from populations 1 and 2, respectively, each with individuals of any ploidy, inbreeding status, and relatedness,

Var[H˘12]=ρ2(1)(1ρ2(2))i=1Ipiqi2+ρ2(2)(1ρ2(1))i=1Ipi2qi+ρ2(1)ρ2(2)i=1Ipiqi+(ρ2(1)ρ2(2)ρ2(1)ρ2(2))(i=1Ipiqi)2, (A13)

where the superscript of the mean kinship coefficient ρ2 corresponds to the population for which it is calculated. The equations for the variance of H12 and H12,BLUE are obtained by substituting Φ¯2 and κ2, respectively, into Equation A13 as the mean kinship coefficients in place of ρ2.

Proof:

By definition of variance,

Var[H˘12]=i=1IVar[p˘iq˘i]+2i=1I1i=i+1ICov[p˘iq˘i,p˘iq˘i]

where

Var[p˘iq˘i]=E[p˘i2q˘i2](E[p˘iq˘i])2

and

Cov[p˘iq˘i,p˘iq˘i]=E[p˘iq˘ip˘iq˘i]E[p˘iq˘i]E[p˘iq˘i].

Because p˘i and q˘i are unbiased estimators of population allele frequency, and populations 1 and 2 are independent,

E[p˘iq˘i]=piqi

Similarly, E[p˘iq˘i]=piqi. Next, we have

E[p˘i2q˘i2]=E[p˘i2]E[q˘i2]=[pi2+ρ2(1)pi(1pi)][qi2+ρ2(2)qi(1qi)]=piqi[pi+ρ2(1)(1pi)][qi+ρ2(2)(1qi)], (A14)

where E[q˘i2] takes the same form as E[p˘i2] (Equation A1), except that the resulting weighted mean kinship coefficient ρ2 is for population 2, indicated by the superscript. By substituting Equation A14 into Var[p˘iq˘i], we have

Var[p˘iq˘i]=piqi[pi+ρ2(1)(1pi)][qi+ρ2(2)(1qi)](piqi)2=piqi{[pi+ρ2(1)(1pi)][qi+ρ2(2)(1qi)]piqi}=piqi[ρ2(1)(1pi)qi+ρ2(2)pi(1qi)+ρ2(1)ρ2(2)(1pi)(1qi)]. (A15)

We now derive an expression for Cov[p˘iq˘i,p˘iq˘i]. Let Bkt(i) be an indicator random variable in population 2 analogous to the indicator random variable Aj(i), which we have previously defined for population 1.

E[p˘iq˘ip˘iq˘i]=E[p˘ip˘i]E[q˘iq˘i]=(j=1n1j=1n1=1mj=1mjwjwjmjmjE[Aj(i)Aj(i)])(k=1n2k=1n2t=1mkt=1mkwkwkmkmkE[Bkt(i)Bkt(i)]),

where

E[Aj(i)Aj(i)]=[Aj(i)=1,Aj(i)=1],

and

E[Bkt(i)Bkt(i)]=[Bkt(i)=1,Bkt(i)=1].

Consider a scenario in which we have two allele copies. Let s1 be the identity state with probability Δ1, in which two randomly drawn alleles are not IBD, and s2 be the identity state occurring with probability Δ2=1Δ1, in which the two alleles are IBD.

[Aj(i)=1,Aj(i)=1]=[Aj(i)=1,Aj(i)=1|s1][s1]+[Aj(i)=1,Aj(i)=1|s2][s2]=pipiΔ1+0×Δ2=Δ1pipi.

Note that, because Δ1+Δ2=1 and Φjj(1)=Δ2 (same with Φkk(2)), we have Δ1=1Φjj(1). Thus,

[Aj(i)=1,Aj(i)=1]=(1Φjj(1))pipi

and

[Bkt(i)=1,Bkt(i)=1]=(1Φkk(2))qiqi.

Substituting, we now have

E[p˘ip˘iq˘iq˘i]=(j=1n1j=1n1=1mj=1mjwjwjmjmj(1Φjj(1))pipi)(k=1n2k=1n2t=1mkt=1mkwkwkmkmk(1Φkk(2))qiqi)=(1ρ2(1))(1ρ2(2))pipiqiqi, (A16)

and substituting Equation A16 into Cov[p˘iq˘i,p˘iq˘i] yields

Cov[p˘iq˘i,p˘iq˘i]=(1ρ2(1))(1ρ2(2))pipiqiqipiqipiqi.=(ρ2(1)ρ2(2)ρ2(1)ρ2(2))pipiqiqi. (A17)

Therefore, using Equations A15 and A17,

Var[H˘12]=i=1Ipiqi[ρ2(1)(1pi)qi+ρ2(2)pi(1qi)+ρ2(1)ρ2(2)(1pi)(1qi)]+2i=1I1i=i+1I(ρ2(1)ρ2(2)ρ2(1)ρ2(2))pipiqiqi=ρ2(1)i=1Ipi(1pi)qi2+ρ2(2)i=1Ipi2qi(1qi)+ρ2(1)ρ2(2)i=1Ipi(1pi)qi(1qi)+2i=1I1i=i+1I(ρ2(1)ρ2(2)ρ2(1)ρ2(2))pipiqiqi=ρ2(1)(1ρ2(2))i=1Ipiqi2+ρ2(2)(1ρ2(1))i=1Ipi2qi+(ρ2(1)ρ2(2)ρ2(1)ρ2(2))i=1Ipi2qi2+ρ2(1)ρ2(2)i=1Ipiqi+2(ρ2(1)ρ2(2)ρ2(1)ρ2(2))i=1Ii=1Ipipiqiqi=ρ2(1)(1ρ2(2))i=1Ipiqi2+ρ2(2)(1ρ2(1))i=1Ipi2qi+ρ2(1)ρ2(2)i=1Ipiqi+(ρ2(1)ρ2(2)ρ2(1)ρ2(2))(i=1Ipiqi)2.
Lemma 12:

Consider a locus with I distinct alleles across two independent populations and parametric allele frequencies pi[0,1], i=1,2,,I, and i=1Ipi=1 for population 1, or qi[0,1], i=1,2,,I, and i=1Iqi=1 for population 2. For two samples of size n1 and n2 individuals from populations 1 and 2, respectively, each with individuals of any ploidy, inbreeding status, and relatedness,

Cov[H˘12,H˘1]=11ρ2(1)[2(ρ3(1)ρ2(1))i=1Ipi3qi+(2ρ2(1)3ρ3(1))i=1Ipi2qi+ρ3(1)i=1Ipiqi] (A18)

and

Cov[H˘12,H˘2]=11ρ2(2)[2(ρ3(2)ρ2(2))i=1Ipiqi3+(2ρ2(2)3ρ3(2))i=1Ipiqi2+ρ3(2)i=1Ipiqi], (A19)

where the superscript of the mean kinship coefficients ρ2 and ρ3 corresponds to the population for which these are calculated. The formulas for Cov[H12,H1], Cov[H12,H2], Cov[H12,BLUE,H1,BLUE], and Cov[H12,BLUE,H2,BLUE] are obtained by substituting Φ¯2 and Φ¯3 (for H), or κ2 and κ3 (for HBLUE) into Equations A18 and A19, respectively.

Proof:

The covariance between H˘12 and H˘1 is

Cov[H˘12,H˘1]=11ρ2(1)Cov[(1i=1Ip˘iq˘i),(1i=1Ip˘i2)]=11ρ2(1)i=1Ii=1ICov[p˘iq˘i,p˘i2]=11ρ2(1)(i=1ICov[p˘iq˘i,p˘i2]+i=1Ii=1iiICov[p˘iq˘i,p˘i2]).

The value of the covariance calculated for the case where i=i can be written as

Cov[p˘iq˘i,p˘i2]=E[p˘i3q˘i]E[p˘iq˘i]E[p˘i2].

From the proof of Lemma 11, we have derived the value of E[p˘iq˘i], and, from the proof of Proposition 1 we know the value for E[p˘i2]. We therefore only need to compute

E[p˘i3q˘i]=E[p˘i3]E[q˘i],

where E[q˘i]=qi. Solving for E[p˘i3], we have

E[p˘i3]=j=1n1v=1n1v=1n1=1mjz=1mvz=1mvwjwvwvmjmvmvE[Aj(i)Avz(i)Avz(i)]=j=1n1v=1n1v=1n1=1mjz=1mvz=1mvwjwvwvmjmvmv[Aj(i)=1,Avz(i)=1,Avz(i)=1].

The value of [Aj(i)=1,Avz(i)=1,Avz(i)=1] depends on the probabilities of distinct identity states in which three alleles are drawn from the sample (one each from individuals j, v, and v). We define state 1 as no IBD alleles drawn (probability δ1), state 2 as IBD alleles drawn from j and v (probability δ2), state 3 as IBD alleles drawn from v and v IBD (probability δ3), state 4 as IBD alleles drawn from j and v IBD (probability δ4), and state 5 as all three IBD (probability δ5), with s=15δs=1. Thus, the probabilities for the relevant kinship coefficients are

Φjvv(1)=δ5
Φjv(1)=δ5+δ2
Φvv(1)=δ5+δ3
Φjv(1)=δ5+δ4,

which yields

[Aj(i)=1,Avz(i)=1,Avz(i)=1]=δ5pi+(δ2+δ3+δ4)pi2+δ1pi3=Φjvv(1)pi+(Φjv(1)+Φvv(1)+Φjv(1)3Φjvv(1))pi2+(1+2Φjvv(1)Φjv(1)Φvv(1)Φjv(1))pi3.

Thus, E[p˘i3q˘i] is

E[p˘i3q˘i]=ρ3(1)piqi+3(ρ2(1)ρ3(1))pi2qi+(1+2ρ3(1)3ρ2(1))pi3qi, (A20)

and from Equations A20 and A1, and the definition of E[p˘iq˘i],

Cov[p˘iq˘i,p˘i2]=ρ3(1)piqi+3(ρ2(1)ρ3(1))pi2qi+(1+2ρ3(1)3ρ2(1))pi3qipiqi[pi2+ρ2(1)pi(1pi)]=2(ρ3(1)ρ2(1))pi3qi+(2ρ2(1)3ρ3(1))pi2qi+ρ3(1)piqi. (A21)

Meanwhile, for the Cov[p˘iq˘i,p˘i2] case of ii, Cov[p˘iq˘i,p˘i2]=0. This is intuitively sensible because the products p˘iq˘i and p˘i2 are independent, describing different alleles, and should not covary.

Finally, we can see that, when the two populations considered are independent from one another, the value of Cov[H˘12,H˘1] (or equivalently of Cov[H˘12,H˘2]) is driven entirely by the case in which i=i, such that

Cov[H˘12,H˘1]=11ρ2(1)i=1ICov[p˘iq˘i,p˘i2]=11ρ2(1)[2(ρ3(1)ρ2(1))i=1Ipi3qi+(2ρ2(1)3ρ3(1))i=1Ipi2qi+ρ3(1)i=1Ipiqi].

We now need to derive Cov[H˘1212(H˘1+H˘2),H˘12], the final term required to compute Var[F˘ST].

Lemma 13:

Consider a locus with I distinct alleles across two independent populations and parametric allele frequencies pi[0,1], i=1,2,,I, and i=1Ipi=1 for population 1, or qi[0,1], i=1,2,,I, and i=1Iqi=1 for population 2. For two samples of size n1 and n2 individuals from populations 1 and 2, respectively, each with individuals of any ploidy, inbreeding status, and relatedness,

Cov[H˘1212H˘112H˘2,H˘12]=[ρ2(1)(1ρ2(2))+3ρ3(2)2ρ2(2)2(1ρ2(2))]i=1Ipiqi2+[ρ2(2)(1ρ2(1))+3ρ3(1)2ρ2(1)2(1ρ2(1))]i=1Ipi2qi+[ρ2(1)ρ2(2)ρ3(1)2(1ρ2(1))ρ3(2)2(1ρ2(2))]i=1Ipiqi+(ρ2(1)ρ2(2)ρ2(1)ρ2(2))(i=1Ipiqi)2+ρ2(1)ρ3(1)1ρ2(1)i=1Ipi3qi+ρ2(2)ρ3(2)1ρ2(2)i=1Ipiqi3, (A22)

where the superscript of the mean kinship coefficients ρ2 and ρ3 corresponds to the population for which these quantities are calculated. The formulas for Cov[H12(1/2)H1(1/2)H2,H12] and Cov[H12,BLUE(1/2)H1,BLUE(1/2)H2,BLUE,H12,BLUE] are obtained by substituting Φ¯2 and Φ¯3 (for H), or κ2 and κ3 (for HBLUE) into Equation A22.

Proof:

We begin by breaking up the covariance into its components,

Cov[H˘1212H˘112H˘2,H˘12]=Var[H˘12]12Cov[H˘1,H˘12]12Cov[H˘2,H˘12].

This equation is composed of terms that we previously derived (Equations A13, A18, and A19). Therefore,

Cov[H˘1212H˘112H˘2,H˘12]=ρ2(1)(1ρ2(2))i=1Ipiqi2+ρ2(2)(1ρ2(1))i=1Ipi2qi+ρ2(1)ρ2(2)i=1Ipiqi+(ρ2(1)ρ2(2)ρ2(1)ρ2(2))(i=1Ipiqi)212(1ρ2(1))[2(ρ3(1)ρ2(1))i=1Ipi3qi+(2ρ2(1)3ρ3(1))i=1Ipi2qi+ρ3(1)i=1Ipiqi]12(1ρ2(2))[2(ρ3(2)ρ2(2))i=1Ipiqi3+(2ρ2(2)3ρ3(2))i=1Ipiqi2+ρ3(2)i=1Ipiqi]=[ρ2(1)(1ρ2(2))+3ρ3(2)2ρ2(2)2(1ρ2(2))]i=1Ipiqi2+[ρ2(2)(1ρ2(1))+3ρ3(1)2ρ2(1)2(1ρ2(1))]i=1Ipi2qi+[ρ2(1)ρ2(2)ρ3(1)2(1ρ2(1))ρ3(2)2(1ρ2(2))]i=1Ipiqi+(ρ2(1)ρ2(2)ρ2(1)ρ2(2))(i=1Ipiqi)2+ρ2(1)ρ3(1)1ρ2(1)i=1Ipi3qi+ρ2(2)ρ3(2)1ρ2(2)i=1Ipiqi3.

Footnotes

Communicating editor: B. J. Andrews

Literature Cited

  1. Abney M., Ober C., McPeek M. S., 2002.  Quantitative-trait homozygosity and association mapping and empirical genomewide significance in large, complex pedigrees: fasting serum-insulin level in the Hutterites. Am. J. Hum. Genet. 70: 920–934. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Blekhman R., Goodrich J. K., Huang K., Sun Q., Bukowski R., et al. , 2015.  Host genetic variation impacts microbiome composition across human body sites. Genome Biol. 16: 191. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Butler I. A., Siletti K., Oxley P. R., Kronauer D. J. C., 2014.  Conserved microsatellites in ants enable population genetic and colony pedigree studies across a wide range of species. PLoS One 9: e107334. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Capocasa M., Battagia C., Anagnostou P., Montinaro F., Boschi I., et al. , 2013.  Detecting genetic isolation in human populations: a study of European language minorities. PLoS One 8: e56371. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Chong J. X., Oktay A. A., Dai Z., Swoboda K. J., Prior T. W., et al. , 2011.  A common spinal muscular atrophy deletion mutation is present on a single founder haplotype in the US Hutterites. Eur. J. Hum. Genet. 19: 1045–1051. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Cockerham C. C., 1971.  Higher order probability functions of identity of alleles by descent. Genetics 69: 235–246. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Coia V., Boschi I., Trombetta F., Cavulli F., Montinaro F., et al. , 2012.  Evidence of high genetic variation among linguistically diverse populations on a micro-geographic scale: a case study of the Italian Alps. J. Hum. Genet. 57: 254–260. [DOI] [PubMed] [Google Scholar]
  8. DeGiorgio M., Rosenberg N. A., 2009.  An unbiased estimator of gene diversity in samples containing related individuals. Mol. Biol. Evol. 26: 501–512. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. DeGiorgio M., Jankovic I., Rosenberg N. A., 2010.  Unbiased estimation of gene diversity in samples containing related individuals: exact variance and arbitrary ploidy. Genetics 186: 1367–1387. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Dulik M. C., Zhadanov S. I., Osipova L. P., Askapuli A., Gau L., et al. , 2012.  Mitochondrial DNA and Y chromosome variation provides evidence for a recent common ancestry between Native Americans and Indigenous Altaians. Am. J. Hum. Genet. 90: 229–246. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Epstein M. P., Duren W. L., Boehnke M., 2000.  Improved inference of relationships for pairs of individuals. Am. J. Hum. Genet. 67: 1219–1231. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Gillois M., 1965.  Relation d’identité en génétique. Ann. Inst. Henri Poincaré B 2: 1–94. [Google Scholar]
  13. Harpur B. A., Kent C. F., Molodtsova D., Lebon J. M. D., Alqarni A. S., et al. , 2014.  Population genomics of the honey bee reveals strong signatures of positive selection on worker traits. Proc. Natl. Acad. Sci. USA 111: 2614–2619. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Hollister J. D., Arnold B. J., Svedin E., Xue K. S., Dilkes B. P., et al. , 2012.  Genetic adaptation associated with genome-doubling in autotetraploid Arabidopsis arenosa. PLoS Genet. 8: e1003093. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Huang X., Kurata N., Wei X., Wang Z., Wang A., et al. , 2012.  A map of rice genome variation reveals the origin of cultivated rice. Nature 490: 497–501. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Hudson R. R., Slatkin M., Maddison W. P., 1992.  Estimation of levels of gene flow from DNA sequence data. Genetics 132: 583–589. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Huerta-Sánchez E., DeGiorgio M., Pagani L., Tarekegn A., Ekong R., et al. , 2013.  Genetic signatures reveal high-altitude adaptation in a set of Ethiopian populations. Mol. Biol. Evol. 30: 1877–1888. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Lange K., 2002.  Mathematical and Statistical Methods for Genetic Analysis, Ed. 2 Springer, New York. [Google Scholar]
  19. Lye G. C., Lepais O., Goulson D., 2011.  Reconstructing demographic events from population genetic data: the introduction of bumblebees to New Zealand. Mol. Ecol. 20: 2888–2900. [DOI] [PubMed] [Google Scholar]
  20. McPeek M. S., Wu X., Ober C., 2004.  Best linear unbiased allele-frequency estimation in complex pedigrees. Biometrics 60: 359–367. [DOI] [PubMed] [Google Scholar]
  21. Nei M., 1973.  Analysis of gene diversity in subdivided populations. Proc. Natl. Acad. Sci. USA 70: 3321–3323. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Nei M., Roychoudhury A. K., 1974.  Sampling variances of heterozygosity and genetic distance. Genetics 76: 379–390. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Nielsen N. H., Backes G., Stougaard J., Andersen S. U., Jahoor A., 2014.  Genetic diversity and population structure analysis of European hexaploid bread wheat (Triticum aestivum L.) varieties. PLoS One 9: e94000. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Payne M., Rupar C. A., Siu G. M., Siu V. M., 2011.  Amish, Mennonite, and Hutterite genetic disorder database. Paediatr. Child Health 16: e23–e24. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Pemberton T. J., Wang C., Li J. Z., Rosenberg N. A., 2010.  Inference of unexpected genetic relatedness among individuals in HapMap Phase III. Am. J. Hum. Genet. 87: 457–464. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Pemberton T. J., DeGiorgio M., Rosenberg N. A., 2013.  Population structure in a comprehensive data set on human microsatellite variation. G3 3: 909–916. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Reddy S. B., Rosenberg N. A., 2012.  Refining the relationship between homozygosity and the frequency of the most frequent allele. J. Math. Biol. 64: 87–108. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Reynolds J., Weir B. S., Cockerham C. C., 1983.  Estimation of the coancestry coefficient: basis for a short-term genetic distance. Genetics 105: 767–779. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Shriver M. D., Kennedy G. C., Parra E. J., Lawson H. A., Sonpar V., et al. , 2004.  The genomic distribution of population substructure in four populations using 8,525 autosomal SNPs. Hum. Genomics 1: 274–286. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Simonson T. S., Yang Y., Huff C. D., Yun H., Qin G., et al. , 2010.  Genetic evidence for high-altitude adaptation in Tibet. Science 329: 72–75. [DOI] [PubMed] [Google Scholar]
  31. Solignac M., Vautrin D., Loiseau A., Mougel F., Baudry E., et al. , 2003.  Five hundred and fifty microsatellite markers for the study of the honeybee (Apis mellifera L.) genome. Mol. Ecol. Notes 3: 307–311. [Google Scholar]
  32. Sutter N. B., Bustamante C. D., Chase K., Gray M. M., Zhao K., et al. , 2007.  A single IGF1 allele is a major determinant of small size in dogs. Science 316: 112–115. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Van Hout C. V., Levin A. M., Rampersaud E., Shen H., O’Connell J. R., et al. , 2010.  Extent and distribution of linkage disequilibrium in the Old Order Amish. Genet. Epidemiol. 34: 146–150. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Wang S., Lewis C. M., Jr, Jakobsson M., Ramachandran S., Ray N., et al. , 2007.  Genetic variation and population structure in Native Americans. PLoS Genet. 3: 2049–2067. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Wolter K. M., 2007.  Introduction to Variance Estimation, Ed. 2 Springer, New York, NY. [Google Scholar]
  36. Wright S., 1951.  The genetical structure of populations. Ann. Eugen. 15: 323–354. [DOI] [PubMed] [Google Scholar]
  37. Yi X., Liang Y., Huerta-S’anchez E., Jin X., Cuo Z. X. P., et al. , 2010.  Sequencing of 50 human exomes reveals adaptation to high altitude. Science 329: 75–78. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Zhang W., Fan Z., Han E., Hou R., Zhang L., et al. , 2014.  Hypoxia adaptations in the grey wolf (Canis lupus chanco) from Qinghai-Tibet Plateau. PLoS Genet. 10: e1004466. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Data Availability Statement

The authors state that all data necessary for confirming the conclusions presented in the article are represented fully within the article.


Articles from G3: Genes|Genomes|Genetics are provided here courtesy of Oxford University Press

RESOURCES