Skip to main content
Genome Research logoLink to Genome Research
. 2013 Sep;23(9):1514–1521. doi: 10.1101/gr.154831.113

Estimating and interpreting FST: The impact of rare variants

Gaurav Bhatia 1,2,6,7, Nick Patterson 2,6,7, Sriram Sankararaman 2,3, Alkes L Price 2,4,5,7
PMCID: PMC3759727  PMID: 23861382

Abstract

In a pair of seminal papers, Sewall Wright and Gustave Malécot introduced FST as a measure of structure in natural populations. In the decades that followed, a number of papers provided differing definitions, estimation methods, and interpretations beyond Wright's. While this diversity in methods has enabled many studies in genetics, it has also introduced confusion regarding how to estimate FST from available data. Considering this confusion, wide variation in published estimates of FST for pairs of HapMap populations is a cause for concern. These estimates changed—in some cases more than twofold—when comparing estimates from genotyping arrays to those from sequence data. Indeed, changes in FST from sequencing data might be expected due to population genetic factors affecting rare variants. While rare variants do influence the result, we show that this is largely through differences in estimation methods. Correcting for this yields estimates of FST that are much more concordant between sequence and genotype data. These differences relate to three specific issues: (1) estimating FST for a single SNP, (2) combining estimates of FST across multiple SNPs, and (3) selecting the set of SNPs used in the computation. Changes in each of these aspects of estimation may result in FST estimates that are highly divergent from one another. Here, we clarify these issues and propose solutions.


Since its introduction by Sewall Wright (1949) and Gustave Malécot (1948), FST estimation (Weir and Cockerham 1984; Holsinger and Weir 2009) has become a key component of studies of population structure in humans (International HapMap Consortium 2007; Li et al. 2008; The 1000 Genomes Project Consortium 2010; International HapMap 3 Consortium 2010) and other species (Malécot 1948; Wright 1949; Selander and Hudson 1976; Guries and Ledig 1982; Ellstrand and Elam 1993; Palumbi and Baker 1994). Though the utility of FST and related measures has been subject to recent debate (Jost 2008; Ryman and Leimar 2009), FST continues to be widely used by population geneticists (Xu et al. 2009; Edelaar et al. 2012; Hangartner et al. 2012).

Despite this widespread use in genetic studies, confusion remains about what FST is and how to estimate it. Beyond Wright's original description of FST as a ratio of variances, FST has been conceptually defined in many ways (Wright 1949; Cockerham 1969; Cavalli-Sforza and Bodmer 1971; Nei 1973; Slatkin 1991; Hudson et al. 1992). Additionally, multiple estimators for FST have been described in the literature (Nei 1973, 1986; Weir and Cockerham 1984; Hudson et al. 1992; Holsinger 1999; Weir and Hill 2002), often making the correct choice of estimator unclear.

With this diversity of definition and estimation in mind, we consider estimates of FST published by The 1000 Genomes Project Consortium (2010) of 0.052 for European and East Asian populations and 0.071 for European and West African populations. These are less than half of the published estimates, 0.111 and 0.156, from HapMap3 data (International HapMap 3 Consortium 2010) and may be the result of demography that differentially impacts FST at rare variants. These estimates have subsequently been used to simulate properties of recent rare variants (Mathieson and McVean 2012), making it imperative to know whether this reduction in FST is a meaningful result of the inclusion of rare variants or merely an artifact of estimation.

To answer these questions, we examine the issues surrounding FST estimated on data containing rare variants. We focus our attention on FST estimation in the context of comparing two populations—potentially with differing amounts of drift since the populations split—using a series of bi-allelic SNPs. We use the definition of Weir and Hill (2002), which allows for population-specific FST. Using this definition, we divide the issues surrounding estimation into three categories and examine them using both simulated and 1000 Genomes data:

  1. Choice of FST estimator.

  2. Combining estimates of FST across multiple SNPs.

  3. Dependence of FST on the set of SNPs analyzed.

We conclude that the lower FST estimates reported by The 1000 Genomes Project Consortium (2010) are a consequence of the estimation method that was applied and are not informative for human demographic history. Correcting for differences in estimation method yields FST estimates of 0.106 for Europeans and East Asians and 0.139 for Europeans and West Africans—much closer to HapMap3 estimates. Overall, our results contradict a recent statement “among human populations, FST is typically estimated to be <0.1” by Mathieson and McVean (2012), which was based on results from The 1000 Genomes Project Consortium (2010).

Altogether, in the setting of rare variants, a careful protocol for producing FST estimates is warranted. We provide such a protocol.

Results

Theory

Defining FST

We use the definition of Weir and Hill (2002) (WH) throughout our manuscript to analyze estimators in the context of comparing two populations at a series of bi-allelic SNPs. In this context, WH define FST as the correlation between randomly drawn alleles from a single population relative to the most recent common ancestral population:

graphic file with name 1514equ1.jpg

where Inline graphic is the allele frequency of the derived allele in population i, at SNP s, Inline graphic is the allele frequency of the derived allele in the ancestral population at SNP s, and Inline graphic is the population-specific FST for population i. For a pair of populations, FST is

graphic file with name 1514equ2.jpg

Although we use the WH definition of FST to compare estimation methods, numerous alternate definitions exist in the literature (see Supplemental Material), in part because of confusion regarding Wright's original description of FST.

Wright (1949) defined FST as the correlation of randomly drawn gametes from the same population, relative to the total population. However, he did not clearly specify the “total population,” leaving subsequent investigators to interpret its meaning. For Nei (1973) the “total population” is the combination of the two population samples. This means that FST quantifies drift relative to an average of the two population samples. For Cockerham (1969) and WH, the “total” population is the most recent common ancestral population to the two populations being considered. Consistent with those investigators, we view FST as a parameter of the evolutionary process and not a statistic from observed samples as Nei has described.

To view FST as a parameter of the evolutionary process, the Cockerham and WH definitions assume that studied SNPs were polymorphic in the ancestral population. This is clear from Equation 1 as Inline graphic for SNPs arising from recent mutations. While this assumption does not always hold, we believe that the WH definition provides a valid basis for comparing estimation methods, and also assesses the performance of estimators when this assumption is violated.

By defining only one FST for both populations in a comparison, Cockerham (1969) and Weir and Cockerham (1984) also assumed that the two populations have experienced identical amounts of drift since splitting. This assumption, which may be unrealistic in many real data sets, was generalized by WH, and motivates our use of the WH definition. In this study, we focus on cases without migration and admixture, though these cases were considered in WH and are the subject of future work (B Weir, pers. comm.).

In addition to the definitions described above, FST has been related to divergence time, coalescent times, and migration rates. Additionally, likelihood-based definitions view FST as a parameter of the distribution of allele frequencies in current populations (Balding and Nichols 1995; Nicholson et al. 2002; Balding 2003). Further details are provided in the Supplemental Material.

Choice of FST estimator

While estimators of FST handle issues related to finite sample size, we are interested in their behavior in the limit of large sample sizes, or the “quantity being estimated.” Most published estimates of FST are produced using the Weir and Cockerham (WC) (Weir and Cockerham 1984) (>8000 citations) or Nei (Nei 1973) (>5500 citations) estimators. However, we recommend a different estimator motivated by Hudson et al. (1992).

The WC estimator was developed for the case of populations with identical FST, and if it is used when FST is not identical for both populations, we demonstrate that the WC quantity being estimated becomes dependent on the ratio of sample sizes M according to (see Methods):

graphic file with name 1514equ3.jpg

We note that this variation with sample size is not due to any flaw in the WC estimator, but rather due to the use of the WC estimator for a purpose different from what was intended. We also note that the WC estimator is often used to produce single SNP estimates of FST to detect selection. We caution that when sample sizes are very different, the WC estimator can give inflated single SNP estimates of FST, resulting in false-positive signals of selection (see Supplemental Material).

In the context of the WH definition, the Nei estimator will consistently overestimate FST, and the degree of overestimation will depend upon the magnitude of FST values (see Methods):

graphic file with name 1514equ4.jpg

We note that this result, with a maximum value of 2, makes it impossible to view FST as a correlation.

The Hudson estimator (Hudson et al. 1992; Keinan et al. 2007) produces estimates that are the simple average of FST according to the WH definition. These estimates are independent of sample sizes even when FST is not identical across populations. We note that while Hudson did not explicitly provide an estimator of FST, he did describe a method of estimation that corresponds to the estimator that we explicitly provide here (see Supplemental Material). Thus, we refer to this estimator as the Hudson estimator. Hudson estimates correspond to a simple average of the population specific FST estimates as given by (see Methods):

graphic file with name 1514equ5.jpg

We note that the Hudson estimator is a simple average of the population-specific estimators proposed by Weir and Hill (2002). We provide comparisons of this estimator to the WC and Nei estimators when applied to simulated data (see Supplemental Material) and empirical data (see below).

Combining estimates of FST across multiple SNPs

We investigate two approaches for combining estimates of FST across multiple SNPs. In the first approach, variance components—the numerator and denominator—are averaged separately and the genome-wide estimate of FST is a “ratio of averages” (Weir and Cockerham 1984; International HapMap 3 Consortium 2010). In the second approach, single SNP estimates of FST are averaged across SNPs. The resulting “average of ratios” is reported as the genome-wide estimate (The 1000 Genomes Project Consortium 2010) (see Methods).

In the context of the WH definition, the numerator of the Hudson FST estimator (see Methods) is an unbiased estimator of the variance between populations. The denominator is an unbiased estimator of the total variance in the ancestral population. However, this does not mean that the ratio of the estimators is itself an unbiased estimator of FST. We are not aware of any unbiased estimator.

While an unbiased estimator is not available, FST estimates produced using a ratio of these two unbiased estimates will be asymptotically consistent, in the sense that they will converge to the correct underlying value as the number of independent SNPs increases. This is the basis of our recommendation that FST be estimated as a ratio of averages.

We analyze the effects of choosing an average of ratios in coalescent simulations detailed in the Supplemental Material.

Dependence of FST on the set of SNPs analyzed

It is well known that population genetic factors can cause variation in FST estimates, and that ascertainment schemes can alter the properties of studied SNPs (Ramírez-Soriano and Calafell 2008; Albrechtsen et al. 2010). For example, selection can result in differences between FST estimated on genic and nongenic SNPs (Clark et al. 2005; Barreiro et al. 2008; Hernandez et al. 2011); complex demography can cause FST to vary with SNP allele frequency (Schaffner et al. 2005) (see below). Indeed, variation in FST estimates between ascertained classes of SNPs can be used to test a variety of hypotheses about population history (Weir et al. 2005; McVicker et al. 2009). This usage of FST demonstrates that there is no single correct ascertainment scheme, as FST is a parameter of both the populations and the set of SNPs that are used in the computation.

Though there is no single correct ascertainment scheme, ascertainment in an outgroup may have desirable properties. Outgroup ascertainment guarantees that studied SNPs were polymorphic in the most recent common ancestral population (ignoring recurrent mutation), satisfying an assumption made in the Weir and Hill definition. This leads estimates of FST to be independent of allele frequency and depend upon time since divergence according to a simple equation (see Supplemental Material, Equation s1).

While we view these as desirable properties, if no reasonable outgroup sample is available, it may become necessary to choose SNPs that are polymorphic in one, both, or either of the populations studied. These choices will affect the estimate of FST produced and may explain discrepancies in FST estimates across studies of the same populations.

We explore the effects of various ascertainment schemes on FST estimates across the allele frequency spectrum in a variety of simulated demographic scenarios (see Supplemental Material).

Other FST estimators

In addition to the WC, Nei, and Hudson estimators that we analyzed above, we have also analyzed several additional estimators. Our results on each of these estimators are described in detail in the Supplemental Material.

The moment-based estimator of Weir and Hill (2002) (WH) introduced population-specific estimates of FST. Weir and Hill recommend a sample size weighted average of these estimates, which may result in a wide variation with sample size. However, one could also report these estimates independently or perform a simple average of these estimates.

A separate maximum-likelihood estimator of Weir and Hill (2002) (WH-ML) is based upon a normal approximation to genetic drift. However, the equations provided for the WH-ML estimator are not applicable to the general case of unequal sample size, and the investigators recommend that estimates be “simply averaged across loci,” causing WH-ML estimates to vary widely with the inclusion of rare variants.

We evaluated two max-likelihood estimators based on the beta-binomial likelihood using point estimates for the allele frequency in the ancestral population (D Balding, pers. comm.). These estimates perform well for small values of FST, but do poorly as FST increases. It may be possible to improve on these methods by integrating over the distribution of ancestral allele frequencies, an interesting direction for future research.

We also considered the beta-binomial MCMC method of Holsinger (1999). However, our simulations suggest that Holsinger estimates increase dramatically if rare SNPs are analyzed. Additionally, the MCMC-based approach imposes a significant computational burden, making the method difficult to apply to modern data sets.

Analysis of 1000 Genomes data

We analyzed data from 1000 Genomes populations (The 1000 Genomes Project Consortium 2010) to illustrate the effects of changes in each of the aspects of estimation described above. We focus largely on the comparison of Utah residents of European ancestry (CEU) and Chinese individuals from Beijing (CHB), as the Yoruba in Ibadan, Nigeria (YRI) sample functions as a natural outgroup for ascertainment of SNPs. This ascertainment has desirable properties (see above).

Choice of FST estimator

Estimates of FST for CEU and CHB are 0.106 (s.e. 0.0006), 0.112 (s.e. 0.0006), and 0.107 (s.e. 0.0006) for the WC, Nei, and Hudson estimators, respectively. These estimates were produced over SNPs ascertained as polymorphic in YRI. The higher Nei estimate is expected. In addition, sample sizes for CEU (85 individuals) and CHB (97 individuals) are similar, so we do not expect WC and Hudson estimates to differ.

In order to investigate the effects of sample size variation we selected 14 individuals—the size of the smallest sample (Iberian populations in Spain; IBS) in the 1000 Genomes Consortium data—from both CEU and CHB to produce populations CEU14 and CHB14. Hudson FST estimates for CEU14 and CHB are similar to those for CHB14 and CEU (see Table 1). However, WC estimates are 0.114 (s.e. 0.0006) and 0.107 (s.e. 0.0006) for CEU14 vs. CHB and CHB14 vs. CEU, respectively. The difference between these estimates is statistically significant (greater than eight standard errors). To verify that this difference is not due to different sets of polymorphic SNPs, we re-estimated FST restricting to SNPs that were polymorphic in YRI and at least one of CEU14 or CHB14. Re-estimated values of FST were similar to those above and WC estimates remained discordant (data not shown).

Table 1.

FST estimates for pairs of populations in 1000 Genomes

graphic file with name 1514tbl1.jpg

The effect of sample size variation is further exacerbated when ascertainment is performed within the populations studied. For example, in comparing IBS—with a sample size of only 14 individuals—to YRI, no reasonable outgroup population exists in the 1000 Genomes data. If we ascertain within one of these populations, WC estimates are 0.121 and 0.144 for ascertainment in YRI and IBS, respectively. These estimates—computed using identical populations and even identical individuals—are highly divergent at >25 standard errors apart, whereas Hudson estimates are much more stable (see Table 1). This underscores that FST estimates can vary substantially based on the choice of estimator.

Regardless of choice of estimator, our estimates of FST from 1000 Genomes data are relatively close to previously reported values of FST (see Supplemental Table S1 for all populations). This suggests that while the choice of estimator can impact the resulting value of FST, it does not explain the disparate results reported by the 1000 Genomes Consortium, and other aspects of estimation may be involved. We consider these in the sections below.

Combining estimates of FST across multiple SNPs

From 1000 Genomes data, we estimated FST for CEU and CHB as 0.106 (s.e. 0.0006) and 0.072 (s.e. 0.0003) for the ratio of averages and average of ratios, respectively. These estimates were produced over SNPs ascertained as polymorphic in YRI. This suggests that the result reported by the 1000 Genomes Consortium (0.052) may be partially explained by the large reduction in FST obtained by use of an average of ratios. These results are replicated for several comparisons of populations included in the 1000 Genomes data (see Table 2).

Table 2.

A comparison of the FST estimated using 1000 Genomes and HapMap data by either using a ratio of averages or an average of ratios

graphic file with name 1514tbl2.jpg

To explore the effect of the rare variants included in sequence data, we compared our results to those obtained using HapMap3 genotypes. We obtain FST estimates for CEU and CHB of 0.110 (s.e. 0.0010) and 0.089 (s.e. 0.0006) using the ratio of averages and average of ratios, respectively. This suggests that the inclusion of rare variants with low single-SNP FST estimates in the 1000 Genomes data tends to exacerbate the discrepancy produced by the average of ratios. We expect that this discrepancy will grow with sample sizes and sequencing depth (see Supplemental Fig. S2). Ultimately, using the average of ratios may make estimates incomparable across studies and unrelated to population demographic history.

While the use of the average of ratios clearly results in lower estimates of FST, these estimates are not as low as those published by the 1000 Genomes Consortium. Below, we explore the possibility that the remaining discrepancy can be accounted for by differences in the set of SNPs analyzed.

Dependence of FST on the set of SNPs analyzed

When estimating FST for CEU and CHB, we compared the effects of ascertaining in YRI (YRI ascertainment) versus ascertaining SNPs that were polymorphic in CEU, CHB, both populations, or either population (see Table 3). When using an average of ratios, our estimates of FST were ∼0.103 for all of these modified ascertainment schemes. These can be compared to an FST of 0.106 produced from YRI ascertainment in 1000 Genomes data or 0.110 in HapMap3 data. Though statistically significant, these results suggest that the effects of modified ascertainment are not very large when analyzing human populations using a ratio of averages. This indicates that reasonable estimates of FST may be produced when comparing populations without access to an outgroup.

Table 3.

Assessing the effect of ascertainment schemes and combination methods on the resulting FST estimate for CEU and CHB

graphic file with name 1514tbl3.jpg

However, when using an average of ratios and including all SNPs polymorphic in either CEU or CHB, our estimate changed from 0.072 to 0.047 (s.e. 0.0002), which is similar to the result reported by the 1000 Genomes Consortium. This suggests that much of the discrepancy between previously published estimates of FST for CEU and CHB and the published 1000 Genomes estimate is explained by using the average of ratios and an ascertainment scheme that includes all SNPs that are polymorphic in either of the two populations. These results are replicated for comparisons of continental populations included in the 1000 Genomes data as we obtained values of 0.056 and 0.063 for comparisons of CEU-YRI and CHB-YRI, respectively.

Separately, we note that when comparing CEU to CHB on the 1000 Genomes data we observed larger FST estimates of 0.108 for the lowest frequency SNPs (0.0 < MAF ≤ 0.05) versus estimates of 0.103 for the most common SNPs (0.45 < MAF < 0.5) when ascertaining in CEU. These estimates were 0.131 and 0.097 when ascertaining in CHB (see Fig. 1). Increased FST for rare variants suggests that bottlenecks are likely to be a stronger influence on FST estimates for CEU and CHB than recent expansions. Our results also indicate that bottlenecks in the population history of CHB are likely to be stronger than those in the population history of CEU, consistent with the findings of Keinan et al. (2007). This is in contrast to the much lower FST estimates reported on sequence data by the 1000 Genomes Consortium, which might suggest that expansions are a stronger influence on FST at rare SNPs.

Figure 1.

Figure 1.

Allele frequency dependence of FST under different ascertainment schemes. This shows FST for CEU and CHB as a function of allele frequency when ascertaining in either CEU, CHB, or YRI. The increased FST for rare variants is consistent with bottlenecks being a stronger force on FST for CEU and CHB than recent expansion. In fact, this is consistent with a stronger bottleneck in the population history of CHB. We note that this frequency dependence disappears when ascertaining in YRI, suggesting that YRI is a reasonable outgroup for the comparison of CEU and CHB.

Under a simple demographic history (i.e., without migration or admixture), this dependence on minor allele frequency is expected to disappear when ascertaining SNPs in an outgroup. When ascertaining in YRI we do not observe any significant dependence on frequency, which suggests that YRI is a reasonable outgroup for the comparison for CEU and CHB.

We note that when ascertaining in YRI, our genome-wide estimate of FST (0.106) is lower than estimated from HapMap3 (0.110). To investigate whether this difference is due to non-random ascertainment of HapMap3 SNPs, we sampled 10 subsets of SNPs from the 1000 Genomes data that matched the allele frequency spectrum of HapMap3 SNPs (see Supplemental Material). We estimated FST for CEU and CHB in each of these subsets ranging from 0.106 to 0.107 (s.e. 0.0010). This suggests that HapMap3 SNPs are more highly differentiated than random SNPs, consistent with previous findings on the effects of ascertainment on genotyping arrays (Clark et al. 2005; Albrechtsen et al. 2010).

Recommendations

Choice of FST estimator

Because the Hudson estimator is not sensitive to the ratio of sample sizes and does not systematically overestimate FST, we recommend that it be used to estimate FST for pairs of populations. The Hudson estimator for FST and a corresponding block-jackknife estimator for the standard error of FST are implemented in the EIGENSOFT software package (EIGENSOFT 4.2 http://www.hsph.harvard.edu/faculty/alkes-price/software/).

Combining estimates of FST across multiple SNPs

Using an average of ratios will result in large reductions in FST estimates. This effect will be exacerbated when estimating FST from sequence data. Therefore, we recommend using a ratio of averages.

Dependence of FST on the set of SNPs analyzed

Estimating FST from SNPs ascertained in an outgroup has the following valuable properties: (1) FST estimates are expected to be independent of allele frequency in the outgroup, and (2) FST estimates will relate to divergence time according to Supplemental Equation s1 if there has been no migration or admixture. However, data from a reasonable outgroup is not always available. Additionally, comparison of FST between ascertained classes of SNPs (e.g., genic vs. nongenic) can be used to test a variety of hypotheses regarding population history. Thus, we recommend that future publications of FST estimates include details of the ascertainment scheme used, including the proportion of SNPs that are polymorphic in each sample.

Discussion

The use of FST to quantify the genetic distance between populations and to assess differentiation at individual SNPs is widespread. Here, we point out several challenges surrounding FST and provide a protocol for its robust estimation in the case of two populations and bi-allelic SNPs. We show that the estimator of FST, the method of combining estimates across SNPs, and the scheme for SNP ascertainment can impact the resulting estimate of FST. An inappropriate choice for any of these aspects of estimation can lead to widely disparate estimates of FST, especially in a setting of large numbers of rare variants.

Indeed, the FST estimate 0.052 for CEU and CHB reported by The 1000 Genomes Project Consortium (2010) underscores the need for a careful analysis. Utilizing the careful protocol set out here, we provide an estimate of 0.106 for CEU and CHB on 1000 Genomes data, which is close to our estimate of 0.110 on HapMap3 (International HapMap 3 Consortium 2010) data. Additionally, we show that when ascertaining for SNPs in one of the two populations studied, rare variants have higher FST estimates than common variants. This is the exact opposite of the results suggested by the 1000 Genomes data. The difference between these two results changes the conclusions that are drawn about the role of demography in shaping the patterns of differentiation between human populations. In addition to altering genome-wide estimates of FST, the choice of estimator can introduce inflation at the level of single SNP estimates, potentially making it difficult to interpret high FST estimates as signals of selection (see Supplemental Material).

Another concern about FST was considered by Jost (2008), who showed that as heterozygosity becomes large, FST will naturally approach 0—indicating low differentiation—even if all alleles at a locus are population private. In an effort to avoid this problem, Jost introduced D as an alternate measure of differentiation. However, it has been suggested that Jost's D shares the same problems as FST, and that these problems are sometimes even more pronounced for Jost's D (Ryman and Leimar 2009). In any case, FST and related measures “unquestionably provide important insights into population structure” (Jost 2008), particularly for species such as humans, in which heterozygosity is relatively low.

In conclusion, we recommend the use of the Hudson estimator (Hudson et al. 1992; Keinan et al. 2007) of FST that is independent of sample size. We demonstrate that a ratio of averages is an appropriate method for combining these estimates across multiple SNPs. We also show the value of estimating FST from SNPs ascertained in an outgroup, though we do not view this as a necessity. We do recommend, however, that future publications of FST estimates include details of the ascertainment of SNPs.

Methods

Weir and Cockerham's FST (WC)

Definition

Weir and Cockerham (1984) used the definition provided by Cockerham (1969) of FST as a ratio of the variance between populations to the total variance in the ancestral population. We analyze this definition in the Supplemental Material.

Estimator

In the setting of population-specific FST, described by the WH definition, the WC estimator will result in estimates that vary with the ratio of sample sizes (see Supplemental Material for details). In the case of two populations and biallelic SNPs, the WC estimator is

graphic file with name 1514equ6.jpg

where ni is the sample size and Inline graphic is the sample allele frequency in population i for i ∈ {1, 2}. Then, in the limit of large sample sizes (ni − 1 ≈ ni), we can assume that sample allele frequencies become close to population allele frequencies Inline graphic. We analyze the estimator as the sample sizes increase, but their ratio goes to a constant M (see Supplemental Material for a derivation). In this case, we show (see Supplemental Material) that the estimate tends toward Equation 1 (see Results).

If the sample sizes are equal, M = 1, then the estimate becomes

graphic file with name 1514equ11.jpg

Also, when FST is identical for both populations, i.e., Inline graphic, it is straightforward to see that Inline graphic, i.e., the estimate will not depend upon the ratio of sample sizes (M). We note that if FST is identical across populations, weighting by sample sizes will reduce the variance of the estimator. This was the intent of Weir and Cockerham. If the sample sizes are unequal or this assumption does not hold, however, the estimate will depend upon the ratio of sample sizes underlying the limit. Given the complexity of human population history, it is unlikely that this assumption will hold in general. This means that even if large numbers of samples and SNPs are used to estimate FST for a pair of populations, this estimate may not be comparable across studies with different sample sizes.

We note that when FST is not identical for both populations, it is possible to estimate FST separately for each population (i.e., Inline graphic) (Weir and Hill 2002). Estimates for those produced according to the method given in Weir and Hill (2002) will not depend on sample size. We focus here on estimating FST for a pair of populations, as this is a very common use when analyzing human genetic data.

Nei's FST

Definition

Nei (1986) defined FST (he used the term GST) based upon the sample gene diversity between and within populations as

graphic file with name 1514equ7.jpg

where Inline graphic is the average gene diversity between populations and HT is the diversity in the average of the two population samples. We consider this definition in detail in the Supplemental Material.

Estimator

In the case of two populations and bi-allelic SNPs, Nei's estimator is

graphic file with name 1514equ8.jpg

where

graphic file with name 1514equ12.jpg

and Inline graphic is the sample allele frequency in population i for i ∈ {1, 2}. We note that this is Nei's updated estimator and, in the case of two populations, differs from the estimator given in Nei (1973) and Nei and Chesser (1983) by a factor of 2. We use the estimator given in Nei (1986), as it is most closely related to the other estimators considered.

Using the definition of Weir and Hill (2002) we show (see Supplemental Material) that estimates made using Nei's estimator will tend toward Equation 2 (see Results), with a maximum value of 2 as Inline graphic. This overestimates the average of population-specific FST values and alters the relation from this average of FST values to divergence time (see Supplemental Material). Estimates of FST given for the Nei estimator were generated using the proposed estimator for the numerator (see Supplemental Material) and a simple estimator for the denominator.

Hudson's FST

Definition

Hudson et al. (1992) defined FST in terms of heterozygosity. The fundamental difference between these estimators is that for Hudson, the total variance is based upon the ancestral population and not the current sample.

Estimator

Hudson's estimator for FST is given by

graphic file with name 1514equ9.jpg

where Hw is the mean number of differences within populations, and Hb is the mean number of differences between populations. While Hudson did not give explicit equations for Hw and Hb, we cast his description into an explicit estimator (see Supplemental Material for a derivation). The estimator that we analyze is

graphic file with name 1514equ10.jpg

where ni is the sample size and Inline graphic is the sample allele frequency in population i for i ∈ {1, 2}. Analyzing this estimator using the definition of Weir and Hill (2002), we show (see Supplemental Material) that FST estimated using Hudson's estimator will tend toward Equation 3 (see Results), which is exactly the average of population-specific FST values that we seek to estimate. This emerges naturally, as the proposed estimator is the simple average of the population-specific estimators given in Weir and Hill (2002). This estimator has the desirable properties that it is (1) independent of sample composition, and (2) does not overestimate FST (it has a maximum value of 1). We recommend its use to produce estimates of FST for two populations.

Combining estimates of FST across multiple SNPs

The Hudson estimator is asymptotically consistent, as the estimators of the variance components involved in the computation of FST are unbiased in the context of the WH definition. However, as their quotient is not an unbiased estimator of FST, use of an average of ratios will, in general, result in a biased estimate.

As many rare variants discovered by deep sequencing are population specific, we analyze the effect of this approach in the presence of many such variants. Consider a rare SNP with p1 = ɛ, p2 = 0. This yields a single SNP FST = ɛ. An estimate produced using an average of ratios will be highly sensitive to rare SNPs of this type and is likely to exhibit dependence on both the sequencing depth and sample size used in the analysis (see Supplemental Fig. S2).

Previous works have examined this choice and advocated for the use of a ratio of averages (Reynolds et al. 1983; Weir and Cockerham 1984). However, in describing the WH-ML method, Weir and Hill recommend that estimates be “simply averaged over loci.” We believe that use of an average of ratios can account for the bulk of the discrepancy between the estimates of FST from The 1000 Genomes Project Consortium (2010) and previously published estimates (International HapMap 3 Consortium 2010) (see Results).

Dependence of FST on the set of SNPs analyzed

In relating quantities being estimated from current populations to parameters of the evolutionary model, we have calculated expected values given the allele frequency in the ancestral population. This implicitly performs an ascertainment of SNPs that are polymorphic in the ancestral population or, equivalently, in an outgroup population. Provided there is no migration or admixture between populations, the relationship between FST and divergence time is given in Supplemental Equation s12.

This relationship accounts for changes in effective population size (i.e., bottlenecks or expansions) in the demographic history of the populations being compared. Additionally, ascertainment in an outgroup renders the estimate independent of the allele frequency spectrum in the outgroup. Therefore, with this type of ascertainment scheme, estimates should be concordant regardless of whether they are produced from rare or common SNPs.

While ascertainment in an outgroup has several helpful properties, in many practical circumstances no data from a reasonable outgroup is available. In these instances, FST can be estimated using SNPs ascertained in either one of the populations under study. However, in these instances estimates are not expected to be independent of allele frequency spectrum or complex demographic scenarios.

Acknowledgments

We are grateful to B. Weir for substantial contributions to this research and D. Reich, S. Sunyaev, S. Myers, N. Zaitlen, J. Wilson, A. Keinan, W. Jin, and D. Balding for helpful discussions. This research was funded by NIH grants T32 HG002295 (G.B.), R01 HG006399 (N.P. and A.L.P.), and R03 HG006170 (G.B. and A.L.P.).

Footnotes

[Supplemental material is available for this article.]

Article published online before print. Article, supplemental material, and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.154831.113.

References

  1. The 1000 Genomes Project Consortium 2010. A map of human genome variation from population-scale sequencing. Nature 467: 1061–1073 [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Albrechtsen A, Nielsen FC, Nielsen R 2010. Ascertainment biases in SNP chips affect measures of population divergence. Mol Biol Evol 27: 2534–2547 [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Balding DJ 2003. Likelihood-based inference for genetic correlation coefficients. Theor Popul Biol 63: 221–230 [DOI] [PubMed] [Google Scholar]
  4. Balding DJ, Nichols RA 1995. A method for quantifying differentiation between populations at multi-allelic loci and its implications for investigating identity and paternity. Genetica 96: 3–12 [DOI] [PubMed] [Google Scholar]
  5. Barreiro LB, Laval G, Quach H, Patin E, Quintana-Murci L 2008. Natural selection has driven population differentiation in modern humans. Nat Genet 40: 340–345 [DOI] [PubMed] [Google Scholar]
  6. Cavalli-Sforza LL, Bodmer WF 1971. The genetics of human populations. W.H. Freeman, San Francisco, CA [Google Scholar]
  7. Clark AG, Hubisz MJ, Bustamante CD, Williamson SH, Nielsen R 2005. Ascertainment bias in studies of human genome-wide polymorphism. Genome Res 15: 1496–1502 [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Cockerham CC 1969. Variance of gene frequencies. Evolution 23: 72–84 [DOI] [PubMed] [Google Scholar]
  9. Edelaar P, Alonso D, Lagerveld S, Senar JC, Björklund M 2012. Population differentiation and restricted gene flow in Spanish crossbills: Not isolation-by-distance but isolation-by-ecology. J Evol Biol 25: 417–430 [DOI] [PubMed] [Google Scholar]
  10. Ellstrand NC, Elam DR 1993. Population genetic consequences of small population size: Implications for plant conservation. Ann Rev Ecol Syst 24: 217–242 [Google Scholar]
  11. Guries RP, Ledig FT 1982. Genetic diversity and population structure in pitch pine (Pinus rigida Mill.). Evolution 36: 387–402 [DOI] [PubMed] [Google Scholar]
  12. Hangartner S, Laurila A, Räsänen K 2012. Adaptive divergence in moor frog (Rana arvalis) populations along an acidification gradient: Inferences from QST–FST correlations. Evolution 66: 867–881 [DOI] [PubMed] [Google Scholar]
  13. Hernandez RD, Kelley JL, Elyashiv E, Melton SC, Auton A, McVean G, 1000 Genomes Project, Sella G, Przeworski M 2011. Classic selective sweeps were rare in recent human evolution. Science 331: 920–924 [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Holsinger KE 1999. Analysis of genetic diversity in geographically structured populations: A Bayesian perspective. Hereditas 130: 245–255 [Google Scholar]
  15. Holsinger KE, Weir BS 2009. Genetics in geographically structured populations: defining, estimating and interpreting FST. Nat Rev Genet 10: 639–650 [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Hudson RR, Slatkin M, Maddison WP 1992. Estimation of levels of gene flow from DNA sequence data. Genetics 132: 583–589 [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. International HapMap Consortium 2007. A second generation human haplotype map of over 3.1 million SNPs. Nature 449: 851–861 [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. International HapMap 3 Consortium 2010. Integrating common and rare genetic variation in diverse human populations. Nature 467: 52–58 [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Jost L 2008. GST and its relatives do not measure differentiation. Mol Ecol 17: 4015–4026 [DOI] [PubMed] [Google Scholar]
  20. Keinan A, Mullikin JC, Patterson N, Reich D 2007. Measurement of the human allele frequency spectrum demonstrates greater genetic drift in East Asians than in Europeans. Nat Genet 39: 1251–1255 [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Li JZ, Absher DM, Tang H, Southwick AM, Casto AM, Ramachandran S, Cann HM, Barsh GS, Feldman M, Cavalli-Sforza LL, et al. 2008. Worldwide human relationships inferred from genome-wide patterns of variation. Science 319: 1100–1104 [DOI] [PubMed] [Google Scholar]
  22. Malécot G. 1948. Les mathématiques de l'hérédié. Masson & Cie, Paris, France. [Google Scholar]
  23. Mathieson I, McVean G 2012. Differential confounding of rare and common variants in spatially structured populations. Nat Genet 44: 243–246 [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. McVicker G, Gordon D, Davis C, Green P 2009. Widespread genomic signatures of natural selection in hominid evolution. PLoS Genet 5: e1000471. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Nei M 1973. Analysis of gene diversity in subdivided populations. Proc Natl Acad Sci 70: 3321–3323 [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Nei M 1986. Definition and estimation of fixation indices. Evolution 40: 643–645 [DOI] [PubMed] [Google Scholar]
  27. Nei M, Chesser RK 1983. Estimation of fixation indices and gene diversities. Ann Hum Genet 47: 253–259 [DOI] [PubMed] [Google Scholar]
  28. Nicholson G, Smith AV, Jonsson F, Gústafsson Ó, Stefansson K, Donnelly P 2002. Assessing population differentiation and isolation from single-nucleotide polymorphism data. J R Stat Soc Series B Stat Methodol 64: 695–715 [Google Scholar]
  29. Palumbi SR, Baker CS 1994. Contrasting population structure from nuclear intron sequences and mtDNA of humpback whales. Mol Biol Evol 11: 426–435 [DOI] [PubMed] [Google Scholar]
  30. Ramírez-Soriano A, Calafell F 2008. FABSIM: A software for generating FST distributions with various ascertainment biases. Bioinformatics 24: 2790–2791 [DOI] [PubMed] [Google Scholar]
  31. Reynolds J, Weir BS, Cockerham CC 1983. Estimation of the coancestry coefficient: Basis for a short-term genetic distance. Genetics 105: 767–779 [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Ryman N, Leimar O 2009. GST is still a useful measure of genetic differentiation— a comment on Jost's D. Mol Ecol 18: 2084–2087 [DOI] [PubMed] [Google Scholar]
  33. Schaffner SF, Foo C, Gabriel S, Reich D, Daly MJ, Altshuler D 2005. Calibrating a coalescent simulation of human genome sequence variation. Genome Res 15: 1576–1583 [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Selander RK, Hudson RO 1976. Animal population structure under close inbreeding: The land snail Rumina in Southern France. Am Nat 110: 695–718 [Google Scholar]
  35. Slatkin M 1991. Inbreeding coefficients and coalescence times. Genet Res 58: 167. [DOI] [PubMed] [Google Scholar]
  36. Weir BS, Cockerham CC 1984. Estimating F-statistics for the analysis of population structure. Evolution 38: 1358–1370 [DOI] [PubMed] [Google Scholar]
  37. Weir BS, Hill WG 2002. Estimating F-statistics. Annu Rev Genet 36: 721–750 [DOI] [PubMed] [Google Scholar]
  38. Weir BS, Cardon LR, Anderson AD, Nielsen DM, Hill WG 2005. Measures of human population structure show heterogeneity among genomic regions. Genome Res 15: 1468–1476 [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Wright S 1949. The genetical structure of populations. Ann Hum Genet 15: 323–354 [DOI] [PubMed] [Google Scholar]
  40. Xu S, Yin X, Li S, Jin W, Lou H, Yang L, Gong X, Wang H, Shen Y, Pan X, et al. 2009. Genomic dissection of population substructure of Han Chinese and its implication in association studies. Am J Hum Genet 85: 762–774 [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Genome Research are provided here courtesy of Cold Spring Harbor Laboratory Press

RESOURCES