Abstract
Interpretations of values of the FST measure of genetic differentiation rely on an understanding of its mathematical constraints. Previously, it has been shown that FST values computed from a biallelic locus in a set of multiple populations and FST values computed from a multiallelic locus in a pair of populations are mathematically constrained as a function of the frequency of the allele that is most frequent across populations. We generalize from these cases to report here the mathematical constraint on FST given the frequency M of the most frequent allele at a multiallelic locus in a set of multiple populations. Using coalescent simulations of an island model of migration with an infinitely-many-alleles mutation model, we argue that the joint distribution of FST and M helps in disentangling the separate influences of mutation and migration on FST. Finally, we show that our results explain a puzzling pattern of microsatellite differentiation: the lower FST in an interspecific comparison between humans and chimpanzees than in the comparison of chimpanzee populations. We discuss the implications of our results for the use of FST.
This article is part of the theme issue ‘Celebrating 50 years since Lewontin's apportionment of human diversity’.
Keywords: allele frequency, chimpanzee, genetic differentiation, migration, population structure
1. Introduction
Multiallelic loci such as microsatellites and haplotype assignments are used to study genetic differentiation in a variety of fields, ranging from ecology and conservation genetics to anthropology and human genomics. Genetic differentiation is often measured for multiallelic loci using the multiallelic extension of Wright’s fixation index FST [1]
1.1 |
For a polymorphic multiallelic locus with I distinct alleles in a set of K subpopulations, denoting by pk,i the frequency of allele i in subpopulation k, and .
FST values are known to be smaller for multiallelic than for biallelic loci [2]. One reason invoked to explain this difference is that within-subpopulation heterozygosity HS mathematically constrains the maximal value of FST to be below 1, and the constraint is stronger when HS is high. This phenomenon was noticed concurrently in simulation-based, empirical and theoretical studies [3–7], and the mathematical constraints describing the dependence were subsequently clarified [8,9].
Studies have found that the maximal value of FST can be viewed as constrained not only by functions of the within-subpopulation allele frequency distribution such as HS, but alternatively by aspects of the global allele frequency distribution across subpopulations. For a biallelic locus in K = 2 subpopulations, Maruki et al. [10] showed that the maximal FST as a function of the frequency M of the most frequent allele decreases as M increases from to 1 (see also [11]). Generalizing the biallelic case to arbitrarily many alleles, Jakobsson et al. [12] showed that for multiallelic loci with an unspecified number of distinct alleles, the maximal FST increases from 0 to 1 as a function of M if , and decreases from 1 to 0 for in the manner reported by Maruki et al. [10] for biallelic loci. Edge & Rosenberg [13] generalized these results to the case of a fixed finite number of alleles, showing that the maximal FST differs slightly from the unspecified case when the fixed number of distinct alleles is an odd number.
Generalizing the simplest case of K = I = 2 in a different direction, Alcala & Rosenberg [14] considered biallelic loci in the case of a fixed number of subpopulations K ≥ 2. We showed that the maximal value of FST displays a peculiar behaviour as a function of M: the upper bound has a maximum of 1 if and only if M = k/K, for integers k with . The constraints on the maximal value of FST dissipate as K tends to infinity, even though for any fixed K, there always exists a value of M for which .
Relating FST to its maximum as a function of M helps explain surprising phenomena that arise during population-genetic data analysis. For example, Jakobsson et al. [12] showed that stronger constraints on FST could explain the low FST values seen in pairs of African human populations. They also found that such constraints could explain the lower FST values seen in high-diversity multiallelic loci compared to lower-diversity loci—microsatellites compared to single-nucleotide polymorphisms. Alcala & Rosenberg [14] showed that constraints on the maximal FST could explain the lower FST values between human populations seen when computing FST pairwise rather than from all populations simultaneously.
In this study, we characterize the relationship between FST and the frequency M of the most frequent allele, for a multiallelic locus and an arbitrary specified value of the number of subpopulations K. We derive the mathematical upper bound on FST in terms of M, extending the biallelic result of Alcala & Rosenberg [14] to the multiallelic case, and providing the most comprehensive description of the mathematical constraints on FST in terms of M to date (table 1). To assist in interpreting the new bound, we simulate the joint distribution of FST and M in the island migration model, describing its properties as a function of the number of subpopulations, the migration rate and a mutation rate. The K-subpopulation upper bound on FST in terms of M facilitates an explanation of counterintuitive aspects of inter-species genetic differentiation. We discuss the importance of the results for applications of FST more generally.
Table 1.
reference | number of alleles | number of subpopulations | variable in terms of which constraints are reported |
---|---|---|---|
Long & Kittles [8] | unspecified value ≥2 | fixed finite value ≥2 | HS |
Rosenberg et al. [11] | 2 | 2 | δ |
Hedrick [9] | unspecified value ≥2 | fixed finite value ≥2 | HS |
Maruki et al. [10] | 2 | 2 | HS, M |
Jakobsson et al. [12] | unspecified value ≥2 | 2 | HT, M |
Edge & Rosenberg [13] | fixed finite value ≥2 | 2 | HT, M |
Alcala & Rosenberg [14] | 2 | fixed finite value ≥2 | M |
this paper | unspecified value ≥2 | fixed finite value ≥2 | M |
2. Model
Our goal is to derive the range of values that FST can take—the lower and upper bounds on FST—as a function of the frequency M of the most frequent allele for a multiallelic locus, when the number of subpopulations K is a fixed finite value greater than or equal to 2. We follow previous studies [12–15] in describing notation and constructing the scenario.
We consider a polymorphic locus with an unspecified number of distinct alleles, in a setting with K subpopulations contributing equally to the total population. We denote the frequency of allele i in subpopulation k by pk,i, with sum across subpopulations. Each allele frequency pk,i lies in [0, 1]. Within subpopulations, allele frequencies sum to 1: for each k, . Hence, σi lies in [0, K], and . We number alleles from most to least frequent, so σi ≥ σj for i ≤ j.
Because by assumption the locus is polymorphic, σi < K for each i. Alleles 1 and 2 have non-zero frequency in at least one subpopulation, not necessarily the same one; we have σ1 > 0 and σ2 > 0. We denote the mean frequency of the most frequent allele across subpopulations by M = σ1/K. We then have 0 < M < 1. We treat the allele frequencies pk,i and associated quantities M and σi as parametric values, and not as estimates computed from data.
Equation (1.1) expresses FST as a ratio involving within-subpopulation heterozygosity, HS, and total heterozygosity, HT, with 0 ≤ HS < 1 and 0 ≤ HT < 1. Because we assume the locus is polymorphic, HT > 0. We write equation (1.1) in terms of allele frequencies, permitting the number of distinct alleles to be arbitrarily large
2.1 |
Hence, our goal is, for fixed σ1 = KM, 0 < σ1 < K, to identify the matrices (pk,i)K×∞, with pk,i in [0, 1], and , that minimize and maximize FST in equation (2.1).
Note that we adopt the interpretation of FST as a ‘statistic’ that describes a mathematical function of allele frequencies rather than as a ‘parameter’ that describes coancestry of individuals in a population [e.g. 16]. See Alcala & Rosenberg [14] for a discussion of interpretations of FST when studying its mathematical properties.
3. Mathematical constraints
(a) . Lower bound of FST
Bounds on FST in terms of the frequency of the most frequent allele can be written with respect to M or σ1, noting that M ranges in (0, 1) and σ1 ranges in (0, K). For the lower bound, from equation (2.1), for any choice of σ1, FST = 0 can be achieved. Consider (σ1, σ2, …) with σi in [0, K) for each k, σi ≥ σj for i ≤ j, , and σ1 > 0 and σ2 > 0. We set pk,i = σi/K for all subpopulations k and alleles i; this choice yields FST = 0.
FST = 0 implies that the numerator of equation (2.1), HT − HS, is zero. This numerator can be written . The Cauchy–Schwarz inequality guarantees that , with equality if and only if p1,i = p2,i = … = pK,i = σi/K. Applying the Cauchy–Schwarz inequality to all alleles i, the numerator of equation (2.1) is zero only if for all i, (p1,i, p2,i, …, pK,i) = (σi/K, σi/K, …, σi/K).
Thus, we can conclude that the allele frequency matrices in which all K subpopulations have identical allele frequency vectors are the only matrices for which FST = 0. The lower bound on FST is equal to 0 irrespective of M or σ1, for any value of the number of subpopulations K.
(b) . Upper bound of FST
To derive the upper bound on FST in terms of M = σ1/K, we must maximize FST in equation (2.1), assuming that σ1 and K are constant. The computations are performed in appendix A; we write the main result as a function of σ1, noting that it can be converted into a function of M by replacing σ1 with KM.
In theorem A.1, we treat the case in which σ1 has an integer value. For non-integer σ1, theorem A.2 shows that the maximal FST requires that (i) the sum of squared allele frequencies across alleles and subpopulations, , is maximal, and (ii) alleles i = 2, 3, … are each present in at most one subpopulation, but allele 1 might be present in more than one subpopulation. We then separately maximize FST as a function of σ1 for σ1 in (0, 1) and non-integer σ1 in (1, K). These two cases differ in that allele 1 appears in a single subpopulation in the former case, and it must appear in at least two subpopulations in the latter.
The maximal FST as a function of σ1 for σ1 in (0, K) is
where . Here, denotes the smallest integer greater than or equal to x, denotes the greatest integer less than or equal to x, and denotes the fractional part of x. Note that for an integer choice of σ1, the maximum from equation (3.1) and the limits as σ1 tends to the integer from above and below all equal 1, so that the maximum as a function of σ1 is continuous.
From appendix A, FST reaches its upper bound for integer σ1 when allele 1 has frequency 1 in each of σ1 subpopulations, and when in each of the remaining K − σ1 subpopulations, an allele other than allele 1 has frequency 1. These alleles of frequency 1 need not be private, although they can be; any identity relationships among them are permissible, provided that when summing frequencies across subpopulations, none of these alleles has a sum that exceeds σ1. The locus can have as few as alleles of non-zero frequency and as many as K − σ1 + 1.
For σ1 in interval (0, 1), FST is maximal when each allele is present in only a single subpopulation, and when each subpopulation has exactly J alleles with a non-zero frequency: J − 1 alleles at frequency σ1 and one allele at frequency 1 − (J − 1)σ1 ≤ σ1. Because each subpopulation has J distinct alleles and no alleles are shared across subpopulations, this upper bound requires that the locus has KJ alleles of non-zero frequency.
For non-integer σ1 in (1, K), FST reaches its maximum when there are subpopulations in which the most frequent allele has frequency 1, a single subpopulation in which it has frequency {σ1} and a private allele has frequency 1 − {σ1}, and subpopulations each with a different private allele at frequency 1. Only the most frequent allele is shared across subpopulations, and a single subpopulation displays polymorphism. At the maximum, alleles have non-zero frequency.
(c) . Properties of the upper bound
Figure 1 shows the maximal value of FST in terms of M = σ1/K for various values of the number of subpopulations, K. We describe a number of properties of this upper bound.
(i) . Piecewise structure of the upper bound
First, we observe that the upper bound has a piecewise structure.
For M < 1/K, the upper bound depends on . As KM increases in (0, 1), each decrement in the integer value of produces a distinct ‘piece’ with domain [1/(Kj), 1/(K(j − 1))), for integers j ≥ 2. Within each interval [1/(Kj), 1/(K(j − 1))), J has the constant value j.
At M = 1/K, the upper bound has its first transition between cases. For M > 1/K, the upper bound depends on . As KM increases in [1, K), each increment in also produces a distinct piece of the domain. For each k from 1 to K − 1, for M in [k/K, (k + 1)/K).
Counting the intervals of the domain, we see that an infinite number of distinct intervals occur for M in (0, 1/K), and K − 1 intervals occur for M in (1/K, 1). Within intervals, the function describing the upper bound is smooth.
(ii) . Behaviour of the upper bound for M = 1/K, 2/K, …(K − 1)/K
The upper bound is equal to 1 at M = 1/K, 2/K, …(K − 1)/K. For M in (0, 1/K), setting the numerator and denominator equal in equation (3.1), we find that the upper bound is never equal to 1. For M in (1/K, 1), the upper bound is equal to 1 if and only if {σ1} = 0, that is, if and only if σ1 is an integer and M = k/K for k = 2, 3, …, K − 1.
Hence, noting that the upper bound is equal to 1 at M = 1/K, we conclude that the upper bound can equal 1 if and only if M = k/K for integers k = 1, 2, …, K − 1. For fixed K, the upper bound on FST has exactly K − 1 maxima at which FST can equal 1, at M = 1/K, 2/K, …, (K − 1)/K. We can conclude that FST is unconstrained within the unit interval only for a finite set of values of the frequency M of the most frequent allele. The size of this set increases with the number of subpopulations K.
(iii) . Behaviour of the upper bound for M in (0, 1/K)
For M in (0, 1/K), we can compute the value of the upper bound at the transition points between distinct pieces of the domain, namely values of 1/(Kj) for integers j ≥ 2. Applying equation (3.1), we observe that at M = 1/(Kj), the upper bound has value (K − 1)/(Kj − 1). In other words, the upper bound touches the curve
3.2 |
This curve is represented in figure 1 as a dashed line.
Note that for K = 2, the special case considered by Jakobsson et al. [12], equation (3.2) reduces to q*(M) = M/(1 − M) = σ1/(2 − σ1), which matches equation 21 from Jakobsson et al. [12]. In fact, setting K = 2, equation (3.1) for M in (0, 1/K) reduces to the K = 2 upper bound on FST in eqn 9 of [12].
(iv) . Behaviour of the upper bound for M in (1/K, 1)
Because the upper bound is a smooth function on each interval of its domain, and because it possesses maxima at interval boundaries M = 1/K, 2/K, …, (K − 1)/K, it must possess local minima in intervals [k/K, (k + 1)/K) for k = 1, 2, …, K − 2. Indeed, such minima are visible in figure 1 in cases with K = 3, K = 6, K = 40 and K = 100; for K = 2, only one maximum occurs, so that there is no interval between a pair of maxima in which a minimum can occur. Note that because we restrict attention to M in (0, 1), we do not count the point at M = 1 and FST = 0 as a local minimum.
4. Joint distribution of M and FST under an evolutionary model
So far, we have described the mathematical constraint imposed on FST by M without respect to the frequency with which particular values of M arise in evolutionary scenarios. As an assessment of the bounds in evolutionary models can illuminate the settings in which they are most salient in population-genetic data analysis [9,14,17–20], we simulated the joint distribution of FST and M under an island migration model, relating the distribution to the mathematical bounds on FST. This analysis considers allele frequency distributions, and hence values of M and FST, generated by evolutionary models. The simulation approach is modified from [14,15].
(a) . Simulations
We simulated alleles under a coalescent model, using the software MS [21]. We considered a total population of KN diploid individuals subdivided into K subpopulations of size N. At each generation, a proportion m of the individuals in a subpopulation originated outside the subpopulation. Thus, the scaled migration rate is 4Nm, and it corresponds to twice the number of individuals in a subpopulation that originate elsewhere. We considered the island model [22–24], in which migrants have the same probability m/(K − 1) of coming from any other specific subpopulation. We used an infinitely-many-alleles model; mutations occur at rate μ, and the scaled mutation rate is 4Nμ.
We examined three values of K (2, 6, 40), three values of 4Nμ (0.1, 1, 10) and three values of 4Nm (0.1, 1, 10). Note that in MS, time is scaled in units of 4N generations, and there is no need to specify subpopulation sizes N. MS simulates an infinitely-many-sites model, where each mutation occurs at a new site; each haplotype is a new allele, so that each mutation creates a new allele. For our analysis, we are concerned only with the allelic categories and not with the simulated sequences; thus, although the simulation follows the infinitely-many-sites model, the analysis treats simulated datasets as having been generated under an infinitely-many-alleles model.
For each parameter triplet (K, 4Nμ, 4Nm), we performed 1000 replicate simulations, sampling 100 sequences per subpopulation in each replicate. We computed FST values from the parametric allele (haplotype) frequencies. MS commands appear in electronic supplementary material, File S1; note that the simulation approach here uses the standard method of simulating MS with a specified mutation rate θ = 4Nμ, whereas in our previous analyses of biallelic cases [14,15], we had employed the alternative approach of requiring simulated datasets to possess exactly one segregating site.
Figure 2 shows the joint distribution of M and FST for the nine values of (4Nμ, 4Nm) in the case of K = 2. Electronic supplementary material, figures S1 and S2 provide similar figures for K = 6 and K = 40, respectively.
(b) . Impact of the mutation rate
For fixed migration rate 4Nm and number of subpopulations K, the main impact of the mutation rate is on the frequency M of the most frequent allele. For K = 2, under weak mutation (4Nμ = 0.1), the joint distribution of M and FST is highest in the high-M region, for all values of 4Nm (figure 2a,d,g). Although most simulation replicates produce with an upper bound on FST less than one, this set of parameter values does give rise to replicates near the peak at .
Under intermediate mutation (4Nμ = 1), the increased mutation rate tends to decrease M, shifting the joint distribution to lower values of M for all values of 4Nm (figure 2b,e,h). Finally, under strong mutation (4Nμ = 10), the joint distribution of M and FST is highest in the low-M region, for all values of 4Nm (figure 2c,f,i). In this region, the upper bound on FST is most strongly constrained, leading to low FST values.
(c) . Impact of the migration rate
For fixed mutation rate 4Nμ and number of subpopulations K, the impact of the migration rate is seen primarily in the FST values rather than the values of M. Under weak migration (4Nm = 0.1), subpopulations are differentiated, and the joint distribution of M and FST is highest near the upper bound on FST in terms of M (figure 2a–c).
Under intermediate migration (4Nm = 1), differentiation between subpopulations decreases, and the joint density of M and FST is highest at lower values of FST (figure 2d–f). Under strong migration (4Nm = 10), the joint density of M and FST nears the lower bound (figure 2g–i).
(d) . Impact of the number of subpopulations
In figure 1, the number of subpopulations changes the shape of the region in which FST is permitted to range as a function of M. Thus, in simulations, the impact of the number of subpopulations K is observed in cases in which a change in K permits FST to expand its range within the unit square for (M, FST). For each of the nine choices of (4Nμ, 4Nm), figure 3 summarizes the means observed for (M, FST) in figures 2 and electronic supplementary material, S1 and S2, corresponding to K = 2, K = 6 and K = 40, respectively.
The number of subpopulations generally increases FST for fixed 4Nμ and 4Nm. For example, the mean FST can be substantially larger for K = 6 than for K = 2. Consider (4Nμ, 4Nm) = (0.1, 0.1). For K = 2, the mean FST is near its upper bound (figure 3a); for K = 6, FST is not as close to the bound (figure 3b). However, because the upper bound for K = 6 exceeds that for K = 2, the mean FST is nevertheless larger in the case of K = 6.
5. Example: humans and chimpanzees
We now use our theoretical results to examine genetic differentiation in humans and chimpanzees. Because humans and chimpanzees are distinct species, we might expect a genetic differentiation measure such as FST to produce a greater value for a computation between them than for a computation among populations within one or the other. Indeed, studies of multiallelic loci do find that adding chimpanzees to data on multiple human populations increases the value of FST [8,25]. However, we will see that FST has a more subtle pattern when considering data on multiple chimpanzee populations, and that our theoretical computations explain a surprising result.
We examine data on 246 multiallelic microsatellite loci assembled by Pemberton et al. [26] from several studies of worldwide human populations and a study of chimpanzees [27]. We consider FST comparisons both between humans and chimpanzees and among populations of chimpanzees. For the human data, we consider all 5795 individuals in the dataset, and for the chimpanzee data, we consider 84 chimpanzee individuals from six populations: one bonobo population, and five common chimpanzee populations (Central, Eastern, Western, hybrid and captive).
In the data analysis, we perform a computation to summarize the relationship of FST to the upper bound. For a set of Z loci, denote by Fz and Mz the values of FST and M at locus z. The mean FST for the set, or , is
5.1 |
Using equation (3.1), we can compute the corresponding maximum FST given the observed σz = KMz, z = 1, 2, …, Z. Denoting this quantity by Fmax,z, we have
5.2 |
measures the proximity of the FST values to their upper bounds: it ranges from 0, if FST values at all loci equal 0, to 1, if FST values at all loci equal their upper bounds.
We computed the parametric allele frequencies for each subpopulation—the human and chimpanzee groups for the human–chimpanzee comparison, and chimpanzee subpopulations for the comparison of chimpanzees—averaging across subpopulations to obtain the frequency M of the most frequent allele. We then computed FST and the associated upper bound for each locus, averaging across loci to obtain the overall and for the full microsatellite set (equations (5.1) and (5.2)).
Surprisingly, given the longer evolutionary time between humans and chimpanzees than among chimpanzee populations, the FST value is significantly greater when comparing chimpanzee populations () than when comparing humans and chimpanzees (; p = 4.2 × 10−14, Wilcoxon rank sum test). The explanation for this result can be found in the properties of the upper bound on FST given M.
Values of M are similar in the two comparisons (figure 4a,b). However, K differs, equaling 2 for the human–chimpanzee comparison and 6 for the comparison of chimpanzee subpopulations. Because the theoretical range of FST is seen to be smaller for FST values computed among smaller sets of subpopulations than among larger sets (figure 1), the FST values among chimpanzees possess a larger range. For example, the maximal FST at the mean M of 0.27 observed in pairwise comparisons is 0.34 for K = 2 (red segment in figure 4a), whereas the maximal FST at the mean M of 0.36 observed for six chimpanzee populations is 0.93 for K = 6 (figure 4b). Given the stronger constraint in pairwise calculations than in calculations with more subpopulations, it is not unexpected that pairwise FST values would be smaller than those in a 6-region computation. A high FST among chimpanzees compared to between humans and chimpanzees is a by-product of mathematical constraints on FST.
Interestingly, the effect of K on FST is largely eliminated when each FST value is normalized by the associated maximum given K and M (figure 4c). The normalization leads to higher values for human–chimpanzee comparisons than among chimpanzee subpopulations ( and 0.20, respectively; p = 1.1 × 10−9, Wilcoxon rank sum test), as expected from the greater evolutionary distance between humans and chimpanzees compared to that among chimpanzees.
6. Discussion
We have analysed the range of values that FST can take as a function of the frequency M of the most frequent allele at a multiallelic locus, for an arbitrary value of the number of subpopulations K. We showed that FST can span the full unit interval only for a finite set of values of M, at M = k/K for integers k in [1, K − 1]. For all other M, FST necessarily lies below 1. The number of subpopulations K enlarges the range of values that FST can take as it increases.
This study provides the most complete relationship between FST and M obtained to date, generalizing previous results for the case of K = 2 subpopulations [12] and for a restriction to I = 2 alleles [14]. Interestingly, the maximal FST we have obtained merges patterns observed in these previous studies. Fixing K = 2, we obtain the upper bound on FST in terms of M that was reported by Jakobsson et al. [12]. As K increases, the piecewise pattern seen by Jakobsson et al. [12] for the maximal FST in the K = 2 case for M in is observed in the multiallelic case for M in (0, 1/K). The decay from to (M, FST) = (1, 0) seen by Jakobsson et al. [12] for K = 2 is observed for M in the decay from ((K − 1)/K, 1) to (1, 0) for arbitrary K.
The allele frequency values for which the upper bound is reached for M in (0, 1/K) generalize those seen for the case of K = 2 and M in [12]. The upper bound is reached when all alleles are private, each subpopulation has as many alleles as possible at frequency KM, and at most one additional allele. The allele frequency values for which the upper bound is reached for M in ((K − 1)/K, 1) also generalize those seen for K = 2 and M in : the maximum is reached when the most frequent allele is fixed in all subpopulations except one, and a single private allele is present in this last subpopulation.
The results from Alcala & Rosenberg [14] for I = 2 produce a more constrained upper bound on FST than for arbitrary I, with the domain of M restricted to . Nevertheless, many properties of the maximal FST we observe for unspecified I and M in (1/K, 1) are similar to those seen for I = 2 and M in : finitely many peaks at points M = k/K, local minima between the peaks, and an increase in coverage of the unit square for (M, FST) as K increases. The maximal FST functions for M in ((K − 1)/K, 1) for unspecified I and for I = 2 agree, as the number of alleles required to maximize FST in this interval in the case of unspecified I is simply equal to 2.
In assuming that the number of alleles is unspecified, we found that the number of distinct alleles needed for achieving the maximal FST is for M in (0, 1/K) and for non-integer M in (1/K, 1); the maximum can be achieved with each number of distinct alleles in for M equal to 1/K, 2/K, …, (K − 1)/K. With a fixed maximal number of distinct alleles, such as in the I = 2 case of Alcala & Rosenberg [14] with K specified and in the K = 2 case with I specified [13], the upper bound on FST is less than or equal to that seen in the corresponding unspecified-I case. For K = 2, specifying I has a relatively small effect in reducing the maximal value of FST [13]. As in Edge & Rosenberg [13], specifying I in the case of larger values of K is expected to have the greatest impact on the FST upper bound at the lowest end of the domain for M.
In coalescent simulations, we found that the joint distribution of M and FST within their permissible space can help separate the impact of mutation and migration. Although the dependence of FST on mutation and migration rates has been long documented, the symmetric effects of mutation and migration under the island model [22] illustrate the difficulty in separating their effects. Under the island model, allele frequency M is informative about the scaled mutation rate 4Nμ, and comparing the value of FST to its maximum given M is informative about the scaled migration rate 4Nm. Adding a dimension that is more sensitive to mutation than to migration—M in our case—enables the separation of their effects. Other statistics, such as total heterozygosity HT or within-subpopulation heterozygosity HS, have the potential to play a similar role [20].
Our results can inform data analyses. In particular, we caution users to examine upper bounds on FST to assess how mathematical constraints influence observations. As the constraints are strongest for K = 2, this step is valuable in pairwise comparisons; it is also useful when the frequency M of the most frequent allele can be small in relation to the number of populations K, such as for high-diversity forensic [28] and immunological [29] loci in human populations. Visual inspection of the values of M and FST within their bounds can suggest that constraints have an effect. can provide a helpful summary by evaluating the proximity of FST values to their maxima.
Further, joint use of M along with FST could be useful in various applications of FST, such as in inference of model parameters by approximate Bayesian computation [30] and machine learning [31]. FST outlier tests to detect local adaptation from multiallelic loci [32] could search for FST values that represent outliers not in the distribution of FST values, but rather, outliers in relation to associated upper bounds. Computing null distributions for FST conditional on M could enhance the approach.
In an example data analysis, we have shown that taking into account mathematical constraints on FST can help understand puzzling FST behaviour. In our example, FST at a set of loci was higher when comparing K = 6 chimpanzee populations than when comparing humans and chimpanzees (K = 2), even though the same loci were used and the mean value for M was similar in the two comparisons. A comparison of FST values to their respective maxima explained these counterintuitive results.
We note that analyses of FST in relation to M differ from analyses of FST in relation to within-subpopulation statistics HS and JS = 1 − HS, such as those performed in deriving the influential Hedrick’s [9] and Jost’s D [33] statistics. We have previously shown that for biallelic loci in K subpopulations, for fixed M, the statistics FST, and D are all maximized at the same set of allele frequency values [15]. Although the normalizations of FST used to produce and D lead to statistics that are unconstrained in the unit interval as functions of HS, and D continue to be constrained as functions of M. A statistic that instead normalizes FST by its maximum as a function of M, a statistic of the total population, captures aspects of the allele frequency dependence of FST that differ from those captured by normalizations by functions of within-subpopulation statistics.
In human populations, efforts to understand FST patterns trace in large part to Lewontin’s foundational FST-like variance-partitioning computation [34], in which it was seen that among-population differences (analogous to FST) were small relative to within-population differences (analogous to 1 − FST). Studies using loci with different numbers of alleles, loci with different frequencies for the most frequent allele, and samples with different numbers of subpopulations have varied to some extent in their numerical estimates of FST [14,35–38]. Mathematical results on FST bounds provide part of the explanation for these differences: they establish that each dataset differing in the character of its loci and subpopulation set has its own distinctive interval in which its associated FST calculation could potentially land. Hence, each dataset can give rise to a numerically distinct value not due to features of the underlying human biology, but rather, due to different constraints on the FST measure itself. FST bounds contribute to explaining quantitative variation in variance-partitioning computations—in which, although numerical values differ, the within-population component of genetic variation consistently predominates. The mathematics serves to support the qualitative claim that worldwide human genetic differentiation measurements represented by FST-like statistics have low values—as was argued by Lewontin 50 years ago.
Acknowledgements
We thank Kent Holsinger for comments on the manuscript and Maike Morrison for helpful conversations.
Appendix A. Proof of equation (3.1)
This appendix derives the upper bound on FST as a function of σ1 (equation 3.1). First, we separate the case of integer values of σ1. Next, for non-integer values of σ1, we reduce the problem of maximizing FST to the problem of maximizing the sum of squared allele frequencies across alleles and subpopulations, . Next, we maximize S as a function of σ1, separately for σ1 in (0, 1) and for non-integer σ1 in (1, K).
(a) . A useful expression for FST
Suppose K ≥ 2 is a specified integer. Suppose σ1 is a fixed value, with 0 < σ1 < K. We leave the number of alleles I unspecified. For each i ≥ 1, we write , with σi ≥ σj for each i and j with i ≤ j. For convenience, σ1 is taken to mean both the function that computes the sum for a specified set of values of the pk,i and a fixed value for that sum.
For each (k, i) with 1 ≤ k ≤ K and i ≥ 1, pk,i lies in [0, 1], and for all k, 1 ≤ k ≤ K. Define FST as in equation (2.1). We seek to maximize FST over all possible sets of values of the pk,i with a fixed value σ1 for the sum . Note that because σ1 < K and , it follows that σ2 > 0.
Denote the sum of squared frequencies of allele 1 across subpopulations, , by S1. Denote for the corresponding sum of squared frequencies of all alleles. We express equation (2.1) in terms of σ1, S1 and S:
A 1 |
By construction of equation (2.1), the denominator of equation (A 1) lies in (0, K2), as 0 < HT < 1 from the fact that σ2 > 0. The numerator lies in [0, K2), as 0 ≤ HS ≤ HT < 1, so that 0 ≤ HT − HS < 1. FST lies in [0, 1], as 0 ≤ HS and 0 < HT imply 0 ≤ (HT − HS)/HT ≤ 1.
(b) . The case of integer values of σ1
In equation (A 1), the numerator is less than or equal to the denominator, with equality if and only if . This equality in turn requires that for each k, there exists some i for which pk,i = 1, a condition that can be achieved only if σ1 is an integer.
Theorem A.1. —
Suppose σ1 is an integer value, 1, 2, …, K − 1. FST = 1 if and only if (i) pk,1 = 1 in each of σ1 subpopulations, and (ii) for each of the K − σ1 remaining subpopulations, there exists a value of i ≥ 2 with pk,i = 1.
Proof. —
FST = 1 if and only if S = K, and S = K if and only if for each k, there exists an associated i with pk,i = 1. For a fixed integer value of σ1, pk,1 = 1 in exactly σ1 subpopulations. ▪
Note that any set of equivalence relationships can exist among the values of i associated with the K − σ1 subpopulations in which pk,1 = 0, provided that none of these values of i is associated with more than σ1 subpopulations. For example, these values of i can be mutually distinct, or groups of them with size as large as σ1 can be mutually equal.
(c) . Non-integer values of σ1
For non-integer σ1, the numerator of equation (A 1) is strictly less than the denominator. Hence, if the other quantities in equation (A 1) are fixed, then FST decreases with increasing . We have the following theorem.
Theorem A.2. —
Suppose σ1 is not an integer. FST satisfies
A 2 equality requiring that for each i ≥ 2, there exists at most one value of k for which pk,i > 0.
Proof. —
Because is subtracted in both the numerator and the denominator of equation (A 1), and because the numerator is strictly less than the denominator for non-integer σ1, FST can be bounded above by minimizing this term. Because pk,i ≥ 0 for all (k, i), each sum is bounded below by zero. Setting the sum to 0 for all i ≥ 2 gives the upper bound in equation (A 2).
For the equality condition, if and only if all products are zero—that is, if and only if for each i ≥ 2, at most one value of k has pk,i > 0. ▪
By theorem A.2, to maximize FST for fixed non-integer σ1, we must maximize the quantity in equation (A 2). It suffices to consider sets of values of pk,i in which for each i ≥ 2, at most one value of k has pk,i > 0.
(d) . The case of (non-integer) σ1 in (0, 1)
In this section, we find the set of values of the pk,i that maximize FST for σ1 in (0, 1). We proceed in two steps. (i) We show that for σ1 in (0, 1), the maximal FST occurs at a set of pk,i values for which all alleles are private: that is, for each i ≥ 1, pk,i > 0 for at most one value of k. (ii) We determine the set of pk,i values that, with all alleles private, maximizes FST.
(i) In equation (A 2), note that . Because is subtracted from both numerator and denominator in equation (A 2), the quantity in equation (A 2) is maximal when is minimal. In other words, the upper bound on FST is maximal if and only if is minimal.
Because σ1 < 1, a minimum of 0 for is achieved if and only if there is a single value k = k′ at which pk′,1 = σ1, so that pk,1 = 0 for all k ≠ k′. We then have , and from equation (A 2),
A 3 |
Each allele is private, and because allele 1 is the most frequent, pk,i lies in [0, σ1] for all (k, i).
(ii) The problem of finding the set of pk,i values that maximizes FST has now been reduced to the problem of maximizing the right-hand side of equation (A 3), with the constraint that all alleles are private. Because the numerator in equation (A 3) increases with S and the denominator decreases with S, the maximum is achieved if and only if S achieves its maximal value. In other words, we seek to maximize , with the constraints and pk,i ≤ σ1 for each (k, i) with 1 ≤ k ≤ K and i ≥ 1. Because each allele is private, the maximum is achieved by separately maximizing each with constraints and pk,i ≤ σ1.
This maximization is precisely that of lemma 3 of Rosenberg & Jakobsson [40]. Applying the lemma, the maximum is achieved with pk,1 = pk,2 = … = pk,J−1 = σ1, pk,J = 1 − (J − 1)σ1, and pk,i = 0 for i > J, where . It satisfies . In other words, each subpopulation k possesses J − 1 private alleles with frequency σ1 and one private allele with frequency 1 − (J − 1)σ1. Hence, S ≤ K[1 − σ1(J − 1)(2 − Jσ1)], so that equation (A 3) leads to equation (3.1) for σ1 in (0, 1).
(e) . The case of non-integer σ1 in (1, K)
This section finds the set of values of the pk,i that maximizes FST for non-integer σ1 in (1, K). For non-integer in (1, K), because 0 ≤ pk,1 ≤ 1 for all k, pk,1 > 0 for at least two values of k. Writing S* = S − S1, equation (A 2) can be rewritten
A 4 |
Because the numerator increases with S1, and because the numerator increases with S* and the denominator decreases with S*, the upper bound on FST is greatest when both S1 and S* are maximized subject to for each k and for each i. If S1 and S* can be simultaneously maximized at the same set of values of the pk,i, then this set of values of the pk,i achieves the maximal FST.
We proceed in three steps. (i) First, we find the set of values of the pk,i that maximizes S1. (ii) Next, we find the set of values that maximizes S*. (iii) We then conclude that because the same set maximizes both S1 and S* separately, this set achieves the upper bound in equation (A 4), and hence in equation (A 2).
(i) We first maximize S1 for fixed non-integer σ1 in (1, K). More precisely, we seek to maximize with constraints and pk,1 ≤ 1 for each k from 1 to K. This maximization is precisely that performed in theorem 1 from Alcala & Rosenberg [14], a corollary of lemma 3 of Rosenberg & Jakobsson [40]. Applying the theorem, the maximum is achieved by setting , , and pk,1 = 0 for all . The maximal value of S1 is .
(ii) Next, we maximize . Because, by theorem A.2, all alleles with i ≥ 2 are private at the set of values of the pk,i that maximizes FST for fixed non-integer σ1, each non-zero pk,i for i ≥ 2 is equal to the associated σi. The sum of the frequencies of all alleles across all subpopulations is , so that . The problem of maximizing S* is the problem of maximizing with the constraints and σi ≤ 1 for each i from 2 to ∞. This maximization is again that performed in lemma 3 of Rosenberg & Jakobsson [40]. Applying the lemma, the maximum is achieved by setting , , and σi = 0 for . The maximum is .
(iii) S1 is maximized at a set of pk,i for which subpopulations are fixed for allele 1, allele 1 has frequency {σ1} in one subpopulation and allele 1 has frequency 0 in all other subpopulations. S* is maximized at a set of pk,i for which subpopulations are fixed, each for a distinct allele i with i ≥ 2, one subpopulation possesses a distinct allele i ≥ 2 with frequency 1 − {σ1}, and all other subpopulations possess no alleles i ≥ 2 of non-zero frequency.
The upper bound in equation (A 4) depends on both S1 and S*, each of which depends on the pk,i. Were the set of values of the pk,i that maximizes S1 and the set of values of the pk,i that maximizes S* to differ, additional work would be required to find the set of values of the pk,i that maximizes FST. However, we now observe that S1 and S* can be simultaneously maximized at the same set of values of pk,i, so that the same set of values of the pk,i maximizes S1 and S* and hence FST. In particular, subpopulations are fixed for allele 1, each of subpopulations is fixed for its own private allele, and a single subpopulation possesses allele 1 with frequency {σ1} and a private allele with frequency 1 − {σ1}. The number of alleles of non-zero frequency is . Only the most frequent allele is shared by more than one subpopulation, and a single subpopulation possesses more than one allele of non-zero frequency.
Substituting the maximal values of S1 and S* into equation (A 4), for non-integer σ1 in (1, K), we obtain the maximal FST in terms of σ1 shown in equation (3.1).
Data accessibility
Data are publicly available as described in the references cited. MS commands are provided in electronic supplementary material [39].
Authors' contributions
N.A. and N.A.R. designed the study. N.A. analysed the data. N.A. and N.A.R. wrote the manuscript.
Both authors gave final approval for publication and agreed to be held accountable for the work performed therein.
Competing interests
Where authors are identified as personnel of the International Agency for Research on Cancer/World Health Organization, the authors alone are responsible for the views expressed in this article and they do not necessarily represent the decisions, policy or views of the International Agency for Research on Cancer/World Health Organization.
Funding
Support was provided by NIH grant no. R01 HG005855, NSF grant no. BCS-2116322 and a France-Stanford Center for Interdisciplinary Studies grant.
References
- 1.Nei M. 1973. Analysis of gene diversity in subdivided populations. Proc. Natl Acad. Sci. USA 70, 3321-3323. ( 10.1073/pnas.70.12.3321) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Holsinger KE, Weir BS. 2009. Genetics in geographically structured populations: defining, estimating and interpreting FST. Nat. Rev. Genet. 10, 639-650. ( 10.1038/nrg2611) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Jin L, Chakraborty R. 1995. Population structure, stepwise mutations, heterozygote deficiency and their implications in DNA forensics. Heredity 74, 274-285. ( 10.1038/hdy.1995.41) [DOI] [PubMed] [Google Scholar]
- 4.Charlesworth B. 1998. Measures of divergence between populations and the effect of forces that reduce variability. Mol. Biol. Evol. 15, 538-543. ( 10.1093/oxfordjournals.molbev.a025953) [DOI] [PubMed] [Google Scholar]
- 5.Nagylaki T. 1998. Fixation indices in subdivided populations. Genetics 148, 1325-1332. ( 10.1093/genetics/148.3.1325) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Hedrick PW. 1999. Highly variable loci and their interpretation in evolution and conservation. Evolution 53, 313-318. ( 10.1111/j.1558-5646.1999.tb03767.x) [DOI] [PubMed] [Google Scholar]
- 7.Balloux F, Brünner H, Lugon-Moulin N, Hausser J, Goudet J. 2000. Microsatellites can be misleading: an empirical and simulation study. Evolution 54, 1414-1422. ( 10.1111/j.0014-3820.2000.tb00573.x) [DOI] [PubMed] [Google Scholar]
- 8.Long JC, Kittles RA. 2003. Human genetic diversity and the nonexistence of biological races. Hum. Biol. 75, 449-471. ( 10.1353/hub.2003.0058) [DOI] [PubMed] [Google Scholar]
- 9.Hedrick PW. 2005. A standardized genetic differentiation measure. Evolution 59, 1633-1638. ( 10.1111/j.0014-3820.2005.tb01814.x) [DOI] [PubMed] [Google Scholar]
- 10.Maruki T, Kumar S, Kim Y. 2012. Purifying selection modulates the estimates of population differentiation and confounds genome-wide comparisons across single-nucleotide polymorphisms. Mol. Biol. Evol. 29, 3617-3623. ( 10.1093/molbev/mss187) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Rosenberg NA, Li LM, Ward R, Pritchard JK. 2003. Informativeness of genetic markers for inference of ancestry. Am. J. Hum. Genet. 73, 1402-1422. ( 10.1086/380416) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Jakobsson M, Edge MD, Rosenberg NA. 2013. The relationship between FST and the frequency of the most frequent allele. Genetics 193, 515-528. ( 10.1534/genetics.112.144758) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Edge MD, Rosenberg NA. 2014. Upper bounds on FST in terms of the frequency of the most frequent allele and total homozygosity: the case of a specified number of alleles. Theor. Popul. Biol. 97, 20-34. ( 10.1016/j.tpb.2014.08.001) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Alcala N, Rosenberg NA. 2017. Mathematical constraints on FST: biallelic markers in arbitrarily many populations. Genetics 206, 1581-1600. ( 10.1534/genetics.116.199141) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Alcala N, Rosenberg NA. 2019. GST′, Jost’s D, and FST are similarly constrained by allele frequencies: a mathematical, simulation, and empirical study. Mol. Ecol. 28, 1624-1636. ( 10.1111/mec.15000) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Weir BS, Cockerham CC. 1984. Estimating F-statistics for the analysis of population structure. Evolution 38, 1358-1370. ( 10.1111/j.1558-5646.1984.tb05657.x) [DOI] [PubMed] [Google Scholar]
- 17.Whitlock MC. 2011. G′ST and D do not replace FST. Mol. Ecol. 20, 1083-1091. ( 10.1111/j.1365-294X.2010.04996.x) [DOI] [PubMed] [Google Scholar]
- 18.Rousset F. 2013. Exegeses on maximum genetic differentiation. Genetics 194, 557-559. ( 10.1534/genetics.113.152132) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Alcala N, Goudet J, Vuilleumier S. 2014. On the transition of genetic differentiation from isolation to panmixia: what we can learn from GST and D. Theor. Popul. Biol. 93, 75-84. ( 10.1016/j.tpb.2014.02.003) [DOI] [PubMed] [Google Scholar]
- 20.Wang J. 2015. Does GST underestimate genetic differentiation from marker data? Mol. Ecol. 24, 3546-3558. ( 10.1111/mec.13204) [DOI] [PubMed] [Google Scholar]
- 21.Hudson RR. 2002. Generating samples under a Wright–Fisher neutral model of genetic variation. Bioinformatics 18, 337-338. ( 10.1093/bioinformatics/18.2.337) [DOI] [PubMed] [Google Scholar]
- 22.Maruyama T. 1970. Effective number of alleles in a subdivided population. Theor. Popul. Biol. 1, 273-306. ( 10.1016/0040-5809(70)90047-X) [DOI] [PubMed] [Google Scholar]
- 23.Wakeley J. 1998. Segregating sites in Wright’s island model. Theor. Popul. Biol. 53, 166-174. ( 10.1006/tpbi.1997.1355) [DOI] [PubMed] [Google Scholar]
- 24.Fu R, Gelfand AE, Holsinger KE. 2003. Exact moment calculations for genetic models with migration, mutation, and drift. Theor. Popul. Biol. 63, 231-243. ( 10.1016/S0040-5809(03)00003-0) [DOI] [PubMed] [Google Scholar]
- 25.Long JC. 2009. Update to Long and Kittles’s ‘Human genetic diversity and the nonexistence of biological races’ (2003): fixation on an index. Hum. Biol. 81, 799-803. ( 10.3378/027.081.0622) [DOI] [PubMed] [Google Scholar]
- 26.Pemberton TJ, DeGiorgio M, Rosenberg NA. 2013. Population structure in a comprehensive genomic data set on human microsatellite variation. G3: Genes Genom. Genet. 3, 891-907. ( 10.1534/g3.113.005728) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Becquet C, Patterson N, Stone AC, Przeworski M, Reich D. 2007. Genetic structure of chimpanzee populations. PLoS Genet. 3, e66. ( 10.1371/journal.pgen.0030066) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Algee-Hewitt BF, Edge MD, Kim J, Li JZ, Rosenberg NA. 2016. Individual identifiability predicts population identifiability in forensic microsatellite markers. Curr. Biol. 26, 935-942. ( 10.1016/j.cub.2016.01.065) [DOI] [PubMed] [Google Scholar]
- 29.Maróstica AS, Nunes K, Castelli EC, Silva NSB, Weir BS, Goudet J, Meyer D. 2022. How HLA diversity is apportioned: influence of selection and relevance to transplantation. Phil. Trans. R. Soc. B 377, 20200420. ( 10.1098/rstb.2020.0420) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Beaumont MA. 2010. Approximate Bayesian computation in evolution and ecology. Annu. Rev. Ecol. Evol. Syst. 41, 379-406. ( 10.1146/annurev-ecolsys-102209-144621) [DOI] [Google Scholar]
- 31.Schrider DR, Kern AD. 2018. Supervised machine learning for population genetics: a new paradigm. Trends Genet. 34, 301-312. ( 10.1016/j.tig.2017.12.005) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Hoban S, et al. 2016. Finding the genomic basis of local adaptation: pitfalls, practical solutions, and future directions. Am. Nat. 188, 379-397. ( 10.1086/688018) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Jost L. 2008. GST and its relatives do not measure differentiation. Mol. Ecol. 17, 4015-4026. ( 10.1111/j.1365-294X.2008.03887.x) [DOI] [PubMed] [Google Scholar]
- 34.Lewontin RC. 1972. The apportionment of human diversity. Evol. Biol. 6, 381-398. ( 10.1007/978-1-4684-9063-3_14) [DOI] [Google Scholar]
- 35.Brown RA, Armelagos GJ. 2001. Apportionment of racial diversity: a review. Evol. Anthropol. 10, 34-40. () [DOI] [Google Scholar]
- 36.Ruvolo M, Seielstad MT. 2001 ‘The apportionment of human diversity’ 25 years later. In Thinking about evolution: historical, philosophical, and political perspectives (eds RS Singh, CS Krimbas, DB Paul, J Beatty), pp. 141–151. Cambridge, UK: Cambridge University Press.
- 37.Novembre J. 2022. The background and legacy of Lewontin’s apportionment of human genetic diversity. Phil. Trans. R. Soc. B 377, 20200406. ( 10.1098/rstb.2020.0406) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Shen H, Feldman MW. 2022. Diversity and its causes: Lewontin on racism, biological determinism and the adaptationist programme. Phil. Trans. R. Soc. B 377, 20200417. ( 10.1098/rstb.2020.0417) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Alcala N, Rosenberg NA. 2022. Mathematical constraints on FST: multiallelic markers in arbitrarily many populations. Figshare. [DOI] [PMC free article] [PubMed]
- 40.Rosenberg NA, Jakobsson M. 2008. The relationship between homozygosity and the frequency of the most frequent allele. Genetics 179, 2027-2036. ( 10.1534/genetics.107.084772) [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Citations
- Alcala N, Rosenberg NA. 2022. Mathematical constraints on FST: multiallelic markers in arbitrarily many populations. Figshare. [DOI] [PMC free article] [PubMed]
Data Availability Statement
Data are publicly available as described in the references cited. MS commands are provided in electronic supplementary material [39].