Mathematical constraints on FST: multiallelic markers in arbitrarily many populations

Nicolas Alcala; Noah A Rosenberg

doi:10.1098/rstb.2020.0414

. 2022 Apr 18;377(1852):20200414. doi: 10.1098/rstb.2020.0414

Mathematical constraints on F_ST: multiallelic markers in arbitrarily many populations

Nicolas Alcala ^1,^✉, Noah A Rosenberg ²

PMCID: PMC9014193 PMID: 35430885

Abstract

Interpretations of values of the F_ST measure of genetic differentiation rely on an understanding of its mathematical constraints. Previously, it has been shown that F_ST values computed from a biallelic locus in a set of multiple populations and F_ST values computed from a multiallelic locus in a pair of populations are mathematically constrained as a function of the frequency of the allele that is most frequent across populations. We generalize from these cases to report here the mathematical constraint on F_ST given the frequency M of the most frequent allele at a multiallelic locus in a set of multiple populations. Using coalescent simulations of an island model of migration with an infinitely-many-alleles mutation model, we argue that the joint distribution of F_ST and M helps in disentangling the separate influences of mutation and migration on F_ST. Finally, we show that our results explain a puzzling pattern of microsatellite differentiation: the lower F_ST in an interspecific comparison between humans and chimpanzees than in the comparison of chimpanzee populations. We discuss the implications of our results for the use of F_ST.

This article is part of the theme issue ‘Celebrating 50 years since Lewontin's apportionment of human diversity’.

Keywords: allele frequency, chimpanzee, genetic differentiation, migration, population structure

1. Introduction

Multiallelic loci such as microsatellites and haplotype assignments are used to study genetic differentiation in a variety of fields, ranging from ecology and conservation genetics to anthropology and human genomics. Genetic differentiation is often measured for multiallelic loci using the multiallelic extension of Wright’s fixation index F_ST [1]

F_{S T} = \frac{H_{T} - H_{S}}{H_{T}} .

1.1

For a polymorphic multiallelic locus with I distinct alleles in a set of K subpopulations, denoting by p_k,i the frequency of allele i in subpopulation k, $H_{S} = 1 - (1 / K) \sum_{k = 1}^{K} \sum_{i = 1}^{I} p_{k, i}^{2}$ and $H_{T} = 1 - \sum_{i = 1}^{I} {((1 / K) \sum_{k = 1}^{K} p_{k, i})}^{2}$ .

F_ST values are known to be smaller for multiallelic than for biallelic loci [2]. One reason invoked to explain this difference is that within-subpopulation heterozygosity H_S mathematically constrains the maximal value of F_ST to be below 1, and the constraint is stronger when H_S is high. This phenomenon was noticed concurrently in simulation-based, empirical and theoretical studies [3–7], and the mathematical constraints describing the dependence were subsequently clarified [8,9].

Studies have found that the maximal value of F_ST can be viewed as constrained not only by functions of the within-subpopulation allele frequency distribution such as H_S, but alternatively by aspects of the global allele frequency distribution across subpopulations. For a biallelic locus in K = 2 subpopulations, Maruki et al. [10] showed that the maximal F_ST as a function of the frequency M of the most frequent allele decreases as M increases from $1 / 2$ to 1 (see also [11]). Generalizing the biallelic case to arbitrarily many alleles, Jakobsson et al. [12] showed that for multiallelic loci with an unspecified number of distinct alleles, the maximal F_ST increases from 0 to 1 as a function of M if $0 < M < 1 / 2$ , and decreases from 1 to 0 for $1 / 2 \leq M < 1$ in the manner reported by Maruki et al. [10] for biallelic loci. Edge & Rosenberg [13] generalized these results to the case of a fixed finite number of alleles, showing that the maximal F_ST differs slightly from the unspecified case when the fixed number of distinct alleles is an odd number.

Generalizing the simplest case of K = I = 2 in a different direction, Alcala & Rosenberg [14] considered biallelic loci in the case of a fixed number of subpopulations K ≥ 2. We showed that the maximal value of F_ST displays a peculiar behaviour as a function of M: the upper bound has a maximum of 1 if and only if M = k/K, for integers k with $⌈ K / 2 ⌉ \leq k \leq K - 1$ . The constraints on the maximal value of F_ST dissipate as K tends to infinity, even though for any fixed K, there always exists a value of M for which $F_{S T} < 2 \sqrt{2} - 2 \approx 0.8284$ .

Relating F_ST to its maximum as a function of M helps explain surprising phenomena that arise during population-genetic data analysis. For example, Jakobsson et al. [12] showed that stronger constraints on F_ST could explain the low F_ST values seen in pairs of African human populations. They also found that such constraints could explain the lower F_ST values seen in high-diversity multiallelic loci compared to lower-diversity loci—microsatellites compared to single-nucleotide polymorphisms. Alcala & Rosenberg [14] showed that constraints on the maximal F_ST could explain the lower F_ST values between human populations seen when computing F_ST pairwise rather than from all populations simultaneously.

In this study, we characterize the relationship between F_ST and the frequency M of the most frequent allele, for a multiallelic locus and an arbitrary specified value of the number of subpopulations K. We derive the mathematical upper bound on F_ST in terms of M, extending the biallelic result of Alcala & Rosenberg [14] to the multiallelic case, and providing the most comprehensive description of the mathematical constraints on F_ST in terms of M to date (table 1). To assist in interpreting the new bound, we simulate the joint distribution of F_ST and M in the island migration model, describing its properties as a function of the number of subpopulations, the migration rate and a mutation rate. The K-subpopulation upper bound on F_ST in terms of M facilitates an explanation of counterintuitive aspects of inter-species genetic differentiation. We discuss the importance of the results for applications of F_ST more generally.

Table 1.

Studies describing the mathematical constraints on F_ST. H_S and H_T denote the within-subpopulation and total heterozygosities, respectively. δ denotes the absolute difference in the frequency of a specific allele between two subpopulations, and M denotes the frequency of the most frequent allele in the total population. Instead of heterozygosities H_S or H_T, some studies consider homozygosities 1 − H_S or 1 − H_T.

reference	number of alleles	number of subpopulations	variable in terms of which constraints are reported
Long & Kittles [8]	unspecified value ≥2	fixed finite value ≥2	H_S
Rosenberg et al. [11]	2	2	δ
Hedrick [9]	unspecified value ≥2	fixed finite value ≥2	H_S
Maruki et al. [10]	2	2	H_S, M
Jakobsson et al. [12]	unspecified value ≥2	2	H_T, M
Edge & Rosenberg [13]	fixed finite value ≥2	2	H_T, M
Alcala & Rosenberg [14]	2	fixed finite value ≥2	M
this paper	unspecified value ≥2	fixed finite value ≥2	M

Open in a new tab

2. Model

Our goal is to derive the range of values that F_ST can take—the lower and upper bounds on F_ST—as a function of the frequency M of the most frequent allele for a multiallelic locus, when the number of subpopulations K is a fixed finite value greater than or equal to 2. We follow previous studies [12–15] in describing notation and constructing the scenario.

We consider a polymorphic locus with an unspecified number of distinct alleles, in a setting with K subpopulations contributing equally to the total population. We denote the frequency of allele i in subpopulation k by p_k,i, with sum $σ_{i} = \sum_{k = 1}^{K} p_{k, i}$ across subpopulations. Each allele frequency p_k,i lies in [0, 1]. Within subpopulations, allele frequencies sum to 1: for each k, $\sum_{i = 1}^{\infty} p_{k, i} = 1$ . Hence, σ_i lies in [0, K], and $\sum_{i = 1}^{\infty} σ_{i} = K$ . We number alleles from most to least frequent, so σ_i ≥ σ_j for i ≤ j.

Because by assumption the locus is polymorphic, σ_i < K for each i. Alleles 1 and 2 have non-zero frequency in at least one subpopulation, not necessarily the same one; we have σ₁ > 0 and σ₂ > 0. We denote the mean frequency of the most frequent allele across subpopulations by M = σ₁/K. We then have 0 < M < 1. We treat the allele frequencies p_k,i and associated quantities M and σ_i as parametric values, and not as estimates computed from data.

Equation (1.1) expresses F_ST as a ratio involving within-subpopulation heterozygosity, H_S, and total heterozygosity, H_T, with 0 ≤ H_S < 1 and 0 ≤ H_T < 1. Because we assume the locus is polymorphic, H_T > 0. We write equation (1.1) in terms of allele frequencies, permitting the number of distinct alleles to be arbitrarily large

F_{S T} = \frac{(1 / K) \sum_{k = 1}^{K} \sum_{i = 1}^{\infty} p_{k, i}^{2} - \sum_{i = 1}^{\infty} {(\sum_{k = 1}^{K} (p_{k, i} / K))}^{2}}{1 - \sum_{i = 1}^{\infty} {(\sum_{k = 1}^{K} (p_{k, i} / K))}^{2}} .

2.1

Hence, our goal is, for fixed σ₁ = KM, 0 < σ₁ < K, to identify the matrices (p_k,i)_K×∞, with p_k,i in [0, 1], $\sum_{i = 1}^{\infty} p_{k, i} = 1$ and $(1 / K) \sum_{k = 1}^{K} p_{k, 1} = σ_{1} / K = M$ , that minimize and maximize F_ST in equation (2.1).

Note that we adopt the interpretation of F_ST as a ‘statistic’ that describes a mathematical function of allele frequencies rather than as a ‘parameter’ that describes coancestry of individuals in a population [e.g. 16]. See Alcala & Rosenberg [14] for a discussion of interpretations of F_ST when studying its mathematical properties.

3. Mathematical constraints

(a) . Lower bound of F_ST

Bounds on F_ST in terms of the frequency of the most frequent allele can be written with respect to M or σ₁, noting that M ranges in (0, 1) and σ₁ ranges in (0, K). For the lower bound, from equation (2.1), for any choice of σ₁, F_ST = 0 can be achieved. Consider (σ₁, σ₂, …) with σ_i in [0, K) for each k, σ_i ≥ σ_j for i ≤ j, $\sum_{i = 1}^{\infty} σ_{i} = K$ , and σ₁ > 0 and σ₂ > 0. We set p_k,i = σ_i/K for all subpopulations k and alleles i; this choice yields F_ST = 0.

F_ST = 0 implies that the numerator of equation (2.1), H_T − H_S, is zero. This numerator can be written $(1 / K^{2}) \sum_{i = 1}^{\infty} (K \sum_{k = 1}^{K} p_{k, i}^{2} - σ_{i}^{2})$ . The Cauchy–Schwarz inequality guarantees that $K \sum_{k = 1}^{K} p_{k, i}^{2} \geq σ_{i}^{2}$ , with equality if and only if p_1,i = p_2,i = … = p_K,i = σ_i/K. Applying the Cauchy–Schwarz inequality to all alleles i, the numerator of equation (2.1) is zero only if for all i, (p_1,i, p_2,i, …, p_K,i) = (σ_i/K, σ_i/K, …, σ_i/K).

Thus, we can conclude that the allele frequency matrices in which all K subpopulations have identical allele frequency vectors are the only matrices for which F_ST = 0. The lower bound on F_ST is equal to 0 irrespective of M or σ₁, for any value of the number of subpopulations K.

(b) . Upper bound of F_ST

To derive the upper bound on F_ST in terms of M = σ₁/K, we must maximize F_ST in equation (2.1), assuming that σ₁ and K are constant. The computations are performed in appendix A; we write the main result as a function of σ₁, noting that it can be converted into a function of M by replacing σ₁ with KM.

In theorem A.1, we treat the case in which σ₁ has an integer value. For non-integer σ₁, theorem A.2 shows that the maximal F_ST requires that (i) the sum of squared allele frequencies across alleles and subpopulations, $S = \sum_{i = 1}^{\infty} \sum_{k = 1}^{K} p_{k, i}^{2}$ , is maximal, and (ii) alleles i = 2, 3, … are each present in at most one subpopulation, but allele 1 might be present in more than one subpopulation. We then separately maximize F_ST as a function of σ₁ for σ₁ in (0, 1) and non-integer σ₁ in (1, K). These two cases differ in that allele 1 appears in a single subpopulation in the former case, and it must appear in at least two subpopulations in the latter.

The maximal F_ST as a function of σ₁ for σ₁ in (0, K) is

F_{S T} \leq {\begin{matrix} 1, & σ_{1} = 1, 2, \dots, K - 1, \\ \frac{(K - 1) [1 - σ_{1} (J - 1) (2 - J σ_{1})]}{K - [1 - σ_{1} (J - 1) (2 - J σ_{1})]}, & 0 < σ_{1} < 1, \\ \frac{K (K - 1) - σ_{1}^{2} + ⌊ σ_{1} ⌋ - 2 (K - 1) {σ_{1}} + (2 K - 1) {σ_{1}}^{2}}{K (K - 1) - σ_{1}^{2} - ⌊ σ_{1} ⌋ + 2 σ_{1} - {σ_{1}}^{2}}, & non-integer σ_{1}, 1 < σ_{1} < K, \end{matrix}

where $J = ⌈ σ_{1}^{- 1} ⌉$ . Here, $⌈ x ⌉$ denotes the smallest integer greater than or equal to x, $⌊ x ⌋$ denotes the greatest integer less than or equal to x, and ${x} = x - ⌊ x ⌋$ denotes the fractional part of x. Note that for an integer choice of σ₁, the maximum from equation (3.1) and the limits as σ₁ tends to the integer from above and below all equal 1, so that the maximum as a function of σ₁ is continuous.

From appendix A, F_ST reaches its upper bound for integer σ₁ when allele 1 has frequency 1 in each of σ₁ subpopulations, and when in each of the remaining K − σ₁ subpopulations, an allele other than allele 1 has frequency 1. These alleles of frequency 1 need not be private, although they can be; any identity relationships among them are permissible, provided that when summing frequencies across subpopulations, none of these alleles has a sum that exceeds σ₁. The locus can have as few as $⌈ K σ_{1}^{- 1} ⌉$ alleles of non-zero frequency and as many as K − σ₁ + 1.

For σ₁ in interval (0, 1), F_ST is maximal when each allele is present in only a single subpopulation, and when each subpopulation has exactly J alleles with a non-zero frequency: J − 1 alleles at frequency σ₁ and one allele at frequency 1 − (J − 1)σ₁ ≤ σ₁. Because each subpopulation has J distinct alleles and no alleles are shared across subpopulations, this upper bound requires that the locus has KJ alleles of non-zero frequency.

For non-integer σ₁ in (1, K), F_ST reaches its maximum when there are $⌊ σ_{1} ⌋$ subpopulations in which the most frequent allele has frequency 1, a single subpopulation in which it has frequency {σ₁} and a private allele has frequency 1 − {σ₁}, and $K - ⌊ σ_{1} ⌋ - 1$ subpopulations each with a different private allele at frequency 1. Only the most frequent allele is shared across subpopulations, and a single subpopulation displays polymorphism. At the maximum, $K - ⌊ σ_{1} ⌋ + 1$ alleles have non-zero frequency.

(c) . Properties of the upper bound

Figure 1 shows the maximal value of F_ST in terms of M = σ₁/K for various values of the number of subpopulations, K. We describe a number of properties of this upper bound.

(i) . Piecewise structure of the upper bound

First, we observe that the upper bound has a piecewise structure.

For M < 1/K, the upper bound depends on $J = ⌈ σ_{1}^{- 1} ⌉ = ⌈ 1 / (K M) ⌉$ . As KM increases in (0, 1), each decrement in the integer value of $⌈ 1 / (K M) ⌉$ produces a distinct ‘piece’ with domain [1/(Kj), 1/(K(j − 1))), for integers j ≥ 2. Within each interval [1/(Kj), 1/(K(j − 1))), J has the constant value j.

At M = 1/K, the upper bound has its first transition between cases. For M > 1/K, the upper bound depends on $⌊ σ_{1} ⌋ = ⌊ K M ⌋$ . As KM increases in [1, K), each increment in $⌊ K M ⌋$ also produces a distinct piece of the domain. For each k from 1 to K − 1, $⌊ K M ⌋ = k$ for M in [k/K, (k + 1)/K).

Counting the intervals of the domain, we see that an infinite number of distinct intervals occur for M in (0, 1/K), and K − 1 intervals occur for M in (1/K, 1). Within intervals, the function describing the upper bound is smooth.

(ii) . Behaviour of the upper bound for M = 1/K, 2/K, …(K − 1)/K

The upper bound is equal to 1 at M = 1/K, 2/K, …(K − 1)/K. For M in (0, 1/K), setting the numerator and denominator equal in equation (3.1), we find that the upper bound is never equal to 1. For M in (1/K, 1), the upper bound is equal to 1 if and only if {σ₁} = 0, that is, if and only if σ₁ is an integer and M = k/K for k = 2, 3, …, K − 1.

Hence, noting that the upper bound is equal to 1 at M = 1/K, we conclude that the upper bound can equal 1 if and only if M = k/K for integers k = 1, 2, …, K − 1. For fixed K, the upper bound on F_ST has exactly K − 1 maxima at which F_ST can equal 1, at M = 1/K, 2/K, …, (K − 1)/K. We can conclude that F_ST is unconstrained within the unit interval only for a finite set of values of the frequency M of the most frequent allele. The size of this set increases with the number of subpopulations K.

(iii) . Behaviour of the upper bound for M in (0, 1/K)

For M in (0, 1/K), we can compute the value of the upper bound at the transition points between distinct pieces of the domain, namely values of 1/(Kj) for integers j ≥ 2. Applying equation (3.1), we observe that at M = 1/(Kj), the upper bound has value (K − 1)/(Kj − 1). In other words, the upper bound touches the curve

q^{*} (M) = \frac{(K - 1) M}{1 - M} .

3.2

This curve is represented in figure 1 as a dashed line.

Note that for K = 2, the special case considered by Jakobsson et al. [12], equation (3.2) reduces to q*(M) = M/(1 − M) = σ₁/(2 − σ₁), which matches equation 21 from Jakobsson et al. [12]. In fact, setting K = 2, equation (3.1) for M in (0, 1/K) reduces to the K = 2 upper bound on F_ST in eqn 9 of [12].

(iv) . Behaviour of the upper bound for M in (1/K, 1)

Because the upper bound is a smooth function on each interval of its domain, and because it possesses maxima at interval boundaries M = 1/K, 2/K, …, (K − 1)/K, it must possess local minima in intervals [k/K, (k + 1)/K) for k = 1, 2, …, K − 2. Indeed, such minima are visible in figure 1 in cases with K = 3, K = 6, K = 40 and K = 100; for K = 2, only one maximum occurs, so that there is no interval between a pair of maxima in which a minimum can occur. Note that because we restrict attention to M in (0, 1), we do not count the point at M = 1 and F_ST = 0 as a local minimum.

4. Joint distribution of M and F_ST under an evolutionary model

So far, we have described the mathematical constraint imposed on F_ST by M without respect to the frequency with which particular values of M arise in evolutionary scenarios. As an assessment of the bounds in evolutionary models can illuminate the settings in which they are most salient in population-genetic data analysis [9,14,17–20], we simulated the joint distribution of F_ST and M under an island migration model, relating the distribution to the mathematical bounds on F_ST. This analysis considers allele frequency distributions, and hence values of M and F_ST, generated by evolutionary models. The simulation approach is modified from [14,15].

(a) . Simulations

We simulated alleles under a coalescent model, using the software MS [21]. We considered a total population of KN diploid individuals subdivided into K subpopulations of size N. At each generation, a proportion m of the individuals in a subpopulation originated outside the subpopulation. Thus, the scaled migration rate is 4Nm, and it corresponds to twice the number of individuals in a subpopulation that originate elsewhere. We considered the island model [22–24], in which migrants have the same probability m/(K − 1) of coming from any other specific subpopulation. We used an infinitely-many-alleles model; mutations occur at rate μ, and the scaled mutation rate is 4Nμ.

We examined three values of K (2, 6, 40), three values of 4Nμ (0.1, 1, 10) and three values of 4Nm (0.1, 1, 10). Note that in MS, time is scaled in units of 4N generations, and there is no need to specify subpopulation sizes N. MS simulates an infinitely-many-sites model, where each mutation occurs at a new site; each haplotype is a new allele, so that each mutation creates a new allele. For our analysis, we are concerned only with the allelic categories and not with the simulated sequences; thus, although the simulation follows the infinitely-many-sites model, the analysis treats simulated datasets as having been generated under an infinitely-many-alleles model.

For each parameter triplet (K, 4Nμ, 4Nm), we performed 1000 replicate simulations, sampling 100 sequences per subpopulation in each replicate. We computed F_ST values from the parametric allele (haplotype) frequencies. MS commands appear in electronic supplementary material, File S1; note that the simulation approach here uses the standard method of simulating MS with a specified mutation rate θ = 4Nμ, whereas in our previous analyses of biallelic cases [14,15], we had employed the alternative approach of requiring simulated datasets to possess exactly one segregating site.

Figure 2 shows the joint distribution of M and F_ST for the nine values of (4Nμ, 4Nm) in the case of K = 2. Electronic supplementary material, figures S1 and S2 provide similar figures for K = 6 and K = 40, respectively.

(b) . Impact of the mutation rate

For fixed migration rate 4Nm and number of subpopulations K, the main impact of the mutation rate is on the frequency M of the most frequent allele. For K = 2, under weak mutation (4Nμ = 0.1), the joint distribution of M and F_ST is highest in the high-M region, for all values of 4Nm (figure 2a,d,g). Although most simulation replicates produce $M > 1 / 2$ with an upper bound on F_ST less than one, this set of parameter values does give rise to replicates near the peak at $(M, F_{S T}) = (1 / 2, 1)$ .

Under intermediate mutation (4Nμ = 1), the increased mutation rate tends to decrease M, shifting the joint distribution to lower values of M for all values of 4Nm (figure 2b,e,h). Finally, under strong mutation (4Nμ = 10), the joint distribution of M and F_ST is highest in the low-M region, for all values of 4Nm (figure 2c,f,i). In this region, the upper bound on F_ST is most strongly constrained, leading to low F_ST values.

(c) . Impact of the migration rate

For fixed mutation rate 4Nμ and number of subpopulations K, the impact of the migration rate is seen primarily in the F_ST values rather than the values of M. Under weak migration (4Nm = 0.1), subpopulations are differentiated, and the joint distribution of M and F_ST is highest near the upper bound on F_ST in terms of M (figure 2a–c).

Under intermediate migration (4Nm = 1), differentiation between subpopulations decreases, and the joint density of M and F_ST is highest at lower values of F_ST (figure 2d–f). Under strong migration (4Nm = 10), the joint density of M and F_ST nears the lower bound (figure 2g–i).

(d) . Impact of the number of subpopulations

In figure 1, the number of subpopulations changes the shape of the region in which F_ST is permitted to range as a function of M. Thus, in simulations, the impact of the number of subpopulations K is observed in cases in which a change in K permits F_ST to expand its range within the unit square for (M, F_ST). For each of the nine choices of (4Nμ, 4Nm), figure 3 summarizes the means observed for (M, F_ST) in figures 2 and electronic supplementary material, S1 and S2, corresponding to K = 2, K = 6 and K = 40, respectively.

Figure 3. — Mean frequency M of the most frequent allele and mean F_ST in the island migration model, for different scaled migration rates 4Nm and mutation rates 4Nμ and different numbers of subpopulations K. (a) K = 2, (b) K = 6 and (c) K = 40. The black solid lines represent the upper bound on F_ST in terms of M (equation 3.1). The coloured points represent the mean M and mean F_ST, where colours correspond to values of 4Nm. These points are taken from figures 2 and electronic supplementary material, S1 and S2.

The number of subpopulations generally increases F_ST for fixed 4Nμ and 4Nm. For example, the mean F_ST can be substantially larger for K = 6 than for K = 2. Consider (4Nμ, 4Nm) = (0.1, 0.1). For K = 2, the mean F_ST is near its upper bound (figure 3a); for K = 6, F_ST is not as close to the bound (figure 3b). However, because the upper bound for K = 6 exceeds that for K = 2, the mean F_ST is nevertheless larger in the case of K = 6.

5. Example: humans and chimpanzees

We now use our theoretical results to examine genetic differentiation in humans and chimpanzees. Because humans and chimpanzees are distinct species, we might expect a genetic differentiation measure such as F_ST to produce a greater value for a computation between them than for a computation among populations within one or the other. Indeed, studies of multiallelic loci do find that adding chimpanzees to data on multiple human populations increases the value of F_ST [8,25]. However, we will see that F_ST has a more subtle pattern when considering data on multiple chimpanzee populations, and that our theoretical computations explain a surprising result.

We examine data on 246 multiallelic microsatellite loci assembled by Pemberton et al. [26] from several studies of worldwide human populations and a study of chimpanzees [27]. We consider F_ST comparisons both between humans and chimpanzees and among populations of chimpanzees. For the human data, we consider all 5795 individuals in the dataset, and for the chimpanzee data, we consider 84 chimpanzee individuals from six populations: one bonobo population, and five common chimpanzee populations (Central, Eastern, Western, hybrid and captive).

In the data analysis, we perform a computation to summarize the relationship of F_ST to the upper bound. For a set of Z loci, denote by F_z and M_z the values of F_ST and M at locus z. The mean F_ST for the set, or ${\bar{F}}_{S T}$ , is

{\bar{F}}_{S T} = \frac{1}{Z} \sum_{z = 1}^{Z} F_{z} .

5.1

Using equation (3.1), we can compute the corresponding maximum F_ST given the observed σ_z = KM_z, z = 1, 2, …, Z. Denoting this quantity by F_max,z, we have

\bar{F_{S T} / F_{max}} = \frac{1}{Z} \sum_{z = 1}^{Z} \frac{F_{z}}{F_{max, z}} .

5.2

$\bar{F_{S T} / F_{max}}$ measures the proximity of the F_ST values to their upper bounds: it ranges from 0, if F_ST values at all loci equal 0, to 1, if F_ST values at all loci equal their upper bounds.

We computed the parametric allele frequencies for each subpopulation—the human and chimpanzee groups for the human–chimpanzee comparison, and chimpanzee subpopulations for the comparison of chimpanzees—averaging across subpopulations to obtain the frequency M of the most frequent allele. We then computed F_ST and the associated upper bound for each locus, averaging across loci to obtain the overall ${\bar{F}}_{S T}$ and $\bar{F_{S T} / F_{max}}$ for the full microsatellite set (equations (5.1) and (5.2)).

Surprisingly, given the longer evolutionary time between humans and chimpanzees than among chimpanzee populations, the F_ST value is significantly greater when comparing chimpanzee populations ( ${\bar{F}}_{S T} = 0.16$ ) than when comparing humans and chimpanzees ( ${\bar{F}}_{S T} = 0.10$ ; p = 4.2 × 10⁻¹⁴, Wilcoxon rank sum test). The explanation for this result can be found in the properties of the upper bound on F_ST given M.

Values of M are similar in the two comparisons (figure 4a,b). However, K differs, equaling 2 for the human–chimpanzee comparison and 6 for the comparison of chimpanzee subpopulations. Because the theoretical range of F_ST is seen to be smaller for F_ST values computed among smaller sets of subpopulations than among larger sets (figure 1), the F_ST values among chimpanzees possess a larger range. For example, the maximal F_ST at the mean M of 0.27 observed in pairwise comparisons is 0.34 for K = 2 (red segment in figure 4a), whereas the maximal F_ST at the mean M of 0.36 observed for six chimpanzee populations is 0.93 for K = 6 (figure 4b). Given the stronger constraint in pairwise calculations than in calculations with more subpopulations, it is not unexpected that pairwise F_ST values would be smaller than those in a 6-region computation. A high F_ST among chimpanzees compared to between humans and chimpanzees is a by-product of mathematical constraints on F_ST.

Interestingly, the effect of K on F_ST is largely eliminated when each F_ST value is normalized by the associated maximum given K and M (figure 4c). The normalization leads to higher values for human–chimpanzee comparisons than among chimpanzee subpopulations ( $\bar{F_{S T} / F_{max}} = 0.32$ and 0.20, respectively; p = 1.1 × 10⁻⁹, Wilcoxon rank sum test), as expected from the greater evolutionary distance between humans and chimpanzees compared to that among chimpanzees.

6. Discussion

We have analysed the range of values that F_ST can take as a function of the frequency M of the most frequent allele at a multiallelic locus, for an arbitrary value of the number of subpopulations K. We showed that F_ST can span the full unit interval only for a finite set of values of M, at M = k/K for integers k in [1, K − 1]. For all other M, F_ST necessarily lies below 1. The number of subpopulations K enlarges the range of values that F_ST can take as it increases.

This study provides the most complete relationship between F_ST and M obtained to date, generalizing previous results for the case of K = 2 subpopulations [12] and for a restriction to I = 2 alleles [14]. Interestingly, the maximal F_ST we have obtained merges patterns observed in these previous studies. Fixing K = 2, we obtain the upper bound on F_ST in terms of M that was reported by Jakobsson et al. [12]. As K increases, the piecewise pattern seen by Jakobsson et al. [12] for the maximal F_ST in the K = 2 case for M in $(0, 1 / 2)$ is observed in the multiallelic case for M in (0, 1/K). The decay from $(M, F_{S T}) = (1 / 2, 1)$ to (M, F_ST) = (1, 0) seen by Jakobsson et al. [12] for K = 2 is observed for M in the decay from ((K − 1)/K, 1) to (1, 0) for arbitrary K.

The allele frequency values for which the upper bound is reached for M in (0, 1/K) generalize those seen for the case of K = 2 and M in $(0, 1 / 2)$ [12]. The upper bound is reached when all alleles are private, each subpopulation has as many alleles as possible at frequency KM, and at most one additional allele. The allele frequency values for which the upper bound is reached for M in ((K − 1)/K, 1) also generalize those seen for K = 2 and M in $(1 / 2, 1)$ : the maximum is reached when the most frequent allele is fixed in all subpopulations except one, and a single private allele is present in this last subpopulation.

The results from Alcala & Rosenberg [14] for I = 2 produce a more constrained upper bound on F_ST than for arbitrary I, with the domain of M restricted to $(1 / 2, 1)$ . Nevertheless, many properties of the maximal F_ST we observe for unspecified I and M in (1/K, 1) are similar to those seen for I = 2 and M in $(1 / 2, 1)$ : finitely many peaks at points M = k/K, local minima between the peaks, and an increase in coverage of the unit square for (M, F_ST) as K increases. The maximal F_ST functions for M in ((K − 1)/K, 1) for unspecified I and for I = 2 agree, as the number of alleles required to maximize F_ST in this interval in the case of unspecified I is simply equal to 2.

In assuming that the number of alleles is unspecified, we found that the number of distinct alleles needed for achieving the maximal F_ST is $K ⌈ σ_{1}^{- 1} ⌉$ for M in (0, 1/K) and $K - ⌊ σ_{1} ⌋ + 1$ for non-integer M in (1/K, 1); the maximum can be achieved with each number of distinct alleles in $[⌈ K σ_{1}^{- 1} ⌉, K - σ_{1} + 1]$ for M equal to 1/K, 2/K, …, (K − 1)/K. With a fixed maximal number of distinct alleles, such as in the I = 2 case of Alcala & Rosenberg [14] with K specified and in the K = 2 case with I specified [13], the upper bound on F_ST is less than or equal to that seen in the corresponding unspecified-I case. For K = 2, specifying I has a relatively small effect in reducing the maximal value of F_ST [13]. As in Edge & Rosenberg [13], specifying I in the case of larger values of K is expected to have the greatest impact on the F_ST upper bound at the lowest end of the domain for M.

In coalescent simulations, we found that the joint distribution of M and F_ST within their permissible space can help separate the impact of mutation and migration. Although the dependence of F_ST on mutation and migration rates has been long documented, the symmetric effects of mutation and migration under the island model [22] illustrate the difficulty in separating their effects. Under the island model, allele frequency M is informative about the scaled mutation rate 4Nμ, and comparing the value of F_ST to its maximum given M is informative about the scaled migration rate 4Nm. Adding a dimension that is more sensitive to mutation than to migration—M in our case—enables the separation of their effects. Other statistics, such as total heterozygosity H_T or within-subpopulation heterozygosity H_S, have the potential to play a similar role [20].

Our results can inform data analyses. In particular, we caution users to examine upper bounds on F_ST to assess how mathematical constraints influence observations. As the constraints are strongest for K = 2, this step is valuable in pairwise comparisons; it is also useful when the frequency M of the most frequent allele can be small in relation to the number of populations K, such as for high-diversity forensic [28] and immunological [29] loci in human populations. Visual inspection of the values of M and F_ST within their bounds can suggest that constraints have an effect. $\bar{F_{S T} / F_{max}}$ can provide a helpful summary by evaluating the proximity of F_ST values to their maxima.

Further, joint use of M along with F_ST could be useful in various applications of F_ST, such as in inference of model parameters by approximate Bayesian computation [30] and machine learning [31]. F_ST outlier tests to detect local adaptation from multiallelic loci [32] could search for F_ST values that represent outliers not in the distribution of F_ST values, but rather, outliers in relation to associated upper bounds. Computing null distributions for F_ST conditional on M could enhance the approach.

In an example data analysis, we have shown that taking into account mathematical constraints on F_ST can help understand puzzling F_ST behaviour. In our example, F_ST at a set of loci was higher when comparing K = 6 chimpanzee populations than when comparing humans and chimpanzees (K = 2), even though the same loci were used and the mean value for M was similar in the two comparisons. A comparison of F_ST values to their respective maxima explained these counterintuitive results.

We note that analyses of F_ST in relation to M differ from analyses of F_ST in relation to within-subpopulation statistics H_S and J_S = 1 − H_S, such as those performed in deriving the influential Hedrick’s $G_{S T}^{'}$ [9] and Jost’s D [33] statistics. We have previously shown that for biallelic loci in K subpopulations, for fixed M, the statistics F_ST, $G_{S T}^{'}$ and D are all maximized at the same set of allele frequency values [15]. Although the normalizations of F_ST used to produce $G_{S T}^{'}$ and D lead to statistics that are unconstrained in the unit interval as functions of H_S, $G_{S T}^{'}$ and D continue to be constrained as functions of M. A statistic that instead normalizes F_ST by its maximum as a function of M, a statistic of the total population, captures aspects of the allele frequency dependence of F_ST that differ from those captured by normalizations by functions of within-subpopulation statistics.

In human populations, efforts to understand F_ST patterns trace in large part to Lewontin’s foundational F_ST-like variance-partitioning computation [34], in which it was seen that among-population differences (analogous to F_ST) were small relative to within-population differences (analogous to 1 − F_ST). Studies using loci with different numbers of alleles, loci with different frequencies for the most frequent allele, and samples with different numbers of subpopulations have varied to some extent in their numerical estimates of F_ST [14,35–38]. Mathematical results on F_ST bounds provide part of the explanation for these differences: they establish that each dataset differing in the character of its loci and subpopulation set has its own distinctive interval in which its associated F_ST calculation could potentially land. Hence, each dataset can give rise to a numerically distinct value not due to features of the underlying human biology, but rather, due to different constraints on the F_ST measure itself. F_ST bounds contribute to explaining quantitative variation in variance-partitioning computations—in which, although numerical values differ, the within-population component of genetic variation consistently predominates. The mathematics serves to support the qualitative claim that worldwide human genetic differentiation measurements represented by F_ST-like statistics have low values—as was argued by Lewontin 50 years ago.

Acknowledgements

We thank Kent Holsinger for comments on the manuscript and Maike Morrison for helpful conversations.

Appendix A. Proof of equation (3.1)

This appendix derives the upper bound on F_ST as a function of σ₁ (equation 3.1). First, we separate the case of integer values of σ₁. Next, for non-integer values of σ₁, we reduce the problem of maximizing F_ST to the problem of maximizing the sum of squared allele frequencies across alleles and subpopulations, $S = \sum_{i = 1}^{\infty} \sum_{k = 1}^{K} p_{k, i}^{2}$ . Next, we maximize S as a function of σ₁, separately for σ₁ in (0, 1) and for non-integer σ₁ in (1, K).

(a) . A useful expression for F_ST

Suppose K ≥ 2 is a specified integer. Suppose σ₁ is a fixed value, with 0 < σ₁ < K. We leave the number of alleles I unspecified. For each i ≥ 1, we write $σ_{i} = \sum_{k = 1}^{K} p_{k, i}$ , with σ_i ≥ σ_j for each i and j with i ≤ j. For convenience, σ₁ is taken to mean both the function that computes the sum $\sum_{k = 1}^{K} p_{k, 1}$ for a specified set of values of the p_k,i and a fixed value for that sum.

For each (k, i) with 1 ≤ k ≤ K and i ≥ 1, p_k,i lies in [0, 1], and $\sum_{i = 1}^{\infty} p_{k, i} = 1$ for all k, 1 ≤ k ≤ K. Define F_ST as in equation (2.1). We seek to maximize F_ST over all possible sets of values of the p_k,i with a fixed value σ₁ for the sum $\sum_{k = 1}^{K} p_{k, 1}$ . Note that because σ₁ < K and $\sum_{k = 1}^{K} \sum_{i = 1}^{\infty} p_{k, i} = \sum_{i = 1}^{\infty} σ_{i} = K$ , it follows that σ₂ > 0.

Denote the sum of squared frequencies of allele 1 across subpopulations, $\sum_{k = 1}^{K} p_{k, 1}^{2}$ , by S₁. Denote $S = \sum_{i = 1}^{\infty} \sum_{k = 1}^{K} p_{k, i}^{2} = \sum_{k = 1}^{K} \sum_{i = 1}^{\infty} p_{k, i}^{2}$ for the corresponding sum of squared frequencies of all alleles. We express equation (2.1) in terms of σ₁, S₁ and S:

F_{S T} = \frac{(K - 1) S + S_{1} - σ_{1}^{2} - 2 \sum_{i = 2}^{\infty} \sum_{k = 1}^{K - 1} \sum_{ℓ = k + 1}^{K} p_{k, i} p_{ℓ, i}}{K^{2} - S + S_{1} - σ_{1}^{2} - 2 \sum_{i = 2}^{\infty} \sum_{k = 1}^{K - 1} \sum_{ℓ = k + 1}^{K} p_{k, i} p_{ℓ, i}} .

A 1

By construction of equation (2.1), the denominator of equation (A 1) lies in (0, K²), as 0 < H_T < 1 from the fact that σ₂ > 0. The numerator lies in [0, K²), as 0 ≤ H_S ≤ H_T < 1, so that 0 ≤ H_T − H_S < 1. F_ST lies in [0, 1], as 0 ≤ H_S and 0 < H_T imply 0 ≤ (H_T − H_S)/H_T ≤ 1.

(b) . The case of integer values of σ₁

In equation (A 1), the numerator is less than or equal to the denominator, with equality if and only if $K = S = \sum_{k = 1}^{K} \sum_{i = 1}^{\infty} p_{k, i}^{2}$ . This equality in turn requires that for each k, there exists some i for which p_k,i = 1, a condition that can be achieved only if σ₁ is an integer.

Theorem A.1. —

Suppose σ₁ is an integer value, 1, 2, …, K − 1. F_ST = 1 if and only if (i) p_k,1 = 1 in each of σ₁ subpopulations, and (ii) for each of the K − σ₁ remaining subpopulations, there exists a value of i ≥ 2 with p_k,i = 1.

Proof. —

F_ST = 1 if and only if S = K, and S = K if and only if for each k, there exists an associated i with p_k,i = 1. For a fixed integer value of σ₁, p_k,1 = 1 in exactly σ₁ subpopulations. ▪

Note that any set of equivalence relationships can exist among the values of i associated with the K − σ₁ subpopulations in which p_k,1 = 0, provided that none of these values of i is associated with more than σ₁ subpopulations. For example, these values of i can be mutually distinct, or groups of them with size as large as σ₁ can be mutually equal.

(c) . Non-integer values of σ₁

For non-integer σ₁, the numerator of equation (A 1) is strictly less than the denominator. Hence, if the other quantities in equation (A 1) are fixed, then F_ST decreases with increasing $2 \sum_{i = 2}^{\infty} \sum_{k = 1}^{K - 1} \sum_{ℓ = k + 1}^{K} p_{k, i} p_{ℓ, i}$ . We have the following theorem.

Theorem A.2. —

Suppose σ₁ is not an integer. F_ST satisfies

$F_{S T} \leq \frac{(K - 1) S + S_{1} - σ_{1}^{2}}{K^{2} - S + S_{1} - σ_{1}^{2}},$ A 2

equality requiring that for each i ≥ 2, there exists at most one value of k for which p_k,i > 0.

Proof. —

Because $2 \sum_{i = 2}^{\infty} \sum_{k = 1}^{K - 1} \sum_{ℓ = k + 1}^{K} p_{k, i} p_{ℓ, i}$ is subtracted in both the numerator and the denominator of equation (A 1), and because the numerator is strictly less than the denominator for non-integer σ₁, F_ST can be bounded above by minimizing this term. Because p_k,i ≥ 0 for all (k, i), each sum $\sum_{k = 1}^{K - 1} \sum_{ℓ = k + 1}^{K} p_{k, i} p_{ℓ, i}$ is bounded below by zero. Setting the sum to 0 for all i ≥ 2 gives the upper bound in equation (A 2).

For the equality condition, $\sum_{i = 2}^{\infty} \sum_{k = 1}^{K - 1} \sum_{ℓ = k + 1}^{K} p_{k, i} p_{ℓ, i} = 0$ if and only if all products $p_{k, i} p_{ℓ, i}$ are zero—that is, if and only if for each i ≥ 2, at most one value of k has p_k,i > 0. ▪

By theorem A.2, to maximize F_ST for fixed non-integer σ₁, we must maximize the quantity in equation (A 2). It suffices to consider sets of values of p_k,i in which for each i ≥ 2, at most one value of k has p_k,i > 0.

(d) . The case of (non-integer) σ₁ in (0, 1)

In this section, we find the set of values of the p_k,i that maximize F_ST for σ₁ in (0, 1). We proceed in two steps. (i) We show that for σ₁ in (0, 1), the maximal F_ST occurs at a set of p_k,i values for which all alleles are private: that is, for each i ≥ 1, p_k,i > 0 for at most one value of k. (ii) We determine the set of p_k,i values that, with all alleles private, maximizes F_ST.

(i) In equation (A 2), note that $σ_{1}^{2} - S_{1} = 2 \sum_{k = 1}^{K - 1}$ $\sum_{ℓ = k + 1}^{K} p_{k, 1} p_{ℓ, 1}$ . Because $σ_{1}^{2} - S_{1}$ is subtracted from both numerator and denominator in equation (A 2), the quantity in equation (A 2) is maximal when $σ_{1}^{2} - S_{1}$ is minimal. In other words, the upper bound on F_ST is maximal if and only if $2 \sum_{k = 1}^{K - 1} \sum_{ℓ = k + 1}^{K} p_{k, 1} p_{ℓ, 1}$ is minimal.

Because σ₁ < 1, a minimum of 0 for $2 \sum_{k = 1}^{K - 1} \sum_{ℓ = k + 1}^{K} p_{k, 1} p_{ℓ, 1}$ is achieved if and only if there is a single value k = k′ at which p_k′,1 = σ₁, so that p_k,1 = 0 for all k ≠ k′. We then have $σ_{1}^{2} = S_{1}$ , and from equation (A 2),

F_{S T} \leq \frac{(K - 1) S}{K^{2} - S} .

A 3

Each allele is private, and because allele 1 is the most frequent, p_k,i lies in [0, σ₁] for all (k, i).

(ii) The problem of finding the set of p_k,i values that maximizes F_ST has now been reduced to the problem of maximizing the right-hand side of equation (A 3), with the constraint that all alleles are private. Because the numerator in equation (A 3) increases with S and the denominator decreases with S, the maximum is achieved if and only if S achieves its maximal value. In other words, we seek to maximize $S = \sum_{k = 1}^{K} \sum_{i = 1}^{\infty} p_{k, i}^{2}$ , with the constraints $\sum_{i = 1}^{\infty} p_{k, i} = 1$ and p_k,i ≤ σ₁ for each (k, i) with 1 ≤ k ≤ K and i ≥ 1. Because each allele is private, the maximum is achieved by separately maximizing each $\sum_{i = 1}^{\infty} p_{k, i}^{2}$ with constraints $\sum_{i = 1}^{\infty} p_{k, i} = 1$ and p_k,i ≤ σ₁.

This maximization is precisely that of lemma 3 of Rosenberg & Jakobsson [40]. Applying the lemma, the maximum is achieved with p_k,1 = p_k,2 = … = p_k,J−1 = σ₁, p_k,J = 1 − (J − 1)σ₁, and p_k,i = 0 for i > J, where $J = ⌈ σ_{1}^{- 1} ⌉$ . It satisfies $\sum_{i = 1}^{\infty} p_{k, i}^{2} \leq 1 - σ_{1} (J - 1) (2 - J σ_{1})$ . In other words, each subpopulation k possesses J − 1 private alleles with frequency σ₁ and one private allele with frequency 1 − (J − 1)σ₁. Hence, S ≤ K[1 − σ₁(J − 1)(2 − Jσ₁)], so that equation (A 3) leads to equation (3.1) for σ₁ in (0, 1).

(e) . The case of non-integer σ₁ in (1, K)

This section finds the set of values of the p_k,i that maximizes F_ST for non-integer σ₁ in (1, K). For non-integer $σ_{1} = \sum_{k = 1}^{K} p_{k, 1}$ in (1, K), because 0 ≤ p_k,1 ≤ 1 for all k, p_k,1 > 0 for at least two values of k. Writing S* = S − S₁, equation (A 2) can be rewritten

F_{S T} \leq \frac{K S_{1} + (K - 1) S^{*} - σ_{1}^{2}}{K^{2} - S^{*} - σ_{1}^{2}} .

A 4

Because the numerator increases with S₁, and because the numerator increases with S* and the denominator decreases with S*, the upper bound on F_ST is greatest when both S₁ and S* are maximized subject to $\sum_{i = 1}^{\infty} p_{k, i} = 1$ for each k and $\sum_{k = 1}^{K} p_{k, i} \leq σ_{1}$ for each i. If S₁ and S* can be simultaneously maximized at the same set of values of the p_k,i, then this set of values of the p_k,i achieves the maximal F_ST.

We proceed in three steps. (i) First, we find the set of values of the p_k,i that maximizes S₁. (ii) Next, we find the set of values that maximizes S*. (iii) We then conclude that because the same set maximizes both S₁ and S* separately, this set achieves the upper bound in equation (A 4), and hence in equation (A 2).

(i) We first maximize S₁ for fixed non-integer σ₁ in (1, K). More precisely, we seek to maximize $S_{1} = \sum_{k = 1}^{K} p_{k, 1}^{2}$ with constraints $\sum_{k = 1}^{K} p_{k, 1} = σ_{1}$ and p_k,1 ≤ 1 for each k from 1 to K. This maximization is precisely that performed in theorem 1 from Alcala & Rosenberg [14], a corollary of lemma 3 of Rosenberg & Jakobsson [40]. Applying the theorem, the maximum is achieved by setting $p_{1, 1} = p_{2, 1} = \dots = p_{⌊ σ_{1} ⌋, 1} = 1$ , $p_{⌊ σ_{1} ⌋ + 1, 1} = {σ_{1}}$ , and p_k,1 = 0 for all $k > ⌊ σ_{1} ⌋ + 1$ . The maximal value of S₁ is ${σ_{1}}^{2} + ⌊ σ_{1} ⌋$ .

(ii) Next, we maximize $S^{*} = \sum_{i = 2}^{\infty} \sum_{k = 1}^{K} p_{k, i}^{2}$ . Because, by theorem A.2, all alleles with i ≥ 2 are private at the set of values of the p_k,i that maximizes F_ST for fixed non-integer σ₁, each non-zero p_k,i for i ≥ 2 is equal to the associated σ_i. The sum of the frequencies of all alleles across all subpopulations is $\sum_{i = 1}^{\infty} σ_{i} = K$ , so that $\sum_{i = 2}^{\infty} σ_{i} = K - σ_{1}$ . The problem of maximizing S* is the problem of maximizing $S^{*} = \sum_{i = 2}^{\infty} σ_{i}^{2}$ with the constraints $\sum_{i = 2}^{\infty} σ_{i} = K - σ_{1}$ and σ_i ≤ 1 for each i from 2 to ∞. This maximization is again that performed in lemma 3 of Rosenberg & Jakobsson [40]. Applying the lemma, the maximum is achieved by setting $σ_{2} = σ_{3} = \dots = σ_{K - ⌊ σ_{1} ⌋} = 1$ , $σ_{K - ⌊ σ_{1} ⌋ + 1} = 1 - {σ_{1}}$ , and σ_i = 0 for $i > K - ⌊ σ_{1} ⌋ + 1$ . The maximum is ${(1 - {σ_{1}})}^{2} + (K - ⌊ σ_{1} ⌋ - 1)$ .

(iii) S₁ is maximized at a set of p_k,i for which $⌊ σ_{1} ⌋$ subpopulations are fixed for allele 1, allele 1 has frequency {σ₁} in one subpopulation and allele 1 has frequency 0 in all other subpopulations. S* is maximized at a set of p_k,i for which $K - ⌊ σ_{1} ⌋ - 1$ subpopulations are fixed, each for a distinct allele i with i ≥ 2, one subpopulation possesses a distinct allele i ≥ 2 with frequency 1 − {σ₁}, and all $⌊ σ_{1} ⌋$ other subpopulations possess no alleles i ≥ 2 of non-zero frequency.

The upper bound in equation (A 4) depends on both S₁ and S*, each of which depends on the p_k,i. Were the set of values of the p_k,i that maximizes S₁ and the set of values of the p_k,i that maximizes S* to differ, additional work would be required to find the set of values of the p_k,i that maximizes F_ST. However, we now observe that S₁ and S* can be simultaneously maximized at the same set of values of p_k,i, so that the same set of values of the p_k,i maximizes S₁ and S* and hence F_ST. In particular, $⌊ σ_{1} ⌋$ subpopulations are fixed for allele 1, each of $K - ⌊ σ_{1} ⌋ - 1$ subpopulations is fixed for its own private allele, and a single subpopulation possesses allele 1 with frequency {σ₁} and a private allele with frequency 1 − {σ₁}. The number of alleles of non-zero frequency is $K - ⌊ σ_{1} ⌋ + 1$ . Only the most frequent allele is shared by more than one subpopulation, and a single subpopulation possesses more than one allele of non-zero frequency.

Substituting the maximal values of S₁ and S* into equation (A 4), for non-integer σ₁ in (1, K), we obtain the maximal F_ST in terms of σ₁ shown in equation (3.1).

Data accessibility

Data are publicly available as described in the references cited. MS commands are provided in electronic supplementary material [39].

Authors' contributions

N.A. and N.A.R. designed the study. N.A. analysed the data. N.A. and N.A.R. wrote the manuscript.

Both authors gave final approval for publication and agreed to be held accountable for the work performed therein.

Competing interests

Where authors are identified as personnel of the International Agency for Research on Cancer/World Health Organization, the authors alone are responsible for the views expressed in this article and they do not necessarily represent the decisions, policy or views of the International Agency for Research on Cancer/World Health Organization.

Funding

Support was provided by NIH grant no. R01 HG005855, NSF grant no. BCS-2116322 and a France-Stanford Center for Interdisciplinary Studies grant.

References

1.Nei M. 1973. Analysis of gene diversity in subdivided populations. Proc. Natl Acad. Sci. USA 70, 3321-3323. ( 10.1073/pnas.70.12.3321) [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Holsinger KE, Weir BS. 2009. Genetics in geographically structured populations: defining, estimating and interpreting F_ST. Nat. Rev. Genet. 10, 639-650. ( 10.1038/nrg2611) [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Jin L, Chakraborty R. 1995. Population structure, stepwise mutations, heterozygote deficiency and their implications in DNA forensics. Heredity 74, 274-285. ( 10.1038/hdy.1995.41) [DOI] [PubMed] [Google Scholar]
4.Charlesworth B. 1998. Measures of divergence between populations and the effect of forces that reduce variability. Mol. Biol. Evol. 15, 538-543. ( 10.1093/oxfordjournals.molbev.a025953) [DOI] [PubMed] [Google Scholar]
5.Nagylaki T. 1998. Fixation indices in subdivided populations. Genetics 148, 1325-1332. ( 10.1093/genetics/148.3.1325) [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Hedrick PW. 1999. Highly variable loci and their interpretation in evolution and conservation. Evolution 53, 313-318. ( 10.1111/j.1558-5646.1999.tb03767.x) [DOI] [PubMed] [Google Scholar]
7.Balloux F, Brünner H, Lugon-Moulin N, Hausser J, Goudet J. 2000. Microsatellites can be misleading: an empirical and simulation study. Evolution 54, 1414-1422. ( 10.1111/j.0014-3820.2000.tb00573.x) [DOI] [PubMed] [Google Scholar]
8.Long JC, Kittles RA. 2003. Human genetic diversity and the nonexistence of biological races. Hum. Biol. 75, 449-471. ( 10.1353/hub.2003.0058) [DOI] [PubMed] [Google Scholar]
9.Hedrick PW. 2005. A standardized genetic differentiation measure. Evolution 59, 1633-1638. ( 10.1111/j.0014-3820.2005.tb01814.x) [DOI] [PubMed] [Google Scholar]
10.Maruki T, Kumar S, Kim Y. 2012. Purifying selection modulates the estimates of population differentiation and confounds genome-wide comparisons across single-nucleotide polymorphisms. Mol. Biol. Evol. 29, 3617-3623. ( 10.1093/molbev/mss187) [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Rosenberg NA, Li LM, Ward R, Pritchard JK. 2003. Informativeness of genetic markers for inference of ancestry. Am. J. Hum. Genet. 73, 1402-1422. ( 10.1086/380416) [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Jakobsson M, Edge MD, Rosenberg NA. 2013. The relationship between F_ST and the frequency of the most frequent allele. Genetics 193, 515-528. ( 10.1534/genetics.112.144758) [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Edge MD, Rosenberg NA. 2014. Upper bounds on F_ST in terms of the frequency of the most frequent allele and total homozygosity: the case of a specified number of alleles. Theor. Popul. Biol. 97, 20-34. ( 10.1016/j.tpb.2014.08.001) [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Alcala N, Rosenberg NA. 2017. Mathematical constraints on F_ST: biallelic markers in arbitrarily many populations. Genetics 206, 1581-1600. ( 10.1534/genetics.116.199141) [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Alcala N, Rosenberg NA. 2019. G_ST′, Jost’s D, and F_ST are similarly constrained by allele frequencies: a mathematical, simulation, and empirical study. Mol. Ecol. 28, 1624-1636. ( 10.1111/mec.15000) [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Weir BS, Cockerham CC. 1984. Estimating F-statistics for the analysis of population structure. Evolution 38, 1358-1370. ( 10.1111/j.1558-5646.1984.tb05657.x) [DOI] [PubMed] [Google Scholar]
17.Whitlock MC. 2011. G′_ST and D do not replace F_ST. Mol. Ecol. 20, 1083-1091. ( 10.1111/j.1365-294X.2010.04996.x) [DOI] [PubMed] [Google Scholar]
18.Rousset F. 2013. Exegeses on maximum genetic differentiation. Genetics 194, 557-559. ( 10.1534/genetics.113.152132) [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Alcala N, Goudet J, Vuilleumier S. 2014. On the transition of genetic differentiation from isolation to panmixia: what we can learn from G_ST and D. Theor. Popul. Biol. 93, 75-84. ( 10.1016/j.tpb.2014.02.003) [DOI] [PubMed] [Google Scholar]
20.Wang J. 2015. Does G_ST underestimate genetic differentiation from marker data? Mol. Ecol. 24, 3546-3558. ( 10.1111/mec.13204) [DOI] [PubMed] [Google Scholar]
21.Hudson RR. 2002. Generating samples under a Wright–Fisher neutral model of genetic variation. Bioinformatics 18, 337-338. ( 10.1093/bioinformatics/18.2.337) [DOI] [PubMed] [Google Scholar]
22.Maruyama T. 1970. Effective number of alleles in a subdivided population. Theor. Popul. Biol. 1, 273-306. ( 10.1016/0040-5809(70)90047-X) [DOI] [PubMed] [Google Scholar]
23.Wakeley J. 1998. Segregating sites in Wright’s island model. Theor. Popul. Biol. 53, 166-174. ( 10.1006/tpbi.1997.1355) [DOI] [PubMed] [Google Scholar]
24.Fu R, Gelfand AE, Holsinger KE. 2003. Exact moment calculations for genetic models with migration, mutation, and drift. Theor. Popul. Biol. 63, 231-243. ( 10.1016/S0040-5809(03)00003-0) [DOI] [PubMed] [Google Scholar]
25.Long JC. 2009. Update to Long and Kittles’s ‘Human genetic diversity and the nonexistence of biological races’ (2003): fixation on an index. Hum. Biol. 81, 799-803. ( 10.3378/027.081.0622) [DOI] [PubMed] [Google Scholar]
26.Pemberton TJ, DeGiorgio M, Rosenberg NA. 2013. Population structure in a comprehensive genomic data set on human microsatellite variation. G3: Genes Genom. Genet. 3, 891-907. ( 10.1534/g3.113.005728) [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Becquet C, Patterson N, Stone AC, Przeworski M, Reich D. 2007. Genetic structure of chimpanzee populations. PLoS Genet. 3, e66. ( 10.1371/journal.pgen.0030066) [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Algee-Hewitt BF, Edge MD, Kim J, Li JZ, Rosenberg NA. 2016. Individual identifiability predicts population identifiability in forensic microsatellite markers. Curr. Biol. 26, 935-942. ( 10.1016/j.cub.2016.01.065) [DOI] [PubMed] [Google Scholar]
29.Maróstica AS, Nunes K, Castelli EC, Silva NSB, Weir BS, Goudet J, Meyer D. 2022. How HLA diversity is apportioned: influence of selection and relevance to transplantation. Phil. Trans. R. Soc. B 377, 20200420. ( 10.1098/rstb.2020.0420) [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Beaumont MA. 2010. Approximate Bayesian computation in evolution and ecology. Annu. Rev. Ecol. Evol. Syst. 41, 379-406. ( 10.1146/annurev-ecolsys-102209-144621) [DOI] [Google Scholar]
31.Schrider DR, Kern AD. 2018. Supervised machine learning for population genetics: a new paradigm. Trends Genet. 34, 301-312. ( 10.1016/j.tig.2017.12.005) [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Hoban S, et al. 2016. Finding the genomic basis of local adaptation: pitfalls, practical solutions, and future directions. Am. Nat. 188, 379-397. ( 10.1086/688018) [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Jost L. 2008. G_ST and its relatives do not measure differentiation. Mol. Ecol. 17, 4015-4026. ( 10.1111/j.1365-294X.2008.03887.x) [DOI] [PubMed] [Google Scholar]
34.Lewontin RC. 1972. The apportionment of human diversity. Evol. Biol. 6, 381-398. ( 10.1007/978-1-4684-9063-3_14) [DOI] [Google Scholar]
35.Brown RA, Armelagos GJ. 2001. Apportionment of racial diversity: a review. Evol. Anthropol. 10, 34-40. () [DOI] [Google Scholar]
36.Ruvolo M, Seielstad MT. 2001 ‘The apportionment of human diversity’ 25 years later. In Thinking about evolution: historical, philosophical, and political perspectives (eds RS Singh, CS Krimbas, DB Paul, J Beatty), pp. 141–151. Cambridge, UK: Cambridge University Press.
37.Novembre J. 2022. The background and legacy of Lewontin’s apportionment of human genetic diversity. Phil. Trans. R. Soc. B 377, 20200406. ( 10.1098/rstb.2020.0406) [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Shen H, Feldman MW. 2022. Diversity and its causes: Lewontin on racism, biological determinism and the adaptationist programme. Phil. Trans. R. Soc. B 377, 20200417. ( 10.1098/rstb.2020.0417) [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Alcala N, Rosenberg NA. 2022. Mathematical constraints on F_ST: multiallelic markers in arbitrarily many populations. Figshare. [DOI] [PMC free article] [PubMed]
40.Rosenberg NA, Jakobsson M. 2008. The relationship between homozygosity and the frequency of the most frequent allele. Genetics 179, 2027-2036. ( 10.1534/genetics.107.084772) [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Citations

Alcala N, Rosenberg NA. 2022. Mathematical constraints on F_ST: multiallelic markers in arbitrarily many populations. Figshare. [DOI] [PMC free article] [PubMed]

Data Availability Statement

Data are publicly available as described in the references cited. MS commands are provided in electronic supplementary material [39].

[RSTB20200414C1] 1.Nei M. 1973. Analysis of gene diversity in subdivided populations. Proc. Natl Acad. Sci. USA 70, 3321-3323. ( 10.1073/pnas.70.12.3321) [DOI] [PMC free article] [PubMed] [Google Scholar]

[RSTB20200414C2] 2.Holsinger KE, Weir BS. 2009. Genetics in geographically structured populations: defining, estimating and interpreting F_ST. Nat. Rev. Genet. 10, 639-650. ( 10.1038/nrg2611) [DOI] [PMC free article] [PubMed] [Google Scholar]

[RSTB20200414C3] 3.Jin L, Chakraborty R. 1995. Population structure, stepwise mutations, heterozygote deficiency and their implications in DNA forensics. Heredity 74, 274-285. ( 10.1038/hdy.1995.41) [DOI] [PubMed] [Google Scholar]

[RSTB20200414C4] 4.Charlesworth B. 1998. Measures of divergence between populations and the effect of forces that reduce variability. Mol. Biol. Evol. 15, 538-543. ( 10.1093/oxfordjournals.molbev.a025953) [DOI] [PubMed] [Google Scholar]

[RSTB20200414C5] 5.Nagylaki T. 1998. Fixation indices in subdivided populations. Genetics 148, 1325-1332. ( 10.1093/genetics/148.3.1325) [DOI] [PMC free article] [PubMed] [Google Scholar]

[RSTB20200414C6] 6.Hedrick PW. 1999. Highly variable loci and their interpretation in evolution and conservation. Evolution 53, 313-318. ( 10.1111/j.1558-5646.1999.tb03767.x) [DOI] [PubMed] [Google Scholar]

[RSTB20200414C7] 7.Balloux F, Brünner H, Lugon-Moulin N, Hausser J, Goudet J. 2000. Microsatellites can be misleading: an empirical and simulation study. Evolution 54, 1414-1422. ( 10.1111/j.0014-3820.2000.tb00573.x) [DOI] [PubMed] [Google Scholar]

[RSTB20200414C8] 8.Long JC, Kittles RA. 2003. Human genetic diversity and the nonexistence of biological races. Hum. Biol. 75, 449-471. ( 10.1353/hub.2003.0058) [DOI] [PubMed] [Google Scholar]

[RSTB20200414C9] 9.Hedrick PW. 2005. A standardized genetic differentiation measure. Evolution 59, 1633-1638. ( 10.1111/j.0014-3820.2005.tb01814.x) [DOI] [PubMed] [Google Scholar]

[RSTB20200414C10] 10.Maruki T, Kumar S, Kim Y. 2012. Purifying selection modulates the estimates of population differentiation and confounds genome-wide comparisons across single-nucleotide polymorphisms. Mol. Biol. Evol. 29, 3617-3623. ( 10.1093/molbev/mss187) [DOI] [PMC free article] [PubMed] [Google Scholar]

[RSTB20200414C11] 11.Rosenberg NA, Li LM, Ward R, Pritchard JK. 2003. Informativeness of genetic markers for inference of ancestry. Am. J. Hum. Genet. 73, 1402-1422. ( 10.1086/380416) [DOI] [PMC free article] [PubMed] [Google Scholar]

[RSTB20200414C12] 12.Jakobsson M, Edge MD, Rosenberg NA. 2013. The relationship between F_ST and the frequency of the most frequent allele. Genetics 193, 515-528. ( 10.1534/genetics.112.144758) [DOI] [PMC free article] [PubMed] [Google Scholar]

[RSTB20200414C13] 13.Edge MD, Rosenberg NA. 2014. Upper bounds on F_ST in terms of the frequency of the most frequent allele and total homozygosity: the case of a specified number of alleles. Theor. Popul. Biol. 97, 20-34. ( 10.1016/j.tpb.2014.08.001) [DOI] [PMC free article] [PubMed] [Google Scholar]

[RSTB20200414C14] 14.Alcala N, Rosenberg NA. 2017. Mathematical constraints on F_ST: biallelic markers in arbitrarily many populations. Genetics 206, 1581-1600. ( 10.1534/genetics.116.199141) [DOI] [PMC free article] [PubMed] [Google Scholar]

[RSTB20200414C15] 15.Alcala N, Rosenberg NA. 2019. G_ST′, Jost’s D, and F_ST are similarly constrained by allele frequencies: a mathematical, simulation, and empirical study. Mol. Ecol. 28, 1624-1636. ( 10.1111/mec.15000) [DOI] [PMC free article] [PubMed] [Google Scholar]

[RSTB20200414C16] 16.Weir BS, Cockerham CC. 1984. Estimating F-statistics for the analysis of population structure. Evolution 38, 1358-1370. ( 10.1111/j.1558-5646.1984.tb05657.x) [DOI] [PubMed] [Google Scholar]

[RSTB20200414C17] 17.Whitlock MC. 2011. G′_ST and D do not replace F_ST. Mol. Ecol. 20, 1083-1091. ( 10.1111/j.1365-294X.2010.04996.x) [DOI] [PubMed] [Google Scholar]

[RSTB20200414C18] 18.Rousset F. 2013. Exegeses on maximum genetic differentiation. Genetics 194, 557-559. ( 10.1534/genetics.113.152132) [DOI] [PMC free article] [PubMed] [Google Scholar]

[RSTB20200414C19] 19.Alcala N, Goudet J, Vuilleumier S. 2014. On the transition of genetic differentiation from isolation to panmixia: what we can learn from G_ST and D. Theor. Popul. Biol. 93, 75-84. ( 10.1016/j.tpb.2014.02.003) [DOI] [PubMed] [Google Scholar]

[RSTB20200414C20] 20.Wang J. 2015. Does G_ST underestimate genetic differentiation from marker data? Mol. Ecol. 24, 3546-3558. ( 10.1111/mec.13204) [DOI] [PubMed] [Google Scholar]

[RSTB20200414C21] 21.Hudson RR. 2002. Generating samples under a Wright–Fisher neutral model of genetic variation. Bioinformatics 18, 337-338. ( 10.1093/bioinformatics/18.2.337) [DOI] [PubMed] [Google Scholar]

[RSTB20200414C22] 22.Maruyama T. 1970. Effective number of alleles in a subdivided population. Theor. Popul. Biol. 1, 273-306. ( 10.1016/0040-5809(70)90047-X) [DOI] [PubMed] [Google Scholar]

[RSTB20200414C23] 23.Wakeley J. 1998. Segregating sites in Wright’s island model. Theor. Popul. Biol. 53, 166-174. ( 10.1006/tpbi.1997.1355) [DOI] [PubMed] [Google Scholar]

[RSTB20200414C24] 24.Fu R, Gelfand AE, Holsinger KE. 2003. Exact moment calculations for genetic models with migration, mutation, and drift. Theor. Popul. Biol. 63, 231-243. ( 10.1016/S0040-5809(03)00003-0) [DOI] [PubMed] [Google Scholar]

[RSTB20200414C25] 25.Long JC. 2009. Update to Long and Kittles’s ‘Human genetic diversity and the nonexistence of biological races’ (2003): fixation on an index. Hum. Biol. 81, 799-803. ( 10.3378/027.081.0622) [DOI] [PubMed] [Google Scholar]

[RSTB20200414C26] 26.Pemberton TJ, DeGiorgio M, Rosenberg NA. 2013. Population structure in a comprehensive genomic data set on human microsatellite variation. G3: Genes Genom. Genet. 3, 891-907. ( 10.1534/g3.113.005728) [DOI] [PMC free article] [PubMed] [Google Scholar]

[RSTB20200414C27] 27.Becquet C, Patterson N, Stone AC, Przeworski M, Reich D. 2007. Genetic structure of chimpanzee populations. PLoS Genet. 3, e66. ( 10.1371/journal.pgen.0030066) [DOI] [PMC free article] [PubMed] [Google Scholar]

[RSTB20200414C28] 28.Algee-Hewitt BF, Edge MD, Kim J, Li JZ, Rosenberg NA. 2016. Individual identifiability predicts population identifiability in forensic microsatellite markers. Curr. Biol. 26, 935-942. ( 10.1016/j.cub.2016.01.065) [DOI] [PubMed] [Google Scholar]

[RSTB20200414C29] 29.Maróstica AS, Nunes K, Castelli EC, Silva NSB, Weir BS, Goudet J, Meyer D. 2022. How HLA diversity is apportioned: influence of selection and relevance to transplantation. Phil. Trans. R. Soc. B 377, 20200420. ( 10.1098/rstb.2020.0420) [DOI] [PMC free article] [PubMed] [Google Scholar]

[RSTB20200414C30] 30.Beaumont MA. 2010. Approximate Bayesian computation in evolution and ecology. Annu. Rev. Ecol. Evol. Syst. 41, 379-406. ( 10.1146/annurev-ecolsys-102209-144621) [DOI] [Google Scholar]

[RSTB20200414C31] 31.Schrider DR, Kern AD. 2018. Supervised machine learning for population genetics: a new paradigm. Trends Genet. 34, 301-312. ( 10.1016/j.tig.2017.12.005) [DOI] [PMC free article] [PubMed] [Google Scholar]

[RSTB20200414C32] 32.Hoban S, et al. 2016. Finding the genomic basis of local adaptation: pitfalls, practical solutions, and future directions. Am. Nat. 188, 379-397. ( 10.1086/688018) [DOI] [PMC free article] [PubMed] [Google Scholar]

[RSTB20200414C33] 33.Jost L. 2008. G_ST and its relatives do not measure differentiation. Mol. Ecol. 17, 4015-4026. ( 10.1111/j.1365-294X.2008.03887.x) [DOI] [PubMed] [Google Scholar]

[RSTB20200414C34] 34.Lewontin RC. 1972. The apportionment of human diversity. Evol. Biol. 6, 381-398. ( 10.1007/978-1-4684-9063-3_14) [DOI] [Google Scholar]

[RSTB20200414C35] 35.Brown RA, Armelagos GJ. 2001. Apportionment of racial diversity: a review. Evol. Anthropol. 10, 34-40. () [DOI] [Google Scholar]

[RSTB20200414C36] 36.Ruvolo M, Seielstad MT. 2001 ‘The apportionment of human diversity’ 25 years later. In Thinking about evolution: historical, philosophical, and political perspectives (eds RS Singh, CS Krimbas, DB Paul, J Beatty), pp. 141–151. Cambridge, UK: Cambridge University Press.

[RSTB20200414C37] 37.Novembre J. 2022. The background and legacy of Lewontin’s apportionment of human genetic diversity. Phil. Trans. R. Soc. B 377, 20200406. ( 10.1098/rstb.2020.0406) [DOI] [PMC free article] [PubMed] [Google Scholar]

[RSTB20200414C38] 38.Shen H, Feldman MW. 2022. Diversity and its causes: Lewontin on racism, biological determinism and the adaptationist programme. Phil. Trans. R. Soc. B 377, 20200417. ( 10.1098/rstb.2020.0417) [DOI] [PMC free article] [PubMed] [Google Scholar]

[RSTB20200414C39] 39.Alcala N, Rosenberg NA. 2022. Mathematical constraints on F_ST: multiallelic markers in arbitrarily many populations. Figshare. [DOI] [PMC free article] [PubMed]

[RSTB20200414C40] 40.Rosenberg NA, Jakobsson M. 2008. The relationship between homozygosity and the frequency of the most frequent allele. Genetics 179, 2027-2036. ( 10.1534/genetics.107.084772) [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Mathematical constraints on FST: multiallelic markers in arbitrarily many populations

Nicolas Alcala

Noah A Rosenberg

Abstract

1. Introduction

Table 1.

2. Model

3. Mathematical constraints

(a) . Lower bound of FST

(b) . Upper bound of FST

(c) . Properties of the upper bound

Figure 1.

(i) . Piecewise structure of the upper bound

(ii) . Behaviour of the upper bound for M = 1/K, 2/K, …(K − 1)/K

(iii) . Behaviour of the upper bound for M in (0, 1/K)

(iv) . Behaviour of the upper bound for M in (1/K, 1)

4. Joint distribution of M and FST under an evolutionary model

(a) . Simulations

Figure 2.

(b) . Impact of the mutation rate

(c) . Impact of the migration rate

(d) . Impact of the number of subpopulations

Figure 3.

5. Example: humans and chimpanzees

Figure 4.

6. Discussion

Acknowledgements

Appendix A. Proof of equation (3.1)

(a) . A useful expression for FST

(b) . The case of integer values of σ1

Theorem A.1. —

Proof. —

(c) . Non-integer values of σ1

Theorem A.2. —