Mathematical Constraints on FST: Biallelic Markers in Arbitrarily Many Populations

Nicolas Alcala; Noah A Rosenberg

doi:10.1534/genetics.116.199141

. 2017 May 5;206(3):1581–1600. doi: 10.1534/genetics.116.199141

Mathematical Constraints on F_ST: Biallelic Markers in Arbitrarily Many Populations

Nicolas Alcala ^1,¹, Noah A Rosenberg ¹

PMCID: PMC5500152 PMID: 28476869

Abstract

$F_{ST}$ is one of the most widely used statistics in population genetics. Recent mathematical studies have identified constraints that challenge interpretations of $F_{ST}$ as a measure with potential to range from 0 for genetically similar populations to 1 for divergent populations. We generalize results obtained for population pairs to arbitrarily many populations, characterizing the mathematical relationship between $F_{ST},$ the frequency M of the more frequent allele at a polymorphic biallelic marker, and the number of subpopulations K. We show that for fixed K, $F_{ST}$ has a peculiar constraint as a function of M, with a maximum of 1 only if $M = i / K,$ for integers i with $⌈ K / 2 ⌉ \leq i \leq K - 1.$ For fixed M, as K grows large, the range of $F_{ST}$ becomes the closed or half-open unit interval. For fixed K, however, some $M < (K - 1) / K$ always exists at which the upper bound on $F_{ST}$ lies below $2 \sqrt{2} - 2 \approx 0.8284.$ We use coalescent simulations to show that under weak migration, $F_{ST}$ depends strongly on M when K is small, but not when K is large. Finally, examining data on human genetic variation, we use our results to explain the generally smaller $F_{ST}$ values between pairs of continents relative to global $F_{ST}$ values. We discuss implications for the interpretation and use of $F_{ST} .$

Keywords: allele frequency, F_ST, genetic differentiation, migration, population structure

GENETIC differentiation, in which individuals from the same subpopulation are more genetically similar than are individuals from different subpopulations, is a central concept in population genetics. It can arise from a large variety of processes, including from aspects of the physical environment such as geographic barriers, variable permeability to migrants, and spatially heterogeneous selection pressures, as well as from biotic phenomena such as assortative mating and self-fertilization. Genetic differentiation among populations is thus a pervasive feature of population-genetic variation.

To measure genetic differentiation, Wright (1951) introduced the fixation index $F_{ST},$ defined as the “correlation between random gametes, drawn from the same subpopulation, relative to the total.” Many definitions of $F_{ST}$ and related statistics have since been proposed (reviewed by Holsinger and Weir 2009). $F_{ST}$ is often defined in terms of a ratio involving mean heterozygosity of a set of subpopulations, $H_{S},$ and “total heterozygosity” of a population formed by pooling the alleles of the subpopulations, $H_{T}$ (Nei 1973):

F_{ST} = \frac{H_{T} - H_{S}}{H_{T}} .

(1)

For a polymorphic biallelic marker whose more frequent allele has mean frequency M across K subpopulations, denoting by $p_{k}$ the frequency of the allele in subpopulation k, $H_{S} = 1 - (1 / K) \sum_{k = 1}^{K} [p_{k}^{2} + {(1 - p_{k})}^{2}]$ and $H_{T} = 1 - [M^{2} + {(1 - M)}^{2}] .$

$F_{ST}$ and related statistics have a wide range of applications. For example, $F_{ST}$ is used as a descriptive statistic whose values are routinely reported in empirical population-genetic studies (Holsinger and Weir 2009). It is considered as a test statistic for spatially divergent selection, either acting on a locus (Lewontin and Krakauer 1973; Bonhomme et al. 2010) or, using comparisons to a corresponding phenotypic statistic $Q_{ST},$ on a trait (Leinonen et al. 2013). $F_{ST}$ is also used as a summary statistic for demographic inference, to measure gene flow between subpopulations (Slatkin 1985); or via approximate Bayesian computation, to estimate demographic parameters (Cornuet et al. 2008).

Applications of $F_{ST}$ generally assume that values near 0 indicate that there are almost no genetic differences among subpopulations, and that values near 1 indicate that subpopulations are genetically different (Hartl and Clark 1997; Frankham et al. 2002; Holsinger and Weir 2009). Mathematical studies, however, have challenged the simplicity of this interpretation, commenting that the range of values that $F_{ST}$ can take is considerably restricted by the allele frequency distribution (Table 1). Such studies have highlighted a direct relationship between allele frequencies and constraints on the range of $F_{ST}$ through functions of the allele frequency distribution such as the mean heterozygosity across subpopulations, $H_{S} .$ The maximal $F_{ST}$ has been shown to decrease as a function of $H_{S},$ both for an infinite (Hedrick 1999) and for a fixed finite number of subpopulations $K \geq 2$ (Long and Kittles 2003; Hedrick 2005). Consequently, if subpopulations differ in their alleles but separately have high heterozygosity, then $H_{S}$ can be high and $F_{ST}$ can be low; $F_{ST}$ can be near 0 even if subpopulations are completely genetically different in the sense that no allele occurs in more than one subpopulation.

Table 1. Studies describing the mathematical constraints on F_ST.

Reference	Number of alleles	Number of subpopulations	Variable in terms of which constraints are reported^a
Hedrick (1999)	unspecified value $\geq 2$	$∞$	$H_{S}$
Long and Kittles (2003)	unspecified value $\geq 2$	fixed finite value $\geq 2$	$H_{S}$
Rosenberg et al. (2003)	2	2	δ
Hedrick (2005)	unspecified value $\geq 2$	fixed finite value $\geq 2$	$H_{S}$
Maruki et al. (2012)	2	2	$H_{S},$ M
Jakobsson et al. (2013)	unspecified value $\geq 2$	2	$H_{T},$ M
Edge and Rosenberg (2014)	fixed finite value $\geq 2$	2	$H_{T},$ M
This article	2	fixed finite value $\geq 2$	M

Open in a new tab

$H_{S}$ and $H_{T}$ denote the within-subpopulation and total heterozygosities, respectively. δ denotes the absolute difference in the frequency of a specific allele between two subpopulations, and M denotes the frequency of the most frequent allele.

Instead of heterozygosities $H_{S}$ or $H_{T},$ some studies consider homozygosities $J_{S} = 1 - H_{S}$ or $J_{T} = 1 - J_{T} .$

Detailed mathematical results have clarified the relationship between allele frequencies and $F_{ST}$ in the case of $K = 2$ subpopulations. Considering a biallelic marker, Maruki et al. (2012) evaluated the constraint on $F_{ST}$ by the frequency M of the most frequent allele: the maximal $F_{ST}$ decreases monotonically from 1 to 0 with increasing M, $1 / 2 \leq M < 1.$ Jakobsson et al. (2013) extended this result to multiallelic markers with an unspecified number of distinct alleles, showing that the maximal $F_{ST}$ increases from 0 to 1 as a function of M when $0 < M < 1 / 2,$ and decreases from 1 to 0 when $1 / 2 \leq M < 1$ in the manner reported by Maruki et al. (2012). Edge and Rosenberg (2014) generalized these results to the case of a fixed finite number of alleles, showing that the maximal $F_{ST}$ differs slightly from the unspecified case when the fixed number of distinct alleles is odd.

In this study, we characterize the relationship between $F_{ST}$ and the frequency M of the most frequent allele, for a biallelic marker and an arbitrary number of subpopulations K. We derive the mathematical upper bound on $F_{ST}$ in terms of M, extending the biallelic two-subpopulation result to arbitrary K. To assist in interpreting the bound, we simulate the joint distribution of $F_{ST}$ and M in the island migration model, describing its properties as a function of the number of subpopulations and the migration rate. The K-population upper bound on $F_{ST}$ as a function of M facilitates an explanation of counterintuitive aspects of global human genetic differentiation. We discuss the importance of the results for applications of $F_{ST}$ more generally.

Mathematical Constraints

Model

Our goal is to derive the range of values $F_{ST}$ can take, the lower and upper bounds on $F_{ST},$ as a function of the frequency M of the most frequent allele for a biallelic marker when the number of subpopulations K is a fixed finite value ≥2. We consider a polymorphic locus with two alleles, A and a, in a setting with K subpopulations contributing equally to the total population. We denote the frequency of allele A in subpopulation k by $p_{k} .$ The frequency of allele a in subpopulation k is $1 - p_{k} .$ Each allele frequency $p_{k}$ lies in the interval $[0, 1] .$

The mean frequency of allele A across the subpopulations is $M = (1 / K) \sum_{k = 1}^{K} p_{k},$ and the mean frequency of allele a is $1 - M .$ Without loss of generality, we assume that allele A is the more frequent allele in the total population, so that $M \geq 1 / 2 \geq 1 - M .$ Because by assumption the locus is polymorphic, $M \neq 1.$

We assume that the allele frequencies M and $p_{k}$ are parametric allele frequencies of the total population and subpopulations, and not estimated values computed from data.

F_ST as a function of M

Equation 1 expresses $F_{ST}$ as a ratio involving within-subpopulation heterozygosity, $H_{S},$ and total heterozygosity, $H_{T} .$ We substitute $H_{S}$ and $H_{T}$ in Equation 1 with their respective expressions in terms of allele frequencies:

F_{ST} = \frac{\frac{1}{K} \sum_{k = 1}^{K} [p_{k}^{2} + {(1 - p_{k})}^{2}] - [M^{2} + {(1 - M)}^{2}]}{1 - [M^{2} + {(1 - M)}^{2}]} .

(2)

Simplifying Equation 2 by noting that $\sum_{k = 1}^{K} p_{k} = K M$ leads to:

F_{ST} = \frac{(\frac{1}{K} \sum_{k = 1}^{K} p_{k}^{2}) - M^{2}}{M (1 - M)} .

(3)

For fixed M, we seek the vectors $(p_{1}, p_{2}, \dots, p_{K}),$ with $p_{k} \in [0, 1]$ and $(1 / K) \sum_{k = 1}^{K} p_{k} = M,$ that minimize and maximize $F_{ST} .$

We clarify here that in mathematical analysis of the relationship between $F_{ST}$ and allele frequencies, we adopt an interpretation of $F_{ST}$ as a “statistic” that describes a mathematical function of allele frequencies rather than as a “parameter” that describes coancestry of individuals in a population. Multiple interpretations of $F_{ST}$ exist, giving rise to different expressions for computing it (e.g., Nei 1973; Nei and Chesser 1983; Weir and Cockerham 1984; Holsinger and Weir 2009). Our interest is in disentangling properties of the mathematical function by which true allele frequencies are used to compute $F_{ST}$ from the population-genetic relationships among individuals and sampling phenomena that could be viewed as affecting the computation. As a result, it is natural to follow the statistic interpretation that has been used in earlier scenarios involving $F_{ST}$ bounds in relation to allele frequencies viewed as parameters, rather than as estimates or outcomes of a stochastic process (Table 1), and in which such a disentanglement is possible. We return to this topic in the Discussion.

Lower bound

From Equation 3, for all $M \in [1 / 2, 1),$ setting $p_{k} = M$ in all subpopulations k yields $F_{ST} = 0.$ The Cauchy–Schwarz inequality guarantees that $K \sum_{k = 1}^{K} p_{k}^{2} \geq {(\sum_{k = 1}^{K} p_{k})}^{2},$ with equality if and only if $p_{1} = p_{2} = \dots = p_{K} .$ Hence, $K \sum_{k = 1}^{K} p_{k}^{2} = {(\sum_{k = 1}^{K} p_{k})}^{2},$ or by dividing both sides by $K^{2}$ to give $(1 / K) \sum_{k = 1}^{K} p_{k}^{2} = M^{2},$ requires $p_{1} = p_{2} = \dots = p_{K} = M .$ Examining Equation 3, $(p_{1}, p_{2}, \dots, p_{K}) = (M, M, \dots, M)$ is thus the only vector that yields $F_{ST} = 0.$ We can conclude that the lower bound on $F_{ST}$ is equal to 0 irrespective of M, for any value of the number of subpopulations K.

Upper bound

To derive the upper bound on $F_{ST}$ in terms of M, we must maximize $F_{ST}$ in Equation 3, assuming that M and K are constant. Because all terms in Equation 3 depend only on M and K except the positive term $\sum_{k = 1}^{K} p_{k}^{2}$ in the numerator, maximizing $F_{ST}$ corresponds to maximizing $\sum_{k = 1}^{K} p_{k}^{2}$ at fixed M and K.

Denote by $⌊ x ⌋$ the greatest integer less than or equal to x, and by ${x} = x - ⌊ x ⌋$ the fractional part of x. Using a result from Rosenberg and Jakobsson (2008), Theorem 1 from Appendix A states that the maximum for $\sum_{k = 1}^{K} p_{k}^{2}$ satisfies

\sum_{k = 1}^{K} p_{k}^{2} \leq ⌊ K M ⌋ + {K M}^{2},

(4)

with equality if and only if allele A has frequency 1 in $⌊ K M ⌋$ subpopulations, frequency ${K M}$ in a single subpopulation, and frequency 0 in all other subpopulations. Substituting Equation 4 into Equation 3, we obtain the upper bound for $F_{ST} :$

F_{ST} \leq \frac{⌊ K M ⌋ + {K M}^{2} - K M^{2}}{K M (1 - M)} .

(5)

The upper bound on $F_{ST}$ in terms of M has a piecewise structure, with changes in shape occurring when $K M$ is an integer.

For $i = ⌊ K / 2 ⌋, ⌊ K / 2 ⌋ + 1, \dots, K - 1,$ define the interval $I_{i}$ by $[1 / 2, (i + 1) / K)$ for $i = ⌊ K / 2 ⌋$ in the case that K is odd and by $[i / K, (i + 1) / K)$ for all other $(i, K) .$ For $M \in I_{i},$ $⌊ K M ⌋$ has a constant value i. Writing $x = K M - ⌊ K M ⌋ = K M - i$ so that $M = (i + x) / K,$ for each interval $I_{i},$ the upper bound on $F_{ST}$ is a smooth function:

Q_{i} (x) = \frac{K (i + x^{2}) - {(i + x)}^{2}}{(i + x) (K - i - x)},

(6)

where x lies in $[0, 1)$ (or in $[1 / 2, 1)$ for odd K and $i = ⌊ K / 2 ⌋),$ and i lies in $[⌊ K / 2 ⌋, K - 1] .$

The conditions under which the upper bound is reached illuminate its interpretation. The maximum requires the most frequent allele to have frequency 1 or 0 in all except possibly one subpopulation, so that the locus is polymorphic in at most a single subpopulation. Thus, $F_{ST}$ is maximal when fixation is achieved in as many subpopulations as possible.

Figure 1 shows the upper bound on $F_{ST}$ in terms of M for various values of K. It has peaks at values $i / K,$ where it is possible for the allele to be fixed in all K subpopulations and for $F_{ST}$ to reach a value of 1. Between $i / K$ and $(i + 1) / K,$ the function reaches a local minimum, eventually decreasing to 0 as M approaches 1. The upper bound is not differentiable at the peaks, and it is smooth and strictly <1 between the peaks. If K is even, then the upper bound begins from a local maximum at $M = 1 / 2;$ if K is odd, it begins from a local minimum at $M = 1 / 2 .$

Bounds on $F_{ST}$ as a function of the frequency of the most frequent allele, M, for different numbers of subpopulations K: (A) $K = 2,$ (B) $K = 3,$ (C) $K = 7,$ (D) $K = 10,$ and (E) $K = 40.$ The shaded region represents the space between the upper and lower bounds on $F_{ST} .$ The upper bound is computed from Equation 5; for each K, the lower bound is 0 for all values of M.

Properties of the upper bound

Local maxima:

We explore properties of the upper bound on $F_{ST}$ as a function of M for fixed K by examining the local maxima and minima. The upper bound is equal to 1 on interval $I_{i}$ if and only if the numerator and denominator in Equation 6 are equal. Noting that $K \geq 2,$ this condition is equivalent to $x^{2} = x$ and hence, because $0 \leq x < 1,$ $x = {K M} = 0.$ Thus, on interval $I_{i}$ for M, the maximal $F_{ST}$ is 1 if and only if $K M$ is an integer.

$K M$ has exactly $⌊ K / 2 ⌋$ integer values for $M \in [1 / 2, 1) .$ Consequently, given K, there are exactly $⌊ K / 2 ⌋$ maxima at which $F_{ST}$ can equal 1, at $M = (K + 1) / (2 K), (K + 3) / (2 K), \dots, (2 K - 2) / (2 K)$ if K is odd and at $M = K / (2 K), (K + 2) / (2 K), \dots, (2 K - 2) / (2 K)$ if K is even.

This analysis finds that $F_{ST}$ is only unconstrained within the unit interval for a finite set of values of the frequency M of the most frequent allele. The size of this set increases with the number of subpopulations K.

Local minima:

Equality of the upper bound at the right endpoint of each interval $I_{i}$ and the left endpoint of $I_{i + 1}$ for each i from $⌊ K / 2 ⌋$ to $K - 2$ demonstrates that the upper bound on $F_{ST}$ is a continuous function of M. Consequently, local minima necessarily occur between the local maxima. If K is even, then the upper bound on $F_{ST}$ has $K / 2 - 1$ local minima, each inside an interval $I_{i},$ $i = K / 2, K / 2 + 1, \dots, K - 2.$ If K is odd, then the upper bound has $(K - 1) / 2$ local minima, the first in interval $[1 / 2, (K + 1) / (2 K)),$ and each of the others in an interval $I_{i},$ with $i = (K + 1) / 2, (K + 3) / 2, \dots, K - 2.$ Note that because we restrict attention to $M \in [1 / 2, 1),$ we do not count the point at $M = 1$ and $F_{ST} = 0$ as a local minimum.

Theorem 2 from Appendix B describes the relative positions of the local minima within intervals $I_{i},$ as a function of the number of subpopulations K. From Proposition 1 of Appendix B, for fixed K, the relative position of the local minimum within interval $I_{i}$ increases with i; as a result, the leftmost dips in the upper bound (those near $M = 1 / 2)$ are less tilted toward the right endpoints of their associated intervals than are the subsequent dips (nearer $M = 1) .$ The unique local minimum in interval $I_{i}$ lies either exactly at $M = [i + (1 / 2)] / K = 1 / 2$ for the leftmost dip for odd K (Proposition 2), or slightly to the right of the midpoint $[i + (1 / 2)] / K$ of interval $I_{i}$ in other intervals, but no farther from the center than $M = (i + 2 - \sqrt{2}) / K \approx (i + 0.5858) / K$ (Proposition 3; Figure B1).

The values of the upper bound on $F_{ST}$ at the local minima as a function of i are computed in Appendix B (Equation B5) by substituting the positions M of the local minima into Equation 5. From Proposition 4 of Appendix B, for fixed K, the value of $F_{ST}$ at the local minimum in interval $I_{i}$ decreases as i increases. The maximal $F_{ST}$ among local minima increases as K increases (Proposition 5). The upper bound on $F_{ST}$ at the local minimum closest to $M = 1 / 2$ tends to 1 as $K \to ∞$ (Proposition 5). The upper bound on $F_{ST}$ at the local minimum closest to $M = 1,$ however, is always smaller than $2 \sqrt{2} - 2 \approx 0.8284$ (Proposition 6).

In conclusion, although $F_{ST}$ is constrained below 1 for all values of M in the interior of intervals $I_{i} = [i / K, (i + 1) / K),$ the constraint is reduced as $K \to ∞,$ and in the limit it even completely disappears in the interval $I_{i}$ closest to $M = 1 / 2 .$ Nevertheless, there always exists a value of $M < (K - 1) / K$ for which the upper bound on $F_{ST}$ is lower than $2 \sqrt{2} - 2 \approx 0.8284.$

Mean range of possible F_ST values:

We now evaluate how strongly M constrains the range of $F_{ST}$ as a function of K. Following similar computations for other settings where $F_{ST}$ is considered in relation to a quantity that constrains it (Jakobsson et al. 2013; Edge and Rosenberg 2014), we compute the mean maximum $F_{ST}$ across all possible M values. This quantity, denoted $A (K),$ is the area between the lower and upper bounds on $F_{ST}$ divided by the length of the domain for M, or $1 / 2 .$ $A (K)$ near 0 indicates a strong constraint.

Because the lower bound on $F_{ST}$ is 0 for all M between $1 / 2$ and 1, $A (K)$ corresponds to the area under the upper bound on $F_{ST}$ divided by $1 / 2,$ or twice the integral of Equation 5 between $1 / 2$ and 1:

A (K) = 2 \int_{M = 1 / 2}^{1} \frac{⌊ K M ⌋ + {K M}^{2} - K M^{2}}{K M (1 - M)} d M .

(7)

The integral is computed in Appendix C. We obtain

A (K) = 1 - K + 2 (K + 1) \ln K - \frac{4}{K} \sum_{i = 2}^{K} i \ln i .

(8)

We also obtain an asymptotic approximation $\tilde{A} (K) \sim A (K)$ in Appendix C, where

\tilde{A} (K) = 1 - \frac{\ln K}{3 K} - \frac{4 \ln C}{K} .

(9)

Here, $C \approx 1.2824$ represents the Glaisher–Kinkelin constant.

$A (2) = 2 \ln 2 - 1 \approx 0.3863,$ in accord with the $K = 2$ case of Jakobsson et al. (2013). Interestingly, the constraint on the mean range of $F_{ST}$ disappears as $K \to ∞ .$ Indeed, from Equation 9, we immediately see that $\lim_{K \to ∞} A (K) = 1$ (Figure 2). As a mean of 1 indicates that $F_{ST}$ ranges from 0 to 1 for all M (except possibly on a set of measure 0), for large K, the range of $F_{ST}$ is approximately invariant with respect to M.

The mean $A (K)$ of the upper bound on $F_{ST}$ over the interval $M \in [1 / 2, 1),$ as a function of the number of subpopulations K. $A (K)$ is computed from Equation 8 (black line). The approximation $\tilde{A} (K)$ is computed from Equation 9 (gray dashed line). A numerical computation of the relative error of the approximation as a function of K, $| A (K) - \tilde{A} (K) | / A (K),$ finds that the maximal error for $2 \leq K \leq 1000$ is 0.00174, achieved when $K = 2.$ The x-axis is plotted on a logarithmic scale.

The increase of $A (K)$ with K is monotonic (Theorem 3 of Appendix C). By numerically evaluating Equation 8, we find that although $A (2) \approx 0.3863,$ for $K \geq 7,$ $A (K) > 0.75,$ and for $K \geq 46,$ $A (K) > 0.95.$ Nevertheless, although the mean of the upper bound on $F_{ST}$ approaches 1, we have shown in Proposition 6 from Appendix B that for large K, values of M continue to exist at which the upper bound is constrained substantially below 1.

Evolutionary Processes and the Joint Distribution of M and $F_{ST}$ for a Biallelic Marker and K Subpopulations

Thus far, we have described the mathematical constraint imposed on $F_{ST}$ by M without respect to the frequency with which particular values of M arise in evolutionary scenarios. As an assessment of the bounds in evolutionary models can illuminate the settings in which they are most salient in population-genetic data analysis (Hedrick 2005; Whitlock 2011; Rousset 2013; Alcala et al. 2014; Wang 2015), we simulated the joint distribution of $F_{ST}$ and M in three migration models, in each case relating the distribution to the mathematical bounds on $F_{ST} .$ This analysis considers allele frequency distributions, and hence values of M and $F_{ST},$ generated by evolutionary models.

Simulations

We simulated independent SNPs under the coalescent, using the software MS (Hudson 2002). We considered a population of total size $K N$ diploid individuals subdivided into K subpopulations of equal size N. At each generation, a proportion m of the individuals in a subpopulation originated from another subpopulation. Thus, the scaled migration rate is $4 N m,$ and it corresponds to twice the number of individuals in a subpopulation that originate elsewhere. We focus on the finite island model (Maruyama 1970; Wakeley 1998), in which migrants have the same probability $m / (K - 1)$ to come from any specific other subpopulation. The finite rectangular and linear stepping-stone models generate similar results (Figures S1–S4 in File S1).

We examined three values of K (2, 7, 40) and three values of $4 N m$ (0.1, 1, 10). Note that in MS, time is scaled in units of $4 N$ generations, so there is no need to specify the subpopulation sizes N. To obtain independent SNPs, we used the MS command “-s” to fix the number of segregating sites S to 1. For each parameter pair $(K, 4 N m),$ we performed 100,000 replicate simulations, sampling 100 sequences per subpopulation in each replicate. $F_{ST}$ values were computed from the parametric allele frequencies.

Fixing $S = 1$ and accepting all coalescent genealogies entails an implicit assumption that all genealogies have equal potential to produce exactly one segregating site. We therefore also considered a different approach to generating SNPs, assuming an infinitely-many-sites model with a specified scaled mutation rate θ and discarding simulations leading to $S > 1.$ We chose θ so that the expected number of segregating sites in a constant-sized population, or $\sum_{i = 1}^{K N - 1} θ / i,$ was 1. This approach produces similar results to the fixed-S simulation (Figure S5 in File S1).

Weak migration

Under the island model with weak migration $(4 N m = 0.1),$ the joint distribution of M and $F_{ST}$ is highest near the upper bound on $F_{ST}$ in terms of M, for all K (Figure 3, A–C). For $K = 2,$ most SNPs have M near 0.5, representing fixation of the major allele in one subpopulation and absence in the other, and $F_{ST}$ near 1 (Figure 3A). The mean $F_{ST}$ in sliding windows for M closely follows the upper bound on $F_{ST} .$ For $K = 7,$ most SNPs have M near $4 / 7,$ $5 / 7,$ or $6 / 7,$ representing fixation of the major allele in four, five, or six subpopulations and absence in the others, and $F_{ST} \approx 1$ (Figure 3B). The mean $F_{ST}$ closely follows the upper bound. For $K = 40,$ most SNPs either have M near $37 / 40,$ $38 / 40,$ or $39 / 40,$ and $F_{ST} \approx 1,$ or $M < 37 / 40$ and $F_{ST} \approx 0.92$ (Figure 3C). The mean $F_{ST}$ follows the upper bound for $M > 37 / 40 .$ For $M < 37 / 40,$ it lies below the upper bound and does not possess its characteristic peaks.

Joint density of the frequency M of the most frequent allele and $F_{ST}$ in the island migration model, for different numbers of subpopulations K and scaled migration rates $4 N m$ (where N is the subpopulation size and m the migration rate): (A) $K = 2,$ $4 N m = 0.1;$ (B) $K = 7,$ $4 N m = 0.1;$ (C) $K = 40,$ $4 N m = 0.1;$ (D) $K = 2,$ $4 N m = 1;$ (E) $K = 7,$ $4 N m = 1;$ (F) $K = 40,$ $4 N m = 1;$ (G) $K = 2,$ $4 N m = 10;$ (H) $K = 7,$ $4 N m = 10;$ and (I) $K = 40,$ $4 N m = 10.$ The black solid line represents the upper bound on $F_{ST}$ in terms of M (Equation 5); the red dashed line represents the mean $F_{ST}$ in sliding windows of M of size 0.02 (plotted from 0.51 to 0.99). Colors represent the density of SNPs, estimated using a Gaussian kernel density estimate with a bandwidth of 0.007, with density set to 0 outside of the bounds. SNPs are simulated using coalescent software MS, assuming an island model of migration and conditioning on one segregating site. See Figure S5 in File S1 for an alternative algorithm for simulating SNPs. Each panel considers 100,000 replicate simulations, with 100 lineages sampled per subpopulation. Figures S2 and S3 in File S1 present similar results under finite rectangular and linear stepping-stone migration models.

We can interpret these patterns using the model of Wakeley (1999), which showed that when migration is infrequent compared to coalescence, coalescence follows two phases. In the scattering phase, lineages coalesce in each subpopulation, leading to a state with a single lineage per subpopulation. In the collecting phase, lineages from different subpopulations coalesce. As a result, considering K subpopulations with equal sample size n, when $4 N m ≪ 1$ , genealogies tend to have K long branches close to the root, each corresponding to a subpopulation and each leading to n shorter terminal branches. The long branches coalesce as pairs accumulate by migration in shared ancestral subpopulations. A random mutation on such a genealogy is likely to occur in one of two places. It can occur on a long branch during the collecting phase, in which case the derived allele will have frequency 1 in all subpopulations whose lineages descend from the branch, and 0 in the others. Alternatively, it can occur toward the terminal branches in the scattering phase, in which case the mutation will have frequency $p_{k} > 0$ in one subpopulation and 0 in all others. These scenarios that are likely under weak migration, one allele fixed in some subpopulations or present only in one subpopulation, correspond closely to conditions under which the upper bound on $F_{ST}$ is reached at fixed M. Thus, the properties of likely genealogies explain the proximity of $F_{ST}$ to its upper bound.

Intermediate and strong migration

With intermediate migration $(4 N m = 1),$ for all K, the joint density of M and $F_{ST}$ is highest at lower values of $F_{ST}$ than with $4 N m = 0.1$ (Figure 3, D–F). For $K = 2,$ most SNPs have $M > 0.8$ and the mean $F_{ST}$ is almost equidistant from the upper and lower bounds on $F_{ST},$ nearing the upper bound as M increases (Figure 3D). For $K = 7,$ most SNPs have $M > 0.9;$ as was seen for $K = 2,$ the mean $F_{ST}$ is almost equidistant from the upper and lower bounds, moving toward the upper bound as M increases (Figure 3E). For $K = 40,$ the pattern is similar, most SNPs having $M > 0.95$ (Figure 3F).

Under intermediate migration, migration is sufficient that more mutations than in the weak-migration case generate polymorphism in multiple subpopulations. A random mutation is likely to occur on a branch that leads to many terminal branches from the same subpopulation, but also to branches from other subpopulations. Thus, the allele is likely to have intermediate frequency in multiple subpopulations. This setting does not generate the conditions under which the upper bound on $F_{ST}$ is reached, so that except at the largest M, intermediate migration leads to values farther from the upper bound than in the weak-migration case. For large M, the rarer allele is likely to be only in one subpopulation, so that $F_{ST}$ is nearer to the upper bound.

With strong migration $(4 N m = 10),$ the joint density of M and $F_{ST}$ nears the lower bound (Figure 3, G–I). For each K, most SNPs have $M > 0.9$ and $F_{ST} \approx 0,$ with the mean $F_{ST}$ increasing somewhat as K increases. Under strong migration, because lineages can migrate between subpopulations quickly, they can also coalesce quickly, irrespective of their subpopulations of origin. As a result, a random mutation is likely to occur on a branch that leads to terminal branches in many subpopulations. The allele is expected to have comparable frequency in all subpopulations, so that $F_{ST}$ is likely to be small. This scenario corresponds to the conditions under which the lower bound on $F_{ST}$ is approached.

Proximity of the joint density of M and F_ST to the upper bound

To summarize features of the relationship of $F_{ST}$ to the upper bound seen in Figure 3, we can quantify the proximity of the joint density of M and $F_{ST}$ to the bounds on $F_{ST} .$ For a set of Z loci, denote by $F_{z}$ and $M_{z}$ the values of $F_{ST}$ and M at locus z. The mean $F_{ST}$ for the set, or ${\bar{F}}_{ST},$ is

{\bar{F}}_{ST} = \frac{1}{Z} \sum_{z = 1}^{Z} F_{z} .

(10)

Using Equation 5, a corresponding mean maximum $F_{ST}$ given the observed $M_{z},$ $z = 1, 2, \dots, Z,$ denoted ${\bar{F}}_{\max},$ is

{\bar{F}}_{m a x} = \frac{1}{Z} \sum_{z = 1}^{Z} \frac{⌊ K M_{z} ⌋ + {K M_{z}}^{2} - K M_{z}^{2}}{K M_{z} (1 - M_{z})} .

(11)

The ratio ${\bar{F}}_{ST} / {\bar{F}}_{\max}$ gives a sense of the proximity of the $F_{ST}$ values to their upper bounds: it ranges from 0, when $F_{ST}$ values at all SNPs equal their lower bounds, to 1, when $F_{ST}$ values at all SNPs equal their upper bounds.

Figure 4 shows the ratio ${\bar{F}}_{ST} / {\bar{F}}_{\max}$ under the island model for different values of K and $4 N m .$ For each value of the number of subpopulations, ${\bar{F}}_{ST} / {\bar{F}}_{\max}$ decreases with $4 N m .$ This result summarizes the influence of the migration rate observed in Figure 3: $F_{ST}$ values tend to be close to the upper bound under weak migration, and near the lower bound under strong migration. ${\bar{F}}_{ST} / {\bar{F}}_{\max}$ is only minimally influenced by the number of subpopulations K (Figure 4). Even though the upper bound on $F_{ST}$ in terms of M is strongly affected by K, the proximity of $F_{ST}$ to the upper bound is similar across K values.

${\bar{F}}_{ST} / {\bar{F}}_{\max},$ the ratio of the mean $F_{ST}$ to the mean maximal $F_{ST}$ given the observed frequency M of the most frequent allele, as a function of the number of subpopulations K and the scaled migration rate $4 N m$ for the island migration model. Colors represent values of K. $F_{ST}$ values are computed from coalescent simulations using MS for 10,000 independent SNPs and 100 lineages sampled per subpopulation. ${\bar{F}}_{m a x}$ is computed from Equation 11. Figure S4 in File S1 presents similar results under rectangular and linear stepping-stone migration models.

Application to Human Genomic Data

We now use our theoretical results to explain observed patterns of human genetic differentiation, and in particular, to explain the impact of the number of subpopulations. We examine data from Li et al. (2008) on 577,489 SNPs from 938 individuals of the Human Genome Diversity Panel (HGDP) (Cann et al. 2002), as compiled by Pemberton et al. (2012). We use the same division of the individuals into seven geographic regions that was examined by Li et al. (2008) (Africa, Middle East, Europe, Central and South Asia, East Asia, Oceania, and America). Previous studies of these individuals have used $F_{ST}$ to compare differentiation in regions with different numbers of subpopulations sampled (Rosenberg et al. 2002; Ramachandran et al. 2004; Rosenberg 2011).

We computed the parametric allele frequencies for each region, averaging across regions to obtain the frequency M of the most frequent allele. We then computed $F_{ST}$ for each SNP, averaging $F_{ST}$ values across SNPs to obtain the overall $F_{ST}$ for the full SNP set. To assess the impact of the number of subpopulations K on the relationship between M and $F_{ST},$ we computed $F_{ST}$ for all 120 sets of two or more geographic regions (Figure 5). The 21 pairwise $F_{ST}$ values range from 0.007 (between Middle East and Europe) to 0.101 (Africa and America), with a mean of 0.057, SD of 0.027, and median of 0.061. $F_{ST}$ is substantially larger for sets of three geographic regions. The smallest value is larger, 0.012 (Middle East, Europe, Central/South Asia); as is the largest value, 0.133 (Africa, Oceania, America); the mean of 0.076; and the median of 0.089. Among the $21 \times 5 = 105$ ways of adding a third region to a pair of regions, 83 produce an increase in $F_{ST} .$ For 17 sets of three regions, the value of $F_{ST}$ exceeds that of each of its three component pairs.

Mean $F_{ST}$ values across loci for sets of geographic regions. Each box represents a particular combination of two, three, four, five, six, or all seven geographic regions. Within a box, the numerical value shown is $F_{ST}$ among the regions. The regions considered are indicated by the pattern of “.” and “X” symbols within the box, with X indicating inclusion and “.” indicating exclusion. From left to right, the regions are Africa, Middle East, Europe, Central/South Asia, East Asia, Oceania, and America. Thus, for example, X...X.. indicates the subset {Africa, East Asia}. Lines are drawn between boxes that represent nested subsets. A line is colored red if the larger subset has a higher $F_{ST}$ value, and blue if it has a lower $F_{ST} .$ Computations rely on 577,489 SNPs from the HGDP.

The pattern of increase of $F_{ST}$ with the inclusion of additional subpopulations can be seen in Figure 6A, which plots the $F_{ST}$ values from Figure 5 as a function of K. The magnitude of the increase is greatest from $K = 2$ to $K = 3,$ decreasing with increasing K. From $K = 3$ to 4, 82 of 140 additions of a region increase $F_{ST};$ 54 of 105 produce an increase from $K = 4$ to 5; 21 of 42 from $K = 5$ to 6; and 3 of 7 from $K = 6$ to 7. The seven-region $F_{ST}$ of 0.102 exceeds all the pairwise $F_{ST}$ values.

$F_{ST}$ values for sets of geographic regions as a function of K, the number of regions considered. (A) ${\bar{F}}_{ST}$ computed using Equation 10. (B) ${\bar{F}}_{ST} / {\bar{F}}_{\max}$ computed using Equation 11. For each subset of populations, the value of $F_{ST}$ is taken from Figure 5. The mean across subsets for a fixed K appears as a solid red line, and the median as a dashed red line.

The larger $F_{ST}$ values with increasing K can be explained by the difference in constraints on $F_{ST}$ in terms of M (Figure 7). For fixed M, as we saw in the increase of $A (K)$ with K (Figure 2), the permissible range of $F_{ST}$ values is smaller on average for $F_{ST}$ values computed among smaller sets of populations than among larger sets. For example, the maximal $F_{ST}$ value at the mean M of 0.76 observed in pairwise comparisons is 0.33 for $K = 2$ (black line in Figure 7A), while the maximal $F_{ST}$ value at the mean M of 0.77 observed for the global comparison of seven regions is 0.86 for $K = 7$ (Figure 7B). Given the stronger constraint in pairwise calculations, it is not unexpected that pairwise $F_{ST}$ values would be smaller than the values computed with more regions, such as in the seven-region computation. Interestingly, the effect of K on $F_{ST}$ is largely eliminated when $F_{ST}$ values are normalized by their maxima (Figure 6B). The normalization, which takes both K and M into account, generates nearly constant means and medians of $F_{ST}$ as functions of K, with higher values for $K = 2.$

Joint density of the frequency M of the most frequent allele and $F_{ST}$ in human population-genetic data, considering 577,489 SNPs. (A) $F_{ST}$ computed for pairs of geographic regions. The density is evaluated from the set of $F_{ST}$ values for all 21 pairs of regions. (B) $F_{ST}$ computed among $K = 7$ geographic regions. The figure design follows Figure 3.

Data availability

MS commands for the coalescent simulations appear in File S2. See Li et al. (2008) for the human SNP data.

Discussion

We have evaluated the constraint imposed by the frequency M of the most frequent allele at a biallelic locus on the range of $F_{ST},$ for arbitrarily many subpopulations. Although $F_{ST}$ is unconstrained in the unit interval when $M = i / K$ for integers i satisfying $⌈ K / 2 ⌉ \leq i \leq K - 1,$ it is constrained below 1 for all other M. We have found that the number of subpopulations K has considerable impact on the range of $F_{ST},$ with a weaker constraint on $F_{ST}$ as K increases. As shown by Jakobsson et al. (2013) for $K = 2,$ across possible values of M, $F_{ST}$ is restricted to 38.63% of the possible space. For $K = 100,$ however, $F_{ST}$ can occupy 97.47% of the space. Although the mean over M values of the permissible interval for $F_{ST}$ approaches the full unit interval as $K \to ∞,$ for any K, an allele frequency $M < (K - 1) / K$ exists for which the maximal $F_{ST}$ is lower than $2 \sqrt{2} - 2.$

Multiple studies have highlighted the relationship between $F_{ST}$ and M in two subpopulations for biallelic markers (Rosenberg et al. 2003; Maruki et al. 2012) and, more generally, for an unspecified (Jakobsson et al. 2013) or specified number of alleles (Edge and Rosenberg 2014). We have extended these results to the case of biallelic markers in a specified but arbitrary number of subpopulations, comprehensively describing the relationship between $F_{ST}$ and M for the biallelic case. The study is part of an increasing body of work characterizing the mathematical relationship of population-genetic statistics with quantities that constrain them (Hedrick 1999, 2005; Rosenberg and Jakobsson 2008; Reddy and Rosenberg 2012). As we have seen, such relationships contribute to understanding the behavior of the statistics in evolutionary models and to interpreting counterintuitive results in human population genetics.

Properties of F_ST in evolutionary models

Our work extends classical results about the impact of evolutionary processes on $F_{ST}$ values. Wright (1951) showed that in an equilibrium population, $F_{ST}$ is expected to be near 1 if migration is weak, and near 0 if migration is strong. On the basis of our simulations, we can more precisely formulate this proposition: considering a SNP at frequency M in an equilibrium population, $F_{ST}$ is expected to be near its upper bound in terms of M if migration is weak and near 0 if migration is strong. This formulation of Wright’s proposition makes it possible to explain why SNPs subject to the same migration process can display a variety of $F_{ST}$ patterns; indeed, under weak migration, we expect $F_{ST}$ values to mirror the considerable variation in the upper bound on $F_{ST}$ in terms of M.

Lower F_ST values in pairwise comparisons than in comparisons of more subpopulations

$F_{ST}$ values have often been compared across computations with different numbers of subpopulations. Such comparisons appear frequently, for example, in studies of domesticated animals such as horses, pigs, and sheep (Cañon et al. 2000; Kim et al. 2005; Lawson Handley et al. 2007). In human populations, table 1 of the microsatellite study of Rosenberg et al. (2002) presents comparisons of $F_{ST}$ values for scenarios with K ranging from 2 to 52. Table 3 of Rosenberg et al. (2006) compares $F_{ST}$ values for microsatellites and biallelic indels in population sets with K ranging from 2 to 18. Major SNP studies have also compared $F_{ST}$ values for scenarios with $K = 2$ and $K = 3$ groups (Hinds et al. 2005; International HapMap Consortium 2005).

Our results suggest that such comparisons between $F_{ST}$ values with different K can hide an effect of the number of subpopulations, especially when some of the comparisons involve the most strongly constrained case of $K = 2.$ For human data, we found that owing to a difference in the $F_{ST}$ constraint for different K values, pairwise $F_{ST}$ values between continental regions were consistently lower than $F_{ST}$ values computed using three or more regions, and sets of three regions were identified for which the $F_{ST}$ value exceeded the values for all three pairs of regions in the set. The effect of K might help illuminate why SNP-based pairwise human $F_{ST}$ values (table S11 of 1000 Genomes Project Consortium 2012) are generally smaller than estimates that use all populations together (11.1% of genetic variance due to between-region or between-population differences; Li et al. 2008). We find that comparing $F_{ST}$ values with different choices of K can generate as much difference—twofold—as comparing $F_{ST}$ with different marker types (Holsinger and Weir 2009). This substantial impact of K on $F_{ST}$ merits further attention.

Consequences for the use of F_ST as a test statistic

The effects of constraints on $F_{ST}$ extend beyond the use of $F_{ST}$ as a statistic for genetic differentiation. In F_ST-based genome scans for local adaptation, tracing to the work of Lewontin and Krakauer (1973), a hypothesis of spatially divergent selection at a candidate locus is evaluated by comparing $F_{ST}$ at the locus with the $F_{ST}$ distribution estimated from a set of putatively neutral loci. Under this test, $F_{ST}$ values smaller or larger than expected by chance are interpreted as being under stabilizing or divergent selection, respectively. Modern versions of this approach compare $F_{ST}$ values at single loci with the distribution across the genome (Beaumont and Nichols 1996; Akey et al. 2002; Foll and Gaggiotti 2008; Bonhomme et al. 2010; Günther and Coop 2013).

The constraints on $F_{ST}$ in our work and the work of Jakobsson et al. (2013) and Edge and Rosenberg (2014) suggest that $F_{ST}$ values strongly depend on the frequency of the most frequent allele. Consequently, we expect that $F_{ST}$ outlier tests that do not explicitly take into account this constraint will result in a deficit of power at loci with high- and low-frequency alleles. Because pairwise $F_{ST}$ and $F_{ST}$ values in many populations have different constraints, we predict that the effect of the constraint on outlier tests relying on a single global $F_{ST}$ (e.g., Beaumont and Nichols 1996; Foll and Gaggiotti 2008) will be smaller than in tests relying on pairwise $F_{ST}$ (e.g., Günther and Coop 2013).

F_ST as a statistic or as a parameter

The perspective we used in obtaining $F_{ST}$ bounds treats $F_{ST}$ as a mathematical function of allele frequencies rather than as a population-genetic parameter. Thus, the starting point for our mathematical analysis (Equation 3) is that the allele frequencies are mathematical constants rather than random outcomes of an evolutionary process.

In the alternative perspective that $F_{ST}$ is a parameter rather than a statistic, both the sample of alleles drawn from a set of subpopulations and the sample of subpopulations drawn from a larger collection of subpopulations are treated as random. An analysis of mathematical bounds analogous to our analysis of Equation 3 in terms of M would then investigate bounds on estimators of $F_{ST},$ where the value of the estimator is bounded in terms of the largest sample allele frequency. In this perspective, the estimator of Weir and Cockerham (1984) for a biallelic locus $(\hat{θ};$ Weir 1996, p. 173), under an assumption of equal sample sizes in the K subpopulations and either haploid data or a random union of gametes in diploids, is

\hat{θ} = \frac{s^{2} - \frac{1}{2 n - 1} [\tilde{M} (1 - \tilde{M}) - \frac{K - 1}{K} s^{2}]}{\tilde{M} (1 - \tilde{M}) + \frac{s^{2}}{K}} .

(12)

Here, $2 n$ is the sample size per subpopulation (n diploid individuals or $2 n$ haploids), $\tilde{M}$ is the mean sample frequency of the most frequent allele across subpopulations, and $s^{2} = [1 / (K - 1)] \sum_{k = 1}^{K} {({\tilde{p}}_{k} - \tilde{M})}^{2}$ is the empirical variance of the sample frequency ${\tilde{p}}_{k}$ of the most frequent allele across subpopulations.

Although Equation 12 has more terms than Equation 3, it can be shown that for fixed $\tilde{M}$ with $1 / 2 \leq \tilde{M} < 1$ and fixed $K \geq 2$ and $2 n \geq 2,$ $s^{2}$ and hence $\hat{θ}$ are minimized and maximized under corresponding conditions to those that minimize and maximize Equation 3. In particular, Theorem 1 applies to ${{\tilde{p}}_{k}}_{k = 1}^{K},$ with $\sum_{k = 1}^{K} {\tilde{p}}_{k} = K \tilde{M} .$ We then expect that corresponding mathematical results to those seen for $F_{ST}$ as computed in Equation 3 will hold for $\hat{θ}$ from Equation 12. Such computations indicate that our “statistic” perspective on $F_{ST}$ generates mathematical results of interest to a “parameter” interpretation of $F_{ST} .$

Conclusions

Many recent articles have noted that $F_{ST}$ often behaves counterintuitively (Whitlock 2011; Alcala et al. 2014; Wang 2015), for example, indicating low differentiation in cases in which populations do not share any alleles (Balloux et al. 2000; Jost 2008) or suggesting less divergence among populations than is visible in clustering analyses (Tishkoff et al. 2009; Algee-Hewitt et al. 2016). It has thus become clear that observed $F_{ST}$ patterns often trace to peculiar mathematical properties of F_ST—in particular its relationship to other statistics such as homozygosity or allele frequency—instead of to biological phenomena of interest. Our work here, extending approaches of Jakobsson et al. (2013) and Edge and Rosenberg (2014), seeks to characterize those properties, so that the influence of mathematical constraints on $F_{ST}$ can be disentangled from biological phenomena.

One response to the dependence of $F_{ST}$ on M may be to compute $F_{ST}$ only when allele frequencies lie in a specific class, such as $M \leq 0.95.$ Such choices can potentially avoid a misleading interpretation that a genetic differentiation measure is low in scenarios when minor alleles, though rare, are in fact private to single populations. We note, however, that the dependence of $F_{ST}$ spans the full range of values of M, and exists for values of M both above and below a choice of cutoff. In addition, this dependence varies with the number of subpopulations K, so that use of the same cutoff could have a different effect on $F_{ST}$ values in scenarios with different numbers of subpopulations.

In a potentially more informative approach, addressing the mathematical dependence of $F_{ST}$ on the within-subpopulation mean heterozygosity $H_{S},$ Wang (2015) has proposed plotting the joint distribution of $H_{S}$ and $F_{ST}$ to assess the correlation between the two statistics. Using the island model, Wang (2015) argued that when $H_{S}$ and $F_{ST}$ are uncorrelated, $F_{ST}$ is expected to be more revealing about the demographic history of a species than when they are strongly correlated and $F_{ST}$ merely reflects the within-subpopulation diversity. Our results suggest a related framework: studies can compare plots of the joint distribution of M and $F_{ST}$ with the bounds on $F_{ST}$ in terms of M. This framework, which examines constraints on $F_{ST}$ in terms of allele frequencies in the total population, complements that of Wang (2015), which considers constraints in terms of subpopulation allele frequencies. Such analyses, considering $F_{ST}$ together with additional measures of allele frequencies, are desirable in diverse scenarios for explaining counterintuitive $F_{ST}$ phenomena, for avoiding overinterpretation of $F_{ST}$ values, and for making sense of $F_{ST}$ comparisons across settings that have a substantial difference in the nature of one or more underlying parameters.

Supplementary Material

Supplemental material is available online at www.genetics.org/lookup/suppl/doi:10.1534/genetics.116.199141/-/DC1.

Click here for additional data file.^{(1.9MB, pdf)}

Click here for additional data file.^{(83.8KB, pdf)}

Acknowledgments

We thank the editor M. Beaumont, K. Holsinger, and an anonymous reviewer for comments on an earlier version of the manuscript. Support was provided by National Science Foundation grants BCS-1515127 and DBI-1458059; National Institute of Justice grant 2014-DN-BX-K015; a postdoctoral fellowship from the Stanford Center for Computational, Evolutionary, and Human Genomics; and Swiss National Science Foundation Early Postdoc.Mobility fellowship P2LAP3_161869.

Appendix A: Demonstration of Equation 4

This appendix provides the derivation of the upper bound on $\sum_{k = 1}^{K} p_{k}^{2}$ as a function of K and M.

Theorem 1. Suppose $σ > 0$ and $K \geq ⌊ σ ⌋ + 1$ are specified, where K is an integer. Considering all sequences ${p_{k}}_{k = 1}^{K}$ with $p_{k} \in [0, 1],$ $\sum_{k = 1}^{K} p_{k} = σ,$ and $k < ℓ$ implies $p_{k} < p_{ℓ},$ $\sum_{k = 1}^{∞} p_{k}^{2}$ is maximal if and only if $p_{k} = 1$ for $1 \leq k \leq ⌊ σ ⌋,$ $p_{⌊ σ ⌋ + 1} = σ - ⌊ σ ⌋,$ and $p_{k} = 0$ for $k > ⌊ σ ⌋ + 1,$ and its maximum is ${(σ - ⌊ σ ⌋)}^{2} + ⌊ σ ⌋ .$

Proof. This theorem is a special case of lemma 3 from Rosenberg and Jakobsson (2008), which states (changing notation for some of the variables to avoid confusion): “Suppose $A > 0$ and $C > 0$ and that $⌈ C / A ⌉$ is denoted L. Considering all sequences ${p_{i}}_{i = 1}^{∞}$ with $p_{i} \in [0, A],$ $\sum_{i = 1}^{∞} p_{i} = C,$ and $i < j$ implies $p_{i} \geq p_{j},$ $H (p) = \sum_{i = 1}^{∞} p_{i}^{2}$ is maximal if and only if $p_{i} = A$ for $1 \leq i \leq L - 1,$ $p_{L} = C - (L - 1) A,$ and $p_{i} = 0$ for $i > L,$ and its maximum is $L (L - 1) A^{2} - 2 C (L - 1) A + C^{2} .$ ”

In our special case, we apply the lemma with $A = 1$ and $C = σ .$ We also restrict consideration to sequences of finite rather than infinite length; however, our condition $K \geq ⌊ σ ⌋ + 1$ for the number of terms in the sequence guarantees that the maximum in the case of infinite sequences, which requires $⌈ σ ⌉ \leq ⌊ σ ⌋ + 1$ nonzero terms, is attainable with sequences of the finite length we consider. For convenience in numerical computations, we state our result using the floor function rather than the ceiling function, requiring some bookkeeping to obtain our corollary.

If σ is not an integer, then in lemma 3 of Rosenberg and Jakobsson (2008), $L = ⌊ σ ⌋ + 1,$ and the maximum occurs with $p_{1} = p_{2} = \dots = p_{⌊ σ ⌋} = 1,$ $p_{⌊ σ ⌋ + 1} = σ - ⌊ σ ⌋,$ and $p_{k} = 0$ for $k > ⌊ σ ⌋ + 1,$ equaling $L (L - 1) A^{2} - 2 C (L - 1) A + C^{2} = (⌊ σ ⌋ + 1) ⌊ σ ⌋ - 2 σ ⌊ σ ⌋ + σ^{2} .$

If σ is an integer, then $⌊ σ ⌋ = ⌈ σ ⌉ = σ,$ and the maximum occurs with $p_{1} = p_{2} = \dots = p_{σ} = 1,$ $p_{σ + 1} = σ - ⌊ σ ⌋ = 0,$ and $p_{k} = 0$ for $k > σ + 1,$ equaling $L (L - 1) A^{2} - 2 C (L - 1) A + C^{2} = ⌊ σ ⌋ (⌊ σ ⌋ - 1) - 2 σ (⌊ σ ⌋ - 1) + σ^{2} .$

In both cases, the maximum simplifies to ${(σ - ⌊ σ ⌋)}^{2} + ⌊ σ ⌋,$ noting that $⌊ σ ⌋ = ⌈ σ ⌉ = σ$ in the latter case.□

In our application of the theorem in the main text, the definition of M gives $\sum_{k = 1}^{K} p_{k} = K M,$ so that $K M$ plays the role of σ. We thus obtain that the maximal value of $\sum_{k = 1}^{K} p_{k}^{2}$ for sequences ${p_{k}}_{k = 1}^{K}$ with $p_{k} \in [0, 1],$ $k < ℓ$ implies $p_{k} < p_{ℓ},$ and $\sum_{k = 1}^{K} p_{k} = K M$ is ${(K M - ⌊ K M ⌋)}^{2} + ⌊ K M ⌋$ with equality if and only if $p_{k} = 1$ for $1 \leq k \leq ⌊ K M ⌋,$ $p_{⌊ K M ⌋ + 1} = {K M},$ and $p_{k} = 0$ for $k > ⌊ K M ⌋ + 1.$ Considering all sequences ${p_{k}}_{k = 1}^{K}$ with $p_{k} \in [0, 1]$ and not necessarily ordered such that $k < ℓ$ implies $p_{k} < p_{ℓ},$ the maximum is achieved when any $⌊ K M ⌋$ terms equal 1, one term is ${K M},$ and remaining terms are 0.

Appendix B: Local Minima of the Upper Bound on F_ST

This appendix derives the positions and values of the local minima in the upper bound on $F_{ST}$ in terms of M (Equation 5).

Positions of the Local Minima

To derive the positions of the local minima of the upper bound on $F_{ST}$ in terms of M, we study the function $Q_{i} (x)$ (Equation 6) on the interval $[0, 1)$ for x, where $i = ⌊ K M ⌋$ and $x = K M - i,$ so that $M = (i + x) / K .$ Recall that K and i are integers with $K \geq 2$ and i in $[⌊ K / 2 ⌋, K - 1] .$ Note that $x < 1$ ensures that $M < 1,$ in accord with our assumption of a polymorphic locus.

Theorem 2. Consider fixed integers $K \geq 2$ and i in $[⌊ K / 2 ⌋, K - 1] .$

$Q_{i} (x)$ has no local minimum on $[0, 1)$ for $K = 2$ or for $i = K - 1.$
For $K \geq 3$ and i in $[⌊ K / 2 ⌋, K - 2],$ $Q_{i} (x)$ has a unique local minimum on the interval $[0, 1)$ for x, with position denoted $x_{\min} .$
For odd $K \geq 3$ and $i = (K - 1) / 2,$ $x_{\min} = 1 / 2 .$
For all other $(K, i)$ with $K \geq 3$ and i in $[⌊ K / 2 ⌋, K - 2],$ $x_{\min} = λ (K, i),$ where
$λ (K, i) = \frac{i (K - i) - \sqrt{i (i + 1) (K - i) (K - i - 1)}}{2 i - K + 1} .$ (B1)

Proof. We take the derivative of $Q_{i} (x) :$

\frac{d Q_{i} (x)}{d x} = \frac{- K (2 i - K + 1) x^{2} + 2 i K (K - i) x - i K (K - i)}{{(i + x)}^{2} {(K - i - x)}^{2}} .

(B2)

As $1 / 2 \leq M = (i + x) / K < 1,$ the denominator in Equation B2 is nonzero, and $d Q_{i} (x) / d x = 0$ is equivalent to a quadratic equation in x:

- K (2 i - K + 1) x^{2} + 2 i K (K - i) x - i K (K - i) = 0.

(B3)

If $i = (K - 1) / 2,$ then the quadratic term in Equation B3 vanishes and Equation B3 becomes a linear equation in x, with solution $x = 1 / 2 .$ That the solution is a local minimum follows from the continuity of $Q_{i} (x)$ on $[0, 1)$ together with the fact that $Q_{i} (0) = Q_{i} (1) = 1$ and $Q_{i} (x) < 1$ for $0 < x < 1.$ Consequently, if K is odd, then the local minimum for $i = (K - 1) / 2$ occurs at $M = (i + x) / K = 1 / 2,$ the lowest possible value of M. This establishes (iii).

Excluding $i = (K - 1) / 2$ for odd K, for all $i \in [⌊ K / 2 ⌋, K - 2]$ with $K \geq 3,$ Equation B3 has a unique solution in $[0, 1);$ this solution has $x = λ (K, i)$ (Equation B1). The other root of Equation B3 exceeds 1. That $x = λ (K, i)$ represents a local minimum is again a consequence of the continuity of $Q_{i} (x)$ on $[0, 1)$ together with $Q_{i} (0) = Q_{i} (1) = 1$ and $Q_{i} (x) < 1$ for $0 < x < 1.$ This establishes (ii) and (iv).

For the case of $i = K - 1,$ Equation B3 has a double root at $x = 1,$ outside the permissible domain for x, $[0, 1) .$ $Q_{i} (0) = 1,$ $0 \leq Q_{i} (x) \leq 1$ on $[0, 1),$ and $Q_{i} (x)$ approaches 0 as $x \to 1.$ Consequently, $Q_{i} (x)$ has no local minimum for $i = K - 1.$ For $K = 2,$ $i = K - 1$ is the only possible value of i, and $Q_{i} (x)$ has no local minimum. This establishes (i). □

Positions of the Local Minima for Fixed K as a Function of i

Having identified the locations of the local minima, we now explore how those locations change at fixed K with increasing i. For fixed $K \geq 3,$ we consider $x_{\min}$ from Theorem 2 as a function of i on the interval $[⌊ K / 2 ⌋, K - 2] .$ It is convenient to define interval $I_{*},$ equaling $[K / 2, K - 2]$ for even K and $((K - 1) / 2, K - 2]$ for odd K.

Proposition 1. Consider a fixed integer $K \geq 3.$

The function $x_{\min} (i)$ increases as i increases from $⌊ K / 2 ⌋$ to $K - 2.$
Its minimum is $x_{\min} [(K - 1) / 2] = 1 / 2$ if K is odd, and $x_{\min} (K / 2) = (K / 4) (K - \sqrt{K^{2} - 4})$ if K is even.
Its maximum is $x_{\min} (1) = 1 / 2$ for $K = 3,$ and for $K > 3$ it is
$x_{\min} (K - 2) = \frac{2 (K - 2) - \sqrt{2 (K - 2) (K - 1)}}{K - 3} .$ (B4)

Proof. By Theorem 2, for fixed $K \geq 3$ and $i \in I_{*},$ $x_{\min} (i)$ is given by Equation B1. Treating i as continuous, we take the derivative:

\frac{d x_{\min} (i)}{d i} = \frac{[(K - i) (K - i - 1) + i (i + 1)] [2 i (K - i - 1) + K - 1 - 2 \sqrt{f (i)}]}{2 {(2 i - K + 1)}^{2} \sqrt{f (i)}},

where $f (i) = i (i + 1) (K - i) (K - i - 1) .$ Because all other terms of $d x_{\min} (i) / d i$ are positive for i in $((K - 1) / 2, K - 2],$ $d x_{\min} (i) / d i$ has the same sign as $2 i (K - i - 1) + K - 1 - 2 \sqrt{f (i)} .$

Rearranging terms, we have

\sqrt{f (i)} = \sqrt{{[i (K - i - 1) + \frac{K - 1}{2}]}^{2} - \frac{{(K - 2 i - 1)}^{2}}{4}} .

Because $- {(K - 2 i - 1)}^{2} / 4 < 0$ for i in $((K - 1) / 2, K - 2],$ $\sqrt{f (i)} < \sqrt{{[i (K - i - 1) + (K - 1) / 2]}^{2}}$ and $- 2 \sqrt{f (i)} > - 2 i (K - i - 1) - K + 1.$ Consequently, $2 i (K - i - 1) + K - 1 - 2 \sqrt{f (i)} > 0$ for all i in $((K - 1) / 2, K - 2] .$ Thus, $d x_{\min} (i) / d i > 0$ for all i in $I_{*},$ and $x_{\min} (i)$ increases with i in this interval.

For odd K and $i = (K - 1) / 2,$ Equation B1 gives $\lim_{i \to (K - 1) / 2^{+}} λ (K, i) = 1 / 2 .$ Thus, because $x_{\min} [(K - 1) / 2] = 1 / 2$ by Theorem 2, $x_{\min} (i)$ is continuous at $(K - 1) / 2 .$ The function $x_{\min} (i)$ therefore increases with i in the closed interval $[⌊ K / 2 ⌋, K - 2] .$ This proves (i).

Because $x_{\min} (i)$ increases with i for all i in $[⌊ K / 2 ⌋, K - 2],$ $x_{\min} (i)$ is minimal when i is minimal. For odd K, the minimal value of i is $(K - 1) / 2 .$ From Theorem 2, $x_{\min} [(K - 1) / 2] = 1 / 2$ for all odd K. For even K, the minimal value of i is $K / 2 .$ From Theorem 2, $x_{\min} (K / 2) = (K / 4) (K - \sqrt{K^{2} - 4}) .$ This proves (ii).

Similarly, $x_{\min} (i)$ is maximal when i is maximal. From Theorem 2, the maximal value of i for which there exists a minimum of $Q_{i} (x)$ is $i = K - 2,$ and the position of this local minimum is $x_{\min} (K - 2) = [2 (K - 2) - \sqrt{2 (K - 2) (K - 1)}] / (K - 3) .$ In particular, for $K = 3,$ $(K - 1) / 2 = K - 2 = 1,$ so there is a unique local minimum at position $x_{\min} [(K - 1) / 2] = 1 / 2 .$ This proves (iii). □

Positions of the First and Last Local Minima as Functions of K

We now fix i and examine the effect of K on the local minimum at fixed i. We first focus on the interval closest to $M = 1 / 2,$ the first local minimum of the upper bound on $F_{ST} .$

Proposition 2. Consider integers $K \geq 3.$

For odd K, the relative position $x_{\min} [(K - 1) / 2]$ of the first local minimum does not depend on K and is $1 / 2 .$
For even K, the relative position $x_{\min} (K / 2)$ of the first local minimum decreases as $K \to \infty,$ tends to $1 / 2,$ and is bounded above by $4 - 2 \sqrt{3} \approx 0.5359.$

Proof. For odd K, the interval closest to $M = 1 / 2$ is $[1 / 2, 1 / 2 + 1 / (2 K)) .$ In this interval, from Proposition 1ii, the minimum occurs at $x_{\min} [(K - 1) / 2] = 1 / 2$ irrespective of K. This proves (i).

For even K, the interval for M closest to $M = 1 / 2$ is $[1 / 2, 1 / 2 + 1 / K) .$ In this interval, from Proposition 1ii, the minimum has position $x_{\min} (K / 2) = (K / 4) (K - \sqrt{K^{2} - 4}) .$ The derivative of this function is

\frac{d x_{\min} (\frac{K}{2})}{d K} = - \frac{{(K - \sqrt{K^{2} - 4})}^{2}}{4 \sqrt{K^{2} - 4}},

which is negative for all $K \geq 3.$ Thus, $x_{\min} (K / 2)$ decreases with K for $K \geq 3.$ In addition, as $K \to ∞,$ $x_{\min} (K / 2) \to 1 / 2 .$ Because $x_{\min} (K / 2)$ decreases with K, its maximum value is reached when K is minimal. The minimal even value of K is $K = 4.$ Thus, $x_{\min} (K / 2) \leq x_{\min} (4 / 2) = 4 - 2 \sqrt{3} .$ This proves (ii). □

By Proposition 2, if K is large and even, then the first local minimum lies near the center of the interval $[1 / 2, 1 / 2 + 1 / K)$ for M.

Proposition 3. For integers $K \geq 3,$ the relative position $x_{\min} (K - 2)$ of the last local minimum increases as $K \to \infty$ and tends to $2 - \sqrt{2} \approx 0.5858.$

Proof. From Theorem 2, for $K = 3$ and $K = 4,$ there is a single local minimum. Hence, from Proposition 2, the position of the last local minimum is $x_{\min} (1) = 1 / 2$ for $K = 3$ and $x_{\min} (2) = 4 - 2 \sqrt{3} \approx 0.5359$ for $K = 4.$ The position of the last local minimum then increases from $K = 3$ to $K = 4.$

If $K > 3,$ from Proposition 1iii, the position of the last local minimum follows Equation B4. We take the derivative

\frac{d x_{\min} (K - 2)}{d K} = \frac{(3 K - 5) - 2 \sqrt{2 (K - 2) (K - 1)}}{{(K - 3)}^{2} \sqrt{2 (K - 2) (K - 1)}} .

For $K > 3,$ the denominator is positive and $d x_{\min} (K - 2) / d K$ has the same sign as its numerator. Because for $K > 3,$ ${(3 K - 5)}^{2} - 8 (K - 2) (K - 1) = {(K - 3)}^{2} > 0,$ we have $3 K - 5 > 2 \sqrt{2 (K - 2) (K - 1)}$ and a positive numerator. Then $d x_{min} (K - 2) / d K > 0$ and $x_{\min} (K - 2)$ increases for $K > 3.$

From Equation B4, $x_{\min} (K - 2)$ tends to $2 - \sqrt{2} \approx 0.5858$ as $K \to ∞ .$ Thus, the last local minimum is not at the center of interval $I_{K - 2};$ rather, it is nearer to the upper endpoint. Because $x_{\min} (K - 2)$ increases with K, $x_{\min} (K - 2) < \lim_{K \to ∞} x_{\min} (K - 2)$ and the last local maximum has position bounded above by $2 - \sqrt{2} .$ □

As we have shown in Proposition 1i that for fixed K, as i increases from $⌊ K / 2 ⌋$ to $K - 2,$ the relative position of the local minimum increases, this relative position is restricted in the interval $[x_{\min} (⌊ K / 2 ⌋), x_{\min} (K - 2)] .$ Further, because from Proposition 2, $x_{\min} [(K - 1) / 2] = 1 / 2$ for odd K and $x_{\min} (K / 2) > 1 / 2$ for even K $,$ and from Proposition 3, $x_{\min} (K - 2) < 2 - \sqrt{2},$ the relative positions of the local minima must be in the interval $[1 / 2, 2 - \sqrt{2}) .$

Figure B1 illustrates as functions of K the relative positions of the first local minimum $(x_{\min} [(K - 1) / 2]$ for odd K and $x_{\min} (K / 2)$ for even K $)$ and the last local minimum $(x_{\min} (K - 2)) .$ The restriction of these positions to the interval $[1 / 2, 2 - \sqrt{2})$ is visible, with the first local minimum lying closer to the center of interval $[0, 1)$ for x than the last local minimum. The decrease in the position of the first local minimum for even K alternating with values of $1 / 2$ for odd K (Proposition 2) and the increase in the position of the last local minimum (Proposition 3) are visible as well.

Values at the Local Minima

We obtain the value of the local minima of the upper bound on $F_{ST}$ in each interval $I_{i}$ by substituting into Equation 6 the value of i for interval $I_{i}$ and its associated $x_{\min} (i)$ from Theorem 2. We obtain

Q_{i} [x_{\min} (i)] = \frac{K [i + x_{\min} {(i)}^{2}] - {[i + x_{\min} (i)]}^{2}}{[i + x_{\min} (i)] [K - i - x_{\min} (i)]} .

(B5)

Note that for odd K, although $λ (K, i)$ is undefined at $i = (K - 1) / 2,$ $x_{\min} (i)$ is continuous. Thus, $Q_{i} [x_{\min} (i)]$ is also defined and continuous for all $i \in [⌊ K / 2 ⌋, K - 2] .$ We consider $Q_{i}$ as a function of i on this interval.

Proposition 4. For fixed $K \geq 3,$ the local minima $Q_{i} [x_{\min} (i)]$ decrease as i increases from $⌊ K / 2 ⌋$ to $K - 2.$

Proof. We take the derivative $d Q_{i} [x_{min} (i)] / d i$ for fixed K and $i \in I_{*} .$ From Equations B5 and B1,

\frac{d Q_{i} (x_{\min} (i))}{d i} = \frac{\sqrt{f (i)} K (2 i - K + 1) u (i)}{w {(i)}^{2} {[w (i) + K (2 i - K + 1)]}^{2}},

where $w (i) = \sqrt{f (i)} - i (i + 1)$ and $u (i) = 2 (K^{2} - 1) \sqrt{f (i)} + 2 (K^{2} + 1) i^{2} - (K - 1) (2 i K^{2} + K^{2} - K + 2 i) .$

For all $i \in I_{*},$ the denominator of the derivative is positive, as are $\sqrt{f (i)},$ K, and $2 i - K + 1.$ Hence, $d Q_{i} [x_{\min} (i)] / d i$ has the same sign as $u (i) .$

Because $f (i)$ decreases for $i \in [(K - 1) / 2, K - 2],$ $\sqrt{f (i)} \geq \sqrt{f [(K - 1) / 2]} = (K^{2} - 1) / 4$ and $K^{2} - 1 - 4 \sqrt{f (i)} \geq 0.$ We factor $u (i) :$

u (i) = 2 (K^{2} + 1) [i - v (i) - \frac{K - 1}{2}] [i + v (i) - \frac{K - 1}{2}],

where $v (i) = \sqrt{(K^{2} - 1) (K^{2} - 1 - 4 \sqrt{f (i)})} / (2 \sqrt{K^{2} + 1}) .$ Because $v (i) \geq 0,$ for all $i \geq (K - 1) / 2,$ $i + v (i) - (K - 1) / 2 \geq 0.$ Thus, the sign of $u (i)$ is given by the sign of

i - v (i) - \frac{K - 1}{2} = \frac{(2 i - K + 1) \sqrt{K^{2} + 1} - \sqrt{(K^{2} - 1) [K^{2} - 1 - 4 \sqrt{f (i)}]}}{2 \sqrt{K^{2} + 1}} .

For $i \in ((K - 1) / 2, K - 2],$

\frac{K^{2} {(2 i - K - 1)}^{4}}{4 {(K - 1)}^{2} {(K + 1)}^{2}} = {[\frac{K^{2} - 1}{4} - \frac{(K^{2} + 1) {[i - \frac{K - 1}{2}]}^{2}}{K^{2} - 1}]}^{2} - f (i) > 0,

and hence,

- 4 | \frac{K^{2} - 1}{4} - \frac{(K^{2} + 1) {[i - (K - 1) / 2]}^{2}}{K^{2} - 1} | < - 4 \sqrt{f (i)} .

(B6)

The term ${[i - (K - 1) / 2]}^{2}$ increases as a function of i for $i \in [(K - 1) / 2, K - 2] .$ Hence, $(K^{2} - 1) / 4 - {(K^{2} + 1) {[i - (K - 1) / 2]}^{2}} / (K^{2} - 1)$ decreases with i. It is minimal at the largest value in the permissible domain for i, or $i = K - 2,$ with minimum $[3 K {(K - 1)}^{2} - 4] / [2 (K - 1) (K + 1)] .$ The denominator of this quantity is positive and the numerator increases with K. It is thus minimal for $K = 3,$ at which $3 K {(K - 1)}^{2} - 4 = 32 > 0.$ This proves that $[(K^{2} - 1) / 4] - {(K^{2} + 1) {[i - (K - 1) / 2]}^{2}} / (K^{2} - 1) > 0$ for $i \in [(K - 1) / 2, K - 2].$

We can then remove the absolute value in Equation B6 and rearrange to obtain $(K^{2} + 1) {(2 i - K + 1)}^{2} < (K^{2} - 1) [K^{2} - 1 - 4 \sqrt{f (i)}] .$ Both sides of this inequality are positive for $i \in ((K - 1) / 2, K - 2]$ and we can take the square root to obtain $\sqrt{K^{2} + 1} (2 i - K + 1) < \sqrt{(K^{2} - 1) [K^{2} - 1 - 4 \sqrt{f (i)}]} .$ Hence $i - v (i) - (K - 1) / 2 < 0$ for $i \in ((K - 1) / 2, K - 2],$ $d Q_{i} [x_{\min} (i)] / d i < 0$ for $i \in I_{*},$ and the local minima $Q_{i} [x_{\min} (i)]$ decrease with i. □

Proposition 5. For $K = 3,$ the first local minimum $Q_{i} [x_{\min} (i)]$ has value $2 / 3;$ for $K \geq 3,$ the first local minimum increases as a function of K and tends to 1 as $K \to \infty .$

Proof. From Proposition 2i, for odd K, the first local minimum is reached for $i = (K - 1) / 2$ and $x = 1 / 2,$ and the upper bound on $F_{ST}$ is $Q_{(K - 1) / 2} (1 / 2) = 1 - (1 / K) .$ Thus, for $K = 3,$ the first local minimum has value $Q_{1} (1 / 2) = 2 / 3 .$ For even K, the first local minimum is reached if $i = K / 2$ and $x = x_{\min} (K / 2),$ with upper bound on $F_{ST}$ equal to

Q_{\frac{K}{2}} [x_{\min} (\frac{K}{2})] = \frac{(K - 2) [K (K + 1) - (K + 1) \sqrt{K^{2} - 4} - 2]}{\sqrt{K^{2} - 4} (K - \sqrt{K^{2} - 4})} .

(B7)

Denote by $Δ_{1} (K) = Q_{K / 2} [x_{\min} (K / 2)] - [1 - 1 / (K - 1)]$ the difference between the first local minimum for even K and the first local minimum for odd $K - 1,$ and by $Δ_{2} (K) = [1 - 1 / (K + 1)] - Q_{K / 2} [x_{\min} (K / 2)]$ the difference between the first local minimum for odd $K + 1$ and the first local minimum for even K. To show that the first local minimum increases with K, we must show that for all even $K \geq 4,$ (i) $Δ_{1} (K) > 0,$ and (ii) $Δ_{2} (K) > 0.$

For (i), subtracting $1 - [1 / (K - 1)]$ from Equation B7, we have

Δ_{1} (K) = \frac{(K - 2) [(K + 2) (K^{2} - K - 1) - \sqrt{K^{2} - 4} (K^{2} + K - 1)]}{(K - 1) \sqrt{K^{2} - 4} (K - \sqrt{K^{2} - 4})} .

Because all other terms are positive for $K \geq 3,$ $Δ_{1} (K)$ has the same sign as $(K + 2) (K^{2} - K - 1) - \sqrt{K^{2} - 4} (K^{2} + K - 1) .$ Dividing by $\sqrt{K + 2},$ this quantity in turn has the same sign as $\sqrt{K + 2} (K^{2} - K - 1) - \sqrt{K - 2} (K^{2} + K - 1) .$ This last quantity is positive for $K \geq 4,$ as when we multiply it by the positive $\sqrt{K + 2} (K^{2} - K - 1) + \sqrt{K - 2} (K^{2} + K - 1),$ the result reduces simply to the number 4. This proves (i).

For (ii), subtracting Equation B7 from $1 - 1 / (K + 1),$ we have

Δ_{2} (K) = \frac{(K + 2) [\sqrt{K^{2} - 4} (K^{2} - K - 1) - (K - 2) (K^{2} + K - 1)]}{(K + 1) \sqrt{K^{2} - 4} (K - \sqrt{K^{2} - 4})} .

Because all other terms are positive for $K \geq 3,$ $Δ_{2} (K)$ has the same sign as $\sqrt{K^{2} - 4} (K^{2} - K - 1) - (K - 2) (K^{2} + K - 1) .$ Dividing by $\sqrt{K - 2},$ this quantity has the same sign as $\sqrt{K + 2} (K^{2} - K - 1) - \sqrt{K - 2} (K^{2} + K - 1),$ which was shown to be positive in the proof of (i). This demonstrates (ii).

From (i) and (ii), the value of the upper bound on $F_{ST}$ at the first local minimum increases with K for all $K \geq 3.$ To see that the limiting value is 1 as $K \to ∞,$ we note that the subsequence of values $1 - (1 / K)$ at odd K tends to 1 as $K \to ∞ .$ As the sequence of values of the first local minimum with increasing K is monotonic and bounded above by 1, it is therefore convergent; as it has a subsequence converging to 1, the sequence converges to 1. □

Proposition 6. For $K = 3,$ the last local minimum $Q_{i} [x_{\min} (i)]$ has value $2 / 3;$ for $K \geq 3,$ the last local minimum increases as a function of K and tends to $2 \sqrt{2} - 2$ as $K \to \infty .$

Proof. From Proposition 5, for $K = 3,$ the single local minimum has value $2 / 3 .$ By Theorem 2, for $K > 3,$ the last local minimum is reached when $i = K - 2$ and $x = x_{\min} (K - 2),$ in which case from Equations B5 and B1 the upper bound on $F_{ST}$ is

Q_{K - 2} [x_{\min} (K - 2)] = \frac{2 (K - 2) [\sqrt{2} {(K - 1)}^{2} - \sqrt{h (K)} (K + 1)]}{\sqrt{h (K)} {(\sqrt{h (K)} - \sqrt{2})}^{2}},

(B8)

where $h (K) = (K - 2) (K - 1) .$

We examine the derivative of $Q_{K - 2} [x_{\min} (K - 2)]$ with respect to K. For $K \geq 3,$ $h (K) > 0$ and

\frac{d Q_{K - 2} [x_{\min} (K - 2)]}{d K} = \frac{α (K) \sqrt{2}}{\sqrt{h (K)} {(\sqrt{h (K)} - \sqrt{2})}^{4}},

where $α (K) = 3 K {(K - 1)}^{2} - 4 - 2 \sqrt{2} (K - 1) (K + 1) \sqrt{h (K)} .$ For $K > 3,$ the derivative has the same sign as $α (K) .$

We note that

\frac{{(K - 3)}^{4} K^{2}}{8 {(K - 1)}^{2} {(K + 1)}^{2}} = {[\frac{3 {(K - 1)}^{2} K - 4}{2 \sqrt{2} (K - 1) (K + 1)}]}^{2} - h (K) \geq 0,

with equality only at $K = 3.$ Because $3 K {(K - 1)}^{2} - 4$ is positive for $K \geq 3,$ $[3 K {(K - 1)}^{2} - 4] / [2 \sqrt{2} (K - 1) (K + 1)] \geq \sqrt{h (K)},$ with equality only at $K = 3.$ Thus, $α (K) > 0$ and $d Q_{K - 2} [x_{\min} (K - 2)] / d K > 0$ for $K > 3,$ and hence, the last local minimum increases with K.

For the limit as $K \to ∞,$ we take the limit of $Q_{K - 2} [x_{\min} (K - 2)]$ in Equation B8, obtaining $2 \sqrt{2} - 2 \approx 0.8284.$ □

Appendix C: Computing the Mean Range of F_ST

This appendix provides the computation of the integral $A (K)$ (Equation 7) and its asymptotic approximation $\tilde{A} (K) .$

Computing A(K)

To compute $A (K),$ we break the integral into a sum over intervals. If K is even, then we consider intervals $I_{i} = [i / K, (i + 1) / K)$ with $i = K / 2, (K / 2) + 1, \dots, K - 1.$ If K is odd, then we use intervals $I = [1 / 2, (K + 1) / (2 K))$ and $I_{i} = [i / K, (i + 1) / K)$ with $i = (K + 1) / 2, (K + 3) / 2, \dots, K - 1.$

By construction of $Q_{i} (x)$ (Equation 6), in each interval $I_{i},$ the upper bound on $F_{ST}$ is equal to $Q_{i} (x)$ with $x = {K M} .$ In the odd case, because $(K + 1) / 2$ is an integer, on interval $[1 / 2, (K + 1) / (2 K)),$ $⌊ K M ⌋$ has a constant value $(K - 1) / 2$ and ${K M} = K M - (K - 1) / 2,$ and the upper bound is equal to $Q_{(K - 1) / 2} (x) .$ Making the substitution $x = K M - i,$ we obtain $d x = K d M$ and we can write Equation 7 in terms of $Q_{i} (x) :$

A (K) = {\begin{array}{l} \frac{2}{K} \sum_{i = \frac{K}{2}}^{K - 1} \int_{0}^{1} Q_{i} (x) d x, & K even \\ \frac{2}{K} (\int_{\frac{1}{2}}^{1} Q_{\frac{K - 1}{2}} (x) d x + \sum_{i = \frac{K + 1}{2}}^{K - 1} \int_{0}^{1} Q_{i} (x) d x), & K odd . \end{array}

(C1)

We next use a partial fraction decomposition. For $⌊ K / 2 ⌋ \leq i \leq K - 2,$

Q_{i} (x) = 1 - K + \frac{i (i + 1)}{i + x} + \frac{(K - i) (K - i - 1)}{K - i - x},

\int_{0}^{1} Q_{i} (x) d x = 1 - K + i (i + 1) \ln (\frac{i + 1}{i}) + (K - i) (K - i - 1) \ln (\frac{K - i}{K - i - 1}) .

For $i = K - 1,$

Q_{i} (x) = 1 - K + \frac{i (i + 1)}{i + x},

\int_{0}^{1} Q_{K - 1} (x) d x = 1 - K + (K - 1) K \ln (\frac{K}{K - 1}) .

For $i = (K - 1) / 2,$

\int_{\frac{1}{2}}^{1} Q_{\frac{K - 1}{2}} (x) d x = \frac{1 - K}{2} + (\frac{K - 1}{2}) (\frac{K + 1}{2}) \ln (\frac{K + 1}{K - 1}) .

Thus, when K is even,

\begin{matrix} A (K) = \frac{2}{K} \sum_{i = \frac{K}{2}}^{K - 1} \int_{0}^{1} Q_{i} (x) d x \\ = - \frac{2}{K} \sum_{i = \frac{K}{2}}^{K - 1} (K - 1) + \frac{2}{K} \sum_{i = \frac{K}{2}}^{K - 1} [i (i + 1) \ln (\frac{i + 1}{i})] \\ + \frac{2}{K} \sum_{i = \frac{K}{2}}^{K - 2} [(K - i) (K - i - 1) \ln (\frac{K - i}{K - i - 1})] \\ = 1 - K + \frac{2}{K} \sum_{i = 1}^{K - 1} [i (i + 1) \ln (\frac{i + 1}{i})] . \end{matrix}

When K is odd,

\begin{matrix} A (K) = \frac{2}{K} \int_{\frac{1}{2}}^{1} Q_{\frac{K - 1}{2}} (x) d x + \frac{2}{K} \sum_{i = \frac{K + 1}{2}}^{K - 1} \int_{0}^{1} Q_{i} (x) d x \\ = \frac{2}{K} \frac{1 - K}{2} + \frac{2}{K} (\frac{K - 1}{2}) (\frac{K + 1}{2}) \ln (\frac{K + 1}{K - 1}) \\ - \frac{2}{K} \sum_{i = \frac{K + 1}{2}}^{K - 1} (K - 1) + \frac{2}{K} \sum_{i = \frac{K + 1}{2}}^{K - 1} [i (i + 1) \ln (\frac{i + 1}{i})] \\ + \frac{2}{K} \sum_{i = \frac{K + 1}{2}}^{K - 2} [(K - i) (K - i - 1) \ln (\frac{K - i}{K - i - 1})] \\ = 1 - K + \frac{2}{K} \sum_{i = 1}^{K - 1} [i (i + 1) \ln (\frac{i + 1}{i})] . \end{matrix}

(C2)

The expressions for $A (K)$ are equal for even and odd K. We can simplify further:

\begin{matrix} \sum_{i = 1}^{K - 1} i (i + 1) \ln (\frac{i + 1}{i}) = \sum_{i = 2}^{K} (i - 1) i \ln i - \sum_{i = 2}^{K - 1} i (i + 1) \ln i \\ = K (K + 1) \ln K - 2 \sum_{i = 2}^{K} i \ln i . \end{matrix}

(C3)

Substituting the expression from Equation C3 into Equation C2 and simplifying, we obtain Equation 8.

Asymptotic Approximation for A(K) (Equation 9)

To asymptotically approximate $A (K),$ we first need a large-K approximation of $\sum_{i = 2}^{K} i \ln i .$ Because $\sum_{i = 2}^{K} i \ln i = \ln [H (K)],$ where $H (K) = \prod_{i = 1}^{K} i^{i}$ is the hyperfactorial function, we can use classical results about the asymptotic behavior of $H (K) :$

\lim_{K \to ∞} \frac{H (K)}{\exp (- \frac{K^{2}}{4}) K^{\frac{K^{2}}{2} + \frac{K}{2} + \frac{1}{12}}} = C,

(C4)

where C is the Glaisher–Kinkelin constant. Because the logarithm function is continuous at C, $\ln [H (K)] / [\exp (- K^{2} / 4) K^{[(K^{2} / 2) + (K / 2) + (1 / 12)]}]$ has limit $\ln C$ as $K \to ∞ .$ Thus, if we write $f (K) \sim g (K)$ for two functions that satisfy $\lim_{K \to ∞} [f (K) / g (K)] = 1,$ then

\sum_{i = 2}^{K} i \ln i \sim - \frac{K^{2}}{4} + (\frac{K^{2}}{2} + \frac{K}{2} + \frac{1}{12}) \ln K + \ln C .

(C5)

Substituting the expression from Equation C5 into Equation 8, we obtain function $\tilde{A} (K)$ (Equation 9) and the relationship $\tilde{A} (K) \sim A (K) .$

Monotonicity of Equation 8 in K

Theorem 3. $A (K)$ increases monotonically in K for $K \geq 2.$

Proof. We must show that $Δ A (K) = A (K + 1) - A (K) > 0$ for all $K \geq 2.$ From the expression for $A (K)$ in Equation 8, we have:

Δ A (K) = - 1 + 2 K \ln (1 + \frac{1}{K}) - 2 \ln K + \frac{4 \sum_{i = 2}^{K} i \ln i}{K (K + 1)} .

(C6)

To show that $Δ A (K) > 0,$ we find a lower bound for $Δ A (K),$ denoted $D (K),$ and then show that $D (K) > 0.$

We first find a lower bound for $\sum_{i = 2}^{K} i \ln i .$ From the Euler–Maclaurin summation formula, we have

\sum_{i = 2}^{K} i \ln i = (\int_{1}^{K} x \ln x d x) + \frac{K \ln K}{2} + \int_{1}^{K} (\ln x + 1) (x - ⌊ x ⌋ - \frac{1}{2}) d x .

For all positive integers $i \geq 2,$

\int_{i - 1}^{i} (\ln x + 1) (x - ⌊ x ⌋ - \frac{1}{2}) d x = \frac{1}{4} [2 i - 1 - 2 i (i - 1) \ln (\frac{i}{i - 1})] .

(C7)

This integral can be seen to be positive from the equivalence for $i > 1$ of $2 i - 1 - 2 i (i - 1) \ln (i / [i - 1]) > 0$ with $\exp [(2 i - 1) / [2 i (i - 1)]] > i / (i - 1) .$ This latter inequality follows from the inequality $\exp (x) > 1 + x + x^{2} / 2$ for $x > 0$ from the Taylor expansion of $e^{x},$ noting that $1 + (2 i - 1) / [2 i (i - 1)] + (1 / 2) {[(2 i - 1) / [2 i (i - 1)]]}^{2} > i / (i - 1) .$

Consequently, as the integral in Equation C7 is positive for each $i \geq 2,$ $\int_{1}^{K} (\ln x + 1) [x - ⌊ x ⌋ - (1 / 2)] d x > 0,$ and

\begin{matrix} \sum_{i = 1}^{K} i \ln i > (\int_{i = 1}^{K} x \ln x d x) + \frac{K \ln K}{2} \\ = \frac{K (K + 1) \ln K}{2} + \frac{1 - K^{2}}{4} . \end{matrix}

(C8)

As a result, the following function is a lower bound for $Δ A :$

\begin{matrix} D (K) = - 1 + 2 K \ln (1 + \frac{1}{K}) - 2 \ln K + \frac{4 [\frac{K (K + 1) \ln K}{2} + \frac{1 - K^{2}}{4}]}{K (K + 1)} \\ = - 2 + \frac{1}{K} + 2 K \ln (1 + \frac{1}{K}) . \end{matrix}

Dividing by $2 K$ and substituting $u = 1 / K$ for $K > 0,$ $D (K) > 0$ if and only if $f (u) = \ln (1 + u) - u + (u^{2} / 2) > 0$ for $u > 0.$ It can be seen that this latter inequality holds by noting that $f (0) = 0$ and $f^{'} (u) = [1 / (1 + u)] - 1 + u = u^{2} / (1 + u) > 0.$ □

Figure B1 — The first and last local minima of $F_{ST}$ as functions of the frequency M of the most frequent allele, for $K \geq 3$ subpopulations. (A) Relative positions within the interval $[i / K, (i + 1) / K)$ of the first and last local minima, as functions of K. The position $x_{m i n} (i)$ of the local minimum in interval $I_{i}$ is computed from Equation B1. If K is odd, then this position is $x_{m i n} [(K - 1) / 2];$ if K is even, then it is $x_{m i n} (K / 2) .$ The position of the last local minimum is $x_{m i n} (K - 2) .$ Dashed lines indicate the smallest value for $x_{m i n} (i)$ of $1 / 2,$ and the limiting largest value of $2 - \sqrt{2} .$ (B) The value of the upper bound on $F_{ST}$ at the first and last local minima, as functions of K. These values are computed from Equation 5, taking $⌊ K M ⌋ = i$ and ${K M} = x_{m i n} (i),$ with $x_{m i n} (i)$ as in part (A). Dashed lines indicate the limiting values of 1 and $2 \sqrt{2} - 2$ for the first and last local minima, respectively.

Footnotes

Communicating editor: M. A. Beaumont

Literature Cited

1000 Genomes Project Consortium , 2012. An integrated map of genetic variation from 1,092 human genomes. Nature 491: 56–65. [DOI] [PMC free article] [PubMed] [Google Scholar]
Akey J. M., Zhang G., Zhang K., Jin L., Shriver M. D., 2002. Interrogating a high-density SNP map for signatures of natural selection. Genome Res. 12: 1805–1814. [DOI] [PMC free article] [PubMed] [Google Scholar]
Alcala N., Goudet J., Vuilleumier S., 2014. On the transition of genetic differentiation from isolation to panmixia: what we can learn from G_ST and D. Theor. Popul. Biol. 93: 75–84. [DOI] [PubMed] [Google Scholar]
Algee-Hewitt B. F. B., Edge M. D., Kim J., Li J. Z., Rosenberg N. A., 2016. Individual identifiability predicts population identifiability in forensic microsatellite markers. Curr. Biol. 26: 935–942. [DOI] [PubMed] [Google Scholar]
Balloux F., Brünner H., Lugon-Moulin N., Hausser J., Goudet J., 2000. Microsatellites can be misleading: an empirical and simulation study. Evolution 54: 1414–1422. [DOI] [PubMed] [Google Scholar]
Beaumont M. A., Nichols R. A., 1996. Evaluating loci for use in the genetic analysis of population structure. Proc. R. Soc. Lond. Ser. B Biol. Sci. 263: 1619–1626. [Google Scholar]
Bonhomme M., Chevalet C., Servin B., Boitard S., Abdallah J., et al. , 2010. Detecting selection in population trees: the Lewontin and Krakauer test extended. Genetics 186: 241–262. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cann H. M., de Toma C., Cazes L., Legrand M.-F., Morel V., et al. , 2002. A human genome diversity cell line panel. Science 296: 261–262. [DOI] [PubMed] [Google Scholar]
Cañon J., Checa M. L., Carleos C., Vega-Pla J. L., Vallejo M., et al. , 2000. The genetic structure of Spanish Celtic horse breeds inferred from microsatellite data. Anim. Genet. 31: 39–48. [DOI] [PubMed] [Google Scholar]
Cornuet J.-M., Santos F., Beaumont M. A., Robert C. P., Marin J.-M., et al. , 2008. Inferring population history with DIY ABC: a user-friendly approach to approximate Bayesian computation. Bioinformatics 24: 2713–2719. [DOI] [PMC free article] [PubMed] [Google Scholar]
Edge M. D., Rosenberg N. A., 2014. Upper bounds on F_ST in terms of the frequency of the most frequent allele and total homozygosity: the case of a specified number of alleles. Theor. Popul. Biol. 97: 20–34. [DOI] [PMC free article] [PubMed] [Google Scholar]
Foll M., Gaggiotti O., 2008. A genome-scan method to identify selected loci appropriate for both dominant and codominant markers: a Bayesian perspective. Genetics 180: 977–993. [DOI] [PMC free article] [PubMed] [Google Scholar]
Frankham R., Ballou J. D., Briscoe D. A., 2002. Introduction to Conservation Genetics. Cambridge University Press, Cambridge, United Kingdom. [Google Scholar]
Günther T., Coop G., 2013. Robust identification of local adaptation from allele frequencies. Genetics 195: 205–220. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hartl D. L., Clark A. G., 1997. Principles of Population Genetics. Sinauer, Sunderland, MA. [Google Scholar]
Hedrick P. W., 1999. Highly variable loci and their interpretation in evolution and conservation. Evolution 53: 313–318. [DOI] [PubMed] [Google Scholar]
Hedrick P. W., 2005. A standardized genetic differentiation measure. Evolution 59: 1633–1638. [PubMed] [Google Scholar]
Hinds D. A., Stuve L. L., Nilsen G. B., Halperin E., Eskin E., et al. , 2005. Whole-genome patterns of common DNA variation in three human populations. Science 307: 1072–1079. [DOI] [PubMed] [Google Scholar]
Holsinger K. E., Weir B. S., 2009. Genetics in geographically structured populations: defining, estimating and interpreting F_ST. Nat. Rev. Genet. 10: 639–650. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hudson R. R., 2002. Generating samples under a Wright–Fisher neutral model of genetic variation. Bioinformatics 18: 337–338. [DOI] [PubMed] [Google Scholar]
International HapMap Consortium , 2005. A haplotype map of the human genome. Nature 437: 1299–1320. [DOI] [PMC free article] [PubMed] [Google Scholar]
Jakobsson M., Edge M. D., Rosenberg N. A., 2013. The relationship between F_ST and the frequency of the most frequent allele. Genetics 193: 515–528. [DOI] [PMC free article] [PubMed] [Google Scholar]
Jost L., 2008. G_ST and its relatives do not measure differentiation. Mol. Ecol. 17: 4015–4026. [DOI] [PubMed] [Google Scholar]
Kim T. H., Kim K. S., Choi B. H., Yoon D. H., Jang G. W., et al. , 2005. Genetic structure of pig breeds from Korea and China using microsatellite loci analysis. J. Anim. Sci. 83: 2255–2263. [DOI] [PubMed] [Google Scholar]
Lawson Handley L. J., Byrne K., Santucci F., Townsend S., Taylor M., et al. , 2007. Genetic structure of European sheep breeds. Heredity 99: 620–631. [DOI] [PubMed] [Google Scholar]
Leinonen T., McCairns R. J. S., O’Hara R. B., Merilä J., 2013. $Q_{ST} - F_{ST}$ comparisons: evolutionary and ecological insights from genomic heterogeneity. Nat. Rev. Genet. 14: 179–190. [DOI] [PubMed] [Google Scholar]
Lewontin R., Krakauer J., 1973. Distribution of gene frequency as a test of the theory of the selective neutrality of polymorphisms. Genetics 74: 175–195. [DOI] [PMC free article] [PubMed] [Google Scholar]
Li J. Z., Absher D. M., Tang H., Southwick A. M., Casto A. M., et al. , 2008. Worldwide human relationships inferred from genome-wide patterns of variation. Science 319: 1100–1104. [DOI] [PubMed] [Google Scholar]
Long J. C., Kittles R. A., 2003. Human genetic diversity and the nonexistence of biological races. Hum. Biol. 75: 449–471. [DOI] [PubMed] [Google Scholar]
Maruki T., Kumar S., Kim Y., 2012. Purifying selection modulates the estimates of population differentiation and confounds genome-wide comparisons across single-nucleotide polymorphisms. Mol. Biol. Evol. 29: 3617–3623. [DOI] [PMC free article] [PubMed] [Google Scholar]
Maruyama T., 1970. Effective number of alleles in a subdivided population. Theor. Popul. Biol. 1: 273–306. [DOI] [PubMed] [Google Scholar]
Nei M., 1973. Analysis of gene diversity in subdivided populations. Proc. Natl. Acad. Sci. USA 70: 3321–3323. [DOI] [PMC free article] [PubMed] [Google Scholar]
Nei M., Chesser R. K., 1983. Estimation of fixation indices and gene diversities. Ann. Hum. Genet. 47: 253–259. [DOI] [PubMed] [Google Scholar]
Pemberton T. J., Absher D., Feldman M. W., Myers R. M., Rosenberg N. A., et al. , 2012. Genomic patterns of homozygosity in worldwide human populations. Am. J. Hum. Genet. 91: 275–292. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ramachandran S., Rosenberg N. A., Zhivotovsky L. A., Feldman M. W., 2004. Robustness of the inference of human population structure: a comparison of X-chromosomal and autosomal microsatellites. Hum. Genomics 1: 87–97. [DOI] [PMC free article] [PubMed] [Google Scholar]
Reddy S. B., Rosenberg N. A., 2012. Refining the relationship between homozygosity and the frequency of the most frequent allele. J. Math. Biol. 64: 87–108. [DOI] [PMC free article] [PubMed] [Google Scholar]
Rosenberg N. A., 2011. A population-genetic perspective on the similarities and differences among worldwide human populations. Hum. Biol. 83: 659–684. [DOI] [PMC free article] [PubMed] [Google Scholar]
Rosenberg N. A., Jakobsson M., 2008. The relationship between homozygosity and the frequency of the most frequent allele. Genetics 179: 2027–2036. [DOI] [PMC free article] [PubMed] [Google Scholar]
Rosenberg N. A., Pritchard J. K., Weber J. L., Cann H. M., Kidd K. K., et al. , 2002. Genetic structure of human populations. Science 298: 2381–2385. [DOI] [PubMed] [Google Scholar]
Rosenberg N. A., Li L. M., Ward R., Pritchard J. K., 2003. Informativeness of genetic markers for inference of ancestry. Am. J. Hum. Genet. 73: 1402–1422. [DOI] [PMC free article] [PubMed] [Google Scholar]
Rosenberg N. A., Mahajan S., Gonzalez-Quevedo C., Blum M. G. B., Nino-Rosales L., et al. , 2006. Low levels of genetic divergence across geographically and linguistically diverse populations from India. PLoS Genet. 2: e215. [DOI] [PMC free article] [PubMed] [Google Scholar]
Rousset F., 2013. Exegeses on maximum genetic differentiation. Genetics 194: 557–559. [DOI] [PMC free article] [PubMed] [Google Scholar]
Slatkin M., 1985. Rare alleles as indicators of gene flow. Evolution 39: 53–65. [DOI] [PubMed] [Google Scholar]
Tishkoff S. A., Reed F. A., Friedlaender F. R., Ehret C., Ranciaro A., et al. , 2009. The genetic structure and history of Africans and African Americans. Science 324: 1035–1044. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wakeley J., 1998. Segregating sites in Wright’s island model. Theor. Popul. Biol. 53: 166–174. [DOI] [PubMed] [Google Scholar]
Wakeley J., 1999. Nonequilibrium migration in human history. Genetics 153: 1863–1871. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang J., 2015. Does G_ST underestimate genetic differentiation from marker data? Mol. Ecol. 24: 3546–3558. [DOI] [PubMed] [Google Scholar]
Weir B. S., 1996. Genetic Data Analysis II. Sinauer, Sunderland, MA. [Google Scholar]
Weir B. S., Cockerham C. C., 1984. Estimating F-statistics for the analysis of population structure. Evolution 38: 1358–1370. [DOI] [PubMed] [Google Scholar]
Whitlock M. C., 2011. $G_{ST}^{'}$ and D do not replace $F_{ST} .$ Mol. Ecol. 20: 1083–1091. [DOI] [PubMed] [Google Scholar]
Wright S., 1951. The genetical structure of populations. Ann. Eugen. 15: 323–354. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Click here for additional data file.^{(1.9MB, pdf)}

Click here for additional data file.^{(83.8KB, pdf)}

Data Availability Statement

MS commands for the coalescent simulations appear in File S2. See Li et al. (2008) for the human SNP data.

[bib1] 1000 Genomes Project Consortium , 2012. An integrated map of genetic variation from 1,092 human genomes. Nature 491: 56–65. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib2] Akey J. M., Zhang G., Zhang K., Jin L., Shriver M. D., 2002. Interrogating a high-density SNP map for signatures of natural selection. Genome Res. 12: 1805–1814. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib3] Alcala N., Goudet J., Vuilleumier S., 2014. On the transition of genetic differentiation from isolation to panmixia: what we can learn from G_ST and D. Theor. Popul. Biol. 93: 75–84. [DOI] [PubMed] [Google Scholar]

[bib4] Algee-Hewitt B. F. B., Edge M. D., Kim J., Li J. Z., Rosenberg N. A., 2016. Individual identifiability predicts population identifiability in forensic microsatellite markers. Curr. Biol. 26: 935–942. [DOI] [PubMed] [Google Scholar]

[bib5] Balloux F., Brünner H., Lugon-Moulin N., Hausser J., Goudet J., 2000. Microsatellites can be misleading: an empirical and simulation study. Evolution 54: 1414–1422. [DOI] [PubMed] [Google Scholar]

[bib6] Beaumont M. A., Nichols R. A., 1996. Evaluating loci for use in the genetic analysis of population structure. Proc. R. Soc. Lond. Ser. B Biol. Sci. 263: 1619–1626. [Google Scholar]

[bib7] Bonhomme M., Chevalet C., Servin B., Boitard S., Abdallah J., et al. , 2010. Detecting selection in population trees: the Lewontin and Krakauer test extended. Genetics 186: 241–262. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib8] Cann H. M., de Toma C., Cazes L., Legrand M.-F., Morel V., et al. , 2002. A human genome diversity cell line panel. Science 296: 261–262. [DOI] [PubMed] [Google Scholar]

[bib9] Cañon J., Checa M. L., Carleos C., Vega-Pla J. L., Vallejo M., et al. , 2000. The genetic structure of Spanish Celtic horse breeds inferred from microsatellite data. Anim. Genet. 31: 39–48. [DOI] [PubMed] [Google Scholar]

[bib10] Cornuet J.-M., Santos F., Beaumont M. A., Robert C. P., Marin J.-M., et al. , 2008. Inferring population history with DIY ABC: a user-friendly approach to approximate Bayesian computation. Bioinformatics 24: 2713–2719. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib11] Edge M. D., Rosenberg N. A., 2014. Upper bounds on F_ST in terms of the frequency of the most frequent allele and total homozygosity: the case of a specified number of alleles. Theor. Popul. Biol. 97: 20–34. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib12] Foll M., Gaggiotti O., 2008. A genome-scan method to identify selected loci appropriate for both dominant and codominant markers: a Bayesian perspective. Genetics 180: 977–993. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib13] Frankham R., Ballou J. D., Briscoe D. A., 2002. Introduction to Conservation Genetics. Cambridge University Press, Cambridge, United Kingdom. [Google Scholar]

[bib14] Günther T., Coop G., 2013. Robust identification of local adaptation from allele frequencies. Genetics 195: 205–220. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib15] Hartl D. L., Clark A. G., 1997. Principles of Population Genetics. Sinauer, Sunderland, MA. [Google Scholar]

[bib16] Hedrick P. W., 1999. Highly variable loci and their interpretation in evolution and conservation. Evolution 53: 313–318. [DOI] [PubMed] [Google Scholar]

[bib17] Hedrick P. W., 2005. A standardized genetic differentiation measure. Evolution 59: 1633–1638. [PubMed] [Google Scholar]

[bib18] Hinds D. A., Stuve L. L., Nilsen G. B., Halperin E., Eskin E., et al. , 2005. Whole-genome patterns of common DNA variation in three human populations. Science 307: 1072–1079. [DOI] [PubMed] [Google Scholar]

[bib19] Holsinger K. E., Weir B. S., 2009. Genetics in geographically structured populations: defining, estimating and interpreting F_ST. Nat. Rev. Genet. 10: 639–650. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib20] Hudson R. R., 2002. Generating samples under a Wright–Fisher neutral model of genetic variation. Bioinformatics 18: 337–338. [DOI] [PubMed] [Google Scholar]

[bib21] International HapMap Consortium , 2005. A haplotype map of the human genome. Nature 437: 1299–1320. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib22] Jakobsson M., Edge M. D., Rosenberg N. A., 2013. The relationship between F_ST and the frequency of the most frequent allele. Genetics 193: 515–528. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib23] Jost L., 2008. G_ST and its relatives do not measure differentiation. Mol. Ecol. 17: 4015–4026. [DOI] [PubMed] [Google Scholar]

[bib24] Kim T. H., Kim K. S., Choi B. H., Yoon D. H., Jang G. W., et al. , 2005. Genetic structure of pig breeds from Korea and China using microsatellite loci analysis. J. Anim. Sci. 83: 2255–2263. [DOI] [PubMed] [Google Scholar]

[bib25] Lawson Handley L. J., Byrne K., Santucci F., Townsend S., Taylor M., et al. , 2007. Genetic structure of European sheep breeds. Heredity 99: 620–631. [DOI] [PubMed] [Google Scholar]

[bib26] Leinonen T., McCairns R. J. S., O’Hara R. B., Merilä J., 2013. $Q_{ST} - F_{ST}$ comparisons: evolutionary and ecological insights from genomic heterogeneity. Nat. Rev. Genet. 14: 179–190. [DOI] [PubMed] [Google Scholar]

[bib27] Lewontin R., Krakauer J., 1973. Distribution of gene frequency as a test of the theory of the selective neutrality of polymorphisms. Genetics 74: 175–195. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib28] Li J. Z., Absher D. M., Tang H., Southwick A. M., Casto A. M., et al. , 2008. Worldwide human relationships inferred from genome-wide patterns of variation. Science 319: 1100–1104. [DOI] [PubMed] [Google Scholar]

[bib29] Long J. C., Kittles R. A., 2003. Human genetic diversity and the nonexistence of biological races. Hum. Biol. 75: 449–471. [DOI] [PubMed] [Google Scholar]

[bib30] Maruki T., Kumar S., Kim Y., 2012. Purifying selection modulates the estimates of population differentiation and confounds genome-wide comparisons across single-nucleotide polymorphisms. Mol. Biol. Evol. 29: 3617–3623. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib31] Maruyama T., 1970. Effective number of alleles in a subdivided population. Theor. Popul. Biol. 1: 273–306. [DOI] [PubMed] [Google Scholar]

[bib32] Nei M., 1973. Analysis of gene diversity in subdivided populations. Proc. Natl. Acad. Sci. USA 70: 3321–3323. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib33] Nei M., Chesser R. K., 1983. Estimation of fixation indices and gene diversities. Ann. Hum. Genet. 47: 253–259. [DOI] [PubMed] [Google Scholar]

[bib34] Pemberton T. J., Absher D., Feldman M. W., Myers R. M., Rosenberg N. A., et al. , 2012. Genomic patterns of homozygosity in worldwide human populations. Am. J. Hum. Genet. 91: 275–292. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib35] Ramachandran S., Rosenberg N. A., Zhivotovsky L. A., Feldman M. W., 2004. Robustness of the inference of human population structure: a comparison of X-chromosomal and autosomal microsatellites. Hum. Genomics 1: 87–97. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib36] Reddy S. B., Rosenberg N. A., 2012. Refining the relationship between homozygosity and the frequency of the most frequent allele. J. Math. Biol. 64: 87–108. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib37] Rosenberg N. A., 2011. A population-genetic perspective on the similarities and differences among worldwide human populations. Hum. Biol. 83: 659–684. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib38] Rosenberg N. A., Jakobsson M., 2008. The relationship between homozygosity and the frequency of the most frequent allele. Genetics 179: 2027–2036. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib39] Rosenberg N. A., Pritchard J. K., Weber J. L., Cann H. M., Kidd K. K., et al. , 2002. Genetic structure of human populations. Science 298: 2381–2385. [DOI] [PubMed] [Google Scholar]

[bib40] Rosenberg N. A., Li L. M., Ward R., Pritchard J. K., 2003. Informativeness of genetic markers for inference of ancestry. Am. J. Hum. Genet. 73: 1402–1422. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib41] Rosenberg N. A., Mahajan S., Gonzalez-Quevedo C., Blum M. G. B., Nino-Rosales L., et al. , 2006. Low levels of genetic divergence across geographically and linguistically diverse populations from India. PLoS Genet. 2: e215. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib42] Rousset F., 2013. Exegeses on maximum genetic differentiation. Genetics 194: 557–559. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib43] Slatkin M., 1985. Rare alleles as indicators of gene flow. Evolution 39: 53–65. [DOI] [PubMed] [Google Scholar]

[bib44] Tishkoff S. A., Reed F. A., Friedlaender F. R., Ehret C., Ranciaro A., et al. , 2009. The genetic structure and history of Africans and African Americans. Science 324: 1035–1044. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib45] Wakeley J., 1998. Segregating sites in Wright’s island model. Theor. Popul. Biol. 53: 166–174. [DOI] [PubMed] [Google Scholar]

[bib46] Wakeley J., 1999. Nonequilibrium migration in human history. Genetics 153: 1863–1871. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib47] Wang J., 2015. Does G_ST underestimate genetic differentiation from marker data? Mol. Ecol. 24: 3546–3558. [DOI] [PubMed] [Google Scholar]

[bib48] Weir B. S., 1996. Genetic Data Analysis II. Sinauer, Sunderland, MA. [Google Scholar]

[bib49] Weir B. S., Cockerham C. C., 1984. Estimating F-statistics for the analysis of population structure. Evolution 38: 1358–1370. [DOI] [PubMed] [Google Scholar]

[bib50] Whitlock M. C., 2011. $G_{ST}^{'}$ and D do not replace $F_{ST} .$ Mol. Ecol. 20: 1083–1091. [DOI] [PubMed] [Google Scholar]

[bib51] Wright S., 1951. The genetical structure of populations. Ann. Eugen. 15: 323–354. [DOI] [PubMed] [Google Scholar]

PERMALINK

Mathematical Constraints on FST: Biallelic Markers in Arbitrarily Many Populations

Nicolas Alcala

Noah A Rosenberg

Abstract

Table 1. Studies describing the mathematical constraints on FST.

Mathematical Constraints

Model

FST as a function of M

Lower bound

Upper bound

Figure 1.

Properties of the upper bound

Local maxima:

Local minima:

Mean range of possible FST values:

Figure 2.

Evolutionary Processes and the Joint Distribution of M and FST for a Biallelic Marker and K Subpopulations

Simulations

Weak migration

Figure 3.

Intermediate and strong migration

Proximity of the joint density of M and FST to the upper bound

Figure 4.

Application to Human Genomic Data

Figure 5.

Figure 6.

Figure 7.