The Relationship Between Homozygosity and the Frequency of the Most Frequent Allele

Noah A Rosenberg; Mattias Jakobsson

doi:10.1534/genetics.107.084772

. 2008 Aug;179(4):2027–2036. doi: 10.1534/genetics.107.084772

The Relationship Between Homozygosity and the Frequency of the Most Frequent Allele

Noah A Rosenberg ^1,¹, Mattias Jakobsson ¹

PMCID: PMC2516077 PMID: 18689892

Abstract

Homozygosity is a commonly used summary of allele-frequency distributions at polymorphic loci. Because high-frequency alleles contribute disproportionately to the homozygosity of a locus, it often occurs that most homozygotes are homozygous for the most frequent allele. To assess the relationship between homozygosity and the highest allele frequency at a locus, for a given homozygosity value, we determine the lower and upper bounds on the frequency of the most frequent allele. These bounds suggest tight constraints on the frequency of the most frequent allele as a function of homozygosity, differing by at most Inline graphic and having an average difference of − π²/18 ≈ 0.1184. The close connection between homozygosity and the frequency of the most frequent allele—which we illustrate using allele frequencies from human populations—has the consequence that when one of these two quantities is known, considerable information is available about the other quantity. This relationship also explains the similar performance of statistical tests of population-genetic models that rely on homozygosity and those that rely on the frequency of the most frequent allele, and it provides a basis for understanding the utility of extended homozygosity statistics in identifying haplotypes that have been elevated to high frequency as a result of positive selection.

THE concept of homozygosity appears ubiquitously in population genetics, in the context of mathematical theory as well as in statistical methods for data analysis. Consider a locus with K ≥ 2 alleles, for which the frequency of allele i is p_i > 0 and for which the alleles are placed in decreasing order of frequency so that p_i ≥ p_j if i < j. For diploids, the fraction of homozygotes expected under the assumption of Hardy–Weinberg proportions can be defined as

(1)

where

(2)

In this article, we show that if all that is known about a locus is its expected homozygosity H, it is possible to localize the frequency p₁ of its most frequent allele within a quite narrow range. Conversely, given p₁, a narrow range can be specified for the value of H. Thus, we determine the upper and lower bounds on the frequency p₁ of the most frequent allele as functions of homozygosity H. We also determine the bounds on H as functions of p₁.

The connection between H and p₁ provides a close relationship between two of the most basic quantities associated with a polymorphic locus. We use this relationship to explain a high correlation observed between H and p₁ in human microsatellite data, as well as to provide a conceptual basis for the success of extended haplotype homozygosity methods in detecting positive selection. Note that expected heterozygosity under Hardy–Weinberg proportions is 1 − H; thus, by a simple transformation, our results can also be used to describe the relationship between heterozygosity and the frequency of the most frequent allele.

RESULTS

We consider a polymorphic locus with at least two alleles. We do not assume that the number of alleles with nonzero frequency is known; it is convenient to view the locus as having infinitely many alleles and to allow some of these alleles to have frequency 0. We refer to the frequency of the most frequent allele, p₁, by M. Henceforth we use H and “homozygosity” to refer to expected homozygosity assuming Hardy–Weinberg proportions. Both M and H must lie in the interval (0, 1). The quantity ⌈x⌉ denotes the smallest integer larger than or equal to x. Our main results, which are proved in the appendix, are the bounds on M as functions of H (Theorem 1) and the bounds on H as functions of M (Theorem 2).

Theorem 1. Consider a sequence of the allele frequencies at a locus, Inline graphic , with p_i ∈ [0, 1), , , M = p₁, and i < j implies p_i ≥ p_j. Then

with equality if and only if p_i = M for 1 ≤ i ≤ K − 1, p_K = 1 − (K − 1)M, and p_i = 0 for i > K, where K = ⌈H⁻¹⌉ = ⌈M⁻¹⌉.

Theorem 2. Consider a sequence of the allele frequencies at a locus, Inline graphic , with p_i ∈ [0, 1), , , M = p₁, and i < j implies p_i ≥ p_j. Then (i) H > M² and (ii) H ≤ 1 − M(⌈M⁻¹⌉ − 1)(2 − ⌈M⁻¹⌉M), with equality if and only if p_i = M for 1 ≤ i ≤ K − 1, p_K = 1 − (K − 1)M, and p_i = 0 for i > K, where K = ⌈H⁻¹⌉ = ⌈M⁻¹⌉.

The bounds obtained in Theorems 1 and 2 are summarized in Table 1. Loosely speaking, Theorem 1 verifies that for a given homozygosity, the frequency of the most frequent allele is smallest when as many alleles as possible are tied as most frequent and greatest when there is one extremely frequent allele and many rare alleles. Theorem 2 shows that for a given frequency of the most frequent allele, homozygosity is smallest when many extremely rare alleles are present and greatest when as many alleles as possible are tied as most frequent. For each of the theorems, part i is straightforward to prove, and part ii follows from the fact that when considering all possible sets of nonnegative real numbers bounded above by a specified constant M and having a fixed sum C, the maximal sum of squares is obtained by greedily choosing as many of the numbers as possible to equal M and by assigning at most one additional number to be positive (Lemma 3 in the appendix).

TABLE 1.

Bounds on homozygosity and the frequency of the most frequent allele

Quantity	Lower bound in terms of other quantity	Upper bound in terms of other quantity	Average difference between upper and lower bounds	Maximal difference between upper and lower bounds
M				at
H	M²	1 − M(⌈M⁻¹⌉ − 1)(2 − ⌈M⁻¹⌉M)		at

Open in a new tab

Theorems 1 and 2 can be visualized in Figures 1–5, and various properties of the bounds that can be observed in the figures are considered in the appendix. Figure 1 illustrates the upper and lower bounds on the frequency of the most frequent allele, as functions of homozygosity. The peculiar yet continuous and monotonic nature of the lower bound can be observed, as can the relatively confined range between the upper and lower bounds—with an average difference of Inline graphic − π²/18 ≈ 0.1184—in which the frequency of the most frequent allele must lie. The stepped shape for the lower bound results from transitions at reciprocals of integers for the number of alleles contained in the collection of allele frequencies that achieves the lower bound.

Figure 1.— — Upper and lower bounds on the frequency of the most frequent allele, as functions of homozygosity.

Figure 2.— — The difference between the upper and lower bounds on the frequency of the most frequent allele, for a given homozygosity, and the difference between the bounds and homozygosity itself.

Figure 3.— — The lower bound on the fraction of homozygosity contributed by homozygotes for the most frequent allele. The upper bound is 1.

Figure 4.— — Upper and lower bounds on homozygosity, as functions of the frequency of the most frequent allele. These upper and lower bounds are the inverse functions of the lower and upper bounds on the frequency of the most frequent allele, given homozygosity.

Figure 5.— — The difference between the upper and lower bounds on homozygosity given the frequency of the most frequent allele, and the difference between the frequency of the most frequent allele and the bounds.

Figure 2 shows the pairwise differences among the upper bound, the lower bound, and the homozygosity itself. From this figure, it is possible to see that the lower bound on the frequency of the most frequent allele is greater than or equal to the homozygosity, with equality when homozygosity is the reciprocal of an integer. It can also be seen that the difference between the lower bound and the homozygosity has numerous local maxima, the highest point being at ( Inline graphic , ), and that the difference between the upper bound and the lower bound has local maxima at reciprocals of integers and local minima in the intervening intervals. The maximal difference between the upper and lower bounds occurs at (, ), and the highest of the local minima is nearby.

Figure 3 displays the minimal fraction of homozygosity contained in homozygotes for the most frequent allele. This function is monotonically increasing, so that for homozygosities substantially > Inline graphic , nearly all homozygotes are homozygous for the most frequent allele, regardless of the total number of alleles.

The upper and lower bounds on homozygosity in terms of the frequency of the most frequent allele are the inverse functions of the lower and upper bounds on the frequency of the most frequent allele in terms of homozygosity. Thus, there is a close relationship between the bounds on H in terms of M shown in Figure 4 and the bounds on M in terms of H shown in Figure 1.

As functions of the frequency of the most frequent allele, Figure 5 depicts the pairwise differences among the upper bound on homozygosity, the lower bound, and the frequency of the most frequent allele itself. The frequency of the most frequent allele is greater than or equal to the upper bound, equaling the upper bound at reciprocals of integers. The difference between this frequency and the upper bound has a collection of local maxima, the highest being at ( Inline graphic , ). The difference between the upper and lower bounds has local maxima at reciprocals of integers and local minima in the intervening intervals. The maximal difference between the upper and lower bounds occurs at (, ), near the highest of the local minima.

APPLICATION TO DATA

To demonstrate the bounds with actual allele frequencies, we consider the homozygosity and frequency of the most frequent allele for 783 multiallelic microsatellite loci studied in a sample of 1048 individuals drawn from worldwide human populations (Rosenberg et al. 2005). Although our theoretical results are useful for any collection of multiallelic loci, this data set provides a particularly illustrative example, as levels of variability of human microsatellites span quite a wide range. For each locus, we assume that the allele frequencies in the sample are parametric allele frequencies, and we obtain values for H and M in the full collection of 1048 individuals.

Figure 6 plots H and M for the 783 loci, illustrating a high degree of correlation between the two quantities. Homozygosity ranges from 0.0837 to 0.6872, and the frequency of the most frequent allele ranges from 0.1136 to 0.8146. Several loci have values of M quite close to the lower bound for their homozygosity values (Table 2). The lists of allele frequencies for these loci are fairly close to the lists that achieve the lower bound. For example, locus AGAT017 has homozygosity 0.2118, between Inline graphic and , and its four most frequent alleles have frequencies 0.2425, 0.2410, 0.2300, and 0.1979. At homozygosity 0.2118, the lower bound for the frequency of the most frequent allele is achieved when four alleles have frequency 0.2243 and a fifth allele has frequency 0.1027.

TABLE 2.

Five microsatellite loci with frequency of the most frequent allele close to the lower bound

Locus	Total data points	H	Lower bound on M	Upper bound on M	p₁ (M)	p₂	p₃	p₄	p₅	p₆	p₇	p₈	p₉	p₁₀	p₁₁	p₁₂
AGAT017	1996	0.21	0.22	0.46	0.24	0.24	0.23	0.20	0.05	0.02	<0.01	<0.01	<0.01	<0.01
TATC012	2046	0.26	0.28	0.51	0.31	0.30	0.27	0.09	0.02	<0.01	<0.01	<0.01	<0.01	<0.01	<0.01	<0.01
GATA146D07	1940	0.30	0.31	0.54	0.32	0.31	0.31	0.03	0.02	<0.01	<0.01	<0.01
GATA151C03P	2076	0.37	0.41	0.61	0.43	0.42	0.10	0.03	<0.01	<0.01	<0.01	<0.01
D6S2522	1950	0.50	0.55	0.71	0.56	0.43	<0.01	<0.01	<0.01

Open in a new tab

Two other loci whose most frequent alleles have frequency close to the lower bound—TATC012 and GATA146D07—have homozygosities between Inline graphic and . For homozygosities in this interval, the lower bound is achieved when the three highest allele frequencies have the same value; indeed both loci have three high-frequency alleles with frequency near the lower bound. Similarly, locus GATA151C03P, with homozygosity between and , has two high-frequency alleles with frequency near the lower bound.

Table 3 displays the allele frequencies for three loci with values of M close to the upper bound. The upper bound is approximated when a locus has one allele with a particularly high frequency and many alleles with low frequencies. Consistent with their M values near the upper bound, each of the three loci has a single high-frequency allele and several low-frequency alleles.

TABLE 3.

Three microsatellite loci with frequency of the most frequent allele close to the upper bound

Locus	Total data points	H	Lower bound on M	Upper bound on M	p₁ (M)	p₂	p₃	p₄	p₅	p₆	p₇	p₈	p₉	p₁₀
TAA005	2002	0.47	0.49	0.69	0.66	0.13	0.12	0.03	0.03	0.01	<0.01	<0.01	<0.01
AATA045	2090	0.48	0.49	0.70	0.67	0.13	0.11	0.06	0.01	<0.01	<0.01	<0.01	<0.01	<0.01
GATA150B10	2068	0.50	0.50	0.71	0.69	0.12	0.10	0.04	0.03	0.02	0.01	<0.01	<0.01	<0.01

Open in a new tab

Subdividing loci on the basis of their numbers of alleles, Figure 7 illustrates a trend of decreasing H with an increasing number of alleles. Considering the four plots, the mean value of M − H is greatest in Figure 7B, in which the mean homozygosity is near 0.25. This observation is explained by the fact that the range between the upper and lower bounds on M is greatest for a homozygosity of Inline graphic . As the mean homozygosity moves away from in Figure 7, A, C, and D, the mean value of M − H decreases.

Inline graphic — Frequency of the most frequent allele minus homozygosity for 783 microsatellite loci. Each bin is 0.01 × 0.01, and the upper and lower bounds on M − H are shown for comparison. For each plot, the mean is marked by an x. (A) Loci with 4–9 distinct alleles (233): the mean is (0.2989, 0.1179). (B) Loci with 10–11 distinct alleles (222): the mean is (0.2570, 0.1208). (C) Loci with 12–14 distinct alleles (175): the mean is (0.2262, 0.1126). (D) Loci with 15–35 distinct alleles (153): the mean is (0.1890, 0.1102).

DISCUSSION

For a biallelic locus, an exact relationship exists between homozygosity (H) and the frequency of the most frequent allele (M), as H = 2M² − 2M + 1 and Inline graphic . Although in general the value of H or M is not uniquely specified from the value of the other quantity, we have found that a close connection between H and M does in fact exist. Our analysis verifies that measured values for homozygosity (and heterozygosity) consist largely of the contribution of the most common allele, and that the contribution made by rarer alleles is relatively small. Especially if homozygosity is very high or if the most frequent allele has a high frequency, each of the two summaries H and M greatly limits the possible values of the other quantity, so that both quantities provide similar information about an underlying allele-frequency distribution.

These results have implications for population-genetic methods that rely on H or M in analyses of multiallelic loci. Various neutrality tests have been developed that identify deviations from null population-genetic models on the basis of unusual values of homozygosity (Watterson 1977, 1978), heterozygosity (Depaulis and Veuille 1998; Depaulis et al. 2001; Markovtsova et al. 2001), or the frequency of the most frequent allele (Hudson et al. 1994). The close connection between homozygosity and the frequency of the most frequent allele suggests that tests using H and those using M detect similar features of the allele-frequency distribution. This observation potentially explains a high level of agreement seen in Table 7 of Innan et al. (2005) for the haplotype diversity test (Depaulis and Veuille 1998), based on haplotype heterozygosity, and the Hudson et al. (1994) haplotype test, based on the frequency of the most frequent haplotype.

Our results are also informative in relation to recently proposed methods that use “extended haplotype homozygosity”—pairwise identity of long haplotypes in the neighborhood of an index site—in detecting the signature of partial selective sweeps (Sabeti et al. 2002; Toomajian et al. 2006; Voight et al. 2006; Tang et al. 2007; Zeng et al. 2007). During such sweeps, a favored mutant allele rises to high frequency, carrying with it neighboring alleles that were near the selected site on the haplotype on which the mutation originally occurred. Thus, the detection of partial selective sweeps is a search for long high-frequency haplotypes that have not had sufficient time to be broken down by recombination. Because of the close connection between homozygosity and the frequency of the most frequent allele, genomic regions that have long high-frequency haplotypes will largely be coincident with regions that have long stretches of high haplotype homozygosity. Consequently, extended haplotype homozygosity methods provide an effective basis for accessing the signal of partial selective sweeps contained in extended high-frequency haplotypes.

Finally, the connection between homozygosity and the frequency of the most frequent allele may be useful for examining the properties of a variety of additional functions of allele frequencies that are based on homozygosity. Notably, the genetic differentiation measure F_ST and related quantities can be assembled from the homozygosities of various subgroups of a population—especially when viewed in the formulation of the G_ST measure of Nei (1987). From the connection between H and M, it follows that constraints on F_ST as functions of M undoubtedly exist; such constraints potentially provide the conceptual basis for understanding a frequency dependence observed for values of F_ST (Long and Kittles 2003; Hedrick 2005).

Acknowledgments

We thank S. Boca for numerous discussions of this work. Grant support was provided by a University of Michigan Center for Genetics in Health and Medicine postdoctoral fellowship, by National Institutes of Health grant R01 GM081441, by an Alfred P. Sloan Research Fellowship, and by a Burroughs Wellcome Fund Career Award in the Biomedical Sciences.

APPENDIX

In addition to verifying Theorems 1 and 2, this appendix formalizes many of the features visible in Figures 1–5. We begin with the proofs of the theorems. We then obtain properties of the frequency of the most frequent allele in terms of homozygosity and properties of homozygosity in terms of the frequency of the most frequent allele. For convenience, we label the bounds as follows:

(A1)

(A2)

(A3)

(A4)

For integers K ≥ 2, we also denote the half-open interval [1/K, 1/(K − 1)) by I_K.

The key result is Lemma 3, which considers sets of nonnegative numbers with a fixed positive sum C, in which the numbers in the set are bounded above by a positive constant M. The square of a positive number x is greater than or equal to the sum of squares for each collection of nonnegative numbers whose sum is x. As a result, considering all sets of nonnegative numbers with maximum M and with sum equal to C, we can show that the maximal sum of squares is obtained when as many of the numbers as possible are equal to M and when at most one remaining number is smaller than M. Lemma 3 makes it possible to obtain the maximal homozygosity as a function of M and, ultimately, to find the minimal M as a function of H.

Lemma 3. Suppose M > 0 and C > 0 and that ⌈C/M⌉ is denoted K. Considering all sequences Inline graphic with p_i ∈ [0, M], , and i < j implies p_i ≥ p_j, is maximal if and only if p_i = M for 1 ≤ i ≤ K − 1, p_K = C − (K − 1)M, and p_i = 0 for i > K, and its maximum is K(K − 1)M² − 2C(K − 1)M + C².

Proof. We use induction on K. Suppose K = 1, so that C ≤ M. Because Inline graphic ,

(A5)

Because a nonnegative term is subtracted in Equation A5, the maximum of H(p) occurs when this term is zero. As a result, at the maximum, p₁ = C ≤ M, p_i = 0 for i > 1, and H(p) = C². This establishes the base case.

Assume that the desired result is true for all C and M with ⌈C/M⌉ = K − 1. Now suppose ⌈C/M⌉ = K. The proposed value of p that maximizes H has p_i = M for 1 ≤ i ≤ K − 1, p_K = C − (K − 1)M, and p_i = 0 for i > K. Label this sequence by p*. Then

(A6)

By assumption, ⌈C/M⌉ = K and M ≥ C/K. As Equation A6 describes a parabola in M with positive leading term, regardless of the value of M, H(p*) is greater than or equal to the value at the minimum of the parabola, or C²/K.

We now show that no other sequence p can achieve a value of H as high as H(p*). Suppose p₁ < C/K. Because p_i ≤ p₁ for i > 1,

(A7)

Because H(p*) ≥ C²/K, the sequence p that maximizes H cannot have p₁ < C/K. This sequence must therefore have p₁ ≥ C/K and ⌈C/p₁⌉ ≤ K. However, because p₁ ≤ M and ⌈C/M⌉ = K by assumption, ⌈C/p₁⌉ ≥ ⌈C/M⌉ = K. Thus, ⌈C/p₁⌉ = K and ⌈(C − p₁)/p₁⌉ = K − 1.

Note that Inline graphic . We can therefore apply the inductive hypothesis to with C − p₁ in place of C and p₁ in place of M. By the inductive hypothesis, the maximum of occurs if and only if p_i = p₁ for 2 ≤ i ≤ K − 1, p_K = (C − p₁) − (K − 2)p₁, and p_i = 0 for i > K. As a result,

(A8)

This function is monotonically increasing in p₁ for p₁ ≥ C/K and therefore achieves its maximum when p₁ is as large as possible—that is, when p₁ = M. ▪

Proof of Theorem 2. (i) This result follows from the definition of H and from the fact that p₂ > 0; (ii) this follows from Lemma 3, taking C = 1 so that K = ⌈M⁻¹⌉. ▪

Lemma 4. (i) Inline graphic are monotonically increasing, continuous, and bijective; (ii) f and G are differentiable on (1/K, 1/(K − 1)) for each integer K ≥ 2.

Proof. The result is trivial for F and g. For integers K ≥ 2, G(1/K) = 1/K, and G is monotonically increasing on each interval I_K, where ⌈M⁻¹⌉ has the fixed value K. Thus, G is monotonically increasing on (0, 1). From the form of G it is clear that on (1/K, 1/(K − 1)), G is continuous and differentiable. For each K, as M = 1/K is approached from either direction, G(M) approaches 1/K. Thus, G is continuous on (0, 1). Given H ∈ (0, 1), there is a unique M for which G(M) = H, so that G is bijective. Similar reasoning holds for f. ▪

As a consequence of this lemma, since G(1/K) = 1/K for integers K ≥ 2, if M ∈ (0, 1) and ⌈M⁻¹⌉ = K, then G(M) lies in the interval I_K. Similarly, if ⌈H⁻¹⌉ = K for H ∈ (0, 1), then f(H) also lies in I_K.

Lemma 5. F and g are inverse functions on (0, 1), as are f and G.

Proof. The result is trivial for F and g. As bijections, both f and G are invertible. Noting that M ∈ I_K implies G(M) ∈ I_K, ⌈M⁻¹⌉ = ⌈G(M)⁻¹⌉, from which we can solve for M in terms of G(M) on each interval I_K to find that on each interval the inverse of G is f. ▪

Proof of Theorem 1.

This result follows from the definition of H and from the fact that p₂ > 0.
By Theorem 2, for a given value of M and a given sequence with , G(M) ≥ H with equality if and only if p_i = M for 1 ≤ i ≤ K − 1, p_K = 1 − (K − 1)M, and p_i = 0 for i ≥ K, where K = ⌈M⁻¹⌉. Applying the monotonically increasing function f to the inequality G(M) ≥ H, f(G(M)) ≥ f(H) with the same equality condition. Because f is the inverse of G, M ≥ f(H) with the same equality condition. ▪

Note that for a given value of H, we can find a set of allele frequencies for which the value of M comes arbitrarily close to its upper bound of Inline graphic . This can be accomplished by supposing that a locus has one common allele with frequency M and N rare alleles each with frequency ɛ. If all other alleles have zero frequency, such a locus must have M² + Nɛ² = H and M + Nɛ = 1. Solving this pair of equations for M in terms of N (taking the larger root) and letting Inline graphic , . Similar reasoning yields sets of allele frequencies that for a given value of M have values of H arbitrarily close to the lower bound of M².

Frequency of the most frequent allele in terms of homozygosity:

We now derive the properties of the upper and lower bounds on the frequency of the most frequent allele, as functions of homozygosity. Most of the results that follow are relatively straightforward to prove, and they are included for completeness.

Proposition 6 determines the mean values of the bounds, finding that the difference between them has a rather small mean of Inline graphic − π²/18 ≈ 0.1184. Lemma 7 then shows that the lower bound on the frequency of the most frequent allele is greater than or equal to the homozygosity itself; the mean values of the differences of the upper and lower bounds from the homozygosity are then obtained in Proposition 8. Results 9–15 concern additional properties of the differences among the upper and lower bounds and the homozygosity and properties of various maxima and minima associated with the upper and lower bounds. The section concludes with Proposition 16, which determines a lower bound on the fraction of homozygosity that is due to the most frequent allele.

Proposition 6. Averaging across values of H ∈ (0, 1), (i) the mean of F(H) is Inline graphic ; (ii) the mean of f(H) is π²/18; (iii) the mean of F(H) − f(H) is − π²/18.

Proof.

The mean of F(H) is .
Because (0, 1) = I_K, and because ⌈H⁻¹⌉ = K for H ∈ I_K, the mean of f(H) can be written
(A9)
Because reduces to −1 and because , Equation A9 simplifies to π²/18.
That the mean of F(H) − f(H) is − π²/18 follows directly from i and ii together with the fact that F(H) > f(H) for H ∈ (0, 1). ▪

Lemma 7. For H ∈ (0, 1), f(H) ≥ H, with equality if and only if H = K⁻¹ for some integer K.

Proof. This result follows from the fact that for H ∈ (0, 1), ⌈H⁻¹⌉ > 1, and ⌈H⁻¹⌉ − 1 < H⁻¹ ≤ ⌈H⁻¹⌉, with the equality occurring if and only if H = K⁻¹ for an integer K. ▪

Proposition 8. Averaging across values of H ∈ (0, 1), (i) the mean of F(H) − H is Inline graphic ; (ii) the mean of f(H) − H is π²/18 − .

Proof. By Theorem 1 and Lemma 7, F(H) > f(H) ≥ H on the interval (0, 1). The mean of F(H) − H [or f(H) − H] equals the mean of F(H) [or f(H)] minus the mean of H, or Inline graphic . Consequently, using Proposition 6, (i) the mean of F(H) − H is − = , and (ii) the mean of f(H) − H is π²/18 − . ▪

Proposition 9. On the interval [1/K, 1/(K − 1)), where K ≥ 2 is an integer, the maximal value of f(H) − H is 1/[4K(K − 1)], and it is achieved at H = (4K − 3)/[4K(K − 1)].

Proof. For H ∈ I_K, ⌈H⁻¹⌉ = K, and

f(H) − H is continuous on the interval and differentiable except at the endpoints. Its only critical point on the interval is a maximum that occurs at ((4K − 3)/[4K(K − 1)], 1/[4K(K − 1)]). ▪

Corollary 10. On (0, 1), the maximal value of f(H) − H is Inline graphic , and it is achieved at H = .

Proof. Because (0, 1) = Inline graphic I_K, f(H) − H has its maximum in I_K for some K—in particular, for the K for which the maximal value of f(H) − H is greatest. By Proposition 9, the maximum of f(H) − H on I_K is 1/[4K(K − 1)]. As 1/[4K(K − 1)] decreases for K ≥ 2, the maximum of f(H) − H on (0, 1) occurs in I₂. Applying Proposition 9, this maximum is at ( Inline graphic , ). ▪

Proposition 11. On the interval [1/K, 1/(K − 1)], where K ≥ 2 is an integer,

i. For K ≥ 5, the maximal value of F(H) − f(H) is , and it is achieved at H = 1/(K − 1). For K = 2, 3, 4, the maximal value of F(H) − f(H) is , and it is achieved at H = 1/K.
ii. The minimal value of F(H) − f(H) is
(A10)
and it is achieved at H = (K − 1)/(K² − K − 1).

Proof. Define χ(H) = F(H) − f(H). For H ∈ I_K, ⌈H⁻¹⌉ = K, and

To verify ii, note that the only critical point of χ(H) on [1/K, 1/(K − 1)] is a minimum that occurs at ((K − 1)/(K² − K − 1), β(K)).

To obtain i, note that because there is no maximum in the interior of [1/(K − 1), 1/K], the maximum of χ(H) occurs at the endpoint of the interval that produces the larger value of χ(H). At H = 1/K, Inline graphic , and at H = 1/(K − 1), . Define , and note that at points H = 1/K for integers K ≥ 1, γ(H) = χ(H). At the endpoints of [0, 1], γ(H) = 0, and on [0, 1], γ(H) has its maximum and only critical point at (, ). Consequently, for H, H′ ∈ [0, 1], if H > H′ ≥ , then γ(H) < γ(H′), whereas if Inline graphic ≥ H > H′, then γ(H) > γ(H′). Thus, for K = 2, 3, 4, γ(1/K) = χ(1/K) > χ(1/(K − 1)) = γ(1/(K − 1)), whereas for integers K ≥ 5, γ(1/(K − 1)) = χ(1/(K − 1)) > χ(1/K) = γ(1/K). ▪

Proposition 12. On (0, 1), the highest local minimum of F(H) − f(H) is Inline graphic , and it occurs at H = .

Proof. By Proposition 11, the minimal difference for a given interval [1/K, 1/(K − 1)] is achieved at H = (K − 1)/(K² − K − 1) and is β(K). To find the integer K ≥ 2 where β(K) is greatest, we show that β(K) > β(K + 1) for K ≥ 6. It then follows that the largest value of β(K) occurs at the integer K ∈ [2, 6] that produces the highest value of β(K). This maximum occurs at K = 5, so that H = Inline graphic and .

The following chain of inequalities yields the result:

Corollary 13. The maximal value of F(H) − H is Inline graphic , and it is achieved at H = .

Proof. This result was shown in the proof of Proposition 11 when it was found that Inline graphic has its maximum on [0, 1] at H = . ▪

Corollary 14. The maximal value of F(H) − f(H) is Inline graphic , and it is achieved at H = .

Proof. Because f(H) ≥ H, F(H) − f(H) ≤ F(H) − H. By Corollary 13, the maximum of F(H) − H occurs at Inline graphic and is . Evaluating at H = , F(H) − f(H) achieves this same upper bound. ▪

Proposition 15. The difference F(H) − f(H) is (i) greater than f(H) − H if Inline graphic , (ii) equal to f(H) − H if , and (iii) less than f(H) − H if .

Proof. Consider H ∈ [1/K, 1/(K − 1)], for K ≥ 3. On this interval, by Proposition 11, the minimum of F(H) − f(H) is β(K), and by Proposition 9, the maximum of f(H) − H is μ(K) = 1/[4K(K − 1)]. The following inequalities yield F(H) − f(H) > f(H) − H for H < Inline graphic :

(A11)

For H ∈ [ Inline graphic , 1), [F(H) − f(H)] − [f(H) − H] = , which for H ∈ [, 1) can be shown to fall on the same side of zero as H² − 8H + 4. The only root of H² − 8H + 4 = 0 for H ∈ [, 1) is , at which the sign of H² − 8H + 4 switches from positive to negative. ▪

Proposition 16.

i. The fraction of homozygosity due to homozygotes for the most frequent allele is greater than or equal to
with equality if and only if K = ⌈H⁻¹⌉ = ⌈M⁻¹⌉, p₁ = p₂ = … = p_K₋₁ = M, and p_K = 1 − (K − 1)M.
ii. The fraction of homozygosity due to homozygotes for the most frequent allele is greater than or equal to H, equality requiring H = K⁻¹ for some integer K ≥ 2, and p₁ = p₂ = … = p_K = H.
iii. The lower bound on the fraction of homozygosity due to homozygotes for the most frequent allele lies in [1/K, 1/(K − 1)), where K = ⌈H⁻¹⌉.
iv. The lower bound on the fraction of homozygosity due to homozygotes for the most frequent allele is monotonically increasing with H on the interval (0, 1).

Proof. The fraction of homozygosity due to homozygotes for the most frequent allele is M²/H, so that i follows directly from Theorem 1ii.

ii. That M²/H ≥ f(H)²/H ≥ H²/H follows directly from Theorem 1ii and Lemma 7, with equality under the same conditions as specified by these results.
iii. That M²/H ≥ K⁻¹ for H ∈ I_K follows trivially from ii. Note that f(H)²/H < 1/(K − 1) is equivalent to , which is true except if H = 1/(K − 1).
iv. Denote the lower bound in i by σ(H). The function σ is continuous on (0, 1), and at H = K⁻¹ for integers K ≥ 2, σ(H) = K⁻¹. To show that σ is monotonic on (0, 1) all that must be shown is that it is monotonic for H ∈ I_K. On this interval, ⌈H⁻¹⌉ = K, and the derivative of σ is
To show that the term inside the brackets is positive for H ∈ I_K, we can begin with the inequality (K − 1)H² − KH + 1 > 0, which holds for H ∈ I_K, as the leading term is positive and the roots are located at 1/(K − 1) and 1. Multiplying by K² and adding identical terms to both sides, we have (K − 1)(K²H² − 4KH + 4) > (K² − 4K + 4)(KH − 1). Noting that K ≥ 2 and for H ∈ I_K, 2 − KH > 0, the square root of both sides can be taken to obtain . ▪

Homozygosity in terms of the frequency of the most frequent allele:

Many of the results in this section follow from those in the previous section, using the fact that the lower and upper bounds g and G for homozygosity are the respective inverse functions of the upper and lower bounds F and f for the frequency of the most frequent allele.

Proposition 17. Averaging across values of M ∈ (0, 1), (i) the mean of G(M) is 1 − π²/18; (ii) the mean of g(M) is Inline graphic ; (iii) the mean of G(M) − g(M) is − π²/18.

Proof.

iii. Because G and g are the inverse functions of f and F by Lemma 5, and because on (0, 1), G > g and F > f, the area between G and g equals the area between F and f. By Proposition 6, this area is − π²/18.
ii. The mean of g(M) is .
i. That the mean of G(M) is 1 − π²/18 follows directly from ii and iii. ▪

Lemma 18. For M ∈ (0, 1), G(M) ≤ M, with equality if and only if M = K⁻¹ for some integer K.

Proof. This result follows directly from Lemma 7 and the inverse relationship of G and f in Lemma 5. ▪

Proposition 19. Averaging across values of M ∈ (0, 1), (i) the mean of M − G(M) is π²/18 − Inline graphic ; (ii) the mean of M − g(M) is .

Proof. From the inverse relationship between G and f (Lemma 5), the area between M and G(M) equals the area between f(H) and H, or π²/18 − Inline graphic (Proposition 8ii), and from the inverse relationship between g and F, the area between M and g(M) equals the area between F(H) and H, or (Proposition 8i). ▪

Proposition 20. On the interval [1/K, 1/(K − 1)), where K ≥ 2 is an integer, the maximal value of M − G(M) is 1/[4K(K − 1)], and it is achieved at H = (2K − 1)/[2K(K − 1)].

Proof. For M ∈ I_K, ⌈M⁻¹⌉ = K, and M − G(M) = M − [K(K − 1)M² − 2(K − 1)M + 1]. M − G(M) is continuous on the interval and differentiable except at the endpoints. Its only critical point on the interval is a maximum that occurs at ((2K − 1)/[2K(K − 1)], 1/[4K(K – 1)]). ▪

Corollary 21. On (0, 1), the maximal value of M − G(M) is Inline graphic , and it is achieved at M = .

Proof. Because (0, 1) = Inline graphic I_K, M − G(M) has its maximum in I_K for some K—in particular, for the K for which the maximal value of M − G(M) is greatest. By Proposition 20, the maximum of M − G(M) on I_K is 1/[4K(K − 1)]. As 1/[4K(K − 1)] decreases for K ≥ 2, the maximum of M − G(M) on (0, 1) occurs in I₂. Applying Proposition 20, this maximum is at ( Inline graphic , ). ▪

Proposition 22. On the interval [1/K, 1/(K − 1)], where K ≥ 2 is an integer,

i. For K ≥ 3, the maximal value of G(M) − g(M) is (K − 2)/(K − 1)², and it is achieved at M = 1/(K − 1). For K = 2, the maximal value of G(M) − g(M) is , and it is achieved at M = .
ii. The minimal value of G(M) − g(M) is ρ(K) = (K − 2)/(K² − K − 1), and it is achieved at M = (K − 1)/(K² − K − 1).

Proof. Define ξ(M) = G(M) − g(M). For M ∈ I_K, ⌈M⁻¹⌉ = K, and ξ(M) = (K² − K − 1)M² − 2(K − 1)M + 1. To verify ii, note that the only critical point of ξ(M) on [1/K, 1/(K − 1)] is a minimum that occurs at ((K − 1)/(K² − K − 1), ρ(K)).

To obtain i, note that because there is no maximum in the interior of [1/(K − 1), 1/K], the maximum of ξ(M) occurs at the endpoint that produces the larger value of ξ(M). At M = 1/K, ξ(M) = 1/K − 1/K², and at M = 1/(K − 1), ξ(M) = 1/(K − 1) − 1/(K − 1)². Define δ(M) = M − M², and note that at points M = 1/K for integers K ≥ 1, δ(M) = ξ(M). At the endpoints of [0, 1], δ(M) = 0, and on [0, 1], δ(M) has its maximum and only critical point at ( Inline graphic , ). Consequently, for M, M′ ∈ [0, 1], if M > M′ ≥ , then δ(M) < δ(M′), whereas if ≥ M > M′, then δ(M) > δ(M′). Thus, for K = 2, δ(1/K) = ξ(1/K) > ξ(1/(K − 1)) = δ(1/(K − 1)), whereas for integers K ≥ 3, δ(1/(K − 1)) = ξ(1/(K − 1)) > ξ(1/K) = δ(1/K). ▪

Proposition 23. On (0, 1), the highest local minimum of G(M) − g(M) is Inline graphic , and it occurs at M = .

Proof. By Proposition 22, the minimal difference for a given interval [1/K, 1/(K − 1)] is achieved at M = (K − 1)/(K² − K − 1) and is ρ(K). To find the integer K ≥ 2 where ρ(K) is greatest, note that

(A12)

As a result, ρ(K) − ρ(K + 1) > 0 for K ≥ 3. It follows that ρ(K) is largest at the integer K ∈ [2, 3] that produces the highest value of ρ(K). This maximum occurs at K = 3, so that M = Inline graphic and ρ(K) = . ▪

Corollary 24. The maximal value of M − g(M) is Inline graphic , and it is achieved at M = .

Proof. This result was shown in the proof of Proposition 22 when it was found that δ(M) = M − M² has its maximum on [0, 1] at M = Inline graphic . ▪

Corollary 25. The maximal value of G(M) − g(M) is Inline graphic , and it is achieved at M = .

Proof. Because M ≥ G(M), G(M) − g(M) ≤ M − g(M). By Corollary 24, the maximum of M − g(M) occurs at Inline graphic and is . Evaluating at M = , G(M) − g(M) achieves this same upper bound. ▪

Proposition 26. The difference G(M) − g(M) is (i) greater than M − G(M) if 0 < M < Inline graphic , (ii) equal to M − G(M) if M = , and (iii) less than M − G(M) if < M < 1.

Proof. Consider M ∈ [1/K, 1/(K − 1)], for K ≥ 3. On this interval, by Proposition 22, the minimum of G(M) − g(M) is ρ(K), and by Proposition 20, the maximum of M − G(M) is μ(K) = 1/[4K(K − 1)]. The quantity ρ(K) − μ(K) can be simplified to

which is clearly positive for K > Inline graphic , and which is also positive for K = 3. As a result, G(M) − g(M) > M − G(M) for intervals I_K with K ≥ 3, that is, for 0 < M < .

For M ∈ [ Inline graphic , 1), [G(M) − g(M)] − [M − G(M)] = 3M² − 5M + 2. The only root of 3M² − 5M + 2 = 0 for M ∈ [, 1) is M = , at which the sign of 3M² − 5M + 2 switches from positive to negative. ▪

References

Depaulis, F., and M. Veuille, 1998. Neutrality tests based on the distribution of haplotypes under an infinite-site model. Mol. Biol. Evol. 15 1788–1790. [DOI] [PubMed] [Google Scholar]
Depaulis, F., S. Mousset and M. Veuille, 2001. Haplotype tests using coalescent simulations conditional on the number of segregating sites. Mol. Biol. Evol. 18 1136–1138. [DOI] [PubMed] [Google Scholar]
Hedrick, P. W., 2005. A standardized genetic differentiation measure. Evolution 59 1633–1638. [PubMed] [Google Scholar]
Hudson, R. R., K. Bailey, D. Skarecky, J. Kwiatowski and F. J. Ayala, 1994. Evidence for positive selection in the superoxide dismutase (Sod) region of Drosophila melanogaster. Genetics 136 1329–1340. [DOI] [PMC free article] [PubMed] [Google Scholar]
Innan, H., K. Zhang, P. Marjoram, S. Tavaré and N. A. Rosenberg, 2005. Statistical tests of the coalescent model based on the haplotype frequency distribution and the number of segregating sites. Genetics 169 1763–1777. [DOI] [PMC free article] [PubMed] [Google Scholar]
Long, J. C., and R. A. Kittles, 2003. Human genetic diversity and the nonexistence of biological races. Hum. Biol. 75 449–471. [DOI] [PubMed] [Google Scholar]
Markovtsova, L., P. Marjoram and S. Tavaré, 2001. On a test of Depaulis and Veuille. Mol. Biol. Evol. 18 1132–1133. [DOI] [PubMed] [Google Scholar]
Nei, M., 1987. Molecular Evolutionary Genetics. Columbia University Press, New York.
Rosenberg, N. A., S. Mahajan, S. Ramachandran, C. Zhao, J. K. Pritchard et al., 2005. Clines, clusters, and the effect of study design on the inference of human population structure. PLoS Genet. 1 660–671. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sabeti, P. C., D. E. Reich, J. M. Higgins, H. Z. P. Levine, D. J. Richter et al., 2002. Detecting recent positive selection in the human genome from haplotype structure. Nature 419 832–837. [DOI] [PubMed] [Google Scholar]
Tang, K., K. R. Thornton and M. Stoneking, 2007. A new approach for using genome scans to detect recent positive selection in the human genome. PLoS Biol. 5 1587–1602. [DOI] [PMC free article] [PubMed] [Google Scholar]
Toomajian, C., T. T. Hu, M. J. Aranzana, C. Lister, C. Tang et al., 2006. A nonparametric test reveals selection for rapid flowering in the Arabidopsis genome. PLoS Biol. 4 732–738. [DOI] [PMC free article] [PubMed] [Google Scholar]
Voight, B. F., S. Kudaravalli, X. Wen and J. K. Pritchard, 2006. A map of recent positive selection in the human genome. PLoS Biol. 4 446–458. [DOI] [PMC free article] [PubMed] [Google Scholar]
Watterson, G. A., 1977. Heterosis or neutrality? Genetics 85 789–814. [DOI] [PMC free article] [PubMed] [Google Scholar]
Watterson, G. A., 1978. The homozygosity test of neutrality. Genetics 88 405–417. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zeng, K., S. Mano, S. Shi and C.-I. Wu, 2007. Comparisons of site- and haplotype-frequency methods for detecting positive selection. Mol. Biol. Evol. 24 1562–1574. [DOI] [PubMed] [Google Scholar]

[bib1] Depaulis, F., and M. Veuille, 1998. Neutrality tests based on the distribution of haplotypes under an infinite-site model. Mol. Biol. Evol. 15 1788–1790. [DOI] [PubMed] [Google Scholar]

[bib2] Depaulis, F., S. Mousset and M. Veuille, 2001. Haplotype tests using coalescent simulations conditional on the number of segregating sites. Mol. Biol. Evol. 18 1136–1138. [DOI] [PubMed] [Google Scholar]

[bib3] Hedrick, P. W., 2005. A standardized genetic differentiation measure. Evolution 59 1633–1638. [PubMed] [Google Scholar]

[bib4] Hudson, R. R., K. Bailey, D. Skarecky, J. Kwiatowski and F. J. Ayala, 1994. Evidence for positive selection in the superoxide dismutase (Sod) region of Drosophila melanogaster. Genetics 136 1329–1340. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib5] Innan, H., K. Zhang, P. Marjoram, S. Tavaré and N. A. Rosenberg, 2005. Statistical tests of the coalescent model based on the haplotype frequency distribution and the number of segregating sites. Genetics 169 1763–1777. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib6] Long, J. C., and R. A. Kittles, 2003. Human genetic diversity and the nonexistence of biological races. Hum. Biol. 75 449–471. [DOI] [PubMed] [Google Scholar]

[bib7] Markovtsova, L., P. Marjoram and S. Tavaré, 2001. On a test of Depaulis and Veuille. Mol. Biol. Evol. 18 1132–1133. [DOI] [PubMed] [Google Scholar]

[bib8] Nei, M., 1987. Molecular Evolutionary Genetics. Columbia University Press, New York.

[bib9] Rosenberg, N. A., S. Mahajan, S. Ramachandran, C. Zhao, J. K. Pritchard et al., 2005. Clines, clusters, and the effect of study design on the inference of human population structure. PLoS Genet. 1 660–671. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib10] Sabeti, P. C., D. E. Reich, J. M. Higgins, H. Z. P. Levine, D. J. Richter et al., 2002. Detecting recent positive selection in the human genome from haplotype structure. Nature 419 832–837. [DOI] [PubMed] [Google Scholar]

[bib11] Tang, K., K. R. Thornton and M. Stoneking, 2007. A new approach for using genome scans to detect recent positive selection in the human genome. PLoS Biol. 5 1587–1602. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib12] Toomajian, C., T. T. Hu, M. J. Aranzana, C. Lister, C. Tang et al., 2006. A nonparametric test reveals selection for rapid flowering in the Arabidopsis genome. PLoS Biol. 4 732–738. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib13] Voight, B. F., S. Kudaravalli, X. Wen and J. K. Pritchard, 2006. A map of recent positive selection in the human genome. PLoS Biol. 4 446–458. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib14] Watterson, G. A., 1977. Heterosis or neutrality? Genetics 85 789–814. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib15] Watterson, G. A., 1978. The homozygosity test of neutrality. Genetics 88 405–417. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib16] Zeng, K., S. Mano, S. Shi and C.-I. Wu, 2007. Comparisons of site- and haplotype-frequency methods for detecting positive selection. Mol. Biol. Evol. 24 1562–1574. [DOI] [PubMed] [Google Scholar]

PERMALINK

The Relationship Between Homozygosity and the Frequency of the Most Frequent Allele

Noah A Rosenberg

Mattias Jakobsson

Abstract

RESULTS

TABLE 1.

Figure 1.—

Figure 2.—

Figure 3.—

Figure 4.—

Figure 5.—

APPLICATION TO DATA

Figure 6.—

TABLE 2.

TABLE 3.

Figure 7.—

DISCUSSION

Acknowledgments

APPENDIX

Frequency of the most frequent allele in terms of homozygosity:

Homozygosity in terms of the frequency of the most frequent allele:

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

The Relationship Between Homozygosity and the Frequency of the Most Frequent Allele

Noah A Rosenberg

Mattias Jakobsson

Abstract

RESULTS

TABLE 1.

Figure 1.—

Figure 2.—

Figure 3.—

Figure 4.—

Figure 5.—

APPLICATION TO DATA

Figure 6.—

TABLE 2.

TABLE 3.

Figure 7.—

DISCUSSION

Acknowledgments

APPENDIX

Frequency of the most frequent allele in terms of homozygosity:

Homozygosity in terms of the frequency of the most frequent allele:

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases