Skip to main content
PLOS ONE logoLink to PLOS ONE
. 2014 Jan 21;9(1):e85925. doi: 10.1371/journal.pone.0085925

Confidence Intervals for Population Allele Frequencies: The General Case of Sampling from a Finite Diploid Population of Any Size

Tak Fung 1,2,*, Kevin Keenan 3
Editor: Guy Brock4
PMCID: PMC3897575  PMID: 24465792

Abstract

The estimation of population allele frequencies using sample data forms a central component of studies in population genetics. These estimates can be used to test hypotheses on the evolutionary processes governing changes in genetic variation among populations. However, existing studies frequently do not account for sampling uncertainty in these estimates, thus compromising their utility. Incorporation of this uncertainty has been hindered by the lack of a method for constructing confidence intervals containing the population allele frequencies, for the general case of sampling from a finite diploid population of any size. In this study, we address this important knowledge gap by presenting a rigorous mathematical method to construct such confidence intervals. For a range of scenarios, the method is used to demonstrate that for a particular allele, in order to obtain accurate estimates within 0.05 of the population allele frequency with high probability (Inline graphic%), a sample size of Inline graphic is often required. This analysis is augmented by an application of the method to empirical sample allele frequency data for two populations of the checkerspot butterfly (Melitaea cinxia L.), occupying meadows in Finland. For each population, the method is used to derive Inline graphic% confidence intervals for the population frequencies of three alleles. These intervals are then used to construct two joint Inline graphic% confidence regions, one for the set of three frequencies for each population. These regions are then used to derive a Inline graphic% confidence interval for Jost's D, a measure of genetic differentiation between the two populations. Overall, the results demonstrate the practical utility of the method with respect to informing sampling design and accounting for sampling uncertainty in studies of population genetics, important for scientific hypothesis-testing and also for risk-based natural resource management.

Introduction

Spatiotemporal patterns of genetic variation among populations are often used to test hypotheses about processes underlying the patterns, such as selection, migration and genetic drift (e.g., [1], [2], [3], [4], [5]). This genetic variation captures differences in genetic structure among populations, where the genetic structure of a population is determined by the distribution of alleles among individuals in the population. The allele distribution of a population is the result of a range of biological and environmental processes acting on the population and also on surrounding populations within the same geographical region, which result in non-random mixing of gametes among individuals in all populations. Patterns in allele distributions among populations are generally assessed by consideration of variations in allele frequencies (e.g., [6], [7], [8], [9]).

Logistic and ethical constraints mean that in practice, a population is unlikely to be sampled in its entirety, such that the population allele frequencies have to be estimated using a subset of sampled individuals. For diploid organisms in a large population, it has been established that the frequency of an allele in a sample provides an unbiased estimate of the frequency in the population as a whole [10]. Thus, if samples of a given size are repeatedly taken from a large population, then the mean frequency of an allele in a sample converges to the population allele frequency as the number of samples increases. However, many studies only take a single sample from a population and present or use the resulting frequencies of alleles in the sample, without accounting for sampling uncertainty (e.g., [11], [12], [13], [14], [15], [16], [17], [18]). Therefore, these studies implicitly assume that the sample allele frequencies are close to the population allele frequencies, which is by no means guaranteed. Interpretation of findings from these studies is therefore complicated by the potential for large sampling uncertainty. Ideally, in this single sample case, uncertainty bounds for the population allele frequencies would be quantified, based on the sample allele frequencies.

A recent study by Hale et al. [19] used computer simulations to draw samples from four diploid populations, with allele frequencies based on four empirical datasets. Subsequently, they used the sample data to derive the means, variances and ranges of allele frequencies in samples of varying size, as well as the means, variances and ranges of indicators that are a function of allele frequencies (specifically, heterozygosity and FST). Using these results, Hale et al. [19] concluded that a sample size of 30 is sufficient to accurately estimate population allele frequencies when using microsatellites. However, the authors did not quantify uncertainty bounds for the population allele frequencies using their sample data, such that their assessment of accuracy lacks a rigorous quantitative basis. Furthermore, for each sample size, sample frequencies were calculated using only 100 replicate samples, which may not give a good approximation of the true dispersion in the sample frequencies. This highlights a weakness of using a computational approach that lacks an underlying mathematical theory. Moreover, Hale et al. [19] did not consider the situation identified earlier, where one sample is taken and there is a need to quantify uncertainty of population allele frequencies using just the sample allele frequencies. For this situation, earlier studies have derived confidence intervals in order to capture uncertainty in the population allele frequencies. If a diploid population is of infinite size and is at Hardy-Weinberg equilibrium (HWE), then the frequency of an allele in a sample follows a binomial distribution [10]. Thus, Gillespie [20] proposed the use of the Wald confidence interval [21]. However, this interval only performs well for sufficiently large sample sizes – with small sample sizes, it is too short [22]. For the binomial distribution, other confidence intervals have been derived that do perform well for small sample sizes [22], [23], notably the Clopper-Pearson interval that was derived eight decades ago [24]. However, these only apply in the limited cases when the population is at HWE. Weir [10] proposed a confidence interval that allows for deviations from HWE, analogous to the Wald confidence interval. However, like the latter, it is expected to perform well only for sufficiently large sample sizes [10]. This is problematic because the accuracy at a particular sample size is unknown, so “sufficiently large” cannot be rigorously quantified. Moreover, all the confidence intervals considered only apply to cases when the population can be assumed to be of infinite size, i.e. when the population size is much larger than the sample size. This does not reflect the range of scenarios encountered in empirical research (e.g., [13], [25], [26], [27]).

In this study, we build on previous work by constructing confidence intervals for population allele frequencies for the general case where (i) the population is diploid, finite and can be of any size; (ii) the sample can take any size less than or equal to that of the population; and (iii) the population can deviate from HWE to any extent. These confidence intervals are guaranteed to contain the population allele frequency with a probability above a known threshold. The method derived for constructing these intervals is then used to calculate sample sizes required to achieve accurate estimates of population allele frequencies, under a range of scenarios. Here, accuracy is measured as the length of the confidence intervals. The sample sizes derived serve as a guide for determination of suitable sample sizes in future population genetic studies. In particular, we show that a sample size of 30 does not necessarily give accurate estimates, thus refining the conclusion of Hale et al. [19]. Lastly, we provide an example of how the method can be applied to microsatellite data for two populations of the checkerspot butterfly (Melitaea cinxia L.) [13], to derive confidence intervals for population allele frequencies and also for Jost's D, a measure of genetic differentiation between two populations that is a function of their population allele frequencies [9]. The data comes from a study [13] where it is unclear that the populations can be assumed to be of infinite sizes or at HWE. This example illustrates how the mathematical theory underlying our method can be used to quantify sampling uncertainty not only in population allele frequencies, but also in parameters that are functions of these frequencies, important for hypothesis-testing and also for natural resource management.

Overall, this study provides a rigorous mathematical quantification of sampling uncertainty in population allele frequencies, without the typically unrealistic constraints of assuming that a population has an infinite size and is at HWE. Thus, the results can be applied to a wide range of studies in population genetics.

Methods

In order to construct confidence intervals for the general case of taking samples of any size from a diploid population of any size (larger than or equal to the sample size) and with any degree of deviation from HWE, the sampling distribution of the allele frequencies in this general case is first exactly specified. This distribution is then used to derive formulae specifying confidence intervals that contain the population allele frequencies with probability above a known threshold. Using these formulae, confidence intervals are derived for a range of archetypal scenarios, which are then used to calculate sample sizes that permit accurate estimates of population allele frequencies in these scenarios. In addition, the formulae are applied to a real scenario where samples are taken from two butterfly populations [13], to construct confidence intervals for the population allele frequencies of these two populations. The intervals are then used to derive a corresponding confidence interval for Jost's D [9].

Derivation of sampling distribution of allele frequencies

Consider a population of M diploid individuals, from which a sample of N individuals is randomly drawn (Inline graphic). At the locus of interest, there are Inline graphic alleles, denoted by Inline graphic, Inline graphic. Let the population allele frequencies be denoted by Inline graphic, with the corresponding sample allele frequencies being denoted by Inline graphic. Also, let Inline graphic be the frequency of individuals in the population with alleles Inline graphic and Inline graphic (Inline graphic), such that Inline graphic is the number of corresponding individuals in the base population. Inline graphic is related to Inline graphic by the formula Inline graphic. Here, Inline graphic is a measure of the homozygosity of individuals in the population with respect to allele Inline graphic. Inline graphic, and thus Inline graphic. If Inline graphic, then the minimum value of Inline graphic is 0, since all copies of allele Inline graphic can be distributed among heterozygotes of allele Inline graphic. However, if Inline graphic, then this is not possible – there must be at least one homozygote of allele Inline graphic. In this case, the minimum number of homozygotes is realized when all heterozygotes have a copy of allele Inline graphic, i.e. when Inline graphic. Rearranging this for Inline graphic and substituting into Inline graphic gives the minimum value of Inline graphic as Inline graphic. Thus, overall, Inline graphic.

The frequency of allele Inline graphic in the sample of size N is given by Inline graphic, where Inline graphic is the number of copies of allele Inline graphic in the sample. Thus, the probability distribution of Inline graphic is the same as the probability distribution of Inline graphic except with the x-axis scaled by a factor Inline graphic. Denote the probability mass function (pmf) for Inline graphic by Inline graphic. The range of Inline graphic is Inline graphic, so Inline graphic for Inline graphic outside this range. Thus, for the following calculations, only Inline graphic are considered. Now, Inline graphic, where Inline graphic and Inline graphic is the number of individuals in the sample with alleles Inline graphic and Inline graphic. Inline graphic, Inline graphic and Inline graphic thus represent the number of individuals in the sample with two copies, one copy and no copies of allele Inline graphic, respectively. Inline graphic, Inline graphic and Inline graphic follow a multivariate hypergeometric distribution with pmf given by:

graphic file with name pone.0085925.e063.jpg (1)

where terms on the right-hand side in brackets are binomial coefficients. Since the number of individuals in the sample must equal N, Inline graphic. Inline graphic is the sum of Inline graphic for all those biologically feasible combinations of Inline graphic, Inline graphic and Inline graphic satisfying Inline graphic. It will now be shown that this summation can be simplified to the sum of an expression that depends only on Inline graphic and Inline graphic, over all Inline graphic between lower and upper bounds that only depend on Inline graphic. This considerably eases calculation of Inline graphic. Firstly, the number of individuals in the sample with two copies, one copy or no copies of allele Inline graphic cannot exceed the corresponding numbers in the sampled population. This gives rise to three inequalities:

graphic file with name pone.0085925.e077.jpg (2)
graphic file with name pone.0085925.e078.jpg (3)
graphic file with name pone.0085925.e079.jpg (4)

Secondly, the number of individuals with two copies, one copy or no copies of allele Inline graphic in the sample must be non-negative. This gives rise to three more inequalities:

graphic file with name pone.0085925.e081.jpg (5)
graphic file with name pone.0085925.e082.jpg (6)
graphic file with name pone.0085925.e083.jpg (7)

Since Inline graphic and Inline graphic, then Inline graphic and Inline graphic. Thus, the inequalities (2)–(7) can be rearranged to obtain the double inequality:

graphic file with name pone.0085925.e088.jpg (8)

It is noted that Inline graphic, which means that Inline graphic (inequality (6)) ensures that Inline graphic, as required. Denote the upper and lower bounds in (8) by Inline graphic and Inline graphic respectively. Then Inline graphic can be written as:

graphic file with name pone.0085925.e095.jpg (9)

where Inline graphic has been rewritten using Inline graphic and Inline graphic, Inline graphic is the smallest integer larger than a, and Inline graphic is the largest integer smaller than a. The ceiling and floor functions are introduced to ensure that Inline graphic is an integer, which it must be to have a biological interpretation. Inline graphic is equal to Inline graphic, the pmf for Inline graphic, and thus defines the probability distribution of Inline graphic. Equation (9) can be used to define the probability distribution of Inline graphic given only four parameters: M, Inline graphic, Inline graphic and N.

Derivation of confidence intervals for sample allele frequencies

For given population size (M), population allele frequency (Inline graphic), population frequency of homozygotes with allele Inline graphic (Inline graphic) and sample size (N), the probability distribution for the sample allele frequency (Inline graphic) can be calculated exactly using equation (9). The mean value of this distribution is:

graphic file with name pone.0085925.e113.jpg (10)

as in the case of sampling from an infinite diploid population [10]. In equation (10), the expectation Inline graphic has been used, which follows from the fact that the Inline graphic's, with Inline graphic, follow a multivariate hypergeometric distribution with parameters M, N and Inline graphic [28]. In addition, it can be proved that the variance of Inline graphic is

graphic file with name pone.0085925.e119.jpg (11)

(see File S1). The variance thus includes a standard finite correction factor Inline graphic that tends to 1 as Inline graphic, as required. Therefore, as Inline graphic, Inline graphic, the variance in the case of an infinite population size [10]. The cumulative distribution function (cdf) for Inline graphic is specified by:

graphic file with name pone.0085925.e125.jpg (12)

where Inline graphic is an integer in the interval Inline graphic. This cdf can be calculated using equation (9) and is a function of Inline graphic. To construct a CI for Inline graphic given M, N and Inline graphic, consider testing the null hypothesis Inline graphic against the alternative Inline graphic at significance level Inline graphic for an observed value of Inline graphic, denoted by Inline graphic. The null hypothesis is not rejected if using Inline graphic, Inline graphic falls within an acceptance region defined by Inline graphic, where Inline graphic is the largest integer for which Inline graphic and Inline graphic is the smallest integer for which Inline graphic. Inline graphic can be calculated using equation (12). One method of constructing a Inline graphic% CI for Inline graphic is to determine the set of values of Inline graphic for which the null hypothesis is not rejected at significance level Inline graphic, denoted by Inline graphic, and defining the CI as

graphic file with name pone.0085925.e149.jpg (13)

[29]. This hypothesis-testing approach corresponds to the “test-method” described by Talens [30], who applied it to construct CI's for parameters of the univariate hypergeometric distribution. If Inline graphic, the empty set, then the acceptance region needs to be extended to include Inline graphic for at least one value of Inline graphic. Thus, in this case, Inline graphic is decreased until Inline graphic is in the acceptance region for at least one Inline graphic value.

The CI specified by equation (13) is not an exact Inline graphic% CI because the distribution for Inline graphic is discrete and because there may be some Inline graphic values between Inline graphic and Inline graphic for which Inline graphic, since Inline graphic is not guaranteed to be a monotonic function of Inline graphic (see equations (9) and (12)). It is noted that for the probability parameter in a binomial distribution, Clopper and Pearson [24] derived Inline graphic% CI's using an analogous method. Thus, for the case of a large population size relative to the sample size (Inline graphic) and HWE, where Inline graphic approximately follows a binomial distribution, CI's for Inline graphic derived using our method would be virtually the same as Clopper-Pearson CI's. This can be verified by explicitly calculating and comparing CI's using the two methods – for example, if Inline graphic and Inline graphic, then both methods give the CI Inline graphic for Inline graphic.

In deriving equation (13), it was assumed that Inline graphic is known. However, Inline graphic is generally unknown. In this case, a Inline graphic% confidence region (CR) can be derived for Inline graphic and Inline graphic in an analogous way, by considering the null hypothesis Inline graphic. The CR can be derived by determining the set of vectors Inline graphic for which the null hypothesis is not rejected at significance level Inline graphic. This set is denoted by Inline graphic, where Inline graphic is as defined before. Using the CR, Inline graphic% CI's for Inline graphic and Inline graphic can be defined as

graphic file with name pone.0085925.e185.jpg (14a)

and

graphic file with name pone.0085925.e186.jpg (14b)

where Inline graphic is the set of values consisting of the jth elements in the set of vectors Inline graphic. In determining Inline graphic, Inline graphic and Inline graphic values over the biologically feasible ranges are tested. Since the number of copies of allele Inline graphic in the population must be at least the number found in the sample, Inline graphic. Also, the number of copies of all other alleles in the population must be at least the number found in the sample, Inline graphic; this constrains the maximum value of Inline graphic according to Inline graphic. For a given value of Inline graphic, Inline graphic, as determined earlier. If Inline graphic, then the acceptance region is extended to include Inline graphic for at least one pair of values of Inline graphic and Inline graphic, by decreasing Inline graphic.

CI's specified by equation (13) can be calculated given M, N, Inline graphic and Inline graphic, whereas those specified by equations (14a) and (14b) can be calculated given just M, N and Inline graphic. In this paper, computation of any CI's using these equations, incorporating the underlying formulae specified by equations (1), (8), (9) and (12), was carried out using the software package Mathematica v5.0 [31]. Supporting Webpage 1 provides Mathematica code for computation of the CI's and can be viewed at http://rpubs.com/kkeenan02/Fung-Keenan-Mathematica/. However, other software packages such as MATLAB [32] and R [33] could also be used to implement the formulae; indeed, an R version of the code is provided on Supporting Webpage 2 and can be viewed at http://rpubs.com/kkeenan02/Fung-Keenan-R/. All source code for the composition of Supporting Webpages 1 and 2, including the raw Mathematica and R code used, can be accessed at https://github.com/kkeenan02/Fung-Keenan2013/.

Determining minimum sample sizes for accurate estimation of population allele frequencies

In studies of population genetics, it is desirable to obtain a sample allele frequency close to the population allele frequency with high probability [19], i.e. a short CI of high probability. Thus, under three representative scenarios, we calculate the minimum sample sizes required to obtain Inline graphic% CI's with lengths Inline graphic and Inline graphic across all possible values of the observed sample allele frequency, Inline graphic. These minimum sample sizes are denoted by Inline graphic and Inline graphic respectively. They are calculated by starting with Inline graphic, computing CI lengths for all possible values of Inline graphic and then taking the maximum value. This is repeated for increasing N in increments of 10 until the maximum CI length becomes Inline graphic. The N value with maximum CI length closest to 0.2 is then chosen, and then increased or decreased as necessary to find Inline graphic. Inline graphic is derived analogously. Maximum CI lengths of 0.2 and 0.1 represent small maximum absolute errors in the estimated allele frequency of 0.1 and 0.05 respectively, if the estimate is taken as the mid-point of the CI. The first scenario examined, Scenario 1, is sampling from a population of Inline graphic at HWE. Inline graphic is larger than or on the same order of magnitude as the upper bound of the range of size estimates for 15/24 (63%) species populations collated by Frankham et al. [25], covering mammals, birds, insects and plants. Thus, Inline graphic is taken to represent a large population. Later, M is varied over two orders of magnitude to consider populations that range from small to very large. Since there is a HWE, Inline graphic. Also, in the sampled population, the number of homozygotes with allele i, Inline graphic, must be an integer. This restricts the number of values that Inline graphic can take to 11 equally spaced values in the interval Inline graphic (starting from 0).

Scenarios 2 and 3 represent situations where HWE does not hold, such that Inline graphic. In Scenario 2, Inline graphic is assumed to take its minimum value, which is Inline graphic, whereas in Scenario 3, Inline graphic is assumed to take its maximum value, which is Inline graphic. In comparison with Scenario 1, Inline graphic can now take a larger number of values – it can take the 2,001 equally spaced values in the interval Inline graphic in Scenario 2 and the 1,001 equally spaced values in the interval Inline graphic in Scenario 3. Since the variance of Inline graphic is an increasing function of Inline graphic (equation (11)), CI's derived using Inline graphic are expected to increase in length with Inline graphic as well. This means that CI lengths and minimum sample sizes (Inline graphic and Inline graphic) derived for Scenarios 2 and 3 are expected to encompass the entire ranges of possible values.

In the three scenarios examined, Inline graphic is specified as a function of Inline graphic, such that equation (13) can be used to calculate a CI for Inline graphic given a value of Inline graphic. Calculation of this CI involves considering all possible values of Inline graphic and determining under which values Inline graphic falls within the corresponding acceptance region (as described above). An alternative scenario, Scenario 4, is that the relationship between Inline graphic and Inline graphic is unknown. In this case, equation (14a) has to be used to calculate a CI for Inline graphic given a value of Inline graphic, which involves considering all possible values of Inline graphic and Inline graphic. This results in a considerable increase in computation time; for example, when sampling Inline graphic individuals from a population of Inline graphic, 2,001 values of Inline graphic need to be considered but 1,002,001 combinations of Inline graphic and Inline graphic need to be considered, representing an increase in computational time by two orders of magnitude. However, for a given value of Inline graphic, the maximum value of Inline graphic gives the highest variance for Inline graphic and is thus expected to maximize the length of the acceptance region within which Inline graphic could fall within. Thus, CI's for Inline graphic derived under Scenario 4 are expected to closely match those derived under the case of maximum homozygosity (Scenario 3). This can be verified by explicit calculation of the CI's – for example, given Inline graphic, Inline graphic and Inline graphic, the CI's derived under Scenario 4 and Scenario 3 are Inline graphic and Inline graphic, representing a difference in length of only 0.02. Similar results hold for Inline graphic and 5. Therefore, results from Scenario 3 are used to approximate those for Scenario 4, obviating the need for long computational runtimes to explore all possible combinations of Inline graphic and Inline graphic.

Lastly, from equation (11), the variance of Inline graphic is an increasing function of M. Thus, CI lengths for Inline graphic are expected to increase with M, which would result in increases in Inline graphic and Inline graphic. To test this explicitly, for a population at HWE and with Inline graphic, maximum lengths for the Inline graphic% CI for Inline graphic (across all values of Inline graphic) are calculated for Inline graphic, 250, 500, 750, 1,000, 2,500, 5,000, 7,500 and 10,000. This range corresponds to populations that are small to very large [25].

Application to an empirical data set for checkerspot butterflies

To demonstrate how the theory developed can be used in practice, it is applied to microsatellite data for samples from two populations of the checkerspot butterfly (Melitaea cinxia L.) occupying meadows on the Åland Islands in Finland [13]. Specifically, for the CINX1 locus, Inline graphic% CI's are calculated for the frequencies of alleles A, B and C for the Prästö and Finström populations, using the corresponding sample allele frequencies (Table 1 of Palo et al. [13]). For each population, a CI is not calculated for the population frequency of the fourth and final allele D, because this is fixed by the population frequencies of the first three alleles. To achieve consistency with the notation used in our study, henceforth, alleles A, B, C and D are referred to as alleles Inline graphic, Inline graphic, Inline graphic and Inline graphic respectively. The sample size for the Prästö population is Inline graphic, whereas that for the Finström population is Inline graphic [13]. Inline graphic and Inline graphic, Inline graphic, are used to denote the sample allele frequencies of Inline graphic in the Prästö and Finström populations respectively. Similarly, Inline graphic and Inline graphic are used to denote the population allele frequencies of Inline graphic in the two populations, respectively. The two populations consist of two and seven subpopulations respectively. Thus, technically, the two populations can be referred to as metapopulations, although this terminology is not used in our study for clarity. The subpopulations form part of a total of about 536 subpopulations on the Åland Islands, with an estimated total size ranging from 35,000 to at least 200,000 [13]. Thus, the size of each subpopulation is assumed to be Inline graphic, such that the Prästö and Finström populations are assumed to have a size of Inline graphic and Inline graphic respectively. The observed sample allele frequencies in the two populations, denoted by Inline graphic and Inline graphic respectively, are taken directly from [13]. These are used to calculate the observed number of copies of allele Inline graphic in each population, denoted by Inline graphic and Inline graphic respectively, using the formulae Inline graphic and Inline graphic. Due to rounding error in Inline graphic and Inline graphic values from [13], Inline graphic and Inline graphic had to be rounded to the nearest integer. Since the true homozygosity of each allele in each population is unknown [13], conservative CI's are calculated assuming Inline graphic takes its maximum value of Inline graphic (this is expected to maximize the lengths of the CI's, as explained in the previous section). In addition, Inline graphic is chosen, such that a Inline graphic% CI is derived for each of the three population allele frequencies in each of the two populations. The reason for this choice of Inline graphic is because for each population, using the Bonferroni Inequality [34], the cubic region defined by the three CI's can be taken as a Inline graphic% confidence region (CR) for the three population allele frequencies, where Inline graphic. The choice of Inline graphic allows Inline graphic% CR's to be derived for each set of three population allele frequencies in the two populations.

To demonstrate how the theory developed in this paper can be used to derive CI's for genetic indicators that are a function of the population allele frequencies, the Inline graphic% CR's are used to calculate a Inline graphic% CI for Jost's D for the CINX1 locus and the Prästö and Finström butterfly populations. Jost's D is a measure of genetic distance between populations [9]. For the case of two populations, it is given by

graphic file with name pone.0085925.e317.jpg (15)

where, using the notation in this example,

graphic file with name pone.0085925.e318.jpg (16)

and

graphic file with name pone.0085925.e319.jpg (17)

A lower limit to a Inline graphic% CI for Inline graphic can be derived by minimizing the function in (15) given the constraints that the six population allele frequencies are contained within the two corresponding Inline graphic% CR's, and the two constraints Inline graphic and Inline graphic. This lower limit is denoted by Inline graphic. Similarly, the upper limit to a Inline graphic% CI for Inline graphic can be derived by maximizing the function in (15) under the same constraints. This is denoted by Inline graphic. In this study, the “Minimize” and “Maximize” functions in Mathematica v5.0 [31] are used to compute Inline graphic and Inline graphic, but corresponding functions may be used in other software packages, such as the “solnp” function in the “Rsolnp” R package. A Inline graphic% CI for Inline graphic can then be defined as the interval Inline graphic. Supporting Webpage 1 provides Mathematica code that can be used to calculate Inline graphic for the butterfly case study examined (http://rpubs.com/kkeenan02/Fung-Keenan-Mathematica/). Corresponding R code is presented on Supporting Webpage 2 (http://rpubs.com/kkeenan02/Fung-Keenan-R/).

Results

Maximum length of Inline graphic% confidence interval with increasing sample size

For Scenario 1, where the sampled diploid population of size Inline graphic was at HWE, the maximum length of the Inline graphic% CI for the population frequency of allele Inline graphic, Inline graphic, considering all possible observed values of the sample allele frequency, Inline graphic, decreased non-linearly with sample size N (Figure 1). The minimum N required to achieve a maximum length Inline graphic, Inline graphic, was 22, whereas the minimum N required to achieve a length Inline graphic, Inline graphic, was 49 (Figure 1).

Figure 1. Change in maximum length of ≥95% confidence interval with increasing sample size.

Figure 1

Graph showing how the maximum length of the Inline graphic% confidence interval (CI) for the population frequency of an allele Inline graphic (Inline graphic) changes with increasing sample size N, when sampling from a diploid population of size Inline graphic. For a given N, the maximum CI length was derived by calculating CI lengths for all possible values of the observed sample allele frequency and then taking the maximum length. The three curves correspond to three scenarios where the population is (1) at Hardy-Weinberg equilibrium (HWE), (2) attains its lowest homozygosity value with respect to Inline graphic, and (3) attains its highest homozygosity value with respect to Inline graphic. For each curve, the two filled circles represent the minimum N values required for the maximum CI length to reach values of Inline graphic and Inline graphic. For visual guidance, the two dashed horizontal lines mark maximum CI lengths of 0.2 and 0.1. The exact values of N tested are described in Methods , and equation (13) was used to calculate the CI lengths.

For given N and Inline graphic, the CI for Inline graphic consists of all possible values of Inline graphic for which Inline graphic lies in the corresponding acceptance region, as described in Methods . This acceptance region is expected to increase with the variance of Inline graphic, Inline graphic. At HWE, the frequency of homozygotes of allele Inline graphic, Inline graphic, is equal to Inline graphic; thus, according to equation (11), Inline graphic takes its highest values at intermediate values of Inline graphic. As a result, intermediate values of Inline graphic are likely to fall within the acceptance regions of more values of Inline graphic, such that the corresponding CI lengths are generally longer. Simulation results for Scenario 1 were broadly in agreement with these theoretical expectations (Figure 2).

Figure 2. Change in length of Inline graphic% confidence interval across different observed values, under Hardy-Weinberg equilibrium.

Figure 2

For a sample of size N taken from a population of size Inline graphic at Hardy-Weinberg equilibrium, graph showing the length of the Inline graphic% confidence interval (CI) for the population frequency of an allele Inline graphic (Inline graphic) across all possible observed values of the sample allele frequency (Inline graphic). The three curves correspond to Inline graphic, 30 and 50. Equation (13) was used to calculate the length of each CI.

In Scenario 2, where the population was no longer at HWE but attained its lowest Inline graphic, the maximum CI length also decreased non-linearly with increasing N (Figure 1). Compared with Scenario 1, Inline graphic was slightly larger and Inline graphic was larger by about a factor of two, taking values of 26 and 94 respectively (Figure 1). This is contrary to expectations that minimum homozygosity would give smaller Inline graphic and Inline graphic (see Methods ). The reason is that although Inline graphic is smaller for given N and Inline graphic compared with the case of HWE, which would be expected to result in fewer Inline graphic values for which a given Inline graphic falls within the corresponding acceptance regions and hence a shorter CI, there are more possible values of Inline graphic (2,001 compared with 11 – see Methods ), which increased the number of Inline graphic values for which Inline graphic falls within the corresponding acceptance regions. For Scenario 2, Inline graphic, such that Inline graphic attains its highest values at Inline graphic values close to 0.25 and 0.75 (equation (11)). Thus, for given N, values of Inline graphic near 0.25 and 0.75 are expected to generally exhibit the longest CI lengths. Simulation results for Scenario 2 were broadly in agreement with these theoretical expectations (Figure 3).

Figure 3. Change in length of Inline graphic% confidence interval across different observed values, under minimum homozygosity.

Figure 3

For a sample of size N taken from a population of size Inline graphic with the minimum homozygosity possible for an allele Inline graphic, graph showing the length of the Inline graphic% confidence interval (CI) for the population frequency of Inline graphic (Inline graphic) across all possible observed values of the sample allele frequency (Inline graphic). The three curves correspond to Inline graphic, 30 and 100. Equation (13) was used to calculate the length of each CI.

For the last scenario, Scenario 3, the population attained its highest Inline graphic, representing the opposite extreme to Scenario 2. As for the previous two scenarios, the maximum CI length decreased non-linearly with increasing N (Figure 1), but this time, Inline graphic and Inline graphic took values of 94 and 285 respectively. These values were approximately four and six times as large as the corresponding values in Scenario 1 (Figure 1). In Scenario 3, Inline graphic, and equation (11) shows that intermediate values of Inline graphic give the highest values of Inline graphic. Therefore, as in Scenario 1, intermediate values of Inline graphic are expected to generally exhibit the longest CI lengths for given N. Again, simulation results were consistent with these expectations (Figure 4).

Figure 4. Change in length of Inline graphic% confidence interval across different observed values, under maximum homozygosity.

Figure 4

For a sample of size N taken from a population of size Inline graphic with the maximum homozygosity possible for an allele Inline graphic, graph showing the length of the Inline graphic% confidence interval (CI) for the population frequency of Inline graphic (Inline graphic) across all possible observed values of the sample allele frequency (Inline graphic). The four curves correspond to Inline graphic, 30, 100 and 200. Equation (13) was used to calculate the length of each CI.

Maximum length of Inline graphic% confidence interval with increasing population size

When taking a sample of size Inline graphic from a diploid population at HWE, the maximum length of the Inline graphic% CI for Inline graphic, across all Inline graphic, increased with population size M (Figure 5). As M was increased from a small value of 100 to 1,000, the maximum CI length remained the same at 0.2 (this is possible because there is nothing to prevent different values of M giving the same maximum CI length, using equation (13)). The maximum CI length increased when M was increased from 1,000 to a large value of 2,500, but only by 0.04. Thereafter, the length remained constant up to a very large value of Inline graphic, and only increased by a small amount of 0.02 with a further increase in M to 10,000. Thus, simulation results confirm the theoretical expectation that CI length increases with M (as explained in Methods – this expectation arises because Inline graphic increases with M, according to equation (11)). However, the increase in the CI length was modest as M was increased over two orders of magnitude.

Figure 5. Change in maximum length of Inline graphic% confidence interval with increasing population size.

Figure 5

Graph showing how the maximum length of the Inline graphic% confidence interval (CI) for the population frequency of an allele Inline graphic (Inline graphic) changes with increasing population size M, when taking samples of size Inline graphic. The population is at Hardy-Weinberg equilibrium. For a given M, the maximum CI length was derived by calculating CI lengths for all possible values of the observed sample allele frequency and then taking the maximum length. M values of 100, 250, 500, 750, 1,000, 2,500, 5,000, 7,500 and 10,000 were tested, as indicated by the filled circles.

Inline graphic% confidence interval for Jost's D between two checkerspot butterfly populations

For the Prästö checkerspot butterfly population, the Inline graphic% CI's for the population frequencies for three of the four alleles at the CINX1 locus were computed using equation (13) and data from Palo et al. [13], as described in Methods . The three alleles are denoted by Inline graphic, Inline graphic and Inline graphic respectively, with the population frequencies denoted by Inline graphic, Inline graphic and Inline graphic respectively. The three corresponding Inline graphic% CI's derived were Inline graphic, Inline graphic and Inline graphic respectively. Using the same methodology, for the Finström population, the Inline graphic% CI's for the population frequencies of Inline graphic, Inline graphic and Inline graphic were calculated asInline graphic, Inline graphic and Inline graphic respectively.

The three Inline graphic% CI's for the Prästö population were used to form a cubic region, which corresponds to a Inline graphic% CR for Inline graphic, Inline graphic and Inline graphic [34]. In the same way, the three Inline graphic% CI's for the Finström population were used to form a Inline graphic% CR for Inline graphic, Inline graphic and Inline graphic. Within these two CR's, the maximum and minimum values of Inline graphic (Inline graphic is given by equation (15)) were calculated and used to derive a Inline graphic% CI for Inline graphic, as described in Methods . This CI for Inline graphic was derived as Inline graphic. If Inline graphic was calculated simply using the sample allele frequencies and equation (15), then only one value would have been obtained: 0.043.

Discussion

In scientific studies that use sample data to estimate unknown population parameters, the sampling uncertainty in the estimates needs to be quantified in order to make reliable inferences on population processes captured by the parameters. This forms an essential part of scientific hypothesis-testing. Therefore, in studies of population genetics, it is essential to quantify the sampling uncertainty of key population parameters used to infer past and present evolutionary processes. These include allele frequencies, which are often used to quantify genetic variation among populations, thereby allowing hypotheses on processes driving this variation to be tested (e.g., [1], [2], [3], [4], [5]). However, many studies do not include sampling uncertainty for allele frequencies, instead presenting and/or using single point estimates based on one sample per population (e.g., [11], [12], [13], [14], [15], [16], [17], [18]). Thus, it is not possible to assess the accuracy of any inferences from these studies. This hinders not only the advance of scientific knowledge but also decision-making based on this knowledge, such as for sustainable management and conservation of natural resources. In this context, the work presented in this paper is valuable in that it provides a method of quantifying sampling uncertainty in allele frequencies for diploid populations, in the form of confidence intervals (CI's) containing true values with probability equal to or greater than a desired threshold.

The method presented pertains to the general case of a locus with n alleles, with a sample of size N taken from a population of size M and any degree of homozygosity with respect to the n alleles. In this case, the method allows construction of a CI for the population frequency of each allele, which can then be combined to create a joint confidence region (CR) for all the population allele frequencies at a given locus. It is noted that if more than one locus is considered simultaneously, then a joint CR for all population allele frequencies at all loci can be calculated by combining CI's in an analogous way. For the subcase of an infinite population size (Inline graphic) and a large sample size of Inline graphic, Weir [10] had proposed an approximate Inline graphic% CI for the population allele frequency of an allele Inline graphic, Inline graphic. This is Inline graphic, where Inline graphic is the sample allele frequency; Inline graphic satisfies Inline graphic, with Inline graphic being the cumulative distribution function (cdf) of the standard Normal distribution; and Inline graphic is an estimate of the standard deviation of Inline graphic using sample data. Inline graphic is specified by equation (11) with Inline graphic replacing Inline graphic and Inline graphic, the sample frequency of homozygotes with allele Inline graphic, replacing Inline graphic, the corresponding population frequency of homozygotes. However, the accuracy of this CI depends both on how close Inline graphic is to the true standard deviation Inline graphic (specified by equation (11)) and how close the cdf of Inline graphic is to a Normal distribution with the same mean and variance. This accuracy has not been quantified [10] and thus, it is not known whether the CI actually contains Inline graphic with a probability of at least Inline graphic, rendering its use problematic. In this study, we have rectified this problem by constructing a CI for Inline graphic with probability coverage of at least Inline graphic, for the more general case where the population size can take any value larger than or equal to the sample size.

The method we derived was used to show that the sampling uncertainty in Inline graphic, measured as the maximum length of the Inline graphic% CI for Inline graphic across all possible values of the observed sample allele frequency Inline graphic, decreased non-linearly with N when sampling from a large population (Inline graphic) under three archetypal scenarios. These three scenarios represent the cases where the population (1) is at Hardy-Weinberg equilibrium (HWE), (2) has the lowest value of Inline graphic and (3) has the highest value of Inline graphic. As expected from theory, for any given N, the maximum CI lengths for Scenario 3 were always greater than corresponding lengths in Scenarios 1 and 2. However, the maximum CI lengths for Scenario 2 was unexpectedly greater than that in Scenario 1 for some values of N, reflecting the greater possible number of values Inline graphic can take in a finite population with minimum homozygosity compared to one at HWE. This illustrates how the finite size of a population can give opposite trends to those obtained under an assumption of infinite size. According to theory and simulations, Scenario 3 gives CI lengths that closely approximate those in the case where Inline graphic is unknown (see Methods ). Thus, if Inline graphic is unknown, CI lengths derived under Scenario 3 should be used. This was the approach used in the application of our method to sample data for two butterfly populations, discussed further below. On the other hand, if there is evidence that the population is at HWE, then the shorter CI's derived under Scenario 1 could be used.

Under the three scenarios examined, the non-linear decreases in sampling uncertainty with increasing N are consistent with results from the simulation study of Hale et al. [19], who found that the average difference between Inline graphic and Inline graphic also exhibited non-linear decreases with N. However, Hale et al. [19] did not use their simulation results to quantify sampling uncertainty for the realistic situation where only one sample is taken from a population; this situation was considered in our study. Furthermore, as mentioned in the Introduction, for given N, they only used 100 samples to numerically construct the distribution for Inline graphic, resulting in an incomplete distribution that may not closely reflect the true distribution. This highlights a weakness of a simulation-based approach without a rigorous mathematical underpinning, which is present in our approach. Hale et al. [19] concluded that N = 25–30 is sufficient to give accurate estimates of Inline graphic, but this conclusion has to be interpreted in light of the limitations identified. Our results refine this conclusion by showing that across the three scenarios examined, N = 49–285 is required to ensure that, with a high probability of Inline graphic, an estimate for Inline graphic can be derived from any one sample that is within 0.05 of the true value; this corresponds to a CI of length Inline graphic. To ensure that the estimate is within 0.1 rather than 0.05, corresponding to a CI of length Inline graphic, our results show that N = 22–94 is required. Thus, Inline graphic is not guaranteed to give “accurate” estimates of Inline graphic under all or most scenarios, and N values up to 10 times larger could be required. Decreasing the population size M from 1,000 would help to decrease sampling uncertainty, but results showed that decreasing M over two orders of magnitude from 10,000 to 100 only resulted in modest decreases in the maximum CI length of Inline graphic, with no decrease when M was decreased from 1,000 to 100. Thus, the overall conclusion is that Inline graphic is often insufficient to guarantee accurate estimates of Inline graphic, in the sense that Inline graphic is within 0.05 or 0.1 of the estimate. Considering that alleles at highly polymorphic loci, such as microsatellites, often occur at population frequencies of Inline graphic [35], it might be desirable to derive CI's for Inline graphic that are of length Inline graphic. Thus, sample sizes even larger than the values found in our simulations might be required under some circumstances.

The application of our method to empirical data for two populations of the checkerspot butterfly [13] demonstrated how the underlying theory can be applied to construct joint Inline graphic% CR's for the population frequencies of multiple alleles at a single locus. These CR's were then used to construct a Inline graphic% CI for Jost's D, which measures genetic differentiation between the two populations. This illustrates how our method can be used to quantify sampling uncertainty in genetic indicators that are a function of population allele frequencies, thus facilitating hypothesis-testing and also risk-based natural resource management. In the example considered, the single point estimate of Jost's D using the sample allele frequencies was about four times lower than the upper bound of its Inline graphic% CI. Thus, use of the single point estimate without accounting for sampling uncertainty could lead to misleading conclusions. The effectiveness of any management measures based on such conclusions would be compromised, hindering the achievement of objectives related to conservation or sustainable use. Our example therefore highlights the practical utility of the method that we have derived.

In conclusion, we have presented a rigorous mathematical method for quantifying sampling uncertainty in estimates of population allele frequencies, for a general case that has hitherto not been analyzed. In addition, we have demonstrated its practical application in informing sampling design and determining uncertainty in genetic indicators. Thus, the method derived advances both theory and practice, with broad implications for a range of disciplines, including: conservation genetics, evolutionary genetics, genetic epidemiology, genome wide association studies (GWAS), forensics and medical genetics. In particular, the method provides exact answers to the question of how many individuals need to be sampled from a population in order to achieve a given level of accuracy in estimates of population allele frequencies. This is a question that has rarely been studied before [19], despite its important practical implications. Previous studies have derived sample sizes required to sample, with high probability, at least one copy of all alleles at a locus above a given frequency [35], [36], [37], but these do not correspond to sample sizes required to achieve accurate estimates of population allele frequencies. Derivation of the latter requires explicit quantification of sampling uncertainty, as we have done in this study.

Possible future extensions

The CI's and CR's constructed using our method are conservative in the sense that they contain the true values with a probability equal to or greater than a desired threshold. This conservative property is useful in hypothesis-testing if there is a need to decrease the probability of obtaining a false positive below a certain threshold. However, researchers would ideally like to construct CI's and CR's containing true values with a known probability, not just with probability at or above a known threshold. Therefore, future research could attempt to tighten the intervals and regions that we have derived, ideally until they cover a known probability. For example, Cai and Krishnamoorthy [23] devised a method (using their “Combined Test” approach) that was shown to give shorter CI's for the probability parameters of both the binomial and univariate hypergeometric distributions, when compared with a hypothesis-testing approach analogous to the one used in this paper. If their method could be extended to the population allele frequency parameter for the more complex distribution used in this paper (equation (9)), based on a multivariate hypergeometric distribution (equation (1)), then this might result in tighter CI's for population allele frequencies when sampling from a finite diploid population.

In addition, the CI's derived in our paper were designed to quantify the uncertainty in population allele frequencies that arises from taking a random sample of a finite diploid population, which can exhibit any degree of relatedness. The “random” refers to equal probability of choosing individuals that may be of any type, which does not imply that once an individual of a particular type has been sampled, the next individual sampled is equally likely to belong to any of the types (i.e., does not imply individuals in the population or sample are unrelated). Relatedness among individuals in the population is implicitly included within the population frequencies of the different genotypes, and is thus accounted for when calculating the sampling distribution for the population frequency of an allele Inline graphic, as specified by equation (1). However, this representation does not give an explicit quantification of the degree of relatedness among individuals in the population, for example using kinship coefficients. DeGiorgio and Rosenberg [38] used such coefficients to derive an unbiased estimator of heterozygosity (Inline graphic, for a locus with n alleles) in the case of sampling from a population of diploid individuals that could be related, and DeGiorgio et al. [39] extended these results to the case of individuals with arbitrary ploidy. In their calculations, the number of copies of allele Inline graphic in each sampled individual k was treated as a random variable and the covariances of these variables were then related to the kinship coefficients. This is different to calculations in our paper, where the number of sampled individuals with alleles Inline graphic and Inline graphic was treated as a random variable (see Methods ). The method in this paper might be revised by considering the number of copies of allele Inline graphic for each sampled individual instead, following [38], [39]. This could allow explicit quantification of relatedness in the context of deriving CI's for the population allele frequencies.

Supporting Information

File S1

Derivation of the variance of the sample allele frequency.

(PDF)

Acknowledgments

We would like to thank two anonymous reviewers for helpful and insightful comments, which have led to great improvements in this manuscript.

Funding Statement

TF is being supported by the National University of Singapore start-up grant WBS R-154-000-551-133. In addition, TF acknowledges past funding support by a Ph.D. studentship from the Beaufort Marine Research Award in Ecosystem Approach to Fisheries Management, carried out under the Sea Change Strategy and the Strategy for Science Technology and Innovation (2006–2013), with the support of the Marine Institute, funded under the Marine Research Sub-Programme of the Irish National Development Plan 2007–2013 (http://www.marine.ie/home/research/MIFunded/BeaufortAwards/). KK is being supported by a Ph.D. studentship from the Beaufort Marine Research Award in Fish Population Genetics, funded by the Irish Government under the Sea Change Strategy as above. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1. Bowcock AM, Kidd JR, Mountain JL, Hebert JM, Carotenuto L, et al. (1991) Drift, admixture, and selection in human evolution: A study with DNA polymorphisms. Proc Natl Acad Sci USA 88: 839–843. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Akey JM, Zhang G, Zhang K, Jin L, Shriver MD (2002) Interrogating a high-density SNP map for signatures of natural selection. Genome Res 12: 1805–1814. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Luikart G, England PR, Tallmon D, Jordan S, Taberlet P (2003) The power and promise of population genomics: from genotyping to genome typing. Nature 4: 981–994. [DOI] [PubMed] [Google Scholar]
  • 4. de Kovel CGF (2006) The power of allele frequency comparisons to detect the footprint of selection in natural and experimental situations. Genet Sel Evol 38: 3–23. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Friedlander SM, Herrmann AL, Lowry DP, Mepham ER, Lek M, et al. (2013) ACTN3 allele frequency in humans covaries with global latitudinal gradient. PLoS ONE 8 1: e52282. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Nei M (1972) Genetic distance between populations. Am Nat 106: 283–292. [Google Scholar]
  • 7. Nei M, Chesser RK (1983) Estimation of fixation indices and gene diversities. Ann Hum Genet 47: 253–259. [DOI] [PubMed] [Google Scholar]
  • 8. Weir BS, Cockerham CC (1984) Estimating F-statistics for the analysis of population structure. Evolution 38: 1358–1370. [DOI] [PubMed] [Google Scholar]
  • 9. Jost L (2008) G ST and its relatives do not measure differentiation. Mol Ecol 17: 4015–4026. [DOI] [PubMed] [Google Scholar]
  • 10.Weir BS (1996) Genetic data analysis II. Sunderland, USA, Sinauer Associates. 445 p. [Google Scholar]
  • 11. Eanes WF, Koehn RK (1978) An analysis of genetic structure in the Monarch butterfly, Danaus plexippus L. Evolution 32: 784–797. [DOI] [PubMed] [Google Scholar]
  • 12. Forbes SH, Hogg JT, Buchanan FC, Crawford AM, Allendorf FW (1995) Microsatellite evolution in congeneric mammals: domestic and bighorn sheep. Mol Biol and Evol 12: 1106–1113. [DOI] [PubMed] [Google Scholar]
  • 13. Palo J, Varvio S-L, Hanski I, Väinölä R (1995) Developing microsatellite markers for insect population structure: complex variation in a checkerspot butterfly. Hereditas 123: 295–300. [DOI] [PubMed] [Google Scholar]
  • 14. Luikart G, Allendorf FW, Cournuet J-M, Sherwin WB (1998) Distortion of allele frequency distributions provides a test for recent population bottlenecks. J Hered 89: 238–247. [DOI] [PubMed] [Google Scholar]
  • 15. Rank NE, Dahlhoff EP (2002) Allele frequency shifts in response to climate change and physiological consequences of allozyme variation in a montane insect. Evolution 56: 2278–2289. [DOI] [PubMed] [Google Scholar]
  • 16. Hugnet C, Bentjen SA, Mealey KL (2004) Frequency of the mutant MDR1 allele associated with multidrug sensitivity in a sample of collies from France. J Vet Pharmacol Ther 27: 227–229. [DOI] [PubMed] [Google Scholar]
  • 17. Seider T, Fimmers R, Betz P, Lederer T (2010) Allele frequencies of the five miniSTR loci D1S1656, D2S441, D10S1248, D12S391 and D22S1045 in a German population sample. Forensic Sci Int Genet 4: e159–e160. [DOI] [PubMed] [Google Scholar]
  • 18. Petrejčíková E, Soták M, Bernasovská J, Bernasovský I, Rębała K, et al. (2011) Allele frequencies and population data for 11 Y-chromosome STRs in samples from Eastern Slovakia. Forensic Sci Int Genet 5: e53–e62. [DOI] [PubMed] [Google Scholar]
  • 19. Hale ML, Burg TM, Steeves TE (2012) Sampling for microsatellite-based population genetic studies: 25 to 30 individuals per population is enough to accurately estimate allele frequencies. PLoS ONE 7 9: e45170. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Gillespie JH (2004) Population genetics: a concise guide, second edition. Baltimore, USA, The John Hopkins University Press. 217 p. [Google Scholar]
  • 21.Laplace PS (1812) Théorie analytique des probabilités. Paris, France, Courcier. 464 p. [Google Scholar]
  • 22. Agresti A, Coull BA (1998) Approximate is better than “exact” for interval estimation of binomial proportions. Am Stat 52: 119–126. [Google Scholar]
  • 23. Cai Y, Krishnamoorthy K (2005) A simple improved inferential method for some discrete distributions. Comput Stat Data Anal 48: 605–621. [Google Scholar]
  • 24. Clopper CJ, Pearson ES (1934) The use of confidence or fiducial limits illustrated in the case of the binomial. Biometrika 26: 404–413. [Google Scholar]
  • 25. Frankham R (1996) Relationship of genetic variation to population size in wildlife. Conserv Biol 10: 1500–1508. [Google Scholar]
  • 26. Green DM (2003) The ecology of extinction: population fluctuation and decline in amphibians. Biol Conserv 111: 331–343. [Google Scholar]
  • 27. Vredenberg VT, Knapp RA, Tunstall TS, Briggs CJ (2010) Dynamics of an emerging disease drive large-scale amphibian population extinctions. Proc Natl Acad Sci USA 107: 9689–9694. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Johnson NL, Kotz S, Balakrishnan N (1997) Discrete multivariate distributions. Hoboken, USA, Wiley-Blackwell. 328 p. [Google Scholar]
  • 29.Cox DR, Hinkley DV (1974) Theoretical statistics. London, UK, Chapman and Hall. 511 p. [Google Scholar]
  • 30.Talens E (2005) Statistical auditing and the AOQL-Method. Ridderkerk, The Netherlands, Labyrint Publications. 164 p. [Google Scholar]
  • 31.Wolfram Research Inc. (2003) Mathematica Edition: Version 5.0. Champaign, Illinois, USA, Wolfram Research Inc. [Google Scholar]
  • 32.The MathWorks Inc. (2012) MATLAB version 8.0. Natick, Massachusetts, USA, The MathWorks Inc. [Google Scholar]
  • 33.R Development Core Team (2010) R: a language and environment for statistical computing. Vienna, Austria, R Foundation for Statistical Computing. 1731 p. [Google Scholar]
  • 34.Rice JA (1995) Mathematical statistics and data analysis, second edition. Belmont, USA, Duxbury Press. 651 p. [Google Scholar]
  • 35. Chakraborty R (1992) Sample size requirements for addressing the population genetic issues of forensic use of DNA typing. Hum Biol 64: 141–159. [PubMed] [Google Scholar]
  • 36. Gregorius H-R (1980) The probability of losing an allele when diploid genotypes are sampled. Biometrics 36: 643–652. [PubMed] [Google Scholar]
  • 37. Sjögren P, Wyöni P-I (1994) Conservation genetics and detection of rare alleles in finite populations. Conserv Biol 8: 267–270. [Google Scholar]
  • 38. DeGiorgio M, Rosenberg NA (2009) An unbiased estimator of gene diversity in samples containing related individuals. Mol Biol Evol 26: 501–512. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. DeGiorgio M, Jankovic I, Rosenberg NA (2010) Unbiased estimation of gene diversity in samples containing related individuals: exact variance and arbitrary ploidy. Genetics 186: 1367–1387. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

File S1

Derivation of the variance of the sample allele frequency.

(PDF)


Articles from PLoS ONE are provided here courtesy of PLOS

RESOURCES