Skip to main content
PLOS ONE logoLink to PLOS ONE
. 2011 Jun 28;6(6):e21105. doi: 10.1371/journal.pone.0021105

Extrapolation of Urn Models via Poissonization: Accurate Measurements of the Microbial Unknown

Manuel E Lladser 1,*, Raúl Gouet 2, Jens Reeder 3
Editor: Dongxiao Zhu4
PMCID: PMC3125174  PMID: 21738613

Abstract

The availability of high-throughput parallel methods for sequencing microbial communities is increasing our knowledge of the microbial world at an unprecedented rate. Though most attention has focused on determining lower-bounds on the Inline graphic-diversity i.e. the total number of different species present in the environment, tight bounds on this quantity may be highly uncertain because a small fraction of the environment could be composed of a vast number of different species. To better assess what remains unknown, we propose instead to predict the fraction of the environment that belongs to unsampled classes. Modeling samples as draws with replacement of colored balls from an urn with an unknown composition, and under the sole assumption that there are still undiscovered species, we show that conditionally unbiased predictors and exact prediction intervals (of constant length in logarithmic scale) are possible for the fraction of the environment that belongs to unsampled classes. Our predictions are based on a Poissonization argument, which we have implemented in what we call the Embedding algorithm. In fixed i.e. non-randomized sample sizes, the algorithm leads to very accurate predictions on a sub-sample of the original sample. We quantify the effect of fixed sample sizes on our prediction intervals and test our methods and others found in the literature against simulated environments, which we devise taking into account datasets from a human-gut and -hand microbiota. Our methodology applies to any dataset that can be conceptualized as a sample with replacement from an urn. In particular, it could be applied, for example, to quantify the proportion of all the unseen solutions to a binding site problem in a random RNA pool, or to reassess the surveillance of a certain terrorist group, predicting the conditional probability that it deploys a new tactic in a next attack.

Introduction

A fundamental problem in microbial ecology is the “rare biosphere” [1] i.e. the vast number of low-abundance species in any sample. However, because most species in a given sample are rare, estimating their total number i.e. Inline graphic-diversity is a difficult task [2], [3], and of dubious utility [4], [5]. Although parametric and non-parametric methods for species estimation show some promise [6], [7], microbial communities may not yet have been sufficiently deeply sampled [8] to test the suitability of the models or fit their parameters. For instance, human-skin communities demonstrate an unprecedented diversity within and across skin locations of same individuals, with marked differences between specimens [9].

In an environment composed of various but an unknown number of species, let Inline graphic be the proportion in which a certain species Inline graphic occurs. Samples from microbial communities may be conceptualized as sampling–with replacement–different colored balls from an urn. The urn represents the environment where samples are taken: soil, gut, skin, etc. The balls represent the different members of the microbial community, and each color is a uniquely defined operational taxonomic unit.

In the non-parametric setting, the urn is composed by an unknown number of colors occurring in unknown relative proportions. In this setting, the Inline graphic-diversity of the urn [10] corresponds to the cardinality of the set Inline graphic. Although various lower-confidence bounds for this parameter have been proposed in the literature [11][14], tight lower-bounds on Inline graphic-diversity are difficult in the non-parametric setting because a small fraction of the urn could be composed by a vast number of different colors [15]. Motivated by this, we shift our interest to predicting instead the fraction of balls with a color unrepresented in the first Inline graphic observations from the urn. This is the unobservable random variable:

graphic file with name pone.0021105.e009.jpg

where Inline graphic denote the sequence of colors observed when sampling Inline graphic balls from the urn. Notice how Inline graphic depends both on the specific colors observed in the sample, and the unknown proportions of these colors in the urn. This quantity is very useful to assess what remains unknown in the urn. For instance, the probability of discovering a new color with one additional observation is precisely Inline graphic, and the mean number of additional observations to discover a new color is Inline graphic. We note that Inline graphic corresponds to what is called the conditional coverage of a sample of size Inline graphic in the literature. For this reason, we refer to Inline graphic as the conditional uncovered probability of the sample.

The expected value of Inline graphic is given by:

graphic file with name pone.0021105.e019.jpg

Unlike the conditional uncovered probability of the sample, Inline graphic is a parameter that depends on the unknown urn composition but not on the specific colors observed in the sample. Interest in the above quantities or related ones has ranged from estimating the probability distribution of the keys used in the Kenngruppenbuch (the Enigma cipher book) in World War II [16], to assessing the confidence that an iterative procedure with a random start has found the global maximum of a given function [17], to predicting the probability of discovering a new gene by sequencing additional clones from a cDNA library [18]. We note that Inline graphic is called the expected coverage of the sample in the literature.

Various predictors of Inline graphic and estimators of Inline graphic have been proposed in the literature. These are mostly based on a user-defined parameter Inline graphic and the statistics Inline graphic, Inline graphic; defined as the number of colors observed Inline graphic-times, when Inline graphic additional balls are sampled from the urn.

Turing and Good [19] proposed to estimate Inline graphic using the biased statistic Inline graphic. Posteriorly, Robbins [20] proposed to predict Inline graphic using

graphic file with name pone.0021105.e032.jpg (1)

which he showed to be unbiased for Inline graphic and to satisfy the inequality Inline graphic. Despite the possibly small quadratic variation distance between Inline graphic and Robbins' estimator, and as illustrated by the plots on the left side of Fig. 1, when using Robbins' estimator to predict Inline graphic sequentially with Inline graphic (to assess the quality of the predictions at various depths in the sample), we observe that unusually small or large values of Inline graphic may offset subsequent predictions of Inline graphic. In fact, as seen on the right-hand plots of the same figure, an offset prediction is usually followed by another offset prediction of the same order of magnitude, even Inline graphic observations later (correlation coefficient of green clouds, Inline graphic and Inline graphic on top- and bottom-right plots).

Figure 1. Point predictions in a human-gut and exponential urn.

Figure 1

Plots associated with a human-gut (top-row) and exponential urn (bottom-row). Left-column, sequential predictions of the conditional uncovered probability (black), as a function of the number Inline graphic of observations, using Robbins' estimator in equation (1) (green), Starr's estimator in equation (2) (orange), and the Embedding algorithm (blue, red), over a same sample of size Inline graphic from each urn. Starr's estimator was implemented keeping Inline graphic. Blue predictions correspond to consecutive outputs of the Embedding algorithm in Table 1, which was reiterated until exhausting the sample using the parameter Inline graphic. Red predictions correspond to outputs of the algorithm each time a new species was discovered. Right-column, correlation plots associated with consecutive predictions of the conditional uncovered probability (normalized by its true value at the point of prediction), under the various methods. The green and orange clouds correspond to pairs of predictions, 100-observations apart, using Robbins' and Starr's estimators, respectively. Blue and red clouds correspond to pairs of consecutive outputs of the Embedding algorithm, following the same coloring scheme than on the left plots. Notice how the red and blue clouds are centered around Inline graphic, indicating the accuracy of our methodology in a log-scale. Furthermore, the green and orange clouds show a higher level of correlation than the blue and red clouds, indicating that our method recovers more easily from previously offset predictions. In each urn, our predictions used the Inline graphic observations and a HPP with intensity one–simulated independently from the urn–to predict sequentially the uncovered probability of the first part of the sample. See Fig. 4 for the associated rank curve in each urn.

Subsequently, for each Inline graphic, Starr [21] proposed to predict Inline graphic using

graphic file with name pone.0021105.e051.jpg (2)

Even though Inline graphic is the minimum variance unbiased estimator of Inline graphic based on Inline graphic additional observations from the urn [22], Starr showed that Inline graphic may be strongly negatively correlated with Inline graphic when Inline graphic (note that Starr's and Robbins' estimators are identical when Inline graphic). Furthermore, the sequential prediction of Inline graphic via Starr's estimator is affected by issues similar to Robbins' estimator, which is also illustrated in Fig. 1, even when the parameter Inline graphic is set as large as possible, namely Inline graphic is equal to the sample size (correlation coefficient of orange clouds, Inline graphic and Inline graphic on top- and bottom-right, respectively). We observe that Inline graphic and Inline graphic are indistinguishable in a linear scale when Inline graphic because, for each Inline graphic, it applies that (see Materials and Methods):

graphic file with name pone.0021105.e068.jpg (3)

In terms of prediction intervals, if Inline graphic denotes the Inline graphic upper quantile of a standard Normal distribution, it follows from Esty's analysis [23] that if Inline graphic is not very near Inline graphic or Inline graphic then

graphic file with name pone.0021105.e074.jpg (4)

is approximately a Inline graphic prediction interval for Inline graphic. In practice, and as seen in Fig. 2, when the center of the interval is of a similar or lesser order of magnitude than its radius, the ratio between the upper- and lower-bound of these intervals may oscillate erratically, sometimes over several orders of magnitude. This can be an issue in assessing the depth of sampling in rich environments. For instance, to be highly confident that Inline graphic is not of practical use because one may need from Inline graphic to Inline graphic additional observations to discover a new species.

Figure 2. Prediction intervals in the human-gut and exponential urn.

Figure 2

95% prediction intervals for the conditional uncovered probability (black) of the human-gut and exponential urn as a function of the number of observations. Esty's prediction intervals in equation (4) (green), and predictions intervals based on the Embedding algorithm (blue, red), using the parameters Inline graphic and Inline graphic on the left and right, respectively. Blue and red curves correspond to the conservative-lower and -upper prediction intervals for the uncovered probability, respectively. The missing segments on the lower green-curves correspond to Esty's prediction intervals that contained Inline graphic. Although the upper- and lower-bound of the Esty's intervals may be of different order of magnitude, our method produces intervals of a constant length in logarithmic scale. This length is controlled by the user-defined parameter Inline graphic. In each urn, our method predicted accurately the uncovered probability of a random sub-sample of the Inline graphic observations from the urn. See Fig. 4 for the associated rank curve in each urn.

The issues of the aforementioned methods are somewhat expected. On one hand, the problem of predicting Inline graphic is very different from estimating Inline graphic: the former requires predicting the exact proportion of balls in the urn with colors outside the random set Inline graphic, rather than in average over all possible such sets. On the other hand, the point estimators of Inline graphic are unlikely to predict Inline graphic accurately in a logarithmic scale, unless the standard deviation of Inline graphic is small relative to Inline graphic. Finally, the methods we have described from the literature were designed for static situations i.e. to predict Inline graphic or estimate Inline graphic when Inline graphic is fixed.

Results

Embedding Algorithm

Here we propose a new methodology to address the issues of the methods presented in the Introduction to predict Inline graphic. Our methodology lends itself better for a sequential analysis and accurate predictions in a logarithmic scale; in particular, also in a linear scale–though it relies on randomized sample sizes. Due to this, in static situations i.e. for fixed sample sizes, our method only yields predictions for a random sub-sample of the original sample.

Randomized sample sizes are more than just an artifact of our procedure: due to Theorem 1 below, for any predetermined sample size, there is no deterministic algorithm to predict Inline graphic and Inline graphic unbiasedly, unless the urn is composed by a known and flat distribution of colors. See the Materials and Methods section for the proofs of our theorems.

Theorem 1 If Inline graphic is a continuous and one-to-one function then the following two statements are equivalent: (i) there is a non-randomized algorithm based on Inline graphic to predict Inline graphic conditionally unbiased; (ii) the urn is composed by a known and equidistributed number of colors.

Our methodology is based on a so called Poissonization argument [24]. This technique is often used in allocation problems to remove correlations [25]. It was applied in [26] to show that the cardinality of the random set Inline graphic is asymptotically Gaussian after the appropriate renormalization. Mao and Lindsay [27] used implicitly a Poissonization argument to argue that intervals such as in equation (4) have a Inline graphic asymptotic confidence, under the hypothesis that the times at which each color in the urn is observed obey a homogeneous Poisson point process (HPP) with a random intensity. Here, asymptotic means that the Inline graphic-diversity tends to infinity, which entails adding colors into the urn. Our approach, however, is not based on any assumption on the times the data was collected, nor on an asymptotic rescaling of the problem, but rather on the embedding of a sample from an urn into a HPP with intensity Inline graphic in the semi-infinite interval Inline graphic. We emphasize that the HPP is a mathematical artifice simulated independently from the urn.

In what follows, Inline graphic is a user-defined integer parameter. We have implemented the Poissonization argument in what we call the Embedding algorithm in Table 1. For a schematic description of the algorithm see Fig. 3 and, for its heuristic, consult the Materials and Methods section.

Table 1. Embedding algorithm.

Input: Inline graphic, a set Inline graphic of colors known to be in the urn, and constants Inline graphic that satisfy condition (5).
Output: Unbiased predictor of Inline graphic, Inline graphic prediction interval for Inline graphic and an updated set Inline graphic of colors known to belong to the urn.
Step 1. Assign Inline graphic, Inline graphic, and Inline graphic.
Step 2. While Inline graphic assign Inline graphic, and sample with replacement a ball from the urn. Let Inline graphic be the color of the sampled ball. If Inline graphic then assign Inline graphic and Inline graphic.
Step 3. Simulate Inline graphic, and assign Inline graphic.
Step 4. Output Inline graphic, Inline graphic and Inline graphic.

Figure 3. Schematic description of the Embedding algorithm.

Figure 3

Suppose that in a first sample from an urn you only observe the colors red, white and blue; in particular, Inline graphic. Let Inline graphic be the unknown proportion in the urn of balls colored with any of these colors i.e. Inline graphic. To estimate Inline graphic, sample additional balls from the urn until observing Inline graphic balls with colors outside Inline graphic. Embed the colors of this second sample into a homogeneous Poisson point process with intensity one; in particular, the average separation of consecutive points with colors outside Inline graphic are independent exponential random variables with mean Inline graphic. The unknown quantity Inline graphic can be now estimated from the random variable Inline graphic. As a byproduct of our methodology, conditional on Inline graphic, if Inline graphic denotes the relative proportion of color Inline graphic in the first sample then Inline graphic predicts the true proportion of color Inline graphic in the urn.

Suppose that a set Inline graphic of colors is already known to belong to the urn and let Inline graphic be the coverage probability of the colors in this set. We note that, in the context of the previous discussion, Inline graphic with Inline graphic.

To predict Inline graphic, draw balls from the urn until Inline graphic colors outside Inline graphic are observed. Visualize each observation as a colored point in the interval Inline graphic. The Poissonization consists in spacing these points out using independent exponential random variables with mean one. Due to the thinning property of Poisson point processes [28], the position Inline graphic of the point farthest apart from Inline graphic has a Gamma distribution with mean Inline graphic. We may exploit this to obtain conditionally unbiased predictors and exact prediction intervals for Inline graphic and Inline graphic as follows. Regarding direct predictions of Inline graphic, note that measuring Inline graphic in a logarithmic rather than linear scale makes more sense when deep sampling is possible.

Theorem 2 Conditioned on Inline graphic and the event Inline graphic , the following applies:

  1. If Inline graphic then Inline graphic is unbiased for Inline graphic, with variance Inline graphic.

  2. If Inline graphic and Inline graphic denotes Euler's constant then Inline graphic is unbiased for Inline graphic, with variance Inline graphic, which is bounded between Inline graphic and Inline graphic.

  3. If Inline graphic, Inline graphic and Inline graphic are such that
    graphic file with name pone.0021105.e174.jpg (5)
    then the interval Inline graphic contains Inline graphic with exact probability Inline graphic; in particular, Inline graphic contains Inline graphic also with probability Inline graphic.

We note that Inline graphic is the uniformly minimum variance unbiased estimator of Inline graphic based on Inline graphic exponential random variables with unknown mean Inline graphic. Furthermore, Inline graphic converges almost surely to Inline graphic, as Inline graphic tends to infinity; in particular, the point predictors in part (i) and (ii) are strongly consistent.

We also note that the logarithm of the statistic in part (i) under-estimates Inline graphic in average. In fact, the difference between the natural logarithm of the statistic in (i) and the statistic in (ii) is Inline graphic, which is negative for Inline graphic, and increases to zero as Inline graphic tends to infinity. From a computational stand point, however, the statistics Inline graphic and Inline graphic differ by at most Inline graphic-units when Inline graphic. The same precision may be reached for smaller values of Inline graphic if larger bases are utilized. For instance, in base-10, the discrepancy will be at most Inline graphic for Inline graphic.

In regards to part (iii) of the theorem, we note that our prediction intervals for Inline graphic cannot contain zero unless Inline graphic. On the other hand, since the density function used in equation (5) is unimodal, the shortest prediction interval for Inline graphic corresponds to a pair of non-negative constants Inline graphic such that:

graphic file with name pone.0021105.e203.jpg (6)

Similarly, optimal prediction intervals for Inline graphic follow when

graphic file with name pone.0021105.e205.jpg (7)

with Inline graphic (see Materials and Methods for a numerical procedure to approximate these constants). In either case, because Inline graphic converges in distribution to a standard Normal as Inline graphic tends to infinity, one may select in (5) the approximate constants Inline graphic and Inline graphic. With these approximate values, if Inline graphic then the true confidence Inline graphic of the associated prediction intervals satisfies (see Materials and Methods):

graphic file with name pone.0021105.e213.jpg (8)

(The term on the exponential on the left-hand side above is big-O of Inline graphic; in particular, the lower-bound is of the same asymptotic order than the upper-bound.) We note that the constants produced by the Normal approximation may be crude for relatively large values of Inline graphic, as seen in Table 2.

Table 2. Optimal versus asymptotic Inline graphic prediction intervals.

Inline graphic Predictioninterval for Optimalconstants Gaussianapproximation Relativeerror Inline graphic
Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic
Inline graphic Inline graphic Inline graphic Inline graphic Same asabove Inline graphic Inline graphic
Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic
Inline graphic Inline graphic Inline graphic Inline graphic Same asabove Inline graphic Inline graphic

As high-throughput technologies allow deeper sampling of microbial communities, it will be increasingly important to have upper- and lower-bounds for Inline graphic of a comparable order of magnitude. Since the prediction intervals for this quantity in Theorem 2 are of the form Inline graphic, and the ratio between the upper- and lower-bound of this interval is Inline graphic, one may wish to determine constants Inline graphic and Inline graphic such that, not only (5) is satisfied, but also

graphic file with name pone.0021105.e252.jpg (9)

where Inline graphic is a user-defined parameter. Not all values of Inline graphic are attainable for a given Inline graphic and confidence level. In fact, the smallest attainable value is given by the constants associated with the optimal prediction interval for Inline graphic. Equivalently, Inline graphic is attainable if and only if

graphic file with name pone.0021105.e258.jpg

Conversely, and as stated in the following result, any value of Inline graphic is attainable at a given confidence level, provided that the parameter Inline graphic is selected sufficiently large.

Theorem 3 Let Inline graphic and Inline graphic be fixed constants. For each Inline graphic sufficiently large, there are constants Inline graphic such that (5) and (9) are satisfied.

For a given parameter Inline graphic, there are at most two constants Inline graphic such that Inline graphic and Inline graphic are prediction intervals for Inline graphic with exact confidence Inline graphic. We refer to these as conservative-lower and conservative-upper prediction intervals, respectively. We refer to intervals of the form Inline graphic and Inline graphic as upper- and lower-bound prediction intervals, respectively. See Table 3 for the determination of these constants for various values of Inline graphic when Inline graphic.

Table 3. Constants associated with 95% prediction intervals.

Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic
1 2.995732274 Inline graphic Inline graphic 0.051293294
2 4.743864518 Inline graphic Inline graphic 0.355361510
3 6.295793622 Inline graphic Inline graphic 0.817691447
4 7.753656528 0.806026244 1.360288674 1.366318397
5 9.153519027 0.924031159 1.969902541 1.970149568
6 10.51303491 1.053998892 2.61300725 2.613014744
7 11.84239565 1.185086999 3.28531552 3.285315692
8 13.14811380 1.315076338 3.98082278 3.980822786
9 14.43464972 1.443547021 4.69522754 4.695227540
10 15.70521642 1.570546801 5.42540570 5.425405697
11 16.96221924 1.696229569 6.16900729 6.169007289
12 18.20751425 1.820753729 6.92421252 6.924212514
13 19.44256933 1.944257623 7.68957829 7.689578292
14 20.66856908 2.066857113 8.46393752 8.463937522
15 21.88648591 2.188648652 9.24633050 9.246330491
16 23.09712976 2.309712994 10.03595673 10.03595673
17 24.30118368 2.430118373 10.83214036 10.83214036
18 25.49923008 2.549923010 11.63430451 11.63430451
19 26.69177031 2.669177032 12.44195219 12.44195219
20 27.87923964 2.787923964 13.25465160 13.25465160
21 29.06201884 2.906201884 14.07202475 14.07202475
22 30.24044329 3.024044329 14.89373854 14.89373854
23 31.41481021 3.141481021 15.71949763 15.71949763
24 32.58538445 3.258538445 16.54903872 16.54903871
25 33.75240327 3.375240328 17.38212584 17.38212584
Constants associated with Inline graphic upper-bound, conservative-lower, conservative-upper and lower-bound prediction intervals for Inline graphic, when Inline graphic and Inline graphic. By definition, this means that Inline graphic and Inline graphic. Furthermore, the constants Inline graphic are solutions to the equation:
graphic file with name pone.0021105.e293.jpg
solved numerically with Newton's method using Maple 13.02. This equation may have at most two different solutions, and star (Inline graphic) denotes that the equation has no solution.

Effect of non-randomized sample sizes

The Embedding algorithm provides conditionally unbiased predictors and intervals for Inline graphic and Inline graphic, provided that an arbitrary number of additional observations is possible until observing Inline graphic balls with colors outside Inline graphic. When dealing with fixed sample sizes, there is a positive probability of not meeting this condition, in which case the Embedding Algorithm is inconclusive. In large samples however, such as those collected in microbial datasets, the algorithm may be applied sequentially until it yields an inconclusive prediction. In such case, the true confidence of the prediction intervals produced by the algorithm satisfy the following.

Theorem 4 Suppose that condition (5) is satisfied. Conditioned on Inline graphic , if Inline graphic balls with colors outside Inline graphic are observed in the next Inline graphic draws from the urn, then the true confidence Inline graphic of the prediction interval for Inline graphic produced by the Embedding algorithm satisfies:

  1. if Inline graphic then Inline graphic ;

  2. if Inline graphic then Inline graphic, where Inline graphic is a Gamma random variable with parameters Inline graphic, and Inline graphic is a Negative Binomial random variable with parameters Inline graphic.

Thus, if the Embedding algorithm produces an output in what remains of a finite sample size, the upper-bound prediction interval for Inline graphic has at least the user-defined confidence. This is perhaps the case of most interest in applications: it allows the user to estimate the least number of additional samples to observe a color not seen in any sample. For the other three interval types, the true confidence is approximately at least the targeted one if the probability that the algorithm produces an output in what remains of the sample is large.

Discussion

Comparisons with Robbins-Starr estimators

Note that, like Robbins' and Starr's estimators, our method requires extracting additional balls from the urn to make a prediction. However, unlike the methods of the Introduction, our method uses only the additionally collected data–instead of all the data ever collected from the urn–to make a prediction. In terms of sequential analysis, this is advantageous to recover from earlier erroneous predictions (we expand on this point in the next section, see Fig. 1).

In what remains of this section, Inline graphic hence Inline graphic, the conditional uncovered probability of a sample of size Inline graphic. Furthermore, to rule out trivial cases, we assume that Inline graphic with positive probability i.e. the urn is composed by more than just balls of a single color.

Part (i) of Theorem 1 provides a conditionally unbiased predictor for Inline graphic. We can show, however, that Robbins' and Starr's estimators are not conditionally unbiased for Inline graphic in the non-parametric case when Inline graphic. To see this argument, first notice that Inline graphic due to the inequality (3). On the other hand, if Inline graphic is a color in the urn such that Inline graphic then

graphic file with name pone.0021105.e324.jpg

As a result:

graphic file with name pone.0021105.e325.jpg

Hence, if there exists a color Inline graphic in the urn that makes the above quantity strictly positive (there are infinitely many such urns, including all urns composed by infinitely many colors, because Inline graphic) then Inline graphic cannot be conditionally unbiased for Inline graphic.

On the other hand, due to parts (i) and (ii) in Theorem 1, we obtain (see Materials and Methods):

graphic file with name pone.0021105.e330.jpg (10)
graphic file with name pone.0021105.e331.jpg (11)

where Inline graphic denotes correlation and Inline graphic variance. Consequently, the point predictors in Theorem 1 are positively correlated with the quantities they were designed to predict. This contrasts with Robbins' estimator, which may be strongly negatively correlated with Inline graphic. For instance, if Inline graphic for Inline graphic different colors in the urn, it is shown in [21] that the asymptotic correlation between Inline graphic and Robbins' estimator Inline graphic is asymptotically negative when Inline graphic converges to a strictly positive but finite constant Inline graphic. In this same regime but provided that Inline graphic, we can show that (see Materials and Methods):

graphic file with name pone.0021105.e342.jpg (12)

Since the right-hand side above is negative for all Inline graphic sufficiently small, Starr's estimator Inline graphic may also have a strong negative correlation with Inline graphic when Inline graphic is much smaller than Inline graphic.

A further calculation based on parts (i) and (ii) in Theorem 1 shows that

graphic file with name pone.0021105.e348.jpg
graphic file with name pone.0021105.e349.jpg

In particular, for fixed Inline graphic, the correlations in equations (10) and (11) approach to one as Inline graphic tends to infinity.

Finally, for non-trivial urns with finite Inline graphic-diversity, i.e. urns composed by balls with at least two but a finite number of different colors, one can show for fixed Inline graphic that the correlation in equation (10) approaches Inline graphic as Inline graphic tends to infinity. Furthermore, if we again assume that Inline graphic for Inline graphic different colors in the urn and Inline graphic converges to a strictly positive but finite constant, then the correlation in equation (10) approaches zero from above. As we pointed out before, in this regime, Robbins' estimator is asymptotically negatively correlated with Inline graphic.

Selection of parameters

There are two main criteria to select the parameter Inline graphic of the Embedding algorithm in a concrete application.

One criteria applies for point predictors. In this case, conditioned on Inline graphic, the standard deviation of the relative error of our prediction of Inline graphic is Inline graphic (Theorem 2, part (i)). To predict Inline graphic, Inline graphic should be therefore selected as small as possible so as to meet the user's tolerance on the average relative error of our predictions. The same criteria applies for point predictors of Inline graphic, for which the standard deviation of the absolute error is of order Inline graphic, uniformly for all Inline graphic (Theorem 2, part (ii)).

A different criteria applies for prediction intervals. In this case, conditioned on Inline graphic, the user should first specify the confidence level, and how much larger he wants the upper-prediction-bound to be in relation to the lower-prediction bound of Inline graphic. Since the ratio between these last two quantities is given by the parameter Inline graphic in (9), Inline graphic should be selected as small as possible to meet the user's pre-specified factor Inline graphic for the given confidence level of the prediction interval (Theorem 3). See Table 4 for the optimal choice of Inline graphic for various values of Inline graphic when Inline graphic. Note that for the selected parameter Inline graphic, the constants associated with the optimal prediction intervals are given in equations (6) and (7), see Materials and Methods.

Table 4. Optimal selection of parameter Inline graphic in terms of parameter Inline graphic.

Inline graphic Inline graphic Inline graphic Inline graphic
80 2 0.0598276655 0.355361510
48 2 0.1013728884 0.355358676
40 2 0.1231379857 0.355320458
24 2 0.226833483 0.346045204
20 3 0.320984257 0.817610455
12 3 0.590243030 0.787721610
10 4 0.806026244 1.360288674
6 6 1.822307383 2.58658608
5 7 2.48303930 3.22806682
3 14 7.17185045 8.27008349
2.5 19 11.26109001 11.96814857
1.5 94 75.9077267 76.5492088
1.25 309 275.661191 275.949782
Constants associated with the controlled upper- to lower-bound ratio prediction intervals for Inline graphic, when Inline graphic; in particular, for each Inline graphic and Inline graphic, Inline graphic and Inline graphic contain Inline graphic with a Inline graphic probability. For each Inline graphic, the smallest value of Inline graphic for which the equation:
graphic file with name pone.0021105.e394.jpg
admits a solution, is reported. Numerical values where determined using Maple 13.02.

Simulations on analytic and non-analytic urns

We tested our methods against an urn with an exponential relative abundance rank curve over Inline graphic species, and an urn matching the observed distribution of microbes in a human-gut sample from [29]. We also analyzed a sample from a human-hand microbiota found in [30]. The gut and hand data are part of the largest microbial datasets collected thus far (see Fig. 4 for the relative abundance rank curve associated with each urn). The relative abundance rank curve, or for simplicity “rank curve”, associated with an urn is a graphical representation of its composition: the height of the graph above a non-negative integer Inline graphic is the fraction of balls in the urn with the Inline graphic-th most dominant color.

Figure 4. Rank curves associated with the human-gut, human-hand and exponential urn.

Figure 4

In a rank curve, the relative abundance of a species is plotted against its sorted rank amongst all species, allowing for a quick overview of the evenness of a community. On the left, rank curves associated with the human-gut (blue) and -hand data (green) show a relatively small number of species with an abundance greater than 1%, and a long tail of relatively rare species. The right rank curve of the exponential urn (red) simulates an extreme environment, where relatively excessive sampling is unlikely to exhaust the pool of rare species.

The blue dots and red curves on the plots on the left side in Fig. 1 show very accurate point predictions in log-scale of the conditional uncovered probability (as a function of the number of observations), when we apply the Embedding algorithm to a sample of size Inline graphic from the human-gut and exponential urn, respectively. In both instances, the parameter Inline graphic of the Embedding algorithm was set to Inline graphic. The accuracy of our method is confirmed by the red clouds on the plots on the right side of Fig. 1, which are centered around Inline graphic. The red clouds also indicate that our predictions recover more easily from offset predictions as compared to Robbins' and Starr's (correlation coefficient of red clouds, Inline graphic and Inline graphic on top- and bottom-right, respectively). This is to be expected because the Embedding algorithm relies only on the additionally collected data to make a new prediction, whereas Robbins' and Starr's estimators use all the data ever collected from the urn. On the other hand, the red and blue curves in Fig. 2 show that the conservative-upper and -lower prediction intervals of the conditional uncovered probability (also as a function of the number of observations) contain this quantity with high probability and, unlike Esty's intervals, have a constant length in logarithmic scale. The intervals on the plots on the right side are tighter than those on the left because of the decrease of the parameter Inline graphic from Inline graphic to Inline graphic. In each case, the parameter Inline graphic was selected according to the guidelines in Table 4. We note that sequential predictions based on the Embedding Algorithm in figures 1 and 2 were produced until the algorithm yielded inconclusive predictions. For this reason, our predictions ended before exhausting each sample.

In the human-hand dataset, Inline graphic species were observed in a sample of size Inline graphic. To simulate draws with replacement from this environment, we produced a random permutation of the data (see Materials and Methods section). Using the Embedding algorithm with parameters Inline graphic, and according to our point predictor, Inline graphic of the species observed in the sample represent Inline graphic of that hand environment; in particular, the remaining Inline graphic is composed by at least Inline graphic species. Furthermore, according to our upper-bound prediction interval, and with at least a Inline graphic confidence, the species not represented in the sample account for less than Inline graphic of that environment.

To test the above predictions, we simulated the rare biosphere as follows. We hypothesized that our point prediction of the conditional uncovered probability could be offset by up to one order of magnitude. We also hypothesized that the number of unseen species in the sample had an exponential relative abundance rank curve, composed either by Inline graphic, Inline graphic or Inline graphic species. This leads to nine different urns in which to test our methods. These urns are devised such that they gradually change from the almost unchanged urn in the bottom left corner to the urn in the upper right, which is dominated by rare species (see Fig. 5 for the associated rank curves). As seen on the plots in Fig. 6, the Embedding algorithm yields very accurate predictions in each of these nine scenarios, for all the sample sizes considered.

Figure 5. Rank curves associated with the rare biosphere simulation in the human-gut and -hand urn.

Figure 5

Rank curves associated with Fig. 6 (green) and Fig. 7 (blue).

Figure 6. Predictions in the human-hand urn when simulating the rare biosphere.

Figure 6

Prediction of the conditional uncovered probability (black) in nine urns associated with a human-hand urn. Point predictions produced by the Embedding algorithm (blue), point predictions produced by the algorithm each time a new species was discovered (red), Inline graphic upper-bound interval (orange), and Inline graphic conservative-upper interval (green). The algorithm used the parameters Inline graphic. The different urns were devised as follows. For each Inline graphic (indexing rows) and Inline graphic (indexing columns), a mixture of two urns was considered: an urn with the same distribution as the microbes found in a sample from a human-hand and weighted by the factor Inline graphic, and an urn consisting of Inline graphic colors (disjoint from the hand urn), with an exponentially decaying rank curve and weighted by the factor Inline graphic. See Fig. 5 for the rank curve associated with each urn.

As seen in Fig. 7, our predictions are also in excellent agreement with the human-gut dataset when we simulate the rare biosphere. As expected, the conditional uncovered probability almost always lies between the predicted bounds. We also note that the predictions based on the Embedding algorithm are accurate even for a small number of observations. This suggests that our algorithm can be applied to deeply as well as shallowly sampled environments.

Figure 7. Predictions in the human-gut urn when simulating the rare biosphere.

Figure 7

In a sample of size Inline graphic from a human-gut, Inline graphic species were discovered. Based on our methods, we estimate that Inline graphic of these species represent Inline graphic of that gut environment; hence, the remaining Inline graphic is composed by at least Inline graphic species. To test our predictions of the conditional uncovered probability (black), we simulated the rare biosphere by adding additional species and hypothesized that our point prediction could be offset by up to one order of magnitude: point predictions produced by the Embedding Algorithm (blue), point predictions produced by the algorithm each time a new species was discovered (red), Inline graphic upper-bound (orange), and Inline graphic conservative-upper interval (green). The predictions used the parameters Inline graphic. The different urns were devised as follows. For each Inline graphic (indexing rows) and Inline graphic (indexing columns), a mixture of two urns was considered: an urn with the same distribution as the microbes found in the gut dataset, and weighted by the factor Inline graphic, and an urn consisting of Inline graphic colors (disjoint from the gut urn), with an exponentially decaying rank curve and weighted by the factor Inline graphic. See Fig. 5 for the rank curve associated with each urn.

Materials and Methods

Heuristic behind the Embedding algorithm

The number of times a rare color occurs in a sample from an urn is approximately Poisson distributed. In the non-parametric setting, a direct use of this approximation is tricky because “rare” is relative to the sample size and the unknown urn composition. The embedding into a HPP is a way to accommodate for the Poisson approximation heuristic, without making additional assumptions on the urn's composition. To fix ideas, imagine that no ball in the urn is colored black. Make up a second urn with a single ball colored black. We refer to this as the “black-urn”. Now sample (with replacement) balls according to the following scheme: draw a ball from the original- versus black-urn with probability Inline graphic and Inline graphic, respectively, where Inline graphic is a fixed but small parameter. Under this sampling scheme, even the most abundant colors in the original-urn are rare. In particular, the smaller Inline graphic is, the closer is the distribution of the number of times a particular set of colors (excluding black) is observed to a Poisson distribution. This approach is not very practical, however, because the number of samples to observe a given number of balls from the original urn can be astronomically large when Inline graphic is very small. To overpass this issue imagine drawing a ball every Inline graphic-seconds. Draws from the original urn will then be apart Inline graphic seconds, where Inline graphic has a Geometric distribution with mean Inline graphic. As a result: Inline graphic, for Inline graphic. Thus, as Inline graphic gets smaller, the time-separations between consecutive samples from the original urn resemble independent Exponential random variables with mean one. The black-urn can therefore be removed from the heuristic altogether by embedding samples from the original urn into a HPP with intensity one over the interval Inline graphic.

Simulating draws with replacement

To simulate draws with replacement using data already collected from an environment, produce a random permutation of the data. This can be accomplished with low-memory complexity using the discrete inverse transform method to simulate draws–without replacement–from a finite population [31].

Constants associated with optimal prediction intervals

To numerically approximate a pair of constants Inline graphic such that Inline graphic and Inline graphic, where the integer Inline graphic and the number Inline graphic are given constants, introduce the auxiliary variable Inline graphic, and note that the later condition is fulfilled only when Inline graphic and Inline graphic. Due to Newton's method, the sequence Inline graphic defined recursively as follows converges to the unique Inline graphic that satisfies the integrability condition, provided that Inline graphic is chosen sufficiently close to Inline graphic:

graphic file with name pone.0021105.e467.jpg
graphic file with name pone.0021105.e468.jpg
graphic file with name pone.0021105.e469.jpg

Proof of Inequality (3)

First notice that

graphic file with name pone.0021105.e470.jpg (13)

To bound the first term on the right-hand side above, notice that Inline graphic. As a result, since Inline graphic, we obtain that:

graphic file with name pone.0021105.e473.jpg (14)

On the other hand, to bound the second term on the right-hand side of equation (13), define the quantity Inline graphic and notice that Inline graphic. Using that a weighted average is at most the largest of the terms averaged, we obtain that:

graphic file with name pone.0021105.e476.jpg (15)

where, for the last inequality, we have used that for each Inline graphic, the associated product is less or equal to the factor associated with the index Inline graphic. Equation (3) is now a direct consequence of equations (13), (14) and (15).

Proof of Theorem 1

In what follows, Inline graphic denotes the inverse function of Inline graphic.

Define Inline graphic to be the set of decreasing partitions of Inline graphic i.e. vectors of the form Inline graphic, with Inline graphic and Inline graphic integers, such that Inline graphic. To each possible sample Inline graphic, let Inline graphic be the decreasing partition of Inline graphic associated with the observed ranks in the sample.

Define Inline graphic, for each set Inline graphic of colors. Part (i) in the theorem is equivalent to the existence of a function Inline graphic such that

graphic file with name pone.0021105.e493.jpg (16)

with probability one. This is because, in the non-parametric setting, the different colors in the urn carry no intrinsic meaning apart from being different. If there is a certain function Inline graphic which satisfies condition (16) then Inline graphic, for each color Inline graphic such that Inline graphic. In particular, the set Inline graphic must be finite. Furthermore, if this set has cardinality Inline graphic then Inline graphic, for each color Inline graphic in the set; in particular, Inline graphic. Condition (ii) is therefore necessary for condition (i). Conversely, if condition (ii) is satisfied and the urn is composed by Inline graphic colors occurring in equal proportions then the function Inline graphic defined as Inline graphic satisfies condition (16).

Proof of Theorem 2

Conditioned on the set Inline graphic, and the random index Inline graphic used in Step 3 of the Embedding algorithm, Inline graphic has a Gamma distribution with shape parameter Inline graphic and scale parameter Inline graphic. However, because Inline graphic has a Negative Binomial distribution, conditioned on Inline graphic alone, Inline graphic has Gamma distribution with shape parameter Inline graphic and scale parameter Inline graphic. In particular, Inline graphic has probability density function Inline graphic, for Inline graphic. From this, parts (i) and (iii) in the theorem are immediate. To show part (ii), notice first that Inline graphic is conditionally unbiased for Inline graphic, where

graphic file with name pone.0021105.e521.jpg

The second identity above is due to an integration by parts argument and only holds for Inline graphic. However, since Inline graphic, we obtain that Inline graphic, for Inline graphic. This shows that Inline graphic is conditionally unbiased for Inline graphic. To complete the proof of the theorem, notice that Inline graphic and Inline graphic have the same variance. In particular, Inline graphic, where

graphic file with name pone.0021105.e531.jpg

The last identity above holds only for Inline graphic. Using that Inline graphic, we conclude that Inline graphic, for Inline graphic. As a result: Inline graphic; in particular, since Inline graphic, Inline graphic. The theorem is now a consequence of the following inequalities:

graphic file with name pone.0021105.e539.jpg

Proof of Equation (8)

Let Inline graphic and assume that Inline graphic. Observe that:

graphic file with name pone.0021105.e542.jpg

The factor multiplying the previous integral is an increasing function of Inline graphic; in particular, due to Stirling's formula, it is bounded by Inline graphic from above. Furthermore, from section 6.1.42 in [32], it follows that

graphic file with name pone.0021105.e545.jpg

On the other hand, if one rewrites the integrand of the previous integral in an exponential-logarithmic form and uses that Inline graphic, for all Inline graphic, where

graphic file with name pone.0021105.e548.jpg

one sees that

graphic file with name pone.0021105.e549.jpg

All together, these inequalities imply that

graphic file with name pone.0021105.e550.jpg

from which the result follows.

Proof of Theorem 3

Due to the Central Limit Theorem, if Inline graphic and Inline graphic then

graphic file with name pone.0021105.e553.jpg

As a result, for all Inline graphic sufficiently large, Inline graphic, and the integral on the left-hand side above is greater than or equal to Inline graphic. Fix any such Inline graphic. Since the value of the associated integral may be decreased continuously by increasing the parameter Inline graphic, there is Inline graphic such that Inline graphic and

graphic file with name pone.0021105.e561.jpg

Define Inline graphic, for Inline graphic. Since Inline graphic and, because Inline graphic, Inline graphic, the continuity of Inline graphic implies that there is Inline graphic such that Inline graphic. Selecting Inline graphic and Inline graphic shows the theorem.

Proof of Theorem 4

The proof is based on a coupling argument. First observe that one can define on the same probability space random variables Inline graphic such that (1) Inline graphic and Inline graphic have Negative Binomial distributions with parameters Inline graphic, but with Inline graphic conditioned to be less than or equal to Inline graphic; (2) Inline graphic but Inline graphic when Inline graphic; and (3) Inline graphic are independent Exponentials with mean Inline graphic and independent of Inline graphic.

Let Inline graphic be the event “Inline graphic balls with colors outside Inline graphic are observed in the next Inline graphic draws from the urn”. Conditioned on Inline graphic, we have that Inline graphic and Inline graphic. As a result:

graphic file with name pone.0021105.e591.jpg
graphic file with name pone.0021105.e592.jpg

Since Inline graphic, and because Inline graphic when Inline graphic, we obtain that

graphic file with name pone.0021105.e596.jpg

From this, the upper-bound in part (i) and both inequalities in part (ii) follow after noticing that Inline graphic has a Gamma distribution with shape parameter Inline graphic and scale parameter Inline graphic. To show the lower-bound in (i), we again notice that Inline graphic. In particular, if Inline graphic then

graphic file with name pone.0021105.e602.jpg

Proof of Equations (10) and (11)

Consider random variables Inline graphic and Inline graphic and a random vector Inline graphic, defined on a same probability space. Assume that Inline graphic is square-integrable and conditionally unbiassed for Inline graphic given Inline graphic i.e. Inline graphic. Furthermore, assume that Inline graphic hence Inline graphic. Because Inline graphic is also square-integrable and Inline graphic, we obtain that

graphic file with name pone.0021105.e614.jpg
graphic file with name pone.0021105.e615.jpg
graphic file with name pone.0021105.e616.jpg

Hence Inline graphic.

Equation (10) follows by considering Inline graphic, Inline graphic and Inline graphic. Similarly, equation (11) follows by considering Inline graphic and Inline graphic.

Proof of Inequality (12)

First note that

graphic file with name pone.0021105.e623.jpg (17)

Now observe that Inline graphic because of inequality (13), which implies that Inline graphic because Inline graphic. On the other hand, because Robbins' and Starr's estimators are both unbiased for Inline graphic, we have Inline graphic. Furthermore, according to the proof of Theorem 2 in [21], Inline graphic, therefore

graphic file with name pone.0021105.e630.jpg

As a result, Inline graphic. Inequality (12) is now a direct consequence of inequality (17), and the next identities [21]:

graphic file with name pone.0021105.e632.jpg
graphic file with name pone.0021105.e633.jpg
graphic file with name pone.0021105.e634.jpg

Acknowledgments

We would like to thank three anonymous referees for their careful reading of our manuscript and their numerous suggestions, which were incorporated in this final version. The authors are also thankful to R. Knight for contributing to an early version of the code, implementing some of the analyses, and commenting on the manuscript.

Footnotes

Competing Interests: The authors have declared that no competing interests exist.

Funding: M. Lladser was partially supported by NASA ROSES NNX08AP60G and NIH R01 HG004872. R. Gouet was supported by FONDAP, BASAL-CMM and Fondecyt 1090216. J. Reeder was supported by a post-doctoral scholarship from the German Academic Exchange Service (DAAD). M. Lladser and R. Gouet are thankful to the project Núcleo Milenio Información y Aleatoriedad, which facilitated a research visit to collaborate in person. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Sogin ML, Morrison HG, Huber JA, Welch DM, Huse SM, et al. Microbial diversity in the deep sea and the underexplored \rare biosphere”. Proc Natl Acad Sci USA. 2006;103:12115–12120. doi: 10.1073/pnas.0605127103. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Hughes JB, Hellmann JJ, Ricketts TH, Bohannan BJ. Counting the uncountable: statistical approaches to estimating microbial diversity. Appl Environ Microbiol. 2001;67:4399–4406. doi: 10.1128/AEM.67.10.4399-4406.2001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Schloss PD, Handelsman J. Status of the microbial census. Microbiol Mol Biol Rev. 2004;68:686–691. doi: 10.1128/MMBR.68.4.686-691.2004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Curtis TP, Head IM, Lunn M, Woodcock S, Schloss PD, et al. What is the extent of prokaryotic diversity? Phil Trans R Soc Lond. 2006;361:2023–2037. doi: 10.1098/rstb.2006.1921. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Roesch LF, Fulthorpe RR, Riva A, Casella G, Hadwin AK, et al. Pyrosequencing enumerates and contrasts soil microbial diversity. Isme J. 2007;1:283–290. doi: 10.1038/ismej.2007.53. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Hong SH, Bunge J, Jeon SO, Epstein SS. Predicting microbial species richness. Proc Natl Acad Sci USA. 2006;103:117–122. doi: 10.1073/pnas.0507245102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Quince C, Curtis TP, Sloan WT. The rational exploration of microbial diversity. Isme J. 2008;2:997–1006. doi: 10.1038/ismej.2008.69. [DOI] [PubMed] [Google Scholar]
  • 8.Turnbaugh PJ, Hamady M, Yatsunenko T, Cantarel BL, Duncan A, et al. A core gut microbiome in obese and lean twins. Nature. 2007;457:480–484. doi: 10.1038/nature07540. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Fierer N, Lauber CL, Zhou N, McDonald D, Costello EK, et al. Forensic identification using skin bacterial communities and/or references within. Proc Natl Acad Sci USA. 2010;107:6477–6481. doi: 10.1073/pnas.1000162107. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Magurran AE. Measuring Biological Diversity. Oxford - Blackwell; 2004. [DOI] [PubMed] [Google Scholar]
  • 11.Burnham KP, Overton WS. Estimation of the size of a closed population when capture probabilities vary among animals. Biometrika. 1978;65:625–633. [Google Scholar]
  • 12.Chao A. Nonparametric estimation of the number of classes in a population. Scand J Stat. 1984;11:265–270. [Google Scholar]
  • 13.Chao A. Estimating the population size for capture-recapture data with unequal catchability. Biometrics. 1897;43:783–791. [PubMed] [Google Scholar]
  • 14.Mao CX, Lindsay BG. Estimating the number of classes. Ann Stat. 2007;35:917–930. [Google Scholar]
  • 15.Bunge J, Fitzpatrick M. Estimating the number of species: A review. J Am Stat Assoc. 1993;88:364–373. [Google Scholar]
  • 16.Hinsley F, Stripp A. Codebreakers: The Inside Story of Bletchley Park. Oxford Univ. Press; 1993. [Google Scholar]
  • 17.Finch SJ, Mendell NR, Thode HC., Jr Probabilistic measures of adequacy of a numerical search for a global maximum. J Am Stat Assoc. 1989;84:1020–1023. [Google Scholar]
  • 18.Mao CX. Predicting the conditional probability of discovering a new class. J Am Stat Assoc. 2004;99:1108–1118. [Google Scholar]
  • 19.Good IJ. The population frequencies of species and the estimation of population parameters. Biometrika. 1953;40:237–264. [Google Scholar]
  • 20.Robbins HE. On estimating the total probability of the unobserved outcomes of an experiment. Ann Math Stat. 1968;39:256–257. [Google Scholar]
  • 21.Starr N. Linear estimation of the probability of discovering a new species. Ann Stat. 1979;7:644–652. [Google Scholar]
  • 22.Clayton MK, Frees EW. Nonparametric estimation of the probability of discovering a new species. J Am Stat Assoc. 1987;82:305–311. [Google Scholar]
  • 23.Esty WW. A Normal limit law for a nonparametric estimator of the coverage of a random sample. Ann Statist. 1983;11:905–912. [Google Scholar]
  • 24.Aldous D. Probability Approximations via the Poisson Clumping Heuristic. Springer-Verlag; 1988. [Google Scholar]
  • 25.Mahmoud HM. Sorting: A Distribution Theory. Wiley-Interscience; 2000. [Google Scholar]
  • 26.Hwang HK, Janson S. Local limit theorems for finite and infinite urn models. Ann Probab. 2008;36:992–1022. [Google Scholar]
  • 27.Mao CX, Lindsay BG. A poisson model for the coverage problem with a genomic application. Biometrika. 2002;89:669–681. [Google Scholar]
  • 28.Durrett R. Essentials of stochastic processes. Springer Texts in Statistics; 1999. [Google Scholar]
  • 29.Turnbaugh PJ, Ridaura VK, Faith JJ, Rey FE, Knight R, et al. The e_ect of diet on the human gut microbiome: A metagenomic analysis in humanized gnotobiotic mice. Sci Transl Med. 2009;1:6ra14. doi: 10.1126/scitranslmed.3000322. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Fierer N, Hamady M, Lauber CL, Knight R. The influence of sex, handedness, and washing on the diversity of hand surface bacteria. Proc Natl Acad Sci USA. 2008;105:17994–17999. doi: 10.1073/pnas.0807920105. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Ross SM. Simulation. Academic Press, third edition; 2002. [Google Scholar]
  • 32.Abramowitz M, Stegun IA. Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables. New York: Dover, ninth Dover printing, tenth GPO printing edition; 1964. [Google Scholar]

Articles from PLoS ONE are provided here courtesy of PLOS

RESOURCES