Abstract
Anne Chao proposed a very popular, nonparametric estimator of the species richness of a community, on the basis of a limited size sampling of this community. This expression was originally derived on a statistical basis as a lower-bound estimate of the number of missing species in the sample and provides accordingly a minimal threshold for the estimation of the total species richness of the community. Hereafter, we propose an alternative, algebraic derivation of Chao's estimator, demonstrating thereby that Chao's formulation may also provide centered estimates (and not only a lower bound threshold), provided that the sampled communities satisfy a specific type of SAD (species abundance distribution). This particular SAD corresponds to the case when the number of unrecorded species in the sample tends to decrease exponentially with increasing sampling size. It turns out that the shape of this “ideal” SAD often conforms approximately to the usually recorded types in nature, such as “log-normal” or “broken-stick.”. Accordingly, this may explain why Chao's formulation is generally recognized as a particularly satisfying nonparametric estimator.
1. Introduction
Estimating the total species richness within large communities of species, using only samplings limited in sizes, is a common, long standing challenge which has elicited numerous procedures of estimations.
For a few decades, a series of so-called nonparametricestimators provide elegant and convenient solutions to this question. These new estimators are ordinarily simple format formulations which, moreover, require no specific assumption regarding the statistical distribution of the abundances of species (which, thus, makes these formulations “nonparametric”).
Within this category of formulations, Anne Chao proposed a very popular nonparametric expression, which actually stands among the most commonly used estimators of the total species richness of a sampled community.
Let Δ be the number of species that are missed by (i.e., unrecorded within) the limited sampling of a large community of species. Then, according to Chao's formulation [1],
| (1) |
with f 1 and f 2 as the numbers of species encountered only once and twice, respectively, in the sample.
This expression was later generalised [2] as
| (2) |
with f x standing for the number of species recorded x times in the sample.
These expressions, derived on a statistical basis, provide a lower-bound estimate of the number Δ of missing species in a sampled community [1, 2]. That is, Chao's formulation is expected to provide only a minimal threshold (in the statistical sense) for the estimated species richness. Yet, several decades of practice (especially in the field of ecological purposes and biodiversity surveys) call for placing Chao's expression among the most valuable and reliable estimators [3–7] since, in many occasions, this expression may well appear to provide approximatelycentered rather than only lower-bound estimates. In short, although designed conceptually as a lower-bound evaluation, Chao's formulation may, nonetheless, fairly often provide rather centered estimates in the common practice.
Hereafter, we address this apparent paradox and propose new insights and argumentations, issued from an alternative, algebraic derivation of the originally statistically derived formulation by Chao.
2. The Specific Condition Allowing an Alternative, Algebraic Derivation of the “Chao” Formulation
We demonstrate (see Appendix A for mathematical details) that the general expression of Chao's estimator, established originally on a statistical basis, may also admit, alternatively, an algebraic derivation, leading to exactly the same expression as the statistically derived formulation Δ = [f 1 x/(x!f x)](1/(x−1)).
Yet, while the statistical derivation of Chao's formulation requires no particular restriction (“nonparametric”), the algebraic derivation, implies a particular shape for the expected decrease of the proportion Δ/S of missing (i.e., unrecorded) species when the sample size N increases. In fact, as demonstrated in Appendix A, this asymptotic decrease should conform to a negative exponential:
| (3) |
with
-
(i)
“S” as the species richness, that is, the total (unknown) number of species of the community,
-
(ii)
“k” as a constant,
-
(iii)
“N” as the sample size, that is, the number of individuals recorded in the sample.
In turn, this particular form of the decrease of the number of missing species with enlarging sample sizes constrains the shape of the species abundance distribution (the “SAD,” that is, the distribution of species abundances when species are conventionally ranked by decreasing order of abundance).
According to (3), the number “r” of recorded species in the sample is
| (4) |
Let k′ be the number of individuals belonging to the less abundant species among the “r” species recorded in the sample (i.e., the species of rank “r” when species are ranked by decreasing order of abundance). Then, the relative abundance a r of the species of rank r is expected to be inversely proportional to N, as
| (5) |
| (6) |
According to the sample size N, every species may be called to play the role of the less abundant species within the sample (since, by continuously decreasing the size N of the sample, each species of the community [including, at last, the most common] would successively play the role of the least abundant species in the sample).
Therefore, (6) stands the same for any species of any rank “i” in the SAD:
| (7) |
This equation thus describes the shape of the species abundance distribution (a i = f(i)) when the proportion of unrecorded species in a sample is exponentially decreasing with the sample size, that is, the shape of the species abundance distribution which conditions the validity of the algebraic derivation of Chao estimator. Figure 1 provides examples of the corresponding shapes of the SAD.
Figure 1.

Typical shape of the “SAD” when the number Δ of unrecorded species is decreasing as a negative exponential of the sample size. The sum of abundances is normalised at 100% (open circles: total species richness = 20; black diamonds: total species richness = 50).
3. The Resulting Restrictive Condition Which Allows the “Chao” Formulation to Become a Centered Estimator of Species Richness
As the algebraic derivation is deterministic by essence, it therefore provides a centered estimate of Δ and of the resulting total species richness S, instead of being only a lower-bound estimate, as is the case in the nonparametric context. As mentioned above, the algebraic derivation of Chao's formulation requires that the sampled community satisfies, at least approximately, the particular type of SAD defined by (7) and illustrated at Figure 1. This restrictive condition is the “price” to be paid for the more accurate estimate, namely, the loss of the strict “nonparametric” character of the statistically based conception.
Yet, this condition assigned to the shape of the distribution of species abundances might not be so restrictive in practice, at least as a first approximation. Reasons for this may be as follows:
-
(a)
an asymptotic decrease to zero of Δ with N (equation (3)) seems logical and intuitive, because aiming to estimate the total number of species in a community implicitly requires that this number does exist and might be actually reached progressively with sampling size N increasing continuously;
-
(b)
among the different types of accumulation curves with such an asymptotic evolution, the negative exponential answer of Δ to increasing sampling size is, admittedly, one among the most simple, robust, and seemingly common [8, 9];
-
(c)
the sigmoidal shape of the prescribed SAD (equation (7) and Figure 1) is not so far from the most classically referred empirical types, broken-stick and log-normal distributions [10]. Yet, a strict conformity is not expected a priori with any empirical models. For example, the equation i = f(a i) for the SAD corresponding to a broken-stick distribution is i ≈ S · exp(−S · a i) [10], which is formally different from (7).
Accordingly, it is no real surprise that Chao's formula often approaches a strictly centred estimate, in spite of being only a lower-bound estimate in all generality.
This would explain why, in ecological practice in particular, Chao's formulation is yet considered one of the more accurate and reliable estimators of the total species richness within partially sampled communities.
As mentioned in particular by Gotelli (personal communication), a trend would remain for Chao's estimates to increase somehow when a series of sampling of growing sizes are extracted from the same community instead of remaining ideally stable on average. This, however, is not necessarily contradictory to preceding arguments but should certainly result from residual discrepancy between the real SAD and the ideal model described by (7) and exemplified at Figure 1.
Acknowledgments
Two anonymous reviewers are gratefully acknowledged for their useful suggestions on a previous version of the paper. Also, Anne Chao and Nick Gotelli provide encouraging comments on the original version of the paper.
Appendix
A. An Alternative, Algebraic Derivation of the Chao 1 Estimator
Consider a community of species, actually containing an unknown total number “S” of species.
Let p i be the probability of occurrence (assimilated to the relative abundance) of species “i” within this community and let N be the sample size, that is, the number of individuals recorded in the sample.
The estimated number of species that escape recording during sampling of the community is Δ(N):
| (A.1) |
with Σi as the operation summation extended to the totality of the “S” species “i” (either recorded or not).
The number f x of species recorded x times in the sample is then, according to the binomial distribution:
| (A.2) |
A.1. Deriving f x as a Function of Δ(N)
The number f x of species recorded x times in the sample may be derived as a function of Δ(N).
(i) Consider
| (A.3) |
According to (A.1), it becomes as follows: f 1 = N(Δ(N − 1) − Δ(N)) = −N(Δ(N) − Δ(N−1)).
As appropriate sampling requires a size N ≫ 1, the difference (Δ(N) − Δ(N−1)) can be likened to the corresponding derivative (∂Δ(N)/∂N). Then,
| (A.4) |
where Δ(N)′ is the first derivative of Δ(N) with respect to N. Thus,
| (A.5) |
Similarly,
(ii) consider
| (A.6) |
where Δ(N)′′ is the second derivative of Δ(N) with respect to N. Thus,
| (A.7) |
(iii) Consider
| (A.8) |
where f 1 * is the number of singletons that would be recorded in a sample of size (N − 1) instead of N.
| (A.9) |
where Δ(N − 1)′ is the kth derivate of Δ(N) with respect to N, at point (N − 1). Then,
| (A.10) |
where Δ(N)′′′ is the third derivative of Δ(N) with respect to N. Thus,
| (A.11) |
(iv) Generalising this approach to calculate the number f x of species recorded x times in the sample:
| (A.12) |
with Σj as the summation from j = 0 to j = x − 1. It becomes as follows:
| (A.13) |
with Σk as the summation from k = 1 to k = x − 1; that is,
| (A.14) |
where C (N−x+1+k),k = (N − x + 1 + k)!/k!/(N − x + 1)! and f k * is the number of species recorded k times during a sampling of size (N − x + 1 + k) (instead of size N).
The same demonstration, which yields previously the expression of f 1 * above (see (A.9)), applies for the f k * (with k up to x − 1) and gives
| (A.15) |
where Δ(N−x+1+k) (k) is the kth derivate of Δ(N) with respect to N, at point (N − x + 1 + k). Then,
| (A.16) |
which finally yields
| (A.17) |
That is,
| (A.18) |
where Δ(N) (x) is the xth derivative of Δ(N) with respect to N, at point N.
A relationship is thus derived between the series of the numbers f x of species recorded x times and the series of derivatives Δ(N) (x) at order x for Δ(N):
| (A.19) |
These relations will serve as the basis for deriving the formulation of Δ as a function of the f x.
A.2. Deriving a Generalised Expression for the “Chao” Formulation
Let us consider now the case when the number of missing species Δ(N) conforms (or is close) to a negative exponential with respect to the sampling size N. Accordingly, Δ(N) would thus verify the following series of differential equations:
| (A.20) |
that yields Δ(N)/Δ(N) (x) = (Δ(N)/Δ(N)′)x and then Δ(N) = [(Δ(N)′)x/Δ(N) (x)]1/(x−1), where Δ(N)′, Δ(N)′′, Δ(N)′′′, and Δ(N) (x) are the first, second, third, and x° derivatives of Δ(N) with respect to N.
As Δ(N)′ = −(f 1/N) and Δ(N) (x) = (−1)x f x/C N,x (see (A.5) and (A.18)), it follows for Δ(N) that
| (A.21) |
Note that C N,x/N x is ~1/x! since, in practice, x remains, by far, quite smaller than N. Accordingly,
| (A.22) |
In particular,
| (A.23) |
Conflict of Interests
The author declares that there is no conflict of interests regarding the publication of this paper.
References
- 1.Chao A. Nonparametric estimation of the number of classes in a population. Scandinavian Journal of Statistics. 1984;11(4):265–270. [Google Scholar]
- 2.Chao A., Shen T.-J. Nonparametric prediction in species sampling. Journal of Agricultural, Biological, and Environmental Statistics. 2004;9(3):253–269. doi: 10.1198/108571104X3262. [DOI] [Google Scholar]
- 3.Coddington J. A., Young L. H., Coyle F. A. Estimating spider species richness in a Southern Appalachian cove hardwood forest. Journal of Arachnology. 1996;24(2):111–128. [Google Scholar]
- 4.Hraber P. T. Discovering Molecular Mechanisms of Mutualism with Computational Approaches to Endosymbiosis. Santa Fe, NM, USA: University of New Mexico; 2001. [Google Scholar]
- 5.Aubry S., Magnin F., Bonnet V., Preece R. C. Multi-scale altitudinal patterns in species richness of land snail communities in South-Eastern France. Journal of Biogeography. 2005;32(6):985–998. doi: 10.1111/j.1365-2699.2005.01275.x. [DOI] [Google Scholar]
- 6.Brittain S., Böhning D. Estimators in capture-recapture studies with two sources. Advances in Statistical Analysis. 2009;93(1):23–47. doi: 10.1007/s10182-008-0085-y. [DOI] [Google Scholar]
- 7.Chao A., Colwell R. K., Lin C.-W., Gotelli N. J. Sufficient sampling for asymptotic minimum species richness estimators. Ecology. 2009;90(4):1125–1133. doi: 10.1890/07-2147.1. [DOI] [PubMed] [Google Scholar]
- 8.Holdridge L. R., Grenke W. G., Hatheway W. H., Liang T., Tosi J. A. Forest Environments in Tropical Life Zones. Oxford, UK: Pergamon Press; 1971. [Google Scholar]
- 9.Soberon M. J., Llorente B. J. The use of species accumulation functions for the prediction of species richness. Conservation Biology. 1993;7(3):480–488. doi: 10.1046/j.1523-1739.1993.07030480.x. [DOI] [Google Scholar]
- 10.May R. M. Patterns of species abundance and diversity. In: Cody M. L., Diamond J. M., editors. Ecology and Evolution of Communities. Cambridge, Mass, USA: The Belknap Press of Harvard University; 1975. [Google Scholar]
