Summary
We propose a Poisson-compound gamma approach for species richness estimation. Based on the denseness and nesting properties of the gamma mixture, we fix the shape parameter of each gamma component at a unified value, and estimate the mixture using nonparametric maximum likelihood. A least-squares crossvalidation procedure is proposed for the choice of the common shape parameter. The performance of the resulting estimator of N is assessed using numerical studies and genomic data.
Some key words: Crossvalidation, Nesting property of gamma mixtures, Nonparametric maximum likelihood estimation, Poisson-compound gamma model, Species richness estimation
1. Introduction
A typical dataset in the species problem comprises a series of counts {xi : i = 1, . . ., D}, recording the numbers of individuals from a total of D observed distinct species. These counts are often summarized into the sufficient statistic n = (n1, . . ., nt), where nj = ∑i I {xi = j}, t = max{xi}. The parameter for estimation is the total number of the distinct species N in the underlying population. Research on this old but challenging problem is motivated by several novel genomic applications, and the following are two examples.
Serial analysis of gene expression data and expressed sequence tag data are two similar types of genomic data widely used for surveying transcribed genes. Both are collections of tags or signatures of genes, representing random samples from the entire gene expression population or mRNA pool (Thygesen & Zwinderman, 2006; Morris et al., 2003; Wang et al., 2005). We let nj stand for the number of genes that have j tags in a sample; but owing to the limited sampling or sequencing effort, some expressed genes may not be observed. By knowing the total number of expressed genes, biologists can gain insights into the diversity of active genes in the given tissue.
The diversity of microbial species present can provide an interesting profile of a microbial community (Acinas et al., 2004; Hong et al., 2006). For example, using the 16S ribosomal RNA gene as the signature, researchers can identify distinct microbial species present in a soil sample. However, sampling is often shallow in practice, i.e. only a small fraction of the species is obtained, and the total diversity needs to be estimated.
In both examples, the gene expression level or the species abundance varies substantially. The first challenge in the estimation of N is how to model the heterogeneity. Ignoring it may result in a large downward bias (Chao & Lee, 1992; Bunge & Fitzpatrick, 1993; Norris & Pollock, 1998; Böhning & Schön, 2005). In a likelihood-based approach, one first specifies a probability distribution function f (xi ; λi) (λi > 0), and then imposes a nondegenerate distribution Q(λ; θ), θ ∈ Θ ⊂ d, typically continuous, for the species abundance index λi. Thus, the marginal distribution of X becomes a mixture distribution, i.e. f (x; Q) = ∫ f (x; λ)dQ(λ). However, an inappropriate choice of Q can lead to systematic bias of the resulting estimator.
A more robust approach is to estimate Q by nonparametric maximum likelihood (Norris & Pollock, 1998). The nonparametric maximum likelihood estimator, Q̂, is in the form of a discrete distribution (Lindsay, 1995), and in the case of a Poisson mixture it is unique with a finite number of support points for any dataset. Wang & Lindsay (2005) proposed a penalized nonparametric maximum likelihood estimator, which is more stable than that of Norris & Pollock (1998). Nevertheless, severe underestimation can occur due to the interplay of factors including inadequate sampling effort, heterogeneity and skewness of the abundance curve. This weakness is also observed in other non-likelihood-based nonparametric approaches. In the motivating genomic applications, shallow sampling is typical. Thus, underestimation is the second major challenge.
In the motivating applications, a continuous Q can be appealing for interpretability of species abundance. More importantly, it could better capture the information of species abundance near zero, and hence improve the estimation of N when underestimation is a major concern. The objective of this study is to model Q using a smooth curve while estimating it using nonparametric maximum likelihood based on a Poisson-compound gamma model.
2. The model and maximum likelihood estimation
2.1. A general model
Consider that we have a mixture distribution f (x; H) defined based on a density kernel f (x; λ), λ ∈ and a mixing distribution H(λ; μ, ν), μ, ν ∈ . Here, we assume that H is continuous and has location and dispersion parameters μ and ν, respectively. Furthermore, as ν → 0, H becomes degenerate, and thus, f (x; H) converges to a unicomponent distribution. If we fix ν and allow a mixing distribution G for μ, then the overall mixing distribution for λ becomes continuous, having a form Q(λ) = ∫ H(λ; μ, ν)dG(μ). The mixture density of X becomes , where . If H is chosen to be conjugate for f, then can often be obtained in a simple closed form. For each given ν, the global maximum likelihood estimator of G, say Ĝν, can be found by nonparametric maximum likelihood estimation (Lindsay, 1983, 1995). The estimator Ĝν is a discrete distribution, while the resulting mixing distribution Q of λ, denoted by Q̂ν(λ), is continuous. In this paper, we term Q̂ν (λ) the smooth nonparametric maximum likelihood estimator (Magder & Zeger, 1996).
Alternatively, one can ignore the H structure and directly estimate Q(λ) in the defined mixture model f (x; Q) = ∫ f (x; λ)dQ by nonparametric maximum likelihood. The resulting discrete Q̂(λ) maximizes the likelihood over all Q; we call it the discrete nonparametric maximum likelihood estimator.
2.2. Properties of gamma mixtures
Let h(λ; α, β) = βα{Γ(α)}−1λα−1e−βλ (λ, α, β > 0) be the gamma density. We consider two reparameterizations: (α, β) ↦ (α, μ) and (α, β) ↦ (μ, β), where μ = α/β. The following denseness property of the gamma mixture is cited from Fang & Chlamtac (1999, p. 1064):
Lemma 1. Let Q(λ) be the cumulative distribution function of a nonnegative random variable. Then it is possible to choose a sequence of distribution functions Qm (λ), each of which corresponds to a gamma mixture, such that Qm (λ) → Q(λ) as m → ∞, at any point where Q(λ) is continuous. For example, we may choose
where is a gamma distribution with mean k/m and variance k/m2.
Intuitively, when m becomes large, each individual gamma component approximates a delta function at k/m with a weight equal to the increase in the cumulative distribution function of Q from (k − 1)/m to k/m. Therefore, a gamma mixture model can be used to approximate any distribution defined in +. In practice, Q is unknown and the sample size is limited, and one can fit a gamma mixture by the method of maximum likelihood.
The second property is the nesting property of gamma mixtures.
Lemma 2. Let h(λ ; θ, μ0) be the reparameterized gamma density where θ = α or β and μ0 is the mean parameter. Then for any given δ > 0, there exists a unique probability measure G for μ such that for all λ > 0, ∫h(λ ; θ + δ, μ)dG(μ) = h(λ; θ, μ0).
Proof. See the Appendix.
The existence part of Lemma 2 for θ = α is a result from Lindsay (1995, p. 54), given in the new parameterization. The existence result for θ = β case and the uniqueness for both α and β cases are new. The uniqueness is a consequence of the identifiability presented in Theorem 2 below. Subsequently, we shall use h{θ, G(μ)} to denote a gamma mixture defined as ∫h(λ; θ, μ)dG(μ). The following nesting property is an immediate consequence of Lemma 2.
Theorem 1. Let δ > 0, θ > 0 and h(θ, μ) be the reparameterized gamma density, where θ = α or β. Let 𝒢 = {G(μ) : G is a probability measure on +}. Define the mixture sets θ = [h{θ, G(μ)} : G ∈ 𝒢] and θ+δ = [h{θ + δ, G(μ)} : G ∈ 𝒢]. Then θ ⊂ θ+δ.
Lemma 2 implies that an arbitrary gamma distribution h(θ, μ0) can be uniquely expressed as a gamma mixture h{θ + δ, G(μ)} for any given δ > 0. Hence, a gamma mixture model, h{θ, G(μ)} ∈ θ, can be rewritten as a model in θ+δ, but with a different G(μ). This implies that G(μ) is not identifiable in the entire distribution space 𝒢 if θ is unknown. Nevertheless, if θ is given, G is identifiable in 𝒢.
Theorem 2. Let h{θ, G(μ)} be a gamma mixture model as defined earlier with a given θ > 0 and an unknown G(μ) defined in +. Then G(μ) is identifiable in 𝒢.
Proof. See the Appendix.
The identifiability of the gamma mixture model under a given unified θ implies the uniqueness of G(μ), as stated in Lemma 2. The gamma mixture with different θs is not identifiable in the entire distribution space 𝒢. If we assume that the mixture is finite, i.e. the mixing distribution G is discrete with a finite number of support points, then the mixture becomes identifiable (Teicher, 1963). In practice, as we fit a finite mixture of gammas, the identifiability among the finite mixture set is adequate to justify the estimation.
2.3. Poisson-compound gamma model
The denseness property suggests that there exists a sequence of finite gamma mixtures that can approximate any Q in +. As Q is unknown, we need to estimate Q based on the data. If Q̃ is such an estimator in a finite gamma mixture form, then by the nesting property Q̃ can be rewritten as a new gamma mixture, say Q̃θ, with a unified θ parameter. However, the likelihood for such Q̃θ is always dominated by that of the smooth nonparametric maximum likelihood estimator Q̂θ obtained with the given θ. In other words, the smooth nonparametric maximum likelihood estimator with an appropriate choice of θ can also approximate Q arbitrarily well regardless of the form of true Q. Hence, robustness of this method is anticipated.
We consider a Poisson-compound gamma model for the species problem. If the compound gamma has a fixed θ and a mixing distribution G for μ, then fθ (x; G) = ∬ e−λλx(x!)−1 h(λ; θ, μ)dλdG(μ). We can define . For example, under the (α, μ) parameterization, . Then . The likelihood function is
| (1) |
where the expressions before and after the multiplication sign are referred to as the marginal and conditional likelihood, respectively, and are denoted by and henceforth.
Based on the likelihood in (1), there are two possible likelihood approaches, conditional or unconditional (Sanathanan, 1972, 1977). In the conditional approach, the mixing distribution G, say, Ĝc, is first estimated from alone, and then the estimator of N by N̂c = ⌈D/{1 − fθ(0, Ĝc)}⌉, where ⌈x⌉ stands for the integer no greater than x. The unconditional approach seeks the pair (N̂u, Ĝu) to maximize the full likelihood, which must also satisfy N̂u = ⌈D/{1 − fθ(0, Ĝu)}⌉ (Wang & Lindsay, 2005).
In this paper, we consider the conditional approach for its computational simplicity. Below we shall use Ĝ to denote the conditional nonparametric maximum likelihood estimator of G. A frequently occurring issue in the conditional discrete nonparametric maximum likelihood approach is the boundary problem, where a component λ ≈ 0+ is fitted. The same problem arises in the unconditional discrete (Wang & Lindsay, 2005) and the finite mixture approaches. Unfortunately, the continuity brought into Q in the Poisson-compound gamma model does not help overcome this issue. As the gamma distribution has μ = α/β, σ2 = μ/β, with the given α, fitting a μ ≈ 0+ implies β → ∞, and hence, σ2 → 0. Therefore, the fitted gamma component corresponds to a spike of λ at 0+. Such a tiny component is problematic because it blows up the resulting point estimate of N. In Wang & Lindsay (2008), it was further shown that this boundary problem is structural in the sense that the fitted probability at X = 1 for the zero-truncated sample must be no less than the empirical probability n1/∑ni. It can be shown that this is also true for smooth nonparametric maximum likelihood estimation.
An effective solution to this problem is to penalize the likelihood. Define the odds parameter ψ as ψ(G) = fθ (0; G)/{1 − fθ (0; G)}. Then N̂c can be written as N̂c = ⌈D{1 + ψ(Ĝ)}⌉. Thus, penalizing large ψ values can stabilize the conditional estimator. Such a penalty function of ψ in the likelihood has a natural interpretation as a Bayesian prior for ψ, in which, however, the hyperparameter needs to be subjectively prespecified (Wang & Lindsay, 2005). To overcome this weakness, here we use a partial prior approach similar to that proposed in Wang & Lindsay (2008) for the binomial mixture. We impose an exponential prior for the odds parameter in a form of p(ψ; η) = ηe−ηψ (ψ, η > 0). As we are estimating the entire mixing distribution in while the prior is specified for a functional of G in +, such a prior is termed a partial prior. We use an empirical Bayes method to find (η̂, Ĝ) that maximizes the joint likelihood
| (2) |
where is the conditional likelihood defined in (1). It was shown in Wang & Lindsay (2008) that, ψ(Ĝ) = 1/η̂, and Ĝ maximizes if and only if it maximizes the alternative likelihood
| (3) |
Clearly, as ψ → ∞, we have . Thus, the boundary problem is prevented. Compared to the quadratic penalty in Wang & Lindsay (2005), the exponential partial prior approach is simpler and less ad hoc. The proposed estimator of N in this paper is based on the likelihood or its equivalent . The penalized nonparametric maximum likelihood estimator of G from can be reliably found using an em algorithm (Wang, 2007).
3. Choosing θ by least-squares crossvalidation
The nesting property of the gamma mixture directly results in the following monotonicity property of the likelihood function defined in (1), (2) and (3), whose proof is omitted.
Proposition 1. Let ℓ̂θ be the maximized loglikelihood value from the smooth nonparametric maximum likelihood estimation based on conditional or unconditional likelihood from (1), or based on the likelihood function with the partial prior as defined in (2) or (3). Then ℓ̂θ is a monotone nondecreasing function of θ.
Proposition 1 implies that the global maximum likelihood is achieved at the discrete nonparametric maximum likelihood estimator, namely θ = ∞. When θ = α, as σ2 = μ2/α, a larger α for given μ implies a more spiky gamma density. As α → ∞, each gamma component will converge to a degenerate distribution. The same conclusion holds when θ = β. By design, the tuning parameter θ brings better smoothness to Q(λ), and should reduce the underestimation. The likelihood value under smooth nonparametric maximum likelihood estimation can be regarded as the profile likelihood for θ. Unfortunately, the corresponding likelihood ratio test does not have a known distribution (Lindsay, 1995). Hence we propose a crossvalidation procedure for model selection.
In our estimation procedure, the parameter θ plays the same role as a bandwidth parameter in kernel density estimation; it tunes the smoothness of the resulting gamma mixture density. However, our problem differs from the latter in two senses: the species abundance value from the gamma mixture is not directly observed, and the density estimator is defined as a gamma mixture rather than a convolution of Gaussian densities. We investigate a crossvalidation procedure similar to that in density estimation (Silverman, 1992, p. 48) to determine the value of the tuning parameter. Our goal is to find a θ that minimizes the summed square error
| (4) |
where f(j; Ĝ) is the estimated zero-truncated Poisson mixture probability at j, and f (j; G) is its true value. The last term in (4), , does not involve Ĝ, so a good choice of Ĝ is one that minimizes . Because f (j; G) is unknown, we replace it with an empirical version, nj/D. In addition we define f̂−j as the leave-one-out crossvalidated probability density after excluding one species with count j from the sample. Further replacing f(j; Ĝ) with f̂−j (j) in the second term leads us to minimize the function,
| (5) |
with respect to θ. As f̂−j (j) ≈ f (j; Ĝ),
Given the data, minimizing M0(θ) tends to minimize , to provide good fit to the data. Incorporating the leave-one-out crossvalidated probability f−j(j) in M0(θ) in (5) introduces a variance component into the objective function M0(θ), owing to a small perturbation of the data. In the Poisson-compound gamma model, an over-small θ results in a large summed square error due to lack of fit or bias, whereas an over-large θ, though always beating any smaller θ in terms of the likelihood by the nesting property, may yield a relatively larger variance in f(j; Ĝ). Minimizing M0(θ) over θ balances the two competing components in this model.
The summed square error defined earlier is analogous to the integrated square error used in the kernel density estimation. Although we are unable to establish similar asymptotics because of the complexity of nonparametric maximum likelihood estimation of mixtures, this procedure works surprisingly well in our application. We illustrate this using simulated species data from a Poisson-compound gamma model, where N = 1000, and the species abundance is a two-component gamma mixture 0.5Ga(α = 2, μ = 1) + 0.5Ga(α = 2, μ = 2). The data are: n = (n1, . . ., n14) = (269, 183, 88, 56, 34, 8, 9, 6, 0, 1, 1, 1, 1, 1). We consider the θ = α setting by treating α as a tuning parameter. In Fig. 1, we plot the maximized log conditional likelihood under the exponential partial prior on a grid of α ∈ 𝒜 = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 40, 60, 100} together with M0(α) and the resulting N̂.
Fig. 1.
Plots of (a) maximized loglikelihood relative to that achieved at discrete nonparametric maximum likelihood, (b) M0, and (c) N̂ versus log(α) at α = 1, . . ., 10, 20, 40, 60, 100 for the example data. The dashed line in (c) corresponds to the discrete nonparametric maximum likelihood estimator, N̂ = 886.
The loglikelihood plot in Fig. 1(a) confirms the monotonicity. The crossvalidation procedure identified α = 2 on the grid that minimizes M0(α), yielding N̂ = 981. The estimator N̂ decreases in α, until reaching 886 at α = ∞. This overall decreasing pattern of N̂ is typical, though strict monotonicity seems not to hold based on our experience. The discrepancy between N̂ at α = 2 and at α = ∞ suggests that an appropriate choice of α model could lead to a less biased estimator than the discrete nonparametric maximum likelihood estimator.
4. Simulation study
We now investigate whether this new approach could preserve the desired robustness as a nonparametric maximum likelihood estimator, and whether it could significantly reduce the underestimation. Table 1 lists 12 settings, where the species abundance distribution Q(λ) varied from gamma, gamma mixture, lognormal and lognormal mixture, to discrete distributions, with the expected fraction of the sampled species, defined as the sampling depth, d, ranging from 0.20 to 0.90. For each setting, we simulated 200 samples from a population containing N = 1000 species. Crossvalidation was applied to each sample on a grid α ∈ 𝒜 = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 40, 60, 100, ∞}. One could carry out the crossvalidation for α < 1, but the gamma density goes to ∞ at 0+ for α < 1. Consequently, the resulting estimator can be extremely variable or biased. Hence, in this simulation, we took only α ⩾ 1.
Table 1.
Species abundance distribution and sampling depth in 12 simulation settings
| Setting | Distribution type (Q) | d |
|---|---|---|
| gamma | ||
| 1 | Ga(4,3.125) | 0.90 |
| 2 | Ga(4,1) | 0.59 |
| 3 | Ga(1,0.25) | 0.20 |
| gamma mixture | ||
| 4 | 0.5Ga(2,1) + 0.5Ga(2,2) | 0.65 |
| 5 | 0.5Ga(2,1) + 0.5Ga(4,1) | 0.57 |
| Lognormal | ||
| 6 | LN (0.75,0.75) | 0.82 |
| 7 | LN (−.5,2) | 0.50 |
| 8 | LN (−1,1) | 0.36 |
| Lognormal mixture | ||
| 9 | 0.5LN(−0.5,1) + 0.5LN(0.5,1) | 0.61 |
| Discrete | ||
| 10 | 0.8Δ1.2 + 0.2Δ6.7 | 0.76 |
| 11 | 0.89Δ0.5 + 0.11Δ6.7 | 0.46 |
| 12 | 0.8Δ0.2 + 0.2Δ1.3 | 0.29 |
Ga, gamma (α, μ); LN, lognormal(μ, σ); Δx, a degenerate distribution at x ; d is defined as the sampling depth, which is the expected fraction of species to be observed in the sample, i.e. d = 1 − f (0; Q).
Overall, the crossvalidation procedure works satisfactorily in selecting appropriate models for different settings. For the single gamma models including settings 1–3, and the gamma mixture setting 4 where the components had the same α parameter, the medians of the selected α values, denoted as α̂, were 4, 4, 1 and 2, respectively, which were consistent with the true values. For setting 5, where the gamma mixture components had differing α values, the crossvalidation tended to select some intermediate α̂ values with a median of 3. For the three discrete Q models, settings 10–12, the medians of α̂ were 100, 80 and 1, respectively. The difference probably arises because in the first two, the sampling was deeper, and the two discrete components were farther apart than setting 12. As a result, the crossvalidation procedure can better identify the latent structure in the former two. For the four lognormal models including settings 6–9, the medians were 2, 1, 1 and 1, respectively, suggesting that the crossvalidation procedure tended to select gamma components with a larger coefficient of variation to approximate the lognormal distributions. In Fig. 2, we present the histograms of α̂ from settings 2, 4, 5, 8, 11 and 12.
Fig. 2.
Histogram of selected α values in simulation settings (a) 2, (b) 4, (c) 5, (d) 8, (e) 11, and (f) 12. Each bar in the histogram represents the frequency of selected α models for α ∈ 𝒜 = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 40, 60, 100, ∞}. The bar at α = 101 represents the frequency at α̂ = ∞.
Table 2 summarizes the Monte Carlo estimates from the proposed estimator, denoted by N̂pcg, in comparison with other popular estimators, including the penalized discrete nonparametric maximum likelihood estimator N̂wl of Wang & Lindsay (2005); two sample coverage-based estimators N̂c1 and N̂c2, corresponding to N̂2 and N̂3, respectively from Chao & Lee (1992), where N̂c2 provides further bias correction based on N̂c1 ; a coverage-duplication estimator N̂c3 based on a Poisson-gamma model from Chao & Bunge (2002); and a jackknife estimator N̂j from Burnham & Overton (1978, 1979). The jackknife order was selected up to five based on a procedure proposed in the original papers. A confidence interval for N̂pcg was obtained using a standard bootstrap procedure as follows. Let g(j; Ĝα̂)(j = 1, 2 . . .) be the fitted zero-truncated Poisson mixture probabilities from a Monte Carlo sample with D observed distinct species. We generated 200 bootstrap samples, each containing D observations sampled from g(j; Ĝα̂). The crossvalidation procedure was applied to each bootstrap sample and N̂pcg was calculated. The interval spanned by the 0.025 and 0.975 quantiles of the N estimates was regarded as the bootstrap confidence interval. We also computed three additional likelihood-based estimators without a partial prior, including N̂fm, N̂pcg2 and N̂pcg3. Estimator N̂fm was obtained by fitting a finite mixture of zero-truncated Poisson distributions to the data. Estimators N̂ pcg2 and N̂pcg3 were the conditional and unconditional maximum likelihood estimators, respectively, based on the Poisson-compound gamma model without fixing α or β. In each case, the number of components was selected using the Bayesian information criterion; and the confidence interval coverage was based on the same bootstrap procedure as for N̂pcg.
Table 2.
Comparing nine different estimators with respect to median bias, mean absolute error and 95% confidence interval coverage in 12 simulation settings listed in Table 1
| Setting | N̂ | M̂ | s | MAE | Cov. | Setting | M̂ | s | MAE | Cov. | Setting | M̂ | s | MAE | Cov. |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | N̂pcg | 1003 | 25 | 19 | 96 | 2 | 1021 | 135 | 95 | 95 | 3 | 925 | 234 | 195 | 98 |
| N̂wl | 1004 | 46 | 33 | 90 | 998 | 112 | 86 | 92 | 674 | 148 | 314 | 82 | |||
| N̂c1 | 982 | 14 | 20 | 75 | 925 | 44 | 76 | 70 | 615 | 101 | 376 | 18 | |||
| N̂c2 | 988 | 16 | 16 | 88 | 953 | 57 | 60 | 89 | 654 | 150 | 326 | 46 | |||
| N̂c3 | 989 | 16 | 16 | 90 | 993 | 89 | 67 | 96 | – | – | – | – | |||
| N̂j | 1075 | 17 | 74 | 100 | 1134 | 108 | 140 | 43 | 590 | 85 | 406 | 1 | |||
| N̂fm | 974 | 16 | 27 | 38 | 868 | 87 | 131 | 14 | 541 | 177 | 448 | 22 | |||
| N̂pcg2 | 1001 | 17 | 13 | 92 | 991 | 90 | 68 | 95 | – | – | – | – | |||
| N̂pcg3 | 999 | 17 | 13 | 87 | 981 | 85 | 66 | 95 | – | – | – | – | |||
| 4 | N̂pcg | 1097 | 127 | 118 | 99 | 5 | 1012 | 140 | 104 | 96 | 6 | 1022 | 65 | 50 | 98 |
| N̂wl | 955 | 98 | 84 | 99 | 965 | 110 | 88 | 99 | 1008 | 73 | 58 | 95 | |||
| N̂c1 | 900 | 34 | 101 | 23 | 895 | 44 | 102 | 45 | 965 | 21 | 34 | 69 | |||
| N̂c2 | 951 | 45 | 55 | 87 | 934 | 59 | 72 | 83 | 989 | 25 | 22 | 93 | |||
| N̂c3 | 1014 | 71 | 55 | 96 | 999 | 105 | 78 | 95 | 1001 | 28 | 22 | 95 | |||
| N̂j | 1036 | 80 | 60 | 84 | 1094 | 106 | 107 | 33 | 1064 | 65 | 74 | 33 | |||
| N̂fm | 874 | 170 | 145 | 27 | 886 | 189 | 148 | 22 | 885 | 31 | 114 | 14 | |||
| N̂pcg2 | 1038 | 83 | 72 | 92 | 1002 | 107 | 79 | 96 | 1100 | 207 | 120 | 26 | |||
| N̂pcg3 | 1028 | 81 | 66 | 95 | 991 | 101 | 76 | 94 | 1094 | 62 | 99 | 100 | |||
| 7 | N̂pcg | 847 | 130 | 156 | 78 | 8 | 1033 | 191 | 147 | 100 | 9 | 1041 | 139 | 108 | 100 |
| N̂wl | 755 | 91 | 229 | 57 | 794 | 115 | 199 | 97 | 920 | 96 | 99 | 100 | |||
| N̂c1 | 674 | 29 | 325 | 0 | 769 | 72 | 226 | 23 | 856 | 35 | 142 | 7 | |||
| N̂c2 | 739 | 43 | 257 | 2 | 940 | 140 | 124 | 87 | 932 | 50 | 71 | 76 | |||
| N̂c3 | 846 | 98 | 153 | 73 | – | – | – | – | 1036 | 98 | 82 | 95 | |||
| N̂j | 802 | 112 | 193 | 19 | 892 | 126 | 131 | 55 | 984 | 112 | 76 | 75 | |||
| N̂fm | 628 | 50 | 363 | 80 | 699 | 106 | 285 | 32 | 821 | 66 | 176 | 28 | |||
| N̂pcg2 | – | – | – | – | – | – | – | – | – | – | – | – | |||
| N̂pcg3 | 581 | 97 | 396 | 25 | – | – | – | – | – | – | – | – | |||
| 10 | N̂pcg | 1013 | 36 | 31 | 83 | 11 | 1035 | 133 | 99 | 92 | 12 | 915 | 168 | 148 | 94 |
| N̂wl | 1026 | 60 | 49 | 67 | 1048 | 142 | 116 | 66 | 742 | 129 | 247 | 89 | |||
| N̂c1 | 1060 | 36 | 65 | 52 | 1003 | 71 | 55 | 96 | 625 | 64 | 373 | 2 | |||
| N̂c2 | 1171 | 54 | 180 | 2 | 1487 | 172 | 490 | 2 | 676 | 93 | 314 | 21 | |||
| N̂c3 | 1343 | 115 | 365 | 0 | – | – | – | – | – | – | – | – | |||
| N̂j | 1160 | 45 | 164 | 0 | 1183 | 144 | 215 | 46 | 797 | 104 | 205 | 32 | |||
| N̂fm | 1002 | 31 | 25 | 86 | 1006 | 461 | 386 | 51 | 534 | 244 | 459 | 7 | |||
| N̂pcg2 | 1018 | 39 | 34 | 80 | – | – | – | – | – | – | – | – | |||
| N̂pcg3 | 1025 | 59 | 41 | 87 | – | – | – | – | – | – | – | – |
M̂, s and MAE are the sample median, standard deviation and mean of absolute error calculated based on 200 Monte Carlo estimates for each setting. Cov. is the average of 95% confidence interval coverage evaluated from 200 samples. N̂c1, N̂c2 and N̂c3 were estimated based on less abundant subset of species defined as nj ⩽ 10. For N̂pcg, N̂wl, N̂fm, N̂pcg2 and N̂pcg3, the coverage of confidence interval was estimated based on a similar bootstrap procedure. ‘–’ stands for situations where extreme bias/variance occurred.
For each setting, the summary statistics based on 200 Monte Carlo estimates presented in Table 2 include the median, standard deviation, mean absolute error and the coverage of a nominal 95% confidence interval. For N̂c1, N̂c2 and N̂c3, we used a log-transformation procedure from Chao (1987) for the 95% confidence interval. For computational reasons the number of Monte Carlo replicates is too small for precise estimation of coverages, but despite this major differences between the procedures are clear.
Overall, N̂pcg exhibited the desirable robustness similar to N̂wl, N̂c1, N̂j and N̂fm, having no extreme behaviour in any setting. In contrast, N̂pcg2, N̂pcg3 and N̂c3, which make gamma or gamma mixture assumptions for Q, are susceptible to model misspecification. For example, they all had large bias in multiple settings. In particular, for N̂pcg2 and N̂pcg3, the Bayesian information criterion tended to eliminate the boundary component if its weight was small, but not otherwise. Consequently, large bias was still occasionally observed. In most other cases, the positive bias appeared to be systematic, which occurred when using gamma models to approximate lognormal and discrete distributions in settings 7, 8, 9, 11 and 12. A large bias was frequently observed when a fitted gamma component had α̂ < 1. Such results were excluded from Table 2. Although N̂pcg was built on the same model, its estimation strategy, including using the partial prior, crossvalidation and controlling α ⩾ 1, appeared to be effective in alleviating such issues regardless of the true form of Q. For N̂fm, we observed a few extremely large estimates owing to the boundary problem in several settings. In the summary in Table 2, such extreme estimates, if greater than 2000, were replaced with 2000.
We further calculated the median relative bias, defined as bias/N for each setting. The average of the absolute median relative bias across different settings for N̂pcg, N̂wl, N̂c1, N̂c2, N̂c3, N̂j and N̂fm was 4.6%, 10%, 14.6%, 14.8%, 6.3%, 13.9% and 19.2%, respectively. The average bias of 6.3% for N̂c3 was obtained after excluding four settings where it had large bias. In particular, when the sampling was shallow and underestimation was a severe issue, N̂pcg significantly improved over other estimators regarding bias.
The proposed estimator N̂pcg pays a price of relatively larger variance for the achieved robustness, improved bias and confidence interval coverage when the true Q is continuous. For example, when compared to the coverage-based estimator N̂c2, N̂pcg had relatively larger mean absolute error in seven settings, though the coverage of confidence interval of N̂pcg was uniformly better in such settings.
5. Data analysis
In the motivating genomic applications, each dataset often presents a shallow survey of the entire population of genes or microbial species. Hence, underestimation is probably the primary concern. In Table 3, we present three datasets including the microbial species data from Acinas et al. (2004), the Arabidopsis thaliana expressed sequence tag data from Wang & Lindsay (2005) and the traffic data from Simar (1976) and Böhning & Schön (2005).
Table 3.
Microbial species, Arabidopsis thaliana root expressed sequence tag and Traffic data
| j | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17+ |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Microbial (nj) | 381 | 65 | 23 | 18 | 4 | 5 | 3 | 0 | 1 | 0 | 4 | 0 | 3 | 2 | 0 | 1 | 3 |
| Root (nj) | 2187 | 490 | 133 | 121 | 37 | 51 | 22 | 19 | 7 | 8 | 6 | 7 | 6 | 4 | 5 | 5 | 18 |
| Traffic (nj) | 1317 | 239 | 42 | 14 | 4 | 4 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
In total, 513 species were observed in the microbial sample. The point estimate from N̂pcg was 3005 at α̂ = 3 with 95% confidence interval (1691, 4530). The point estimate and confidence bounds from other estimators including N̂wl, N̂c1, N̂c2, N̂j, N̂pcg2 and N̂pcg3 are also presented in Table 4 for comparison. The estimator N̂c3 gave all negative estimates, and was therefore excluded. Estimates from other estimators varied substantially. The point estimates from N̂c1 and the order 5 jackknife estimator N̂j were 1674 and 1913, respectively, and both are likely to be underestimates of N based on our experience gained from the simulation. It is possible that there are actually more than 3000 species, as indicated by the confidence limits from N̂pcg.
Table 4.
Results for data in Table 3
| Estimators | N̂pcg | N̂wl | N̂c1 | N̂c2 | N̂j | N̂pcg2 | N̂pcg3 |
|---|---|---|---|---|---|---|---|
| Microbial | 3005 (1691, 4530) | 2053 (1519, 2585) | 1674 (1388, 2052) | 2510 (1878, 3434) | 1913 (1634, 2191) | – (521, 2272) |
1098 (855, 2712) |
| Root | 8980 (8383, 18 771) | 8919 (8211, 10 918) | 9090 (8470, 9783) | 13 948 (12 459, 15 674) | 9891 (9192, 10 589) | 9483 – |
9478 (6530, 18 033) |
| Traffic | 6935 (5198, 12 409) | 5496 (4991, 6918) | 5684 (5031, 6461) | 6788 (5665, 8223) | 6588 (5892, 7284) | – – |
– – |
The point estimate and 95% confidence interval (parentheses) are presented for each dataset. Nwl was computed by using ESTstatCF program from http://bioinfo.stats.northwestern.edu/~jzwang. ‘–’ indicates that the estimates were extremely large due to boundary problems or possible bias.
For the expressed sequence tag data and traffic data, the estimates from N̂pcg were 8980 (8383, 18 771) at α̂ = ∞ and 6935 (5198, 12 409) at α̂ = 3, respectively. In both the cases, the bootstrap estimates were highly skewed to the right. The large skewness should not be simply understood as the negative side of variability, but instead it may reflect an appealing feature in coverage. For example, the true N in the traffic data was known to be 9461, and N̂pcg is the only estimator that provides a confidence interval that covers the true N. Similarly, for the root data, 8980 could be a conservative estimate of the total number of expressed genes in the root tissue. As the Arabidopsis thaliana genome contains approximately 27 000 protein coding genes, the N̂pcg confidence limits suggest that there can be up to 2/3 of the entire transcriptome expressed in the root tissue.
6. Discussion
This estimation strategy can be potentially applied to many applications that involve mixture models. An immediate application is to the capture–recapture problem. Instead of using a finite mixture of binomials (Pledger, 2000), one could use a binomial-compound beta model (Morgan & Ridout, 2008). The crossvalidation procedure can be investigated further to select the number of components in finite mixture models. Detailed discussion can be found in the online Supplementary Material. The R package (R Development Core Team, 2010) species, which automates different estimators for species richness estimation, can be found at http://bioinfo.stats.northwestern.edu/~jzwang.
Supplementary material
Supplementary material is available at Biometrika online.
Acknowledgments
The author would like to thank Professor Bruce Lindsay of the Department of Statistics at Pennsylvania State University, two referees, the associate editor and the editor. The author is also grateful to Professor Bruce Spencer, Professor Thomas Severini and Professor Wenxin Jiang of the Department of Statistics at Northwestern University. This work was supported by the National Institutes of Health, U.S.A., and the National Center for Supercomputing Applications.
Appendix
Proof of Lemma 2. We show the existence of G for θ = α or β. The uniqueness is guaranteed by the identifiability of gamma mixtures under a unified α or β parameter, stated in Theorem 2.
When θ = α, given any δ > 0 and μ0 > 0, the following G satisfies ∫h(λ ; α + δ, μ)dG(μ) = h(λ; α, μ0), for all λ > 0:
It is straightforward to verify that G(μ) is a well-defined probability distribution function on {0, (α + δ)μ0/α}.
When θ = β, if such a G(μ) exists, it has to satisfy that for all λ > 0,
| (A1) |
We can define G(μ) as a discrete probability measure as such that
| (A2) |
To verify whether (A2) is a probability measure, consider a Po(x; δλ), where λ has a prior distribution of Ga(βμ0, β). Then
This is exactly in the same form as PG. Plugging G into the left-hand side of the equality (A1), we have
Proof of Theorem 2. Suppose that under the reparameterization (α, μ) setting, the mixing distribution in μ is G(μ). As α is fixed, G(μ) corresponds to a unique mixing distribution in β, e.g. G*(β), in the original (α, β) parameterization and G*(β) = 1 − G{(α/β)−}, where G{(α/β)−} = P(μ < α/β). Conversely, we have G(μ) = 1 − G*{(α/β)−}. As G(μ) and G*(β) are uniquely determined by each other, it suffices to show that under the original parameterization (α, β), G*(β) is identifiable. Suppose there is another G** such that for all λ > 0,
| (A3) |
This equality holds at λ = 1, and hence
| (A4) |
The left- and right-hand sides of equation (A3) divided by those of (A4) in parallel gives
| (A5) |
where
If we let t = 1 − λ, then the left- and right-hand sides of (A5) are the moment generating functions for P* and P**, respectively, which are finite for t in a neighbourhood around zero, and hence P*=P**. As dG*(β) = β−αeβ dP*(β)/{∫ β−αeβ dP*(β)}, dG**(β) = β−αeβ dP**(β)/{∫ β−αeβ dP**(β)}, and therefore G* = G**. Hence G(μ) is identifiable in 𝒢.
Now consider the (β, μ) setting. Suppose the mixing distribution has a cumulative distribution function G(μ), and there exists another G* such that for all λ > 0
| (A6) |
This equality holds at λ = 1, and hence
| (A7) |
The left- and right-hand sides of equation (A6) divided by those of (A7) in parallel gives
| (A8) |
where
Let t = β log(λ). The two sides of equation (A8) become the moment generating functions of P and P*, respectively, and is bounded when λ is around 1, or t around 0. Thus, based on by the same argument as above, G(μ) is identifiable in 𝒢.
References
- Acinas S, Klepac-Ceraj V, Hunt D, Pharino C, Ceraj I, Distel D, Polz M. Fine-scale phylogenetic architecture of a complex bacterial community. Nature. 2004;430:551–4. doi: 10.1038/nature02649. [DOI] [PubMed] [Google Scholar]
- Böhning D, Schön D. Nonparametric maximum likelihood estimation of population size based on the counting distribution. J. R. Statist. Soc. C. 2005;54:721–37. [Google Scholar]
- Bunge J, Fitzpatrick M. Estimating the number of species: a review. J Am Statist Assoc. 1993;88:364–73. [Google Scholar]
- Burnham KP, Overton WS. Estimation of the size of a closed population when capture probabilities vary among animals. Biometrika. 1978;65:625–33. [Google Scholar]
- Burnham KP, Overton WS. Robust estimation of population size when capture probabilities vary among animals. Ecology. 1979;60:927–36. [Google Scholar]
- Chao A. Estimating the population size for capture-recapture data with unequal catchability. Biometrics. 1987;43:783–91. [PubMed] [Google Scholar]
- Chao A, Bunge J. Estimating the number of species in a stochastic abundance model. Biometrics. 2002;58:531–9. doi: 10.1111/j.0006-341x.2002.00531.x. [DOI] [PubMed] [Google Scholar]
- Chao A, Lee S-M. Estimating the number of classes via sample coverage. J Am Statist Assoc. 1992;87:210–7. [Google Scholar]
- Fang Y, Chlamtac I. Teletraffic analysis and mobility modeling of PCS networks. IEEE Trans Commun. 1999;47:1062–72. [Google Scholar]
- Hong S-H, Bunge J, Jeon S-O, Epstein S. Predicting microbial species richness. Proc Nat Acad Sci. 2006:117–22. doi: 10.1073/pnas.0507245102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lindsay BG. The geometry of mixture likelihoods: a general theory. Ann Statist. 1983;11:86–94. [Google Scholar]
- Lindsay BG. Mixture Models: Theory, Geometry and Applications. Hayward, CA: Institute of Mathematical Statistics; 1995. NSF-CBMS Regional Conference Series in Probability and Statistics, vol. 5. [Google Scholar]
- Magder LS, Zeger SZ. A smooth nonparametric estimate of a mixing distribution using mixtures of gaussians. J Am Statist Assoc. 1996;91:1141–51. [Google Scholar]
- Morgan BJT, Ridout MS. A new mixture model for capture heterogeneity. Appl Statist. 2008;57:433–46. [Google Scholar]
- Morris JS, Baggerly KA, Coombes KR. Bayesian shrinkage estimation of the relative abundance of m-rna transcripts using sage. Biometrics. 2003;59:476–86. doi: 10.1111/1541-0420.00057. [DOI] [PubMed] [Google Scholar]
- Norris JL, Pollock KH. Non-parametric mle for poisson species abundance models allowing for heterogeneity between species. Envir Ecol Statist. 1998;5:391–402. [Google Scholar]
- Pledger S. Unified maximum likelihood estimates for closed capture-recapture models using mixtures. Biometrics. 2000;56:434–42. doi: 10.1111/j.0006-341x.2000.00434.x. [DOI] [PubMed] [Google Scholar]
- R Development Core Team . R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing; 2010. URL: http://www.R-project.org. [Google Scholar]
- Sanathanan L. Estimating the size of a multinomial population. Ann Math Statist. 1972;43:142–52. [Google Scholar]
- Sanathanan L. Estimating the size of a truncated sample. J Am Statist Assoc. 1977;72:669–72. [Google Scholar]
- Silverman BW. Density Estimation for Statistics and Data Analysis. New York: Chapman & Hall; 1992. [Google Scholar]
- Simar L. Maximum likelihood estimation of a compound poisson process. Ann Statist. 1976;4:1200–9. [Google Scholar]
- Teicher H. Identifiability of finite mixtures. Ann Math Statist. 1963;34:1265–9. [Google Scholar]
- Thygesen HH, Zwinderman AH. Modeling sage data with a truncated gamma-poisson model. BMC Bioinformatics. 2006;7 doi: 10.1186/1471–2105–7–157. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang J-P. A linearization procedure and a VDM/ECM algorithm for penalized and constrained nonparametric maximum likelihood estimation for mixture models. Comp Statist Data Anal. 2007;51:2946–57. [Google Scholar]
- Wang J-P, Lindsay BG. An exponential partial prior for improving nonparametric maximum likelihood estimation in mixture models. Statist Methodol. 2008;5:30–45. [Google Scholar]
- Wang J-PZ, Lindsay B, Cui L, Wall P, Marion J, Zhang J, dePamphilis CW. Gene capture prediction and overlap estimation in EST sequencing from one or multiple libraries. BMC Bioinformatics. 2005;6:300. doi: 10.1186/1471-2105-6-300. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang J-PZ, Lindsay BG. A penalized nonparametric maximum likelihood approach to species richness estimation. J Am Statist Assoc. 2005;100:942–59. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supplementary material is available at Biometrika online.


