Abstract
We discuss fully Bayesian inference in a class of species sampling models that are induced by residual allocation (sometimes called stick-breaking) priors on almost surely discrete random measures. This class provides a generalization of the well-known Ewens sampling formula that allows for additional flexibility while retaining computational tractability. In particular, the procedure is used to derive the exchangeable predictive probability functions associated with the generalized Dirichlet process of Hjort (2000) and the probit stick-breaking prior of Chung and Dunson (2009) and Rodriguez and Dunson (2011). The procedure is illustrated with applications to genetics and nonparametric mixture modeling.
Keywords: Exchangeable partition probability function, Generalized Dirichlet process, Probit-stick breaking process, Size-biased permutation
1. Introduction
An exchangeable sequence of random variables X1, X2, … defined on a probability space follows a species sampling model (SSM) if its joint distribution can be characterized by a sequence of predictive rules where X1 ~ G0 and
for some non-atomic measure G0 on and for all . In the previous expression, δa denotes the degenerate probability measure placing probability 1 on a, with being the number of values among X1, … , Xn that are equal to , being the number of unique values among X1, … , Xn, and for k ≤ n and n = 1, 2, … being a collection of functions of mn (usually called the predictive probability functions, or PPFs) which for all n and mn satisfy
The name species sampling model comes from the application of this type of models in ecology and genetics (Fisher et al., 1943; Ewens, 1972; Engen, 1978). Indeed, we could think of sequentially sampling individuals in an infinite population with an unknown (and potentially infinite) number of species. When the first individual is sampled, its species is assigned a random tag generated according to G0. Subsequently, individual n either belongs to one of the previously observed Kn species with respective probabilities , or to a new species (which is again given a random tag according to G0) with probability .
A well-defined set of PPFs implies a symmetric joint probability distribution for the number of species and the sample sizes associated with each one of them, which can be obtained through a recursion where p(1) = 1,
for k ≤ Kn, and
(for example, see Pitman, 1995 and Lee et al., 2013). The function p is often called the exchangeable partition probability function (EPPF). Alternatively, this can be written as
| (1) |
In principle, constructing EPPFs is a difficult task, as ensuring that X1, X2, … is an exchangeable sequence implies quite strict conditions on the PPFs (for a discussion, see for example Pitman, 1995, Pitman, 1996b and Lee et al., 2013). As a consequence, the number of species sampling models available in the literature is rather small, with the most popular ones arguably being those associated with the Dirichlet process (DP) (Ferguson, 1973; Blackwell and MacQueen, 1973), the two-parameter Poisson-Dirichlet process (PDP) (Pitman, 1995), and the normalized inverse Gaussian measures (NIGM) (Lijoi et al., 2005).
The previous list suggests that there is a close link between the class of nonparametric priors on almost-surely discrete distributions often used in nonparametric Bayesian modeling and the class of SSMs. Indeed, a well known result due to Pitman (1996b) establishes that the de Finetti measure of any SSM can be written as
| (2) |
for some sequence of positive random variables ω1, ω2, … and R such that almost surely, ,… is a random sample from a non-atomic G0, and the sequences ω1, ω2, … and are independent. The resulting SSM is termed proper if R = 0 with probability 1. This connection can be exploited to generate novel SSMs, in particular, we study the SSM induced by a class of residual allocation models, as well as the special cases associated with the generalized Dirichlet process (GDP) of Hjort (2000), and the probit stick-breaking processes (PSBP) (Rodriguez et al., 2009; Chung and Dunson, 2009; Rodriguez and Dunson, 2011). In addition to model construction, we discuss computational issues associated with Bayesian estimation in this class of models.
The remainder of the paper is organized as follows. Section 2 discusses the EPPF for a class of species sampling models, specifically, independent residual allocation models, giving a general expression and analyzing some special well-known particular cases. Section 3 applies the results to the probit stick-breaking priors of Chung and Dunson (2009) and Rodriguez and Dunson (2011) and to the generalized Dirichlet process of Hjort (2000). Section 4 discusses Bayesian inference for the parameter controlling the allocation distribution and the probability of discovery of a new species for the class of models considered here. Section 5 illustrates our methods using both simulated and real datasets. Final comments are given in Section 6. An Appendix contains proofs of some auxiliary results.
2. Exchangeable partition probability functions and predictive probability functions derived from residual allocation models
A random probability measure G defined on a probability space is said to follow an independent residual allocation prior if it can be represented as
| (3) |
where is a sequence of independent and identically distributed realizations from some distribution G0 and ωk = zk ∏ℓ<k{1 – zℓ} where independently for all k = 1, 2, … (with the convention zN = 1 if N < ∞), and is a probability distribution on [0, 1] indexed by the vector of parameters θ. Two of the best-known examples of residual allocation priors are the Dirichlet process (where N = ∞, θ = b, and zk ~ Beta(1, b) for some b > 0) and the Poisson-Dirichlet process (for which θ = (a, b), zk ~ Beta(1 – a, b, + ka) and either N = ∞, 0 ≤ a < 1 and b > –a or N < ∞, a < 0 and b = –aN). It follows that (3) is a special case of SSM (2). It should be also noted that for the case N = ∞, the resulting residual allocation model defines a probability model (i.e. the SSM is proper) if and only if , which is for instance the case of the PDP. See Barrientos et al. (2012).
Kingman (1978) and Pitman (1996b) discuss in more generality the relationship between almost surely discrete random distributions and SSMs. In particular, note that if X1, … , Xn is an independent and identically distributed sample from G which follows a stick-breaking prior, then it is exchangeable and there is a positive probability of ties among the Xis. In fact, the EPPF associated with this prior can be computed as
| (4) |
where represents the set of all possible sequences of distinct (not necessarily consecutive) positive integers of length Kn. Briefly, this formula simply states that to compute the EPPF we need to consider all possible allocations of the observed species to Kn of the atoms in (3), and compute the expected value of the probability of obtaining the sample under each of these orderings. Note that once the EPPF has been obtained, the PPFs necessary to define the associated SSM can be obtained using (1).
Example 1 (The Ewens sampling formula and the Dirichlet process)
Consider the case n = 2. When computing p(2) (i.e., the probability that the first two samples belong to the same species), Equation (4) reduces to . More generally, the probability that all the samples belong to the same species is simply . In the case of the Dirichlet process, this implies
which is a well known result.
The residual allocation construction associated with the weights in equation (3) provides a very general mechanism to define priors on random measures and, implicitly, on the associated SSMs. In the sequel, we will focus on the special case where all the stick-breaking ratios zk are not only independent but also identically distributed, i.e., zk ~ H for all k = 1, 2, …. Note that this type of construction leads to weights that are stochastically ordered. Indeed, when we have wk+1/wk = (1 – zk)zk+1/zk for al k ≥ 1. Since P(zk ≤ t) = P(zk+1 ≤ t) and P((1 – zk)zk+1 ≤ zk+1) = 1 then zk+1 ≤ t implies (1 – zk)zk+1 ≤ t, so that P(zk+1 ≤ t) ≤ P((1 – zk)zk+1 ≤ t), from which stochastic order follows. However, in spite of this property, this class of models still affords a great deal of flexibility while retaining some tractability.
Lemma 1
The EPPF induced by the stick breaking prior in (3) is given by
| (5) |
where denotes the set of all permutations of {1, … , Kn}, and γθ(x, y) = Eθ{zx(1 – z)y} with z ~ H.
For a proof, see the Appendix. Note that, for the case where the stick-breaking ratios are i.i.d., we only need to consider the order in which the species arise during sampling, and not the actual identity of the atoms. From (5) it is easy to see that for (i.e., all samples belong to the same species, or Kn = 1) we have
| (6) |
while in the case (i.e., each sample belongs to a different species, or Kn = n) we have
| (7) |
Furthermore, equation (7) corresponds to equation (5.2) from Hjort (2000).
Example 2 (The Ewens sampling formula and the Dirichlet process continued)
Again, if zk ~ Beta(1, b) we have θ = b and
Hence,
| (8) |
while
| (9) |
Putting (8) and (9) together we have
which is a well known formula for the Dirichlet process. The fact that can be shown by induction (see the Appendix).
Note that the function
satisfies gθ(1) = 1, but is in general not symmetric on its arguments and depends on the permutation σ, which encodes the order of the indices that identify the components that generate the unique labels associated with the observed species. Indeed, in the future it will be convenient to interpret as describing the joint probability distribution of the exchangeable vector and the permutation σ, . This joint probability can be potentially factorized as
Example 3 (The Ewens sampling formula and the Dirichlet process continued)
Since the Dirichlet process is invariant to size-biased permutations of the components, for the Ewens sampling formula we have
and
This probability is maximized for the permutation σ’ such that mσ’(1) ≥ mσ’(2) ≥ ⋯ ≥ mσ’(Kn). Indeed, since in the Dirichlet process the weights are stochastically ordered, we do not expect that all possible permutation of the labels should be equiprobable. Instead, permutations that assign larger groups to lower indexes should be preferred.
The interpretation of gθ as a joint probability distribution will be helpful later in developing computational algorithms for these species sampling models.
3. Two new species sampling models
3.1. The probit stick-breaking process
Consider the class of probit stick-breaking priors (Chung and Dunson, 2009; Rodriguez and Dunson, 2011) where zk = Φ(uk) and uk ~ N(μ, τ2). Note that if μ = 0 and τ2 = 1, then zk ~ Uni[0, 1] and we recover the Dirichlet process as a special case.
For any integers c and d, we can write
Rodriguez and Dunson (2011) show that, for any integer q, where , and Tq = (T1, … , Tq)’ follows a q-variate normal distribution with mean E(Ti) = μ and covariance matrix satisfying Var(Ti) = τ2 + 1 and Cov(Ti, Tj) = τ2. Hence
Alternatively, noting that , where u* = Φ(z*) and z* ~ N(–μ, τ2), we could also write
where . Computationally, the first expression would be preferred if d < c and the second if d > c (if d = c both formulas have the same computational complexity).
3.2. A generalized Dirichlet processes
Hjort (2000) defined a new class of residual allocation priors by letting N = ∞ and uk ~ Beta(a, b). Since a = 1 implies that G follows a Dirichlet process, he called this construction a generalized Dirichlet processes (GDP). To compute the EPPF associated with the GDP, note that
Hence
Example 4
Note that if a = 1 this expression leads to
as expected.
As the following result shows, the GDP is particularly appealing as the basis for creating species sampling models because it allows for a multitude of asymptotic behaviors for the expected number of species.
Lemma 2
The expected number of clusters Ea,b(Kn) for the species sampling model induced by a generalized Dirichlet process where zk ~ Beta(a, b) is given by
| (10) |
For a proof, see the Appendix. Note that for a = 1, equation (10) simplifies to , a well known result for the Dirichlet process (Antoniak, 1974). For more general values of a, Stirling’s approximation can be used to show that
where C(a, b) = {aΓ(a + b)/Γ(b)} exp{−2(a + 1)}. Hence, for a ≤ 1 the expected number of distinct species will grow slowly but without bound as n increases, with Ea,b(Kn) ~ o(n1–a). This behavior is similar to the one associated with the Poisson-Dirichlet process (for example, see Pitman, 2006 and Sudderth and Jordan, 2009). However, for a > 1 the expected number of species instead converges to a finite constant, a result that is similar to the one obtained from a hierarchical specification for the Poisson-Dirichlet process in which a < 0, b = –aL for some integer L, and L is given a prior with support over and having a finite mean.
4. Bayesian inference for species sampling models
In this section we discuss practical issues associated with Bayesian inference for species sampling models with a EPPF defined by (5). In particular we focus on estimation of the parameter θ and prediction of probability of discovery for new species in a new sample using simulation based algorithms.
4.1. Parameter estimation
Given a prior p(θ), the posterior distribution of the parameters of the SSM is trivially given by
where was given in (5). Since θ is typically a low dimensional vector, exploring this posterior distribution using a Markov chain Monte Carlo algorithm should be in principle straightforward. However, for realistic applications, evaluating the sum over all possible permutations of the set {1, 2, … , Kn} is infeasible (for example, with only Kn = 10 species, the number of terms in the sum is 10! = 3, 628,800, all of which need to be evaluated each time the likelihood is calculated).
In a Bayesian setting we can get around this issue by reinterpreting as describing a joint distribution over both the permutation σ and the vector of observations , and treating σ as a latent variable to be imputed as part of an augmented sampler. Once we augment the model by introducing the permutation σ, the joint posterior distribution can be written as
This posterior distribution can be explored by alternatively sampling from the full conditional associated with the parameters θ that control the prior size of the weights, and the full conditional distribution of the permutation σ controlling the assignment of species to atoms:
Randomly initialize the values of the model parameters to θ(0) and σ(0).
- For each b = 1, … , B repeat the following two steps:
- (a) Sample θ(b) from the full conditional distribution p (θ ∣ m, σ(b–1)), possibly using a Metropolis Hastings step.
- (b) Sample σ(b) from the full conditional distribution p (σ ∣ m, θ(b)), possibly using a Metropolis Hastings step.
For sampling θ we favor a Metropolis-Hasting step with a multivariate Gaussian random walk proposal on a suitable transformation of θ (for example, in the case of a SSM induced by the generalized Dirichlet process prior, we use a bivariate Gaussian random walk on log a and log b as our proposal distribution). The covariance matrix for the proposal is chosen on the basis of an estimate of Cov(θ ∣ m) obtained from a preliminary run that uses uncorrelated proposals. The covariance matrix obtained in this way is then rescaled to obtain an overall acceptance rate of around 35%. The acceptance ratio for these proposals reduces to
where ∣J(θ)∣ is the Jacobian of the transformation applied to θ.
For sampling σ, the most obvious proposal involves generating a new permutation σ* from the current state of the Markov chain, σ(b–1), by randomly selecting two indexes i, j ∈ {1, … , Kn} and setting σ*(i) = σ(b–1)(j) and σ*(j) = σ(b–1)(i), while keeping all other entries of σ* identical to those of σ(b–1). The acceptance ratio for this proposal reduces to
Although this is a natural approach, our numerical experiments suggest that this simple scheme mixes too slowly to be practical. Instead, we consider sampling algorithms that simultaneously update blocks of r randomly selected components of the permutation σ. More specifically, note that for any set of indexes Ξ = {ξ1, … , ξr} with ξi ∈ {1, … , Kn} and ξi ≠ ξj for i ≠ j, and for any permutation π of Ξ, the full conditional probability given by p {(σξ1, … , σξr) = (σπξ1, … , σπξr) ∣ m, ⋯ } is proportional to
For small values of r (say, in the range of 4 to 6) an algorithm that uniformly selects the indexes in ξ at each iteration can sample from this full conditional distribution in a reasonable amount of time. In particular note that the computation of p (σξ1, … , σξr ∣ m, ⋯ ) can be easily parallelized. Furthermore in the case r = 2 this scheme is essentially equivalent to the naive algorithm described before.
The proposal described above works well when the values m1, … , mKn are, for the most part, distinct. However, when there are multiple ties among the counts this proposal can be very inefficient, as permutations that involve swapping positions for species that have the same size lead to the same value of gθ. In those cases, we use stratified sampling instead of random sampling to select the set Ξ. More specifically, the set Ξ is selected by first selecting r cluster sizes among the unique values of m1, … , mKn, and then each ξi is selected uniformly among the indexes that correspond to each of the unique values previously selected. If m1, … , mKn are all distinct, this scheme is equivalent to randomly sampling ξ1, … , ξK without replacement form the set {1, … , Kn}. However, when ties are present, the algorithm will avoid considering swaps that involve two species with the same number of observations.
Finally, in addition to the local updates for σ just described, we incorporate a global update σ that consists on randomly proposing a completely new permutation from an independent proposal with
(Recall that this is the probability of a permutation of the indexes associated with a Dirichlet process prior, and it favors permutations in which larger groups are assigned to lower indexes). The acceptance probability for this proposal is
This long-range move is used sparingly. We select q, the number of steps between attempts to perform a global move, so that the probability of covering any specific position with the short-range moves discussed before over q moves, (1 – r/K)q, is high (in our numerical experiments, 0.999).
4.2. Predicting the number of new species
The previous MCMC algorithm can be easily extended to estimate the predictive distribution of , the number of new species to be observed if the current sample size is extended from n to . This can be done by computing an “order-dependent” EPPF, as described below.
Consider first the probability that a single newly sampled individual belongs to one of the existing species. In this case, the addition of a new species does not affect the permutation σ. Furthermore, for the species identified by σk, it is clear that such probability is proportional to
On the other hand, the probability that a new species appears will depend on the position in the order where that new species resides. Hence, to compute this probability we need to consider an extended permutation of the elements of {1, … , Kn + 1} where (which indicates the relative position of the newly observed species) can take any value on {1, … , Kn + 1}, while must respect the same order of the elements induced by the original permutation σ. In particular, the permutation the assigns has probability proportional to
In the context of our MCMC algorithm, these probabilities can be used to sample the species of a new individual by extending the vector mn to mn+1 and (if required) the permutation σ to . Furthermore, by sequentially applying the same ideas we can simulate from the distribution of new individuals over species, allowing for inferences over , as well as any other functionals of interest.
5. Illustrations
5.1. Simulation study 1
We begin by considering four simulated datasets, which are presented in Table 1. The first two datasets were simulated from a species sampling model induced by a DP with precision parameter b = 2.0. These two datasets differ in terms of the total number of observations (n = 500 in the first case and n = 5000 in the second). On the other hand, the last two datasets were generated from the species sampling model induced by a GDP with parameters a = 0.2, b = 1.5, and a = 2.0, b = 0.3. These parameters correspond to situations in which the prior expected number of species either grows at an exponential rate with the number of observations, or converges to a finite value as n grows. In addition to estimating the parameters of the GDP, we also predict , the expected number of species if the sample is extended from n to . These results are compared against those obtained by fitting species sampling models based on the Dirichlet process (fitted using the data-augmentation algorithm of Escobar and West, 1995) and/or the Poisson-Dirichlet process (fitted using a joint random walk Metropolis-Hastings algorithm). All results are based on 250000 iterations of the respective algorithms, which were obtained after a burn-in period of 20000 iterations. Priors for the parameters of the SSM associated with the GDP and DP are exponentials with mean 1, while for the PDP we use a uniform prior on the interval [0, 1] for the intensity parameter and an exponential with mean 1 for the concentration parameter.
Table 1.
Four datasets used in our first simulation study.
| ID | True a | True b | m |
|---|---|---|---|
|
| |||
| 1 | 1.0 | 2.0 | (36, 160, 49, 85, 35, 57, 63, 10, 1, 1, 3) |
| 2 | 1.0 | 2.0 | (2552, 1, 919, 624, 327, 227, 131, 129, 12, 5, 8, 30, 21, 1, 9, 3, 1) |
| 3 | 0.2 | 1.5 | (120, 19, 20, 19, 2, 22, 1, 2, 173, 21, 56, 2, 7, 2, 5, 2, 1, 4, 4, 12, 1, 4, 1) |
| 4 | 2.0 | 1.5 | (353, 81, 30, 16, 9, 6, 5) |
Table 2 presents posterior means and 95% posterior symmetric credible intervals for the parameters of the SSM induced by the GDP, along with the associated equivalent sample sizes (ESS). Overall, we are capable of recovering the true values of the parameters in all four examples, although the level of uncertainty associated with them is relatively high. Furthermore, we note that the equivalent sample sizes are relatively small, with the ESS tending to be smaller when the number of components is higher. This is likely a consequence of the need to sample the permutation of the groups using a Metropolis step, which increases the overall autocorrelation in the chain.
Table 2.
Results from our simulation study. The last four columns show the posterior mean and the posterior 95% symmetric credible intervals for the parameters of the SSM induced by a generalized Dirichlet process, along with the equivalent sample sizes associated with the MCMC sample.
| ID | n | K | True a | True b | E(a ∣ m) under GDP |
ESS(a) | E(b ∣ m) under GDP |
ESS(a) |
|---|---|---|---|---|---|---|---|---|
|
| ||||||||
| 1 | 500 | 11 | 1.0 | 2.0 | 0.98 (0.07, 2.63) | 11,817 | 1.70 (0.45, 4.00) | 12,136 |
| 2 | 5000 | 17 | 1.0 | 2.0 | 1.09 (0.04, 3.30) | 2,979 | 2.07 (0.51, 5.12) | 3,394 |
| 3 | 500 | 23 | 0.2 | 1.5 | 0.35 (0.01, 1.14) | 3,963 | 2.26 (0.82, 4.96) | 5,497 |
| 4 | 500 | 7 | 2.0 | 1.5 | 1.47 (0.09, 4.41) | 15,801 | 1.31 (0.28, 3.46) | 13,616 |
|
| ||||||||
Figure 1(a) and 1(b) present estimates of the predictive distribution of the number of new species in a sample of 100 observation for our first two datasets. In this setting we compare the distribution generated by the SSM associated with the GDP with unknown parameters, with that generated by the SSM associated with a DP with unknown parameter b (recall that this is equivalent to setting a = 1 in the GDP), and that associated with a DP with known (true) parameter b = 2. Note that, even though there is substantial uncertainty associated with the estimate of a and b, in both cases the prediction of the number of species from the GDP are essentially identical to those generated by the DP with unknown parameter. Similarly Figures 2(a) and 2(b) show the predictive distributions for the number of species under the SSMs associated with the GDP and the PDP for the last two datasets. Interestingly enough, the predictions from these two models (which, should be emphasized, are not nested) are very similar in these examples.
Figure 1.
Distribution of the number of new species in a new sample of 150 individuals for datasets 1 and 2. We compare the distribution under the true parameter with the posterior distribution under the species sampling model induced by a Dirichlet process with precision parameter b, and under a species sampling model induced by a generalized Dirichlet process model with parameters a and b.
Figure 2.
Posterior distribution of the number of new species in a new sample of 150 individuals for datasets 3 and 4. We compare the posterior distribution under the species sampling model induced by a Dirichlet process with precision parameter b and under a species sampling model induced by a generalized Dirichlet process model with parameters a and b.
5.2. Simulation study 2
In this section we consider two datasets, both with K = 8 and n = 1200 (see Table 3). None of the two datasets were simulated according to a known species sampling model. Instead, they were selected to represent two opposite situations in terms of the relative size of the observed species. Indeed, while the first dataset corresponds to a situation where species are uniformly represented, the second corresponds to a very non-uniform distribution.
Table 3.
Two datasets used in our second simulation study.
| ID | m |
|---|---|
|
| |
| 5 | (150, 150, 150, 150, 150, 150, 150, 150) |
| 6 | (800, 300, 80, 16, 1, 1, 1, 1) |
Figure 3(a) shows the posterior expected number of new species for each of the two datasets under a generalized Dirichlet process with parameters a = 0.2 and b = 0.7 as a function of the new sample size . These curves were constructed using 40000 samples of our MCMC algorithm obtained by burning the first 200000 iterations of the chain and by skipping every 50 iterations. The differences between the two curves illustrate that, unlike the Poisson-Dirichlet process, predictions under the generalized Dirichlet process depend not only on the number of species observed so far, but also on the relative sizes of the species. To further illustrate the differences between the methods, we also present in Figure 3(b) the expected number of species for each of the two samples sizes under the generalized Dirichlet process when the values of a and b are estimated from the data. These estimates are complemented with the expected values under a Poisson-Dirichlet process, which are also treated as unknown.
Figure 3.
Posterior expected number of new species in an additional sample of size for the datasets in Table 3.
5.3. Modeling expressed sequence tag data
As a final illustration we consider a real dataset discussed in Mao and Lindsay (2002), Mao (2004), and Lijoi et al. (2007). The data consist of n = 2574 randomly selected expressed sequence tags taken form a large cDNA library made from the 0 mm to 3 mm buds of tomato flowers. The number of distinct tags observed in this sample is K = 1814; Table 4 shows the number of times each species size was observed in the sample (e.g., 1423 tags were observed only once).
Table 4.
Data on the number of expressed sequence tags from a cDNA library made from the 0 mm to 3 mm buds of tomato flowers.
| Size | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 16 | 23 | 27 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Frequency | 1423 | 253 | 71 | 33 | 11 | 6 | 2 | 3 | 1 | 2 | 2 | 1 | 1 | 1 | 2 | 1 | 1 |
We compare the species sampling model associated with the GDP against the species sample models induced by the DP and the PDP. For this illustration, the algorithm described in Section 4 was used to generate of 20000 samples from the posterior distribution, obtained after a burn-in of 100000 iterations and thinning every 80 observations. The parameters a and b of the GDP are given exponential prior distributions with means 0.1 and 1000 respectively; the corresponding posterior means are E(a ∣ m) = 0.053 and E(b ∣ m) = 1345, while the respective 95% symmetric credible intervals are (0.001, 0.186) and (1172, 1584). The equivalent sample sizes for these parameters are 3385 for a and 5431 for b. In the case of the PDP, the parameters are given a uniform distribution on the (0, 1) interval and an exponential prior with mean 800; the corresponding posterior means are 0.61 and 735.9, with 95% credible intervals of (0.53, 0.67) and (515.6, 1017.4). Finally, for the DP an exponential prior with mean 1000 is employed for the concentration parameters, which leads to a posterior mean of 2724.9 and 95% posterior credible interval of (2491.3, 2972.1).
Figure 4 presents the distribution of the number of unique species in a new sample of size 500 taken from this population under these three models, while Figure 4(b) shows the posterior expected number of species for a new sample of size . Not surprisingly given the large number of singleton species, the PDP favors a larger number of new species than the DP. On the other hand, the GDP seems to offer a compromise between both predictions, with a forecast of the number of species that seems to be in between those generated by the DP and the PDP.
Figure 4.
Prediction for the number of new species for the expressed sequence tag data of Mao and Lindsay (2002). Panel (a) shows the posterior distribution of the number of new species in a new sample of 500 individuals under each of three competing models: the generalized Dirichlet process (GDP), the Poisson-Dirichlet process (PDP) and the Dirichlet process (DP). Panel (b) shows the posterior expected number of species for a new sample of size under each of the three competing models.
6. Discussion
We have discussed the structure of species sampling models that arise from residual allocation models with independent and identically distributed stick-breaking ratios. The approach we propose avoids an explicit truncation of the process and the need to directly infer the underlying probabilities of the different species.
One interesting observation from our example is that, for small samples, there is little difference between the GDP and other similar models. However, for large samples with a large numbers of individuals belonging to unique species, the results can be quite different. Another interesting observation relates to the relatively high levels of uncertainty associated with posterior estimates of the parameters of the species sampling models. As pointed out by one referee, this is particularly noteworthy in situations such as dataset 4, where the posterior credible interval for a in the GDP encompasses situations that correspond to both finite and infinite prior expected number of species. To us, this highlights the importance of using model-averaged predictions (in which we integrate over the posterior distribution of the parameters) instead of just “plugging in” point estimators. It also highlights the difficulties of differentiating, in practice, between models that assume an infinite number of species and models with a finite but unknown number of them.
Although we do not consider an explicit example because of space constraints, it is worthwhile noting that, besides being used to model data directly, species sampling models are widely used as priors on partition structures in the context of (nonparametric) mixture models (e.g., see Escobar and West, 1995). In that context, a closed form for the EPPF provides the basis for the so-called “collapsed” Gibbs samplers, which act on the space of equivalence classes implied by the partitions (e.g., see Escobar, 1994, Neal, 2000 and references therein). In such cases, sampling over the permutations plays the same role as the label-switching moves commonly used in “blocked” and “retrospective” Gibbs samplers that explicitly represent the mixture distribution (for example, see Roberts and Papaspiliopoulos, 2008). Because of the invariance to size-biased permutations inherent to DP and PDP priors (Pitman, 1996a), that type of label-switching moves are not required in collapsed Gibbs samplers for mixture models based on those processes. However, for more general general priors on the weights, label-switching moves seem to be required for both collapsed and blocked samplers.
The approach discussed in this paper could potentially be extended to residual allocation models for which residuals are not identically distributed. Indeed, our key observation is that Bayesian inference on general species sampling models can be carried out using a data augmentation approach that considers the identity of the atoms to which each species is assigned. When the weights are independent and identically distributed this is relatively straightforward because only the order in which the different species appear matters. For more general models we also need to consider the separation between species, considerably increasing the computational burden.
Devising specific sampling algorithms for the parameters associated with the SSM is highly model-dependent and we have just discussed simple random walk Metropolis algorithms. In order to improve the efficiency of our algorithm, adaptive techniques such as those introduced in Roberts and Rosenthal (2009) might be worthwhile pursuing. However, we do note that the reason for a slow mixing seems to be the need to sample the permutations; designing better algorithms for this particular step might have a stronger impact on the efficiency of the algorithm.
Highlights for Review.
We discuss fully Bayesian inference for a class of models induced by residual allocation priors.
We give a generalization of the Ewens sampling formula for the class under study.
We derive the exchangeable partition probability function for generalized Dirichlet and probit stick-breaking priors.
A suitable computational strategy for fitting the models is described.
Acknowledgments
The authors would like to thank an anonymous referee for helpful comments and suggestions. Abel Rodríguez was partially funded by awards NSF-DMS 0915272 and NIH/NIGMS R01GM090201-01. Fernando Quintana was partially funded by award FONDECYT 1100010.
Appendix. Proofs of Theoretical Results
Proof of Lemma 1
From (4) we have
This sum can be broken into Kn! pieces according to the ordering of the indexes j1, … , jKn. Because of the symmetry of the problem, it is enough to work with a particular ordering of the variables, for example j1 < j2 < … < jKn. In that case
where si = ji – ji–1 – 1 for i = 1, … , Kn are the sizes of the “gaps” between the labels of two consecutive observed species, with the convention j0 = 0. Now, since z1, z2, … are assumed to be independent and identically distributed, we have
where z ~ H, and
The proof is completed by noting that the results for the other Kn! – 1 terms can be obtained through a similar calculation that involves permuting the order of the counts .
The distribution over permutations for the Ewens sampling formula
Lemma 3
For all K ≥ 1 and m1, … , mK ≥ 1 write
| (.1) |
where denotes the set of all K! permutations of the indices (1, 2, … , k). Then rk(m1, … , mK) = 1 for all k ≥ 1.
Proof
We proceed by induction on k. When k = 1, contains only one element, namely, (1), and the only term within brackets in (.1) trivially becomes 1, and so r1(m1) = 1. Assuming now that rK(m1, … , mK) = 1 for a given K, we consider rK+1(m1, … , mk, mK+1). Write any as σ(K+1) = (σ(K+1)(1),…,σ(K+1)(K + 1)). For s = 1, … , K + 1 define i.e., is the set of all permutations of 1, … , K + 1 whose first element is s. Denote also . The first term in the product involved in (.1) for K + 1 is , while the remaining terms are precisely those involved in the definition of rK(m1, … , ms−1, ms+1, … , mK+1). Therefore, the summation over all permutations in can divided into K + 1 sums running over all possible values for σ(K+1)(1). Then
where the first equality in the second line holds by the induction hypothesis.
Proof of Lemma 2
Let W1, W2, … be a sequence of binary variables such that Wi = 1 if observation i corresponds to a species that had not been observed in the past, and Wi = 0 otherwise. Note that and that, conditionally on the random weights, these random variables are independent (although not identically distributed). Hence,
where, from Hjort (2000),
which completes the proof.
Footnotes
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Contributor Information
Abel Rodríguez, Department of Applied Mathematics and Statistics, University of California, Santa Cruz, USA.
Fernando A. Quintana, Departamento de Estadística, Facultad de Matemáticas, Pontificia Universidad Católica de Chile, Santiago, Chile.
References
- Antoniak C. Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. Annals of Statistics. 1974;2:1152–1174. [Google Scholar]
- Barrientos AF, Jara A, Quintana FA. On the support of MacEachern’s dependent Dirichlet processes and extensions. Bayesian Analysis. 2012;7(2):277–309. [Google Scholar]
- Blackwell D, MacQueen JB. Ferguson distribution via Polyaurn schemes. The Annals of Statistics. 1973;1:353–355. [Google Scholar]
- Chung Y, Dunson DB. Nonparametric bayes conditional distribution modeling with variable selection. Journal of American Statistical Association. 2009;104:1646–1660. doi: 10.1198/jasa.2009.tm08302. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Engen S. Stochastic Abundance Models with Emphasis on Biological Communities and Species Diversity. Chapman & Hall Ltd; 1978. [Google Scholar]
- Escobar MD. Estimating normal means with a Dirichlet process prior. Journal of the American Statistical Association. 1994;89:268–277. [Google Scholar]
- Escobar MD, West M. Bayesian density estimation and inference using mixtures. Journal of American Statistical Association. 1995;90:577–588. [Google Scholar]
- Ewens WJ. The sampling theory of selectively neutral alleles. Theoretical Population Biology. 1972;3:87–112. doi: 10.1016/0040-5809(72)90035-4. [DOI] [PubMed] [Google Scholar]
- Ferguson TS. A Bayesian analysis of some nonparametric problems. Annals of Statistics. 1973;1:209–230. [Google Scholar]
- Fisher RA, Corbet AS, Williams CB. The relation between the number of species and the number of individuals in a random sample of an animal population. Journal of Animal Ecology. 1943;12:42–58. [Google Scholar]
- Hjort N. Bayesian analysis for a generalized Dirichlet process prior. Tech. rep. University of Oslo; 2000. [Google Scholar]
- Kingman JF. The representation of partitin structures. Journal of the London Mathematical Society. 1978;18:374–380. [Google Scholar]
- Lee J, Quintana FA, Muller P, Trippa L. Defining predictive probability functions for species sampling models. Statistical Science. 2013;28(2):209–222. doi: 10.1214/12-sts407. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lijoi A, Mena RH, Prunster I. Bayesian nonparametric analysis for a generalized Dirichlet process prior. Statistical Inference for Stochastic Processes. 2005;8:283–309. [Google Scholar]
- Lijoi A, Mena RH, Prunster I. Bayesian nonparametric estimation of the probability of discovering new species. Biometrika. 2007;94:769–786. [Google Scholar]
- Mao CX. Prediction of the conditional probability of discovering a new class. Journal of American Statistical Association. 2004;99:1108–1118. [Google Scholar]
- Mao CX, Lindsay BG. A Poisson model for the coverage problem with a genomic application. Biometrika. 2002;89:669–682. [Google Scholar]
- Neal RM. Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics. 2000;9:249–265. [Google Scholar]
- Pitman J. Exchangeable and partially exchangeable random partitions. Probability Theory and Related Fields. 1995;102:145–158. [Google Scholar]
- Pitman J. Random discrete distributions invariant under size-biased permutation. Advances in Applied Probability. 1996a;28(2):525–539. [Google Scholar]
- Pitman J. Some developments of the Blackwell-MacQueen urn scheme. In: Ferguson TS, Shapeley LS, MacQueen JB, editors. Statistics, Probability and Game Theory. Papers in Honor of David Blackwell. IMS; Hayward, CA: 1996b. pp. 245–268. [Google Scholar]
- Pitman J. Combinatorial stochastic processes. Vol. 1875 of Lecture Notes in Mathematics. Springer-Verlag; Berlin: 2006. lectures from the 32nd Summer School on Probability Theory held in Saint-Flour, July 7–24, 2002, With a foreword by Jean Picard. [Google Scholar]
- Roberts G, Papaspiliopoulos O. Retrospective Markov chain Monte Carlo methods for Dirichlet process hierarchical models. Biometrika. 2008;95:169–186. [Google Scholar]
- Roberts GO, Rosenthal JS. Examples of Adaptive MCMC. Journal of Computational and Graphical Statistics. 2009;18(2):349–367. [Google Scholar]
- Rodriguez A, B. DD, Taylor J. Bayesian hierarchically weighted finite mixture models for samples of distributions. Biostatistics. 2009;10(1):155–171. doi: 10.1093/biostatistics/kxn024. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rodriguez A, Dunson DB. Nonparametric Bayesian models through probit stick-breaking processes. Bayesian Analysis. 2011;6:145–178. doi: 10.1214/11-BA605. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sudderth EB, Jordan MI. Shared segmentation of natural scenes using dependent pitman-yor processes. In: Koller D, Schuurmans D, Bengio Y, Bottou L, editors. Advances in Neural Information Processing Systems 21. Curran Associates, Inc.; 2009. pp. 1585–1592. [Google Scholar]




