Abstract
There has been increasing interest in applying Bayesian nonparametric methods in large samples and high dimensions. As Markov chain Monte Carlo (MCMC) algorithms are often infeasible, there is a pressing need for much faster algorithms. This article proposes a fast approach for inference in Dirichlet process mixture (DPM) models. Viewing the partitioning of subjects into clusters as a model selection problem, we propose a sequential greedy search algorithm for selecting the partition. Then, when conjugate priors are chosen, the resulting posterior conditionally on the selected partition is available in closed form. This approach allows testing of parametric models versus nonparametric alternatives based on Bayes factors. We evaluate the approach using simulation studies and compare it with four other fast nonparametric methods in the literature. We apply the proposed approach to three datasets including one from a large epidemiologic study. Matlab codes for the simulation and data analyses using the proposed approach are available online in the supplemental materials.
Keywords: Clustering, Density estimation, Efficient computation, Large samples, Nonparametric Bayes, Pólya urn scheme, Sequential analysis
1. INTRODUCTION
In recent years, there has been an explosion of interest in Bayesian nonparametric methods due to their flexibility and to the availability of efficient and easy to use algorithms for posterior computation. Most of the focus has been on Dirichlet process mixture (DPM) models (Lo 1984; Escobar 1994; Escobar and West 1995), which place a Dirichlet process (DP) prior (Ferguson 1973, 1974) on parameters in a hierarchical model. For DPMs, there is a rich literature on Markov chain Monte Carlo (MCMC) algorithms for posterior computation, proposing marginal Gibbs sampling (MacEachern 1994; West, Müller, and Escobar 1994; Bush and MacEachern 1996), conditional Gibbs sampling (Ishwaran and James 2001), and split-merge (Jain and Neal 2004) algorithms. These approaches are very useful in small to moderate sized datasets when one can devote several hours (or days) for computation.
However, there is clearly a pressing need for dramatically faster alternatives to MCMC, which can be executed within seconds (or at most minutes) even for very large datasets. Such algorithms are absolutely required in large-scale data analyses, in which computational speed is paramount. In the pregnancy outcome application considered in Section 6, data were available for 34,178 pregnancies and it was infeasible to implement MCMC. Even in smaller applications, it is very desirable to obtain results quickly. Speed also has the advantage of allowing detailed simulation studies of operating characteristics and sensitivity analyses for different prior specifications. In addition to obtaining results quickly for one DPM, it is typically of interest to compare DPMs to simpler parametric models. Typical MCMC algorithms do not allow such comparisons, as marginal likelihoods are not estimated, though there has been some recent work to address this gap (Basu and Chib 2003).
The focus of this article is on extremely fast alternatives to MCMC, which allow accurate approximate Bayes inferences under one DPM, while also producing marginal likelihood estimates to be used in model comparison. For example, one may be interested in comparing a DPM to a simpler parametric model. For simplicity in exposition, we focus throughout the article on Gaussian DPMs, though the methods can be trivially modified to other cases in which a conjugate prior is chosen.
For DPM models, a number of alternatives to MCMC have been proposed, including predictive recursion (PR) (Newton and Zhang 1999; Newton 2002; Ghosh and Tokdar 2006; Tokdar, Martin, and Ghosh 2009), weighted Chinese restaurant (WCR) sampling (Lo, Brunner, and Chan 1996; Ishwaran and Takahara 2002; Ishwaran and James 2003), sequential importance sampling (SIS) (MacEachern, Clyde, and Liu 1999; Quintana and Newton 2000), and variational Bayes (VB) (Blei and Jordan 2006; Kurihara, Welling, and Vlassis 2006; Kurihara, Welling, and Teh 2007). The WCR and SIS approaches are computationally intensive because they are based on a large number of particles. For a Gaussian mixture model with unknown mean and variance, the recursive algorithm (Newton 2002; Ghosh and Tokdar 2006; Tokdar, Martin, and Ghosh 2009) needs to estimate a bivariate mixing density and involves approximating a normalizing constant in each sequential updating step. VB relies on maximization of a lower bound on the marginal likelihood using a factorization approximation to the posterior. Wang and Titterington (2005) showed a tendency of VB to underestimate uncertainty in mixture models. Also, VB is sensitive to the starting values, motiving the use of a short SIS run to choose initial values.
We propose an alternative sequential updating and greedy search (SUGS) algorithm. This algorithm relies on factorizing the DP prior as a product of a prior on the partition of subjects into clusters and independent priors on the parameters within each cluster. Adding subjects one at a time, we allocate subjects to the cluster that maximizes the conditional posterior probability given their data and the allocation of previous subjects, while also updating the posterior distribution of the cluster-specific parameters. Hence, viewing selection of the partition as a model selection problem, we implement a sequential greedy search for a good partition, with the exact posterior given this partition then available in closed form. The algorithm is very fast involving only a single cycle of simple calculations for each subject. In addition, a marginal likelihood is produced that can be used for model selection and for eliminating sensitivity to the order in which subjects are added through model averaging or selection over random orders. Existing methods related to SUGS include those of Daumé III (2007), Fearnhead (2004), Minka and Ghahramani (2003), and Zhang, Ghahramani, and Yang (2005).
Section 2 describes the prior structure. Section 3 proposes the fast SUGS posterior updating algorithm, with Section 4 providing details for normal DPMs. Section 5 evaluates the approach and compares it with four other fast nonparametric methods through simulation studies. Section 6 contains three real data applications and Section 7 concludes with some remarks.
2. DIRICHLET PROCESS MIXTURES AND PARTITION MODELS
DPM models have a well-known relationship to partition models (Quintana and Iglesias 2003; Park and Dunson 2009). For example, consider a DP mixture of normals (Lo 1984):
| (2.1) |
where θ̃i = (μ̃i, τ̃i) are parameters specific to subject i, α is the DP precision parameter, and P0 is a base probability measure. Then, upon marginalizing out the random mixing measure P, one obtains the DP prediction rule (Blackwell and MacQueen 1973):
| (2.2) |
where δθ is a probability measure concentrated at θ. Sequential application of the DP prediction rule for subjects 1, …, n creates a random partition of the integers {1, …, n}. Commonly used algorithms for posterior computation in DPM models rely on marginalizing out P to obtain a random partition, so that one bypasses computation for the infinitely-many parameters characterizing P (Bush and MacEachern 1996).
Taking advantage of a characterization of Lo (1984), one can express the posterior distribution in DPMs after marginalizing out P as a product of the posterior for the partition multiplied by independent posteriors for each cluster, obtained by updating the prior P0 with the data for the subjects allocated to that cluster. Instead of obtaining this structure indirectly through marginalization of P, one could directly specify a model for the random partition, while assuming conditional independence given the allocation to clusters. This possibility was suggested by Quintana and Iglesias (2003), who focused on product partition models (PPMs) (Barry and Hartigan 1992).
We assume that there is an infinite sequence of clusters, with θh representing the parameters specific to cluster h, for h = 1, …, ∞. We use the DP prediction rule in (2.2) for sequentially allocating subjects to a sparse subset of these clusters. The first subject will be automatically allocated to cluster h = 1, with additional clusters occupied as subjects are added as needed to improve predictive performance, obtaining an online updating approach. Sensitivity to ordering will be discussed later in the article.
Let γi be a cluster index for subject i, with γi = h denoting that subject i is allocated to cluster h. Relying on the DP prediction rule, the conditional prior distribution of γi given γ(i−1) = (γ1, …, γi−1)is assumed to be multinomial with
| (2.3) |
where α > 0 is a DP precision parameter controlling sparseness and , the number of clusters after i − 1 subjects have been sequentially added. As α increases, there is an increasing tendency to allocate subjects to new clusters instead of clusters occupied by previous subjects. The prior probabilities in (2.3) favor allocation of subject i to clusters having large numbers of subjects.
To complete a Bayesian specification, it is necessary to choose priors for the parameters within each of the clusters:
| (2.4) |
where p0 is the prior distribution on the cluster-specific coefficients θh and independence across the clusters is implied by the result of Lo (1984).
3. SEQUENTIAL UPDATING AND GREEDY SEARCH
3.1 Proposed Algorithm
Suppose that a measurement yi is obtained for subjects i = 1, …, n. Updating (2.3) one can obtain the conditional posterior probability of allocating subject i to cluster h given the data for subjects 1, …, i [y(i) = (y1, …, yi)′] and the cluster assignment for subjects 1, …, i − 1 [γ(i−1) = (γ1, …, γi−1)′]:
| (3.1) |
where πih = Pr(γi = h|γ(i−1)) is the conditional prior probability in expression (2.3), and Lih(yi) = ∫f(yi |θh)π(θh|y(i−1), γ(i−1))dθh is the conditional likelihood of yi given allocation to cluster h and the cluster allocation for subjects 1, …, i − 1, with f(yi|θh) denoting the likelihood of yi given parameters θh and π(θh|y(i−1), γ(i−1)) ∝ p0(θh)Π{j:γj=h,1≤j≤i−1} f(yj |θh), the posterior distribution of θh given y(i−1) and γ(i−1). For a new cluster h = ki−1 + 1, π(θh|y(i−1), γ(i−1)) = p0(θh), as none of the first i − 1 subjects have been allocated to cluster ki−1 + 1.
For conjugate p0, the posterior π(θh| y(i−1), γ(i−1)) and likelihood Lih(yi) are available in closed form. Hence, the joint posterior distribution for the cluster-specific coefficients given the data and cluster allocation for all n subjects,
is similarly available in closed form. Note that the first kn clusters are occupied in that they have at least one member from the sample.
The real challenge is addressing uncertainty in the partition of subjects to clusters, γ. MCMC algorithms attempt to address this uncertainty by generating samples from the joint posterior distribution of γ and θ. As highlighted in Section 1, such MCMC algorithms are quite expensive computationally. This is particularly true if sufficient numbers of samples are collected to adequately explore the posterior distribution of γ. The multimodal nature of the posterior and the tendency to remain for long intervals in local modes make this exploration quite challenging. In addition, γ ∈ Γ can be viewed as a model index belonging to the high-dimensional space Γ. As for other high-dimensional stochastic search procedures, for sufficiently large n, it is for all practical purposes infeasible to fully explore Γ or to draw enough samples to accurately represent the posterior of γ.
An additional issue is that, even if one could obtain iid draws from γ, problems in interpretation often arise due to the label switching issue. Viewing γ as a model index, samples from the joint posterior of γ and θ can be used to obtain model-averaged predictions and inferences, allowing for uncertainty in selection of γ. Although it is well known that model averaging is most useful for prediction, the ability to obtain interpretable inferences may be lost in averaging across models. This is certainly true in mixture models, because the meaning of the cluster labels changes across the samples, making it difficult to summarize cluster-specific results. There has been some work on postprocessing algorithms to align the clusters (Stephens 2000), though this can add considerably to the computational burden.
Motivated by these issues, there has been some recent work on obtaining an optimal estimate of γ (Lau and Green 2007; Dahl 2009). These approaches are quite expensive computationally, so will not be considered further. We instead propose a very fast sequential updating and greedy search (SUGS) algorithm, which cycles through subjects, i = 1, …, n, sequentially allocating them to the cluster that maximizes the conditional posterior allocation probability. This proceeds as follows:
Let γ1 = 1 and calculate π(θ1|y1, γ1).
-
For i = 2, …, n,
Choose γi to maximize the conditional probability of γi = h given y(i) and γ(i−1) using (3.1).
Update π(θγi|y(i−1), γ(i−1)) using the data for subject i.
This algorithm only requires a single cycle of simple deterministic calculations for each subject under study, and can be implemented within a few seconds even for very large datasets. In addition, the algorithm is online so that additional subjects can be added as they become available without additional computations for the past subjects. Hence, the algorithm is particularly suited for large-scale real-time prediction. The proposed method is similar to the hard decision method by Zhang, Ghahramani, and Yang (2005) in the field of online document clustering and the “trivial” algorithm by Daumé III (2007). However, we also propose methods to remove order dependence in sequential updating, allow unknown DP precision parameter α, use empirical Bayes for estimation of key hyperparameters, and conduct model comparison associated with SUGS.
3.2 Removing Order Dependence
The SUGS approach for selecting γ ∈ Γ is sequentially optimal, but will not in general produce a global maximum a posteriori (MAP) estimate of γ. Producing the global MAP is in general quite challenging computationally given the multimodality and size of Γ. In addition, as noted by Stephens (2000), there are in general very many choices of γ having identical or close to identical marginal likelihoods. Hence, SUGS seems to provide a reasonable strategy for rapidly identifying a good partition without spending an enormous amount of additional time searching for alternative partitions that may provide only minimal improvement.
One aspect that is unappealing is dependence of the selection of γ on the order in which subjects are added. As this order is typically arbitrary, one would prefer to eliminate this order dependence. To address this issue, we recommend repeating the SUGS algorithm of Section 3.1 for multiple permutations of the ordering {1, …, n}. The marginal likelihood given an ordering γ is calculated as
| (3.2) |
Selecting an ordering with the largest marginal likelihood works fine in eliminating the ordering effect in general. However, this marginal likelihood criterion is not perfectly reliable and sometimes leads to poor predictive density estimation. This is because the ordering with the largest marginal likelihood occasionally overfits the data in assigning subjects to more clusters than is necessary. As an alternative, we propose to use pseudo-marginal likelihood (PML) and base inferences on the ordering having the largest PML.
The pseudo-marginal likelihood is defined as the product of conditional predictive ordinates (Geisser 1980; Pettiti 1990; Gelfand and Dey 1994) as follows:
| (3.3) |
where y(−i) is the set of all the data but yi for i = 1, …, n. The PMLγ(y) criterion is appealing in favoring a partition resulting in good predictive performance and has been used for assessing goodness of fit and Bayesian model selection by Geisser and Eddy (1979), Gelfand and Dey (1994), Sinha, Chen, and Ghosh (1999), and Mukhopadhyay, Ghosh, and Berger (2005), among others. To speed up computation, we use π(θ|y, γ) instead of π(θ|y(−i), γ(−i)) in practice, approximating PML with a product of predictive densities defined in (3.8) over all subjects. This approximation is accurate, particularly for large samples. We use PML criteria in all the implementation of SUGS in the simulation studies and real data analyses unless mentioned otherwise. Because SUGS is extremely fast, repeating it for a modest number of random orderings and selecting a good ordering does not take much time.
A variety of strategies have been suggested to limit order dependence in other non-parametric sequential algorithms. The recursive algorithm (Newton 2002; Tokdar, Martin, and Ghosh 2009) and the expectation propagation method (Minka and Ghahramani 2003) proposed to take an unweighted average over a number of permutations. Daumé III (2007) proposed to present the data in ascending order of individual marginal likelihood, ∫f(yi |θ)p0(θ) dθ, prior to the online updating.
3.3 Allowing the DP Precision Parameter α to be Unknown
In the above development, we have assumed that the DP precision parameter α is fixed, which is not recommended because the value of α plays a strong role in the allocation of subjects to clusters. To allow unknown α, we choose the prior:
| (3.4) |
with a prespecified grid of possible values with a large range and .
We can easily modify the SUGS algorithm to allow simultaneous updating of α. Letting and , we obtain the following modification to (3.1):
| (3.5) |
which is obtained marginalizing over the posterior for α given the data and allocation for subjects 1, …, i −1. Then we obtain the following updated probabilities:
| (3.6) |
Note that we obtain a closed-form joint posterior distribution for the cluster-specific parameters θ and DP precision α given γ.
In our proposed approach, we handle the DP precision parameter α from a fully Bayesian perspective and we can obtain the posterior distribution. This is more appealing than most of the fast DP mixture algorithms, which use fixed value of α, for example, the particle filter by Fearnhead (2004), the fast DPM model by Daumé III (2007), and the accelerated and collapsed variational DP mixture models by Kurihara, Welling, and Vlassis (2006) and Kurihara, Welling, and Teh (2007), respectively. The online document clustering method by Zhang, Ghahramani, and Yang (2005) employed an empirical Bayes method to estimate α, whereas Blei and Jordan (2006) adopted a gamma prior for α in their variational DP mixture approach. We found the Blei and Jordan (2006) approach to have better performance than the newer VB variants in simulations (results not shown).
3.4 Estimating Predictive Distributions
From applying SUGS, we obtain a selected partition γ and posterior distributions in closed form for the parameters within each cluster, π(θh|y, γ), and for DP precision parameter α as in (3.6). From these posterior distributions, we can conduct inferences on the cluster-specific coefficients, θ1, …, θkn.
In addition, we can conduct fast online predictions for new subjects. The predicted probability of allocation of subject i = n +1 to cluster h is
| (3.7) |
The predictive density is then
| (3.8) |
which is available in closed form.
To obtain pointwise credible intervals for the conditional density estimate, f̂(yn+1), apply the following Monte Carlo procedure:
- Draw S samples posterior distribution of from the joint
-
Calculate the conditional density for each of these draws:
where is calculated using formula (3.1) with α = α(s) and i = n +1.
Calculate empirical percentiles of .
Because the proposed SUGS algorithm is deterministic for a selected ordering, the resulting credible intervals tend to underestimate the uncertainty. This is observed in our simulation study and occurs in many other competing approaches, such as VB.
3.5 Model Comparison
One very appealing aspect of the SUGS approach is that we obtain a closed-form expression for the exact marginal likelihood for the selected model γ, because each integral term in (3.2) has a simple closed form due to the conjugacy. Hence, we can obtain Bayes factors and posterior probabilities for competing models. For example, the Bayes factor for comparing the selected semiparametric model to the parametric base model is
where the denominator is the marginal likelihood obtained in allocating all subjects to the first cluster. The performance of tests based on these Bayes factors is assessed through simulations in Section 5.
4. DP MIXTURES OF NORMALS AND SUGS DETAILS
4.1 SUGS Details
We focus on normal mixture models as an important special case, letting θh = (μh, τh)′ represent the mean parameter μh and residual precision τh for cluster h, h = 1, …, ∞. To specify p0, we choose conjugate normal inverse-gamma priors as follows:
| (4.1) |
with m, ψ, a, b hyperparameters that are assumed known.
After updating prior (4.1) with the data for subjects 1, …, i, we have
| (4.2) |
where the values are obtained through sequential application of the updating equations:
with corresponding to the initial prior in (4.1).
Letting πih = Pr(γi = h|γ(i−1)) as shorthand for the conditional prior probabilities in (2.3) and updating with the data for subject i, we obtain
| (4.3) |
for l = 1, …, ki−1 +1, where f(yi|γi = h, γ(i−1), y(i−1)) corresponds to a noncentral t-distribution, a special case in (A.1) in the Appendix for x = 1 and one-dimensional β, with used in place of ξ, Ψ, a, b in (A.1).
4.2 Empirical SUGS
In implementing SUGS, we have found some sensitivity to the prior specification, which is an expected feature of analyses based on DP mixture models. To reduce this sensitivity, we recommend routinely normalizing the data prior to analysis. Then, one can let m = 0, ψ = 1, and a = 1 in prior (4.1) as a default. However, there is still some sensitivity to the choice of b, which we propose to address through the following procedure. We first choose a prior for b, π(b) = G(c, d), with c = 1, d = 10 providing a good default choice. We then propose to update this prior sequentially within a preliminary SUGS run to obtain an estimate of b. This estimate will then be plugged in for b in a subsequent SUGS analysis. We find this modification leads to good performance in terms of estimation and model selection in a very wide variety of cases.
Let b̂(i) be an estimate of b after the first i −1 subjects have been incorporated, with
The updating equation for is then modified to be
The final estimate b̂(n+1) is used as the value for b in the subsequent SUGS analyses. Although this increases the computational cost, the result is a more robust estimate.
5. SIMULATION STUDY
5.1 Performance of SUGS
Simulation studies were conducted to evaluate the performance of the proposed algorithm. We focused on the normal DPM model of Section 4 and considered two cases for the true density: (1) mixture of three normals:
and (2) a single normal with mean 0 and variance 0.4. In each case, we considered 100 simulated datasets each with sample size n = 500. For all the analyses reported in this article, we used the default priors recommended in Section 4.2, and took the prior for α to be a discretized Gamma(1, 1) distribution with support on the points {0.01, 0.05} ∪ {0.1 + 0.2k, k = 0, 1, …, 20}. In addition, SUGS was repeated for 10 random orderings.
For each sampled dataset, we calculated the predictive density using SUGS, the typical frequentist kernel density estimate, and the Bayes factor of the selected model against the parametric null model (a single normal distribution). The kernel density estimate was obtained using the “ksdensity” function in Matlab, with default settings including using a normal kernel (Bowman and Azzalini 1997) and an optimal default value for kernel width. To measure the closeness of the proposed density estimate and the true density, we calculated the Kullback–Leibler divergence (KLD) between densities f and g defined as follows:
with f being the true density and g being an estimate.
Figures 1 and 2 plot the true density (solid) and the 100 predictive densities (dotted) given by the SUGS algorithm in case 1 and case 2, respectively. Clearly, the predictive densities are very close to the true density. The averages of 100 KLDs of the proposed density estimates and the kernel density estimates relative to the true density are 0.0111 and 0.041 in case 1 and 0.0027 and 0.0080 in case 2, respectively. The results suggest that the proposed density estimates are closer to the true density than the kernel density estimates.
Figure 1.
SUGS density estimates in simulation case 1 for n = 500. The estimated densities (dotted) from 100 datasets and the true density (solid). A color version of this figure is available in the electronic version of this article.
Figure 2.
SUGS density estimates in simulation case 2 for n = 500. The estimated densities (dotted) from 100 datasets and the true density (solid). A color version of this figure is available in the electronic version of this article.
Table 1 summarizes the estimated Bayes factors across the simulations. To obtain the Bayes factor, we calculate the marginal likelihood using the formula in (3.2), with the hyperparameter b = 1 for the normal baseline model. When data are generated from a mixture of normals in case 1, the Bayes factors provide decisive support in favor of the true model as shown in Table 1. When data are generated from the null model as in simulation 2, the Bayes factors pick up the base normal model over 90% of the datasets. These results show that SUGS has good performance in selecting between a single normal and a mixture of normals.
Table 1.
Performance of Bayes factor under null and alternative models.
| BF ≤ 1 | 1 < BF ≤ 100 | BF > 100 | |
|---|---|---|---|
| Case 1 | 0 | 0 | 100 |
| Case 2 | 92 | 4 | 4 |
In the above implementation of SUGS, we treat α as unknown and assign a discretized Gamma prior for it. To see the benefit of this, we run SUGS with several fixed values of α separately. The comparison results are presented in Table 2 in terms of the average of 100 KLDs and the computational time per dataset. From Table 2, SUGS using random α performs better than using fixed value of α in estimating the predictive density because it produces the smallest average of KLDs. This is more apparent in case 1 of the simulation, for which all the averages of KLDs from the SUGS using fixed α values are relatively large. It is observed that the computation time increases as the value of α becomes large. The reason is that large values of α tend to induce more clusters and thus need more computation cost. It is appealing that SUGS using random α has better performance than using fixed value of α whereas it does not necessarily take more time.
Table 2.
Effect of using random and fixed value of α in SUGS.
| Random α | α = 0.1 | α = 0.5 | α = 1 | α = 2 | ||
|---|---|---|---|---|---|---|
| KLD | Case 1 | 0.0111 | 0.1182 | 0.0643 | 0.0488 | 0.0430 |
| Case 2 | 0.0027 | 0.0037 | 0.0064 | 0.0069 | 0.0067 | |
| Time | Case 1 | 2.98 | 2.27 | 3.88 | 5.55 | 8.28 |
| Case 2 | 2.93 | 2.71 | 3.75 | 5.36 | 8.03 |
In the above simulations, we adopt the PML criterion to eliminate the effect of sequential ordering of subjects in SUGS. For comparison, we also run SUGS using marginal likelihood (ML) criterion mentioned in Section 3.2 together with SUGS using simply averaging (SAV) over 10 random orderings for each dataset. The first part of Table 3 lists the sample standard deviation (SSD) of 100 log marginal likelihood estimates obtained from 100 datasets using these three criteria in SUGS. From Table 3, the PML criterion clearly performs best with smallest SSD compared to the ML and SAV criteria in both cases 1 and 2 of simulation. The large value of SSD from using the SAV criterion indicates that SUGS is very sensitive to the ordering of subjects being added to the model. The proposed PML criterion does a great job in eliminating the effect of such ordering as clearly indicated from Table 3.
Table 3.
Sample standard deviation of log marginal likelihood estimates for running SUGS and the inadmissible approach using different criteria regarding the online clustering of subjects based on 100 datasets: PML, ML, and SAV criteria for SUGS and AS and ML criteria for the inadmissible approach.
| SUGS
|
Inad
|
||||
|---|---|---|---|---|---|
| PML | ML | SAV | AS | ML | |
| Case 1 | 17.4 | 62.4 | 209.9 | 99.2 | 21.6 |
| Case 2 | 4.1 | 88.3 | 143.9 | 62.9 | 19.0 |
The SUGS algorithm is very fast. In both case 1 and case 2, the analyses for all 100 simulated datasets were completed in approximately 3 minutes for sample size 500. We also ran simulations with sample size n = 5000 and obtained excellent results (not shown). All programs including the simulations in Section 5.2 and data analysis in Section 6 were executed in Matlab version 7.3 running on Dell desktop with Intel(R) Xeon(R) CPU and 3.00 GB of RAM.
5.2 Comparison With Four Other Fast Nonparametric DPM Algorithms
To compare SUGS with competing fast nonparametric methods, we reanalyzed the same simulated data in Section 5.1 using the VB method of Blei and Jordan (2006), the PR approach of Newton (2002), the SIS of MacEachern, Clyde, and Liu (1999), and the inadmissible (Inad) approach of Daumé III (2007). We evaluated their performance in terms of predictive density estimation and running time.
The idea of variational inference is to formulate the computation of the posterior distribution as an optimization problem (Blei and Jordan 2006; Wainwright and Jordan 2008). To implement the variational Dirichlet process Gaussian mixture model, we applied the code created by Dr. Kenichi Kurihara; the code is available at http://sato-www.cs.titech.ac.jp/kurihara/vdpmog.html. It is known that VB is very sensitive to the initial values and poorly chosen initial values usually result in poor estimates. To overcome this problem, the code adopts sequential importance sampling to find good initial values.
The recursive algorithm of Newton (2002) sequentially updates the mixing density π(θ) via the following equation:
where w = (w1, …, wn) is a sequence of weights satisfying some conditions, c(y, π) = ∫f(y|θ)π(θ)dθ, and πn is used as the estimated mixing density. The predictive density estimate is then fn(y) = ∫f(y|θ)πn(θ)dθ, which is strongly consistent (Tokdar, Martin, and Ghosh 2009). Because πn depends on the ordering of (y1, …, yn), following the recommendation of Newton (2002) and Tokdar, Martin, and Ghosh (2009), we use the average of fn(y) over 10 permutations.
We generalize the SIS of MacEachern, Clyde, and Liu (1999) for DP mixture of normals. The SIS is similar to SUGS in the sense of sequential updating but differs in that it adopts a random assignment of allocation instead of finding the allocation that maximizes the posterior probabilities in step 2(a) of Section 3.1 in SUGS. Here we implement SIS with 10 particles (permutations) and calculate the predictive density by taking the average for each dataset.
The admissible approach is the one that performs fastest and best among the three algorithms proposed by Daumé III (2007). It aims to find the maximum posteriori assignment of data points to clusters. To achieve this goal, it sequentially updates multiple clusterings in a queue to obtain clusterings that have highest scores. The clustering with the highest score is chosen to be the allocation of all subjects. To be fair to the admissible approach in comparison with SUGS, we adopt only one clustering in the queue. We fix the DP precision parameter α to be 1 in the simulation and calculate marginal likelihood and predictive density based on the final clustering that has the highest score. Because the ordering of subjects can affect the clustering result, Daumé III (2007) recommended to present the data in the ascending ordering (AS) of individual marginal likelihood. We evaluate this criterion and also implement a marginal likelihood (ML) criterion, in which one runs the inadmissible approach over 10 random orderings and chooses the one that leads to the highest marginal likelihood among the 10 final clusterings. The results in the second part of Table 3 suggest that AS is not a good criterion as it produces a large uncertainty. In contrast, the ML criterion seems to be more reliable because the SSD is much smaller as seen in Table 3. Also, Inad ML gives larger marginal likelihoods than Inad AS from our simulation (results not shown).
Table 4 shows the comparison of SUGS with VB, PR, SIS, Inad AS, and Inad ML in terms of the average of 100 KLDs and the running time per dataset. The values of the average of KLDs obtained from SUGS and VB are comparable and are smaller than the corresponding value obtained from PR and the Inad methods in case 1 of simulation. In case 2, the average of KLDs is small for all the approaches except the inadmissible approach although the latter gives acceptable predictive density estimates using the ML criterion.
Table 4.
Comparison of SUGS, VB, PR, SIS, and the inadmissible algorithm proposed by Daumé III (2007) in terms of the average of KLD and running time.
| SUGS | VB | PR | SIS | Inad AS | Inad ML | ||
|---|---|---|---|---|---|---|---|
| KLD | Case 1 | 0.0111 | 0.0101 | 0.0265 | 0.1395 | 0.0559 | 0.0205 |
| Case 2 | 0.0027 | 0.0040 | 0.0051 | 0.0032 | 0.0427 | 0.0298 | |
| Time | Case 1 | 2.98 | 33.40 | 47.18 | 2.60 | 0.37 | 6.94 |
| Case 2 | 2.93 | 33.70 | 49.01 | 2.74 | 0.53 | 9.05 |
The second part of Table 4 shows the runtime (in seconds) of all the methods including SUGS per dataset. From Table 4, SUGS is only slower than the SIS and Inad AS, both of which perform poorly in estimating the predictive density, however. Also, we observed from our simulations that SUGS works over 10 times faster than VB and and over 15 times faster than the recursive algorithm. The reason that VB is slow here is that the applied code of VB adopts a sequential importance sampling step to obtain feasible initialization for VB, which is quite time-consuming. However, the SIS step is necessary because VB gives poor results if it is used alone. The recursive algorithm is slow because it involves estimating a two-dimensional mixing density in the sequential updating and also averages on 10 permutations to eliminate the ordering effect for each dataset.
6. APPLICATIONS
We applied the SUGS algorithm to three data examples. The first two are galaxy data and enzyme data, which have been studied thoroughly by many people in the literature. The third is gestational age at delivery data from the Collaborative Perinatal Project (CPP), which was a very large epidemiologic study conducted in the 1960s and 1970s. The official CPP data and documentation are available at ftp://sph-ftp.jhsph.edu/cpp/ provided by Johns Hopkins University School of Public Health. Here we focus on 34,178 pregnancies that had their gestational ages at delivery (GADs) recorded in the CPP data, which provide a large sample size example. The Matlab codes of the SUGS algorithms for the simulation in Section 5.1 and data analyses as well as all the datasets are provided in the Supplemental Materials.
The galaxy data are a commonly used example in assessing methods for Bayesian density estimation and clustering; see, for example, the works of Roeder (1990), Escobar and West (1995), and Richardson and Green (1997), among others. The data contain measured velocities of 82 galaxies from six well-separated conic sections of space. The SUGS algorithm gives five clusters and the corresponding predictive density is shown in Figure 3, which is similar to kernel estimate and the results of Escobar and West (1995), Richardson and Green (1997), and Fearnhead (2004).
Figure 3.

Predictive density estimate (solid), kernel density estimate (dashed), histogram, and plots of galaxy data (+). A color version of this figure is available in the electronic version of this article.
The enzyme data record enzyme activities in blood for 245 individuals. One interest in analyzing this dataset is the identification of subgroups of slow or fast metabolizers as a marker of genetic polymorphisms (Richardson and Green 1997). Bechtel et al. (1993) concluded that the distribution is a mixture of two skewed distributions based on a maximum likelihood analysis. Richardson and Green (1997) analyzed the data using Bayesian normal mixtures with an unknown number of components. The application of the SUGS algorithm using the default priors on the enzyme data gives a partition of three clusters. The predictive density shown in Figure 4 agrees closely with the findings in the above mentioned papers.
Figure 4.
Predictive density estimate (solid), kernel density estimate (dashed), histogram, and plots of enzyme data (+). A color version of this figure is available in the electronic version of this article.
For the third example, we consider the GADs in weeks for 34,178 births in the CPP. We are interested in the relationship of GAD and the covariates race, sex, maternal smoking status during pregnancy, and maternal age. We use indicators X1, X2, X3, and X4 to denote these four variables, with 1 indicating black, female, smoker, and maternal age less than 35, respectively, and with 0 indicating nonblack, male, nonsmoker, and maternal age no less than 35, respectively. Table 5 gives the observed frequencies for these covariates. The distribution of GAD is known to be nonnormal and have heavy left tails by previous research done by, for example, Smith (2001), among others. In the following, we apply the proposed SUGS algorithm on this dataset. The left tail behavior of the distribution of GAD corresponding to premature deliveries is particularly of interest.
Table 5.
The frequency of observations in categories of each covariate.
| Value | x1 (race) | x2 (sex) | x3 (smoke) | x4 (age) |
|---|---|---|---|---|
| 0 | 51.23% | 49.72% | 48.51% | 7.72% |
| 1 | 48.77% | 50.28% | 51.49% | 92.28% |
Let zi = (1 xi1 xi2 xi3 xi4)′ and yi denote the GAD for subject i. We consider the following model:
where βi = (βi0 βi1 βi2 βi3 βi4)′ denotes the random effects of intercept and covariates for subject i. To apply SUGS, we set ξ = 0, Ψ = n(zz′)−1, a = 1, and estimated b as described in Section 4.2, where z = (z1, …, zn)′. We run 20 permutations for CPP data to eliminate the ordering effect. In terms of computational speed, the analysis was completed within a few seconds for the galaxy and enzyme data, whereas a single permutation took approximately 2 minutes for the CPP data.
Figure 5 shows the estimated predictive densities and cumulative distribution functions of GAD for nonblack babies and black babies controlling other covariates equal to zero. As seen in Figure 5, the predictive density of GAD for black babies is shifted left by around 1 week compared to the density of GAD for nonblacks. This result suggests that black babies are more likely to be born premature than nonblack babies. Here, we only show the comparison of densities and CDFs of GAD for different race groups, and the corresponding densities and CDFs of GAD for different other covariate groups are very close (not shown).
Figure 5.
Top: Densities of GAD for black race (dashed) and other race (solid). Bottom: Cumulative distribution functions of GAD for black race (dashed) and other race (solid). A color version of this figure is available in the electronic version of this article.
Note that SUGS will produce clusters of subjects having identical coefficients for the different predictors. Table 6 summarizes the results of the cluster-specific coefficients in the original data scale given by SUGS, including the posterior means and the corresponding 95% credible intervals. Table 6 also presents the sample mean of GAD and the number of subjects for each cluster in the last two rows. Clearly, the four clusters represent different groups of babies with the first cluster at the right tail of the distribution of GAD and the third and fourth clusters at the left tail. Cluster 2 is the dominant cluster containing about 94% babies and the covariate effects are all significant in this cluster due to the extremely large sample size. Seen across the clusters, the effect of black race on GAD tends to increase in the clusters corresponding to lower GAD babies. This implies an interaction in which black race has a significant impact in shifting GAD slightly for full-term deliveries, with a larger impact on timing of premature deliveries.
Table 6.
Cluster-specific coefficients obtained by SUGS
| Cluster 1 | Cluster 2 | Cluster 3 | Cluster 4 | |
|---|---|---|---|---|
| β0 | 48.26 (48.19, 48.32) | 39.88 (39.88, 39.88) | 31.40 (31.36, 31.44) | 19.98 (16.19, 23.77) |
| β1 | −0.01 (−0.02, 0.01) | −0.70 (−0.70, −0.70) | −1.62 (−1.63, −1.61) | −3.28 (−4.43, −2.12) |
| β2 | −0.11 (−0.12, −0.09) | 0.12 (0.12, 0.12) | 0.05 (0.04, 0.06) | −0.56 (−1.48, 0.37) |
| β3 | 0.03 (0.02, 0.05) | 0.02 (0.01, 0.02) | 0.04 (0.03, 0.05) | 0.16 (−0.83, 1.14) |
| β4 | 0.04 (−0.02, 0.10) | 0.15 (0.15, 0.15) | 1.06 (1.03, 1.10) | 1.52 (−1.01, 4.04) |
| ȳj | 48.28 | 39.75 | 31.39 | 18.42 |
| nj | 385 | 32,024 | 1702 | 67 |
7. DISCUSSION
We have proposed a fast algorithm for posterior computation and model selection in Dirichlet process mixture models. The proposed SUGS approach is very fast and can be implemented easily in very large datasets when priors are chosen to be conjugate. In the simulations and real data examples we considered, we obtained promising results. Extensions to nonconjugate cases are conceptually straightforward. In such cases, instead of obtaining the exact marginal likelihoods conditional on the allocation to clusters, one can utilize an approximation. A promising strategy for many models would be to use a Laplace approximation. The performance of such an approach remains to be evaluated.
Although our focus was on DPMs, the same type of approach can conceptually be applied in a much broader class of models, including species sampling models and general product partition models.
Supplementary Material
Acknowledgments
The authors thank the editor, the associate editor, and two anonymous reviewers for their critical and constructive comments that greatly improved the presentation of this article.
APPENDIX: DESCRIPTION OF THE PREDICTIVE DISTRIBUTION
Suppose that (y|x, β, τ) ~ N(x′β, τ−1) with π(β, τ) = Np(β; ξ, Ψτ−1)G(τ; a, b) the prior. Then, the marginal density of y given x follows the noncentral t-distribution:
| (A.1) |
where ν = 2a is the degrees of freedom, Ψ̂ = (Ψ−1 + xx′)−1,
with μy the mean and σ2ν/(ν − 2) the variance for ν > 2.
Footnotes
Data Files: The galaxy data, enzyme data, and CPP data used in Section 6. (Data_ application.zip, WinZip archived file)
Matlab Codes for Data Analysis: The SUGS codes used for analyzing the galaxy data, enzyme data, and CPP data. (SUGS_application.zip, WinZip archived file)
Matlab Code for Simulation: The SUGS code used in case 1 of the simulation study in Section 5.1. (SUGSsimu.m, Matlab m file)
Contributor Information
Lianming Wang, Email: wang99@mailbox.sc.edu, Department of Statistics, University of South Carolina, Columbia, SC 29208.
David B. Dunson, Department of Statistical Science, Duke University, Durham, NC 27708.
References
- Barry D, Hartigan JA. Product Partition Models for Change Point Problems. The Annals of Statistics. 1992;20:260–279. [Google Scholar]
- Basu S, Chib S. Marginal Likelihood and Bayes Factors for Dirichlet Process Mixture Models. Journal of the American Statistical Association. 2003;98:224–235. [Google Scholar]
- Bechtel YC, Bonaïti-Pellié C, Poisson N, Magnette J, Bechtel PR. A Population and Family Study of N-Acetyltransferase Using Caffeine Urinary Metabolites. Clinical Pharmacology & Therapeutics. 1993;54:134–141. doi: 10.1038/clpt.1993.124. [DOI] [PubMed] [Google Scholar]
- Blackwell D, MacQueen J. Ferguson Distributions via Polya Urn Schemes. The Annals of Statistics. 1973;1:353–355. [Google Scholar]
- Blei DM, Jordan MI. Variational Inference for Dirichlet Process Mixtures. Bayesian Analysis. 2006;1:121–144. [Google Scholar]
- Bowman AW, Azzalini A. Applied Smoothing Techniques for Data Analysis. New York: Oxford Univeristy Press; 1997. [Google Scholar]
- Bush CA, MacEachern SN. A Semiparametric Bayesian Model for Randomized Block Designs. Biometrika. 1996;83:175–185. [Google Scholar]
- Dahl DB. Modal Clustering in a Class of Product Partition Models. Bayesian Analysis. 2009;4:243–264. [Google Scholar]
- Daumé H., III Fast Search for Dirichlet Process Mixture Models. Conference on Artificial Intelligence and Statistics.2007. [Google Scholar]
- Escobar MD. Estimating Normal Means With a Dirichlet Process Prior. Journal of the American Statistical Association. 1994;89:268–277. [Google Scholar]
- Escobar MD, West M. Bayesian Density Estimation and Inference Using Mixtures. Journal of the American Statistical Association. 1995;90:577–588. [Google Scholar]
- Fearnhead P. Particle Filters for Mixture Models With an Unknown Number of Components. Statistics and Computing. 2004;14:11–21. [Google Scholar]
- Ferguson TS. A Bayesian Analysis of Some Nonparametric Problems. The Annals of Statistics. 1973;1:209–230. [Google Scholar]
- Ferguson TS. Prior Distributions on Spaces of Probability Measures. The Annals of Statistics. 1974;2:615–629. [Google Scholar]
- Geisser S. Discussion on “Sampling and Bayes Inference in Scientific Modeling and Robustness,” by G. E. P. Box. Journal of the Royal Statistical Society, Ser A. 1980;143:416–417. [Google Scholar]
- Geisser S, Eddy W. A Predictive Approach to Model Selection. Journal of the American Statistical Association. 1979;74:153–160. [Google Scholar]
- Gelfand AE, Dey DK. Bayesian Model Choice: Asymptotics and Exact Calculations” (with discussion) Journal of the Royal Statistical Society, Ser B. 1994;56:501–514. [Google Scholar]
- Ghosh J, Tokdar S. Convergence and Consistency of Newton’s Algorithm for Estimating a Mixing Distribution. In: Fan J, Koul H, editors. The Frontiers of Statistics. London: Imperial College Press; 2006. pp. 429–443. [Google Scholar]
- Ishwaran H, James LF. Gibbs Sampling Methods for Stick-Breaking Priors. Journal of the American Statistical Association. 2001;101:179–194. [Google Scholar]
- Ishwaran H, James LF. Generalized Weighted Chinese Restaurant Processes for Species Sampling Mixture Models. Statistica Sinica. 2003;13:1211–1235. [Google Scholar]
- Ishwaran H, Takahara G. Independent and Identically Distributed Monte Carlo Algorithms for Semiparametric Linear Mixed Models. Journal of the American Statistical Association. 2002;97:1154–1166. [Google Scholar]
- Jain S, Neal RM. A Split-Merge Markov Chain Monte Carlo Procedure for the Dirichlet Process Mixture Model. Journal of Computational and Graphical Statistics. 2004;13:158–182. [Google Scholar]
- Kurihara K, Welling M, Teh YW. Collapsed Variational Dirichlet Process Mixture Models. Proceedings of the Twentieth International Joint Conference on Artificial Intelligence (IJCAI07); San Francisco, CA: Kaufmann; 2007. pp. 2796–2801. [Google Scholar]
- Kurihara K, Welling M, Vlassis N. Accelerated Variational Dirichlet Mixture Models. Advances in Neural Information Processing Systems, 19 (NIPS 2006); Vancouver, British Columbia, Canada. 2006. [Google Scholar]
- Lau JW, Green PJ. Bayesian Model Based Clustering Procedures. Journal of Computational and Graphical Statistics. 2007;16:526–558. [Google Scholar]
- Lo AY. On a Class of Bayesian Nonparametric Estimates: I, Density Estimates. The Annals of Statistics. 1984;12:351–357. [Google Scholar]
- Lo AY, Brunner LJ, Chan AT. Research Report 1. Hong Kong University of Science and Technology; 1996. Weighted Chinese Restaurant Processses and Bayesian Mixture Models. [Google Scholar]
- MacEachern SN. Estimating Normal Means With a Conjugate Style Dirichlet Process Prior. Communications in Statistics: Simulation and Computation. 1994;23:727–741. [Google Scholar]
- MacEachern SN, Clyde M, Liu JS. Sequential Importance Sampling for Nonparametric Bayes Models: The Next Generation. Canadian Journal of Statistics. 1999;27:251–267. [Google Scholar]
- Minka T, Ghahramani Z. Expectation and Propagation for Infinite Mixtures. NIPS’03 Workshop on Nonparametric Bayesian Methods and Infinite Models; Vancouver, British Columbia, Canada. 2003. [Google Scholar]
- Mukhopadhyay N, Ghosh JK, Berger JO. Some Bayesian Predictive Approaches to Model Selection. Statistics & Probability Letters. 2005;73:369–379. [Google Scholar]
- Newton MA. On a Nonparametric Recursive Estimator of the Mixing Distribution. Sankhyā, Ser A. 2002;64:306–322. [Google Scholar]
- Newton MA, Zhang Y. A Recursive Algorithm for Nonparametric Analysis With Missing Data. Biometrika. 1999;86:15–26. [Google Scholar]
- Park J-H, Dunson DB. Statistica Sinica. 2009. Bayesian Generalized Product Partition Models. to appear. [Google Scholar]
- Pettit LI. The Conditional Predictive Ordinate for the Normal Distribution. Journal of the Royal Statistical Society, Ser B. 1990;52:175–184. [Google Scholar]
- Quintana FA, Iglesias PL. Bayesian Clustering and Product Partition Models. Journal of the Royal Statistical Society, Ser B. 2003;65:557–574. [Google Scholar]
- Quintana FA, Newton MA. Computational Aspects of Nonparametric Bayesian Analysis With Applications to the Modeling of Multiple Binary Sequences. Journal of Computational and Graphical Statistics. 2000;9:711–737. [Google Scholar]
- Richardson S, Green PJ. On Bayesian Analysis of Mixtures With an Unknown Number of Components. Journal of Royal Statistical Society, Ser B. 1997;59:731–792. [Google Scholar]
- Roeder K. Density Estimation With Confidence Sets Exemplified by Superclusters and Voids in the Galaxies. Journal of the American Statistical Association. 1990;85:617–624. [Google Scholar]
- Sinha D, Chen MH, Ghosh SK. Bayesian Analysis and Model Selection for Interval-Censored Survival Data. Biometrics. 1999;55:585–590. doi: 10.1111/j.0006-341x.1999.00585.x. [DOI] [PubMed] [Google Scholar]
- Smith GCS. Use of Time to Event Analysis to Estimate the Normal Duration of Human Pregnancy. Human Reproduction. 2001;16:1497–1500. doi: 10.1093/humrep/16.7.1497. [DOI] [PubMed] [Google Scholar]
- Stephens M. Dealing With Label Switching in Mixture Models. Journal of the Royal Statistical Society, Ser B. 2000;62:795–809. [Google Scholar]
- Tokdar ST, Martin R, Ghosh JK. Consistency of a Recursive Estimate of Mixing Distributions. The Annals of Statistics. 2009;37:2502–2522. [Google Scholar]
- Wainwright MJ, Jordan MI. Graphical Models, Exponential Families, and Variational Inference. Foundations and Trends in Machine Learning. 2008;1:1–305. [Google Scholar]
- Wang B, Titterington M. Inadequacy of Interval Estimates Corresponding to Variational Bayesian Approximations. In: Cowell RG, Ghahramani Z, editors. Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics. London: Society for Artificial Intelligence and Statistics; 2005. pp. 373–380. [Google Scholar]
- West M, Müller P, Escobar MD. Hierarchical Priors and Mixture Models, With Application in Regression and Density Estimation. In: Freedman PR, editor. Aspects of Uncertainty: A Tribute to D V Lindley-Smith AFM. London: Wiley; 1994. pp. 363–386. [Google Scholar]
- Zhang J, Ghahramani Z, Yang Y. A Probabilistic Model for Online Document Clustering With Application to Novelty Detection. Advances in Neural Information Processing Systems 17 (NIPS-2004); Vancouver, British Columbia, Canada. 2005. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.




