Skip to main content
Oxford University Press logoLink to Oxford University Press
. 2016 May 6;103(2):319–335. doi: 10.1093/biomet/asw005

Data augmentation for models based on rejection sampling

Vinayak Rao 1,, Lizhen Lin 2, David B Dunson 3
PMCID: PMC4890134  PMID: 27279660

Abstract

We present a data augmentation scheme to perform Markov chain Monte Carlo inference for models where data generation involves a rejection sampling algorithm. Our idea is a simple scheme to instantiate the rejected proposals preceding each data point. The resulting joint probability over observed and rejected variables can be much simpler than the marginal distribution over the observed variables, which often involves intractable integrals. We consider three problems: modelling flow-cytometry measurements subject to truncation; the Bayesian analysis of the matrix Langevin distribution on the Stiefel manifold; and Bayesian inference for a nonparametric Gaussian process density model. The latter two are instances of doubly-intractable Markov chain Monte Carlo problems, where evaluating the likelihood is intractable. Our experiments demonstrate superior performance over state-of-the-art sampling algorithms for such problems.

Keywords: Bayesian inference, Density estimation, Gaussian process, Intractable likelihood, Markov chain Monte Carlo, Matrix Langevin distribution, Rejection sampling, Truncation

1. Introduction

Rejection sampling allows sampling from a probability density Inline graphic by constructing an upper bound to Inline graphic, and accepting or rejecting samples from a density proportional to the bounding envelope. The envelope is usually much simpler than Inline graphic, with the number of rejections determined by how closely it matches the true density.

In typical applications, the probability density of interest is indexed by a parameter Inline graphic, and we write it as Inline graphic. A Bayesian analysis places a prior on Inline graphic, and, given observations from the likelihood Inline graphic, studies the posterior over Inline graphic. An intractable likelihood, often with a normalization constant depending on Inline graphic, precludes straightforward Markov chain Monte Carlo inference over Inline graphic: calculating a Metropolis–Hastings acceptance probability involves evaluating the ratio of two such likelihoods, and is itself intractable. This class of problems is called doubly-intractable (Murray et al., 2006), and existing approaches require the ability to draw exact samples from Inline graphic, or to obtain positive unbiased estimates of Inline graphic.

We describe an approach that is applicable when Inline graphic has an associated rejection sampling algorithm. Our idea is to instantiate the rejected proposals preceding each observation, resulting in an augmented state-space on which we run a Markov chain. Including the rejected proposals can eliminate any intractable terms, and allows the application of standard techniques (Adams et al., 2009). We show that, conditioned on the observations, it is straightforward to independently sample the number and values of the rejected proposals: this just requires running the rejection sampler to generate as many acceptances as there are observations, with all rejected proposals kept. The ability to produce a conditionally independent draw of these variables is important when posterior updates of some parameters are intractable while others are simple. In such a situation, we introduce the rejected variables only when we need to carry out the intractable updates, after which we discard them and carry out the simpler updates.

A particular application of our algorithm is parameter inference for probability distributions truncated to sets like the positive orthant, the simplex, or the unit sphere. Such distributions correspond to sampling proposals from the untruncated distribution and rejecting those outside the domain of interest. We consider an application from flow cytometry where this representation is the actual data collection process. Truncated distributions also arise in applications like measured time-to-infection (Goethals et al., 2009), where times larger than a year are truncated, mortality data (Alai et al., 2013), annuity valuation for truncated lifetimes (Alai et al., 2013), and stock price changes (Aban et al., 2006). One approach for such problems was proposed in Liechty et al. (2009), through their algorithm samples from an approximation to the posterior distribution of interest. Our algorithm provides a simple and general way to apply the machinery of Bayesian inference to such problems.

2. Rejection sampling

Consider a probability density Inline graphic on some space Inline graphic, with the parameter Inline graphic taking values in Inline graphic. We assume that the normalization constant Inline graphic is difficult to evaluate, so that naïve sampling from Inline graphic is not easy. We also assume there exists a second, simpler density Inline graphic for all Inline graphic and some positive Inline graphic.

Rejection sampling generates samples distributed as Inline graphic by first proposing samples from Inline graphic. A draw Inline graphic from Inline graphic is accepted with probability Inline graphic. Let there be Inline graphic rejected proposals preceding an accepted sample Inline graphic, and denote them by Inline graphic where Inline graphic itself is a random variable. Write Inline graphic, so that the joint probability is

2. (1)

This procedure recovers samples from Inline graphic, so that (1) has the correct marginal distribution over Inline graphic (Robert & Casella, 2005, p. 51). Later, we will need to sample the rejected variables Inline graphic given an observation Inline graphic drawn from Inline graphic. Simulating from Inline graphic involves the two steps in Algorithm 1, which relies on Proposition 1 about Inline graphic; see the Appendix.

Algorithm 1 —

Algorithm to sample from Inline graphic

  • Input: A sample Inline graphic, and the parameter value Inline graphic.

  • Output: The set of rejected proposals Inline graphic preceding Inline graphic.

  • Sample Inline graphic independently from Inline graphic until a point Inline graphic is accepted.

  • Discard Inline graphic, and treat the preceding rejected proposals as Inline graphic.

Proposition 1 —

The set of rejected samplesInline graphicpreceding an accepted sampleInline graphicis independent ofInline graphic:Inline graphic.

3. Bayesian inference

3.1. Sampling by introducing rejected proposals

Given observations Inline graphic, and a prior Inline graphic, Bayesian inference typically uses Markov chain Monte Carlo simulation to sample from an intractable posterior Inline graphic. Split Inline graphic as Inline graphic so that the normalization constant factors as Inline graphic, with Inline graphic simple to evaluate, and Inline graphic intractable. Updating Inline graphic with Inline graphic fixed is easy, and there are situations where we can place a conjugate prior on Inline graphic. Inference for Inline graphic is a doubly-intractable problem.

We assume that Inline graphic has an associated rejection sampling algorithm with proposal density Inline graphic. For the Inline graphicth observation Inline graphic, write the preceding set of rejected samples as Inline graphic. The joint density of all samples, both rejected and accepted, is

3.1.

This involves no intractable terms, so standard techniques can be applied to update Inline graphic. To introduce the rejected proposals Inline graphic, we simply follow Algorithm 1: draw proposals from Inline graphic until we have Inline graphic acceptances, with the Inline graphicth batch of rejected proposals forming the set Inline graphic.

The ability to produce conditionally independent draws of Inline graphic is important when, for instance, there exists a conjugate prior Inline graphic on Inline graphic for the likelihood Inline graphic. Introducing the rejected proposals Inline graphic breaks this conjugacy, and the resulting complications in updating Inline graphic can slow down mixing, especially when Inline graphic is high-dimensional. A much cleaner solution is to sample Inline graphic from its conditional posterior Inline graphic, introducing the auxiliary variables only when needed to update Inline graphic. After updating Inline graphic, they can be discarded. Algorithm 2 describes this.

Algorithm 2 —

An iteration of the Markov chain for posterior inference for Inline graphic

  • Input: The observations Inline graphic, and the current parameter values Inline graphic.

  • Output: New parameter values Inline graphic.

  • Run Algorithm 1 Inline graphic times, keeping all the rejected proposals Inline graphic.

  • Update Inline graphic to Inline graphic with a Markov kernel having Inline graphic as stationary distribution.

  • Discard the rejected proposals Inline graphic.

  • Sample a new value of Inline graphic from the conditional Inline graphic.

3.2. Related work

One of the simplest and most widely applicable Markov chain Monte Carlo algorithms for doubly-intractable distributions is the exchange sampler of Murray et al. (2006). Simplifying an earlier idea by Møller et al. (2006), this algorithm effectively amounts to the following: given the current parameter Inline graphic, propose a new parameter Inline graphic according to some proposal distribution. Additionally, generate a dataset of Inline graphic pseudo-observations Inline graphic from Inline graphic. The exchange algorithm then proposes to exchange parameters associated with datasets. Murray et al. (2006) show that all intractable terms cancel out in the resulting acceptance probability, and that the resulting Markov chain has the correct stationary distribution.

While the exchange algorithm is applicable whenever one can sample from the likelihood Inline graphic, it does not exploit the mechanism used to produce these samples. When the latter is a rejection sampling algorithm, each pseudo-observation is preceded by a sequence of rejected proposals. These are all discarded, and only the accepted proposals are used to evaluate the new parameter Inline graphic. By contrast our algorithm explicitly instantiates these rejected proposals, so that they can be used to make good proposals. In our experiments, we use a Hamiltonian Monte Carlo sampler on the augmented space and exploit gradient information to make nonlocal moves with a high probability of acceptance. For reasonable acceptance probabilities under the exchange sampler, one must make local updates to Inline graphic, or resort to complicated annealing schemes. Of course, the exchange sampler is applicable when no efficient rejection sampling scheme exists, such as when carrying out parameter inference for a Markov random field.

Another framework for doubly-intractable distributions is the pseudo-marginal approach of Andrieu & Roberts (2009). The idea here is that even if we cannot exactly evaluate the acceptance probability, it is sufficient to use a positive, unbiased estimator: this will still result in a Markov chain with the correct stationary distribution. In our case, instead of requiring an unbiased estimate, we bound Inline graphic by choosing Inline graphic. Additionally, like the exchange sampler, the pseudo-marginal method provides a mechanism to evaluate a proposed Inline graphic; making good proposals (Dahlin et al., 2015) is less obvious. Other papers are Beskos et al. (2006), based on a rejection sampling algorithm for diffusions, and Walker (2011).

Most closely related to our ideas is a sampler from Adams et al. (2009); see also §7. Their problem also involved inferences on the parameters governing the output of a rejection sampling algorithm. Like us, they augment the state space to include the rejected proposals Inline graphic, and like us, given these auxiliary variables, they use Hamiltonian Monte Carlo to efficiently update parameters. However, rather than generating independent realizations of Inline graphic when needed, Adams et al. (2009) outlined a set of Markov transition operators to perturb the current configuration of Inline graphic, while maintaining the correct stationary distribution. With prespecified probabilities, they proposed adding a new variable to Inline graphic, deleting a variable from Inline graphic and perturbing the value of an existing element in Inline graphic. These local updates to Inline graphic can slow down Markov chain mixing, require the user to specify a number of parameters, and also involve calculating Metropolis–Hastings acceptance probabilities for each local step. Furthermore, the Markov nature of their updates require them to maintain the rejected proposals at all times; this can break any conjugacy, and complicate inference for other parameters.

4. Convergence properties

Write the Markov transition density of our chain as Inline graphic, and the Inline graphic-fold transition density as Inline graphic. The Markov chain is uniformly ergodic if constants Inline graphic and Inline graphic exist such that for all Inline graphic and Inline graphic, Inline graphic The term to the left is twice the total variation distance between the desired posterior and the state of the Markov chain initialized at Inline graphic after Inline graphic iterations. Small values of Inline graphic imply faster mixing. The following minorization condition is sufficient for uniform ergodicity (Jones & Hobert, 2001): there exists a probability density Inline graphic and a Inline graphic such that for all Inline graphic,

4. (2)

When this holds, the mixing rate Inline graphic, so that a large Inline graphic implies rapid mixing.

Our Markov transition density first introduces the rejected proposals Inline graphic, and then conditionally updates Inline graphic. The set Inline graphic preceding the Inline graphicth observation takes values in the union space Inline graphic. The output of the rejection sampler, including the Inline graphicth observation, lies in the product space Inline graphic with density given by equation (1), so that any Inline graphic has probability

4. (3)

Here, Inline graphic is the measure with respect to which the densities Inline graphic and Inline graphic are defined, and it is easy to see that equation (3) integrates to 1. From Bayes' rule, the conditional density over Inline graphic is

4. (4)

The fact that the right-hand side does not depend on Inline graphic is another proof of Proposition 1. Equation (4) also motivates the use of our algorithm outside the context of rejection sampling: we can view Inline graphic as convenient auxiliary variables that are independent of Inline graphic, and whose density is such that Inline graphic cancels when evaluating the joint density of Inline graphic.

The density from equation (4) characterizes the data augmentation step of our sampling algorithm. In practice, we need as many draws from this density as there are observations. The next step involves updating Inline graphic given Inline graphic, and depends on the problem at hand. We simplify matters by assuming that we can sample from Inline graphic independently of the old Inline graphic: this is the classical data augmentation algorithm. We also assume that the functions Inline graphic and Inline graphic are uniformly bounded from above and below by finite, positive quantities Inline graphic and Inline graphic respectively, and that Inline graphic. It follows that there exist positive numbers Inline graphic and Inline graphic that minimize Inline graphic and Inline graphic. We can now state our result.

Theorem 1 —

Assume thatInline graphicand that positive boundsInline graphicexist withInline graphicandInline graphicas defined earlier. Further assume we can sample from the conditionalInline graphic. Then our data augmentation algorithm is uniformly ergodic with mixing rateInline graphicbounded above byInline graphic, whereInline graphicandInline graphicis the number of observations.

Despite our assumptions, our theorem has a number of useful implications. The ratio Inline graphic is a measure of how flat the function Inline graphic is, and the closer it is to unity, the more efficient rejection sampling for Inline graphic can be. From our result, the smaller the ratio, the larger the bound on Inline graphic, suggesting slower mixing. This is consistent with more rejected proposals Inline graphic increasing the coupling between successive Inline graphics in the Markov chain. On the other hand, a small Inline graphic suggests a proposal distribution tailored to Inline graphic, and our result shows that this implies faster mixing. The numbers Inline graphic and Inline graphic are measures of mismatch between the target and proposal density, with small values giving better mixing. Finally, more observations Inline graphic result in slower mixing. We suspect that this last property holds for most exact samplers for doubly-intractable distributions, though we are unaware of any such result.

Even without assuming we can sample from Inline graphic, our ability to sample Inline graphic independently means that the marginal chain over Inline graphic is Markovian. By contrast, existing approaches (Adams et al., 2009; Walker, 2011) only produce dependent updates in the complicated auxiliary space: they sample from Inline graphic by making local updates to Inline graphic. Consequently, these chains are Markovian only in the complicated augmented space, and the marginal processes over Inline graphic have long-term dependencies. Besides affecting mixing, this can also complicate analysis.

5. Flow cytometry data

We apply our algorithm to a dataset of flow cytometry measurements from patients subjected to bone-marrow transplant (Brinkman et al., 2007). This graft-versus-host disease dataset has 6809 control and 9083 positive observations, corresponding to whether donor immune cells attack host cells. Each observation consists of four biomarker measurements truncated between 0 and 1024, though more complicated truncation rules are often used according to operator judgement (Lee & Scott, 2012). We normalize and plot the first two dimensions, markers CD4 and CD8b, in Fig. 1. Truncation complicates the clustering of observations into homogeneous groups, an important step in the flow-cytometry pipeline called gating. Consequently, Lee & Scott (2012) propose an expectation-maximization algorithm for truncated Gaussian mixture models, which must be adapted if different mixture components or truncation rules are used.

Fig. 1.

Fig. 1.

Scatterplots of the first two dimensions for the control (left) and positive (right) group. Contours represent log posterior-mean densities under a Dirichlet process mixture.

We model the untruncated distribution for each group as a Dirichlet process mixture of Gaussian kernels (Lo, 1984), with points outside the four-dimensional unit hypercube discarded to form the normalized dataset. The Dirichlet process mixture model is a flexible nonparametric prior over densities parameterized by a concentration parameter Inline graphic and a base probability measure. We set Inline graphic, and for the base measure, which gives the distribution over cluster parameters, we use a normal-inverse-Wishart distribution. Given the rejected variables, we can use standard techniques to update a representation of the Dirichlet process. We follow the blocked-sampler of Ishwaran & James (2001) based on the stick-breaking representation of the Dirichlet process, using a truncation level of 50 clusters. This corresponds to updating Inline graphic, step 2 in Algorithm 2. Having done this, we discard the old rejected samples, and produce a new set by drawing from a 50-component Gaussian mixture model, corresponding to step 1 in Algorithm 2.

Figure 1 shows the log mean posterior densities for the first two dimensions from 10 000 iterations. While the control group has three clear modes, these are much less pronounced in the positive group. Directly modelling observations with a Gaussian mixture model obscured this by forcing modes away from the edges. One can use components with bounded support in the mixture model, such as a Dirichlet process mixture of Beta densities; however, these do not reflect the underlying data generation process, and are unsuitable when different groups have different truncation levels. By contrast, it is easy to extend our modelling ideas to allow groups to share components, allowing better identification of disease predictors.

Our sampler took less than two minutes to run 1000 iterations, not much longer than a typical Dirichlet process sampler for datasets of this size. The average number of augmented points was 3960 and 4608 for the two groups. We study our sampler more systematically in the next section, but this application demonstrates the flexibility and simplicity of our main idea.

6. Bayesian inference for the matrix Langevin distribution

6.1. The matrix Langevin distribution on the Stiefel manifold

The Stiefel manifold Inline graphic is the space of all Inline graphic orthonormal matrices, that is, Inline graphic matrices Inline graphic such that Inline graphic, where Inline graphic is the Inline graphic identity matrix. When Inline graphic, this is the Inline graphic hypersphere Inline graphic, and when Inline graphic, this is the space of all orthonormal matrices Inline graphic. Probability distributions on the Stiefel manifold play an important role in statistics, signal processing and machine learning, with applications ranging from studies of orientations of orbits of comets and asteroids to principal components analysis to the estimation of rotation matrices. The simplest such distribution is the matrix Langevin distribution, an exponential-family distribution whose density with respect to the invariant Haar volume measure (Edelman et al., 1998) is Inline graphic. Here Inline graphic is the exponential-trace, and Inline graphic is a Inline graphic matrix. The normalization constant Inline graphic is the hypergeometric function with matrix arguments, evaluated at Inline graphic (Chikuse, 2003). Let Inline graphic be the singular value decomposition of Inline graphic, where Inline graphic and Inline graphic are Inline graphic and Inline graphic orthonormal matrices, and Inline graphic is a positive diagonal matrix. We parameterize Inline graphic by Inline graphic, and one can think of Inline graphic and Inline graphic as orientations, with Inline graphic controlling the concentration in directions determined by these orientations. Large values of Inline graphic imply concentration along the associated directions, while setting Inline graphic to zero gives the uniform distribution on the Stiefel manifold. It can be shown (Khatri & Mardia, 1977) that Inline graphic, so that this depends only on Inline graphic. We write it as Inline graphic). In our Bayesian analysis, we place independent priors on Inline graphic and Inline graphic. The last two lie on the Stiefel manifolds Inline graphic and Inline graphic, and we place matrix Langevin priors Inline graphic and Inline graphic on these: we will see below that these are conditionally conjugate. We place independent Gamma priors on the diagonal elements of Inline graphic. However, the difficulty in evaluating the normalization constant Inline graphic makes posterior inference for Inline graphic doubly intractable. Thus, in a 2006 University of Iowa PhD thesis, Camano-Garcia keeps Inline graphic constant, while Hoff (2009a) uses a first-order Taylor expansion of the intractable term to run an approximate sampling algorithm. Below, we show how fully Bayesian inference can be carried out for this quantity as well.

6.2. A rejection sampling algorithm

We first describe a rejection sampling algorithm from Hoff (2009b) to sample from Inline graphic. For simplicity, assume Inline graphic is the identity matrix. In the general case, we simply rotate the resulting draw by Inline graphic, since if Inline graphic, then Inline graphic. At a high level, the algorithm sequentially proposes vectors from the matrix Langevin on the unit sphere: this is also called the von Mises–Fisher distribution and is easy to simulate (Wood, 1994). The mean of the Inline graphicth vector is column Inline graphic of Inline graphic, Inline graphic, projected onto the nullspace of the earlier vectors, Inline graphic. This sampled vector is then projected back onto Inline graphic and normalized, and the process is repeated Inline graphic times. Call the resulting distribution Inline graphic; for more details, see Algorithm 3 and Hoff (2009b).

Algorithm 3 —

Proposal Inline graphic for the matrix Langevin distribution (Hoff, 2009b)

  • Input: Parameters Inline graphic; write Inline graphic for column Inline graphic of Inline graphic, and Inline graphic for element Inline graphic of Inline graphic.

  • Output: An output Inline graphic; write Inline graphic for column Inline graphic of Inline graphic.

  • Sample Inline graphic.

  • For Inline graphic

    • Construct Inline graphic, an orthogonal basis for the nullspace of Inline graphic.

    • Sample Inline graphic.

    • Set Inline graphic

Letting Inline graphic be the modified Bessel function of the first kind, Inline graphic is a density on the Stiefel manifold with

6.2.

Write Inline graphic for the reciprocal of the term in braces. Since Inline graphic is an increasing function of Inline graphic, and Inline graphic, we have the following bound Inline graphic for Inline graphic:

6.2.

This implies that Inline graphic, allowing the following rejection sampler: sample Inline graphic from Inline graphic, and accept with probability Inline graphic. The accepted proposals come from Inline graphic, and for samples from Inline graphic, post multiply these by Inline graphic.

6.3. Posterior sampling

Given a set of Inline graphic observations Inline graphic, and writing Inline graphic, we have:

6.3.

At a high level, our approach is a Gibbs sampler that sequentially updates Inline graphic and Inline graphic. The pair of matrices Inline graphic correspond to the tractable Inline graphic in Algorithm 2, while Inline graphic corresponds to Inline graphic. Updating the first two is straightforward, while the third requires our augmentation scheme.

  1. Updating Inline graphic and Inline graphic: With a matrix Langevin prior on Inline graphic, the posterior is
    graphic file with name M304.gif
    This is just the matrix Langevin distribution over rotation matrices, and one can sample from this following §6.2. From here onwards, we will rotate the observations by Inline graphic, allowing us to ignore this term. Redefining Inline graphic as Inline graphic, the posterior over Inline graphic is also matrix Langevin,
    graphic file with name M309.gif
  2. Updating Inline graphic: Here, we exploit the rejection sampler scheme of the previous section, and instantiate the rejected proposals using Algorithm 1. From §6.2, the joint probability is
    graphic file with name M311.gif (5)

All terms in (5) can be evaluated easily, allowing a simple Metropolis–Hastings algorithm in this augmented space. In fact, we can calculate gradients to run a Hamiltonian Monte Carlo algorithm (Neal, 2010) that makes significantly more efficient proposals than a random-walk sampling algorithm. In particular, let Inline graphic, and Inline graphic. The log joint probability Inline graphic is

6.3.

In the Appendix, we give an expression for the gradient of this loglikelihood. We use this to construct a Hamiltonian Monte Carlo sampler (Neal, 2010) for Inline graphic. Here, it suffices to note that a proposal involves taking Inline graphic leapfrog steps of size Inline graphic along the gradient, and accepting the resulting state with probability proportional to the product of equation (5), and a simple Gaussian momentum term. The acceptance probability depends on how well the Inline graphic-discretization approximates the continuous dynamics of the system, and choosing a small Inline graphic and a large Inline graphic can give global moves with high acceptance probability. A large Inline graphic however costs a large number of gradient evaluations. We study this trade-off in §6.5.

6.4. Vectorcardiogram dataset

The vectorcardiogram is a loop traced by the cardiac vector during a cycle of the heart beat. The two directions of orientation of this loop in three dimensions form a point on the Stiefel manifold. The dataset of Downs et al. (1971) includes 98 such recordings, and is displayed in Fig. 2(a). We represent each observation with a pair of orthonormal vectors, with the cone of lines to the right forming the first component. This empirical distribution possesses a single mode, so that the matrix Langevin distribution seems a suitable model.

Fig. 2.

Fig. 2.

(a) Vector cardiogram dataset with inferences. Bold solid lines are maximum likelihood estimates of Inline graphic, and solid circles contain Inline graphic posterior mass. Dashed circles are Inline graphic predictive probability regions. (b) Posterior distribution over Inline graphic and Inline graphic, circles are maximum likelihood estimates.

We place independent exponential priors with mean 10 and variance 100 on the scale parameter Inline graphic, and a uniform prior on the location parameter Inline graphic. We restrict Inline graphic to be the identity matrix. Inferences were carried out using the Hamiltonian sampler to produce 10 000 samples, with a burn-in period of 1000. For the leapfrog dynamics, we set a step size of 0Inline graphic3, with the number of steps equal to 5. We fix the mass parameter to the identity matrix. We implemented all algorithms in R (R Development Core Team, 2016), building on the rstiefel package of Hoff (2009b). Simulations were run on an Intel Core 2 Duo 3 Ghz CPU. For comparison, we include the maximum likelihood estimates of Inline graphic and Inline graphic. For Inline graphic and Inline graphic, these were 11Inline graphic9 and 5Inline graphic9, and we plot these in Fig. 2(b) as circles.

The bold straight lines in Fig. 2(a) show the maximum likelihood estimates of the components of Inline graphic, with the small circles corresponding to Inline graphic Bayesian credible regions estimated from the Monte Carlo output. The dashed circles correspond to Inline graphic predictive probability regions for the Bayesian model. For these, we generated 50 points on Inline graphic for each sample, with parameters specified by that sample. The dashed circles contain Inline graphic of these points across all samples. Figure 2(b) shows the posterior over Inline graphic and Inline graphic.

6.5. Comparison of exact samplers

To quantify sampler efficiency, we estimate the effective sample sizes produced per unit time. This corrects for correlation between successive Markov chain samples by estimating the number of independent samples produced; for this we used the rcoda package of Plummer et al. (2006).

Figure 3(a) shows the effective sample size per second for two Metropolis–Hastings samplers, the exchange sampler and our latent variable sampler on the vectorcardiogram dataset. Both perform a random walk in the Inline graphic-space, with the steps drawn for a normal distribution whose variance increases along the horizontal axis. The figure shows that both samplers' performance peaks when the proposals have a variance between 1 and Inline graphic, with the exchange sampler performing slightly better. However, the real advantage of our sampler is that introducing the latent variables results in a joint distribution with no intractable terms, allowing the use of more sophisticated sampling algorithms. Figure 3(b) studies the Hamiltonian Monte Carlo sampler described at the end of §3.1. Here we vary the size of the leapfrog steps along the horizontal axis, with the different curves corresponding to different numbers of such steps. This performs an order of magnitude better than either of the previous algorithms, with performance peaking with 3 to 5 steps of size 0Inline graphic3 to 0Inline graphic5, fairly typical values for this algorithm. This shows the advantage of exploiting gradient information in exploring the parameter space.

Fig. 3.

Fig. 3.

Effective samples per second for (a) random walk and (b) Hamiltonian samplers. From bottom to top at abscissa 0Inline graphic5: (a) Metropolis–Hastings data-augmentation sampler and exchange sampler, and (b) 1, 10, 5 and 3 leapfrog steps of the Hamiltonian sampler.

6.6. Comparison with an approximate sampler

In this section, we consider an approximate sampler based on an asymptotic approximation to Inline graphic for large values of Inline graphic (Khatri & Mardia, 1977):

6.6.

We use this approximation in the acceptance probability of a Metropolis–Hastings algorithm; it can similarly be used to construct a Hamiltonian sampler. For a more complicated but accurate approximation, see Kume et al. (2013). In general however, using such approximate schemes involves the ratio of two approximations, and can have very unpredictable performance.

On the vectorcardiogram dataset, the approximate sampler is about forty times faster than the exact samplers. For larger datasets, this difference will be even greater, and the real question is how accurate the approximation is. Our exact sampler allows us to study this: we consider the Stiefel manifold Inline graphic, with the three diagonal elements of Inline graphic set to Inline graphic and 10. With this setting of Inline graphic, and a random Inline graphic, we generate datasets with 50 observations, with Inline graphic taking values Inline graphic and 10. In each case, we estimate the posterior mean of Inline graphic by running the exchange sampler, and treat this as the truth. We compare this with posterior means returned by our Hamiltonian sampler, as well as the approximate sampler. Figure 4 shows these results. As expected, the two exact samplers agree, and the Hamiltonian sampler has almost no error. For values of Inline graphic around 5, the estimated posterior mean for the approximate sampler is close to that of the exact samplers. Smaller values lead to an approximate posterior mean that underestimates the actual posterior mean, while in higher dimensions, the opposite occurs. Recalling that Inline graphic controls the concentration of the matrix Langevin distribution about its mode, this implies that in high dimensions, the approximate sampler underestimates uncertainty in the distribution of future observations.

Fig. 4.

Fig. 4.

Errors in the posterior mean for the vectorcardiogram dataset. Each panel is a different component of Inline graphic; solid/dashed lines are the Hamiltonian/approximate sampler.

7. The Gaussian process density sampler

7.1. Nonparametric density modelling with a transformed Gaussian process

Our next application is the Gaussian process density sampler of Adams et al. (2009), a nonparametric prior for probability densities induced by a logistic transformation of a random function from a Gaussian process. Letting Inline graphic denote the logistic function, the random density is

7.1.

with Inline graphic a parametric base density and Inline graphic denoting a Gaussian process. The inequality Inline graphic allows a rejection sampling algorithm by making proposals from Inline graphic. At a proposed location Inline graphic, we sample the function value Inline graphic conditioning on all previous evaluations, and accept the proposal with probability Inline graphic. Such a scheme involves no approximation error, and only requires evaluating the random function on a finite set of points. Algorithm 4 describes the steps involved in generating Inline graphic observations.

Algorithm 4 —

Generate Inline graphic new samples from the Gaussian process density sampler

  • Input: A base probability density Inline graphic.

    • Previous accepted and rejected proposals Inline graphic and Inline graphic.

    • Gaussian process evaluations Inline graphic and Inline graphic at these locations.

    Output: Inline graphic new samples Inline graphic, with the associated rejected proposals Inline graphic.

    • Gaussian process evaluations Inline graphic and Inline graphic at these locations.

  • Repeat

    • Sample a proposal Inline graphic from Inline graphic.

    • Sample Inline graphic, the Gaussian process evaluated at Inline graphic, conditioning on Inline graphic, Inline graphic, Inline graphic and Inline graphic.

    • With probability Inline graphic

      • Accept Inline graphic and add it to Inline graphic. Add Inline graphic to Inline graphic.

  • Else

    • Reject Inline graphic and add it to Inline graphic. Add Inline graphic to Inline graphic.

  • Until Inline graphic samples are accepted.

7.2. Posterior inference

Given observations Inline graphic, we are interested in Inline graphic, the posterior over the underlying density. Since Inline graphic is determined by the modulating function Inline graphic, we focus on Inline graphic. While this quantity is doubly intractable, after augmenting the state space to include the proposals Inline graphic from the rejection sampling algorithm, Inline graphic has density Inline graphic with respect to the Gaussian process prior; see also Adams et al. (2009). In words, the posterior over Inline graphic evaluated at Inline graphic is just the posterior from a Gaussian process classification problem with a logistic link-function, and with the accepted and rejected proposals corresponding to the two classes. Markov chain Monte Carlo methods such as Hamiltonian Monte Carlo or elliptical slice sampling (Murray et al., 2010) are applicable in such a situation. Given Inline graphic on Inline graphic, the Gaussian process can be evaluated anywhere else by conditionally sampling from a multivariate normal.

Sampling the rejected proposals Inline graphic given Inline graphic and Inline graphic is straightforward using Algorithm 1: run the rejection sampler until Inline graphic accepts, and treat the rejected proposals generated along the way as Inline graphic. In practice, we do not have access to the entire function Inline graphic, only its values evaluated on Inline graphic and Inline graphic, the locations of the previous thinned variables. However, just as under the generative mechanism, we can retrospectively evaluate the function Inline graphic where needed. After proposing from Inline graphic, we sample the value of the function at this location conditioned on all previous evaluations, and use this value to decide whether to accept or reject. We outline the inference algorithm in Algorithm 5, noting that it is much simpler than that proposed in Adams et al. (2009). We also refer to that paper for limitations of the exchange sampler in this problem.

Algorithm 5 —

A Markov chain iteration for inference in the Gaussian process density sampler

  • Input: Observations Inline graphic with corresponding function evaluations Inline graphic.

    • Current rejected proposals Inline graphic with corresponding function evaluations Inline graphic.

  • Output: New rejected proposals Inline graphic.

    • New Gaussian process evaluations Inline graphic and Inline graphic at Inline graphic and Inline graphic.

    • New hyperparameters.

  • Run Algorithm 4 to produce Inline graphic accepted samples, with Inline graphic and Inline graphic as inputs.

  • Replace Inline graphic and Inline graphic with values returned by the previous step; call these Inline graphic and Inline graphic.

  • Update Inline graphic and Inline graphic using for example, hybrid Monte Carlo, to get Inline graphic and Inline graphic.

  • Update Gaussian process and base-distribution hyperparameters.

7.3. Experiments

Voice changes are a symptom and measure of the onset of Parkinson's disease, and one attribute is voice shimmer, a measure of variation in amplitude. We consider a dataset of such measurements for subjects with and without the disease (Little et al., 2007), with 147 measurements with, and 48 without the disease. We normalized these to vary from 0 to 5, and used the model of Adams et al. (2009) as a prior on the underlying probability densities. We set Inline graphic to a normal Inline graphic, with a normal-inverse-Gamma prior on Inline graphic. The latter had its mean, inverse-scale, degrees-of-freedom and variance set to 0,Inline graphic1,1 and 10. The Gaussian process had a squared-exponential kernel, with variance and length-scale of 1. For each case, we ran a Matlab implementation of our data augmentation algorithm to produce 2000 posterior samples after a burn-in of 500 samples.

Figure 5(a) shows the resulting posterior over densities, corresponding to Inline graphic in Algorithm 2. The control group is fairly Gaussian, while the disease group is skewed to the right. Figure 5(b) focuses on the deviation from normality by plotting the posterior over the latent function Inline graphic. We see that to the right of 0Inline graphic5, this deviation is larger than its prior mean of zero, implying larger probability than under a Gaussian density. Figure 6 studies the distribution of the rejected proposals Inline graphic. Figure 6(a) shows the distribution of their locations: most of these occured near the origin. Here, the disease density reverts to Gaussian or even sub-Gaussian density, with the intensity function taking small values. Figure 6(b) is a histogram of the number of rejected proposals: this is typically around 100 to 150, though the largest value we observed was 668. Since inference on the latent function involves evaluating it at the accepted as well as rejected proposals, the largest covariance matrix we had to deal with was about Inline graphic; typical values were around Inline graphic. Using the same set-up as §6.5, it took a naïve Matlab implementation 26 and 18 minutes to run 2500 iterations for the disease and control datasets. One can imagine computations becoming unwieldy for a large number of observations, or when there is large mismatch between the true density and the base-measure Inline graphic. In such situations, one might have to choose the Gaussian process covariance kernel more carefully, use one of many sparse approximation techniques, or use other nonparametric priors like splines instead. In all these cases, we can use our algorithm to recover the rejected proposals Inline graphic, and given these, posterior inference for Inline graphic can be carried out using standard techniques.

Fig. 5.

Fig. 5.

Inferences for the Parkinson's dataset: (a) posterior density for positive (solid) and control (dashed) groups, (b) posterior distribution of the Gaussian process function for positive group with observations. Both panels show the median with 80 percent credible intervals.

Fig. 6.

Fig. 6.

Rejected proposals for the Parkinson's dataset: (a) kernel density estimate of locations of rejected proposals, and (b) histogram of the number of rejected proposals for the positive group.

8. Future work

Our algorithm, while exact, also provides a framework for faster, approximate algorithms. A priori, the number of rejected proposals preceeding any observation is unbounded: one can bound the computational cost of an iteration by limiting the maximum number of rejected proposals. Similarly, one might share rejected proposals across observations. We leave the study of such approximate sampling algorithms for future research. Also left open is a more careful analysis of Markov mixing rates for the applications we considered. There are also a number of applications that we have not described here: particularly relevant are rejection samplers for diffusions (Beskos et al., 2006; Bladt & Sørensen, 2014).

Acknowledgement

This work was supported by the National Institute of Environmental Health Sciences of the National Institutes of Health. We are grateful to the editor and reviewers for valuable comments.

Appendix

Proofs

Proof of Proposition 1 —

Rejection sampling first proposes from Inline graphic, and then accepts with probability Inline graphic. Conceptually, one can first decide whether to accept or reject, and then conditionally sample the location. The marginal acceptance probability is Inline graphic, the area under Inline graphic divided by that under Inline graphic. An accepted sample Inline graphic is distributed as the target distribution Inline graphic, while rejected samples are distributed as Inline graphic. This two-component mixture is just the proposal Inline graphic. While this mixture representation loses the computational benefits of the original algorithm, it shows that the location of an accepted sample is independent of the past, and consequently, that the number and locations of rejected samples preceding an accepted sample is independent of the location of that sample. Thus, one can use the rejected samples preceding any other accepted sample.

Proof of Theorem 1 —

It follows from Bayes' rule and the assumed bounds that for an observation Inline graphic,

graphic file with name M468.gif

Let the number of observations Inline graphic be Inline graphic. Then,

graphic file with name M471.gif

Thus Inline graphic satisfies equation (2), with Inline graphic, and Inline graphic.

Gradient information

For Inline graphic pairs Inline graphic, with Inline graphic, and Inline graphic, we have

graphic file with name M479.gif

Let Inline graphic, and Inline graphic. Since Inline graphic,

graphic file with name M483.gif

Then, writing Inline graphic, and Inline graphic for Inline graphic, we have

graphic file with name M487.gif

References

  1. Aban I. B., Meerschaert M. M. & Panorska A. K. (2006). Parameter estimation for the truncated Pareto distribution. J. Am. Statist. Assoc. 101, 270–7. [Google Scholar]
  2. Adams R. P., Murray I. & MacKay D. J. C. (2009). The Gaussian process density sampler. In Adv. Neural Info. Process. Syst.21, D. Koller, D. Schuurmans, Y. Bengio & L. Bottou, eds. Cambridge, MA: MIT Press, pp. 9–16.
  3. Alai D. H., Landsman Z. & Sherris M. (2013). Lifetime dependence modelling using a truncated multivariate Gamma distribution. Insurance Math. Econom. 52, 542–9. [Google Scholar]
  4. Andrieu C. & Roberts G. O. (2009). The pseudo-marginal approach for efficient Monte Carlo computations. Ann. Statist. 37, 697–725. [Google Scholar]
  5. Beskos A., Papaspiliopoulos O., Roberts G. O. & Fearnhead P. (2006). Exact and computationally efficient likelihood-based estimation for discretely observed diffusion processes. J. R. Statist. Soc. B 68, 333–82. [Google Scholar]
  6. Bladt M. & Sørensen M. (2014). Simple simulation of diffusion bridges with application to likelihood inference for diffusions. Bernoulli 20, 645–75. [Google Scholar]
  7. Brinkman R. R., Gasparetto M., Lee S.-J. J., Ribickas A. J., Perkins J., Janssen W., Smiley R. & Smith C. (2007). High-content flow cytometry and temporal data analysis for defining a cellular signature of graft-versus-host disease. Biol. Blood Marrow Trans. 13, 691–700. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Chikuse Y. (2003) Statistics on Special Manifolds. New York: Springer. [Google Scholar]
  9. Dahlin J., Lindsten F. & Schön T. B. (2015). Particle Metropolis–Hastings using gradient and Hessian information. Statist Comp. 25, 81–92. [Google Scholar]
  10. Downs T. D., Liebman J. & Mackay W. (1971). Statistical methods for vectorcardiogram orientations. In Vectorcardiography2: Proc. XIth Intn. Symp. Vectorcardiography, R. H. I. Hoffman & E. E. Glassman, eds. Amsterdam: North-Holland, pp. 216–22.
  11. Edelman A., Arias T. A. & Smith S. T. (1998). The geometry of algorithms with orthogonality constraints. SIAM J. Matrix Anal. Appl. 20, 303–53. [Google Scholar]
  12. Goethals K., Ampe B., Berkvens D., Laevens H., Janssen P. & Duchateau L. (2009). Modeling interval-censored, clustered cow udder quarter infection times through the shared gamma frailty model. J. Agric. Biol. Envir. Statist. 14, 1–14. [Google Scholar]
  13. Hoff P. D. (2009a). A hierarchical eigenmodel for pooled covariance estimation. J. R. Statist. Soc. B 71, 971–92. [Google Scholar]
  14. Hoff P. D. (2009b). Simulation of the matrix Bingham–von Mises–Fisher distribution, with applications to multivariate and relational data. J. Comp. Graph. Statist. 18, 438–56. [Google Scholar]
  15. Ishwaran H. & James L. F. (2001). Gibbs sampling methods for stick-breaking priors. J. Am. Statist. Assoc. 96, 161–73. [Google Scholar]
  16. Jones G. L. & Hobert J. P. (2001). Honest exploration of intractable probability distributions via Markov chain Monte Carlo. Statist. Sci. 16, 312–34. [Google Scholar]
  17. Khatri C. G. & Mardia K. V. (1977). The von Mises–Fisher matrix distribution in orientation statistics. J. R. Statist. Soc. B 39, 95–106. [Google Scholar]
  18. Kume A., Preston S. P. & Wood A. T. A. (2013). Saddlepoint approximations for the normalizing constant of Fisher–Bingham distributions on products of spheres and Stiefel manifolds. Biometrika 100, 971–84. [Google Scholar]
  19. Lee G. & Scott C. (2012). EM algorithms for multivariate Gaussian mixture models with truncated and censored data. Comput. Statist. Data Anal. 56, 2816–29. [Google Scholar]
  20. Liechty M. W., Liechty J. C. & Müller P. (2009). The shadow prior. J. Comp. Graph. Statist. 18, 368–83. [Google Scholar]
  21. Little M. A., McSharry P. E., Roberts S. J., Costello D. A. E. & Moroz I. M. (2007). Exploiting nonlinear recurrence and fractal scaling properties for voice disorder detection. Biomed. Eng. Online, 6, 23. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Lo A. (1984). On a class of Bayesian nonparametric estimates: I. Density estimates. Ann. Statist. 12, 351–7. [Google Scholar]
  23. Møller J., Pettitt A. N., Reeves R. & Berthelsen K. K. (2006). An efficient Markov chain Monte Carlo method for distributions with intractable normalising constants. Biometrika 93, 451–8. [Google Scholar]
  24. Murray I., Ghahramani Z. & MacKay D. J. C. (2006). MCMC for doubly-intractable distributions. In Proc. 22nd Conf. Uncert. Artif. Intell. AUAI Press, pp. 359–66.
  25. Murray I., Adams R. P. & MacKay D. J. (2010). Elliptical slice sampling. J. Mach. Learn. Res. 9, 541–8. [Google Scholar]
  26. Neal R. M. (2010). MCMC using Hamiltonian dynamics. In Handbook of Markov Chain Monte Carlo, S. P. Brooks, A. Gelman, G. L. Jones, & X.-L. Meng, eds. Boca Raton, Florida: Chapman & Hall/CRC Press, pp. 113–62.
  27. Plummer M., Best N., Cowles K. & Vines K. (2006). CODA: Convergence diagnosis and output analysis for MCMC. R News 6, 7–11. [Google Scholar]
  28. Robert C. P. & Casella G. (2005). Monte Carlo Statistical Methods. New York: Springer, 2nd ed. [Google Scholar]
  29. R Development Core Team (2016). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0. http://www.R-project.org.
  30. Walker S. G. (2011). Posterior sampling when the normalizing constant is unknown. Commun. Statist. B 40, 784–92. [Google Scholar]
  31. Wood A. T. A. (1994). Simulation of the von Mises–Fisher distribution. Commun. Statist. B 23, 157–64. [Google Scholar]

Articles from Biometrika are provided here courtesy of Oxford University Press

RESOURCES