Easily Computed Marginal Likelihoods from Posterior Simulation Using the THAMES Estimator

Martin Metodiev; Marie Perrot-Dockès; Sarah Ouadah; Nicholas J Irons; Pierre Latouche; Adrian E Raftery

doi:10.1214/24-ba1422

. Author manuscript; available in PMC: 2025 Aug 8.

Published before final editing as: Bayesian Anal. 2024 Apr 23:10.1214/24-ba1422. doi: 10.1214/24-ba1422

Easily Computed Marginal Likelihoods from Posterior Simulation Using the THAMES Estimator

Martin Metodiev ^*,^†, Marie Perrot-Dockès ^†, Sarah Ouadah ^‡, Nicholas J Irons ^§, Pierre Latouche ^*,^†,^¶, Adrian E Raftery ^§,^¶,^‖

PMCID: PMC12333553 NIHMSID: NIHMS2024687 PMID: 40787637

Abstract

We propose an easily computed estimator of marginal likelihoods from posterior simulation output, via reciprocal importance sampling, combining earlier proposals of DiCiccio et al (1997) and Robert and Wraith (2009). This involves only the unnormalized posterior densities from the sampled parameter values, and does not involve additional simulations beyond the main posterior simulation, or additional complicated calculations, provided that the parameter space is unconstrained. Even if this is not the case, the estimator is easily adjusted by a simple Monte Carlo approximation. It is unbiased for the reciprocal of the marginal likelihood, consistent, has finite variance, and is asymptotically normal. It involves one user-specified control parameter, and we derive an optimal way of specifying this. We illustrate it with several numerical examples.

Keywords: marginal likelihood estimation, reciprocal importance sampling

MSC2020 subject classifications: Primary 62F15, 62-04; secondary 62F12

1. Introduction

A key quantity in Bayesian model selection is the marginal likelihood, also known as the evidence, the normalizing constant of the posterior density, or the integrated likelihood. Consider a statistical model with parameter vector $θ$ and data $𝒟$ . Let $L (θ) = p (𝒟 ∣ θ)$ be the usual likelihood, and $π (θ)$ be the prior distribution of $θ$ . Then $Z = p (𝒟) = \int L (θ) π (θ) d θ$ is the marginal likelihood.

The marginal likelihood plays a key role in defining Bayes factors. Consider two models $M_{1}$ and $M_{2}$ with marginal likelihoods $Z_{1}$ and $Z_{2}$ . Then the Bayes factor (or ratio of posterior to prior odds) for model $M_{1}$ against $M_{2}$ is $B_{1,2} = Z_{1} / Z_{2}$ .

The marginal likelihood is also a critical quantity for Bayesian model averaging (BMA). Consider $K$ models, $M_{1}, \dots, M_{K}$ , with prior model probabilities $Π_{k}$ (which add up to 1), and marginal likelihoods $Z_{k}$ . Suppose $Q$ is a quantity of interest, such as a parameter or a future observation to be predicted. Then the BMA posterior distribution of $Q$ is

p (Q∣ 𝒟) = \sum_{k = 1}^{K} p (Q ∣ 𝒟, M_{k}) p (M_{k} ∣ 𝒟),

(1)

where $p (M_{k} ∣ 𝒟)$ is the posterior model probability of $M_{k}$ , which satisfies $p (M_{k} ∣ 𝒟) \propto Π_{k} Z_{k}$ and $\sum_{k = 1}^{K} p (M_{k} ∣ 𝒟) = 1$ . So $p (Q ∣ 𝒟) = \sum_{k = 1}^{K} p (Q ∣ 𝒟, M_{k}) Π_{k} Z_{k} / \sum_{k = 1}^{K} Π_{k} Z_{k}$ .

Finally, the most likely model a posteriori is the one that maximizes $Π_{k} Z_{k}$ . Choosing it minimizes the model selection error rate on average over the prior [17]. Often the prior over the model space is chosen to be uniform, in which case $Π_{k} = 1 / K, \forall k$ . In this case, Bayesian model selection by choosing the most likely model a posteriori boils down to choosing the model with the largest $Z_{k}$ , and hence involves only the marginal likelihoods.

Bayesian models are often estimated using Monte Carlo methods in which a sample of values of $θ$ is simulated from the posterior distribution. The most common class of such methods is Markov chain Monte Carlo (MCMC). Perhaps surprisingly, estimating the marginal likelihood from the output of MCMC and other posterior simulation methods has turned out not to be straightforward. Many different methods have been proposed, and none of them is widely considered to be generally the best. Llorente et al. [20] provide a comprehensive review of such methods, describing 16 different methods and, remarkably, cite over 20 other review articles!

We seek a method that is precise, generic and simple for estimating the marginal likelihood from posterior simulation output. We take this to mean that it gives low variance estimates of the marginal likelihood, uses posterior simulation output for just the one model being analyzed, uses only likelihoods and prior densities of the sampled values of $θ$ , and does not need additional simulations or complicated calculations.

Some well-known methods do not satisfy our desiderata. These include Chib’s method [3], which requires complicated additional calculations, bridge sampling [22], which requires simulations from two models, importance sampling, which requires additional simulations, nested sampling [34], which involves other simulations, and more advanced methods such as adaptive annealed importance sampling [19]. They also include the harmonic mean of the likelihoods [26], which is unbiased and consistent, but has infinite variance and is unstable, as pointed out by the original authors.

Arguably, the only methods that are precise, generic and simple for estimating the marginal likelihood from MCMC by our definition are versions of reciprocal importance sampling (RIS) [8]. These are based on the identity:

Z^{- 1} = E_{θ} [\frac{h (θ)}{L (θ) π (θ)}| 𝒟],

(2)

where $h (θ)$ is a (normalized) probability density function (pdf) over the posterior support. Remarkably, (2) holds for any pdf $h (θ)$ . This leads to the estimator

{\hat{Z}}^{- 1} = \frac{1}{T} \sum_{t = 1}^{T} \frac{h (θ^{(t)})}{L (θ^{(t)}) π (θ^{(t)})},

(3)

where $θ^{(1)}, \dots, θ^{(T)}$ are simulated from the posterior using MCMC or another method. This estimator has good properties in general, provided that the tails of the distribution $h (θ)$ are thin enough in all directions. It can be hard to choose $h (θ)$ so that it both overlaps substantially with the posterior distribution (needed for efficiency) and has thin enough tails (needed for finite variance), especially in higher dimensions. We propose a choice of $h (θ)$ that leads to easily computed estimates and is optimal or near optimal in a certain sense.

The paper is organized as follows. In Section 2 we discuss reciprocal importance sampling and its properties. In Section 3 we describe our proposed choice of $h (θ)$ and derive some of its properties. In Section 4 we give several numerical examples, including a multivariate Gaussian example, a Bayesian regression example, a non-Gaussian case, and a Bayesian hierarchical model. We conclude in Section 5 with a discussion. The code for this paper is made available via Github for scientific dissemination, see the following link. The THAMES has been implemented in a package, which is submitted to CRAN. There is also an R package implementing THAMES [15].

2. Reciprocal Importance Sampling

In general, the RIS estimator of the marginal likelihood is defined by Equation (3). This has several good properties. It is unbiased, in the sense that $E [{\hat{Z}}^{- 1}] = Z^{- 1}$ , where the expectation is over the posterior distribution of $θ$ . It is also strongly simulation-consistent, in the sense that ${\hat{Z}}^{- 1} ⟶ Z^{- 1}$ almost surely as $T ⟶ \infty$ .

In addition, the RIS estimator of the reciprocal marginal likelihood, ${\hat{Z}}^{- 1}$ , has finite variance and is asymptotically normally distributed as $T ⟶ \infty$ if the tails of $h (θ)$ are thin enough. Specifically, this requires that

\int \frac{h (θ)^{2}}{L (θ) π (θ)} d θ < \infty .

(4)

It is hard to choose $h (θ)$ so that it both overlaps substantially with the area of the parameter space with high posterior density, which is needed for efficiency, and so that it also has thin enough tails, which is needed for finite variance. The difficulty grows as the dimension increases.

Two choices of $h (θ)$ in the literature deserve attention. DiCiccio et al. [4] proposed $h (θ) = MVN (θ; \hat{θ}, \hat{Σ})$ , where $\hat{θ}$ is the posterior mean or mode, and $\hat{Σ}$ is an estimate of the posterior covariance matrix. This overlaps nicely with $L (θ) π (θ)$ , but its tails may not be thin enough when the posterior is asymmetric or the parameter is high-dimensional.

To remedy the problem of the tails possibly being too thick, DiCiccio et al. [4] proposed truncating it, using instead $h (θ) = T M V N_{A} (\hat{θ}, \hat{Σ})$ , a multivariate normal distribution truncated to the set $A$ , where

A = \{θ : (θ - \hat{θ})^{T} {\hat{Σ}}^{- 1} (θ - \hat{θ}) < c^{2}\} .

(5)

Thus $A$ is an ellipsoid with radius $c$ and volume

V (A) = c^{d} π^{d / 2} | \hat{Σ} |^{1 / 2} / Γ (\frac{d}{2} + 1) .

(6)

Truncating the distribution ensures that the estimator ${\hat{Z}}^{- 1}$ has finite variance. The authors found that the truncation improved the performance of the RIS estimator. However, with high-dimensional parameters, the result might be sensitive to the specification of c.

Robert and Wraith [30] proposed setting $h (θ)$ to be a uniform distribution on the convex hull of simulated MCMC parameters values in the $α$ -HPD region, namely the highest posterior density region containing a proportion $α$ of the sampled parameter values. They considered the values $α = 0.1$ and 0.25. They applied it to a two-dimensional toy example where it performed well.

However, as far as we know, that method has not yet been fully developed for realistic, higher-dimensional situations. For example, we know of no simple way to compute the volume of the convex hull of a set of points in higher dimensions, which is required for the method in general. It is also not clear how best to choose $α$ nor how sensitive the method would be to $α$ in higher dimensions. It has been used in a higher-dimensional application by Durmus et al. [5], but this involved comparing competing models defined on the same parameter space, thus avoiding the need to calculate the volume of $A$ , which canceled out in Bayesian model comparisons. Calculating the volume of $A$ may be the most difficult part of this method in general.

3. Estimating the marginal likelihood

3.1. Estimating the marginal likelihood with THAMES

We propose combining the proposals of DiCiccio et al. [4] and Robert and Wraith [30] to obtain a method that we believe satisfies all our desiderata. We propose specifying $h (θ)$ to be a uniform distribution, but to be uniform over the set $A$ defined in Equation (5), rather than over a convex hull of points. This resolves the problem of computing the volume of $A$ , since this is given analytically by Equation (6). If $A$ is not a subset of the posterior support, for example if the posterior support is constrained, we adjust the volume of $A$ by a simple Monte Carlo approximation. This yields the estimator

{\hat{Z}}^{- 1} = \frac{1}{V (A) T} \sum_{\begin{matrix} t = 1 \\ θ^{(t)} \in A \end{matrix}}^{T} \frac{1}{L (θ^{(t)}) π (θ^{(t)})} .

(7)

Thus $\hat{Z}$ is a truncated harmonic mean of the unnormalized posterior densities, $L (θ^{(t)}) π (θ^{(t)})$ .¹ We call it the Truncated HArmonic Mean EStimator, or THAMES.

The THAMES, ${\hat{Z}}^{- 1}$ , has several desirable properties. It is simple to compute, involving only the prior and likelihood values of the sampled parameter values. In fact it involves only the product of the prior and likelihood values, namely the unnormalized posterior densities of the sampled parameter values. It is unbiased as an estimator of $Z^{- 1}$ , as long as $A$ is specified independently of the sample. It is also simulation-consistent, in the sense that ${\hat{Z}}^{- 1} ⟶ Z^{- 1}$ almost surely as $T ⟶ \infty$ , by the strong law of large numbers. Its variance (over simulation from the posterior given the data $𝒟$ ) is finite provided that

\int_{A} (L (θ) π (θ))^{- 1} d θ < \infty,

(8)

which will usually hold since $A$ is a bounded set in $R^{d}$ . In fact, it suffices that the likelihood and the prior are continuous with respect to $θ$ and strictly positive on the closure of $A$ . If Equation (8) holds, ${\hat{Z}}^{- 1}$ is asymptotically normal (again as the number of parameter values simulated increases), by the Lindeberg central limit theorem. Note that asymptotic normality holds on the scale of ${\hat{Z}}^{- 1}$ , and not exactly on other scales such as $\hat{Z}$ or $log (\hat{Z})$ .

If the posterior simulation method yields independent draws, then $Var ({\hat{Z}}^{- 1})$ can be estimated directly as the empirical variance of the values of ${(L (θ^{(t)}) π (θ^{(t)}) 1 (θ^{(t)} \in A))}^{- 1}$ , divided by $T \cdot V (A)^{2}$ . If MCMC is used, successive simulations from the posterior will in general not be independent. A central limit theorem will still hold, but the variance needs to take account of the serial dependence. This can be done approximately by computing the variance based on serial independence and multiplying it by an estimate of the spectral density of the sequence at zero. For example, if the sequence of values of $1 / (L (θ) π (θ))$ can be approximated by a first-order autoregressive model with parameter $ϕ$ , then this would be approximately $1 / (1 - ϕ)^{2}$ . An alternative would be to thin the sequence enough that the resulting subsequence is approximately uncorrelated and then use the variance based on assuming independence. A different approach was taken by Frühwirth-Schnatter [7].

Note that an approximate normal confidence interval can be obtained for ${\hat{Z}}^{- 1}$ , because that is the scale on which a central limit theorem holds. This could be turned into a confidence interval for $\hat{Z}$ by taking the reciprocals of the ends of the normal confidence intervals for ${\hat{Z}}^{- 1}$ ; the resulting confidence interval would not be symmetric. The same could be done for $log (\hat{Z})$ in a similar manner.

3.2. Optimal choice of control parameter, $c$

We now address the question of how to choose the radius $c$ of the ellipse that specifies the THAMES in Equation (5). Ignoring serial correlation between simulated values of the parameters, we suggest choosing $c$ to minimize the estimated variance of ${\hat{Z}}^{- 1}$ . This could be done empirically by computing ${\hat{Z}}^{- 1}$ for a range of values of $c$ , estimating $Var ({\hat{Z}}^{- 1})$ for each value of $c$ , and optimizing it over $c$ by a grid search or a one-dimensional numerical optimization method.

It is possible to obtain analytic results in the case where the posterior distribution is normal. This is of considerable interest as the posterior distribution is asymptotically normal in many common situations, including some where standard regularity conditions do not hold [13, 9, 32, 25]. In this case the THAMES has finite variance since the posterior density, and thus the product of the likelihood and the prior, is continuous with respect to $θ$ and strictly positive everywhere.

We want to minimize the variance of the THAMES. Due to our assumption of independence of all of the successive MCMC simulations, this variance can be simplified to

V a r ({\hat{Z}}^{- 1} ∣ 𝒟) = \frac{1}{T} \cdot \frac{1}{Z^{2}} \cdot S C V (d, c) .

(9)

Here $S C V (d, c)$ denotes

S C V (d, c) ≔ \frac{{V a r}_{θ^{(1)}} (\frac{1_{A} (θ^{(1)}) / V (A)}{L (θ^{(1)}) π (θ^{(1)})}| 𝒟)}{E_{θ^{(1)}} {(\frac{1_{A} (θ^{(1)}) / V (A)}{L (θ^{(1)}) π (θ^{(1)})}| 𝒟)}^{2}},

(10)

the squared coefficient of variation of the first term of the THAMES. Since the variance is a product of $\frac{1}{T}, \frac{1}{Z^{2}}$ and $S C V (d, c)$ , minimizing $S C V (d, c)$ with respect to $c$ is equivalent to minimizing the variance of the THAMES.

We derive a statement about the optimal choice of $c$ by assuming that the posterior covariance matrix $Σ$ and the posterior mean $m$ can be provided by a stochastic oracle. The THAMES can then be defined using

A_{o r} ≔ \{θ : (θ - m)^{T} Σ^{- 1} (θ - m) < c^{2}\} .

(11)

We will show that the radius $c$ that minimizes the variance of the THAMES depends on the dimension $d$ , and is equal to $c_{d} = \sqrt{d + L_{d}}$ with $L_{d}$ being close to one for large d. Interestingly, in this case the $S C V$ depends neither on the data, $𝒟$ , nor on the number of samples from the posterior, $T$ . Of course, this is rarely exactly the case in practice. However, plugging in consistent estimators of $(m, Σ)$ gives approximately the same results if the number of samples from the posterior is large enough, provided that a sample splitting procedure is used. This will be a consequence of Theorem 3.3. The sample splitting procedure that we suggest is described in Section 3.3. The proofs of these results are given in Supplement A [23]. Additional numerical results about the behaviour of the optimal radius are given in Supplement B [24].

Assumption 1. For the following theorems it is assumed that we can ignore serial correlation (i.e. we assume independence of the successive MCMC iterations) and that the posterior distribution is normal with mean $m \in R^{d}$ and positive definite covariance matrix $Σ \in ℳ_{d \times d} (R)$ . We further assume that the THAMES is defined on $A_{o r}$ .

Theorem 3.1. There exists a unique radius $c_{d} \in (0, \infty)$ such that the ellipsoid $A_{or}$ with radius $c_{d}$ minimizes the variance of the THAMES. This value $c_{d}$ does not depend on the posterior mean or covariance matrix. It satisfies $c_{d} = \sqrt{d + L_{d}}$ , where the optimal shifting parameter $L_{d} \geq 0$ is a sequence for which $\frac{L_{d}}{d} \overset{d \to \infty}{\to} 0$ .

Remark 1. Theorem 3.1 ensures that the optimal radius $c_{d}$ is asymptotically equivalent to $\sqrt{d}$ . In fact, our calculations suggest that $c_{d} = \sqrt{d + L_{d}}$ can be approximated by $\sqrt{d + 1}$ (See Supplement B [24]).

Theorem 3.2. The following statements hold for the SCV :

For any choice of the shifting parameter $L, L \in R$ and for all $ε > 0$
$1 - ε \leq \frac{S C V (d, \sqrt{d + L_{d}})}{\sqrt{(d + 2) π / 4}} \leq \frac{S C V (d, \sqrt{d + L})}{\sqrt{(d + 2) π / 4}} \leq 2 + ε,$ (12)
for all but finitely many $d$ . Thus choosing the radius $\sqrt{d + L}$ results in an $S C V$ that is both asymptotically at most twice as large as the optimal $S C V$ and is of order $\sqrt{d}$ .
The following inequality for the $S C V$ can be given for choosing the radius $c = \sqrt{d + 1}$ :
$0.63 \sqrt{(d + 2) π / 4} - 1 \leq S C V (d, \sqrt{d + L_{d}})$ (13)

$\leq S C V (d, \sqrt{d + 1}) \leq 1.05 \cdot 2 \sqrt{(d + 2) π / 4} - 1$ (14)
This inequality holds for all $d \geq 1$ .

Remark 2. Statement 1 of Theorem 3.2 shows that $S C V (d, c)$ is increasing with order $\sqrt{d}$ as $d \to \infty$ , both in our choice $c = \sqrt{d + 1}$ and the optimal choice $c_{d}$ . Further, any choice of the shifting parameter $L$ used to define the radius $\sqrt{d + L}$ is asymptotically at most twice as bad as any optimal solution in terms of the $S C V$ . This suggests some robustness of our estimator with respect to the choice of $L$ . For numeric results about the behaviour of the $S C V$ for different values of $L$ , we refer the reader to Supplement B [24].

One can also calculate the bias of the THAMES on the scale of the marginal likelihood by considering a second-order Taylor approximation and using Equation (9):

E [\hat{Z}] = Z + V a r [{\hat{Z}}^{- 1}] / {(Z^{- 1})}^{3} = Z (1 + S C V (d, c) / T) .

(15)

Considering Equation (15), the bias can be estimated by using the plug-in estimates $\hat{S C V}$ and ${\hat{Z}}^{- 1}$ . We also observe that the bias vanishes as $T$ increases.

Remark 3. Statement 2 of Theorem 3.2 gives a very rough theoretical guarantee: For any dimension $d \geq 1$ , the $S C V$ obtained by choosing our recommendation for the radius, $\sqrt{d + 1}$ , and the $S C V$ obtained by choosing the optimal radius, $c_{d}$ , can be bounded by an affine transform of $\sqrt{d + 2}$ . However, our calculations suggest that the $S C V$ at the point $c = \sqrt{d + 1}$ has an asymptotically optimal performance (See Supplement B [24]).

So far, we have given results for the idealized situation where the posterior distribution is exactly normal. We now give a result for the more common and realistic situation where the posterior distribution is only asymptotically normal.

Theorem 3.3. Let $p_{n} (θ ∣ 𝒟_{n})$ be a sequence of posterior densities with data $𝒟_{n}$ , posterior covariance matrix $Σ_{n}$ , posterior mean $m_{n}$ and an $S C V$ denoted by $S C V_{n}$ . Then, if

{|Σ_{n}|}^{\frac{1}{2}} p_{n} (Σ_{n}^{\frac{1}{2}} \cdot θ + m_{n}| 𝒟_{n}) \overset{n \to \infty}{\to} | Σ |^{\frac{1}{2}} p (Σ^{\frac{1}{2}} \cdot θ + m| 𝒟)

(16)

uniformly in $θ$ on all compact subsets of $R^{d}$ , it follows that

S C V_{n} (d, c) \overset{n \to \infty}{\to} S C V (d, c)

(17)

uniformly in $c$ on all compact subsets of $(0, \infty)$ . In particular, for any $b \geq c_{d} \geq a > 0$ ,

{(c_{d})}_{n} \in {argmin}_{c \in [a, b]} S C V_{n} (d, c) \forall n \Rightarrow lim_{n \to \infty} {(c_{d})}_{n} = c_{d} .

(18)

Remark 4. We have already stated that the normal case is important because the posterior distribution is often asymptotically normal when the size of the data, $n$ , is large. Theorem 3.3 assures us that our results still hold in this limiting case, under some assumptions:

If the convergence of the normalized posterior pdf is uniform in $θ$ (Equation (16)), our statements about the limit behaviour of the $S C V$ (Theorem 3.2 and Remarks 2–3) still hold approximately when $n$ is large (Equation (17)). If additionally any optimal radius ${(c_{d})}_{n}$ does not converge to zero or infinity, any result about $c_{d}$ (Theorem 3.1 and Remark 1) also holds approximately when $n$ is large (Equation (18)).

Let $H_{0}$ denote the Fisher information matrix. Reformulating Equation (16) by replacing $Σ_{n}$ by $\frac{1}{n} H_{0}^{- 1}$ , to which it is asymptotically equivalent, gives a statement that has been proven under a variety of assumptions, e.g. Miller [25, Theorem 4], except that in these results the type of convergence is usually not uniform convergence, but a weaker type of convergence, such as convergence in distribution or convergence in total variation.

Additional assumptions can be made about the pdfs of the sequence of distributions such that convergence in distribution implies uniform convergence of the pdfs. For example, if the pdfs are asymptotically equicontinuous and we have convergence in distribution, the convergence of the pdfs is uniform [38, Theorem 1].

Note that in this case there is no problem if the parameter space is constrained: Uniform convergence of the pdfs implies that $A_{n}$ is a subset of the posterior support if $n$ is large enough. There are also no assumptions about $(m_{n}, Σ_{n})$ , other than that they converge to the moments of the posterior limit. In this sense, the estimators $(\hat{μ}, \hat{Σ})$ take the place of these constants in practice and Theorem 3.3 holds even when we use these estimators.

Remark 5. Due to the assumption of normality it is the case that when choosing the optimal radius $c_{d} = \sqrt{d + L_{d}}$ , the probability of a term of the THAMES in $θ^{(t)}$ not being set to 0 is equal to

P (θ^{(t)} \in A_{o r}) = P ({(θ^{(t)} - m)}^{T} Σ^{- 1} (θ^{(t)} - m) < d + L_{d}) = χ^{2} (d + L_{d}; d),

(19)

the CDF of the $χ^{2}$ -distribution with $d$ degrees of freedom evaluated at $d + L_{d}$ . This approaches 50% due to Theorem 3.1. Thus the algorithm sets about 50% of the highest terms in Equation (7) to 0. This means that for a large number of samples $T$ and given the normality assumption, our algorithm is similar to the following method:

Instead of checking whether $θ^{(t)} \in A_{o r}$ directly, one can set roughly 50% of the highest terms of the THAMES, the terms not included in the Heighest Posterior Density (HPD) region of size 50%, to 0.

Remark 6. It is assumed that the covariance matrix of the posterior distribution is positive definite. This assumption is necessary since otherwise a posterior density with respect to the Lebesgue-measure on $R^{d}$ would not exist. On the other hand this assumption is not restrictive, since the same estimation procedure can be applied to a lower dimensional subspace of $R^{d}$ on which a density is defined.

We can illustrate the relationship between the THAMES and the harmonic mean estimator defined by Newton and Raftery [26] using the toy example from Figure 1. It was calculated using the same model as the one introduced in Section 4.1 with the dimensions of the parameter space $d = 2$ , but by setting the data set to $𝒟 \equiv 0$ to ensure stability of the estimator on the inverse likelihood scale.

Figure 1: — Left: The THAMES calculated by choosing the radii $c = \sqrt{d + 1} = \sqrt{3}, c = 0.1 \sqrt{3}, c = 50 \sqrt{3}$ in the two-dimensional case, $d = 2$ , with the true value $1 / Z$ (dotted line) and the Harmonic Mean Estimator. Right: The posterior sample evaluated at the inverse of the unnormalized posterior density and the different ellipses used to define the THAMES. In this particular case the posterior covariance matrix is the scaled identity matrix, so the ellipses are spheres. The two samples occurring at points 644 and 7216 have a very low likelihood. They cause massive jumps in the Harmonic Mean Estimator when the radius of $A$ is large and are excluded when the radius is equal to $\sqrt{d + 1} = \sqrt{3}$ . One can choose a smaller radius (e.g., $c = 0.1 \sqrt{3}$ ), but then too much of the sample is excluded and convergence takes longer.

The pdf of the Uniform distribution on the ellipse is essentially used as a rejection rule: Values with a very low posterior (and therefore high inverse posterior) are rejected, while high-density values are accepted. A balance between the volume of the ellipse and the percentage of the rejected posterior sample needs to be found to ensure optimal performance. The harmonic mean estimator does not have this rejection rule, so sample points with low posterior densities can lead to massive jumps.

3.3. THAMES algorithm

Below is an algorithm for the implementation of the THAMES. Procedures for sample splitting, as well as the truncated ellipsoid correction used in the case that the parameter space is constrained have been included. These additions are described in page 10 and page 11, respectively.

We recommend these additions, but we have also found that in some cases they make almost no difference. For example, sample splitting does not appear to have an impact when the dimension of the parameter space, $d$ , is small (Section 4.1), while the truncated ellipsoid correction is negligible when the posterior mean is not close to the edge of the posterior support (Section 4.4 and Section 4.3).

Algorithm 1.

${\hat{Z}}^{- 1}$ calculation

Input: Data

𝒟

and posterior samples

{(θ^{(i)})}_{i \in 〚 1, T 〛}

Sample splitting: Calculate the empirical mean

\hat{θ}

and sample covariance matrix

\hat{Σ}

based on the first

T / 2

posterior samples

{(θ^{(i)})}_{i \in 〚 1, T / 2 〛}

Standardization:

{\tilde{θ}}^{(i)} = (θ^{(i)} - \hat{θ}) {\hat{Σ}}^{- 1 / 2}

for

i \in 〚 T / 2 + 1, T 〛

Truncation subset:

𝒮 = \{i : {‖{\tilde{θ}}^{(i)}‖}_{2}^{2} < d + 1\}

Calculate THAMES estimator:

{\hat{Z}}^{- 1} = \frac{1}{T / 2} \sum_{i = T / 2 + 1}^{T} \frac{h (θ^{(i)})}{L (θ^{(i)}) π (θ^{(i)})},

where

h (θ^{(i)}) = 1 / V (A)

i \in 𝒮

and 0 otherwise, with

V (A) = \sqrt{| \hat{Σ} |} π^{d / 2} (d + 1)^{d / 2} / Γ (\frac{d}{2} + 1)

and

A = \{θ : (θ - \hat{θ})^{T} {\hat{Σ}}^{- 1} (θ - \hat{θ}) < d + 1\}

if the posterior support

supp (θ ∣ 𝒟)

is constrained then

Simulate the sample

ν^{(1)}, \dots, ν^{(N)}

from the uniform distribution on

A

Approximate the volume ratio

V (A \cap supp (θ ∣ 𝒟)) / V (A)

via the Monte Carlo estimator

\hat{R} = \frac{1}{N} \sum_{i = 1}^{N} 1_{{θ ∣ π (θ) L (θ) > 0}} (ν^{(i)}) .

Assign

{\hat{Z}}^{- 1} \leftarrow {\hat{R}}^{- 1} {\hat{Z}}^{- 1} .

end if

Output: THAMES estimator

{\hat{Z}}^{- 1}

Open in a new tab

Sample splitting

The theoretical guarantees established in Section 3.2 operate under the assumption of an oracle ellipsoid $A_{o r}$ . In particular, this means that the ellipsoid determining the THAMES estimator ${\hat{Z}}^{- 1}$ is defined independently of ${(θ^{(i)})}_{i \in 〚 1, T 〛}$ . In practice, we find that estimating $A$ and $Z^{- 1}$ simultaneously using the same posterior sample can induce bias in ${\hat{Z}}^{- 1}$ when the parameter space is high-dimensional. As such, we implement a sample splitting procedure that involves estimating $A$ and $Z^{- 1}$ using separate posterior draws. Specifically, we first estimate the posterior mean and covariance matrix via the empirical mean $\hat{θ}$ and sample covariance $\hat{Σ}$ using the first $T / 2$ posterior samples ${(θ^{(i)})}_{i \in 〚 1, T / 2 〛}$ . Defining $A$ as in Equation (5) based on $\hat{θ}$ and $\hat{Σ}$ , we then calculate the THAMES estimator ${\hat{Z}}^{- 1}$ using the last $T / 2$ posterior samples ${(θ^{(i)})}_{i \in 〚 T / 2 + 1, T 〛}$ . The same problem was noted by Gronau et al. [10] in their popular implementation of bridge sampling. For this reason, the bridge sampling package uses the same sample splitting procedure just described, in its default setting.

Correcting for the presence of constrained parameters

Whenever the posterior support of the parameters is not $R^{d}$ , for example when the parameters are variances or probabilities, it is possible that our choice of $h$ in Equation (3), the pdf of the uniform distribution on $A$ , is not correctly normalized. This is due to the fact that $A$ is not necessarily a subset of the posterior support and thus $h$ is not a pdf over this space.

In this case, the expectation of the THAMES is distorted by a multiplicative constant:

E_{θ} [{\hat{Z}}^{- 1} ∣ 𝒟] = E_{θ} [\frac{h (θ)}{L (θ) π (θ)}| 𝒟] = Z^{- 1} \cdot \frac{V (A \cap supp (θ ∣ 𝒟))}{V (A)} ≕ Z^{- 1} R,

(20)

where $V (A \cap supp (θ ∣ 𝒟))$ denotes the volume of the intersection between $A$ and the posterior support. One way to deal with this problem is to transform the parameter space, e.g., by setting $ϑ ≔ log (θ)$ if $θ$ is a variance parameter. One can then continue with marginal likelihood estimation on $ϑ$ , using the transformed prior distribution. In this case, it is of course important to include the Jacobian of the transformation when computing the prior density. It should be noted that the default proposal used in the bridge sampling package from Gronau et al. [10] uses this transformation, because it suffers from the same problem when the parameter space is constrained. This solution is also a viable option for the THAMES, since the transformation removes the need for any adjustments. However, this solution requires deciding on a viable transformation for each new type of constraint (e.g., simplex constraints, interval constraints, etc.) and the posterior behaviour of the transformed parameters may be hard to interpret. For this reason, we suggest a different correction.

Another way is to adjust for the bias by calculating the ratio of these volumes, $R$ , using a simple Monte Carlo approximation: We simulate $ν^{(1)}, \dots, ν^{(N)} \overset{i.i.d.}{~} Unif (A), N \in N$ and calculate

\hat{R} ≔ \frac{1}{N} \sum_{i = 1}^{N} 1_{supp (θ ∣ 𝒟)} (ν^{(i)}) = \frac{1}{N} \sum_{i = 1}^{N} 1_{{θ ∣ π (θ) L (θ) > 0}} (ν^{(i)}) .

(21)

Given $A$ , this is an unbiased and consistent estimator of $R$ by the law of large numbers. The bias-adjusted THAMES is then ${\hat{Z}}_{a d j}^{- 1} = {\hat{R}}^{- 1} {\hat{Z}}^{- 1}$ .

The problem of the parameter space being constrained is common not only for the THAMES, but for reciprocal importance sampling estimators in general. It has for example been addressed by Hajargasht and Wo’zniak [11] and Sims et al. [33]. Hajargasht and Wo’zniak [11] used variational Bayes techniques and showed that these ensure that the support of the chosen $h$ is a subset of the posterior support, under mild conditions. Sims et al. [33] used an ellipsoidal density truncated on a subset of the joint support, $Θ_{U} ≔ {θ ∣ π (θ) L (θ) > U}$ , where $U > 0$ . Since the support of $π (θ) L (θ)$ is equal to the posterior support, our truncation set is similar to the one chosen in Sims et al. [33], except that we set $U = 0$ .

The adjustment is usually very small. The problem arises only when the posterior mean is close enough to the edge of the parameter space. The edge of the parameter space often indicates a priori unlikely values. For this reason it is also rare that the data indicate posterior parameters being close to the edge. Thus the ratio between the volumes is close to one and the variance of $\hat{R}$ is small. In fact, the adjustment did not have any sizeable impact on any of the examples simulated in Section 4. This may not be the case, however, if the actual data generating mechanism is very different from the model being considered. In this case, it can in practice happen that the posterior mean is indeed very close to the edge. We show one example of this in Supplement B [24].

In either case we have found that a small number of simulations, around $N = 100$ , is usually enough. Confidence intervals obtained from the fact that $\hat{R}$ is asymptotically normal can be used to check whether the variance of $\hat{R}$ is large. In this case $N$ should be increased to yield a more precise approximation. The computational cost of implementing this adjustment is typically small.

4. Examples

We now describe several simulated and real data examples to assess the THAMES estimator. In Sections 4.1, 4.2, and 4.3, three statistical models, for which exact expressions of the marginal likelihood are derived, are considered. This allows us to compare the THAMES estimated values to the exact ones for evaluation. In Section 4.4, we consider a real data example with models for which no analytical expressions for the marginal likelihood are available, to our knowledge, and where there is a need for reliable estimators. We compare our estimator to bridge sampling, which is more complicated than THAMES but is known to have performed well [22, 10].

4.1. Multivariate Gaussian data

We first consider the case where data $Y_{i}, i = 1, \dots, n$ are drawn independently from a multivariate normal distribution:

Y_{i} ∣ μ \overset{iid}{~} {MVN}_{d} (μ, I_{d}), i = 1, \dots, n,

along with a prior distribution on the mean vector $μ$ :

p (μ) = {MVN}_{d} (μ; 0_{d}, s_{0} I_{d}),

with $s_{0} > 0$ . As shown in Supplement A [23], the posterior distribution of the mean vector $μ$ given the data $𝒟 = \{y_{1}, \dots, y_{n}\}$ is given by:

p (μ∣ 𝒟) = {MVN}_{d} (μ; m_{n}, s_{n} I_{d}),

(22)

where $m_{n} = n \overline{y} / (n + 1 / s_{0}), \overline{y} = (1 / n) \sum_{i = 1}^{n} y_{i}$ , and $s_{n} = 1 / (n + 1 / s_{0})$ .

Interestingly, while the observations ${(Y_{i})}_{i}$ are independent given the vector $μ$ , they are not independent marginally, and the marginal likelihood does not take the form of a product over marginal terms in $i$ . Conversely, thanks to the isotropic Gaussian prior distribution which is considered for $μ$ , where the ${(μ_{j})}_{j}$ are all iid, not only are the vectors ${(Y_{. j})}_{j}$ independent given $μ$ , they are also independent marginally. From this property, we prove in Proposition 2 of Supplement A [23] that the marginal likelihood of the model can be written analytically as

p (𝒟) = \prod_{j = 1}^{d} {MVN}_{n} (y_{. j}; 0_{n}, s_{0} 1_{n} 1_{n}^{⊤} + I_{n}),

(23)

where $y_{. j} \in R^{n}$ is the vector of all observations for variable $j$ such that ${[y_{. j}]}_{i} = y_{i j}$ and $1_{n}$ is the vector of 1 in $R^{n}$ .

Assessing the precision of the THAMES estimator as a function of $T$

We first considered the univariate case $d = 1$ . Thus, we simulated a unique sample of size $n = 20$ with $μ = 2$ . Moreover, we set $s_{0} = 1$ , for illustration. Note that other choices for $s_{0}$ led to similar conclusions regarding the quality of the estimation. Figure 2 shows the THAMES estimated values for the log marginal likelihood, for $T = 5,1005, 2005, \dots, 9005$ samples of the posterior distribution (Equation (22)). Confidence intervals as well as the exact value of the log marginal likelihood computed using Equation (23) are also reported. It can be seen that the estimate converges to the correct value and that the confidence intervals contain the true value in all cases, even for $T = 5$ only.

Figure 2: — Estimation of the log marginal likelihood using THAMES for a unique univariate Gaussian sample with $n = 20$ as a function of $T$ , the number of (cumulative) samples from the posterior distribution. The black dots indicate the values of the THAMES estimator of the log marginal likelihood. The vertical lines represent 95% confidence intervals, and the dashed blue line represents the exact value computed using Equation (23).

Assessing the precision of the THAMES estimator as a function of $d$

For this second set of experiments, we considered different values of $d$ , and aimed at testing the robustness of the THAMES approach on multiple data sets, with increasing dimensionality. Thus, for each $d$ , we generated 50 different data sets of size $n = 20$ using the multivariate Gaussian model. In practice, we set the true value of $μ$ to 2, for all its components. Again, the prior parameter $s_{0}$ was set to 1 and similar conclusions were drawn for other values. Moreover, the value of $T$ was set to 10, 000 for all the experiments.

We also used this example to assess the sample splitting procedure for the posterior samples, as proposed in Section 3.3. The results are given in Figure 3. On the figure on the left, where no sample splitting of the posterior samples is used to compute THAMES, we observe that a bias appears as the dimensionality of the model considered increases, and the log marginal likelihood tends to be slightly underestimated. As illustrated by the figure on the right hand side, this bias is primarily related to the estimation of the posterior covariance matrix, and not to the THAMES estimation itself. Indeed, focusing on this figure on the right, we note that if the exact expression of the posterior covariance matrix given in Equation (22) is used to compute THAMES, then while the variance of the estimator increases with $d$ , we do not observe any bias. Crucially, if the sample splitting of the posterior samples is employed to compute THAMES, then again, we do not observe any bias.

Figure 3: — Difference between the estimated log marginal likelihood using THAMES and the true log marginal likelihood for a multivariate Gaussian model with $n = 20$ and $T = 10000$ . The procedure is repeated on 50 different data sets for each $d$ . Left: the THAMES approach *with no* sample splitting of the posterior samples. Middle: the THAMES approach *with* sample splitting of the posterior samples. Right: the results provided correspond to the case where the exact expression of the posterior covariance matrix given in Equation (22) is used to compute THAMES.

Overall, we found that the sample splitting procedure of the posterior samples was not necessary to compute THAMES for low values of $d$ . The estimated values are indeed particularly close to the exact ones. However, for large values of $d$ , we recommend using the sample splitting procedure to remove the bias.

4.2. Bayesian Regression

We consider a data set $(x_{i}, Y_{i}), i = 1, \dots, n$ to train a linear regression model of the form

Y_{i} ∣ x_{i}, β, σ^{2} ~ 𝒩 (x_{i}^{⊤} β, σ^{2}), i = 1, \dots, n .

In this section, the goal is to assess the quality of our proposed estimator. As such, we choose a prior on $(β, σ^{2})$ for which an exact expression for the marginal likelihood, $Z$ , exists. We compare our estimator, the THAMES, to the bridge sampling estimator implemented in Gronau et al. [10] and a simple Monte Carlo (MC) estimator, calculated by averaging the likelihood for parameter values simulated from the prior.

Denoting $Y \in R^{n}$ , the vector of target variables $Y_{i}$ , and $X \in ℳ_{n \times (d - 1)} (R)$ the design matrix where the input vectors $x_{i} \in R^{d - 1}$ are stacked as row vectors, the linear regression model becomes:

Y ∣ X, β, σ^{2} ~ {MVN}_{n} (X β, σ^{2} I_{n}) .

We rely on a centered isotropic Gaussian prior distribution for the regression vector $β$ and the variance $σ^{2}$ :

p (β ∣ σ^{2}) = {MVN}_{d - 1} (β; 0_{d - 1}, g σ^{2} {(X^{T} X)}^{- 1}), p (σ^{2}) = InvGamma (σ^{2}; \frac{1}{2} ν_{0}, \frac{1}{2} σ_{0}^{2} ν_{0}),

with $g, σ_{0}^{2}, ν_{0} > 0$ . Introduced by Zellner [39, 40], this framework offers a conjugate prior with the particularly attractive property of scale-invariance with respect to the regressor [14]. Then the posterior distribution of $(β, σ^{2})$ , given the training data set $𝒟 = \{(x_{1}, y_{1}), \dots, (x_{n}, y_{n})\}$ , is tractable:

p (β ∣ σ^{2}, 𝒟) = {MVN}_{d - 1} (β; \frac{g}{g + 1} m_{n}, \frac{g}{g + 1} σ^{2} {(X^{T} X)}^{- 1}),

p (σ^{2} ∣ 𝒟) = InvGamma (σ^{2}; \frac{1}{2} (ν_{0} + n), \frac{1}{2} (ν_{0} σ_{0}^{2} + s_{n})),

with $s_{n} = y^{T} y - \frac{g}{g + 1} y^{T} X {(X^{T} X)}^{- 1} X^{T} y$ and $m_{n} = {(X^{T} X)}^{- 1} X^{T} y$ , where $y \in R^{n}$ is the observed vector of target variables associated with $Y$ . Moreover, the marginal likelihood also has an analytical expression:

p (y ∣ X) = \frac{(g + 1)^{- (d - 1) / 2}}{π^{n / 2}} \cdot \frac{Γ (\frac{1}{2} (ν_{0} + n))}{Γ (\frac{1}{2} ν_{0})} \cdot {(\frac{ν_{0} σ_{0}^{2}}{ν_{0} σ_{0}^{2} + Σ_{n}})}^{ν_{0} / 2} .

Proofs for the exact expressions of the posterior and the marginal are given in Hoff [14, Chapter 9]. The data for this example are described by Hastie et al. [12] and come from a study by Stamey et al. [36]. They examined the correlation between the level of prostate-specific antigen (lpsa) and eight clinical measures in men who were about to receive a radical prostatectomy. The variables are log cancer volume (lcavol), log prostate weight (lweight), age, log of the amount of benign prostatic hyperplasia (lbph), seminal vesicle invasion (svi), log of capsular penetration (lcp), Gleason score (gleason), and percent of Gleason scores 4 or 5 (pgg45). The target variable is the level of prostate-specific antigen (lpsa).

The choice of the hyperparameter $g$ is a topic of much discussion [28, 6]. In Porwal and Raftery [28], $g = \sqrt{n}$ showed good performance when compared to a variety of different options, albeit in a slightly different setting where the prior on $σ^{2}$ is improper. For this reason, we chose $g = \sqrt{n}$ . We chose $(ν_{0}, σ_{0}^{2}) = (4,1)$ for the other hyperparameters, but other choices for $(g, ν_{0}, σ_{0}^{2})$ led to similar conclusions regarding the quality of the estimation.

Seven different regression models $M_{2}, M_{3}, \dots, M_{8}$ , each with a different number of selected variables, ranging from 2 to 8, are considered for illustration. The variables are added in the order given above. Thus, $M_{2}$ includes the predictor variables lcavol and lweight, while Model $M_{3}$ considers the variables lcavol, lweight, as well as age for prediction. Finally, model $M_{8}$ takes all 8 input variables into account. Figure 4 shows the estimators of the log marginal likelihood for different number of samples from the posterior distribution in $(β, σ^{2})$ , for the different models, as well as the approximate confidence intervals.

Figure 4: — Log marginal likelihood (dotted line) and its estimators (dots) for the prostate data set. The approximate confidence intervals of the estimators are also indicated. Bridge sampling and THAMES are on point, while the simple Monte Carlo estimator does not seem to have converged.

Sample splitting was used and there was no noticeable bias in the results, even though we did not correct the THAMES for the bounded parameter $σ^{2}$ due to the fact that the posterior mode of this parameter is far away from 0. We also calculated the bias correction from Remark 2 which had no impact numerically, even for a posterior sample size as small as $T = 50$ .

While the simple Monte Carlo estimator did not converge, the bridge sampling estimator and the THAMES behaved very similarly. Indeed, it can be seen that both these estimators converged rapidly to the correct value and that the intervals covered the correct values in most cases, even when the number of samples used was small, for all models investigated. Figure 5 shows the estimates produced by these methods when zoomed in on a finer scale in the full model, meaning that the number of clinical measures is equal to 8. Notably, the confidence intervals obtained by the THAMES were more conservative and wider than those obtained for the bridge sampling estimator, while the estimators themselves converged in a similar speed and manner.

Figure 5: — Log marginal likelihood (dotted line) and its estimators (dots) for the prostate data set with the full model (i.e., all eight clinical measures are selected). The approximate confidence intervals of the estimators are also indicated. The confidence intervals of the THAMES are more conservative, while the ones obtained from bridge sampling are more narrow.

For all models, those two estimators are particularly precise for 1000 samples of the posterior only. While the main goal of this section is to illustrate the precision of our estimation strategy for a series of models, we can also report that the model with the highest marginal likelihood, among those considered for this data set, is Model $M_{2}$ . In other words, the variables lcavol and lweight are seen as key for the prediction of the level of prostate-specific antigen.

4.3. Dirichlet-multinomial model

Extensions of the Dirichlet-multinomial model are widely used in the context of topic modelling, see, e.g., Blei et al. [2]. The expression for the marginal likelihood in this model is known, as in the previous two sections. This allows us to assess the performance of our estimator in another simulation study, in a non-Gaussian context.

A simulation study in this setting is useful for two reasons: First, this is a high-dimensional setting in which the posterior distribution of the parameters is highly non-Gaussian. In fact, the parameter space is bounded. This allows us to assess how well the THAMES performs in a very different setting, and also how much of an impact the correction for a bounded parameter space from Section 3.3 has. We check this numerically in Supplement B [24].

Second, there do exist similar models to this one for which the marginal likelihood is not tractable, e.g. Blei and Lafferty [1]. These models are therefore a possible application of the THAMES. The simulation study might give an idea of how well the THAMES would perform in these applications.

The Dirichlet-multinomial model is defined as follows: Each data point $Y_{i} \in {0, \dots, l}^{K}$ is drawn from a multinomial distribution given a Dirichlet-distributed random variable $μ$ :

μ ~ Dirichlet (μ; (a_{0}, \dots, a_{0})), Y_{i} ∣ μ \overset{i.i.d.}{~} ℳ (l, μ), \forall i = 1, \dots, n .

Here, $μ$ is positive and $K$ -dimensional with components summing to 1. The covariance matrix of $μ$ is thus necessarily singular. As noted in Remark 6, the THAMES needs to be used on posterior simulations from the subspace of $R^{K}$ on which a density is defined. In this case, this is $R^{K - 1} ≕ R^{d}$ . The prior density is thus

π (μ_{1}, \dots, μ_{d}) = Dirichlet (μ_{1}, \dots, μ_{d}, 1 - \sum_{j = 1}^{d} μ_{j}; (a_{0}, \dots, a_{0})) .

The posterior support is $\{μ \in R^{d} ∣ \sum_{j = 1}^{d} μ_{j} \leq 1, μ_{1}, \dots, μ_{d} > 0\}$ . The posterior distribution given the data $𝒟 = \{y_{1}, \dots, y_{n}\}$ is tractable:

p (μ_{1}, \dots, μ_{d} ∣ 𝒟) = Dirichlet (μ_{1}, \dots, μ_{d}, 1 - \sum_{j = 1}^{d} μ_{j}; α_{1}, \dots, α_{K}),

with $α_{j} = a_{0} + \sum_{i = 1}^{n} y_{i j}$ . The marginal likelihood is thus also tractable, using Bayes’ theorem.

Results

The marginal likelihood was estimated in the setting $(n, l, T, a_{0}) = (400, 150, 10000, 1)$ with $d$ varying between 1, 20, 50 and 100. The quantities $n$ and $l$ were intentionally chosen to be large, since this model has very high-dimensional applications. For example, Blei et al. [2] used a data set with 8000 documents, $n = 15, 818$ words and used up to $K = 200$ different topics.

As mentioned, an alternative to the correction proposed in Section 3.3 is to reparametrize $μ$ such that the support of the parameter is unconstrained. We did this by setting

(μ_{1}^{(t)}, \dots, μ_{d}^{(t)}) ≕ \frac{(exp (ϑ_{1}^{(t)}), \dots, exp (ϑ_{d}^{(t)}))}{exp (- \sum_{k = 1}^{d} ϑ_{k}^{(t)}) + \sum_{k = 1}^{d} exp (ϑ_{k}^{(t)})} = softmax (ϑ^{(t)}), t = 1, \dots, T .

We are using a bijective version of the softmax function (we take the first $d$ elements of the version of the softmax which maps a sample from the Dirichlet to a paramater that sums to 0), and the induced prior $π_{2} (ϑ) ≔ π (softmax (ϑ)) |{softmax}_{Jacobian} (ϑ)|$ . We stress that this procedure is not necessary to calculate the THAMES, since the THAMES can be calculated on any parameter space. It is however necessary to calculate the bridge sampling estimator implemented in Gronau et al. [10].

Table 1 shows the results when calculating the THAMES and the bridge sampling estimator on $(ϑ^{(1)}, \dots, ϑ^{(T)})$ ². A fixed parameter $μ$ was set to $μ = (1 / K, \dots, 1 / K)$ and 50 different samples were generated using the parameters $l$ and $μ$ . The MC estimator was also computed for comparison.

Table 1:

Average CPU times (in seconds per 10,000 posterior draws) as well as mean absolute errors and standard deviations for bridge sampling, the THAMES and the naive Monte Carlo (MC) estimator (errors of the latter were rounded to 1 decimal place). Estimates using MC are quickest to compute, but also the least precise. The THAMES is much faster than bridge sampling, although point estimates from the latter are more accurate.

	d=1 MAE	SD	Time	d=20 MAE	SD	Time
Bridge	0.0001	0.0002	6.0026	0.0019	0.0024	36.9822
MC	0.093	0.1145	0.0020	6903.4	942.444	0.0008
THAMES	0.0064	0.0087	0.0158	0.0197	0.025	0.2486
	d=50 MAE	SD	Time	d=100 MAE	SD	Time
Bridge	0.0037	0.0046	58.9884	0.0086	0.0108	116.5902
MC	13879.1	1231.7635	0.0004	18418.1	844.7528	0.0046
THAMES	0.0315	0.039	0.3094	0.0473	0.0617	1.0532

Open in a new tab

Both bridge sampling and the THAMES outperformed the MC estimator. Additionally, while the bridge sampling estimator performed better in terms of mean absolute error, it should be noted that the THAMES is not only easier, but quicker to compute, with their differences in computation time growing as the dimension of the parameter space increases. This is likely due to the fact that the THAMES does not require additional evaluations of the likelihood, beyond the precomputed likelihood values of the posterior sample, so its computation time grows much more slowly with increasing d. Meanwhile bridge sampling does require additional evaluations, which take up an increasing amount of computation time. The average computation time for the bridge sampling estimator is about 361 times as high for $d = 1$ , and 118 times as high for $d = 100$ . However, we would like to emphasize that, in our opinion, the real strength of our estimator lies in the fact that it is not only quick, but also easy to implement.

4.4. Mixed effects model

Netherlands schools data

To demonstrate the performance of THAMES on a random effects model, we consider the Netherlands (NL) schools dataset of Snijders and Bosker [35]. For our purposes, the data consist of language test scores of 2,287 eighth-grade pupils from 133 classes (in 131 schools) in the Netherlands. We denote by $y_{i j} \in R$ the language test score of pupil $i$ in class $j$ , where $j \in {1, \dots, J}$ with $J = 133$ and $i \in \{1, \dots, n_{j}\}$ with $n_{j}$ the size of class $j$ . Let $n = \sum_{j = 1}^{J} n_{j} = 2, 287$ denote the full sample size.

We aim to determine if there is clustering of language test scores by class, with some classes performing significantly better than others on average. To do this, we fit both a simple mean model (which treats test scores of students in the same class as independent) and a random intercept model (which accounts for correlation of test scores within each class) to the data. The former (null) model $H_{0}$ posits that all classes perform the same, on average, while the latter (alternative) model $H_{1}$ allows for variation in performance at the class level. We estimate the log marginal likelihoods for the two models, $ℓ_{0} (y)$ and $ℓ_{1} (y)$ , respectively, using the THAMES. For comparison, we also compute estimates using bridge sampling [10] and a simple Monte Carlo (MC) estimator that averages the likelihood against draws from the prior. With estimates of $ℓ_{0} (y)$ and $ℓ_{1} (y)$ , we estimate the log Bayes factor $b_{01}$ to conduct a Bayesian hypothesis test of $H_{0}$ versus $H_{1}$ . Note that posterior simulation and marginal likelihood calculation are not analytically tractable for this model. As such, the use of approximate posterior sampling (e.g., via MCMC) and marginal likelihood estimation (e.g., via the THAMES) is required.

Linear model (LM)

We first consider a simple mean model (denoted LM), which posits that

y_{i j} = μ + ε_{i j}, j \in \{1, \dots, J\}, i \in \{1, \dots, n_{j}\}, \sum_{j} n_{j} = n,

ε_{i j} \overset{iid}{~} N (0, σ_{ε}^{2}),

μ ~ N (\hat{μ}, {\hat{σ}}_{μ}^{2}),

σ_{ε}^{2} ~ InverseGamma ({\hat{ν}}_{ε}, {\hat{β}}_{ε}) .

The fixed hyperparameters $\hat{μ}, {\hat{σ}}_{μ}^{2}, {\hat{ν}}_{ε}, {\hat{β}}_{ε}$ are specified so as to ensure that the prior distribution is dispersed relative to the likelihood, but on the same scale, as

\hat{μ} = mean (y_{i j}) = 40.93, {\hat{σ}}_{ε}^{2} = \sqrt{2} \cdot sd (y_{i j}) = 12.73,

{\hat{ν}}_{ε} = 0.5, {\hat{β}}_{ε} = 0.5 \cdot var (y_{i j}) = 40.53 .

The hyperparameters $({\hat{ν}}_{ε}, {\hat{β}}_{ε})$ are chosen so that the prior mean of the precision $1 / σ_{ε}^{2}$ equals $1 / var (y_{i j})$ . The set of parameters to be estimated in this model $(μ, σ_{ε}^{2})$ has dimension $d = 2$ . As we are not using a conjugate prior for the linear model, the marginal likelihood does not admit an analytic expression in this case.

Full linear mixed model (full LMM)

We consider the random intercept model (denoted full LMM):

y_{i j} = μ + α_{j} + ε_{i j},

ε_{i j} \overset{iid}{~} N (0, σ_{ε}^{2}),

α_{j} \overset{iid}{~} N (0, σ_{α}^{2}),

μ ~ N (\hat{μ}, {\hat{σ}}_{μ}^{2}),

σ_{ε}^{2} ~ InverseGamma ({\hat{ν}}_{ε}, {\hat{β}}_{ε}),

σ_{α}^{2} ~ InverseGamma ({\hat{ν}}_{α}, {\hat{β}}_{α}) .

Here $(\hat{μ}, {\hat{σ}}_{μ}^{2}, {\hat{ν}}_{ε}, {\hat{β}}_{ε})$ are as above and we specify ${\hat{ν}}_{α} = 0.5, {\hat{β}}_{α} = 0.5 \cdot var ({\hat{μ}}_{j}) = 13.77$ , where ${\hat{μ}}_{j} = \frac{1}{n_{j}} \sum_{i = 1}^{n_{j}} y_{i j}$ is the sample mean for class $j \in {1, \dots, J}$ . The hyperparameters $({\hat{ν}}_{α}, {\hat{β}}_{α})$ are chosen so that the prior mean of the precision $1 / σ_{α}^{2}$ equals $1 / var ({\hat{μ}}_{j})$ . The set of parameters to be estimated in this model $(μ, σ_{ε}^{2}, σ_{α}^{2}, α)$ has dimension $d = 136$ .

Reduced linear mixed model (reduced LMM)

Note that the intercept parameters of the full LMM are not identifiable, as there is give-and-take between estimating the grand mean $μ$ and the random intercepts $α_{j}$ . By absorbing $α_{j}$ into the error term structure $ε_{i j}$ , we can specify an equivalent model (having the same marginal likelihood) with $d = 3$ identifiable parameters $(μ, σ_{ε}^{2}, σ_{α}^{2})$ . Mathematically, this amounts to marginalizing the $α_{j}$ ’s out of the model. The model (which we call reduced LMM) is given by

y_{i j} = μ + ε_{i j},

ε_{i j} ~ N (0, σ_{ε}^{2} + σ_{α}^{2}),

Cov (ε_{i j}, ε_{i^{'} j}) = σ_{α}^{2}, i, i^{'} \in \{1, \dots, n_{j}\}, i \neq i^{'},

Cov (ε_{i j}, ε_{i^{'} j^{'}}) = 0, j \neq j^{'},

μ ~ N (\hat{μ}, {\hat{σ}}_{μ}^{2}),

σ_{ε}^{2} ~ InverseGamma ({\hat{ν}}_{ε}, {\hat{β}}_{ε}),

σ_{α}^{2} ~ InverseGamma ({\hat{ν}}_{α}, {\hat{β}}_{α}) .

Here $(\hat{μ}, {\hat{σ}}_{μ}^{2}, {\hat{ν}}_{ε}, {\hat{β}}_{ε}, {\hat{ν}}_{α}, {\hat{β}}_{α})$ are as above.

Results

Figure 6 shows the log marginal likelihood of the NL schools data for each model computed using the THAMES, bridge sampling, and simple Monte Carlo estimators with approximate 95% confidence intervals as a function of the number of posterior MCMC or prior MC draws. Bridge sampling is a popular state-of-the-art method to estimate log marginal likelihoods from posterior MCMC samples, which is substantially more complicated computationally than the THAMES. Posterior MCMC sampling is carried out in R using Stan [29, 37]. We use values of the sample size $T$ evenly spaced between 1,000 and 20,000. For each $T$ , we run 4 chains in parallel for $T / 2$ iterations and remove the first $T / 4$ as burn-in, yielding $T / 4$ MCMC samples from each of the 4 chains, which are used to compute the THAMES and bridge sampling estimates.

THAMES provides consistent estimates of the log marginal likelihood with greater precision as the posterior sample size grows. As we would expect, THAMES converges much faster for the LM (with $d = 2$ ) and the reduced LMM (with $d = 3$ ) than for the full LMM (with $d = 136$ ), although the estimates of the reduced and full LMM converge to the same value. While the posterior support of this model is constrained due to the variance parameters $(σ_{ε}^{2}, σ_{α}^{2})$ , we found that the truncation correction defined in Section 3.3 had no impact on the results. For a given posterior sample size, we find that bridge sampling generally produces more precise estimates than THAMES. However, THAMES has the advantage of being much simpler to implement and more computationally efficient in practice. On average over the samples in Figure 6, bridge sampling required 6.4 times as much compute time as THAMES for the full LMM, 556.5 times as much for the reduced LMM, and 26.8 times as much for the LM when estimated from the same number of posterior draws, as reported in Table 2.³ The MC estimator, while fast and theoretically unbiased and consistent, suffers from substantial variance. In the left panel of Figure 6, the MC estimates are not shown as they lie outside the range of the plot.

Using the THAMES estimates of the log marginal likelihoods for the LM $(ℓ_{0} (y))$ and the (reduced) LMM $(ℓ_{1} (y))$ with 20,000 posterior draws, the log Bayes factor $(B_{01})$ is estimated as

B_{01} = ℓ_{0} (y) - ℓ_{1} (y) = - 8278.842 + 8136.561 = - 142.281,

indicating decisive evidence in favor of the random intercept model [18].

5. Discussion

We have proposed an estimator of the reciprocal of the marginal likelihood, called THAMES, which is simple to compute, unbiased, consistent, has finite variance and is asymptotically normal, with available confidence intervals. It is a version of reciprocal importance sampling. The estimator has one user-specified control parameter, and we have derived an optimal value for this in the situation where the posterior distribution is normal, which is of great interest because posterior distributions are asymptotically normal in many situations. We have carried out several numerical experiments in which the estimator performs well.

A similar proposal was made independently in McEwen et al. [21] under the name “Learnt harmonic mean estimator”, where a variety of different sample models were suggested to work in conjunction. One of these models, the “Hypersphere”, corresponds to the THAMES, the difference being that no theoretical results were given for the optimal control parameter, $c$ . Instead, $c$ was optimized numerically as the minimum of the second harmonic moment, via the Brent hybrid root-finding algorithm. In the only high-dimensional application, which was in fact a Gaussian posterior, $c$ was not optimized and it was noted that “alternative more effective target models can be developed that better scale to higher-dimensional settings”. We believe that with the THAMES, using the suggested optimal controlling parameter, this is the case.

The THAMES relies on estimating the posterior covariance matrix and mean. In our experience it is important that the estimator chosen for the covariance matrix be accurate for estimating each matrix entry. Elementwise accuracy appears to be important because the covariance matrix is used to precisely define a quadratic inequality. For example, using a shrinkage estimator for the covariance matrix, which can produce large errors in a small proportion of its elements, has in our experience degraded the performance of the THAMES in some situations.

One possible alternative to covariance matrix estimation would be to select a minimum-volume covering ellipse which includes a certain percentage of those points of the posterior sample which have the largest value with respect to the (unnormalized) posterior density evaluated at those points. This would ensure that an HPD-region is well approximated, independent of the underlying posterior distribution. Determining a minimum-volume covering ellipse given a set of points can be difficult computationally, but this problem has been addressed in the literature in different settings and could possibly be adapted to the THAMES.

Supplementary Material

Supplement A

Supplement A: Proofs and Calculations. We prove the analytic results from Section 3 and derive the exact expression of the posterior and marginal density used for the multinomial likelihood in Section 4.

NIHMS2024687-supplement-Supplement_A.pdf^{(414.8KB, pdf)}

Supplement B

Supplement B: Additional Simulations. We give some additional, numeric results about the approximate behaviour of the THAMES in the normal case as well as the case where the posterior support is constrained.

NIHMS2024687-supplement-Supplement_B.pdf^{(453.8KB, pdf)}

Table 2:

Average CPU times (in seconds per 1,000 posterior draws) to produce the estimates in Figure 6. The THAMES is faster than bridge sampling. Both take more time for the same number of iterations than the naive Monte Carlo (MC) estimator in terms of CPU time, even though the latter is far less precise (see Figure 6).

	Full LMM	Reduced LMM	LM
Bridge	0.3815	1.6482	0.0723
MC	0.0002	0.0002	0.0002
THAMES	0.0581	0.0030	0.0027

Open in a new tab

Acknowledgments

The authors would like to thank the anonymous referees, an Associate Editor and the Editor for their constructive comments. The authors would further like to thank Christoph Richard from the Friedrich-Alexander-Universität Erlangen-Nürnberg for his helpful comments. Their comments dramatically improved the content and quality of this paper.

Funding

Irons’s research was supported by a Shanahan Endowment Fellowship and a Eunice Kennedy Shriver National Institute of Child Health and Human Development training grant, T32 HD101442-01, to the Center for Studies in Demography & Ecology at the University of Washington. Raftery’s research was supported by NIH grant R01 HD070936 from the Eunice Kennedy Shriver National Institute of Child Health and Human Development (NICHD), by the Fondation des Sciences Mathématiques de Paris (FSMP), and by Université Paris-Cité.

Footnotes

Recall that the unstable harmonic mean estimator described by [26] was quite different, not being truncated, and being a harmonic mean of the likelihoods rather than the unnormalized posterior density values.

Computations were performed on an Intel(R) Core(TM) i7-7700HQ CPU at 2.80GHz with 16 GB RAM.

Computations were performed on an Apple M1 chip with 3.20GHz processor and 16 GB RAM.

References

[1].Blei DM and Lafferty JD (2007). “A correlated topic model of Science.” Annals of Applied Statistics, 1(1): 17–35. [Google Scholar]
[2].Blei DM, Ng AY, and Jordan MI (2003). “Latent Dirichlet Allocation.” Journal of Machine Learning Research, 3: 993–1022. [Google Scholar]
[3].Chib S (1995). “Marginal likelihood from the Gibbs output.” Journal of the American Statistical Association, 90: 1313–1321. [Google Scholar]
[4].DiCiccio TJ, Kass RE, Raftery AE, and Wasserman L (1997). “Computing Bayes factors by combining simulation and asymptotic approximations.” Journal of the American Statistical Association, 92: 903–915. [Google Scholar]
[5].Durmus A, Moulines E, and Pereyra M (2018). “Efficient Bayesian computation by proximal Markov chain Monte Carlo: when Langevin meets Moreau.” SIAM Journal on Imaging Sciences, 11: 473–506. [Google Scholar]
[6].Fernández C, Ley E, and Steel MF (2001). “Benchmark priors for Bayesian model averaging.” Journal of Econometrics, 100(2): 381–427. URL https://www.sciencedirect.com/science/article/pii/S0304407600000762 [Google Scholar]
[7].Frühwirth-Schnatter S (2004). “Estimating marginal likelihoods for mixture and Markov switching models using bridge sampling techniques.” Econometrics Journal, 7: 143–167. [Google Scholar]
[8].Gelfand AE and Dey DK (1994). “Bayesian model choice: asymptotics and exact calculations.” Journal of the Royal Statistical Society: Series B (Methodological), 56: 501–514. [Google Scholar]
[9].Ghosal S (2000). “Asymptotic normality of posterior distributions for exponential families when the number of parameters tends to infinity.” Journal of Multivariate Analysis, 74: 49–68. [Google Scholar]
[10].Gronau QF, Singmann H, and Wagenmakers E-J (2020). “bridgesampling: An R Package for Estimating Normalizing Constants.” Journal of Statistical Software, 92(10): 1–29. [Google Scholar]
[11].Hajargasht G and Wo’zniak T (2018). “Accurate Computation of Marginal Data Densities Using Variational Bayes.” arXiv: Applications. https://arxiv.org/pdf/1805.10036.pdf. [Google Scholar]
[12].Hastie T, Tibshirani R, and Friedman JH (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2nd edition. [Google Scholar]
[13].Heyde CC and Johnstone IM (1979). “On asymptotic posterior normality for stochastic processes.” Journal of the Royal Statistical Society: Series B (Methodological), 41: 184–189. [Google Scholar]
[14].Hoff P (2009). A First Course in Bayesian Statistical Methods. Springer Texts in Statistics. Springer; New York. URL https://books.google.de/books?id=DykcMwEACAAJ [Google Scholar]
[15].Irons NJ, Perrot-Dockès M, and Metodiev M (2023). “thames: Easily Computed Marginal Likelihoods from Posterior Simulation Using the THAMES Estimator.” R package version 0.1.0. [DOI] [PMC free article] [PubMed]
[16].Jameson GJO (2015). “A simple proof of Stirling’s formula for the gamma function.” The Mathematical Gazette, 99(544): 68–74. URL http://www.jstor.org/stable/24496904 [Google Scholar]
[17].Jeffreys H (1961). Theory of Probability. Oxford, U.K.: Oxford University Press, 3rd edition. [Google Scholar]
[18].Kass RE and Raftery AE (1995). “Bayes factors.” Journal of the American Statistical Association, 90(430): 773–795. [Google Scholar]
[19].Liu B (2014). “Adaptive annealed importance sampling for multimodal posterior exploration and model selection with application to extrasolar planet detection.” The Astrophysical Journal Supplement Series, 213(1): 14. [Google Scholar]
[20].Llorente F, Martino L, Delgado D, and Lopez-Santiago J (2023). “Marginal likelihood computation for model selection and hypothesis testing: An extensive review.” SIAM Review, 65: 3–58. [Google Scholar]
[21].McEwen JD, Wallis CGR, Price MA, and Docherty MM (2022). “Machine learning assisted Bayesian model comparison: learnt harmonic mean estimator.” arXiv. https://arxiv.org/pdf/2111.12720.pdf. [Google Scholar]
[22].Meng X-L and Wong WH (1996). “Simulating ratios of normalizing constants via a simple identity: A theoretical exploration.” Statistica Sinica, 6: 831–860. [Google Scholar]
[23].Metodiev M, Perrot-Dockès M, Ouadah S, Irons NJ, Latouche P, and Raftery AE (2024). “Supplement A to “Easily Computed Marginal Likelihoods from Posterior Simulation Using the THAMES Estimator”.” [DOI] [PMC free article] [PubMed]
[24].— (2024). “Supplement B to “Easily Computed Marginal Likelihoods from Posterior Simulation Using the THAMES Estimator”.” [DOI] [PMC free article] [PubMed]
[25].Miller JW (2021). “Asymptotic normality, concentration, and coverage of generalized posteriors.” The Journal of Machine Learning Research, 22: 7598–7650. [Google Scholar]
[26].Newton MA and Raftery AE (1994). “Approximate Bayesian inference with the weighted likelihood bootstrap.” Journal of the Royal Statistical Society: Series B (Methodological), 56: 3–26. [Google Scholar]
[27].Olver FWJ, Olde Daalhuis AB, Lozier DW, Schneider BI, Boisvert RF, Clark CW, Miller BR, Saunders BV, Cohl HS, McClain MA, and eds. (????). “NIST Digital Library of Mathematical Functions.” https://dlmf.nist.gov/, Release 1.1.9 of 2023-03-15. URL https://dlmf.nist.gov/
[28].Porwal A and Raftery AE (2022). “Comparing methods for statistical inference with model uncertainty.” Proceedings of the National Academy of Sciences, 119(16): e2120737119. URL https://www.pnas.org/doi/abs/10.1073/pnas.2120737119 [DOI] [PMC free article] [PubMed] [Google Scholar]
[29].R Core Team (2023). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/ [Google Scholar]
[30].Robert CP and Wraith D (2009). “Computational methods for Bayesian model choice.” In AIP conference proceedings, volume 1193, 251–262. American Institute of Physics. [Google Scholar]
[31].Rockafellar R and Wets R-B (2009). Variational Analysis. Springer, 1st edition. [Google Scholar]
[32].Shen X (2002). “Asymptotic Normality of Semiparametric and Nonparametric Posterior Distributions.” Journal of the American Statistical Association, 97: 222–235. [Google Scholar]
[33].Sims CA, Waggoner DF, and Zha T (2008). “Methods for inference in large multiple-equation Markov-switching models.” Journal of Econometrics, 146(2): 255–274. [Google Scholar]
[34].Skilling J (2006). “Nested sampling for general Bayesian computation.” Bayesian Analysis, 1: 833–859. [Google Scholar]
[35].Snijders TAB and Bosker RJ (1999). Multilevel Analysis. An Introduction to Basic and Advanced Multilevel Modelling. Sage. [Google Scholar]
[36].Stamey T, Kabalin J, McNeal J, Johnstone I, Freiha F, Redwine E, and Yang N (1989). “Prostate specific antigen in the diagnosis and treatment of adenocarcinoma of the prostate II radical prostatectomy treated patients.” Journal of Urology, 16: 1076–1083. [DOI] [PubMed] [Google Scholar]
[37].Stan Development Team (2022). “RStan: the R interface to Stan.” R package version 2.21.7. URL https://mc-stan.org/
[38].Sweeting T (1996). “On a Converse to Scheffe’s Theorem.” The Annals of Statistics, 14: 1252–1256. [Google Scholar]
[39].Zellner A (1971). An Introduction to Bayesian Inference in Econometrics. Krieger. URL https://books.google.de/books?id=paqiswEACAAJ [Google Scholar]
[40].— (1986). “Bayesian Estimation and Prediction Using Asymmetric Loss Functions.” Journal of the American Statistical Association, 81(394): 446–451. URL http://www.jstor.org/stable/2289234 [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement A

NIHMS2024687-supplement-Supplement_A.pdf^{(414.8KB, pdf)}

Supplement B

NIHMS2024687-supplement-Supplement_B.pdf^{(453.8KB, pdf)}

[R1] [1].Blei DM and Lafferty JD (2007). “A correlated topic model of Science.” Annals of Applied Statistics, 1(1): 17–35. [Google Scholar]

[R2] [2].Blei DM, Ng AY, and Jordan MI (2003). “Latent Dirichlet Allocation.” Journal of Machine Learning Research, 3: 993–1022. [Google Scholar]

[R3] [3].Chib S (1995). “Marginal likelihood from the Gibbs output.” Journal of the American Statistical Association, 90: 1313–1321. [Google Scholar]

[R4] [4].DiCiccio TJ, Kass RE, Raftery AE, and Wasserman L (1997). “Computing Bayes factors by combining simulation and asymptotic approximations.” Journal of the American Statistical Association, 92: 903–915. [Google Scholar]

[R5] [5].Durmus A, Moulines E, and Pereyra M (2018). “Efficient Bayesian computation by proximal Markov chain Monte Carlo: when Langevin meets Moreau.” SIAM Journal on Imaging Sciences, 11: 473–506. [Google Scholar]

[R6] [6].Fernández C, Ley E, and Steel MF (2001). “Benchmark priors for Bayesian model averaging.” Journal of Econometrics, 100(2): 381–427. URL https://www.sciencedirect.com/science/article/pii/S0304407600000762 [Google Scholar]

[R7] [7].Frühwirth-Schnatter S (2004). “Estimating marginal likelihoods for mixture and Markov switching models using bridge sampling techniques.” Econometrics Journal, 7: 143–167. [Google Scholar]

[R8] [8].Gelfand AE and Dey DK (1994). “Bayesian model choice: asymptotics and exact calculations.” Journal of the Royal Statistical Society: Series B (Methodological), 56: 501–514. [Google Scholar]

[R9] [9].Ghosal S (2000). “Asymptotic normality of posterior distributions for exponential families when the number of parameters tends to infinity.” Journal of Multivariate Analysis, 74: 49–68. [Google Scholar]

[R10] [10].Gronau QF, Singmann H, and Wagenmakers E-J (2020). “bridgesampling: An R Package for Estimating Normalizing Constants.” Journal of Statistical Software, 92(10): 1–29. [Google Scholar]

[R11] [11].Hajargasht G and Wo’zniak T (2018). “Accurate Computation of Marginal Data Densities Using Variational Bayes.” arXiv: Applications. https://arxiv.org/pdf/1805.10036.pdf. [Google Scholar]

[R12] [12].Hastie T, Tibshirani R, and Friedman JH (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2nd edition. [Google Scholar]

[R13] [13].Heyde CC and Johnstone IM (1979). “On asymptotic posterior normality for stochastic processes.” Journal of the Royal Statistical Society: Series B (Methodological), 41: 184–189. [Google Scholar]

[R14] [14].Hoff P (2009). A First Course in Bayesian Statistical Methods. Springer Texts in Statistics. Springer; New York. URL https://books.google.de/books?id=DykcMwEACAAJ [Google Scholar]

[R15] [15].Irons NJ, Perrot-Dockès M, and Metodiev M (2023). “thames: Easily Computed Marginal Likelihoods from Posterior Simulation Using the THAMES Estimator.” R package version 0.1.0. [DOI] [PMC free article] [PubMed]

[R16] [16].Jameson GJO (2015). “A simple proof of Stirling’s formula for the gamma function.” The Mathematical Gazette, 99(544): 68–74. URL http://www.jstor.org/stable/24496904 [Google Scholar]

[R17] [17].Jeffreys H (1961). Theory of Probability. Oxford, U.K.: Oxford University Press, 3rd edition. [Google Scholar]

[R18] [18].Kass RE and Raftery AE (1995). “Bayes factors.” Journal of the American Statistical Association, 90(430): 773–795. [Google Scholar]

[R19] [19].Liu B (2014). “Adaptive annealed importance sampling for multimodal posterior exploration and model selection with application to extrasolar planet detection.” The Astrophysical Journal Supplement Series, 213(1): 14. [Google Scholar]

[R20] [20].Llorente F, Martino L, Delgado D, and Lopez-Santiago J (2023). “Marginal likelihood computation for model selection and hypothesis testing: An extensive review.” SIAM Review, 65: 3–58. [Google Scholar]

[R21] [21].McEwen JD, Wallis CGR, Price MA, and Docherty MM (2022). “Machine learning assisted Bayesian model comparison: learnt harmonic mean estimator.” arXiv. https://arxiv.org/pdf/2111.12720.pdf. [Google Scholar]

[R22] [22].Meng X-L and Wong WH (1996). “Simulating ratios of normalizing constants via a simple identity: A theoretical exploration.” Statistica Sinica, 6: 831–860. [Google Scholar]

[R23] [23].Metodiev M, Perrot-Dockès M, Ouadah S, Irons NJ, Latouche P, and Raftery AE (2024). “Supplement A to “Easily Computed Marginal Likelihoods from Posterior Simulation Using the THAMES Estimator”.” [DOI] [PMC free article] [PubMed]

[R24] [24].— (2024). “Supplement B to “Easily Computed Marginal Likelihoods from Posterior Simulation Using the THAMES Estimator”.” [DOI] [PMC free article] [PubMed]

[R25] [25].Miller JW (2021). “Asymptotic normality, concentration, and coverage of generalized posteriors.” The Journal of Machine Learning Research, 22: 7598–7650. [Google Scholar]

[R26] [26].Newton MA and Raftery AE (1994). “Approximate Bayesian inference with the weighted likelihood bootstrap.” Journal of the Royal Statistical Society: Series B (Methodological), 56: 3–26. [Google Scholar]

[R27] [27].Olver FWJ, Olde Daalhuis AB, Lozier DW, Schneider BI, Boisvert RF, Clark CW, Miller BR, Saunders BV, Cohl HS, McClain MA, and eds. (????). “NIST Digital Library of Mathematical Functions.” https://dlmf.nist.gov/, Release 1.1.9 of 2023-03-15. URL https://dlmf.nist.gov/

[R28] [28].Porwal A and Raftery AE (2022). “Comparing methods for statistical inference with model uncertainty.” Proceedings of the National Academy of Sciences, 119(16): e2120737119. URL https://www.pnas.org/doi/abs/10.1073/pnas.2120737119 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] [29].R Core Team (2023). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/ [Google Scholar]

[R30] [30].Robert CP and Wraith D (2009). “Computational methods for Bayesian model choice.” In AIP conference proceedings, volume 1193, 251–262. American Institute of Physics. [Google Scholar]

[R31] [31].Rockafellar R and Wets R-B (2009). Variational Analysis. Springer, 1st edition. [Google Scholar]

[R32] [32].Shen X (2002). “Asymptotic Normality of Semiparametric and Nonparametric Posterior Distributions.” Journal of the American Statistical Association, 97: 222–235. [Google Scholar]

[R33] [33].Sims CA, Waggoner DF, and Zha T (2008). “Methods for inference in large multiple-equation Markov-switching models.” Journal of Econometrics, 146(2): 255–274. [Google Scholar]

[R34] [34].Skilling J (2006). “Nested sampling for general Bayesian computation.” Bayesian Analysis, 1: 833–859. [Google Scholar]

[R35] [35].Snijders TAB and Bosker RJ (1999). Multilevel Analysis. An Introduction to Basic and Advanced Multilevel Modelling. Sage. [Google Scholar]

[R36] [36].Stamey T, Kabalin J, McNeal J, Johnstone I, Freiha F, Redwine E, and Yang N (1989). “Prostate specific antigen in the diagnosis and treatment of adenocarcinoma of the prostate II radical prostatectomy treated patients.” Journal of Urology, 16: 1076–1083. [DOI] [PubMed] [Google Scholar]

[R37] [37].Stan Development Team (2022). “RStan: the R interface to Stan.” R package version 2.21.7. URL https://mc-stan.org/

[R38] [38].Sweeting T (1996). “On a Converse to Scheffe’s Theorem.” The Annals of Statistics, 14: 1252–1256. [Google Scholar]

[R39] [39].Zellner A (1971). An Introduction to Bayesian Inference in Econometrics. Krieger. URL https://books.google.de/books?id=paqiswEACAAJ [Google Scholar]

[R40] [40].— (1986). “Bayesian Estimation and Prediction Using Asymmetric Loss Functions.” Journal of the American Statistical Association, 81(394): 446–451. URL http://www.jstor.org/stable/2289234 [Google Scholar]

PERMALINK

Easily Computed Marginal Likelihoods from Posterior Simulation Using the THAMES Estimator

Martin Metodiev

Marie Perrot-Dockès

Sarah Ouadah

Nicholas J Irons

Pierre Latouche

Adrian E Raftery

Abstract

1. Introduction

2. Reciprocal Importance Sampling

3. Estimating the marginal likelihood

3.1. Estimating the marginal likelihood with THAMES

3.2. Optimal choice of control parameter, c

Figure 1:

3.3. THAMES algorithm

Algorithm 1.

Sample splitting

Correcting for the presence of constrained parameters

4. Examples

4.1. Multivariate Gaussian data

Assessing the precision of the THAMES estimator as a function of T

Figure 2:

Assessing the precision of the THAMES estimator as a function of d

Figure 3:

4.2. Bayesian Regression

Figure 4:

Figure 5:

4.3. Dirichlet-multinomial model

Results

Table 1:

4.4. Mixed effects model

Netherlands schools data

Linear model (LM)

Full linear mixed model (full LMM)

Reduced linear mixed model (reduced LMM)

Results

Figure 6:

5. Discussion

Supplementary Material

Table 2:

Acknowledgments

Funding

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

3.2. Optimal choice of control parameter, $c$

Assessing the precision of the THAMES estimator as a function of $T$

Assessing the precision of the THAMES estimator as a function of $d$