Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2025 Aug 8.
Published before final editing as: Bayesian Anal. 2024 Apr 23:10.1214/24-ba1422. doi: 10.1214/24-ba1422

Easily Computed Marginal Likelihoods from Posterior Simulation Using the THAMES Estimator

Martin Metodiev *,, Marie Perrot-Dockès , Sarah Ouadah , Nicholas J Irons §, Pierre Latouche *,†,, Adrian E Raftery §,¶,
PMCID: PMC12333553  NIHMSID: NIHMS2024687  PMID: 40787637

Abstract

We propose an easily computed estimator of marginal likelihoods from posterior simulation output, via reciprocal importance sampling, combining earlier proposals of DiCiccio et al (1997) and Robert and Wraith (2009). This involves only the unnormalized posterior densities from the sampled parameter values, and does not involve additional simulations beyond the main posterior simulation, or additional complicated calculations, provided that the parameter space is unconstrained. Even if this is not the case, the estimator is easily adjusted by a simple Monte Carlo approximation. It is unbiased for the reciprocal of the marginal likelihood, consistent, has finite variance, and is asymptotically normal. It involves one user-specified control parameter, and we derive an optimal way of specifying this. We illustrate it with several numerical examples.

Keywords: marginal likelihood estimation, reciprocal importance sampling

MSC2020 subject classifications: Primary 62F15, 62-04; secondary 62F12

1. Introduction

A key quantity in Bayesian model selection is the marginal likelihood, also known as the evidence, the normalizing constant of the posterior density, or the integrated likelihood. Consider a statistical model with parameter vector θ and data 𝒟. Let L(θ)=p(𝒟θ) be the usual likelihood, and π(θ) be the prior distribution of θ. Then Z=p(𝒟)=L(θ)π(θ)dθ is the marginal likelihood.

The marginal likelihood plays a key role in defining Bayes factors. Consider two models M1 and M2 with marginal likelihoods Z1 and Z2. Then the Bayes factor (or ratio of posterior to prior odds) for model M1 against M2 is B1,2=Z1/Z2.

The marginal likelihood is also a critical quantity for Bayesian model averaging (BMA). Consider K models, M1,,MK, with prior model probabilities Πk (which add up to 1), and marginal likelihoods Zk. Suppose Q is a quantity of interest, such as a parameter or a future observation to be predicted. Then the BMA posterior distribution of Q is

pQ𝒟=k=1KpQ𝒟,MkpMk𝒟, (1)

where pMk𝒟 is the posterior model probability of Mk, which satisfies pMk𝒟ΠkZk and k=1KpMk𝒟=1. So p(Q𝒟)=k=1KpQ𝒟,MkΠkZk/k=1KΠkZk.

Finally, the most likely model a posteriori is the one that maximizes ΠkZk. Choosing it minimizes the model selection error rate on average over the prior [17]. Often the prior over the model space is chosen to be uniform, in which case Πk=1/K,k. In this case, Bayesian model selection by choosing the most likely model a posteriori boils down to choosing the model with the largest Zk, and hence involves only the marginal likelihoods.

Bayesian models are often estimated using Monte Carlo methods in which a sample of values of θ is simulated from the posterior distribution. The most common class of such methods is Markov chain Monte Carlo (MCMC). Perhaps surprisingly, estimating the marginal likelihood from the output of MCMC and other posterior simulation methods has turned out not to be straightforward. Many different methods have been proposed, and none of them is widely considered to be generally the best. Llorente et al. [20] provide a comprehensive review of such methods, describing 16 different methods and, remarkably, cite over 20 other review articles!

We seek a method that is precise, generic and simple for estimating the marginal likelihood from posterior simulation output. We take this to mean that it gives low variance estimates of the marginal likelihood, uses posterior simulation output for just the one model being analyzed, uses only likelihoods and prior densities of the sampled values of θ, and does not need additional simulations or complicated calculations.

Some well-known methods do not satisfy our desiderata. These include Chib’s method [3], which requires complicated additional calculations, bridge sampling [22], which requires simulations from two models, importance sampling, which requires additional simulations, nested sampling [34], which involves other simulations, and more advanced methods such as adaptive annealed importance sampling [19]. They also include the harmonic mean of the likelihoods [26], which is unbiased and consistent, but has infinite variance and is unstable, as pointed out by the original authors.

Arguably, the only methods that are precise, generic and simple for estimating the marginal likelihood from MCMC by our definition are versions of reciprocal importance sampling (RIS) [8]. These are based on the identity:

Z-1=Eθh(θ)L(θ)π(θ)𝒟, (2)

where h(θ) is a (normalized) probability density function (pdf) over the posterior support. Remarkably, (2) holds for any pdf h(θ). This leads to the estimator

Zˆ-1=1Tt=1Thθ(t)Lθ(t)πθ(t), (3)

where θ(1),,θ(T) are simulated from the posterior using MCMC or another method. This estimator has good properties in general, provided that the tails of the distribution h(θ) are thin enough in all directions. It can be hard to choose h(θ) so that it both overlaps substantially with the posterior distribution (needed for efficiency) and has thin enough tails (needed for finite variance), especially in higher dimensions. We propose a choice of h(θ) that leads to easily computed estimates and is optimal or near optimal in a certain sense.

The paper is organized as follows. In Section 2 we discuss reciprocal importance sampling and its properties. In Section 3 we describe our proposed choice of h(θ) and derive some of its properties. In Section 4 we give several numerical examples, including a multivariate Gaussian example, a Bayesian regression example, a non-Gaussian case, and a Bayesian hierarchical model. We conclude in Section 5 with a discussion. The code for this paper is made available via Github for scientific dissemination, see the following link. The THAMES has been implemented in a package, which is submitted to CRAN. There is also an R package implementing THAMES [15].

2. Reciprocal Importance Sampling

In general, the RIS estimator of the marginal likelihood is defined by Equation (3). This has several good properties. It is unbiased, in the sense that EZˆ-1=Z-1, where the expectation is over the posterior distribution of θ. It is also strongly simulation-consistent, in the sense that Zˆ-1Z-1 almost surely as T.

In addition, the RIS estimator of the reciprocal marginal likelihood, Zˆ-1, has finite variance and is asymptotically normally distributed as T if the tails of h(θ) are thin enough. Specifically, this requires that

h(θ)2L(θ)π(θ)dθ<. (4)

It is hard to choose h(θ) so that it both overlaps substantially with the area of the parameter space with high posterior density, which is needed for efficiency, and so that it also has thin enough tails, which is needed for finite variance. The difficulty grows as the dimension increases.

Two choices of h(θ) in the literature deserve attention. DiCiccio et al. [4] proposed h(θ)=MVN(θ;θˆ,Σˆ), where θˆ is the posterior mean or mode, and Σˆ is an estimate of the posterior covariance matrix. This overlaps nicely with L(θ)π(θ), but its tails may not be thin enough when the posterior is asymmetric or the parameter is high-dimensional.

To remedy the problem of the tails possibly being too thick, DiCiccio et al. [4] proposed truncating it, using instead h(θ)=TMVNA(θˆ,Σˆ), a multivariate normal distribution truncated to the set A, where

A=θ:(θ-θˆ)TΣˆ-1(θ-θˆ)<c2. (5)

Thus A is an ellipsoid with radius c and volume

V(A)=cdπd/2|Σˆ|1/2/Γd2+1. (6)

Truncating the distribution ensures that the estimator Zˆ-1 has finite variance. The authors found that the truncation improved the performance of the RIS estimator. However, with high-dimensional parameters, the result might be sensitive to the specification of c.

Robert and Wraith [30] proposed setting h(θ) to be a uniform distribution on the convex hull of simulated MCMC parameters values in the α-HPD region, namely the highest posterior density region containing a proportion α of the sampled parameter values. They considered the values α=0.1 and 0.25. They applied it to a two-dimensional toy example where it performed well.

However, as far as we know, that method has not yet been fully developed for realistic, higher-dimensional situations. For example, we know of no simple way to compute the volume of the convex hull of a set of points in higher dimensions, which is required for the method in general. It is also not clear how best to choose α nor how sensitive the method would be to α in higher dimensions. It has been used in a higher-dimensional application by Durmus et al. [5], but this involved comparing competing models defined on the same parameter space, thus avoiding the need to calculate the volume of A, which canceled out in Bayesian model comparisons. Calculating the volume of A may be the most difficult part of this method in general.

3. Estimating the marginal likelihood

3.1. Estimating the marginal likelihood with THAMES

We propose combining the proposals of DiCiccio et al. [4] and Robert and Wraith [30] to obtain a method that we believe satisfies all our desiderata. We propose specifying h(θ) to be a uniform distribution, but to be uniform over the set A defined in Equation (5), rather than over a convex hull of points. This resolves the problem of computing the volume of A, since this is given analytically by Equation (6). If A is not a subset of the posterior support, for example if the posterior support is constrained, we adjust the volume of A by a simple Monte Carlo approximation. This yields the estimator

Zˆ-1=1V(A)Tt=1θ(t)AT1Lθ(t)πθ(t). (7)

Thus Zˆ is a truncated harmonic mean of the unnormalized posterior densities, Lθ(t)πθ(t).1 We call it the Truncated HArmonic Mean EStimator, or THAMES.

The THAMES, Zˆ-1, has several desirable properties. It is simple to compute, involving only the prior and likelihood values of the sampled parameter values. In fact it involves only the product of the prior and likelihood values, namely the unnormalized posterior densities of the sampled parameter values. It is unbiased as an estimator of Z-1, as long as A is specified independently of the sample. It is also simulation-consistent, in the sense that Zˆ-1Z-1 almost surely as T, by the strong law of large numbers. Its variance (over simulation from the posterior given the data 𝒟) is finite provided that

A(L(θ)π(θ))-1dθ<, (8)

which will usually hold since A is a bounded set in Rd. In fact, it suffices that the likelihood and the prior are continuous with respect to θ and strictly positive on the closure of A. If Equation (8) holds, Zˆ-1 is asymptotically normal (again as the number of parameter values simulated increases), by the Lindeberg central limit theorem. Note that asymptotic normality holds on the scale of Zˆ-1, and not exactly on other scales such as Zˆ or log(Zˆ).

If the posterior simulation method yields independent draws, then VarZˆ-1 can be estimated directly as the empirical variance of the values of Lθ(t)πθ(t)1θ(t)A-1, divided by TV(A)2. If MCMC is used, successive simulations from the posterior will in general not be independent. A central limit theorem will still hold, but the variance needs to take account of the serial dependence. This can be done approximately by computing the variance based on serial independence and multiplying it by an estimate of the spectral density of the sequence at zero. For example, if the sequence of values of 1/(L(θ)π(θ)) can be approximated by a first-order autoregressive model with parameter ϕ, then this would be approximately 1/(1-ϕ)2. An alternative would be to thin the sequence enough that the resulting subsequence is approximately uncorrelated and then use the variance based on assuming independence. A different approach was taken by Frühwirth-Schnatter [7].

Note that an approximate normal confidence interval can be obtained for Zˆ-1, because that is the scale on which a central limit theorem holds. This could be turned into a confidence interval for Zˆ by taking the reciprocals of the ends of the normal confidence intervals for Zˆ-1; the resulting confidence interval would not be symmetric. The same could be done for log(Zˆ) in a similar manner.

3.2. Optimal choice of control parameter, c

We now address the question of how to choose the radius c of the ellipse that specifies the THAMES in Equation (5). Ignoring serial correlation between simulated values of the parameters, we suggest choosing c to minimize the estimated variance of Zˆ-1. This could be done empirically by computing Zˆ-1 for a range of values of c, estimating VarZˆ-1 for each value of c, and optimizing it over c by a grid search or a one-dimensional numerical optimization method.

It is possible to obtain analytic results in the case where the posterior distribution is normal. This is of considerable interest as the posterior distribution is asymptotically normal in many common situations, including some where standard regularity conditions do not hold [13, 9, 32, 25]. In this case the THAMES has finite variance since the posterior density, and thus the product of the likelihood and the prior, is continuous with respect to θ and strictly positive everywhere.

We want to minimize the variance of the THAMES. Due to our assumption of independence of all of the successive MCMC simulations, this variance can be simplified to

VarZˆ-1𝒟=1T1Z2SCVd,c. (9)

Here SCV(d,c) denotes

SCV(d,c)Varθ(1)1Aθ(1)/V(A)Lθ(1)πθ(1)𝒟Eθ(1)1Aθ(1)/V(A)Lθ(1)πθ(1)𝒟2, (10)

the squared coefficient of variation of the first term of the THAMES. Since the variance is a product of 1T,1Z2 and SCV(d,c), minimizing SCV(d,c) with respect to c is equivalent to minimizing the variance of the THAMES.

We derive a statement about the optimal choice of c by assuming that the posterior covariance matrix Σ and the posterior mean m can be provided by a stochastic oracle. The THAMES can then be defined using

Aorθ:(θ-m)TΣ-1(θ-m)<c2. (11)

We will show that the radius c that minimizes the variance of the THAMES depends on the dimension d, and is equal to cd=d+Ld with Ld being close to one for large d. Interestingly, in this case the SCV depends neither on the data, 𝒟, nor on the number of samples from the posterior, T. Of course, this is rarely exactly the case in practice. However, plugging in consistent estimators of (m,Σ) gives approximately the same results if the number of samples from the posterior is large enough, provided that a sample splitting procedure is used. This will be a consequence of Theorem 3.3. The sample splitting procedure that we suggest is described in Section 3.3. The proofs of these results are given in Supplement A [23]. Additional numerical results about the behaviour of the optimal radius are given in Supplement B [24].

Assumption 1. For the following theorems it is assumed that we can ignore serial correlation (i.e. we assume independence of the successive MCMC iterations) and that the posterior distribution is normal with mean mRd and positive definite covariance matrix Σd×d(R). We further assume that the THAMES is defined on Aor.

Theorem 3.1. There exists a unique radius cd(0,) such that the ellipsoid Aor with radius cd minimizes the variance of the THAMES. This value cd does not depend on the posterior mean or covariance matrix. It satisfies cd=d+Ld, where the optimal shifting parameter Ld0 is a sequence for which Lddd0.

Remark 1. Theorem 3.1 ensures that the optimal radius cd is asymptotically equivalent to d. In fact, our calculations suggest that cd=d+Ld can be approximated by d+1 (See Supplement B [24]).

Theorem 3.2. The following statements hold for the SCV :

  1. For any choice of the shifting parameter L,LR and for all ε>0
    1-εSCVd,d+Ld(d+2)π/4SCV(d,d+L)(d+2)π/42+ε, (12)
    for all but finitely many d. Thus choosing the radius d+L results in an SCV that is both asymptotically at most twice as large as the optimal SCV and is of order d.
  2. The following inequality for the SCV can be given for choosing the radius c=d+1:
    0.63(d+2)π/4-1SCVd,d+Ld (13)
    SCV(d,d+1)1.052(d+2)π/4-1 (14)
    This inequality holds for all d1.

Remark 2. Statement 1 of Theorem 3.2 shows that SCV(d,c) is increasing with order d as d, both in our choice c=d+1 and the optimal choice cd. Further, any choice of the shifting parameter L used to define the radius d+L is asymptotically at most twice as bad as any optimal solution in terms of the SCV. This suggests some robustness of our estimator with respect to the choice of L. For numeric results about the behaviour of the SCV for different values of L, we refer the reader to Supplement B [24].

One can also calculate the bias of the THAMES on the scale of the marginal likelihood by considering a second-order Taylor approximation and using Equation (9):

E[Zˆ]=Z+VarZˆ-1/Z-13=Z(1+SCV(d,c)/T). (15)

Considering Equation (15), the bias can be estimated by using the plug-in estimates SCV^ and Zˆ-1. We also observe that the bias vanishes as T increases.

Remark 3. Statement 2 of Theorem 3.2 gives a very rough theoretical guarantee: For any dimension d1, the SCV obtained by choosing our recommendation for the radius, d+1, and the SCV obtained by choosing the optimal radius, cd, can be bounded by an affine transform of d+2. However, our calculations suggest that the SCV at the point c=d+1 has an asymptotically optimal performance (See Supplement B [24]).

So far, we have given results for the idealized situation where the posterior distribution is exactly normal. We now give a result for the more common and realistic situation where the posterior distribution is only asymptotically normal.

Theorem 3.3. Let pnθ𝒟n be a sequence of posterior densities with data 𝒟n, posterior covariance matrix Σn, posterior mean mn and an SCV denoted by SCVn. Then, if

Σn12pnΣn12θ+mn𝒟nn|Σ|12pΣ12θ+m𝒟 (16)

uniformly in θ on all compact subsets of Rd, it follows that

SCVn(d,c)nSCV(d,c) (17)

uniformly in c on all compact subsets of (0,). In particular, for any bcda>0,

cdnargminca,bSCVnd,cnlimncdn=cd. (18)

Remark 4. We have already stated that the normal case is important because the posterior distribution is often asymptotically normal when the size of the data, n, is large. Theorem 3.3 assures us that our results still hold in this limiting case, under some assumptions:

If the convergence of the normalized posterior pdf is uniform in θ (Equation (16)), our statements about the limit behaviour of the SCV (Theorem 3.2 and Remarks 2–3) still hold approximately when n is large (Equation (17)). If additionally any optimal radius cdn does not converge to zero or infinity, any result about cd (Theorem 3.1 and Remark 1) also holds approximately when n is large (Equation (18)).

Let H0 denote the Fisher information matrix. Reformulating Equation (16) by replacing Σn by 1nH0-1, to which it is asymptotically equivalent, gives a statement that has been proven under a variety of assumptions, e.g. Miller [25, Theorem 4], except that in these results the type of convergence is usually not uniform convergence, but a weaker type of convergence, such as convergence in distribution or convergence in total variation.

Additional assumptions can be made about the pdfs of the sequence of distributions such that convergence in distribution implies uniform convergence of the pdfs. For example, if the pdfs are asymptotically equicontinuous and we have convergence in distribution, the convergence of the pdfs is uniform [38, Theorem 1].

Note that in this case there is no problem if the parameter space is constrained: Uniform convergence of the pdfs implies that An is a subset of the posterior support if n is large enough. There are also no assumptions about mn,Σn, other than that they converge to the moments of the posterior limit. In this sense, the estimators (μˆ,Σˆ) take the place of these constants in practice and Theorem 3.3 holds even when we use these estimators.

Remark 5. Due to the assumption of normality it is the case that when choosing the optimal radius cd=d+Ld, the probability of a term of the THAMES in θ(t) not being set to 0 is equal to

Pθ(t)Aor=Pθ(t)-mTΣ-1θ(t)-m<d+Ld=χ2d+Ld;d, (19)

the CDF of the χ2-distribution with d degrees of freedom evaluated at d+Ld. This approaches 50% due to Theorem 3.1. Thus the algorithm sets about 50% of the highest terms in Equation (7) to 0. This means that for a large number of samples T and given the normality assumption, our algorithm is similar to the following method:

Instead of checking whether θ(t)Aor directly, one can set roughly 50% of the highest terms of the THAMES, the terms not included in the Heighest Posterior Density (HPD) region of size 50%, to 0.

Remark 6. It is assumed that the covariance matrix of the posterior distribution is positive definite. This assumption is necessary since otherwise a posterior density with respect to the Lebesgue-measure on Rd would not exist. On the other hand this assumption is not restrictive, since the same estimation procedure can be applied to a lower dimensional subspace of Rd on which a density is defined.

We can illustrate the relationship between the THAMES and the harmonic mean estimator defined by Newton and Raftery [26] using the toy example from Figure 1. It was calculated using the same model as the one introduced in Section 4.1 with the dimensions of the parameter space d=2, but by setting the data set to 𝒟0 to ensure stability of the estimator on the inverse likelihood scale.

Figure 1:

Figure 1:

Left: The THAMES calculated by choosing the radii c=d+1=3,c=0.13,c=503 in the two-dimensional case, d=2, with the true value 1/Z (dotted line) and the Harmonic Mean Estimator. Right: The posterior sample evaluated at the inverse of the unnormalized posterior density and the different ellipses used to define the THAMES. In this particular case the posterior covariance matrix is the scaled identity matrix, so the ellipses are spheres. The two samples occurring at points 644 and 7216 have a very low likelihood. They cause massive jumps in the Harmonic Mean Estimator when the radius of A is large and are excluded when the radius is equal to d+1=3. One can choose a smaller radius (e.g., c=0.13), but then too much of the sample is excluded and convergence takes longer.

The pdf of the Uniform distribution on the ellipse is essentially used as a rejection rule: Values with a very low posterior (and therefore high inverse posterior) are rejected, while high-density values are accepted. A balance between the volume of the ellipse and the percentage of the rejected posterior sample needs to be found to ensure optimal performance. The harmonic mean estimator does not have this rejection rule, so sample points with low posterior densities can lead to massive jumps.

3.3. THAMES algorithm

Below is an algorithm for the implementation of the THAMES. Procedures for sample splitting, as well as the truncated ellipsoid correction used in the case that the parameter space is constrained have been included. These additions are described in page 10 and page 11, respectively.

We recommend these additions, but we have also found that in some cases they make almost no difference. For example, sample splitting does not appear to have an impact when the dimension of the parameter space, d, is small (Section 4.1), while the truncated ellipsoid correction is negligible when the posterior mean is not close to the edge of the posterior support (Section 4.4 and Section 4.3).

Algorithm 1.

Zˆ-1 calculation

Input: Data 𝒟 and posterior samples θ(i)i1,T.
Sample splitting: Calculate the empirical mean θˆ and sample covariance matrix Σˆ based on the first T/2 posterior samples θ(i)i1,T/2.
Standardization: θ˜(i)=θ(i)-θˆΣˆ-1/2 for iT/2+1,T.
Truncation subset: 𝒮=i:θ˜(i)22<d+1.
Calculate THAMES estimator:
Zˆ-1=1T/2i=T/2+1Thθ(i)Lθ(i)πθ(i),
 where hθ(i)=1/V(A) if i𝒮 and 0 otherwise, with V(A)=|Σˆ|πd/2(d+1)d/2/Γd2+1 and A=θ:(θ-θˆ)TΣˆ-1(θ-θˆ)<d+1
if the posterior support supp(θ𝒟) is constrained then
 Simulate the sample ν(1),,ν(N) from the uniform distribution on A.
 Approximate the volume ratio V(Asupp(θ𝒟))/V(A) via the Monte Carlo estimator
Rˆ=1Ni=1N1{θπ(θ)L(θ)>0}ν(i).
 Assign Zˆ-1Rˆ-1Zˆ-1.
end if
Output: THAMES estimator Zˆ-1.

Sample splitting

The theoretical guarantees established in Section 3.2 operate under the assumption of an oracle ellipsoid Aor. In particular, this means that the ellipsoid determining the THAMES estimator Zˆ-1 is defined independently of θ(i)i1,T. In practice, we find that estimating A and Z-1 simultaneously using the same posterior sample can induce bias in Zˆ-1 when the parameter space is high-dimensional. As such, we implement a sample splitting procedure that involves estimating A and Z-1 using separate posterior draws. Specifically, we first estimate the posterior mean and covariance matrix via the empirical mean θˆ and sample covariance Σˆ using the first T/2 posterior samples θ(i)i1,T/2. Defining A as in Equation (5) based on θˆ and Σˆ, we then calculate the THAMES estimator Zˆ-1 using the last T/2 posterior samples θ(i)iT/2+1,T. The same problem was noted by Gronau et al. [10] in their popular implementation of bridge sampling. For this reason, the bridge sampling package uses the same sample splitting procedure just described, in its default setting.

Correcting for the presence of constrained parameters

Whenever the posterior support of the parameters is not Rd, for example when the parameters are variances or probabilities, it is possible that our choice of h in Equation (3), the pdf of the uniform distribution on A, is not correctly normalized. This is due to the fact that A is not necessarily a subset of the posterior support and thus h is not a pdf over this space.

In this case, the expectation of the THAMES is distorted by a multiplicative constant:

EθZˆ-1𝒟=Eθh(θ)L(θ)π(θ)𝒟=Z-1V(Asupp(θ𝒟))V(A)Z-1R, (20)

where V(Asupp(θ𝒟)) denotes the volume of the intersection between A and the posterior support. One way to deal with this problem is to transform the parameter space, e.g., by setting ϑlog(θ) if θ is a variance parameter. One can then continue with marginal likelihood estimation on ϑ, using the transformed prior distribution. In this case, it is of course important to include the Jacobian of the transformation when computing the prior density. It should be noted that the default proposal used in the bridge sampling package from Gronau et al. [10] uses this transformation, because it suffers from the same problem when the parameter space is constrained. This solution is also a viable option for the THAMES, since the transformation removes the need for any adjustments. However, this solution requires deciding on a viable transformation for each new type of constraint (e.g., simplex constraints, interval constraints, etc.) and the posterior behaviour of the transformed parameters may be hard to interpret. For this reason, we suggest a different correction.

Another way is to adjust for the bias by calculating the ratio of these volumes, R, using a simple Monte Carlo approximation: We simulate ν(1),,ν(N)~i.i.d.Unif(A),NN and calculate

Rˆ1Ni=1N1supp(θ𝒟)ν(i)=1Ni=1N1{θπ(θ)L(θ)>0}ν(i). (21)

Given A, this is an unbiased and consistent estimator of R by the law of large numbers. The bias-adjusted THAMES is then Zˆadj-1=Rˆ-1Zˆ-1.

The problem of the parameter space being constrained is common not only for the THAMES, but for reciprocal importance sampling estimators in general. It has for example been addressed by Hajargasht and Wo’zniak [11] and Sims et al. [33]. Hajargasht and Wo’zniak [11] used variational Bayes techniques and showed that these ensure that the support of the chosen h is a subset of the posterior support, under mild conditions. Sims et al. [33] used an ellipsoidal density truncated on a subset of the joint support, ΘU{θπ(θ)L(θ)>U}, where U>0. Since the support of π(θ)L(θ) is equal to the posterior support, our truncation set is similar to the one chosen in Sims et al. [33], except that we set U=0.

The adjustment is usually very small. The problem arises only when the posterior mean is close enough to the edge of the parameter space. The edge of the parameter space often indicates a priori unlikely values. For this reason it is also rare that the data indicate posterior parameters being close to the edge. Thus the ratio between the volumes is close to one and the variance of Rˆ is small. In fact, the adjustment did not have any sizeable impact on any of the examples simulated in Section 4. This may not be the case, however, if the actual data generating mechanism is very different from the model being considered. In this case, it can in practice happen that the posterior mean is indeed very close to the edge. We show one example of this in Supplement B [24].

In either case we have found that a small number of simulations, around N=100, is usually enough. Confidence intervals obtained from the fact that Rˆ is asymptotically normal can be used to check whether the variance of Rˆ is large. In this case N should be increased to yield a more precise approximation. The computational cost of implementing this adjustment is typically small.

4. Examples

We now describe several simulated and real data examples to assess the THAMES estimator. In Sections 4.1, 4.2, and 4.3, three statistical models, for which exact expressions of the marginal likelihood are derived, are considered. This allows us to compare the THAMES estimated values to the exact ones for evaluation. In Section 4.4, we consider a real data example with models for which no analytical expressions for the marginal likelihood are available, to our knowledge, and where there is a need for reliable estimators. We compare our estimator to bridge sampling, which is more complicated than THAMES but is known to have performed well [22, 10].

4.1. Multivariate Gaussian data

We first consider the case where data Yi,i=1,,n are drawn independently from a multivariate normal distribution:

Yiμ~iidMVNdμ,Id,i=1,,n,

along with a prior distribution on the mean vector μ:

pμ=MVNdμ;0d,s0Id,

with s0>0. As shown in Supplement A [23], the posterior distribution of the mean vector μ given the data 𝒟=y1,,yn is given by:

pμ𝒟=MVNdμ;mn,snId, (22)

where mn=ny/n+1/s0,y=(1/n)i=1nyi, and sn=1/n+1/s0.

Interestingly, while the observations Yii are independent given the vector μ, they are not independent marginally, and the marginal likelihood does not take the form of a product over marginal terms in i. Conversely, thanks to the isotropic Gaussian prior distribution which is considered for μ, where the μjj are all iid, not only are the vectors Y.jj independent given μ, they are also independent marginally. From this property, we prove in Proposition 2 of Supplement A [23] that the marginal likelihood of the model can be written analytically as

p𝒟=j=1dMVNny.j;0n,s01n1n+In, (23)

where y.jRn is the vector of all observations for variable j such that y.ji=yij and 1n is the vector of 1 in Rn.

Assessing the precision of the THAMES estimator as a function of T

We first considered the univariate case d=1. Thus, we simulated a unique sample of size n=20 with μ=2. Moreover, we set s0=1, for illustration. Note that other choices for s0 led to similar conclusions regarding the quality of the estimation. Figure 2 shows the THAMES estimated values for the log marginal likelihood, for T=5,1005,2005,,9005 samples of the posterior distribution (Equation (22)). Confidence intervals as well as the exact value of the log marginal likelihood computed using Equation (23) are also reported. It can be seen that the estimate converges to the correct value and that the confidence intervals contain the true value in all cases, even for T=5 only.

Figure 2:

Figure 2:

Estimation of the log marginal likelihood using THAMES for a unique univariate Gaussian sample with n=20 as a function of T, the number of (cumulative) samples from the posterior distribution. The black dots indicate the values of the THAMES estimator of the log marginal likelihood. The vertical lines represent 95% confidence intervals, and the dashed blue line represents the exact value computed using Equation (23).

Assessing the precision of the THAMES estimator as a function of d

For this second set of experiments, we considered different values of d, and aimed at testing the robustness of the THAMES approach on multiple data sets, with increasing dimensionality. Thus, for each d, we generated 50 different data sets of size n=20 using the multivariate Gaussian model. In practice, we set the true value of μ to 2, for all its components. Again, the prior parameter s0 was set to 1 and similar conclusions were drawn for other values. Moreover, the value of T was set to 10, 000 for all the experiments.

We also used this example to assess the sample splitting procedure for the posterior samples, as proposed in Section 3.3. The results are given in Figure 3. On the figure on the left, where no sample splitting of the posterior samples is used to compute THAMES, we observe that a bias appears as the dimensionality of the model considered increases, and the log marginal likelihood tends to be slightly underestimated. As illustrated by the figure on the right hand side, this bias is primarily related to the estimation of the posterior covariance matrix, and not to the THAMES estimation itself. Indeed, focusing on this figure on the right, we note that if the exact expression of the posterior covariance matrix given in Equation (22) is used to compute THAMES, then while the variance of the estimator increases with d, we do not observe any bias. Crucially, if the sample splitting of the posterior samples is employed to compute THAMES, then again, we do not observe any bias.

Figure 3:

Figure 3:

Difference between the estimated log marginal likelihood using THAMES and the true log marginal likelihood for a multivariate Gaussian model with n=20 and T=10000. The procedure is repeated on 50 different data sets for each d. Left: the THAMES approach with no sample splitting of the posterior samples. Middle: the THAMES approach with sample splitting of the posterior samples. Right: the results provided correspond to the case where the exact expression of the posterior covariance matrix given in Equation (22) is used to compute THAMES.

Overall, we found that the sample splitting procedure of the posterior samples was not necessary to compute THAMES for low values of d. The estimated values are indeed particularly close to the exact ones. However, for large values of d, we recommend using the sample splitting procedure to remove the bias.

4.2. Bayesian Regression

We consider a data set xi,Yi,i=1,,n to train a linear regression model of the form

Yixi,β,σ2~𝒩xiβ,σ2,i=1,,n.

In this section, the goal is to assess the quality of our proposed estimator. As such, we choose a prior on β,σ2 for which an exact expression for the marginal likelihood, Z, exists. We compare our estimator, the THAMES, to the bridge sampling estimator implemented in Gronau et al. [10] and a simple Monte Carlo (MC) estimator, calculated by averaging the likelihood for parameter values simulated from the prior.

Denoting YRn, the vector of target variables Yi, and Xn×(d-1)(R) the design matrix where the input vectors xiRd-1 are stacked as row vectors, the linear regression model becomes:

YX,β,σ2~MVNnXβ,σ2In.

We rely on a centered isotropic Gaussian prior distribution for the regression vector β and the variance σ2:

pβσ2=MVNd-1β;0d-1,gσ2XTX-1,pσ2=InvGammaσ2;12ν0,12σ02ν0,

with g,σ02,ν0>0. Introduced by Zellner [39, 40], this framework offers a conjugate prior with the particularly attractive property of scale-invariance with respect to the regressor [14]. Then the posterior distribution of β,σ2, given the training data set 𝒟=x1,y1,,xn,yn, is tractable:

pβσ2,𝒟=MVNd-1β;gg+1mn,gg+1σ2XTX-1,
pσ2𝒟=InvGammaσ2;12ν0+n,12ν0σ02+sn,

with sn=yTy-gg+1yTXXTX-1XTy and mn=XTX-1XTy, where yRn is the observed vector of target variables associated with Y. Moreover, the marginal likelihood also has an analytical expression:

p(yX)=(g+1)-(d-1)/2πn/2Γ12ν0+nΓ12ν0ν0σ02ν0σ02+Σnν0/2.

Proofs for the exact expressions of the posterior and the marginal are given in Hoff [14, Chapter 9]. The data for this example are described by Hastie et al. [12] and come from a study by Stamey et al. [36]. They examined the correlation between the level of prostate-specific antigen (lpsa) and eight clinical measures in men who were about to receive a radical prostatectomy. The variables are log cancer volume (lcavol), log prostate weight (lweight), age, log of the amount of benign prostatic hyperplasia (lbph), seminal vesicle invasion (svi), log of capsular penetration (lcp), Gleason score (gleason), and percent of Gleason scores 4 or 5 (pgg45). The target variable is the level of prostate-specific antigen (lpsa).

The choice of the hyperparameter g is a topic of much discussion [28, 6]. In Porwal and Raftery [28], g=n showed good performance when compared to a variety of different options, albeit in a slightly different setting where the prior on σ2 is improper. For this reason, we chose g=n. We chose ν0,σ02=(4,1) for the other hyperparameters, but other choices for g,ν0,σ02 led to similar conclusions regarding the quality of the estimation.

Seven different regression models M2,M3,,M8, each with a different number of selected variables, ranging from 2 to 8, are considered for illustration. The variables are added in the order given above. Thus, M2 includes the predictor variables lcavol and lweight, while Model M3 considers the variables lcavol, lweight, as well as age for prediction. Finally, model M8 takes all 8 input variables into account. Figure 4 shows the estimators of the log marginal likelihood for different number of samples from the posterior distribution in β,σ2, for the different models, as well as the approximate confidence intervals.

Figure 4:

Figure 4:

Log marginal likelihood (dotted line) and its estimators (dots) for the prostate data set. The approximate confidence intervals of the estimators are also indicated. Bridge sampling and THAMES are on point, while the simple Monte Carlo estimator does not seem to have converged.

Sample splitting was used and there was no noticeable bias in the results, even though we did not correct the THAMES for the bounded parameter σ2 due to the fact that the posterior mode of this parameter is far away from 0. We also calculated the bias correction from Remark 2 which had no impact numerically, even for a posterior sample size as small as T=50.

While the simple Monte Carlo estimator did not converge, the bridge sampling estimator and the THAMES behaved very similarly. Indeed, it can be seen that both these estimators converged rapidly to the correct value and that the intervals covered the correct values in most cases, even when the number of samples used was small, for all models investigated. Figure 5 shows the estimates produced by these methods when zoomed in on a finer scale in the full model, meaning that the number of clinical measures is equal to 8. Notably, the confidence intervals obtained by the THAMES were more conservative and wider than those obtained for the bridge sampling estimator, while the estimators themselves converged in a similar speed and manner.

Figure 5:

Figure 5:

Log marginal likelihood (dotted line) and its estimators (dots) for the prostate data set with the full model (i.e., all eight clinical measures are selected). The approximate confidence intervals of the estimators are also indicated. The confidence intervals of the THAMES are more conservative, while the ones obtained from bridge sampling are more narrow.

For all models, those two estimators are particularly precise for 1000 samples of the posterior only. While the main goal of this section is to illustrate the precision of our estimation strategy for a series of models, we can also report that the model with the highest marginal likelihood, among those considered for this data set, is Model M2. In other words, the variables lcavol and lweight are seen as key for the prediction of the level of prostate-specific antigen.

4.3. Dirichlet-multinomial model

Extensions of the Dirichlet-multinomial model are widely used in the context of topic modelling, see, e.g., Blei et al. [2]. The expression for the marginal likelihood in this model is known, as in the previous two sections. This allows us to assess the performance of our estimator in another simulation study, in a non-Gaussian context.

A simulation study in this setting is useful for two reasons: First, this is a high-dimensional setting in which the posterior distribution of the parameters is highly non-Gaussian. In fact, the parameter space is bounded. This allows us to assess how well the THAMES performs in a very different setting, and also how much of an impact the correction for a bounded parameter space from Section 3.3 has. We check this numerically in Supplement B [24].

Second, there do exist similar models to this one for which the marginal likelihood is not tractable, e.g. Blei and Lafferty [1]. These models are therefore a possible application of the THAMES. The simulation study might give an idea of how well the THAMES would perform in these applications.

The Dirichlet-multinomial model is defined as follows: Each data point Yi{0,,l}K is drawn from a multinomial distribution given a Dirichlet-distributed random variable μ:

μ~Dirichletμ;a0,,a0,Yiμ~i.i.d.(l,μ),i=1,,n.

Here, μ is positive and K-dimensional with components summing to 1. The covariance matrix of μ is thus necessarily singular. As noted in Remark 6, the THAMES needs to be used on posterior simulations from the subspace of RK on which a density is defined. In this case, this is RK-1Rd. The prior density is thus

πμ1,,μd=Dirichletμ1,,μd,1-j=1dμj;a0,,a0.

The posterior support is μRdj=1dμj1,μ1,,μd>0. The posterior distribution given the data 𝒟=y1,,yn is tractable:

pμ1,,μd𝒟=Dirichletμ1,,μd,1-j=1dμj;α1,,αK,

with αj=a0+i=1nyij. The marginal likelihood is thus also tractable, using Bayes’ theorem.

Results

The marginal likelihood was estimated in the setting n,l,T,a0=(400,150,10000,1) with d varying between 1, 20, 50 and 100. The quantities n and l were intentionally chosen to be large, since this model has very high-dimensional applications. For example, Blei et al. [2] used a data set with 8000 documents, n=15,818 words and used up to K=200 different topics.

As mentioned, an alternative to the correction proposed in Section 3.3 is to reparametrize μ such that the support of the parameter is unconstrained. We did this by setting

μ1(t),,μd(t)expϑ1(t),,expϑd(t)exp-k=1dϑk(t)+k=1dexpϑk(t)=softmaxϑ(t),t=1,,T.

We are using a bijective version of the softmax function (we take the first d elements of the version of the softmax which maps a sample from the Dirichlet to a paramater that sums to 0), and the induced prior π2(ϑ)π(softmax(ϑ))softmaxJacobian(ϑ). We stress that this procedure is not necessary to calculate the THAMES, since the THAMES can be calculated on any parameter space. It is however necessary to calculate the bridge sampling estimator implemented in Gronau et al. [10].

Table 1 shows the results when calculating the THAMES and the bridge sampling estimator on ϑ(1),,ϑ(T)2. A fixed parameter μ was set to μ=(1/K,,1/K) and 50 different samples were generated using the parameters l and μ. The MC estimator was also computed for comparison.

Table 1:

Average CPU times (in seconds per 10,000 posterior draws) as well as mean absolute errors and standard deviations for bridge sampling, the THAMES and the naive Monte Carlo (MC) estimator (errors of the latter were rounded to 1 decimal place). Estimates using MC are quickest to compute, but also the least precise. The THAMES is much faster than bridge sampling, although point estimates from the latter are more accurate.

d=1
MAE
SD Time d=20
MAE
SD Time
Bridge 0.0001 0.0002 6.0026 0.0019 0.0024 36.9822
MC 0.093 0.1145 0.0020 6903.4 942.444 0.0008
THAMES 0.0064 0.0087 0.0158 0.0197 0.025 0.2486
d=50
MAE
SD Time d=100
MAE
SD Time
Bridge 0.0037 0.0046 58.9884 0.0086 0.0108 116.5902
MC 13879.1 1231.7635 0.0004 18418.1 844.7528 0.0046
THAMES 0.0315 0.039 0.3094 0.0473 0.0617 1.0532

Both bridge sampling and the THAMES outperformed the MC estimator. Additionally, while the bridge sampling estimator performed better in terms of mean absolute error, it should be noted that the THAMES is not only easier, but quicker to compute, with their differences in computation time growing as the dimension of the parameter space increases. This is likely due to the fact that the THAMES does not require additional evaluations of the likelihood, beyond the precomputed likelihood values of the posterior sample, so its computation time grows much more slowly with increasing d. Meanwhile bridge sampling does require additional evaluations, which take up an increasing amount of computation time. The average computation time for the bridge sampling estimator is about 361 times as high for d=1, and 118 times as high for d=100. However, we would like to emphasize that, in our opinion, the real strength of our estimator lies in the fact that it is not only quick, but also easy to implement.

4.4. Mixed effects model

Netherlands schools data

To demonstrate the performance of THAMES on a random effects model, we consider the Netherlands (NL) schools dataset of Snijders and Bosker [35]. For our purposes, the data consist of language test scores of 2,287 eighth-grade pupils from 133 classes (in 131 schools) in the Netherlands. We denote by yijR the language test score of pupil i in class j, where j{1,,J} with J=133 and i1,,nj with nj the size of class j. Let n=j=1Jnj=2,287 denote the full sample size.

We aim to determine if there is clustering of language test scores by class, with some classes performing significantly better than others on average. To do this, we fit both a simple mean model (which treats test scores of students in the same class as independent) and a random intercept model (which accounts for correlation of test scores within each class) to the data. The former (null) model H0 posits that all classes perform the same, on average, while the latter (alternative) model H1 allows for variation in performance at the class level. We estimate the log marginal likelihoods for the two models, 0(y) and 1(y), respectively, using the THAMES. For comparison, we also compute estimates using bridge sampling [10] and a simple Monte Carlo (MC) estimator that averages the likelihood against draws from the prior. With estimates of 0(y) and 1(y), we estimate the log Bayes factor b01 to conduct a Bayesian hypothesis test of H0 versus H1. Note that posterior simulation and marginal likelihood calculation are not analytically tractable for this model. As such, the use of approximate posterior sampling (e.g., via MCMC) and marginal likelihood estimation (e.g., via the THAMES) is required.

Linear model (LM)

We first consider a simple mean model (denoted LM), which posits that

yij=μ+εij,j1,,J,i1,,nj,jnj=n,
εij~iidN0,σε2,
μ~Nμˆ,σˆμ2,
σε2~InverseGammaνˆε,βˆε.

The fixed hyperparameters μˆ,σˆμ2,νˆε,βˆε are specified so as to ensure that the prior distribution is dispersed relative to the likelihood, but on the same scale, as

μˆ=meanyij=40.93,σˆε2=2sdyij=12.73,
νˆε=0.5,βˆε=0.5varyij=40.53.

The hyperparameters νˆε,βˆε are chosen so that the prior mean of the precision 1/σε2 equals 1/varyij. The set of parameters to be estimated in this model μ,σε2 has dimension d=2. As we are not using a conjugate prior for the linear model, the marginal likelihood does not admit an analytic expression in this case.

Full linear mixed model (full LMM)

We consider the random intercept model (denoted full LMM):

yij=μ+αj+εij,
εij~iidN0,σε2,
αj~iidN0,σα2,
μ~Nμˆ,σˆμ2,
σε2~InverseGammaνˆε,βˆε,
σα2~InverseGammaνˆα,βˆα.

Here μˆ,σˆμ2,νˆε,βˆε are as above and we specify νˆα=0.5,βˆα=0.5varμˆj=13.77, where μˆj=1nji=1njyij is the sample mean for class j{1,,J}. The hyperparameters νˆα,βˆα are chosen so that the prior mean of the precision 1/σα2 equals 1/varμˆj. The set of parameters to be estimated in this model μ,σε2,σα2,α has dimension d=136.

Reduced linear mixed model (reduced LMM)

Note that the intercept parameters of the full LMM are not identifiable, as there is give-and-take between estimating the grand mean μ and the random intercepts αj. By absorbing αj into the error term structure εij, we can specify an equivalent model (having the same marginal likelihood) with d=3 identifiable parameters μ,σε2,σα2. Mathematically, this amounts to marginalizing the αj’s out of the model. The model (which we call reduced LMM) is given by

yij=μ+εij,
εij~N0,σε2+σα2,
Covεij,εij=σα2,i,i1,,nj,ii,
Covεij,εij=0,jj,
μ~Nμˆ,σˆμ2,
σε2~InverseGammaνˆε,βˆε,
σα2~InverseGammaνˆα,βˆα.

Here μˆ,σˆμ2,νˆε,βˆε,νˆα,βˆα are as above.

Results

Figure 6 shows the log marginal likelihood of the NL schools data for each model computed using the THAMES, bridge sampling, and simple Monte Carlo estimators with approximate 95% confidence intervals as a function of the number of posterior MCMC or prior MC draws. Bridge sampling is a popular state-of-the-art method to estimate log marginal likelihoods from posterior MCMC samples, which is substantially more complicated computationally than the THAMES. Posterior MCMC sampling is carried out in R using Stan [29, 37]. We use values of the sample size T evenly spaced between 1,000 and 20,000. For each T, we run 4 chains in parallel for T/2 iterations and remove the first T/4 as burn-in, yielding T/4 MCMC samples from each of the 4 chains, which are used to compute the THAMES and bridge sampling estimates.

Figure 6:

Figure 6:

Log marginal likelihood estimates for models fitted to the NL schools data.

THAMES provides consistent estimates of the log marginal likelihood with greater precision as the posterior sample size grows. As we would expect, THAMES converges much faster for the LM (with d=2) and the reduced LMM (with d=3) than for the full LMM (with d=136), although the estimates of the reduced and full LMM converge to the same value. While the posterior support of this model is constrained due to the variance parameters σε2,σα2, we found that the truncation correction defined in Section 3.3 had no impact on the results. For a given posterior sample size, we find that bridge sampling generally produces more precise estimates than THAMES. However, THAMES has the advantage of being much simpler to implement and more computationally efficient in practice. On average over the samples in Figure 6, bridge sampling required 6.4 times as much compute time as THAMES for the full LMM, 556.5 times as much for the reduced LMM, and 26.8 times as much for the LM when estimated from the same number of posterior draws, as reported in Table 2.3 The MC estimator, while fast and theoretically unbiased and consistent, suffers from substantial variance. In the left panel of Figure 6, the MC estimates are not shown as they lie outside the range of the plot.

Using the THAMES estimates of the log marginal likelihoods for the LM 0(y) and the (reduced) LMM 1(y) with 20,000 posterior draws, the log Bayes factor B01 is estimated as

B01=0(y)-1(y)=-8278.842+8136.561=-142.281,

indicating decisive evidence in favor of the random intercept model [18].

5. Discussion

We have proposed an estimator of the reciprocal of the marginal likelihood, called THAMES, which is simple to compute, unbiased, consistent, has finite variance and is asymptotically normal, with available confidence intervals. It is a version of reciprocal importance sampling. The estimator has one user-specified control parameter, and we have derived an optimal value for this in the situation where the posterior distribution is normal, which is of great interest because posterior distributions are asymptotically normal in many situations. We have carried out several numerical experiments in which the estimator performs well.

A similar proposal was made independently in McEwen et al. [21] under the name “Learnt harmonic mean estimator”, where a variety of different sample models were suggested to work in conjunction. One of these models, the “Hypersphere”, corresponds to the THAMES, the difference being that no theoretical results were given for the optimal control parameter, c. Instead, c was optimized numerically as the minimum of the second harmonic moment, via the Brent hybrid root-finding algorithm. In the only high-dimensional application, which was in fact a Gaussian posterior, c was not optimized and it was noted that “alternative more effective target models can be developed that better scale to higher-dimensional settings”. We believe that with the THAMES, using the suggested optimal controlling parameter, this is the case.

The THAMES relies on estimating the posterior covariance matrix and mean. In our experience it is important that the estimator chosen for the covariance matrix be accurate for estimating each matrix entry. Elementwise accuracy appears to be important because the covariance matrix is used to precisely define a quadratic inequality. For example, using a shrinkage estimator for the covariance matrix, which can produce large errors in a small proportion of its elements, has in our experience degraded the performance of the THAMES in some situations.

One possible alternative to covariance matrix estimation would be to select a minimum-volume covering ellipse which includes a certain percentage of those points of the posterior sample which have the largest value with respect to the (unnormalized) posterior density evaluated at those points. This would ensure that an HPD-region is well approximated, independent of the underlying posterior distribution. Determining a minimum-volume covering ellipse given a set of points can be difficult computationally, but this problem has been addressed in the literature in different settings and could possibly be adapted to the THAMES.

Supplementary Material

Supplement A

Supplement A: Proofs and Calculations. We prove the analytic results from Section 3 and derive the exact expression of the posterior and marginal density used for the multinomial likelihood in Section 4.

Supplement B

Supplement B: Additional Simulations. We give some additional, numeric results about the approximate behaviour of the THAMES in the normal case as well as the case where the posterior support is constrained.

Table 2:

Average CPU times (in seconds per 1,000 posterior draws) to produce the estimates in Figure 6. The THAMES is faster than bridge sampling. Both take more time for the same number of iterations than the naive Monte Carlo (MC) estimator in terms of CPU time, even though the latter is far less precise (see Figure 6).

Full LMM Reduced LMM LM
Bridge 0.3815 1.6482 0.0723
MC 0.0002 0.0002 0.0002
THAMES 0.0581 0.0030 0.0027

Acknowledgments

The authors would like to thank the anonymous referees, an Associate Editor and the Editor for their constructive comments. The authors would further like to thank Christoph Richard from the Friedrich-Alexander-Universität Erlangen-Nürnberg for his helpful comments. Their comments dramatically improved the content and quality of this paper.

Funding

Irons’s research was supported by a Shanahan Endowment Fellowship and a Eunice Kennedy Shriver National Institute of Child Health and Human Development training grant, T32 HD101442-01, to the Center for Studies in Demography & Ecology at the University of Washington. Raftery’s research was supported by NIH grant R01 HD070936 from the Eunice Kennedy Shriver National Institute of Child Health and Human Development (NICHD), by the Fondation des Sciences Mathématiques de Paris (FSMP), and by Université Paris-Cité.

Footnotes

1

Recall that the unstable harmonic mean estimator described by [26] was quite different, not being truncated, and being a harmonic mean of the likelihoods rather than the unnormalized posterior density values.

2

Computations were performed on an Intel(R) Core(TM) i7-7700HQ CPU at 2.80GHz with 16 GB RAM.

3

Computations were performed on an Apple M1 chip with 3.20GHz processor and 16 GB RAM.

References

  • [1].Blei DM and Lafferty JD (2007). “A correlated topic model of Science.” Annals of Applied Statistics, 1(1): 17–35. [Google Scholar]
  • [2].Blei DM, Ng AY, and Jordan MI (2003). “Latent Dirichlet Allocation.” Journal of Machine Learning Research, 3: 993–1022. [Google Scholar]
  • [3].Chib S (1995). “Marginal likelihood from the Gibbs output.” Journal of the American Statistical Association, 90: 1313–1321. [Google Scholar]
  • [4].DiCiccio TJ, Kass RE, Raftery AE, and Wasserman L (1997). “Computing Bayes factors by combining simulation and asymptotic approximations.” Journal of the American Statistical Association, 92: 903–915. [Google Scholar]
  • [5].Durmus A, Moulines E, and Pereyra M (2018). “Efficient Bayesian computation by proximal Markov chain Monte Carlo: when Langevin meets Moreau.” SIAM Journal on Imaging Sciences, 11: 473–506. [Google Scholar]
  • [6].Fernández C, Ley E, and Steel MF (2001). “Benchmark priors for Bayesian model averaging.” Journal of Econometrics, 100(2): 381–427. URL https://www.sciencedirect.com/science/article/pii/S0304407600000762 [Google Scholar]
  • [7].Frühwirth-Schnatter S (2004). “Estimating marginal likelihoods for mixture and Markov switching models using bridge sampling techniques.” Econometrics Journal, 7: 143–167. [Google Scholar]
  • [8].Gelfand AE and Dey DK (1994). “Bayesian model choice: asymptotics and exact calculations.” Journal of the Royal Statistical Society: Series B (Methodological), 56: 501–514. [Google Scholar]
  • [9].Ghosal S (2000). “Asymptotic normality of posterior distributions for exponential families when the number of parameters tends to infinity.” Journal of Multivariate Analysis, 74: 49–68. [Google Scholar]
  • [10].Gronau QF, Singmann H, and Wagenmakers E-J (2020). “bridgesampling: An R Package for Estimating Normalizing Constants.” Journal of Statistical Software, 92(10): 1–29. [Google Scholar]
  • [11].Hajargasht G and Wo’zniak T (2018). “Accurate Computation of Marginal Data Densities Using Variational Bayes.” arXiv: Applications. https://arxiv.org/pdf/1805.10036.pdf. [Google Scholar]
  • [12].Hastie T, Tibshirani R, and Friedman JH (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2nd edition. [Google Scholar]
  • [13].Heyde CC and Johnstone IM (1979). “On asymptotic posterior normality for stochastic processes.” Journal of the Royal Statistical Society: Series B (Methodological), 41: 184–189. [Google Scholar]
  • [14].Hoff P (2009). A First Course in Bayesian Statistical Methods. Springer Texts in Statistics. Springer; New York. URL https://books.google.de/books?id=DykcMwEACAAJ [Google Scholar]
  • [15].Irons NJ, Perrot-Dockès M, and Metodiev M (2023). “thames: Easily Computed Marginal Likelihoods from Posterior Simulation Using the THAMES Estimator.” R package version 0.1.0. [DOI] [PMC free article] [PubMed]
  • [16].Jameson GJO (2015). “A simple proof of Stirling’s formula for the gamma function.” The Mathematical Gazette, 99(544): 68–74. URL http://www.jstor.org/stable/24496904 [Google Scholar]
  • [17].Jeffreys H (1961). Theory of Probability. Oxford, U.K.: Oxford University Press, 3rd edition. [Google Scholar]
  • [18].Kass RE and Raftery AE (1995). “Bayes factors.” Journal of the American Statistical Association, 90(430): 773–795. [Google Scholar]
  • [19].Liu B (2014). “Adaptive annealed importance sampling for multimodal posterior exploration and model selection with application to extrasolar planet detection.” The Astrophysical Journal Supplement Series, 213(1): 14. [Google Scholar]
  • [20].Llorente F, Martino L, Delgado D, and Lopez-Santiago J (2023). “Marginal likelihood computation for model selection and hypothesis testing: An extensive review.” SIAM Review, 65: 3–58. [Google Scholar]
  • [21].McEwen JD, Wallis CGR, Price MA, and Docherty MM (2022). “Machine learning assisted Bayesian model comparison: learnt harmonic mean estimator.” arXiv. https://arxiv.org/pdf/2111.12720.pdf. [Google Scholar]
  • [22].Meng X-L and Wong WH (1996). “Simulating ratios of normalizing constants via a simple identity: A theoretical exploration.” Statistica Sinica, 6: 831–860. [Google Scholar]
  • [23].Metodiev M, Perrot-Dockès M, Ouadah S, Irons NJ, Latouche P, and Raftery AE (2024). “Supplement A to “Easily Computed Marginal Likelihoods from Posterior Simulation Using the THAMES Estimator”.” [DOI] [PMC free article] [PubMed]
  • [24].— (2024). “Supplement B to “Easily Computed Marginal Likelihoods from Posterior Simulation Using the THAMES Estimator”.” [DOI] [PMC free article] [PubMed]
  • [25].Miller JW (2021). “Asymptotic normality, concentration, and coverage of generalized posteriors.” The Journal of Machine Learning Research, 22: 7598–7650. [Google Scholar]
  • [26].Newton MA and Raftery AE (1994). “Approximate Bayesian inference with the weighted likelihood bootstrap.” Journal of the Royal Statistical Society: Series B (Methodological), 56: 3–26. [Google Scholar]
  • [27].Olver FWJ, Olde Daalhuis AB, Lozier DW, Schneider BI, Boisvert RF, Clark CW, Miller BR, Saunders BV, Cohl HS, McClain MA, and eds. (????). “NIST Digital Library of Mathematical Functions.” https://dlmf.nist.gov/, Release 1.1.9 of 2023-03-15. URL https://dlmf.nist.gov/
  • [28].Porwal A and Raftery AE (2022). “Comparing methods for statistical inference with model uncertainty.” Proceedings of the National Academy of Sciences, 119(16): e2120737119. URL https://www.pnas.org/doi/abs/10.1073/pnas.2120737119 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [29].R Core Team (2023). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/ [Google Scholar]
  • [30].Robert CP and Wraith D (2009). “Computational methods for Bayesian model choice.” In AIP conference proceedings, volume 1193, 251–262. American Institute of Physics. [Google Scholar]
  • [31].Rockafellar R and Wets R-B (2009). Variational Analysis. Springer, 1st edition. [Google Scholar]
  • [32].Shen X (2002). “Asymptotic Normality of Semiparametric and Nonparametric Posterior Distributions.” Journal of the American Statistical Association, 97: 222–235. [Google Scholar]
  • [33].Sims CA, Waggoner DF, and Zha T (2008). “Methods for inference in large multiple-equation Markov-switching models.” Journal of Econometrics, 146(2): 255–274. [Google Scholar]
  • [34].Skilling J (2006). “Nested sampling for general Bayesian computation.” Bayesian Analysis, 1: 833–859. [Google Scholar]
  • [35].Snijders TAB and Bosker RJ (1999). Multilevel Analysis. An Introduction to Basic and Advanced Multilevel Modelling. Sage. [Google Scholar]
  • [36].Stamey T, Kabalin J, McNeal J, Johnstone I, Freiha F, Redwine E, and Yang N (1989). “Prostate specific antigen in the diagnosis and treatment of adenocarcinoma of the prostate II radical prostatectomy treated patients.” Journal of Urology, 16: 1076–1083. [DOI] [PubMed] [Google Scholar]
  • [37].Stan Development Team (2022). “RStan: the R interface to Stan.” R package version 2.21.7. URL https://mc-stan.org/
  • [38].Sweeting T (1996). “On a Converse to Scheffe’s Theorem.” The Annals of Statistics, 14: 1252–1256. [Google Scholar]
  • [39].Zellner A (1971). An Introduction to Bayesian Inference in Econometrics. Krieger. URL https://books.google.de/books?id=paqiswEACAAJ [Google Scholar]
  • [40].— (1986). “Bayesian Estimation and Prediction Using Asymmetric Loss Functions.” Journal of the American Statistical Association, 81(394): 446–451. URL http://www.jstor.org/stable/2289234 [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement A

Supplement A: Proofs and Calculations. We prove the analytic results from Section 3 and derive the exact expression of the posterior and marginal density used for the multinomial likelihood in Section 4.

Supplement B

Supplement B: Additional Simulations. We give some additional, numeric results about the approximate behaviour of the THAMES in the normal case as well as the case where the posterior support is constrained.

RESOURCES