Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2021 Oct 4.
Published in final edited form as: SIAM/ASA J Uncertain Quantif. 2020 Aug 24;8(3):1139–1188. doi: 10.1137/18M122964X

Stratification as a general variance reduction method for Markov chain Monte Carlo

Aaron R Dinner , Erik H Thiede , Brian Van Koten §, Jonathan Weare
PMCID: PMC8488943  NIHMSID: NIHMS1656095  PMID: 34611500

Abstract

The Eigenvector Method for Umbrella Sampling (EMUS) [46] belongs to a popular class of methods in statistical mechanics which adapt the principle of stratified survey sampling to the computation of free energies. We develop a detailed theoretical analysis of EMUS. Based on this analysis, we show that EMUS is an efficient general method for computing averages over arbitrary target distributions. In particular, we show that EMUS can be dramatically more efficient than direct MCMC when the target distribution is multimodal or when the goal is to compute tail probabilities. To illustrate these theoretical results, we present a tutorial application of the method to a problem from Bayesian statistics.

1. Introduction.

Markov chain Monte Carlo (MCMC) methods have been widely used with great success throughout statistics, engineering, and the natural sciences. However, when estimating tail probabilities or when sampling from multimodal distributions, accurate MCMC estimates often require a prohibitively large number of samples. In this article, we analyze the Eigenvector Method for Umbrella Sampling (EMUS) [46]. We first proposed EMUS as a method for computing free energies, and we demonstrated that it was useful for treating the multimodality that typically arises in that context. Here, we demonstrate that EMUS is an effective general means of addressing the challenges posed not just by multimodality but also tail events, with potential applications to a broad range of problems in statistics, engineering, and the natural sciences.

EMUS was inspired by Umbrella Sampling [47] and other methods such as the Weighted Histogram Analysis Method (WHAM) [27] and the Multistate Bennett Acceptance Ratio (MBAR) [42] for computing potentials of mean force and free energies in statistical mechanics.1 We call these stratified MCMC methods since they each adapt the principle of stratified survey sampling to MCMC simulation. Stratified MCMC methods are among the most powerful, most successful, and most widely used tools in molecular simulation. (However, in contrast to our presentation here, they are not typically used in molecular simulation to compute averages of general observables.) WHAM, for example, has been instrumental for treating biomolecular processes ranging from protein folding [7] to conductance by ion channels [3].

While the practical utility of stratification has been established in many applications, the advantages and disadvantages of the method have remained poorly understood; cf. Remark 3.8 and [46]. Motivated by the substantial gap between theory and application of stratified MCMC within statistical mechanics, and also by the general challenges posed by multimodality and tail probabilities, the goal of this paper is to develop a clear theoretical explanation of the advantages of EMUS. Our theory suggests new applications of stratified MCMC (and EMUS in particular) to broad classes of sampling problems arising in statistics and statistical mechanics. For example, very recently EMUS was successfully applied to a parameter estimation problem in cosmology [34].

We now describe EMUS and its relationship to other MCMC methods. The EMUS algorithm proceeds roughly as follows:

  1. We divide the support of the target distribution into regions called strata. Associated to each stratum, we define a biased distribution whose support lies within that stratum. For example, one might let the biased distribution corresponding to a stratum be the target distribution conditioned on the stratum.

  2. We use MCMC to sample the biased distributions.

  3. We weight the samples from each stratum to compute estimates of general averages with respect to the target distribution.

EMUS belongs to a large class of MCMC methods that by various mechanisms promote a more uniform sampling of space. For example, in parallel tempering [44, 16], one uses MCMC samples drawn from a distribution or sequence of distributions close to the uniform distribution to speed sampling of the target distribution. The bias introduced by the choice of distributions is corrected either by reweighting the samples or by a replica exchange strategy [16]. The Wang–Landau [50] and Metadynamics methods [28] adaptively construct a biased distribution to achieve uniform sampling in certain coordinates. The temperature accelerated molecular dynamics method [33] is also designed to achieve uniform sampling in a given coordinate, but it works by entirely different means. In EMUS and other stratified MCMC methods, one achieves more uniform sampling by ensuring that each stratum contains points from at least one MCMC simulation.

EMUS also resembles certain methods for computing normalization constants of families of probability densities [17, 49, 35, 26]. The resemblance arises because the weights in the third step of EMUS are the normalization constants of the biased distributions. These methods have been used, for example, to compute Bayes factors in model selection problems [17] and for computations related to selection bias models [49]. However, despite a strong formal resemblance, EMUS has entirely different objectives from these methods: When computing normalization constants, the distributions analogous to our biased distributions are specified as part of the problem. By contrast, in EMUS and other stratified MCMC methods, the strata are chosen as in stratified survey sampling to maximize efficiency. EMUS is perhaps more similar in spirit to the parallel Markov chain Monte Carlo method [48]. Like EMUS, this method is designed to make MCMC more efficient.

Summary of Main Results.

Our most general results are a central limit theorem (CLT) for the EMUS method and a convenient upper bound on the asymptotic variance, cf. Theorem 3.3 and Theorem 3.5. We note that the proof of the upper bound relies on a new class of perturbation estimates for Markov chains which we derived in [45]. These estimates are substantially more detailed than previous results [10], cf. Remark 4.2. After proving the CLT, we address the dependence of the sampling error on the choice of strata. In particular, for a representative MCMC method, we estimate the asymptotic variances of trajectory averages sampling the biased distributions, cf. Theorem 3.7. Our estimate shows how factors such as the diameters of the strata influence the asymptotic variances.

In Section 4, we apply the general theory developed in Section 3 to case studies involving tail probabilities and multimodality. Our results concern two limits: a small probability limit and a low-temperature limit. In the small probability limit, we consider estimation of probabilities of the form

pM:=P[XM].

For a broad class of random variables X, we show that while the cost of computing pM with relative precision by direct MCMC increases exponentially with M, the cost by EMUS increases only polynomially; cf. Section 4.2. In the low-temperature limit, a parameter of the target distribution decreases, intensifying the effects of multimodality on the efficiency of MCMC sampling. We show that the cost of computing an average to fixed precision by direct MCMC increases exponentially in this limit, whereas the cost by EMUS increases only polynomially; cf. Section 4.1. We conclude that EMUS may be dramatically more efficient than direct MCMC sampling when the target distribution is multimodal or when the goal is to compute a small tail probability.

To illustrate our theoretical results, we present a tutorial numerical study applying EMUS to a problem in Bayesian statistics, cf. Section 5. In addition to illustrating the theory, our numerical study demonstrates the problems that may occur when EMUS and other similar stratified MCMC methods are used carelessly. It also addresses practical issues such as the choice of strata and the computation of error bars for averages estimated by EMUS.

The results in this article significantly extend and generalize the ideas in [46]. We first proposed the EMUS method with the goal of analyzing and improving umbrella sampling approaches in free energy calculations. Here, our goal is to establish EMUS as a general variance reduction technique, and we present many entirely new results, including an upper bound on the asymptotic variance of EMUS (Theorem 3.5), a condition to guide some aspects of the choice of strata (Remark 3.12), a theoretical argument demonstrating the benefits of EMUS for computing tail probabilities (Section 4.2), numerical results applying EMUS to Bayesian inference (Section 5), a method of correcting problems related to poorly chosen strata (Section 5.3), and a greatly improved numerical method for estimating the standard deviations of quantities computed by EMUS (Appendix F). In addition, we give complete justifications of some results that were stated without proof in [46], including Theorem 3.7 concerning the dependence of the sampling error on the choice of strata. Finally, we note that our results concerning multimodal distributions and the low-temperature limit generalize and clarify the results given in [46]; in particular, our Theorem 4.1 covers periodic boundary conditions and stratification in more than one variable.

2. The Eigenvector Method for Umbrella Sampling.

In this section, we derive the Eigenvector Method for Umbrella Sampling (EMUS), and we prove that it is consistent. We also derive a related method, iterative EMUS, and we compare iterative EMUS with the MBAR method from statistical mechanics [42].

2.1. Derivation of EMUS.

The objective of EMUS is to compute the average

π[g]:=Ωg(x)π(dx),

of a function g with respect to a measure π defined on a set Ω. In EMUS, instead of sampling directly from π, we sample from biased distributions analogous to the strata in stratified survey sampling methods. We then weight the samples from the biased distributions to estimate π[g].

We assume that the biased distributions take the form

πi(dx):=ψi(x)π(dx)π[ψi]

for some set {ψi}i=1L of non-negative bias functions defined on Ω. We call the support of ψi the i’th stratum to make an analogy between the biased distributions of EMUS and the strata of stratified survey sampling.

The EMUS method is based on a formula expressing π[g] as a function of expectations over the biased distributions. To derive this formula, we assume that

i=1Lψi(x)>0 for all xΩ.

We then define

g*(x):=g(x)k=1Lψk(x)

for any function g:Ω, and we observe that

π[g]=Ωg(x)π(dx)=(k=1Lπ[ψk])i=1Lπ[ψi]k=1Lπ[ψk]Ωg(x)k=1Lψk(x)ψi(x)π(dx)π[ψi]=Ψi=1Lziπi[g*], (2.1)

where we set

Ψ:=k=1Lπ[ψk] and zi:=π[ψi]Ψ. (2.2)

We call the vector zL with entries zi the weight vector. Now let 1 be the constant function with 1(x) = 1 for all x ∈ Ω. Taking g = 1 in equation (2.1) yields

Ψ=1i=1Lziπi[1*],

and so

Ψ[g]=i=1Lziπi[g*]i=1Lziπi[1*]. (2.3)

Therefore, to express π[g] as a function of averages over the biased distributions, it will suffice to express the weight vector z as a function of averages over the biased distributions. Taking g equal to ψj in (2.1) yields

Ψzj=π[ψj]=Ψi=1Lziπi[ψj*],

which is equivalent to the eigenvector equation

ztF=zt, where Fij:=πi[ψj*]. (2.4)

We call F the overlap matrix. We observe that F is stochastic (its entries are nonnegative and its rows each sum to one) and that z is a probability vector (its entries sum to one). Therefore, by the Perron-Frobenius theorem, as long as F is irreducible, the eigenvector problem (2.4) determines z as a function of F, a matrix whose entries are averages over the biased distributions. We assume throughout the remainder of this work that F is irreducible. We give a simple condition on the bias functions which guarantees irreducibility of F in Lemma 2.1 below. In general, for any irreducible, stochastic matrix GL×L, we will let w(G)L denote the unique solution of

w(G)tG=w(G)t with i=1Lwi(G)=1.

With this notation, z = w(F), and by (2.3) and (2.4) we have

π[g]=i=1Lwi(F)πi[g*]i=1Lwi(F)πi[1*], (2.5)

expressing π[g] as a function of averages over the biased distributions, as desired.

In EMUS, we substitute MCMC estimates for the averages over biased distributions on the right hand side of (2.5) to estimate π[g]. To be precise, let Xti be a Markov process ergodic for πi. We call Xti the biased process sampling the biased distribution πi. The EMUS algorithm proceeds as follows:

  1. For each i = 1, …, L, compute Ni steps of the process Xti.

  2. Compute the averages
    g¯i*:=1Nit=1Nig*(Xti),1¯i*:=1Nit=1Ni1*(Xti), and F¯ij:=1Nit=1Niψj*(Xti).
  3. Compute w(F¯) numerically, for example from the QR factorization of IF¯ [19].

  4. Compute the estimate
    πUS[g]:=i=1Lwi(F¯)g¯i*i=1Lwi(F¯)1¯i*
    of π[g].

Note that w(F¯) is defined only if F¯ is irreducible. In the following lemma, we state simple criteria for the irreducibility of F and F¯:

Lemma 2.1. The overlap matrix F is irreducible if and only if for every A ⊂ {1, 2, …, L}, we have

π[(iAψi)(jAψj)]>0. (2.6)

The approximate overlap matrix F¯ is irreducible if and only if for every A ⊂ {1, 2, …, L}, the seti∈A{x : ψi(x) > 0} contains at least one sample point generated from one of the biased processes Xtj with jA.

Proof. We prove only the second statement; proof of the first is similar. By definition, a non-negative matrix ML×L is irreducible if and only if for every subset A ⊂ {1, 2, …, L} of the indices, there exist indices iA and jA so that Mji > 0. Now assume that for every A ⊂ {1, 2, …, L}, there exist jA and t ≥ 0 so that

XtjkA{x:ψk(x)>0}.

Then for some iA, ψi(Xtj)>0, so F¯ji>0, hence F¯ is irreducible. ■

We claim that the EMUS estimator is consistent; that is, πUS[g] converges almost surely to π[g] as the total number of samples tends to infinity. To make this precise, we require the following assumption on the growth of Ni with the total number of samples:

Assumption 2.2. Let

N=i=1LNi

be the total number of samples from all biased distributions. Assume that for each i,

limNNi/N=κi>0.

That is, assume that when N is large, the proportion of samples drawn from the i’th biased distribution is fixed and greater than zero.

We now prove that EMUS is consistent:

Lemma 2.3. Under Assumption 2.2 and the irreducibility condition (2.6), πUS[g] converges almost surely to π[g] as the total number of samples N tends to infinity.

Proof. Since the processes Xti are ergodic,

F¯asF,g¯i*asgi*, and 1¯i*as1i* as N. (2.7)

Moreover, by Lemma A.1 in Appendix A, w(G) is continuous at F. (Technically, w(G) admits an extension to the set of all L×L matrices, which is continuous at F.) Therefore, as a function of F¯, g¯i*, and 1¯i*, πUS[g] is continuous at F, πi[g*], and πi[1*]. It follows by the continuous mapping theorem and equation (2.5) that πUS[g]asπ[g]. ■

2.2. Iterative EMUS and comparison with Vardi’s Estimator.

In this section, we explain how EMUS relates to Vardi’s estimator for selection bias models [49] and its descendants such as the popular Multistate Bennett Acceptance Ratio (MBAR) method [42]. In addition to comparing EMUS with these methods, we review a method, iterative EMUS, for solving the nonlinear system of equations defining Vardi’s estimator [46]. The first iterate of this method is exactly the EMUS estimator.

Vardi’s estimator is similar to EMUS, except that it uses the identity

zj=i=1Lziπi[ψjNi/zik=1LψkNk/zk] (2.8)

instead of our eigevector problem (2.4). (This identity appears as equation (1.12) in [18].) That is, Vardi’s estimate zV of the weight vector is the solution of (2.8), but with trajectory averages replacing averages over the biased distributions:

zjV=i=1LziVG¯ij(zV), (2.9)

where for any positive uL we define

G¯ij(u):=1Nin=1Niψj(Xni)Ni/uik=1Lψk(Xni)Nk/uk. (2.10)

By [49, Theorem 1], this nonlinear equation determines zV uniquely up to a constant multiple whenever the irreducibility criterion of Lemma 2.1 holds.

Vardi’s estimator was originally derived assuming that the samples Xti from the biased distributions were i.i.d. In that case, it is the nonparametric maximum likelihood estimator of the target distribution π given samples from the biased distributions πi [49], and it has certain optimality properties [18]. Several adjustments to the estimator have been proposed for the case of samples from Markov processes. In the Multistate Bennett Acceptance Ratio (MBAR) method, one replaces the factors Ni appearing in the summand in (2.10) with effective sample sizes ni, which are computed from estimates of the integrated autocovariance of a family of functions [42]. (In addition, in some versions of MBAR, the sample average over all Ni points on the right hand side of (2.10) is replaced with a sample average over the ni points obtained by including only every Ni/ni’th point along the trajectory Xti.) Another recent work proposes different effective sample sizes computed by minimizing an estimate of the asymptotic variance of the estimator [12]. In fact, the estimator is consistent with Ni replaced by any fixed positive number [12]. We have found that our numerical results do not depend sensitively on the choice of effective sample size, so we use Ni for simplicity.

We now review iterative EMUS, which we introduced in [46]. Iterative EMUS may be understood as a fixed point iteration for solving equation (2.9). The iteration proceeds as follows:

  1. As an initial guess for zV, choose a positive vector z0L. Set m = 0. Choose a tolerance τ > 0.

  2. Compute G¯ij(zm) by (2.10). Solve the eigenvector equation
    zjm+1=i=1Lzim+1G¯ij(zm) (2.11)
    for an updated estimate zm+1 of zV.
  3. If maxi|zim+1zim|/zim>τ, then increment m and repeat step 2.

Remark 2.4. In a related work, we show that the eigenvector equation (2.11) has a unique solution for every m, and we suggest a numerical method for finding the solution [46]. We also discuss the convergence of iterative EMUS, and we show that for every fixed m, zm is a consistent estimator of the weight vector z.

Remark 2.5. If one chooses zi0=Ni/N, then z1 is the EMUS estimate of z, w(F¯).

3. Error Analysis of EMUS.

Here, we develop tools for analyzing the error of EMUS. First, in Section 3.1, we prove a CLT for EMUS, and we derive a convenient upper bound on the asymptotic variance. Then, in Section 3.2, we analyze the dependence of the asymptotic variance of EMUS on the choice of biased distributions. We use these tools in Section 4 to prove limiting results demonstrating the advantages of EMUS for multimodal distributions and tail probabilities.

3.1. A CLT for EMUS and an Estimate of the Asymptotic Variance.

In this section, we prove a Central Limit Theorem (CLT) for EMUS, and we derive an upper bound on the asymptotic variance σUS2(g) of πUS[g]. To prove the CLT for EMUS, we must assume that a CLT holds for trajectory averages over the biased processes:

Assumption 3.1. For any matrix H, let Hi: denote the i’th row of H. Define G¯L×2 by

G¯i:=(g¯i*,1¯i*).

Assume that

Ni((F¯i:,G¯i:)(Fi:,πi[g*],πi[1*]))dN(0,Σi) (3.1)

for some asymptotic covariance matrix Σi(L+2)×(L+2) of the form

Σi=(σiρiρitτi), (3.2)

where σiL×L denotes the asymptotic covariance of F¯i: with itself, ρiL×2 denotes the asymptotic covariance of F¯i: with G¯i:, and τi2×2 denotes the asymptotic covariance of G¯i: with itself.

We expect a CLT to hold for most MCMC methods, target distributions, and target functions of interest in statistics and statistical mechanics. We refer to [40] for a comprehensive review of conditions guaranteeing a CLT. In Theorem 3.7 of Section 3.2, we prove a CLT and an estimate of the asymptotic variance for a simple family of processes which one might use to sample the biased distributions in an application of EMUS.

We now prove a CLT for πUS[g], and we give a formula expressing the asymptotic variance σUS2(g) of πUS[g] in terms of the asymptotic variances Σi of the trajectory averages. In this formula, (IF)# denotes the group generalized inverse of IF; the group inverse A# of a matrix A is characterized by the properties

AA#A=A,A#AA#=A#, and AA#=A#A.

We refer to [19] for a detailed explanation of the properties of the group inverse, a proof that (IF)# exists whenever F is stochastic and irreducible, and an algorithm for computing (IF)#.

In Theorem 3.3 and below we impose the following assumption:

Assumption 3.2. The processes Xti sampling the biased distributions are independent.

This assumption does not hold for all stratified MCMC methods. For example, in replica exchange umbrella sampling one periodically allows configuration exchanges between neighboring processes; see [32] for a general discussion of replica exchange strategies and [43] for an application of replica exchange in a method similar to EMUS. The result is a single process taking values in L×d and sampling the product distribution Π(x1, x2, …, xL) = π1(x1)π2(x2) … π1(xL). In this case, a CLT would still hold for EMUS, but the asymptotic variance would take a different form.

Theorem 3.3. Let Assumptions 2.2, 3.1, and 3.2 and the irreducibility condition (2.6) hold. Let g be square integrable over π, so π[g2] <. Recall the definition of Ψ from (2.2), and define

l:=Ψ(1,π[g])t2.

Let gL be the vector with gi:=l(πi[g*],πi[1*]). We have

N(πUS[g]π[g])dN(0,σUS2(g)), (3.3)

where

σUS2(g)=i=1Lzi2κi{(IF)#gσi(IF)#g+2(IF)#gρil+ltτil}. (3.4)

Proof. The result follows using the delta method and a formula expressing w′(F) in terms of (IF)#; we give the details in Appendix A. ■

In Appendix F, we explain how to compute an estimate of σUS2(g) given trajectories of the biased processes. We use this estimates in Section 5 when analyzing our computational experiments. We note that the numerical methods presented in Appendix F for estimating σ2(g) significantly improve upon our original proposal in [46]; see Figure 11.

Figure 11:

Figure 11:

Sign pattern of group inverse (IF¯)# computed by method of [19] (Figure 11a) and using power iteration (Figure 11b). Yellow indicates an entry with positive sign, blue a negative sign. Here, we consider the overlap matrix F¯ computed to estimate the marginal density of μ2 in Section 5.3. The oscillations in sign observed in the upper right corner of Figure 11a are evidence of numerical error.

We now derive a convenient upper bound on the asymptotic variance σUS2(g). In Section 4, we use this bound to analyze the efficiency of EMUS in the low-temperature limit and in the limit of small tail probabilities. Our bound is based on the probability Pi[tj < ti] defined below:

Definition 3.4. Let Yn be the Markov chain with state space {1,2, …, L} and transition matrix F. Let Pi[tj < ti] denote the probability that Yn hits j before returning to i, conditioned on Y0 = i.

Theorem 3.5. Let Assumptions 2.2, 3.1, and 3.2 and the irreducibility condition (2.6) hold. Let g be square integrable over π, so π[g2] <. Let σ2(g) be the asymptotic variance of πUS[g], and for any measure ν and function f let varν(f) be the variance of f over ν. Define the function

h=g*π[g]1*,

and let C(h¯i) be the asymptotic variance of the trajectory average of h over the biased process Xti. We have

σUS2(g)2i=1L1κi{zi2Ψ2C(h¯i)+tr(Ri)π[|h|]2jiFij>0varπi(ψj*)Pi[tj<ti]2}, (3.5)

where RiL×L with

Rjki:=σjkivarπi(ψj*)varπi(ψk*).

Proof. The result follows from Theorem 3.3, using the perturbation bounds which we derived in [45]. Details appear in Appendix A. ■

When the bias functions are a partition of unity, both the EMUS method and the statements of Theorems 3.3 and 3.5 simplify considerably. (The bias functions are a partition of unity if and only if i=1Lψi(x)=1 for all x.) In this case, f* = f for all functions f, and the EMUS method reduces to

πUS[g]=i=1Lwi(F¯)g¯i,

where

F¯ij=Ni1t=1Niψj(Xti) and g¯i=Ni1t=1Nig(Xti).

In the statement of Theorem 3.5, one can replace π[|h|]2 with varπ(g) or π[|g|]2. We also have Ψ = 1, and one can replace C(h¯i) with the asymptotic variance C(g¯i) of g¯i. (We verify these claims in Appendix A as part of the proof of Theorem 3.5.) In our limiting results (Section 4) and computational experiments (Section 5), we choose the bias functions to be a partition of unity.

3.2. Dependence of the Asymptotic Variance on the Choice of Strata.

In this section, we consider how the choice of strata influences the factors in the upper bound (3.5) on σUS2(g). Roughly, the asymptotic variances C(h¯i) and tr(Ri) characterize the sampling error, and for each i the factor

jiFij>0varπi(ψj*)Pi[τj<τi]2,

measures the sensitivity of the EMUS estimator to sampling errors associated with πi. We show in Section 3.2.1 that C(h¯i) and tr(Ri) may be controlled by decreasing the diameters of the strata, under some conditions. We show in Section 3.2.2 that that Pi[τj < τi] may be controlled by ensuring sufficient overlap between neighboring strata. This last observation leads to a practical condition guiding the choice of strata; see Remark 3.12 and (5.8).

To streamline our illustration of the benefits of stratification, we impose additional assumptions in this section. These assumptions are much stonger than those made above. We discuss how they relate to practical implementations of stratified MCMC in Remarks 3.9 and 3.12.

3.2.1. Asymptotic Variances of MCMC Averages.

Here, we consider the effect of the choice of strata on the asymptotic variances C(h¯i) and tr(Ri). Because such a diverse variety of biased processes and distributions could in principle be used, it is futile in our opinion to try for a completely general result. Instead, motivated by the efficiency analysis undertaken in Section 4, we introduce a simple parametric family of bias functions, and for this family we state Assumption 3.6 relating the diameters of the strata with the asymptotic variances. In Theorem 3.7, we verify Assumption 3.6 for one representative class of biased processes. Finally, at the end of this section, we explain why we expect the assumption to hold for other choices of biased processes and distributions; cf. Remark 3.9.

Consider the following representative class of bias functions: Given a family of sets {Ui : i = 1, …, L} with i=1LUi=Ω, define

ψi:=1Ui and πi(dx):=1Ui(x)π(dx)π[1Ui] for i=1,,L, (3.6)

where 1Ui denotes the characteristic function of Ui. Assume that the sets Ui are chosen so that the irreducibility criterion of Lemma 2.1 holds. For example, suppose that Ω = [0, 1]d is the d-dimensional unit cube. One might choose K, set h := 1/K, and define

Ui=(h[1,1]d+hi)Ω for i{0,1,,K}d, (3.7)

covering Ω uniformly by a grid of strata having diameters proportional to h. We use this uniform grid as a device when analyzing the effect of the stratum size on the efficiency of EMUS in Section 4. However, while such a naïve choice may suffice for small d, it is not practical for large d. We discuss appropriate bias functions for high-dimensional problems later in this section and again in Section 5.1.

Since we wish to study grids like (3.7) as h = 1/K varies, we state our assumption on asymptotic variances in terms of the following parametric family of strata: Let x0 ∈ Ω, and let Zd be a bounded set containing 0. For each h > 0, define a stratum and a biased distribution by

Zh=x0+hZ and πh(dx)=1Zh(x)π(dx)π[1Zh]. (3.8)

Assumption 3.6 characterizes the dependence of the asymptotic variance of MCMC averages over πh on the parameter h:

Assumption 3.6. Assume that f:Ω has finite variance varh(f) over πh, and define σh2(f) to be the asymptotic variance of an MCMC trajectory average approximating πh[f]. Write

π(x)=exp(βV(x))exp(βV(y))dy,

for some potential V:Ω and inverse temperature β > 0. We assume

σh2(f)varπh(f)Chaβbexp(β(maxZhVminZhV))Chaβbexp(βh diam(Z)V)

for some C, a, b ≥ 0 independent of h,Z, and f.

To motivate Assumption 3.6, we prove that a special case holds for a representative class of processes sampling the biased distributions, cf. Theorem 3.7. Assume that the potential V appearing in the assumption is continuously differentiable. Let Zd be either a convex polyhedron or a set with C3 boundary.2 Now let Xth be the overdamped Langevin process with reflecting boundary conditions on Zh. This process is defined by the Fokker–Planck equation

ut(x,t)=div(β1u(x,t)+u(x,t)V(x))for xU,t>0,(β1u(x,t)+u(x,t)V(x))n(x)=0for xZh,t0, andu(x,0)=p(x)for xZh, (3.9)

where β and V are the inverse temperature and potential defined in Assumption 3.6 and n(x) denotes the inward unit normal to ∂Zh at x. That is, Xth is the unique Markov process so that if X0h has density p(x), then Xth has density u(x, t). The existence of the reflected process is established in [51, 2] when Z is a convex polyhedron and in [13, Chapter 8] when Z has C3 boundary. A simple introduction to the reflected process and its properties appears in [36, Chapter 4]. We show in Theorem 3.7 that Xth is ergodic for πh, at least when Z is bounded.

The reflected process Xth shares many features with the processes used in practical stratified MCMC methods. In particular, it is closely related to the (unreflected) overdamped Langevin process Yt, which solves

dYt=V(Yt)dt+2β1dBt. (3.10)

In fact, the Fokker–Planck equation of the unreflected process is the same as the reflected process, except with no boundary condition and with d in place of Zh.

We now verify Assumption 3.6 for the reflected process:

Theorem 3.7. Assume that f:Ω has finite variance varh(f) over πh. Let Z either have C3 boundary or be convex. Assume that V is continuously differentiable. Suppose that Xth is stationary; that is, X0h has distribution πh. Let f¯h:=1Tt=0Tf(Xth) be the continuous time trajectory average of f. We have

T(f¯hπh[f])dN(0,σh2(f)),

where

σh2(f)Λh2βexp(β(maxZhVminZhV))varh(f). (3.11)

The constant Λ depends only on Z, not on h, β, V, or f.

Sketch of proof. We give a detailed proof of Theorem 3.7 in Appendix C; here we present only an outline. Our proof is based on a formula expressing the asymptotic variance of trajectory averages in terms of the generator of Xth: Under certain conditions,

σh2(g)=2πh[(gπh[g])L1(gπh[g])], (3.12)

where L is the generator of Xth and L−1(gπh[g]) is a function so that

L(L1(gπh[g]))=gπh[g],

and ∇L−1(gπh[g]) · n = 0 on ∂Zh. To estimate σh2(g), we prove a Poincaré inequality for πh, which implies an upper bound on L1(gπh[g])L2(πh). This general approach is outlined in [31, Section 3], which treats the case of the overdamped Langevin dynamics on an unbounded domain. ■

Remark 3.8. In a related work, we summarize and refute a widely accepted argument in favor of umbrella sampling from the chemistry literature [9, Chapter 8]; see [46, Section VI.A]. Roughly speaking, the argument in the chemistry literature treated the dependence of the sampling error on the choice of strata correctly, but the sensitivity of the algorithm to those errors incorrectly. Our Theorem 3.7 verifies and extends the correct part of this argument.

Remark 3.9 (Practical Biased Processses and Assumption 3.6). As explained above, the reflected process Xth is quite similar to the processes used in practice. We now discuss practical methods, and we explain when we expect Assumption 3.6 to hold. We identify three major differences between Xth and practical methods: First, in molecular simulations, one typically chooses Gaussian bias functions instead of piecewise constant. Second, practical methods must be discrete in time, e.g. one might use a discretization of the continuous time process Xth. Third, for high-dimensional problems, one typically stratifies only a certain low-dimensional reaction coordinate or collective variable.

In the first case, for Gaussian bias functions, a version of Theorem 3.7 holds with minor adjustments; we omit the exact statement and proof for simplicity. In the second case, for discretizations of Langevin dynamics, the asymptotic variances of trajectory averages are closely related to the corresponding averages for the continuous time dynamics: In fact, under some conditions on the potential V,

limΔt0ΔtζΔt2(f)=ζ2(f), (3.13)

where ζΔt2(f) is the asymptotic variance of the trajectory average of f for the discretization with time step Δt and ς2(f) is the asymptotic variance for the continuous time process [31, Section 3.2]. For other discrete time processes, we expect Assumption 3.6 to hold with different exponents a and b. For example, the affine invariance property of the affine invariant ensemble sampler [20] suggests a = 0.

The third case is subtle. When d is large, one typically stratifies only in a function θ:Ωdl with much smaller than d. To be precise, one might choose a uniform grid of nonnegative functions ηi:l defined as in (3.7), but with supports covering θ(Ω)l instead of Ωd. One would then define the bias functions

ψi(x):=ηi(θ(x)). (3.14)

(We make a similar choice in our calculations in Section 5, cf. the natural stratification (5.1).) For a clever choice of θ, these biased distribution may be much easier to sample than the target distribution. For example, suppose that the marginal πθ of π in θ were multimodal, but that the conditional distributions π(· | θ = θ0) were unimodal or otherwise easy to sample for each fixed θ0. In that case, for h sufficiently small, each biased distribution would be unimodal, hence easy to sample. (Recall that h sets the diameters of the strata for the grid of bias functions defined in (3.7), so h small means that the diameter of the support of ηi is small.) In free energy calculations, it is often possible to choose such a θ based on intuition or scientific principles; see [46, 30] for discussion. Also, when computing tails or marginals, the problem itself typically suggests a particular θ; cf. the natural stratification in Section 5.1.

The reader will notice that bias functions of the form (3.14) will typically have infinite support, rendering the bound in Assumption 3.6 useless. In this case, one might hope for a similar bound with the potential function V replaced by the free energy

F(θ):=β1log(πθ(θ)),

where πθ is the marginal density of π in θ. Roughly, this replacement will be valid when, for MCMC processes sampling π, the θ variables equilibrate very slowly compared to other variables. This will occur, for example, when the marginal in θ is multimodal or otherwise difficult to sample, but the conditional distributions are easy to sample. More on the effective dynamics of low-dimensional variables can be found in [37] or [29].

3.2.2. Controlling the Probabilities Pi[τj < τi].

Here, we examine the effect of the choice of strata on the factor

jiFij>0varπi(ψj*)Pi[τj<τi]2 (3.15)

appearing in our upper bound (3.5) on σUS2(g).

We begin with a lemma estimating varπi(ψj*)/Pi[τj<τi]2 in terms of Fij:

Lemma 3.10. We have

varπi(ψj*)Pi[τj<τi]21Fij.

Proof. We have

Pi[tk<ti]P[X1=kX0=i]=Fik,

where Xt denotes the Markov chain with transition matrix F. Therefore, since ψj*(x)[0,1],

varπi(ψj*)Pi[τj<τi]2varπi(ψj*)Fij2πi[(ψj*)2]Fij2πi[ψj*]Fij2=1Fij.

We now estimate the size of Fij for piecewise constant bias functions such as the uniform grid (3.7):

Lemma 3.11. Assume as in (3.6) that the bias functions are piecewise constant, and write π(x) ∝ exp(−βV (x)). We have

Fij|UiUj||Ui|k=1L1Ukexp(β(minUiVmaxUiV))

In particular, for the uniform grid of strata (3.7), we have

Fij14dexp(β(minUiVmaxUiV))14dexp(2βhdV) (3.16)

for any i,jd so that Fij > 0.

Proof. We have

Fij=πi[ψj*]=πi[1Ujk=1L1Uk]π[UiUj]π[Ui]1k=1L1Uk|UiUj||Ui|1k=1L1Ukexp(β(minUiVmaxUiV)),

which proves the first claim made in the statement of the lemma.

Now, for the uniform grid of strata (3.7), the minimum nonzero value of |UiUj|/|Ui| is 1/2d, attained when j = (1,1, …, 1) + i. Moreover, except for a set of measure zero, each xd lies within 2d strata, so k=1L1Uk=2d. Finally, we have

maxUiV(x)minUiV(x)diam(Ui)V=2dhV,

and the result follows. ■

Remark 3.12 (A Condition to Guide the Choice of Strata). Lemmas 3.10 and 3.11 suggest a practical constraint on the choice of strata: To ensure that the calculation of the weights is not too sensitive to sampling errors, it will suffice to choose strata so that nonzero entries of F are not too small. We let this condition guide the choice of strata in Section 5, cf. (5.8). However, the condition is only sufficient, not necessary. For example, consider a uniform grid of Gaussian bias functions similar to (3.7), but with Gaussian densities having mean μ = h[−1, 1]d + hi and variance σ2 = h2 replacing the characteristic functions 1Ui. In that case, even though F will be dense and may have some extremely small nonzero entries, one can still control (3.15) by decreasing h, under some conditions on π. We omit the exact statement and proof for simplicity.

Despite the exponential dependence on d in (3.16), EMUS and other stratified MCMC methods are advantageous for high-dimensional problems because it often suffices to stratify only a low-dimensional collective variable. In such cases, the dimension of the grid of strata is much smaller than dimension of the state space Ω; see our discussion of collective variables in Section 3.2.1 and our computations in Section 5. It is important to keep this in mind when reading our results below.

Remark 3.13. One may define a uniform grid of strata so that (3.15) increases only as d2 with dimension, not exponentially: For any id, let Vi:=hi+h[12,12]d. For i ≠6 j, define

Wij:={xVj:minyVixyminyVkxy for any kd\{j}}

to be the d-dimensional pyramid consisting of all points in Vj closer to Vi than to any other cube Vk. Now let en denote the n’th standard basis vector in d, and define

Vi:=n=1d(Wi,i+enWi,ien)Vi

to be the cube Vi enlarged by all the neighboring pyramids Wij. The strata Vi are convex, and the corresponding bias functions ψi=121Vi are a partition of unity. Each stratum Vi intersects only the 2d neighboring strata Vi±en for n = 1, …, d. Moreover, each intersection between neighboring strata Vi and Vj consists of the pair of pyramids Wij and Wji, and it has volume 1/d. Therefore, by Lemma 3.11, for this choice of bias functions, the nonzero entries of F decrease as 1/d. It follows that (3.15) increases as d2.

4. Limiting Results as a Rationale for EMUS.

In this section, we analyze the efficiency of EMUS in two limits: First, we consider a low temperature limit, where we write π(x) ∝ exp(−βV (X)) and let the inverse temperature β increase, concentrating the target distribution at its modes and intensifying the effects of multimodality on the efficiency of MCMC sampling. Second, we consider the estimation of increasingly small tail probabilities. Our goal in each case is to elucidate the advantages and disadvantages of EMUS for a broad class of problems, providing a rationale for the use of the method. We hope that others will use the tools of Section 3 in similar fashion to develop their own novel applications of EMUS.

4.1. Limit of Low Temperature.

Let the target distribution take the form

πβ(x)=exp(βV(x))exp(βV(x))dx

for some potential V and inverse temperature β > 0, as in Section 3.2. In this section, we analyze the efficiency of EMUS in the low temperature limit as β tends to infinity with V fixed. We observe that πβ concentrates at its modes (the minima of V) in this limit. As a consequence, MCMC methods for sampling πβ undergo transitions between modes only rarely, which makes direct MCMC sampling increasingly inefficient. To be precise, we show that the asymptotic variance of a trajectory average of the overdamped Langevin dynamics increases exponentially with β in the worst case. On the other hand, we show that the asymptotic variance of the EMUS estimate of the same average increases only polynomially. Therefore, EMUS is dramatically more efficient than direct sampling in the low temperature limit.

We consider the low temperature limit because it provides a convenient sequence of increasingly difficult to sample multimodal distributions: By analyzing EMUS in the low temperature limit, we hope to elucidate its advantages for multimodal problems in general. We have no other interest in low temperature.

We now examine the overdamped Langevin dynamics.

dXtβ=V(Xtβ)dt+2β1dBt (4.1)

in the low temperature limit. (The overdamped Langevin dynamics is ergodic for πβ under certain conditions on V; see [41] for example.) For typical potentials V, the generator

L:=β1Δ+V

of (4.1) has a spectral gap that shrinks exponentially with β; that is, for some c > 0,

exp(cβ)λ1<0, (4.2)

where λ1 is the greatest nonzero eigenvalue of L. We refer to [31, Section 2.5] for a review of results on the spectrum of L, and we refer to [21] for precise conditions on V which guarantee (4.2). Now let v1 be an eigenfunction corresponding to λ1 normalized so that πh[v12]=1. By formula (3.12), the asymptotic variance σβ2(v1) of the trajectory average of v1 satisfies

σβ2(v1)=πβ[v1L1v1]=λ11πβ[v12]=λ11exp(cβ),

indicating that the cost of estimating π[v1] by direct MCMC grows exponentially with β.

Having analyzed the overdamped Langevin dynamics, we now examine EMUS in the low temperature limit. For convenience, we assume that Ω is the unit cube [0,1]dd with periodic boundary conditions; to be more precise, we let Ω=d/d be the set of all points in d with x and y identified if and only if xyd. Periodic boundary conditions are typical of problems in chemistry and computational statistical mechanics. We do not see any difficulties in generalizing our results to other types of domains.

As β increases, we must make the supports of the bias functions smaller. We accomplish this by adjusting the parameter h in a uniform grid of bias functions similar to those defined in (3.7). To be precise, we fix K, set h := 1/K, and define

ψi(x):=12d1[1,1]d(K(xhi)) for i{0,1,,K1}d. (4.3)

This family of Kd bias functions is a partition of unity over Ω, and the support of the i’th bias function is

Ui:=h[1,1]d+hi.

For convenience, we treat the index i as an element of d/Kd; that is, we let i be periodic with period K in each of its components, identifying (0, i2, …, id) with (K, i2, …, id), for example. Figure 1 illustrates such a family of bias functions, and it demonstrates the appropriate relationship between β and h.

Figure 1:

Figure 1:

Bias functions and target distributions in the low temperature limit. In the upper two plots, the black curves are the densities of the target distributions for two different values of β. Observe that π concentrates at the minima of V as β increases. The red bands each lie above a single stratum chosen from a family of strata for which hβ−1. In the lower two plots, the blue curve is βV (x) and the x-axis covers the bottom of the red band in the plot immediately above. Observe that the range of βV (x) over the red band is the same for each of the two values of β. By Theorem 3.7 and the ensuing discussion in Section 3.2, this implies that the cost of sampling a single biased distribution increases at most polynomially with β when hβ−1.

We now show that the asymptotic variance of EMUS increases at most polynomially with β when K is chosen appropriately. In light of the above discussion, this means that EMUS may be dramatically more efficient than direct sampling for multimodal problems. We note that despite the exponential dependence on d in (4.4) below, EMUS and other stratified MCMC methods are often advantageous for high-dimensional multimodal problems; see our discussion of low-dimensional collective variables in Section 3.2 and also our computations in Section 5.

Theorem 4.1. For any bounded continuous function g, let σβ,US2(g) denote the asymptotic variance of πβ,US[g]. Let the bias functions be defined by (4.3) with K equal to the least integer greater than β; that is,

K=β.

Take κi = 1/Kd. Let Assumption 3.6 hold. We have

σβ,US2(g)varπβ(g)C(1+β)qd (4.4)

for constants C, q > 0 independent of g and β, but depending on V and the constants in Assumption 3.6.

Proof. The proof is a straightforward application of the theory developed in Section 3; we present the details in Appendix D. ■

Remark 4.2. Our proof of Theorem 4.1 relies on the perturbation bounds which we derived in [45]. These bounds allow one to estimate the sensitivity of w(F) to small perturbations of F. Most perturbation bounds in the literature predict that w(F) is highly sensitive when the spectral gap of F is small, but ours show that this is not always the case. (The spectral gap is 1 − |λ2|, where λ2 is the eigenvalue of F with second largest absolute value.) In the low-temperature limit, the spectral gap of F decreases exponentially with β; see [45] for a simple example of this phenomenon. Nonetheless, using our bounds, we show that the cost to compute averages by EMUS increases only polynomially in β.

4.2. Limit of Small Probability.

In this section, we assess the performance of EMUS for computing tail probabilities. To be precise, we let Ω = [0,∞), and we consider estimation of probabilities of the form

pM:=π([M,)).

We show that for a broad class of distributions π, the cost of computing pM with relative precision by direct MCMC increases exponentially with M, whereas the cost by EMUS increases only polynomially. Thus, EMUS is dramatically more efficient than direct sampling for computing the probabilities of tail events.

In Assumption 4.3 below, we state the conditions which we will impose on π in our analysis. These conditions specify a simple class of problems for which strong conclusions may be drawn. Similar results hold more generally. For example, in Section 5, we report the results of a computational experiment demonstrating the advantages of EMUS for computing tails of a marginal density.

Assumption 4.3. Write

π(x)=exp(V(x))

for some potential function V:[0,). Assume that for some M0 ≥ 0:

  1. Whenever xM0,
    0V(x) and 0<V(x). (4.5)
  2. For some α ∈ (0, 1) and c > 0, whenever xM0,
    αV(x)2V(x)c>0. (4.6)

For example, we might have

π(x)exp(|x|r) for any r1.

Remark 4.4. Condition (4.6) in Assumption 4.3 implies geometric ergodicity of the overdamped Langevin dynamics with potential V [40]. We rely on this fact to motivate Assumption 4.5 concerning the convergence of MCMC processes sampling biased distributions with unbounded support. Interestingly, we use the same condition to prove lower bounds on some of the entries of the overlap matrix; cf. Lemma E.1.

Condition (4.5) in Assumption 4.3 implies

pMDexp(γM)

whenever MM0 for some D, γ > 0. Therefore, the relative variance ρ2M of 1[M,∞) over π satisfies

ρM2=pMpM2pM2D1exp(γM)1.

We conclude that estimating pM with relative accuracy by a direct MCMC method (or even Monte Carlo with independent samples) requires a number of samples increasing exponentially with M.

By contrast, we show that for an appropriate choice of bias functions, the cost to estimate pM by EMUS increases only polynomially in M. For each M > 0 and K, let

h:=MK,

and define the family of K + 2 bias functions

ψi(x):={121[0,h](x) for i=0121[(i1)h,(i+1)h](x) for i=1,,K1121[Mh,)(x) for i=K, and 121[M,)(x) for i=K+1. (4.7)

As in Section 4.1, let Ui denote the support of ψi. This family of bias functions is a partition of unity on [0,∞); see Figure 2.

Figure 2:

Figure 2:

The bias functions {ψi : i = 0, …, K + 1} defined in (4.7) and a potential function V satisfying Assumption 4.3. Observe that the bias functions ψK and ψK+1 have unbounded support.

We now address the cost of estimating pM by EMUS. First, we observe that Assumption 3.6 on the asymptotic variances of MCMC averages does not cover the sampling of πK and πK+1, since the supports of these distributions are unbounded. We need to add to that assumption

Assumption 4.5. Let f:[0,), and define σi2(f) to be the asymptotic variance of an MCMC trajectory average approximating πi[f] for i = K,K + 1. We assume

σi2(f)varπi(f)D

for some D independent of M and f.

In fact, since Assumption 4.3 implies that the overdamped Langevin dynamics is ergodic for π(x) = exp(−V (x)) on the unbounded domain Ω= (cf. Remark 4.4), we fully expect (but do not prove here) that under Assumption 4.3, Assumption 4.5 holds for overdamped Langevin constrained (by reflection as in (3.9)) to remain in the support of πK or πK+1. Alternatively the reader may simply assume that we draw i.i.d. samples from the biased distributions. All our results hold in that case.

We show in Theorem 4.6 that the relative asymptotic variance of the EMUS estimate of pM grows only polynomially with M for a broad class of target distributions π. Therefore, EMUS may be dramatically more efficient than direct MCMC sampling when the goal is to compute tail probabilities. We observe that while the hypotheses of the theorem are somewhat restrictive, similar results hold more generally; for example, see Section 5 where we compute tails of a marginal density.

Theorem 4.6. Let Assumptions 3.6, 4.3, and 4.5 hold. Set

K=MmaxxM|V(x)|.

Define a family of K + 2 bias functions ψi by (4.7). Take κi = 1/(K + 2). Let σM,US2 denote the asymptotic variance of the EMUS estimate of pM. We have

σM,US2pM2CK2

for some constant C > 0 depending on V but not on M.

For example, suppose that

V(x)=V˜(x)+xr,

where V˜ has bounded support and r ≥ 1. Then |V′(x)| ≤ C(1 + Mr−1), and so

σM,US2pM2CM2(1+Mr1)2.

Proof. The proof is similar to the low temperature limit, Theorem 4.1, but with complications arising because not all strata are bounded and because here we consider the relative variance instead of the variance; see Appendix E. In particular, we require Assumption 4.3 to show that one can in fact choose h so that all nonzero entries of F are bounded above zero uniformly as M increases; cf. Lemma E.1. This is the only part of the proof relying on Assumption 4.3. ■

5. EMUS for tails: An example from Bayesian inference.

We demonstrate the use of EMUS for efficiently exploring and visualizing distributions. In particular, we show how EMUS may be used to efficiently compute both marginal densities and also tail probabilities of the form P[η(Z) ≥ ε−1] where η(Z) is a real valued function of a high-dimensional random variable Z. For both tails and marginals, there is a natural and easy to implement choice of strata, which we describe in Section 5.1.

In Section 5.3, we calculate two different one-dimensional marginals of the posterior distribution of the hierarchical Bayesian mixture model described in Section 5.2. For one marginal, the natural stratification suffices. For the other, it does not, but a preliminary computation made with the natural stratification suggests a better choice of strata. We use this example to explain how to diagnose and correct problems related to poorly chosen strata: Our results will serve to guide the practice of stratified MCMC.

5.1. The natural stratification for tails and marginals.

Here, we briefly explain how EMUS can be used to estimate tail probabilities and low-dimensional marginals of high-dimensional distributions. Let Ωd; let π be a probability distribution on Ω; and let η:Ω. Suppose that one wishes to estimate the very small tail probability P[η(Z) ≥ ε−1]. In this case, it is natural to stratify in η only. That is, one may choose a partition of unity {ϕi}i=1L on and define bias functions

ψi(x)=ϕi(η(x)) for i=1,,L (5.1)

depending only on η. For a partition of unity, one might choose the regular grid of piecewise constant functions defined in Section 4.2. We refer to (5.1) as the natural stratification. To compute the tail probability, one uses EMUS to estimate π(1[ε1,)η).

Computing marginal densities is similar; in fact, computing tails may be understood as a special case of computing a marginal density. Suppose now that η:Ωl. To estimate the marginal πη of π in η, one chooses a partition of unity {ϕi}i=1L on l, again defining bias functions by (5.1). One then uses EMUS to compute averages of histogram bins, which are functions of the form

bη0(η(x))=1η0+h[1,1]l(η(x)). (5.2)

We have

limh01(2h)lπ[bη0]=πη(η0),

so for small h the averages of the histogram bins approximate πη.

By the argument in Section 4.2, EMUS with the natural stratification will be dramatically more efficient than direct sampling as long the biased distributions are no harder to sample than the target distribution π. Essentially, this is because with the natural stratification very small averages like P[η(Z) ≥ ε−1] over the target distribution π are expressed as functions of much larger averages over the biased distributions πi. Unfortunately, however, for general functions η, the biased distributions of the natural stratification need not be easy to sample. In Section 5.3, we give one example where the natural stratification works and one where it does not. In the case where it does not, we explain how to make a better choice of strata.

5.2. A hierarchical Bayesian mixture model.

Here, we review the hierarchical Bayesian mixture model proposed in [39], and we discuss the difficulties which complicate inference under this model. As a tutorial in the use of EMUS, we present a numerical investigation of these difficulties in Section 5.

In the hierarchical mixture model, the data vector y=(y1,,yn)n consists of independent identically distributed samples drawn from a mixture distribution of the form

p(yiϕ)=k=1Kqkν(yi;μk,λk1),

where K is the number of mixture components, qk is the weight of the k’th mixture component, ν(;μk,λk1) is the normal density with mean μk and variance λk1, and ϕ is the vector of parameters

ϕ=(μ1,,μK,λ1,,λK,q1,,qK1).

(Since p(yi|ϕ) is a probability distribution, q1 +⋯+qK = 1, and q1, …, qK−1 determine qK) The following prior distribution is imposed on ϕ:

μi~N(m,κ1)λk~Gamma(α,β)β~Gamma(g,h)(q1,,qK1)~ Dirichlet K(1,,1).

As in [23, 11], we choose

m=M,κ=4R2,α=2,g=0.2, and h=100gαR2

where R and M are the range and the mean of the observed data, respectively. The posterior density is

p(θy)=κK/2ghβKα+g1ZKΓ(α)KΓ(g)(2π)n+K2(k=1Kλk)α1×exp{κ2k=1K(μkM)2β(h+k=1Kλk)}×i=1N(k=1Kqkλk12exp{λk2(yiμk)2}),

where θ = (ϕ, β) denotes the vector of all parameters to be inferred, including the hyperparameter β.

Several factors complicate inference based on this model: First, the mixture components are not identifiable; that is, the posterior distribution is invariant under permutation of the labels of the mixture components. Consequences of non-identifiability are discussed at length in [23, 11]. In our computations in Section 5.3, we impose the constraint

μ1μ2μK

to ensure that the components are identifiable. Second, in Lemma 5.1, we show that the posterior density may be unbounded, introducing spurious modes with infinite density. Finally, even with identifiability constraints, the posterior distribution may have multiple modes of finite posterior density. For example, see the modes reported in [11]. In Section 5.3, we use EMUS to efficiently visualize the posterior, assessing the effects of multimodality and unboundedness.

We suspect that the unboundedness of the posterior for this model is well known. However, we are unable to find a reference, so we now explain. It is certainly well known that the likelihood of a Gaussian mixture model is unbounded: Roughly speaking, the likelihood is infinite when any mixture component is collapsed on a single data point [1]. Nonetheless, one might expect the posterior density p(θ|y) to be bounded, since the prior penalizes large values of the precisions λi. This is not always the case when the data vector contains repeated entries:

Lemma 5.1. If any datum yi has frequency Ni greater than

2g+2(K1)α,

then the posterior density p(θ|y) is unbounded.

Proof. Take the limit of p(θ|y) as λ1 → ∞ with μ1 = yi, β=λ11 and all other variables held fixed. ■

The reader will observe that under the model, the set of data vectors with repeated entries has probability zero. However, in practice, the data consist of measurements with finite precision, and therefore repeated entries occur commonly, cf. the Hidalgo stamp data used in Section 5.3.

5.3. Numerical experiments: Choosing strata, computing tails, diagnosis of problems.

In this section, we explain how to recognize and correct problems related to poor choices of strata, and we demonstrate the use of EMUS to investigate the multimodality and unboundedness of the posterior in the mixture model. We first compute two one-dimensional marginals of the high-dimensional posterior density p(θ|y) using the natural stratification (5.1). The natural stratification works in one case but not the other. In the case where the natural stratification does not work, preliminary calculations based on the natural stratification suggest a better choice of strata.

Here, we let y be the Hidalgo stamp data set first studied in [22], consisting of the thicknesses of 485 stamps, ranging between 60 μm and 130 μm. We let there be three mixture components (K = 3), following previous computational studies [11, 23]. In our first calculation, we estimated the marginal in μ2 using the natural stratification with a grid of 201 bias functions covering the range [7, 11], with the support of the leftmost and rightmost bias functions reaching to −∞ and ∞, respectively. For the middle strata, define ϕ1: by

ϕ1(x):=max{0,1|x|}. (5.3)

We used the bias functions

ψi(θ)=ϕ1(μ2(7+(i1)h)h), where h:=0.02 (5.4)

for i = 2, …, 200. Now, define ϕ2: by

ϕ2(x):=min{max{0,1x},1} (5.5)

The first and last bias functions were

ψ1(θ)=ϕ2(μ27h) (5.6)
ψ201(θ)=ϕ2((7+200h)μ2h), (5.7)

where h = 0.02 as before.

We chose the total number of bias functions based on the sizes of the off-diagonal entries in the overlap matrix. For any bias functions of the form (5.4), the overlap matrix is tridiagonal. Thus, by Remark 3.12, if the superdiagonal and subdiagonal entries Fi,i+1 and Fi,i−1 are sufficiently large, then the EMUS estimator is not too sensitive to statistical errors in F¯. For our choice of bias functions,

min{Fi,i+1;i=1,200}0.01 and min{Fi,i1;i=2,,201}0.004. (5.8)

We sampled the biased distributions using the affine invariant ensemble sampler with 100 walkers, as implemented in the emcee package [15]. Due to computational restrictions on memory, only every tenth sample point was saved. As a check on the sampling, the average acceptance probability over all walkers in the ensemble sampler was calculated for each biased distribution. Averaging over biased distributions gave a total average acceptance probability of 0.31. The minimum acceptance probability over all distributions was 0.12.

To initialize sampling, we computed an unbiased test trajectory; that is, a trajectory having ergodic distribution π. We then started by sampling a single biased distribution πk, initializing with points drawn randomly from the unbiased trajectory. We sampled the other biased distributions in sequence, initializing with points drawn randomly from samples of adjacent biased distributions. Thus, we sampled πk first, then πk−1 and πk+1, then πk−2 and πk+2, etc. We equilibrated the sampler in each πi for 3000 Monte Carlo steps, and collected data for an additional 100000 Monte Carlo steps. Each step of the ensemble sampler involves perturbing the positions of each of the 100 walkers.

We computed the marginal in μ2 using a grid of 200 histogram bins, covering the region [7, 11]; this corresponds to taking h = 0.01 in (5.2). The result is the curve labeled EMUS in Figure 3a. The marginal in μ2 has two modes, labeled 1 and 2 in Figure 3a. We plot the mixture distributions corresponding to these modes in Figure 4. (To be precise, the distributions in Figure 4 correspond to means over histogram bins centered at the labeled points.)

Figure 3:

Figure 3:

Estimates of the logarithm of the marginal density in μ2 and the asymptotic variances of those estimates. Figure 3a displays estimates of the marginal in μ2 computed by EMUS and by an unbiased trajectory of the ensemble sampler. Figure 3b displays the asymptotic variances of these two estimates of the marginal density. We note that while the unbiased calculation has greater accuracy near the mode, the EMUS calculation has greater accuracy in the tails. The relative errors in this figure were estimated using the method described in Appendix F.

Figure 4:

Figure 4:

Gaussian mixtures corresponding to modes of the marginal in μ2. Mixtures 1 and 2 correspond to the labeled points in Figure 3a. To be precise, the blue curve in each plot is the mixture distribution corresponding to the mean of a histogram bin centered at the point labeled in Figure 3a. The green curves are the individual mixture components. The black bars are a histogram of the Hidalgo stamp data.

For comparison, we also estimated the marginal in μ2 from multiple long, unbiased trajectories. We computed 100 unbiased trajectories of the affine invariant ensemble sampler in parallel. For each trajectory, the ensembles were first equilibrated for 10000 Monte Carlo steps, and then data were collected for 100000 steps. These trajectories were combined and binned to produce the density labeled Unbiased in Figure 4. We estimated the relative asymptotic variance of the marginal density for the unbiased calculation using ACOR [14], and we estimated the relative asymptotic variance for the EMUS calculation using the method outlined in Appendix F. We present the results in Figure 3a. Note that near the mode, unbiased MCMC performs slightly better than EMUS, but in the tails, EMUS performs dramatically better.

After computing the marginal in μ2, we tried computing the marginal in log10 λ1. We used the natural stratification with a grid of 50 bias functions with maxima equally spaced between −1 and 3.2 constructed as

ψi(θ)=ϕ(1+h(i1)log10λ1h)

where

h=3.2(1)49.

We used the same initialization scheme as for the marginal in μ2, beginning with a single biased distribution initialized from an unbiased test trajectory. We call this the center sample. The result of this calculation was the density labeled “1D Center” in Figure 5a. When we tried to compute the asymptotic variance of this density estimate, we noticed very slow convergence of the sampler for some biased distributions. To investigate, we performed another EMUS calculation using a similar initialization procedure, but starting from π1, the biased distribution at the extreme left, covering the lowest values of λ1. We call this the left sample. The result of this second calculation was the density labeled “1D Left” in Figure 5a. For both the center and left samples, the strata were equilibrated for 3000 steps and sampled for another 200000. We observe that the two densities differ significantly in the region −1 ≤ log10 λ1 ≤ 0.5. They should be the same up to sampling errors; for example, we observe that different initializations have no effect on the calculation of the marginal in μ2, cf. Figure 3a.

Figure 5:

Figure 5:

Estimates of the logarithm of the marginal density in log10 λ1 and the asymptotic variances of those estimates. Figure 5a displays the estimates of the marginal in log10 λ1 computed by various methods. The error bars are twice the estimated asymptotic standard deviation in each histogram bin. For the two-dimensional EMUS calculations, standard deviations were estimated using the method described in Appendix F. For both the unbiased calculation asymptotic variances were estimated using ACOR [14]. No error bars are given for the two one-dimensional calculations, as the barrier depicted in Figure 10 makes accurate estimation of the asymptotic variance impossible. A clear error is visible in the two one-dimensional umbrella sampling calculations, due to initialization along either side of the barrier in Figure 10. Figure 5b displays the asymptotic variance of the marginal density in log10 λ1 for the unbiased and the two-dimensional EMUS calculations. We note that while the unbiased calculation achieves greater accuracy near the mode, the EMUS calculation achieves greater accuracy in the tails.

Figure 6 explains the problem and suggests a solution: In the region 0.2 ≤ log10 λ1 ≤ 0.7, the center and left samples cover entirely different ranges of log10 λ2. This suggests that the biased distributions corresponding to the range 0.2 ≤ log10 λ1 ≤ 0.7 are multimodal, with barriers in λ2 impeding sampling.

Figure 6:

Figure 6:

To generate Figure 6, we binned the samples for the one-dimensional left and center EMUS calculations, and we plotted the difference in the histograms. The contour lines are contours of the log marginal density, as in Figure 7a. Figure 6 shows that while the two calculations largely sample the same regions, near log10 λ1 = 0.45 they become trapped on opposite sides of a barrier. This leads to poor sampling, causing a slowly decaying error in the estimates of the marginal density, cf. Figure 5a.

To confirm the hypothesis that barriers in λ2 were responsible for the poor convergence observed in the center and left samples, we performed a third calculation, stratifying in both log10 λ1 and log10 λ2. We used a 50 × 50 grid of bilinear bias functions, with maxima equally spaced between −1 and 3.2. To be precise, for i, j = 1, …, 50, we defined the bias functions

ψij(θ)=ϕ(1+h(i1)log10λ1h)×ϕ(1+h(j1)log10λ2h),

with h as before. Let ηij denote the biased distribution corresponding to ψij.

We performed the two-dimensional EMUS calculation twice, initializing from the center and left samples drawn from the natural stratification in log10 λ1. For each i = 1, …, L, to sample the row {ηij : j = 1, …, 50} of biased distributions, we began by initializing sampling of a single biased distribution ηik with points from the either the center or left sample of πi. We then sampled the other distributions ηij for jk in sequence, again initializing with points from samples of adjacent distributions, either ηi,j+1 or ηi,j−1 in this case. If no samples were found inside the support of a biased distribution, that distribution was ignored. For each biased distribution, sampling was burned in for 4500 steps, and samples were collected for an additional 2500 steps. Ultimately, 1397 of the 2500 biased distributions were sampled; the unsampled distributions correspond to the white space in Figure 7a.

Figure 7:

Figure 7:

Logarithm of marginal density in log10 λ1 and log10 λ2 as estimated by EMUS and unbiased MCMC. Contour lines in both figures are every unit change in the estimated log10 marginal density. Figure 7a is the EMUS estimate. The numbers 1, 2, and 3 on this figure correspond to the mixture densities in Figure 8. Note that at values of log10 λ near 3.0 we begin to see the modes corresponding to singularities of the posterior. Figure 7b is the marginal density estimated from a long unbiased trajectory of the ensemble sampler. Note that the entire trajectory lies in a small neighborhood of the mode labeled 1 in Figure 7a.

We computed the marginal in log10 λ1 and log10 λ2 using a 200×200 grid of histogram bins, covering the region −1 ≤ log10 λ1 ≤ 3.2 and −1 ≤ log10 λ2 ≤ 3.2; this corresponds to taking h = (3.2 − (−1))/200 in (5.2); the result from the center calculation appears in Figure 7a. In Figure 8, we show the mixture distributions corresponding to the modes of the two-dimensinoal marginal in Figure 7a. The two-dimensional marginals were essentially the same for the center and left initializations; see Figure 9. We also estimated the one-dimensional marginal in log10 λ1 using the two-dimensional stratification; see the results labeled “2D Center” and “2D Left ” in Figure 5a. Finally, we estimated the relative asymptotic variance of the marginal in log10 λ1 computed by two-dimensional stratification. Again, we observe that EMUS performs much better than unbiased sampling in the tails, cf. Figure 5b.

Figure 8:

Figure 8:

Gaussian mixtures corresponding to means of histogram bins. Mixtures one through three correspond to the labeled points on Figure 7a, mixture four corresponds to a distribution near a singularity of the posterior, with log10 λ1 = 4.34 and log10 λ2 = 0.79. To be precise, the blue curve in each plot is the mixture distribution corresponding to the mean of a histogram bin centered at the point labeled in Figure 7a. The green curves are the individual mixture components. The black bars are a histogram of the Hidalgo stamp data.

Figure 9:

Figure 9:

The difference between the free energy surfaces of the two-dimensional umbrella sampling runs. The center calculation was initialized from the center one-dimensional calculation, and the left calculation from the left one-dimensional calculation. In general the difference is small, roughly a tenth of an order of magnitude in the log marginal.

The marginal in log10 λ1 and log10 λ2 confirms that barriers in λ2 caused the problems observed in calculating the marginal in log10 λ1 using the natural stratification. In fact, we see that computing the marginal in either λ1 or λ2 requires stratifying both variables, as stratifying only one leads to barriers that impede sampling in the other. In particular, there are barriers in λ2 along the line log10 λ1 = 0.45 and a barrier in λ1 along log10 λ2 = 0.6: In Figure 10, we plot an estimate of the conditional distribution of log10 λ2 with log10 λ1 = 0.45 fixed. This distribution is multimodal with a region of very low probability separating the modes, which explains the poor sampling depicted in Figure 6.

Figure 10:

Figure 10:

Here we give an estimate of the conditional distribution of log10 λ2 with log10 λ1 = 0.45 calculated from the two-dimensional marginal seen in Figure 7a. The conditional distribution is multimodal. The mode on the left corresponds to mixtures with the data from thicknesses of 60 to 85 μm covered by a single Gaussian similar to mode 2 in Figure 8. The mode on the right corresponds to mixtures with these data covered by two Gaussians similar to mode 1 in Figure 8.

To conclude, we have confirmed that EMUS can be extremely efficient for computing tails. However, one must exercise care in the choice of strata. The natural stratification often suffices, but in some cases, like computing the marginal in log10 λ1, the biased distributions of the natural stratification may be very difficult to sample. We propose the use of different initializations, like the center and left samples, as a method of identifying problems related to poorly chosen strata. Careful inspection of simulations performed with these different initializations can identify problems and suggest better strata.

6. Conclusions.

We have analyzed the Eigenvector Method for Umbrella Sampling (EMUS), an especially simple and effective stratified MCMC method sharing many features with the popular WHAM [27] and MBAR [42] methods of computational chemistry. We have demonstrated the advantages of EMUS for sampling from multimodal distributions and computing tail probabilities, and we have explained how to identify and resolve the problems which may occur if the method is implemented poorly. We have also given a tutorial intended to explain how to diagnose and correct problems related to poorly chosen strata.

Our purpose was to explain the benefits of stratified MCMC analytically, with the ultimate goal of introducing stratified MCMC to a diverse audience of statisticians, engineers, and scientists. Since stratified MCMC had previously been applied only to a particular class of statistical mechanics calculations without any general justification, we began by developing a general theory. We hope that our theory will serve as the basis for further developments. For example, it may now be possible to undertake a comparison of EMUS and other so-called reaction coordinate methods such as Wang–Landau sampling [50] or Metadynamics [28]. Despite some similarities with EMUS, these methods work by a substantially different mechanism and understanding the relative advantages of the two approaches is non-trivial.

Acknowledgments

BvK was supported by NSF RTG: Computational and Applied Mathematics in Statistical Science, number 1547396. ARD and EHT were supported by National Institutes of Health (NIH) Grant Number 5 R01 GM109455-02. We wish to thank Jonathan Mattingly, Jeremy Tempkin, and Charlie Matthews for helpful discussions.

Appendix A. Proof of Theorem 3.3.

Our proof of Theorem 3.3 (the CLT for EMUS) is based on the delta method. To apply the delta method, we require the following result ensuring the differentiability of w(G):

Lemma A.1. The function w(G) admits an extension w˜:L×LL which is differentiable on the set of irreducible stochastic matrices.

Proof. By [45, Lemma 3.1], w(G) admits a continuously differentiable extension to an open set UL×L. We further extend the domain of w(G) to L×L by arbitrarily defining w(G) = 0 whenever GL×L\U. ■

The extension in Lemma A.1 resolves two technicalities: First, the set of stochastic matrices is not a vector space but a compact, convex subset of L×L with empty interior. Therefore, the derivative of w is undefined. Second, F¯ may be reducible for some values of N and some realizations of the processes sampling the biased distributions. In that case, the invariant distribution of F¯ is not unique, so w(F¯) is undefined. Throughout the remainder of this work, w(G) will denote the extension guaranteed by the lemma. We now prove the CLT for EMUS.

Proof of Theorem 3.3. The proof is based on the delta method [6, Proposition 6.2] and a formula for w(F¯) given in [19].

By Lemma A.1, w(F¯) is differentiable at F, so the function

B(F¯,{g¯i*}i=1L,{1¯i*}i=1L):=πUS[g]=i=1Lwi(F¯)g¯i*i=1Lwi(F¯)1¯i*

is differentiable at (F,{πi[g*]}i=1L,{πi[1*]}i=1L),. Let iBL+2 be the derivative of B with respect to those quantities computed from Xti: That is,

iB:=(BF¯i:,BG¯i:)L+2, (A.1)

where BF¯i:L denotes the partial derivative of B with respect to the i’th row of F¯ and

BG¯i:=(Bg¯i*,B1¯i*)2.

To simplify notation, we will assume throughout the remainder of this argument that all derivatives are evaluated at (F,{πi[g*]}i=1L,{πi[1*]}i=1L). In formulas involving matrix multiplication, we will treat ∂iB, BF¯i: and BG¯i: as row vectors.

Since we assume that the processes Xti sampling the different measures πi are independent, [5, Chapter 1, Theorem 2.8] implies that

M((F¯1:,g¯1,1¯1*,,F¯L:,g¯L*,1¯L*)(F1:,π1[g*],π1[1*],,FL:,πL[g*],πL[1*]))dN(0,Σ), (A.2)

where Σ is the covariance matrix of the product of the distributions N(0,κi1Σi). (That is, ΣL(L+2)×L(L+2) is the block diagonal matrix with the matrices κi1Σi along the diagonal.) Therefore, by the delta method,

M(πUS[g]π[g])dN(0,σ2),

where

σ2=(1B,,LB)Σ(1B,,LB)t=i=1Lκi1iBΣiiBt. (A.3)

Now we observe that for any column vector vL having mean zero,

ddεwk(F+εeivt)|ε=0=wkF¯i:v=zivt(IF)#ek,

by [19, Theorem 3.1]. (In the formula above, eiL denotes the i’th standard basis vector.) Therefore, we have

BF¯i:v=k=1LwkF¯i:vπk[g*]k=1Lzkπk[1*]k=1LwkF¯i:vπk[1*]i=1Lzkπk[1*]i=1Lzkπk[g*]k=1Lzkπk[1*]=k=1LwkF¯i:vΨ(πk[g*]π[g]πk[1*])=zivt(IF)#g, (A.4)

where

gk=Ψπk[g*π[g]1*]=l(πk[g*],πk[1*]).

(Equality (A.4) above follows from (2.3) and the definition (2.2) of Ψ.) Also,

BG¯i=ziΨ(1,π[g])=zil. (A.5)

Thus,

iBΣiiBt=BF¯i:σiBtF¯i:+2BF¯i:ρiBG¯i:t+BG¯i:τiBG¯i:=zi2{(IF)#gσi(IF)#g+2(IF)#gρil+ltτil},

and the result follows by (A.3). ■

Appendix B. Proof of Theorem 3.5.

Definition B.1. Let eiL denote the i’th standard basis vector. For i,j ∈ {1,2, …, L} with ij, define the logarithmic partial derivatives

logwkFij(F):=Fijlogwk(ijI+Fij(eiejteieit))=ddε|ε=0logwk(F+ε(eiejteieit)). (B.1)

(These partial derivatives must be understood as derivatives of the extension guaranteed by Lemma A.1; otherwise, they are defined only when Fij > 0 and Fii > 0.)

Our definition of logarithmic partial derivatives in (B.1) is not standard. However, we observe that a version of the standard formula relating the total and partial derivatives of log w holds: For all matrices H whose rows sum to zero,

ddε|ε=0logwk(F+εH)=ijlogwkFij(F)Hij. (B.2)

We need only consider matrices whose rows sum to zero, since these are the only perturbations for which F + εH can be stochastic.

The following result appears in [45, Theorem 3.6]. It is crucial in our proof of Theorem 3.5.

Lemma B.2. Recall Pi[tj < ti] and logwkFij and from Definitions 3.4 and B.1. For all stochastic irreducible matrices F, we have

121Pi[tj<ti]maxk|logwkFij(F)|1Pi[tj<ti].

We also require the following lemma in the proof of Theorem 3.5.

Lemma B.3. The asymptotic covariance matrix σi has the properties:

  1. The rows and colums of σi sum to zero. That is, for eL the vector of all ones,
    σie=0 and etσi=0.
  2. For all j = 1, …, L,
    σjki=σkji=0 whenever Fik=0.

Proof. Since the rows of F¯ sum to one with probability one, we have

var(F¯i:e)=0

for any fixed number of samples Ni. Therefore, the asymptotic variance σi has etσie = 0, and it follows that etσi = σie = 0 since σi is symmetric and positive semidefinite.

Let k be such that Fik = 0. Since F¯ik=0 with probability one, we have

cov(F¯ik,F¯ij)=0

for any j = 1, …, L, and therefore σjki=0. ■

We now prove Theorem 3.5.

Theorem 3.5. We begin with formula (A.3):

σ2=i=1Lκi1iBtΣiiB. (B.3)

Since the asymptotic covariance matrix Σi is symmetric and positive semidefinite, the Cauchy inequality holds:

atΣib12atΣia+12btΣib,

for all a,bL+1. Therefore,

iBtΣiiB=(BF¯i:,BG¯i:)Σi(BF¯i:,BG¯i:)t2(BF¯i:,0)Σi(BF¯i:,0)t+2(0t,BG¯i:)Σi(0t,BG¯i:)t=2BF¯i:σiBF¯i:t+2BG¯i:τiBG¯i:t.=: 2A0+2A1. (B.4)

(Here, 0 denotes the zero vector in L, interpreted as a column vector.)

We now estimate the term A0 defined above. By (A.4), we have

A0=BF¯i:tσiBF¯i:=j,k,l,m=1LglwlF¯ijσjkigmwmF¯ij=l,m=1LzlglzmgmjiFij>0kiFik>0logwlF¯ijσjkilogwmF¯ik.=l,m=1LzlglzmgmjiFij>0kiFik>0varπi(ψj*)logwlF¯ijRjkivarπi(ψk*)logwmF¯ik, (B.5)

where

Rjki:=σjkivarπi(ψj*)varπi(ψk*).

(The third equality above follows from formula (B.2) relating the total and partial derivatives of log w, since the rows and colums of σi sum to zero by Lemma B.3.)

We claim that

jiFij>0kiFik>0varπi(ψj*)logwlF¯ijRjkivarπi(ψk*)logwmF¯iktr(Ri)jiFij>0varπi(ψj*)logwlF¯ij2. (B.6)

To prove this, we observe that Ri is symmetric and positive semidefinite since σi is symmetric and positive semidefinite. Therefore, Ri has the spectral decomposition

Ri=j=1Lλi,jvi,j(vi,j)t

with eigenvalues λi,j > 0 and corresponding eigenvectors vi,j such that ‖vi,j‖ = 1. Thus, for any aL,

atRia=j=1Lλi,j|vi,ja|2(j=1Lλi,j)a2=tr(Ri)a2. (B.7)

Inequality (B.6) follows from (B.7) by setting

aj={varπi(ψj*)logwlFij if ji and Fij>0, and 0 otherwise.

Finally, combining (B.5), (B.6), and Lemma B.2 yields

A0tr(Ri)(l=1Lzl|gl|)2jiFij>0varπi(ψj*)Pi[τj<τi]2.

Moreover, we have

l=1Lzl|gl|=Ψl=1Lzl|πl[g*π[g]1*]|Ψl=1Lzlπl[|h|]=π[|h|],

by (2.3), and therefore

A0tr(Ri)π[|h|]2jiFij>0varπi(ψj*)Pi[τj<τi]2. (B.8)

We now observe that by (A.5)

A1=zi2ltτil=Ψ2zi2C(h¯i),

C(h¯i) denotes the asymptotic covariance of the trajectory average h¯i of h over the biased process Xti. Therefore, combining (B.3) and (B.8), we find

σ22i=1L1κi{zi2Ψ2C(h¯i)+tr(Ri)π[|h|]2jiFij>0varπi(ψj*)Pi[τj<τi]2},

as desired.

Finally, suppose that the bias functions are a partition of unity. In that case,

π[|h|]2=π[|gπ[g]|]2π[|gπ[g]|2]=varπ(g),

and so we may replace π[|h|]2 with varπ(g). In addition, we observe that for a partition of unity, equation (A.4) holds with gk = πk[g]. Thus, following the argument above, one may verify that the result holds with π[|g|]2 in place of π[|h|]2.

Appendix C. Proof of Theorem 3.7.

In the arguments below, for any probability measure ν on a set Ω, we let

L2(ν):={u:Ω:ν[u2]<},

and we define the L2(ν) inner product

f,gν=ν[fg]

with the corresponding norm

fL2(ν):=f,fν.

Given a set Ud, we define L2(U), L2(U), 〈, 〉U to be the analogous function space, norm, and inner product for Lebesgue measure on U.

Our proof of Theorem 3.7 requires a Poincaré inequality, Lemma C.1. We refer to [31, Section 3] for an introduction to Poincaré inequalities and their role in the theory of diffusion processes.

Lemma C.1. Assume that the Poincaré inequality holds for U with constant Λ; that is, assume that for all weakly differentiable f:U so thatfL2(U),

fUfdxL2(U)Λ(U)fL2(U)

We have a similar Poincaré inequality for πh:

fπh(f)L2(πh)hΛ(U)exp(β2(supUhVinfUhV))fL2(πh).

Proof. By a standard scaling argument, the Poincaré inequality holds for Uh with constant hΛ. To see this, let Ah : UUh be the affine transformation

Ahx=x0+h(xx0).

For any f:Uh with ∇fL2(Uh), using the change of variable formula and the chain rule, we have

fUhfL2(Uh)2=hdfAhUfAhL2(U)2hdΛ2(fAh)L2(U)2=hdh2Λ2(f)AhL2(U)2=h2Λ2fL2(Uh)2.

Now observe that for any fL2(πh),

fπh[f]L2(πh)=mincfcL2(πh),

since πh[f] is the L2(πh) orthogonal projection of f onto the space of constant functions. Therefore, we have

fπh[f]L2(πh)2fUhfL2(πh)2(supxUhπh(x))fUhfL2(Uh)2h2Λ2(supUhπ(x))fL2(Uh)2h2Λ2supxUhπh(x)infxUhπh(x)fL2(πh)2,

and the result follows. ■

Remark C.2. The Poincaré inequality for the Lebesgue measure on a set U holds under very weak conditions on U. For example, when U is convex, the Poincaré inequality holds with constant Λ(U) = D/π, where D is the diameter of the domain [38].

We now prove Theorem 3.7:

Proof of Theorem 3.7. We begin by stating a simple consequence of the functional central limit theorem for reversible, continuous time Markov processses: Let Yt be a reversible, stationary Markov process with ergodic distribution π and generator L. Let gL2(π), and define

g¯:=T1s=0Tg(Ys)ds.

By [25, Corollary 1.9],

T(g¯π[g])dN(0,σ2( g)),

where

σ2(g)=gπ[g],L1(gπ[g])π. (C.1)

Here, L−1(gπ[g]) denotes any function in the domain of L with

L(L1(gπ[g]))=gπ[g]

and π[L−1(gπ[g])] = 0. Such a function must exist when gL2(π) and Xt is reversible [25].

We now show that the process Xth meets the conditions above for the central limit theorem. First, we recall that the generator of Xth is the operator

Lh=β1ΔV

with domain

D(Lh):={gC2(Uh):g(x)n(x)=0 for all xUh};

see [2, Proposition 3.2] for the case of a convex polyhedron or [13, Chapter 8] for a domain with C3 boundary.

By [24, Theorem 4.3.3], a process Yt with invariant distribution π is reversible if its generator is symmetric and it has the strong continuity property

limt0+Ttffπ=0 for all fL2(π), (C.2)

where Ttf(x) := Ex[f(Yt)] denotes the backwards semigroup associated with Yt. The generator Lh of Xth is symmetric, since for all f, gD(Lh), using integration by parts, we have

β1f,gπ=β1Uhfgzh1exp(βV)dx=β1Uhfdiv(zh1exp(βV)g)dxβ1Uhfzh1exp(βV)gn dS=Uh(β1ΔgVg)fzh1exp(βV)dx=f,Lhgπ. (C.3)

(Here, zh1:=Uhexp(βV)dx is the normalizing constant for πh.) Since 〈∇f, ∇gπ is invariant under exchanging f and g, 〈f, Lhgπ = 〈Lhf, gπ and Lh is symmetric. We postpone discussion of the strong continuity of Xth to the end of the proof.

We now use the Poincaré inequality (Lemma C.1) and (C.3) to prove that Xth is ergodic and to estimate the term Lh1(gπh[g]) appearing in the formula for σh2(g); in essence, we adapt the approach outlined in [31, Section 3] to the family of reflected processes Xth. We prove ergodicity first. By [4, Proposition 2.2], a process is ergodic if and only if 0 is a simple eigenvalue of its generator. By the Poincaré inequality (Lemma C.1) and (C.3), for all uD(Lh),

uπh[u]L2(πh)2Ch2uL2(πh)2=Ch2βu,LuπhCh2βuL2(πh)LuL2(πh), (C.4)

where

Ch=hΛ(U)exp(β2(supUhVinfUhV)).

Now if u is not constant, uπh[u]L2(πh)2>0, so LhuL2(πh)>0 and u is not an eigenvector with eigenvalue 0. Hence, 0 is a simple eigenvalue of Lh, and Xth is ergodic.

Finally, we estimate σh2(g). We have

uL2(πh)Ch2βLuL2(πh).

Taking u=Lh1(gπh[g]) in the above yields

Lh1(gπh[g])L2(πh)Ch2βgπh[g]L2(πh),

which implies

σh2(g)=gπh[g],L1(gπh[g])πhCh2βvarπh(g),

using the Cauchy–Schwarz inequality.

It remains to show that the process Xth has the strong continuity property (C.2). We only sketch an argument, since the basic ideas are standard. First, one can use the Lipschitz continuity of strong solutions of the reflected process [2, Lemma 4.1] to show that Xth has the Feller property. (That is, one can show that Ttu is continuous whenever u is continuous.) In addition, since the process Xth has an infinitesimal generator, we have the pointwise continuity property

limt0+Ttu(x)=u(x) (C.5)

for all xUh and all uD(Lh). Now we have ‖Tt ≤ 1 for all t ≥ 0, where ‖Tt is the operator norm of Tt on the space of continuous functions with the sup-norm, and therefore by a density argument the limit (C.5) holds for all continuous u. Hence, by [8, Lemma 1.4], we have

limt0+supxUh|Ttu(x)u(x)|=0

for all continuous u. The strong continuity property (C.2) then follows by another density argument, using that TtL2(πh)1 for all t ≥ 0. ■

Appendix D. Proof of Theorem 4.1.

Proof of Theorem 4.1. By the remarks immediately following Theorem 3.5, since the bias functions are a partition of unity, we have

σ2(g)2id/Kdκi1{C(g¯i)zi2+varπ(g)tr(Ri)jiFij>01Fij}. (D.1)

To prove the desired upper bound, we substitute estimates of C(g¯i), Ri, and Fij into the inequality above.

First, we consider the asymptotic covariances Ri and C(g¯i). Let

h=1/K.

The diameter of Ui is 2dh, so by Assumption 3.6

RjjiChaβbexp(2dhβVL)Chabexp(2dVL). (D.2)

(The second inequality follows since ≤ 1 by definition.) Similarly,

C(g¯i)Chabexp(2dVL)varπi(g). (D.3)

Second, by Lemma 3.11, the nonzero entries of the overlap matrix F are bounded below as β tends to infinity:

Fijexp(2dVL)4d (D.4)

for all i,j so that Fi,j > 0. We also observe that each row of F has 3d nonzero entries, since Fi,i+k > 0 only when all entries of k belong to {−1, 0, 1}.

We now estimate the term involving C(g¯i) in (D.1). By (D.3), we have

id/Kdzi2C(g¯i)Chabexp(2dVL)id/Kdzi2varπi(g). (D.5)

Now we have

varπi(g)=πi[|gπi[g]|2]πi[|gπ[g]|2].

Therefore,

id/Kdzi2varπi(g)id/Kdziπi[|gπ[g]|2]=π[|gπ[g]|2]=varπ(g). (D.6)

(The inequality follows since 0 ≤ zi ≤ 1 for all i; the second to last equality follows using (2.5) and that {ψi}id/Kd is a partition of unity.) Thus,

id/Kdzi2C(g¯i)Chabexp(2dVL)varπ(g). (D.7)

It remains to address the term involving Ri in (D.1): Using (D.2), (D.4), and that each row of F has 3d nonzero entries, we have

tr(Ri)jiFij>01FijC62dhabexp(4dVL) (D.8)

for every id/Kd. Finally, using (D.7), (D.8), and κi1=Kd=βd, we conclude

σ2(g)2Chab{Kdexp(2dVL)+62dK2dexp(4dVL)}varπ(g)(Dβd+ba+Eβ2d+ba)varπ(g),

where the constants D and E depend on d and V, but not on g or β. ■

We note that if one uses the bias functions proposed in Remark 3.13, then the constants D and E in the proof of Theorem 4.1 grow only polynomially with the dimension d, not exponentially. However, we do not claim that those bias functions perform better than the uniform grid (3.7) or the bias functions of Section 5.3 in practice.

Appendix E. Proof of Theorem 4.6.

Proof of Theorem 4.6. Take g := 1xM. As explained in the remarks after the statement of Theorem 3.5, since the bias functions are a partition of unity, we have

σM22i=0K+1κi1{C(g¯i)zi2+pM2tr(Ri)jiFij>01Fij}. (E.1)

First, we estimate

RjjiChaexp(hmaxxM|V(x)|)Ceha

for all i = 1, …, K − 1 by Assumption 3.6. By Assumption 4.5,

RjjKD for j=K1,K,

and RjjK=0 for jK − 1,K, since ψK+1 is constant over the support of πK. In addition,

RjjK+1=0 for all j=1,,L,

since all bias functions ψi take a constant value over the support of πK+1. Likewise,

C(g¯K)CehavarπK(g)Ceha,

and C(g¯i)=0 for all iK.

We now show that the nonzero entries of the overlap matrix are bounded below independent of M. First, we estimate the entries which are averages over the biased distributions with bounded support. By Lemma 3.11, we have

Fij12exp(2)>0

for all i = 0, …, K − 1 and j so that Fij > 0. whenever Fij > 0. It remains to address those entries related to biased distributions with unbounded support, so with i = K,K + 1. By Lemma E.1, FK,K+1 and FK,K−1 are bounded below by some θ > 0 independent of M, for this choice of bias functions. (Lemma E.1 and its proof appear in Appendix E. Lemma E.1 is the only part of the proof which relies on Assumption 4.3.) In addition, for any i = 0, …, K + 1, we have Fii=12, which implies FK+1,K=1FK+1,K+1=12 since F is stochastic when the bias functions are a partition of unity.

Finally, we substitute the above estimates of the overlap matrix and the variances into (E.1). Let c = min{θ,1/2exp(2)}. Observe that h decreases with M, so CehaE for some constant E, uniformly in M. Let F = max{D,E}. We have

σ2pM22pM2i=0K+1(K+2){C(g¯i)zi2+pM2(2F)2c2}2(K+2)FzK2pM2+4(K+2)2c2.

We now observe that

zKpM=zKzK+1=FK+1,KFK,K+11c.

Therefore,

σ2pM22F(K+2)+4F(K+2)2c2,

which proves the result. ■

We now prove Lemma E.1, which is used in the proof of Theorem 4.6.

Lemma E.1. Under the hypotheses of Theorem 4.6, there exist constants M1, θ+, θ > 0 depending on V but not on M so that

FK,K+1θ+>0 and FK,K1θ>0

whenever MM1.

Proof. We consider FK,K−1 first. We have

FK,K1=12π([Mh,M))π([Mh,))=12MhMexp(V(x))dxMhexp(V(x))dx.

By the integral mean value theorem,

MhMexp(V(x))dx=hexp(V(ξMh,M))

for some ξMh,M ∈ [Mh,M]. Moreover, by (4.5), we have

V(x)V(M)+V(M)(xM) for all xMM0.

Therefore, when MhM0,

Mhexp(V(x))dxMhexp(V(Mh)V(Mh)(xM+h))dx=exp(V(Mh))V(Mh).

It follows that

FK,K1hV(Mh)exp(V(Mh)V(ξMh,M))hV(Mh)exp(hmaxxM|V(x)|)hV(Mh)exp(1)=V(Mh)maxxM|V(x)|exp(1), (E.2)

using the definition h = M/K.

To estimate the quotient in expression (E.2), we distinguish two cases: By (4.5), V′ is nondecreasing on [M0,∞), so either limx→∞ V′(x) = C2 < ∞ or limx→∞ V′(x) = ∞. In the first case, V′ is bounded, and we have

V(Mh)maxxM|V(x)|V(M0)maxx[0,)|V(x)|>0, (E.3)

whenever MhM0. In the second case, for M sufficiently large,

maxxM|V(x)|=V(M).

Therefore, applying in succession the mean value theorem, the monotonicity of V′, assumption (4.6), and the hypothesis limx→∞ V′(x) = ∞, we have that for all M sufficiently large,

V(Mh)maxxM|V(x)|=V(M)V(M)V(M)V(Mh)V(M)V(M)V(M)hV(ηMh,M)V(M)V(M)V(M)V(ηMh,M)V(ηMh,M)2V(M)V(M)α1α2>0. (E.4)

(In the second and third lines above, ηMh,M ∈ [Mh,M] denotes the point guaranteed by the mean value theorem so that V′(M) – V′(Mh) = hV′′(ηMh,M).) It follows from (E.2), (E.3), and (E.4) that there exist M > 0 so that

FK,K1θ>0 (E.5)

whenever MM.

Now we prove that FK,K+1 is bounded below. We have

FK,K+1=12Mexp(V(x))dxMhexp(V(x))dx=FK,K1Mexp(V(x))dxMhMexp(V(x))dxθMM+hexp(V(x))dxMhMexp(V(x))dxθMM+hexp(V(xh)V(x))exp(V(xh))dxMhMexp(V(x))dxθexp(min[Mh,M+h]Vmax[Mh,M+h]V)θexp(2hmax[Mh,M+h]|V|). (E.6)

As above, to bound the quantity appearing in the exponent in (E.6), we distinguish the two cases limx→∞ V′(x) = C1 < ∞ and limx→∞ V′(x) = ∞. In the first case, for M sufficiently large that 2C1 ≥ |V′(x)| ≥ C1/2 whenever xMh, we have

hmax[Mh,M+h]|V|=max[Mh,M+h]|V|max[0,M]|V|2C1C1/2=4. (E.7)

In the second case, for M sufficiently large,

hmax[Mh,M+h]|V|=max[Mh,M+h]|V|max[0,M]|V|V(M+h)V(M). (E.8)

By (4.6), we have the differential inequality

V<α|V|2.

This implies

V(M+s)y(s)

for

y(s)=1V(M)1αs

the solution of the initial value problem

y=αy2 and y(0)=V(M).

Therefore,

V(M+h)1V(M)1αh=1V(M)1αV(M)1V(M)1α,

so by (E.8),

hmax[Mh,M+h]|V|11α. (E.9)

It follows from (E.6), (E.7), and (E.9) that there exist M+, θ+ > 0 so that

FK,K+1θ+>0 (E.10)

whenever MM+. ■

Appendix F. An improved method of computing error bars for EMUS.

In [46, Section VII.B.1], we proposed a practical method of estimating the asymptotic standard deviations (error bars) of averages computed by EMUS. Using the notation established in Appendix A, our method proceeds as follows:

  1. Compute F¯, {g¯i*}i=1L and {1¯i*}i=1L.

  2. Compute w(F¯) and the group inverse (IF¯)#.

  3. Evaluate ∂iB, and F¯, {g¯i*}i=1L, and {1¯i*}i=1L.

  4. Compute the time series
    ζ¯ti=iB((ψ1(Xti),,ψL(Xti),g*(Xti),1*(Xti))(F¯i1,,F¯iL,g¯i*,1¯i*)).
  5. Compute an estimate χ¯i2 of the integrated autocovariance of ζ¯ti using an algorithm such as ACOR [14].

  6. Compute as an estimate σ2 the quantity
    σ¯2:=i=1Lχ¯i2κi. (F.1)

We originally proposed computing the group inverse (IF¯)# using the method of [19] based on the QR factorization. We have since discovered that this method does not always yield sufficiently accurate results. For example, when computing error bars for the marginal in μ2 in Section 5.3, we observed a highly oscillatory numerical error affecting some entries of (IF¯)#. That the sign pattern in Figure 11a fails to be symmetric is evidence of this numerical error. We note that since the exact overlap matrix F is in detailed balance with w(F), we have diag(w(F))F diag(w(F))−1 = Ft. (Here, diag(w(F)) denotes the diagonal matrix with w(F) along the diagonal.) Therefore,

((IF)#)t=diag(w(F))(IF)#diag(w(F))1,

which implies that the sign pattern of (IF)# is symmetric since w(F) is positive. As a result of these numerical errors, we were unable to accurately compute error bars for the EMUS estimate of the marginal density.

We therefore propose computing the group inverse by a new method combining QR factorization with power iteration. We first compute an estimate G0 of (IF¯)# by the method of [19]. We then iterate

Gn+1=I(Gn)=F˜Gn+Iew(F¯)t, (F.2)

where eL denotes the column vector of all ones and F˜:=(Iew(F)t)F¯. We observe that (IF¯)# is a fixed point of this iteration, since

I((IF¯)#)=(Iew(F¯)t)F¯(IF¯)#+(Ieπ(F¯)t)=(F¯I)(IF¯)#+(Iew(F¯)t)+(IF¯)#=(IF¯)#.

Above, we use well known properties of the group inverse, including that the spectral projector Iew(F¯)t commutes with F¯, that (Iew(F¯)t)(IF¯)#=(IF)#, and that (IF¯)(IF¯)#=Iew(F¯)t.

Moreover, when F¯ is irreducible, IK is a contraction for K sufficiently large. By the Perron-Frobenius theorem, the spectral radius of F˜ is smaller than 1 − ε, for some ε > 0. Therefore, by Gelfand’s formula, for any matrix norm ‖·‖, we have limkF˜k1/k<1ε/2, and so for some K,

F˜k<(1ε/2)k whenever kK.

Now

IK(G)=F˜KG+(Ieπ(F)t)j=0K1Fj.

Thus, assuming that the norm ‖·‖ is submultiplicative,

IK(G)IK(H)=F˜K(GH)F˜KGH(1ε/2)KGH.

Therefore, the power iteration converges and its limit is the group inverse (IF¯)#.

Using this new method, we computed (IF¯)# for F¯ the overlap matrix involved in estimating the marginal in μ2 in Section 5.3. We performed 106 power method iterates. Observe that the sign pattern of the group inverse computed with power iteration is symmetric; see Figure 11b.

The power iteration (F.2) converges slowly when the spectral gap of F¯ is small. We have shown in [45] that the spectral gap may be very small: It decreases exponentially with a temperature parameter in a limit similar to the one analyzed in Section 4.1 above. However, even when the spectral gap is small, we conjecture that a modest number of power iterations will significantly reduce the numerical error in the group inverse, since the error in the initial calculation seems to be highly oscillatory and the power iteration has a smoothing effect.

Footnotes

1

A potential of mean force is the logarithm of a marginal density. A free energy is the logarithm of a normalization constant. Both quantities play fundamental roles in statistical mechanics, e.g. in theories of rates of chemical reactions.

2

The boundary of a set is C3 if in a neighborhood of each point on the boundary, the boundary is the graph of a three times continuously di_erentiable function.

REFERENCES

  • [1].Aitkin M: Likelihood and Bayesian analysis of mixtures. Statistical Modelling 1(4), 287–304 (2001) [Google Scholar]
  • [2].Andres S: Pathwise differentiability for SDEs in a convex polyhedron with oblique reflection. Ann. Inst. Henri Poincaré Probab. Stat 45(1), 104–116 (2009) [Google Scholar]
  • [3].Berneche S, Roux B: Energetics of ion conduction through the k[sup +] channel. Nature 414(6859), 73 (2001) [DOI] [PubMed] [Google Scholar]
  • [4].Bhattacharya RN: On the functional central limit theorem and the law of the iterated logarithm for Markov processes. Z. Wahrsch. Verw. Gebiete 60(2), 185–201 (1982) [Google Scholar]
  • [5].Billingsley P: Convergence of Probability Measures, second edn. Wiley series in probability and statistics. Wiley-Interscience, New York: (1999) [Google Scholar]
  • [6].Bilodeau M, Brenner D: Theory of multivariate statistics. Springer texts in statistics. Springer, New York: (1999) [Google Scholar]
  • [7].Boczko EM, Brooks CL: First-principles calculation of the folding free energy of a three-helix bundle protein. Science 269(5222), 393–396 (1995) [DOI] [PubMed] [Google Scholar]
  • [8].Böttcher B, Schilling RL, Wang J: A primer on Feller processes. In: Lévy Matters III : Lévy-type processes: construction, approximation and sample path properties, Lecture notes in mathematics (Springer-Verlag), chap. 1, pp. 1–30. Springer; (2013) [Google Scholar]
  • [9].Chandler D: Introduction to modern statistical mechanics. Oxford University Press, New York: (1987) [Google Scholar]
  • [10].Cho GE, Meyer CD: Comparison of perturbation bounds for the stationary distribution of a Markov chain. Linear Algebra Appl. 335, 137–150 (2001) [Google Scholar]
  • [11].Chopin N, Lelièvre T, Stoltz G: Free energy methods for Bayesian inference: efficient exploration of univariate Gaussian mixture posteriors. Statistics and Computing 22(4), 897–916 (2012) [Google Scholar]
  • [12].Doss H, Tan A: Estimates and standard errors for ratios of normalizing constants from multiple Markov chains via regeneration. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 76(4), 683–712 (2014) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [13].Ethier SN, Kurtz TG: Markov processes. Wiley Series in Probability and Mathematical Statistics: Probability and Mathematical Statistics. John Wiley & Sons, Inc., New York: (1986). Characterization and convergence [Google Scholar]
  • [14].Foreman-Mackey D, Goodman J: ACOR 1.1.1 https://pypi.python.org/pypi/acor/1.1.1 (2014)
  • [15].Foreman-Mackey D, Hogg DW, Lang D, Goodman J: emcee: The MCMC hammer. Publications of the Astronomical Society of the Pacific 125(925), 306 (2013) [Google Scholar]
  • [16].Geyer CJ: Markov chain Monte Carlo maximum likelihood. In: Computing Science and Statistics: Proceedings of the 23rd Symposium on the Interface. American Statistical Association; (1991) [Google Scholar]
  • [17].Geyer CJ: Estimating normalizing constants and reweighting mixtures (1994). Technical Report No. 568. Retrieved from the University of Minnesota Digital Conservancy
  • [18].Gill RD, Vardi Y, Wellner JA: Large sample theory of empirical distributions in biased sampling models. Ann. Statist 16(3), 1069–1112 (1988) [Google Scholar]
  • [19].Golub GH, Meyer CD Jr.: Using the QR factorization and group inversion to compute, differentiate, and estimate the sensitivity of stationary probabilities for Markov chains. SIAM J. Algebraic Discrete Methods 7(2), 273–281 (1986) [Google Scholar]
  • [20].Goodman J, Weare J: Ensemble samplers with affine invariance. Commun. Appl. Math. Comput. Sci 5(1), 65–80 (2010) [Google Scholar]
  • [21].Helffer B, Klein M, Nier F: Quantitative analysis of metastability in reversible diffusion processes via a Witten complex approach. Mat. Contemp 26, 41–85 (2004) [Google Scholar]
  • [22].Izenman AJ, Sommer CJ: Philatelic mixtures and multimodal densities. Journal of the American Statistical Association 83(404), 941–953 (1988) [Google Scholar]
  • [23].Jasra A, Holmes CC, Stephens DA: Markov chain Monte Carlo methods and the label switching problem in Bayesian mixture modeling. Statist. Sci 20(1), 50–67 (2005) [Google Scholar]
  • [24].Jiang DQ, Qian M, Qian MP: Mathematical theory of nonequilibrium steady states : on the frontier of probability and dynamical systems. Lecture notes in mathematics (Springer-Verlag). Springer, Berlin; New York: (2004) [Google Scholar]
  • [25].Kipnis C, Varadhan SRS: Central limit theorem for additive functionals of reversible Markov processes and applications to simple exclusions. Comm. Math. Phys 104(1), 1–19 (1986) [Google Scholar]
  • [26].Kong A, McCullagh P, Meng XL, Nicolae D, Tan Z: A theory of statistical models for Monte Carlo integration. Journal of the Royal Statistical Society. Series B: Statistical Methodology 65(3), 585–604 (2003) [Google Scholar]
  • [27].Kumar S, Rosenberg JM, Bouzida D, Swendsen RH, Kollman PA: The weighted histogram analysis method for free-energy calculations on biomolecules. I. The method. J. Comput. Chem 13(8), 1011–1021 (1992) [Google Scholar]
  • [28].Laio A, Parrinello M: Escaping free-energy minima. Proceedings of the National Academy of Sciences 99(20), 12562–12566 (2002) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [29].Legoll F, Lelièvre T: Effective dynamics using conditional expectations. Nonlinearity 23(9), 2131–2163 (2010) [Google Scholar]
  • [30].Lelièvre T, Rousset M, Stoltz G: Free energy computations. Imperial College Press, London: (2010). A mathematical perspective [Google Scholar]
  • [31].Lelièvre T, Stoltz G: Partial differential equations and stochastic methods in molecular dynamics. Acta Numerica 25, 681 (2016) [Google Scholar]
  • [32].Liu JS: Monte Carlo strategies in scientific computing. Springer Series in Statistics. Springer-Verlag, New York: (2001) [Google Scholar]
  • [33].Maragliano L, Vanden-Eijnden E: A temperature accelerated method for sampling free energy and determining reaction pathways in rare events simulations. Chemical Physics Letters 426(1–3), 168–175 (2006) [Google Scholar]
  • [34].Matthews C, Weare J, Kravtsov A, Jennings E: Umbrella sampling: a powerful method to sample tails of distributions (2017). ArXiv:1712.05024
  • [35].Meng XL, Wong WH: Simulating ratios of normalizing constants via a simple identity: a theoretical exploration. Statistica Sinica 6(4), 831–860 (1996) [Google Scholar]
  • [36].Pavliotis GA: Stochastic processes and applications, Texts in Applied Mathematics, vol. 60. Springer, New York: (2014). Diffusion processes, the Fokker-Planck and Langevin equations [Google Scholar]
  • [37].Pavliotis GA, Stuart AM: Multiscale methods, Texts in Applied Mathematics, vol. 53. Springer, New York: (2008). Averaging and homogenization [Google Scholar]
  • [38].Payne LE, Weinberger HF: An optimal Poincaré inequality for convex domains. Archive for Rational Mechanics and Analysis 5(1), 286–292 (1960) [Google Scholar]
  • [39].Richardson S, Green PJ: On Bayesian analysis of mixtures with an unknown number of components (with discussion). Journal of the Royal Statistical Society: Series B (Statistical Methodology) 59(4), 731–792 (1997) [Google Scholar]
  • [40].Roberts GO, Rosenthal JS: Geometric ergodicity and hybrid Markov chains. Electron. Comm. Probab 2, no. 2, 13–25 (1997) [Google Scholar]
  • [41].Roberts GO, Tweedie RL: Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli 2(4), 341–363 (1996) [Google Scholar]
  • [42].Shirts MR, Chodera JD: Statistically optimal analysis of samples from multiple equilibrium states. The Journal of chemical physics 129(12), 124105 (2008) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [43].Sugita Y, Kitao A, Okamoto Y: Multidimensional replica-exchange method for free-energy calculations. J. Chem. Phys 113(15), 11 (2000) [Google Scholar]
  • [44].Swendsen RH, Wang JS: Replica monte carlo simulation of spin-glasses. Physical review letters 57(21), 2607 (1986) [DOI] [PubMed] [Google Scholar]
  • [45].Thiede E, Van Koten B, Weare J: Sharp entrywise perturbation bounds for Markov chains. SIAM Journal on Matrix Analysis and Applications 36(3), 917–941 (2015) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [46].Thiede EH, Van Koten B, Weare J, Dinner AR: Eigenvector method for umbrella sampling enables error analysis. The Journal of Chemical Physics 145(8), 084115 (2016) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [47].Torrie GM, Valleau JP: Nonphysical sampling distributions in Monte Carlo free-energy estimation: Umbrella sampling. Journal of Computational Physics 23(2), 187 (1977) [Google Scholar]
  • [48].VanDerwerken DN, Schmidler SC: Parallel Markov chain Monte Carlo (2013). ArXiv:1312.7479
  • [49].Vardi Y: Empirical distributions in selection bias models. The Annals of Statistics 13(1), 178–203 (1985) [Google Scholar]
  • [50].Wang F, Landau DP: Efficient, multiple-range random walk algorithm to calculate the density of states. Phys. Rev. Lett 86, 2050–2053 (2001) [DOI] [PubMed] [Google Scholar]
  • [51].Wang FY, Yan L: Gradient estimate on convex domains and applications. Proc. Amer. Math. Soc 141(3), 1067–1081 (2013) [Google Scholar]

RESOURCES