Stratification as a general variance reduction method for Markov chain Monte Carlo

Aaron R Dinner; Erik H Thiede; Brian Van Koten; Jonathan Weare

doi:10.1137/18M122964X

. Author manuscript; available in PMC: 2021 Oct 4.

Published in final edited form as: SIAM/ASA J Uncertain Quantif. 2020 Aug 24;8(3):1139–1188. doi: 10.1137/18M122964X

Stratification as a general variance reduction method for Markov chain Monte Carlo

Aaron R Dinner ^†, Erik H Thiede ^‡, Brian Van Koten ^§, Jonathan Weare ^¶

PMCID: PMC8488943 NIHMSID: NIHMS1656095 PMID: 34611500

Abstract

The Eigenvector Method for Umbrella Sampling (EMUS) [46] belongs to a popular class of methods in statistical mechanics which adapt the principle of stratified survey sampling to the computation of free energies. We develop a detailed theoretical analysis of EMUS. Based on this analysis, we show that EMUS is an efficient general method for computing averages over arbitrary target distributions. In particular, we show that EMUS can be dramatically more efficient than direct MCMC when the target distribution is multimodal or when the goal is to compute tail probabilities. To illustrate these theoretical results, we present a tutorial application of the method to a problem from Bayesian statistics.

1. Introduction.

Markov chain Monte Carlo (MCMC) methods have been widely used with great success throughout statistics, engineering, and the natural sciences. However, when estimating tail probabilities or when sampling from multimodal distributions, accurate MCMC estimates often require a prohibitively large number of samples. In this article, we analyze the Eigenvector Method for Umbrella Sampling (EMUS) [46]. We first proposed EMUS as a method for computing free energies, and we demonstrated that it was useful for treating the multimodality that typically arises in that context. Here, we demonstrate that EMUS is an effective general means of addressing the challenges posed not just by multimodality but also tail events, with potential applications to a broad range of problems in statistics, engineering, and the natural sciences.

EMUS was inspired by Umbrella Sampling [47] and other methods such as the Weighted Histogram Analysis Method (WHAM) [27] and the Multistate Bennett Acceptance Ratio (MBAR) [42] for computing potentials of mean force and free energies in statistical mechanics.¹ We call these stratified MCMC methods since they each adapt the principle of stratified survey sampling to MCMC simulation. Stratified MCMC methods are among the most powerful, most successful, and most widely used tools in molecular simulation. (However, in contrast to our presentation here, they are not typically used in molecular simulation to compute averages of general observables.) WHAM, for example, has been instrumental for treating biomolecular processes ranging from protein folding [7] to conductance by ion channels [3].

While the practical utility of stratification has been established in many applications, the advantages and disadvantages of the method have remained poorly understood; cf. Remark 3.8 and [46]. Motivated by the substantial gap between theory and application of stratified MCMC within statistical mechanics, and also by the general challenges posed by multimodality and tail probabilities, the goal of this paper is to develop a clear theoretical explanation of the advantages of EMUS. Our theory suggests new applications of stratified MCMC (and EMUS in particular) to broad classes of sampling problems arising in statistics and statistical mechanics. For example, very recently EMUS was successfully applied to a parameter estimation problem in cosmology [34].

We now describe EMUS and its relationship to other MCMC methods. The EMUS algorithm proceeds roughly as follows:

We divide the support of the target distribution into regions called strata. Associated to each stratum, we define a biased distribution whose support lies within that stratum. For example, one might let the biased distribution corresponding to a stratum be the target distribution conditioned on the stratum.
We use MCMC to sample the biased distributions.
We weight the samples from each stratum to compute estimates of general averages with respect to the target distribution.

EMUS belongs to a large class of MCMC methods that by various mechanisms promote a more uniform sampling of space. For example, in parallel tempering [44, 16], one uses MCMC samples drawn from a distribution or sequence of distributions close to the uniform distribution to speed sampling of the target distribution. The bias introduced by the choice of distributions is corrected either by reweighting the samples or by a replica exchange strategy [16]. The Wang–Landau [50] and Metadynamics methods [28] adaptively construct a biased distribution to achieve uniform sampling in certain coordinates. The temperature accelerated molecular dynamics method [33] is also designed to achieve uniform sampling in a given coordinate, but it works by entirely different means. In EMUS and other stratified MCMC methods, one achieves more uniform sampling by ensuring that each stratum contains points from at least one MCMC simulation.

EMUS also resembles certain methods for computing normalization constants of families of probability densities [17, 49, 35, 26]. The resemblance arises because the weights in the third step of EMUS are the normalization constants of the biased distributions. These methods have been used, for example, to compute Bayes factors in model selection problems [17] and for computations related to selection bias models [49]. However, despite a strong formal resemblance, EMUS has entirely different objectives from these methods: When computing normalization constants, the distributions analogous to our biased distributions are specified as part of the problem. By contrast, in EMUS and other stratified MCMC methods, the strata are chosen as in stratified survey sampling to maximize efficiency. EMUS is perhaps more similar in spirit to the parallel Markov chain Monte Carlo method [48]. Like EMUS, this method is designed to make MCMC more efficient.

Summary of Main Results.

Our most general results are a central limit theorem (CLT) for the EMUS method and a convenient upper bound on the asymptotic variance, cf. Theorem 3.3 and Theorem 3.5. We note that the proof of the upper bound relies on a new class of perturbation estimates for Markov chains which we derived in [45]. These estimates are substantially more detailed than previous results [10], cf. Remark 4.2. After proving the CLT, we address the dependence of the sampling error on the choice of strata. In particular, for a representative MCMC method, we estimate the asymptotic variances of trajectory averages sampling the biased distributions, cf. Theorem 3.7. Our estimate shows how factors such as the diameters of the strata influence the asymptotic variances.

In Section 4, we apply the general theory developed in Section 3 to case studies involving tail probabilities and multimodality. Our results concern two limits: a small probability limit and a low-temperature limit. In the small probability limit, we consider estimation of probabilities of the form

p_{M} : = P [X \geq M] .

For a broad class of random variables X, we show that while the cost of computing p_M with relative precision by direct MCMC increases exponentially with M, the cost by EMUS increases only polynomially; cf. Section 4.2. In the low-temperature limit, a parameter of the target distribution decreases, intensifying the effects of multimodality on the efficiency of MCMC sampling. We show that the cost of computing an average to fixed precision by direct MCMC increases exponentially in this limit, whereas the cost by EMUS increases only polynomially; cf. Section 4.1. We conclude that EMUS may be dramatically more efficient than direct MCMC sampling when the target distribution is multimodal or when the goal is to compute a small tail probability.

To illustrate our theoretical results, we present a tutorial numerical study applying EMUS to a problem in Bayesian statistics, cf. Section 5. In addition to illustrating the theory, our numerical study demonstrates the problems that may occur when EMUS and other similar stratified MCMC methods are used carelessly. It also addresses practical issues such as the choice of strata and the computation of error bars for averages estimated by EMUS.

The results in this article significantly extend and generalize the ideas in [46]. We first proposed the EMUS method with the goal of analyzing and improving umbrella sampling approaches in free energy calculations. Here, our goal is to establish EMUS as a general variance reduction technique, and we present many entirely new results, including an upper bound on the asymptotic variance of EMUS (Theorem 3.5), a condition to guide some aspects of the choice of strata (Remark 3.12), a theoretical argument demonstrating the benefits of EMUS for computing tail probabilities (Section 4.2), numerical results applying EMUS to Bayesian inference (Section 5), a method of correcting problems related to poorly chosen strata (Section 5.3), and a greatly improved numerical method for estimating the standard deviations of quantities computed by EMUS (Appendix F). In addition, we give complete justifications of some results that were stated without proof in [46], including Theorem 3.7 concerning the dependence of the sampling error on the choice of strata. Finally, we note that our results concerning multimodal distributions and the low-temperature limit generalize and clarify the results given in [46]; in particular, our Theorem 4.1 covers periodic boundary conditions and stratification in more than one variable.

2. The Eigenvector Method for Umbrella Sampling.

In this section, we derive the Eigenvector Method for Umbrella Sampling (EMUS), and we prove that it is consistent. We also derive a related method, iterative EMUS, and we compare iterative EMUS with the MBAR method from statistical mechanics [42].

2.1. Derivation of EMUS.

The objective of EMUS is to compute the average

π [g] : = \int_{Ω} g (x) π (d x),

of a function g with respect to a measure π defined on a set Ω. In EMUS, instead of sampling directly from π, we sample from biased distributions analogous to the strata in stratified survey sampling methods. We then weight the samples from the biased distributions to estimate π[g].

We assume that the biased distributions take the form

π_{i} (d x) : = \frac{ψ_{i} (x) π (d x)}{π [ψ_{i}]}

for some set ${ψ_{i}}_{i = 1}^{L}$ of non-negative bias functions defined on Ω. We call the support of ψ_i the i’th stratum to make an analogy between the biased distributions of EMUS and the strata of stratified survey sampling.

The EMUS method is based on a formula expressing π[g] as a function of expectations over the biased distributions. To derive this formula, we assume that

\sum_{i = 1}^{L} ψ_{i} (x) > 0 for all x \in Ω .

We then define

g^{*} (x) : = \frac{g (x)}{\sum_{k = 1}^{L} ψ_{k} (x)}

for any function $g : Ω \to ℝ$ , and we observe that

π [g] = \int_{Ω} g (x) π (d x) = (\sum_{k = 1}^{L} π [ψ_{k}]) \sum_{i = 1}^{L} \frac{π [ψ_{i}]}{\sum_{k = 1}^{L} π [ψ_{k}]} \int_{Ω} \frac{g (x)}{\sum_{k = 1}^{L} ψ_{k} (x)} \frac{ψ_{i} (x) π (d x)}{π [ψ_{i}]} = Ψ \sum_{i = 1}^{L} z_{i} π_{i} [g^{*}],

(2.1)

where we set

Ψ : = \sum_{k = 1}^{L} π [ψ_{k}] and z_{i} : = \frac{π [ψ_{i}]}{Ψ} .

(2.2)

We call the vector $z \in ℝ^{L}$ with entries z_i the weight vector. Now let 1 be the constant function with 1(x) = 1 for all x ∈ Ω. Taking g = 1 in equation (2.1) yields

Ψ = \frac{1}{\sum_{i = 1}^{L} z_{i} π_{i} [1^{*}]},

and so

Ψ [g] = \frac{\sum_{i = 1}^{L} z_{i} π_{i} [g^{*}]}{\sum_{i = 1}^{L} z_{i} π_{i} [1^{*}]} .

(2.3)

Therefore, to express π[g] as a function of averages over the biased distributions, it will suffice to express the weight vector z as a function of averages over the biased distributions. Taking g equal to ψ_j in (2.1) yields

Ψ z_{j} = π [ψ_{j}] = Ψ \sum_{i = 1}^{L} z_{i} π_{i} [ψ_{j}^{*}],

which is equivalent to the eigenvector equation

z^{t} F = z^{t}, where F_{i j} : = π_{i} [ψ_{j}^{*}] .

(2.4)

We call F the overlap matrix. We observe that F is stochastic (its entries are nonnegative and its rows each sum to one) and that z is a probability vector (its entries sum to one). Therefore, by the Perron-Frobenius theorem, as long as F is irreducible, the eigenvector problem (2.4) determines z as a function of F, a matrix whose entries are averages over the biased distributions. We assume throughout the remainder of this work that F is irreducible. We give a simple condition on the bias functions which guarantees irreducibility of F in Lemma 2.1 below. In general, for any irreducible, stochastic matrix $G \in ℝ^{L \times L}$ , we will let $w (G) \in ℝ^{L}$ denote the unique solution of

w {(G)}^{t} G = w {(G)}^{t} with \sum_{i = 1}^{L} w_{i} (G) = 1.

With this notation, z = w(F), and by (2.3) and (2.4) we have

π [g] = \frac{\sum_{i = 1}^{L} w_{i} (F) π_{i} [g^{*}]}{\sum_{i = 1}^{L} w_{i} (F) π_{i} [1^{*}]},

(2.5)

expressing π[g] as a function of averages over the biased distributions, as desired.

In EMUS, we substitute MCMC estimates for the averages over biased distributions on the right hand side of (2.5) to estimate π[g]. To be precise, let $X_{t}^{i}$ be a Markov process ergodic for π_i. We call $X_{t}^{i}$ the biased process sampling the biased distribution π_i. The EMUS algorithm proceeds as follows:

For each i = 1, …, L, compute N_i steps of the process $X_{t}^{i}$ .
Compute the averages
${\bar{g}}_{i}^{*} : = \frac{1}{N_{i}} \sum_{t = 1}^{N_{i}} g^{*} (X_{t}^{i}), {\bar{1}}_{i}^{*} : = \frac{1}{N_{i}} \sum_{t = 1}^{N_{i}} 1^{*} (X_{t}^{i}), and {\bar{F}}_{i j} : = \frac{1}{N_{i}} \sum_{t = 1}^{N_{i}} ψ_{j}^{*} (X_{t}^{i}) .$
Compute $w (\bar{F})$ numerically, for example from the QR factorization of $I - \bar{F}$ [19].
Compute the estimate
$π_{US} [g] : = \frac{\sum_{i = 1}^{L} w_{i} (\bar{F}) {\bar{g}}_{i}^{*}}{\sum_{i = 1}^{L} w_{i} (\bar{F}) {\bar{1}}_{i}^{*}}$
of π[g].

Note that $w (\bar{F})$ is defined only if $\bar{F}$ is irreducible. In the following lemma, we state simple criteria for the irreducibility of F and $\bar{F}$ :

Lemma 2.1. The overlap matrix F is irreducible if and only if for every A ⊂ {1, 2, …, L}, we have

π [(\sum_{i \in A} ψ_{i}) (\sum_{j \notin A} ψ_{j})] > 0.

(2.6)

The approximate overlap matrix $\bar{F}$ is irreducible if and only if for every A ⊂ {1, 2, …, L}, the set ∪_i∈A{x : ψ_i(x) > 0} contains at least one sample point generated from one of the biased processes $X_{t}^{j}$ with j ∉ A.

Proof. We prove only the second statement; proof of the first is similar. By definition, a non-negative matrix $M \in ℝ^{L \times L}$ is irreducible if and only if for every subset A ⊂ {1, 2, …, L} of the indices, there exist indices i ∈ A and j ∉ A so that M_ji > 0. Now assume that for every A ⊂ {1, 2, …, L}, there exist j ∉ A and t ≥ 0 so that

X_{t}^{j} \in \cup_{k \in A} {x : ψ_{k} (x) > 0} .

Then for some i ∈ A, $ψ_{i} (X_{t}^{j}) > 0$ , so ${\bar{F}}_{j i} > 0$ , hence $\bar{F}$ is irreducible. ■

We claim that the EMUS estimator is consistent; that is, π_US[g] converges almost surely to π[g] as the total number of samples tends to infinity. To make this precise, we require the following assumption on the growth of N_i with the total number of samples:

Assumption 2.2. Let

N = \sum_{i = 1}^{L} N_{i}

be the total number of samples from all biased distributions. Assume that for each i,

lim_{N \to \infty} N_{i} / N = κ_{i} > 0.

That is, assume that when N is large, the proportion of samples drawn from the i’th biased distribution is fixed and greater than zero.

We now prove that EMUS is consistent:

Lemma 2.3. Under Assumption 2.2 and the irreducibility condition (2.6), π_US[g] converges almost surely to π[g] as the total number of samples N tends to infinity.

Proof. Since the processes $X_{t}^{i}$ are ergodic,

\bar{F} \overset{as}{\to} F, {\bar{g}}_{i}^{*} \overset{as}{\to} g_{i}^{*}, and {\bar{1}}_{i}^{*} \overset{as}{\to} 1_{i}^{*} as N \to \infty .

(2.7)

Moreover, by Lemma A.1 in Appendix A, w(G) is continuous at F. (Technically, w(G) admits an extension to the set of all L×L matrices, which is continuous at F.) Therefore, as a function of $\bar{F}$ , ${\bar{g}}_{i}^{*}$ , and ${\bar{1}}_{i}^{*}$ , π_US[g] is continuous at F, π_i[g*], and π_i[1*]. It follows by the continuous mapping theorem and equation (2.5) that $π_{US} [g] \overset{as}{\to} π [g]$ . ■

2.2. Iterative EMUS and comparison with Vardi’s Estimator.

In this section, we explain how EMUS relates to Vardi’s estimator for selection bias models [49] and its descendants such as the popular Multistate Bennett Acceptance Ratio (MBAR) method [42]. In addition to comparing EMUS with these methods, we review a method, iterative EMUS, for solving the nonlinear system of equations defining Vardi’s estimator [46]. The first iterate of this method is exactly the EMUS estimator.

Vardi’s estimator is similar to EMUS, except that it uses the identity

z_{j} = \sum_{i = 1}^{L} z_{i} π_{i} [\frac{ψ_{j} N_{i} / z_{i}}{\sum_{k = 1}^{L} ψ_{k} N_{k} / z_{k}}]

(2.8)

instead of our eigevector problem (2.4). (This identity appears as equation (1.12) in [18].) That is, Vardi’s estimate z^V of the weight vector is the solution of (2.8), but with trajectory averages replacing averages over the biased distributions:

z_{j}^{V} = \sum_{i = 1}^{L} z_{i}^{V} {\bar{G}}_{i j} (z^{V}),

(2.9)

where for any positive $u \in ℝ^{L}$ we define

{\bar{G}}_{i j} (u) : = \frac{1}{N_{i}} \sum_{n = 1}^{N_{i}} \frac{ψ_{j} (X_{n}^{i}) N_{i} / u_{i}}{\sum_{k = 1}^{L} ψ_{k} (X_{n}^{i}) N_{k} / u_{k}} .

(2.10)

By [49, Theorem 1], this nonlinear equation determines z^V uniquely up to a constant multiple whenever the irreducibility criterion of Lemma 2.1 holds.

Vardi’s estimator was originally derived assuming that the samples $X_{t}^{i}$ from the biased distributions were i.i.d. In that case, it is the nonparametric maximum likelihood estimator of the target distribution π given samples from the biased distributions π_i [49], and it has certain optimality properties [18]. Several adjustments to the estimator have been proposed for the case of samples from Markov processes. In the Multistate Bennett Acceptance Ratio (MBAR) method, one replaces the factors N_i appearing in the summand in (2.10) with effective sample sizes n_i, which are computed from estimates of the integrated autocovariance of a family of functions [42]. (In addition, in some versions of MBAR, the sample average over all N_i points on the right hand side of (2.10) is replaced with a sample average over the n_i points obtained by including only every N_i/n_i’th point along the trajectory $X_{t}^{i}$ .) Another recent work proposes different effective sample sizes computed by minimizing an estimate of the asymptotic variance of the estimator [12]. In fact, the estimator is consistent with N_i replaced by any fixed positive number [12]. We have found that our numerical results do not depend sensitively on the choice of effective sample size, so we use N_i for simplicity.

We now review iterative EMUS, which we introduced in [46]. Iterative EMUS may be understood as a fixed point iteration for solving equation (2.9). The iteration proceeds as follows:

As an initial guess for z^V, choose a positive vector $z^{0} \in ℝ^{L}$ . Set m = 0. Choose a tolerance τ > 0.
Compute ${\bar{G}}_{i j} (z^{m})$ by (2.10). Solve the eigenvector equation
$z_{j}^{m + 1} = \sum_{i = 1}^{L} z_{i}^{m + 1} {\bar{G}}_{i j} (z^{m})$ (2.11)
for an updated estimate z^m+1 of z^V.
If ${max}_{i} | z_{i}^{m + 1} - z_{i}^{m} | / z_{i}^{m} > τ$ , then increment m and repeat step 2.

Remark 2.4. In a related work, we show that the eigenvector equation (2.11) has a unique solution for every m, and we suggest a numerical method for finding the solution [46]. We also discuss the convergence of iterative EMUS, and we show that for every fixed m, z^m is a consistent estimator of the weight vector z.

Remark 2.5. If one chooses $z_{i}^{0} = N_{i} / N$ , then z¹ is the EMUS estimate of z, $w (\bar{F})$ .

3. Error Analysis of EMUS.

Here, we develop tools for analyzing the error of EMUS. First, in Section 3.1, we prove a CLT for EMUS, and we derive a convenient upper bound on the asymptotic variance. Then, in Section 3.2, we analyze the dependence of the asymptotic variance of EMUS on the choice of biased distributions. We use these tools in Section 4 to prove limiting results demonstrating the advantages of EMUS for multimodal distributions and tail probabilities.

3.1. A CLT for EMUS and an Estimate of the Asymptotic Variance.

In this section, we prove a Central Limit Theorem (CLT) for EMUS, and we derive an upper bound on the asymptotic variance $σ_{US}^{2} (g)$ of π_US[g]. To prove the CLT for EMUS, we must assume that a CLT holds for trajectory averages over the biased processes:

Assumption 3.1. For any matrix H, let H_i: denote the i’th row of H. Define $\bar{G} \in ℝ^{L \times 2}$ by

{\bar{G}}_{i :} = ({\bar{g}}_{i}^{*}, {\bar{1}}_{i}^{*}) .

Assume that

\sqrt{N_{i}} (({\bar{F}}_{i :}, {\bar{G}}_{i :}) - (F_{i :}, π_{i} [g^{*}], π_{i} [1^{*}])) \overset{d}{\to} N (0, Σ_{i})

(3.1)

for some asymptotic covariance matrix $Σ_{i} \in ℝ^{(L + 2) \times (L + 2)}$ of the form

Σ_{i} = (\begin{array}{l} σ^{i} & ρ_{i} \\ ρ_{i}^{t} & τ_{i} \end{array}),

(3.2)

where $σ^{i} \in ℝ^{L \times L}$ denotes the asymptotic covariance of ${\bar{F}}_{i :}$ with itself, $ρ_{i} \in ℝ^{L \times 2}$ denotes the asymptotic covariance of ${\bar{F}}_{i :}$ with ${\bar{G}}_{i :}$ , and $τ_{i} \in ℝ^{2 \times 2}$ denotes the asymptotic covariance of ${\bar{G}}_{i :}$ with itself.

We expect a CLT to hold for most MCMC methods, target distributions, and target functions of interest in statistics and statistical mechanics. We refer to [40] for a comprehensive review of conditions guaranteeing a CLT. In Theorem 3.7 of Section 3.2, we prove a CLT and an estimate of the asymptotic variance for a simple family of processes which one might use to sample the biased distributions in an application of EMUS.

We now prove a CLT for π_US[g], and we give a formula expressing the asymptotic variance $σ_{US}^{2} (g)$ of π_US[g] in terms of the asymptotic variances Σ_i of the trajectory averages. In this formula, (I − F)^# denotes the group generalized inverse of I − F; the group inverse A^# of a matrix A is characterized by the properties

A A^{#} A = A, A^{#} A A^{#} = A^{#}, and A A^{#} = A^{#} A .

We refer to [19] for a detailed explanation of the properties of the group inverse, a proof that (I − F)^# exists whenever F is stochastic and irreducible, and an algorithm for computing (I − F)^#.

In Theorem 3.3 and below we impose the following assumption:

Assumption 3.2. The processes $X_{t}^{i}$ sampling the biased distributions are independent.

This assumption does not hold for all stratified MCMC methods. For example, in replica exchange umbrella sampling one periodically allows configuration exchanges between neighboring processes; see [32] for a general discussion of replica exchange strategies and [43] for an application of replica exchange in a method similar to EMUS. The result is a single process taking values in $ℝ^{L \times d}$ and sampling the product distribution Π(x₁, x₂, …, x_L) = π₁(x₁)π₂(x₂) … π₁(x_L). In this case, a CLT would still hold for EMUS, but the asymptotic variance would take a different form.

Theorem 3.3. Let Assumptions 2.2, 3.1, and 3.2 and the irreducibility condition (2.6) hold. Let g be square integrable over π, so π[g²] < ∞. Recall the definition of Ψ from (2.2), and define

l : = Ψ {(1, - π [g])}^{t} \in ℝ^{2} .

Let $g \in ℝ^{L}$ be the vector with $g_{i} : = l \cdot (π_{i} [g^{*}], π_{i} [1^{*}])$ . We have

\sqrt{N} (π_{U S} [g] - π [g]) \overset{d}{\to} N (0, σ_{U S}^{2} (g)),

(3.3)

where

σ_{U S}^{2} (g) = \sum_{i = 1}^{L} \frac{z_{i}^{2}}{κ_{i}} {{(I - F)}^{#} g \cdot σ^{i} {(I - F)}^{#} g + 2 {(I - F)}^{#} g \cdot ρ_{i} l + l^{t} τ_{i} l} .

(3.4)

Proof. The result follows using the delta method and a formula expressing w′(F) in terms of (I − F)^#; we give the details in Appendix A. ■

In Appendix F, we explain how to compute an estimate of $σ_{US}^{2} (g)$ given trajectories of the biased processes. We use this estimates in Section 5 when analyzing our computational experiments. We note that the numerical methods presented in Appendix F for estimating σ²(g) significantly improve upon our original proposal in [46]; see Figure 11.

Figure 11: — Sign pattern of group inverse ${(I - \bar{F})}^{#}$ computed by method of [19] (Figure 11a) and using power iteration (Figure 11b). Yellow indicates an entry with positive sign, blue a negative sign. Here, we consider the overlap matrix $\bar{F}$ computed to estimate the marginal density of μ₂ in Section 5.3. The oscillations in sign observed in the upper right corner of Figure 11a are evidence of numerical error.

We now derive a convenient upper bound on the asymptotic variance $σ_{US}^{2} (g)$ . In Section 4, we use this bound to analyze the efficiency of EMUS in the low-temperature limit and in the limit of small tail probabilities. Our bound is based on the probability P_i[t_j < t_i] defined below:

Definition 3.4. Let Y_n be the Markov chain with state space {1,2, …, L} and transition matrix F. Let P_i[t_j < t_i] denote the probability that Y_n hits j before returning to i, conditioned on Y₀ = i.

Theorem 3.5. Let Assumptions 2.2, 3.1, and 3.2 and the irreducibility condition (2.6) hold. Let g be square integrable over π, so π[g²] < ∞. Let σ²(g) be the asymptotic variance of π_US[g], and for any measure ν and function f let var_ν(f) be the variance of f over ν. Define the function

h = g^{*} - π [g] 1^{*},

and let $C ({\bar{h}}_{i})$ be the asymptotic variance of the trajectory average of h over the biased process $X_{t}^{i}$ . We have

σ_{U S}^{2} (g) \leq 2 \sum_{i = 1}^{L} \frac{1}{κ_{i}} {z_{i}^{2} Ψ^{2} C ({\bar{h}}_{i}) + tr (R^{i}) π {[| h |]}^{2} \sum_{\begin{matrix} j \neq i \\ F_{i j} > 0 \end{matrix}} \frac{{var}_{π_{i}} (ψ_{j}^{*})}{P_{i} {[t_{j} < t_{i}]}^{2}}},

(3.5)

where $R^{i} \in ℝ^{L \times L}$ with

R_{j k}^{i} : = \frac{σ_{j k}^{i}}{\sqrt{{var}_{π_{i}} (ψ_{j}^{*})} \sqrt{{var}_{π_{i}} (ψ_{k}^{*})}} .

Proof. The result follows from Theorem 3.3, using the perturbation bounds which we derived in [45]. Details appear in Appendix A. ■

When the bias functions are a partition of unity, both the EMUS method and the statements of Theorems 3.3 and 3.5 simplify considerably. (The bias functions are a partition of unity if and only if $\sum_{i = 1}^{L} ψ_{i} (x) = 1$ for all x.) In this case, f* = f for all functions f, and the EMUS method reduces to

π_{US} [g] = \sum_{i = 1}^{L} w_{i} (\bar{F}) {\bar{g}}_{i},

where

{\bar{F}}_{i j} = N_{i}^{- 1} \sum_{t = 1}^{N_{i}} ψ_{j} (X_{t}^{i}) and {\bar{g}}_{i} = N_{i}^{- 1} \sum_{t = 1}^{N_{i}} g (X_{t}^{i}) .

In the statement of Theorem 3.5, one can replace π[|h|]² with var_π(g) or π[|g|]². We also have Ψ = 1, and one can replace $C ({\bar{h}}_{i})$ with the asymptotic variance $C ({\bar{g}}_{i})$ of ${\bar{g}}_{i}$ . (We verify these claims in Appendix A as part of the proof of Theorem 3.5.) In our limiting results (Section 4) and computational experiments (Section 5), we choose the bias functions to be a partition of unity.

3.2. Dependence of the Asymptotic Variance on the Choice of Strata.

In this section, we consider how the choice of strata influences the factors in the upper bound (3.5) on $σ_{US}^{2} (g)$ . Roughly, the asymptotic variances $C ({\bar{h}}_{i})$ and tr(Rⁱ) characterize the sampling error, and for each i the factor

\sum_{\begin{matrix} j \neq i \\ F_{i j} > 0 \end{matrix}} \frac{{var}_{π_{i}} (ψ_{j}^{*})}{P_{i} {[τ_{j} < τ_{i}]}^{2}},

measures the sensitivity of the EMUS estimator to sampling errors associated with π_i. We show in Section 3.2.1 that $C ({\bar{h}}_{i})$ and tr(Rⁱ) may be controlled by decreasing the diameters of the strata, under some conditions. We show in Section 3.2.2 that that P_i[τ_j < τ_i] may be controlled by ensuring sufficient overlap between neighboring strata. This last observation leads to a practical condition guiding the choice of strata; see Remark 3.12 and (5.8).

To streamline our illustration of the benefits of stratification, we impose additional assumptions in this section. These assumptions are much stonger than those made above. We discuss how they relate to practical implementations of stratified MCMC in Remarks 3.9 and 3.12.

3.2.1. Asymptotic Variances of MCMC Averages.

Here, we consider the effect of the choice of strata on the asymptotic variances $C ({\bar{h}}_{i})$ and tr(Rⁱ). Because such a diverse variety of biased processes and distributions could in principle be used, it is futile in our opinion to try for a completely general result. Instead, motivated by the efficiency analysis undertaken in Section 4, we introduce a simple parametric family of bias functions, and for this family we state Assumption 3.6 relating the diameters of the strata with the asymptotic variances. In Theorem 3.7, we verify Assumption 3.6 for one representative class of biased processes. Finally, at the end of this section, we explain why we expect the assumption to hold for other choices of biased processes and distributions; cf. Remark 3.9.

Consider the following representative class of bias functions: Given a family of sets {U_i : i = 1, …, L} with $\cup_{i = 1}^{L} U_{i} = Ω$ , define

ψ_{i} : = 1_{U_{i}} and π_{i} (d x) : = \frac{1_{U_{i}} (x) π (d x)}{π [1_{U_{i}}]} for i = 1, \dots, L,

(3.6)

where $1_{U_{i}}$ denotes the characteristic function of U_i. Assume that the sets U_i are chosen so that the irreducibility criterion of Lemma 2.1 holds. For example, suppose that Ω = [0, 1]^d is the d-dimensional unit cube. One might choose $K \in ℕ$ , set h := 1/K, and define

U_{i} = (h {[- 1, 1]}^{d} + h i) \cap Ω for i \in {0, 1, \dots, K}^{d},

(3.7)

covering Ω uniformly by a grid of strata having diameters proportional to h. We use this uniform grid as a device when analyzing the effect of the stratum size on the efficiency of EMUS in Section 4. However, while such a naïve choice may suffice for small d, it is not practical for large d. We discuss appropriate bias functions for high-dimensional problems later in this section and again in Section 5.1.

Since we wish to study grids like (3.7) as h = 1/K varies, we state our assumption on asymptotic variances in terms of the following parametric family of strata: Let x₀ ∈ Ω, and let $Z \subset ℝ^{d}$ be a bounded set containing 0. For each h > 0, define a stratum and a biased distribution by

Z_{h} = x_{0} + h Z and π_{h} (d x) = \frac{1_{Z_{h}} (x) π (d x)}{π [1_{Z_{h}}]} .

(3.8)

Assumption 3.6 characterizes the dependence of the asymptotic variance of MCMC averages over π_h on the parameter h:

Assumption 3.6. Assume that $f : Ω \to ℝ$ has finite variance var_h(f) over π_h, and define $σ_{h}^{2} (f)$ to be the asymptotic variance of an MCMC trajectory average approximating π_h[f]. Write

π (x) = \frac{exp (- β V (x))}{\int exp (- β V (y)) d y},

for some potential $V : Ω \to ℝ$ and inverse temperature β > 0. We assume

\frac{σ_{h}^{2} (f)}{{var}_{π_{h}} (f)} \leq C h^{a} β^{b} exp (β (max_{Z_{h}} V - min_{Z_{h}} V)) \leq C h^{a} β^{b} exp (β h diam (Z) ‖ \nabla V ‖_{\infty})

for some C, a, b ≥ 0 independent of h,Z, and f.

To motivate Assumption 3.6, we prove that a special case holds for a representative class of processes sampling the biased distributions, cf. Theorem 3.7. Assume that the potential V appearing in the assumption is continuously differentiable. Let $Z \subset ℝ^{d}$ be either a convex polyhedron or a set with C³ boundary.² Now let $X_{t}^{h}$ be the overdamped Langevin process with reflecting boundary conditions on Z_h. This process is defined by the Fokker–Planck equation

\begin{array}{l} \frac{\partial u}{\partial t} (x, t) = div (β^{- 1} \nabla u (x, t) + u (x, t) \nabla V (x)) & for x \in U, t > 0, \\ (β^{- 1} \nabla u (x, t) + u (x, t) \nabla V (x)) \cdot n (x) = 0 & for x \in \partial Z_{h}, t \geq 0, and \\ u (x, 0) = p (x) & for x \in Z_{h}, \end{array}

(3.9)

where β and V are the inverse temperature and potential defined in Assumption 3.6 and n(x) denotes the inward unit normal to ∂Z_h at x. That is, $X_{t}^{h}$ is the unique Markov process so that if $X_{0}^{h}$ has density p(x), then $X_{t}^{h}$ has density u(x, t). The existence of the reflected process is established in [51, 2] when Z is a convex polyhedron and in [13, Chapter 8] when Z has C³ boundary. A simple introduction to the reflected process and its properties appears in [36, Chapter 4]. We show in Theorem 3.7 that $X_{t}^{h}$ is ergodic for π_h, at least when Z is bounded.

The reflected process $X_{t}^{h}$ shares many features with the processes used in practical stratified MCMC methods. In particular, it is closely related to the (unreflected) overdamped Langevin process Y_t, which solves

d Y_{t} = - \nabla V (Y_{t}) d t + \sqrt{2 β^{- 1}} d B_{t} .

(3.10)

In fact, the Fokker–Planck equation of the unreflected process is the same as the reflected process, except with no boundary condition and with $ℝ^{d}$ in place of Z_h.

We now verify Assumption 3.6 for the reflected process:

Theorem 3.7. Assume that $f : Ω \to ℝ$ has finite variance var_h(f) over π_h. Let Z either have C³ boundary or be convex. Assume that V is continuously differentiable. Suppose that $X_{t}^{h}$ is stationary; that is, $X_{0}^{h}$ has distribution π_h. Let ${\bar{f}}^{h} : = \frac{1}{T} \int_{t = 0}^{T} f (X_{t}^{h})$ be the continuous time trajectory average of f. We have

\sqrt{T} ({\bar{f}}^{h} - π_{h} [f]) \overset{d}{\to} N (0, σ_{h}^{2} (f)),

where

σ_{h}^{2} (f) \leq Λ h^{2} β exp (β (max_{Z_{h}} V - min_{Z_{h}} V)) {var}_{h} (f) .

(3.11)

The constant Λ depends only on Z, not on h, β, V, or f.

Sketch of proof. We give a detailed proof of Theorem 3.7 in Appendix C; here we present only an outline. Our proof is based on a formula expressing the asymptotic variance of trajectory averages in terms of the generator of $X_{t}^{h}$ : Under certain conditions,

σ_{h}^{2} (g) = - 2 π_{h} [(g - π_{h} [g]) L^{- 1} (g - π_{h} [g])],

(3.12)

where L is the generator of $X_{t}^{h}$ and L⁻¹(g − π_h[g]) is a function so that

L (L^{- 1} (g - π_{h} [g])) = g - π_{h} [g],

and ∇L⁻¹(g − π_h[g]) · n = 0 on ∂Z_h. To estimate $σ_{h}^{2} (g)$ , we prove a Poincaré inequality for π_h, which implies an upper bound on $L^{- 1} {(g - π_{h} [g])}_{L^{2} (π_{h})}$ . This general approach is outlined in [31, Section 3], which treats the case of the overdamped Langevin dynamics on an unbounded domain. ■

Remark 3.8. In a related work, we summarize and refute a widely accepted argument in favor of umbrella sampling from the chemistry literature [9, Chapter 8]; see [46, Section VI.A]. Roughly speaking, the argument in the chemistry literature treated the dependence of the sampling error on the choice of strata correctly, but the sensitivity of the algorithm to those errors incorrectly. Our Theorem 3.7 verifies and extends the correct part of this argument.

Remark 3.9 (Practical Biased Processses and Assumption 3.6). As explained above, the reflected process $X_{t}^{h}$ is quite similar to the processes used in practice. We now discuss practical methods, and we explain when we expect Assumption 3.6 to hold. We identify three major differences between $X_{t}^{h}$ and practical methods: First, in molecular simulations, one typically chooses Gaussian bias functions instead of piecewise constant. Second, practical methods must be discrete in time, e.g. one might use a discretization of the continuous time process $X_{t}^{h}$ . Third, for high-dimensional problems, one typically stratifies only a certain low-dimensional reaction coordinate or collective variable.

In the first case, for Gaussian bias functions, a version of Theorem 3.7 holds with minor adjustments; we omit the exact statement and proof for simplicity. In the second case, for discretizations of Langevin dynamics, the asymptotic variances of trajectory averages are closely related to the corresponding averages for the continuous time dynamics: In fact, under some conditions on the potential V,

lim_{Δ t \to 0} Δ t ζ_{Δ t}^{2} (f) = ζ^{2} (f),

(3.13)

where $ζ_{Δ t}^{2} (f)$ is the asymptotic variance of the trajectory average of f for the discretization with time step Δt and ς²(f) is the asymptotic variance for the continuous time process [31, Section 3.2]. For other discrete time processes, we expect Assumption 3.6 to hold with different exponents a and b. For example, the affine invariance property of the affine invariant ensemble sampler [20] suggests a = 0.

The third case is subtle. When d is large, one typically stratifies only in a function $θ : Ω \subset ℝ^{d} \to ℝ^{l}$ with ℓ much smaller than d. To be precise, one might choose a uniform grid of nonnegative functions $η_{i} : ℝ^{l} \to ℝ$ defined as in (3.7), but with supports covering $θ (Ω) \subset ℝ^{l}$ instead of $Ω \subset ℝ^{d}$ . One would then define the bias functions

ψ_{i} (x) : = η_{i} (θ (x)) .

(3.14)

(We make a similar choice in our calculations in Section 5, cf. the natural stratification (5.1).) For a clever choice of θ, these biased distribution may be much easier to sample than the target distribution. For example, suppose that the marginal π_θ of π in θ were multimodal, but that the conditional distributions π(· | θ = θ₀) were unimodal or otherwise easy to sample for each fixed θ₀. In that case, for h sufficiently small, each biased distribution would be unimodal, hence easy to sample. (Recall that h sets the diameters of the strata for the grid of bias functions defined in (3.7), so h small means that the diameter of the support of η_i is small.) In free energy calculations, it is often possible to choose such a θ based on intuition or scientific principles; see [46, 30] for discussion. Also, when computing tails or marginals, the problem itself typically suggests a particular θ; cf. the natural stratification in Section 5.1.

The reader will notice that bias functions of the form (3.14) will typically have infinite support, rendering the bound in Assumption 3.6 useless. In this case, one might hope for a similar bound with the potential function V replaced by the free energy

F (θ) : = - β^{- 1} log (π_{θ} (θ)),

where π_θ is the marginal density of π in θ. Roughly, this replacement will be valid when, for MCMC processes sampling π, the θ variables equilibrate very slowly compared to other variables. This will occur, for example, when the marginal in θ is multimodal or otherwise difficult to sample, but the conditional distributions are easy to sample. More on the effective dynamics of low-dimensional variables can be found in [37] or [29].

3.2.2. Controlling the Probabilities P_i[τ_j < τ_i].

Here, we examine the effect of the choice of strata on the factor

\sum_{\begin{matrix} j \neq i \\ F_{i j} > 0 \end{matrix}} \frac{{var}_{π_{i}} (ψ_{j}^{*})}{P_{i} {[τ_{j} < τ_{i}]}^{2}}

(3.15)

appearing in our upper bound (3.5) on $σ_{US}^{2} (g)$ .

We begin with a lemma estimating ${var}_{π_{i}} (ψ_{j}^{*}) / P_{i} {[τ_{j} < τ_{i}]}^{2}$ in terms of F_ij:

Lemma 3.10. We have

\frac{{var}_{π_{i}} (ψ_{j}^{*})}{P_{i} {[τ_{j} < τ_{i}]}^{2}} \leq \frac{1}{F_{i j}} .

Proof. We have

P_{i} [t_{k} < t_{i}] \geq P [X_{1} = k ∣ X_{0} = i] = F_{i k},

where X_t denotes the Markov chain with transition matrix F. Therefore, since $ψ_{j}^{*} (x) \in [0, 1]$ ,

\frac{{var}_{π_{i}} (ψ_{j}^{*})}{P_{i} {[τ_{j} < τ_{i}]}^{2}} \leq \frac{{var}_{π_{i}} (ψ_{j}^{*})}{F_{i j}^{2}} \leq \frac{π_{i} [{(ψ_{j}^{*})}^{2}]}{F_{i j}^{2}} \leq \frac{π_{i} [ψ_{j}^{*}]}{F_{i j}^{2}} = \frac{1}{F_{i j}} .

■

We now estimate the size of F_ij for piecewise constant bias functions such as the uniform grid (3.7):

Lemma 3.11. Assume as in (3.6) that the bias functions are piecewise constant, and write π(x) ∝ exp(−βV (x)). We have

F_{i j} \geq \frac{| U_{i} \cap U_{j} |}{| U_{i} | {‖ \sum_{k = 1}^{L} 1_{U_{k}} ‖}_{\infty}} exp (β (min_{U_{i}} V - max_{U_{i}} V))

In particular, for the uniform grid of strata (3.7), we have

F_{i j} \geq \frac{1}{4^{d}} exp (β (min_{U_{i}} V - max_{U_{i}} V)) \geq \frac{1}{4^{d}} exp (- 2 β h \sqrt{d} ‖ \nabla V ‖_{\infty})

(3.16)

for any $i, j \in ℤ^{d}$ so that F_ij > 0.

Proof. We have

F_{i j} = π_{i} [ψ_{j}^{*}] = π_{i} [\frac{1_{U_{j}}}{\sum_{k = 1}^{L} 1_{U_{k}}}] \geq \frac{π [U_{i} \cap U_{j}]}{π [U_{i}]} \frac{1}{{‖ \sum_{k = 1}^{L} 1_{U_{k}} ‖}_{\infty}} \geq \frac{| U_{i} \cap U_{j} |}{| U_{i} |} \frac{1}{{‖ \sum_{k = 1}^{L} 1_{U_{k}} ‖}_{\infty}} exp (β (min_{U_{i}} V - max_{U_{i}} V)),

which proves the first claim made in the statement of the lemma.

Now, for the uniform grid of strata (3.7), the minimum nonzero value of |U_i ∩ U_j|/|U_i| is 1/2^d, attained when j = (1,1, …, 1) + i. Moreover, except for a set of measure zero, each $x \in ℝ^{d}$ lies within 2^d strata, so ${‖ \sum_{k = 1}^{L} 1_{U_{k}} ‖}_{\infty} = 2^{d}$ . Finally, we have

max_{U_{i}} V (x) - min_{U_{i}} V (x) \leq diam (U_{i}) ‖ \nabla V ‖_{\infty} = 2 \sqrt{d} h ‖ \nabla V ‖_{\infty},

and the result follows. ■

Remark 3.12 (A Condition to Guide the Choice of Strata). Lemmas 3.10 and 3.11 suggest a practical constraint on the choice of strata: To ensure that the calculation of the weights is not too sensitive to sampling errors, it will suffice to choose strata so that nonzero entries of F are not too small. We let this condition guide the choice of strata in Section 5, cf. (5.8). However, the condition is only sufficient, not necessary. For example, consider a uniform grid of Gaussian bias functions similar to (3.7), but with Gaussian densities having mean μ = h[−1, 1]^d + hi and variance σ² = h² replacing the characteristic functions $1_{U_{i}}$ . In that case, even though F will be dense and may have some extremely small nonzero entries, one can still control (3.15) by decreasing h, under some conditions on π. We omit the exact statement and proof for simplicity.

Despite the exponential dependence on d in (3.16), EMUS and other stratified MCMC methods are advantageous for high-dimensional problems because it often suffices to stratify only a low-dimensional collective variable. In such cases, the dimension of the grid of strata is much smaller than dimension of the state space Ω; see our discussion of collective variables in Section 3.2.1 and our computations in Section 5. It is important to keep this in mind when reading our results below.

Remark 3.13. One may define a uniform grid of strata so that (3.15) increases only as d² with dimension, not exponentially: For any $i \in ℤ^{d}$ , let $V_{i}^{'} : = h i + h {[- \frac{1}{2}, \frac{1}{2}]}^{d}$ . For i ≠6 j, define

W_{i j} : = {x \in V_{j}^{'} : min_{y \in V_{i}^{'}} ‖ x - y ‖ \leq min_{y \in V_{k}^{'}} ‖ x - y ‖ for any k \in ℤ^{d} \ {j}}

to be the d-dimensional pyramid consisting of all points in $V_{j}^{'}$ closer to $V_{i}^{'}$ than to any other cube $V_{k}^{'}$ . Now let e_n denote the n’th standard basis vector in $ℝ^{d}$ , and define

V_{i} : = \cup_{n = 1}^{d} (W_{i, i + e_{n}} \cup W_{i, i - e_{n}}) \cup V_{i}^{'}

to be the cube $V_{i}^{'}$ enlarged by all the neighboring pyramids W_ij. The strata V_i are convex, and the corresponding bias functions $ψ_{i} = \frac{1}{2} 1_{V_{i}}$ are a partition of unity. Each stratum V_i intersects only the 2d neighboring strata $V_{i \pm e_{n}}$ for n = 1, …, d. Moreover, each intersection between neighboring strata V_i and V_j consists of the pair of pyramids W_ij and W_ji, and it has volume 1/d. Therefore, by Lemma 3.11, for this choice of bias functions, the nonzero entries of F decrease as 1/d. It follows that (3.15) increases as d².

4. Limiting Results as a Rationale for EMUS.

In this section, we analyze the efficiency of EMUS in two limits: First, we consider a low temperature limit, where we write π(x) ∝ exp(−βV (X)) and let the inverse temperature β increase, concentrating the target distribution at its modes and intensifying the effects of multimodality on the efficiency of MCMC sampling. Second, we consider the estimation of increasingly small tail probabilities. Our goal in each case is to elucidate the advantages and disadvantages of EMUS for a broad class of problems, providing a rationale for the use of the method. We hope that others will use the tools of Section 3 in similar fashion to develop their own novel applications of EMUS.

4.1. Limit of Low Temperature.

Let the target distribution take the form

π_{β} (x) = \frac{exp (- β V (x))}{\int exp (- β V (x)) d x}

for some potential V and inverse temperature β > 0, as in Section 3.2. In this section, we analyze the efficiency of EMUS in the low temperature limit as β tends to infinity with V fixed. We observe that π_β concentrates at its modes (the minima of V) in this limit. As a consequence, MCMC methods for sampling π_β undergo transitions between modes only rarely, which makes direct MCMC sampling increasingly inefficient. To be precise, we show that the asymptotic variance of a trajectory average of the overdamped Langevin dynamics increases exponentially with β in the worst case. On the other hand, we show that the asymptotic variance of the EMUS estimate of the same average increases only polynomially. Therefore, EMUS is dramatically more efficient than direct sampling in the low temperature limit.

We consider the low temperature limit because it provides a convenient sequence of increasingly difficult to sample multimodal distributions: By analyzing EMUS in the low temperature limit, we hope to elucidate its advantages for multimodal problems in general. We have no other interest in low temperature.

We now examine the overdamped Langevin dynamics.

d X_{t}^{β} = - \nabla V (X_{t}^{β}) d t + \sqrt{2 β^{- 1}} d B_{t}

(4.1)

in the low temperature limit. (The overdamped Langevin dynamics is ergodic for π_β under certain conditions on V; see [41] for example.) For typical potentials V, the generator

L : = - β^{- 1} Δ + \nabla V \cdot \nabla

of (4.1) has a spectral gap that shrinks exponentially with β; that is, for some c > 0,

- exp (- c β) \leq λ_{1} < 0,

(4.2)

where λ₁ is the greatest nonzero eigenvalue of $L$ . We refer to [31, Section 2.5] for a review of results on the spectrum of $L$ , and we refer to [21] for precise conditions on V which guarantee (4.2). Now let v₁ be an eigenfunction corresponding to λ₁ normalized so that $π_{h} [v_{1}^{2}] = 1$ . By formula (3.12), the asymptotic variance $σ_{β}^{2} (v_{1})$ of the trajectory average of v₁ satisfies

σ_{β}^{2} (v_{1}) = - π_{β} [v_{1} L^{- 1} v_{1}] = - λ_{1}^{- 1} π_{β} [v_{1}^{2}] = - λ_{1}^{- 1} \geq exp (c β),

indicating that the cost of estimating π[v₁] by direct MCMC grows exponentially with β.

Having analyzed the overdamped Langevin dynamics, we now examine EMUS in the low temperature limit. For convenience, we assume that Ω is the unit cube ${[0, 1]}^{d} \subset ℝ^{d}$ with periodic boundary conditions; to be more precise, we let $Ω = ℝ^{d} / ℤ^{d}$ be the set of all points in $ℝ^{d}$ with x and y identified if and only if $x - y \in ℤ^{d}$ . Periodic boundary conditions are typical of problems in chemistry and computational statistical mechanics. We do not see any difficulties in generalizing our results to other types of domains.

As β increases, we must make the supports of the bias functions smaller. We accomplish this by adjusting the parameter h in a uniform grid of bias functions similar to those defined in (3.7). To be precise, we fix $K \in ℕ$ , set h := 1/K, and define

ψ_{i} (x) : = \frac{1}{2^{d}} 1_{{[- 1, 1]}^{d}} (K (x - h i)) for i \in {0, 1, \dots, K - 1}^{d} .

(4.3)

This family of K^d bias functions is a partition of unity over Ω, and the support of the i’th bias function is

U_{i} : = h {[- 1, 1]}^{d} + h i .

For convenience, we treat the index i as an element of $ℤ^{d} / K ℤ^{d}$ ; that is, we let i be periodic with period K in each of its components, identifying (0, i₂, …, i_d) with (K, i₂, …, i_d), for example. Figure 1 illustrates such a family of bias functions, and it demonstrates the appropriate relationship between β and h.

Figure 1: — Bias functions and target distributions in the low temperature limit. In the upper two plots, the black curves are the densities of the target distributions for two different values of β. Observe that π concentrates at the minima of V as β increases. The red bands each lie above a single stratum chosen from a family of strata for which h ∝ β⁻¹. In the lower two plots, the blue curve is βV (x) and the x-axis covers the bottom of the red band in the plot immediately above. Observe that the range of βV (x) over the red band is the same for each of the two values of β. By Theorem 3.7 and the ensuing discussion in Section 3.2, this implies that the cost of sampling a single biased distribution increases at most polynomially with β when h ∝ β⁻¹.

We now show that the asymptotic variance of EMUS increases at most polynomially with β when K is chosen appropriately. In light of the above discussion, this means that EMUS may be dramatically more efficient than direct sampling for multimodal problems. We note that despite the exponential dependence on d in (4.4) below, EMUS and other stratified MCMC methods are often advantageous for high-dimensional multimodal problems; see our discussion of low-dimensional collective variables in Section 3.2 and also our computations in Section 5.

Theorem 4.1. For any bounded continuous function g, let $σ_{β, U S}^{2} (g)$ denote the asymptotic variance of π_β,US[g]. Let the bias functions be defined by (4.3) with K equal to the least integer greater than β; that is,

K = ⌈ β ⌉ .

Take κ_i = 1/K^d. Let Assumption 3.6 hold. We have

\frac{σ_{β, U S}^{2} (g)}{{var}_{π_{β}} (g)} \leq C {(1 + β)}^{q d}

(4.4)

for constants C, q > 0 independent of g and β, but depending on V and the constants in Assumption 3.6.

Proof. The proof is a straightforward application of the theory developed in Section 3; we present the details in Appendix D. ■

Remark 4.2. Our proof of Theorem 4.1 relies on the perturbation bounds which we derived in [45]. These bounds allow one to estimate the sensitivity of w(F) to small perturbations of F. Most perturbation bounds in the literature predict that w(F) is highly sensitive when the spectral gap of F is small, but ours show that this is not always the case. (The spectral gap is 1 − |λ₂|, where λ₂ is the eigenvalue of F with second largest absolute value.) In the low-temperature limit, the spectral gap of F decreases exponentially with β; see [45] for a simple example of this phenomenon. Nonetheless, using our bounds, we show that the cost to compute averages by EMUS increases only polynomially in β.

4.2. Limit of Small Probability.

In this section, we assess the performance of EMUS for computing tail probabilities. To be precise, we let Ω = [0,∞), and we consider estimation of probabilities of the form

p_{M} : = π ([M, \infty)) .

We show that for a broad class of distributions π, the cost of computing p_M with relative precision by direct MCMC increases exponentially with M, whereas the cost by EMUS increases only polynomially. Thus, EMUS is dramatically more efficient than direct sampling for computing the probabilities of tail events.

In Assumption 4.3 below, we state the conditions which we will impose on π in our analysis. These conditions specify a simple class of problems for which strong conclusions may be drawn. Similar results hold more generally. For example, in Section 5, we report the results of a computational experiment demonstrating the advantages of EMUS for computing tails of a marginal density.

Assumption 4.3. Write

π (x) = exp (- V (x))

for some potential function $V : [0, \infty) \to ℝ$ . Assume that for some M₀ ≥ 0:

Whenever x ≥ M₀,
$0 \leq V^{″} (x) and 0 < V^{'} (x) .$ (4.5)
For some α ∈ (0, 1) and c > 0, whenever x ≥ M₀,
$α V^{'} {(x)}^{2} - V^{″} (x) \geq c > 0.$ (4.6)

For example, we might have

π (x) \propto exp (- | x |^{r}) for any r \geq 1.

Remark 4.4. Condition (4.6) in Assumption 4.3 implies geometric ergodicity of the overdamped Langevin dynamics with potential V [40]. We rely on this fact to motivate Assumption 4.5 concerning the convergence of MCMC processes sampling biased distributions with unbounded support. Interestingly, we use the same condition to prove lower bounds on some of the entries of the overlap matrix; cf. Lemma E.1.

Condition (4.5) in Assumption 4.3 implies

p_{M} \leq D exp (- γ M)

whenever M ≥ M₀ for some D, γ > 0. Therefore, the relative variance ρ²_M of 1_[M,∞) over π satisfies

ρ_{M}^{2} = \frac{p_{M} - p_{M}^{2}}{p_{M}^{2}} \geq D^{- 1} exp (γ M) - 1.

We conclude that estimating p_M with relative accuracy by a direct MCMC method (or even Monte Carlo with independent samples) requires a number of samples increasing exponentially with M.

By contrast, we show that for an appropriate choice of bias functions, the cost to estimate p_M by EMUS increases only polynomially in M. For each M > 0 and $K \in ℕ$ , let

h : = \frac{M}{K},

and define the family of K + 2 bias functions

ψ_{i} (x) : = {\begin{array}{l} \frac{1}{2} 1_{[0, h]} (x) & for i = 0 \\ \frac{1}{2} 1_{[(i - 1) h, (i + 1) h]} (x) & for i = 1, \dots, K - 1 \\ \frac{1}{2} 1_{[M - h, \infty)} (x) & for i = K, and \\ \frac{1}{2} 1_{[M, \infty)} (x) & for i = K + 1. \end{array}

(4.7)

As in Section 4.1, let U_i denote the support of ψ_i. This family of bias functions is a partition of unity on [0,∞); see Figure 2.

We now address the cost of estimating p_M by EMUS. First, we observe that Assumption 3.6 on the asymptotic variances of MCMC averages does not cover the sampling of π_K and π_K+1, since the supports of these distributions are unbounded. We need to add to that assumption

Assumption 4.5. Let $f : [0, \infty) \to ℝ$ , and define $σ_{i}^{2} (f)$ to be the asymptotic variance of an MCMC trajectory average approximating π_i[f] for i = K,K + 1. We assume

\frac{σ_{i}^{2} (f)}{{var}_{π_{i}} (f)} \leq D

for some D independent of M and f.

In fact, since Assumption 4.3 implies that the overdamped Langevin dynamics is ergodic for π(x) = exp(−V (x)) on the unbounded domain $Ω = ℝ$ (cf. Remark 4.4), we fully expect (but do not prove here) that under Assumption 4.3, Assumption 4.5 holds for overdamped Langevin constrained (by reflection as in (3.9)) to remain in the support of π_K or π_K+1. Alternatively the reader may simply assume that we draw i.i.d. samples from the biased distributions. All our results hold in that case.

We show in Theorem 4.6 that the relative asymptotic variance of the EMUS estimate of p_M grows only polynomially with M for a broad class of target distributions π. Therefore, EMUS may be dramatically more efficient than direct MCMC sampling when the goal is to compute tail probabilities. We observe that while the hypotheses of the theorem are somewhat restrictive, similar results hold more generally; for example, see Section 5 where we compute tails of a marginal density.

Theorem 4.6. Let Assumptions 3.6, 4.3, and 4.5 hold. Set

K = M max_{x \leq M} ⌈ | V^{'} (x) | ⌉ .

Define a family of K + 2 bias functions ψ_i by (4.7). Take κ_i = 1/(K + 2). Let $σ_{M, U S}^{2}$ denote the asymptotic variance of the EMUS estimate of p_M. We have

\frac{σ_{M, U S}^{2}}{p_{M}^{2}} \leq C K^{2}

for some constant C > 0 depending on V but not on M.

For example, suppose that

V (x) = \tilde{V} (x) + x^{r},

where $\tilde{V}$ has bounded support and r ≥ 1. Then |V′(x)| ≤ C(1 + M^r−1), and so

\frac{σ_{M, U S}^{2}}{p_{M}^{2}} \leq C M^{2} {(1 + M^{r - 1})}^{2} .

Proof. The proof is similar to the low temperature limit, Theorem 4.1, but with complications arising because not all strata are bounded and because here we consider the relative variance instead of the variance; see Appendix E. In particular, we require Assumption 4.3 to show that one can in fact choose h so that all nonzero entries of F are bounded above zero uniformly as M increases; cf. Lemma E.1. This is the only part of the proof relying on Assumption 4.3. ■

5. EMUS for tails: An example from Bayesian inference.

We demonstrate the use of EMUS for efficiently exploring and visualizing distributions. In particular, we show how EMUS may be used to efficiently compute both marginal densities and also tail probabilities of the form P[η(Z) ≥ ε⁻¹] where η(Z) is a real valued function of a high-dimensional random variable Z. For both tails and marginals, there is a natural and easy to implement choice of strata, which we describe in Section 5.1.

In Section 5.3, we calculate two different one-dimensional marginals of the posterior distribution of the hierarchical Bayesian mixture model described in Section 5.2. For one marginal, the natural stratification suffices. For the other, it does not, but a preliminary computation made with the natural stratification suggests a better choice of strata. We use this example to explain how to diagnose and correct problems related to poorly chosen strata: Our results will serve to guide the practice of stratified MCMC.

5.1. The natural stratification for tails and marginals.

Here, we briefly explain how EMUS can be used to estimate tail probabilities and low-dimensional marginals of high-dimensional distributions. Let $Ω \subset ℝ^{d}$ ; let π be a probability distribution on Ω; and let $η : Ω \to ℝ$ . Suppose that one wishes to estimate the very small tail probability P[η(Z) ≥ ε⁻¹]. In this case, it is natural to stratify in η only. That is, one may choose a partition of unity ${ϕ_{i}}_{i = 1}^{L}$ on $ℝ$ and define bias functions

ψ_{i} (x) = ϕ_{i} (η (x)) for i = 1, \dots, L

(5.1)

depending only on η. For a partition of unity, one might choose the regular grid of piecewise constant functions defined in Section 4.2. We refer to (5.1) as the natural stratification. To compute the tail probability, one uses EMUS to estimate $π (1_{[ε^{- 1}, \infty)} \circ η)$ .

Computing marginal densities is similar; in fact, computing tails may be understood as a special case of computing a marginal density. Suppose now that $η : Ω \to ℝ^{l}$ . To estimate the marginal π_η of π in η, one chooses a partition of unity ${ϕ_{i}}_{i = 1}^{L}$ on $ℝ^{l}$ , again defining bias functions by (5.1). One then uses EMUS to compute averages of histogram bins, which are functions of the form

b_{η_{0}} (η (x)) = 1_{η_{0} + h {[- 1, 1]}^{l}} (η (x)) .

(5.2)

We have

lim_{h \to 0} \frac{1}{{(2 h)}^{l}} π [b_{η_{0}}] = π_{η} (η_{0}),

so for small h the averages of the histogram bins approximate π_η.

By the argument in Section 4.2, EMUS with the natural stratification will be dramatically more efficient than direct sampling as long the biased distributions are no harder to sample than the target distribution π. Essentially, this is because with the natural stratification very small averages like P[η(Z) ≥ ε⁻¹] over the target distribution π are expressed as functions of much larger averages over the biased distributions π_i. Unfortunately, however, for general functions η, the biased distributions of the natural stratification need not be easy to sample. In Section 5.3, we give one example where the natural stratification works and one where it does not. In the case where it does not, we explain how to make a better choice of strata.

5.2. A hierarchical Bayesian mixture model.

Here, we review the hierarchical Bayesian mixture model proposed in [39], and we discuss the difficulties which complicate inference under this model. As a tutorial in the use of EMUS, we present a numerical investigation of these difficulties in Section 5.

In the hierarchical mixture model, the data vector $y = (y_{1}, \dots, y_{n}) \in ℝ^{n}$ consists of independent identically distributed samples drawn from a mixture distribution of the form

p (y_{i} ∣ ϕ) = \sum_{k = 1}^{K} q_{k} ν (y_{i}; μ_{k}, λ_{k}^{- 1}),

where K is the number of mixture components, q_k is the weight of the k’th mixture component, $ν (\cdot; μ_{k}, λ_{k}^{- 1})$ is the normal density with mean μ_k and variance $λ_{k}^{- 1}$ , and ϕ is the vector of parameters

ϕ = (μ_{1}, \dots, μ_{K}, λ_{1}, \dots, λ_{K}, q_{1}, \dots, q_{K - 1}) .

(Since p(y_i|ϕ) is a probability distribution, q₁ +⋯+q_K = 1, and q₁, …, q_K−1 determine q_K) The following prior distribution is imposed on ϕ:

\begin{array}{l} μ_{i} & ~ N (m, κ^{- 1}) \\ λ_{k} & ~ Gamma (α, β) \\ β & ~ Gamma (g, h) \\ (q_{1}, \dots, q_{K - 1}) & ~ {Dirichlet}_{K} (1, \dots, 1) . \end{array}

As in [23, 11], we choose

m = M, κ = \frac{4}{R^{2}}, α = 2, g = 0.2, and h = \frac{100 g}{α R^{2}}

where R and M are the range and the mean of the observed data, respectively. The posterior density is

p (θ ∣ y) = \frac{κ^{K / 2} g^{h} β^{K α + g - 1}}{Z_{K} Γ {(α)}^{K} Γ (g) {(2 π)}^{\frac{n + K}{2}}} {(\prod_{k = 1}^{K} λ_{k})}^{α - 1} \times exp {- \frac{κ}{2} \sum_{k = 1}^{K} {(μ_{k} - M)}^{2} - β (h + \sum_{k = 1}^{K} λ_{k})} \times \prod_{i = 1}^{N} (\sum_{k = 1}^{K} q_{k} λ_{k}^{\frac{1}{2}} exp {\frac{λ_{k}}{2} {(y_{i} - μ_{k})}^{2}}),

where θ = (ϕ, β) denotes the vector of all parameters to be inferred, including the hyperparameter β.

Several factors complicate inference based on this model: First, the mixture components are not identifiable; that is, the posterior distribution is invariant under permutation of the labels of the mixture components. Consequences of non-identifiability are discussed at length in [23, 11]. In our computations in Section 5.3, we impose the constraint

μ_{1} \leq μ_{2} \leq \dots \leq μ_{K}

to ensure that the components are identifiable. Second, in Lemma 5.1, we show that the posterior density may be unbounded, introducing spurious modes with infinite density. Finally, even with identifiability constraints, the posterior distribution may have multiple modes of finite posterior density. For example, see the modes reported in [11]. In Section 5.3, we use EMUS to efficiently visualize the posterior, assessing the effects of multimodality and unboundedness.

We suspect that the unboundedness of the posterior for this model is well known. However, we are unable to find a reference, so we now explain. It is certainly well known that the likelihood of a Gaussian mixture model is unbounded: Roughly speaking, the likelihood is infinite when any mixture component is collapsed on a single data point [1]. Nonetheless, one might expect the posterior density p(θ|y) to be bounded, since the prior penalizes large values of the precisions λ_i. This is not always the case when the data vector contains repeated entries:

Lemma 5.1. If any datum y_i has frequency N_i greater than

2 g + 2 (K - 1) α,

then the posterior density p(θ|y) is unbounded.

Proof. Take the limit of p(θ|y) as λ₁ → ∞ with μ₁ = y_i, $β = λ_{1}^{- 1}$ and all other variables held fixed. ■

The reader will observe that under the model, the set of data vectors with repeated entries has probability zero. However, in practice, the data consist of measurements with finite precision, and therefore repeated entries occur commonly, cf. the Hidalgo stamp data used in Section 5.3.

5.3. Numerical experiments: Choosing strata, computing tails, diagnosis of problems.

In this section, we explain how to recognize and correct problems related to poor choices of strata, and we demonstrate the use of EMUS to investigate the multimodality and unboundedness of the posterior in the mixture model. We first compute two one-dimensional marginals of the high-dimensional posterior density p(θ|y) using the natural stratification (5.1). The natural stratification works in one case but not the other. In the case where the natural stratification does not work, preliminary calculations based on the natural stratification suggest a better choice of strata.

Here, we let y be the Hidalgo stamp data set first studied in [22], consisting of the thicknesses of 485 stamps, ranging between 60 μm and 130 μm. We let there be three mixture components (K = 3), following previous computational studies [11, 23]. In our first calculation, we estimated the marginal in μ₂ using the natural stratification with a grid of 201 bias functions covering the range [7, 11], with the support of the leftmost and rightmost bias functions reaching to −∞ and ∞, respectively. For the middle strata, define $ϕ_{1} : ℝ \to ℝ$ by

ϕ_{1} (x) : = max {0, 1 - | x |} .

(5.3)

We used the bias functions

ψ_{i} (θ) = ϕ_{1} (\frac{μ_{2} - (7 + (i - 1) h)}{h}), where h : = 0.02

(5.4)

for i = 2, …, 200. Now, define $ϕ_{2} : ℝ \to ℝ$ by

ϕ_{2} (x) : = min {max {0, 1 - x}, 1}

(5.5)

The first and last bias functions were

ψ_{1} (θ) = ϕ_{2} (\frac{μ_{2} - 7}{h})

(5.6)

ψ_{201} (θ) = ϕ_{2} (\frac{(7 + 200 h) - μ_{2}}{h}),

(5.7)

where h = 0.02 as before.

We chose the total number of bias functions based on the sizes of the off-diagonal entries in the overlap matrix. For any bias functions of the form (5.4), the overlap matrix is tridiagonal. Thus, by Remark 3.12, if the superdiagonal and subdiagonal entries F_i,i+1 and F_i,i−1 are sufficiently large, then the EMUS estimator is not too sensitive to statistical errors in $\bar{F}$ . For our choice of bias functions,

min {F_{i, i + 1}; i = 1, \dots 200} \geq 0.01 and min {F_{i, i - 1}; i = 2, \dots, 201} \geq 0.004.

(5.8)

We sampled the biased distributions using the affine invariant ensemble sampler with 100 walkers, as implemented in the emcee package [15]. Due to computational restrictions on memory, only every tenth sample point was saved. As a check on the sampling, the average acceptance probability over all walkers in the ensemble sampler was calculated for each biased distribution. Averaging over biased distributions gave a total average acceptance probability of 0.31. The minimum acceptance probability over all distributions was 0.12.

To initialize sampling, we computed an unbiased test trajectory; that is, a trajectory having ergodic distribution π. We then started by sampling a single biased distribution π_k, initializing with points drawn randomly from the unbiased trajectory. We sampled the other biased distributions in sequence, initializing with points drawn randomly from samples of adjacent biased distributions. Thus, we sampled π_k first, then π_k−1 and π_k+1, then π_k−2 and π_k+2, etc. We equilibrated the sampler in each π_i for 3000 Monte Carlo steps, and collected data for an additional 100000 Monte Carlo steps. Each step of the ensemble sampler involves perturbing the positions of each of the 100 walkers.

We computed the marginal in μ₂ using a grid of 200 histogram bins, covering the region [7, 11]; this corresponds to taking h = 0.01 in (5.2). The result is the curve labeled EMUS in Figure 3a. The marginal in μ₂ has two modes, labeled 1 and 2 in Figure 3a. We plot the mixture distributions corresponding to these modes in Figure 4. (To be precise, the distributions in Figure 4 correspond to means over histogram bins centered at the labeled points.)

Figure 3: — Estimates of the logarithm of the marginal density in μ₂ and the asymptotic variances of those estimates. Figure 3a displays estimates of the marginal in μ₂ computed by EMUS and by an unbiased trajectory of the ensemble sampler. Figure 3b displays the asymptotic variances of these two estimates of the marginal density. We note that while the unbiased calculation has greater accuracy near the mode, the EMUS calculation has greater accuracy in the tails. The relative errors in this figure were estimated using the method described in Appendix F.

Figure 4: — Gaussian mixtures corresponding to modes of the marginal in μ₂. Mixtures 1 and 2 correspond to the labeled points in Figure 3a. To be precise, the blue curve in each plot is the mixture distribution corresponding to the mean of a histogram bin centered at the point labeled in Figure 3a. The green curves are the individual mixture components. The black bars are a histogram of the Hidalgo stamp data.

For comparison, we also estimated the marginal in μ₂ from multiple long, unbiased trajectories. We computed 100 unbiased trajectories of the affine invariant ensemble sampler in parallel. For each trajectory, the ensembles were first equilibrated for 10000 Monte Carlo steps, and then data were collected for 100000 steps. These trajectories were combined and binned to produce the density labeled Unbiased in Figure 4. We estimated the relative asymptotic variance of the marginal density for the unbiased calculation using ACOR [14], and we estimated the relative asymptotic variance for the EMUS calculation using the method outlined in Appendix F. We present the results in Figure 3a. Note that near the mode, unbiased MCMC performs slightly better than EMUS, but in the tails, EMUS performs dramatically better.

After computing the marginal in μ₂, we tried computing the marginal in log₁₀ λ₁. We used the natural stratification with a grid of 50 bias functions with maxima equally spaced between −1 and 3.2 constructed as

ψ_{i} (θ) = ϕ (\frac{- 1 + h (i - 1) - {log}_{10} λ_{1}}{h})

where

h = \frac{3.2 - (- 1)}{49} .

We used the same initialization scheme as for the marginal in μ₂, beginning with a single biased distribution initialized from an unbiased test trajectory. We call this the center sample. The result of this calculation was the density labeled “1D Center” in Figure 5a. When we tried to compute the asymptotic variance of this density estimate, we noticed very slow convergence of the sampler for some biased distributions. To investigate, we performed another EMUS calculation using a similar initialization procedure, but starting from π₁, the biased distribution at the extreme left, covering the lowest values of λ₁. We call this the left sample. The result of this second calculation was the density labeled “1D Left” in Figure 5a. For both the center and left samples, the strata were equilibrated for 3000 steps and sampled for another 200000. We observe that the two densities differ significantly in the region −1 ≤ log₁₀ λ₁ ≤ 0.5. They should be the same up to sampling errors; for example, we observe that different initializations have no effect on the calculation of the marginal in μ₂, cf. Figure 3a.

Figure 5: — Estimates of the logarithm of the marginal density in log₁₀ λ₁ and the asymptotic variances of those estimates. Figure 5a displays the estimates of the marginal in log₁₀ λ₁ computed by various methods. The error bars are twice the estimated asymptotic standard deviation in each histogram bin. For the two-dimensional EMUS calculations, standard deviations were estimated using the method described in Appendix F. For both the unbiased calculation asymptotic variances were estimated using ACOR [14]. No error bars are given for the two one-dimensional calculations, as the barrier depicted in Figure 10 makes accurate estimation of the asymptotic variance impossible. A clear error is visible in the two one-dimensional umbrella sampling calculations, due to initialization along either side of the barrier in Figure 10. Figure 5b displays the asymptotic variance of the marginal density in log₁₀ λ₁ for the unbiased and the two-dimensional EMUS calculations. We note that while the unbiased calculation achieves greater accuracy near the mode, the EMUS calculation achieves greater accuracy in the tails.

Figure 6 explains the problem and suggests a solution: In the region 0.2 ≤ log₁₀ λ₁ ≤ 0.7, the center and left samples cover entirely different ranges of log₁₀ λ₂. This suggests that the biased distributions corresponding to the range 0.2 ≤ log₁₀ λ₁ ≤ 0.7 are multimodal, with barriers in λ₂ impeding sampling.

To confirm the hypothesis that barriers in λ₂ were responsible for the poor convergence observed in the center and left samples, we performed a third calculation, stratifying in both log₁₀ λ₁ and log₁₀ λ₂. We used a 50 × 50 grid of bilinear bias functions, with maxima equally spaced between −1 and 3.2. To be precise, for i, j = 1, …, 50, we defined the bias functions

ψ_{i j} (θ) = ϕ (\frac{- 1 + h (i - 1) - {log}_{10} λ_{1}}{h}) \times ϕ (\frac{- 1 + h (j - 1) - {log}_{10} λ_{2}}{h}),

with h as before. Let η_ij denote the biased distribution corresponding to ψ_ij.

We performed the two-dimensional EMUS calculation twice, initializing from the center and left samples drawn from the natural stratification in log₁₀ λ₁. For each i = 1, …, L, to sample the row {η_ij : j = 1, …, 50} of biased distributions, we began by initializing sampling of a single biased distribution η_ik with points from the either the center or left sample of π_i. We then sampled the other distributions η_ij for j ≠ k in sequence, again initializing with points from samples of adjacent distributions, either η_i,j+1 or η_i,j−1 in this case. If no samples were found inside the support of a biased distribution, that distribution was ignored. For each biased distribution, sampling was burned in for 4500 steps, and samples were collected for an additional 2500 steps. Ultimately, 1397 of the 2500 biased distributions were sampled; the unsampled distributions correspond to the white space in Figure 7a.

Figure 7: — Logarithm of marginal density in log₁₀ λ₁ and log₁₀ λ₂ as estimated by EMUS and unbiased MCMC. Contour lines in both figures are every unit change in the estimated log₁₀ marginal density. Figure 7a is the EMUS estimate. The numbers 1, 2, and 3 on this figure correspond to the mixture densities in Figure 8. Note that at values of log₁₀ λ near 3.0 we begin to see the modes corresponding to singularities of the posterior. Figure 7b is the marginal density estimated from a long unbiased trajectory of the ensemble sampler. Note that the entire trajectory lies in a small neighborhood of the mode labeled 1 in Figure 7a.

We computed the marginal in log₁₀ λ₁ and log₁₀ λ₂ using a 200×200 grid of histogram bins, covering the region −1 ≤ log₁₀ λ₁ ≤ 3.2 and −1 ≤ log₁₀ λ₂ ≤ 3.2; this corresponds to taking h = (3.2 − (−1))/200 in (5.2); the result from the center calculation appears in Figure 7a. In Figure 8, we show the mixture distributions corresponding to the modes of the two-dimensinoal marginal in Figure 7a. The two-dimensional marginals were essentially the same for the center and left initializations; see Figure 9. We also estimated the one-dimensional marginal in log₁₀ λ₁ using the two-dimensional stratification; see the results labeled “2D Center” and “2D Left ” in Figure 5a. Finally, we estimated the relative asymptotic variance of the marginal in log₁₀ λ₁ computed by two-dimensional stratification. Again, we observe that EMUS performs much better than unbiased sampling in the tails, cf. Figure 5b.

Figure 8: — Gaussian mixtures corresponding to means of histogram bins. Mixtures one through three correspond to the labeled points on Figure 7a, mixture four corresponds to a distribution near a singularity of the posterior, with log₁₀ λ₁ = 4.34 and log₁₀ λ₂ = 0.79. To be precise, the blue curve in each plot is the mixture distribution corresponding to the mean of a histogram bin centered at the point labeled in Figure 7a. The green curves are the individual mixture components. The black bars are a histogram of the Hidalgo stamp data.

Figure 9: — The difference between the free energy surfaces of the two-dimensional umbrella sampling runs. The center calculation was initialized from the center one-dimensional calculation, and the left calculation from the left one-dimensional calculation. In general the difference is small, roughly a tenth of an order of magnitude in the log marginal.

The marginal in log₁₀ λ₁ and log₁₀ λ₂ confirms that barriers in λ₂ caused the problems observed in calculating the marginal in log₁₀ λ₁ using the natural stratification. In fact, we see that computing the marginal in either λ₁ or λ₂ requires stratifying both variables, as stratifying only one leads to barriers that impede sampling in the other. In particular, there are barriers in λ₂ along the line log₁₀ λ₁ = 0.45 and a barrier in λ₁ along log₁₀ λ₂ = 0.6: In Figure 10, we plot an estimate of the conditional distribution of log₁₀ λ₂ with log₁₀ λ₁ = 0.45 fixed. This distribution is multimodal with a region of very low probability separating the modes, which explains the poor sampling depicted in Figure 6.

Figure 10: — Here we give an estimate of the conditional distribution of log₁₀ λ₂ with log₁₀ λ₁ = 0.45 calculated from the two-dimensional marginal seen in Figure 7a. The conditional distribution is multimodal. The mode on the left corresponds to mixtures with the data from thicknesses of 60 to 85 μm covered by a single Gaussian similar to mode 2 in Figure 8. The mode on the right corresponds to mixtures with these data covered by two Gaussians similar to mode 1 in Figure 8.

To conclude, we have confirmed that EMUS can be extremely efficient for computing tails. However, one must exercise care in the choice of strata. The natural stratification often suffices, but in some cases, like computing the marginal in log₁₀ λ₁, the biased distributions of the natural stratification may be very difficult to sample. We propose the use of different initializations, like the center and left samples, as a method of identifying problems related to poorly chosen strata. Careful inspection of simulations performed with these different initializations can identify problems and suggest better strata.

6. Conclusions.

We have analyzed the Eigenvector Method for Umbrella Sampling (EMUS), an especially simple and effective stratified MCMC method sharing many features with the popular WHAM [27] and MBAR [42] methods of computational chemistry. We have demonstrated the advantages of EMUS for sampling from multimodal distributions and computing tail probabilities, and we have explained how to identify and resolve the problems which may occur if the method is implemented poorly. We have also given a tutorial intended to explain how to diagnose and correct problems related to poorly chosen strata.

Our purpose was to explain the benefits of stratified MCMC analytically, with the ultimate goal of introducing stratified MCMC to a diverse audience of statisticians, engineers, and scientists. Since stratified MCMC had previously been applied only to a particular class of statistical mechanics calculations without any general justification, we began by developing a general theory. We hope that our theory will serve as the basis for further developments. For example, it may now be possible to undertake a comparison of EMUS and other so-called reaction coordinate methods such as Wang–Landau sampling [50] or Metadynamics [28]. Despite some similarities with EMUS, these methods work by a substantially different mechanism and understanding the relative advantages of the two approaches is non-trivial.

Acknowledgments

BvK was supported by NSF RTG: Computational and Applied Mathematics in Statistical Science, number 1547396. ARD and EHT were supported by National Institutes of Health (NIH) Grant Number 5 R01 GM109455-02. We wish to thank Jonathan Mattingly, Jeremy Tempkin, and Charlie Matthews for helpful discussions.

Appendix A. Proof of Theorem 3.3.

Our proof of Theorem 3.3 (the CLT for EMUS) is based on the delta method. To apply the delta method, we require the following result ensuring the differentiability of w(G):

Lemma A.1. The function w(G) admits an extension $\tilde{w} : ℝ^{L \times L} \to ℝ^{L}$ which is differentiable on the set of irreducible stochastic matrices.

Proof. By [45, Lemma 3.1], w(G) admits a continuously differentiable extension to an open set $U \subset ℝ^{L \times L}$ . We further extend the domain of w(G) to $ℝ^{L \times L}$ by arbitrarily defining w(G) = 0 whenever $G \in ℝ^{L \times L} \ U$ . ■

The extension in Lemma A.1 resolves two technicalities: First, the set of stochastic matrices is not a vector space but a compact, convex subset of $ℝ^{L \times L}$ with empty interior. Therefore, the derivative of w is undefined. Second, $\bar{F}$ may be reducible for some values of N and some realizations of the processes sampling the biased distributions. In that case, the invariant distribution of $\bar{F}$ is not unique, so $w (\bar{F})$ is undefined. Throughout the remainder of this work, w(G) will denote the extension guaranteed by the lemma. We now prove the CLT for EMUS.

Proof of Theorem 3.3. The proof is based on the delta method [6, Proposition 6.2] and a formula for $w^{'} (\bar{F})$ given in [19].

By Lemma A.1, $w (\bar{F})$ is differentiable at F, so the function

B (\bar{F}, {{\bar{g}}_{i}^{*}}_{i = 1}^{L}, {{\bar{1}}_{i}^{*}}_{i = 1}^{L}) : = π_{US} [g] = \frac{\sum_{i = 1}^{L} w_{i} (\bar{F}) {\bar{g}}_{i}^{*}}{\sum_{i = 1}^{L} w_{i} (\bar{F}) {\bar{1}}_{i}^{*}}

is differentiable at $(F, {π_{i} [g^{*}]}_{i = 1}^{L}, {π_{i} [1^{*}]}_{i = 1}^{L})$ ,. Let $\partial_{i} B \in ℝ^{L + 2}$ be the derivative of B with respect to those quantities computed from $X_{t}^{i}$ : That is,

\partial_{i} B : = (\frac{\partial B}{\partial {\bar{F}}_{i :}}, \frac{\partial B}{\partial {\bar{G}}_{i :}}) \in ℝ^{L + 2},

(A.1)

where $\frac{\partial B}{\partial {\bar{F}}_{i :}} \in ℝ^{L}$ denotes the partial derivative of B with respect to the i’th row of $\bar{F}$ and

\frac{\partial B}{\partial {\bar{G}}_{i :}} = (\frac{\partial B}{\partial {\bar{g}}_{i}^{*}}, \frac{\partial B}{\partial {\bar{1}}_{i}^{*}}) \in ℝ^{2} .

To simplify notation, we will assume throughout the remainder of this argument that all derivatives are evaluated at $(F, {π_{i} [g^{*}]}_{i = 1}^{L}, {π_{i} [1^{*}]}_{i = 1}^{L})$ . In formulas involving matrix multiplication, we will treat ∂_iB, $\frac{\partial B}{\partial {\bar{F}}_{i :}}$ and $\frac{\partial B}{\partial {\bar{G}}_{i :}}$ as row vectors.

Since we assume that the processes $X_{t}^{i}$ sampling the different measures π_i are independent, [5, Chapter 1, Theorem 2.8] implies that

\sqrt{M} (({\bar{F}}_{1 :}, {\bar{g}}_{1}, {\bar{1}}_{1}^{*}, \dots, {\bar{F}}_{L :}, {\bar{g}}_{L}^{*}, {\bar{1}}_{L}^{*}) - (F_{1 :}, π_{1} [g^{*}], π_{1} [1^{*}], \dots, F_{L :}, π_{L} [g^{*}], π_{L} [1^{*}])) \overset{d}{\to} N (0, Σ),

(A.2)

where Σ is the covariance matrix of the product of the distributions $N (0, κ_{i}^{- 1} Σ_{i})$ . (That is, $Σ \in ℝ^{L (L + 2) \times L (L + 2)}$ is the block diagonal matrix with the matrices $κ_{i}^{- 1} Σ_{i}$ along the diagonal.) Therefore, by the delta method,

\sqrt{M} (π_{US} [g] - π [g]) \overset{d}{\to} N (0, σ^{2}),

where

σ^{2} = (\partial_{1} B, \dots, \partial_{L} B) Σ {(\partial_{1} B, \dots, \partial_{L} B)}^{t} = \sum_{i = 1}^{L} κ_{i}^{- 1} \partial_{i} B Σ_{i} \partial_{i} B^{t} .

(A.3)

Now we observe that for any column vector $v \in ℝ^{L}$ having mean zero,

{\frac{d}{d ε} w_{k} (F + ε e_{i} v^{t}) |}_{ε = 0} = \frac{\partial w_{k}}{\partial {\bar{F}}_{i :}} v = z_{i} v^{t} {(I - F)}^{#} e_{k},

by [19, Theorem 3.1]. (In the formula above, $e_{i} \in ℝ^{L}$ denotes the i’th standard basis vector.) Therefore, we have

\frac{\partial B}{\partial {\bar{F}}_{i :}} v = \frac{\sum_{k = 1}^{L} \frac{\partial w_{k}}{\partial {\bar{F}}_{i :}} v π_{k} [g^{*}]}{\sum_{k = 1}^{L} z_{k} π_{k} [1^{*}]} - \frac{\sum_{k = 1}^{L} \frac{\partial w_{k}}{\partial {\bar{F}}_{i :}} v π_{k} [1^{*}]}{\sum_{i = 1}^{L} z_{k} π_{k} [1^{*}]} \frac{\sum_{i = 1}^{L} z_{k} π_{k} [g^{*}]}{\sum_{k = 1}^{L} z_{k} π_{k} [1^{*}]} = \sum_{k = 1}^{L} \frac{\partial w_{k}}{\partial {\bar{F}}_{i :}} v Ψ (π_{k} [g^{*}] - π [g] π_{k} [1^{*}]) = z_{i} v^{t} (I - F)^{#} g,

(A.4)

where

g_{k} = Ψ π_{k} [g^{*} - π [g] 1^{*}] = l \cdot (π_{k} [g^{*}], π_{k} [1^{*}]) .

(Equality (A.4) above follows from (2.3) and the definition (2.2) of Ψ.) Also,

\frac{\partial B}{\partial {\bar{G}}_{i}} = z_{i} Ψ (1, - π [g]) = z_{i} l .

(A.5)

Thus,

\partial_{i} B Σ_{i} \partial_{i} B^{t} = \frac{\partial B}{\partial {\bar{F}}_{i :}} σ^{i} \frac{\partial B^{t}}{\partial {\bar{F}}_{i :}} + 2 \frac{\partial B}{\partial {\bar{F}}_{i :}} ρ_{i} {\frac{\partial B}{\partial {\bar{G}}_{i :}}}^{t} + \frac{\partial B}{\partial {\bar{G}}_{i :}} τ_{i} \frac{\partial B}{\partial {\bar{G}}_{i :}} = z_{i}^{2} {{(I - F)}^{#} g \cdot σ^{i} {(I - F)}^{#} g + 2 {(I - F)}^{#} g \cdot ρ_{i} l + l^{t} τ_{i} l},

and the result follows by (A.3). ■

Appendix B. Proof of Theorem 3.5.

Definition B.1. Let $e_{i} \in ℝ^{L}$ denote the i’th standard basis vector. For i,j ∈ {1,2, …, L} with i ≠ j, define the logarithmic partial derivatives

\frac{\partial log w_{k}}{\partial F_{i j}} (F) : = \frac{\partial}{\partial F_{i j}} log w_{k} (\sum_{i \neq j} I + F_{i j} (e_{i} e_{j}^{t} - e_{i} e_{i}^{t})) = {\frac{d}{d ε} |}_{ε = 0} log w_{k} (F + ε (e_{i} e_{j}^{t} - e_{i} e_{i}^{t})) .

(B.1)

(These partial derivatives must be understood as derivatives of the extension guaranteed by Lemma A.1; otherwise, they are defined only when F_ij > 0 and F_ii > 0.)

Our definition of logarithmic partial derivatives in (B.1) is not standard. However, we observe that a version of the standard formula relating the total and partial derivatives of log w holds: For all matrices H whose rows sum to zero,

{\frac{d}{d ε} |}_{ε = 0} log w_{k} (F + ε H) = \sum_{i \neq j} \frac{\partial log w_{k}}{\partial F_{i j}} (F) H_{i j} .

(B.2)

We need only consider matrices whose rows sum to zero, since these are the only perturbations for which F + εH can be stochastic.

The following result appears in [45, Theorem 3.6]. It is crucial in our proof of Theorem 3.5.

Lemma B.2. Recall P_i[t_j < t_i] and $\frac{\partial log w_{k}}{\partial F_{i j}}$ and from Definitions 3.4 and B.1. For all stochastic irreducible matrices F, we have

\frac{1}{2} \frac{1}{P_{i} [t_{j} < t_{i}]} \leq max_{k} | \frac{\partial log w_{k}}{\partial F_{i j}} (F) | \leq \frac{1}{P_{i} [t_{j} < t_{i}]} .

We also require the following lemma in the proof of Theorem 3.5.

Lemma B.3. The asymptotic covariance matrix σⁱ has the properties:

The rows and colums of σⁱ sum to zero. That is, for $e \in ℝ^{L}$ the vector of all ones,
$σ^{i} e = 0 a n d e^{t} σ^{i} = 0.$
For all j = 1, …, L,
$σ_{j k}^{i} = σ_{k j}^{i} = 0 whenever F_{i k} = 0.$

Proof. Since the rows of $\bar{F}$ sum to one with probability one, we have

var ({\bar{F}}_{i :} e) = 0

for any fixed number of samples N_i. Therefore, the asymptotic variance σⁱ has e^tσⁱe = 0, and it follows that e^tσⁱ = σⁱe = 0 since σⁱ is symmetric and positive semidefinite.

Let k be such that F_ik = 0. Since ${\bar{F}}_{i k} = 0$ with probability one, we have

cov ({\bar{F}}_{i k}, {\bar{F}}_{i j}) = 0

for any j = 1, …, L, and therefore $σ_{j k}^{i} = 0$ . ■

We now prove Theorem 3.5.

Theorem 3.5. We begin with formula (A.3):

σ^{2} = \sum_{i = 1}^{L} κ_{i}^{- 1} \partial_{i} B^{t} Σ_{i} \partial_{i} B .

(B.3)

Since the asymptotic covariance matrix Σ_i is symmetric and positive semidefinite, the Cauchy inequality holds:

a^{t} Σ_{i} b \leq \frac{1}{2} a^{t} Σ_{i} a + \frac{1}{2} b^{t} Σ_{i} b,

for all $a, b \in ℝ^{L + 1}$ . Therefore,

\partial_{i} B^{t} Σ_{i} \partial_{i} B = (\frac{\partial B}{\partial {\bar{F}}_{i :}}, \frac{\partial B}{\partial {\bar{G}}_{i :}}) Σ_{i} {(\frac{\partial B}{\partial {\bar{F}}_{i :}}, \frac{\partial B}{\partial {\bar{G}}_{i :}})}^{t} \leq 2 (\frac{\partial B}{\partial {\bar{F}}_{i :}}, 0) Σ_{i} {(\frac{\partial B}{\partial {\bar{F}}_{i :}}, 0)}^{t} + 2 (0^{t}, \frac{\partial B}{\partial {\bar{G}}_{i :}}) Σ_{i} {(0^{t}, \frac{\partial B}{\partial {\bar{G}}_{i :}})}^{t} = 2 \frac{\partial B}{\partial {\bar{F}}_{i :}} σ^{i} {\frac{\partial B}{\partial {\bar{F}}_{i :}}}^{t} + 2 \frac{\partial B}{\partial {\bar{G}}_{i :}} τ_{i} {\frac{\partial B}{\partial {\bar{G}}_{i :}}}^{t} . =: 2 A_{0} + 2 A_{1} .

(B.4)

(Here, 0 denotes the zero vector in $ℝ^{L}$ , interpreted as a column vector.)

We now estimate the term A₀ defined above. By (A.4), we have

A_{0} = {\frac{\partial B}{\partial {\bar{F}}_{i :}}}^{t} σ^{i} \frac{\partial B}{\partial {\bar{F}}_{i :}} = \sum_{j, k, l, m = 1}^{L} g_{l} \frac{\partial w_{l}}{\partial {\bar{F}}_{i j}} σ_{j k}^{i} g_{m} \frac{\partial w_{m}}{\partial {\bar{F}}_{i j}} = \sum_{l, m = 1}^{L} z_{l} g_{l} z_{m} g_{m} \sum_{\begin{matrix} j \neq i \\ F_{i j} > 0 \end{matrix}} \sum_{\begin{matrix} k \neq i \\ F_{i k} > 0 \end{matrix}} \frac{\partial log w_{l}}{\partial {\bar{F}}_{i j}} σ_{j k}^{i} \frac{\partial log w_{m}}{\partial {\bar{F}}_{i k}} . = \sum_{l, m = 1}^{L} z_{l} g_{l} z_{m} g_{m} \sum_{\begin{matrix} j \neq i \\ F_{i j} > 0 \end{matrix}} \sum_{\begin{matrix} k \neq i \\ F_{i k} > 0 \end{matrix}} \sqrt{{var}_{π_{i}} (ψ_{j}^{*})} \frac{\partial log w_{l}}{\partial {\bar{F}}_{i j}} R_{j k}^{i} \sqrt{{var}_{π_{i}} (ψ_{k}^{*})} \frac{\partial log w_{m}}{\partial {\bar{F}}_{i k}},

(B.5)

where

R_{j k}^{i} : = \frac{σ_{j k}^{i}}{\sqrt{{var}_{π_{i}} (ψ_{j}^{*})} \sqrt{{var}_{π_{i}} (ψ_{k}^{*})}} .

(The third equality above follows from formula (B.2) relating the total and partial derivatives of log w, since the rows and colums of σⁱ sum to zero by Lemma B.3.)

We claim that

\sum_{\begin{array}{l} j \neq i \\ F_{i j} > 0 \end{array}} \sum_{\begin{matrix} k \neq i \\ F_{i k} > 0 \end{matrix}} \sqrt{{var}_{π_{i}} (ψ_{j}^{*})} \frac{\partial log w_{l}}{\partial {\bar{F}}_{i j}} R_{j k}^{i} \sqrt{{var}_{π_{i}} (ψ_{k}^{*})} \frac{\partial log w_{m}}{\partial {\bar{F}}_{i k}} \leq tr (R^{i}) \sum_{\begin{matrix} j \neq i \\ F_{i j} > 0 \end{matrix}} {var}_{π_{i}} (ψ_{j}^{*}) {\frac{\partial log w_{l}}{\partial {\bar{F}}_{i j}}}^{2} .

(B.6)

To prove this, we observe that Rⁱ is symmetric and positive semidefinite since σⁱ is symmetric and positive semidefinite. Therefore, Rⁱ has the spectral decomposition

R_{i} = \sum_{j = 1}^{L} λ_{i, j} v^{i, j} {(v^{i, j})}^{t}

with eigenvalues λ_i,j > 0 and corresponding eigenvectors v^i,j such that ‖v^i,j‖ = 1. Thus, for any $a \in ℝ^{L}$ ,

a^{t} R^{i} a = \sum_{j = 1}^{L} λ_{i, j} {| v^{i, j} \cdot a |}^{2} \leq (\sum_{j = 1}^{L} λ_{i, j}) ‖ a ‖^{2} = tr (R^{i}) ‖ a ‖^{2} .

(B.7)

Inequality (B.6) follows from (B.7) by setting

a_{j} = {\begin{array}{l} \sqrt{{var}_{π_{i}} (ψ_{j}^{*})} \frac{\partial log w_{l}}{\partial F_{i j}} & if j \neq i and F_{i j} > 0, and \\ 0 & otherwise . \end{array}

Finally, combining (B.5), (B.6), and Lemma B.2 yields

A_{0} \leq tr (R^{i}) {(\sum_{l = 1}^{L} z_{l} | g_{l} |)}^{2} \sum_{\begin{matrix} j \neq i \\ F_{i j} > 0 \end{matrix}} \frac{{var}_{π_{i}} (ψ_{j}^{*})}{P_{i} {[τ_{j} < τ_{i}]}^{2}} .

Moreover, we have

\sum_{l = 1}^{L} z_{l} | g_{l} | = Ψ \sum_{l = 1}^{L} z_{l} | π_{l} [g^{*} - π [g] 1^{*}] | \leq Ψ \sum_{l = 1}^{L} z_{l} π_{l} [| h |] = π [| h |],

by (2.3), and therefore

A_{0} \leq tr (R^{i}) π {[| h |]}^{2} \sum_{\begin{matrix} j \neq i \\ F_{i j} > 0 \end{matrix}} \frac{{var}_{π_{i}} (ψ_{j}^{*})}{P_{i} {[τ_{j} < τ_{i}]}^{2}} .

(B.8)

We now observe that by (A.5)

A_{1} = z_{i}^{2} l^{t} τ_{i} l = Ψ^{2} z_{i}^{2} C ({\bar{h}}_{i}),

$C ({\bar{h}}_{i})$ denotes the asymptotic covariance of the trajectory average ${\bar{h}}_{i}$ of h over the biased process $X_{t}^{i}$ . Therefore, combining (B.3) and (B.8), we find

σ^{2} \leq 2 \sum_{i = 1}^{L} \frac{1}{κ_{i}} {z_{i}^{2} Ψ^{2} C ({\bar{h}}_{i}) + tr (R^{i}) π {[| h |]}^{2} \sum_{\begin{matrix} j \neq i \\ F_{i j} > 0 \end{matrix}} \frac{{var}_{π_{i}} (ψ_{j}^{*})}{P_{i} {[τ_{j} < τ_{i}]}^{2}}},

as desired.

Finally, suppose that the bias functions are a partition of unity. In that case,

π {[| h |]}^{2} = π {[| g - π [g] |]}^{2} \leq π [| g - π [g] |^{2}] = {var}_{π} (g),

and so we may replace π[|h|]² with var_π(g). In addition, we observe that for a partition of unity, equation (A.4) holds with g_k = π_k[g]. Thus, following the argument above, one may verify that the result holds with π[|g|]² in place of π[|h|]².

Appendix C. Proof of Theorem 3.7.

In the arguments below, for any probability measure ν on a set Ω, we let

L^{2} (ν) : = {u : Ω \to ℝ : ν [u^{2}] < \infty},

and we define the L²(ν) inner product

{〈 f, g 〉}_{ν} = ν [f g]

with the corresponding norm

‖ f ‖_{L^{2} (ν)} : = \sqrt{{〈 f, f 〉}_{ν}} .

Given a set $U \subset ℝ^{d}$ , we define L²(U), $‖ \cdot ‖_{L^{2} (U)}$ , 〈, 〉_U to be the analogous function space, norm, and inner product for Lebesgue measure on U.

Our proof of Theorem 3.7 requires a Poincaré inequality, Lemma C.1. We refer to [31, Section 3] for an introduction to Poincaré inequalities and their role in the theory of diffusion processes.

Lemma C.1. Assume that the Poincaré inequality holds for U with constant Λ; that is, assume that for all weakly differentiable $f : U \to ℝ$ so that ∇f ∈ L²(U),

‖ f - \int_{U} f d x ‖_{L^{2} (U)} \leq Λ (U) ‖ \nabla f ‖_{L^{2} (U)}

We have a similar Poincaré inequality for π_h:

‖ f - π_{h} (f) ‖_{L^{2} (π_{h})} \leq h Λ (U) exp (\frac{β}{2} (sup_{U_{h}} V - inf_{U_{h}} V)) ‖ \nabla f ‖_{L^{2} (π_{h})} .

Proof. By a standard scaling argument, the Poincaré inequality holds for U_h with constant hΛ. To see this, let A_h : U → U_h be the affine transformation

A_{h} x = x_{0} + h (x - x_{0}) .

For any $f : U_{h} \to ℝ$ with ∇f ∈ L²(U_h), using the change of variable formula and the chain rule, we have

{‖ f - \int_{U_{h}} f ‖}_{L^{2} (U_{h})}^{2} = h^{d} {‖ f \circ A_{h} - \int_{U} f \circ A_{h} ‖}_{L^{2} (U)}^{2} \leq h^{d} Λ^{2} {‖ \nabla (f \circ A_{h}) ‖}_{L^{2} (U)}^{2} = h^{d} h^{2} Λ^{2} {‖ (\nabla f) \circ A_{h} ‖}_{L^{2} (U)}^{2} = h^{2} Λ^{2} ‖ \nabla f ‖_{L^{2} (U_{h})}^{2} .

Now observe that for any f ∈ L²(π_h),

{‖ f - π_{h} [f] ‖}_{L^{2} (π_{h})} = min_{c \in ℝ} ‖ f - c ‖_{L^{2} (π_{h})},

since π_h[f] is the L²(π_h) orthogonal projection of f onto the space of constant functions. Therefore, we have

{‖ f - π_{h} [f] ‖}_{L^{2} (π_{h})}^{2} \leq {‖ f - \int_{U_{h}} f ‖}_{L^{2} (π_{h})}^{2} \leq (sup_{x \in U_{h}} π_{h} (x)) {‖ f - \int_{U_{h}} f ‖}_{L^{2} (U_{h})}^{2} \leq h^{2} Λ^{2} (sup_{U_{h}} π (x)) ‖ \nabla f ‖_{L^{2} (U_{h})}^{2} \leq h^{2} Λ^{2} \frac{{sup}_{x \in U_{h}} π_{h} (x)}{{inf}_{x \in U_{h}} π_{h} (x)} ‖ \nabla f ‖_{L^{2} (π_{h})}^{2},

and the result follows. ■

Remark C.2. The Poincaré inequality for the Lebesgue measure on a set U holds under very weak conditions on U. For example, when U is convex, the Poincaré inequality holds with constant Λ(U) = D/π, where D is the diameter of the domain [38].

We now prove Theorem 3.7:

Proof of Theorem 3.7. We begin by stating a simple consequence of the functional central limit theorem for reversible, continuous time Markov processses: Let Y_t be a reversible, stationary Markov process with ergodic distribution π and generator L. Let g ∈ L²(π), and define

\bar{g} : = T^{- 1} \int_{s = 0}^{T} g (Y_{s}) d s .

By [25, Corollary 1.9],

\sqrt{T} (\bar{g} - π [g]) \overset{d}{\to} N (0, σ^{2} (g)),

where

σ^{2} (g) = {〈 g - π [g], L^{- 1} (g - π [g]) 〉}_{π} .

(C.1)

Here, L⁻¹(g − π[g]) denotes any function in the domain of L with

L (L^{- 1} (g - π [g])) = g - π [g]

and π[L⁻¹(g−π[g])] = 0. Such a function must exist when g ∈ L²(π) and X_t is reversible [25].

We now show that the process $X_{t}^{h}$ meets the conditions above for the central limit theorem. First, we recall that the generator of $X_{t}^{h}$ is the operator

L_{h} = β^{- 1} Δ - \nabla V \cdot \nabla

with domain

D (L_{h}) : = {g \in C^{2} (U^{h}) : \nabla g (x) \cdot n (x) = 0 for all x \in \partial U^{h}};

see [2, Proposition 3.2] for the case of a convex polyhedron or [13, Chapter 8] for a domain with C³ boundary.

By [24, Theorem 4.3.3], a process Y_t with invariant distribution π is reversible if its generator is symmetric and it has the strong continuity property

lim_{t \to 0^{+}} ‖ T_{t} f - f ‖_{π} = 0 for all f \in L^{2} (π),

(C.2)

where T_tf(x) := E_x[f(Y_t)] denotes the backwards semigroup associated with Y_t. The generator L_h of $X_{t}^{h}$ is symmetric, since for all f, g ∈ D(L_h), using integration by parts, we have

- β^{- 1} 〈 \nabla f, \nabla g 〉_{π} = - β^{- 1} \int_{U_{h}} \nabla f \cdot \nabla g z_{h}^{- 1} exp (- β V) d x = β^{- 1} \int_{U_{h}} f div (z_{h}^{- 1} exp (- β V) \nabla g) d x - β^{- 1} \int_{\partial U_{h}} f z_{h}^{- 1} exp (- β V) \nabla g \cdot n d S = \int_{U_{h}} (β^{- 1} Δ g - \nabla V \cdot \nabla g) f z_{h}^{- 1} exp (- β V) d x = {〈 f, L_{h} g 〉}_{π} .

(C.3)

(Here, $z_{h}^{- 1} : = \int_{U_{h}} exp (- β V) d x$ is the normalizing constant for π_h.) Since 〈∇f, ∇g〉_π is invariant under exchanging f and g, 〈f, L_hg〉_π = 〈L_hf, g〉_π and L_h is symmetric. We postpone discussion of the strong continuity of $X_{t}^{h}$ to the end of the proof.

We now use the Poincaré inequality (Lemma C.1) and (C.3) to prove that $X_{t}^{h}$ is ergodic and to estimate the term $L_{h}^{- 1} (g - π_{h} [g])$ appearing in the formula for $σ_{h}^{2} (g)$ ; in essence, we adapt the approach outlined in [31, Section 3] to the family of reflected processes $X_{t}^{h}$ . We prove ergodicity first. By [4, Proposition 2.2], a process is ergodic if and only if 0 is a simple eigenvalue of its generator. By the Poincaré inequality (Lemma C.1) and (C.3), for all u ∈ D(L_h),

‖ u - π_{h} {[u] ‖}_{L^{2} (π_{h})}^{2} \leq C_{h}^{2} ‖ \nabla u ‖_{L^{2} (π_{h})}^{2} = C_{h}^{2} β {〈 u, - L u 〉}_{π_{h}} \leq C_{h}^{2} β ‖ u ‖_{L^{2} (π_{h})} ‖ L u ‖_{L^{2} (π_{h})},

(C.4)

where

C_{h} = h Λ (U) exp (\frac{β}{2} (sup_{U_{h}} V - inf_{U_{h}} V)) .

Now if u is not constant, ${‖ u - π_{h} [u] ‖}_{L^{2} (π_{h})}^{2} > 0$ , so ${‖ L_{h} u ‖}_{L^{2} (π_{h})} > 0$ and u is not an eigenvector with eigenvalue 0. Hence, 0 is a simple eigenvalue of L_h, and $X_{t}^{h}$ is ergodic.

Finally, we estimate $σ_{h}^{2} (g)$ . We have

‖ u ‖_{L^{2} (π_{h})} \leq C_{h}^{2} β ‖ L u ‖_{L^{2} (π_{h})} .

Taking $u = L_{h}^{- 1} (g - π_{h} [g])$ in the above yields

‖ L_{h}^{- 1} {(g - π_{h} [g]) ‖}_{L^{2} (π_{h})} \leq C_{h}^{2} β ‖ g - π_{h} {[g] ‖}_{L^{2} (π_{h})},

which implies

σ_{h}^{2} (g) = 〈 g - π_{h} [g], L^{- 1} {(g - π_{h} [g]) 〉}_{π_{h}} \leq C_{h}^{2} β {var}_{π_{h}} (g),

using the Cauchy–Schwarz inequality.

It remains to show that the process $X_{t}^{h}$ has the strong continuity property (C.2). We only sketch an argument, since the basic ideas are standard. First, one can use the Lipschitz continuity of strong solutions of the reflected process [2, Lemma 4.1] to show that $X_{t}^{h}$ has the Feller property. (That is, one can show that T_tu is continuous whenever u is continuous.) In addition, since the process $X_{t}^{h}$ has an infinitesimal generator, we have the pointwise continuity property

lim_{t \to 0^{+}} T_{t} u (x) = u (x)

(C.5)

for all x ∈ U_h and all u ∈ D(L_h). Now we have ‖T_t‖_∞ ≤ 1 for all t ≥ 0, where ‖T_t‖_∞ is the operator norm of T_t on the space of continuous functions with the sup-norm, and therefore by a density argument the limit (C.5) holds for all continuous u. Hence, by [8, Lemma 1.4], we have

lim_{t \to 0^{+}} sup_{x \in U_{h}} | T_{t} u (x) - u (x) | = 0

for all continuous u. The strong continuity property (C.2) then follows by another density argument, using that ${‖ T_{t} ‖}_{L^{2} (π_{h})} \leq 1$ for all t ≥ 0. ■

Appendix D. Proof of Theorem 4.1.

Proof of Theorem 4.1. By the remarks immediately following Theorem 3.5, since the bias functions are a partition of unity, we have

σ^{2} (g) \leq 2 \sum_{i \in ℤ^{d} / K ℤ^{d}} κ_{i}^{- 1} {C ({\bar{g}}_{i}) z_{i}^{2} + {var}_{π} (g) tr (R^{i}) \sum_{\begin{matrix} j \neq i \\ F_{i j} > 0 \end{matrix}} \frac{1}{F_{i j}}} .

(D.1)

To prove the desired upper bound, we substitute estimates of $C ({\bar{g}}_{i})$ , Rⁱ, and F_ij into the inequality above.

First, we consider the asymptotic covariances Rⁱ and $C ({\bar{g}}_{i})$ . Let

h = 1 / K .

The diameter of U_i is $2 \sqrt{d} h$ , so by Assumption 3.6

R_{j j}^{i} \leq C h^{a} β^{b} exp (2 \sqrt{d} h β ‖ \nabla V ‖_{L^{\infty}}) \leq C h^{a - b} exp (2 \sqrt{d} ‖ \nabla V ‖_{L^{\infty}}) .

(D.2)

(The second inequality follows since hβ ≤ 1 by definition.) Similarly,

C ({\bar{g}}_{i}) \leq C h^{a - b} exp (2 \sqrt{d} ‖ \nabla V ‖_{L^{\infty}}) {var}_{π_{i}} (g) .

(D.3)

Second, by Lemma 3.11, the nonzero entries of the overlap matrix F are bounded below as β tends to infinity:

F_{i j} \geq \frac{exp (- 2 \sqrt{d} ‖ \nabla V ‖_{L^{\infty}})}{4^{d}}

(D.4)

for all i,j so that F_i,j > 0. We also observe that each row of F has 3^d nonzero entries, since F_i,i+k > 0 only when all entries of k belong to {−1, 0, 1}.

We now estimate the term involving $C ({\bar{g}}_{i})$ in (D.1). By (D.3), we have

\sum_{i \in ℤ^{d} / K ℤ^{d}} z_{i}^{2} C ({\bar{g}}_{i}) \leq C h^{a - b} exp (2 \sqrt{d} ‖ \nabla V ‖_{L^{\infty}}) \sum_{i \in ℤ^{d} / K ℤ^{d}} z_{i}^{2} {var}_{π_{i}} (g) .

(D.5)

Now we have

{var}_{π_{i}} (g) = π_{i} [{| g - π_{i} [g] |}^{2}] \leq π_{i} [| g - π [g] |^{2}] .

Therefore,

\sum_{i \in ℤ^{d} / K ℤ^{d}} z_{i}^{2} {var}_{π_{i}} (g) \leq \sum_{i \in ℤ^{d} / K ℤ^{d}} z_{i} π_{i} [| g - π [g] |^{2}] = π [| g - π [g] |^{2}] = {var}_{π} (g) .

(D.6)

(The inequality follows since 0 ≤ z_i ≤ 1 for all i; the second to last equality follows using (2.5) and that ${ψ_{i}}_{i \in ℤ^{d} / K ℤ^{d}}$ is a partition of unity.) Thus,

\sum_{i \in ℤ^{d} / K ℤ^{d}} z_{i}^{2} C ({\bar{g}}_{i}) \leq C h^{a - b} exp (2 \sqrt{d} ‖ \nabla V ‖_{L^{\infty}}) {var}_{π} (g) .

(D.7)

It remains to address the term involving Rⁱ in (D.1): Using (D.2), (D.4), and that each row of F has 3^d nonzero entries, we have

tr (R^{i}) \sum_{\begin{matrix} j \neq i \\ F_{i j} > 0 \end{matrix}} \frac{1}{F_{i j}} \leq C 6^{2 d} h^{a - b} exp (4 \sqrt{d} ‖ \nabla V ‖_{L^{\infty}})

(D.8)

for every $i \in ℤ^{d} / K ℤ^{d}$ . Finally, using (D.7), (D.8), and $κ_{i}^{- 1} = K^{d} = β^{d}$ , we conclude

σ^{2} (g) \leq 2 C h^{a - b} {K^{d} exp (2 \sqrt{d} ‖ \nabla V ‖_{L^{\infty}}) + 6^{2 d} K^{2 d} exp (4 \sqrt{d} ‖ \nabla V ‖_{L^{\infty}})} {var}_{π} (g) \leq (D ⌈ β ⌉^{d + b - a} + E ⌈ β ⌉^{2 d + b - a}) {var}_{π} (g),

where the constants D and E depend on d and V, but not on g or β. ■

We note that if one uses the bias functions proposed in Remark 3.13, then the constants D and E in the proof of Theorem 4.1 grow only polynomially with the dimension d, not exponentially. However, we do not claim that those bias functions perform better than the uniform grid (3.7) or the bias functions of Section 5.3 in practice.

Appendix E. Proof of Theorem 4.6.

Proof of Theorem 4.6. Take g := 1_x≥M. As explained in the remarks after the statement of Theorem 3.5, since the bias functions are a partition of unity, we have

σ_{M}^{2} \leq 2 \sum_{i = 0}^{K + 1} κ_{i}^{- 1} {C ({\bar{g}}_{i}) z_{i}^{2} + p_{M}^{2} tr (R^{i}) \sum_{\begin{matrix} j \neq i \\ F_{i j} > 0 \end{matrix}} \frac{1}{F_{i j}}} .

(E.1)

First, we estimate

R_{j j}^{i} \leq C h^{a} exp (h max_{x \leq M} | V^{'} (x) |) \leq C e h^{a}

for all i = 1, …, K − 1 by Assumption 3.6. By Assumption 4.5,

R_{j j}^{K} \leq D for j = K - 1, K,

and $R_{j j}^{K} = 0$ for j ≠ K − 1,K, since ψ_K+1 is constant over the support of π_K. In addition,

R_{j j}^{K + 1} = 0 for all j = 1, \dots, L,

since all bias functions ψ_i take a constant value over the support of π_K+1. Likewise,

C ({\bar{g}}_{K}) \leq C e h^{a} {var}_{π_{K}} (g) \leq C e h^{a},

and $C ({\bar{g}}_{i}) = 0$ for all i ≠ K.

We now show that the nonzero entries of the overlap matrix are bounded below independent of M. First, we estimate the entries which are averages over the biased distributions with bounded support. By Lemma 3.11, we have

F_{i j} \geq \frac{1}{2 exp (2)} > 0

for all i = 0, …, K − 1 and j so that F_ij > 0. whenever F_ij > 0. It remains to address those entries related to biased distributions with unbounded support, so with i = K,K + 1. By Lemma E.1, F_K,K+1 and F_K,K−1 are bounded below by some θ > 0 independent of M, for this choice of bias functions. (Lemma E.1 and its proof appear in Appendix E. Lemma E.1 is the only part of the proof which relies on Assumption 4.3.) In addition, for any i = 0, …, K + 1, we have $F_{i i} = \frac{1}{2}$ , which implies $F_{K + 1, K} = 1 - F_{K + 1, K + 1} = \frac{1}{2}$ since F is stochastic when the bias functions are a partition of unity.

Finally, we substitute the above estimates of the overlap matrix and the variances into (E.1). Let c = min{θ,1/2exp(2)}. Observe that h decreases with M, so Ceh^a ≤ E for some constant E, uniformly in M. Let F = max{D,E}. We have

\frac{σ^{2}}{p_{M}^{2}} \leq \frac{2}{p_{M}^{2}} \sum_{i = 0}^{K + 1} (K + 2) {C ({\bar{g}}_{i}) z_{i}^{2} + p_{M}^{2} (2 F) \frac{2}{c^{2}}} \leq 2 (K + 2) F \frac{z_{K}^{2}}{p_{M}^{2}} + \frac{4 {(K + 2)}^{2}}{c^{2}} .

We now observe that

\frac{z_{K}}{p_{M}} = \frac{z_{K}}{z_{K + 1}} = \frac{F_{K + 1, K}}{F_{K, K + 1}} \leq \frac{1}{c} .

Therefore,

\frac{σ^{2}}{p_{M}^{2}} \leq \frac{2 F (K + 2) + 4 F {(K + 2)}^{2}}{c^{2}},

which proves the result. ■

We now prove Lemma E.1, which is used in the proof of Theorem 4.6.

Lemma E.1. Under the hypotheses of Theorem 4.6, there exist constants M₁, θ₊, θ₋ > 0 depending on V but not on M so that

F_{K, K + 1} \geq θ_{+} > 0 a n d F_{K, K - 1} \geq θ_{-} > 0

whenever M ≥ M₁.

Proof. We consider F_K,K−1 first. We have

F_{K, K - 1} = \frac{1}{2} \frac{π ([M - h, M))}{π ([M - h, \infty))} = \frac{1}{2} \frac{\int_{M - h}^{M} exp (- V (x)) d x}{\int_{M - h}^{\infty} exp (- V (x)) d x} .

By the integral mean value theorem,

\int_{M - h}^{M} exp (- V (x)) d x = h exp (- V (ξ_{M - h, M}))

for some ξ_M−h,M ∈ [M − h,M]. Moreover, by (4.5), we have

V (x) \leq V (M) + V^{'} (M) (x - M) for all x \geq M \geq M_{0} .

Therefore, when M − h ≥ M₀,

\int_{M - h}^{\infty} exp (- V (x)) d x \leq \int_{M - h}^{\infty} exp (- V (M - h) - V^{'} (M - h) (x - M + h)) d x = \frac{exp (- V (M - h))}{V^{'} (M - h)} .

It follows that

F_{K, K - 1} \geq h V^{'} (M - h) exp (V (M - h) - V (ξ_{M - h, M})) \geq h V^{'} (M - h) exp (- h max_{x \leq M} | V^{'} (x) |) \geq h V^{'} (M - h) exp (- 1) = \frac{V^{'} (M - h)}{{⌈ max}_{x \leq M} | V^{'} (x) | ⌉} exp (- 1),

(E.2)

using the definition h = M/K.

To estimate the quotient in expression (E.2), we distinguish two cases: By (4.5), V′ is nondecreasing on [M₀,∞), so either lim_x→∞ V′(x) = C₂ < ∞ or lim_x→∞ V′(x) = ∞. In the first case, V′ is bounded, and we have

\frac{V^{'} (M - h)}{{⌈ max}_{x \leq M} | V^{'} (x) | ⌉} \geq \frac{V^{'} (M_{0})}{{⌈ max}_{x \in [0, \infty)} | V^{'} (x) | ⌉} > 0,

(E.3)

whenever M − h ≥ M₀. In the second case, for M sufficiently large,

max_{x \leq M} | V^{'} (x) | = V^{'} (M) .

Therefore, applying in succession the mean value theorem, the monotonicity of V′, assumption (4.6), and the hypothesis lim_x→∞ V′(x) = ∞, we have that for all M sufficiently large,

\frac{V^{'} (M - h)}{⌈ {max}_{x \leq M} | V^{'} (x) | ⌉} = \frac{V^{'} (M)}{⌈ V^{'} (M) ⌉} - \frac{V^{'} (M) - V^{'} (M - h)}{⌈ V^{'} (M) ⌉} \geq \frac{V^{'} (M)}{⌈ V^{'} (M) ⌉} - \frac{h V^{″} (η_{M - h, M})}{⌈ V^{'} (M) ⌉} \geq \frac{V^{'} (M)}{⌈ V^{'} (M) ⌉} - \frac{V^{″} (η_{M - h, M})}{V^{'} {(η_{M - h, M})}^{2}} \geq \frac{V^{'} (M)}{⌈ V^{'} (M) ⌉} - α \geq \frac{1 - α}{2} >0 .

(E.4)

(In the second and third lines above, η_M−h,M ∈ [M − h,M] denotes the point guaranteed by the mean value theorem so that V′(M) – V′(M − h) = hV′′(η_M−h,M).) It follows from (E.2), (E.3), and (E.4) that there exist M₋,θ₋ > 0 so that

F_{K, K - 1} \geq θ_{-} > 0

(E.5)

whenever M ≥ M₋.

Now we prove that F_K,K+1 is bounded below. We have

F_{K, K + 1} = \frac{1}{2} \frac{\int_{M}^{\infty} exp (- V (x)) d x}{\int_{M - h}^{\infty} exp (- V (x)) d x} = F_{K, K - 1} \frac{\int_{M}^{\infty} exp (- V (x)) d x}{\int_{M - h}^{M} exp (- V (x)) d x} \geq θ_{-} \frac{\int_{M}^{M + h} exp (- V (x)) d x}{\int_{M - h}^{M} exp (- V (x)) d x} \geq θ_{-} \frac{\int_{M}^{M + h} exp (V (x - h) - V (x)) exp (- V (x - h)) d x}{\int_{M - h}^{M} exp (- V (x)) d x} \geq θ_{-} exp (min_{[M - h, M + h]} V - max_{[M - h, M + h]} V) \geq θ_{-} exp (- 2 h max_{[M - h, M + h]} | V^{'} |) .

(E.6)

As above, to bound the quantity appearing in the exponent in (E.6), we distinguish the two cases lim_x→∞ V′(x) = C₁ < ∞ and lim_x→∞ V′(x) = ∞. In the first case, for M sufficiently large that 2C₁ ≥ |V′(x)| ≥ C₁/2 whenever x ≥ M − h, we have

h max_{[M - h, M + h]} | V^{'} | = \frac{{max}_{[M - h, M + h]} | V^{'} |}{{⌈ max}_{[0, M]} | V^{'} | ⌉} \leq \frac{2 C_{1}}{C_{1} / 2} = 4.

(E.7)

In the second case, for M sufficiently large,

h max_{[M - h, M + h]} | V^{'} | = \frac{{max}_{[M - h, M + h]} | V^{'} |}{{⌈ max}_{[0, M]} | V^{'} | ⌉} \leq \frac{V^{'} (M + h)}{V^{'} (M)} .

(E.8)

By (4.6), we have the differential inequality

V^{″} < α {| V^{'} |}^{2} .

This implies

V^{'} (M + s) \leq y^{'} (s)

for

y (s) = \frac{1}{V^{'} {(M)}^{- 1} - α s}

the solution of the initial value problem

y^{'} = α y^{2} and y (0) = V^{'} (M) .

Therefore,

V^{'} (M + h) \leq \frac{1}{V^{'} {(M)}^{- 1} - α h} = \frac{1}{V^{'} {(M)}^{- 1} - α ⌈ V^{'} {(M) ⌉}^{- 1}} \leq \frac{V^{'} (M)}{1 - α},

so by (E.8),

h max_{[M - h, M + h]} | V^{'} | \leq \frac{1}{1 - α} .

(E.9)

It follows from (E.6), (E.7), and (E.9) that there exist M₊, θ₊ > 0 so that

F_{K, K + 1} \geq θ_{+} > 0

(E.10)

whenever M ≥ M₊. ■

Appendix F. An improved method of computing error bars for EMUS.

In [46, Section VII.B.1], we proposed a practical method of estimating the asymptotic standard deviations (error bars) of averages computed by EMUS. Using the notation established in Appendix A, our method proceeds as follows:

Compute $\bar{F}$ , ${{\bar{g}}_{i}^{*}}_{i = 1}^{L}$ and ${{\bar{1}}_{i}^{*}}_{i = 1}^{L}$ .
Compute $w (\bar{F})$ and the group inverse ${(I - \bar{F})}^{#}$ .
Evaluate ∂_iB, and $\bar{F}$ , ${{\bar{g}}_{i}^{*}}_{i = 1}^{L}$ , and ${{\bar{1}}_{i}^{*}}_{i = 1}^{L}$ .
Compute the time series
${\bar{ζ}}_{t}^{i} = \partial_{i} B \cdot ((ψ_{1} (X_{t}^{i}), \dots, ψ_{L} (X_{t}^{i}), g^{*} (X_{t}^{i}), 1^{*} (X_{t}^{i})) - ({\bar{F}}_{i 1}, \dots, {\bar{F}}_{i L}, {\bar{g}}_{i}^{*}, {\bar{1}}_{i}^{*})) .$
Compute an estimate ${\bar{χ}}_{i}^{2}$ of the integrated autocovariance of ${\bar{ζ}}_{t}^{i}$ using an algorithm such as ACOR [14].
Compute as an estimate σ² the quantity
${\bar{σ}}^{2} : = \sum_{i = 1}^{L} \frac{{\bar{χ}}_{i}^{2}}{κ_{i}} .$ (F.1)

We originally proposed computing the group inverse ${(I - \bar{F})}^{#}$ using the method of [19] based on the QR factorization. We have since discovered that this method does not always yield sufficiently accurate results. For example, when computing error bars for the marginal in μ₂ in Section 5.3, we observed a highly oscillatory numerical error affecting some entries of ${(I - \bar{F})}^{#}$ . That the sign pattern in Figure 11a fails to be symmetric is evidence of this numerical error. We note that since the exact overlap matrix F is in detailed balance with w(F), we have diag(w(F))F diag(w(F))⁻¹ = F^t. (Here, diag(w(F)) denotes the diagonal matrix with w(F) along the diagonal.) Therefore,

{({(I - F)}^{#})}^{t} = diag (w (F)) {(I - F)}^{#} diag {(w (F))}^{- 1},

which implies that the sign pattern of (I − F)^# is symmetric since w(F) is positive. As a result of these numerical errors, we were unable to accurately compute error bars for the EMUS estimate of the marginal density.

We therefore propose computing the group inverse by a new method combining QR factorization with power iteration. We first compute an estimate G⁰ of ${(I - \bar{F})}^{#}$ by the method of [19]. We then iterate

G^{n + 1} = I (G^{n}) = \tilde{F} G^{n} + I - e w {(\bar{F})}^{t},

(F.2)

where $e \in ℝ^{L}$ denotes the column vector of all ones and $\tilde{F} : = (I - e w {(F)}^{t}) \bar{F}$ . We observe that ${(I - \bar{F})}^{#}$ is a fixed point of this iteration, since

I ({(I - \bar{F})}^{#}) = (I - e w {(\bar{F})}^{t}) \bar{F} (I - \bar{F})^{#} + (I - e π {(\bar{F})}^{t}) = (\bar{F} - I) (I - \bar{F})^{#} + (I - e w {(\bar{F})}^{t}) + (I - \bar{F})^{#} = (I - \bar{F})^{#} .

Above, we use well known properties of the group inverse, including that the spectral projector $I - e w {(\bar{F})}^{t}$ commutes with $\bar{F}$ , that $(I - e w {(\bar{F})}^{t}) {(I - \bar{F})}^{#} = {(I - F)}^{#}$ , and that $(I - \bar{F}) {(I - \bar{F})}^{#} = I - e w {(\bar{F})}^{t}$ .

Moreover, when $\bar{F}$ is irreducible, $I^{K}$ is a contraction for K sufficiently large. By the Perron-Frobenius theorem, the spectral radius of $\tilde{F}$ is smaller than 1 − ε, for some ε > 0. Therefore, by Gelfand’s formula, for any matrix norm ‖·‖, we have ${lim}_{k \to \infty} ‖ \tilde{F} ‖^{k 1 / k} < 1 - ε / 2$ , and so for some K,

‖ {\tilde{F}}^{k} ‖ < {(1 - ε / 2)}^{k} whenever k \geq K .

Now

I^{K} (G) = {\tilde{F}}^{K} G + (I - e π {(F)}^{t}) \sum_{j = 0}^{K - 1} F^{j} .

Thus, assuming that the norm ‖·‖ is submultiplicative,

‖ I^{K} (G) - I^{K} (H) ‖ = ‖ {\tilde{F}}^{K} (G - H) ‖ \leq ‖ {\tilde{F}}^{K} ‖ ‖ G - H ‖ \leq {(1 - ε / 2)}^{K} ‖ G - H ‖ .

Therefore, the power iteration converges and its limit is the group inverse ${(I - \bar{F})}^{#}$ .

Using this new method, we computed ${(I - \bar{F})}^{#}$ for $\bar{F}$ the overlap matrix involved in estimating the marginal in μ₂ in Section 5.3. We performed 10⁶ power method iterates. Observe that the sign pattern of the group inverse computed with power iteration is symmetric; see Figure 11b.

The power iteration (F.2) converges slowly when the spectral gap of $\bar{F}$ is small. We have shown in [45] that the spectral gap may be very small: It decreases exponentially with a temperature parameter in a limit similar to the one analyzed in Section 4.1 above. However, even when the spectral gap is small, we conjecture that a modest number of power iterations will significantly reduce the numerical error in the group inverse, since the error in the initial calculation seems to be highly oscillatory and the power iteration has a smoothing effect.

Footnotes

A potential of mean force is the logarithm of a marginal density. A free energy is the logarithm of a normalization constant. Both quantities play fundamental roles in statistical mechanics, e.g. in theories of rates of chemical reactions.

The boundary of a set is C³ if in a neighborhood of each point on the boundary, the boundary is the graph of a three times continuously di_erentiable function.

REFERENCES

[1].Aitkin M: Likelihood and Bayesian analysis of mixtures. Statistical Modelling 1(4), 287–304 (2001) [Google Scholar]
[2].Andres S: Pathwise differentiability for SDEs in a convex polyhedron with oblique reflection. Ann. Inst. Henri Poincaré Probab. Stat 45(1), 104–116 (2009) [Google Scholar]
[3].Berneche S, Roux B: Energetics of ion conduction through the k[sup +] channel. Nature 414(6859), 73 (2001) [DOI] [PubMed] [Google Scholar]
[4].Bhattacharya RN: On the functional central limit theorem and the law of the iterated logarithm for Markov processes. Z. Wahrsch. Verw. Gebiete 60(2), 185–201 (1982) [Google Scholar]
[5].Billingsley P: Convergence of Probability Measures, second edn. Wiley series in probability and statistics. Wiley-Interscience, New York: (1999) [Google Scholar]
[6].Bilodeau M, Brenner D: Theory of multivariate statistics. Springer texts in statistics. Springer, New York: (1999) [Google Scholar]
[7].Boczko EM, Brooks CL: First-principles calculation of the folding free energy of a three-helix bundle protein. Science 269(5222), 393–396 (1995) [DOI] [PubMed] [Google Scholar]
[8].Böttcher B, Schilling RL, Wang J: A primer on Feller processes. In: Lévy Matters III : Lévy-type processes: construction, approximation and sample path properties, Lecture notes in mathematics (Springer-Verlag), chap. 1, pp. 1–30. Springer; (2013) [Google Scholar]
[9].Chandler D: Introduction to modern statistical mechanics. Oxford University Press, New York: (1987) [Google Scholar]
[10].Cho GE, Meyer CD: Comparison of perturbation bounds for the stationary distribution of a Markov chain. Linear Algebra Appl. 335, 137–150 (2001) [Google Scholar]
[11].Chopin N, Lelièvre T, Stoltz G: Free energy methods for Bayesian inference: efficient exploration of univariate Gaussian mixture posteriors. Statistics and Computing 22(4), 897–916 (2012) [Google Scholar]
[12].Doss H, Tan A: Estimates and standard errors for ratios of normalizing constants from multiple Markov chains via regeneration. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 76(4), 683–712 (2014) [DOI] [PMC free article] [PubMed] [Google Scholar]
[13].Ethier SN, Kurtz TG: Markov processes. Wiley Series in Probability and Mathematical Statistics: Probability and Mathematical Statistics. John Wiley & Sons, Inc., New York: (1986). Characterization and convergence [Google Scholar]
[14].Foreman-Mackey D, Goodman J: ACOR 1.1.1 https://pypi.python.org/pypi/acor/1.1.1 (2014)
[15].Foreman-Mackey D, Hogg DW, Lang D, Goodman J: emcee: The MCMC hammer. Publications of the Astronomical Society of the Pacific 125(925), 306 (2013) [Google Scholar]
[16].Geyer CJ: Markov chain Monte Carlo maximum likelihood. In: Computing Science and Statistics: Proceedings of the 23rd Symposium on the Interface. American Statistical Association; (1991) [Google Scholar]
[17].Geyer CJ: Estimating normalizing constants and reweighting mixtures (1994). Technical Report No. 568. Retrieved from the University of Minnesota Digital Conservancy
[18].Gill RD, Vardi Y, Wellner JA: Large sample theory of empirical distributions in biased sampling models. Ann. Statist 16(3), 1069–1112 (1988) [Google Scholar]
[19].Golub GH, Meyer CD Jr.: Using the QR factorization and group inversion to compute, differentiate, and estimate the sensitivity of stationary probabilities for Markov chains. SIAM J. Algebraic Discrete Methods 7(2), 273–281 (1986) [Google Scholar]
[20].Goodman J, Weare J: Ensemble samplers with affine invariance. Commun. Appl. Math. Comput. Sci 5(1), 65–80 (2010) [Google Scholar]
[21].Helffer B, Klein M, Nier F: Quantitative analysis of metastability in reversible diffusion processes via a Witten complex approach. Mat. Contemp 26, 41–85 (2004) [Google Scholar]
[22].Izenman AJ, Sommer CJ: Philatelic mixtures and multimodal densities. Journal of the American Statistical Association 83(404), 941–953 (1988) [Google Scholar]
[23].Jasra A, Holmes CC, Stephens DA: Markov chain Monte Carlo methods and the label switching problem in Bayesian mixture modeling. Statist. Sci 20(1), 50–67 (2005) [Google Scholar]
[24].Jiang DQ, Qian M, Qian MP: Mathematical theory of nonequilibrium steady states : on the frontier of probability and dynamical systems. Lecture notes in mathematics (Springer-Verlag). Springer, Berlin; New York: (2004) [Google Scholar]
[25].Kipnis C, Varadhan SRS: Central limit theorem for additive functionals of reversible Markov processes and applications to simple exclusions. Comm. Math. Phys 104(1), 1–19 (1986) [Google Scholar]
[26].Kong A, McCullagh P, Meng XL, Nicolae D, Tan Z: A theory of statistical models for Monte Carlo integration. Journal of the Royal Statistical Society. Series B: Statistical Methodology 65(3), 585–604 (2003) [Google Scholar]
[27].Kumar S, Rosenberg JM, Bouzida D, Swendsen RH, Kollman PA: The weighted histogram analysis method for free-energy calculations on biomolecules. I. The method. J. Comput. Chem 13(8), 1011–1021 (1992) [Google Scholar]
[28].Laio A, Parrinello M: Escaping free-energy minima. Proceedings of the National Academy of Sciences 99(20), 12562–12566 (2002) [DOI] [PMC free article] [PubMed] [Google Scholar]
[29].Legoll F, Lelièvre T: Effective dynamics using conditional expectations. Nonlinearity 23(9), 2131–2163 (2010) [Google Scholar]
[30].Lelièvre T, Rousset M, Stoltz G: Free energy computations. Imperial College Press, London: (2010). A mathematical perspective [Google Scholar]
[31].Lelièvre T, Stoltz G: Partial differential equations and stochastic methods in molecular dynamics. Acta Numerica 25, 681 (2016) [Google Scholar]
[32].Liu JS: Monte Carlo strategies in scientific computing. Springer Series in Statistics. Springer-Verlag, New York: (2001) [Google Scholar]
[33].Maragliano L, Vanden-Eijnden E: A temperature accelerated method for sampling free energy and determining reaction pathways in rare events simulations. Chemical Physics Letters 426(1–3), 168–175 (2006) [Google Scholar]
[34].Matthews C, Weare J, Kravtsov A, Jennings E: Umbrella sampling: a powerful method to sample tails of distributions (2017). ArXiv:1712.05024
[35].Meng XL, Wong WH: Simulating ratios of normalizing constants via a simple identity: a theoretical exploration. Statistica Sinica 6(4), 831–860 (1996) [Google Scholar]
[36].Pavliotis GA: Stochastic processes and applications, Texts in Applied Mathematics, vol. 60. Springer, New York: (2014). Diffusion processes, the Fokker-Planck and Langevin equations [Google Scholar]
[37].Pavliotis GA, Stuart AM: Multiscale methods, Texts in Applied Mathematics, vol. 53. Springer, New York: (2008). Averaging and homogenization [Google Scholar]
[38].Payne LE, Weinberger HF: An optimal Poincaré inequality for convex domains. Archive for Rational Mechanics and Analysis 5(1), 286–292 (1960) [Google Scholar]
[39].Richardson S, Green PJ: On Bayesian analysis of mixtures with an unknown number of components (with discussion). Journal of the Royal Statistical Society: Series B (Statistical Methodology) 59(4), 731–792 (1997) [Google Scholar]
[40].Roberts GO, Rosenthal JS: Geometric ergodicity and hybrid Markov chains. Electron. Comm. Probab 2, no. 2, 13–25 (1997) [Google Scholar]
[41].Roberts GO, Tweedie RL: Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli 2(4), 341–363 (1996) [Google Scholar]
[42].Shirts MR, Chodera JD: Statistically optimal analysis of samples from multiple equilibrium states. The Journal of chemical physics 129(12), 124105 (2008) [DOI] [PMC free article] [PubMed] [Google Scholar]
[43].Sugita Y, Kitao A, Okamoto Y: Multidimensional replica-exchange method for free-energy calculations. J. Chem. Phys 113(15), 11 (2000) [Google Scholar]
[44].Swendsen RH, Wang JS: Replica monte carlo simulation of spin-glasses. Physical review letters 57(21), 2607 (1986) [DOI] [PubMed] [Google Scholar]
[45].Thiede E, Van Koten B, Weare J: Sharp entrywise perturbation bounds for Markov chains. SIAM Journal on Matrix Analysis and Applications 36(3), 917–941 (2015) [DOI] [PMC free article] [PubMed] [Google Scholar]
[46].Thiede EH, Van Koten B, Weare J, Dinner AR: Eigenvector method for umbrella sampling enables error analysis. The Journal of Chemical Physics 145(8), 084115 (2016) [DOI] [PMC free article] [PubMed] [Google Scholar]
[47].Torrie GM, Valleau JP: Nonphysical sampling distributions in Monte Carlo free-energy estimation: Umbrella sampling. Journal of Computational Physics 23(2), 187 (1977) [Google Scholar]
[48].VanDerwerken DN, Schmidler SC: Parallel Markov chain Monte Carlo (2013). ArXiv:1312.7479
[49].Vardi Y: Empirical distributions in selection bias models. The Annals of Statistics 13(1), 178–203 (1985) [Google Scholar]
[50].Wang F, Landau DP: Efficient, multiple-range random walk algorithm to calculate the density of states. Phys. Rev. Lett 86, 2050–2053 (2001) [DOI] [PubMed] [Google Scholar]
[51].Wang FY, Yan L: Gradient estimate on convex domains and applications. Proc. Amer. Math. Soc 141(3), 1067–1081 (2013) [Google Scholar]

[R1] [1].Aitkin M: Likelihood and Bayesian analysis of mixtures. Statistical Modelling 1(4), 287–304 (2001) [Google Scholar]

[R2] [2].Andres S: Pathwise differentiability for SDEs in a convex polyhedron with oblique reflection. Ann. Inst. Henri Poincaré Probab. Stat 45(1), 104–116 (2009) [Google Scholar]

[R3] [3].Berneche S, Roux B: Energetics of ion conduction through the k[sup +] channel. Nature 414(6859), 73 (2001) [DOI] [PubMed] [Google Scholar]

[R4] [4].Bhattacharya RN: On the functional central limit theorem and the law of the iterated logarithm for Markov processes. Z. Wahrsch. Verw. Gebiete 60(2), 185–201 (1982) [Google Scholar]

[R5] [5].Billingsley P: Convergence of Probability Measures, second edn. Wiley series in probability and statistics. Wiley-Interscience, New York: (1999) [Google Scholar]

[R6] [6].Bilodeau M, Brenner D: Theory of multivariate statistics. Springer texts in statistics. Springer, New York: (1999) [Google Scholar]

[R7] [7].Boczko EM, Brooks CL: First-principles calculation of the folding free energy of a three-helix bundle protein. Science 269(5222), 393–396 (1995) [DOI] [PubMed] [Google Scholar]

[R8] [8].Böttcher B, Schilling RL, Wang J: A primer on Feller processes. In: Lévy Matters III : Lévy-type processes: construction, approximation and sample path properties, Lecture notes in mathematics (Springer-Verlag), chap. 1, pp. 1–30. Springer; (2013) [Google Scholar]

[R9] [9].Chandler D: Introduction to modern statistical mechanics. Oxford University Press, New York: (1987) [Google Scholar]

[R10] [10].Cho GE, Meyer CD: Comparison of perturbation bounds for the stationary distribution of a Markov chain. Linear Algebra Appl. 335, 137–150 (2001) [Google Scholar]

[R11] [11].Chopin N, Lelièvre T, Stoltz G: Free energy methods for Bayesian inference: efficient exploration of univariate Gaussian mixture posteriors. Statistics and Computing 22(4), 897–916 (2012) [Google Scholar]

[R12] [12].Doss H, Tan A: Estimates and standard errors for ratios of normalizing constants from multiple Markov chains via regeneration. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 76(4), 683–712 (2014) [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] [13].Ethier SN, Kurtz TG: Markov processes. Wiley Series in Probability and Mathematical Statistics: Probability and Mathematical Statistics. John Wiley & Sons, Inc., New York: (1986). Characterization and convergence [Google Scholar]

[R14] [14].Foreman-Mackey D, Goodman J: ACOR 1.1.1 https://pypi.python.org/pypi/acor/1.1.1 (2014)

[R15] [15].Foreman-Mackey D, Hogg DW, Lang D, Goodman J: emcee: The MCMC hammer. Publications of the Astronomical Society of the Pacific 125(925), 306 (2013) [Google Scholar]

[R16] [16].Geyer CJ: Markov chain Monte Carlo maximum likelihood. In: Computing Science and Statistics: Proceedings of the 23rd Symposium on the Interface. American Statistical Association; (1991) [Google Scholar]

[R17] [17].Geyer CJ: Estimating normalizing constants and reweighting mixtures (1994). Technical Report No. 568. Retrieved from the University of Minnesota Digital Conservancy

[R18] [18].Gill RD, Vardi Y, Wellner JA: Large sample theory of empirical distributions in biased sampling models. Ann. Statist 16(3), 1069–1112 (1988) [Google Scholar]

[R19] [19].Golub GH, Meyer CD Jr.: Using the QR factorization and group inversion to compute, differentiate, and estimate the sensitivity of stationary probabilities for Markov chains. SIAM J. Algebraic Discrete Methods 7(2), 273–281 (1986) [Google Scholar]

[R20] [20].Goodman J, Weare J: Ensemble samplers with affine invariance. Commun. Appl. Math. Comput. Sci 5(1), 65–80 (2010) [Google Scholar]

[R21] [21].Helffer B, Klein M, Nier F: Quantitative analysis of metastability in reversible diffusion processes via a Witten complex approach. Mat. Contemp 26, 41–85 (2004) [Google Scholar]

[R22] [22].Izenman AJ, Sommer CJ: Philatelic mixtures and multimodal densities. Journal of the American Statistical Association 83(404), 941–953 (1988) [Google Scholar]

[R23] [23].Jasra A, Holmes CC, Stephens DA: Markov chain Monte Carlo methods and the label switching problem in Bayesian mixture modeling. Statist. Sci 20(1), 50–67 (2005) [Google Scholar]

[R24] [24].Jiang DQ, Qian M, Qian MP: Mathematical theory of nonequilibrium steady states : on the frontier of probability and dynamical systems. Lecture notes in mathematics (Springer-Verlag). Springer, Berlin; New York: (2004) [Google Scholar]

[R25] [25].Kipnis C, Varadhan SRS: Central limit theorem for additive functionals of reversible Markov processes and applications to simple exclusions. Comm. Math. Phys 104(1), 1–19 (1986) [Google Scholar]

[R26] [26].Kong A, McCullagh P, Meng XL, Nicolae D, Tan Z: A theory of statistical models for Monte Carlo integration. Journal of the Royal Statistical Society. Series B: Statistical Methodology 65(3), 585–604 (2003) [Google Scholar]

[R27] [27].Kumar S, Rosenberg JM, Bouzida D, Swendsen RH, Kollman PA: The weighted histogram analysis method for free-energy calculations on biomolecules. I. The method. J. Comput. Chem 13(8), 1011–1021 (1992) [Google Scholar]

[R28] [28].Laio A, Parrinello M: Escaping free-energy minima. Proceedings of the National Academy of Sciences 99(20), 12562–12566 (2002) [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] [29].Legoll F, Lelièvre T: Effective dynamics using conditional expectations. Nonlinearity 23(9), 2131–2163 (2010) [Google Scholar]

[R30] [30].Lelièvre T, Rousset M, Stoltz G: Free energy computations. Imperial College Press, London: (2010). A mathematical perspective [Google Scholar]

[R31] [31].Lelièvre T, Stoltz G: Partial differential equations and stochastic methods in molecular dynamics. Acta Numerica 25, 681 (2016) [Google Scholar]

[R32] [32].Liu JS: Monte Carlo strategies in scientific computing. Springer Series in Statistics. Springer-Verlag, New York: (2001) [Google Scholar]

[R33] [33].Maragliano L, Vanden-Eijnden E: A temperature accelerated method for sampling free energy and determining reaction pathways in rare events simulations. Chemical Physics Letters 426(1–3), 168–175 (2006) [Google Scholar]

[R34] [34].Matthews C, Weare J, Kravtsov A, Jennings E: Umbrella sampling: a powerful method to sample tails of distributions (2017). ArXiv:1712.05024

[R35] [35].Meng XL, Wong WH: Simulating ratios of normalizing constants via a simple identity: a theoretical exploration. Statistica Sinica 6(4), 831–860 (1996) [Google Scholar]

[R36] [36].Pavliotis GA: Stochastic processes and applications, Texts in Applied Mathematics, vol. 60. Springer, New York: (2014). Diffusion processes, the Fokker-Planck and Langevin equations [Google Scholar]

[R37] [37].Pavliotis GA, Stuart AM: Multiscale methods, Texts in Applied Mathematics, vol. 53. Springer, New York: (2008). Averaging and homogenization [Google Scholar]

[R38] [38].Payne LE, Weinberger HF: An optimal Poincaré inequality for convex domains. Archive for Rational Mechanics and Analysis 5(1), 286–292 (1960) [Google Scholar]

[R39] [39].Richardson S, Green PJ: On Bayesian analysis of mixtures with an unknown number of components (with discussion). Journal of the Royal Statistical Society: Series B (Statistical Methodology) 59(4), 731–792 (1997) [Google Scholar]

[R40] [40].Roberts GO, Rosenthal JS: Geometric ergodicity and hybrid Markov chains. Electron. Comm. Probab 2, no. 2, 13–25 (1997) [Google Scholar]

[R41] [41].Roberts GO, Tweedie RL: Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli 2(4), 341–363 (1996) [Google Scholar]

[R42] [42].Shirts MR, Chodera JD: Statistically optimal analysis of samples from multiple equilibrium states. The Journal of chemical physics 129(12), 124105 (2008) [DOI] [PMC free article] [PubMed] [Google Scholar]

[R43] [43].Sugita Y, Kitao A, Okamoto Y: Multidimensional replica-exchange method for free-energy calculations. J. Chem. Phys 113(15), 11 (2000) [Google Scholar]

[R44] [44].Swendsen RH, Wang JS: Replica monte carlo simulation of spin-glasses. Physical review letters 57(21), 2607 (1986) [DOI] [PubMed] [Google Scholar]

[R45] [45].Thiede E, Van Koten B, Weare J: Sharp entrywise perturbation bounds for Markov chains. SIAM Journal on Matrix Analysis and Applications 36(3), 917–941 (2015) [DOI] [PMC free article] [PubMed] [Google Scholar]

[R46] [46].Thiede EH, Van Koten B, Weare J, Dinner AR: Eigenvector method for umbrella sampling enables error analysis. The Journal of Chemical Physics 145(8), 084115 (2016) [DOI] [PMC free article] [PubMed] [Google Scholar]

[R47] [47].Torrie GM, Valleau JP: Nonphysical sampling distributions in Monte Carlo free-energy estimation: Umbrella sampling. Journal of Computational Physics 23(2), 187 (1977) [Google Scholar]

[R48] [48].VanDerwerken DN, Schmidler SC: Parallel Markov chain Monte Carlo (2013). ArXiv:1312.7479

[R49] [49].Vardi Y: Empirical distributions in selection bias models. The Annals of Statistics 13(1), 178–203 (1985) [Google Scholar]

[R50] [50].Wang F, Landau DP: Efficient, multiple-range random walk algorithm to calculate the density of states. Phys. Rev. Lett 86, 2050–2053 (2001) [DOI] [PubMed] [Google Scholar]

[R51] [51].Wang FY, Yan L: Gradient estimate on convex domains and applications. Proc. Amer. Math. Soc 141(3), 1067–1081 (2013) [Google Scholar]

PERMALINK

Stratification as a general variance reduction method for Markov chain Monte Carlo

Aaron R Dinner

Erik H Thiede

Brian Van Koten

Jonathan Weare

Abstract

1. Introduction.

Summary of Main Results.

2. The Eigenvector Method for Umbrella Sampling.

2.1. Derivation of EMUS.

2.2. Iterative EMUS and comparison with Vardi’s Estimator.

3. Error Analysis of EMUS.

3.1. A CLT for EMUS and an Estimate of the Asymptotic Variance.

Figure 11:

3.2. Dependence of the Asymptotic Variance on the Choice of Strata.

3.2.1. Asymptotic Variances of MCMC Averages.

3.2.2. Controlling the Probabilities Pi[τj < τi].

4. Limiting Results as a Rationale for EMUS.

4.1. Limit of Low Temperature.

Figure 1:

4.2. Limit of Small Probability.

Figure 2:

5. EMUS for tails: An example from Bayesian inference.

5.1. The natural stratification for tails and marginals.

5.2. A hierarchical Bayesian mixture model.

5.3. Numerical experiments: Choosing strata, computing tails, diagnosis of problems.

Figure 3:

Figure 4:

Figure 5:

Figure 6:

Figure 7:

Figure 8:

Figure 9:

Figure 10:

6. Conclusions.

Acknowledgments

Appendix A. Proof of Theorem 3.3.

Appendix B. Proof of Theorem 3.5.

Appendix C. Proof of Theorem 3.7.

Appendix D. Proof of Theorem 4.1.

Appendix E. Proof of Theorem 4.6.

Appendix F. An improved method of computing error bars for EMUS.

Footnotes

REFERENCES

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

3.2.2. Controlling the Probabilities P_i[τ_j < τ_i].