Skip to main content
Elsevier Sponsored Documents logoLink to Elsevier Sponsored Documents
. 2020 Apr;130(4):2200–2227. doi: 10.1016/j.spa.2019.06.015

Perturbation bounds for Monte Carlo within Metropolis via restricted approximations

Felipe Medina-Aguayo a, Daniel Rudolf b,, Nikolaus Schweizer c
PMCID: PMC7074005  PMID: 32255890

Abstract

The Monte Carlo within Metropolis (MCwM) algorithm, interpreted as a perturbed Metropolis–Hastings (MH) algorithm, provides an approach for approximate sampling when the target distribution is intractable. Assuming the unperturbed Markov chain is geometrically ergodic, we show explicit estimates of the difference between the nth step distributions of the perturbed MCwM and the unperturbed MH chains. These bounds are based on novel perturbation results for Markov chains which are of interest beyond the MCwM setting. To apply the bounds, we need to control the difference between the transition probabilities of the two chains and to verify stability of the perturbed chain.

Keywords: Markov chain Monte Carlo, Restricted approximation, Monte Carlo within Metropolis, Intractable likelihood

1. Introduction

The Metropolis–Hastings (MH) algorithm is a classical method for sampling approximately from a distribution of interest relying only on point-wise evaluations of an unnormalized density. However, when even this unnormalized density depends on unknown integrals and cannot easily be evaluated, then this approach is not feasible. A possible solution is to replace the required density evaluations in the MH acceptance ratio with suitable approximations. This idea is implemented in Monte Carlo within Metropolis (MCwM) algorithms which substitute the unnormalized density evaluations by Monte Carlo estimates for the intractable integrals.

Yet in general, replacing the exact MH acceptance ratio by an approximation leads to inexact algorithms in the sense that a stationary distribution of the transition kernel of the resulting Markov chain (if it exists) is not the distribution of interest. Moreover, convergence to a distribution is not at all clear. Nonetheless, these approximate, perturbed, or noisy methods, see e.g. [1], [10], [12], have recently gained increased attention due to their applicability in certain intractable sampling problems. In this work we attempt to answer the following questions about the MCwM algorithm:

  • Can one quantify the quality of MCwM algorithms?

  • When might the MCwM algorithm fail and what can one do in such situations?

Regarding the first question, by using bounds on the difference of the nth step distributions of a MH and a MCwM algorithm based Markov chain we give a positive answer. For the second question, we suggest a modification for stabilizing the MCwM approach by restricting the Markov chain to a suitably chosen set that contains the “essential part”, which we also call the “center” of the state space. We provide examples where this restricted version of MCwM converges towards the distribution of interest while the unrestricted version does not. Note also that in practical implementations of Markov chain Monte Carlo on a computer, simulated chains are effectively restricted to compact state spaces due to memory limitations. Our results on restricted approximations can also be read in this spirit.

Perturbation theory. Our overall approach is based on perturbation theory for Markov chains. Let (Xn)nN0 be a Markov chain with transition kernel P and (X˜n)nN0 be a Markov chain with transition kernel P˜ on a common Polish space (G,B(G)). We think of P and P˜ as “close” to each other in a suitable sense and consider P˜ as a perturbation of P. In order to quantify the difference of the distributions of Xn and X˜n, denoted by pn and p˜n respectively, we work with

pnp˜ntv, (1)

where tv denotes the total variation distance. The Markov chain (Xn)nN0 can be interpreted as the unavailable, unperturbed, or ideal chain; while (X˜n)nN0 is a perturbation that is available for simulation. We focus on the case where the ideal Markov chain is geometrically ergodic, more precisely V-uniformly ergodic, implying that its transition kernel P satisfies a Lyapunov condition of the form

PV(x)δV(x)+L,xG,

for some function V:G[1,) and numbers δ[0,1),L[1,).

To obtain estimates of (1) we need two assumptions which can be informally explained as follows:

  • 1.

    Closeness of P˜ and P: The difference of P˜ and P is measured by controlling either a weighted total variation distance or a weighted V-norm of P(x,)P˜(x,) uniformly. Here, uniformity either refers to the entire state space or, at least, to the “essential” part of it.

  • 2.

    Stability of P˜: A stability condition on P˜ is satisfied either in the form of a Lyapunov condition or by restriction to the center of the state space determined by V.

Under these assumptions, explicit bounds on (1) are provided in Section 3. More precisely, in Proposition 6 and Theorem 7 stability is guaranteed through a Lyapunov condition for P˜, whereas in Theorem 9 a restricted approximation P˜ is considered.

Monte Carlo within Metropolis. In Section 4 we apply our perturbation bounds in the context of approximate sampling via MCwM. In the following we briefly introduce the setting. The goal is to (approximately) sample from a target distribution π on G, which is determined by an unnormalized density function πu:G[0,) w.r.t a reference measure μ, that is,

π(A)=Aπu(x)dμ(x)Gπu(x)dμ(x),AB(G).

Classically the method of choice is to construct a Markov chain (Xn)nN0 based on the MH algorithm for approximate sampling of π. This algorithm crucially relies on knowing (at least) the ratio πu(y)πu(x) for arbitrary (x,y)G2, e.g., because πu(x) and πu(y) can readily be computed. However, in some scenarios, only approximations of πu(x) and πu(y) are available. Replacing the true unnormalized density πu in the MH algorithm by an approximation yields a perturbed, “inexact” Markov chain (X˜n)nN0. If the approximation is based on a Monte Carlo method, the perturbed chain is called MCwM chain.

Two particular settings where approximations of πu may rely on Monte Carlo estimates are doubly-intractable distributions and latent variables. Examples of the former occur in Markov or Gibbs random fields, where the function values πu(x) of the unnormalized density itself are only known up to a factor Z(x). This means that

πu(x)=ρ(x)Z(x),xG, (2)

where only values of ρ(x) can easily be computed while the computational problem lies in evaluating

graphic file with name fx1001.jpg

where Y denotes an auxiliary variable space, Inline graphic and rx is a probability distribution on Y. We investigate a MCwM algorithm, which in every transition uses an iid sequence of random variables (Yi(x))1iN, with Y1(x)rx, to approximate Z(x) by Inline graphic (and Z(y) by Z^N(y), respectively). The second setting we study arises from latent variables. Here, πu(x) cannot be evaluated since it takes the form

1. (3)

where rx is a probability distribution on a measurable space Y of latent variables y, and Inline graphic is a non-negative density function. In general, no explicit computable expression of the above integral is at hand and the MCwM idea is to substitute πu(x) in the MH algorithm by a Monte Carlo estimate based on iid sequences of random variables (Yi(x))1iN and (Yi(y))1iN with Y1(x)rx, Y1(y)ry. The resulting MCwM algorithm has been studied before in [3], [14]. Let us note here that this MCwM approach should not be confused with the pseudo-marginal method, see [3]. The pseudo-marginal method constructs a Markov chain on the extended space G×Y that targets a distribution with π as its marginal on G.

Perturbation bounds for MCwM. In both intractability settings, the corresponding MCwM Markov chains depend on the parameter NN which denotes the number of samples used within the Monte Carlo estimates. As a consequence, any bound on (1) is N-dependent, which allows us to control the dissimilarity to the ideal MH based Markov chain. In Corollary 16 and the application of Corollary 17 to the examples considered in Section 4 we provide informative rates of convergence as N. Note that with those estimates we relax the requirement of uniform bounds on the approximation error introduced by the estimator for πu, which is essentially imposed in [1], [14]. In contrast to this requirement, we use (if available) the Lyapunov function as a counterweight for a second as well as inverse second moment and can therefore handle situations where uniform bounds on the approximation error are not available. If we do not have access to a Lyapunov function for the MCwM transition kernel we suggest to restrict it to a subset of the state space, i.e., use restricted approximations. This subset is determined by V and usually corresponds to a ball with some radius R(N) that increases as the approximation quality improves, that is, R(N) as N.

Our analysis of the MCwM algorithm is guided by some facts we observe in simple illustrations, in particular, we consider a log-normal example discussed in Section 4.1. In this example, we encounter a situation where the mean squared error of the Monte Carlo approximation grows exponentially in the tail of the target distribution. We observe empirically that (unrestricted) MCwM works well whenever the growth behavior is dominated by the decay of the (Gaussian) target density in the tail. The application of Corollary 17 to the log-normal example shows that the restricted approximation converges towards the true target density in the number of samples N at least like (logN)1 independent of any growth of the error. However, the convergence is better, at least like logNN, if the growth is dominated by the decay of the target density.

2. Preliminaries

Let G be a Polish space, where B(G) denotes its Borel σ-algebra. Assume that P is a transition kernel with stationary distribution π on G. For a signed measure q on G and a measurable function f:GR we define

qP(A)GP(x,A)dq(x),Pf(x)Gf(y)P(x,dy),xG,AB(G).

For a distribution μ on G we use the notation μ(f)Gf(x)dμ(x). For a measurable function V:G[1,) and two probability measures μ,ν on G define

μνVsup|f|V|μ(f)ν(f)|.

For the constant function V=1 this is the total variation distance, i.e., 

μνtvsup|f|1|μ(f)ν(f)|.

The next, well-known theorem defines geometric ergodicity and states a useful equivalent condition. The proof follows by [23, Proposition 2.1] and [17, Theorem 16.0.1].

Theorem 1

For a ϕ-irreducible and aperiodic transition kernel P with stationary distribution π defined on G the following statements are equivalent:

  • The transition kernel P is geometrically ergodic, that is, there exists a number α¯[0,1) and a measurable function C:G[1,) such that for π-a.e. xG we have
    Pn(x,)πtvC(x)α¯n,nN. (4)
  • There is a π-a.e. finite measurable function V:G[1,] with finite moments with respect to π and there are constants α[0,1) and C[1,) such that
    Pn(x,)πVCV(x)αn,xG,nN. (5)

In particular, the function V can be chosen such that a Lyapunov condition of the form

PV(x)δV(x)+L,xG, (6)

for some δ[0,1) and L(0,), is satisfied.

Remark 2

We call a transition kernel V-uniformly ergodic if it satisfies (5) and note that this condition can be rewritten as

supxGPn(x,)πVV(x)Cαn. (7)

3. Quantitative perturbation bounds

Assume that (Xn)nN0 is a Markov chain with transition kernel P and initial distribution p0 on G. We define pnp0Pn, i.e., pn is the distribution of Xn. The distribution pn is approximated by using another Markov chain (X˜n)nN0 with transition kernel P˜ and initial distribution p˜0. We define p˜np˜0P˜n, i.e., p˜n is the distribution of X˜n. The idea throughout the paper is to interpret (Xn)nN0 as some ideal, unperturbed chain and (X˜n)nN0 as an approximating, perturbed Markov chain.

In the spirit of the doubly-intractable distribution and latent variable case considered in Section 4 we think of the unperturbed Markov chain as “nice”, where convergence properties are readily available. Unfortunately since we cannot simulate the “nice” chain we try to approximate it with a perturbed Markov chain, which is, because of the perturbation, difficult to analyze directly. With this in mind, we make the following standing assumption on the unperturbed Markov chain.

Assumption 3

Let V:G[1,) be a measurable function and assume that P is V-uniformly ergodic, that is, (5) holds for some constants C[1,) and α[0,1).

We start with an auxiliary estimate of pnp˜ntv which is interesting on its own and is proved in Appendix A.1.

Lemma 4

Let Assumption 3 be satisfied and for a measurable functionW:G[1,) define

εtv,WsupxGP(x,)P˜(x,)tvW(x),
εV,WsupxGP(x,)P˜(x,)VW(x).

Then, for any r(0,1],

pnp˜ntvCαnp0p˜0V+εtv,W1rεV,WrCri=0n1p˜i(W)α(ni1)r. (8)

Remark 5

The quantities εtv,W and εV,W measure the difference between P and P˜. Note that we can interpret them as operator norms

εtv,W=PP˜B(1)B(W)andεV,W=PP˜B(V)B(W),

where

B(W)=f:GRf,WsupxG|f(x)|W(x)<. (9)

It is also easily seen that εtv,Wmin{2,εV,W} which implies that a small number εV,W leads also to a small number εtv,W. In (8) an additional parameter r appears which can be used to tune the estimate. Namely, if one is not able to bound εV,W sufficiently well but has a good estimate of εtv,W one can optimize over r. On the other hand, if there is a satisfying estimate of εV,W one can just set r=1.

In the previous lemma we proved an upper bound of pnp˜ntv which still contains an unknown quantity given by

i=0n1p˜i(W)α(ni1)r

which measures, in a sense, stability of the perturbed chain through a weighted sum of expectations of the Lyapunov function W under p˜i. To control this term, we impose additional assumptions on the perturbed chain. In the following, we consider two assumptions of this type, a Lyapunov condition and a bounded support assumption.

3.1. Lyapunov condition

We start with a simple version of our main estimate which illustrates already some key aspects of the approach via the Lyapunov condition. Here the intuition is as follows: By Theorem 1 we know that the function V of Assumption 3 can be chosen such that a Lyapunov condition for P is satisfied. Since we think of P˜ as being close to P, it might be possible to show also a Lyapunov condition with V of P˜. If this is the case, the following proposition is applicable.

Proposition 6

Let Assumption 3 be satisfied. Additionally, let δ˜[0,1) and L˜(0,) be such that

P˜V(x)δ˜V(x)+L˜,xG. (10)

Assume that p0=p˜0 and define κmaxp˜0(V),L˜1δ˜, as well as (for simplicity)

εtvεtv,V,εVεV,V.

Then, for any r(0,1],

pnp˜ntvεtv1rεVrCrκ(1α)r. (11)

Proof

We use Lemma 4 with W=V. By (10), it follows that

p˜i(V)=GP˜iV(x)p˜0(dx)δ˜ip˜0(V)+(1δ˜i)L˜1δ˜κ. (12)

The final estimate is obtained by a geometric series and 1αrr(1α). □

Now we state a more general theorem. In particular, in this estimate the dependence on the initial distribution can be weakened. In the perturbation bound of the previous estimate, the initial distribution is only forgotten if p˜0(V)<L˜(1δ˜). Yet, intuitively, for long-term stability results p˜0(V) should not matter at all. This intuition is confirmed by the theorem.

Theorem 7

Let Assumption 3 be satisfied. Assume also thatW:G[1,) is a measurable function which satisfies with δ˜[0,1) and L˜(0,) the Lyapunov condition

P˜W(x)δ˜W(x)+L˜,xG. (13)

Define εtv,W,εV,W as in Lemma 4 and γL˜1δ˜. Then, for any r(0,1] with

βn,r(δ˜,α)nα(n1)r,αr=δ˜,|αrnδ˜n||αrδ|,αrδ˜,

we have

p˜npntvCαnp˜0p0V+εtv,W1rεV,WrCrp˜0(W)βn,r(δ˜,α)+γ(1α)r. (14)

Proof

Here we use Lemma 4 with possibly different W and V. By (13) we have p˜i(W)δ˜ip˜0(W)+γ and by

i=0n1δ˜iα(ni1)r=βn,r(δ˜,α)

we obtain the assertion by a geometric series and 1αrr(1α). □

Remark 8

We consider an illustrating example where Theorem 7 leads to a considerably sharper bound than Proposition 6. This improvement is due to the combination of two novel properties of the bound of Theorem 7:

  • 1.

    In the Lyapunov condition (13) the function W can be chosen differently from V.

  • 2.

    Note that βn,r(δ˜,α) is bounded from above by nmax{δ˜,αr}n1. Thus βn,r(δ˜,α) converges almost exponentially fast to zero in n. This implies that for n sufficiently large the dependence of p˜0(W) vanishes. Nevertheless, the leading factor n can capture situations in which the perturbation error is increasing in n for small n.

Illustrating example. Let G={0,1} and assume p0=p˜0=(0,1). Here state “1” can be interpreted as “transitional” while state “0” as “essential” part of the state space. Define

P=1010andP˜=101212.

Thus, the unperturbed Markov chain (Xn)nN0 moves from “1” to “0” right away, while the perturbed one (X˜n)nN0 takes longer. Both transition matrices have the same stationary distribution π=(1,0). Obviously, p0p˜0tv=0 and for nN it holds that

pnp˜ntv=2P(XnX˜n)=12n1.

The unperturbed Markov chain is uniformly ergodic, such that we can choose V=1 and (5) is satisfied with C=1 and α=0. In particular, in this setting εtv and εV from Proposition 6 coincide, we have εtv=1. Thus, the estimate of Proposition 6 gives

pnp˜ntvεtv=1.

This bound is optimal in the sense that it is best possible for n=1. But for increasing n it is getting worse. Notice also that a different choice of V cannot really remedy this situation: The chains differ most strongly at n=1 and the bound of Proposition 6 is constant over time. Now choose the function W(x)=1+v1{x=1} for some v0. The transition matrix P˜ satisfies the Lyapunov condition

P˜W(x)12W(x)+12,

i.e., δ˜=L˜=12. Moreover, we have p˜0(W)=1+v and εV,W=εtv,W=1(1+v). Thus, in the bound from Theorem 7 we can set r=1 and γ=1 such that

pnp˜ntv1v+1+12n1.

Since v can be chosen arbitrarily large, it follows that

pnp˜ntv12n1,

which is best possible for all nN.

The previous example can be seen as a toy model of a situation where the transition probabilities of a perturbed and unperturbed Markov chain are very similar in the “essential” part of the state space, but differ considerably in the “tail”, seen as the “transitional” part. When the chains start both at the same point in the “tail”, considerable differences between distributions can build up along the initial transient and then vanish again. Earlier perturbation bounds as for example in [18], [22], [26] take only an initial error and a remaining error into account. Thus, those are worse for situations where this transient error captured by βn,r dominates. A very similar term also appears in the very recent error bounds due to [10]. In any case, the example also illustrates that a function W different from V is advantageous.

3.2. Restricted approximation

In the previous section, we have seen that a Lyapunov condition of the perturbation helps to control the long-term stability of approximating a V-uniformly ergodic Markov chain. In this section we assume that the perturbed chain is restricted to a “large” subset of the state space. In this setting a sufficiently good approximation of the unperturbed Markov chain on this subset leads to a perturbation estimate.

For the unperturbed Markov chain we assume that transition kernel P is V-uniformly ergodic. Then, for R1 define the “large subset” of the state space as

BR={xGV(x)R}.

If V is chosen as a monotonic transformation of a norm on G, BR is simply a ball around 0. The restriction of P to the set BR, given as PR, is defined as

PR(x,A)=P(x,ABR)+1A(x)P(x,BRc),AB(G),xG.

In other words, whenever P would make a transition from xBR to GBR, PR remains in x. Otherwise, PR is the same as P. We obtain the following perturbation bound for approximations whose stability is guaranteed through a restriction to the set BR.

Theorem 9

Under the V-uniform ergodicity of Assumption 3 let δ[0,1) and L[1,) be chosen in such a way that

PV(x)δV(x)+L,xG.

For the perturbed transition kernel P˜ assume that it is restricted to BR, i.e., P˜(x,BR)=1 for allxG, and that RΔ(R)(1δ)2 with

Δ(R)supxBRPR(x,)P˜(x,)tvV(x).

Then, with p0=p˜0 and

κmaxp˜0(V),L1δ

we have for Rexp(1) that

pnp˜ntv33C(L+1)κ1αlogRR. (15)

The proof of the result is stated in Appendix A.1. Notice that while the perturbed chain is restricted to the set BR, we do not place a similar restriction on the unperturbed chain. The estimate (15) compares the restricted, perturbed chain to the unrestricted, unperturbed one.

Remark 10

In the special case where P˜(x,)=PR(x,) for xBR we have Δ(R)=0. For example

P˜(x,A)=1BR(x)PR(x,A)+1BRc(x)δx0(A),AB(G),

with x0BR satisfies this condition. The resulting perturbed Markov chain is simply a restriction of the unperturbed Markov chain to BR and Theorem 9 provides a quantitative bound on the difference of the distributions.

3.3. Relationship to earlier perturbation bounds

In contrast to the V-uniform ergodicity assumption we impose on the ideal Markov chain, the results in [1], [12], [18] only cover perturbations of uniformly ergodic Markov chains. Nonetheless, perturbation theoretical questions for geometrically ergodic Markov chains have been studied before, see e.g. [5], [7], [14], [20], [24], [26], [28] and the references therein. A crucial aspect where those papers differ from each other is how one measures the closeness of the transitions of the unperturbed and perturbed Markov chains to have applicable estimates, see the discussion about this in [7], [26], [28]. Our Proposition 6 and Theorem 7 refine and extend the results of [26, Theorem 3.2]. In particular, in Theorem 7 we take a restriction to the center of the state space into account. Let us also mention here that [22], [26] contain related results under Wasserstein ergodicity assumptions. More recently, [11] studies approximate chains using notions of maximal couplings, [20] extends the uniformly ergodic setting from [12] to using L2 norms instead of total variation, and [10] explores bounds on the approximation error of time averages.

The usefulness of restricted approximations in the study of Markov chains has been observed before. For example in [27], in an infinite-dimensional setting, spectral gap properties of a Markov operator based on a restricted approximation are investigated. Also recently in [30] it is proposed to consider a subset of the state space termed “large set” in which a certain Lyapunov condition holds. This is in contrast to a Lyapunov function defined on the entire space, which might deteriorate as the dimension of the state space or the number of observations increases. This new Lyapunov condition from [30] is particularly useful for obtaining explicit bounds on the number of iterations to get close to the stationary distribution in high-dimensional settings.

4. Monte Carlo within Metropolis

In Bayesian statistics it is of interest to sample with respect to a distribution π on (G,B(G)). We assume that π admits a possibly unnormalized density πu:G[0,) with respect to a reference measure μ, for example the counting, Lebesgue or some Gaussian measure. The Metropolis–Hastings (MH) algorithm is often the method of choice to draw approximate samples according to π:

Algorithm 1

For a proposal transition kernel Q a transition from x to y of the MH algorithm works as follows.

  • 1.

    Draw UUnif[0,1] and a proposal ZQ(x,) independently, call the result u and z, respectively.

  • 2.
    Compute the acceptance ratio
    r(x,z)π(dz)Q(z,dx)π(dx)Q(x,dz)=πu(z)πu(x)μ(dz)Q(z,dx)μ(dx)Q(x,dz), (16)
    which is the density of the measure π(dz)Q(z,dx) w.r.t. π(dx)Q(x,dz), see [29].
  • 3.

    If u<r(x,z), then accept the proposal, and return yz, otherwise reject the proposal and return yx.

The transition kernel of the MH algorithm with proposal Q, stationary distribution π and acceptance probability

a(x,z)min1,r(x,z)

is given by

Ma(x,dz)a(x,z)Q(x,dz)+δx(dz)1Ga(x,y)Q(x,dy). (17)

For the MH algorithm in the computation of r(x,z) one uses πu(z)πu(x), which might be known from having access to function evaluations of the unnormalized density πu. However, when it is expensive or even impossible to compute function values of πu, then it may not be feasible to sample from π using the MH algorithm. Here are two typical examples of such scenarios:

  • Doubly-intractable distribution: For models such as Markov or Gibbs random fields, the unnormalized density πu(x) itself is typically only known up to a factor Z(x), that is,
    πu(x)=ρ(x)Z(x),xG (18)
    where functions values of ρ can be computed, but function values of Z cannot. For instance, Z might be given in the form

    graphic file with name fx1006.jpg

    where Y denotes an auxiliary variable space, Inline graphic and rx is a probability distribution on Y.
  • Latent variables: Here πu(x) cannot be evaluated, since it takes the form
    graphic file with name fx1008.jpg (19)
    with a probability distribution rx on a measurable space Y of latent variables y and a non-negative function Inline graphic .

In the next sections, we study in both of these settings the perturbation error of an approximating MH algorithm. A fair assumption in both scenarios, which holds for a large family of target distributions using random-walk type proposals, see, e.g., [9], [16], [25], is that the infeasible, unperturbed MH algorithm is V-uniformly ergodic:

Assumption 11

For some function V:G[1,) let the transition kernel Ma of the MH algorithm be V-uniformly ergodic, that is,

Man(x,)πVCV(x)αn

with C[1,) and α[0,1), and additionally, assume that the Lyapunov condition

MaV(x)δV(x)+L,

for some δ[0,1) and L[1,) is satisfied.

We have the following standard proposition (see e.g. [26, Lemma 4.1] or [1], [4], [10], [15], [22]) which leads to upper bounds on εtv, εV and Δ(R) (see Lemma 4 and Theorem 9) for two MH type algorithms Mb and Mc with common proposal distribution but different acceptance probability functions b,c:G×G[0,1], respectively.

Proposition 12

Let b,c:G×G[0,1] and letV:G[1,) be such thatsupxGMbV(x)V(x)T for a constantT1. Assume that there are functions η,ξ:G[0,) and a set BG such that, either

|b(x,y)c(x,y)|1B(y)(η(x)+η(y))b(x,y)ξ(x),or|b(x,y)c(x,y)|1B(y)(η(x)+η(y))b(x,y)ξ(y) (20)

for all x,yG. Then we have

supxBMb(x,)Mc(x,)VV(x)4Tη1Bξ1B,

and, with the definition of ,W provided in (9), for any β(0,1),

supxBMb(x,)Mc(x,)tvV(x)4Tη1B,Vβξ1B,V1β.

The proposition provides a tool for controlling the distance between the transition kernels of two MH type algorithms with identical proposal and different acceptance probabilities. The specific functional form for the dependence of the upper bound in (20) on x and y is motivated by the applications below. The set B indicates the “essential” part of G where the difference of the acceptance probabilities matter. The parameter β is used to shift weight between the two components ξ and η of the approximation error. For the proof of the proposition, we refer to Appendix A.2.

4.1. Doubly-intractable distributions

In the case where πu takes the form (18), we can approximate Z(x) by a Monte Carlo estimate

graphic file with name fx1010.jpg

under the assumption that we have access to an iid sequence of random variables (Yi(x))1iN where each Yi(x) is distributed according to rx. Then, the idea is to substitute the unknown quantity Z(x) by the approximation Z^N(x) within the acceptance ratio. Defining WN(x)Z^N(x)Z(x), the acceptance ratio can be written as

r˜(x,z,WN(x),WN(z))μ(dz)Q(z,dx)μ(dx)Q(x,dz)Z^N(x)Z^N(z)=r(x,z)WN(x)WN(z),

where the random variables WN(x), WN(z) are assumed to be independent from each other. Notice that the quantities WN only appear in the theoretical analysis of the algorithm. For the implementation, it is sufficient to be able to compute r˜. This leads to a Monte Carlo within Metropolis (MCwM) algorithm:

Algorithm 2

For a given proposal transition kernel Q, a transition from x to y of the MCwM algorithm works as follows.

  • 1.

    Draw UUnif[0,1] and a proposal ZQ(x,) independently, call the result u and z, respectively.

  • 2.

    Calculate r˜(x,z,WN(x),WN(z)) based on independent samples for WN(x), WN(z), which are also independent from previous iterations.

  • 3.

    If u<r˜(x,z,WN(x),WN(z)), then accept the proposal, and return yz, otherwise reject the proposal and return yx.

Given the current state xG and a proposed state zG the overall acceptance probability is

aN(x,z)E[min1,r˜(x,z,WN(x),WN(z))], (21)

which leads to the corresponding transition kernel of the form MaN, see (17).

Remark 13

Let us emphasize that the doubly-intractable case can also be approached algorithmically from various other perspectives. For instance, instead of estimating the normalizing constant Z(x) one could estimate unbiasedly (Z(x))1 whenever exact simulation from the Markov or Gibbs random field is possible. In this case, πu(x) turns into a Monte Carlo estimate which can formally be analyzed with exactly the same techniques as the latent variable scenario described below. Yet another algorithmic possibility is explored in the noisy exchange algorithm of [1], where ratios of the form Z(x)Z(y) are approximated by a single Monte Carlo estimate. Their algorithm is motivated by the exchange algorithm [19] which, perhaps surprisingly, can avoid the need for evaluating the ratio Z(x)Z(y) and targets the distribution π exactly, see e.g. [6], [21] for an overview of these and related methods. However, in some cases the exchange algorithm performs poorly, see [1]. Then approximate sampling methods for distributions of the form (2) might prove useful as long as the introduced bias is not too large. As a final remark in this direction, the recent work  [2] considers a correction of the noisy exchange algorithm which produces a Markov chain with stationary distribution π.

The quality of the MCwM algorithm depends on the error of the approximation of Z(x). The root mean squared error of this approximation can be quantified by the use of WN, that is,

(E|WN(x)1|2)12=s(x)NxG,NN, (22)

where

s(x)(E|W1(x)1|2)12

is determined by the second moment of W1(x). In addition, due to the appearance of the estimator WN(z) in the denominator of r˜, we need some control of its distribution near zero. To this end, we define, for zG and p>0, the inverse moment function

ip,N(z)EWN(z)p1p.

With this notation we obtain the following estimate, which is proved in Appendix A.2.

Lemma 14

Assume that there exists kN such that i2,k(x) ands(x) are finite for all xG. Then, for all x,zG and Nk we have

|a(x,z)aN(x,z)|a(x,z)1Ni2,k(z)(s(x)+s(z)).

Remark 15

One can replace the boundedness of the second inverse moment i2,k(x) for any xG by boundedness of a lower moment ip,m(x) for p(0,2) with suitably adjusted mN, see Lemma 23 in the Appendix A.2.

4.1.1. Inheritance of the Lyapunov condition

If the second and inverse second moment are uniformly bounded, s< as well as i2,N<, one can show that the Lyapunov condition of the MH transition kernel is inherited by the MCwM algorithm. In the following corollary, we prove this inheritance and state the resulting error bound for MCwM.

Corollary 16

For a distribution m0 on G letmnm0Man andmn,Nm0MaNn be the respective distributions of the MH and MCwM algorithms aftern steps. Let Assumption 11 be satisfied and for some kN let

D8Li2,ks<.

Further, define δNδ+DN and βnnmax{δN,α}n1. Then, for any

N>maxk,D2(1δ)2

we have δN[0,1) and

mnmn,NtvDCNm0(V)βn+L(1δN)(1α).
Proof

Assumption 11 implies supxGMaV(x)V(x)2L. By Lemma 14 and Proposition 12, with B=G, we obtain

εV,V=supxGMa(x,)MaN(x,)VV(x)DN.

Further, note that

MaNV(x)MaV(x)Ma(x,)MaN(x,)VDNV(x),

which implies, by Assumption 11, that for N>D2(1δ)2 we have δN[0,1) and MaNV(x)δNV(x)+L. By Theorem 7 and Remark 8 we obtain for r=1 the assertion.  □

Observe that the estimate is bounded in nN so that the difference of the distributions converges uniformly in n to zero for N. The constant δN decreases for increasing N, so that larger values of N improve the bound.

Log-normal example I. Let G=R and the target measure π be the standard normal distribution. We choose a Gaussian proposal kernel Q(x,)=N(x,γ2) for some γ2>0, where N(μ,σ2) denotes the normal distribution with mean μ and variance σ2. It is well known, see [9, Theorem 4.1, Theorem 4.3 and Theorem 4.6], that the MH transition kernel satisfies Assumption 11 for some numbers α, C, δ and L with V(x)=exp(x24).

Let g(y;μ,σ2) be the density of the log-normal distribution with parameters μ and σ, i.e., g is the density of exp(μ+σS) for a random variable SN(0,1). Then, by the fact that 0yg(y;σ(x)22,σ(x)2)dy=1 for all functions σ:G(0,), we can write the (unnormalized) standard normal density as

πu(x)=exp(x22)=exp(x22)0yg(y;σ(x)22,σ(x)2)dy.

Hence πu takes the form (18) with Y=[0,), ρ(x)=exp(x22), Inline graphic and rx being a log-normal distribution with parameters σ(x)22 and σ(x)2. Independent draws from this log-normal distribution are used in the MCwM algorithm to approximate the integral. We have E[W1(x)p]=exp(p(p1)σ(x)22) for all x,pR and, accordingly,

s(x)=(exp(σ(x)2)1)12exp(σ(x)22)
ip,1(x)=exp((p+1)σ(x)22).

By Lemma 23 we conclude that

i2,k(x)i2k,1(x)=exp12+1kσ(x)2.

Hence, s as well as i2,k are bounded if for some constant c>0 we have σ(x)2c for all xG. In that case Corollary 16 is applicable and provides estimates for the difference between the distributions of the MH and MCwM algorithms after n-steps. However, one might ask what happens if the function σ(x)2 is not uniformly bounded, taking, for example, the form σ(x)2=|x|q for some q>0. In Fig. 1 we illustrate the difference of the distribution of the target measure to a kernel density estimator based on a MCwM algorithm sample for σ(x)2=|x|1.8. Even though s(x) and ip,1(x) grow super-exponentially in |x|, the MCwM still works reasonably well in this case.

Fig. 1.

Fig. 1

Here σ(x)2|x|1.8 for xR. The target density (standard normal) is plotted in gray, a kernel density estimator based on 105 steps of the MCwM algorithm with N=10 (left), N=102 (middle) and N=103 (right) is plotted in blue . (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

However, in Fig. 2 we consider the case where σ(x)2=|x|2.2 and the behavior changes dramatically. Here the MCwM algorithm does not seem to work at all. This motivates a modification of the MCwM algorithm in terms of restricting the state space to the “essential part” determined by the Lyapunov condition.

Fig. 2.

Fig. 2

Here σ(x)2|x|2.2 for xR. The target density (standard normal) is plotted in gray, a kernel density estimator based on 105 steps of the MCwM algorithm with N=10 (left), N=102 (middle) and N=103 (right) is plotted in blue . (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

4.1.2. Restricted MCwM approximation

With the notation and definition from the previous section we consider the case where the functions i2,k(x) and s(x) are not uniformly bounded. Under Assumption 11 there are two simultaneously used tools which help to control the difference of a transition of MH and MCwM:

  • 1.

    The Lyapunov condition leads to a weight function and eventually to a weighted norm, see Proposition 12.

  • 2.

    By restricting the MCwM to the “essential part” of the state space we prevent that the approximating Markov chain deteriorates. Namely, for some R1 we restrict the MCwM to BR, see Section 3.2.

For x,zG the acceptance ratio r˜ used in Algorithm 2 is now modified to

1BR(z)r˜(x,z,WN(x),WN(z))

which leads to the restricted MCwM algorithm:

Algorithm 3

For given R1 and a proposal transition kernel Q a transition from x to y of the restricted MCwM algorithm works as follows.

  • 1.

    Draw UUnif[0,1] and a proposal ZQ(x,) independently, call the result u and z, respectively.

  • 2.

    Calculate r˜(x,z,WN(x),WN(z)) based on independent samples for WN(x), WN(z), which are also independent from previous iterations.

  • 3.

    If u<1BR(z)r˜(x,z,WN(x),WN(z)), then accept the proposal, and return yz, otherwise reject the proposal and return yx.

Given the current state xG and a proposed state zG the overall acceptance probability is

aN(R)(x,z)Emin1,1BR(z)r˜(x,z,WN(x),WN(z))=1BR(z)aN(x,z),

which leads to the corresponding transition kernel of the form MaN(R), see (17). By using Theorem 9 and Proposition 12 we obtain the following estimate.

Corollary 17

Let Assumption 11 be satisfied, i.e., Ma is V-uniformly ergodic and the function V as well as the constants α,C,δ and L are determined. For β(0,1) and R1 let

BRxGV(x)R,
DR12Li2,k1BR,V1βs1BR,Vβ<.

Let m0 be a distribution on BR and κmax{m0(V),L(1δ)}. Then, for

Nmaxk,4RDR1δ2 (23)

and Rexp(1) we have

mnmn,N(R)tv33C(L+1)κ1αlogRR,

where mn,N(R)m0MaN(R)n and mnm0Man are the distributions of the MH and restricted MCwM algorithm aftern-steps.

Proof

We apply Theorem 9 with P(x,)=Ma(x,) and

P˜(x,)=1BR(x)MaN(R)(x,)+1BRc(x)δx0(),xG,

for some x0BR. Note that P˜(x,BR)=1 for any xG. Further P˜ and MaN(R) coincide on BR, thus we also have P˜n=MaN(R)n on BR for nN. Observe also that the restriction of P to BR, denoted by PR, satisfies PR=Ma(R) with a(R)(x,z)1BR(z)a(x,z). Hence

Δ(R)=supxBRMa(R)(x,)MaN(R)(x,)tvV(x).

Moreover, we have by Lemma 14 that

|a(R)(x,z)aN(R)(x,z)|=1BR(z)|a(x,z)aN(x,z)|1BR(z)a(x,z)1Ni2,k(z)(s(x)+s(z))=a(R)(x,z)1Ni2,k(z)(s(x)+s(z)).

With Proposition 12 and

supxGMa(R)V(x)V(x)supxGMaV(x)V(x)+1Assumption 113L,

we have that Δ(R)DRN. Then, by N4(RDR(1δ))2 we obtain

RΔ(R)1δ2

such that all conditions of Theorem 9 are verified and the stated estimate follows. □

Remark 18

The estimate depends crucially on the sample size N as well as on the parameter R. If the influence of R in DR is explicitly known, then one can choose R depending on N in such away that the conditions of the corollary are satisfied and one eventually obtains an upper bound on the total variation distance of the difference between the distributions depending only on N and not on R anymore. For example, if we additionally assume that the function g:(0,)(0,) given by g(R)=RDR is invertible, then for Nk and the choice Rg1(1δ)N2 we have

mnmn,N(R)tv33C(L+1)κ1αlogg1(1δ)N2g1(1δ)N2.

Thus, depending on whether and how fast g1(1δ)N2 for N determines the convergence of the upper bound of mnmn,N(R)tv to zero.

Log-normal example II. We continue with the log-normal example. In this setting we have

BR={xR|x|2logR},i2,k1BR,V1βsup|x|2logRexp12+1kσ(x)21β4x2,s1BR,Vβsup|x|2logRexpσ(x)22βx24.

Thus, DR is uniformly bounded in R for σ(x)2|x|q with q<2 and not uniformly bounded for q>2. As in the numerical experiments in Fig. 1, Fig. 2 let us consider the cases σ(x)2=|x|1.8 and σ(x)2=|x|2.2. In Fig. 3 we compare the normal target density with a kernel density estimator based on the restricted MCwM on BR=[10,10] and observe essentially the same reasonable behavior as in Fig. 1. In Fig. 4 we consider the same scenario and observe that the restriction indeed stabilizes. In contrast to Fig. 2, convergence to the true target distribution is visible but, in line with the theory, slower than for σ(x)2=|x|1.8.

Fig. 3.

Fig. 3

Here σ(x)2|x|1.8 for xR and BR=[10,10]. The target density (standard normal) is plotted in gray, a kernel density estimator based on 105 steps of the MCwM algorithm with N=10 (left), N=102 (middle) and N=103 (right) is plotted in blue.

Fig. 4.

Fig. 4

Here σ(x)2|x|2.2 for xR and BR=[10,10]. The target density (standard normal) is plotted in gray, a kernel density estimator based on 105 steps of the MCwM algorithm with N=10 (left), N=102 (middle) and N=103 (right) is plotted in blue.

Now we apply Corollary 17 in both cases and note that by similar arguments as below one can also treat σ(x)2|x|q with, respectively, q<2 or q>2.

1. Case σ(x)2=|x|1.8. For k=100 and β=12 one can easily see that i2,1001BR,V12 and s1BR,V12 is bounded by 6000, independent of R. Hence there is a constant D1 so that DRD. With this knowledge we choose R=(1δ)2DN such that for Nmax100,2exp(2)D2(1δ)2 condition (23) and Rexp(1) is satisfied. Then, Corollary 17 gives the existence of a constant C˜>0, so that

mnmn,N(R)tvC˜logNN

for any initial distribution m0 on BR.

2. Case σ(x)2=|x|2.2. For k=100 and β=12 we obtain

i2,1001BR,V12exp2.5logR1110,s1BR,V12exp2.5logR1110.

Hence DR12Lexp5logR1110. Eventually, for

Nmax100,242exp(261110)L2(1δ)2

we have with R=exp16logN(1δ)24L1011 that Rexp(1) and (23) is satisfied. Then, with C˜133C(L+1)κ1α, C˜21δ24L and Corollary 17 we have

mnmn,N(R)tvC˜11621011logC˜2N1011exp1621011logC˜2N1011C˜1(k+1)!logC˜2N10k11,

for any initial distribution m0 on BR and all kN. Here the last inequality follows by the fact that exp(x)xk+1(k+1)! for any x0 and kN.

To summarize, by suitably choosing N and R (possibly depending on N) sufficiently large the difference between the distributions of the restricted MCwM and the MH algorithms after n-steps can be made arbitrarily small.

4.2. Latent variables

In this section we consider πu of the form (19). Here, as for doubly intractable distributions, the idea is to substitute πu(x) in the acceptance probability of the MH algorithm by a Monte Carlo estimate

graphic file with name fx1012.jpg

where we assume that we have access to an iid sequence of random variables (Yi(x))1iN where each Yi(x) has distribution rx. Define a function WN:GR by WN(x)ρ^N(x)πu(x) and note that E[WN(x)]=1. Then, the acceptance probability given WN(x), WN(z) modifies to

aN(x,z)Emin1,r(x,z)WN(z)WN(x)

where WN(x), WN(z) are assumed to be independent random variables. Note that all the objects which depend on aN, such as MaN,aN(R),MaN(R), that appear in this section are defined just as in Section 4.1. The only difference is that the order of the variables WN(x) and WN(z) in the ratio r˜ at (21) has been reversed. Thus, this leads to a MCwM algorithm as stated in Algorithm 2, where the transition kernel is given by MaN.

Also as in Section 4.1 we define s(x)E|W1(x)1|212 and ip,N(x)(EWN(x)p)1p for all xG and p>0. With those quantities we obtain the following estimate of the difference of the acceptance probabilities of Ma and MaN proved in Appendix A.2.

Lemma 19

Assume that there exists kN such that i2,k(x) ands(x) are finite for all xG. Then, for all x,zG and Nk we have

|a(x,z)aN(x,z)|a(x,z)1Ni2,k(x)(s(x)+s(z)). (24)

If s and i2,k are finite for some kN, then the same statement as formulated in Corollary 16 holds. The proof works exactly as stated there. Examples which satisfy this condition are for instance presented in [15]. However, there are cases where the functions s and i2,k are unbounded. In this setting, as in Section 4.1.2, we consider the restricted MCwM algorithm with transition kernel MaN(R). Here again the same statement and proof as formulated in Corollary 17 hold. We next provide an application of this corollary in the latent variable setting.

Normal–normal model. Let G=R and the function φμ,σ2 be the density of N(μ,σ2). For some zR and (precision) parameters γZ,γY>0 define

πu(x)Rφz,γZ1(y)φ0,γY1(xy)dy,

that is, Y=R, Inline graphic and rx=N(x,γY1). By the convolution of two normals the target distribution π satisfies

πu(x)=φz,γZ,Y1(x),withγZ,Y1γZ1+γY1. (25)

Note that, for real-valued random variables Y,Z the probability measure π is the posterior distribution given an observation Z=z within the model

Z|Y=yNy,γZ1,Y|xNx,γY1,

with the improper Lebesgue prior imposed on x.

Pretending that we do not know πu(x) we compute

ρ^N(x)=1Ni=1Nφz,γZ1(Yi(x)),

where (Yi(x))1iN is a sequence of iid random variables with Y1(x)N(x,γY1). Hence

WN(x)=1Ni=1Nφz,γZ1(Yi(x))φz,γZ,Y1(x)=1NγZγZ,Y12i=1Nφ0,1(γZ(zYi(x)))φ0,1(γZ,Y(zx)).

By using a random variable ξN(0,1) we have for p>γYγZ that

EW1(x)p=γZγZ,Yp2Eexpp2γZ,Yzx2p2γZγY(γY12(zx)ξ)2expγZγZ,Ypp12γY+pγZzx2. (26)

Here means equal up to a constant independent of x. As a consequence, s= and therefore Corollary 16 (which is also true in the latent variable setting) cannot be applied. Nevertheless, we can obtain bounds for the restricted MCwM in this example using the statement of Corollary 17 by controlling s and i2,k using a Lyapunov function V. The following result, proved in Appendix A.2, verifies the necessary moment conditions under some additional restrictions on the model parameters.

Proposition 20

Assume that γY>2γZ, the unnormalized density πu is given as in (25) and let the proposal transition kernelQ be a Gaussian random walk, that is, Q(x,)=N(x,σ2) for some σ>0. Then, there is a Lyapunov function V:G[1,) for Ma, such that Ma isV-uniformly ergodic, i.e., Assumption 11 is satisfied, and there are β(0,1) as well as kN such that

i2,k,V1β<ands,Vβ<.

The previous proposition implies that there is a constant D<, such that DR from Corollary 17 is bounded by D independent of R. Hence there are numbers C˜1,C˜2>0 such that with R=C˜1N and for N sufficiently large we have

mnmn,N(R)tvC˜2logNN

for any initial distribution m0 on BR.

Acknowledgments

Daniel Rudolf gratefully acknowledges support of the Felix-Bernstein-Institute for Mathematical Statistics in the Biosciences (Volkswagen Foundation), the Campus laboratory AIMS and the DFG within the project 389483880. Felipe Medina-Aguayo was supported by BBSRC grant BB/N00874X/1 and thanks Richard Everitt for useful discussions.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Contributor Information

Felipe Medina-Aguayo, Email: f.j.medinaaguayo@reading.ac.uk.

Daniel Rudolf, Email: daniel.rudolf@uni-goettingen.de.

Nikolaus Schweizer, Email: n.f.f.schweizer@uvt.nl.

Appendix. Technical proofs

A.1. Proofs of Section 3

Before we come to the proofs of Section 3 let us recall a relation between geometric ergodicity and an ergodicity coefficient. Let V:G[1,] be a measurable, π-a.e. finite function, then, define the ergodicity coefficient τV(P) as

τV(P)supx,yGP(x,)P(y,)VV(x)+V(y).

The next lemma provides a relation between the ergodicity coefficient and V-uniform ergodicity.

Lemma 21

If (7) is satisfied, then τV(Pn)Cαn.

A proof of this fact is implicitly contained in [13] and can also be found in [26, Lemma 3.2]. Both references crucially use an observation of Hairer and Mattingly [8].

To summarize, if the transition kernel P is geometrically ergodic, then, by Theorem 1 there exist a function V:G[1,), α[0,1) and C(0,) such that, by Lemma 21, τV(Pn)Cαn. The next proposition states two further useful properties (submultiplicativity and contractivity) of the ergodicity coefficient. For a proof of the corresponding inequalities see for example [13, Proposition 2.1].

Proposition 22

Assume P,Q are transition kernels and μ,ν are probability measures on G. Then

τV(PQ)τV(P)τV(Q),(submultiplicativity)(μν)PVτV(P)μνV.(contractivity)

Now we prove Lemma 4.

Proof of Lemma 4

As in the proof of [18, Theorem 3.1] we use

p˜npn=(p˜0p0)Pn+i=0n1p˜i(P˜P)Pni1,

which can be shown by induction over nN. Then

p˜npntv(p˜0p0)Pntv+i=0n1p˜i(P˜P)Pni1tv. (A.1)

With Proposition 22 and Lemma 21 we estimate the first term of the previous inequality by

(p˜0p0)Pntv(p˜0p0)PnVτV(Pn)p˜0p0VCαnp˜0p0V.

For the terms which appear in the sum of (A.1) we can use two types of estimates. Note that τ1(P)1 (here the subscript indicates that V=1) which leads by Proposition 22 to

p˜i(P˜P)Pni1tvp˜i(P˜P)tvτ1(Pni1)p˜i(P˜P)tv=sup|f|1|Gf(x)p˜i(P˜P)(dx)|=sup|f|1|G(P˜P)f(x)p˜i(dx)|GP˜(x,)P(x,)tvp˜i(dx)εtv,Wp˜i(W).

On the other hand

p˜i(P˜P)Pni1tvp˜i(P˜P)Pni1Vp˜i(P˜P)VτV(Pni1)Cαni1p˜i(P˜P)VCαni1GP˜(x,)P(x,)Vp˜i(dx)Cαni1εV,Wp˜i(W).

Thus, for any r(0,1] we obtain

p˜i(P˜P)Pni1tvp˜i(P˜P)Pni1tv1rp˜i(P˜P)Pni1tvrεtv,W1rεV,WrCrp˜i(W)α(ni1)r,

which gives by (A.1) the final estimate. □

Next we prove Theorem 9.

Proof Theorem 9

Locally for xBR we have PRV(x)PV(x)δV(x)+L, and, eventually,

P˜V(x)PRV(x)+|P˜V(x)PRV(x)|δV(x)+RP˜(x,)PR(x,)tv+L(δ+RΔ(R))V(x)+L. (A.2)

We write BRc for GBR and obtain for xBRc that

P˜V(x)=BRV(y)P˜(x,dy)V(x). (A.3)

Denote δ˜δ+RΔ(R)12+δ2<1. For i2 we obtain by (A.2), (A.3) and (1δ˜i)2(1δ˜i1) that

p˜i(V)δ˜iBRV(x)p0(dx)+(1δ˜i)L1δ˜+δ˜i1BRcP˜V(x)p0(dx)+(1δ˜i1)L1δ˜δ˜i1p0(V)+(1δ˜i1)3L1δ˜6κ.

Furthermore, p0(V)κ and p˜1(V)2κ. Now it is easily seen that

i=0n1p˜i(V)α(ni1)r6κr(1α).

For εtv,V we have

εtv,VmaxsupxBRP(x,)P˜(x,)tvV(x),supxBRcP(x,)P˜(x,)tvV(x).

The second term in the maximum is bounded by 2R. For xBR we have

P(x,)P˜(x,)tvP(x,)PR(x,)tv+PR(x,)P˜(x,)tv2P(x,BRc)+PR(x,)P˜(x,)tv

so that the first term in the maximum satisfies

supxBRP(x,)P˜(x,)tvV(x)Δ(R)+2supxBRP(x,BRc)V(x).

Consider a random variable X1x with distribution P(x,), xBR. Applying Markov’s inequality to the random variable V(X1x) leads to

PV(x)=E[V(X1x)]RP(V(X1x)>R)=RP(x,BRc),

and thus

supxBRP(x,BRc)V(x)supxBRPV(x)RV(x)δ+LR.

Finally, RΔ(R)<1δ and L1 imply εtv,V2(L+1)R.

We obtain εV,V2(L+1) by the use of

P(x,)P˜(x,)VPV(x)+P˜V(x),

the fact that supxGPV(x)V(x)δ+L and

supxGP˜V(x)V(x)maxsupxBRP˜V(x)V(x),supxBRcP˜V(x)V(x)(A.2), (A.3)maxδ˜+L,1L+1.

Then, by Lemma 4 for r(0,1],

pnp˜ntv12Cr(L+1)κrR1r(1α)12C(L+1)κrR1r(1α).

By minimizing over r we obtain for Rexp(1) that

pnp˜ntv12C(L+1)κ1αR1log(R)log(R)R.

Finally by the fact that R1logR=exp(1)<3312 the assertion follows. □

A.2. Proofs of Section 4

We start with the proof of Proposition 12.

Proof of Proposition 12

For any f:GR we have

Mbf(x)Mcf(x)=Gf(y)(b(x,y)c(x,y))Q(x,dy)+f(x)G(c(x,y)b(x,y))Q(x,dy).

In the first case of (20), we have for all xB that

Mb(x,)Mc(x,)tv2G|b(x,y)c(x,y)|Q(x,dy)2Bb(x,y)ξ(x)(η(x)+η(y))Q(x,dy)2ξ(x)(η(x)+Mb(η1B)(x))2ξ(x)(η(x)+MbVβ(x)η1B,Vβ)4Tξ1B,V1βη1B,VβV(x),

where we used that supxGMbV(x)V(x)T implies supxGMbV(x)βV(x)βTβ by Jensen’s inequality. Moreover, for any xB we obtain

Mb(x,)Mc(x,)Vsup|f|V|Gf(y)(b(x,y)c(x,y))Q(x,dy)+f(x)G(c(x,y)b(x,y))Q(x,dy)|GV(y)|b(x,y)c(x,y)|Q(x,dy)+V(x)G|b(x,y)c(x,y)|Q(x,dy)BV(y)b(x,y)ξ(x)(η(x)+η(y))Q(x,dy)+V(x)Bb(x,y)ξ(x)(η(x)+η(y))Q(x,dy)2η1Bξ1B(MbV(x)+V(x)),

which implies the assertion in that case. In the second case of (20), we have similarly for any xB that

Mb(x,)Mc(x,)tv2η(x)Mb(ξ1B)+2Mb(ξη1B)2η(x)ξ1B,V1βMb(V1β)(x)+2ξη1B,VMbV(x)4Tη1B,Vβξ1B,V1βV(x)

and

Mb(x,)Mc(x,)VBV(y)b(x,y)ξ(y)(η(x)+η(y))Q(x,dy)+V(x)Bb(x,y)ξ(y)(η(x)+η(y))Q(x,dy)2ξ1Bη1B(MbV(x)+V(x)),

which finishes the proof. □

Before we come to further proofs of Section 4 we provide some properties of inverse moments of averages of non-negative real-valued iid random variables (Si)iN. In this setting, the pth inverse moment, for p>0, is defined by

jp,rE1ri=1rSip1p.

Lemma 23

Assume that jp,r< for some rN and p>0. Then

  • (i)

    jp,sjp,r for sN with sr;

  • (ii)

    jq,rjp,r for 0<q<p;

  • (iii)

    jkp,krjp,r for any kN.

Proof

Properties (i), (ii) follow as in [14, Lemma 3.5]. For proving (iii) we have to show that

E1kri=1krSipkE1ri=1rSipk.

To this end, observe first that we can write

1kri=1krSi=1ki=1kVi

where the “batch-means” V1,,Vk are non-negative, real-valued iid random variables which have the same distribution as 1ri=1rSi. With Zi=Vi1 we obtain

E11kri=1krSipk=E11ki=1k1Zipk

which is a moment of the harmonic mean of Z1,,Zk. Using the inequality between geometric and harmonic means as well as the independence we find that

E11ki=1k1ZipkEi=1kZip=EZ1pk=E11ri=1rSipk.

The previous lemma shows that when inverse moments of some positive order are finite, then so are inverse moments of all higher and lower orders if the sample size is adjusted accordingly.

Proof of Lemma 14

It is easily seen that

a(x,z)Emin1,WN(x)WN(z)aN(x,z)

for any x,zG. By virtue of Jensen’s inequality and E[WN(z)]=1 we have E[WN(z)1]1 as well as

aN(x,z)min1,r(x,z)EWN(x)WN(z)a(x,z)

where we also used the independence of WN(x) and WN(z) in the last inequality. (The previous arguments are similar to those in [14, Lemma 3.3 and the proof of Lemma 3.2].) Note that i2,N(x)i2,k(x) for Nk by Lemma 23. Hence, one can conclude that

|a(x,z)aN(x,z)|a(x,z)Emax0,1WN(x)WN(z)a(x,z)aN(x,z)EWN(x)WN(z)1a(x,z)<aN(x,z)a(x,z)E|1WN(x)WN(z)|a(x,z)i2,N(z)E|WN(x)WN(z)|212a(x,z)i2,N(z)E|WN(x)1|212+E|WN(z)1|212a(x,z)i2,k(z)N(s(x)+s(z)).

Proof of Lemma 19

As in the previous proof or from [14, Lemma 3.3 and the proof of Lemma 3.2] an immediate consequence is

a(x,z)Emin1,WN(z)WN(x)aN(x,z)a(x,z)EWN(z)WN(x).

Note that i2,Ni2,k for Nk, see Lemma 23. The rest of the lemma follows as in the previous proof, only the ratio WN(x)WN(z) is reversed. □

Proof of Proposition 20

For random-walk-based Metropolis chains (in particular for Q as assumed in the statement) by [9, Theorem 4.1 and the first sentence after the proof of the theorem, as well as, Theorem 4.3, Theorem 4.6] we have that Ma is Vt-uniformly ergodic with

Vt(x)πu(x)texptγZ,Y2zx2,

for any t0,1. Hence, Assumption 11 is satisfied and we need to find t(0,1) as well as β(0,1) such that i2,k,Vt1β< and s,Vtβ< for some kN. For showing s,Vtβ< we use (26) to see that

sxC˜expγZγY+2γZγZ,Y2zx2,

for some C˜<. Hence

s(x)Vt(x)βC˜expγZγY+2γZtβγZ,Y2(zx)2,

and choosing β(0,1) such that

tβ=γZγY+2γZ (A.4)

leads to s,Vtβ<. In order to show i2,k,Vt1β<, we first use Lemma 23(iii) and obtain for any xG and any kN

i2,k(x)=E[Wk(x)2]12EW1(x)2kk2.

Then, for k>2γZγY by (26) we have

EW1(x)2kk2expγZ1+2kγY2kγZγZ,Y2zx2.

Therefore, there is a constant C˜< such that

i2,k(x)Vt(x)1βC˜expγZ1+2kγY2kγZt(1β)γZ,Y2(zx)2.

We have i2,k,Vt1β< if γZ1+2kγY2kγZt(1β). The latter condition holds whenever

k2γZ1+t(1β)γYt(1β)γZ,

provided that t(1β)>γZγY. This implies, by (A.4), that t should be chosen such that

t>γZγY+γZγY+2γZ. (A.5)

Choosing t such that it satisfies (A.5) is feasible whenever the right-hand side of (A.5) is smaller than 1. This is the case if γY>2γZ. □

References

  • 1.Alquier P., Friel N., Everitt R., Boland A. Noisy Monte Carlo: Convergence of Markov chains with approximate transition kernels. Stat. Comput. 2016;26(1):29–47. [Google Scholar]
  • 2.C. Andrieu, A. Doucet, S. Yıldırım, N. Chopin, On the utility of Metropolis-Hastings with asymmetric acceptance ratio, ArXiv preprint arXiv:1803.09527.
  • 3.Andrieu C., Roberts G. The pseudo-marginal approach for efficient Monte Carlo computations. Ann. Statist. 2009;37(2):697–725. [Google Scholar]
  • 4.Bardenet R., Doucet A., Holmes C. Proceedings of the 31st International Conference on Machine Learning. 2014. Towards scaling up Markov chain Monte Carlo: an adaptive subsampling approach; pp. 405–413. [Google Scholar]
  • 5.Breyer L., Roberts G., Rosenthal J. A note on geometric ergodicity and floating-point roundoff error. Statist. Probab. Lett. 2001;53(2):123–127. [Google Scholar]
  • 6.Everitt R.G., Johansen A.M., Rowing E., Evdemon-Hogan M. Bayesian Model comparison with un-normalised likelihoods. Stat. Comput. 2017;27(2):403–422. [Google Scholar]
  • 7.Ferré D., Hervé L., Ledoux J. Regular perturbation of V-geometrically ergodic Markov chains. J. Appl. Probab. 2013;50(1):184–194. [Google Scholar]
  • 8.Hairer M., Mattingly J.C. Seminar on Stochastic Analysis, Random Fields and Applications, Vol. VI. Springer; 2011. Yet another look at Harris’ ergodic theorem for Markov chains; pp. 109–117. [Google Scholar]
  • 9.Jarner S., Hansen E. Geometric ergodicity of Metropolis algorithms. Stochastic Process. Appl. 2000;85(2):341–361. [Google Scholar]
  • 10.J.E. Johndrow, J.C. Mattingly, Error bounds for Approximations of Markov chains used in Bayesian Sampling, ArXiv preprint arXiv:1711.05382.
  • 11.J.E. Johndrow, J.C. Mattingly, Coupling and Decoupling to bound an approximating Markov Chain, ArXiv preprint arXiv:1706.02040.
  • 12.J.E. Johndrow, J.C. Mattingly, S. Mukherjee, D. Dunson, Optimal approximating Markov chains for Bayesian inference, ArXiv preprint arXiv:1508.03387.
  • 13.Mao Y., Zhang M., Zhang Y. A generalization of Dobrushin coefficient. Chin. J. Appl. Probab. Statist. 2013;29(5):489–494. [Google Scholar]
  • 14.Medina-Aguayo F.J., Lee A., Roberts G. Stability of noisy Metropolis–Hastings. Stat. Comp. 2016;26(6):1187–1211. doi: 10.1007/s11222-015-9604-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Medina-Aguayo F.J., Lee A., Roberts G.O. Erratum to: Stability of noisy Metropolis–Hastings. Stat. Comp. 2018;28(1):239. doi: 10.1007/s11222-015-9604-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Mengersen K., Tweedie R. Rates of convergence of the Hastings and Metropolis algorithms. Ann. Statist. 1996;24(1):101–121. [Google Scholar]
  • 17.Meyn S., Tweedie R. second ed. Cambridge University Press; 2009. Markov Chains and Stochastic Stability. [Google Scholar]
  • 18.Mitrophanov A. Sensitivity and convergence of uniformly ergodic Markov chains. J. Appl. Probab. 2005;42(4):1003–1014. [Google Scholar]
  • 19.Murray I., Ghahramani Z., MacKay D. Proceedings of the 22nd Annual Conference on Uncertainty in Artificial Intelligence UAI06. 2006. MCMC For doubly-intractable distributions. [Google Scholar]
  • 20.J. Negrea, J.S. Rosenthal, Error Bounds for Approximations of Geometrically Ergodic Markov Chains, ArXiv preprint arXiv:1702.07441.
  • 21.Park J., Haran M. Bayesian Inference in the presence of intractable normalizing functions. J. Amer. Statist. Assoc. 2018;113(523):1372–1390. [Google Scholar]
  • 22.N. Pillai, A. Smith, Ergodicity of Approximate MCMC Chains with Applications to Large Data Setsm, ArXiv preprint arXiv:1405.0182.
  • 23.Roberts G., Rosenthal J. Geometric ergodicity and hybrid Markov chains. Electron. Commun. Probab. 1997;2:13–25. [Google Scholar]
  • 24.Roberts G., Rosenthal J., Schwartz P. Convergence properties of perturbed Markov chains. J. Appl. Probab. 1998;35(1):1–11. [Google Scholar]
  • 25.Roberts G., Tweedie R. Geometric convergence and central limit theorems for multidimensional Hastings and Metropolis algorithms. Biometrika. 1996;83(1):95–110. [Google Scholar]
  • 26.Rudolf D., Schweizer N. Perturbation theory for Markov chains via wasserstein distance. Bernoulli. 2018;24(4A):2610–2639. [Google Scholar]
  • 27.Rudolf D., Sprungk B. On a generalization of the preconditioned Crank-Nicolson Metropolis algorithm. Found. Comput. Math. 2018;18:309–343. [Google Scholar]
  • 28.Shardlow T., Stuart A. A perturbation theory for ergodic Markov chains and application to numerical approximations. SIAM J. Numer. Anal. 2000;37:1120–1137. [Google Scholar]
  • 29.Tierney L. A note on Metropolis-Hastings kernels for general state spaces. Ann. Appl. Probab. 1998;8:1–9. [Google Scholar]
  • 30.J. Yang, J.S. Rosenthal, Complexity Results for MCMC derived from Quantitative Bounds, ArXiv preprint arXiv:1708.00829.

RESOURCES