Skip to main content
Springer logoLink to Springer
. 2017 Nov 2;169(6):1098–1131. doi: 10.1007/s10955-017-1906-8

Using Perturbed Underdamped Langevin Dynamics to Efficiently Sample from Probability Distributions

A B Duncan 1, N Nüsken 2,, G A Pavliotis 2
PMCID: PMC6959385  PMID: 32009676

Abstract

In this paper we introduce and analyse Langevin samplers that consist of perturbations of the standard underdamped Langevin dynamics. The perturbed dynamics is such that its invariant measure is the same as that of the unperturbed dynamics. We show that appropriate choices of the perturbations can lead to samplers that have improved properties, at least in terms of reducing the asymptotic variance. We present a detailed analysis of the new Langevin sampler for Gaussian target distributions. Our theoretical results are supported by numerical experiments with non-Gaussian target measures.

Introduction and Motivation

Sampling from probability measures in high-dimensional spaces is a problem that appears frequently in applications, e.g. in computational statistical mechanics and in Bayesian statistics. In particular, we are faced with the problem of computing expectations with respect to a probability measure π on Rd, i.e. we wish to evaluate integrals of the form:

π(f):=Rdf(x)π(dx). 1

As is typical in many applications, particularly in molecular dynamics and Bayesian inference, the density (for convenience denoted by the same symbol π) is known only up to a normalization constant; furthermore, the dimension of the underlying space is quite often large enough to render deterministic quadrature schemes computationally infeasible.

A standard approach to approximating such integrals is Markov Chain Monte Carlo (MCMC) techniques [19, 32, 52], where a Markov process (Xt)t0 is constructed which is ergodic with respect to the probability measure π. Then, defining the long-time average

πT(f):=1T0Tf(Xs)ds 2

for fL1(π), the ergodic theorem guarantees almost sure convergence of the long-time average πT(f) to π(f).

There are infinitely many Markov, and, for the purposes of this paper diffusion, processes that can be constructed in such a way that they are ergodic with respect to the target distribution. A natural question is then how to choose the ergodic diffusion process (Xt)t0. Naturally the choice should be dictated by the requirement that the computational cost of (approximately) calculating (1) is minimized. A standard example is given by the overdamped Langevin dynamics defined to be the unique (strong) solution (Xt)t0 of the following stochastic differential equation (SDE):

dXt=-V(Xt)dt+2dWt, 3

where V=-logπ is the potential associated with the smooth positive density π. Under appropriate assumptions on V, i.e. on the measure π(dx), the process (Xt)t0 is ergodic and in fact reversible with respect to the target distribution.

Another well-known example is the underdamped Langevin dynamics given by (Xt)t0=(qt,pt)t0 defined on the extended space (phase space) Rd×Rd by the following pair of coupled SDEs:

dqt=M-1ptdt,dpt=-V(qt)dt-ΓM-1ptdt+2ΓdWt, 4

where the mass and friction tensors M and Γ are assumed to be symmetric positive definite matrices. It is well-known [36, 46] that (qt,pt)t0 is ergodic with respect to the measure π^:=πN(0,M), having density with respect to the Lebesgue measure on R2d given by

π^(q,p)=1Z^exp-V(q)-12p·M-1p, 5

where Z^ is a normalization constant. Note that π^ has marginal π with respect to p and thus for functions fL1(π), we have that 1t0tf(qt)dtπ(f) almost surely. Notice also that the dynamics restricted to the q-variables is no longer Markovian. The p-variables can thus be interpreted as giving some instantaneous memory to the system, facilitating efficient exploration of the state space. Higher order Markovian models, based on a finite dimensional (Markovian) approximation of the generalized Langevin equation can also be used [12].

As there is a lot of freedom in choosing the dynamics in (2), see the discussion in Sect. 2, it is desirable to choose the diffusion process (Xt)t0 in such a way that πT(f) can provide a good estimation of π(f). The performance of the estimator (2) can be quantified in various manners. The ultimate goal, of course, is to choose the dynamics as well as the numerical discretization in such a way that the computational cost of the longtime-average estimator is minimized, for a given tolerance. The minimization of the computational cost consists of three steps: bias correction, variance reduction and choice of an appropriate discretization scheme. For the latter step see Sect. 5 and [14, Sect. 6].

Under appropriate conditions on the potential V it can be shown that both (3) and (4) converge to equilibrium exponentially fast, e.g. in relative entropy. One performance objective would then be to choose the process (Xt)t0 so that this rate of convergence is maximised. Conditions on the potential V which guarantee exponential convergence to equilibrium, both in L2(π) and in relative entropy can be found in [7, 39, 54]. In the case when the target measure π is Gaussian, both the overdamped  (3) and the underdamped (4) dynamics become generalized Ornstein–Uhlenbeck processes. For such processes the entire spectrum of the generator—or, equivalently, the Fokker–Planck operator—can be computed analytically and, in particular, an explicit formula for the L2-spectral gap can be obtained [38, 43, 44]. A detailed analysis of the convergence to equilibrium in relative entropy for stochastic differential equations with linear drift, i.e. generalized Ornstein–Uhlenbeck processes, has been carried out in [1, 2].

In addition to speeding up convergence to equilibrium, i.e. reducing the bias of the estimator (2), one is naturally also interested in reducing the asymptotic variance. Under appropriate conditions on the target measure π and the observable f, the estimator πT(f) satisfies a central limit theorem (CLT) [31], that is,

1TπT(f)-π(f)TdN0,2σf2,

where σf2< is the asymptotic variance of the estimator πT(f). The asymptotic variance characterises the magnitude of fluctuations of πT(f) around π(f). Consequently, another natural objective is to choose the process (Xt)t0 such that σf2 is as small as possible. It is well known that the asymptotic variance can be expressed in terms of the solution to an appropriate Poisson equation for the generator of the dynamics [31]

-Lϕ=f-π(f),σf2=Rdϕ(-Lϕ)π(dx). 6

Techniques from the theory of partial differential equations can then be used in order to study the problem of minimizing the asymptotic variance. This is the approach that was taken in [14], see also [23], and it will also be used in this paper.

Other measures of performance have also been considered. For example, in [50, 51], performance of the estimator is quantified in terms of the rate functional of the ensemble measure 1t0tδX(t)(dx). See also [28] for a study of the nonasymptotic behaviour of MCMC techniques, including the case of overdamped Langevin dynamics.

Similar analyses have been carried out for various modifications of (3). Of particular interest to us are the Riemannian manifold MCMC [18] (see the discussion in Sect. 2) and the nonreversible Langevin samplers [20, 21]. As a particular example of the general framework that was introduced in [18], we mention the preconditioned overdamped Langevin dynamics dXt=-PV(Xt)dt+2PdWt, that was presented in [4]. There, the long-time behaviour of as well as the asymptotic variance of the corresponding estimator πT(f) are studied and applied to equilibrium sampling in molecular dynamics. A variant of the standard underdamped Langevin dynamics that can be thought of as a form of preconditioning and that has been used by practitioners is the mass-tensor molecular dynamics [6].

The nonreversible overdamped Langevin dynamics

dXt=-V(Xt)-γ(Xt)dt+2dWt, 7

where the vector field γ satisfies ·(πγ)=0 is ergodic (but not reversible) with respect to the target measure π for all choices of the divergence-free vector field γ. The asymptotic behaviour of this process was considered for Gaussian diffusions in [20], where the rate of convergence of the covariance to equilibrium was quantified in terms of the choice of γ. This work was extended to the case of non-Gaussian target densities, and consequently for nonlinear SDEs of the form (7) in [21]. The problem of constructing the optimal nonreversible perturbation, in terms of the L2(π) spectral gap for Gaussian target densities was studied in  [34] see also [55]. Optimal nonreversible perturbations with respect to miniziming the asymptotic variance were studied in [14, 23]. In all these works it was shown that, in theory [i.e. without taking into account the computational cost of the discretization of the dynamics (7)], the nonreversible Langevin sampler (7) is never worse that its reversible counterpart (3), both in terms of converging faster to the target distribution as well as in terms of having a lower asymptotic variance. It should be emphasized that the two optimality criteria, maximizing the spectral gap and minimizing the asymptotic variance, lead to different choices for the nonreversible drift γ(x).

The goal of this paper is to extend the analysis presented in [14, 34] by introducing the following modification of the standard underdamped Langevin dynamics:

dqt=M-1ptdt-μJ1V(qt)dt,dpt=-V(qt)dt-νJ2M-1ptdt-ΓM-1ptdt+2ΓdWt, 8

where M,ΓRd×d are constant strictly positive definite matrices, μ and ν are scalar constants and J1,J2Rd×d are constant skew-symmetric matrices. As demonstrated in Sect. 2, the process defined by (8) will be ergodic with respect to the Gibbs measure π^ defined in (5).

Our objective is to investigate the use of these dynamics for computing ergodic averages of the form (2). To this end, we study the long time behaviour of (8) and, using hypocoercivity techniques, prove that the process converges exponentially fast to equilibrium. This perturbed underdamped Langevin process introduces a number of parameters in addition to the mass and friction tensors which must be tuned to ensure that the process is an efficient sampler. For Gaussian target densities, we derive estimates for the spectral gap and the asymptotic variance, valid in certain parameter regimes. Moreover, for certain classes of observables, we are able to identify the choices of parameters which lead to the optimal performance in terms of asymptotic variance. While these results are valid for Gaussian target densities, we advocate these particular parameter choices also for more complex target densities. To demonstrate their efficacy, we perform a number of numerical experiments on more complex, multimodal distributions. In particular, we use the Langevin sampler (8) in order to study the problem of diffusion bridge sampling.

The rest of the paper is organized as follows. In Sect. 2 we present some background material on Langevin dynamics, we construct general classes of Langevin samplers and we introduce criteria for assessing the performance of the samplers. In Sect. 3 we study qualitative properties of the perturbed underdamped Langevin dynamics (8) including exponentially fast convergence to equilibrium and the overdamped limit. In Sect. 4 we study in detail the performance of the Langevin sampler (8) for the case of Gaussian target distributions. In Sect. 5 we introduce a numerical scheme for simulating the perturbed dynamics (8) and we present numerical experiments on the implementation of the proposed samplers for the problem of diffusion bridge sampling. Section 6 is reserved for conclusions and suggestions for further work.

Construction of General Langevin Samplers

Background and Preliminaries

In this section we consider estimators of the form (2) where (Xt)t0 is a diffusion process given by the solution of the following Itô SDE:

dXt=a(Xt)dt+2b(Xt)dWt, 9

with drift coefficient a:RdRd and diffusion coefficient b:RdRd×m both having smooth components, and where (Wt)t0 is a standard Rm–valued Brownian motion. Associated with (9) is the infinitesimal generator L formally given by

Lf=a·f+Σ:f,fCc2(Rd) 10

where Σ=bb, f denotes the Hessian of the function f and : denotes the Frobenius inner product. In general, Σ is nonnegative definite, and could possibly be degenerate. In particular, the infinitesimal generator (10) need not be uniformly elliptic. To ensure that the corresponding semigroup exhibits sufficient smoothing behaviour, we shall require that the process (9) is hypoelliptic in the sense of Hörmander. If this condition holds, then irreducibility of the process (Xt)t0 will be an immediate consequence of the existence of a strictly positive invariant distribution π(x)dx, see [30].

Suppose that (Xt)t0 is nonexplosive. It follows from the hypoellipticity assumption that the process (Xt)t0 possesses a smooth transition density p(txy) which is defined for all t0 and x,yRd, [5, Theorem VII.5.6]. The associated strongly continuous Markov semigroup (Pt)t0 is defined by Ptf(x)=Rdp(t,x,y)f(y)dy. Suppose that (Pt)t0 is invariant with respect to the target measure π, i.e. RdPtf(x)π(dx)=Rdf(x)π(dx) for t0 and all bounded continuous functions f. Then (Pt)t0 can be extended to a positivity preserving contraction semigroup on L2(π) which is strongly continuous. Moreover, the infinitesimal generator corresponding to (Pt)t0 is given by an extension of (L,Cc2(Rd)), also denoted by L.

Due to hypoellipticity and invariance with respect to (Pt)t0, the probability measure π on Rd has a smooth density with respect to the Lebesgue measure. If this density is strictly positive, it follows that π is necessarily the unique invariant distribution. Slightly abusing the notation, we will denote both the measure and its density by π. Furthermore, we will denote by L2(π) be the Hilbert space of π-square integrable functions equipped with inner product ·,·L2(π) and norm |·|L2(π).

A General Characterisation of Ergodic Diffusions

A natural question is what conditions on the coefficients a and b of (9) are required to ensure that (Xt)t0 is invariant with respect to the distribution π(x)dx. The following result provides a necessary and sufficient condition for a diffusion process to be invariant with respect to a given target distribution.

Theorem 1

Consider a diffusion process (Xt)t0 on Rd defined by the unique, non-explosive solution to the Itô SDE (9) with drift aC1(Rd;Rd) and diffusion coefficient bC1(Rd;Rd×m). Then (Xt)t0 is invariant with respect to π if and only if

a=Σlogπ+·Σ+γ, 11

where Σ=bb and γ:RDRD is a continuously differentiable vector field satisfying

·πγ=0. 12

If additionally γL1(π), then there exists a skew-symmetric matrix function C:RdRd×d such that γ=1π·πC. In this case the infinitesimal generator can be written as an L2(π)-extension of

Lf=1π·(Σ+C)πf,fCc2(Rd).

The proof of the first part of this result can be found in [46, Chap. 4]; similar versions of this characterisation can be found in [54] and [21]. For the existence of the skew-symmetric matrix C see, e.g., [16, Sec.4, Prop. 1]. See also [37].

Remark 1

If (11) holds and L is hypoelliptic it follows immediately that (Xt)t0 is ergodic with unique invariant distribution π(x)dx (see [30]).

More generally, we can consider Itô diffusions in an extended phase space:

dZt=b(Zt)dt+2σ(Zt)dWt, 13

where (Wt)t0 is a standard Brownian motion in RN, Nd. This is a Markov process with generator

L=b(z)·z+Σ(z):zz, 14

where Σ(z)=(σσT)(z). We will consider dynamics (Zt)t0 that is ergodic with respect to πz(z)dz such that Rmπz(x,y)dy=π(x), where z=(x,y),xRd,yRm,d+m=N.

There are various well-known choices of dynamics which are invariant (and indeed ergodic) with respect to the target distribution π(x)dx.

  1. Choosing b=I and γ=0 we immediately recover the overdamped Langevin dynamics (3).

  2. Choosing b=I, and γ0 such that (12) holds gives rise to the nonreversible overdamped equation defined by (7). As it satisfies the conditions of Theorem 1, it is ergodic with respect to π. In particular choosing γ(x)=JV(x) for a constant skew-symmetric matrix J we obtain
    dXt=-(I+J)V(Xt)dt+2dWt, 15
    which has been studied in previous works.
  3. Given a target density π>0 on Rd, if we consider the augmented target density π^ on R2d given in (5), then choosing
    γ((q,p))=M-1p-V(q)andb=0ΓR2d×d, 16
    where M and Γ are positive definite symmetric matrices, the conditions of Theorem 1 are satisfied for the target density π^. The resulting dynamics (qt,pt)t0 is determined by the underdamped Langevin equation (4). It is straightforward to verify that the generator is hypoelliptic, [35, Sec 2.2.3.1], and thus (qt,pt)t0 is ergodic.
  4. More generally, consider the augmented target density π^ on R2d as above, and choose
    γ((q,p))=M-1p-μJ1V(q)-V(q)-νJ2M-1pandb=0ΓR2d×d, 17
    where μ and ν are scalar constants and J1,J2Rd×d are constant skew-symmetric matrices. With this choice we recover the perturbed Langevin dynamics (8). It is straightforward to check that (17) satisfies the invariance condition (12), and thus Theorem 1 guarantees that (8) is invariant with respect to π^.
  5. In a similar fashion, one can introduce an augmented target density on R(m+2)d, with
    π^^(q,p,u1,,um)e-|p|22+|u1|2++|um|22+V(q),
    where p,q,uiRd, for i=1,,m. Clearly Rd×Rmdπ^^(q,p,u1,,um)dpdu1dum=π(q). We now define γ:R(m+2)dR(m+2)d by
    γ(q,p,u1,,um)=p-qV(q)+j=1mλjuj-λ1p-λmpT
    and b:R(m+2)dR(m+2)d×(m+2)d by
    b(q,p,u1,,um)=000000000000α1Id×d00000α2Id×d00000αmId×d,
    where λiR and αi>0, for i=1,,m. The resulting process (9) is given by
    dqt=ptdt,dpt=-qV(qt)dt+j=1dλjuj(t)dt,dut1=-λ1ptdt-α1ut1dt+2α1dWt1,dutm=-λmptdt-αmutmdt+2αmdWtm, 18
    where (Wtj)t0,j=1,,m are independent Rd–valued Brownian motions. This process is ergodic with unique invariant distribution π^^, and under appropriate conditions on V, converges exponentially fast to equilibrium in relative entropy [42]. Equation (18) is a Markovian representation of a generalised Langevin equation of the form
    dqt=ptdt,dpt=-qV(qt)dt-0tF(t-s)psds+N(t),
    where N(t) is a mean-zero stationary Gaussian process with autocorrelation function F(t), i.e. EN(t)N(s)=F(t-s)Id×d and F(t)=i=1mλi2e-αi|t|.
  6. Let π~(z)exp(-Φ(z)) be a positive density on RN where N>d such that π(x)=RN-dπ~(x,z)dz, where (x,y)Rd×RN-d. Then choosing b=ID×D and γ=0 we obtain the dynamics
    dXt=-xΦ(Xt,Yt)dt+2dWt1,dYt=-yΦ(Xt,Yt)dt+2dWt2,
    then (Xt,Yt)t0 is immediately ergodic with respect to π~.

Comparison Criteria

For a fixed observable f, a natural measure of accuracy of the estimator πT(f)=t-10tf(Xs)ds is the mean square error (MSE) defined by

MSE(f,T):=ExπT(f)-π(f)2, 19

where Ex denotes the expectation conditioned on the process (Xt)t0 starting at x. It is instructive to introduce the decomposition MSE(f,T)=μ2(f,T)+σ2(f,T), where

μ(f,T)=Ex[πT(f)]-π(f)andσ2(f,T)=ExπT(f)-ExπT(f)2=Var[πT(f)]. 20

Here μ(f,T) measures the bias of the estimator πT(f) and σ2(f,T) measures the variance of fluctuations of πT(f) around the mean.

The speed of convergence to equilibrium of the process (Xt)t0 will control both the bias term μ(f,T) and the variance σ2(f,T). To make this claim more precise, suppose that the semigroup (Pt)t0 associated with (Xt)t0 decays exponentially fast in L2(π), i.e. there exist constants λ>0 and C1 such that

|Ptg-π(g)|L2(π)Ce-λt|g-π(g)|L2(π),gL2(π). 21

Remark 2

If (21) holds with C=1, this estimate is equivalent to -L having a spectral gap in L2(π). Allowing for a constant C>1 is essential for our purposes though in order to treat nonreversible and degenerate diffusion processes by the theory of hypocoercivity as outlined in [54].

The following lemma characterises the decay of the bias μ(f,T) as T in terms of λ and C. The proof can be found in [41].

Lemma 1

Let (Xt)t0 be the unique, non-explosive solution of (9), such that X0π0π and dπ0dπL2(π), where dπ0dπ denotes the Radon-Nikodym derivative of π0 with respect to π. Suppose that the process is ergodic with respect to π such that the Markov semigroup (Pt)t0 satisfies (21). Then for fL(π),

μ(f,T)CλT1-e-λT||f||LVarπdπ0dπ12.

The study of the long time behaviour of the variance σ2(f,T) involves deriving a central limit theorem for the additive functional 0tf(Xt)-π(f)dt. As discussed in [13], we reduce this problem to proving well-posedness of the Poisson equation

-Lχ=f-π(f),π(χ)=0. 22

The only complications in this approach arise from the fact that the generator L need not be symmetric in L2(π) nor uniformly elliptic. The following result summarises conditions for the well-posedness of the Poisson equation and it also provides with us with a formula for the asymptotic variance. Again, the proof can be found in [41].

Lemma 2

Let (Xt)t0 be the unique, non-explosive solution of (9) with smooth drift and diffusion coefficients, such that the corresponding infinitesimal generator is hypoelliptic. Syppose that (Xt)t0 is ergodic with respect to π and moreover, (Pt)t0 decays exponentially fast in L2(π) as in (21). Then for all fL2(π), there exists a unique mean zero solution χ to the Poisson equation (22). If X0π, then for all fC(Rd)L2(π)

TπT(f)-π(f)TdN(0,2σf2), 23

where σf2 is the asymptotic variance defined by

σf2=χ,(-L)χL2(π)=χ,ΣχL2(π). 24

Moreover, if X0π0 where π0π and dπ0dπL2(π) then (23) holds for all fC(Rd)L(Rd).1

Clearly, observables that only differ by a constant have the same asymptotic variance. In the sequel, we will hence restrict our attention to observables fL2(π) satisfying π(f)=0, simplifying expressions (22) and (23). The corresponding subspace of L2(π) will be denoted by L02(π). If the exponential decay estimate (21) is satisfied, then Lemma 2 shows that -L is invertible on L02(π), so we can express the asymptoptic variance as

σf2=f,(-L)-1fL2(π),fL02(π). 25

We note that the constants C and λ appearing in the exponential decay estimate (21) also control the speed of convergence of σ2(f,T) to zero. Indeed, it is straightforward to show that if (21) is satisfied, then the solution χ of (22) satisfies

σf2=χ,f-π(f)L2(π)Cλ|f|L2(π)2. 26

Lemmas 1 and 2 would suggest that choosing the coefficients Σ and γ to optimize the constants C and λ in (26) would be an effective means of improving the performance of the estimator πT(f), especially since the improvement in performance would be uniform over an entire class of observables. When this is possible, this is indeed the case. However, as has been observed in [20, 21, 34], maximising the speed of convergence to equilibrium is a delicate task. As the leading order term in MSE(fT), it is typically sufficient to focus specifically on the asymptotic variance σf2 and study how the parameters of the SDE (9) can be chosen to minimise σf2. This study was undertaken in [14] for processes of the form (7).

Perturbation of Underdamped Langevin Dynamics

The primary objective of this work is to compare the performances of the perturbed underdamped Langevin dynamics (8) and the unperturbed dynamics (4) according to the criteria outlined in Sect. 2.3 and to find suitable choices for the matrices J1, J2, M and Γ that improve the performance of the sampler. We begin our investigations of (8) by establishing ergodicity and exponentially fast return to equilibrium, and by studying the overdamped limit of (8). As the latter turns out to be nonreversible and therefore in principle superior to the usual overdamped limit (3), e.g. [21], this calculation provides us with further motivation to study the proposed dynamics.

For the bulk of this work, we focus on the particular case when the target measure is Gaussian, i.e. when the potential is given by V(q)=12qTSq with a symmetric and positive definite precision matrix S (i.e. the covariance matrix is given by S-1). In this case, we advocate the following conditions for the choice of parameters:

M=S,Γ=γS,SJ1S=J2,μ=ν. 27

Under the above choices (27), we show that the large perturbation limit limμσf2 exists and is finite and we provide an explicit expression for it (see Corollary 4). From this expression, we derive an algorithm for finding optimal choices for J1 in the case of quadratic observables (see Algorithm 2).

If the friction coefficient is not too small (γ>2), and under certain mild nondegeneracy conditions, we prove that adding a small perturbation will always decrease the asymptotic variance for observables of the form f(q)=q·Kq+l·q+C:

ddμσf2μ=0=0andd2dμ2σf2μ=0<0,

see Theorem 3. In fact, we conjecture that this statement is true for arbitrary observables fL2(π), but we have not been able to prove this. The dynamics (8) [used in conjunction with the conditions (27)] proves to be especially effective when the observable is antisymmetric (i.e. when it is invariant under the substitution q-q) or when it has a significant antisymmetric part. In particular, in Proposition 3 we show that under certain conditions on the spectrum of J1, for any antisymmetric observable fL2(π) it holds that limμσf2=0.

Numerical experiments and analysis show that departing significantly from (27) in fact possibly decreases the performance of the sampler. This is in stark contrast to (7), where it is not possible to increase the asymptotic variance by any perturbation. For that reason, until now it seems practical to use (8) as a sampler only when a reasonable estimate of the global covariance of the target distribution is available. In the case of Bayesian inverse problems and diffusion bridge sampling, the target measure π is given with respect to a Gaussian prior. We demonstrate the effectiveness of our approach in these applications, taking the prior Gaussian covariance as S in (27).

Remark 3

In [34, Rem. 3] another modification of (4) was suggested (albeit with the simplifications Γ=γ·I and M=I):

dqt=(1-J)M-1ptdt,dpt=-(1+J)V(qt)dt-ΓM-1ptdt+2ΓdWt, 28

J again denoting an antisymmetric matrix. However, under the change of variables p(1+J)p~ the above equations transform into

dqt=M~-1ptdt,dp~t=-V(qt)dt-Γ~M~-1p~tdt+2Γ~dW~t, 29

where M~=(1+J)-1M(1-J)-1 and Γ~=(1+J)-1Γ(1-J)-1. Since any observable f depends only on q (the p-variables are merely auxiliary), the estimator πT(f) as well as its associated convergence characteristics (i.e. asymptotic variance and speed of convergence to equilibrium) are invariant under this transformation. Therefore, (28) reduces to the underdamped Langevin dynamics (4) and does not represent an independent approach to sampling. Suitable choices of M and Γ will be discussed in Sect. 4.5.

Properties of Perturbed Underdamped Langevin Dynamics

In this section we study some of the properties of the perturbed underdamped dynamics (8). First, note that its generator is given by

L=M-1p·q-qV·pLham-ΓM-1p·p+Γ:Dp2LthermL0-μJ1V·q-νJM-1p·pLpert, 30

decomposed into the perturbation Lpert and the unperturbed operator L0, which can be further split into the Hamiltonian part Lham and the thermostat (Ornstein–Uhlenbeck) part Ltherm, see [35, 36, 46].

Lemma 3

The infinitesimal generator L (30) is hypoelliptic.

Proof

The proof consists of verifying the conditions of Hörmander’s Theorem for the generator (30) and can be found in [41].

An immediate corollary of this result and of Theorem 1 is that the perturbed underdamped Langevin process (8) is ergodic with unique invariant distribution π^ given by (5).

As explained in Sect. 2.3, the exponential decay estimate (21) is crucial for our approach, as in particular it guarantees the well-posedness of the Poisson equation (22). From now on, we will therefore make the following assumption on the potential V,  required to prove exponential decay in L2(π):

Assumption 1

Assume that the Hessian of V is bounded and that the target measure π(dq)=1Ze-Vdq satisfies a Poincare inequality, i.e. there exists a constant ρ>0 such that

Rdϕ2dπρRd|ϕ|2dπ,ϕL02(π)H1(π). 31

Sufficient conditions on the potential so that Poincaré’s inequality holds, e.g. the Bakry-Emery criterion, are presented in [7].

Theorem 2

Under Assumption 1 there exist constants C1 and λ>0 such that the semigroup (Pt)t0 generated by L satisfies exponential decay in L2(π) as in (21).

Proof

The proof uses the machinery of hypocoercivity developed in [54] and can be found in [41]. Using the framework of [15], we conjecture that the assumption on the boundedness of the Hessian of V can be substantially weakened and more quantitative decay estimates (in particular with respect to μ and ν) can be obtained. This approach has recently been successfully applied to equilibrium and nonequilibirum Langevin dynamics, see [27, 53]. We leave this work track for future study.

The Overdamped Limit

In this section we develop a connection between the perturbed underdamped Langevin dynamics (8) and the nonreversible overdamped Langevin dynamics (7). The analysis is very similar to the one presented in [35, Sect. 2.2.2] and we will be brief. For convenience in this section we will perform the analysis on the d-dimensional torus Td(R/Z)d, i.e. we will assume qTd. Consider the following scaling of (8):

dqtϵ=1ϵM-1ptϵ,dt-μJ1qV(qt)dt, 32a
dptϵ=-1ϵqV(qtϵ)dt-1ϵ2νJ2M-1ptϵdt-1ϵ2ΓM-1ptϵdt+1ϵ2ΓdWt, 32b

valid for the small mass/small momentum regime Mϵ2M,ptϵpt. Equivalently, those modifications can be obtained from subsituting Γϵ-1Γ and tϵ-1t, and so in the limit as ϵ0 the dynamics (32) describes the limit of large friction with rescaled time. It turns out that as ϵ0, the dynamics (32) converges to the limiting SDE

dqt=-(νJ2+Γ)-1qV(qt)dt-μJ1qV(qt)dt+(νJ2+Γ)-12ΓdWt. 33

The following proposition makes this statement precise.

Proposition 1

Denote by (qtϵ,ptϵ) the solution to (32) with (deterministic) initial conditions (q0ϵ,p0ϵ)=(qinit,pinit) and by qt0 the solution to (33) with initial condition q00=qinit. For any T>0, (qtϵ)0tT converges to (qt0)0tT in L2(Ω,C([0,T]),Td) as ϵ0, i.e. limϵ0E(sup0tT|qtϵ-qt0|2)=0.

Proof

The proof follows standard arguments (see for instance [46]) and can be found in [41]. By a more refined analysis, it is also possible to get information on the rate of convergence; see, e.g. [48, 49].

Remark 4

The overdamped limit (33) respects the invariant distribution, in the sense that it is ergodic with respect to π(dq)=1Ze-Vdq.

The limiting SDE (33) is nonreversible due to the term -μJ1qV(qt)dt and also because the matrix (νJ2+Γ)-1 is in general neither symmetric nor antisymmetric. This result, together with the fact that nonreversible perturbations of overdamped Langevin dynamics of the form (7) are by now well-known to have improved performance properties, motivates further investigation of the dynamics (8).

Sampling from a Gaussian Distribution

In this section we study in detail the performance of the Langevin sampler (8) for Gaussian target densities, first considering the case of unit covariance. In particular, we study the optimal choice for the parameters in the sampler, the exponential decay rate and the asymptotic variance. We then extend our results to Gaussian target densities with arbitrary covariance matrices.

Unit Covariance: Small Perturbations

In our study of the dynamics given by (8) we first consider the simple case when V(q)=12|q|2, i.e. the task of sampling from a Gaussian measure with unit covariance. We will assume M=I, Γ=γI and J1=J2=:J (so that the q- and p-dynamics are perturbed in the same way, albeit posssibly with different strengths μ and ν). Our first result concerns the asymptotic variance for linear and quadratic observables for small perturbations of equal strength (μ=ν). For sufficiently strong damping (γ>2) always leads to an improvement in asymptotic variance under the nondegeneracy conditions [J,K]0 and lkerJ:

Theorem 3

Consider the dynamics

dqt=ptdt-μJqtdt,dpt=-qtdt-μJptdt-γptdt+2γdWt, 34

with γ>2 and an observable of the form f(q)=q·Kq+l·q+C, where KRsymd×d, lRd and CR. If at least one of the conditions [J,K]0 and lkerJ is satisfied, then the asymptotic variance of the unperturbed sampler is at a local maximum independently of K and J (and γ, as long as γ>2), i.e.

μσf2μ=0=0andμ2σf2μ=0<0.

Proof

The dynamics (34) are of Ornstein–Uhlenbeck type, i.e. we can write

dXt=-BXtdt+2QdW¯t,B=μJ-IIγI+μJ,Q=000γI 35

with X=(q,p)T, and (W¯t)t0 denoting a standard Wiener process on R2d. The generator of (35) is then given by

L=-Bx·+TQ. 36

According to Lemma 2, the asymptotic variance can be expressed as

σf2=χ,fL2(π^),whereχisthesolutionto-Lχ=f,π^(χ)=0. 37

By calculations similar to those in [14, Sect. 4], χ is given by χ(x)=x·Cx+D·x-TrC, where

BC+CBT=K¯,BTD=l¯, 38

using the notations

K¯=K000R2d×2dandl¯=l0R2d. 39

The asymptotic variance is then given by

σf2=2Tr(CK¯)+D·l¯. 40

Taking derivatives of 38 and solving the ensuing matrix equations, it is possible to obtain explicit expressions for μC|μ=0, μ2C|μ=0, μD|μ=0 and μ2C|μ=0 as detailed in [41]. We obtain

μσf2μ=0=0andμ2σf2μ=0=(-2γ3+4γ)|Jl|2+γ-4γ3-γ3-1γ·Tr(JKJK)-Tr(J2K2).

Notice that Tr(JKJK)-Tr(J2K2)=12Tr([J,K]2) and that [JK] is symmetric. It follows that Tr(JKJK)-Tr(J2K2)0 with equality if and only if [J,K]=0. Together with -2γ3+4γ<0 for γ>2 and γ-4γ3-γ3-1γ<0 for γ>0, the claim follows.

Remark 5

As we will see in Sect. 4.3, Example 1, if [J,K]=0 and lkerJ, the asymptotic variance is constant as a function of μ, i.e. the perturbation has no effect.

Numerical examples show that the conditions γ>2 and μ=ν are indeed necessary for the conclusions of Theorem 3 to hold (an explanation in terms of the spectrum of the generator will be provided in Sect. 4.2). In particular, an unfortunate choice of the perturbations will actually increase the asymptotic variance of the dynamics.

Let us illustrate this by plotting the asymptotic variance as a function of the perturbation strength μ (see Fig. 1), making the choices d=2, l=(1,1)T,

K=2001andJ=01-10. 41

The asymptotic variance has been computed according to (38) and (40). Going beyond the results of this section, the graphs give an impression of the behaviour of the asymptotic variace for large values of μ, discussed further in Sect. 4.3.

Fig. 1.

Fig. 1

Asymptotic variance for linear and quadratic observables, depending on relative perturbation and friction strengths. a Equal perturbations: μ=ν. b Approximately equal perturbations: μ=0.9ν. c Opposing perturbations: μ=-ν. d Equal perturbations: μ=ν (sufficiently large friction γ). e Equal perturbations: μ=ν (small friction γ)

Figure 1a, b, c show the asymptotic variance associated with the quadratic observable f(q)=q·Kq. In accordance with Theorem 3, the asymptotic variance is at a local maximum at zero perturbation in the case μ=ν (see Fig. 1a). For increasing perturbation strength, the graph shows monotone decay for μ (this limiting behaviour will be explored analytically in Sect. 4.3). If the condition μ=ν is only approximately satisfied (Fig. 1b), our numerical examples still exhibits decaying asymptotic variance in the neighbourhood of the critical point. In this case, however, the asymptotic variance diverges for growing values of the perturbation μ. If the perturbations are opposed (μ=-ν), it is possible for certain observables that the unperturbed dynamics represents a global minimum. Such a case is observed in Fig. 1c. In Fig. 1d, e the observable f(q)=l·q is considered. If the damping is sufficiently strong (γ>2), the unperturbed dynamics is at a local maximum of the asymptotic variance (Fig. 1d). Furthermore, the asymptotic variance approaches zero as μ (for a theoretical explanation see again Sect. 4.3). The graph in Fig. 1e shows that the assumption of γ not being too small cannot be dropped from Theorem 3. Even in this case though the example shows decay of the asymptotic variance for large values of μ.

Exponential Decay Rate

Let us denote by λ the optimal exponential decay rate in (21), i.e.

λ=sup{λ>0|There existsC1such that(21)holds}. 42

Note that λ is well-defined and positive by Theorem 2. We also define the spectral bound of the generator L by

s(L)=inf(Reσ(-L)\{0}). 43

In [38] it is proven that the Ornstein–Uhlenbeck semigroup (Pt)t0 considered in this section is differentiable (see Proposition 2.1). In this case (see Corollary 3.12 of [17]), it is known that the exponential decay rate and the spectral bound coincide, i.e. λ=s(L), whereas in general only λs(L) holds. In this section we will therefore analyse the spectral properties of the generator L in the Gaussian case. In particular, this leads to some intuition of why choosing equal perturbations (μ=ν) is crucial for the performance of the sampler.

In [38] (see also [43]), it was proven that the spectrum of L as in (36) in L2(π^) is given by

σ(L)=-j=1rnjλj:njN,λjσ(B). 44

Note that σ(L) only depends on the drift matrix B. In the case where μ=ν, the spectrum of B can be computed explicitly.

Lemma 4

Assume μ=ν. Then the spectrum of B is given by2

σ(B)=μλ+(γ2)2-1+γ2|λσ(J)μλ-(γ2)2-1+γ2|λσ(J). 45

Proof

We will compute σ(B-γ2I) and then use the identity σ(B)=λ+γ2|λσB-γ2I. We have

detB-γ2I-λI=detμJ-γ2I-λIμJ+γ2I-λI+I=detμJ-λI+γ22-1I·detμJ-λI-γ22-1I,

where I is understood to denote the identity matrix of appropriate dimension. The above quantity is zero if and only if

λ-γ22-1σ(μJ)orλ+γ22-1σ(μJ).

Together with (4.2), the claim follows.

Using formula (45), in Fig. 2a we show a sketch of the spectrum σ(-L) for the case of equal perturbations (μ=ν) with the convenient choices n=1 and γ=2. Of course, the eigenvalue at 0 is associated to the invariant measure since Lπ^=0. The arrows indicate the movement of the eigenvalues as the perturbation μ increases in accordance with Lemma 4. Clearly, the spectral bound of L is not affected by the perturbation. Note that the eigenvalues on the real axis stay invariant under the perturbation. The subspace of L02(π^) associated to those will turn out to be crucial for the characterisation of the limiting asymptotic variance as μ (see Remark 10).

Fig. 2.

Fig. 2

Effects of the perturbation on the spectra of -L and B. a σ(-L) in the case μ=ν. The arrows indicate the movement of the spectrum as the perturbation strength μ increases. b σ(B) in the case J1=0, i.e. the dynamics is only perturbed by -νJ2pdt. The arrows indicate the movement of the eigenvalues as ν increases

To illustrate the suboptimal properties of the perturbed dynamics when the perturbations are not equal, we plot the spectrum of the drift matrix σ(B) in the case when the dynamics is only perturbed by the term νJ2pdt (i.e. μ=0) for n=2, γ=2 and

J2=0-110, 46

(see Fig. 2b). Note that the full spectrum σ(-L) can be inferred from (44). For ν=0 we have that the spectrum σ(B) only consists of the (degenerate) eigenvalue 1. For increasing ν, the figure shows that the degenerate eigenvalue splits up into four eigenvalues, two of which get closer to the imaginary axis as ν increases, leading to a smaller spectral bound and therefore to a decrease in the speed of convergence to equilibrium. Figure 2a, b give an intuitive explanation of why the fine-tuning of the perturbation strengths is crucial.

We close this subsection by providing autocorrelation plots (see Fig. 3) for the linear observable considered in Fig. 1d (with a friction coefficient of γ=2.5). It is well-known that the asymptotic variance is given by the integrated autocorrelation function (see e.g. Proposition IV 1.3 in [3]),

σf2=0Eπf(q0)-π(f)f(qt)-π(f)dt. 47

Comparing Fig. 3a, b yields additional insight into the mechanics of the variance reduction: the increase of the imaginary part of the eigenvalues of L (as indicated in Fig. 2a) leads to oscillations in the autocorrelation function and therefore to cancellations in (47). A similar effect has already been observed in [50] for the nonreversible overdamped Langevin dynamics (15).

Fig. 3.

Fig. 3

Autocorrelation plots for the perturbed and unperturbed dynamics. a Unperturbed Langevin dynamics. b Perturbed Langevin dynamics

Unit Covariance: Large Perturbations

In the previous subsection we observed that for the particular perturbation J1=J2 and μ=ν [see equation (34)] the perturbed Langevin dynamics demonstrated an improvement in performance for μ in a neighbourhood of 0, when the observable is linear or quadratic. Recall that this dynamics is ergodic with respect to a standard Gaussian measure π^ on R2d with marginal π with respect to the q-variable. As before, we shall consider only observables that do not depend on p. Moreover, we assume without loss of generality that π(f)=0. For such observables we will write fL02(π) and consider the canonical embedding L02(π)L2(π^). We emphasize that L02(π) consists of functions that only depend on q, whereas functions in L2(π^) may depend on both q and p.

In this subsection will analyse the asymptotic variance for large values of μ. The infinitesimal generator of (34) can be written as

L=p·q-q·p+γ(-p·p+Δp)L0+μ(-Jq·q-Jp·p)A=:L0+μA, 48

where we have introduced the notation Lpert=μA. In the sequel, the adjoint of an operator B in L2(π^) will be denoted by B. In the rest of this section we will make repeated use of the Hermite polynomials

gα(x)=(-1)|α|e|x|22αe-|x|22,αN2d, 49

invoking the notation x=(q,p)R2d. For mN0 define the Hilbert spaces

Hm=span{gα:|α|=m},f,gm:=f,gL2(π^),f,gHm.

The following result (Theorem 4) holds for operators of the form (36) providing an orthogonal decomposition of L2(π^) into invariant subspaces. The drift and diffusion matrices B and Q are assumed to be such that L is the generator of an ergodic stochastic process (see [2, Definition 2.1] for precise conditions).

Theorem 4

[2, Sect. 5]. The following holds:

  1. The space L2(π^) has a decomposition into mutually orthogonal subspaces:
    L2(π^)=mN0Hm.
  2. For all mN0, Hm is invariant under L as well as under the semigroup (etL)t0.

  3. The spectrum of L has the following decomposition:
    σ(L)=mN0σ(L|Hm),σ(L|Hm)=j=12dαjλj:|α|=m,λjσ(B).

Remark 6

Note that by the ergodicity of the dynamics, kerL consists of constant functions and so kerL=H0. Therefore, L02(π^) has the decomposition

L02(π^)=L2(π^)/kerL=m1Hm.

Our first main result of this section is an expression for the asymptotic variance in terms of the unperturbed operator L0 and the perturbation A:

Proposition 2

Let fL02(π) (so in particular f=f(q)). Then the associated asymptotic variance is given by

σf2=f,-L0(L02+μ2AA)-1fL2(π^). 50

Remark 7

The proof of the preceding Proposition will show that L02+μ2AA is invertible on L02(π^) and that (L02+μ2AA)-1fD(L0) for all fL02(π^).

To prove Proposition 2 we will make use of the generator with reversed perturbation

L-=L0-μA

and the momentum flip operator

P:L02(π^)L02(π^),ϕ(q,p)ϕ(q,-p). 51

Clearly, P2=I and P=P. Further properties of L0, A and the auxiliary operators L- and P are gathered in the following lemma:

Lemma 5

For all ϕ,ψC(R2d)L2(π^) the following holds:

  1. The generator L0 is symmetric in L2(π^) with respect to P:
    ϕ,PL0PψL2(π^)=L0ϕ,ψL2(π^).
  2. The perturbation A is skewadjoint in L2(π^):
    A=-A.
  3. The operators L0 and A commute:
    [L0,A]ϕ=0.
  4. The perturbation A satisfies
    PAPϕ=Aϕ.
  5. L and L- commute,
    [L,L-]ϕ=0,
    and the following relation holds:
    ϕ,PLPψL2(π^)=L-ϕ,ψL2(π^). 52
  6. The operators L, L0, L-, A and P leave the Hermite spaces Hm invariant.

Remark 8

The claim (c) in the above lemma is crucial for our approach, which itself rests heavily on the fact that the q- and p-perturbations match (J1=J2).

Proof of Lemma 5

The statement (a) is well-known and its proof can be found in [35, Sect. 2.2.3.1] for instance. The claim (b) follows by noting that the flow vector field b(q,p)=(-Jq,-Jp) associated to A is divergence-free with respect to π^, i.e. ·(π^b)=0. Therefore, A is the generator of a strongly continuous unitary semigroup on L2(π^) and hence skewadjoint by Stone’s Theorem. The claims (c), (d) and (e) follow by direct computations which can be found in [41]. To prove (f) first notice that L, L0 and L- are of the form (36) and therefore leave the spaces Hm invariant by Theorem 4. It follows immediately that also A leaves those spaces invariant. The fact that P leaves the spaces Hm invariant follows directly by inspection of (49) and (51).

Now we proceed with the proof of Proposition 2:

Proof of Proposition 2

Since the potential V is quadratic, Assumption 1 clearly holds and thus Lemma 2 ensures that L and L- are invertible on L02(π^) with

(-L)-1=0etLdt, 53

and analogously for L--1. In particular, the asymptotic variance can be written as σf2=f,(-L)-1fL2(π^). Due to the respresentation (53) and Theorem 4, the inverses of L and L- leave the Hermite spaces Hm invariant. We will prove the claim from Proposition 2 under the assumption that Pf=f which includes the case f=f(q). For the following calculations we will assume fHm for fixed m1. Combining statement (f) with (a) and (e) of Lemma 5 (and noting that HmC(R2d)L2(π^)) we see that

PLP=L-andPL0P=L0 54

when restricted to Hm. Therefore, the following calculations are justified:

f,(-L)-1fL2(π^)=12f,(-L)-1fL2(π^)+f,(-L)-1fL2(π^)=12f,(-L)-1fL2(π^)+Pf,(-L)-1PfL2(π^)=12f,(-L)-1fL2(π^)+f,(-L-)-1fL2(π^)=12f,(-L)-1+(-L-)-1fL2(π^),

where in the second line we have used the assumption Pf=f and in the third line the properties P2=I, P=P and Eq. (54). Since L and L- commute on Hm according to Lemma 5(e),(f) we can write

(-L)-1+(-L-)-1=L-(-LL-)-1+L(-LL-)-1=-2L0(LL-)-1

for the restrictions on Hm, using L+L-=2L0. We also have LL-=(L0+μA)(L0-μA)=L02+μ2AA, since L0 and A commute. We thus arrive at the formula

σf2=f,-L0(L02+μ2AA)-1fL2(π^),fHm. 55

Now since (L02+μ2AA)-1f=(LL-)-1fD(L0) for all fL2(π^), it follows that the operator -L0(L02+μ2AA)-1 is bounded. We can therefore extend formula (55) to the whole of L2(π^) by continuity, using the fact that L02(π^)=m1Hm.

Applying Proposition 2 we can analyse the behaviour of σf2 in the limit of large perturbation strength μ. To this end, we introduce the orthogonal decomposition

L02(π)=ker(Jq·q)ker(Jq·q), 56

where Jq·q is understood as an unbounded operator acting on L02(π), obtained as the smallest closed extension of Jq·q acting on Cc(Rd). In particular, ker(Jq·q) is a closed linear subspace of L02(π). Let Π denote the L02(π)-orthogonal projection onto ker(Jq·q). We will write σf2(μ) to stress the dependence of the asymptotic variance on the perturbation strength. The following result shows that for large perturbations, the limiting asymptotic variance is always smaller than the asymptotic variance in the unperturbed case. Furthermore, the limit is given as the asymptotic variance of the projected observable Πf for the unperturbed dynamics.

Theorem 5

Let fL02(π) (so in particular f=f(q)). Then limμσf2(μ)=σΠf2(0)σf2(0).

Remark 9

Note that the fact that the limit exists and is finite is nontrivial. In particular, as Fig. 1b, c demonstrate, it is often the case that limμσf2(μ)= if the condition μ=ν is not satisfied.

Remark 10

The decomposition (56) can be interpreted in terms of the spectrum σ(L) as follows: First observe that for functions f that only depend on q, fker(Jq·q) is equivalent to fkerA. Let us denote by σ¯=μRσ(L0+μA) the part of σ(L0) that is not affected by the perturbation and by

E¯=span{fL2(π^):there existsλσ¯such thatL0f=λf}¯L2(π^)

the corresponding subspace. Then it is straightforward to see that ker(A)=E¯.3 In Fig. 2a, σ¯ has been highlighted by diamonds.

Proof of Theorem 5

Note that L0 and AA leave the Hermite spaces Hm invariant and their restrictions to those spaces commute (see Lemma 5, (b), (c) and (f)). Furthermore, as the Hermite spaces Hm are finite-dimensional, those operators have discrete spectrum. As AA is nonnegative self-adjoint, there exists an orthogonal decomposition L02(π)=iWi into eigenspaces of the operator -L0(L02+μ2AA)-1, the decomposition Wi being finer then Hm in the sense that every Wi is a subspace of some Hm. Moreover, -L0(L02+μ2AA)-1|Wi=-L0(L02+μ2λi)-1|Wi, where λi0 is the eigenvalue of AA associated to the subspace Wi. Consequently, formula (50) can be written as

σf2=ifi,-L0(L02+μ2λi)-1fiL2(π^), 57

where f=ifi and fiWi. Let us assume now without loss of generality that W0=kerAA, so in particular λ0=0. Then clearly

limμσf2=2f0,-L0(L02)-1f0L2(π^)=2f0,(-L0)-1f0L2(π^)=σf02(0).

Now notice that W0=kerAA=kerA, showing the equality in the claim. It remains to show that σΠf2(0)σf2(0). To see this, we write

σf2(0)=2f,(-L0)-1fL2(π^)=2Πf+(1-Π)f,(-L0)-1(Πf+(1-Π)f)L2(π^)=σΠf2(0)+σ(1-Π)f2(0)+R,

where

R=2Πf,(-L0)-1(1-Π)fL2(π^)+2(1-Π)f,(-L0)-1ΠfL2(π^).

Note that since we only consider observables that do not depend on p, Πfker(Jq·q) and (1-Π)fi1Wi. Since L0 commutes with A, it follows that (-L0)-1 leaves both W0 and i1Wi invariant. Therefore, as the latter spaces are orthogonal to each other, it follows that R=0, from which the result follows.

From Theorem 5 it follows that in the limit as μ, the asymptotic variance σf2(μ) is not decreased by the perturbation if fker(Jq·q). In fact, this result also holds true non-asymptotically, i.e. observables in ker(Jq·q) are not affected at all by the perturbation:

Lemma 6

Let fker(Jq·q). Then σf2(μ)=σf2(0) for all μR.

Proof

From fker(Jq·q) it follows immediately that fkerAA. Then the claim follows from the expression (57).

Example 1

Recall the case of observables of the form f(q)=q·Kq+l·q+C with KRsymd×d, lRd and CR from Sect. 4.1. If [J,K]=0 and lkerJ, then fker(Jq·q) as

Jq·q(q·Kq+l·q+C)=2Jq·Kq+Jq·l=q·(KJ-JK)q-q·Jl=0.

From the preceding lemma it follows that σf2(μ)=σf2(0) for all μR, showing that the assumption in Theorem 3 does not exclude nontrivial cases.

The following result shows that the dynamics (34) is particularly effective for antisymmetric observables (at least in the limit of large perturbations):

Proposition 3

Let fL02(π) satisfy f(-q)=-f(q) and assume that kerJ={0}. Furthermore, assume that the eigenvalues of J are rationally independent, i.e.

σ(J)={±iλ1,±iλ2,,±iλd} 58

with λiR>0 and ikiλi0 for all (k1,,kd)Zd\(0,,0). Then limμσf2(μ)=0.

Proof of Proposition 3

The claim would immediately follow from fker(Jq·) according to Theorem 5, but that does not seem to be so easy to prove directly. Instead, we again make use of the Hermite polynomials.

Recall from the proof of Proposition 2 that L is invertible on L02(π^) and its inverse leaves the Hermite spaces Hm invariant. Consequently, the asymptotic variance of an observable fL02(π^) can be written as

σf2=f,(-L)-1fL2(π^)=m=1Πmf,(-L|Hm)-1ΠmfL2(π^), 59

where Πm:L02(π^)Hm denotes the orthogonal projection onto Hm. From (49) it is clear that ga is symmetric for |α| even and antisymmetric for |α| odd. Therefore, from f being antisymmetric it follows that fm1,moddHm. In view of (45), ((c)) and (58) the spectrum of L|Hm can be written as

σ(L|Hm)=μj=12dαjβj+Cα,γ:|α|=m,βjσ(J)=iμj=1d(αj-αj+d)λj+Cα,γ:|α|=m 60

with appropriate real constants Cα,γR that depend on α and γ, but not on μ. For |α|=j=12dαj=m odd, we have that

j=1d(αj-αj+d)λj0. 61

Indeed, assume to the contrary that the above expression is zero. Then it would follow that αj=αj+d for all j=1,,d by rational independence of λ1,,λd and |m| would have to be even. From (60) and (61) it is clear that

supr>0:B(0,r)σ(L|Hm)=μ,

where B(0, r) denotes the ball of radius r centered at the origin in C. Consequently, the spectral radius of (-L|Hm)-1 and hence (-L|Hm)-1 itself converges to zero as μ. The result then follows from (59).

Remark 11

The idea of the preceding proof can be explained using Fig. 2a and Remark 10. The eigenvalues in the fixed spectrum E¯ (on the real axis, highlighted by diamonds) correspond to Hermite polynomials of even order. The independence condition on the eigenvalues of J prevents cancellations that would lead to fixed eigenvalues associated to Hermite polynomials of odd order. Therefore, antisymmetric observables are orthogonal to E¯=kerA.

The following corollary gives a version of the converse of Proposition 3 and provides further intuition into the mechanics of the variance reduction achieved by the perturbation.

Corollary 1

Let fL02(π) and assume that limμσf2(μ)=0. Then

B(0,r)fdq=0

for all r(0,), where B(0, r) denotes the ball centered at 0 with radius r.

Proof

According to Theorem 5, limμσf2(μ)=0 implies σΠf2(0)=0. We can write

σΠf2(0)=Πf,(-L0)-1ΠfL2(π^)=12Πf,(-L0)-1+(-L0)-1ΠfL2(π^) 62

and recall from the proof of Proposition 2 that (-L0)-1 and (-L0)-1 leave the Hermite spaces Hm invariant. Therefore ker(-L0)-1+(-L0)-1=0 in L02(π^), and in particular σΠf2(0)=0 implies Πf=0, which in turn shows that fker(Jq·). From (Jq·)=-Jq·, it follows that

ker(Jq·)=im(Jq·)¯=im(Jq·)¯. 63

Hence, there exists a sequence (ϕn)nCc(Rd) such that Jq·ϕnf in L2(π). Taking a subsequence if necessary, we can assume that the convergence is pointwise π-almost everywhere and that the sequence is pointwise bounded by a function in L1(π). Since J is antisymmetric, we have that Jq·ϕn=·(ϕnJq). Now Gauss’s theorem yields

B(0,r)fdq=B(0,r)·(ϕJq)dq=B(0,r)ϕJq·dn,

where n denotes the outward normal to the sphere B(0,r). This quantity is zero due to the orthogonality of Jq and n, and so the result follows from Lebesgue’s dominated convergence theorem.

Optimal Choices of J for Quadratic Observables

Assume fL02(π) is given by f(q)=q·Kq+l·q-TrK, with KRsymd×d and lRd (note that the constant term is chosen such that π(f)=0). Our objective is to choose J in such a way that limμσf2(μ) becomes as small as possible. To stress the dependence on the choice of J, we introduce the notation σf2(μ,J). Also, we denote the orthogonal projection onto (kerJ) by ΠkerJ.

Lemma 7

(Zero variance limit for linear observables). Assume K=0 and ΠkerJl=0. Then

limμσf2(μ,J)=0.

Proof

According to Theorem 5, we have to show that Πf=0, where Π is the L2(π)-orthogonal projection onto ker(Jq·). Let us thus use (63) and prove that fim(Jq·)¯. Indeed, since ΠkerJl=0, by Fredholm’s alternative there exists uRd such that Ju=l. Now define ϕL02(π) by ϕ(q)=-u·q, leading to f=Jq·ϕ, so the result follows.

Lemma 8

(Zero variance limit for purely quadratic observables.) Let l=0 and consider the decomposition K=K0+K1 into the traceless part K0=K-TrKd·I and the trace-part K1=TrKd·I. For the corresponding decomposition of the observable

f(q)=f0(q)+f1(q)=q·K0q+q·K1q-TrK

the following holds:

  1. There exists an antisymmetric matrix J such that limμσf02(μ,J)=0, and there is an algorithmic way (see Algorithm 1) to compute an appropriate J in terms of K.

  2. The trace-part is not effected by the perturbation, i.e. σf12(μ,J)=σf12(0) for all μR.

Proof

To prove the first claim, according to Theorem 5 it is sufficient to show that f0ker(Jq·)=im(Jq·)¯. Let us consider the function ϕ(q)=q·Aq, with ARsymd×d. It holds that Jq·ϕ=2q·(JTAq)=q·[A,J]q. The task of finding an antisymmetric matrix J such that limμσf02(μ,J)=0 can therefore be accomplished by constructing an antisymmetric matrix J such that there exists a symmetric matrix A with the property K0=[A,J]. Given any traceless matrix K0 there exists an orthogonal matrix UO(Rd) such that UK0UT has zero entries on the diagonal, and that U can be obtained in an algorithmic manner (see for example [29] or [22, Chap. 2, Sect. 2, Problem 3]) Assume thus that such a matrix UO(Rd) has been found and choose real numbers a1,,adR such that aiaj if ij. We now set A¯=diag(a1,,an), and

J¯ij=(UK0UT)ijai-ajifij,0ifi=j. 64

Observe that since UK0UT is symmetric, J¯ is antisymmetric. A short calculation shows that [A¯,J¯]=UK0UT. We can thus define A=UTA¯U and J=UTJ¯U to obtain [A,J]=K0. Therefore, the J constructed in this way indeed satisfies (4.4). For the second claim, note that f1ker(Jq·), since Jq·q·TrKdq=2TrKdq·Jq=0 due to the antisymmetry of J. The result then follows from Lemma 6.

We would like to stress that the perturbation J constructed in the previous lemma is far from unique due to the freedom of choice of U and a1,,adR in its proof. However, it is asymptotically optimal:

Corollary 2

In the setting of Lemma 8 the following holds:

minJT=-Jlimμσf2(μ,J)=σf12(0).

Proof

The claim follows immediately since f1ker(Jq·) for arbitrary antisymmetric J as shown in (4.4), and therefore the contribution of the trace part f1 to the asymptotic variance cannot be reduced by any choice of J according to Lemma 6.

As the proof of Lemma 8 is constructive, we obtain the following algorithm for determining optimal perturbations for quadratic observables:

Algorithm 1

Given KRsymd×d, determine an optimal antisymmetric perturbation J as follows:

  1. Set K0=K-TrKd·I.

  2. Find UO(Rd) such that UK0UT has zero entries on the diagonal.

  3. Choose aiR, i=1,d such that aiaj for ij and set
    J¯ij=(UK0UT)ijai-aj
    for ij and J¯ii=0 otherwise.
  4. Set J=UTJ¯U.

Remark 12

In [14], the authors consider the task of finding optimal perturbations J for the nonreversible overdamped Langevin dynamics given in (15). In the Gaussian case this optimization problem turns out be equivalent to the one considered in this section. Indeed, equation (39) of [14] can be rephrased as fker(Jq·). Therefore, Algorithm 1 and its generalization Algorithm 2 (described in Sect. 4.5) can be used without modifications to find optimal perturbations of overdamped Langevin dynamics.

Gaussians with Arbitrary Covariance and Preconditioning

In this section we extend the results of the preceding sections to the case when the target measure π is given by a Gaussian with arbitrary covariance, i.e. V(q)=12q·Sq with SRsymd×d symmetric and positive definite. The dynamics (8) then takes the form

dqt=M-1ptdt-μJ1Sqtdt,dpt=-Sqtdt-νJ2M-1ptdt-ΓM-1ptdt+2ΓdWt. 65

The key observation is now that the choices M=S and Γ=γS together with the transformation q~=S1/2q and p~=S-1/2p lead to the dynamics

dq~t=p~tdt-μS1/2J1S1/2q~tdt,dp~t=-q~tdt-μS-1/2J2S-1/2p~tdt-γp~tdt+2γdWt, 66

which is of the form (34) if J1 and J2 obey the condition SJ1S=J2. Clearly the dynamics (66) is ergodic with respect to a Gaussian measure with unit covariance, in the following denoted by π~. The connection between the asymptotic variances associated to (65) and (66) is as follows:

For an observable fL02(π) we can write

T(1T0Tf(qs)ds-π(f))=T(1T0Tf~(q~s)ds-π~(f~)),

where f~(q)=f(S-1/2q). Therefore, the asymptotic variances satisfy σf2=σ~f~2, where σ~f~2 denotes the asymptotic variance of the process (q~t)t0. Because of this, the results from the previous sections generalise to (65), subject to the condition that the choices M=S, Γ=γS and SJ1S=J2 are made. We formulate our results in this general setting as corollaries:

Corollary 3

Consider the dynamics

dqt=M-1ptdt-μJ1V(qt)dt,dpt=-V(qt)dt-μJ2M-1ptdt-ΓM-1ptdt+2ΓdWt, 67

with V(q)=12q·Sq. Assume that M=S, Γ=γS with γ>2 and SJ1S=J2. Let fL2(π) be an observable of the form

f(q)=q·Kq+l·q+C 68

with KRsymd×d, lRd and CR. If at least one of the conditions KJ1SSJ1K and lkerJ is satisfied, then the asymptotic variance is at a local maximum for the unperturbed sampler, i.e.

μσf2μ=0=0andμ2σf2μ=0<0.

Proof

Note that

f~(q)=f(S-1/2q)=q·S-1/2KS-1/2q+S-1/2l·q+C=q·K~q+l~·q+C

is again of the form (68) (where in the last equality, K~=S-1/2KS-1/2 and l~=S-1/2l have been defined). From (66), (4.5) and Theorem 3 the claim follows if at least one of the conditions [K~,S1/2J1S1/2]0 and l~ker(S1/2J1S1/2) is satisfied. The first of those can easily seen to be equivalent to S-1/2(KJS-SJK)S-1/20, which is equivalent to KJ1SSJ1K since S is nondegenerate. The second condition is equivalent to S1/2J1l0, which is equivalent to J1l0, again by nondegeneracy of S.

Corollary 4

Assume the setting from the previous corollary and denote by Π the orthogonal projection onto ker(J1Sq·). For fL2(π) it holds that limμσf2(μ)=σΠf2(0)σf2(0).

Proof

Theorem 5 implies limμσ~f~2(μ)=σ~Π~f~2(0)σ~f~2(0) for the transformed system (66). Here f~(q)=f(S-1/2q) is the transformed observable and Π~ denotes L2(π)-orthogonal projection onto ker(S1/2J1S1/2q·). According to (4.5), it is sufficient to show that (Πf)S-1/2=Π~f~. This however follows directly from the fact that the linear transformation ϕϕS1/2 maps ker(S1/2J1S1/2q·) bijectively onto ker(J1Sq·).

Let us also reformulate Algorithm 1 for the case of a Gaussian with arbitrary covariance.

Algorithm 2

Given K,SRsymd×d with f(q)=q·Kq and V(q)=12q·Sq (assuming S is nondegenerate), determine optimal perturbations J1 and J2 as follows:

  1. Set K~=S-1/2KS-1/2 and K~0=K~-TrK~d·I.

  2. Find UO(Rd) such that UK~0UT has zero entries on the diagonal.

  3. Choose aiR, i=1,,d such that aiaj for ij and set
    J¯ij=(UK~0UT)ijai-aj.
  4. Set J~=UTJ¯U.

  5. Put J1=S-1/2J~S-1/2 and J2=S1/2JS1/2.

Finally, we obtain the following optimality result from Lemma 7 and Corollary 2.

Corollary 5

Let f(q)=q·Kq+l·q-TrK and assume that ΠkerJl=0. Then

minJ1T=-J1,J2=SJ1Slimμσf2(μ,J1,J2)=σf12(0),

where f1(q)=q·K1q, K1=Tr(S-1K)dS. Optimal choices for J1 and J2 can be obtained using Algorithm 2.

Remark 13

Since in Sect. 4.1 we analysed the case where J1 and J2 are proportional, we are not able to drop the restriction J2=SJ1S from the above optimality result. Analysis of completely arbitrary perturbations will be the subject of future work.

Remark 14

The choices M=S and Γ=γS have been introduced to make the perturbations considered in this article lead to samplers that perform well in terms of reducing the asymptotic variance. However, adjusting the mass and friction matrices according to the target covariance in this way (i.e. M=S and Γ=γS) is a popular way of preconditioning the dynamics, see for instance [18] and, in particular mass-tensor molecular dynamics [6]. Here we will present an argument why such a preconditioning is indeed beneficial in terms of the convergence rate of the dynamics. Let us first assume that S is diagonal, i.e. S=diag(s(1),,s(d)) and that M=diag(m(d),,m(d)) and Γ=diag(γ(d),,γ(d)) are chosen diagonally as well. Then (65) decouples into one-dimensional SDEs of the following form:

dqt(i)=1m(i)pt(i)dt,dpt(i)=-s(i)qt(i)dt-γ(i)m(i)pt(i)dt+2γ(i)dWt,i=1,,d. 69

Let us write those Ornstein–Uhlenbeck processes as

dXt(i)=-B(i)Xt(i)dt+2Q(i)dWt(i),B(i)=0-1m(i)s(i)γ(i)m(i),Q(i)=000γ(i). 70

As in Sect. 4.2, the rate of the exponential decay of (70) is equal to minReσ(B(i)). A short calculation shows that the eigenvalues of B(i) are given by

λ1,2(i)=γ(i)2m(i)±(γ(i)2m(i))2-s(i)m(i).

Therefore, the rate of exponential decay is maximal when

(γ(i)2m(i))2-s(i)m(i)=0, 71

in which case it is given by

(λ(i))=s(i)m(i).

Naturally, it is reasonable to choose m(i) in such a way that the exponential rate (λ(i)) is the same for all i, leading to the restriction M=cS with c>0. Choosing c small will result in fast convergence to equilibrium, but also make the dynamics (69) quite stiff, requiring a very small timestep Δt in a discretisation scheme. The choice of c will therefore need to strike a balance between those two competing effects. The constraint (71) then implies Γ=2cS. By a coordinate transformation, the preceding argument also applies if S, M and Γ are diagonal in the same basis, and of course M and Γ can always be chosen that way. Numerical experiments show that it is possible to increase the rate of convergence to equilibrium even further by choosing M and Γ nondiagonally with respect to S (although only by a small margin). A clearer understanding of this is a topic of further investigation.

Numerical Experiments: Diffusion Bridge Sampling

Numerical Scheme

In this section we introduce a splitting scheme for simulating the perturbed underdamped Langevin dynamics given by Eq. (8). In the unpertubed case, i.e. when J1=J2=0, the BAOAB scheme (see [33] and references therein) has proven to be efficient for computing long time ergodic averages with respect to q-dependent observables. Motivated by this, we introduce the following perturbed scheme, introducing additional Runge-Kutta integration steps:

pn+1/2=pn-12ΔtV(qn), 72a
qn+1/2=qn+12Δt·M-1pn+1/2, 72b
qn+1/2=RK412Δt,qn+1/2, 72c
p^=exp(-Δt(ΓM-1+νJ2M-1))pn+1/2+I-e-2ΓΔtN(0,1), 72d
qn+1/2=RK412Δt,qn+1/2, 72e
qn+1=qn+1/2+12Δt·M-1p^, 72f
pn+1=p^-12Δt·V(qn+1), 72g

where RK4(Δt,q0) refers to fourth order Runge-Kutta integration of the ODE

q˙=-J1V(q),q(0)=q0 73

up until time Δt. We remark that the J2-perturbation is linear and can therefore be included in the O-part without much computational overhead. We emphasize the fact that many different splitting schemes could be investigated: although the BAOAB-scheme works well for unperturbed Langevin dynamics, it is not clear whether this remains true for the perturbed dynamics. Moreover, the perturbations introduced by J1 and J2 can be added in various places. Other discretisation schemes for the ODE (73) could be useful as well, for instance one could use a symplectic integrator, using the Hamiltonian structure of (73). However, since V as the Hamiltonian for (73) is not separable in general, such a symplectic integrator would have to be implicit. Note that (72c) and (72e) could be merged since (72e) commutes with (72d). In this paper, we content ourselves with the above scheme for our numerical experiments. Investigation of optimal numerical schemes for perturbed Langevin dynamics is an interesting problem for further research.

Remark 15

The aformentioned schemes lead to an error in the approximation for π(f), since the invariant measure π is not preserved exactly by the numerical scheme. In practice, the BAOAB-scheme can therefore be accompanied by an accept-reject Metropolis step as in [40], leading to an unbiased estimate of π(f), albeit with an inflated variance. In this case, after every rejection the momentum variable has to be flipped (p-p) in order to keep the correct invariant measure. We note here that our perturbed scheme can be ’Metropolized’ in a similar way by ’flipping’ the matrices J1 and J2 after every rejection (J1-J1 and J2-J2) and using an appropriate (volume-preserving and time-reversible) integrator for the dynamics given by (73). Implementations of this idea are the subject of ongoing work. See [47] for a similar approach to nonreversible overdamped Langevin dynamics.

Diffusion Bridge Sampling

To numerically test our analytical results, we will apply the dynamics (8) to sample a measure on path space associated to a diffusion bridge. Specifically, consider the SDE

dXs=-U(Xs)ds+2β-1dWs,

with XsRn, β>0 and the potential U:RnR obeying adequate growth and smoothness conditions (see [24], Sect. 5 for precise statements). The law of the solution to this SDE conditioned on the events X(0)=x- and X(s+)=x+ is a probability measure π on L2([0,s+],Rn) which poses a challenging and important sampling problem, especially if U is multimodal. This setting has been used as a test case for sampling probability measures in high dimensions (see for example [9] and [45]). For a more detailed introduction (including applications) see [11] and for a rigorous theoretical treatment the papers [11, 2426].

In the case U0, it can be shown that the law of the conditioned process is given by a Gaussian measure π0 with mean zero and precision operator S=-β2Δ on the Sobolev space H1([0,s+],Rd) equipped with appropriate boundary conditions. The general case can then be understood as a perturbation thereof: The measure π is absolutely continuous with respect to π0 with Radon-Nikodym derivative

dπdπ0exp(-Ψ), 74

where

Ψ(x)=β20s+G(x(s),β)dsandG(x,β)=12|U(x)|2-1βΔU(x).

We will make the choice x-=x+=0, which is possible without loss of generality as explained in [10, Remark 3.1], leading to Dirichlet boundary conditions on [0,s+] for the precision operator S. Furthermore, we choose s+=1 and discretise the ensuing s-interval [0, 1] according to

[0,1]=[0,s1)[s1,s2)[sn-1,sn)[sn,1]

in an equidistant way with stespize sj+1-sjδ=1d+1. Functions on this grid are determined by the values x(s1)=x1,,x(sn)=xn, recalling that x(0)=x(1)=0 by the Dirichlet boundary conditions. We discretise the functional Ψ as

Ψ~(x1,,xn)=β2δi=1dG(xi,β)=β2δi=1d(12(U(xi)2-1βU(xi)),

such that its gradient is given by

(Ψ~)i=β2δ(2U(xi)U(xi)-1βU(xi)),i=1,,d.

We denote by Aδ the discretised Dirichlet Laplacian on [0, 1] with stepsize δ. Following (74), the discretised target measure π^ has the form

π^=1Ze-VdxwithV(x)=Ψ~(x)-βδ4x·Aδx,xRd.

In the following we will consider the case n=1 with potential U:RR given by U(x)=12(x2-1)2 and set β=1. To test our algorithm we adjust the parameters M, Γ, J1 and J2 according to the recommended choice in the Gaussian case, (27), where we take S=β2δ·Aδ as the precision operator of the Gaussian target. We will consider the linear observable f1(x)=l·x with l=(1,,1) and the quadratic observable f2(x)=|x|2. In a first experiment we adjust the perturbations J1 and J2 to the observable f2 according to Algorithm 2. The dynamics (8) is integrated using the splitting scheme introduced in Sect. 5.1 with a stepsize of Δt=10-4 over the time interval [0, T] with T=102. Furthermore, we choose initial conditions q0=(1,,1), p0=(0,,0) and introduce a burn-in time T0=1, i.e. we take the estimator to be π^(f)1T-T0T0Tf(qt)dt. We compute the variance of the above estimator from N=500 realisations and compare the results for different choices of the friction coefficient γ and of the perturbation strength μ.

The numerical experiments show that the perturbed dynamics generally outperform the unperturbed dynamics independently of the choice of μ and γ, both for linear and quadratic observables. One notable exception is the behaviour of the linear observable for small friction γ=10-3 (see Fig. 4a), where the asymptotic variance initially increases for small perturbation strengths μ. This does not contradict our analytical results, since the small perturbation results Theorem 3 and Corollary 3 are only valid if γ>2. We remark here that the condition γ>2, while necessary for the theoretical results from Sect. 4.1, is not a very advisable choice in practice (at least in this experiment), since Figs. 4b and 5b clearly indicate that the optimal friction is around γ10-1. Interestingly, the problem of choosing a suitable value for the friction coefficient coefficient γ becomes mitigated by the introduction of the perturbation: While the performance of the unperturbed sampler depends quite sensitively on γ, the asymptotic variance of the perturbed dynamics is a lot more stable with respect to variations of γ. A somewhat surprising phenomenon is that the standard deviation σ associated to the linear observable decays in the range γ[10,100] for the unperturbed sampler (see Fig. 4b). We confirmed this behaviour by further numerical experiments and remark that as the target measure π^ is fairly complicated, convexity of the function γσ should not be expected.

Fig. 4.

Fig. 4

Standard deviation of π^(f) for a linear observable as a function of friction γ and perturbation strength μ

Fig. 5.

Fig. 5

Standard deviation of π^(f) for a quadratic observable as a function of friction γ and perturbation strength μ

In the regime of growing values of μ, the experiments confirm the results from Sect. 4.3, i.e. the asymptotic variance approaches a limit that is smaller than the asymptotic variance of the unperturbed dynamics.

As a final remark we report our finding that the performance of the sampler for the linear observable is qualitatively independent of the choice of J1 [as long as J2 is adjusted according to (27)]. This result is in alignment with Propostion 3 which predicts good properties of the sampler for antisymmetric observables. In contrast to this, a judicious choice of J1 is critical for quadratic observables. In particular, applying Algorithm 2 significantly improves the performance of the perturbed sampler in comparison to choosing J1 arbitrarily.

Outlook and Future Work

A new family of Langevin samplers was introduced in this paper. These new SDE samplers consist of perturbations of the underdamped Langevin dynamics (that is known to be ergodic with respect to the canonical measure), where auxiliary drift terms in the equations for both the position and the momentum are added, in a way that the perturbed family of dynamics is ergodic with respect to the same (canonical) distribution. These new Langevin samplers were studied in detail for Gaussian target distributions where it was shown, using tools from spectral theory for differential operators, that an appropriate choice of the perturbations in the equations for the position and momentum can improve the performance of the Langvin sampler, at least in terms of reducing the asymptotic variance. The performance of the perturbed Langevin sampler to non-Gaussian target densities was tested numerically on the problem of diffusion bridge sampling.

The work presented in this paper can be improved and extended in several directions. First, a rigorous analysis of the new family of Langevin samplers for non-Gaussian target densities is needed. The analytical tools developed in [14] can be used as a starting point. Furthermore, the study of the actual computational cost and its minimization by an appropriate choice of the numerical scheme and of the perturbations in position and momentum would be of interest to practitioners. In addition, the analysis of our proposed samplers can be facilitated by using tools from symplectic and differential geometry. Finally, combining the new Langevin samplers with existing variance reduction techniques such as zero variance MCMC, preconditioning/Riemannian manifold MCMC can lead to sampling schemes that can be of interest to practitioners, in particular in molecular dynamics simulations. All these topics are currently under investigation.

Acknowledgements

AD was supported by the EPSRC under Grant No. EP/J009636/1. NN is supported by EPSRC through a Roth Departmental Scholarship. GP is partially supported by the EPSRC under Grants No. EP/J009636/1, EP/L024926/1, EP/L020564/1 and EP/L025159/1. Part of the work reported in this paper was done while NN and GP were visiting the Institut Henri Poincaré during the Trimester Program “Stochastic Dynamics Out of Equilibrium”. The hospitality of the Institute and of the organizers of the program is greatly acknowledged. We thank the referees for their useful comments and suggestions that have lead to various improvements in the presentation of this paper.

Footnotes

1

In fact, using the results from [8], we could consider observables in L2(π). However, we will not extend this point further in this paper.

2

Notice that γ22-1 is understood to be a complex number for γ<2.

3

Indeed, the fact that fkerA is equivalent to fE¯ is easy to check if f is an eigenvector of L0 (recall that f is then an eigenvector of L0+μA as well, using Lemma 5(c) The claim then follows by extending linearly.

Contributor Information

A. B. Duncan, Email: Andrew.Duncan@sussex.ac.uk

N. Nüsken, Email: n.nusken14@imperial.ac.uk

G. A. Pavliotis, Email: g.pavliotis@imperial.ac.uk

References

  • 1.Achleitner F, Arnold A, Stürzer D. Large-time behavior in non-symmetric Fokker-Planck equations. Riv. Math. Univ. Parma (N.S.) 2015;6(1):1–68. [Google Scholar]
  • 2.Arnold, A., Erb, J.: Sharp entropy decay for hypocoercive and non-symmetric Fokker-Planck equations with linear drift. arXiv:1409.5425v2 (2014)
  • 3.Asmussen S, Glynn PW. Stochastic Simulation: Algorithms and Analysis. New York: Springer; 2007. [Google Scholar]
  • 4.Alrachid, H., Mones, L., Ortner, C.: Some remarks on preconditioning molecular dynamics. arXiv preprint arXiv:1612.05435 (2016)
  • 5.Bass RF. Diffusions and Elliptic Operators. Berlin: Springer; 1998. [Google Scholar]
  • 6.Bennett CH. Mass tensor molecular dynamics. J. Comput. Phys. 1975;19(3):267–279. doi: 10.1016/0021-9991(75)90077-7. [DOI] [Google Scholar]
  • 7.Bakry D, Gentil I, Ledoux M. Analysis and Geometry of Markov Diffusion Operators. Berlin: Springer; 2013. [Google Scholar]
  • 8.Bhattacharya RN. On the functional central limit theorem and the law of the iterated logarithm for Markov processes. Z. Wahrsch. Verw. Gebiete. 1982;60(2):185–201. doi: 10.1007/BF00531822. [DOI] [Google Scholar]
  • 9.Beskos A, Pinski FJ, Sanz-Serna JM, Stuart AM. Hybrid Monte Carlo on Hilbert spaces. Stoch. Process. Appl. 2011;121(10):2201–2230. doi: 10.1016/j.spa.2011.06.003. [DOI] [Google Scholar]
  • 10.Beskos A, Roberts G, Stuart A, Voss J. MCMC methods for diffusion bridges. Stoch. Dyn. 2008;8(3):319–350. doi: 10.1142/S0219493708002378. [DOI] [Google Scholar]
  • 11.Beskos, A., Stuart, A.: MCMC methods for sampling function space. In: ICIAM 07—6th International Congress on Industrial and Applied Mathematics, pp. 337–364. European Mathematical Society, Zürich (2009)
  • 12.Ceriotti M, Bussi G, Parrinello M. Langevin equation with colored noise for constant-temperature molecular dynamics simulations. Phys. Rev. Lett. 2009;102(2):020601. doi: 10.1103/PhysRevLett.102.020601. [DOI] [PubMed] [Google Scholar]
  • 13.Cattiaux P, Chafaı D, Guillin A. Central limit theorems for additive functionals of ergodic markov diffusions processes. ALEA. 2012;9(2):337–382. [Google Scholar]
  • 14.Duncan AB, Lelievre T, Pavliotis GA. Variance reduction using nonreversible Langevin samplers. J. Stat. Phys. 2016;163(3):457–491. doi: 10.1007/s10955-016-1491-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Dolbeault J, Mouhot C, Schmeiser C. Hypocoercivity for linear kinetic equations conserving mass. Trans. Am. Math. Soc. 2015;367(6):3807–3828. doi: 10.1090/S0002-9947-2015-06012-7. [DOI] [Google Scholar]
  • 16.Eyink GL, Lebowitz JL, Spohn H. Hydrodynamics and fluctuations outside of local equilibrium: driven diffusive systems. J. Stat. Phys. 1996;83(3–4):385–472. doi: 10.1007/BF02183738. [DOI] [Google Scholar]
  • 17.Engel, K.-J., Nagel, R.: One-parameter semigroups for linear evolution equations, volume 194 of Graduate Texts in Mathematics. Springer, New York (2000). With contributions by Brendle, S., Campiti, M., Hahn, T., Metafune, G., Nickel, G., Pallara, D., Perazzoli, C., Rhandi, A., Romanelli, S., Schnaubelt, R
  • 18.Girolami M, Calderhead B. Riemann manifold Langevin and Hamiltonian Monte Carlo methods. J. R. Stat. Soc. Ser. B Stat. Methodol. 2011;73(2):123–214. doi: 10.1111/j.1467-9868.2010.00765.x. [DOI] [Google Scholar]
  • 19.Gelman A, Carlin JB, Stern HS, Dunson DB, Vehtari A, Rubin DB. Bayesian Data Analysis. 3. Boca Raton, FL: CRC Press; 2014. [Google Scholar]
  • 20.Hwang C-R, Hwang-Ma S, Sheu SJ. Accelerating Gaussian diffusions. Ann. Appl. Probab. 1993;3(3):897–913. doi: 10.1214/aoap/1177005371. [DOI] [Google Scholar]
  • 21.Hwang C-R, Hwang-Ma S-Y, Sheu S-J. Accelerating diffusions. Ann. Appl. Probab. 2005;15(2):1433–1444. doi: 10.1214/105051605000000025. [DOI] [Google Scholar]
  • 22.Horn Roger A, Johnson Charles R. Matrix Analysis. 2. Cambridge: Cambridge University Press; 2013. [Google Scholar]
  • 23.Hwang C-R, Normand R, Wu S-J. Variance reduction for diffusions. Stoch. Process. Appl. 2015;125(9):3522–3540. doi: 10.1016/j.spa.2015.03.006. [DOI] [Google Scholar]
  • 24.Hairer M, Stuart AM, Voss J. Analysis of SPDEs arising in path sampling. II. The nonlinear case. Ann. Appl. Probab. 2007;17(5–6):1657–1706. doi: 10.1214/07-AAP441. [DOI] [Google Scholar]
  • 25.Hairer, M., Stuart, A., Voss, J.: Sampling conditioned diffusions. In: Trends in Stochastic Analysis. London London Mathematical Society Lecture Note Series, vol. 353, pp. 159–185. Cambridge University Press, Cambridge (2009)
  • 26.Hairer M, Stuart AM, Voss J, Wiberg P. Analysis of SPDEs arising in path sampling. I. The Gaussian case. Commun. Math. Sci. 2005;3(4):587–603. doi: 10.4310/CMS.2005.v3.n4.a8. [DOI] [Google Scholar]
  • 27.Iacobucci, A., Olla, S., Stoltz, G.: Convergence rates for nonequilibrium Langevin dynamics. arXiv:1702.03685 (2017)
  • 28.Joulin A, Ollivier Y. Curvature, concentration and error estimates for Markov chain Monte Carlo. Ann. Probab. 2010;38(6):2418–2442. doi: 10.1214/10-AOP541. [DOI] [Google Scholar]
  • 29.Kazakia JY. Orthogonal transformation of a trace free symmetric matrix into one with zero diagonal elements. Int. J. Eng. Sci. 1988;26(8):903–906. doi: 10.1016/0020-7225(88)90041-9. [DOI] [Google Scholar]
  • 30.Kliemann W. Recurrence and invariant measures for degenerate diffusions. Ann Probab. 1987;15:690–707. doi: 10.1214/aop/1176992166. [DOI] [Google Scholar]
  • 31.Komorowski, T., Landim, C., Olla, S.: Fluctuations in Markov processes. Grundlehren der Mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences], vol. 345. Springer, Heidelberg. Time symmetry and martingale approximation (2012)
  • 32.Liu JS. Monte Carlo Strategies in Scientific Computing. Berlin: Springer; 2008. [Google Scholar]
  • 33.Leimkuhler, B., Matthews,C.: Molecular Dynamics. Interdisciplinary Applied Mathematics, vol. 39. Springer, Berlin (2015). With deterministic and stochastic numerical methods
  • 34.Lelièvre T, Nier F, Pavliotis GA. Optimal non-reversible linear drift for the convergence to equilibrium of a diffusion. J. Stat. Phys. 2013;152(2):237–274. doi: 10.1007/s10955-013-0769-x. [DOI] [Google Scholar]
  • 35.Lelièvre, T., Rousset, M., Stoltz, G.: Free Energy Computations. Imperial College Press, London (2010). A Mathematical Perspective
  • 36.Lelièvre T, Stoltz G. Partial differential equations and stochastic methods in molecular dynamics. Acta Numer. 2016;25:681–880. doi: 10.1017/S0962492916000039. [DOI] [Google Scholar]
  • 37.Ma, Y.-A., Chen, T., Fox, E.: A complete recipe for stochastic gradient MCMC. In: Advances in Neural Information Processing Systems, pp. 2899–2907 (2015)
  • 38.Metafune G, Pallara D, Priola E. Spectrum of Ornstein-Uhlenbeck operators in Lp spaces with respect to invariant measures. J. Funct. Anal. 2002;196(1):40–60. doi: 10.1006/jfan.2002.3978. [DOI] [Google Scholar]
  • 39.Markowich PA, Villani C. On the trend to equilibrium for the Fokker-Planck equation: an interplay between physics and functional analysis. Mat. Contemp. 2000;19:1–29. [Google Scholar]
  • 40.Matthews, C., Weare, J., Leimkuhler, B.: Ensemble preconditioning for Markov Chain Monte Carlo simulation. arXiv:1607.03954 (2016)
  • 41.Nüsken, N.: Construction of optimal samplers (in preparation). PhD thesis, Imperial College London (2018)
  • 42.Ottobre M, Pavliotis GA. Asymptotic analysis for the generalized Langevin equation. Nonlinearity. 2011;24(5):1629. doi: 10.1088/0951-7715/24/5/013. [DOI] [Google Scholar]
  • 43.Ottobre M, Pavliotis GA, Pravda-Starov K. Exponential return to equilibrium for hypoelliptic quadratic systems. J. Funct. Anal. 2012;262(9):4000–4039. doi: 10.1016/j.jfa.2012.02.008. [DOI] [Google Scholar]
  • 44.Ottobre M, Pavliotis GA, Pravda-Starov K. Some remarks on degenerate hypoelliptic Ornstein-Uhlenbeck operators. J. Math. Anal. Appl. 2015;429(2):676–712. doi: 10.1016/j.jmaa.2015.04.019. [DOI] [Google Scholar]
  • 45.Ottobre M, Pillai NS, Pinski FJ, Stuart AM. A function space HMC algorithm with second order Langevin diffusion limit. Bernoulli. 2016;22(1):60–106. doi: 10.3150/14-BEJ621. [DOI] [Google Scholar]
  • 46.Pavliotis GA. Stochastic Processes and Applications: Diffusion Processes, the Fokker-Planck and Langevin Equations. Berlin: Springer; 2014. [Google Scholar]
  • 47.Poncet, R.: Generalized and hybrid Metropolis-Hastings overdamped Langevin algorithms. arXiv:1701.05833 (2017)
  • 48.Pavliotis GA, Stuart AM. White noise limits for inertial particles in a random field. Multiscale Model. Simul. 2003;1(4):527–533. doi: 10.1137/S1540345903421076. [DOI] [Google Scholar]
  • 49.Pavliotis GA, Stuart AM. Analysis of white noise limits for stochastic systems with two fast relaxation times. Multiscale Model. Simul. 2005;4(1):1–35. doi: 10.1137/040610507. [DOI] [Google Scholar]
  • 50.Rey-Bellet L, Spiliopoulos K. Irreversible Langevin samplers and variance reduction: a large deviations approach. Nonlinearity. 2015;28(7):2081–2103. doi: 10.1088/0951-7715/28/7/2081. [DOI] [Google Scholar]
  • 51.Rey-Bellet, L., Spiliopoulos, K.: Variance reduction for irreversible Langevin samplers and diffusion on graphs. Electron. Commun. Probab., vol. 20, pp. 15, 16, (2015)
  • 52.Robert C, Casella G. Monte Carlo Statistical Methods. Berlin: Springer; 2013. [Google Scholar]
  • 53.Roussel, J., Stoltz, G.: Spectral methods for Langevin dynamics and associated error estimates. arXiv:1702.04718 (2017)
  • 54.Villani, C.: Hypocoercivity. Number 949-951. American Mathematical Society (2009)
  • 55.Wu S-J, Hwang C-R, Chu MT. Attaining the optimal Gaussian diffusion acceleration. J. Stat. Phys. 2014;155(3):571–590. doi: 10.1007/s10955-014-0963-5. [DOI] [Google Scholar]

Articles from Journal of Statistical Physics are provided here courtesy of Springer

RESOURCES