Abstract
Markov Chain Monte Carlo (MCMC) methods are a powerful tool for computation with complex probability distributions. However the performance of such methods is critically dependent on properly tuned parameters, most of which are difficult if not impossible to know a priori for a given target distribution. Adaptive MCMC methods aim to address this by allowing the parameters to be updated during sampling based on previous samples from the chain at the expense of requiring a new theoretical analysis to ensure convergence. In this work we extend the convergence theory of adaptive MCMC methods to a new class of methods built on a powerful class of parametric density estimators known as normalizing flows. In particular, we consider an independent Metropolis-Hastings sampler where the proposal distribution is represented by a normalizing flow whose parameters are updated using stochastic gradient descent. We explore the practical performance of this procedure on both synthetic settings and in the analysis of a physical field system, and compare it against both adaptive and non-adaptive MCMC methods.
1. INTRODUCTION
Markov Chain Monte Carlo (MCMC) methods are procedures for generating samples from probability distributions, typically given knowledge of the density of the distribution up to proportionality. These MCMC samplers often depend on parameters; for instance, in the random walk Metropolis procedure on , one may treat the covariance matrix of a normal proposal distribution as a parameter of the method; see, for instance, Haario et al. (2001). The performance of an MCMC procedure will depend on these parameters. It would be preferable if these parameters could be adapted during sampling at every step of the chain, however such adaptions can violate the Markov property of the chain and undermine its convergence to the desired target distribution.
An important variation of MCMC is the independent Metropolis-Hastings sampler. This method samples from a target distribution by first sampling from an auxiliary proposal distribution (independently from the current state of the chain) and accepts or rejects those proposals according to the Metropolis-Hastings criterion. The effectiveness of this algorithm depends on the ratio of the target density to the ratio of the proposal density (Robert and Casella, 2005): if the ratio is bounded over the support of the target distribution, the algorithm enjoys a powerful theory of geometric ergodicity. The independent Metropolis-Hastings algorithm is the focus of the present work.
Recently in the machine learning community, normalizing flows have emerged as a powerful mechanism for expressing complex densities, see (Kobyzev et al., 2020; Papamakarios et al., 2021) for recent reviews. Normalizing flows are defined by a parametric, smooth and invertible function which transforms a simple distribution (e.g., a Gaussian) into a more complex one (e.g., natural images) and uses the change-of-variables formula to exactly determine the resulting probability density function in the complex space. Provided that the family of normalizing flows under consideration is sufficiently expressive, any distribution can be constructed in theory this way. In practice, many normalizing flows exhibit a universal approximation property whereby, given suitable model capacity, they can approximate any distribution arbitrarily well, e.g., (Huang et al., 2018; Jaini et al., 2019). Indeed, normalizing flows are distinguished among parameteric families of distributions by their expressiveness and tractability of sampling and log-density evaluation; the precise attributes that one requires for a proposal distribution in the independent Metropolis-Hastings sampler. By incorporating normalizing flows into the MCMC framework we seek to leverage their expressivity along with the ergodicity of the MCMC procedure in order to produce samples from a target distribution (see fig. 1). The principle computational challenge associated to normalizing flows is the identification of parameters that produce the best approximation of a target density. Therefore, a question of principle theoretical interest and practical importance is, “During the course of sampling, under what conditions can the parameters of the normalizing flow be adapted at every step of the chain?”
Figure 1:

This work examines the convergence of adaptive Markov chain Monte Carlo algorithms using the independent Metropolis-Hastings algorithm when the proposal distribution is parameterized by a normalizing flow. In this illustration, we seek to draw samples from a target distribution. We begin with an initial parameter Θ0 which parameterizes a simple proposal distribution, denoted , which is a normalizing flow, and an initial state of the chain X0; a sample from this proposal is accepted or rejected according to the Metropolis-Hastings criterion, yielding a transition to the state X1. The parameters of the normalizing flow are thereafter adapted to produce a new proposal distribution , which we hope is closer to the target distribution. Iterating this procedure we obtain both a sequence of states and a sequence of normalizing flow parameters . The principle question of this work is to establish when the sequence of states converges to the target density.
The outline of this paper is as follows. In section 2 we review important concepts from the analysis of Markov chains; we provide the independent Metropolis-Hastings algorithm and state the conditions under which it enjoys geometric ergodicity; we devise a metric space over transition kernels, which will be important for analyzing notions of continuity. We review recent experimental works that demonstrated the benefit of normalizing flow proposals in MCMCs and related theoretical literature in section 3. In section 4 we state our theories for the continual adaptation of Markov chains. We begin by considering deterministic adaptations wherein parameter updates are determined sequentially and deterministically without regard to the state of the chain; this case can be used to motivate the adaptation of normalizing flows as a gradient flow. We then proceed to consider stochastic adaptations wherein the state of the chain and the adaptation of the parameters of the normalizing flow at the nth step are not necessarily independent given the history of the chain up to the (n−1)th step. This circumstance includes the case wherein the accepted proposal sampled from the normalizing flow is also used in the computation of the adaptation, as necessary for the “pseudo-likelihood” algorithm we examine numerically in section 5.
2. PRELIMINARIES
In giving an overview of Markov chains and their associated theory, we emulate the notation and presentation of Meyn and Tweedie (1993). Refer to appendix A for a review of total variation distances. Throughout, we let 𝒳 denote a set which we equip with its Borel σ-algebra, denoted 𝔅(𝒳). We associate to (𝒳, 𝔅(𝒳)) a measure μ : 𝔅(𝒳) → [0, ∞) – satisfying μ(A) ≥ 0 for all A ∈ 𝔅(𝒳), μ(∅) = 0, and the condition of countable additivity – to create the measure space (𝒳, 𝔅(𝒳), μ). A probability measure is a measure which satisfies Kolmogorov’s axioms (Kolmogorov, 1960). A signed measure relaxes the condition of non-negativity. If X is an 𝒳-valued random variable and Π is a probability measure on (𝒳, 𝔅(𝒳)) we write X ~ Π(·) to mean that for any A ∈ 𝔅(𝒳) we have Pr[X ∈ A] = Π(A). If a probability measure Π has a density with respect to a dominating measure μ, this means that for all A ∈ 𝔅(𝒳), Π(A) = ∫A π(x) μ(dx). The support of a density π is Supp(π) = {x ∈ 𝒳 : π(x) > 0}. When we turn our attention to the discussion of parameterizations of transition kernels, we will write 𝒴 as a generic parameter space and use the symbol θ ∈ 𝒴 to refer to a particular parameterization. We denote the Dirac measure concentrated at x ∈ 𝒳 by δx(·).
2.1. Transition Kernels
In MCMC, we generate a sequence of 𝒳-valued random variables, denoted (X0, X1,…) that satisfy the Markov property. The transition to state Xn+1 given Xn = xn is formally captured by the notion of a transition kernel.
Definition 2.1 (Robert and Casella (2005)). A transition kernel on 𝒳 is a function 𝒳 × 𝔅(𝒳) ∋ (x, A) ↦ K(x, A) that satisfies the following two properties: (i) For all x ∈ 𝒳, K(x,·) is a probability measure and (ii) For all A ∈ 𝔅(𝒳), K(·, A) is 𝔅(𝒳)-measurable.
Thus, the propagation of the state from step n to step n+1 is represented by Xn+1 ~ K(xn,·). When considering Markov chains, we will frequently be interested in the n-step transition probability measure from some initial state X0 = x0; we denote this probability measure by Kn(x0,·) = Pr[Xn ∈ ·|X0 = x0], which has the following expression:
| (1) |
Of principle interest to the theory of Markov chains is the limiting behavior of the n-step transition probability measure.
Definition 2.2. The transition kernel K with n-step transition law Kn is ergodic for Π if, for every x ∈ 𝒳, limn→∞ ∥Kn(x, ·) − Π(·)∥TV = 0.
In the sequel, we will require continuity of sequences of transition kernels, which necessitates that we equip the space of transition kernels with a metric. A natural metric considers the worst-case total variation distance between kernels.
Definition 2.3. Two transition kernels K and K′ on 𝒳 × 𝔅(𝒳) are equal if supx∈𝒳 ∥K(x, ·)−K′(x, ·)∥TV = 0.
Proposition 2.4. Let K and K′ be transition kernels on 𝒳 × 𝔅(𝒳). Then the function, d(K, K′) = supx∈𝒳 ∥K(x, ·) – K′(x, ·)∥TV is a distance function on transition kernels.
A proof is given in appendix D.
2.2. Independent Metropolis-Hastings
Definition 2.5. Let Π and be two probability measures on 𝔅(𝒳) with densities with respect to some dominating measure μ given by π and , respectively. Consider a Markov chain (X0, X1, X2, …) constructed via the following procedure given an initial state of the Markov chain X0 = x0. First, randomly sample . Then set with probability and otherwise set Xn+1 = Xn. The Markov chain (X0, X1, X2, …) is called the independent Metropolis-Hastings sampler of Π given .
Proposition 2.6. Let K denote the transition kernel of the independent Metropolis-Hastings sampler. The stationary distribution of (X0, X1, X2, …) is Π and if there exists a constant M ≥ 1 such that , ∀ x ∈ Supp(π), then the independent Metropolis-Hastings sampler is uniformly ergodic in the sense that .
For a proof of these results, refer to Meyn and Tweedie (1993); Robert and Casella (2005). There is a question of when such a M as in proposition 2.6 will exist. Under a compactness condition and assumptions of continuity on both the proposal and target densities, then an affirmative existence result can be given.
Corollary 2.7. If, in addition, 𝒳 is a compact set and if π and are continuous on 𝒳, and if then there exists such an M as in proposition 2.6.
A proof is given in appendix E. The transition kernel of the independent Metropolis-Hastings sampler has the form
| (2) |
The first term in eq. (2) is the probability of an accepted transition from x to the region dx′ whereas the second term is the probability of remaining at x, which only contributes if x lies in the region dx′.
2.3. Adaptive Transition Kernels
As alluded to in section 1, the transition kernel may depend on parameters, denoted by θ and taking values in a set 𝒴. In this case, we express the dependency of the kernel K on its parameters by writing Kθ. In adaptive MCMC, given a target probability measure Π, we seek to strategically construct a sequence of transition kernels where is a sequence of 𝒴-valued random variables. Ideally, the sequence will enable sampling from Π that becomes more effective with each step. In the adaptive MCMC framework, the one-step transition laws for Xn+1 given Xn = xn and Θn = θn is . The n-step transition law given X0 = x0 and (Θ0 = θ0, …, Θn−1 = θn−1) is
| (3) |
Therefore, by the law of total expectation, the n-step transition law given X0 = x0 is , where the expectation is computed over the marginal distribution of the parameters. We now give a precise definition for what it means for an adaptive MCMC procedure to be ergodic.
Definition 2.8. The n-step transition law Gn is said to be ergodic for the probability measure Π if, for every x ∈ 𝒳, limn→∞ ∥Gn(x,·) − Π(·)∥TV = 0.
The principal theoretical tools of our analysis are the definitions of containment, simultaneous uniform ergodicity, and diminishing adaptation. Diminishing adaptation together with either containment or simultaneous uniform ergodicity implies ergodicity of the adaptive MCMC procedure in the sense of definition 2.8. The remainder of this section is a review of Roberts and Rosenthal (2007); Bai et al. (2011).
Definition 2.9. The sequence of Markov transition kernels is said to exhibit diminishing adaptation if in probability.
Lemma 2.10 (Roberts and Rosenthal (2007)). Suppose that Θn+1 = Θn w.p. 1 − αn and otherwise where is any other element of the index set. If limn→∞ αn = 0, then exhibits diminishing adaptation.
Definition 2.11. Define Wϵ(x, K) = inf {n ≥ 1: ∥Kn(x, ·) − Π(·)∥TV < ϵ}. The sequence is said to exhibit containment if, for every ϵ > 0, the sequence is bounded in probability given X0 = x0 and Θ0 = θ0, where .
Containment states that for a particular stochastic sequence of adaptations there is, with arbitrarily high probability, a finite number of steps one may take with any of the parameters in the sequence in order to be arbitrarily close to the target distribution. The following theorems give the relationships between diminishing adaptation, simultaneous uniform ergodicity, containment, and ergodicity of the adaptive MCMC procedure. The proofs of these results may be found in (Roberts and Rosenthal, 2007).
Theorem 2.12. Let {Kθ}θ∈𝒴 be a family of Markov chain transition kernels that are all stationary for the same distribution Π. Suppose that the family satisfies definition N.1 and that the sequence (Θ0, Θ1, …) satisfies definition 2.9. Then the chain whose transitions are governed by is ergodic for the distribution Π.
Theorem 2.13. Let {Kθ}θ∈𝒴 be a family of Markov chain transition kernels that are all stationary for the same distribution Π. Suppose that the sequence (Θ0, Θ1, …) satisfies definitions 2.9 and 2.11. Then the chain whose transitions are governed by is ergodic for the distribution Π.
3. RELATED WORK
A series of works recently investigated the learning of a proposal distribution for the independent Metropolis-Hastings sampler with normalizing flows, in particular for statistical mechanics field theories. For such models, Albergo et al. (2019) followed by Nicoli et al. (2020, 2021) used stochastic independent adaptations models following the optimization of the reverse Kullback-Leibler divergence (KL), as in Example 2 of the next section. While this strategy is successful when the target is unimodal, it is known to yield underdispersed approximation of the target distribution and to be prone to mode collapse. Within the framework of variational inference, Naesseth et al. (2020) proposed to address these issues by optimizing instead an approximate forward KL using simple parametric families for the proposal. In this case, adaptations are stochastic and rely on the previous states of the chain to estimate gradients of the approximate forward KL, called “pseudo-likelihood” in example 3 of the present paper. Incorporating normalizing flows, Gabrié et al. (2021a) successfully sampled multimodal distributions using an initialization that echoes the containment property. In the context of statistical field theories, Hackett et al. (2021) also demonstrated the need for forward KL training to assist sampling of multimodal distributions while surveying strategies to obtain training samples different from the adaptive MCMC discussed here. Hoffman et al. (2019) focuses on using normalizing flows to adapt Hamiltonian Monte Carlo to unfavorable posterior geometry by transforming a complicated posterior into a isotropic Gaussian.
Among the works above, ergodicity was only tested numerically. One exception is Gabrié et al. (2021a) where a convergence argument based on a continuous time analysis is developed under the assumption of perfect adaptation. The present paper provides a theoretical framework to analyze for the ergodicity of the methods presented in the body of work above. Though our work has focused on establishing ergodicity via the mechanism of Roberts and Rosenthal (2007), we note the work of Andrieu and Moulines (2006), which may be used to establish an ergodicity theory. We concur with the statement in Roberts and Rosenthal (2007) that Andrieu and Moulines (2006) “requir[es] other technical hypotheses which may be difficult to verify in practice” and that diminishing adaptation and containment are “somewhat simpler conditions.” Holden et al. (2009) considered the case of independent adaptations of the independent Metropolis-Hastings algorithm; however, this technique requires that accepted and rejected states be treated identically in the adaptation procedure, so we do not consider it further. We also note that Parno and Marzouk (2018) investigated the ergodicity of an adaptive MCMC using invertible maps. These works have similar aims but differ in several key details. For instance Parno and Marzouk (2018) focuses on establishing an ergodicity theory of triangular transformations of a Gaussian base measure, representing a local proposal distribution, which in practice is accomplished by employing third-degree Hermite polynomials. Our work, on the other hand, employs normalizing flows as global proposal mechanisms (independent of the current state of the chain). This necessitates a somewhat different treatment in order to establish ergodicity of the adaptive chain. As in this work, a pseudo-likelihod objective (see example 3) is employed in order to inform adaptations, but their objective is concave due to the choice of Hermite polynomials, whereas in the case of neural networks, the objective is more complex. Parno and Marzouk (2018) also assumes that parameter space 𝒴 is compact, which is untrue for typical parameterizations of normalizing flows, and insists on enforcing diminishing adaptation probabilistically (lemma 2.10) whereas we allow parameters to converge in probability (lemma 4.2 and theorem 4.3).
4. ANALYTICAL APPARATUS
We now consider the principle problem of this paper: When can the adaptive independent Metropolis-Hastings sampler with proposal distribution parameterized by a normalizing flow be given an ergodicity theory? We separate our discussion into two components wherein the adaptations are either deterministic or not necessarily independent of the state of the chain.
4.1. Deterministic Adaptations
Theorem 4.1. Let Π be a probability measure with density π. Suppose that every θ ∈ 𝒴 parameterizes a probability measure on 𝔅(𝒳) with density . Suppose that (θ0, θ1, …) is a deterministic 𝒴-valued sequence. Let be an associated sequence of Markov transition kernels of the independent Metropolis-Hastings sampler of Π given . Let Kn(x0, A) denote the n-step transition probability from x0 to A ∈ 𝔅(𝒳) with law eq. (3). Then Π is the stationary distribution for Kn. Suppose further that for each there exists Mn satisfying for all x ∈ Supp(π) Then, .
A proof is given in appendix B. We note that theorem 4.1 permits great generality in how θ parameterizes Πθ; indeed, our analysis here, and subsequently, applies to any parameterized family of distributions.
Example 1. Let Π be a probability measure with density π. Let and suppose that every θ ∈ 𝒴 smoothly parameterizes a probability measure on 𝔅(𝒳) with density for which . Consider the initial value problem
| (4) |
where θ0 ∈ 𝒴. Let (t0, t1, …) be a deterministic sequence of times and let θn = θ(tn) for . Consider the family of Markov chain transition operators of the independent Metropolis-Hastings sampler of Π given with transition kernels . Then Π is the stationary distribution of the Markov chain whose transitions satisfy . From the condition it follows that Π is the stationary distribution for each Kθ. Since (θ0, θ1, …) is a deterministic sequence, it follows from theorem 4.1 that Π is the stationary distribution. The particular mechanism of producing a deterministic sequence was not important; however, the time derivative eq. (4) was chosen because it begins to imitate the evolution encountered in normalizing flow loss functions. ∥
4.2. Non-Independent Adaptations
Notice that the decision to make the adaptation and the subsequent state of the chain dependent is not artificial or contrived; in fact, if such a procedure can be equipped with an ergodicity theory, then the resulting algorithm would have an important computational advantage. Specifically, it would require fewer evaluations of the target density (or the gradient of the target density) than the corresponding procedure with independent adaptations. For instance, the following adaptation scheme does not fall into the category of independent adaptations.
Example 2. Let Π be a probability measure with density π on a space 𝒳. Let and suppose that every θ ∈ 𝒴 smoothly parameterizes a probability measure on 𝔅(𝒳) with density for which . Let be the proposal produced by the independent Metropolis-Hastings sampler of Π given . Consider the adaptation
| (5) |
which can be interpreted as the single-sample approximation of the gradient flow of . ∥
This motivates us to explore this direction. Definition 2.9 and the continuous mapping theorem (see theorem D.1) leads immediately to the following result.
Lemma 4.2. Suppose that the map θ ↦ Kθ is continuous and that the sequence (Θ0, Θ1, …) converges in probability in 𝒴. Then exhibits diminishing adaptation.
A proof is given in appendix D. We now consider the question of the continuity of the mapping θ ↦ Kθ.
Theorem 4.3. Let (θ1, θ2, …) be a 𝒴-valued sequence converging to θ. Let π be a probability density function on a space 𝒳 and let be a family of density functions on 𝒳 indexed by θ such that the map is continuous. Suppose further that for every θ ∈ 𝒴. Let Π be the probability measure on 𝔅(𝒳) with density π and let be the probability measure on 𝔅(𝒳) with density . Let Kθ be the transition kernel of the independent Metropolis-Hastings sampler of Π given . Then (i.e. the mapping θ ↦ Kθ is continuous).
A proof is given in appendix D. When training normalizing flows, it is typical to apply stochastic gradient descent to the minimization of some loss function. The question of when the iterates of stochastic gradient descent converge is an important question that has been recently treated in the case of non-convex losses. We refer the interested reader to Bottou (1999); Mertikopoulos et al. (2020) for conditions and results guaranteeing the convergence of stochastic gradient descent. In practice, the convergence of the sequence of normalizing flow parameters can be further encouraged by a decreasing learning rate schedule. In appendix N we discuss simultaneous uniform ergodicity on compact spaces and give some examples of normalizing flows works in these cases. The condition for geometric ergodicity of the independent Metropolis-Hastings sampler is that there exists M ≥ 1 such that for all x ∈ Supp(π) where π is the density of the target distribution and is the proposal density. By taking the logarithm of both sides and rearranging we obtain the equivalent inequality, for all x ∈ Supp(x).
Proposition 4.4. Suppose that every θ ∈ 𝒴 parameterizes a probability measure on 𝔅(𝒳) with density . Let (Θ0, Θ1, …) be a sequence of 𝒴-valued random variables and consider the family of Markov chain transition operators of the independent Metropolis-Hastings sampler of Π given with transition kernels . Suppose that for all δ > 0, there exists M ≡ M(δ) ∈ [1, ∞) such that
| (6) |
for all . Then, exhibits containment.
A proof is given in appendix F. Regarding the tail condition in eq. (6), we note that the tail behaviour of the most popular normalizing flow architectures can be explicitly controlled, as shown by Jaini et al. (2020). Specifically, with Lipschitz triangular bijections (including most affine coupling flow implementations) the tail behaviour remains identical to that of the base measure. Thus, to ensure heavy tails in a flow one can simply replace the typical Gaussian base measure with a heavier tailed one, e.g., a Laplace or Student-t. An even stronger condition than eq. (6) is that . Thus, we see that containment can be obtained for the transition kernels of the independent Metropolis-Hastings sampler if, for every n, is within logM of log π with probability 1 − δ. Note that M does not need to even be close to unity (equivalently, logM need not be close to zero) in order for containment to hold; it is sufficient merely that, with high probability, the sequence does not produce arbitrarily poor approximations of log π.
The loss functions used in estimating normalizing flows are chosen to encourage closeness of the approximation and the target density. For instance, if one chooses to minimize as a function of θ ∈ 𝒴 then . The minimization of a loss function that encourages the closeness of the approximation and the target density is certainly no guarantee that eq. (6) holds; however, it gives an indication that eq. (6) might be true. We turn our attention in the next section to the empirical evaluation of adaptive samplers using normalizing flows. Some obstacles that could prevent the conditions of proposition 4.4 from holding are stated in appendix L.
Example 3. Recently, Gabrié et al. (2021a,b) proposed to sample from Boltzmann distributions and posteriors over the parameters of physical systems by alternating between an independence Metropolis-Hastings algorithm whose proposal is represented as a RealNVP normalizing flow and local updates computed by the Metropolis-adjusted Langevin algorithm (MALA). In Gabrié et al. (2021a) the authors “demonstrate the importance of initializing the training with some a priori knowledge of the relevant modes.” This incorporation of prior knowledge is done to avert mode-collapse. We can connect knowledge of modes to the property of containment: by ensuring that the proposal density of the independent Metropolis-Hastings sampler places sufficient mass on all modes with high probability, one satisfies containment by proposition 4.4. The specific training procedure used by these samplers is to adapt parameters of the normalizing flow as where are the states of the chain to the nth step and are a sequence of adaptation step-sizes. Because the states of the chain can only be regarded as approximate samples from the target distribution, we understand this update as seeking to update a “pseudo-likelihood.” Diminishing adaptation of this procedure can be enforced using either lemma 2.10 or via convergence and continuity using lemma 4.2. When diminishing adaptation and containment are satisfied, this adaptative algorithm produces an ergodic chain by theorem 2.13. ∥
5. EXPERIMENTS
Here we evaluate the adaptive independent Metropolis-Hastings algorithm following the “pseudo-likelihood objective”, with non-independent adaptations, summarized in algorithm 1 in appendix M. As a baseline adaptive MCMC technique, we consider the random walk Metropolis method of Haario et al. (2001); we also compare against Langevin dynamics. To assess the ergodicity of samplers, we compare MCMC samples against analytic samples drawn from the target density, except in the case of the physical system wherein we use domain knowledge to compare against Langevin dynamics. Specifically, we choose 10,000 random unit vectors and project the samples of the adaptive chain onto the vector space spanned by the chosen unit vector; we then compare these one-dimensional quantities to the projection of the baseline samples from the target distribution and compute the two-sample Kolmogorov-Smirnov (KS) test statistic (Smirnov, 1948; Cuesta-Albertos et al., 2006). In appendix J, we show how adaptation can actually degrade sample quality at finite time.
Code implementing the experiments in the brownian bridge, the two-dimensional multimodal example, and the experiments of appendix J can be found at https://github.com/JamesBrofos/Adaptive-Normalizing-Flow-Chains.
5.1. Affine Flows in a Brownian Bridge
We consider sampling from a Gaussian process with the following mean and covariance functions: μ(t) = sin(πt) and Σ(t, s) = min(t, s) − st. For 0 < t, s < 1, the covariance function identifies this distribution as a Brownian bridge whose mean is a sinusoid. We seek to sample this Gaussian process at 50 equally spaced times in the unit interval, yielding a fifty-dimensional target distribution. We estimate an affine normalizing flow from a Gaussian base distribution in order to sample from the target. Since the base distribution of the flow is Gaussian, and since affine transformations of Gaussian random variables remain Gaussian, in addition to the pseudo-likelihood training objective, we also consider gradient descent on the exact KL divergence between the target and the current proposal distribution. Minimization of the exact KL divergence is equivalent to maximum likelihood training, and therefore allows us to compare the efficiency lost by training with the pseudo-likelihood objective compared to the true likelihood. To enforce diminishing adaptation, we set a learning rate schedule for the gradient steps on the shift and scale of the affine transformation that converges to zero. In addition to Langevin dynamics, we also consider a preconditioned variant of the Metropolis-adjusted Langevin algorithm that uses the Hessian of the log-density to adapt proposals to the geometry of the target distribution (Girolami and Calderhead, 2011). Results shown in fig. 2 demonstrate the advantages of the adaptive independent Metropolis-Hastings samplers.
Figure 2:

Result of the Brownian bridge experiment. In assessing ergodicity according to the distribution of Kolmogorov-Smirnov statistics along random one-dimensional sub-spaces, the methods based on the independent Metropolis-Hastings algorithm and preconditioned Langevin dynamcis perform best. Langevin dynamics struggles in this posterior due to the multi-scale phenomena present in this distribution. In terms of the effective sample size per second of computation, the near-independent proposals and high acceptance rate of the independent Metropolis-Hastings sampler cause these algorithms to dominate. We also show the acceptance probability of the adaptive methods; we observe that the independent Metropolis-Hastings procedures enjoy adaptations that cause the acceptance probability to consistently improve over the course of learning.
5.2. Two-Dimensional Examples
We use a RealNVP architecture to model a multimodal distribution and Neal’s funnel distribution, both in . The multimodal density is a mixture of two Gaussians with a shared covariance structure given by Σ = diag(1/100, 1/100). The two means of the component Gaussians are (−2, 2) and (2,−2). Neal’s funnel distribution is defined by generating v ~ Normal(0, 9) and x ~ Normal(0, e−v), which defines a distribution in . To enforce diminishing adaptation, we set a learning rate schedule for the gradient steps on parameters of the RealNVP bijections that converges to zero. Results are shown in fig. 3. The expressivity of the RealNVP normalizing flow is key to building efficient proposals accommodating the distinct modes or the challenging multi-scale structure of Neal’s funnel.
Figure 3:

Examination of the performance of MCMC methods on sampling from the multimodal mixture of Gaussians and Neal’s funnel distribution. Both adaptive methods enjoy increasing acceptance rates in the multimodal distribution as a function of sampling iteration, but only the adaptive independent Metropolis-Hastings algorithm exhibits ergodicity for this distribution. Indeed, for the adaptive random walk and Langevin sampling methods, which are based on local updates, the multimodal distribution poses distinct challenges. In fact, both methods get stuck in one of the modes. By contrast, the adaptive independent Metropolis-Hastings samplers exhibit the best ergodicity of all methods considered. In Neal’s funnel distribution, the adaptive independent Metropolis-Hastings algorithm possesses the best ergodicity.
5.3. Analysis of a Physical Field System
We finally revisit a high-dimensional bi-modal example: the 1d-ϕ4 system studied in Gabrié et al. (2021a). The statistics of a field are given by the Boltzmann weight e−βU with the energy functional
| (7) |
assuming boundary conditions ϕ(0) = ϕ(1) = 0. We discretize the field at 100 equally spaced locations between 0 and 1, for a = 0.1 and β = 20. Examples of states are plotted in fig. 6 of appendix O. The algorithm proposed in Gabrié et al. (2021a) is adapted with a learning rate schedule enforcing diminishing adaptations and a mixture transition kernel stochastically choosing from local Langevin updates or proposal sampling from the normalizing flow (appendix I shows that we can expect this mixture kernel to exhibit containment and diminishing adaption). Because the distribution is high-dimensional and multimodal, it is necessary1 to run multiple parallel walkers initialized around the different modes. In this specific case, the energy and the distributions are even functions of ϕ. In the experiments, we initialize 100 walkers with uneven proportions in each mode (20–80) and test for the ergodicity of the parallel chains. Results are shown in fig. 4. Unlike the adaptive independent Metropolis-Hastings samplers, MALA single walkers are stuck in the mode they were initialized in and cannot recover the correct equal weights of the positive and negative mode. We also compare with the Jump Adaptive Multimodal Sampler (JAMS) of Pompe et al. (2020), using a MALA sampler for the local steps and an adaptive Gaussian mixture for the jumps and report results for the best of 10 runs and a typical failed run where the chains collapsed in one mode. We observe a KS mean statistic of 0.12 for one chain (best of 10 runs), compared to a KS mean statistic of 0.024 for the normalizing flow IMH, with the same number of iterations. The acceptance rate of jumps in JAMS reaches around 40% while the IMH gets to 55%. Additional details can be found in appendix O. Codes to reproduce this experiment can be found at https://github.com/marylou-gabrie/adapt-flow-ergo.
Figure 4:

Results of the ϕ4 field experiment. As Langevin dynamics is unable to mix between the two modes, the better ergodicity of the independent Metropolis Hastings algorithm is reflected in Kolmogorov-Smirnov statistics as expected. The single chain Langevin has poorer ergodicity than its parallel chain equivalent, while for the I.M.H. a single chain approaches the ergodicity of the parallel setting. The Effective Sample Sizes are reported for chains of 1000 steps extracted at burn-in, after 4 × 104 iterations (early) and (late) when the NF proposal acceptance probability has reached 50%. Note that periodic jumps in acceptance correspond to iterations where learning rate was decreased.
6. CONCLUSION
We have examined the question of when an adaptive independent Metropolis-Hastings sampler can be equipped with an ergodicity theory. We specifically consider the case wherein the proposal distribution is parameterized as a normalizing flow. We have considered the cases of deterministic adaptations, independent adaptations, and non-independent adaptations. For the non-independent adaptations case, we examine mechanisms by which to enforce the diminishing adaptation and containment conditions that together imply ergodicity.
Supplementary Material
Acknowledgements
M. G. would like to thank G. Rotskoff and E. Vanden-Eijnden for useful discussions about the physical field system experiments. We thank the Yale Center for Research Computing for use of the research computing infrastructure. This material is based upon work supported by the National Science Foundation Graduate Research Fellowship under Grant No. 1752134. Any opinion, findings, and conclusions or recommendations expressed in this material are those of the authors(s) and do not necessarily reflect the views of the National Science Foundation. MAB was supported by an NSERC Discovery Grant and as part of the Vision: Science to Applications program, thanks in part to funding from the Canada First Research Excellence Fund. The work is also supported in part by NIH/NIGMS 1R01GM136780-01 and AFSOR FA9550-21-1-0317.
Footnotes
Proceedings of the 25th International Conference on Artificial Intelligence and Statistics (AISTATS) 2022, Valencia, Spain. PMLR: Volume 151. Copyright 2022 by the author(s).
This necessity can be lifted by employing an auxiliary fixed set of “training samples” featuring the two modes, in arbitrary proportions. These samples would drive the learning towards relevant regions, so that a random walker can then inform the adaption about the relative statistical weights of different modes
Contributor Information
James A. Brofos, Yale University
Marylou Gabrié, CDS, New York University, CCM, Flatiron Institute.
Marcus A. Brubaker, York University, Vector Institute
Roy R. Lederman, Yale University
References
- Albergo MS, Kanwar G, and Shanahan PE (2019). Flow-based generative models for Markov chain Monte Carlo in lattice field theory. Physical Review D, 100(3):034515. [Google Scholar]
- Andrieu C and Moulines É (2006). On the ergodicity properties of some adaptive MCMC algorithms. The Annals of Applied Probability, 16(3):1462–1505. [Google Scholar]
- Bai Y, Roberts G, and Rosenthal J (2011). On the containment condition for adaptive markov chain monte carlo algorithms. Adv. Appl. Stat, 21. [Google Scholar]
- Berglund N, Gesù GD, and Weber H (2017). An Eyring–Kramers law for the stochastic Allen–Cahn equation in dimension two. Electronic Journal of Probability, 22(none):1–27. [Google Scholar]
- Bottou L (1999). On-line Learning and Stochastic Approximations, page 9–42. Publications of the Newton Institute. Cambridge University Press. [Google Scholar]
- Cuesta-Albertos J, Fraiman R, and Ransford T (2006). Random projections and goodness-of-fit tests in infinite-dimensional spaces. Bulletin Brazilian Mathematical Society, 37:477–501. [Google Scholar]
- Dinh L, Sohl-Dickstein J, and Bengio S (2017). Density estimation using real NVP. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24–26, 2017, Conference Track Proceedings. OpenReview.net. [Google Scholar]
- Falorsi L, de Haan P, Davidson TR, and Forré P (2019). Reparameterizing distributions on lie groups. In Chaudhuri K and Sugiyama M, editors, Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, volume 89 of Proceedings of Machine Learning Research, pages 3244–3253. PMLR. [Google Scholar]
- Gabrié M, Rotskoff GM, and Vanden-Eijnden E (2021a). Adaptive monte carlo augmented with normalizing flows. [DOI] [PMC free article] [PubMed]
- Gabrié M, Rotskoff GM, and Vanden-Eijnden E (2021b). Efficient bayesian sampling using normalizing flows to assist markov chain monte carlo methods. In ICML Workshop on Invertible Neural Networks, Normalizing Flows, and Explicit Likelihood Models. [Google Scholar]
- Girolami M and Calderhead B (2011). Riemann manifold langevin and hamiltonian monte carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(2):123–214. [Google Scholar]
- Haario H, Saksman E, and Tamminen J (2001). An adaptive Metropolis algorithm. Bernoulli, 7(2):223–242. [Google Scholar]
- Hackett DC, Hsieh C-C, Albergo MS, Boyda D, Chen J-W, Chen K-F, Cranmer K, Kanwar G, and Shanahan PE (2021). Flow-based sampling for multimodal distributions in lattice field theory. arXiv preprint, 2107.00734. [Google Scholar]
- Hoffman M, Sountsov P, Dillon JV, Langmore I, Tran D, and Vasudevan S (2019). Neutra-lizing bad geometry in hamiltonian monte carlo using neural transport.
- Holden L, Hauge R, and Holden M (2009). Adaptive independent Metropolis–Hastings. The Annals of Applied Probability, 19(1):395–413. [Google Scholar]
- Huang C-W, Krueger D, Lacoste A, and Courville A (2018). Neural autoregressive flows. In Dy J and Krause A, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 2078–2087. PMLR. [Google Scholar]
- Jaini P, Kobyzev I, Yu Y, and Brubaker M (2020). Tails of Lipschitz triangular flows. In III HD and Singh A, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 4673–4681. PMLR. [Google Scholar]
- Jaini P, Selby KA, and Yu Y (2019). Sum-of-squares polynomial flow.
- Kobyzev I, Prince S, and Brubaker M (2020). Normalizing flows: An introduction and review of current methods. IEEE Transactions on Pattern Analysis and Machine Intelligence, page 1–1. [DOI] [PubMed] [Google Scholar]
- Kolmogorov AN (1960). Foundations of the Theory of Probability. Chelsea Pub Co, 2 edition. [Google Scholar]
- Lebanon G (2017). Probability. Online manuscript.
- Maire F, Friel N, Mira A, and Raftery AE (2019). Adaptive incremental mixture markov chain monte carlo. Journal of Computational and Graphical Statistics, 28(4):790–805. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mertikopoulos P, Hallak N, Kavis A, and Cevher V (2020). On the almost sure convergence of stochastic gradient descent in non-convex problems. In Larochelle H, Ranzato M, Hadsell R, Balcan MF, and Lin H, editors, Advances in Neural Information Processing Systems, volume 33, pages 1117–1128. Curran Associates, Inc. [Google Scholar]
- Meyn S and Tweedie R (1993). Markov Chains and Stochastic Stability. Springer-Verlag, London. [Google Scholar]
- Naesseth CA, Lindsten F, and Blei D (2020). Markovian score climbing: Variational inference with KL(p——q). Advances in Neural Information Processing Systems, 2020-December(MCMC). [Google Scholar]
- Nicoli KA, Anders CJ, Funcke L, Hartung T, Jansen K, Kessel P, Nakajima S, and Stornati P (2021). Estimation of Thermodynamic Observables in Lattice Field Theories with Deep Generative Models. Physical Review Letters, 126(3):32001. [DOI] [PubMed] [Google Scholar]
- Nicoli KA, Nakajima S, Strodthoff N, Samek W, Müller KR, and Kessel P (2020). Asymptotically unbiased estimation of physical observables with neural samplers. Physical Review E, 101(2). [DOI] [PubMed] [Google Scholar]
- Papamakarios G, Nalisnick E, Rezende DJ, Mohamed S, and Lakshminarayanan B (2021). Normalizing flows for probabilistic modeling and inference.
- Parno MD and Marzouk YM (2018). Transport map accelerated markov chain monte carlo. SIAM/ASA Journal on Uncertainty Quantification, 6(2):645–682. [Google Scholar]
- Pollard D (2001). A User’s Guide to Measure Theoretic Probability. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press. [Google Scholar]
- Pompe E, Holmes C, and L atuszyński K (2020). A framework for adaptive mcmc targeting multimodal distributions. Annals of Statistics, 48(5):2930–2952. [Google Scholar]
- Rezende DJ, Papamakarios G, Racanière S, Albergo MS, Kanwar G, Shanahan PE, and Cranmer K (2020). Normalizing flows on tori and spheres. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13–18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pages 8083–8092. PMLR. [Google Scholar]
- Robert CP and Casella G (2005). Monte Carlo Statistical Methods (Springer Texts in Statistics). Springer-Verlag, Berlin, Heidelberg. [Google Scholar]
- Roberts G and Rosenthal J (2004). General state space markov chains and mcmc algorithms. Probability Surveys, 1. [Google Scholar]
- Roberts GO and Rosenthal JS (2007). Coupling and ergodicity of adaptive markov chain monte carlo algorithms. Journal of Applied Probability, 44(2):458–475. [Google Scholar]
- Royden HL (1968). Real analysis [by] H. L. Royden. Macmillan, New York, 2d ed. edition. [Google Scholar]
- Smirnov N (1948). Table for Estimating the Goodness of Fit of Empirical Distributions. The Annals of Mathematical Statistics, 19(2):279–281. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
