Skip to main content
Philosophical transactions. Series A, Mathematical, physical, and engineering sciences logoLink to Philosophical transactions. Series A, Mathematical, physical, and engineering sciences
. 2023 Mar 27;381(2247):20220150. doi: 10.1098/rsta.2022.0150

On free energy barriers in Gaussian priors and failure of cold start MCMC for high-dimensional unimodal distributions

Afonso S Bandeira 1, Antoine Maillard 1, Richard Nickl 2,, Sven Wang 3
PMCID: PMC10041355  PMID: 36970818

Abstract

We exhibit examples of high-dimensional unimodal posterior distributions arising in nonlinear regression models with Gaussian process priors for which Markov chain Monte Carlo (MCMC) methods can take an exponential run-time to enter the regions where the bulk of the posterior measure concentrates. Our results apply to worst-case initialized (‘cold start’) algorithms that are local in the sense that their step sizes cannot be too large on average. The counter-examples hold for general MCMC schemes based on gradient or random walk steps, and the theory is illustrated for Metropolis–Hastings adjusted methods such as preconditioned Crank–Nicolson and Metropolis-adjusted Langevin algorithm.

This article is part of the theme issue ‘Bayesian inference: challenges, perspectives, and prospects’.

Keywords: MCMC, Bayesian inference, Gaussian processes, computational hardness

1. Introduction

Markov chain Monte Carlo (MCMC) methods are the workhorse of Bayesian computation when closed formulae for estimators or probability distributions are not available. For this reason they have been central to the development and success of high-dimensional Bayesian statistics in the last decades, where one attempts to generate samples from some posterior distribution Π(|data) arising from a prior Π on D-dimensional Euclidean space and the observed data vector. MCMC methods tend to perform well in a large variety of problems, are very flexible and user-friendly, and enjoy many theoretical guarantees. Under mild assumptions, they are known to converge to their stationary ‘target’ distributions as a consequence of the ergodic theorem, albeit perhaps at a slow speed, requiring a large number of iterations to provide numerically accurate algorithms. When the target distribution is log-concave, MCMC algorithms are known to mix rapidly, even in high dimensions. But for general D-dimensional densities, we have only a restricted understanding of the scaling of the mixing time of Markov chains with D or with the ‘informativeness’ (sample size or noise level) of the data vector.

A classical source of difficulty for MCMC algorithms are multi-modal distributions. When there is a deep well in the posterior density between the starting point of an MCMC algorithm and the location where the posterior is concentrated, many MCMC algorithms are known to take an exponential time—proportional to the depth of the well—when attempting to reach the target region, even in low-dimensional settings, see figure 1a and also the discussion surrounding proposition 4.2 below. However, for distributions with a single mode and when the dimension D is fixed, MCMC methods can usually be expected to perform well.

Figure 1.

Figure 1.

Two possible sources of MCMC hardness in high dimensions: multi-modal likelihoods and entropic barriers. (a) In low dimensions (here D=1), MCMC hardness usually arises because of a non-unimodal likelihood, creating an ‘energy barrier’, even though the maximum likelihood is attained at θ=θ0. The MCMC algorithm is assumed to be initialized in the set S containing a local maximum of the likelihood. (b) Illustration of the arising of entropic (or volumetric) difficulties, here in dimension D=3: the set of points close to θ0 has much less volume than the set of points far away. As D increases, this phenomenon is amplified: all ratios of volumes of the three sets T,W,S scale exponentially with D. (Online version in colour.)

In essence this article is an attempt to explain how, in high dimensions, wells can be formed without multi-modality of a given posterior distribution. The difficulty in this case is volumetric, also referred to as entropic: while the target region contains most of the posterior mass, its (prior) volume is so small compared to the rest of the space that an MCMC algorithm may take an exponential time to find it, see figure 1b. This competition between ‘energy’—here represented by the log-likelihood N in the posterior distribution dΠ(|data)=exp{N+logdπ}—and ‘entropy’ (related to the prior term π) has also been exploited in recent work on statistical aspects of MCMC in various high dimensional inference and statistical physics models [15]. These ideas somewhat date back to the nineteenth century foundations of statistical mechanics [6] and the notion of free energy, consisting of a sum of energetic and entropic contributions which the system spontaneously attempts to minimize. The ‘MCMC-hardness’ phenomenon described above is then akin to the meta-stable behaviour of thermodynamical systems, such as glasses or supercooled liquids. As the temperature decreases, such systems can undergo a ‘first-order’ phase transition, in which a global free energy minimum (analoguous to the target region above) abruptly appears, while the system remains trapped in a suboptimal local minimum of the free energy (the starting region of the MCMC algorithm). For the system to go to thermodynamic equilibrium it must cross an extensive free energy barrier: such a crossing requires an exponentially long time, so that the system appears equilibrated on all relevant timescales, similarly to the MCMC stuck in the starting region. Classical examples include glasses and the popular experiment of rapid freezing of supercooled water (i.e. water that remained liquid at negative temperatures) after introducing a perturbation.

Inspired by recent work [4,5,7], let us illustrate some of the volumetric phenomena which are key to our results below. We separate the parameter space into three regions (see figures 1 and 2), which we name by common MCMC terminology. Firstly a starting (or initialization) region S, where an algorithm starts, secondly a target region T where both the bulk of the posterior mass and the ground truth are situated, and thirdly an intermediate free-entropy well1 W that separates S from T.2 In our theorems, these regions will be characterized by their Euclidean distance to the ground truth parameter θ0 generating the data. The prior volumes of the ϵ-annuli {θ:rϵ<||θθ0||2r},r>0, closer to the ground truth are smaller than those further out as illustrated in figure 1b, and in high dimensions this effect becomes quantitative in an essential way. Specifically, the trade-off between the entropic and energetic terms can happen such that the following three statements are simultaneously true.

  • (i)

    T contains ‘almost all’ of the posterior mass.

  • (ii)

    As one gets closer to T (and thus the ground truth θ0), the log-likelihood is strictly monotonically increasing.

  • (iii)

    Yet S still possesses exponentially more posterior mass than W.

Using ‘bottleneck’ arguments from Markov chain theory (ch. 7 in [8]), this means that an MCMC algorithm that starts in S is expected to take an exponential time to visit W. If the step size is such that it cannot ‘jump over’ W, this also implies an exponential hitting time lower bound for reaching T. This is illustrated in figure 2 for an averaged version of the model described in §2.

Figure 2.

Figure 2.

Illustration of a free-energy barrier (or free-entropy well) arising with a unimodal posterior. The model is an ‘averaged’ version of the spiked tensor model, with log-likelihood n(θ)=λθ,θ03/2 and uniform prior Π on the n-dimensional unit sphere Sn1. θ0 is chosen arbitrarily on Sn1. The posterior is dΠ(θ|Y)exp{nn(θ)}dΠ(θ), for θSn1. Up to a constant, the free entropy F(r)=(1/n)logdΠ(θ|Y)δ(r||θθ0||2) can be decomposed as the sum of n(θ) (that only depends on r=||θθ0||2) and the ‘entropic’ contribution (1/n)logdΠ(θ)δ(r||θθ0||2). In the figure we show λ=2.1. (Online version in colour.)

In the situation described above, the MCMC iterates never visit the region where the posterior is statistically informative, and hence yield no better inference than a random number generator. One could regard this as a ‘hardness’ result about computation of posterior distributions in high dimensions by MCMC. In this work we show that such situations can occur generically and establish hitting time lower bounds for common gradient or random walk based MCMC schemes in model problems with nonlinear regression and Gaussian process priors. Before doing this, we briefly review some important results of Ben Arous et al. [4] for the problem of principle component analysis (PCA) in tensor models, from which the inspiration for our work was drawn. This technique to establish lower bounds for MCMC algorithms has also recently been leveraged in [5] in the context of sparse PCA, and in [7] to establish connections between MCMC lower bounds and the Low Degree Method for algorithmic hardness predictions (see [9] for an expository note on this technique).

When the target distribution is globally log-concave, pictures such as in figure 2 are ruled out (see also remark 4.7) and polynomial-time mixing bounds have been shown for a variety of commonly used MCMC methods. While an exhaustive discussion would be beyond the scope of this paper, we mention here the seminal works [10,11] which were among the first to demonstrate high-dimensional mixing of discretized Langevin methods (even upon ‘cold-start’ initializations like the ones assumed in the present paper). In concrete nonlinear regression models, polynomial-time computation guarantees were given in [12] under a general ‘gradient stability’ condition on the regression map which guarantees that the posterior is (with high probability) locally log-concave on a large enough region including θ0. While this condition can be expected to hold under natural injectivity hypotheses and was verified for an inverse problem with the Schrödinger equation in [12], for non-Abelian X-ray transforms in [13], the ‘Darcy flow’ model involving elliptic partial differential equations (PDE) in [14] and for generalized linear models in [15], all these results hinge on the existence of a suitable initializer of the gradient MCMC scheme used. These results form part of a larger research programme [14,1619] on algorithmic and statistical guarantees for Bayesian inversion methods [20] applied to problems with partial differential equations. The present article shows that the hypothesis of existence of a suitable initializer is—at least in principle—essential in these results if D/Nκ>0, and that at most ‘moderately’ high-dimensional (D=O(N)) MCMC implementations of Gaussian process priors may be preferable to bypass computational bottlenecks.

Our negative results apply to (worst-case initialized) Markov chains whose step sizes cannot be too large with high probability. As we show this includes many commonly used algorithms (such as preconditioned Crank–Nicolson (pCN) and Metropolis-adjusted Langevin algorithm (MALA)) whose dynamics are of a ‘local’ nature. There are a variety of MCMC methods developed recently, such as piece-wise deterministic Markov processes, boomerang or zig-zag samplers [2124] which may not fall into our framework. While we are not aware of any rigorous results that would establish polynomial hitting or mixing times of these algorithms for high-dimensional posterior distributions such as those exhibited here, it is of great interest to study whether our computational hardness barriers can be overcome by ‘non-local’ methods. There is some empirical evidence that this may be possible. For instance, in the numerical simulation of models of supercooled liquids [25], methods such as swap Monte Carlo [26] have been observed to equilibrate to low-temperature distributions which were not reachable by local approaches. Another example is given by the planted clique problem [27]: this model is conjectured to possess a large algorithmically hard phase, and local Monte Carlo methods are known to fail far from the conjectured algorithmic threshold [2830]. On the other hand, non-local exchange Monte Carlo methods (such as parallel tempering [31]) have been numerically observed to perform significantly better [32].

2. The spiked tensor model: an illustrative example

In this section, we present (a simplified version of) results obtained mostly in [4]. First some notation. For any n1, we denote by Sn1={θRn:||θ||2=1} the Euclidean unit sphere in n dimensions. For θ,θRn we denote θθ=(θiθj)1i,jnRn2 their tensor product.

Spiked tensor estimation is a synthetic model to study tensor PCA, and corresponds to a Gaussian Additive Model with a low-rank prior. More formally, it can be defined as follows [33].

Definition 2.1 (Spiked tensor model). —

Let p3 denote the order of the tensor. The observations Y and the parameter θ are generated according to the following joint probability distribution:

dQ(Y,θ)=1(2π)np/2exp{12||Ynλθp||22}dΠ(θ)dY. 2.1

Here, dY denotes the Lebesgue measure on the space (Rn)p=Rnp of p-tensors of size n. Π is the uniform probability measure on Sn1, and λ0 is the signal-to-noise ratio (SNR) parameter. In particular, the posterior distribution Π(θ|Y) is

dΠ(θ|Y)=1ZYexp(n,Y(θ))dΠ(θ), 2.2

in which ZY is a normalization, and we defined the log-likelihood (up to additive constants) as

n,Y(θ)=12nλθp,Y. 2.3

In the following, we study the model from definition 2.1 via the prism of statistical inference. In particular, we will study the posterior Π(θ|Y) for a fixed3 ‘data tensor’ Y. Since such a tensor was generated according to the marginal of (2.1), we parameterize it as Y=λnθ0p+Z, with Z a p-tensor with i.i.d. N(0,1) coordinates, and θ0 a ‘ground truth’ vector uniformly sampled in Sn1. The goal of our inference task is to recover information on the low-rank perturbation θ0p (or equivalently on the vector θ0, possibly up to a global sign depending on the parity of p) from the posterior distribution Π(|Y).

Crucially, we are interested in the limit of the model of definition 2.1 as n. In particular, all our statements, although sometimes non-asymptotic, are to be interpreted as n grows. We say that an event occurs ‘with high probability’ (w.h.p.) when its probability is 1On(1).4 Moreover, by rotation invariance, all statements are uniform over θ0Sn1, so that said probabilities only refer to the noise tensor Z. Finally, throughout our discussion we will work with latitude intervals (or bands) on the sphere, with the North Pole taken to be θ0. We characterize them using inner products (correlations) θ,θ0 for odd p, and |θ,θ0| for even p (since in this case θ0 and θ0 are indistinguishable from the point of view of the observer).

Definition 2.2 (Latitude intervals). —

Assume that p3 is even. For 0s<t1 we define:

  • Ss={θSn1:|θ,θ0|s},

  • Ws,t={θSn1:s<|θ,θ0|t},

  • Tt={θSn1:t<|θ,θ0|}.

If p is odd, we define these sets similarly, replacing |θ,θ0| by θ,θ0.

Note that these sets can also be characterized using the distance to the ground truth, e.g. Ss={θSn1:min{||θθ0||22,||θ+θ0||22}|2(1s)} when p is even.

(a) . Posterior contraction

We can use uniform concentration of the likelihood to show that as λ (after taking the limit n) the posterior contracts in a region infinitesimally close to the ground truth θ0. We first show that a region arbitrarily close to the ground truth exponentially dominates a very large starting region:

Proposition 2.3. —

For any K>0 there exists λ0>0 and functions {s(λ),t(λ)}[0,1) such that s(λ)<t(λ), {s(λ),t(λ)}1 as λ and for all λλ0:

lim supn1nlogΠ(Ss(λ)|Y)Π(Tt(λ)|Y)K,almost surely. 2.4

Posterior contraction is the content of the following result:

Corollary 2.4 (Posterior contraction). —

There exists λ0>0 and a function s(λ)[0,1) satisfying s(λ)1 as λ, such that for all λλ0:

limnΠ[Ts(λ)|Y]=1,almost surely. 2.5

The proofs of proposition 2.3 and corollary 2.4 are given in appendix A.

Remark 2.5 (Suboptimality of uniform bounds). —

Stronger than corollary 2.4, it is known that there exists a sharp threshold λ(p) such that for any λ>λ(p) the posterior mean, as well as the maximum likelihood estimator, sit w.h.p. in Ts(λ), with s(λ)>0, while such a statement is false for λλ(p) [3436]. The λ0 given by corollary 2.4 is, on the other hand, clearly not sharp, because of the crude uniform bound used in the proof. This can easily be understood in the p=2 case, corresponding to rank-one matrix estimation: uniform bounds such as the ones used here would show posterior contraction for λ=ω(1), while it is known through the celebrated BBP transition that the maximum likelihood estimator is already correlated with the signal for any λ>1 [37]. With more refined techniques from the study of random matrices and spin glass theory of statistical physics it is often possible to obtain precise constants for such relevant thresholds.

(b) . Algorithmic bottleneck for MCMC

Simple volume arguments, associated with an ingenious use of Markov’s inequality due to Ben Arous et al. [4] and of the rotation invariance of the noise tensor Z, allow us to get a computational hardness result for MCMC algorithms, even though the posterior contracts infinitesimally close to the ground truth as we saw in corollary 2.4. In the context of the spiked tensor model, these computational hardness results can be found in [4] (see in particular §7). We will state similar results for general nonlinear regression models in §3: in this context we will not need to use the Markov’s inequality-based technique of Ben Arous et al. [4], and will solely rely on concentration arguments.

Recall that by §2(a), we can find s(λ) such that s(λ)1 as λ and for all λ large enough Π(Ts(λ)|Y)=1On(1). Here, we show that escaping the ‘initialization’ region of the MCMC algorithm is hard in a large range of λ (possibly diverging with n). In what follows, the step size of the algorithm denotes the maximal change ||xt+1xt||2 allowed in any iteration.5 We first state this bottleneck result informally.

Proposition 2.6 (MCMC bottleneck, informal). —

Assume that λ=O(n(p2)/4+η) for all η>0. Then any MCMC algorithm whose invariant distribution is Π(|Y), and with a step size bounded by δ=O([nλ2]1/p), will take an exponential time to get out of the ‘initialization’ region.

Note that the step size condition of proposition 2.6 is always meaningful, since our hypothesis on λ implies [nλ2]1/p=ω(n1/2), and many MCMC algorithms (e.g. any procedure in which a number O(1) of coordinates of the current iterate are changed in a single iteration) will have a step size O(n1/2).

Remark 2.7. —

The results of Ben Arous et al. [4] are stated when considering for the invariant distribution of the MCMC a more general ‘Gibbs-type’ distribution Gβ,Y(dx)eβH(x)dΠ(x), with H(x)=(n/2)xp,Y. The case we consider here is the ‘Bayes-optimal’ β=λ, for which Gλ,Y=Π(|Y). For the general distribution Gβ,Y the conditions of proposition 2.6 become βλ=O(n(p2)/2+η) and δ=O[(nβλ)1/p]. The authors of Ben Arous et al. [4] usually consider β=O(1), so that they show the bottleneck under the condition λ=O(n(p2)/2+η).

More generally, λn(p2)/4 is conjectured to be a regime in which all polynomial-time algorithms fail to recover θ0 [33,3841]. On the other hand, ‘local’ methods (such as gradient-based algorithms [4246], message-passing iterations [35] or natural MCMC algorithms such as the ones of previous remark) are conjectured or known to fail in the larger range λn(p2)/2. Proposition 2.6 shows that ‘Bayes-optimal’ MCMC algorithms fail for λn(p2)/4. To the best of our knowledge, analysing this class of algorithms in the regime n(p2)/4λn(p2)/2 is still open.

Let us now state formally the key ingredient behind proposition 2.6. It is a rewriting of the ‘free energy wells’ result of Ben Arous et al. [4].

Lemma 2.8 (Bottleneck, formal). —

Assume that λ=O(n(p2)/4+η) for all η>0, and let δ=O([nλ2]1/p). Let r(ε)=n1/2+ε. Then for any ε>0 small enough, there exists c,C>0 such that for large enough n, with probability at least 1exp(cn2ε) we have:

Π(Sr(ε)|Y)Π(Wr(ε),r(ε)+δ|Y)exp{Cn2ε}. 2.6

Note that by simple volume arguments, Π(Sr(ε))=1On(1), so that Sr(ε) contains ‘almost all’ the mass of the uniform distribution.

One can then deduce from lemma 2.8 hitting time lower bounds for MCMCs using a folklore bottleneck argument—see Jerrum [8]—that we recall here in a simplified form (see also [5], as well as proposition 4.4, where we will detail it further along with a short proof).

Proposition 2.9. —

We fix any Y and n, and let any 0<s<t<1. Let θ(0),θ(1), be a Markov chain on Sn1 with stationary distribution Π(|Y), and initialized from θ(0)ΠSs(|Y), the posterior distribution conditioned on Ss. Let τt=inf{kN:θ(k)Tt} be the hitting time of the Markov chain onto Tt. Then, for any k1,

Pr(τtk)kΠ(Ws,t|Y)Π(Ss|Y). 2.7

Remark 2.10 (MCMC initialization). —

Note that lemma 2.8, combined with proposition 2.9, shows hardness of MCMC initialized in points drawn from ΠSr(ϵ)(|Y). In particular, it is easy to see that this implies (via the probabilistic method) the existence of such ‘hard’ initializing points. While one might hope to show such negative results for more general initialization, this remains an open problem. On the other hand, Ben Arous et al. [4] shows that there exists initializers in Sr(ϵ) for which vanilla Langevin dynamics achieve non-trivial recovery of the signal even for λ=Θn(1) (a phenomenon they call ‘equatorial passes’).

3. Main results for nonlinear regression with Gaussian priors

We now turn to the main contribution of this article, which is to exhibit some of the phenomena described in §2 in the context of nonlinear regression models. All the theorems of this section are proven in detail in §4.

Consider data Z(N)=iid(Yi,Xi)i=1N from the random design regression model

Yi=G(θ)(Xi)+εi,εiN(0,1),i=1,,N, 3.1

where G:ΘLμ2(X) is a regression map taking values in the space L2(X)=Lμ2(X) on some bounded subset X of Rd, and where the Xiiidμ are drawn uniformly on X. For convenience, we assume that X has Lebesgue measure Xdx=1. The law of the data dPθN(z1,,zN)=i=1NdPθ(zi) is a product measure on (R×X)N, with associated expectation operator EθN. Here θ varies in some parameter space

ΘRD,DNκ0,

and θ0Θ is a ‘ground truth’ (we could use ‘mis-specified’ θ0 and project it onto Θ). We will primarily consider the case where κ>0 and Θ=RD, and consider high-dimensional asymptotics where D (and then also N) diverge to infinity, even though some aspects of our proofs do not rely on these assumptions. We will say that events AN hold with high probability if Pθ0N(AN)1 as N, and we will use the same terminology later when it involves the law of some Markov chain.

Let Π be a prior (Borel probability measure) on Θ so that given the data Z(N) the posterior measure is the ‘Gibbs’-type distribution

dΠ(θ|Z(N))=eN(θ)dΠ(θ)ΘeN(θ)dΠ(θ),θΘ, 3.2

where

N(θ)=12i=1N|YiG(θ)(Xi)|2,(θ)=Eθ0NN(θ),θΘ.

(a) . Hardness examples for posterior computation with Gaussian priors

We are concerned here with the question of whether one can sample from the Gibbs’ measure (3.2) by MCMC algorithms. The priors will be Gaussian, so the ‘source’ of the difficulty will arise from the log-likelihood function N. On the one hand, recent work [1014] has demonstrated that if N(θ) is ‘on average’ (under Eθ0) log-concave, possibly only just locally near the ground truth θ0, then MCMC methods that are initialized into the area of log-concavity can mix towards Π(|Z(N)) in polynomial time even in high-dimensional (D) and ‘informative’ (N) settings. In the absence of such structural assumptions, however, posterior computation may be intractable, and the purpose of this section is to give some concrete examples for this with choices of G that are representative for nonlinear regression models.

We will provide lower bounds on the run-time of ‘worst case’ initialized MCMC in settings where the average posterior surface is not globally log-concave but still unimodal. Both the log-likelihood function and posterior density exhibit linear growth towards their modes, and the average log-likelihood is locally log-concave at θ0. In particular, the Fisher information is well defined and non-singular at the ground truth.

The computational hardness does not arise from a local optimum (‘multi-modality’), but from the difficulty MCMC encounters in ‘choosing’ among many high-dimensional directions when started away from the bulk of the support of the posterior measure. That such problems occur in high dimensions is related to the probabilistic structure of the prior Π, and the manifestation of ‘free energy barriers’ in the posterior distribution.

In many applications of Bayesian statistics, such as in machine learning or in nonlinear inverse problems with PDEs, Gaussian process priors are commonly used for inference. To connect to such situations we illustrate the key ideas that follow with two canonical examples where the prior on RD is the law

(a)θN(0,IDD),or(b)θN(0,Σα), 3.3

where Σα is the covariance matrix arising from the law of a D-dimensional Whittle–Matérn-type Gaussian random field (see §4(d)(i) for a detailed definition). These priors represent widely popular choices in Bayesian statistical inference [47,48] and can be expected to yield consistent statistical solutions of regression problems even when D/Nκ>0, see [48,49]. In (b), we can also accommodate a further ‘rescaling’ (N-dependent shrinkage) of the prior similar to what has been used in recent theory for nonlinear inverse problems [12,13,18], see remark 4.6 for details.

We will present our main results for the case where the ground truth is θ0=0. This streamlines notation while also being the ‘hardest’ case for negative results, since the priors from (a) and (b) are then already centred at the correct parameter.

To formalize our results, let us define balls

Br={θRD:||θ||RDr},r>0, 3.4

centred at θ0=0. We will also require the annuli

Θr,ε={θRD:||θ||RD(r,r+ε)}, 3.5

for r,ε>0 to be chosen. To connect this to the notation in the preceding sections, the sets Θr,ε will play the role of the initialization (or starting) region S, while Bs (for suitable s) corresponds to the target region T where the posterior mass concentrates. The ‘intermediate’ region W=Θs,η representing the ‘free-energy barrier’ is constructed in the proofs of the theorems to follow.

Our results hold for general Markov chains whose invariant measure equals the posterior measure (3.2), and which admit a bound on their ‘typical’ step sizes. As step sizes can be random, this assumption needs to be accommodated in the probabilistic framework describing the transition probabilities of the chain. Let PN(θ,A),NN, (for θRD and Borel sets ARD), denote a sequence of Markov kernels describing the Markov chain dynamics employed for the computation of the posterior distribution Π(|Z(N)). Recall that a probability measure μ on RD is called invariant for PN if RDPN(θ,A)dμ(θ)=μ(A) for all Borel sets A.

Assumption 3.1. —

Let PN(,) be a sequence of Markov kernels satisfying the following:

  • (i)

    PN(,) has invariant distribution Π(|Z(N)) from (3.2).

  • (ii)
    For some fixed c0>0 and for sequences Q=QN>0, η=ηN>0, with P0N-probability approaching 1 as N,
    supθBQPN(θ,{ϑ:||θϑ||RDη2})ec0N,N1.

This assumption states that typical steps of the Markov chain are, with high probability (both under the law of the Markov chain and the randomness of the invariant ‘target’ measure), concentrated in an area of size η/2 around the current state θ, uniformly in a ball of radius Q around θ0=0. For standard MCMC algorithms (such as pCN, MALA) whose proposal steps are based on the discretization of some continuous-time diffusion process, such conditions can be checked, as we will show in the next section.

Theorem 3.2. —

Let D/Nκ>0, consider the posterior (3.2) arising from the model (3.1) and a N(0,ID/D) prior of density π, and let θ0=0. Then there exists G and a fixed constant s(0,1/3) for which the following statements hold true.

  • (i)

    The expected likelihood (θ) is unimodal with mode 0, locally log-concave near 0, radially symmetric, Lipschitz continuous and monotonically decreasing in ||θ||RD on RD.

  • (ii)

    For any fixed r>0, with high probability the log-likelihood N(θ) and the posterior density π(|Z(N)) are monotonically decreasing in ||θ||RD on the set {θ:||θ||RDr}.

  • (iii)

    We have that Π(Bs|Z(N))N1 in probability.

  • (iv)
    There exists ε>0 such that for any (sequence of) Markov kernels PN on RD and associated chains (ϑk:k1) that satisfy assumption 3.1 for some c0>0, Q=1+ε, sequence ηN(0,s) and all N1 large enough, we can find an initialization point ϑ0Θ2/3,ε such that with high probability (under the law of Z(N) and the Markov chain), the hitting time τBs for ϑk to reach Bs (with s as in (iii)) is lower bounded as
    τBsexp(min{c0,1}N/2).

The interpretation is that despite the posterior being strictly increasing in the radial variable ||θ||RD (at least for ||θ||RD>r, any r>0—note that maximizers of the posterior density may deviate from the ‘ground truth’ θ0=0 by some asymptotically vanishing error, cf. also proposition 4.1), MCMC algorithms started in Θ2/3,ε will still take an exponential time before visiting the region Bs where the posterior mass concentrates. This is true for small enough step size independently of D,N. The result holds also for ϑ0 drawn from an absolutely continuous distribution on Θ2/3,ε as inspection of the proof shows. Finally, we note that at the expense of more cumbersome notation, the above high probability results (and similarly in theorem 3.3) could be made non-asymptotic, in the sense that for all δ>0 all statements hold with probability at least 1δ for all NN0(δ) large enough, where the dependency of N0 on δ can be made explicit.

For ‘ellipsoidally supported’ α-regular priors (b), the idea is similar but the geometry of the problem changes as the prior now ‘prefers’ low-dimensional subspaces of RD, forcing the posterior closer towards the ground truth θ0=0. We show that if the step size is small compared to a scaling Nb for b>0 determined by α, then the same hardness phenomenon persists. Note that ‘small’ is only ‘polynomially small’ in N and hence algorithmic hardness does not come from exponentially small step sizes.

Theorem 3.3. —

Let D/Nκ>0, consider the posterior (3.2) arising from the model (3.1) and a N(0,Σα) prior of density π for some α>d/2, and let θ0=0. Define b=(α/d)(1/2)>0. Then there exists G and some fixed constant sb(0,1/2) for which the following statements hold true.

  • (i)

    The expected likelihood (θ) is unimodal with mode 0, locally log-concave near 0, radially symmetric, Lipschitz continuous and monotonically decreasing in ||θ||RD on RD.

  • (ii)

    For any fixed r>0, with high probability N(θ) is radially symmetric and decreasing in ||θ||RD on the set {θ:||θ||RDrNb}.

  • (iii)

    Defining s=sbNb, we have Π(Bs|Z(N))N1 in probability.

  • (iv)
    There exist positive constants ε,C>0 and ν=ν(κ,α,d)>0 such that for any (sequence of) Markov kernels PN on RD and associated chains (ϑk:k1) that satisfy assumption 3.1 for some c0>0, Q=QN=CN, sequence η=ηN(0,sbNb) and all N1 large enough, we can find an initialization point ϑ0ΘNb,εNb such that with high probability (under the law of Z(N) and the Markov chain), the hitting time τBs for ϑk to reach Bs is lower bounded as
    τBsexp(min{c0,ν}N/2).

Again, (iv) holds as well for ϑ0 drawn from an absolutely continuous distribution on ΘNb,εNb. We also note that ε depends only on α,κ,d and the choice of G but not on any other parameters.

Remark 3.4. —

As opposed to theorem 3.2, due to the anisotropy of the prior density π, the posterior distribution is no longer radially symmetric in the preceding theorem, whence part (ii) differs from theorem 3.2. But a slightly weaker form of monotonicity of the posterior density π(|Z(N)) still holds: the same arguments employed to prove part (ii) of theorem 3.2 show that π(|Z(N)) is decreasing on {θ:||θ||RDrNb} (any r>0) along the half-lines through 0, i.e.

P0N(π(ve|Z(N))π(ve|Z(N))for allvvr,eRD,||e||RD=Nb)N1. 3.6

We note that this notion precludes the possibility of π(|Z(N)) having extremal points outside of the region of dominant posterior mass, and implies that moving toward the origin will always increase the posterior density. As a result, many typical Metropolis–Hastings would be encouraged to accept such ‘radially inward’ moves, if they arise as a proposal. Thus, crucially, our exponential hitting time lower bound in part (iv) arises not through multi-modality, but merely through volumetric properties of high-dimensional Gaussian measures.

Remark 3.5 (On the step size condition). —

One may wonder whether larger step sizes can help to overcome the negative result presented in the last theorem. If the step sizes are ‘time-homogeneous’ and Nb on average, then we may hit the region where the posterior is supported at some time. This would happen ‘by chance’ and not because the data (via N) would suggest to move there, and future proposals will likely be outside of that bulk region, so that the chain will either exit the relevant region again or become deterministic because an accept/reject step refuses to move into such directions. In this sense, a negative result for (polynomially) small step sizes gives fundamental limitations on the ability of the chain to explore the precise characteristics of the posterior distribution. We also remark that the Lipschitz constants of (θ) are of order D or D1+b in the preceding theorems, respectively. A Markov chain obtained from discretizing a continuous diffusion process (such as MALA discussed in the next section) will generally require step sizes that are inversely proportional to that Lipschitz constant in order to inherit the dynamics from the continuous process. For such examples, assumption 3.1 is natural. But as discussed at the end of the introduction, there exists a variety of ‘non-local’ MCMC algorithms for which this step size assumption may not be satisfied.

(b) . Implications for common MCMC methods with ‘cold-start’

The preceding general hitting time bounds apply to commonly used MCMC methods in high-dimensional statistics. We focus in particular on algorithms that are popular with PDE models and inverse problems, see, e.g. [50,51] and also [14] for many more references. We illustrate this for two natural examples with Metropolis–Hastings adjusted random walk and gradient algorithms. Other examples can be generated without difficulty.

(i) . Preconditioned Crank–Nicolson

We first give some hardness results for the popular pCN algorithm. A dimension-free convergence analysis for pCN was given in the important paper by Hairer et al. [52] based on ideas from Hairer et al. [53]. The results in the present section show that while the mixing bounds from Hairer et al. [52] are in principle uniform in D, the implicit dependence of the constants on the conditions on the log-likelihood-function in [52] can re-introduce exponential scaling when one wants to apply the results from Hairer et al. [52] to concrete (N-dependent) posterior distributions. This confirms a conjecture about pCN made in Section 1.2.1 of Nickl & Wang [12].

Let C denote the covariance of some Gaussian prior on RD with density π. Then the pCN algorithm for sampling from some posterior density π(θ|Z(N))eN(θ)π(θ) is given as follows. Let (ξk:k1) be an i.i.d. sequence of N(0,C) random vectors. For initializer ϑ0RD, step size β>0 and k1, the MCMC chain is then given by

  • 1.

    Proposal: pk1βϑk1+βξk,

  • 2.
    Accept–reject: Set
    ϑk={pkw.p.min{1,eN(pk)N(ϑk1)},ϑk1else. 3.7

By standard Markov chain arguments one verifies (see [52] or Ch.1 in [14]) that the (unique) invariant density of (ϑk:k1) equals π(|Z(N)).

We now give a hitting time lower bound for the pCN algorithm which holds true in the regression setting for which the main theorems 3.2 and 3.3 (for generic Markov chains) were derived. In particular, we emphasize that the lower bounds to follow hold for the choice of regression ‘forward’ map G constructed in the proofs of theorems 3.2 and 3.3. As for the general results, we treat the two cases of C=ID/D or C=Σα separately.

Theorem 3.6. —

Let ϑk denote the pCN Markov chain from (3.7).

  • (i)

    Assume the setting of theorem 3.2 with C=ID/D, and let G be as in theorem 3.2. Then there exist constants c1,c2,ε>0 such that for any βc1, there is an initialization point ϑ0Θ2/3,ε such that the hitting time τBs=inf{k:ϑkBs} (for Bs as in (3.4)) satisfies with high probability (under the law of the data and of the Markov chain) as N that τBsexp(c2D).

  • (ii)

    Assume the setting of theorem 3.3 with C=Σα for α>d/2, and let G be as in theorem 3.2. Then there exist constants c1,c2,ε>0 such that if βc1N12b there is an initialization point ϑ0ΘNb,εNb such that the hitting time τBs=inf{k:ϑkBs} satisfies with high probability that τBsexp(c2D).

(ii) . Gradient-based Langevin algorithms

We now turn to gradient-based Langevin algorithms which are based on the discretization of continuous-time diffusion processes [10,50]. A polynomial time convergence analysis for the unadjusted Langevin algorithm in the strongly log-concave case has been given in [10,11] and also in [54] for the Metropolis-adjusted case (MALA). We show here that for unimodal but not globally log-concave distributions, the MCMC scheme can take an exponential time to reach the bulk of the posterior distribution. For simplicity we focus on the Metropolis-adjusted Langevin algorithm which is defined as follows. Let (ξk:k1) be a sequence of i.i.d. N(0,ID) variables, and let γ>0 be a step size.

  • 1.

    Proposal: pk=ϑk1+γlogπ(ϑk1|Z(N))+2γξk.

  • 2.
    Accept–reject: Set
    ϑk={pkw.p.min{1,π(pk|Z(N))exp(||ϑk1pkγlogπ(pk|Z(N))||2)π(ϑk1|Z(N))exp(||pkϑk1γlogπ(ϑk1|Z(N))||2)},ϑk1else. 3.8

Again, standard Markov chain arguments show that Π(|Z(N)) is indeed the (unique) invariant distribution of (ϑk:k1). We note here that for the forward G featuring in our results to follows, logπ may only be well-defined (Lebesgue-) almost everywhere on RD due to our piece-wise smooth choice of w, see (4.6) below. However, since all proposal densities involved possess a Lebesgue density, this specification almost everywhere suffices in order to propagate the Markov chain with probability 1. Alternatively one could also straightforwardly avoid this technicality by smoothing our choice of function w in (4.6), which we refrain from for notational ease.

Theorem 3.7. —

Let ϑk denote the MALA Markov chain from (3.8).

  • (i)

    Assume the setting of theorem 3.2, with N(0,ID/D) prior, and let G also be as in theorem 3.2. There exists some c1,c2,ε>0 such that if the step size of (ϑk:k1) satisfies γc1/N, then there is an initialization point ϑ0Θ2/3,ε such that the hitting time τBs=inf{k:ϑkBs} (for Bs as in (3.4)) satisfies with high probability (under the law of the data and of the Markov chain) as N that τBsexp(c2D).

  • (ii)

    Assume the setting of theorem 3.3, with a N(0,Σα) prior, and let G also be as in theorem 3.3. Then there exist some constant c1,c2,ε>0 such that whenever γc1N1b2α, there is an initialization point ϑ0ΘNb,εNb, such that the hitting time τBs=inf{k:ϑkBs} satisfies with high probability (under the law of the data and of the Markov chain) that τBsexp(c2D).

As mentioned in remark 3.5, a bound on the step size that is inversely proportional to the Lipschitz constant of is natural for algorithms like MALA that arise from discretization of a continuous-time Markov process, see e.g. [11,54]. We emphasize again that these Lipschitz constants are D- and N-dependent, so that the required bounds on γ are not unnatural. ‘Optimal’ step size prescriptions for MALA [5457] derived for Gaussian and log-concave targets or, more generally, mean-field limits (in which the posterior distribution possesses a product or mean-field structure, unlike in the models considered here) would need to be adjusted to our model classes to be comparable.

4. Proofs of the main theorems

We begin in §4a by constructing the family of regression maps G underlying our results from §3. Section 4b,c reduce the hitting time bounds from theorems 3.2 and 3.3 (for general Markov chains) to hitting time bounds for intermediate ‘free energy barriers’ that the Markov chain needs to travel through. Subsequently, theorems 3.3 and 3.2 are proved in §4d,e, respectively. Finally, the proofs for pCN (theorem 3.6) and MALA (theorem 3.7) are contained in §4f.

(a) . Radially symmetric choices of G

We start with our parameterization of the map G. In our regression model and since Eε2=1,

(θ)=N2Eθ01|YG(θ)(X)|2=N2||G(θ0)G(θ)||L22N2,θRD. 4.1

We have θ0=0 and by subtracting a fixed function G(0) from G(θ) if necessary we can also assume that G(θ0)=0. In this case, since vol(X)=1,

(θ)=N2||G(θ)||L22N2. 4.2

Take a bounded continuous function w:[0,][0,||w||) with a unique minimizer w(0)=0 and take G of the ‘radial’ form

G(θ)=w(||θ||RD)×g(x),θRD,xX,

where

g:X[gmin,gmax],0<gmin<gmax<,||g||Lμ2(X)=1.

The assumption G(θ0)=0 implies Yi=0+εi under Pθ0N, so that we have

N(θ)=12i=1N|εiw(||θ||)g(Xi)|2=w(||θ||RD)2i=1Ng2(Xi)12i=1Nεi2+w(||θ||)i=1Nεig(Xi), 4.3

and the average log-likelihood is

(θ)=Eθ0NN(θ)=N2w(||θ||RD)N2,θRD. 4.4

Define ϵ-annuli of Euclidean space

Θr,ϵ={θRD:||θ||RD(r,r+ϵ)},r0. 4.5

We then also set, for any s0,ϵ>0,

w(r,ϵ)=infs(r,r+ϵ)w(s),w+(r,ϵ)=sups(r,r+ϵ)w(s).

For our main theorems the map w will be monotone increasing and the preceding notation w,w+ is then not necessary, but proposition 4.2 is potentially also useful in non-monotone settings (as remarked after its proof), hence the slightly more general notation here.

The choice that G is radial is convenient in the proofs, but means that the model is only identifiable up to a rotation for θ0. One could easily make it identifiable by more intricate choices of G, but the main point for our negative results is that the function has a unique mode at the ground truth parameter θ0 and is identifiable there.

(i) . A locally log-concave, globally monotone choice of w

Define for t<L and any r>0 the function w:[0,)R as

(i) . 4.6

where T>ρ, are fixed constants to be chosen. Note that w is monotone increasing and

||w||=(Tt)2+(Tt2)+ρ(Lt)<. 4.7

The function w is quadratic near its minimum at the origin up until t/2, from when onwards it is piece-wise linear. In the linear regime it initially has a ‘steep’ ascent of gradient T until t, then grows more slowly with small gradient ρ from t until L, and from then on is constant. The function w is not C at the points r=t/2, r=t, r=L, but we can easily make it smooth by convolving with a smooth function supported in small neighbourhoods of its breakpoints r without changing the findings that follow. We abstain from this to simplify notation.

The following proposition summarizes some monotonicity properties of the empirical log-likelihood function arising from the above choice of w.

Proposition 4.1. —

Let w be as in (4.6). Then there exists C>0 such that for any r0>0 and N1, we have

P0N(supr0r<sLsup||θs||=s,||θr||=rN(θs)N(θr)w(s)w(r)N4)1CNCNw(r0).

In particular, if r0<t/2 is such that (Tr0)2N as N then the r.h.s. is 1o(1).

Proof. —

Recalling (4.4), (4.3) and since w is monotonically increasing, we bound

P0N(N(θs)N(θr)>(N4)(w(r)w(s)))=P0N(N(θs)N(θr)((θs)(θr))>N4(w(r)w(s)))=Pr((w(r)w(s))2i=1N(g2(Xi)1)+(w(s)w(r))i=1Nεig(Xi)>N4(w(s)w(r)))=Pr(i=1N(g2(Xi)Eg2(X))/2+1w(s)+w(r)i=1Nεig(Xi)>N4)Pr(|i=1N(g2(Xi)Eg2(X))|>N4)+Pr(|i=1Nεig(Xi)|>2Nw(r0)8)=O(1N)+O(1Nw(r0)),

using Chebyshev’s inequality in the last step. Since the events in the penultimate step do not depend on r<s[r0,L], the result follows.

(b) . Bounds for posterior ratios of annuli

A key quantity in the proofs to follow will be to obtain asymptotic (N) bounds of the following functional (recalling the definition of the Euclidean annuli Θr,ε from (4.5)),

FN(r,ε)=1NlogΘr,εeN(θ)dΠ(θ),r0,ε>0, 4.8

in terms of the map w. As a side note, we remark that this functional has a long history in the statistical physics of glasses, in which it is often referred to as the Franz–Parisi potential [7,58].

Proposition 4.2. —

Consider the regression model (3.1) with radially symmetric choice of G from §4a such that ||w||W for some fixed W< (independent of D,N), and let Π=ΠN denote a sequence of prior probability measures on RD.

  • (i)
    Suppose that for some radii 0<s<σ, constants ε,η,ν>0 and for all N1 large enough, we have
    1NlogΠ(Θs,η)Π(Θσ,ε)2ν(w+(σ,ε)w(s,η))2. 4.9
    Then the posterior distribution Π(|Z(N)) from (3.2) arising in the model (3.1) satisfies that with high P0N-probability as N,
    Π(Θs,η|Z(N))Π(Θσ,ε|Z(N))eνN. 4.10
  • (ii)
    If in addition w is monotone increasing on [0,) and if for some Q>1+ε,
    1NlogΠ(BQc)Π(Θσ,ε)2ν, 4.11
    then the posterior distribution Π(|Z(N)) also satisfies (with high probability as N) that
    Π(BQc|Z(N))Π(Θσ,ε|Z(N))eνN. 4.12

Remark 4.3 (The prior condition for w from (4.6)). —

If σ>s>t, for w from (4.6), the ‘likelihood’ term in proposition 4.2 is

w+(σ,ε)2w(s,η)2ρ(σ+εt)ρ(st)2=ρ2(σ+εs)>0, 4.13

so that if we also assume

Tt+ρL=O(N), 4.14

to control ωN,ωN in the proof that follows, then to verify (4.9) it suffices to check

1NlogΠ(Θs,η)Π(Θσ,ε)2νρ2(σ+εs), 4.15

for all large enough N.

Proof. —

Proof of part (i). From the definition of N in (4.3) we first note that for all r0,ϵ>0,

infθΘr,ϵN(θ)12i=1Nεi212w+(r,ϵ)i=1Ng2(Xi)w+(r,ϵ)|i=1Nεig(Xi)|

and

supθΘr,ϵN(θ)12i=1Nεi212w(r,ϵ)i=1Ng2(Xi)+w+(r,ϵ)|i=1Nεig(Xi)|.

We can now further bound, for our G,

1NlogΘr,ϵeN(θ)dΠ(θ)12Ni=1Nεi2w(r,ϵ)2Ni=1Ng2(Xi)+w+(r,ϵ)N|i=1Nεig(Xi)|+logΠ(Θr,ϵ)N

and

1NlogΘr,ϵeN(θ)dΠ(θ)12Ni=1Nεi2w+(r,ϵ)2Ni=1Ng2(Xi)w+(r,ϵ)N|i=1Nεig(Xi)|+logΠ(Θr,ϵ)N.

We estimate w+(r,ϵ)w¯(r,ϵ)=max(w+(r,ϵ),1), and noting that

Eεi2=1=Eg2(Xi)andEεig(Xi)=0,

we can use Chebyshev’s (or Bernstein’s) inequality to construct an event of high probability such that the functional FN from (4.8) is bounded as

FN(r,ϵ)12w(r,ϵ)2+logΠ(Θr,ϵ)N+ωN(s,η) 4.16

and

FN(r,ϵ)12w+(r,ϵ)2+logΠ(Θr,ϵ)N+ωN(r,ϵ), 4.17

where

ωN(r,ϵ)=O(1+w(r,ϵ)+w¯(r,ϵ)N),ωN(s)=O(1+w+(r,ϵ)+w¯(r,ϵ)N), 4.18

and this is uniform in all (r,ϵ) since ||w||W is bounded. Using the above with (r,ϵ) chosen as (s,η) and (σ,ε) respectively, we then obtain

1NlogΠ(Θs,η|Z(N))Π(Θσ,ε|Z(N))=FN(s,η)FN(σ,ε)w(s,η)2+logΠ(Θs,η)N+w+(σ,ε)2logΠ(Θσ,ε)N+ωN(s,η)ωN(σ,ε)=w(s,η)2+w+(σ,ε)2+1NlogΠ(Θs,η)Π(Θσ,ε)+ωN(s,η)ωN(σ,ε), 4.19

with high Pθ0N-probability. The result now follows from the hypothesis (4.9) and since the terms ωN,ωN are O(1).

(Proof of part ii). The proof of part (ii) follows from an obvious modification of the previous arguments.

In the case where Π(Θs,η) and Π(Θσ,ε) are comparable (so that the l.h.s. in (4.9) converges to zero), a local optimum at σ in the function w away from zero can verify the last inequality for ‘intermediate’ s such that w(s)w(σ)2ν. This can be used to give computational hardness results for MCMC of multi-modal distributions. But we are interested in the more challenging case of ‘unimodal’ examples w from (4.6). Before we turn to this, let us point out what can be said about the hitting times of Markov chains if the conclusion (4.10) of proposition 4.2 holds.

(c) . Bounds for Markov chain hitting times

(i) . Hitting time bounds for intermediate sets Θs,η

In (4.10), we can think of Θσ,ε as the ‘initialization region’ (further away from θ0) and Θs,η for intermediate s is the ‘barrier’ before we get close to θ0=0. The last bound permits the following classic hitting time argument, taken from Ben Arous et al. [5], see also [8].

Proposition 4.4. —

Consider any Markov chain (ϑk:kN) with invariant measure μ=Π(|Z(N)) for which (4.10) holds. For constants η<σs, suppose ϑ0 is started in Θσ,ε, μ(Θσ,ε)>0, drawn from the conditional distribution μ(|Θσ,ε), and denote by τs the hitting time of the Markov chain onto Θs,η, that is, the number τs of iterates required until ϑk visits the set Θs,η. Then

Pr(τsK)KeνN,K>0.

Similarly, on the event where (4.12) holds we have that

Pr(τBLcK)KeνN,K>0.
Proof of proposition 4.4. —

We have

Pr(τsK)=Pr(ϑkΘs,ηfor some1kK|ϑ0Θσ,ε)=Pr(ϑ0Θσ,ε,ϑkΘs,ηfor some1kK)μ(Θσ,ε)kKPr(ϑkΘs,η)μ(Θσ,ε)Kμ(Θs,η)μ(Θσ,ε)KeνN.

The second claim is proved analogously.

The last proposition holds ‘on average’ for initializers ϑ0μ(|Θσ,ε), and since Pr=Eμ(|Θσ,ε)Prϑ0 where Prϑ0 is the law of the Markov chain started at ϑ0, the hitting time inequality holds at least for one point in Θσ,ε since infϑ0Prϑ0Eμ(|Θσ,ε)Prϑ0.

(ii) . Reducing hitting times for Bs to ones for Θs,η

We now reduce part (iv) of theorems 3.2 and 3.3, i.e. bounds on the hitting time of the region Bs in which the posterior contracts, to a bound for the hitting time τs for the annulus Θs,η, which is controlled in proposition 4.4. To this end, in the case of theorem 3.2, we suppose that propositions 4.2 and 4.4 are verified with ν=1σ=2/3 some ε>0 and Q,s,η as in the theorem, and in the case of theorem 3.3, we assume the same with choice σ=Nb and ν>0 given after (4.27) below. For c0 from assumption 3.1, define the events

AN:={ke(νc0)N/2:||ϑk+1ϑk||RDη2}.

We can then estimate, using assumption 3.1, that on the frequentist event on which proposition 4.4 holds (which we apply with K=e(νc0)N/2eνN/2), under the probability law of the Markov chain we have

Pr(τBse(νc0)N/2)Pr(τBse(νc0)N/2,AN)+Pr(ANc)Pr(τse(νc0)N/2)+Pr(ANc,τBQc>e(νc0)N/2)+Pr(τBQce(νc0)N/2)2eνN/2+e(νc0)N/2supθBQPN(θ,{ϑ:||θϑ||RDη2})2e(νc0)N/2+e(νc0)N/2c0N3e(νc0)N/2,

where in the second inequality we have used that on the events AN, the Markov chain ϑk, when started in Θ2/3,ε, needs to pass through Θs,η in order to reach Bs.

(d) . Proof of theorem 3.3

In this section, we use the results derived in the previous part of §4 to finish the proof of theorem 3.3. Parts (i) and (ii) of the theorem follow from proposition 4.1 and our choice of w in (4.6). We therefore concentrate on the proofs of part (iii) and (iv). We start with proving a key lemma on small ball estimates for truncated α-regular Gaussian priors.

(i) . Small ball estimates for α-regular priors

Let us first define precisely the notion of α-regular Gaussian priors. For some fixed α>d/2, the prior Π arises as the truncated law Law(θ) of an α-regular Gaussian process with RKHS H=Hα, a Sobolev space over some bounded domain/manifold X, see e.g. section 6.2.1 in [14] for details. Equivalently (under the Parseval isometry) we take a Gaussian Borel measure on the usual sequence space 2L2 with RKHS equal to

hα={(θi)i=1:i=1i2(α/d)θi2=||θ||Hα2<},α>d2.

The prior Π is the truncated law of θD=(θ1,,θD),DN.

Lemma 4.5. —

Fix z>0, α>d/2 and κ>0, and set

b=αd12,τ=1b=2d2αd.

Then if D/Nκ>0, there exist constants c¯0>c0 (depending on b,κ) such that for all N (N0(z,b)) large enough:

c0(z+κα/dzτ/2)τ1NlogΠ(||θ||RDzNb)c¯0zτ. 4.20
Proof of lemma 4.5. —

Note first that the L2-covering numbers of the ball h(α,B) of radius B in Hα satisfy the well-known two-sided estimate

logN(δ,||||L2,h(α,B))(ABδ)d/α,0<δ<AB 4.21

for equivalence constants in depending only on d,α. The upper bound is given in proposition 6.1.1 in [14] and a lower bound can be found as well in the literature [59] (by injecting Hα(X0) into H~α(X) for some strict sub-domain X0X, and using metric entropy lower bounds for the injection Hα(X0)L2(X0)).

Using the results about small deviation asymptotics for Gaussian measures in Banach space [60]—specifically theorem 6.2.1 in [14] with a=2d/(2αd)—and assuming α>d/2, this means that the concentration function of the ’untruncated prior’ satisfies the two-sided estimate

logΠ(||θ||L2γ)γ(2d/(2αd))=γτ,γ0. 4.22

Here, restricting to γ(0,1), the two-sided equivalence constants depend only on α,d. Setting

γ=zNb,z>0, 4.23

and noting that bτ=1, we hence obtain that for some constants cl,cu>0,

eclzτNΠ(||θ||L2zNb)ecuzτN,anyz>0. 4.24

We now show that as long as D/Nκ>0, one may use the above asymptotics to derive the desired small ball probabilities for the projected prior on RD.

We obviously have, by set inclusion and projection,

Π(||θ||RDzNb)Π(||θ||L2zNb),

and hence it only remains to show the first inequality in equation (4.20). The Gaussian isoperimetric theorem (theorem 2.6.12 in [61]) and (4.24) imply that for m4cl and some c>0, we have that (with Φ denoting the c.d.f. for N(0,1))

Π(θ=θ1+θ2,||θ1||L2zNb,||θ2||hαmzτ/2N)Φ(Φ1(Π({θ:||θ||zNb}))+mzτ/2N)Φ(2clzτ/2N+mzτ/2N)1eczτN,

(see also the proof of lemma 5.17 in [19] for a similar calculation). Then if the event in the last probability is denoted by I we have

Π(||θD||RDzNb)Π(||θD||RDzNb,I)+eczτN.

On I, if D/Nκ>0 and by the usual tail estimate for vectors in hα, we have for some c>0 the bound

||θθD||L2||θ1||L2+cDα/dzτ/2NzNb+cκα/dzτ/2Nb,

so that for any z>0,

Π(||θD||RDzNb)Π(||θ||L2zNb+||θθD||L2,I)+eczτNΠ(||θ||L2(2z+cκα/dzτ/2)Nb)+eczτNecu(2z+cκα/dzτ/2)τN+eczτN,

and hence the lemma follows by appropriately choosing c0>0.

Remark 4.6. —

For statistical consistency proofs in nonlinear inverse problems, often rescaled Gaussian priors are used to provide additional regularization [12,13,19]. For these priors a computation analogous to the previous lemma is valid: specifically if we rescale θ by NδN, where δN=Nα/(2α+d) so that NδN=N(d/2)/(2α+d)=Nk, then we just take Nβ+k=Nb in the above small ball computation, that is b=β+k or b=βk, and the same bounds (as well as the proof to follow) apply.

(ii) . Proof of theorem 3.3, part (iv)

Lemma 4.5 and the hypotheses on η immediately imply

Π(θΘs,η)=Π(||θ||RD(sbNb,sbNb+η))Π(||θ||RD2sbNb)ec0N(2sb+κα/d(2sb)τ/2)τ.

To lower bound Π(ΘNb,εNb), we choose ε large enough such that

c¯0(1+ε)τ<c0(1+κα/d)τ,

which implies for all N large enough that

Π(||θ||RD(Nb,(1+ε)Nb))=Π(||θ||RD(1+ε)Nb)Π(||θ||RDNb))ec¯0(1+ε)τNec0(1+κα/d)τNe2c¯0(1+ε)τN. 4.25

Now, for w from (4.6), we set

t=tbNb,ρ(0,1],0<tb<sb<12<L<,T=TbNb, 4.26

for Tb to be chosen and ρ,L, sb, tb fixed constants, so that ||w|| is bounded (uniformly in N) by a constant which depends only on Tb, L, ρ, whence (4.14) holds. Now the key inequality (4.15) with s=sbNb and with our choice of η, ε, σ=Nb will be satisfied if

c0(2sb+κα/d(2sb)τ/2)τ2c¯0(1+ε)τ+2ν+ρ2Nb(1+εsb). 4.27

We define ν to equal to 1/3 of the l.h.s. so that (4.27) will follow for the given sb,κ,α,d by choosing ε large enough and whenever N is large enough.

Finally, let us note that with Q=CN for some C2E[||θ||2], where θ is the infinite Gaussian vector with RKHS hα, we can deduce from theorem 2.1.20 and exercise 2.1.5 in [61] that

Pr(||θ||RDQ)2exp(cC2N/2),somec>0.

Thus, using also (4.25), choosing C large enough verifies (4.11). Since (4.25) and the a.s. boundedness of supθ|N(θ)| for N from (4.3) imply that Π(ΘNb,εNb|Z(N))>0 a.s., proposition 4.2 and then also proposition 4.4 apply for this prior, and the arguments from §4c(ii) yield the desired result.

(iii) . Proof of theorem 3.3, part (iii)

We finish the proof of the theorem by showing point (iii) . We use the setting and choices from the previous section. Let us write G(A)=AeN(θ)dΠ(θ) for any measurable set A. Recall the notation Br={θ:||θ||RDr},r>0. Repeating the argument leading to (4.17) with Bt/2 in place of Θr,ϵ, and using lemma 4.5, we have with high probability

1NlogG(Bt/2)12suprtbNb/2w(r)2c¯0(tb2)τ+ωN(t/2),

where ωN(t/2)=O(||w||/N)=O(1). Likewise, we also have

1NlogG(Bsc)12infrsbNbw(r)2+1NlogΠ(Bsc)+ωN(s),

where ωN(s)=O(||w||/N)=O(1). We can assume that G(Bsc)>0. Hence, since Π(Bsc)1 in view of lemma 4.5,

1NlogG(Bt/2)G(Bsc)(Tt)22c0(tb2)τ+(Tt)2+(Tt/2)+ρ(st)2+1NlogΠ(Bsc)+O(1)Tbtb4c0(tb2)τ+O(1). 4.28

Now, for tb<sb fixed we can choose Tb large enough such that the last quantity exceeds 1 with high probability (in particular this retrospectively justifies the last O(1) as then ||w||=O(1) for our choice of Tb). Therefore, again with high probability

G(Bt/2)G(Bsc)eN×(1+O(1)). 4.29

For Mt,s={θ:t/2<||θ||RDs} this further implies that with high probability

G(Bt/2)+G(Mt,s)G(Bsc)eN×(1+O(1)),

and then,

Π(Bs|Z(N))=G(Bt/2)+G(Mt,s)G(Bt/2)+G(Mt,s)+G(Bsc)=G(Bt/2)+G(Mt,s)(G(Bt/2)+G(Mt,s))(1+(G(Bsc)/G(Bt/2))+G(Mt,s))1,

again with high probability, which is what we wanted to show.

Remark 4.7. —

If the map w is globally convex, say w(s)=Ts2/2 for all s>0, then a ‘large enough’ choice of ε after (4.27) is not possible. It is here where global log-concavity of the likelihood function helps, as it enforces a certain ‘uniform’ spread of the posterior across its support via a global coercivity constant T. By contrast the above example of w is not convex, rather it is very spiked on (0,t/2) and then ‘flattens out’.

(e) . Proof of theorem 3.2

The proof of theorem 3.2 proceeds along the same lines as that of theorem 3.3, with scaling t,L,ρ,s,η constant in N, corresponding to b=0 in Nb, and replacing the volumetric lemma 4.5 by the following basic result.

Lemma 4.8. —

Let θN(0,ID/D). Let a(0,1/2). Then for all DD0(a) large enough,

1DlogΠ(||θ||RDz)12(z22logz12),anyz(0,1a). 4.30

A proof of (4.30) is sketched in appendix B. As a consequence of the previous lemma

1NlogΠ(Θs,η)1NlogΠ(B2s)κ2(log2s2s2+12).

Moreover, to lower bound Π(Θ2/3,ε), we choose ε>2/3. Then, using theorem 2.5.7 in [61] as well as E||θ||E(||θ||2)1/2=1, and then also (4.30) with z=2/3, we obtain that

Π(Θ2/3,ε)Π(|||θ||RD1|13)1Π(||θ||RDE||θ||RD+13)Π(||θ||RD23)1exp(D/18)exp(cD),

for some fixed constant c>0 given by (4.30), whence Π(Θ2/3,ε)1 and also N1logΠ(Θ2/3,ε)0. Therefore, the key inequality (4.15) with σ=2/3, ν=1 holds whenever we choose s=s0 small enough such that

log2s0>2κ1[2+ρ2(s023ε)]+2s02+12.

The rest of the detailed derivations follow the same pattern as in the proof of theorem 3.3 and are left to the reader, including verification of (4.11) via an application of theorem 2.5.7 in [61]. In particular, the proof of part (iii) follows the same arguments (suppressing the Nb scaling everywhere) as in theorem 3.3.

(f) . Proofs for §3b

In this section, we prove the results of §3b which detail the consequences of the general theorems 3.2 and 3.3 for practical MCMC algorithms.

(i) . Proofs for pCN

Theorem 3.6 is proved by verifying the assumption 3.1 for suitable choices of η and L, and for c0=κ/2>0.

Lemma 4.9. —

Let PN denote the transition kernel of pCN from (3.7) with parameter β>0.

  • (i)
    Suppose Π=N(0,ID/D) as in theorem 3.2, and let Q,η>0. Then for all βmin{1/2,η/4Q,η2/64} and all D1, we have (with P0N-probability 1)
    supθBQPN(θ,{ϑ:||θϑ||RDη2})eD/2.
  • (ii)
    Suppose Π=N(0,Σα) as in theorem 3.3, and let Q,η>0. There exists some c>0 such that for all βmin{1/2,η/(4Q),cη2/D} and all D1, we have (with P0N-probability 1)
    supθBQPN(θ,{ϑ:||θϑ||RDη2})eD/2.
Proof of lemma 4.9. —

We begin with the proof of part (ii). Let ||ϑk||RDQ. Then using the definition of pCN and that |1β1|β for any β[0,1] (Taylor expanding around 1), we obtain that for any βmin{1/2,η/4Q},

Pr(||ϑk+1ϑk||RDη2)Pr(||pk+1ϑk||RDη2)Pr(||(1β1)ϑk||RD+β||ξk||RDη2)Pr(||ξk||RD(η/2βQ)β)Pr(||ξk||RDη4β)=Pr(||ξk||RDE||ξk||RDη4βE||ξk||RD).

The variables ξk are equal in law to a vector with components (iα/dgi:iD) for gi iid N(0,1) and hence E||ξk||RD(E||ξk||RD2)1/2C(α,d)< for α>d/2. Then, for βcη2/D with some sufficiently small c>0 (noting that then also βcη2), it holds that

Pr(||ϑk+1ϑk||RDη2)Pr(||ξk||RDE||ξk||RDη8β)exp(η264β)exp(D2), 4.31

using, e.g. theorem 2.5.8 in [61] (and representing the ||||RD-norm by duality as a supremum). This completes the proof of part (ii).

The proof of part (i) is similar, albeit simpler, whence we leave some details to the reader. Arguing similarly as before, we obtain that for any βmin{1/2,η/64Q},

Pr(||ϑk+1ϑk||RDη2)Pr(||ξk||RD(η/2βL)β)Pr(||gk||RDηD4β),

where gk is a N(0,ID) random variable. The latter probability is bounded by a standard deviation inequality for Gaussians, see, e.g. theorem 2.5.7 in [61]. Indeed, noting that E||ξk||RD(E[||ξk||RD2])1/2=D, and that the one-dimensional variances satisfy Egk,v2=||v||RD2=1 for any ||v||RD=1, we obtain

Pr(||gk||RDηD4β)Pr(|||ξk||RDE||ξk||RD|D(η4β1))exp(D2(η4β1)2)exp(D2).
Proof of theorem 3.6. —

We begin with part (ii). Let sb be as in theorem 3.3 and set η=ηN=sbNb/2 as well as Q=QNCN, where C is as in theorem 3.3. With those choices, lemma 4.9 (ii) implies that assumption 3.1 is fulfilled with c0=κ/2, so long as β satisfies

βmin{12,sbNb8CN,csb2N2b4D}N2bD1N12b.

Hence, the desired result immediately follows from an application of theorem 3.3 (iv).

Part (i) of theorem 3.6 similarly follows from verifying assumption 3.1 with s(0,1/3), Q from theorem 3.2, η=s/2 and for small enough β<c1 (with c1 determined by lemma 4.9 (i)), and subsequently applying theorem 3.2 (iv).

(ii) . Proofs for MALA

Theorem 3.7 is proved by verifying the hypotheses of theorems 3.2 and 3.3, respectively. A key difference between pCN and MALA is that the proposal kernels for MALA, not just its acceptance probabilities, depend on the data Z(N) itself. Again, we begin by examining part (ii) which regards N(0,Σα) priors.

Proof of theorem 3.7, part (ii). —

We begin by deriving a bound for the gradient logπ(|Z(N)). For Lebesgue-a.e. θRD, recalling that vol(X)=1, we have that

E0N[N(θ)]=N2w(||θ||)θ||θ||||g||L22

and

N(θ)=12i=1N(εiw(||θ||)g(Xi))w(||θ||)2w(||θ||)θ||θ||g(Xi),=w(||θ||)4w(||θ||)θ||θ||i=1Nεig(Xi)w(||θ||)4θ||θ||i=1Ng2(Xi).

For any r(0,t/2)(t/2,t)(t,L)(L,), recalling the choices for T,t,ρ in (4.26) we see that

w(r)w(r)=8Tr2Tr1(0,t/2)(r)+Tw(r)1(t/2,t)(r)+ρw(r)1(t,L)(r),1+Nb+1, 4.32

where, to bound the second and third term, we used that w(r)Tt=tbTb>0 is bounded away from zero uniformly in N on (t/2,). Similarly, we have

||w||Tt2+T+ρNb.

Combining the above and using Chebyshev’s inequality, it follows that

supθRD||N(θ)||RDNb(|i=1Nεig(Xi)|+i=1Ng2(Xi))Nb(||g|||i=1Nεi|+i=1N(g2(Xi)||g||L22)+N||g||L22)Nb(OP(N)+O(N))=O(N1+b)+O(N1+b).

Thus, the event

A:={supθRD||N(θ)||RDCN1+b},

for some large enough C>0, has probability P0N(A)1 as N. We also verify that

logπ(θ)=12θTΣα1θ=Σα1θ, 4.33

so that with Q=QN=CN (for C as in theorem 3.3) and recalling that Σα=diag(1,,D2α), we obtain

sup||θ||Q||logπ(θ)||RD=sup||θ||Q||Σα1θ||RDD2αNN2α+1.

Now, let sb also be as in theorem 3.3 and set η=ηN=(1/2)sbNb (note that this is a permissible choice in theorem 3.3). Furthermore, for a small enough constant c>0, let γcN12αb. Then since α>b, we also have that

γmin{N12αb,N12b,N1/2b}. 4.34

Hence, on the event A and whenever ||θ||RDQ,

γ||logπ(ϑk|Z(N))||RDγ(N1+b+N1+2α)η.

Using this, (4.34) and choosing c>0 small enough, conditional on the event A the probability Pr() under the Markov chain satisfies

Pr(||pk+1ϑk||η2)Pr(γ||logπ(ϑk|Z(N))||RDη4)+Pr(2γ||ξk+1||RDη4)Pr(||ξk+1||RDη42γ)Pr(||ξk+1||RDE||ξk+1||RDN)exp(N2),

where the last inequality is proved as in (4.31) above, using theorem 2.5.8 in [61]. Thus, assumption 3.1 is satisfied with c0=1 and the proof is complete.

Proof of theorem 3.7, part (i). —

The proof of part (i) proceeds along the same lines, except that (4.32) and (4.33) are replaced with the bound

||ww||s+||w||<C,

for some constant C independent of N, as well as the bound

logπ(θ)=D2||θ||2=Dθ,sup||θ||Q||logπ(θ)||RDNQ.

Then letting s(0,1/3) and Q>0 be as in theorem 3.2, and fixing an arbitrary η(0,s/2), the above implies that for sufficiently small constant c>0 and for any γc/N, it holds that

Pr(||pk+1ϑk||η2)Pr(γ||logπ(ϑk|Z(N))||RDη4)+Pr(2γξk+1η4)Pr(ξk+1η42γ)Pr(ξk+1ηκD42c).

Thus, choosing c>0 small enough and arguing exactly as in the last step of the proof of theorem 3.6, part (i), assumption 3.1 is satisfied with c0=1 and the proof is complete.

Acknowledgements

R.N. would like to thank the Forschungsinstitut für Mathematik (FIM) at ETH Zürich for their hospitality during a sabbatical visit in spring 2022 where this research was initiated.

Appendix A. Proofs of §2

Proof of corollary 2.4. —

We fix K=1 and place ourselves under the event of proposition 2.3, and we denote s=s(λ) and t=t(λ). We can decompose, since Π[Ss|Y]=1Π[Ts|Y]:

Π(Ts|Y)=[1+Π(Ss|Y)Π(Ts|Y)]1.

Moreover, Π(Ss|Y)/Π(Ts|Y)=Π(Ss|Y)/[Π(Tt|Y)+Π(Ws,t|Y)]Π(Ss|Y)/Π(Tt|Y). Using proposition 2.3, for nn0(λ,Y) we have Π(Ss|Y)/Π(Ts|Y)exp{n}. Therefore, Π[Ts|Y](1+exp{n})1, which ends the proof.

Proof. —

The rest of this section is devoted to proving proposition 2.3. We use a uniform bound on the injective norm of Gaussian tensors:

Lemma A.1 —

For all p3 there exists a constant Cp, such that:

lim supn{n1/2maxxSn1|xp,Z|}Cp,almost surely. A 1

This lemma is a very crude version of much finer results: in particular the exact value of the constant μp such that (w.h.p.) maxxSn1|xp,Z|=nμp(1+On(1)) has been first computed non-rigorously in [62], and proven in full generality in [63] (see also discussions in [33,34]). In the rest of this proof, we assume to have conditioned on equation (A 1). For any 0s<t1, we have for nn0(Y):

Π(Ss|Y)Π(Tt|Y)=Ssexp(Y(x))dΠ(x)Ttexp(Y(x))dΠ(x)enλCpSsexp((n/2)λ2x,x0p)dΠ(x)Ttexp((n/2)λ2x,x0p)dΠ(x),exp(nλCp+nλ22[sptp])Π(Ss)Π(Tt). A 2

We upper bound Π(Ss)Π(Sn1)=1. To lower bound Π(Tt), we use the elementary fact (which is easy to prove using spherical coordinates):

Π(Tt)=cpI(1t)/2[(n1)2,(n1)2], A 3

in which Ix(a,b)=0xua1(1u)b1du/01ua1(1u)b1du is the incomplete beta function, and cp=1 for odd p and cp=2 for even p. It is then elementary analysis (cf. e.g. [34]) that

limn1nlogΠ(Tt)=12log(1t2), A 4

uniformly in t[0,1). Coming back to equation (A 2), this implies that we have, for any s<t<1:

lim supn1nlogΠ(Ss|Y)Π(Tt|Y)λCp+λ22[sptp]12log(1t2). A 5

Let K>0. It is then elementary to see that it is possible to construct 0s(λ)<t(λ)<1 with limλ{s(λ),t(λ)}=1, and such that the right-hand side of equation (A 5) becomes smaller than K as λ.

Appendix B. Small ball estimates for isotropic Gaussians

Let Π=N(0,ID/D). In this section, we prove equation (4.30), more precisely we show:

Lemma B.1. —

Let a(0,1). Then for all DD0(a) large enough, one has for all z(0,1a):

1DlogΠ(||θ||2z)12(z22logz12). B 1

Proof of lemma B.1. —

Let f(x)=x2/2+logx+1/2, so that f reaches its maximum in x=1, with f(1)=0. By decomposition into spherical coordinates and isotropy of the Gaussian measure, one has directly:

Π(||θ||2z)=vol(SD1)(2π/D)D/20zdre(Dr2/2)+(D1)logr. B 2

Recall that vol(SD1)=2πD/2/Γ(D/2), so one reaches easily:

cD=1Dlogvol(SD1)(2π/D)D/212=logD2D+O(1D). B 3

In particular, one has for all D large enough (not depending on z):

1DlogΠ(||θ||2z)1Dlog0zdre(r2/2)+(D1)f(r)+cD. B 4

Since f is increasing on (0,1), we have for large enough D:

1DlogΠ(||θ||2z)(11D)f(z)+cD+1Dlog0drer2/2, B 5
(11D)f(z)+logDD. B 6

Since f(1a)<0, let DD0(a) large enough such that f(1a)2logD/(D2). Then for all z1a, one has f(z)2logD/(D2). Plugging it in the inequality above, we reach that for all z(0,1a):

1DlogΠ(||θ||2z)12f(z). B 7

Footnotes

1

As classical in statistical physics, we call free entropy the negative of the free energy.

2

In a physical system, these regions would correspond respectively to a region including a meta-stable state, a region including the globally stable state and a free energy barrier.

3

Note that we assume here that the statistician has access to the distribution Π(|Y) (and in particular to λ), a setting sometimes called Bayes-optimal in the literature.

4

Often the On(1) term will be exponentially small, but we will not require such a strong control.

5

As we will detail in the following sections, see assumption 3.1, the statements remain true if the change is allowed to be higher than the required maximum with exponentially small probability.

Data accessibility

This article has no additional data.

Authors' contributions

R.N.: conceptualization, writing—original draft; A.S.B.: conceptualization, writing—original draft; A.M.: conceptualization, writing—original draft; S.W.: conceptualization, writing—original draft.

All authors gave final approval for publication and agreed to be held accountable for the work performed therein.

Conflict of interest declaration

We declare we have no competing interests.

Funding

RN was supported by the EPSRC programme grant on Mathematics of Deep Larning, project: EP/V026259.

References

  • 1.Anderson PW. 1989. Spin glass VI: spin glass as cornucopia. Phys. Today 42, 9. [Google Scholar]
  • 2.Mézard M, Montanari A. 2009. Information, physics, and computation. Oxford, UK: Oxford University Press. [Google Scholar]
  • 3.Zdeborová L, Krzakala F. 2016. Statistical physics of inference: thresholds and algorithms. Adv. Phys. 65, 453-552. [Google Scholar]
  • 4.Ben Arous G, Gheissari R, Jagannath A. 2020. Algorithmic thresholds for tensor PCA. Ann. Probab. 48, 2052-2087. ( 10.1214/19-AOP1415) [DOI] [Google Scholar]
  • 5.Ben Arous G, Wein AS, Zadik I. 2020. Free energy wells and overlap gap property in sparse PCA. In Conf. on Learning Theory, Graz, Austria, 9–12 July 2020, pp. 479–482. PMLR.
  • 6.Gibbs JW. 1873. A method of geometrical representation of the thermodynamic properties of substances by means of surfaces. Trans. Conn. Acad. Arts Sci. 2, 382–404.
  • 7.Bandeira AS, Alaoui AE, Hopkins SB, Schramm T, Wein AS, Zadik I. 2022. The Franz-Parisi criterion and computational trade-offs in high dimensional statistics. In Advances in Neural Information Processing Systems (eds AH Oh, A Agarwal, D Belgrave, K Cho). See https://openreview.net/forum?id=mzze3bubjk.
  • 8.Jerrum M. 2003. Counting, sampling and integrating: algorithms and complexity. Lectures in Mathematics ETH Zürich. Basel, Switzerland: Birkhäuser Verlag. [Google Scholar]
  • 9.Kunisky D, Wein AS, Bandeira AS. 2022. Notes on computational hardness of hypothesis testing: predictions using the low-degree likelihood ratio. In ISAAC Congress (International Society for Analysis, its Applications and Computation), Aveiro, Portugal, 29 July–2 Aug 2019, pp. 1–50. Springer.
  • 10.Dalalyan AS. 2017. Theoretical guarantees for approximate sampling from smooth and log-concave densities. J. R. Stat. Soc. Ser. B Stat. Methodol. 79, 651-676. ( 10.1111/rssb.12183) [DOI] [Google Scholar]
  • 11.Durmus A, Moulines E. 2019. High-dimensional Bayesian inference via the unadjusted Langevin algorithm. Bernoulli 25, 2854-2882. ( 10.3150/18-BEJ1073) [DOI] [Google Scholar]
  • 12.Nickl R, Wang S. 2020. On polynomial-time computation of high-dimensional posterior measures by Langevin-type algorithms. J. Eur. Math. Soc. [Google Scholar]
  • 13.Bohr J, Nickl R. 2021. On log-concave approximations of high-dimensional posterior measures and stability properties in non-linear inverse problems. (http://arxiv.org/abs/2105.07835)
  • 14.Nickl R. 2022. Bayesian non-linear statistical inverse problems. ETH Zurich Lecture Notes.
  • 15.Altmeyer R. 2022. Polynomial time guarantees for sampling based posterior inference in high-dimensional generalised linear models. (http://arxiv.org/abs/2208.13296)
  • 16.Nickl R. 2020. Bernstein–von Mises theorems for statistical inverse problems I: Schrödinger equation. J. Eur. Math. Soc. 22, 2697-2750. ( 10.4171/JEMS/975) [DOI] [Google Scholar]
  • 17.Monard F, Nickl R, Paternain GP. 2019. Efficient nonparametric Bayesian inference for X-ray transforms. Ann. Stat. 47, 1113-1147. ( 10.1214/18-AOS1708) [DOI] [Google Scholar]
  • 18.Monard F, Nickl R, Paternain GP. 2021. Statistical guarantees for Bayesian uncertainty quantification in nonlinear inverse problems with Gaussian process priors. Ann. Stat. 49, 3255-3298. ( 10.1214/21-AOS2082) [DOI] [Google Scholar]
  • 19.Monard F, Nickl R, Paternain GP. 2021. Consistent inversion of noisy non-Abelian X-ray transforms. Commun. Pure Appl. Math. 74, 1045-1099. ( 10.1002/cpa.21942) [DOI] [Google Scholar]
  • 20.Stuart AM. 2010. Inverse problems: a Bayesian perspective. Acta Numer. 19, 451-559. ( 10.1017/S0962492910000061) [DOI] [Google Scholar]
  • 21.Fearnhead P, Bierkens J, Pollock M, Roberts GO. 2018. Piecewise deterministic Markov processes for continuous-time Monte Carlo. Stat. Sci. 33, 386-412. ( 10.1214/18-STS648) [DOI] [Google Scholar]
  • 22.Bouchard-Côté A, Vollmer SJ, Doucet A. 2018. The bouncy particle sampler: a nonreversible rejection-free Markov chain Monte Carlo method. J. Am. Stat. Assoc. 113, 855-867. [Google Scholar]
  • 23.Bierkens J, Grazzi S, Kamatani K, Roberts G. 2020. The boomerang sampler. In Int. Conf. on Machine Learning, Online, 12–18 July 2020, pp. 908–918. PMLR.
  • 24.Wu C, Robert CP. 2020. Coordinate sampler: a non-reversible Gibbs-like MCMC sampler. Stat. Comput. 30, 721-730. ( 10.1007/s11222-019-09913-w) [DOI] [Google Scholar]
  • 25.Scalliet C, Guiselin B, Berthier L. 2022. Thirty milliseconds in the life of a supercooled liquid. Phys. Rev. X 12, 041028. ( 10.1103/PhysRevX.12.041028) [DOI] [PubMed]
  • 26.Grigera TS, Parisi G. 2001. Fast Monte Carlo algorithm for supercooled soft spheres. Phys. Rev. E 63, 045102. ( 10.1103/PhysRevE.63.045102) [DOI] [PubMed] [Google Scholar]
  • 27.Jerrum M. 1992. Large cliques elude the Metropolis process. Random Struct. Algorithms 3, 347-359. ( 10.1002/rsa.3240030402) [DOI] [Google Scholar]
  • 28.Gamarnik D, Zadik I. 2019. The landscape of the planted clique problem: dense subgraphs and the overlap gap property. (http://arxiv.org/abs/1904.07174).
  • 29.Angelini MC, Fachin P, de Feo S. 2021. Mismatching as a tool to enhance algorithmic performances of Monte Carlo methods for the planted clique model. J. Stat. Mech: Theory Exp. 2021, 113406. ( 10.1088/1742-5468/ac3657) [DOI] [Google Scholar]
  • 30.Chen Z, Mossel E, Zadik I. 2023. Almost-linear planted cliques elude the metropolis process. In Proceedings of the 2023 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), Florence, Italy, 22–25 January 2023, pp. 4504–4539. Philadelphia, PA: SIAM. ( 10.1137/1.9781611977554.ch171) [DOI]
  • 31.Hukushima K, Nemoto K. 1996. Exchange Monte Carlo method and application to spin glass simulations. J. Phys. Soc. Jpn. 65, 1604-1608. ( 10.1143/JPSJ.65.1604) [DOI] [Google Scholar]
  • 32.Angelini MC. 2018. Parallel tempering for the planted clique problem. J. Stat. Mech: Theory Exp. 2018, 073404. [Google Scholar]
  • 33.Richard E, Montanari A. 2014. A statistical model for tensor PCA. In Advances in neural information processing systems 27, Montreal, Canada, 8–13 Dec 2014. Red Hook, NY: Curran Associates.
  • 34.Perry A, Wein AS, Bandeira AS. 2020. Statistical limits of spiked tensor models. Annales de l’Institut Henri Poincaré, Probabilités et Statistiques 56, 230–264.
  • 35.Lesieur T, Miolane L, Lelarge M, Krzakala F, Zdeborová L. 2017. Statistical and computational phase transitions in spiked tensor estimation. In 2017 IEEE Int. Symposium on Information Theory (ISIT), Aachen, Germany, 25–30 June 2017, pp. 511–515. New York, NY: IEEE.
  • 36.Jagannath A, Lopatto P, Miolane L. 2020. Statistical thresholds for tensor PCA. Ann. Appl. Probab. 30, 1910-1933. ( 10.1214/19-AAP1547) [DOI] [Google Scholar]
  • 37.Baik J, Ben Arous G, Péché S. 2005. Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices. Ann. Appl. Probab. 33, 1643-1697. [Google Scholar]
  • 38.Wein AS, El Alaoui A, Moore C. 2019. The Kikuchi hierarchy and tensor PCA. In 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS), Baltimore, MA, 9–12 November 2019, pp. 1446–1468. New York, NY: IEEE.
  • 39.Hopkins SB, Shi J, Steurer D. 2015. Tensor principal component analysis via sum-of-square proofs. In Conf. on Learning Theory, Paris, France, 3–6 July 2015, pp. 956–1006. PMLR.
  • 40.Hopkins SB, Schramm T, Shi J, Steurer D. 2016. Fast spectral algorithms from sum-of-squares proofs: tensor decomposition and planted sparse vectors. In Proc. of the Forty-Eighth Annual ACM Symposium on Theory of Computing, Cambridge, MA, 19–21 June 2016, pp. 178–191. New York, NY: ACM.
  • 41.Kim C, Bandeira AS, Goemans MX. 2017. Community detection in hypergraphs, spiked tensor models, and sum-of-squares. In 2017 International Conference on Sampling Theory and Applications (SampTA), Bordeaux, France, 8–12 July 2017, pp. 124–128. New York, NY: IEEE.
  • 42.Sarao Mannelli S, Biroli G, Cammarota C, Krzakala F, Zdeborová L. 2019. Who is afraid of big bad minima? Analysis of gradient-flow in spiked matrix-tensor models. Advances in neural information processing systems 32, Vancouver, Canada, 8–14 Dec 2019. Red Hook, NY: Curran Associates.
  • 43.Sarao Mannelli S, Krzakala F, Urbani P, Zdeborová L. 2019. Passed & spurious: descent algorithms and local minima in spiked matrix-tensor models. In International Conference on Machine Learning, pp. 4333-4342. PMLR.
  • 44.Biroli G, Cammarota C, Ricci-Tersenghi F. 2020. How to iron out rough landscapes and get optimal performances: averaged gradient descent and its application to tensor PCA. J. Phys. A: Math. Theor. 53, 174003. ( 10.1088/1751-8121/ab7b1f) [DOI] [Google Scholar]
  • 45.Ben Arous G, Gheissari R, Jagannath A. 2020. Bounding flows for spherical spin glass dynamics. Commun. Math. Phys. 373, 1011-1048. ( 10.1007/s00220-019-03649-4) [DOI] [Google Scholar]
  • 46.Ben Arous G, Gheissari R, Jagannath A. 2021. Online stochastic gradient descent on non-convex losses from high-dimensional inference. J. Mach. Learn. Res. 22, 106-1. [Google Scholar]
  • 47.Rasmussen CE, Williams CKI. 2006. Gaussian processes for machine learning. Adaptive Computation and Machine Learning. Cambridge, MA: MIT Press. [Google Scholar]
  • 48.Ghosal S, van der Vaart AW. 2017. Fundamentals of nonparametric Bayesian inference. New York, NY: Cambridge University Press. [Google Scholar]
  • 49.van der Vaart A, van Zanten JH. 2008. Rates of contraction of posterior distributions based on Gaussian process priors. Ann. Stat. 36, 1435-1463. ( 10.1214/009053607000000613) [DOI] [Google Scholar]
  • 50.Cotter SL, Roberts GO, Stuart AM, White D. 2013. MCMC methods for functions: modifying old algorithms to make them faster. Stat. Sci. 28, 424-446. ( 10.1214/13-STS421) [DOI] [Google Scholar]
  • 51.Beskos A, Girolami M, Lan S, Farrell PE, Stuart AM. 2017. Geometric MCMC for infinite-dimensional inverse problems. J. Comput. Phys. 335, 327-351. ( 10.1016/j.jcp.2016.12.041) [DOI] [Google Scholar]
  • 52.Hairer M, Stuart AM, Vollmer SJ. 2014. Spectral gaps for a Metropolis-Hastings algorithm in infinite dimensions. Ann. Appl. Probab. 24, 2455-2490. ( 10.1214/13-AAP982) [DOI] [Google Scholar]
  • 53.Hairer M, Mattingly J, Scheutzow M. 2011. Asymptotic coupling and a general form of Harris’ theorem with applications to stochastic delay equations. Probab. Theory Relat. Fields 149, 223-259. ( 10.1007/s00440-009-0250-6) [DOI] [Google Scholar]
  • 54.Chewi S, Lu C, Ahn K, Cheng X, Le Gouic T, Rigollet P. 2021. Optimal dimension dependence of the Metropolis-Adjusted Langevin Algorithm. In Conference on Learning Theory, Boulder, CO, 15–19 August 2021. PMLR.
  • 55.Roberts GO, Rosenthal JS. 2001. Optimal scaling for various Metropolis-Hastings algorithms. Stat. Sci. 16, 351-367. ( 10.1214/ss/1015346320) [DOI] [Google Scholar]
  • 56.Breyer LA, Piccioni M, Scarlatti S. 2004. Optimal scaling of MaLa for nonlinear regression. Ann. Appl. Probab. 14, 1479-1505. ( 10.1214/105051604000000369) [DOI] [Google Scholar]
  • 57.Mattingly JC, Pillai NS, Stuart AM. 2012. Diffusion limits of the random walk Metropolis algorithm in high dimensions. Ann. Appl. Probab. 22, 881-930. ( 10.1214/10-AAP754) [DOI] [Google Scholar]
  • 58.Franz S, Parisi G. 1995. Recipes for metastable states in spin glasses. J. Phys. I 5, 1401-1415. [Google Scholar]
  • 59.Edmunds DE, Triebel H.. 1996. Function spaces, entropy numbers, differential operators. Cambridge Tracts in Mathematics, vol. 120. Cambridge, UK: Cambridge University Press. [Google Scholar]
  • 60.Li WV, Linde W. 1999. Approximation, metric entropy and small ball estimates for Gaussian measures. Ann. Probab. 27, 1556-1578. ( 10.1214/aop/1022677459) [DOI] [Google Scholar]
  • 61.Giné E, Nickl R.. 2016. Mathematical foundations of infinite-dimensional statistical models. Cambridge Series in Statistical and Probabilistic Mathematics. New York, NY: Cambridge University Press. [Google Scholar]
  • 62.Crisanti A, Sommers HJ. 1992. The spherical p-spin interaction spin glass model: the statics. Zeitschrift für Physik B Condensed Matter 87, 341-354. ( 10.1007/BF01309287) [DOI] [Google Scholar]
  • 63.Subag E. 2017. The complexity of spherical p-spin models – a second moment approach. Ann. Probab. 45, 3385-3450. ( 10.1214/16-AOP1139) [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

This article has no additional data.


Articles from Philosophical transactions. Series A, Mathematical, physical, and engineering sciences are provided here courtesy of The Royal Society

RESOURCES