On free energy barriers in Gaussian priors and failure of cold start MCMC for high-dimensional unimodal distributions

Afonso S Bandeira; Antoine Maillard; Richard Nickl; Sven Wang

doi:10.1098/rsta.2022.0150

. 2023 Mar 27;381(2247):20220150. doi: 10.1098/rsta.2022.0150

On free energy barriers in Gaussian priors and failure of cold start MCMC for high-dimensional unimodal distributions

Afonso S Bandeira ¹, Antoine Maillard ¹, Richard Nickl ^2,^✉, Sven Wang ³

PMCID: PMC10041355 PMID: 36970818

Abstract

We exhibit examples of high-dimensional unimodal posterior distributions arising in nonlinear regression models with Gaussian process priors for which Markov chain Monte Carlo (MCMC) methods can take an exponential run-time to enter the regions where the bulk of the posterior measure concentrates. Our results apply to worst-case initialized (‘cold start’) algorithms that are local in the sense that their step sizes cannot be too large on average. The counter-examples hold for general MCMC schemes based on gradient or random walk steps, and the theory is illustrated for Metropolis–Hastings adjusted methods such as preconditioned Crank–Nicolson and Metropolis-adjusted Langevin algorithm.

This article is part of the theme issue ‘Bayesian inference: challenges, perspectives, and prospects’.

Keywords: MCMC, Bayesian inference, Gaussian processes, computational hardness

1. Introduction

Markov chain Monte Carlo (MCMC) methods are the workhorse of Bayesian computation when closed formulae for estimators or probability distributions are not available. For this reason they have been central to the development and success of high-dimensional Bayesian statistics in the last decades, where one attempts to generate samples from some posterior distribution $Π (\cdot | data)$ arising from a prior $Π$ on $D$ -dimensional Euclidean space and the observed data vector. MCMC methods tend to perform well in a large variety of problems, are very flexible and user-friendly, and enjoy many theoretical guarantees. Under mild assumptions, they are known to converge to their stationary ‘target’ distributions as a consequence of the ergodic theorem, albeit perhaps at a slow speed, requiring a large number of iterations to provide numerically accurate algorithms. When the target distribution is log-concave, MCMC algorithms are known to mix rapidly, even in high dimensions. But for general $D$ -dimensional densities, we have only a restricted understanding of the scaling of the mixing time of Markov chains with $D$ or with the ‘informativeness’ (sample size or noise level) of the data vector.

A classical source of difficulty for MCMC algorithms are multi-modal distributions. When there is a deep well in the posterior density between the starting point of an MCMC algorithm and the location where the posterior is concentrated, many MCMC algorithms are known to take an exponential time—proportional to the depth of the well—when attempting to reach the target region, even in low-dimensional settings, see figure 1a and also the discussion surrounding proposition 4.2 below. However, for distributions with a single mode and when the dimension $D$ is fixed, MCMC methods can usually be expected to perform well.

Figure 1. — Two possible sources of MCMC hardness in high dimensions: multi-modal likelihoods and entropic barriers. (a) In low dimensions (here $D = 1$ ), MCMC hardness usually arises because of a non-unimodal likelihood, creating an ‘energy barrier’, even though the maximum likelihood is attained at $θ = θ_{0}$ . The MCMC algorithm is assumed to be initialized in the set $S$ containing a local maximum of the likelihood. (b) Illustration of the arising of entropic (or volumetric) difficulties, here in dimension $D = 3$ : the set of points close to $θ_{0}$ has much less volume than the set of points far away. As $D$ increases, this phenomenon is amplified: all ratios of volumes of the three sets $T, W, S$ scale exponentially with $D$ . (Online version in colour.)

In essence this article is an attempt to explain how, in high dimensions, wells can be formed without multi-modality of a given posterior distribution. The difficulty in this case is volumetric, also referred to as entropic: while the target region contains most of the posterior mass, its (prior) volume is so small compared to the rest of the space that an MCMC algorithm may take an exponential time to find it, see figure 1b. This competition between ‘energy’—here represented by the log-likelihood $ℓ_{N}$ in the posterior distribution $d Π (\cdot | data) = \exp {ℓ_{N} + \log d π}$ —and ‘entropy’ (related to the prior term $π$ ) has also been exploited in recent work on statistical aspects of MCMC in various high dimensional inference and statistical physics models [1–5]. These ideas somewhat date back to the nineteenth century foundations of statistical mechanics [6] and the notion of free energy, consisting of a sum of energetic and entropic contributions which the system spontaneously attempts to minimize. The ‘MCMC-hardness’ phenomenon described above is then akin to the meta-stable behaviour of thermodynamical systems, such as glasses or supercooled liquids. As the temperature decreases, such systems can undergo a ‘first-order’ phase transition, in which a global free energy minimum (analoguous to the target region above) abruptly appears, while the system remains trapped in a suboptimal local minimum of the free energy (the starting region of the MCMC algorithm). For the system to go to thermodynamic equilibrium it must cross an extensive free energy barrier: such a crossing requires an exponentially long time, so that the system appears equilibrated on all relevant timescales, similarly to the MCMC stuck in the starting region. Classical examples include glasses and the popular experiment of rapid freezing of supercooled water (i.e. water that remained liquid at negative temperatures) after introducing a perturbation.

Inspired by recent work [4,5,7], let us illustrate some of the volumetric phenomena which are key to our results below. We separate the parameter space into three regions (see figures 1 and 2), which we name by common MCMC terminology. Firstly a starting (or initialization) region $S$ , where an algorithm starts, secondly a target region $T$ where both the bulk of the posterior mass and the ground truth are situated, and thirdly an intermediate free-entropy well¹ $W$ that separates $S$ from $T$ .² In our theorems, these regions will be characterized by their Euclidean distance to the ground truth parameter $θ_{0}$ generating the data. The prior volumes of the $ϵ$ -annuli ${θ : r - ϵ < | | θ - θ_{0} | |_{2} \leq r}, r > 0,$ closer to the ground truth are smaller than those further out as illustrated in figure 1b, and in high dimensions this effect becomes quantitative in an essential way. Specifically, the trade-off between the entropic and energetic terms can happen such that the following three statements are simultaneously true.

(i)
$T$ contains ‘almost all’ of the posterior mass.
(ii)
As one gets closer to $T$ (and thus the ground truth $θ_{0}$ ), the log-likelihood is strictly monotonically increasing.
(iii)
Yet $S$ still possesses exponentially more posterior mass than $W$ .

Using ‘bottleneck’ arguments from Markov chain theory (ch. 7 in [8]), this means that an MCMC algorithm that starts in $S$ is expected to take an exponential time to visit $W$ . If the step size is such that it cannot ‘jump over’ $W$ , this also implies an exponential hitting time lower bound for reaching $T$ . This is illustrated in figure 2 for an averaged version of the model described in §2.

In the situation described above, the MCMC iterates never visit the region where the posterior is statistically informative, and hence yield no better inference than a random number generator. One could regard this as a ‘hardness’ result about computation of posterior distributions in high dimensions by MCMC. In this work we show that such situations can occur generically and establish hitting time lower bounds for common gradient or random walk based MCMC schemes in model problems with nonlinear regression and Gaussian process priors. Before doing this, we briefly review some important results of Ben Arous et al. [4] for the problem of principle component analysis (PCA) in tensor models, from which the inspiration for our work was drawn. This technique to establish lower bounds for MCMC algorithms has also recently been leveraged in [5] in the context of sparse PCA, and in [7] to establish connections between MCMC lower bounds and the Low Degree Method for algorithmic hardness predictions (see [9] for an expository note on this technique).

When the target distribution is globally log-concave, pictures such as in figure 2 are ruled out (see also remark 4.7) and polynomial-time mixing bounds have been shown for a variety of commonly used MCMC methods. While an exhaustive discussion would be beyond the scope of this paper, we mention here the seminal works [10,11] which were among the first to demonstrate high-dimensional mixing of discretized Langevin methods (even upon ‘cold-start’ initializations like the ones assumed in the present paper). In concrete nonlinear regression models, polynomial-time computation guarantees were given in [12] under a general ‘gradient stability’ condition on the regression map which guarantees that the posterior is (with high probability) locally log-concave on a large enough region including $θ_{0}$ . While this condition can be expected to hold under natural injectivity hypotheses and was verified for an inverse problem with the Schrödinger equation in [12], for non-Abelian X-ray transforms in [13], the ‘Darcy flow’ model involving elliptic partial differential equations (PDE) in [14] and for generalized linear models in [15], all these results hinge on the existence of a suitable initializer of the gradient MCMC scheme used. These results form part of a larger research programme [14,16–19] on algorithmic and statistical guarantees for Bayesian inversion methods [20] applied to problems with partial differential equations. The present article shows that the hypothesis of existence of a suitable initializer is—at least in principle—essential in these results if $D / N \to κ > 0$ , and that at most ‘moderately’ high-dimensional ( $D = O (N)$ ) MCMC implementations of Gaussian process priors may be preferable to bypass computational bottlenecks.

Our negative results apply to (worst-case initialized) Markov chains whose step sizes cannot be too large with high probability. As we show this includes many commonly used algorithms (such as preconditioned Crank–Nicolson (pCN) and Metropolis-adjusted Langevin algorithm (MALA)) whose dynamics are of a ‘local’ nature. There are a variety of MCMC methods developed recently, such as piece-wise deterministic Markov processes, boomerang or zig-zag samplers [21–24] which may not fall into our framework. While we are not aware of any rigorous results that would establish polynomial hitting or mixing times of these algorithms for high-dimensional posterior distributions such as those exhibited here, it is of great interest to study whether our computational hardness barriers can be overcome by ‘non-local’ methods. There is some empirical evidence that this may be possible. For instance, in the numerical simulation of models of supercooled liquids [25], methods such as swap Monte Carlo [26] have been observed to equilibrate to low-temperature distributions which were not reachable by local approaches. Another example is given by the planted clique problem [27]: this model is conjectured to possess a large algorithmically hard phase, and local Monte Carlo methods are known to fail far from the conjectured algorithmic threshold [28–30]. On the other hand, non-local exchange Monte Carlo methods (such as parallel tempering [31]) have been numerically observed to perform significantly better [32].

2. The spiked tensor model: an illustrative example

In this section, we present (a simplified version of) results obtained mostly in [4]. First some notation. For any $n \geq 1$ , we denote by $S^{n - 1} = {θ \in R^{n} : | | θ | |_{2} = 1}$ the Euclidean unit sphere in $n$ dimensions. For $θ, θ^{'} \in R^{n}$ we denote $θ \otimes θ^{'} = {(θ_{i} θ_{j}^{'})}_{1 \leq i, j \leq n} \in R^{n^{2}}$ their tensor product.

Spiked tensor estimation is a synthetic model to study tensor PCA, and corresponds to a Gaussian Additive Model with a low-rank prior. More formally, it can be defined as follows [33].

Definition 2.1 (Spiked tensor model). —

Let $p \geq 3$ denote the order of the tensor. The observations $Y$ and the parameter $θ$ are generated according to the following joint probability distribution:

$d Q (Y, θ) = \frac{1}{{(2 π)}^{n^{p} / 2}} \exp {- \frac{1}{2} {| | Y - \sqrt{n} λ θ^{\otimes p} | |}_{2}^{2}} d Π (θ) d Y .$ 2.1

Here, $d Y$ denotes the Lebesgue measure on the space ${(R^{n})}^{\otimes p} = R^{n^{p}}$ of $p$ -tensors of size $n$ . $Π$ is the uniform probability measure on $S^{n - 1}$ , and $λ \geq 0$ is the signal-to-noise ratio (SNR) parameter. In particular, the posterior distribution $Π (θ | Y)$ is

$d Π (θ | Y) = \frac{1}{Z_{Y}} \exp (ℓ_{n, Y} (θ)) d Π (θ),$ 2.2

in which $Z_{Y}$ is a normalization, and we defined the log-likelihood (up to additive constants) as

$ℓ_{n, Y} (θ) = \frac{1}{2} \sqrt{n} λ ⟨ θ^{\otimes p}, Y ⟩ .$ 2.3

In the following, we study the model from definition 2.1 via the prism of statistical inference. In particular, we will study the posterior $Π (θ | Y)$ for a fixed³ ‘data tensor’ $Y$ . Since such a tensor was generated according to the marginal of (2.1), we parameterize it as $Y = λ \sqrt{n} θ_{0}^{\otimes p} + Z$ , with $Z$ a $p$ -tensor with i.i.d. $N (0, 1)$ coordinates, and $θ_{0}$ a ‘ground truth’ vector uniformly sampled in $S^{n - 1}$ . The goal of our inference task is to recover information on the low-rank perturbation $θ_{0}^{\otimes p}$ (or equivalently on the vector $θ_{0}$ , possibly up to a global sign depending on the parity of $p$ ) from the posterior distribution $Π (\cdot | Y)$ .

Crucially, we are interested in the limit of the model of definition 2.1 as $n \to \infty$ . In particular, all our statements, although sometimes non-asymptotic, are to be interpreted as $n$ grows. We say that an event occurs ‘with high probability’ (w.h.p.) when its probability is $1 - O_{n} (1)$ .⁴ Moreover, by rotation invariance, all statements are uniform over $θ_{0} \in S^{n - 1}$ , so that said probabilities only refer to the noise tensor $Z$ . Finally, throughout our discussion we will work with latitude intervals (or bands) on the sphere, with the North Pole taken to be $θ_{0}$ . We characterize them using inner products (correlations) $⟨ θ, θ_{0} ⟩$ for odd $p$ , and $| ⟨ θ, θ_{0} ⟩ |$ for even $p$ (since in this case $θ_{0}$ and $- θ_{0}$ are indistinguishable from the point of view of the observer).

Definition 2.2 (Latitude intervals). —

Assume that $p \geq 3$ is even. For $0 \leq s < t \leq 1$ we define:

—
$S_{s} = {θ \in S^{n - 1} : | ⟨ θ, θ_{0} ⟩ | \leq s}$ ,

—
$W_{s, t} = {θ \in S^{n - 1} : s < | ⟨ θ, θ_{0} ⟩ | \leq t}$ ,

—
$T_{t} = {θ \in S^{n - 1} : t < | ⟨ θ, θ_{0} ⟩ |}$ .

If $p$ is odd, we define these sets similarly, replacing $| ⟨ θ, θ_{0} ⟩ |$ by $⟨ θ, θ_{0} ⟩$ .

Note that these sets can also be characterized using the distance to the ground truth, e.g. $S_{s} = {θ \in S^{n - 1} : min {| | θ - θ_{0} | |_{2}^{2}, | | θ + θ_{0} | |_{2}^{2}} | \geq 2 (1 - s)}$ when $p$ is even.

(a) . Posterior contraction

We can use uniform concentration of the likelihood to show that as $λ \to \infty$ (after taking the limit $n \to \infty$ ) the posterior contracts in a region infinitesimally close to the ground truth $θ_{0}$ . We first show that a region arbitrarily close to the ground truth exponentially dominates a very large starting region:

Proposition 2.3. —

For any $K > 0$ there exists $λ_{0} > 0$ and functions ${s (λ), t (λ)} \in [0, 1)$ such that $s (λ) < t (λ)$ , ${s (λ), t (λ)} \to 1$ as $λ \to \infty$ and for all $λ \geq λ_{0}$ :

$\underset{n \to \infty}{lim sup} \frac{1}{n} \log \frac{Π (S_{s (λ)} | Y)}{Π (T_{t (λ)} | Y)} \leq - K, almost surely .$ 2.4

Posterior contraction is the content of the following result:

Corollary 2.4 (Posterior contraction). —

There exists $λ_{0} > 0$ and a function $s (λ) \in [0, 1)$ satisfying $s (λ) \to 1$ as $λ \to \infty$ , such that for all $λ \geq λ_{0}$ :

$lim_{n \to \infty} Π [T_{s (λ)} | Y] = 1, almost surely.$ 2.5

The proofs of proposition 2.3 and corollary 2.4 are given in appendix A.

Remark 2.5 (Suboptimality of uniform bounds). —

Stronger than corollary 2.4, it is known that there exists a sharp threshold $λ^{⋆} (p)$ such that for any $λ > λ^{⋆} (p)$ the posterior mean, as well as the maximum likelihood estimator, sit w.h.p. in $T_{s (λ)}$ , with $s (λ) > 0$ , while such a statement is false for $λ \leq λ^{⋆} (p)$ [34–36]. The $λ_{0}$ given by corollary 2.4 is, on the other hand, clearly not sharp, because of the crude uniform bound used in the proof. This can easily be understood in the $p = 2$ case, corresponding to rank-one matrix estimation: uniform bounds such as the ones used here would show posterior contraction for $λ = ω (1)$ , while it is known through the celebrated BBP transition that the maximum likelihood estimator is already correlated with the signal for any $λ > 1$ [37]. With more refined techniques from the study of random matrices and spin glass theory of statistical physics it is often possible to obtain precise constants for such relevant thresholds.

(b) . Algorithmic bottleneck for MCMC

Simple volume arguments, associated with an ingenious use of Markov’s inequality due to Ben Arous et al. [4] and of the rotation invariance of the noise tensor $Z$ , allow us to get a computational hardness result for MCMC algorithms, even though the posterior contracts infinitesimally close to the ground truth as we saw in corollary 2.4. In the context of the spiked tensor model, these computational hardness results can be found in [4] (see in particular §7). We will state similar results for general nonlinear regression models in §3: in this context we will not need to use the Markov’s inequality-based technique of Ben Arous et al. [4], and will solely rely on concentration arguments.

Recall that by §2(a), we can find $s (λ)$ such that $s (λ) \to 1$ as $λ \to \infty$ and for all $λ$ large enough $Π (T_{s (λ)} | Y) = 1 - O_{n} (1)$ . Here, we show that escaping the ‘initialization’ region of the MCMC algorithm is hard in a large range of $λ$ (possibly diverging with $n$ ). In what follows, the step size of the algorithm denotes the maximal change $| | x^{t + 1} - x^{t} | |_{2}$ allowed in any iteration.⁵ We first state this bottleneck result informally.

Proposition 2.6 (MCMC bottleneck, informal). —

Assume that $λ = O (n^{(p - 2) / 4 + η})$ for all $η > 0$ . Then any MCMC algorithm whose invariant distribution is $Π (\cdot | Y)$ , and with a step size bounded by $δ = O ({[n λ^{2}]}^{- 1 / p})$ , will take an exponential time to get out of the ‘initialization’ region.

Note that the step size condition of proposition 2.6 is always meaningful, since our hypothesis on $λ$ implies ${[n λ^{2}]}^{- 1 / p} = ω (n^{- 1 / 2})$ , and many MCMC algorithms (e.g. any procedure in which a number $O (1)$ of coordinates of the current iterate are changed in a single iteration) will have a step size $O (n^{- 1 / 2})$ .

Remark 2.7. —

The results of Ben Arous et al. [4] are stated when considering for the invariant distribution of the MCMC a more general ‘Gibbs-type’ distribution $G_{β, Y} (d x) \propto e^{β H (x)} d Π (x)$ , with $H (x) = (\sqrt{n} / 2) ⟨ x^{\otimes p}, Y ⟩$ . The case we consider here is the ‘Bayes-optimal’ $β = λ$ , for which $G_{λ, Y} = Π (\cdot | Y)$ . For the general distribution $G_{β, Y}$ the conditions of proposition 2.6 become $β λ = O (n^{(p - 2) / 2 + η})$ and $δ = O [{(n β λ)}^{- 1 / p}]$ . The authors of Ben Arous et al. [4] usually consider $β = O (1)$ , so that they show the bottleneck under the condition $λ = O (n^{(p - 2) / 2 + η})$ .

More generally, $λ ≪ n^{(p - 2) / 4}$ is conjectured to be a regime in which all polynomial-time algorithms fail to recover $θ_{0}$ [33,38–41]. On the other hand, ‘local’ methods (such as gradient-based algorithms [42–46], message-passing iterations [35] or natural MCMC algorithms such as the ones of previous remark) are conjectured or known to fail in the larger range $λ ≪ n^{(p - 2) / 2}$ . Proposition 2.6 shows that ‘Bayes-optimal’ MCMC algorithms fail for $λ ≪ n^{(p - 2) / 4}$ . To the best of our knowledge, analysing this class of algorithms in the regime $n^{(p - 2) / 4} ≪ λ ≪ n^{(p - 2) / 2}$ is still open.

Let us now state formally the key ingredient behind proposition 2.6. It is a rewriting of the ‘free energy wells’ result of Ben Arous et al. [4].

Lemma 2.8 (Bottleneck, formal). —

Assume that $λ = O (n^{(p - 2) / 4 + η})$ for all $η > 0$ , and let $δ = O ({[n λ^{2}]}^{- 1 / p})$ . Let $r (ε) = n^{- 1 / 2 + ε}$ . Then for any $ε > 0$ small enough, there exists $c, C > 0$ such that for large enough $n$ , with probability at least $1 - \exp (- c n^{2 ε})$ we have:

$\frac{Π (S_{r (ε)} | Y)}{Π (W_{r (ε), r (ε) + δ} | Y)} \geq \exp {C n^{2 ε}} .$ 2.6

Note that by simple volume arguments, $Π (S_{r (ε)}) = 1 - O_{n} (1)$ , so that $S_{r (ε)}$ contains ‘almost all’ the mass of the uniform distribution.

One can then deduce from lemma 2.8 hitting time lower bounds for MCMCs using a folklore bottleneck argument—see Jerrum [8]—that we recall here in a simplified form (see also [5], as well as proposition 4.4, where we will detail it further along with a short proof).

Proposition 2.9. —

We fix any $Y$ and $n$ , and let any $0 < s < t < 1$ . Let $θ^{(0)}, θ^{(1)}, \dots$ be a Markov chain on $S^{n - 1}$ with stationary distribution $Π (\cdot | Y)$ , and initialized from $θ^{(0)} \sim Π_{S_{s}} (\cdot | Y)$ , the posterior distribution conditioned on $S_{s}$ . Let $τ_{t} = inf {k \in N : θ^{(k)} \in T_{t}}$ be the hitting time of the Markov chain onto $T_{t}$ . Then, for any $k \geq 1$ ,

$\Pr (τ_{t} \leq k) \leq k \frac{Π (W_{s, t} | Y)}{Π (S_{s} | Y)} .$ 2.7

Remark 2.10 (MCMC initialization). —

Note that lemma 2.8, combined with proposition 2.9, shows hardness of MCMC initialized in points drawn from $Π_{S_{r (ϵ)}} (\cdot | Y)$ . In particular, it is easy to see that this implies (via the probabilistic method) the existence of such ‘hard’ initializing points. While one might hope to show such negative results for more general initialization, this remains an open problem. On the other hand, Ben Arous et al. [4] shows that there exists initializers in $S_{r (ϵ)}$ for which vanilla Langevin dynamics achieve non-trivial recovery of the signal even for $λ = Θ_{n} (1)$ (a phenomenon they call ‘equatorial passes’).

3. Main results for nonlinear regression with Gaussian priors

We now turn to the main contribution of this article, which is to exhibit some of the phenomena described in §2 in the context of nonlinear regression models. All the theorems of this section are proven in detail in §4.

Consider data $Z^{(N)} =^{iid} {(Y_{i}, X_{i})}_{i = 1}^{N}$ from the random design regression model

Y_{i} = G (θ) (X_{i}) + ε_{i}, ε_{i} \sim N (0, 1), i = 1, \dots, N,

3.1

where $G : Θ \to L_{μ}^{2} (X)$ is a regression map taking values in the space $L^{2} (X) = L_{μ}^{2} (X)$ on some bounded subset $X$ of $R^{d}$ , and where the $X_{i} \sim^{iid} μ$ are drawn uniformly on $X$ . For convenience, we assume that $X$ has Lebesgue measure $\int_{X} d x = 1$ . The law of the data $d P_{θ}^{N} (z_{1}, \dots, z_{N}) = \prod_{i = 1}^{N} d P_{θ} (z_{i})$ is a product measure on ${(R \times X)}^{N}$ , with associated expectation operator $E_{θ}^{N}$ . Here $θ$ varies in some parameter space

Θ \subseteq R^{D}, \frac{D}{N} ≃ κ \geq 0,

and $θ_{0} \in Θ$ is a ‘ground truth’ (we could use ‘mis-specified’ $θ_{0}$ and project it onto $Θ$ ). We will primarily consider the case where $κ > 0$ and $Θ = R^{D}$ , and consider high-dimensional asymptotics where $D$ (and then also $N$ ) diverge to infinity, even though some aspects of our proofs do not rely on these assumptions. We will say that events $A_{N}$ hold with high probability if $P_{θ_{0}}^{N} (A_{N}) \to 1$ as $N \to \infty$ , and we will use the same terminology later when it involves the law of some Markov chain.

Let $Π$ be a prior (Borel probability measure) on $Θ$ so that given the data $Z^{(N)}$ the posterior measure is the ‘Gibbs’-type distribution

d Π (θ | Z^{(N)}) = \frac{e^{ℓ_{N} (θ)} d Π (θ)}{\int_{Θ} e^{ℓ_{N} (θ)} d Π (θ)}, θ \in Θ,

3.2

where

ℓ_{N} (θ) = - \frac{1}{2} \sum_{i = 1}^{N} | Y_{i} - G (θ) (X_{i}) |^{2}, ℓ (θ) = E_{θ_{0}}^{N} ℓ_{N} (θ), θ \in Θ .

(a) . Hardness examples for posterior computation with Gaussian priors

We are concerned here with the question of whether one can sample from the Gibbs’ measure (3.2) by MCMC algorithms. The priors will be Gaussian, so the ‘source’ of the difficulty will arise from the log-likelihood function $ℓ_{N}$ . On the one hand, recent work [10–14] has demonstrated that if $ℓ_{N} (θ)$ is ‘on average’ (under $E_{θ_{0}}$ ) log-concave, possibly only just locally near the ground truth $θ_{0}$ , then MCMC methods that are initialized into the area of log-concavity can mix towards $Π (\cdot | Z^{(N)})$ in polynomial time even in high-dimensional ( $D \to \infty$ ) and ‘informative’ ( $N \to \infty$ ) settings. In the absence of such structural assumptions, however, posterior computation may be intractable, and the purpose of this section is to give some concrete examples for this with choices of $G$ that are representative for nonlinear regression models.

We will provide lower bounds on the run-time of ‘worst case’ initialized MCMC in settings where the average posterior surface is not globally log-concave but still unimodal. Both the log-likelihood function and posterior density exhibit linear growth towards their modes, and the average log-likelihood is locally log-concave at $θ_{0}$ . In particular, the Fisher information is well defined and non-singular at the ground truth.

The computational hardness does not arise from a local optimum (‘multi-modality’), but from the difficulty MCMC encounters in ‘choosing’ among many high-dimensional directions when started away from the bulk of the support of the posterior measure. That such problems occur in high dimensions is related to the probabilistic structure of the prior $Π$ , and the manifestation of ‘free energy barriers’ in the posterior distribution.

In many applications of Bayesian statistics, such as in machine learning or in nonlinear inverse problems with PDEs, Gaussian process priors are commonly used for inference. To connect to such situations we illustrate the key ideas that follow with two canonical examples where the prior on $R^{D}$ is the law

(a) θ \sim N (0, \frac{I_{D}}{D}), or (b) θ \sim N (0, Σ_{α}),

3.3

where $Σ_{α}$ is the covariance matrix arising from the law of a $D$ -dimensional Whittle–Matérn-type Gaussian random field (see §4(d)(i) for a detailed definition). These priors represent widely popular choices in Bayesian statistical inference [47,48] and can be expected to yield consistent statistical solutions of regression problems even when $D / N \geq κ > 0$ , see [48,49]. In (b), we can also accommodate a further ‘rescaling’ ( $N$ -dependent shrinkage) of the prior similar to what has been used in recent theory for nonlinear inverse problems [12,13,18], see remark 4.6 for details.

We will present our main results for the case where the ground truth is $θ_{0} = 0$ . This streamlines notation while also being the ‘hardest’ case for negative results, since the priors from (a) and (b) are then already centred at the correct parameter.

To formalize our results, let us define balls

B_{r} = {θ \in R^{D} : | | θ | |_{R^{D}} \leq r}, r > 0,

3.4

centred at $θ_{0} = 0$ . We will also require the annuli

Θ_{r, ε} = {θ \in R^{D} : | | θ | |_{R^{D}} \in (r, r + ε)},

3.5

for $r, ε > 0$ to be chosen. To connect this to the notation in the preceding sections, the sets $Θ_{r, ε}$ will play the role of the initialization (or starting) region $S$ , while $B_{s}$ (for suitable $s$ ) corresponds to the target region $T$ where the posterior mass concentrates. The ‘intermediate’ region $W = Θ_{s, η}$ representing the ‘free-energy barrier’ is constructed in the proofs of the theorems to follow.

Our results hold for general Markov chains whose invariant measure equals the posterior measure (3.2), and which admit a bound on their ‘typical’ step sizes. As step sizes can be random, this assumption needs to be accommodated in the probabilistic framework describing the transition probabilities of the chain. Let $P_{N} (θ, A), N \in N,$ (for $θ \in R^{D}$ and Borel sets $A \subseteq R^{D}$ ), denote a sequence of Markov kernels describing the Markov chain dynamics employed for the computation of the posterior distribution $Π (\cdot | Z^{(N)})$ . Recall that a probability measure $μ$ on $R^{D}$ is called invariant for $P_{N}$ if $\int_{R^{D}} P_{N} (θ, A) d μ (θ) = μ (A)$ for all Borel sets $A$ .

Assumption 3.1. —

Let $P_{N} (\cdot, \cdot)$ be a sequence of Markov kernels satisfying the following:

(i)
$P_{N} (\cdot, \cdot)$ has invariant distribution $Π (\cdot | Z^{(N)})$ from (3.2).

(ii)
For some fixed $c_{0} > 0$ and for sequences $Q = Q_{N} > 0$ , $η = η_{N} > 0$ , with $P_{0}^{N}$ -probability approaching $1$ as $N \to \infty$ ,
$sup_{θ \in B_{Q}} P_{N} (θ, {ϑ : | | θ - ϑ | |_{R^{D}} \geq \frac{η}{2}}) \leq e^{- c_{0} N}, N \geq 1.$

This assumption states that typical steps of the Markov chain are, with high probability (both under the law of the Markov chain and the randomness of the invariant ‘target’ measure), concentrated in an area of size $η / 2$ around the current state $θ$ , uniformly in a ball of radius $Q$ around $θ_{0} = 0$ . For standard MCMC algorithms (such as pCN, MALA) whose proposal steps are based on the discretization of some continuous-time diffusion process, such conditions can be checked, as we will show in the next section.

Theorem 3.2. —

Let $D / N ≃ κ > 0$ , consider the posterior (3.2) arising from the model (3.1) and a $N (0, I_{D} / D)$ prior of density $π$ , and let $θ_{0} = 0$ . Then there exists $G$ and a fixed constant $s \in (0, 1 / 3)$ for which the following statements hold true.

(i)
The expected likelihood $ℓ (θ)$ is unimodal with mode $0$ , locally log-concave near $0$ , radially symmetric, Lipschitz continuous and monotonically decreasing in $| | θ | |_{R^{D}}$ on $R^{D}$ .

(ii)
For any fixed $r > 0$ , with high probability the log-likelihood $ℓ_{N} (θ)$ and the posterior density $π (\cdot | Z^{(N)})$ are monotonically decreasing in $| | θ | |_{R^{D}}$ on the set ${θ : | | θ | |_{R^{D}} \geq r}$ .

(iii)
We have that $Π (B_{s} | Z^{(N)}) \overset{N \to \infty}{\to} 1$ in probability.

(iv)
There exists $ε > 0$ such that for any (sequence of) Markov kernels $P_{N}$ on $R^{D}$ and associated chains $(ϑ_{k} : k \geq 1)$ that satisfy assumption 3.1 for some $c_{0} > 0$ , $Q = 1 + ε$ , sequence $η_{N} \in (0, s)$ and all $N \geq 1$ large enough, we can find an initialization point $ϑ_{0} \in Θ_{2 / 3, ε}$ such that with high probability (under the law of $Z^{(N)}$ and the Markov chain), the hitting time $τ_{B_{s}}$ for $ϑ_{k}$ to reach $B_{s}$ (with $s$ as in (iii)) is lower bounded as
$τ_{B_{s}} \geq \exp (min {c_{0}, 1} N / 2) .$

The interpretation is that despite the posterior being strictly increasing in the radial variable $| | θ | |_{R^{D}}$ (at least for $| | θ | |_{R^{D}} > r$ , any $r > 0$ —note that maximizers of the posterior density may deviate from the ‘ground truth’ $θ_{0} = 0$ by some asymptotically vanishing error, cf. also proposition 4.1), MCMC algorithms started in $Θ_{2 / 3, ε}$ will still take an exponential time before visiting the region $B_{s}$ where the posterior mass concentrates. This is true for small enough step size independently of $D, N$ . The result holds also for $ϑ_{0}$ drawn from an absolutely continuous distribution on $Θ_{2 / 3, ε}$ as inspection of the proof shows. Finally, we note that at the expense of more cumbersome notation, the above high probability results (and similarly in theorem 3.3) could be made non-asymptotic, in the sense that for all $δ > 0$ all statements hold with probability at least $1 - δ$ for all $N \geq N_{0} (δ)$ large enough, where the dependency of $N_{0}$ on $δ$ can be made explicit.

For ‘ellipsoidally supported’ $α$ -regular priors (b), the idea is similar but the geometry of the problem changes as the prior now ‘prefers’ low-dimensional subspaces of $R^{D}$ , forcing the posterior closer towards the ground truth $θ_{0} = 0$ . We show that if the step size is small compared to a scaling $N^{- b}$ for $b > 0$ determined by $α$ , then the same hardness phenomenon persists. Note that ‘small’ is only ‘polynomially small’ in $N$ and hence algorithmic hardness does not come from exponentially small step sizes.

Theorem 3.3. —

Let $D / N ≃ κ > 0$ , consider the posterior (3.2) arising from the model (3.1) and a $N (0, Σ_{α})$ prior of density $π$ for some $α > d / 2$ , and let $θ_{0} = 0$ . Define $b = (α / d) - (1 / 2) > 0$ . Then there exists $G$ and some fixed constant $s_{b} \in (0, 1 / 2)$ for which the following statements hold true.

(i)
The expected likelihood $ℓ (θ)$ is unimodal with mode $0$ , locally log-concave near $0$ , radially symmetric, Lipschitz continuous and monotonically decreasing in $| | θ | |_{R^{D}}$ on $R^{D}$ .

(ii)
For any fixed $r > 0$ , with high probability $ℓ_{N} (θ)$ is radially symmetric and decreasing in $| | θ | |_{R^{D}}$ on the set ${θ : | | θ | |_{R^{D}} \geq r N^{- b}}$ .

(iii)
Defining $s = s_{b} N^{- b}$ , we have $Π (B_{s} | Z^{(N)}) \overset{N \to \infty}{\to} 1$ in probability.

(iv)
There exist positive constants $ε, C > 0$ and $ν = ν (κ, α, d) > 0$ such that for any (sequence of) Markov kernels $P_{N}$ on $R^{D}$ and associated chains $(ϑ_{k} : k \geq 1)$ that satisfy assumption 3.1 for some $c_{0} > 0$ , $Q = Q_{N} = C \sqrt{N}$ , sequence $η = η_{N} \in (0, s_{b} N^{- b})$ and all $N \geq 1$ large enough, we can find an initialization point $ϑ_{0} \in Θ_{N^{- b}, ε N^{- b}}$ such that with high probability (under the law of $Z^{(N)}$ and the Markov chain), the hitting time $τ_{B_{s}}$ for $ϑ_{k}$ to reach $B_{s}$ is lower bounded as
$τ_{B_{s}} \geq \exp (min {c_{0}, ν} N / 2) .$

Again, (iv) holds as well for $ϑ_{0}$ drawn from an absolutely continuous distribution on $Θ_{N^{- b}, ε N^{- b}}$ . We also note that $ε$ depends only on $α, κ, d$ and the choice of $G$ but not on any other parameters.

Remark 3.4. —

As opposed to theorem 3.2, due to the anisotropy of the prior density $π$ , the posterior distribution is no longer radially symmetric in the preceding theorem, whence part (ii) differs from theorem 3.2. But a slightly weaker form of monotonicity of the posterior density $π (\cdot | Z^{(N)})$ still holds: the same arguments employed to prove part (ii) of theorem 3.2 show that $π (\cdot | Z^{(N)})$ is decreasing on ${θ : | | θ | |_{R^{D}} \geq r N^{- b}}$ (any $r > 0$ ) along the half-lines through $0$ , i.e.

$P_{0}^{N} (π (v e | Z^{(N)}) \leq π (v^{'} e | Z^{(N)}) for all v \geq v^{'} \geq r, e \in R^{D}, | | e | |_{R^{D}} = N^{- b}) \overset{N \to \infty}{\to} 1.$ 3.6

We note that this notion precludes the possibility of $π (\cdot | Z^{(N)})$ having extremal points outside of the region of dominant posterior mass, and implies that moving toward the origin will always increase the posterior density. As a result, many typical Metropolis–Hastings would be encouraged to accept such ‘radially inward’ moves, if they arise as a proposal. Thus, crucially, our exponential hitting time lower bound in part (iv) arises not through multi-modality, but merely through volumetric properties of high-dimensional Gaussian measures.

Remark 3.5 (On the step size condition). —

One may wonder whether larger step sizes can help to overcome the negative result presented in the last theorem. If the step sizes are ‘time-homogeneous’ and $≫ N^{- b}$ on average, then we may hit the region where the posterior is supported at some time. This would happen ‘by chance’ and not because the data (via $ℓ_{N}$ ) would suggest to move there, and future proposals will likely be outside of that bulk region, so that the chain will either exit the relevant region again or become deterministic because an accept/reject step refuses to move into such directions. In this sense, a negative result for (polynomially) small step sizes gives fundamental limitations on the ability of the chain to explore the precise characteristics of the posterior distribution. We also remark that the Lipschitz constants of $\nabla ℓ (θ)$ are of order $D$ or $D^{1 + b}$ in the preceding theorems, respectively. A Markov chain obtained from discretizing a continuous diffusion process (such as MALA discussed in the next section) will generally require step sizes that are inversely proportional to that Lipschitz constant in order to inherit the dynamics from the continuous process. For such examples, assumption 3.1 is natural. But as discussed at the end of the introduction, there exists a variety of ‘non-local’ MCMC algorithms for which this step size assumption may not be satisfied.

(b) . Implications for common MCMC methods with ‘cold-start’

The preceding general hitting time bounds apply to commonly used MCMC methods in high-dimensional statistics. We focus in particular on algorithms that are popular with PDE models and inverse problems, see, e.g. [50,51] and also [14] for many more references. We illustrate this for two natural examples with Metropolis–Hastings adjusted random walk and gradient algorithms. Other examples can be generated without difficulty.

(i) . Preconditioned Crank–Nicolson

We first give some hardness results for the popular pCN algorithm. A dimension-free convergence analysis for pCN was given in the important paper by Hairer et al. [52] based on ideas from Hairer et al. [53]. The results in the present section show that while the mixing bounds from Hairer et al. [52] are in principle uniform in $D$ , the implicit dependence of the constants on the conditions on the log-likelihood-function in [52] can re-introduce exponential scaling when one wants to apply the results from Hairer et al. [52] to concrete ( $N$ -dependent) posterior distributions. This confirms a conjecture about pCN made in Section 1.2.1 of Nickl & Wang [12].

Let $C$ denote the covariance of some Gaussian prior on $R^{D}$ with density $π$ . Then the pCN algorithm for sampling from some posterior density $π (θ | Z^{(N)}) \propto e^{ℓ_{N} (θ)} π (θ)$ is given as follows. Let $(ξ_{k} : k \geq 1)$ be an i.i.d. sequence of $N (0, C)$ random vectors. For initializer $ϑ_{0} \in R^{D}$ , step size $β > 0$ and $k \geq 1$ , the MCMC chain is then given by

1.
Proposal: $p_{k} \sim \sqrt{1 - β} ϑ_{k - 1} + \sqrt{β} ξ_{k}$ ,
2.
Accept–reject: Set
$ϑ_{k} = {\begin{cases} p_{k} & w . p . min {1, e^{ℓ_{N} (p_{k}) - ℓ_{N} (ϑ_{k - 1})}}, \\ ϑ_{k - 1} & else . \end{cases}$ 3.7

By standard Markov chain arguments one verifies (see [52] or Ch.1 in [14]) that the (unique) invariant density of $(ϑ_{k} : k \geq 1)$ equals $π (\cdot | Z^{(N)})$ .

We now give a hitting time lower bound for the pCN algorithm which holds true in the regression setting for which the main theorems 3.2 and 3.3 (for generic Markov chains) were derived. In particular, we emphasize that the lower bounds to follow hold for the choice of regression ‘forward’ map $G$ constructed in the proofs of theorems 3.2 and 3.3. As for the general results, we treat the two cases of $C = I_{D} / D$ or $C = Σ_{α}$ separately.

Theorem 3.6. —

Let $ϑ_{k}$ denote the pCN Markov chain from (3.7).

(i)
Assume the setting of theorem 3.2 with $C = I_{D} / D$ , and let $G$ be as in theorem 3.2. Then there exist constants $c_{1}, c_{2}, ε > 0$ such that for any $β \leq c_{1}$ , there is an initialization point $ϑ_{0} \in Θ_{2 / 3, ε}$ such that the hitting time $τ_{B_{s}} = inf {k : ϑ_{k} \in B_{s}}$ (for $B_{s}$ as in (3.4)) satisfies with high probability (under the law of the data and of the Markov chain) as $N \to \infty$ that $τ_{B_{s}} \geq \exp (c_{2} D)$ .

(ii)
Assume the setting of theorem 3.3 with $C = Σ_{α}$ for $α > d / 2$ , and let $G$ be as in theorem 3.2. Then there exist constants $c_{1}, c_{2}, ε > 0$ such that if $β \leq c_{1} N^{- 1 - 2 b}$ there is an initialization point $ϑ_{0} \in Θ_{N^{- b}, ε N^{- b}}$ such that the hitting time $τ_{B_{s}} = inf {k : ϑ_{k} \in B_{s}}$ satisfies with high probability that $τ_{B_{s}} \geq \exp (c_{2} D)$ .

(ii) . Gradient-based Langevin algorithms

We now turn to gradient-based Langevin algorithms which are based on the discretization of continuous-time diffusion processes [10,50]. A polynomial time convergence analysis for the unadjusted Langevin algorithm in the strongly log-concave case has been given in [10,11] and also in [54] for the Metropolis-adjusted case (MALA). We show here that for unimodal but not globally log-concave distributions, the MCMC scheme can take an exponential time to reach the bulk of the posterior distribution. For simplicity we focus on the Metropolis-adjusted Langevin algorithm which is defined as follows. Let $(ξ_{k} : k \geq 1)$ be a sequence of i.i.d. $N (0, I_{D})$ variables, and let $γ > 0$ be a step size.

1.
Proposal: $p_{k} = ϑ_{k - 1} + γ \nabla \log π (ϑ_{k - 1} | Z^{(N)}) + \sqrt{2 γ} ξ_{k}$ .
2.
Accept–reject: Set
$ϑ_{k} = {\begin{cases} p_{k} & w . p . min {1, \frac{π (p_{k} | Z^{(N)}) \exp (- | | ϑ_{k - 1} - p_{k} - γ \nabla \log π (p_{k} | Z^{(N)}) | |^{2})}{π (ϑ_{k - 1} | Z^{(N)}) \exp (- | | p_{k} - ϑ_{k - 1} - γ \nabla \log π (ϑ_{k - 1} | Z^{(N)}) | |^{2})}}, \\ ϑ_{k - 1} & else . \end{cases}$ 3.8

Again, standard Markov chain arguments show that $Π (\cdot | Z^{(N)})$ is indeed the (unique) invariant distribution of $(ϑ_{k} : k \geq 1)$ . We note here that for the forward $G$ featuring in our results to follows, $\nabla \log π$ may only be well-defined (Lebesgue-) almost everywhere on $R^{D}$ due to our piece-wise smooth choice of $w$ , see (4.6) below. However, since all proposal densities involved possess a Lebesgue density, this specification almost everywhere suffices in order to propagate the Markov chain with probability $1$ . Alternatively one could also straightforwardly avoid this technicality by smoothing our choice of function $w$ in (4.6), which we refrain from for notational ease.

Theorem 3.7. —

Let $ϑ_{k}$ denote the MALA Markov chain from (3.8).

(i)
Assume the setting of theorem 3.2, with $N (0, I_{D} / D)$ prior, and let $G$ also be as in theorem 3.2. There exists some $c_{1}, c_{2}, ε > 0$ such that if the step size of $(ϑ_{k} : k \geq 1)$ satisfies $γ \leq c_{1} / N$ , then there is an initialization point $ϑ_{0} \in Θ_{2 / 3, ε}$ such that the hitting time $τ_{B_{s}} = inf {k : ϑ_{k} \in B_{s}}$ (for $B_{s}$ as in (3.4)) satisfies with high probability (under the law of the data and of the Markov chain) as $N \to \infty$ that $τ_{B_{s}} \geq \exp (c_{2} D)$ .

(ii)
Assume the setting of theorem 3.3, with a $N (0, Σ_{α})$ prior, and let $G$ also be as in theorem 3.3. Then there exist some constant $c_{1}, c_{2}, ε > 0$ such that whenever $γ \leq c_{1} N^{- 1 - b - 2 α}$ , there is an initialization point $ϑ_{0} \in Θ_{N^{- b}, ε N^{- b}}$ , such that the hitting time $τ_{B_{s}} = inf {k : ϑ_{k} \in B_{s}}$ satisfies with high probability (under the law of the data and of the Markov chain) that $τ_{B_{s}} \geq \exp (c_{2} D)$ .

As mentioned in remark 3.5, a bound on the step size that is inversely proportional to the Lipschitz constant of $\nabla ℓ$ is natural for algorithms like MALA that arise from discretization of a continuous-time Markov process, see e.g. [11,54]. We emphasize again that these Lipschitz constants are $D$ - and $N$ -dependent, so that the required bounds on $γ$ are not unnatural. ‘Optimal’ step size prescriptions for MALA [54–57] derived for Gaussian and log-concave targets or, more generally, mean-field limits (in which the posterior distribution possesses a product or mean-field structure, unlike in the models considered here) would need to be adjusted to our model classes to be comparable.

4. Proofs of the main theorems

We begin in §4a by constructing the family of regression maps $G$ underlying our results from §3. Section 4b,c reduce the hitting time bounds from theorems 3.2 and 3.3 (for general Markov chains) to hitting time bounds for intermediate ‘free energy barriers’ that the Markov chain needs to travel through. Subsequently, theorems 3.3 and 3.2 are proved in §4d,e, respectively. Finally, the proofs for pCN (theorem 3.6) and MALA (theorem 3.7) are contained in §4f.

(a) . Radially symmetric choices of G

We start with our parameterization of the map $G$ . In our regression model and since $E ε^{2} = 1$ ,

ℓ (θ) = - \frac{N}{2} E_{θ_{0}}^{1} | Y - G (θ) (X) |^{2} = - \frac{N}{2} | | G (θ_{0}) - G (θ) | |_{L^{2}}^{2} - \frac{N}{2}, θ \in R^{D} .

4.1

We have $θ_{0} = 0$ and by subtracting a fixed function $G (0)$ from $G (θ)$ if necessary we can also assume that $G (θ_{0}) = 0$ . In this case, since $vol (X) = 1$ ,

ℓ (θ) = - \frac{N}{2} | | G (θ) | |_{L^{2}}^{2} - \frac{N}{2} .

4.2

Take a bounded continuous function $w : [0, \infty] \to [0, | | w | |_{\infty})$ with a unique minimizer $w (0) = 0$ and take $G$ of the ‘radial’ form

G (θ) = \sqrt{w (| | θ | |_{R^{D}})} \times g (x), θ \in R^{D}, x \in X,

where

g : X \to [g_{min}, g_{max}], 0 < g_{min} < g_{max} < \infty, | | g | |_{L_{μ}^{2} (X)} = 1.

The assumption $G (θ_{0}) = 0$ implies $Y_{i} = 0 + ε_{i}$ under $P_{θ_{0}}^{N}$ , so that we have

\begin{aligned} ℓ_{N} (θ) & = - \frac{1}{2} \sum_{i = 1}^{N} | ε_{i} - \sqrt{w (| | θ | |)} g (X_{i}) |^{2} \\ = - \frac{w (| | θ | |_{R^{D}})}{2} \sum_{i = 1}^{N} g^{2} (X_{i}) - \frac{1}{2} \sum_{i = 1}^{N} ε_{i}^{2} + \sqrt{w (| | θ | |)} \sum_{i = 1}^{N} ε_{i} g (X_{i}), \end{aligned}

4.3

and the average log-likelihood is

ℓ (θ) = E_{θ_{0}}^{N} ℓ_{N} (θ) = - \frac{N}{2} w (| | θ | |_{R^{D}}) - \frac{N}{2}, θ \in R^{D} .

4.4

Define $ϵ$ -annuli of Euclidean space

Θ_{r, ϵ} = {θ \in R^{D} : | | θ | |_{R^{D}} \in (r, r + ϵ)}, r \geq 0.

4.5

We then also set, for any $s \geq 0, ϵ > 0$ ,

w_{-} (r, ϵ) = inf_{s \in (r, r + ϵ)} w (s), w_{+} (r, ϵ) = sup_{s \in (r, r + ϵ)} w (s) .

For our main theorems the map $w$ will be monotone increasing and the preceding notation $w_{-}, w_{+}$ is then not necessary, but proposition 4.2 is potentially also useful in non-monotone settings (as remarked after its proof), hence the slightly more general notation here.

The choice that $G$ is radial is convenient in the proofs, but means that the model is only identifiable up to a rotation for $θ \neq 0$ . One could easily make it identifiable by more intricate choices of $G$ , but the main point for our negative results is that the function $ℓ$ has a unique mode at the ground truth parameter $θ_{0}$ and is identifiable there.

(i) . A locally log-concave, globally monotone choice of w

Define for $t < L$ and any $r > 0$ the function $w : [0, \infty) \to R$ as

4.6

where $T > ρ,$ are fixed constants to be chosen. Note that $w$ is monotone increasing and

| | w | |_{\infty} = {(T t)}^{2} + (\frac{T t}{2}) + ρ (L - t) < \infty .

4.7

The function $w$ is quadratic near its minimum at the origin up until $t / 2$ , from when onwards it is piece-wise linear. In the linear regime it initially has a ‘steep’ ascent of gradient $T$ until $t$ , then grows more slowly with small gradient $ρ$ from $t$ until $L$ , and from then on is constant. The function $w$ is not $C^{\infty}$ at the points $r = t / 2$ , $r = t$ , $r = L$ , but we can easily make it smooth by convolving with a smooth function supported in small neighbourhoods of its breakpoints $r$ without changing the findings that follow. We abstain from this to simplify notation.

The following proposition summarizes some monotonicity properties of the empirical log-likelihood function arising from the above choice of $w$ .

Proposition 4.1. —

Let $w$ be as in (4.6). Then there exists $C > 0$ such that for any $r_{0} > 0$ and $N \geq 1$ , we have

$P_{0}^{N} (sup_{r_{0} \leq r < s \leq L} sup_{| | θ_{s} | | = s, | | θ_{r} | | = r} \frac{ℓ_{N} (θ_{s}) - ℓ_{N} (θ_{r})}{w (s) - w (r)} \leq - \frac{N}{4}) \geq 1 - \frac{C}{N} - \frac{C}{N w (r_{0})} .$

In particular, if $r_{0} < t / 2$ is such that ${(T r_{0})}^{2} N \to \infty$ as $N \to \infty$ then the r.h.s. is $1 - o (1)$ .

Proof. —

Recalling (4.4), (4.3) and since $w$ is monotonically increasing, we bound

$\begin{aligned} P_{0}^{N} (ℓ_{N} (θ_{s}) - ℓ_{N} (θ_{r}) > (\frac{N}{4}) (w (r) - w (s))) \\ = P_{0}^{N} (ℓ_{N} (θ_{s}) - ℓ_{N} (θ_{r}) - (ℓ (θ_{s}) - ℓ (θ_{r})) > - \frac{N}{4} (w (r) - w (s))) \\ = Pr (\frac{(w (r) - w (s))}{2} \sum_{i = 1}^{N} (g^{2} (X_{i}) - 1) + (\sqrt{w (s)} - \sqrt{w (r)}) \sum_{i = 1}^{N} ε_{i} g (X_{i}) > \frac{N}{4} (w (s) - w (r))) \\ = Pr (- \sum_{i = 1}^{N} (g^{2} (X_{i}) - E g^{2} (X)) / 2 + \frac{1}{\sqrt{w (s)} + \sqrt{w (r)}} \sum_{i = 1}^{N} ε_{i} g (X_{i}) > \frac{N}{4}) \\ \leq Pr (| \sum_{i = 1}^{N} (g^{2} (X_{i}) - E g^{2} (X)) | > \frac{N}{4}) + Pr (| \sum_{i = 1}^{N} ε_{i} g (X_{i}) | > \frac{2 N \sqrt{w (r_{0})}}{8}) \\ = O (\frac{1}{N}) + O (\frac{1}{N w (r_{0})}), \end{aligned}$

using Chebyshev’s inequality in the last step. Since the events in the penultimate step do not depend on $r < s \in [r_{0}, L]$ , the result follows.

(b) . Bounds for posterior ratios of annuli

A key quantity in the proofs to follow will be to obtain asymptotic $(N \to \infty)$ bounds of the following functional (recalling the definition of the Euclidean annuli $Θ_{r, ε}$ from (4.5)),

F_{N} (r, ε) = \frac{1}{N} \log \int_{Θ_{r, ε}} e^{ℓ_{N} (θ)} d Π (θ), r \geq 0, ε > 0,

4.8

in terms of the map $w$ . As a side note, we remark that this functional has a long history in the statistical physics of glasses, in which it is often referred to as the Franz–Parisi potential [7,58].

Proposition 4.2. —

Consider the regression model (3.1) with radially symmetric choice of $G$ from §4a such that $| | w | |_{\infty} \leq W$ for some fixed $W < \infty$ (independent of $D, N$ ), and let $Π = Π_{N}$ denote a sequence of prior probability measures on $R^{D}$ .

(i)
Suppose that for some radii $0 < s < σ$ , constants $ε, η, ν > 0$ and for all $N \geq 1$ large enough, we have
$\frac{1}{N} \log \frac{Π (Θ_{s, η})}{Π (Θ_{σ, ε})} \leq - 2 ν - \frac{(w_{+} (σ, ε) - w_{-} (s, η))}{2} .$ 4.9
Then the posterior distribution $Π (\cdot | Z^{(N)})$ from (3.2) arising in the model (3.1) satisfies that with high $P_{0}^{N}$ -probability as $N \to \infty$ ,
$\frac{Π (Θ_{s, η} | Z^{(N)})}{Π (Θ_{σ, ε} | Z^{(N)})} \leq e^{- ν N} .$ 4.10

(ii)
If in addition $w$ is monotone increasing on $[0, \infty)$ and if for some $Q > 1 + ε$ ,
$\frac{1}{N} \log \frac{Π (B_{Q}^{c})}{Π (Θ_{σ, ε})} \leq - 2 ν,$ 4.11
then the posterior distribution $Π (\cdot | Z^{(N)})$ also satisfies (with high probability as $N \to \infty$ ) that
$\frac{Π (B_{Q}^{c} | Z^{(N)})}{Π (Θ_{σ, ε} | Z^{(N)})} \leq e^{- ν N} .$ 4.12

Remark 4.3 (The prior condition for w from (4.6)). —

If $σ > s > t$ , for $w$ from (4.6), the ‘likelihood’ term in proposition 4.2 is

$\frac{w_{+} (σ, ε)}{2} - \frac{w_{-} (s, η)}{2} \leq \frac{ρ (σ + ε - t) - ρ (s - t)}{2} = \frac{ρ}{2} (σ + ε - s) > 0,$ 4.13

so that if we also assume

$T t + ρ L = O (\sqrt{N}),$ 4.14

to control $ω_{N}, ω_{N}^{'}$ in the proof that follows, then to verify (4.9) it suffices to check

$\frac{1}{N} \log \frac{Π (Θ_{s, η})}{Π (Θ_{σ, ε})} \leq - 2 ν - \frac{ρ}{2} (σ + ε - s),$ 4.15

for all large enough $N$ .

Proof. —

Proof of part (i). From the definition of $ℓ_{N}$ in (4.3) we first note that for all $r \geq 0, ϵ > 0$ ,

$inf_{θ \in Θ_{r, ϵ}} ℓ_{N} (θ) \geq - \frac{1}{2} \sum_{i = 1}^{N} ε_{i}^{2} - \frac{1}{2} w_{+} (r, ϵ) \sum_{i = 1}^{N} g^{2} (X_{i}) - \sqrt{w_{+} (r, ϵ)} | \sum_{i = 1}^{N} ε_{i} g (X_{i}) |$

and

$sup_{θ \in Θ_{r, ϵ}} ℓ_{N} (θ) \leq - \frac{1}{2} \sum_{i = 1}^{N} ε_{i}^{2} - \frac{1}{2} w_{-} (r, ϵ) \sum_{i = 1}^{N} g^{2} (X_{i}) + \sqrt{w_{+} (r, ϵ)} | \sum_{i = 1}^{N} ε_{i} g (X_{i}) | .$

We can now further bound, for our $G$ ,

$\begin{aligned} \frac{1}{N} \log \int_{Θ_{r, ϵ}} e^{ℓ_{N} (θ)} d Π (θ) \leq - \frac{1}{2 N} \sum_{i = 1}^{N} ε_{i}^{2} \\ - \frac{w_{-} (r, ϵ)}{2 N} \sum_{i = 1}^{N} g^{2} (X_{i}) + \frac{\sqrt{w_{+} (r, ϵ)}}{N} | \sum_{i = 1}^{N} ε_{i} g (X_{i}) | + \frac{\log Π (Θ_{r, ϵ})}{N} \end{aligned}$

and

$\begin{aligned} \frac{1}{N} \log \int_{Θ_{r, ϵ}} e^{ℓ_{N} (θ)} d Π (θ) \geq - \frac{1}{2 N} \sum_{i = 1}^{N} ε_{i}^{2} \\ - \frac{w_{+} (r, ϵ)}{2 N} \sum_{i = 1}^{N} g^{2} (X_{i}) - \frac{\sqrt{w_{+} (r, ϵ)}}{N} | \sum_{i = 1}^{N} ε_{i} g (X_{i}) | + \frac{\log Π (Θ_{r, ϵ})}{N} . \end{aligned}$

We estimate $\sqrt{w_{+} (r, ϵ)} \leq \bar{w} (r, ϵ) = max (w_{+} (r, ϵ), 1)$ , and noting that

$E ε_{i}^{2} = 1 = E g^{2} (X_{i}) and E ε_{i} g (X_{i}) = 0,$

we can use Chebyshev’s (or Bernstein’s) inequality to construct an event of high probability such that the functional $F_{N}$ from (4.8) is bounded as

$F_{N} (r, ϵ) \leq - \frac{1}{2} - \frac{w_{-} (r, ϵ)}{2} + \frac{\log Π (Θ_{r, ϵ})}{N} + ω_{N} (s, η)$ 4.16

and

$F_{N} (r, ϵ) \geq - \frac{1}{2} - \frac{w_{+} (r, ϵ)}{2} + \frac{\log Π (Θ_{r, ϵ})}{N} + ω_{N}^{'} (r, ϵ),$ 4.17

where

$ω_{N} (r, ϵ) = O (1 + \frac{w_{-} (r, ϵ) + \bar{w} (r, ϵ)}{\sqrt{N}}), ω_{N}^{'} (s) = O (1 + \frac{w_{+} (r, ϵ) + \bar{w} (r, ϵ)}{\sqrt{N}}),$ 4.18

and this is uniform in all $(r, ϵ)$ since $| | w | |_{\infty} \leq W$ is bounded. Using the above with $(r, ϵ)$ chosen as $(s, η)$ and $(σ, ε)$ respectively, we then obtain

$\begin{aligned} \frac{1}{N} \log \frac{Π (Θ_{s, η} | Z^{(N)})}{Π (Θ_{σ, ε} | Z^{(N)})} = F_{N} (s, η) - F_{N} (σ, ε) \\ \leq - \frac{w_{-} (s, η)}{2} + \frac{\log Π (Θ_{s, η})}{N} + \frac{w_{+} (σ, ε)}{2} - \frac{\log Π (Θ_{σ, ε})}{N} + ω_{N} (s, η) - ω_{N}^{'} (σ, ε) \\ = - \frac{w_{-} (s, η)}{2} + \frac{w_{+} (σ, ε)}{2} + \frac{1}{N} \log \frac{Π (Θ_{s, η})}{Π (Θ_{σ, ε})} + ω_{N} (s, η) - ω_{N}^{'} (σ, ε), \end{aligned}$ 4.19

with high $P_{θ_{0}}^{N}$ -probability. The result now follows from the hypothesis (4.9) and since the terms $ω_{N}, ω_{N}^{'}$ are $O (1)$ .

(Proof of part ii). The proof of part (ii) follows from an obvious modification of the previous arguments.

In the case where $Π (Θ_{s, η})$ and $Π (Θ_{σ, ε})$ are comparable (so that the l.h.s. in (4.9) converges to zero), a local optimum at $σ$ in the function $w$ away from zero can verify the last inequality for ‘intermediate’ $s$ such that $w (s) - w (σ) \leq - 2 ν$ . This can be used to give computational hardness results for MCMC of multi-modal distributions. But we are interested in the more challenging case of ‘unimodal’ examples $w$ from (4.6). Before we turn to this, let us point out what can be said about the hitting times of Markov chains if the conclusion (4.10) of proposition 4.2 holds.

(c) . Bounds for Markov chain hitting times

(i) . Hitting time bounds for intermediate sets $Θ_{s, η}$

In (4.10), we can think of $Θ_{σ, ε}$ as the ‘initialization region’ (further away from $θ_{0}$ ) and $Θ_{s, η}$ for intermediate $s$ is the ‘barrier’ before we get close to $θ_{0} = 0$ . The last bound permits the following classic hitting time argument, taken from Ben Arous et al. [5], see also [8].

Proposition 4.4. —

Consider any Markov chain $(ϑ_{k} : k \in N)$ with invariant measure $μ = Π (\cdot | Z^{(N)})$ for which (4.10) holds. For constants $η < σ - s$ , suppose $ϑ_{0}$ is started in $Θ_{σ, ε}$ , $μ (Θ_{σ, ε}) > 0$ , drawn from the conditional distribution $μ (\cdot | Θ_{σ, ε})$ , and denote by $τ_{s}$ the hitting time of the Markov chain onto $Θ_{s, η}$ , that is, the number $τ_{s}$ of iterates required until $ϑ_{k}$ visits the set $Θ_{s, η}$ . Then

$Pr (τ_{s} \leq K) \leq K e^{- ν N}, K > 0.$

Similarly, on the event where (4.12) holds we have that

$Pr (τ_{B_{L}^{c}} \leq K) \leq K e^{- ν N}, K > 0.$

Proof of proposition 4.4. —

We have

$\begin{aligned} Pr (τ_{s} \leq K) & = Pr (ϑ_{k} \in Θ_{s, η} for some 1 \leq k \leq K | ϑ_{0} \in Θ_{σ, ε}) \\ = \frac{Pr (ϑ_{0} \in Θ_{σ, ε}, ϑ_{k} \in Θ_{s, η} for some 1 \leq k \leq K)}{μ (Θ_{σ, ε})} \leq \frac{\sum_{k \leq K} Pr (ϑ_{k} \in Θ_{s, η})}{μ (Θ_{σ, ε})} \\ \leq K \frac{μ (Θ_{s, η})}{μ (Θ_{σ, ε})} \leq K e^{- ν N} . \end{aligned}$

The second claim is proved analogously.

The last proposition holds ‘on average’ for initializers $ϑ_{0} \sim μ (\cdot | Θ_{σ, ε})$ , and since $Pr = E_{μ (\cdot | Θ_{σ, ε})} \underset{ϑ_{0}}{Pr}$ where $\underset{ϑ_{0}}{Pr}$ is the law of the Markov chain started at $ϑ_{0}$ , the hitting time inequality holds at least for one point in $Θ_{σ, ε}$ since $inf_{ϑ_{0}} \underset{ϑ_{0}}{Pr} \leq E_{μ (\cdot | Θ_{σ, ε})} \underset{ϑ_{0}}{Pr}$ .

(ii) . Reducing hitting times for $B_{s}$ to ones for $Θ_{s, η}$

We now reduce part (iv) of theorems 3.2 and 3.3, i.e. bounds on the hitting time of the region $B_{s}$ in which the posterior contracts, to a bound for the hitting time $τ_{s}$ for the annulus $Θ_{s, η}$ , which is controlled in proposition 4.4. To this end, in the case of theorem 3.2, we suppose that propositions 4.2 and 4.4 are verified with $ν = 1 σ = 2 / 3$ some $ε > 0$ and $Q, s, η$ as in the theorem, and in the case of theorem 3.3, we assume the same with choice $σ = N^{- b}$ and $ν > 0$ given after (4.27) below. For $c_{0}$ from assumption 3.1, define the events

A_{N} := {\forall k \leq e^{(ν \land c_{0}) N / 2} : | | ϑ_{k + 1} - ϑ_{k} | |_{R^{D}} \leq \frac{η}{2}} .

We can then estimate, using assumption 3.1, that on the frequentist event on which proposition 4.4 holds (which we apply with $K = e^{(ν \land c_{0}) N / 2} \leq e^{ν N / 2}$ ), under the probability law of the Markov chain we have

\begin{aligned} Pr (τ_{B_{s}} \leq e^{(ν \land c_{0}) N / 2}) \leq Pr (τ_{B_{s}} \leq e^{(ν \land c_{0}) N / 2}, A_{N}) + Pr (A_{N}^{c}) \\ \leq Pr (τ_{s} \leq e^{(ν \land c_{0}) N / 2}) + Pr (A_{N}^{c}, τ_{B_{Q}^{c}} > e^{(ν \land c_{0}) N / 2}) + Pr (τ_{B_{Q}^{c}} \leq e^{(ν \land c_{0}) N / 2}) \\ \leq 2 e^{- ν N / 2} + e^{(ν \land c_{0}) N / 2} sup_{θ \in B_{Q}} P_{N} (θ, {ϑ : | | θ - ϑ | |_{R^{D}} \geq \frac{η}{2}}) \\ \leq 2 e^{- (ν \land c_{0}) N / 2} + e^{(ν \land c_{0}) N / 2 - c_{0} N} \leq 3 e^{- (ν \land c_{0}) N / 2}, \end{aligned}

where in the second inequality we have used that on the events $A_{N}$ , the Markov chain $ϑ_{k}$ , when started in $Θ_{2 / 3, ε}$ , needs to pass through $Θ_{s, η}$ in order to reach $B_{s}$ .

(d) . Proof of theorem 3.3

In this section, we use the results derived in the previous part of §4 to finish the proof of theorem 3.3. Parts (i) and (ii) of the theorem follow from proposition 4.1 and our choice of $w$ in (4.6). We therefore concentrate on the proofs of part (iii) and (iv). We start with proving a key lemma on small ball estimates for truncated $α$ -regular Gaussian priors.

(i) . Small ball estimates for $α$ -regular priors

Let us first define precisely the notion of $α$ -regular Gaussian priors. For some fixed $α > d / 2$ , the prior $Π$ arises as the truncated law $L a w (θ)$ of an $α$ -regular Gaussian process with RKHS $H = H^{α}$ , a Sobolev space over some bounded domain/manifold $X$ , see e.g. section 6.2.1 in [14] for details. Equivalently (under the Parseval isometry) we take a Gaussian Borel measure on the usual sequence space $ℓ_{2} ≃ L^{2}$ with RKHS equal to

h^{α} = {{(θ_{i})}_{i = 1}^{\infty} : \sum_{i = 1}^{\infty} i^{2 (α / d)} θ_{i}^{2} = | | θ | |_{H^{α}}^{2} < \infty}, α > \frac{d}{2} .

The prior $Π$ is the truncated law of $θ_{D} = (θ_{1}, \dots, θ_{D}), D \in N$ .

Lemma 4.5. —

Fix $z > 0$ , $α > d / 2$ and $κ > 0$ , and set

$b = \frac{α}{d} - \frac{1}{2}, τ = \frac{1}{b} = \frac{2 d}{2 α - d} .$

Then if $D / N ≃ κ > 0$ , there exist constants ${\bar{c}}_{0} > c_{0}$ (depending on $b, κ$ ) such that for all $N$ ( $\geq N_{0} (z, b)$ ) large enough:

$c_{0} {(z + κ^{- α / d} z^{- τ / 2})}^{- τ} \leq - \frac{1}{N} \log Π (| | θ | |_{R^{D}} \leq z N^{- b}) \leq {\bar{c}}_{0} z^{- τ} .$ 4.20

Proof of lemma 4.5. —

Note first that the $L^{2}$ -covering numbers of the ball $h (α, B)$ of radius $B$ in $H^{α}$ satisfy the well-known two-sided estimate

$\log N (δ, | | \cdot | |_{L^{2}}, h (α, B)) ≃ (\frac{A B}{δ})^{d / α}, 0 < δ < A B$ 4.21

for equivalence constants in $≃$ depending only on $d, α$ . The upper bound is given in proposition 6.1.1 in [14] and a lower bound can be found as well in the literature [59] (by injecting $H^{α} (X_{0})$ into ${\tilde{H}}^{α} (X)$ for some strict sub-domain $X_{0} \subset X$ , and using metric entropy lower bounds for the injection $H^{α} (X_{0}) \to L^{2} (X_{0})$ ).

Using the results about small deviation asymptotics for Gaussian measures in Banach space [60]—specifically theorem 6.2.1 in [14] with $a = 2 d / (2 α - d)$ —and assuming $α > d / 2$ , this means that the concentration function of the ’untruncated prior’ satisfies the two-sided estimate

$- \log Π (| | θ | |_{L^{2}} \leq γ) ≃ γ^{- (2 d / (2 α - d))} = γ^{- τ}, γ \to 0.$ 4.22

Here, restricting to $γ \in (0, 1)$ , the two-sided equivalence constants depend only on $α, d$ . Setting

$γ = z N^{- b}, z > 0,$ 4.23

and noting that $b τ = 1$ , we hence obtain that for some constants $c_{l}, c_{u} > 0$ ,

$e^{- c_{l} z^{- τ} N} \leq Π (| | θ | |_{L^{2}} \leq z N^{- b}) \leq e^{- c_{u} z^{- τ} N}, any z > 0.$ 4.24

We now show that as long as $D / N \approx κ > 0$ , one may use the above asymptotics to derive the desired small ball probabilities for the projected prior on $R^{D}$ .

We obviously have, by set inclusion and projection,

$Π (| | θ | |_{R^{D}} \leq z N^{- b}) \geq Π (| | θ | |_{L^{2}} \leq z N^{- b}),$

and hence it only remains to show the first inequality in equation (4.20). The Gaussian isoperimetric theorem (theorem 2.6.12 in [61]) and (4.24) imply that for $m \geq 4 \sqrt{c_{l}}$ and some $c > 0$ , we have that (with $Φ$ denoting the c.d.f. for $N (0, 1)$ )

$\begin{aligned} Π (θ = θ_{1} + θ_{2}, | | θ_{1} | |_{L^{2}} \leq z N^{- b}, | | θ_{2} | |_{h^{α}} \leq m z^{- τ / 2} \sqrt{N}) \\ \geq Φ (Φ^{- 1} (Π ({θ : | | θ | | \leq z N^{- b}})) + m z^{- τ / 2} \sqrt{N}) \\ \geq Φ (- \sqrt{2 c_{l}} z^{- τ / 2} \sqrt{N} + m z^{- τ / 2} \sqrt{N}) \geq 1 - e^{- c z^{- τ} N}, \end{aligned}$

(see also the proof of lemma 5.17 in [19] for a similar calculation). Then if the event in the last probability is denoted by $I$ we have

$Π (| | θ_{D} | |_{R^{D}} \leq z N^{- b}) \leq Π (| | θ_{D} | |_{R^{D}} \leq z N^{- b}, I) + e^{- c z^{- τ} N} .$

On $I$ , if $D / N \to κ > 0$ and by the usual tail estimate for vectors in $h^{α}$ , we have for some $c^{'} > 0$ the bound

$| | θ - θ_{D} | |_{L^{2}} \leq | | θ_{1} | |_{L^{2}} + c^{'} D^{- α / d} z^{- τ / 2} \sqrt{N} \leq z N^{- b} + c^{'} κ^{- α / d} z^{- τ / 2} N^{- b},$

so that for any $z > 0$ ,

$\begin{aligned} Π (| | θ_{D} | |_{R^{D}} \leq z N^{- b}) \leq Π (| | θ | |_{L^{2}} \leq z N^{- b} + | | θ - θ_{D} | |_{L^{2}}, I) + e^{- c z^{- τ} N} \\ \leq Π (| | θ | |_{L^{2}} \leq (2 z + c^{'} κ^{- α / d} z^{- τ / 2}) N^{- b}) + e^{- c z^{- τ} N} \\ \leq e^{- c_{u} {(2 z + c^{'} κ^{- α / d} z^{- τ / 2})}^{- τ} N} + e^{- c z^{- τ} N}, \end{aligned}$

and hence the lemma follows by appropriately choosing $c_{0} > 0$ .

Remark 4.6. —

For statistical consistency proofs in nonlinear inverse problems, often rescaled Gaussian priors are used to provide additional regularization [12,13,19]. For these priors a computation analogous to the previous lemma is valid: specifically if we rescale $θ$ by $\sqrt{N} δ_{N}$ , where $δ_{N} = N^{- α / (2 α + d)}$ so that $\sqrt{N} δ_{N} = N^{(d / 2) / (2 α + d)} = N^{k}$ , then we just take $N^{- β + k} = N^{- b}$ in the above small ball computation, that is $- b = - β + k$ or $b = β - k$ , and the same bounds (as well as the proof to follow) apply.

(ii) . Proof of theorem 3.3, part (iv)

Lemma 4.5 and the hypotheses on $η$ immediately imply

\begin{aligned} Π (θ \in Θ_{s, η}) = Π (| | θ | |_{R^{D}} \in (s_{b} N^{- b}, s_{b} N^{- b} + η)) \\ \leq Π (| | θ | |_{R^{D}} \leq 2 s_{b} N^{- b}) \leq e^{- c_{0} N {(2 s_{b} + κ^{- α / d} {(2 s_{b})}^{- τ / 2})}^{- τ}} . \end{aligned}

To lower bound $Π (Θ_{N^{- b}, ε N^{- b}})$ , we choose $ε$ large enough such that

{\bar{c}}_{0} {(1 + ε)}^{- τ} < c_{0} {(1 + κ^{- α / d})}^{- τ},

which implies for all $N$ large enough that

\begin{aligned} Π (| | θ | |_{R^{D}} \in (N^{- b}, (1 + ε) N^{- b})) & = Π (| | θ | |_{R^{D}} \leq (1 + ε) N^{- b}) - Π (| | θ | |_{R^{D}} \leq N^{- b})) \\ \geq e^{- {\bar{c}}_{0} {(1 + ε)}^{- τ} N} - e^{- c_{0} {(1 + κ^{- α / d})}^{- τ} N} \\ \geq e^{- 2 {\bar{c}}_{0} {(1 + ε)}^{- τ} N} . \end{aligned}

4.25

Now, for $w$ from (4.6), we set

t = t_{b} N^{- b}, ρ \in (0, 1], 0 < t_{b} < s_{b} < \frac{1}{2} < L < \infty, T = T_{b} N^{b},

4.26

for $T_{b}$ to be chosen and $ρ, L$ , $s_{b}$ , $t_{b}$ fixed constants, so that $| | w | |_{\infty}$ is bounded (uniformly in $N$ ) by a constant which depends only on $T_{b}$ , $L$ , $ρ$ , whence (4.14) holds. Now the key inequality (4.15) with $s = s_{b} N^{- b}$ and with our choice of $η$ , $ε$ , $σ = N^{- b}$ will be satisfied if

c_{0} {(2 s_{b} + κ^{- α / d} {(2 s_{b})}^{- τ / 2})}^{- τ} \geq 2 {\bar{c}}_{0} {(1 + ε)}^{- τ} + 2 ν + \frac{ρ}{2} N^{- b} (1 + ε - s_{b}) .

4.27

We define $ν$ to equal to $1 / 3$ of the l.h.s. so that (4.27) will follow for the given $s_{b}, κ, α, d$ by choosing $ε$ large enough and whenever $N$ is large enough.

Finally, let us note that with $Q = C \sqrt{N}$ for some $C \geq 2 E [| | θ | |_{ℓ_{2}}]$ , where $θ$ is the infinite Gaussian vector with RKHS $h^{α}$ , we can deduce from theorem 2.1.20 and exercise 2.1.5 in [61] that

Pr (| | θ | |_{R^{D}} \geq Q) \leq 2 \exp (- c C^{2} N / 2), some c > 0.

Thus, using also (4.25), choosing $C$ large enough verifies (4.11). Since (4.25) and the a.s. boundedness of $sup_{θ} | ℓ_{N} (θ) |$ for $ℓ_{N}$ from (4.3) imply that $Π (Θ_{N^{- b}, ε N^{- b}} | Z^{(N)}) > 0$ a.s., proposition 4.2 and then also proposition 4.4 apply for this prior, and the arguments from §4c(ii) yield the desired result.

(iii) . Proof of theorem 3.3, part (iii)

We finish the proof of the theorem by showing point $(iii)$ . We use the setting and choices from the previous section. Let us write $G (A) = \int_{A} e^{ℓ_{N} (θ)} d Π (θ)$ for any measurable set $A$ . Recall the notation $B_{r} = {θ : | | θ | |_{R^{D}} \leq r}, r > 0$ . Repeating the argument leading to (4.17) with $B_{t / 2}$ in place of $Θ_{r, ϵ}$ , and using lemma 4.5, we have with high probability

\frac{1}{N} \log G (B_{t / 2}) \geq - \frac{1}{2} - \frac{sup_{r \leq t_{b} N^{- b} / 2} w (r)}{2} - {\bar{c}}_{0} (\frac{t_{b}}{2})^{- τ} + ω_{N}^{'} (t / 2),

where $ω_{N}^{'} (t / 2) = O (| | w | |_{\infty} / \sqrt{N}) = O (1)$ . Likewise, we also have

\frac{1}{N} \log G (B_{s}^{c}) \leq - \frac{1}{2} - \frac{inf_{r \geq s_{b} N^{- b}} w (r)}{2} + \frac{1}{N} \log Π (B_{s}^{c}) + ω_{N}^{″} (s),

where $ω_{N}^{″} (s) = O (| | w | |_{\infty} / \sqrt{N}) = O (1)$ . We can assume that $G (B_{s}^{c}) > 0$ . Hence, since $Π (B_{s}^{c}) \to 1$ in view of lemma 4.5,

\begin{aligned} \frac{1}{N} \log \frac{G (B_{t / 2})}{G (B_{s}^{c})} \geq - \frac{{(T t)}^{2}}{2} - c_{0} (\frac{t_{b}}{2})^{- τ} + \frac{{(T t)}^{2} + (T t / 2) + ρ (s - t)}{2} + \frac{1}{N} \log Π (B_{s}^{c}) + O (1) \\ \geq \frac{T_{b} t_{b}}{4} - c_{0} (\frac{t_{b}}{2})^{- τ} + O (1) . \end{aligned}

4.28

Now, for $t_{b} < s_{b}$ fixed we can choose $T_{b}$ large enough such that the last quantity exceeds $1$ with high probability (in particular this retrospectively justifies the last $O (1)$ as then $| | w | |_{\infty} = O (1)$ for our choice of $T_{b}$ ). Therefore, again with high probability

\frac{G (B_{t / 2})}{G (B_{s}^{c})} \geq e^{N} \times (1 + O (1)) .

4.29

For $M_{t, s} = {θ : t / 2 < | | θ | |_{R^{D}} \leq s}$ this further implies that with high probability

\frac{G (B_{t / 2}) + G (M_{t, s})}{G (B_{s}^{c})} \geq e^{N} \times (1 + O (1)),

and then,

\begin{aligned} Π (B_{s} | Z^{(N)}) & = \frac{G (B_{t / 2}) + G (M_{t, s})}{G (B_{t / 2}) + G (M_{t, s}) + G (B_{s}^{c})} \\ = \frac{G (B_{t / 2}) + G (M_{t, s})}{(G (B_{t / 2}) + G (M_{t, s})) (1 + (G (B_{s}^{c}) / G (B_{t / 2})) + G (M_{t, s}))} \to 1, \end{aligned}

again with high probability, which is what we wanted to show.

Remark 4.7. —

If the map $w$ is globally convex, say $w (s) = T s^{2} / 2$ for all $s > 0$ , then a ‘large enough’ choice of $ε$ after (4.27) is not possible. It is here where global log-concavity of the likelihood function helps, as it enforces a certain ‘uniform’ spread of the posterior across its support via a global coercivity constant $T$ . By contrast the above example of $w$ is not convex, rather it is very spiked on $(0, t / 2)$ and then ‘flattens out’.

(e) . Proof of theorem 3.2

The proof of theorem 3.2 proceeds along the same lines as that of theorem 3.3, with scaling $t, L, ρ, s, η$ constant in $N$ , corresponding to $b = 0$ in $N^{- b}$ , and replacing the volumetric lemma 4.5 by the following basic result.

Lemma 4.8. —

Let $θ \sim N (0, I_{D} / D)$ . Let $a \in (0, 1 / 2)$ . Then for all $D \geq D_{0} (a)$ large enough,

$- \frac{1}{D} \log Π (| | θ | |_{R^{D}} \leq z) \geq \frac{1}{2} (\frac{z^{2}}{2} - \log z - \frac{1}{2}), any z \in (0, 1 - a) .$ 4.30

A proof of (4.30) is sketched in appendix B. As a consequence of the previous lemma

\frac{1}{N} \log Π (Θ_{s, η}) \leq \frac{1}{N} \log Π (B_{2 s}) \leq \frac{κ}{2} (\log 2 s - 2 s^{2} + \frac{1}{2}) .

Moreover, to lower bound $Π (Θ_{2 / 3, ε})$ , we choose $ε > 2 / 3$ . Then, using theorem 2.5.7 in [61] as well as $E | | θ | | \leq E {(| | θ | |^{2})}^{1 / 2} = 1$ , and then also (4.30) with $z = 2 / 3$ , we obtain that

\begin{aligned} Π (Θ_{2 / 3, ε}) & \geq Π (| | | θ | |_{R^{D}} - 1 | \leq \frac{1}{3}) \\ \geq 1 - Π (| | θ | |_{R^{D}} \geq E | | θ | |_{R^{D}} + \frac{1}{3}) - Π (| | θ | |_{R^{D}} \leq \frac{2}{3}) \\ \geq 1 - \exp (- D / 18) - \exp (- c D), \end{aligned}

for some fixed constant $c > 0$ given by (4.30), whence $Π (Θ_{2 / 3, ε}) \to 1$ and also $N^{- 1} \log Π (Θ_{2 / 3, ε}) \to 0$ . Therefore, the key inequality (4.15) with $σ = 2 / 3$ , $ν = 1$ holds whenever we choose $s = s_{0}$ small enough such that

- \log 2 s_{0} > 2 κ^{- 1} [2 + \frac{ρ}{2} (s_{0} - \frac{2}{3} - ε)] + 2 s_{0}^{2} + \frac{1}{2} .

The rest of the detailed derivations follow the same pattern as in the proof of theorem 3.3 and are left to the reader, including verification of (4.11) via an application of theorem 2.5.7 in [61]. In particular, the proof of part (iii) follows the same arguments (suppressing the $N^{- b}$ scaling everywhere) as in theorem 3.3.

(f) . Proofs for §3b

In this section, we prove the results of §3b which detail the consequences of the general theorems 3.2 and 3.3 for practical MCMC algorithms.

(i) . Proofs for pCN

Theorem 3.6 is proved by verifying the assumption 3.1 for suitable choices of $η$ and $L$ , and for $c_{0} = κ / 2 > 0$ .

Lemma 4.9. —

Let $P_{N}$ denote the transition kernel of pCN from (3.7) with parameter $β > 0$ .

(i)
Suppose $Π = N (0, I_{D} / D)$ as in theorem 3.2, and let $Q, η > 0$ . Then for all $β \leq min {1 / 2, η / 4 Q, η^{2} / 64}$ and all $D \geq 1$ , we have (with $P_{0}^{N}$ -probability 1)
$sup_{θ \in B_{Q}} P_{N} (θ, {ϑ : | | θ - ϑ | |_{R^{D}} \geq \frac{η}{2}}) \leq e^{- D / 2} .$

(ii)
Suppose $Π = N (0, Σ_{α})$ as in theorem 3.3, and let $Q, η > 0$ . There exists some $c > 0$ such that for all $β \leq min {1 / 2, η / (4 Q), c η^{2} / D}$ and all $D \geq 1$ , we have (with $P_{0}^{N}$ -probability 1)
$sup_{θ \in B_{Q}} P_{N} (θ, {ϑ : | | θ - ϑ | |_{R^{D}} \geq \frac{η}{2}}) \leq e^{- D / 2} .$

Proof of lemma 4.9. —

We begin with the proof of part (ii). Let $| | ϑ_{k} | |_{R^{D}} \leq Q$ . Then using the definition of pCN and that $| \sqrt{1 - β} - 1 | \leq β$ for any $β \in [0, 1]$ (Taylor expanding $\sqrt{\cdot}$ around $1$ ), we obtain that for any $β \leq min {1 / 2, η / 4 Q}$ ,

$\begin{aligned} Pr (| | ϑ_{k + 1} - ϑ_{k} | |_{R^{D}} \geq \frac{η}{2}) \leq Pr (| | p_{k + 1} - ϑ_{k} | |_{R^{D}} \geq \frac{η}{2}) \\ \leq Pr (| | (\sqrt{1 - β} - 1) ϑ_{k} | |_{R^{D}} + \sqrt{β} | | ξ_{k} | |_{R^{D}} \geq \frac{η}{2}) \\ \leq Pr (| | ξ_{k} | |_{R^{D}} \geq \frac{(η / 2 - β Q)}{\sqrt{β}}) \\ \leq Pr (| | ξ_{k} | |_{R^{D}} \geq \frac{η}{4 \sqrt{β}}) \\ = Pr (| | ξ_{k} | |_{R^{D}} - E | | ξ_{k} | |_{R^{D}} \geq \frac{η}{4 \sqrt{β}} - E | | ξ_{k} | |_{R^{D}}) . \end{aligned}$

The variables $ξ_{k}$ are equal in law to a vector with components $(i^{- α / d} g_{i} : i \leq D)$ for $g_{i}$ iid $N (0, 1)$ and hence $E | | ξ_{k} | |_{R^{D}} \leq {(E | | ξ_{k} | |_{R^{D}}^{2})}^{1 / 2} \leq C (α, d) < \infty$ for $α > d / 2$ . Then, for $β \leq c η^{2} / D$ with some sufficiently small $c > 0$ (noting that then also $β \leq c η^{2}$ ), it holds that

$\begin{aligned} Pr (| | ϑ_{k + 1} - ϑ_{k} | |_{R^{D}} \geq \frac{η}{2}) & \leq Pr (| | ξ_{k} | |_{R^{D}} - E | | ξ_{k} | |_{R^{D}} \geq \frac{η}{8 \sqrt{β}}) \\ \leq \exp (- \frac{η^{2}}{64 β}) \leq \exp (\frac{- D}{2}), \end{aligned}$ 4.31

using, e.g. theorem 2.5.8 in [61] (and representing the $| | \cdot | |_{R^{D}}$ -norm by duality as a supremum). This completes the proof of part (ii).

The proof of part (i) is similar, albeit simpler, whence we leave some details to the reader. Arguing similarly as before, we obtain that for any $β \leq min {1 / 2, η / 64 Q}$ ,

$Pr (| | ϑ_{k + 1} - ϑ_{k} | |_{R^{D}} \geq \frac{η}{2}) \leq Pr (| | ξ_{k} | |_{R^{D}} \geq \frac{(η / 2 - β L)}{\sqrt{β}}) \leq Pr (| | g_{k} | |_{R^{D}} \geq \frac{η \sqrt{D}}{4 \sqrt{β}}),$

where $g_{k}$ is a $N (0, I_{D})$ random variable. The latter probability is bounded by a standard deviation inequality for Gaussians, see, e.g. theorem 2.5.7 in [61]. Indeed, noting that $E | | ξ_{k} | |_{R^{D}} \leq {(E [| | ξ_{k} | |_{R^{D}}^{2}])}^{1 / 2} = \sqrt{D}$ , and that the one-dimensional variances satisfy $E {⟨ g_{k}, v ⟩}^{2} = | | v | |_{R^{D}}^{2} = 1$ for any $| | v | |_{R^{D}} = 1$ , we obtain

$\begin{aligned} Pr (| | g_{k} | |_{R^{D}} \geq \frac{η \sqrt{D}}{4 \sqrt{β}}) & \leq Pr (| | | ξ_{k} | |_{R^{D}} - E | | ξ_{k} | |_{R^{D}} | \geq \sqrt{D} (\frac{η}{4 \sqrt{β}} - 1)) \\ \leq \exp (- \frac{D}{2} {(\frac{η}{4 \sqrt{β}} - 1)}^{2}) \leq \exp (- \frac{D}{2}) . \end{aligned}$

Proof of theorem 3.6. —

We begin with part (ii). Let $s_{b}$ be as in theorem 3.3 and set $η = η_{N} = s_{b} N^{- b} / 2$ as well as $Q = Q_{N} C \sqrt{N}$ , where $C$ is as in theorem 3.3. With those choices, lemma 4.9 (ii) implies that assumption 3.1 is fulfilled with $c_{0} = κ / 2$ , so long as $β$ satisfies

$β \leq min {\frac{1}{2}, \frac{s_{b} N^{- b}}{8 C \sqrt{N}}, \frac{c s_{b}^{2} N^{- 2 b}}{4 D}} ≲ N^{- 2 b} D^{- 1} ≃ N^{- 1 - 2 b} .$

Hence, the desired result immediately follows from an application of theorem 3.3 (iv).

Part (i) of theorem 3.6 similarly follows from verifying assumption 3.1 with $s \in (0, 1 / 3)$ , $Q$ from theorem 3.2, $η = s / 2$ and for small enough $β < c_{1}$ (with $c_{1}$ determined by lemma 4.9 (i)), and subsequently applying theorem 3.2 (iv).

(ii) . Proofs for MALA

Theorem 3.7 is proved by verifying the hypotheses of theorems 3.2 and 3.3, respectively. A key difference between pCN and MALA is that the proposal kernels for MALA, not just its acceptance probabilities, depend on the data $Z^{(N)}$ itself. Again, we begin by examining part (ii) which regards $N (0, Σ_{α})$ priors.

Proof of theorem 3.7, part (ii). —

We begin by deriving a bound for the gradient $\nabla \log π (\cdot | Z^{(N)})$ . For Lebesgue-a.e. $θ \in R^{D}$ , recalling that $vol (X) = 1$ , we have that

$E_{0}^{N} [\nabla ℓ_{N} (θ)] = - \frac{N}{2} w^{'} (| | θ | |) \frac{θ}{| | θ | |} | | g | |_{L^{2}}^{2}$

and

$\begin{aligned} \nabla ℓ_{N} (θ) & = \frac{1}{2} \sum_{i = 1}^{N} (ε_{i} - \sqrt{w (| | θ | |)} g (X_{i})) \frac{w^{'} (| | θ | |)}{2 \sqrt{w (| | θ | |)}} \frac{θ}{| | θ | |} g (X_{i}), \\ = \frac{w^{'} (| | θ | |)}{4 \sqrt{w (| | θ | |)}} \frac{θ}{| | θ | |} \sum_{i = 1}^{N} ε_{i} g (X_{i}) - \frac{w^{'} (| | θ | |)}{4} \frac{θ}{| | θ | |} \sum_{i = 1}^{N} g^{2} (X_{i}) . \end{aligned}$

For any $r \in (0, t / 2) \cup (t / 2, t) \cup (t, L) \cup (L, \infty)$ , recalling the choices for $T, t, ρ$ in (4.26) we see that

$\begin{aligned} \frac{w^{'} (r)}{\sqrt{w (r)}} & = \frac{8 T r}{2 T r} 1_{(0, t / 2)} (r) + \frac{T}{\sqrt{w (r)}} 1_{(t / 2, t)} (r) + \frac{ρ}{\sqrt{w (r)}} 1_{(t, L)} (r), \\ ≲ 1 + N^{b} + 1, \end{aligned}$ 4.32

where, to bound the second and third term, we used that $\sqrt{w (r)} \geq T t = t_{b} T_{b} > 0$ is bounded away from zero uniformly in $N$ on $(t / 2, \infty)$ . Similarly, we have

$| | w^{'} | |_{\infty} \leq \frac{T t}{2} + T + ρ ≲ N^{b} .$

Combining the above and using Chebyshev’s inequality, it follows that

$\begin{aligned} sup_{θ \in R^{D}} | | \nabla ℓ_{N} (θ) | |_{R^{D}} & ≲ N^{b} (| \sum_{i = 1}^{N} ε_{i} g (X_{i}) | + \sum_{i = 1}^{N} g^{2} (X_{i})) \\ \leq N^{b} (| | g | |_{\infty} | \sum_{i = 1}^{N} ε_{i} | + \sum_{i = 1}^{N} (g^{2} (X_{i}) - | | g | |_{L^{2}}^{2}) + N | | g | |_{L^{2}}^{2}) \\ \leq N^{b} (O_{P} (\sqrt{N}) + O (N)) \\ = O (N^{1 + b}) + O (N^{1 + b}) . \end{aligned}$

Thus, the event

$A := {sup_{θ \in R^{D}} | | \nabla ℓ_{N} (θ) | |_{R^{D}} \leq C^{'} N^{1 + b}},$

for some large enough $C^{'} > 0$ , has probability $P_{0}^{N} (A) \to 1$ as $N \to \infty$ . We also verify that

$\nabla \log π (θ) = - \frac{1}{2} \nabla θ^{T} Σ_{α}^{- 1} θ = - Σ_{α}^{- 1} θ,$ 4.33

so that with $Q = Q_{N} = C \sqrt{N}$ (for $C$ as in theorem 3.3) and recalling that $Σ_{α} = diag (1, \dots, D^{- 2 α})$ , we obtain

$sup_{| | θ | | \leq Q} | | \nabla \log π (θ) | |_{R^{D}} = sup_{| | θ | | \leq Q} | | Σ_{α}^{- 1} θ | |_{R^{D}} ≲ D^{2 α} \sqrt{N} ≃ N^{2 α + 1} .$

Now, let $s_{b}$ also be as in theorem 3.3 and set $η = η_{N} = (1 / 2) s_{b} N^{- b}$ (note that this is a permissible choice in theorem 3.3). Furthermore, for a small enough constant $c > 0$ , let $γ \leq c N^{- 1 - 2 α - b}$ . Then since $α > b$ , we also have that

$γ ≲ min {N^{- 1 - 2 α - b}, N^{- 1 - 2 b}, N^{- 1 / 2 - b}} .$ 4.34

Hence, on the event $A$ and whenever $| | θ | |_{R^{D}} \leq Q$ ,

$γ | | \nabla \log π (ϑ_{k} | Z^{(N)}) | |_{R^{D}} ≲ γ (N^{1 + b} + N^{1 + 2 α}) ≲ η .$

Using this, (4.34) and choosing $c > 0$ small enough, conditional on the event $A$ the probability $Pr (\cdot)$ under the Markov chain satisfies

$\begin{aligned} Pr (| | p_{k + 1} - ϑ_{k} | | \geq \frac{η}{2}) & \leq Pr (γ | | \nabla \log π (ϑ_{k} | Z^{(N)}) | |_{R^{D}} \geq \frac{η}{4}) + Pr (\sqrt{2 γ} | | ξ_{k + 1} | |_{R^{D}} \geq \frac{η}{4}) \\ \leq Pr (| | ξ_{k + 1} | |_{R^{D}} \geq \frac{η}{4 \sqrt{2 γ}}) \\ \leq Pr (| | ξ_{k + 1} | |_{R^{D}} - E | | ξ_{k + 1} | |_{R^{D}} \geq \sqrt{N}) \leq \exp (- \frac{N}{2}), \end{aligned}$

where the last inequality is proved as in (4.31) above, using theorem 2.5.8 in [61]. Thus, assumption 3.1 is satisfied with $c_{0} = 1$ and the proof is complete.

Proof of theorem 3.7, part (i). —

The proof of part (i) proceeds along the same lines, except that (4.32) and (4.33) are replaced with the bound

$| | \frac{w^{'}}{\sqrt{w}} | | s_{\infty} + | | w^{'} | |_{\infty} < C,$

for some constant $C$ independent of $N$ , as well as the bound

$\nabla \log π (θ) = - \frac{D}{2} \nabla | | θ | |^{2} = - D θ, sup_{| | θ | | \leq Q} | | \nabla \log π (θ) | |_{R^{D}} ≃ N Q .$

Then letting $s \in (0, 1 / 3)$ and $Q > 0$ be as in theorem 3.2, and fixing an arbitrary $η \in (0, s / 2)$ , the above implies that for sufficiently small constant $c > 0$ and for any $γ \leq c / N$ , it holds that

$\begin{aligned} Pr (| | p_{k + 1} - ϑ_{k} | | \geq \frac{η}{2}) & \leq Pr (γ | | \nabla \log π (ϑ_{k} | Z^{(N)}) | |_{R^{D}} \geq \frac{η}{4}) + Pr (\sqrt{2 γ} ξ_{k + 1} \geq \frac{η}{4}) \\ \leq Pr (ξ_{k + 1} \geq \frac{η}{4 \sqrt{2 γ}}) \\ \leq Pr (ξ_{k + 1} \geq \frac{η \sqrt{κ D}}{4 \sqrt{2 c}}) . \end{aligned}$

Thus, choosing $c > 0$ small enough and arguing exactly as in the last step of the proof of theorem 3.6, part (i), assumption 3.1 is satisfied with $c_{0} = 1$ and the proof is complete.

Acknowledgements

R.N. would like to thank the Forschungsinstitut für Mathematik (FIM) at ETH Zürich for their hospitality during a sabbatical visit in spring 2022 where this research was initiated.

Appendix A. Proofs of §2

Proof of corollary 2.4. —

We fix $K = 1$ and place ourselves under the event of proposition 2.3, and we denote $s = s (λ)$ and $t = t (λ)$ . We can decompose, since $Π [S_{s} | Y] = 1 - Π [T_{s} | Y]$ :

$Π (T_{s} | Y) = [1 + \frac{Π (S_{s} | Y)}{Π (T_{s} | Y)}]^{- 1} .$

Moreover, $Π (S_{s} | Y) / Π (T_{s} | Y) = Π (S_{s} | Y) / [Π (T_{t} | Y) + Π (W_{s, t} | Y)] \leq Π (S_{s} | Y) / Π (T_{t} | Y)$ . Using proposition 2.3, for $n \geq n_{0} (λ, Y)$ we have $Π (S_{s} | Y) / Π (T_{s} | Y) \leq \exp {- n}$ . Therefore, $Π [T_{s} | Y] \geq {(1 + \exp {- n})}^{- 1}$ , which ends the proof.

Proof. —

The rest of this section is devoted to proving proposition 2.3. We use a uniform bound on the injective norm of Gaussian tensors:

Lemma A.1 —

For all $p \geq 3$ there exists a constant $C_{p}$ , such that:

$\underset{n \to \infty}{lim sup} {n^{- 1 / 2} max_{x \in S^{n - 1}} | ⟨ x^{\otimes p}, Z ⟩ |} \leq C_{p}, almost surely .$ A 1

This lemma is a very crude version of much finer results: in particular the exact value of the constant $μ_{p}$ such that (w.h.p.) $max_{x \in S^{n - 1}} | ⟨ x^{\otimes p}, Z ⟩ | = \sqrt{n} μ_{p} (1 + O_{n} (1))$ has been first computed non-rigorously in [62], and proven in full generality in [63] (see also discussions in [33,34]). In the rest of this proof, we assume to have conditioned on equation (A 1). For any $0 \leq s < t \leq 1$ , we have for $n \geq n_{0} (Y)$ :

$\begin{aligned} \frac{Π (S_{s} | Y)}{Π (T_{t} | Y)} & = \frac{\int_{S_{s}} \exp (ℓ_{Y} (x)) d Π (x)}{\int_{T_{t}} \exp (ℓ_{Y} (x)) d Π (x)} \\ \leq e^{n λ C_{p}} \frac{\int_{S_{s}} \exp ((n / 2) λ^{2} {⟨ x, x_{0} ⟩}^{p}) d Π (x)}{\int_{T_{t}} \exp ((n / 2) λ^{2} {⟨ x, x_{0} ⟩}^{p}) d Π (x)}, \\ \leq \exp (n λ C_{p} + \frac{n λ^{2}}{2} [s^{p} - t^{p}]) \frac{Π (S_{s})}{Π (T_{t})} . \end{aligned}$ A 2

We upper bound $Π (S_{s}) \leq Π (S^{n - 1}) = 1$ . To lower bound $Π (T_{t})$ , we use the elementary fact (which is easy to prove using spherical coordinates):

$Π (T_{t}) = c_{p} I_{(1 - t) / 2} [\frac{(n - 1)}{2}, \frac{(n - 1)}{2}],$ A 3

in which $I_{x} (a, b) = \int_{0}^{x} u^{a - 1} {(1 - u)}^{b - 1} d u / \int_{0}^{1} u^{a - 1} {(1 - u)}^{b - 1} d u$ is the incomplete beta function, and $c_{p} = 1$ for odd $p$ and $c_{p} = 2$ for even $p$ . It is then elementary analysis (cf. e.g. [34]) that

$lim_{n \to \infty} \frac{1}{n} \log Π (T_{t}) = \frac{1}{2} \log (1 - t^{2}),$ A 4

uniformly in $t \in [0, 1)$ . Coming back to equation (A 2), this implies that we have, for any $s < t < 1$ :

$\underset{n \to \infty}{lim sup} \frac{1}{n} \log \frac{Π (S_{s} | Y)}{Π (T_{t} | Y)} \leq λ C_{p} + \frac{λ^{2}}{2} [s^{p} - t^{p}] - \frac{1}{2} \log (1 - t^{2}) .$ A 5

Let $K > 0$ . It is then elementary to see that it is possible to construct $0 \leq s (λ) < t (λ) < 1$ with $lim_{λ \to \infty} {s (λ), t (λ)} = 1$ , and such that the right-hand side of equation (A 5) becomes smaller than $- K$ as $λ \to \infty$ .

Appendix B. Small ball estimates for isotropic Gaussians

Let $Π = N (0, I_{D} / D)$ . In this section, we prove equation (4.30), more precisely we show:

Lemma B.1. —

Let $a \in (0, 1)$ . Then for all $D \geq D_{0} (a)$ large enough, one has for all $z \in (0, 1 - a)$ :

$- \frac{1}{D} \log Π (| | θ | |_{2} \leq z) \geq \frac{1}{2} (\frac{z^{2}}{2} - \log z - \frac{1}{2}) .$ B 1

Proof of lemma B.1. —

Let $f (x) = - x^{2} / 2 + \log x + 1 / 2$ , so that $f$ reaches its maximum in $x = 1$ , with $f (1) = 0$ . By decomposition into spherical coordinates and isotropy of the Gaussian measure, one has directly:

$Π (| | θ | |_{2} \leq z) = \frac{vol (S^{D - 1})}{{(2 π / D)}^{D / 2}} \int_{0}^{z} d r e^{- (D r^{2} / 2) + (D - 1) \log r} .$ B 2

Recall that $vol (S^{D - 1}) = 2 π^{D / 2} / Γ (D / 2)$ , so one reaches easily:

$c_{D} = \frac{1}{D} \log \frac{vol (S^{D - 1})}{{(2 π / D)}^{D / 2}} - \frac{1}{2} = \frac{\log D}{2 D} + O (\frac{1}{D}) .$ B 3

In particular, one has for all $D$ large enough (not depending on $z$ ):

$\frac{1}{D} \log Π (| | θ | |_{2} \leq z) \leq \frac{1}{D} \log \int_{0}^{z} d r e^{- (r^{2} / 2) + (D - 1) f (r)} + c_{D} .$ B 4

Since $f$ is increasing on $(0, 1)$ , we have for large enough $D$ :

$\begin{aligned} \frac{1}{D} \log Π (| | θ | |_{2} \leq z) & \leq (1 - \frac{1}{D}) f (z) + c_{D} + \frac{1}{D} \log \int_{0}^{\infty} d r e^{- r^{2} / 2}, \end{aligned}$ B 5

$\begin{aligned} \leq (1 - \frac{1}{D}) f (z) + \frac{\log D}{D} . \end{aligned}$ B 6

Since $f (1 - a) < 0$ , let $D \geq D_{0} (a)$ large enough such that $f (1 - a) \leq - 2 \log D / (D - 2)$ . Then for all $z \leq 1 - a$ , one has $f (z) \leq - 2 \log D / (D - 2)$ . Plugging it in the inequality above, we reach that for all $z \in (0, 1 - a)$ :

$\frac{1}{D} \log Π (| | θ | |_{2} \leq z) \leq \frac{1}{2} f (z) .$ B 7

Footnotes

As classical in statistical physics, we call free entropy the negative of the free energy.

In a physical system, these regions would correspond respectively to a region including a meta-stable state, a region including the globally stable state and a free energy barrier.

Note that we assume here that the statistician has access to the distribution $Π (\cdot | Y)$ (and in particular to $λ$ ), a setting sometimes called Bayes-optimal in the literature.

⁴

Often the $O_{n} (1)$ term will be exponentially small, but we will not require such a strong control.

⁵

As we will detail in the following sections, see assumption 3.1, the statements remain true if the change is allowed to be higher than the required maximum with exponentially small probability.

Data accessibility

This article has no additional data.

Authors' contributions

R.N.: conceptualization, writing—original draft; A.S.B.: conceptualization, writing—original draft; A.M.: conceptualization, writing—original draft; S.W.: conceptualization, writing—original draft.

All authors gave final approval for publication and agreed to be held accountable for the work performed therein.

Conflict of interest declaration

We declare we have no competing interests.

Funding

RN was supported by the EPSRC programme grant on Mathematics of Deep Larning, project: EP/V026259.

References

1.Anderson PW. 1989. Spin glass VI: spin glass as cornucopia. Phys. Today 42, 9. [Google Scholar]
2.Mézard M, Montanari A. 2009. Information, physics, and computation. Oxford, UK: Oxford University Press. [Google Scholar]
3.Zdeborová L, Krzakala F. 2016. Statistical physics of inference: thresholds and algorithms. Adv. Phys. 65, 453-552. [Google Scholar]
4.Ben Arous G, Gheissari R, Jagannath A. 2020. Algorithmic thresholds for tensor PCA. Ann. Probab. 48, 2052-2087. ( 10.1214/19-AOP1415) [DOI] [Google Scholar]
5.Ben Arous G, Wein AS, Zadik I. 2020. Free energy wells and overlap gap property in sparse PCA. In Conf. on Learning Theory, Graz, Austria, 9–12 July 2020, pp. 479–482. PMLR.
6.Gibbs JW. 1873. A method of geometrical representation of the thermodynamic properties of substances by means of surfaces. Trans. Conn. Acad. Arts Sci. 2, 382–404.
7.Bandeira AS, Alaoui AE, Hopkins SB, Schramm T, Wein AS, Zadik I. 2022. The Franz-Parisi criterion and computational trade-offs in high dimensional statistics. In Advances in Neural Information Processing Systems (eds AH Oh, A Agarwal, D Belgrave, K Cho). See https://openreview.net/forum?id=mzze3bubjk.
8.Jerrum M. 2003. Counting, sampling and integrating: algorithms and complexity. Lectures in Mathematics ETH Zürich. Basel, Switzerland: Birkhäuser Verlag. [Google Scholar]
9.Kunisky D, Wein AS, Bandeira AS. 2022. Notes on computational hardness of hypothesis testing: predictions using the low-degree likelihood ratio. In ISAAC Congress (International Society for Analysis, its Applications and Computation), Aveiro, Portugal, 29 July–2 Aug 2019, pp. 1–50. Springer.
10.Dalalyan AS. 2017. Theoretical guarantees for approximate sampling from smooth and log-concave densities. J. R. Stat. Soc. Ser. B Stat. Methodol. 79, 651-676. ( 10.1111/rssb.12183) [DOI] [Google Scholar]
11.Durmus A, Moulines E. 2019. High-dimensional Bayesian inference via the unadjusted Langevin algorithm. Bernoulli 25, 2854-2882. ( 10.3150/18-BEJ1073) [DOI] [Google Scholar]
12.Nickl R, Wang S. 2020. On polynomial-time computation of high-dimensional posterior measures by Langevin-type algorithms. J. Eur. Math. Soc. [Google Scholar]
13.Bohr J, Nickl R. 2021. On log-concave approximations of high-dimensional posterior measures and stability properties in non-linear inverse problems. (http://arxiv.org/abs/2105.07835)
14.Nickl R. 2022. Bayesian non-linear statistical inverse problems. ETH Zurich Lecture Notes.
15.Altmeyer R. 2022. Polynomial time guarantees for sampling based posterior inference in high-dimensional generalised linear models. (http://arxiv.org/abs/2208.13296)
16.Nickl R. 2020. Bernstein–von Mises theorems for statistical inverse problems I: Schrödinger equation. J. Eur. Math. Soc. 22, 2697-2750. ( 10.4171/JEMS/975) [DOI] [Google Scholar]
17.Monard F, Nickl R, Paternain GP. 2019. Efficient nonparametric Bayesian inference for $X$ -ray transforms. Ann. Stat. 47, 1113-1147. ( 10.1214/18-AOS1708) [DOI] [Google Scholar]
18.Monard F, Nickl R, Paternain GP. 2021. Statistical guarantees for Bayesian uncertainty quantification in nonlinear inverse problems with Gaussian process priors. Ann. Stat. 49, 3255-3298. ( 10.1214/21-AOS2082) [DOI] [Google Scholar]
19.Monard F, Nickl R, Paternain GP. 2021. Consistent inversion of noisy non-Abelian X-ray transforms. Commun. Pure Appl. Math. 74, 1045-1099. ( 10.1002/cpa.21942) [DOI] [Google Scholar]
20.Stuart AM. 2010. Inverse problems: a Bayesian perspective. Acta Numer. 19, 451-559. ( 10.1017/S0962492910000061) [DOI] [Google Scholar]
21.Fearnhead P, Bierkens J, Pollock M, Roberts GO. 2018. Piecewise deterministic Markov processes for continuous-time Monte Carlo. Stat. Sci. 33, 386-412. ( 10.1214/18-STS648) [DOI] [Google Scholar]
22.Bouchard-Côté A, Vollmer SJ, Doucet A. 2018. The bouncy particle sampler: a nonreversible rejection-free Markov chain Monte Carlo method. J. Am. Stat. Assoc. 113, 855-867. [Google Scholar]
23.Bierkens J, Grazzi S, Kamatani K, Roberts G. 2020. The boomerang sampler. In Int. Conf. on Machine Learning, Online, 12–18 July 2020, pp. 908–918. PMLR.
24.Wu C, Robert CP. 2020. Coordinate sampler: a non-reversible Gibbs-like MCMC sampler. Stat. Comput. 30, 721-730. ( 10.1007/s11222-019-09913-w) [DOI] [Google Scholar]
25.Scalliet C, Guiselin B, Berthier L. 2022. Thirty milliseconds in the life of a supercooled liquid. Phys. Rev. X 12, 041028. ( 10.1103/PhysRevX.12.041028) [DOI] [PubMed]
26.Grigera TS, Parisi G. 2001. Fast Monte Carlo algorithm for supercooled soft spheres. Phys. Rev. E 63, 045102. ( 10.1103/PhysRevE.63.045102) [DOI] [PubMed] [Google Scholar]
27.Jerrum M. 1992. Large cliques elude the Metropolis process. Random Struct. Algorithms 3, 347-359. ( 10.1002/rsa.3240030402) [DOI] [Google Scholar]
28.Gamarnik D, Zadik I. 2019. The landscape of the planted clique problem: dense subgraphs and the overlap gap property. (http://arxiv.org/abs/1904.07174).
29.Angelini MC, Fachin P, de Feo S. 2021. Mismatching as a tool to enhance algorithmic performances of Monte Carlo methods for the planted clique model. J. Stat. Mech: Theory Exp. 2021, 113406. ( 10.1088/1742-5468/ac3657) [DOI] [Google Scholar]
30.Chen Z, Mossel E, Zadik I. 2023. Almost-linear planted cliques elude the metropolis process. In Proceedings of the 2023 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), Florence, Italy, 22–25 January 2023, pp. 4504–4539. Philadelphia, PA: SIAM. ( 10.1137/1.9781611977554.ch171) [DOI]
31.Hukushima K, Nemoto K. 1996. Exchange Monte Carlo method and application to spin glass simulations. J. Phys. Soc. Jpn. 65, 1604-1608. ( 10.1143/JPSJ.65.1604) [DOI] [Google Scholar]
32.Angelini MC. 2018. Parallel tempering for the planted clique problem. J. Stat. Mech: Theory Exp. 2018, 073404. [Google Scholar]
33.Richard E, Montanari A. 2014. A statistical model for tensor PCA. In Advances in neural information processing systems 27, Montreal, Canada, 8–13 Dec 2014. Red Hook, NY: Curran Associates.
34.Perry A, Wein AS, Bandeira AS. 2020. Statistical limits of spiked tensor models. Annales de l’Institut Henri Poincaré, Probabilités et Statistiques 56, 230–264.
35.Lesieur T, Miolane L, Lelarge M, Krzakala F, Zdeborová L. 2017. Statistical and computational phase transitions in spiked tensor estimation. In 2017 IEEE Int. Symposium on Information Theory (ISIT), Aachen, Germany, 25–30 June 2017, pp. 511–515. New York, NY: IEEE.
36.Jagannath A, Lopatto P, Miolane L. 2020. Statistical thresholds for tensor PCA. Ann. Appl. Probab. 30, 1910-1933. ( 10.1214/19-AAP1547) [DOI] [Google Scholar]
37.Baik J, Ben Arous G, Péché S. 2005. Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices. Ann. Appl. Probab. 33, 1643-1697. [Google Scholar]
38.Wein AS, El Alaoui A, Moore C. 2019. The Kikuchi hierarchy and tensor PCA. In 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS), Baltimore, MA, 9–12 November 2019, pp. 1446–1468. New York, NY: IEEE.
39.Hopkins SB, Shi J, Steurer D. 2015. Tensor principal component analysis via sum-of-square proofs. In Conf. on Learning Theory, Paris, France, 3–6 July 2015, pp. 956–1006. PMLR.
40.Hopkins SB, Schramm T, Shi J, Steurer D. 2016. Fast spectral algorithms from sum-of-squares proofs: tensor decomposition and planted sparse vectors. In Proc. of the Forty-Eighth Annual ACM Symposium on Theory of Computing, Cambridge, MA, 19–21 June 2016, pp. 178–191. New York, NY: ACM.
41.Kim C, Bandeira AS, Goemans MX. 2017. Community detection in hypergraphs, spiked tensor models, and sum-of-squares. In 2017 International Conference on Sampling Theory and Applications (SampTA), Bordeaux, France, 8–12 July 2017, pp. 124–128. New York, NY: IEEE.
42.Sarao Mannelli S, Biroli G, Cammarota C, Krzakala F, Zdeborová L. 2019. Who is afraid of big bad minima? Analysis of gradient-flow in spiked matrix-tensor models. Advances in neural information processing systems 32, Vancouver, Canada, 8–14 Dec 2019. Red Hook, NY: Curran Associates.
43.Sarao Mannelli S, Krzakala F, Urbani P, Zdeborová L. 2019. Passed & spurious: descent algorithms and local minima in spiked matrix-tensor models. In International Conference on Machine Learning, pp. 4333-4342. PMLR.
44.Biroli G, Cammarota C, Ricci-Tersenghi F. 2020. How to iron out rough landscapes and get optimal performances: averaged gradient descent and its application to tensor PCA. J. Phys. A: Math. Theor. 53, 174003. ( 10.1088/1751-8121/ab7b1f) [DOI] [Google Scholar]
45.Ben Arous G, Gheissari R, Jagannath A. 2020. Bounding flows for spherical spin glass dynamics. Commun. Math. Phys. 373, 1011-1048. ( 10.1007/s00220-019-03649-4) [DOI] [Google Scholar]
46.Ben Arous G, Gheissari R, Jagannath A. 2021. Online stochastic gradient descent on non-convex losses from high-dimensional inference. J. Mach. Learn. Res. 22, 106-1. [Google Scholar]
47.Rasmussen CE, Williams CKI. 2006. Gaussian processes for machine learning. Adaptive Computation and Machine Learning. Cambridge, MA: MIT Press. [Google Scholar]
48.Ghosal S, van der Vaart AW. 2017. Fundamentals of nonparametric Bayesian inference. New York, NY: Cambridge University Press. [Google Scholar]
49.van der Vaart A, van Zanten JH. 2008. Rates of contraction of posterior distributions based on Gaussian process priors. Ann. Stat. 36, 1435-1463. ( 10.1214/009053607000000613) [DOI] [Google Scholar]
50.Cotter SL, Roberts GO, Stuart AM, White D. 2013. MCMC methods for functions: modifying old algorithms to make them faster. Stat. Sci. 28, 424-446. ( 10.1214/13-STS421) [DOI] [Google Scholar]
51.Beskos A, Girolami M, Lan S, Farrell PE, Stuart AM. 2017. Geometric MCMC for infinite-dimensional inverse problems. J. Comput. Phys. 335, 327-351. ( 10.1016/j.jcp.2016.12.041) [DOI] [Google Scholar]
52.Hairer M, Stuart AM, Vollmer SJ. 2014. Spectral gaps for a Metropolis-Hastings algorithm in infinite dimensions. Ann. Appl. Probab. 24, 2455-2490. ( 10.1214/13-AAP982) [DOI] [Google Scholar]
53.Hairer M, Mattingly J, Scheutzow M. 2011. Asymptotic coupling and a general form of Harris’ theorem with applications to stochastic delay equations. Probab. Theory Relat. Fields 149, 223-259. ( 10.1007/s00440-009-0250-6) [DOI] [Google Scholar]
54.Chewi S, Lu C, Ahn K, Cheng X, Le Gouic T, Rigollet P. 2021. Optimal dimension dependence of the Metropolis-Adjusted Langevin Algorithm. In Conference on Learning Theory, Boulder, CO, 15–19 August 2021. PMLR.
55.Roberts GO, Rosenthal JS. 2001. Optimal scaling for various Metropolis-Hastings algorithms. Stat. Sci. 16, 351-367. ( 10.1214/ss/1015346320) [DOI] [Google Scholar]
56.Breyer LA, Piccioni M, Scarlatti S. 2004. Optimal scaling of MaLa for nonlinear regression. Ann. Appl. Probab. 14, 1479-1505. ( 10.1214/105051604000000369) [DOI] [Google Scholar]
57.Mattingly JC, Pillai NS, Stuart AM. 2012. Diffusion limits of the random walk Metropolis algorithm in high dimensions. Ann. Appl. Probab. 22, 881-930. ( 10.1214/10-AAP754) [DOI] [Google Scholar]
58.Franz S, Parisi G. 1995. Recipes for metastable states in spin glasses. J. Phys. I 5, 1401-1415. [Google Scholar]
59.Edmunds DE, Triebel H.. 1996. Function spaces, entropy numbers, differential operators. Cambridge Tracts in Mathematics, vol. 120. Cambridge, UK: Cambridge University Press. [Google Scholar]
60.Li WV, Linde W. 1999. Approximation, metric entropy and small ball estimates for Gaussian measures. Ann. Probab. 27, 1556-1578. ( 10.1214/aop/1022677459) [DOI] [Google Scholar]
61.Giné E, Nickl R.. 2016. Mathematical foundations of infinite-dimensional statistical models. Cambridge Series in Statistical and Probabilistic Mathematics. New York, NY: Cambridge University Press. [Google Scholar]
62.Crisanti A, Sommers HJ. 1992. The spherical $p$ -spin interaction spin glass model: the statics. Zeitschrift für Physik B Condensed Matter 87, 341-354. ( 10.1007/BF01309287) [DOI] [Google Scholar]
63.Subag E. 2017. The complexity of spherical $p$ -spin models – a second moment approach. Ann. Probab. 45, 3385-3450. ( 10.1214/16-AOP1139) [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

This article has no additional data.

[RSTA20220150C1] 1.Anderson PW. 1989. Spin glass VI: spin glass as cornucopia. Phys. Today 42, 9. [Google Scholar]

[RSTA20220150C2] 2.Mézard M, Montanari A. 2009. Information, physics, and computation. Oxford, UK: Oxford University Press. [Google Scholar]

[RSTA20220150C3] 3.Zdeborová L, Krzakala F. 2016. Statistical physics of inference: thresholds and algorithms. Adv. Phys. 65, 453-552. [Google Scholar]

[RSTA20220150C4] 4.Ben Arous G, Gheissari R, Jagannath A. 2020. Algorithmic thresholds for tensor PCA. Ann. Probab. 48, 2052-2087. ( 10.1214/19-AOP1415) [DOI] [Google Scholar]

[RSTA20220150C5] 5.Ben Arous G, Wein AS, Zadik I. 2020. Free energy wells and overlap gap property in sparse PCA. In Conf. on Learning Theory, Graz, Austria, 9–12 July 2020, pp. 479–482. PMLR.

[RSTA20220150C6] 6.Gibbs JW. 1873. A method of geometrical representation of the thermodynamic properties of substances by means of surfaces. Trans. Conn. Acad. Arts Sci. 2, 382–404.

[RSTA20220150C7] 7.Bandeira AS, Alaoui AE, Hopkins SB, Schramm T, Wein AS, Zadik I. 2022. The Franz-Parisi criterion and computational trade-offs in high dimensional statistics. In Advances in Neural Information Processing Systems (eds AH Oh, A Agarwal, D Belgrave, K Cho). See https://openreview.net/forum?id=mzze3bubjk.

[RSTA20220150C8] 8.Jerrum M. 2003. Counting, sampling and integrating: algorithms and complexity. Lectures in Mathematics ETH Zürich. Basel, Switzerland: Birkhäuser Verlag. [Google Scholar]

[RSTA20220150C9] 9.Kunisky D, Wein AS, Bandeira AS. 2022. Notes on computational hardness of hypothesis testing: predictions using the low-degree likelihood ratio. In ISAAC Congress (International Society for Analysis, its Applications and Computation), Aveiro, Portugal, 29 July–2 Aug 2019, pp. 1–50. Springer.

[RSTA20220150C10] 10.Dalalyan AS. 2017. Theoretical guarantees for approximate sampling from smooth and log-concave densities. J. R. Stat. Soc. Ser. B Stat. Methodol. 79, 651-676. ( 10.1111/rssb.12183) [DOI] [Google Scholar]

[RSTA20220150C11] 11.Durmus A, Moulines E. 2019. High-dimensional Bayesian inference via the unadjusted Langevin algorithm. Bernoulli 25, 2854-2882. ( 10.3150/18-BEJ1073) [DOI] [Google Scholar]

[RSTA20220150C12] 12.Nickl R, Wang S. 2020. On polynomial-time computation of high-dimensional posterior measures by Langevin-type algorithms. J. Eur. Math. Soc. [Google Scholar]

[RSTA20220150C13] 13.Bohr J, Nickl R. 2021. On log-concave approximations of high-dimensional posterior measures and stability properties in non-linear inverse problems. (http://arxiv.org/abs/2105.07835)

[RSTA20220150C14] 14.Nickl R. 2022. Bayesian non-linear statistical inverse problems. ETH Zurich Lecture Notes.

[RSTA20220150C15] 15.Altmeyer R. 2022. Polynomial time guarantees for sampling based posterior inference in high-dimensional generalised linear models. (http://arxiv.org/abs/2208.13296)

[RSTA20220150C16] 16.Nickl R. 2020. Bernstein–von Mises theorems for statistical inverse problems I: Schrödinger equation. J. Eur. Math. Soc. 22, 2697-2750. ( 10.4171/JEMS/975) [DOI] [Google Scholar]

[RSTA20220150C17] 17.Monard F, Nickl R, Paternain GP. 2019. Efficient nonparametric Bayesian inference for $X$ -ray transforms. Ann. Stat. 47, 1113-1147. ( 10.1214/18-AOS1708) [DOI] [Google Scholar]

[RSTA20220150C18] 18.Monard F, Nickl R, Paternain GP. 2021. Statistical guarantees for Bayesian uncertainty quantification in nonlinear inverse problems with Gaussian process priors. Ann. Stat. 49, 3255-3298. ( 10.1214/21-AOS2082) [DOI] [Google Scholar]

[RSTA20220150C19] 19.Monard F, Nickl R, Paternain GP. 2021. Consistent inversion of noisy non-Abelian X-ray transforms. Commun. Pure Appl. Math. 74, 1045-1099. ( 10.1002/cpa.21942) [DOI] [Google Scholar]

[RSTA20220150C20] 20.Stuart AM. 2010. Inverse problems: a Bayesian perspective. Acta Numer. 19, 451-559. ( 10.1017/S0962492910000061) [DOI] [Google Scholar]

[RSTA20220150C21] 21.Fearnhead P, Bierkens J, Pollock M, Roberts GO. 2018. Piecewise deterministic Markov processes for continuous-time Monte Carlo. Stat. Sci. 33, 386-412. ( 10.1214/18-STS648) [DOI] [Google Scholar]

[RSTA20220150C22] 22.Bouchard-Côté A, Vollmer SJ, Doucet A. 2018. The bouncy particle sampler: a nonreversible rejection-free Markov chain Monte Carlo method. J. Am. Stat. Assoc. 113, 855-867. [Google Scholar]

[RSTA20220150C23] 23.Bierkens J, Grazzi S, Kamatani K, Roberts G. 2020. The boomerang sampler. In Int. Conf. on Machine Learning, Online, 12–18 July 2020, pp. 908–918. PMLR.

[RSTA20220150C24] 24.Wu C, Robert CP. 2020. Coordinate sampler: a non-reversible Gibbs-like MCMC sampler. Stat. Comput. 30, 721-730. ( 10.1007/s11222-019-09913-w) [DOI] [Google Scholar]

[RSTA20220150C25] 25.Scalliet C, Guiselin B, Berthier L. 2022. Thirty milliseconds in the life of a supercooled liquid. Phys. Rev. X 12, 041028. ( 10.1103/PhysRevX.12.041028) [DOI] [PubMed]

[RSTA20220150C26] 26.Grigera TS, Parisi G. 2001. Fast Monte Carlo algorithm for supercooled soft spheres. Phys. Rev. E 63, 045102. ( 10.1103/PhysRevE.63.045102) [DOI] [PubMed] [Google Scholar]

[RSTA20220150C27] 27.Jerrum M. 1992. Large cliques elude the Metropolis process. Random Struct. Algorithms 3, 347-359. ( 10.1002/rsa.3240030402) [DOI] [Google Scholar]

[RSTA20220150C28] 28.Gamarnik D, Zadik I. 2019. The landscape of the planted clique problem: dense subgraphs and the overlap gap property. (http://arxiv.org/abs/1904.07174).

[RSTA20220150C29] 29.Angelini MC, Fachin P, de Feo S. 2021. Mismatching as a tool to enhance algorithmic performances of Monte Carlo methods for the planted clique model. J. Stat. Mech: Theory Exp. 2021, 113406. ( 10.1088/1742-5468/ac3657) [DOI] [Google Scholar]

[RSTA20220150C30] 30.Chen Z, Mossel E, Zadik I. 2023. Almost-linear planted cliques elude the metropolis process. In Proceedings of the 2023 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), Florence, Italy, 22–25 January 2023, pp. 4504–4539. Philadelphia, PA: SIAM. ( 10.1137/1.9781611977554.ch171) [DOI]

[RSTA20220150C31] 31.Hukushima K, Nemoto K. 1996. Exchange Monte Carlo method and application to spin glass simulations. J. Phys. Soc. Jpn. 65, 1604-1608. ( 10.1143/JPSJ.65.1604) [DOI] [Google Scholar]

[RSTA20220150C32] 32.Angelini MC. 2018. Parallel tempering for the planted clique problem. J. Stat. Mech: Theory Exp. 2018, 073404. [Google Scholar]

[RSTA20220150C33] 33.Richard E, Montanari A. 2014. A statistical model for tensor PCA. In Advances in neural information processing systems 27, Montreal, Canada, 8–13 Dec 2014. Red Hook, NY: Curran Associates.

[RSTA20220150C34] 34.Perry A, Wein AS, Bandeira AS. 2020. Statistical limits of spiked tensor models. Annales de l’Institut Henri Poincaré, Probabilités et Statistiques 56, 230–264.

[RSTA20220150C35] 35.Lesieur T, Miolane L, Lelarge M, Krzakala F, Zdeborová L. 2017. Statistical and computational phase transitions in spiked tensor estimation. In 2017 IEEE Int. Symposium on Information Theory (ISIT), Aachen, Germany, 25–30 June 2017, pp. 511–515. New York, NY: IEEE.

[RSTA20220150C36] 36.Jagannath A, Lopatto P, Miolane L. 2020. Statistical thresholds for tensor PCA. Ann. Appl. Probab. 30, 1910-1933. ( 10.1214/19-AAP1547) [DOI] [Google Scholar]

[RSTA20220150C37] 37.Baik J, Ben Arous G, Péché S. 2005. Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices. Ann. Appl. Probab. 33, 1643-1697. [Google Scholar]

[RSTA20220150C38] 38.Wein AS, El Alaoui A, Moore C. 2019. The Kikuchi hierarchy and tensor PCA. In 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS), Baltimore, MA, 9–12 November 2019, pp. 1446–1468. New York, NY: IEEE.

[RSTA20220150C39] 39.Hopkins SB, Shi J, Steurer D. 2015. Tensor principal component analysis via sum-of-square proofs. In Conf. on Learning Theory, Paris, France, 3–6 July 2015, pp. 956–1006. PMLR.

[RSTA20220150C40] 40.Hopkins SB, Schramm T, Shi J, Steurer D. 2016. Fast spectral algorithms from sum-of-squares proofs: tensor decomposition and planted sparse vectors. In Proc. of the Forty-Eighth Annual ACM Symposium on Theory of Computing, Cambridge, MA, 19–21 June 2016, pp. 178–191. New York, NY: ACM.

[RSTA20220150C41] 41.Kim C, Bandeira AS, Goemans MX. 2017. Community detection in hypergraphs, spiked tensor models, and sum-of-squares. In 2017 International Conference on Sampling Theory and Applications (SampTA), Bordeaux, France, 8–12 July 2017, pp. 124–128. New York, NY: IEEE.

[RSTA20220150C42] 42.Sarao Mannelli S, Biroli G, Cammarota C, Krzakala F, Zdeborová L. 2019. Who is afraid of big bad minima? Analysis of gradient-flow in spiked matrix-tensor models. Advances in neural information processing systems 32, Vancouver, Canada, 8–14 Dec 2019. Red Hook, NY: Curran Associates.

[RSTA20220150C43] 43.Sarao Mannelli S, Krzakala F, Urbani P, Zdeborová L. 2019. Passed & spurious: descent algorithms and local minima in spiked matrix-tensor models. In International Conference on Machine Learning, pp. 4333-4342. PMLR.

[RSTA20220150C44] 44.Biroli G, Cammarota C, Ricci-Tersenghi F. 2020. How to iron out rough landscapes and get optimal performances: averaged gradient descent and its application to tensor PCA. J. Phys. A: Math. Theor. 53, 174003. ( 10.1088/1751-8121/ab7b1f) [DOI] [Google Scholar]

[RSTA20220150C45] 45.Ben Arous G, Gheissari R, Jagannath A. 2020. Bounding flows for spherical spin glass dynamics. Commun. Math. Phys. 373, 1011-1048. ( 10.1007/s00220-019-03649-4) [DOI] [Google Scholar]

[RSTA20220150C46] 46.Ben Arous G, Gheissari R, Jagannath A. 2021. Online stochastic gradient descent on non-convex losses from high-dimensional inference. J. Mach. Learn. Res. 22, 106-1. [Google Scholar]

[RSTA20220150C47] 47.Rasmussen CE, Williams CKI. 2006. Gaussian processes for machine learning. Adaptive Computation and Machine Learning. Cambridge, MA: MIT Press. [Google Scholar]

[RSTA20220150C48] 48.Ghosal S, van der Vaart AW. 2017. Fundamentals of nonparametric Bayesian inference. New York, NY: Cambridge University Press. [Google Scholar]

[RSTA20220150C49] 49.van der Vaart A, van Zanten JH. 2008. Rates of contraction of posterior distributions based on Gaussian process priors. Ann. Stat. 36, 1435-1463. ( 10.1214/009053607000000613) [DOI] [Google Scholar]

[RSTA20220150C50] 50.Cotter SL, Roberts GO, Stuart AM, White D. 2013. MCMC methods for functions: modifying old algorithms to make them faster. Stat. Sci. 28, 424-446. ( 10.1214/13-STS421) [DOI] [Google Scholar]

[RSTA20220150C51] 51.Beskos A, Girolami M, Lan S, Farrell PE, Stuart AM. 2017. Geometric MCMC for infinite-dimensional inverse problems. J. Comput. Phys. 335, 327-351. ( 10.1016/j.jcp.2016.12.041) [DOI] [Google Scholar]

[RSTA20220150C52] 52.Hairer M, Stuart AM, Vollmer SJ. 2014. Spectral gaps for a Metropolis-Hastings algorithm in infinite dimensions. Ann. Appl. Probab. 24, 2455-2490. ( 10.1214/13-AAP982) [DOI] [Google Scholar]

[RSTA20220150C53] 53.Hairer M, Mattingly J, Scheutzow M. 2011. Asymptotic coupling and a general form of Harris’ theorem with applications to stochastic delay equations. Probab. Theory Relat. Fields 149, 223-259. ( 10.1007/s00440-009-0250-6) [DOI] [Google Scholar]

[RSTA20220150C54] 54.Chewi S, Lu C, Ahn K, Cheng X, Le Gouic T, Rigollet P. 2021. Optimal dimension dependence of the Metropolis-Adjusted Langevin Algorithm. In Conference on Learning Theory, Boulder, CO, 15–19 August 2021. PMLR.

[RSTA20220150C55] 55.Roberts GO, Rosenthal JS. 2001. Optimal scaling for various Metropolis-Hastings algorithms. Stat. Sci. 16, 351-367. ( 10.1214/ss/1015346320) [DOI] [Google Scholar]

[RSTA20220150C56] 56.Breyer LA, Piccioni M, Scarlatti S. 2004. Optimal scaling of MaLa for nonlinear regression. Ann. Appl. Probab. 14, 1479-1505. ( 10.1214/105051604000000369) [DOI] [Google Scholar]

[RSTA20220150C57] 57.Mattingly JC, Pillai NS, Stuart AM. 2012. Diffusion limits of the random walk Metropolis algorithm in high dimensions. Ann. Appl. Probab. 22, 881-930. ( 10.1214/10-AAP754) [DOI] [Google Scholar]

[RSTA20220150C58] 58.Franz S, Parisi G. 1995. Recipes for metastable states in spin glasses. J. Phys. I 5, 1401-1415. [Google Scholar]

[RSTA20220150C59] 59.Edmunds DE, Triebel H.. 1996. Function spaces, entropy numbers, differential operators. Cambridge Tracts in Mathematics, vol. 120. Cambridge, UK: Cambridge University Press. [Google Scholar]

[RSTA20220150C60] 60.Li WV, Linde W. 1999. Approximation, metric entropy and small ball estimates for Gaussian measures. Ann. Probab. 27, 1556-1578. ( 10.1214/aop/1022677459) [DOI] [Google Scholar]

[RSTA20220150C61] 61.Giné E, Nickl R.. 2016. Mathematical foundations of infinite-dimensional statistical models. Cambridge Series in Statistical and Probabilistic Mathematics. New York, NY: Cambridge University Press. [Google Scholar]

[RSTA20220150C62] 62.Crisanti A, Sommers HJ. 1992. The spherical $p$ -spin interaction spin glass model: the statics. Zeitschrift für Physik B Condensed Matter 87, 341-354. ( 10.1007/BF01309287) [DOI] [Google Scholar]

[RSTA20220150C63] 63.Subag E. 2017. The complexity of spherical $p$ -spin models – a second moment approach. Ann. Probab. 45, 3385-3450. ( 10.1214/16-AOP1139) [DOI] [Google Scholar]

PERMALINK

On free energy barriers in Gaussian priors and failure of cold start MCMC for high-dimensional unimodal distributions

Afonso S Bandeira

Antoine Maillard

Richard Nickl

Sven Wang

Roles

Abstract

1. Introduction

Figure 1.

Figure 2.

2. The spiked tensor model: an illustrative example

Definition 2.1 (Spiked tensor model). —

Definition 2.2 (Latitude intervals). —

(a) . Posterior contraction

Proposition 2.3. —

Corollary 2.4 (Posterior contraction). —

Remark 2.5 (Suboptimality of uniform bounds). —

(b) . Algorithmic bottleneck for MCMC

Proposition 2.6 (MCMC bottleneck, informal). —

Remark 2.7. —

Lemma 2.8 (Bottleneck, formal). —

Proposition 2.9. —

Remark 2.10 (MCMC initialization). —

3. Main results for nonlinear regression with Gaussian priors

(a) . Hardness examples for posterior computation with Gaussian priors

Assumption 3.1. —

Theorem 3.2. —

Theorem 3.3. —

Remark 3.4. —

Remark 3.5 (On the step size condition). —

(b) . Implications for common MCMC methods with ‘cold-start’

(i) . Preconditioned Crank–Nicolson

Theorem 3.6. —

(ii) . Gradient-based Langevin algorithms

Theorem 3.7. —

4. Proofs of the main theorems

(a) . Radially symmetric choices of G

(i) . A locally log-concave, globally monotone choice of w

Proposition 4.1. —

Proof. —

(b) . Bounds for posterior ratios of annuli

Proposition 4.2. —

Remark 4.3 (The prior condition for w from (4.6)). —

Proof. —

(c) . Bounds for Markov chain hitting times

(i) . Hitting time bounds for intermediate sets Θs,η

Proposition 4.4. —

Proof of proposition 4.4. —

(ii) . Reducing hitting times for Bs to ones for Θs,η

(d) . Proof of theorem 3.3

(i) . Small ball estimates for α-regular priors

Lemma 4.5. —

Proof of lemma 4.5. —

Remark 4.6. —

(ii) . Proof of theorem 3.3, part (iv)

(iii) . Proof of theorem 3.3, part (iii)

Remark 4.7. —

(e) . Proof of theorem 3.2

Lemma 4.8. —

(f) . Proofs for §3b

(i) . Proofs for pCN

Lemma 4.9. —

Proof of lemma 4.9. —

Proof of theorem 3.6. —

(ii) . Proofs for MALA

Proof of theorem 3.7, part (ii). —

Proof of theorem 3.7, part (i). —

Acknowledgements

Appendix A. Proofs of §2

Proof of corollary 2.4. —

Proof. —

Lemma A.1 —

Appendix B. Small ball estimates for isotropic Gaussians

Lemma B.1. —

Proof of lemma B.1. —

Footnotes

Data accessibility

Authors' contributions

Conflict of interest declaration

(i) . Hitting time bounds for intermediate sets $Θ_{s, η}$

(ii) . Reducing hitting times for $B_{s}$ to ones for $Θ_{s, η}$

(i) . Small ball estimates for $α$ -regular priors