Sampling can be faster than optimization

Yi-An Ma; Yuansi Chen; Chi Jin; Nicolas Flammarion; Michael I Jordan

doi:10.1073/pnas.1820003116

. 2019 Sep 30;116(42):20881–20885. doi: 10.1073/pnas.1820003116

Sampling can be faster than optimization

Yi-An Ma ^a, Yuansi Chen ^b, Chi Jin ^a, Nicolas Flammarion ^a, Michael I Jordan ^a,^b,¹

PMCID: PMC6800351 PMID: 31570618

Significance

Modern large-scale data analysis and machine learning applications rely critically on computationally efficient algorithms. There are 2 main classes of algorithms used in this setting—those based on optimization and those based on Monte Carlo sampling. The folk wisdom is that sampling is necessarily slower than optimization and is only warranted in situations where estimates of uncertainty are needed. We show that this folk wisdom is not correct in general—there is a natural class of nonconvex problems for which the computational complexity of sampling algorithms scales linearly with the model dimension while that of optimization algorithms scales exponentially.

Keywords: Langevin Monte Carlo, nonconvex optimization, computational complexity

Abstract

Optimization algorithms and Monte Carlo sampling algorithms have provided the computational foundations for the rapid growth in applications of statistical machine learning in recent years. There is, however, limited theoretical understanding of the relationships between these 2 kinds of methodology, and limited understanding of relative strengths and weaknesses. Moreover, existing results have been obtained primarily in the setting of convex functions (for optimization) and log-concave functions (for sampling). In this setting, where local properties determine global properties, optimization algorithms are unsurprisingly more efficient computationally than sampling algorithms. We instead examine a class of nonconvex objective functions that arise in mixture modeling and multistable systems. In this nonconvex setting, we find that the computational complexity of sampling algorithms scales linearly with the model dimension while that of optimization algorithms scales exponentially.

Machine learning and data science are fields that blend computer science and statistics so as to solve inferential problems whose scale and complexity require modern computational infrastructure. The algorithmic foundations on which these blends have been based rest on 2 general computational strategies, both which have their roots in mathematics—optimization and Markov chain Monte Carlo (MCMC) sampling. Research on these strategies has mostly proceeded separately, with research on optimization focused on estimation and prediction problems and research on sampling focused on tasks that require uncertainty estimates, such as forming credible intervals and conducting hypothesis tests. There is a trend, however, toward the use of common methodological elements within the 2 strands of research (1–12). In particular, both strands have focused on the use of gradients and stochastic gradients—rather than function values or higher-order derivatives—as providing a useful compromise between the computational complexity of individual algorithmic steps and the overall rate of convergence. Empirically, the effectiveness of this compromise is striking. However, the relative paucity of theoretical research linking optimization and sampling has limited the flow of ideas; in particular, the rapid recent advance of theory for optimization (see, e.g., ref. 13) has not yet translated into a similarly rapid advance of the theory for sampling. Accordingly, machine learning has remained limited in its inferential scope, with little concern for estimates of uncertainty.

Theoretical linkages have begun to appear in recent work (see, e.g., refs. 5–12), where tools from optimization theory have been used to establish rates of convergence—notably including nonasymptotic dimension dependence—for MCMC sampling. The overall message from these results is that sampling is slower than optimization—a message which accords with the folk wisdom that sampling approaches are warranted only if there is need for the stronger inferential outputs that they provide. These results are, however, obtained in the setting of convex functions. For convex functions, global properties can be assessed via local information. Not surprisingly, gradient-based optimization is well suited to such a setting.

Our focus is the nonconvex setting. We consider a broad class of problems that are strongly convex outside of a bounded region but nonconvex inside of it. Such problems arise, for example, in Bayesian mixture model problems (14, 15) and in the noisy multistable models that are common in statistical physics (16, 17). We find that when the nonconvex region has a constant and nonzero radius in $R^{d}$ , the MCMC methods converge to $ϵ$ accuracy in $\tilde{O} (d / ϵ)$ or $\tilde{O} (d^{2} \ln (1 / ϵ))$ steps, whereas any optimization approach converges in $\tilde{Ω} ({(1 / ϵ)}^{d})$ steps. Note, critically, the dimension dependence in these results. We see that, for this class of problems, sampling is more effective than optimization.

We obtain these polynomial convergence results for the MCMC algorithms in the nonconvex setting by working in continuous time and separating the problem into 2 subproblems: Given the target distribution we first exploit the properties of a weighted Sobolev space endowed with that target distribution to obtain convergence rates for the continuous dynamics, and we then discretize and find the appropriate step size to retain those rates for the discretized algorithm. This general framework allows us to strengthen recent results in the MCMC literature (18–21) and examine a broader class of algorithms including the celebrated Metropolis–Hastings method.

Polynomial Convergence of MCMC Algorithms

The Langevin algorithm is a family of gradient-based MCMC sampling algorithms (22–24). We present pseudocode for 2 variants of the algorithm in Algorithm 1, and, by way of comparison, we provide pseudocode for classical gradient descent (GD) in Algorithm 2. The variant of the Langevin algorithm which does not include the “if” statement is referred to as the ULA; as can be seen, it is essentially the same as GD, differing only in its

Algorithm 1:

The (Metropolis-adjusted) Langevin algorithm is a gradient-based MCMC algorithm. In each step, one simulates $ξ \sim N (0,2 h^{k} I)$ and $u \sim U [0,1]$ , a uniform random variable between 0 and 1. The conditional distribution $p (x^{k} | x^{k + 1})$ is the normal distribution centered at $x^{k} - h^{k} \nabla U (x^{k})$ and $p *$ is the target distribution. Without the Metropolis adjustment step, the algorithm is called the unadjusted Langevin algorithm (ULA). Otherwise, it is called the Metropolis-adjusted Langevin algorithm (MALA)

MALA
Input: $x^{0}$ , stepsizes ${h^{k}}$
for $k = 0,1,2, \dots, K - 1$ do
$x^{k + 1} \leftarrow x^{k} - h^{k} \nabla U (x^{k}) + ξ$
if $\frac{p (x^{k} \| x^{k + 1}) p^{} (x^{k})}{p (x^{k + 1} \| x^{k}) p^{} (x^{k + 1})} < u$ then
$x^{k + 1} \leftarrow x^{k}$ $⊳$ Metropolis Adjustment
Return $x^{K}$

Open in a new tab

incorporation of a random term $ξ \sim N (0,2 h^{k} I)$ in the update. The variant that includes the “if” statement is referred to as the MALA; it is the standard Metropolis–Hastings algorithm applied to the Langevin setting. It is worth noting that ULA differs from stochastic optimization algorithms in the scaling of the variance of the random term $ξ$ : In stochastic GD, the variance of $ξ$ scales as squared stepsize, ${(h^{k})}^{2}$ .

We consider sampling from a smooth target distribution $p^{*}$ that is strongly log-concave outside of a region. That is, for $p^{*} \propto e^{- U}$ , we assume that $U$ is $m$ -strongly convex outside of a region of radius $R$ and is $L$ -Lipschitz smooth.* (See SI Appendix, section A for a formal statement of the assumptions.) Let $κ = L / m$ denote the condition number of $U$ ; this is a parameter which measures how much $U$ deviates from an isotropic quadratic function outside of the region of radius $R$ . We prove convergence of the Langevin sampling algorithms for this target, establishing a convergence rate. Given an error tolerance $ϵ \in (0,1)$ and an initial distribution $p^{0}$ , define the $ϵ$ -mixing time in total variation distance as

τ (ϵ; p^{0}) = min \{k | {∥p^{k} - p^{*}∥}_{TV} \leq ϵ\} .

Theorem 1. Consider Algorithm 1 with initialization $p^{0} = N (0, \frac{1}{L} I_{d})$ and error tolerance $ϵ \in (0,1)$ . Then ULA with step size $h^{k} = O (e^{- 16 L R^{2}} κ^{- 1} L^{- 1} ϵ^{2} / d)$ satisfies

τ_{U L A} (ϵ, p^{0}) \leq O (e^{32 L R^{2}} κ^{2} \frac{d}{ϵ^{2}} ln (\frac{d}{ϵ^{2}})) .

[1]

For MALA with step size $h^{k} = O (e^{- 8 L R^{2}} κ^{- 1 / 2} L^{- 1} (d ln κ + {ln 1 / ϵ)}^{- 1 / 2} d^{- 1 / 2})$ ,

\begin{matrix} τ_{M A L A} (ϵ, p^{0}) \leq O (\frac{e^{40 L R^{2}}}{m} κ^{3 / 2} d^{1 / 2} {(d ln κ + ln (\frac{1}{ϵ}))}^{3 / 2}) . \end{matrix}

[2]

Comparing Eq. 1 with Eq. 2, we see that the Metropolis adjustment improves the mixing time of ULA to a logarithmic dependence in $ϵ$ , while sacrificing a factor of dimension $d$ . (Note, however, that these are upper bounds, and they depend on our specific setting and our assumptions. It should not be inferred from our results that ULA is generically faster than MALA in terms of dimension dependence.) Comparing Eqs. 1 and 2 with previous results in the literature that provide upper bounds on the mixing time of ULA and MALA for strongly convex potentials $U$ (5–12), we find that the local nonconvexity results in an extra factor of $e^{O (L R^{2})}$ . Thus, when the Lipschitz smoothness $L$ and radius of the nonconvex region $R$ satisfy $L R^{2}$ is $O (\log d)$ , the computational complexity is polynomial in dimension $d$ .

Our proof of Theorem 1 involves a 2-step framework that applies more widely than our specific setting. We first use properties of $p^{*} \propto e^{- U}$ to establish linear convergence of a continuous stochastic process that underlies Algorithm 1. We then discretize, finding an appropriate step size for the algorithm to converge to the desired accuracy. These 2 parts can be tackled independently. In this section, we provide an overview of the first part of the argument in the case of the MALA algorithm. The details, as well as a presentation of the second part of the argument, are provided in SI Appendix, section B.

Letting $t = \sum_{i = 1}^{k} h^{i}$ , assumed finite, a standard limiting process yields the following stochastic differential equation (SDE) as a continuous-time limit of Algorithm 1: $d X_{t} = - \nabla U (X_{t}) d t + \sqrt{2} d B_{t}$ , where $B_{t}$ is a Brownian motion. To assess the rate of convergence of this SDE, we make use of the Kullback–Leibler (KL) divergence, which upper bounds the total variation distance and allows us to obtain strong convergence guarantees that include dimension dependence. Denoting the probability distribution of $X_{t}$ as ${\tilde{p}}_{t}$ , we obtain (see the derivation in SI Appendix, section B.2) the following time derivative of the divergence of ${\tilde{p}}_{t}$ to the target distribution $p^{*}$ :

\frac{d}{d t} KL ({\tilde{p}}_{t} ∥ p^{*}) = - E_{{\tilde{p}}_{t}} [{∥\nabla \ln (\frac{{\tilde{p}}_{t} (x)}{p^{*} (x)})∥}^{2}] .

[3]

The property of $p^{*} \propto e^{- U}$ that we require to turn this time derivative into a convergence rate is that it satisfies a log-Sobolev inequality. Considering the Sobolev space defined by the weighted $L^{2}$ norm, $\int g {(x)}^{2} p^{*} (x) d x$ , we say that $p^{*}$ satisfies a log-Sobolev inequality if there exists a constant $ρ > 0$ such that for any smooth function $g$ on $R^{d}$ , satisfying $\int_{R^{d}} g (x) p^{*} (x) d x = 1$ , we have

\int g (x) \ln g (x) \cdot p^{*} (x) d x \leq \frac{1}{2 ρ} \int \frac{{∥\nabla g (x)∥}^{2}}{g (x)} p^{*} (x) d x .

The largest $ρ$ for which this inequality holds is said to be the log-Sobolev constant for the objective $U$ . We denote it as $ρ_{U}$ . Taking $g = {\tilde{p}}_{t} / p^{*}$ , we obtain

\begin{align} KL ({\tilde{p}}_{t} ∥ p^{*}) & = E_{{\tilde{p}}_{t}} [\ln (\frac{{\tilde{p}}_{t} (x)}{p^{*} (x)})] \\ \leq \frac{1}{2 ρ_{U}} E_{{\tilde{p}}_{t}} [{∥\nabla \ln (\frac{{\tilde{p}}_{t} (x)}{p^{*} (x)})∥}^{2}] . \end{align}

[4]

Note the resemblance of this bound to the Polyak–Łojasiewicz condition (26) used in optimization theory for studying the

Algorithm 2:

GD is a classical gradient-based optimization algorithm which updates $x$ along the negative gradient direction

GD
Input: $x^{0}$ , stepsizes ${h^{k}}$
for $k = 0,1,2, \dots, K - 1$ do
$x^{k + 1} \leftarrow x^{k} - h^{k} \nabla U (x^{k})$
Return $x^{K}$

Open in a new tab

convergence of smooth and strongly convex objective functions—in both cases the difference from the current iterate to the optimum is upper-bounded by the norm of the gradient squared. Combining Eq. 3 with Eq. 4, we derive the promised linear convergence rate for the continuous process:

\frac{d}{d t} KL ({\tilde{p}}_{t} ∥ p^{*}) \leq - 2 ρ_{U} KL ({\tilde{p}}_{t} ∥ p^{*}) .

In SI Appendix, section B.2 we present similar results for the ULA algorithm, again using the KL divergence.

The next step is to bound $ρ_{U}$ in terms of the basic smoothness and local nonconvexity assumptions in our problem. We first require an approximation result:

Lemma 1. For $U m$ -strongly convex outside of a region of radius $R$ and $L$ -Lipschitz smooth, there exists $Û \in C^{1} (R^{d})$ such that $Û$ is $m / 2$ strongly convex on $R^{d}$ , and has a Hessian that exists everywhere on $R^{d}$ . Moreover, we have $sup (Û (x) - U (x)) - inf (Û (x) - U (x)) \leq 16 L R^{2}$ .

The proof of this lemma is presented in SI Appendix, section B.1. The existence of the smooth approximation established in this lemma can now be used to bound the log-Sobolev constant using standard results.

Proposition 1. For $p^{*} \propto e^{- U}$ , where $U$ is $m$ -strongly convex outside of a region of radius $R$ and $L$ -Lipschitz smooth,

ρ_{U} \geq \frac{m}{2} e^{- 16 L R^{2}} .

[5]

Proof: For $m / 2$ -strongly convex $Û \in C^{1} (R^{d})$ whose Hessian $\nabla^{2} Û (x)$ exists everywhere on $R^{d}$ , the distribution $e^{- Û (x)}$ satisfies the Bakry–Emery criterion (27) for a strongly log-concave density, which yields

ρ_{Û} \geq \frac{m}{2} .

[6]

We use the Holley–Stroock theorem (28) to obtain

ρ_{U} \geq \frac{m}{2} e^{- |sup (Û (x) - U (x)) - inf (Û (x) - U (x))|} \geq \frac{m}{2} e^{- 16 L R^{2}} .

[7]

We see from this proof outline that our approach enables one to adapt existing literature on the convergence of diffusion processes (29–31) to work out suitable log-Sobolev bounds and thereby obtain sharp convergence rates in terms of distance measures such as the KL divergence and total variation. This contributes to the existing literature on convergence of MCMC (32–36) by providing nonasymptotic guarantees on computational complexity. The detailed proof also reveals that the log-Sobolev constant $ρ_{U}$ is largely determined by the global qualities of $U$ where most of the probability mass is concentrated; local properties of $U$ have limited influence on $ρ_{U}$ . Since this is a property of the Sobolev space defined by the $p^{*}$ -weighted $L^{2}$ norm, the favorable convergence rates of the Langevin algorithms can be expected to generalize to other sampling algorithms (see, e.g., ref. 37).

Exponential Dependence on Dimension for Optimization

It is well known that finding global minima of a general nonconvex optimization problem is NP-hard (38). Here we demonstrate that it is also hard to find an approximation to the optimum of a Lipschitz-smooth, locally nonconvex objective function $U$ , for any algorithm in a general class of optimization algorithms.

Specifically, we consider a general iterative algorithm family $A$ which, at every step $k$ , is allowed to query not only the function value of $U$ but also its derivatives up to any fixed order at a chosen point $x^{k}$ . Thus, the algorithm has access to the vector $({U (x^{k}), \nabla U (x^{k}), \dots, \nabla^{n} U (x^{k})})$ , for any fixed $n \in N$ . Moreover, the algorithm can use the entire query history to determine the next point $x^{k + 1}$ , and it can do so randomly or deterministically. In the following theorem, we prove that the number of iterations for any algorithm in $A$ to approximate the minimum of $U$ is necessarily exponential in the dimension $d$ .

Theorem 2 (Lower Bound for Optimization). For any $R > 0$ , $L \geq 2 m > 0$ , and $ϵ \leq O (L R^{2})$ , there exists an objective function, $U : R^{d} \to R$ , which is $m$ -strongly convex outside of a region of radius $R$ and $L$ -Lipschitz smooth, such that any algorithm in $A$ requires at least $K = Ω ({(L R^{2} / ϵ)}^{d / 2})$ iterations to guarantee that ${min}_{k \leq K} | U (x^{K}) - U (x^{*}) | < ϵ$ with constant probability.

We remark that Theorem 2 is an information-theoretic result based on the class of iterative algorithms $A$ and the forms of the queries to this class. It is thus an unconditional statement that does not depend on conjectures such as $P \neq N P$ in complexity theory. We also note that if the goal is only to find stationary points instead of the optimum, then the problem becomes easier, requiring only $Ω {(1 / ϵ)}^{2}$ gradient queries to converge (39).

A depiction of an example that achieves this computational lower bound is provided in Fig. 1. The idea is that we can pack exponentially many balls of radius less than $R / 3$ inside a region of radius $R$ . We can arbitrarily assign the minimum $x^{*}$ to 1 of the balls, assigning a larger constant value to the other balls. We show that the number of queries needed to find the specific ball containing the minimum is exponential in $d$ . Moreover, the difference from $U (x^{*})$ to any other point outside of the ball is $O (L R^{2})$ , which can be significant.

This example suggests that the lower-bound scenario will be realized in cases in which regions of attraction are small around a global minimum and behavior within each region of attraction is relatively autonomous. This phenomenon is not uncommon in multistable physical systems. Indeed, in nonequilibrium statistical physics, there are examples where the global behavior of a system can be treated approximately as a set of local behaviors within stable regimes plus Markov transitions among stable regimes (40). In such cases, when the regions of attraction are small, the computational complexity to find the global minimum can be combinatorial. In section 3, we explicitly demonstrate that this combinatorial complexity holds for a Gaussian mixture model.

Why Can’t One Optimize in Polynomial Time Using the Langevin Algorithm?

Consider the rescaled density function $q_{β}^{*} \propto e^{- β U}$ . A line of research beginning with simulated annealing (41) uses a sampling algorithm to sample from $q_{β}^{*}$ , doing so for increasing values of $β$ , and uses the resulting samples to approximate $x^{*} = \arg min_{x \in R^{d}} U (x)$ . In particular, simply returning 1 of the samples obtained for sufficiently large $β$ yields an output that is close to the optimum with high probability. This suggests the following question: Can we use the Langevin algorithm to generate samples from $q_{β}^{*}$ , and thereby obtain an approximation to $x^{*}$ in a number of steps polynomial in $d$ ?

In the following Corollary 1, we demonstrate that this is not possible: We need $β = \tilde{Ω} (d / ϵ)$ so that a sample $x$ from $q_{β}^{*}$ will satisfy $∥x - x^{*}∥ \leq ϵ$ with constant probability. (Here $\tilde{Ω}$ means we have omitted logarithmic factors.) This requires the Lipschitz smoothness of $U$ to scale with $d$ , which in turn causes the sampling complexity to scale exponentially with $d$ , as established in Eqs. 1 and 2.

Corollary 1. There exists an objective function $U$ that is $m$ -strongly convex outside of a region of radius $2 R$ and $L$ -Lipschitz smooth, such that, for $\hat{x} \sim q_{β}^{*}$ , it is necessary that $β = \tilde{Ω} (d / ϵ)$ in order to have $U (\hat{x}) - U (x^{*}) < ϵ$ with constant probability. Moreover, the number of iterations required for the Langevin algorithms to achieve $U (x^{K}) - U (x^{*}) < ϵ$ with constant probability is $K = e^{\tilde{O} (d \cdot L R^{2} / ϵ)}$ .

It should be noted that this upper bound for the Langevin algorithms agrees with the lower bound for optimization algorithms in Theorem 2 up to a factor of $L R^{2} / ϵ$ in the exponent. Intuitively this is because in the lower bound for optimization complexity we are considering the most optimistic scenario for optimization algorithms, where a hypothetical algorithm can determine whether one region of radius $\sqrt{ϵ / L}$ (as depicted in Fig. 1) contains the global minimum or not with only 1 query (of the function value and $n$ -th order derivatives). When using the Langevin algorithms, more steps are required to explore each local region to a constant level of confidence.

Parameter Estimation from Gaussian Mixture Model: Sampling versus Optimization

We have seen that for problems with local nonconvexity the computational complexity for the Langevin algorithm is polynomial in dimension, whereas it is exponential in dimension for optimization algorithms. These are, however, worst-case guarantees. It is important to consider whether they also hold for natural statistical problem classes and for specific optimization algorithms. In this section, we study the Gaussian mixture model, comparing Langevin sampling and the popular expectation-maximization (EM) optimization algorithm.

Consider the problem of inferring the mean parameters of a Gaussian mixture model, $μ = {μ_{1}, \dots, μ_{M}} \in R^{d \times M}$ , when $N$ data points are sampled from that model. Letting $y = {y_{1}, \dots, y_{N}}$ denote the data, we have

\begin{align} p (y_{n} | μ) & = \sum_{i = 1}^{M} \frac{λ_{i}}{Z_{i}} \exp (- \frac{1}{2} {(y_{n} - μ_{i})}^{T} Σ_{i}^{- 1} (y_{n} - μ_{i})) \\ + (1 - \sum_{i = 1}^{M} λ_{i}) p_{0} (y_{n}), \end{align}

[8]

where $Z_{i}$ are normalization constants and $\sum_{i = 1}^{M} λ_{i} \leq 1$ . $p_{0} (y_{n})$ represents general constraints on the data (e.g., data may be distributed inside a region or may have sub-Gaussian tail behavior). The objective function is given by the log posterior distribution: $U (μ) = - \log p (μ) - \sum_{n = 1}^{N} \log p (y_{n} | μ)$ . Assume data are distributed in a bounded region ( $∥y_{n}∥ \leq R$ ) and take $p_{0} (y_{n}) = 1 \{‖ y_{n} ‖ \leq R\} / Z_{0}$ .

We prove in SI Appendix, section D that for a suitable choice of the prior $p (μ)$ and weights ${λ_{i}}$ , the objective function is Lipschitz-smooth and strongly convex for $∥μ∥ \geq 2 R \sqrt{M}$ . Therefore, taking $M R^{2} = O (\log d)$ , the ULA and MALA algorithms converge to $ϵ$ accuracy within $K \leq \tilde{O} (d^{3} / ϵ)$ and $K \leq \tilde{O} (d^{3} \ln^{2} (1 / ϵ))$ steps, respectively.

The EM algorithm updates the value of $μ$ in 2 steps. In the expectation (E) step a weight is computed for each data point and each mixture component, using the current parameter value $μ_{k}$ . In the maximization (M) step the value of $μ_{k + 1}$ is updated as a weighted sample mean (see SI Appendix, section D.2 for a more detailed description). It is standard to initialize the EM algorithm by randomly selecting $M$ data points (sometimes with small perturbations) to form $μ_{0}$ . We demonstrate in SI Appendix, section D.2 that under the condition that $M R^{2} = O (\log d)$ there exists a dataset ${y_{1}, \dots, y_{N}}$ and covariances ${Σ_{1}, \dots, Σ_{M}}$ , such that the EM algorithm requires more than $K \geq min {O (d^{1 / ϵ}), O (d^{d})}$ queries to converge if one initializes the algorithm close to the given data points. That is, for large $ϵ$ , the computational complexity of the EM algorithm depends on $d$ with arbitrarily high order (depending on $ϵ$ ); for small $ϵ$ , the computational complexity of the EM algorithm scales exponentially with $d$ . The latter case corresponds to our lower bound in Theorem 2 when taking the radius of the convex region of $μ$ to scale with $\sqrt{\log d}$ . Therefore, it is significantly harder for the EM algorithm to converge if we initialize the algorithm close to the given data points. This accords with practical implementations of EM algorithms, where heuristic, problem-dependent methods are often employed during initialization with the aim of decreasing the overall computation burden (42).

We also investigated this dichotomy experimentally. We generated data ${y_{1}, \dots, y_{N}}$ with sparse entries, letting the nonzero entries be distributed uniformly on $[- 1,1]$ . We inferred the mean parameters $μ = {μ_{1}, \dots, μ_{M}}$ with the EM algorithm and ULA algorithm to obtain maximum a posteriori (MAP) and mean estimates, respectively. Accuracy of the MAP estimate was measured in terms of the objective $U$ , while that of the mean estimates was measured in terms of both the function value $E [U (μ)]$ and the expected mean parameters $E [μ]$ . See SI Appendix, section E for detailed experimental settings. In Fig. 2, we show the scaling of the number of gradient queries required to converge as a function of the dimension $d$ . We observe that EM with random initialization from the data requires exponentially many gradient queries to converge, while ULA converges in an approximately linear number of gradient queries, corroborating our theoretical analysis.

Fig. 2. — Experimental results: scaling of number of gradient queries required for EM and ULA algorithms to converge with respect to the dimension $d$ . When $d \geq 10$ , too many gradient queries are required for EM to converge, so that an estimate of convergence time is not feasible. When $d = 32$ , ULA converges within 1,500 gradient queries (not shown in the figure).

Many mixture models with strongly log-concave priors fall into the assumed class of distributions with local nonconvexity. If data are distributed relatively close to each other, sampling these distributions can often be easier than searching for their global minima. This scenario is also common in the setting of the noisy multistable models arising in statistical physics [e.g., where the negative log likelihood is the potential energy of a classical particle system in an external field (17)] and related fields.

Discussion

We have shown that there is a natural family of nonconvex functions for which sampling algorithms have polynomial complexity in dimension whereas optimization algorithms display exponential complexity. The intuition behind these results is that computational complexity for optimization algorithms depends heavily on the local properties of the objective function $U$ . This is consistent with a related phenomenon that has been studied in optimization—local strong convexity near the global optimum can improve the convergence rate of convex optimization (43). On the other hand, sampling complexity depends more heavily on the global properties of $U$ . This is also consistent with existing literature; for example, it is known that the dimension dependence of the ULA upper bounds deteriorates when $U$ changes from strongly convex to weakly convex. This corresponds to the fact that the sub-Gaussian tails for strongly log-concave distributions are easier to explore than the subexponential tails for log-concave distributions.

A scrutiny of the relative scale between radius of the nonconvex region $R$ and the dimension $d$ is interesting (for constant Lipschitz smoothness $L$ ): When $R = 0$ , the problem is reduced to the Lipschitz-smooth and strongly convex case, where GD converges in $κ \log (1 / ϵ)$ steps (44) and ULA converges in $κ^{2} d / ϵ^{2}$ steps; when $R = O (\log d)$ , sampling is generally easier than optimization; when $0 < R \leq \sqrt{d}$ , the convergence upper bound for sampling is still slightly smaller than the optimization complexity lower bound; when $\sqrt{d} < R < d$ , the comparison is indeterminate; and the converse is true if $R \geq d$ .

The relatively rapid advance of the theory of gradient-based optimization has been due in part to the development of lower bounds, of the kind exhibited in our Theorem 2, for broad classes of algorithms. It is of interest to develop such lower bounds for MCMC algorithms, particularly bounds that capture dimension dependence. It is also of interest to develop both lower bounds and upper bounds for other forms of nonconvexity. For example, there has been recent work studying strongly dissipative functions (45). Here the worst-case convergence bounds have exponential dependence on the dimension, but $p^{*} \propto e^{- U}$ has a sub-Gaussian tail; further exploration of this setting may yield milder conditions on $U$ that allow MCMC algorithms to have polynomial convergence rates.

Supplementary Material

Supplementary File

pnas.1820003116.sapp.pdf^{(754.8KB, pdf)}

Footnotes

The authors declare no conflict of interest.

*U being $L$ -Lipschitz smooth means that $\nabla U$ is $L$ -Lipschitz continuous. Smoothness is crucial for the convergence of gradient-based methods (25).

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1820003116/-/DCSupplemental.

References

1.Amit Y., Grenander U., Comparing sweep strategies for stochastic relaxation. J. Multivar. Anal. 37, 197–222 (1991). [Google Scholar]
2.Amit Y., On rates of convergence of stochastic relaxation for Gaussian and non-Gaussian distributions. J. Multivar. Anal. 38, 82–99 (1991). [Google Scholar]
3.Roberts G. O., Sahu S. K., Updating schemes, correlation structure, blocking and parameterization for the Gibbs sampler. J. Roy. Stat. Soc. B 59, 291–317 (1997). [Google Scholar]
4.Dempster A., Laird X., Rubin D., Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B 39, 1–38 (1977). [Google Scholar]
5.Dalalyan A. S., Theoretical guarantees for approximate sampling from smooth and log-concave densities. J. Roy. Stat. Soc. B 79, 651–676 (2017). [Google Scholar]
6.Durmus A., Moulines E., Sampling from strongly log-concave distributions with the unadjusted Langevin algorithm. arXiv:1605.01559 (5 May 2016).
7.Dalalyan A. S., Karagulyan A. G., User-friendly guarantees for the Langevin Monte Carlo with inaccurate gradient. arXiv:1710.00095 (29 September 2017).
8.Cheng X., Chatterji N. S., Bartlett P. L., Jordan M. I., “Underdamped Langevin MCMC: A non-asymptotic analysis” in Proceedings of the 31st Conference on Learning Theory (COLT) (Association for Computational Learning, 2018), pp. 300–323. [Google Scholar]
9.Cheng X., Bartlett P. L., “Convergence of Langevin MCMC in KL-divergence” in Proceedings of the 29th International Conference on Algorithmic Learning Theory (ALT) (Association for Computing Machinery, 2018), pp. 186–211. [Google Scholar]
10.Dwivedi R., Chen Y., Wainwright M. J., Yu B., Log-concave sampling: Metropolis-Hastings algorithms are fast! arXiv:1801.02309 (8 January 2018).
11.Mangoubi O., Smith A., Rapid mixing of Hamiltonian Monte Carlo on strongly log-concave distributions. arXiv:1708.07114 (23 August 2017).
12.Mangoubi O., Vishnoi N. K., Dimensionally tight running time bounds for second-order Hamiltonian Monte Carlo. arXiv:1802.08898 (24 February 2018).
13.Nesterov Y., Introductory Lectures on Convex Optimization: A Basic Course (Kluwer, Boston, 2004). [Google Scholar]
14.McLachlan G. J., Peel D., Finite Mixture Models (Wiley, Chichester, UK, 2000). [Google Scholar]
15.Marin J.-M., Mengersen K., Robert C. P., Bayesian Modelling and Inference on Mixtures of Distributions (Springer-Verlag, New York, 2005). [Google Scholar]
16.Kramers H. A., Brownian motion in a field of force and the diffusion model of chemical reactions. Physica 7, 284–304 (1940). [Google Scholar]
17.Landau L. D., Lifshitz E. M., Statistical Physics (Pergamon, Oxford, ed. 3, 1980). [Google Scholar]
18.Eberle A., Guillin A., Zimmer R., Couplings and quantitative contraction rates for Langevin dynamics. arXiv:1703.01617 (5 March 2017).
19.Bou-Rabee N., Eberle A., Zimmer R., Coupling and convergence for Hamiltonian Monte Carlo. arXiv:1805.00452 (1 May 2018).
20.Cheng X., Chatterji N. S., Abbasi-Yadkori Y., Bartlett P. L., Jordan M. I., Sharp convergence rates for Langevin dynamics in the nonconvex setting. arXiv:1805.01648 (4 May 2018).
21.Majka M. B., Mijatović A., Szpruch L., Non-asymptotic bounds for sampling algorithms without log-concavity. arXiv:1808.07105 (21 August 2018).
22.Rossky P. J., Doll J. D., Friedman H. L., Brownian dynamics as smart Monte Carlo simulation. J. Chem. Phys. 69, 4628–4633 (1978). [Google Scholar]
23.Roberts G. O., Stramer O., Langevin diffusions and Metropolis-Hastings algorithms. Methodol. Comput. Appl. Probab. 4, 337–357 (2002). [Google Scholar]
24.Durmus A., Moulines E., Nonasymptotic convergence analysis for the unadjusted Langevin algorithm. Ann. Appl. Probab. 27, 1551–1587 (2017). [Google Scholar]
25.Roberts G. O., Tweedie R. L., Geometric convergence and central limit theorems for multidimensional Hastings and Metropolis algorithms. Biometrika 83, 95–110 (1996). [Google Scholar]
26.Polyak B. T., Gradient methods for minimizing functionals. Zh Vychisl Mat Mat Fiz 3, 643–653 (1963). [Google Scholar]
27.Bakry D., Emery M., “Diffusions hypercontractives” in Séminaire de Probabilités XIX 1983/84, J. Azema, M. Yor, Eds. (Springer, 1985), pp. 177–206.
28.Holley R., Stroock D., Logarithmic Sobolev inequalities and stochastic ising models. J. Stat. Phys. 46, 1159–1194 (1987). [Google Scholar]
29.Ledoux M., The geometry of Markov diffusion generators. Ann. Fac. Sci. Toulouse Math. 9, 305–366 (2000). [Google Scholar]
30.Villani C., Optimal Transport: Old and New (Springer, Berlin, 2009). [Google Scholar]
31.Wibisono A., Sampling as optimization in the space of measures: The Langevin dynamics as a composite optimization problem. arXiv:1802.08089 (22 February 2018).
32.Frieze A., Kannan R., Polson N., Sampling from log-concave distributions. Ann. Appl. Probab. 4, 812–837 (1994). [Google Scholar]
33.Rosenthal J. S., Minorization conditions and convergence rates for Markov chain Monte Carlo. J. Am. Stat. Assoc. 90, 558–566 (1995). [Google Scholar]
34.Rosenthal J., Quantitative convergence rates of Markov chains: A simple account. Electron. Commun. Probab. 7, 123–128 (2002). [Google Scholar]
35.Roberts G. O., Rosenthal J. S., Optimal scaling for various Metropolis-Hastings algorithms. Statist. Sci. 16, 351–367 (2001). [Google Scholar]
36.Roberts G. O., Rosenthal J. S., Complexity bounds for Markov chain Monte Carlo algorithms via diffusion limits. J. Appl. Probab. 53, 410–420 (2016). [Google Scholar]
37.Ma Y.-A., et al. , Is there an analog of Nesterov acceleration for MCMC? arXiv:1902.00996 (4 February 2019).
38.Jain P., Kar P., Non-convex optimization for machine learning. Found. Trends Mach. Learn. 10, 142–336 (2017). [Google Scholar]
39.Carmon Y., Duchi J. C., Hinder O., Sidford A., Lower bounds for finding stationary points I. arXiv:1710.11606 (31 October 2017).
40.Ge H., Qian H., Landscapes of non-gradient dynamics without detailed balance: Stable limit cycles and multiple attractors. Chaos 22, 023140 (2012). [DOI] [PubMed] [Google Scholar]
41.Kirkpatrick S., Gelatt C. D., Vecchi M. P., Optimization by simulated annealing. Science 220, 671–680 (1983). [DOI] [PubMed] [Google Scholar]
42.Vempala S., Wang G., A spectral algorithm for learning mixture models. J. Comput. Syst. Sci. 68, 841–860 (2004). [Google Scholar]
43.Bach F., Adaptivity of averaged stochastic gradient descent to local strong convexity for logistic regression. J. Mach. Learn. Res. 15, 595–627 (2014). [Google Scholar]
44.Bubeck S., Convex optimization: Algorithms and complexity. Found. Trends Mach. Learn. 8, 231–357 (2015). [Google Scholar]
45.Raginsky M., Rakhlin A., Telgarsky M., “Non-convex learning via stochastic gradient Langevin dynamics: A nonasymptotic analysis” in Proceedings of the 30th Conference on Learning Theory (COLT) (Association for Computational Learning, 2017), pp. 1674–1703.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary File

pnas.1820003116.sapp.pdf^{(754.8KB, pdf)}

[r1] 1.Amit Y., Grenander U., Comparing sweep strategies for stochastic relaxation. J. Multivar. Anal. 37, 197–222 (1991). [Google Scholar]

[r2] 2.Amit Y., On rates of convergence of stochastic relaxation for Gaussian and non-Gaussian distributions. J. Multivar. Anal. 38, 82–99 (1991). [Google Scholar]

[r3] 3.Roberts G. O., Sahu S. K., Updating schemes, correlation structure, blocking and parameterization for the Gibbs sampler. J. Roy. Stat. Soc. B 59, 291–317 (1997). [Google Scholar]

[r4] 4.Dempster A., Laird X., Rubin D., Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B 39, 1–38 (1977). [Google Scholar]

[r5] 5.Dalalyan A. S., Theoretical guarantees for approximate sampling from smooth and log-concave densities. J. Roy. Stat. Soc. B 79, 651–676 (2017). [Google Scholar]

[r6] 6.Durmus A., Moulines E., Sampling from strongly log-concave distributions with the unadjusted Langevin algorithm. arXiv:1605.01559 (5 May 2016).

[r7] 7.Dalalyan A. S., Karagulyan A. G., User-friendly guarantees for the Langevin Monte Carlo with inaccurate gradient. arXiv:1710.00095 (29 September 2017).

[r8] 8.Cheng X., Chatterji N. S., Bartlett P. L., Jordan M. I., “Underdamped Langevin MCMC: A non-asymptotic analysis” in Proceedings of the 31st Conference on Learning Theory (COLT) (Association for Computational Learning, 2018), pp. 300–323. [Google Scholar]

[r9] 9.Cheng X., Bartlett P. L., “Convergence of Langevin MCMC in KL-divergence” in Proceedings of the 29th International Conference on Algorithmic Learning Theory (ALT) (Association for Computing Machinery, 2018), pp. 186–211. [Google Scholar]

[r10] 10.Dwivedi R., Chen Y., Wainwright M. J., Yu B., Log-concave sampling: Metropolis-Hastings algorithms are fast! arXiv:1801.02309 (8 January 2018).

[r11] 11.Mangoubi O., Smith A., Rapid mixing of Hamiltonian Monte Carlo on strongly log-concave distributions. arXiv:1708.07114 (23 August 2017).

[r12] 12.Mangoubi O., Vishnoi N. K., Dimensionally tight running time bounds for second-order Hamiltonian Monte Carlo. arXiv:1802.08898 (24 February 2018).

[r13] 13.Nesterov Y., Introductory Lectures on Convex Optimization: A Basic Course (Kluwer, Boston, 2004). [Google Scholar]

[r14] 14.McLachlan G. J., Peel D., Finite Mixture Models (Wiley, Chichester, UK, 2000). [Google Scholar]

[r15] 15.Marin J.-M., Mengersen K., Robert C. P., Bayesian Modelling and Inference on Mixtures of Distributions (Springer-Verlag, New York, 2005). [Google Scholar]

[r16] 16.Kramers H. A., Brownian motion in a field of force and the diffusion model of chemical reactions. Physica 7, 284–304 (1940). [Google Scholar]

[r17] 17.Landau L. D., Lifshitz E. M., Statistical Physics (Pergamon, Oxford, ed. 3, 1980). [Google Scholar]

[r18] 18.Eberle A., Guillin A., Zimmer R., Couplings and quantitative contraction rates for Langevin dynamics. arXiv:1703.01617 (5 March 2017).

[r19] 19.Bou-Rabee N., Eberle A., Zimmer R., Coupling and convergence for Hamiltonian Monte Carlo. arXiv:1805.00452 (1 May 2018).

[r20] 20.Cheng X., Chatterji N. S., Abbasi-Yadkori Y., Bartlett P. L., Jordan M. I., Sharp convergence rates for Langevin dynamics in the nonconvex setting. arXiv:1805.01648 (4 May 2018).

[r21] 21.Majka M. B., Mijatović A., Szpruch L., Non-asymptotic bounds for sampling algorithms without log-concavity. arXiv:1808.07105 (21 August 2018).

[r22] 22.Rossky P. J., Doll J. D., Friedman H. L., Brownian dynamics as smart Monte Carlo simulation. J. Chem. Phys. 69, 4628–4633 (1978). [Google Scholar]

[r23] 23.Roberts G. O., Stramer O., Langevin diffusions and Metropolis-Hastings algorithms. Methodol. Comput. Appl. Probab. 4, 337–357 (2002). [Google Scholar]

[r24] 24.Durmus A., Moulines E., Nonasymptotic convergence analysis for the unadjusted Langevin algorithm. Ann. Appl. Probab. 27, 1551–1587 (2017). [Google Scholar]

[r25] 25.Roberts G. O., Tweedie R. L., Geometric convergence and central limit theorems for multidimensional Hastings and Metropolis algorithms. Biometrika 83, 95–110 (1996). [Google Scholar]

[r26] 26.Polyak B. T., Gradient methods for minimizing functionals. Zh Vychisl Mat Mat Fiz 3, 643–653 (1963). [Google Scholar]

[r27] 27.Bakry D., Emery M., “Diffusions hypercontractives” in Séminaire de Probabilités XIX 1983/84, J. Azema, M. Yor, Eds. (Springer, 1985), pp. 177–206.

[r28] 28.Holley R., Stroock D., Logarithmic Sobolev inequalities and stochastic ising models. J. Stat. Phys. 46, 1159–1194 (1987). [Google Scholar]

[r29] 29.Ledoux M., The geometry of Markov diffusion generators. Ann. Fac. Sci. Toulouse Math. 9, 305–366 (2000). [Google Scholar]

[r30] 30.Villani C., Optimal Transport: Old and New (Springer, Berlin, 2009). [Google Scholar]

[r31] 31.Wibisono A., Sampling as optimization in the space of measures: The Langevin dynamics as a composite optimization problem. arXiv:1802.08089 (22 February 2018).

[r32] 32.Frieze A., Kannan R., Polson N., Sampling from log-concave distributions. Ann. Appl. Probab. 4, 812–837 (1994). [Google Scholar]

[r33] 33.Rosenthal J. S., Minorization conditions and convergence rates for Markov chain Monte Carlo. J. Am. Stat. Assoc. 90, 558–566 (1995). [Google Scholar]

[r34] 34.Rosenthal J., Quantitative convergence rates of Markov chains: A simple account. Electron. Commun. Probab. 7, 123–128 (2002). [Google Scholar]

[r35] 35.Roberts G. O., Rosenthal J. S., Optimal scaling for various Metropolis-Hastings algorithms. Statist. Sci. 16, 351–367 (2001). [Google Scholar]

[r36] 36.Roberts G. O., Rosenthal J. S., Complexity bounds for Markov chain Monte Carlo algorithms via diffusion limits. J. Appl. Probab. 53, 410–420 (2016). [Google Scholar]

[r37] 37.Ma Y.-A., et al. , Is there an analog of Nesterov acceleration for MCMC? arXiv:1902.00996 (4 February 2019).

[r38] 38.Jain P., Kar P., Non-convex optimization for machine learning. Found. Trends Mach. Learn. 10, 142–336 (2017). [Google Scholar]

[r39] 39.Carmon Y., Duchi J. C., Hinder O., Sidford A., Lower bounds for finding stationary points I. arXiv:1710.11606 (31 October 2017).

[r40] 40.Ge H., Qian H., Landscapes of non-gradient dynamics without detailed balance: Stable limit cycles and multiple attractors. Chaos 22, 023140 (2012). [DOI] [PubMed] [Google Scholar]

[r41] 41.Kirkpatrick S., Gelatt C. D., Vecchi M. P., Optimization by simulated annealing. Science 220, 671–680 (1983). [DOI] [PubMed] [Google Scholar]

[r42] 42.Vempala S., Wang G., A spectral algorithm for learning mixture models. J. Comput. Syst. Sci. 68, 841–860 (2004). [Google Scholar]

[r43] 43.Bach F., Adaptivity of averaged stochastic gradient descent to local strong convexity for logistic regression. J. Mach. Learn. Res. 15, 595–627 (2014). [Google Scholar]

[r44] 44.Bubeck S., Convex optimization: Algorithms and complexity. Found. Trends Mach. Learn. 8, 231–357 (2015). [Google Scholar]

[r45] 45.Raginsky M., Rakhlin A., Telgarsky M., “Non-convex learning via stochastic gradient Langevin dynamics: A nonasymptotic analysis” in Proceedings of the 30th Conference on Learning Theory (COLT) (Association for Computational Learning, 2017), pp. 1674–1703.

PERMALINK

Sampling can be faster than optimization

Yi-An Ma

Yuansi Chen

Chi Jin

Nicolas Flammarion

Michael I Jordan

Significance

Abstract

Polynomial Convergence of MCMC Algorithms

Algorithm 1:

Algorithm 2:

Exponential Dependence on Dimension for Optimization

Fig. 1.

Why Can’t One Optimize in Polynomial Time Using the Langevin Algorithm?

Parameter Estimation from Gaussian Mixture Model: Sampling versus Optimization

Fig. 2.

Discussion

Supplementary Material

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Sampling can be faster than optimization

Yi-An Ma

Yuansi Chen

Chi Jin

Nicolas Flammarion

Michael I Jordan

Significance

Abstract

Polynomial Convergence of MCMC Algorithms

Algorithm 1:

Algorithm 2:

Exponential Dependence on Dimension for Optimization

Fig. 1.

Why Can’t One Optimize in Polynomial Time Using the Langevin Algorithm?

Parameter Estimation from Gaussian Mixture Model: Sampling versus Optimization

Fig. 2.

Discussion

Supplementary Material

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases