Efficiency of quantum vs. classical annealing in nonconvex learning problems

Carlo Baldassi; Riccardo Zecchina

doi:10.1073/pnas.1711456115

. 2018 Jan 30;115(7):1457–1462. doi: 10.1073/pnas.1711456115

Efficiency of quantum vs. classical annealing in nonconvex learning problems

Carlo Baldassi ^a,^b,^1,², Riccardo Zecchina ^a,^c,^1,²

PMCID: PMC5816144 PMID: 29382764

Significance

Quantum annealers are physical quantum devices designed to solve optimization problems by finding low-energy configurations of an appropriate energy function by exploiting cooperative tunneling effects to escape local minima. Classical annealers use thermal fluctuations for the same computational purpose, and Markov chains based on this principle are among the most widespread optimization techniques. The fundamental mechanism underlying quantum annealing consists of exploiting a controllable quantum perturbation to generate tunneling processes. The computational potentialities of quantum annealers are still under debate, since few ad hoc positive results are known. Here, we identify a wide class of large-scale nonconvex optimization problems for which quantum annealing is efficient while classical annealing gets stuck. These problems are of central interest to machine learning.

Keywords: nonconvex optimization, machine learning, quantum annealing, neural networks, statistical physics

Abstract

Quantum annealers aim at solving nonconvex optimization problems by exploiting cooperative tunneling effects to escape local minima. The underlying idea consists of designing a classical energy function whose ground states are the sought optimal solutions of the original optimization problem and add a controllable quantum transverse field to generate tunneling processes. A key challenge is to identify classes of nonconvex optimization problems for which quantum annealing remains efficient while thermal annealing fails. We show that this happens for a wide class of problems which are central to machine learning. Their energy landscapes are dominated by local minima that cause exponential slowdown of classical thermal annealers while simulated quantum annealing converges efficiently to rare dense regions of optimal solutions.

Quantum annealing (QA) aims at finding low-energy configurations of nonconvex optimization problems by a controlled quantum adiabatic evolution, where a time-dependent many-body quantum system which encodes for the optimization problem evolves toward its ground states so as to escape local minima through multiple tunneling events (1–5). Classical simulated annealing (SA) uses thermal fluctuations for the same computational purpose, and Markov chains based on this principle are among the most widespread optimization techniques across science (6). Quantum fluctuations are qualitatively different from thermal fluctuations, and in principle, QA algorithms could lead to extremely powerful alternative computational devices.

In the QA approach, a time-dependent quantum transverse field is added to the classical energy function leading to an interpolating Hamiltonian that may take advantage of correlated fluctuations mediated by tunneling. Starting with a high transverse field, the quantum model system can be initialized in its ground state, i.e., all spins aligned in the direction of the field. The adiabatic theorem then ensures that by slowly reducing the transverse field, the system remains in the ground state of the interpolating Hamiltonian. At the end of the process, the transverse field vanishes, and the systems ends up in the sought ground state of the classical energy function. The original optimization problem would then be solved if the overall process could take place in a time bounded by some low-degree polynomial in the size of the problem. Unfortunately, the adiabatic process can become extremely slow. The adiabatic theorem requires the rate of change of the Hamiltonian to be smaller than the square of the gap between the ground state and the first excited state (7–9). For small gaps, the process can thus become inefficient. Exponentially small gaps are not only possible in worst-case scenarios, but have also been found to exist in typical random systems where comparative studies between quantum and classical annealing have so far failed in displaying quantum exponential speed-up, e.g., at first-order phase transition in quantum spin glasses (10, 11) or 2D spin-glass systems (12–14). More positive results have been found for ad hoc energy functions in which global minima are planted in such a way that tunneling cascades can become more efficient than thermal fluctuations (4, 15). As far as the physical implementations of quantum annealers is concerned, studies have been focused on discriminating the presence of quantum effects rather than on their computational effectiveness (16–18).

Consequently, a key open question is to identify classes of relevant optimization problems for which QA can be shown to be exponentially faster than its classical thermal counterpart.

Here, we give an answer to this question by providing analytic and simulation evidence of exponential speed-up of quantum vs. classical SA for a representative class of random nonconvex optimization problems of basic interest in machine learning. The simplest example of this class is the problem of training binary neural networks (described in detail below): Very schematically, the variables of the problem are the (binary) connection weights, while the energy measures the training error over a given dataset.

These problems have been very recently found to possess a rather distinctive geometrical structure of ground states (19–22): The free-energy landscape has been shown to be characterized by the existence of an exponentially large number of metastable states and isolated ground states and a few regions where the ground states are dense. These dense regions, which had previously escaped the equilibrium statistical physics analysis (23, 24), are exponentially rare, but still possess a very high local internal entropy: They are composed of ground states that are surrounded, at extensive but relatively small distances, by exponentially many other ground states. Under these circumstances, classical SA (as any Markov chain satisfying detailed balance) gets trapped in the metastable states, suffering ergodicity breaking and exponential slowing down toward the low-energy configurations. These problems have been considered to be intractable for decades and display deep similarities with disordered spin-glass models, which are known to never reach equilibrium.

The large deviation analysis that has unveiled the existence of the rare dense regions has led to several novel algorithms, including a Monte Carlo scheme defined over an appropriate objective function (20) that bears close similarities with a quantum Monte Carlo (QMC) technique based on the Suzuki–Trotter transformation (5). Motivated by this analytical mapping and by the geometrical structure of the dense and degenerate ground states which is expected to favor zero-temperature kinetic processes (25, 26), we have conducted a full analytical and numerical statistical physics study of the QA problem, reaching the conclusion that in the quantum limit, the QMC process, i.e., simulated QA (SQA), can equilibrate efficiently, while the classical SA gets stuck in high-energy metastable states. These results generalize to multilayered networks.

While it is known that other quasioptimal classical algorithms for the same problems exist (20, 27, 28), here, we focus on the physical speed-up that a QA approach could provide in finding rare regions of ground states. We provide physical arguments and numerical results supporting the conjecture that the real-time QA dynamics behaves similarly to SQA.

As far as machine learning is concerned, dense regions of low-energy configurations (i.e., quasiflat minima over macroscopic length scales) are of fundamental interest, as they are particularly well-suited for making predictions given the learned data: On the one hand, these regions are by definition robust with respect to fluctuations in a sizable fraction of the weight configurations and, as such, are less prone to fit the noise. On the other hand, an optimal Bayesian estimate, resulting from a weighted consensus vote on all configurations, would receive a major contribution from one of such regions, compared with a narrow minimum; the centroid of the region (computed according to any reasonable metric which correlates the distance between configurations with the network outcomes) would act as a representative of the region as a whole (29). In this respect, it is worth mentioning that in deep learning (30), all of the learning algorithms which lead to good prediction performance always include effects of a systematically injected noise in the learning phase, a fact that makes the equilibrium Gibbs measure not the stationary measure of the learning protocols and drives the systems toward wide minima. We expect that these results can be generalized to many other classes of nonconvex optimization problems where local entropy plays a role, ranging from robust optimization to physical disordered systems.

Quantum gate-based algorithms for machine learning exist; however, the possibility of a physical implementation remains a critical issue (31).

Energy Functions

As a working example, we first consider the problem of learning random patterns in single-layer neural network with binary weights, the so-called binary perceptron problem (23). This network maps vectors of $N$ inputs $ξ \in {- 1, + 1}^{N}$ to binary outputs $τ = \pm 1$ through the nonlinear function $τ = sgn (σ \cdot ξ)$ , where $σ \in {- 1, + 1}^{N}$ is the vector of synaptic weights. Given $α N$ input patterns ${ξ^{μ}}_{μ = 1}^{α N}$ with $μ = 1, \dots, α N$ and their corresponding desired outputs ${τ^{μ}}_{μ = 1}^{α N}$ , the learning problem consists in finding $σ$ such that all input patterns are simultaneously classified correctly, i.e., $sgn (σ \cdot ξ^{μ}) = τ^{μ}$ for all $μ$ . Both the components of the input vectors $ξ_{i}^{μ}$ and the outputs $τ^{μ}$ are independent identically distributed unbiased random variables ( $P (x) = \frac{1}{2} δ (x - 1) + \frac{1}{2} δ (x + 1)$ ). In the binary framework, the procedure for writing a spin Hamiltonian whose ground states are the sought optimal solutions of the original optimization problem is well known (32). The energy $E$ of the binary perceptron is proportional to the number of classification errors and can be written as

E ({σ_{j}}) = \sum_{μ = 1}^{α N} Δ_{μ}^{n} Θ (- Δ_{μ}), Δ_{μ} ≐ \frac{τ^{μ}}{\sqrt{N}} \sum_{j = 1}^{N} ξ_{j}^{μ} σ_{j}

[1]

where $Θ (x)$ is the Heaviside step function: $Θ (x) = 1$ if $x > 0$ , $Θ (x) = 0$ otherwise. When the argument of the $Θ$ function is positive, the perceptron is implementing the wrong input–output mapping. The exponent $n \in {0,1}$ defines two different forms of the energy functions which have the same zero-energy ground states and different structures of local minima. The equilibrium analysis of the binary perceptron problem shows that in the large size limit, and for $α < α_{c} ≃ 0.83$ (23), the energy landscape is dominated by an exponential number of local minima and of zero-energy ground states that are typically geometrically isolated (33), i.e., they have extensive mutual Hamming distances. For both choices of $n$ , the problem is computationally hard for SA processes (34): In the large $N$ limit, a detailed balanced stochastic search process gets stuck in metastable states at energy levels of order $O (N)$ above the ground states.

Following the standard SQA approach, we identify the binary variables $σ$ with one of the components of physical quantum spins, say, $σ^{z}$ , and we introduce the Hamiltonian operator of a model of $N$ quantum spins with the perceptron term of Eq. 1 acting in the longitudinal direction $z$ and a magnetic field $Γ$ acting in the transverse direction $x$ . The interpolating Hamiltonian reads:

\hat{H} = E ({{\hat{σ}}_{j}^{z}}) - Γ \sum_{j = 1}^{N} {\hat{σ}}_{j}^{x}

[2]

where ${\hat{σ}}_{j}^{z}$ and ${\hat{σ}}_{j}^{x}$ are the spin operators (Pauli matrices) in the $z$ and $x$ directions. For $Γ = 0$ , one recovers the classical optimization problem. The QA procedure consists of initializing the system at large $β$ and $Γ$ , and slowly decreasing $Γ$ to $0$ . To analyze the low-temperature phase diagram of the model, we need to study the average of the logarithm of the partition function $Z = Tr (e^{- β \hat{H}}) .$ This can be done by using the Suzuki–Trotter transformation, which leads to the study of a classical effective Hamiltonian acting on a system of $y$ interacting Trotter replicas of the original classical system coupled in an extra dimension:

H_{eff} ({σ_{j}^{a}}_{j, a}) = \frac{1}{y} \sum_{a = 1}^{y} E ({σ_{j}^{a}}_{j}) - \frac{γ}{β} \sum_{a = 1}^{y} \sum_{j = 1}^{N} σ_{j}^{a} σ_{j}^{a + 1} - \frac{N K}{β}

[3]

where the $σ_{j}^{a} = \pm 1$ are Ising spins, $a \in {1, \dots, y}$ is a replica index with periodic boundary conditions $σ_{j}^{y + 1} \equiv σ_{j}^{1}$ , $γ = \frac{1}{2} \log \coth (\frac{β Γ}{y})$ and $K = \frac{1}{2} y \log (\frac{1}{2} \sinh (2 \frac{β Γ}{y})) .$

The replicated system needs to be studied in the limit $y \to \infty$ to recover the so-called path integral continuous quantum limit and to make the connection with the behavior of quantum devices (14). The SQA dynamical process samples configurations from an equilibrium distribution, and it is not necessarily equivalent to the real-time Schrödinger equation evolution of the system. A particularly dangerous situation occurs if the ground states of the system encounter first-order phase transitions which are associated with exponentially small gaps (10, 35, 36) at finite $N$ . As discussed below, this appears not to be the case for the class of models we are considering.

Connection with the Local Entropy Measure

The effective Hamiltonian Eq. 3 can be interpreted as many replicas of the original systems coupled through one-dimensional periodic chains, one for each original spin (Fig. 1B). Note that the interaction term $γ$ diverges as the transverse field $Γ$ goes to $0$ . This geometrical structure is very similar to that of the robust ensemble (RE) formalism (20), where a probability measure that gives higher weight to rare dense regions of low-energy states is introduced. There, the main idea is to maximize $Φ (σ^{⋆}) = \log \sum_{{σ}} e^{- β E (σ) - λ \sum_{j = 1}^{N} σ_{j} σ_{j}^{⋆}}$ , i.e., a “local free entropy” where $λ$ is a Lagrange parameter that controls the extensive size of the region around a reference configuration $σ^{⋆}$ . One can then build a new Gibbs distribution $P (σ^{⋆}) \propto e^{y Φ (σ^{⋆})}$ , where $- Φ$ has the role of an energy and $y$ of an inverse temperature: In the limit of large $y$ , this distribution concentrates on the maxima of $Φ$ . Upon restricting the values of $y$ to be integer (and large), $P (σ^{⋆})$ takes a factorized form yielding a replicated probability measure $P_{RE} (σ^{⋆}, σ^{1}, \dots, σ^{y}) \propto e^{- β H_{eff}^{RE} (σ^{⋆}, {σ_{j}^{a}})}$ where the effective energy is given by

H_{eff}^{RE} (σ^{⋆}, {σ_{j}^{a}}_{j, a}) = \sum_{a = 1}^{y} E ({σ_{j}^{a}}_{j}) - \frac{λ}{β} \sum_{a = 1}^{y} \sum_{j = 1}^{N} σ_{j}^{a} σ_{j}^{⋆}

[4]

As in the Suzuki–Trotter formalism, $H_{eff}^{RE} (σ^{⋆}, {σ_{j}^{a}}_{j, a})$ corresponds to a system with an overall energy given by the sum of $y$ individual “real replica energies” plus a geometric coupling term; in this case, however, the replicas interact with the “reference” configurations $σ^{⋆}$ rather than among themselves (Fig. 1C).

Fig. 1. — Topology of the Suzuki–Trotter vs. robust ensemble (RE) representations. (A) The classical objective function we wish to optimize which depends on $N$ discrete variables ${σ_{j}}$ ( $N = 5$ in the picture). (B) Suzuki–Trotter interaction topology: $y$ replicas of the classical system ( $y = 7$ in the picture) are coupled by periodic one-dimensional chains, one for each classical spin. (C) RE interaction topology: $y$ replicas are coupled through a centroid configuration. In the limit of large $N$ and large $y$ (quantum limit) and for strong interaction couplings, all replicas are forced to be close, and the behavior of the two effective models is expected to be similar.

The Suzuki–Trotter representation and the RE formalism differ in the topology of the interactions between replicas and in the scaling of the interactions, but for both cases, there is a classical limit, $Γ \to 0$ and $λ \to \infty$ , respectively, in which the replicated systems are forced to correlate and eventually coalesce in identical configurations. For nonconvex problems, these will not in general correspond to configuration dominating the original classical Gibbs measure.

For the sake of clarity, we should remind that in the classical limit and for $α < α_{c}$ , our model presents an exponential number of far-apart isolated ground states which dominate the Gibbs measure. At the same time, there exist rare clusters of ground states with a density close to its maximum possible value (high local entropy) for small but still macroscopic cluster sizes (19). This fact has several consequences: No further subdivision of the clusters into states is possible; the ground states are typically $O (1)$ spin flip connected (19); and a trade-off between tunneling events and exponential number of destination states within the cluster is possible.

Phase Diagram: Analytical and Numerical Results

Thanks to the mean field nature of the energetic part of the system, Eq. 3, we can resort to the replica method for calculating analytically the phase diagram. As discussed in SI Appendix, this can be done under the so-called static approximation, which consists of using a single-parameter $q_{1}$ to represent the overlaps along the Trotter dimension, $q_{1}^{a b} = ⟨ \frac{1}{N} \sum_{j = 1}^{N} σ_{j}^{a} σ_{j}^{b} ⟩ \approx q_{1}$ . Although this approximation crudely neglects the dependency of $q_{1}^{a b}$ from $| a - b |$ , the resulting predictions show a remarkable agreement with numerical simulations.

In Fig. 2, we report the analytical predictions for the average classical component of the energy of the quantum model as a function of the transverse field $Γ$ . We compare the results with the outcome of extensive simulations performed with the reduced-rejection-rate (RRR) Monte Carlo method (37), in which $Γ$ is initialized at $2.5$ and gradually brought down to $0$ in regular small steps, at constant temperature, and fixing the total simulation time to $τ N y \cdot 10^{4}$ (as to keep constant the number of Monte Carlo sweeps when varying $N$ and $y$ ). Additional details are reported in Materials and Methods and SI Appendix. The size of the systems, the number of samples, and the number of Trotter replicas are scaled up to large values, so that both finite size effects and the quantum limit are kept under control. A key point is to observe that the results do not degrade with the number of Trotter replicas: The average ground-state energy approaches a limiting value, close to the theoretical prediction, in the large $y$ quantum limit. The results appear to be rather insensitive to both $N$ and the simulation time-scaling parameter $τ$ . This indicates that Monte Carlo appears to be able to equilibrate efficiently, in a constant (or almost constant) number of sweeps, at each $Γ$ . The analytical prediction for the classical energy only appears to display a relatively small systematic offset (due to the static approximation) at intermediate values of $Γ$ , while it is very precise at both large and small $Γ$ ; the expectation of the total Hamiltonian, on the other hand, is in excellent agreement with the simulations (SI Appendix).

In the same plot, we display the behavior of classical SA simulated with a standard Metropolis–Hastings scheme, under an annealing protocol in $β$ that would follow the same theoretical curve as SQA if the system were able to equilibrate (Materials and Methods and SI Appendix): As expected (34), SA gets trapped at very high energies (increasing with problem size; in the thermodynamic limit, it is expected that SA would remain stuck at the initial value $0.5 N$ of the energy for times which scale exponentially with $N$ ). Alternative annealing protocols yield analogous results; the exponential scaling with $N$ of SA on binary perceptron models had also been observed experimentally in previous results, e.g., in refs. 21 and 38.

In Fig. 2 Inset, we report the analytical prediction for the transverse overlap parameter $q_{1}$ , which quite remarkably reproduces fairly well the average overlap as measured from simulations.

In Fig. 3, we provide the profiles of the classical energy minima found for different values of $Γ$ in the case of SQA and different temperatures for SA. These results are computed analytically by the cavity method (see Materials and Methods and SI Appendix for details) by evaluating which is the most probable energy found at a normalized Hamming distance $d$ from a given configuration. As it turns out, throughout the annealing process, SQA follows a path corresponding to wide valleys, while SA gets stuck in steep metastable states. The quantum fluctuations reproduced by the SQA process drive the system to converge toward wide flat regions, despite the fact that they are exponentially rare compared with the narrow minima.

The physical interpretation of these results is that quantum fluctuations lower the energy of a cluster proportionally to its size or, in other words, that quantum fluctuations allow the system to lower its kinetic energy by delocalizing; see refs. 25, 26, and 39 for related results. Along the process of reduction of the transverse field, we do not observe any phase transition which could induce a critical slowing down of the QA process, and we expect SQA and QA to behave similarly (11, 36).

This is in agreement with the results of a direct comparison between the real-time quantum dynamics and the SQA on small systems ( $N = 21$ ): As reported in SI Appendix, we have performed extensive numerical studies of properly selected small instances of the binary perceptron problem, comparing the results of SQA and QA and analyzing the results of the QA process and the properties of the Hamiltonian. To reproduce the conditions that are known to exist at large values of $N$ , we have selected instances for which a fast annealing schedule SA gets trapped at some positive fraction of violated constraints, and yet the problems display a sufficiently high number of solutions. We found that the agreement between SQA and QA on each sample is excellent. The measurements on the final configurations reached by QA qualitatively confirm the scenario described above, that QA is attracted toward dense, low-energy regions without getting stuck during the annealing process. Finally, the analysis of the gap between the ground state of the system and the first excited state as $Γ$ decreases shows no signs of the kind of phenomena which would typically hamper the performance of QA in other models: There are no vanishingly small gaps at finite $Γ$ (compare discussion in the introduction). We benchmarked all these results with “randomized” versions of the same samples, in which we randomly permuted the classical energies associated with each spin configuration, so as to keep the distribution of the classical energy levels while destroying the geometric structure of the states. Indeed, for these randomized samples, we found that the gaps nearly close at finite $Γ ≃ 0.4$ , and that, correspondingly, the QA process fails to track the ground state of the system, resulting in a much-reduced probability of finding a solution to the problem.

As concluding remarks, we report that the models with $n = 0$ and $n = 1$ have phase diagrams which are qualitatively very similar (for the sake of simplicity, here we reported the $n = 0$ case only). The former presents at very small positive values of $Γ$ a collapse of the density matrix onto the classical one, whereas the latter ends up in the classical state only at $Γ = 0$ .

For the sake of completeness, we have checked that the performance of SQA in the $y \to \infty$ quantum limit extends to more complex architectures which include hidden layers; the details are reported in SI Appendix.

Conclusions

We conclude by noticing that, at variance with other studies on spin-glass models in which the evidence for QA outperforming classical annealing was limited to finite values of $y$ , thereby just defining a different type of classical SA algorithms, in our case the quantum limit coincides with the optimal behavior of the algorithm itself. We believe that these results could play a role in many optimization problems in which optimality of the cost function needs to also meet robustness conditions (i.e., wide minima). As far as learning problems are concerned, it is worth mentioning that for the best-performing artificial neural networks, the so-called deep networks (30), there is numerical evidence for the existence of rare flat minima (40) and that all of the effective algorithms always include effects of systematic injected noise in the learning phase (41), which implies that the equilibrium Gibbs measure is not the stationary measure of the learning protocols. For the sake of clarity, we should remark that our results are aimed to suggest that QA can equilibrate efficiently, whereas SA cannot; i.e., our notion of quantum speed-up is relative to the same algorithmic scheme that runs on classical hardware. Other classical algorithms for the same class of problems, besides the above-mentioned ones based on the RE and the SQA itself, have been discovered (27, 38, 42–44); however, all of these algorithms are qualitatively different from QA, which can provide a huge speed-up by manipulating single physical bits in parallel. Thus, the overall solving time in a physical QA implementation (neglecting any other technological considerations) would have, at worst, only a mild dependence on $N$ .

Our results provide further evidence that learning can be achieved through different types of correlated fluctuations, among which quantum tunneling could be a relevant example for physical devices.

Materials and Methods

Simulated QA Protocol.

All SQA simulations were performed by using the RRR Monte Carlo method (37). We fixed the total number of spin flip attempts at $τ N y \cdot 10^{4}$ and followed a linear protocol (divided in $30 τ$ steps) for the annealing of $Γ$ . In Fig. 2, we show the results for $N = 4001$ and $τ = 4$ ; the results for $N = 1001, 2001$ and for $τ = 1, 2$ were essentially indistinguishable at that level of detail.

Classical SA Protocol.

The results for SA presented in Fig. 2 used an annealing protocol in $β$ designed to make a direct comparison with QA: We found analytically a curve $β_{equiv} (Γ)$ such that the classical equilibrium energy would be equal to the longitudinal component of the quantum system energy. The classical equilibrium energy was computed from the equations in ref. 23. The result is shown in SI Appendix, Fig. S1. The SA protocol thus consisted of setting $β = β_{equiv} (Γ)$ and decreasing linearly $Γ$ from $2.5$ to $0$ , like for the QA case. We fixed the total number of spin flip attempts at $τ N \cdot 10^{4}$ and used $τ = 4, 8, 16$ ; as for the QA case, the annealing process was divided in $30 τ$ steps. If the system were able to equilibrate, it would follow the theoretical curve (dashed black line in Fig. 2), which it does only for high temperatures.

Other more standard annealing protocols (e.g., linear, exponential, or logarithmic) yielded very similar qualitative results, as expected from the analysis of ref. 34.

Estimation of the Local Energy and Entropy Landscapes.

To compute the local landscapes of the energy and the entropy around a reference configuration, Fig. 3, we used the belief propagation algorithm. We added an external field in the direction of the configuration of interest to focus on regions surrounding that configuration. The strength of the field allowed us to control the size of the region (parameter $d$ in Fig. 3). Typical energies are computed by setting the temperature to infinity, while local entropies are computed by setting the temperature to $0$ . The details of the algorithm are presented in SI Appendix.

Real-Time QA Simulations on Small Instances.

The real-time quantum dynamics simulations on small systems were performed by solving the time-dependent Schrödinger equation for the Hamiltonian of Eq. 2 by using the short iterative Lanczos method (45), which consists of computing the evolution with the Lanczos algorithm, at fixed $Γ$ for a short time interval $Δ t$ , then lowering $Γ$ by a small fixed amount $Δ Γ$ , and iterating until $Γ = 0$ .

Supplementary Material

Supplementary File

pnas.1711456115.sapp.pdf^{(2.3MB, pdf)}

Acknowledgments

We thank G. Santoro, B. Kappen, and F. Becca for discussions. This work was supported by Office of Naval Research Grant N00014-17-1-2569.

Footnotes

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1711456115/-/DCSupplemental.

References

1.Ray P, Chakrabarti BK, Chakrabarti A. Sherrington-Kirkpatrick model in a transverse field: Absence of replica symmetry breaking due to quantum fluctuations. Phys Rev B. 1989;39:11828–11832. doi: 10.1103/physrevb.39.11828. [DOI] [PubMed] [Google Scholar]
2.Finnila A, Gomez M, Sebenik C, Stenson C, Doll J. Quantum annealing: A new method for minimizing multidimensional functions. Chem Phys Lett. 1994;219:343–348. [Google Scholar]
3.Kadowaki T, Nishimori H. Quantum annealing in the transverse ising model. Phys Rev E. 1998;58:5355–5363. [Google Scholar]
4.Farhi E, et al. A quantum adiabatic evolution algorithm applied to random instances of an np-complete problem. Science. 2001;292:472–475. doi: 10.1126/science.1057726. [DOI] [PubMed] [Google Scholar]
5.Das A, Chakrabarti BK. Colloquium: Quantum annealing and analog quantum computation. Rev Mod Phys. 2008;80:1061–1081. [Google Scholar]
6.Moore C, Mertens S. The Nature of Computation. Oxford Univ Press; Oxford: 2011. [Google Scholar]
7.Born M, Fock V. Beweis des adiabatensatzes. Zeitschrift Phys A Hadrons Nuclei. 1928;51:165–180. [Google Scholar]
8.Landau L. Zur theorie der energieubertragung. II. Phys Z Sowjetunion. 1932;2:1–13. [Google Scholar]
9.Zener C. Non-adiabatic crossing of energy levels. Proc R Soc Lond A Math Phys Eng Sci. 1932;137:696–702. [Google Scholar]
10.Altshuler B, Krovi H, Roland J. Anderson localization makes adiabatic quantum optimization fail. Proc Natl Acad Sci USA. 2010;107:12446–12450. doi: 10.1073/pnas.1002116107. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Bapst V, Foini L, Krzakala F, Semerjian G, Zamponi F. The quantum adiabatic algorithm applied to random optimization problems: The quantum spin glass perspective. Phys Rep. 2013;523:127–205. [Google Scholar]
12.Santoro GE, Martoňák R, Tosatti E, Car R. Theory of quantum annealing of an ising spin glass. Science. 2002;295:2427–2430. doi: 10.1126/science.1068774. [DOI] [PubMed] [Google Scholar]
13.Martoňák R, Santoro GE, Tosatti E. Quantum annealing by the path-integral Monte Carlo method: The two-dimensional random ising model. Phys Rev B. 2002;66:094203. [Google Scholar]
14.Heim B, Rønnow TF, Isakov SV, Troyer M. Quantum versus classical annealing of ising spin glasses. Science. 2015;348:215–217. doi: 10.1126/science.aaa4170. [DOI] [PubMed] [Google Scholar]
15.Rønnow TF, et al. Defining and detecting quantum speedup. Science. 2014;345:420–424. doi: 10.1126/science.1252319. [DOI] [PubMed] [Google Scholar]
16.Johnson MW, et al. Quantum annealing with manufactured spins. Nature. 2011;473:194–198. doi: 10.1038/nature10012. [DOI] [PubMed] [Google Scholar]
17.Boixo S, et al. Evidence for quantum annealing with more than one hundred qubits. Nat Phys. 2014;10:218–224. [Google Scholar]
18.Langbein W, et al. Control of fine-structure splitting and biexciton binding in In x Ga 1- x as quantum dots by annealing. Phys Rev B. 2004;69:161301. [Google Scholar]
19.Baldassi C, Ingrosso A, Lucibello C, Saglietti L, Zecchina R. Subdominant dense clusters allow for simple learning and high computational performance in neural networks with discrete synapses. Phys Rev Lett. 2015;115:128101. doi: 10.1103/PhysRevLett.115.128101. [DOI] [PubMed] [Google Scholar]
20.Baldassi C, et al. Unreasonable effectiveness of learning neural networks: From accessible states and robust ensembles to basic algorithmic schemes. Proc Natl Acad Sci USA. 2016;113:E7655–E7662. doi: 10.1073/pnas.1608103113. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Baldassi C, Ingrosso A, Lucibello C, Saglietti L, Zecchina R. Local entropy as a measure for sampling solutions in constraint satisfaction problems. J Stat Mech Theor Exp. 2016;2016:P023301. [Google Scholar]
22.Baldassi C, Gerace F, Lucibello C, Saglietti L, Zecchina R. Learning may need only a few bits of synaptic precision. Phys Rev E. 2016;93:052313. doi: 10.1103/PhysRevE.93.052313. [DOI] [PubMed] [Google Scholar]
23.Krauth W, Mézard M. Storage capacity of memory networks with binary couplings. J Phys France. 1989;50:3057–3066. [Google Scholar]
24.Sompolinsky H, Tishby N, Seung HS. Learning from examples in large neural networks. Phys Rev Lett. 1990;65:1683–1686. doi: 10.1103/PhysRevLett.65.1683. [DOI] [PubMed] [Google Scholar]
25.Foini L, Semerjian G, Zamponi F. Solvable model of quantum random optimization problems. Phys Rev Lett. 2010;105:167204. doi: 10.1103/PhysRevLett.105.167204. [DOI] [PubMed] [Google Scholar]
26.Biroli G, Zamponi F. A tentative replica theory of glassy helium 4. J Low Temp Phys. 2012;168:101–116. [Google Scholar]
27.Hubara I, Courbariaux M, Soudry D, El-Yaniv R, Bengio Y. 2016. Quantized neural networks: Training neural networks with low precision weights and activations. arXiv:1609.07061.
28.Courbariaux M, Bengio Y, David JP. Advances in Neural Information Processing Systems. Vol 28. Curran Associates; Red Hook, NY: 2015. BinaryConnect: Training deep neural networks with binary weights during propagations; pp. 3105–3113. [Google Scholar]
29.MacKay DJ. Information Theory, Inference and Learning Algorithms. Cambridge Univ Press; Cambridge, UK: 2003. [Google Scholar]
30.LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521:436–444. doi: 10.1038/nature14539. [DOI] [PubMed] [Google Scholar]
31.Aaronson S. Read the fine print. Nat Phys. 2015;11:291–293. [Google Scholar]
32.Barahona F. On the computational complexity of ising spin glass models. J Phys A Math Gen. 1982;15:3241–3253. [Google Scholar]
33.Huang H, Kabashima Y. Origin of the computational hardness for learning with binary synapses. Phys Rev E. 2014;90:052813. doi: 10.1103/PhysRevE.90.052813. [DOI] [PubMed] [Google Scholar]
34.Horner H. Dynamics of learning for the binary perceptron problem. Zeitschrift Physik B Condens Matter. 1992;86:291–308. [Google Scholar]
35.Bapst V, Semerjian G. On quantum mean-field models and their quantum annealing. J Stat Mech Theor Exp. 2012;2012:P06007. [Google Scholar]
36.Bapst V, Semerjian G. Thermal, quantum and simulated quantum annealing: Analytical comparisons for simple models. J Phys Conf Ser. 2013;473:012011. [Google Scholar]
37.Baldassi C. A method to reduce the rejection rate in Monte Carlo Markov chains. J Stat Mech Theor Exp. 2017;2017:033301. [Google Scholar]
38.Baldassi C, Braunstein A, Brunel N, Zecchina R. Efficient supervised learning in networks with binary synapses. Proc Natl Acad Sci USA. 2007;104:11079–11084. doi: 10.1073/pnas.0700324104. [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Markland TE, et al. Quantum fluctuations can promote or inhibit glass formation. Nat Phys. 2011;7:134–137. [Google Scholar]
40.Keskar NS, Mudigere D, Nocedal J, Smelyanskiy M, Tang PTP. 2016. On large-batch training for deep learning, 1609 large-batch training for deep learning: Generalization gap and sharp minima. arXiv:1609.04836.
41.Bottou L, Curtis FE, Nocedal J. 2016. Optimization methods for large-scale machine learning. arXiv:1606.04838.
42.Braunstein A, Zecchina R. Learning by message-passing in neural networks with material synapses. Phys Rev Lett. 2006;96:030201. doi: 10.1103/PhysRevLett.96.030201. [DOI] [PubMed] [Google Scholar]
43.Baldassi C. Generalization learning in a perceptron with binary synapses. J Stat Phys. 2009;136:902–916. [Google Scholar]
44.Baldassi C, Braunstein A. A max-sum algorithm for training discrete neural networks. J Stat Mech Theor Exp. 2015;2015:P08008. [Google Scholar]
45.Schneider BI, Guan X, Bartschat K. Chapter five-time propagation of partial differential equations using the short iterative Lanczos method and finite-element discrete variable representation. Adv Quan Chem. 2016;72:95–127. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary File

pnas.1711456115.sapp.pdf^{(2.3MB, pdf)}

[r1] 1.Ray P, Chakrabarti BK, Chakrabarti A. Sherrington-Kirkpatrick model in a transverse field: Absence of replica symmetry breaking due to quantum fluctuations. Phys Rev B. 1989;39:11828–11832. doi: 10.1103/physrevb.39.11828. [DOI] [PubMed] [Google Scholar]

[r2] 2.Finnila A, Gomez M, Sebenik C, Stenson C, Doll J. Quantum annealing: A new method for minimizing multidimensional functions. Chem Phys Lett. 1994;219:343–348. [Google Scholar]

[r3] 3.Kadowaki T, Nishimori H. Quantum annealing in the transverse ising model. Phys Rev E. 1998;58:5355–5363. [Google Scholar]

[r4] 4.Farhi E, et al. A quantum adiabatic evolution algorithm applied to random instances of an np-complete problem. Science. 2001;292:472–475. doi: 10.1126/science.1057726. [DOI] [PubMed] [Google Scholar]

[r5] 5.Das A, Chakrabarti BK. Colloquium: Quantum annealing and analog quantum computation. Rev Mod Phys. 2008;80:1061–1081. [Google Scholar]

[r6] 6.Moore C, Mertens S. The Nature of Computation. Oxford Univ Press; Oxford: 2011. [Google Scholar]

[r7] 7.Born M, Fock V. Beweis des adiabatensatzes. Zeitschrift Phys A Hadrons Nuclei. 1928;51:165–180. [Google Scholar]

[r8] 8.Landau L. Zur theorie der energieubertragung. II. Phys Z Sowjetunion. 1932;2:1–13. [Google Scholar]

[r9] 9.Zener C. Non-adiabatic crossing of energy levels. Proc R Soc Lond A Math Phys Eng Sci. 1932;137:696–702. [Google Scholar]

[r10] 10.Altshuler B, Krovi H, Roland J. Anderson localization makes adiabatic quantum optimization fail. Proc Natl Acad Sci USA. 2010;107:12446–12450. doi: 10.1073/pnas.1002116107. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r11] 11.Bapst V, Foini L, Krzakala F, Semerjian G, Zamponi F. The quantum adiabatic algorithm applied to random optimization problems: The quantum spin glass perspective. Phys Rep. 2013;523:127–205. [Google Scholar]

[r12] 12.Santoro GE, Martoňák R, Tosatti E, Car R. Theory of quantum annealing of an ising spin glass. Science. 2002;295:2427–2430. doi: 10.1126/science.1068774. [DOI] [PubMed] [Google Scholar]

[r13] 13.Martoňák R, Santoro GE, Tosatti E. Quantum annealing by the path-integral Monte Carlo method: The two-dimensional random ising model. Phys Rev B. 2002;66:094203. [Google Scholar]

[r14] 14.Heim B, Rønnow TF, Isakov SV, Troyer M. Quantum versus classical annealing of ising spin glasses. Science. 2015;348:215–217. doi: 10.1126/science.aaa4170. [DOI] [PubMed] [Google Scholar]

[r15] 15.Rønnow TF, et al. Defining and detecting quantum speedup. Science. 2014;345:420–424. doi: 10.1126/science.1252319. [DOI] [PubMed] [Google Scholar]

[r16] 16.Johnson MW, et al. Quantum annealing with manufactured spins. Nature. 2011;473:194–198. doi: 10.1038/nature10012. [DOI] [PubMed] [Google Scholar]

[r17] 17.Boixo S, et al. Evidence for quantum annealing with more than one hundred qubits. Nat Phys. 2014;10:218–224. [Google Scholar]

[r18] 18.Langbein W, et al. Control of fine-structure splitting and biexciton binding in In x Ga 1- x as quantum dots by annealing. Phys Rev B. 2004;69:161301. [Google Scholar]

[r19] 19.Baldassi C, Ingrosso A, Lucibello C, Saglietti L, Zecchina R. Subdominant dense clusters allow for simple learning and high computational performance in neural networks with discrete synapses. Phys Rev Lett. 2015;115:128101. doi: 10.1103/PhysRevLett.115.128101. [DOI] [PubMed] [Google Scholar]

[r20] 20.Baldassi C, et al. Unreasonable effectiveness of learning neural networks: From accessible states and robust ensembles to basic algorithmic schemes. Proc Natl Acad Sci USA. 2016;113:E7655–E7662. doi: 10.1073/pnas.1608103113. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r21] 21.Baldassi C, Ingrosso A, Lucibello C, Saglietti L, Zecchina R. Local entropy as a measure for sampling solutions in constraint satisfaction problems. J Stat Mech Theor Exp. 2016;2016:P023301. [Google Scholar]

[r22] 22.Baldassi C, Gerace F, Lucibello C, Saglietti L, Zecchina R. Learning may need only a few bits of synaptic precision. Phys Rev E. 2016;93:052313. doi: 10.1103/PhysRevE.93.052313. [DOI] [PubMed] [Google Scholar]

[r23] 23.Krauth W, Mézard M. Storage capacity of memory networks with binary couplings. J Phys France. 1989;50:3057–3066. [Google Scholar]

[r24] 24.Sompolinsky H, Tishby N, Seung HS. Learning from examples in large neural networks. Phys Rev Lett. 1990;65:1683–1686. doi: 10.1103/PhysRevLett.65.1683. [DOI] [PubMed] [Google Scholar]

[r25] 25.Foini L, Semerjian G, Zamponi F. Solvable model of quantum random optimization problems. Phys Rev Lett. 2010;105:167204. doi: 10.1103/PhysRevLett.105.167204. [DOI] [PubMed] [Google Scholar]

[r26] 26.Biroli G, Zamponi F. A tentative replica theory of glassy helium 4. J Low Temp Phys. 2012;168:101–116. [Google Scholar]

[r27] 27.Hubara I, Courbariaux M, Soudry D, El-Yaniv R, Bengio Y. 2016. Quantized neural networks: Training neural networks with low precision weights and activations. arXiv:1609.07061.

[r28] 28.Courbariaux M, Bengio Y, David JP. Advances in Neural Information Processing Systems. Vol 28. Curran Associates; Red Hook, NY: 2015. BinaryConnect: Training deep neural networks with binary weights during propagations; pp. 3105–3113. [Google Scholar]

[r29] 29.MacKay DJ. Information Theory, Inference and Learning Algorithms. Cambridge Univ Press; Cambridge, UK: 2003. [Google Scholar]

[r30] 30.LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521:436–444. doi: 10.1038/nature14539. [DOI] [PubMed] [Google Scholar]

[r31] 31.Aaronson S. Read the fine print. Nat Phys. 2015;11:291–293. [Google Scholar]

[r32] 32.Barahona F. On the computational complexity of ising spin glass models. J Phys A Math Gen. 1982;15:3241–3253. [Google Scholar]

[r33] 33.Huang H, Kabashima Y. Origin of the computational hardness for learning with binary synapses. Phys Rev E. 2014;90:052813. doi: 10.1103/PhysRevE.90.052813. [DOI] [PubMed] [Google Scholar]

[r34] 34.Horner H. Dynamics of learning for the binary perceptron problem. Zeitschrift Physik B Condens Matter. 1992;86:291–308. [Google Scholar]

[r35] 35.Bapst V, Semerjian G. On quantum mean-field models and their quantum annealing. J Stat Mech Theor Exp. 2012;2012:P06007. [Google Scholar]

[r36] 36.Bapst V, Semerjian G. Thermal, quantum and simulated quantum annealing: Analytical comparisons for simple models. J Phys Conf Ser. 2013;473:012011. [Google Scholar]

[r37] 37.Baldassi C. A method to reduce the rejection rate in Monte Carlo Markov chains. J Stat Mech Theor Exp. 2017;2017:033301. [Google Scholar]

[r38] 38.Baldassi C, Braunstein A, Brunel N, Zecchina R. Efficient supervised learning in networks with binary synapses. Proc Natl Acad Sci USA. 2007;104:11079–11084. doi: 10.1073/pnas.0700324104. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r39] 39.Markland TE, et al. Quantum fluctuations can promote or inhibit glass formation. Nat Phys. 2011;7:134–137. [Google Scholar]

[r40] 40.Keskar NS, Mudigere D, Nocedal J, Smelyanskiy M, Tang PTP. 2016. On large-batch training for deep learning, 1609 large-batch training for deep learning: Generalization gap and sharp minima. arXiv:1609.04836.

[r41] 41.Bottou L, Curtis FE, Nocedal J. 2016. Optimization methods for large-scale machine learning. arXiv:1606.04838.

[r42] 42.Braunstein A, Zecchina R. Learning by message-passing in neural networks with material synapses. Phys Rev Lett. 2006;96:030201. doi: 10.1103/PhysRevLett.96.030201. [DOI] [PubMed] [Google Scholar]

[r43] 43.Baldassi C. Generalization learning in a perceptron with binary synapses. J Stat Phys. 2009;136:902–916. [Google Scholar]

[r44] 44.Baldassi C, Braunstein A. A max-sum algorithm for training discrete neural networks. J Stat Mech Theor Exp. 2015;2015:P08008. [Google Scholar]

[r45] 45.Schneider BI, Guan X, Bartschat K. Chapter five-time propagation of partial differential equations using the short iterative Lanczos method and finite-element discrete variable representation. Adv Quan Chem. 2016;72:95–127. [Google Scholar]

PERMALINK

Efficiency of quantum vs. classical annealing in nonconvex learning problems

Carlo Baldassi

Riccardo Zecchina

Significance

Abstract

Energy Functions

Connection with the Local Entropy Measure

Fig. 1.

Phase Diagram: Analytical and Numerical Results

Fig. 2.

Fig. 3.

Conclusions

Materials and Methods

Simulated QA Protocol.

Classical SA Protocol.

Estimation of the Local Energy and Entropy Landscapes.

Real-Time QA Simulations on Small Instances.

Supplementary Material

Acknowledgments

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Efficiency of quantum vs. classical annealing in nonconvex learning problems

Carlo Baldassi

Riccardo Zecchina

Significance

Abstract

Energy Functions

Connection with the Local Entropy Measure

Fig. 1.

Phase Diagram: Analytical and Numerical Results

Fig. 2.

Fig. 3.

Conclusions

Materials and Methods

Simulated QA Protocol.

Classical SA Protocol.

Estimation of the Local Energy and Entropy Landscapes.

Real-Time QA Simulations on Small Instances.

Supplementary Material

Acknowledgments

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases