A mean field view of the landscape of two-layer neural networks

Song Mei; Andrea Montanari; Phan-Minh Nguyen

doi:10.1073/pnas.1806579115

. 2018 Jul 27;115(33):E7665–E7671. doi: 10.1073/pnas.1806579115

A mean field view of the landscape of two-layer neural networks

Song Mei ^a, Andrea Montanari ^b,^c,¹, Phan-Minh Nguyen ^b

PMCID: PMC6099898 PMID: 30054315

Significance

Multilayer neural networks have proven extremely successful in a variety of tasks, from image classification to robotics. However, the reasons for this practical success and its precise domain of applicability are unknown. Learning a neural network from data requires solving a complex optimization problem with millions of variables. This is done by stochastic gradient descent (SGD) algorithms. We study the case of two-layer networks and derive a compact description of the SGD dynamics in terms of a limiting partial differential equation. Among other consequences, this shows that SGD dynamics does not become more complex when the network size increases.

Keywords: neural networks, stochastic gradient descent, gradient flow, Wasserstein space, partial differential equations

Abstract

Multilayer neural networks are among the most powerful models in machine learning, yet the fundamental reasons for this success defy mathematical understanding. Learning a neural network requires optimizing a nonconvex high-dimensional objective (risk function), a problem that is usually attacked using stochastic gradient descent (SGD). Does SGD converge to a global optimum of the risk or only to a local optimum? In the former case, does this happen because local minima are absent or because SGD somehow avoids them? In the latter, why do local minima reached by SGD have good generalization properties? In this paper, we consider a simple case, namely two-layer neural networks, and prove that—in a suitable scaling limit—SGD dynamics is captured by a certain nonlinear partial differential equation (PDE) that we call distributional dynamics (DD). We then consider several specific examples and show how DD can be used to prove convergence of SGD to networks with nearly ideal generalization error. This description allows for “averaging out” some of the complexities of the landscape of neural networks and can be used to prove a general convergence result for noisy SGD.

Multilayer neural networks are one of the oldest approaches to statistical machine learning, dating back at least to the 1960s (1). Over the last 10 years, under the impulse of increasing computer power and larger data availability, they have emerged as a powerful tool for a wide variety of learning tasks (2, 3).

In this paper, we focus on the classical setting of supervised learning, whereby we are given data points $(x_{i}, y_{i}) \in R^{d} \times R$ , indexed by $i \in N$ , which are assumed to be independent and identically distributed from an unknown distribution $P$ on $R^{d} \times R$ . Here $x_{i} \in R^{d}$ is a feature vector (e.g., a set of descriptors of an image), and $y_{i} \in R$ is a label (e.g., labeling the object in the image). Our objective is to model the dependence of the label $y_{i}$ on the feature vector $x_{i}$ to assign labels to previously unlabeled examples. In a two-layer neural network, this dependence is modeled as

ŷ (x; θ) = \frac{1}{N} \sum_{i = 1}^{N} σ_{*} (x; θ_{i}) .

[1]

Here, $N$ is the number of hidden units (neurons), $σ_{*} : R^{d} \times R^{D} \to R$ is an activation function, and $θ_{i} \in R^{D}$ are parameters, which we collectively denote by $θ = (θ_{1}, \dots, θ_{N})$ . The factor $(1 / N)$ is introduced for convenience and can be eliminated by redefining the activation. Often $θ_{i} = (a_{i}, b_{i}, w_{i})$ and

σ_{*} (x; θ_{i}) = a_{i} σ (⟨ w_{i}, x ⟩ + b_{i}),

[2]

for some $σ : R \to R$ . Ideally, the parameters $θ = {(θ_{i})}_{i \leq N}$ should be chosen as to minimize the risk (generalization error) $R_{N} (θ) = E {ℓ (y, ŷ (x; θ))}$ , where $ℓ : R \times R \to R$ is a certain loss function. For the sake of simplicity, we will focus on the square loss $ℓ (y, ŷ) = {(y - ŷ)}^{2}$ , but more general choices can be treated along the same lines.

In practice, the parameters of neural networks are learned by stochastic gradient descent (SGD) (4) or its variants. In the present case, this amounts to the iteration

θ_{i}^{k + 1} = θ_{i}^{k} + 2 s_{k} (y_{k} - ŷ (x_{k}; θ^{k})) \nabla_{θ_{i}} σ_{*} (x_{k}; θ_{i}^{k}) .

[3]

Here $θ^{k} = {(θ_{i}^{k})}_{i \leq N}$ denotes the parameters after $k$ iterations, $s_{k}$ is a step size, and $(x_{k}, y_{k})$ is the $k$ th example. Throughout the paper, we make the following “One-Pass Assumption”: Training examples are never revisited. Equivalently, ${(x_{k}, y_{k})}_{k \geq 1}$ are independent and identically distributed. $(x_{k}, y_{k}) \sim P$ .

In large-scale applications, this is not far from truth: The data are so large that each example is visited at most a few times (5). Further, theoretical guarantees suggest that there is limited advantage to be gained from multiple passes (6). For recent work deriving scaling limits under such an assumption (in different problems), see ref. 7.

Understanding the optimization landscape of two-layer neural networks is largely an open problem even when we have access to an infinite number of examples—that is, to the population risk $R_{N} (θ)$ . Several studies have focused on special choices of the activation function $σ_{*}$ and of the data distribution $P$ , proving that the population risk has no bad local minima (8–10). This type of analysis requires delicate calculations that are somewhat sensitive to the specific choice of the model. Another line of work proposes new algorithms with theoretical guarantees (11–16), which use initializations based on tensor factorization.

In this paper, we prove that—in a suitable scaling limit—the SGD dynamics admits an asymptotic description in terms of a certain nonlinear partial differential equation (PDE). This PDE has a remarkable mathematical structure, in that it corresponds to a gradient flow in the metric space $(P (R^{D}), W_{2})$ : the space of probability measures on $R^{D}$ , endowed with the Wasserstein metric. This gradient flow minimizes an asymptotic version of the population risk, which is defined for $ρ \in P (R^{D})$ and will be denoted by $R (ρ)$ . This description simplifies the analysis of the landscape of two-layer neural networks, for instance by exploiting underlying symmetries. We illustrate this by obtaining results on several concrete examples as well as a general convergence result for “noisy SGD.” In the next section, we provide an informal outline, focusing on basic intuitions rather than on formal results. We then present the consequences of these ideas on a few specific examples and subsequently state our general results.

An Informal Overview

A good starting point is to rewrite the population risk $R_{N} (θ) = E {{[y - ŷ (x; θ)]}^{2}}$ as

R_{N} (θ) = R_{#} + \frac{2}{N} \sum_{i = 1}^{N} V (θ_{i}) + \frac{1}{N^{2}} \sum_{i, j = 1}^{N} U (θ_{i}, θ_{j}),

[4]

where we defined the potentials $V (θ) = - E \{y σ_{*} (x; θ)\}$ , $U (θ_{1}, θ_{2}) = E \{σ_{*} (x; θ_{1}) σ_{*} (x; θ_{2})\}$ . In particular, $U (\cdot, \cdot)$ is a symmetric positive semidefinite kernel. The constant $R_{#} = E {y^{2}}$ is the risk of the trivial predictor $ŷ = 0$ .

Notice that $R_{N} (θ)$ only depends on $θ_{1}, \dots, θ_{N}$ through their empirical distribution ${\hat{ρ}}^{(N)} = N^{- 1} \sum_{i = 1}^{N} δ_{θ_{i}}$ . This suggests considering a risk function defined for $ρ \in P (R^{D})$ [we denote by $P (Ω)$ the space of probability distributions on $Ω$ ]:

R (ρ) = R_{#} + 2 \int V (θ) ρ (d θ) + \int U (θ_{1}, θ_{2}) ρ (d θ_{1}) ρ (d θ_{2}) .

[5]

Formal relationships can be established between $R_{N} (θ)$ and $R (ρ)$ . For instance, under mild assumptions, ${inf}_{θ} R_{N} (θ) = {inf}_{ρ} R (ρ) + O (1 / N)$ . We refer to the next sections for mathematical statements of this type.

Roughly speaking, $R (ρ)$ corresponds to the population risk when the number of hidden units goes to infinity, and the empirical distribution of parameters ${\hat{ρ}}^{(N)}$ converges to $ρ$ . Since $U (\cdot, \cdot)$ is positive semidefinite, we obtain that the risk becomes convex in this limit. The fact that learning can be viewed as convex optimization in an infinite-dimensional space was indeed pointed out in the past (17, 18). Does this mean that the landscape of the population risk simplifies for large $N$ and descent algorithms will converge to a unique (or nearly unique) global optimum?

The answer to the latter question is generally negative, and a physics analogy can explain why. Think of $θ_{1}, \dots, θ_{N}$ as the positions of $N$ particles in a $D$ -dimensional space. When $N$ is large, the behavior of such a “gas” of particles is effectively described by a density $ρ_{t} (θ)$ (with $t$ indexing time). However, not all “small” changes of this density profile can be realized in the actual physical dynamics: The dynamics conserves mass locally because particles cannot move discontinuously. For instance, if $supp (ρ_{t}) = S_{1} \cup S_{2}$ for two disjoint compact sets $S_{1}, S_{2} \subseteq R^{D}$ and all $t \in [t_{1}, t_{2}]$ , then the total mass in each of these regions cannot change over time—that is, $ρ_{t} (S_{1}) = 1 - ρ_{t} (S_{2})$ does not depend on $t \in [t_{1}, t_{2}]$ .

We will prove that SGD is well approximated (in a precise quantitative sense described below) by a continuum dynamics that enforces this local mass conservation principle. Namely, assume that the step size in SGD is given by $s_{k} = ε ξ (k ε)$ , for $ξ : R_{\geq 0} \to R_{\geq 0}$ a sufficiently regular function. Denoting by ${\hat{ρ}}_{k}^{(N)} = N^{- 1} \sum_{i = 1}^{N} δ_{θ_{i}^{k}}$ the empirical distribution of parameters after $k$ SGD steps, we prove that

{\hat{ρ}}_{t / ε}^{(N)} \Rightarrow ρ_{t},

[6]

when $N \to \infty$ , $ε \to 0$ (here $\Rightarrow$ denotes weak convergence). The asymptotic dynamics of $ρ_{t}$ is defined by the following PDE, which we shall refer to as distributional dynamics (DD):

\partial_{t} ρ_{t} = 2 ξ (t) \nabla_{θ} \cdot (ρ_{t} \nabla_{θ} Ψ (θ; ρ_{t})),

[7]

and

Ψ (θ; ρ) \equiv V (θ) + \int U (θ, θ') ρ (d θ') .

[8]

[Here, $\nabla_{θ} \cdot v (θ)$ denotes the divergence of the vector field $v (θ)$ .] This should be interpreted as an evolution equation in $P (R^{D})$ . While we described the convergence to this dynamics in asymptotic terms, the results in the next sections provide explicit nonasymptotic bounds. In particular, $ρ_{t}$ is a good approximation of ${\hat{ρ}}_{k}^{(N)}$ , $k = t / ε$ , as soon as $ε ≪ 1 / D$ and $N ≫ D$ .

Using these results, analyzing learning in two-layer neural networks reduces to analyzing the PDE (Eq. 7). While this is far from being an easy task, the PDE formulation leads to several simplifications and insights. First, it factors out the invariance of the risk (Eq. 4) (and of the SGD dynamics; Eq. 3), with respect to permutations of the units ${1, \dots, N}$ .

Second, it allows us to exploit symmetries in the data distribution $P$ . If $P$ is left invariant under a group of transformations (e.g., rotations), we can look for a solution $ρ_{t}$ of the DD (Eq. 7) that enjoys the same symmetry, hence reducing the dimensionality of the problem. This is impossible for the finite- $N$ dynamics (Eq. 3), since no arrangement of the points ${θ_{1}, \dots, θ_{N}} \subseteq R^{D}$ is left invariant, say, under rotations. We will provide examples of this approach in the next sections.

Third, there is rich mathematical literature on the PDE (Eq. 7) that was motivated by the study of interacting particle systems in mathematical physics. As mentioned above, a key structure exploited in this line of work is that Eq. 7 can be viewed as a gradient flow for the cost function $R (ρ)$ in the space $(P (R^{D}), W_{2})$ , of probability measures on $R^{D}$ endowed with the Wasserstein metric (19–21). Roughly speaking, this means that the trajectory $t \mapsto ρ_{t}$ attempts to minimize the risk $R (ρ)$ while maintaining the “local mass conservation” constraint. Recall that the Wasserstein distance is defined as

W_{2} (ρ_{1}, ρ_{2}) = {(inf_{γ \in C (ρ_{1}, ρ_{2})} \int ‖ θ_{1} - θ_{2} ‖_{2}^{2} γ (d θ_{1}, d θ_{2}))}^{1 / 2},

[9]

where the infimum is taken over all couplings of $ρ_{1}$ and $ρ_{2}$ . Informally, the fact that $ρ_{t}$ is a gradient flow means that Eq. 7 is equivalent, for small $τ$ , to

ρ_{t + τ} \approx \arg min_{ρ \in P (R^{D})} \{R (ρ) + \frac{1}{2 ξ (t) τ} W_{2} {(ρ, ρ_{t})}^{2}\} .

[10]

Powerful tools from the mathematical literature on gradient flows in measure spaces (20) can be exploited to study the behavior of Eq. 7.

Most importantly, the scaling limit elucidates the dependence of the landscape of two-layer neural networks on the number of hidden units $N$ .

A remarkable feature of neural networks is the observation that, while they might be dramatically overparametrized, this does not lead to performance degradation. In the case of bounded activation functions, this phenomenon was clarified in the 1990s for empirical risk minimization algorithms (see, e.g., ref. 22). The present work provides analogous insight for the SGD dynamics: Roughly speaking, our results imply that the landscape remains essentially unchanged as $N$ grows, provided $N ≫ D$ . In particular, assume that the PDE (Eq. 7) converges close to an optimum in time $t_{*} (D)$ . This might depend on $D$ but does not depend on the number of hidden units $N$ (which does not appear in the DD PDE; Eq. 7). If $t_{*} (D) = O_{D} (1)$ , we can then take $N$ arbitrarily (as long as $N ≫ D$ ) and will achieve a population risk that is independent of $N$ (and corresponds to the optimum), using $k = t_{*} / ε = O (D)$ samples.

Our analysis can accommodate some important variants of SGD, a particularly interesting one being noisy SGD:

\begin{matrix} θ_{i}^{k + 1} & = & (1 - 2 λ s_{k}) θ_{i}^{k} + 2 s_{k} (y_{k} - ŷ_{k}) \nabla_{θ_{i}} σ_{*} (x_{k}; θ_{i}^{k}) \\ + \sqrt{2 s_{k} / β} g_{i}^{k}, \end{matrix}

[11]

where $g_{i}^{k} \sim N (0, I_{D})$ and $ŷ_{k} = ŷ (x_{k}; θ^{k})$ . (The term $- 2 λ s_{k} θ_{i}^{k}$ corresponds to an $ℓ_{2}$ regularization and will be useful for our analysis below.) The resulting scaling limit differs from Eq. 7 by the addition of a diffusion term:

\partial_{t} ρ_{t} = 2 ξ (t) \nabla_{θ} \cdot (ρ_{t} \nabla_{θ} Ψ_{λ} (θ; ρ_{t})) + 2 ξ (t) β^{- 1} Δ_{θ} ρ_{t},

[12]

where $Ψ_{λ} (θ; ρ) = Ψ (θ; ρ) + (λ / 2) ‖ θ ‖_{2}^{2}$ , and $Δ_{θ} f (θ) = \sum_{i = 1}^{d} \partial_{θ_{i}}^{2} f (θ)$ denotes the usual Laplacian. This can be viewed as a gradient flow for the free-energy $F_{β, λ} (ρ) = (1 / 2) R (ρ) + (λ / 2) \int ‖ θ ‖_{2}^{2} ρ (d θ) - β^{- 1} Ent (ρ)$ , where $Ent (ρ) = - \int ρ (θ) \log ρ (θ) d θ$ is the entropy of $ρ$ [by definition $Ent (ρ) = - \infty$ if $ρ$ is singular]. $F_{β, λ} (ρ)$ is an entropy-regularized risk, which penalizes strongly nonuniform $ρ$ .

We will prove below that, for $β < \infty$ , the evolution (Eq. 12) generically converges to the minimizer of $F_{β, λ} (ρ)$ , hence implying global convergence of noisy SGD in a number of steps independent of $N$ .

Examples

In this section, we discuss some simple applications of the general approach outlined above. Let us emphasize that these examples are not realistic. First, the data distribution $P$ is extremely simple: We made this choice to be able to carry out explicit calculations. Second, the activation function $σ_{*} (x; θ)$ is not necessarily optimal: We made this choice to illustrate some interesting phenomena.

Centered Isotropic Gaussians.

One-neuron neural networks perform well with (nearly) linearly separable data. The simplest classification problem that requires multilayer networks is, arguably, the one of distinguishing two Gaussians with the same mean. Assume the joint law $P$ of $(y, x)$ to be as follows:

with probability $1 / 2$ : $y = + 1$ , $x \sim N (0, {(1 + Δ)}^{2} I_{d})$ ; and

with probability $1 / 2$ : $y = - 1$ , $x \sim N (0, {(1 - Δ)}^{2} I_{d})$ .

(This example will be generalized later.) Of course, optimal classification in this model becomes entirely trivial if we compute the feature $h (x) = ‖ x ‖_{2}$ . However, it is nontrivial that an SGD-trained neural network will succeed.

We choose an activation function without offset or output weights, namely $σ_{*} (x; θ_{i}) = σ (⟨ w_{i}, x ⟩)$ . While qualitatively similar results are obtained for other choices of $σ$ , we will use a simple piecewise linear function as a running example: $σ (t) = s_{1}$ if $t \leq t_{1}$ , $σ (t) = s_{2}$ if $t \geq t_{2}$ , and $σ (t)$ interpolated linearly for $t \in (t_{1}, t_{2})$ . In simulations, we use $t_{1} = 0.5$ , $t_{2} = 1.5$ , $s_{1} = - 2.5$ , and $s_{2} = 7.5$ .

We run SGD with initial weights ${(w_{i}^{0})}_{i \leq N} \sim_{i i d} ρ_{0}$ , where $ρ_{0}$ is spherically symmetric. Fig. 1 reports the result of such an experiment. Due to the symmetry of the distribution $P$ , the distribution $ρ_{t}$ remains spherically symmetric for all $t$ and hence is completely determined by the distribution ${\bar{ρ}}_{t}$ of the norm $r = ‖ w ‖_{2}$ . This distribution satisfies a one-dimensional reduced DD:

\partial_{t} {\bar{ρ}}_{t} = 2 ξ (t) \partial_{r} ({\bar{ρ}}_{t} \partial_{r} ψ (r; {\bar{ρ}}_{t})),

[13]

where the form of $ψ (r; ρ)$ can be derived from $Ψ (θ; ρ)$ . This reduced PDE can be efficiently solved numerically, see SI Appendix for technical details. As illustrated by Fig. 1, the empirical results match closely the predictions produced by this PDE.

Fig. 1. — Evolution of the radial distribution ${\bar{ρ}}_{t}$ for the isotropic Gaussian model, with $Δ = 0.8$ . Histograms are obtained from SGD experiments with $d = 40$ , $N = 800$ , initial weight distribution $ρ_{0} = N (0, 0 . 8^{2} / d \cdot I_{d})$ , and step size $ϵ = 1 0^{- 6}$ and $ξ (t) = 1$ . Continuous lines correspond to a numerical solution of the DD (Eq. 13).

In Fig. 2, we compare the asymptotic risk achieved by SGD with the prediction obtained by minimizing $R (ρ)$ (cf., Eq. 5) over spherically symmetric distributions. It turns out that, for certain values of $Δ$ , the minimum is achieved by the uniform distribution over a sphere of radius $‖ w ‖_{2} = r_{*}$ , to be denoted by $ρ_{r_{*}}^{unif}$ . The value of $r_{*}$ is computed by minimizing

{\bar{R}}_{d}^{(1)} (r) = 1 + 2 v (r) + u_{d} (r, r),

[14]

where expressions for $v (r)$ , $u_{d} (r_{1}, r_{2})$ can be readily derived from $V (w)$ , $U (w_{1}, w_{2})$ and are given in SI Appendix.

Fig. 2. — Population risk in the problem of separating two isotropic Gaussians, as a function of the separation parameter $Δ$ . We use a two-layer network with piecewise linear activation, no offset, and output weights equal to 1. Empirical results obtained by SGD (a single run per data point) are marked “+.” Continuous lines are theoretical predictions obtained by numerically minimizing $R (ρ)$ (see *SI Appendix* for details). Dashed lines are theoretical predictions from the single-delta ansatz of Eq. 14. Notice that this ansatz is incorrect for $Δ > Δ_{d}^{h}$ , which is marked as a solid round dot. Here, $N = 800$ .

Lemma 1:

Let $r_{*}$ be a global minimizer of $r \mapsto R_{d}^{(1)} (r)$ . Then $ρ_{r_{*}}^{unif}$ is a global minimizer of $ρ \mapsto R (ρ)$ if and only if $v (r) + u_{d} (r, r_{*}) \geq v (r_{*}) + u_{d} (r_{*}, r_{*})$ for all $r \geq 0$ .

Checking numerically, this condition yields that $ρ_{r_{*}}^{unif}$ is a global minimizer for $Δ$ in an interval $[Δ_{d}^{l}, Δ_{d}^{h}]$ , where ${lim}_{d \to \infty} Δ_{d}^{l} = 0$ and ${lim}_{d \to \infty} Δ_{d}^{h} = Δ_{\infty} \approx 0.47$ .

Fig. 2 shows good quantitative agreement between empirical results and theoretical predictions and suggests that SGD achieves a value of the risk that is close to optimum. Can we prove that this is indeed the case and that the SGD dynamics does not get stuck in local minima? It turns out that we can use our general theory (see next section) to prove that this is the case for large $d$ . To state this result, we need to introduce a class of good uninformative initializations $P_{good} \subseteq P (R_{\geq 0})$ for which convergence to the optimum takes place. For $\bar{ρ} \in P (R_{\geq 0})$ , we let ${\bar{R}}_{d} (\bar{ρ}) \equiv R (\bar{ρ} \times U n i f (S^{d - 1}))$ . This risk has a well-defined limit as $d \to \infty$ . We say that $\bar{ρ} \in P_{good}$ if $(i) \bar{ρ}$ is absolutely continuous with respect to the Lebesgue measure, with bounded density, $(i i) {\bar{R}}_{\infty} (\bar{ρ}) < 1$ .

Theorem 1:

For any $η, Δ, δ > 0$ and ${\bar{ρ}}_{0} \in P_{good}$ , there exists $d_{0} = d_{0} (η, {\bar{ρ}}_{0}, Δ)$ , $T = T (η, {\bar{ρ}}_{0}, Δ)$ , and $C_{0} = C_{0} (η, {\bar{ρ}}_{0}, Δ, δ)$ , such that the following holds for the problem of classifying isotropic Gaussians. For any dimension $d \geq d_{0}$ , number of neurons $N \geq C_{0} d$ , consider SGD initialized with ${(w_{i}^{0})}_{i \leq N} \sim_{i i d} {\bar{ρ}}_{0} \times U n i f (S^{d - 1})$ and step size $ε \in [1 / N^{10}, 1 / (C_{0} d)]$ . Then we have

R_{N} (θ^{k}) \leq inf_{θ \in R^{N \times d}} R (θ) + η

[15]

for any $k \in [T / ε, 10 T / ε]$ with probability at least $1 - δ$ .

In particular, if we set $ε = 1 / (C_{0} d)$ , then the number of SGD steps is $k \in [(C_{0} T) d, (10 C_{0} T) d]$ : The number of samples used by SGD does not depend on the number of hidden units $N$ and is only linear in the dimension. Unfortunately, the proof does not provide the dependence of $T$ on $η$ , but Theorem 6 below suggests exponential local convergence.

While we stated Theorem 1 for the piecewise linear sigmoids, SI Appendix presents technical conditions under which it holds for a general monotone function $σ : R \to R$ .

Centered Anisotropic Gaussians.

We can generalize the previous result to a problem in which the network needs to select a subset of relevant nonlinear features out of many a priori equivalent ones. We assume the joint law of $(y, x)$ to be as follows:

with probability $1 / 2$ : $y = + 1$ , $x \sim N (0, Σ_{+})$ ; and

with probability $1 / 2$ : $y = - 1$ , $x \sim N (0, Σ_{-})$ .

Given a linear subspace $V \subseteq R^{d}$ of dimension $s_{0} \leq d$ , we assume that $Σ_{+}$ , $Σ_{-}$ differ uniquely along $V$ : $Σ_{\pm} = I_{d} + (τ_{\pm}^{2} - 1) P_{V}$ , where $τ_{\pm} = (1 \pm Δ)$ and $P_{V}$ is the orthogonal projector onto $V$ . In other words, the projection of $x$ on the subspace $V$ is distributed according to an isotropic Gaussian with variance $τ_{+}^{2}$ (if $y = + 1$ ) or $τ_{-}^{2}$ (if $y = - 1$ ). The projection orthogonal to $V$ has instead the same variance in the two classes. A successful classifier must be able to learn the relevant subspace $V$ . We assume the same class of activations $σ_{*} (x; θ) = σ (⟨ w, x ⟩)$ as for the isotropic case.

The distribution $P$ is invariant under a reduced symmetry group $O (s_{0}) \times O (d - s_{0})$ . As a consequence, letting $r_{1} = ‖ P_{V} w ‖_{2}$ and $r_{2} \equiv ‖ (I_{d} - P_{V}) w ‖_{2}$ , it is sufficient to consider distributions $ρ$ that are uniform, conditional on the values of $r_{1}$ and $r_{2}$ . If we initialize $ρ_{0}$ to be uniform conditional on $(r_{1}, r_{2})$ , this property is preserved by the evolution (Eq. 7). As in the isotropic case, we can use our general theory to prove convergence to a near-optimum if $d$ is large enough.

Theorem 2:

For any $η, Δ, δ > 0$ and ${\bar{ρ}}_{0} \in P_{good}$ , there exists $d_{0} = d_{0} (η, {\bar{ρ}}_{0}, Δ, γ)$ , $T = T (η, {\bar{ρ}}_{0}, Δ, γ)$ , and $C_{0} = C_{0} (η, {\bar{ρ}}_{0}, Δ, δ, γ)$ , such that the following holds for the problem of classifying anisotropic Gaussians with $s_{0} = γ d$ , $γ \in (0,1)$ fixed. For any dimension parameters $s_{0} = γ d \geq d_{0}$ , number of neurons $N \geq C_{0} d$ , consider SGD initialized with initialization ${(w_{i}^{0})}_{i \leq N} \sim_{i i d} {\bar{ρ}}_{0} \times U n i f (S^{d - 1})$ and step size $ε \in [1 / N^{10}, 1 / (C_{0} d)]$ . Then, we have $R_{N} (θ^{k}) \leq inf_{θ \in R^{N \times d}} R_{N} (θ) + η$ for any $k \in [T / ε, 10 T / ε]$ with probability at least $1 - δ$ .

Even with a reduced degree of symmetry, SGD converges to a network with nearly optimal risk, after using a number of samples $k = O (d)$ , which is independent of the number of hidden units $N$ .

A Better Activation Function.

Our previous examples use activation functions $σ_{*} (x; θ) = σ (⟨ w, x ⟩)$ without output weights or offset to simplify the analysis and illustrate some interesting phenomena. Here we consider instead a standard rectified linear unit (ReLU) activation and fit both the output weight and the offset: $σ_{*} (x; θ) = a σ_{ReLU} (⟨ w, x ⟩ + b)$ , where $σ_{ReLU} (x) = max (x, 0)$ . Hence, $θ = (w, a, b) \in R^{d + 2}$ .

We consider the same data distribution introduced in the last section (anisotropic Gaussians). Fig. 3 reports the evolution of the risk $R_{N} (θ^{k})$ for three experiments with $d = 320$ , $s_{0} = 60$ , and different values of $Δ$ . SGD is initialized by setting $a_{i} = 1$ , $b_{i} = 1$ , and $w_{i}^{0} \sim_{i i d} N (0, 0 . 8^{2} / d \cdot I_{d})$ for $i \leq N$ . We observe that SGD converges to a network with very small risk, but this convergence has a nontrivial structure and presents long flat regions.

The empirical results are well captured by our predictions based on the continuum limit. In this case, we obtain a reduced PDE for the joint distribution of the four quantities $r = (a, b, r_{1} = ‖ P_{V} w ‖_{2}, r_{2} = ‖ P_{V}^{⊥} {w ‖}_{2})$ , denoted by ${\bar{ρ}}_{t}$ . The reduced PDE is analogous to Eq. 13 albeit in 4 dimensions rather than 1 dimension. In Fig. 3, we consider the evolution of the risk, alongside three properties of the distribution ${\bar{ρ}}_{t}$ —the means of the output weight $a$ , of the offset $b$ , and of $r_{1}$ .

Predicting Failure.

SGD does not always converge to a near global optimum. Our analysis allows us to construct examples in which SGD fails. For instance, Fig. 4 reports results for the isotropic Gaussians problem. We violate the assumptions of Theorem 1 by using nonmonotone activation function. Namely, we use $σ_{*} (x; θ) = σ (⟨ w, x ⟩)$ , where $σ (t) = - 2.5$ for $t \leq 0$ , $σ (t) = 7.5$ for $t \geq 1.5$ , and $σ (t)$ linearly interpolates from (0, –2.5) to (0.5, –4), and from (0.5, –4) to $(1.5, 7.5)$ .

Depending on the initialization, SGD converges to two different limits, one with a small risk and the second with high risk. Again, this behavior is well tracked by solving a one-dimensional PDE for the distribution ${\bar{ρ}}_{t}$ of $r = ‖ w ‖_{2}$ .

General Results

In this section, we return to the general supervised learning problem described in the Introduction and describe our general results. Proofs are deferred to SI Appendix.

First, we note that the minimum of the asymptotic risk $R (ρ)$ of Eq. 5 provides a good approximation of the minimum of the finite- $N$ risk $R_{N} (θ)$ .

Proposition 1:

Assume that either one of the following conditions hold: $(a) {inf}_{ρ} R (ρ)$ is achieved by a distribution $ρ_{*}$ such that $\int U (θ, θ) ρ_{*} (d θ) \leq K$ ; $(b)$ There exists $ε_{0} > 0$ such that, for any $ρ \in P (R^{D})$ such that $R (ρ) \leq {inf}_{ρ} R (ρ) + ε_{0}$ , we have $\int U (θ, θ) ρ (d θ) \leq K$ . Then

|inf_{θ} R_{N} (θ) - inf_{ρ} R (ρ)| \leq K / N .

[16]

Further, assume that $θ \mapsto V (θ)$ and $(θ_{1}, θ_{2}) \mapsto U (θ_{1}, θ_{2})$ are continuous, with $U$ bounded below. A probability measure $ρ_{*}$ is a global minimum of $R$ if ${inf}_{θ \in R^{D}} Ψ (θ; ρ_{*}) > - \infty$ and

supp (ρ_{*}) \subseteq \arg min_{θ \in R^{D}} Ψ (θ; ρ_{*}) .

[17]

We next consider the DDs (Eqs. 7 and 12). These should be interpreted to hold in a weak sense (cf. SI Appendix). To establish that these PDEs indeed describe the limit of the SGD dynamics, we make the following assumptions:

A1. $t \mapsto ξ (t)$ is bounded Lipschitz: $‖ ξ ‖_{\infty}, ‖ ξ ‖_{Lip} \leq K_{1}$ , with $\int_{0}^{\infty} ξ (t) d t = \infty$ .
A2. The activation function $(x, θ) \mapsto σ_{*} (x; θ)$ is bounded, with sub-Gaussian gradient: $‖ σ_{*} ‖_{\infty} \leq K_{2}$ , $‖ \nabla_{θ} σ_{*} (X; θ) ‖_{ψ_{2}} \leq K_{2}$ . Labels are bounded $| y_{k} | \leq K_{2}$ .
A3. The gradients $θ \mapsto \nabla V (θ)$ , $(θ_{1}, θ_{2}) \mapsto \nabla_{θ_{1}} U (θ_{1}, θ_{2})$ are bounded, Lipschitz continuous [namely $‖ \nabla_{θ} V (θ) ‖_{2}$ , $‖ \nabla_{θ_{1}} U (θ_{1}, θ_{2}) ‖_{2} \leq K_{3}$ , $‖ \nabla_{θ} V (θ) - \nabla_{θ} V (θ') ‖_{2} \leq K_{3} ‖ θ - θ' ‖_{2}$ , $‖ \nabla_{θ_{1}} U (θ_{1}, θ_{2}) - \nabla_{θ_{1}} U (θ_{1}^{'}, θ_{2}^{'}) ‖_{2} \leq K_{3} ‖ (θ_{1}, θ_{2}) - (θ_{1}^{'}, θ_{2}^{'}) ‖_{2}$ ].

We also introduce the following error term that quantifies in a nonasymptotic sense the accuracy of our PDE model:

{err}_{N, D} (z) \equiv \sqrt{1 / N \lor ε} \cdot [\sqrt{D + \log (N / ε)} + z] .

[18]

The convergence of the SGD process to the PDE model is an example of a phenomenon that is known in probability theory as propagation of chaos (23).

Theorem 3:

Assume that conditions A1, A2, A3 hold. For $ρ_{0} \in P (R^{D})$ , consider SGD with initialization ${(θ_{i}^{0})}_{i \leq N} \sim_{i i d} ρ_{0}$ and step size $s_{k} = ε ξ (k ε)$ . For $t \geq 0$ , let $ρ_{t}$ be the solution of PDE (Eq. 7). Then, for any fixed $k$ , ${\hat{ρ}}_{k}^{(N)} \Rightarrow ρ_{k ε}$ almost surely along any sequence $(N, ε = ε_{N})$ such that $N / \log (1 / ε_{N}) \to \infty$ , $ε_{N} \to 0$ . Further, there exists a constant $C$ (depending uniquely on the parameters $K_{i}$ of conditions A1–A3) such that, for any $f : R^{D} \times R \to R$ , with $‖ f ‖_{\infty}, ‖ f ‖_{Lip} \leq 1$ , $ε \leq 1$ ,

\begin{align} sup_{k \in [0, T / ε] \cap N} |\frac{1}{N} \sum_{i = 1}^{N} f (θ_{i}^{k}) - \int f (θ) ρ_{k ε} (d θ)| \leq C e^{C T} {err}_{N, D} (z), \\ sup_{k \in [0, T / ε] \cap N} |R_{N} (θ^{k}) - R (ρ_{k ε})| \leq C e^{C T} {err}_{N, D} (z), \end{align}

[19]

with probability $1 - e^{- z^{2}}$ . The same statements hold for noisy SGD (Eq. 11), provided Eq. 7 is replaced by Eq. 12, and if $β \geq 1$ , $λ \leq 1$ , and $ρ_{0}$ is $K_{0}$ sub-Gaussian for some $K_{0} > 0$ .

Notice that dependence of the error terms in $N$ and $D$ is rather benign. On the other hand, the error grows exponentially with the time horizon $T$ , which limits its applicability to cases in which the DD converges rapidly to a good solution. We do not expect this behavior to be improvable within the general setting of 0.3, which a priori includes cases in which the dynamics is unstable.

We can regard $J (θ; ρ_{t}) = ρ_{t} (θ) \nabla_{θ} Ψ (θ; ρ_{t})$ as a current. The fixed points of the continuum dynamics are densities that correspond to zero current, as stated below.

Proposition 2:

Assume $V (\cdot), U (\cdot, \cdot)$ to be differentiable with bounded gradient. If $ρ_{t}$ is a solution of the PDE (Eq. 7), then $R (ρ_{t})$ is nonincreasing. Further, probability distribution $ρ$ is a fixed point of the PDE (Eq. 7) if and only if

supp (ρ) \subseteq \{θ : \nabla_{θ} Ψ (θ; ρ) = 0\} .

[20]

Note that global optimizers of $R (ρ)$ , defined by condition (Eq. 17), are fixed points, but the set of fixed points is, in general, larger than the set of optimizers. Our next proposition provides an analogous characterization of the fixed points of diffusion DD (Eq. 12) (see ref. 21 for related results).

Proposition 3:

Assume that conditions A1–A3 hold and that $ρ_{0}$ is absolutely continuous with respect to the Lebesgue measure, with $F_{β, λ} (ρ_{0}) < \infty$ . If ${(ρ_{t})}_{t \geq 0}$ is a solution of the diffusion PDE (Eq. 12), then $ρ_{t}$ is absolutely continuous. Further, there is at most one fixed point $ρ_{*} = ρ_{*}^{β, λ}$ of Eq. 12 satisfying $F_{β, λ} (ρ_{*}) < \infty$ . This fixed point is absolutely continuous and its density satisfies

ρ_{*} (θ) = \frac{1}{Z (β)} \exp \{- β Ψ_{λ} (θ; ρ_{*})\} .

[21]

In the next sections, we state our results about convergence of the DD to its fixed point. In the case of noisy SGD [and for the diffusion PDE (12)], a general convergence result can be established (although at the cost of an additional regularization). For noiseless SGD (and the continuity equation; Eq. 12), we do not have such a general result. However, we obtain a stability condition for a fixed point containing one point mass, which is useful to characterize possible limiting points (and is used in treating the examples in the previous section).

Convergence: Noisy SGD.

Remarkably, the diffusion PDE (Eq. 12) generically admits a unique fixed point, which is the global minimum of $F_{β, λ} (ρ)$ , and the evolution (Eq. 12) converges to it, if initialized so that $F_{β, λ} (ρ_{0}) < \infty$ . This statement requires some qualifications. First, we introduce sufficient regularity assumptions to guarantee the existence of sufficiently smooth solutions of Eq. 12:

i)
$V \in C^{4} (R^{D})$ , $U \in C^{4} (R^{D} \times R^{D})$ , $\nabla_{θ_{1}}^{k} U (θ_{1}, θ_{2})$ is uniformly bounded for $0 \leq k \leq 4$ .

Next notice that the righthand side of the fixed point equation (Eq. 21) is not necessarily normalizable [for instance, it is not when $V (\cdot)$ , $U (\cdot, \cdot)$ are bounded]. To ensure the existence of a fixed point, we need $λ > 0$ .

Theorem 4:

Assume that conditions A1–A4 hold, and $1 / K_{0} \leq λ \leq K_{0}$ for some $K_{0} > 0$ . Then $F_{β, λ} (ρ)$ has a unique minimizer, denoted by $ρ_{*}^{β, λ}$ , which satisfies

R (ρ_{*}^{β, λ}) \leq inf_{θ \in R^{N \times D}} R_{N} (θ) + C D / β,

[22]

where $C$ is a constant depending on $K_{0}, K_{1}, K_{2}, K_{3}$ . Further, letting $ρ_{t}$ be a solution of the diffusion PDE (Eq. 12) with initialization satisfying $F_{β, λ} (ρ_{0}) < \infty$ , we have, as $t \to \infty$ ,

ρ_{t} \Rightarrow ρ_{*}^{β, λ} .

[23]

The proof of this theorem is based on the following formula that describes the free-energy decrease along the trajectories of the DD (Eq. 12):

\begin{matrix} \frac{d F_{β, λ} (ρ_{t})}{d t} = \\ - 2 ξ (t) \int_{R^{D}} ‖ \nabla_{θ} (Ψ_{λ} (θ; ρ_{t}) + 1 / β \cdot \log ρ_{t} (θ)) ‖_{2}^{2} ρ_{t} (θ) d θ . \end{matrix}

[24]

(A key technical hurdle is of course proving that this expression makes sense, which we do by showing the existence of strong solutions.) It follows that the righthand side must vanish as $t \to \infty$ , from which we prove that (eventually taking subsequences) $ρ_{t} \Rightarrow ρ_{*}$ where $ρ_{*}$ must satisfy $β Ψ_{λ} (θ; ρ_{*}) + \log ρ_{*} (θ) = c o n s t$ . This in turns mean $ρ_{*}$ is a solution of the fixed point condition 21 and is in fact a global minimum of $F_{β, λ}$ by convexity.

This result can be used in conjunction with Theorem 3, to analyze the regularized noisy SGD algorithm (Eq. 11).

Theorem 5:

Assume that conditions A1–A4 hold. Let $ρ_{0} \in P (R^{D})$ be absolutely continuous with $F_{β, λ} (ρ_{0}) < \infty$ and $K_{0}$ sub-Gaussian. Consider regularized noisy SGD (cf. Eq. 11) at inverse temperature $β < \infty$ , regularization $1 / K_{0} \leq λ \leq K_{0}$ with initialization ${(θ_{i}^{0})}_{i \leq N} \sim_{i i d} ρ_{0}$ . Then, for any $η > 0$ , there exists $K = K (η, {K_{i}})$ , and setting $β \geq K D$ , there exists $T = T (η, V, U, {K_{i}}, D, β) < \infty$ and $C_{0} = C_{0} (η, {K_{i}}, δ)$ independent of the dimension $D$ and temperature $β$ ) such that the following happens for $N, (1 / ε) \geq C_{0} e^{C_{0} T} D$ , $ε \geq 1 / N^{10}$ : For any $k \in [T / ε, 10 T / ε]$ , we have, with probability $1 - δ$ ,

R_{N} (θ^{k}) \leq inf_{ρ \in P (R^{D})} R_{λ} (ρ) + η .

[25]

Let us emphasize that the convergence time $T$ in the last theorem can depend on the dimension $D$ and on the data distribution $P$ but is independent of the number of hidden units $N$ . As illustrated by the examples in the previous section, understanding the dependence of $T$ on $D$ requires further analysis, but examining the proof of this theorem suggests $T = e^{O (D)}$ quite generally [examples in which $T = O (1)$ or $T = e^{Θ (D)}$ can be constructed]. We expect that our techniques could be pushed to investigate the dependence of $T$ on $η$ (see SI Appendix, Discussion). In highly structured cases, the dimension $D$ can be of constant order and be much smaller than $d$ .

Convergence: Noiseless SGD.

The next theorems provide necessary and sufficient conditions for distributions containing a single point mass to be a stable fixed point of the evolution. This result is useful to characterize the large time asymptotics of the dynamics (Eq. 7). Here, we write $\nabla_{1} U (θ_{1}, θ_{2})$ for the gradient of $U$ with respect to its first argument and $\nabla_{1,1}^{2} U$ for the corresponding Hessian. Further, for a probability distribution $ρ_{*}$ , we define

H_{0} (ρ_{*}) = \nabla^{2} V (θ_{*}) + \int \nabla_{1,1}^{2} U (θ_{*}, θ) ρ_{*} (d θ) .

[26]

Note that $H_{0} (ρ_{*})$ is nothing but the Hessian of $θ \mapsto Ψ (θ; ρ_{*})$ at $θ_{*}$ .

Theorem 6:

Assume $V, U$ to be twice differentiable with bounded gradient and bounded continuous Hessian. Let $θ_{*} \in R^{D}$ be given. Then $ρ_{*} = δ_{θ_{*}}$ is a fixed point of the evolution (Eq. 7) if and only if $\nabla V (θ_{*}) + \nabla_{1} U (θ_{*}, θ_{*}) = 0$ .

Define $H_{0} (δ_{θ_{*}}) \in R^{D \times D}$ as per Eq. 26. If $λ_{min} (H_{0} (δ_{θ_{*}})) > 0$ , then there exists $r_{0} > 0$ such that, if $supp (ρ_{t_{0}}) \subseteq B (θ_{*}; r_{0}) \equiv {θ : ‖ θ - θ_{*} ‖_{2} \leq r_{0}}$ , then $ρ_{t} \Rightarrow ρ_{*}$ as $t \to \infty$ . In fact, convergence is exponentially fast, namely $\int ‖ θ - θ_{*} ‖_{2}^{2} ρ_{t} (d θ) \leq e^{- λ (t - t_{0})}$ for some $λ > 0$ .

Theorem 7:

Under the same assumptions of Theorem 6, let $ρ_{*} = p_{*} δ_{θ_{*}} + (1 - p_{*}) {\tilde{ρ}}_{*} \in P (R^{D})$ be a fixed point of dynamics (Eq. 7), with $p_{*} \in (0,1]$ and $\nabla Ψ (θ_{*}; ρ_{*}) = 0$ (which, in particular, is implied by the fixed point condition; Eq. 20). Define the level sets $L (η) \equiv {θ : Ψ (θ; ρ_{*}) \leq Ψ (θ_{*}; ρ_{*}) - η}$ and make the following assumptions: (B1) The eigenvalues of $H_{0} = H_{0} (ρ_{*})$ are all different from 0, with $λ_{min} (H_{0}) < 0$ ; (B2) ${\tilde{ρ}}_{*} (L (η)) ↑ 1$ as $η ↓ 0$ ; and (B3) there exists $η_{0} > 0$ such that the sets $\partial L (η)$ are compact for all $η \in (0, η_{0})$ .

If $ρ_{0}$ has a bounded density with respect to the Lebesgue measure, then it cannot be that $ρ_{t}$ converges weakly to $ρ_{*}$ as $t \to \infty$ .

Discussion and Future Directions

In this paper, we developed an approach to the analysis of two-layer neural networks. Using a propagation-of-chaos argument, we proved that—if the number of hidden units satisfies $N ≫ D$ —SGD dynamics is well approximated by the PDE in Eq. 7, while noisy SGD is well approximated by Eq. 12. Both of these asymptotic descriptions correspond to Wasserstein gradient flows for certain energy (or free energy) functionals. While empirical risk minimization is known to be insensitive to overparametrization (22), the present work clarifies that the SGD behavior is also independent of the number of hidden units, as soon as this is large enough.

We illustrated our approach on several concrete examples, by proving convergence of SGD to a near-global optimum. This type of analysis provides a mechanism for avoiding the perils of nonconvexity. We do not prove that the finite- $N$ risk $R_{N} (θ)$ has a unique local minimum or that all local minima are close to each other. Such claims have often been the target of earlier work but might be too strong for the case of neural networks. We prove instead that the PDE (Eq. 7) converges to a near-global optimum, when initialized with a bounded density. This effectively gets rid of some exceptional stationary points of $R_{N} (θ)$ and merges multiple finite $N$ stationary points that result into similar distributions $ρ$ .

In the case of noisy SGD (Eq. 11), we prove that it converges generically to a near-global minimum of the regularized risk, in time independent of the number of hidden units.

We emphasize that while we focused here on the case of square loss, our approach should be generalizable to other loss functions as well (cf. SI Appendix).

The present work opens the way to several interesting research directions. We will mention two of them: $(i)$ The PDE (Eq. 7) corresponds to gradient flow in the Wasserstein metric for the risk $R (ρ)$ (see ref. 20). Building on this remark, tools from optimal transportation theory can be used to prove convergence. $(i i)$ Multiple finite- $N$ local minima can correspond to the same minimizer $ρ_{*}$ of $R (ρ)$ in the limit $N \to \infty$ . Ideas from glass theory (24) might be useful to investigate this structure.

Let us finally mention that, after a first version of this paper appeared as a preprint, several other groups obtained results that are closely related to Theorem 3 (25–27).

Supplementary Material

Supplementary File

pnas.1806579115.sapp.pdf^{(1.2MB, pdf)}

Acknowledgments

This work was partially supported by NSF Grants DMS-1613091, CCF-1714305, and IIS-1741162. S.M. was partially supported by an Office of Technology Licensing Stanford Graduate Fellowship. P.-M.N. was partially supported by a William R. Hewlett Stanford Graduate Fellowship.

Footnotes

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1806579115/-/DCSupplemental.

References

1.Rosenblatt F. Principles of Neurodynamics. Spartan Book; Washington, DC: 1962. [Google Scholar]
2.Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. In: Vardi MY, editor. Advances in Neural Information Processing Systems. Association for Computing Machinery; New York: 2012. pp. 1097–1105. [Google Scholar]
3.Goodfellow I, Bengio Y, Courville A, Bengio Y. Deep Learning. Vol 1 MIT Press; Cambridge: 2016. [Google Scholar]
4.Robbins H, Monro S. A stochastic approximation method. Ann Math Stat. 1951;22:400–407. [Google Scholar]
5.Bottou L. Large-scale machine learning with stochastic gradient descent. In: Lechevallier Y, Saporta G, editors. Proceedings of COMPSTAT’2010. Physica; Heidelberg: 2010. pp. 177–186. [Google Scholar]
6.Shalev-Shwartz S, Ben-David S. Understanding Machine Learning: From Theory to Algorithms. Cambridge Univ Press; New York: 2014. [Google Scholar]
7.Wang C, Mattingly J, Lu YM. 2017. Scaling limit: Exact and tractable analysis of online learning algorithms with applications to regularized regression and PCA. arXiv:1712.04332.
8.Soltanolkotabi M, Javanmard A, Lee JD. 2017. Theoretical insights into the optimization landscape of over-parameterized shallow neural networks. arXiv:1707.04926.
9.Ge R, Lee JD, Ma T. 2017. Learning one-hidden-layer neural networks with landscape design. arXiv:1711.00501.
10.Brutzkus A, Globerson A. 2017. Globally optimal gradient descent for a convnet with Gaussian inputs. arXiv:1702.07966.
11.Arora S, Bhaskara A, Ge R, Ma T. 2014. Provable bounds for learning some deep representations. Proceedings of International Conference on Machine Learning (ICML). Available at https://arxiv.org/abs/1310.6343. Accessed July 18, 2018.
12.Sedghi H, Anandkumar A. 2015. Provable methods for training neural networks with sparse connectivity. Proceedings of International Conference on Learning Representation (ICLR). Available at https://arxiv.org/abs/1412.2693. Accessed July 18, 2018.
13.Janzamin M, Sedghi H, Anandkumar A. 2015. Beating the perils of non-convexity: Guaranteed training of neural networks using tensor methods. arXiv:1506.08473.
14.Zhang Y, Lee JD, Jordan MI. 2016. L1-regularized neural networks are improperly learnable in polynomial time. Proceedings of International Conference on Machine Learning (ICML). Available at https://arxiv.org/abs/1510.03528. Accessed July 18, 2018.
15.Tian Y. 2017. Symmetry-breaking convergence analysis of certain two-layered neural networks with ReLU nonlinearity. International Conference on Learning Representation (ICLR). Available at https://openreview.net/forum?id=Hk85q85ee. Accessed July 18, 2018.
16.Zhong K, Song Z, Jain P, Bartlett PL, Dhillon IS. 2017. Recovery guarantees for one-hidden-layer neural networks. arXiv:1706.03175.
17.Sun Lee W, Bartlett PL, Williamson RC. Efficient agnostic learning of neural networks with bounded fan-in. IEEE Trans Inf Theor. 1996;42:2118–2132. [Google Scholar]
18.Bengio Y, Roux NL, Vincent P, Delalleau O, Marcotte P. Convex neural networks. In: Weiss Y, Schölkopf B, Platt JC, editors. Advances in Neural Information Processing Systems. MIT Press; Cambridge, MA: 2006. pp. 123–130. [Google Scholar]
19.Jordan R, Kinderlehrer D, Otto F. The variational formulation of the Fokker–Planck equation. SIAM J Math Anal. 1998;29:1–17. [Google Scholar]
20.Ambrosio L, Gigli N, Savaré G. Gradient Flows: In Metric Spaces and in the Space of Probability Measures. Birkhäuser; Basel: 2008. [Google Scholar]
21.Carrillo JA, McCann RJ, Villani C. Kinetic equilibration rates for granular media and related equations: Entropy dissipation and mass transportation estimates. Reva Matematica Iberoam. 2003;19:971–1018. [Google Scholar]
22.Bartlett PL. The sample complexity of pattern classification with neural networks: The size of the weights is more important than the size of the network. IEEE Trans Inf Theor. 1998;44:525–536. [Google Scholar]
23.Sznitman A-S. Topics in propagation of chaos. In: Hennequin PL, editor. Ecole d’été de probabilités de Saint-Flour XIX—1989. Springer; Berlin: 1991. pp. 165–251. [Google Scholar]
24.Mézard M, Parisi G. Thermodynamics of glasses: A first principles computation. J Phys Condens Matter. 1999;11:A157–A165. [Google Scholar]
25.Rotskoff GM, Vanden-Eijnden E. 2018. Neural networks as interacting particle systems: Asymptotic convexity of the loss landscape and universal scaling of the approximation error. arXiv:1805.00915.
26.Sirignano J, Spiliopoulos K. 2018. Mean field analysis of neural networks. arXiv:1805.01053.
27.Chizat L, Bach F. 2018. On the global convergence of gradient descent for over-parameterized models using optimal transport. arXiv:1805.09545.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary File

pnas.1806579115.sapp.pdf^{(1.2MB, pdf)}

[r1] 1.Rosenblatt F. Principles of Neurodynamics. Spartan Book; Washington, DC: 1962. [Google Scholar]

[r2] 2.Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. In: Vardi MY, editor. Advances in Neural Information Processing Systems. Association for Computing Machinery; New York: 2012. pp. 1097–1105. [Google Scholar]

[r3] 3.Goodfellow I, Bengio Y, Courville A, Bengio Y. Deep Learning. Vol 1 MIT Press; Cambridge: 2016. [Google Scholar]

[r4] 4.Robbins H, Monro S. A stochastic approximation method. Ann Math Stat. 1951;22:400–407. [Google Scholar]

[r5] 5.Bottou L. Large-scale machine learning with stochastic gradient descent. In: Lechevallier Y, Saporta G, editors. Proceedings of COMPSTAT’2010. Physica; Heidelberg: 2010. pp. 177–186. [Google Scholar]

[r6] 6.Shalev-Shwartz S, Ben-David S. Understanding Machine Learning: From Theory to Algorithms. Cambridge Univ Press; New York: 2014. [Google Scholar]

[r7] 7.Wang C, Mattingly J, Lu YM. 2017. Scaling limit: Exact and tractable analysis of online learning algorithms with applications to regularized regression and PCA. arXiv:1712.04332.

[r8] 8.Soltanolkotabi M, Javanmard A, Lee JD. 2017. Theoretical insights into the optimization landscape of over-parameterized shallow neural networks. arXiv:1707.04926.

[r9] 9.Ge R, Lee JD, Ma T. 2017. Learning one-hidden-layer neural networks with landscape design. arXiv:1711.00501.

[r10] 10.Brutzkus A, Globerson A. 2017. Globally optimal gradient descent for a convnet with Gaussian inputs. arXiv:1702.07966.

[r11] 11.Arora S, Bhaskara A, Ge R, Ma T. 2014. Provable bounds for learning some deep representations. Proceedings of International Conference on Machine Learning (ICML). Available at https://arxiv.org/abs/1310.6343. Accessed July 18, 2018.

[r12] 12.Sedghi H, Anandkumar A. 2015. Provable methods for training neural networks with sparse connectivity. Proceedings of International Conference on Learning Representation (ICLR). Available at https://arxiv.org/abs/1412.2693. Accessed July 18, 2018.

[r13] 13.Janzamin M, Sedghi H, Anandkumar A. 2015. Beating the perils of non-convexity: Guaranteed training of neural networks using tensor methods. arXiv:1506.08473.

[r14] 14.Zhang Y, Lee JD, Jordan MI. 2016. L1-regularized neural networks are improperly learnable in polynomial time. Proceedings of International Conference on Machine Learning (ICML). Available at https://arxiv.org/abs/1510.03528. Accessed July 18, 2018.

[r15] 15.Tian Y. 2017. Symmetry-breaking convergence analysis of certain two-layered neural networks with ReLU nonlinearity. International Conference on Learning Representation (ICLR). Available at https://openreview.net/forum?id=Hk85q85ee. Accessed July 18, 2018.

[r16] 16.Zhong K, Song Z, Jain P, Bartlett PL, Dhillon IS. 2017. Recovery guarantees for one-hidden-layer neural networks. arXiv:1706.03175.

[r17] 17.Sun Lee W, Bartlett PL, Williamson RC. Efficient agnostic learning of neural networks with bounded fan-in. IEEE Trans Inf Theor. 1996;42:2118–2132. [Google Scholar]

[r18] 18.Bengio Y, Roux NL, Vincent P, Delalleau O, Marcotte P. Convex neural networks. In: Weiss Y, Schölkopf B, Platt JC, editors. Advances in Neural Information Processing Systems. MIT Press; Cambridge, MA: 2006. pp. 123–130. [Google Scholar]

[r19] 19.Jordan R, Kinderlehrer D, Otto F. The variational formulation of the Fokker–Planck equation. SIAM J Math Anal. 1998;29:1–17. [Google Scholar]

[r20] 20.Ambrosio L, Gigli N, Savaré G. Gradient Flows: In Metric Spaces and in the Space of Probability Measures. Birkhäuser; Basel: 2008. [Google Scholar]

[r21] 21.Carrillo JA, McCann RJ, Villani C. Kinetic equilibration rates for granular media and related equations: Entropy dissipation and mass transportation estimates. Reva Matematica Iberoam. 2003;19:971–1018. [Google Scholar]

[r22] 22.Bartlett PL. The sample complexity of pattern classification with neural networks: The size of the weights is more important than the size of the network. IEEE Trans Inf Theor. 1998;44:525–536. [Google Scholar]

[r23] 23.Sznitman A-S. Topics in propagation of chaos. In: Hennequin PL, editor. Ecole d’été de probabilités de Saint-Flour XIX—1989. Springer; Berlin: 1991. pp. 165–251. [Google Scholar]

[r24] 24.Mézard M, Parisi G. Thermodynamics of glasses: A first principles computation. J Phys Condens Matter. 1999;11:A157–A165. [Google Scholar]

[r25] 25.Rotskoff GM, Vanden-Eijnden E. 2018. Neural networks as interacting particle systems: Asymptotic convexity of the loss landscape and universal scaling of the approximation error. arXiv:1805.00915.

[r26] 26.Sirignano J, Spiliopoulos K. 2018. Mean field analysis of neural networks. arXiv:1805.01053.

[r27] 27.Chizat L, Bach F. 2018. On the global convergence of gradient descent for over-parameterized models using optimal transport. arXiv:1805.09545.

PERMALINK

A mean field view of the landscape of two-layer neural networks

Song Mei

Andrea Montanari

Phan-Minh Nguyen

Series information

Significance

Abstract

An Informal Overview

Examples

Centered Isotropic Gaussians.

Fig. 1.

Fig. 2.

Lemma 1:

Theorem 1:

Centered Anisotropic Gaussians.

Theorem 2:

A Better Activation Function.

Fig. 3.

Predicting Failure.

Fig. 4.

General Results

Proposition 1:

Theorem 3:

Proposition 2:

Proposition 3:

Convergence: Noisy SGD.

Theorem 4:

Theorem 5:

Convergence: Noiseless SGD.

Theorem 6:

Theorem 7:

Discussion and Future Directions

Supplementary Material

Acknowledgments

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases