Skip to main content
Proceedings of the National Academy of Sciences of the United States of America logoLink to Proceedings of the National Academy of Sciences of the United States of America
. 2018 Jul 27;115(33):E7665–E7671. doi: 10.1073/pnas.1806579115

A mean field view of the landscape of two-layer neural networks

Song Mei a, Andrea Montanari b,c,1, Phan-Minh Nguyen b
PMCID: PMC6099898  PMID: 30054315

Significance

Multilayer neural networks have proven extremely successful in a variety of tasks, from image classification to robotics. However, the reasons for this practical success and its precise domain of applicability are unknown. Learning a neural network from data requires solving a complex optimization problem with millions of variables. This is done by stochastic gradient descent (SGD) algorithms. We study the case of two-layer networks and derive a compact description of the SGD dynamics in terms of a limiting partial differential equation. Among other consequences, this shows that SGD dynamics does not become more complex when the network size increases.

Keywords: neural networks, stochastic gradient descent, gradient flow, Wasserstein space, partial differential equations

Abstract

Multilayer neural networks are among the most powerful models in machine learning, yet the fundamental reasons for this success defy mathematical understanding. Learning a neural network requires optimizing a nonconvex high-dimensional objective (risk function), a problem that is usually attacked using stochastic gradient descent (SGD). Does SGD converge to a global optimum of the risk or only to a local optimum? In the former case, does this happen because local minima are absent or because SGD somehow avoids them? In the latter, why do local minima reached by SGD have good generalization properties? In this paper, we consider a simple case, namely two-layer neural networks, and prove that—in a suitable scaling limit—SGD dynamics is captured by a certain nonlinear partial differential equation (PDE) that we call distributional dynamics (DD). We then consider several specific examples and show how DD can be used to prove convergence of SGD to networks with nearly ideal generalization error. This description allows for “averaging out” some of the complexities of the landscape of neural networks and can be used to prove a general convergence result for noisy SGD.


Multilayer neural networks are one of the oldest approaches to statistical machine learning, dating back at least to the 1960s (1). Over the last 10 years, under the impulse of increasing computer power and larger data availability, they have emerged as a powerful tool for a wide variety of learning tasks (2, 3).

In this paper, we focus on the classical setting of supervised learning, whereby we are given data points (xi,yi)Rd×R, indexed by iN, which are assumed to be independent and identically distributed from an unknown distribution P on Rd×R. Here xiRd is a feature vector (e.g., a set of descriptors of an image), and yiR is a label (e.g., labeling the object in the image). Our objective is to model the dependence of the label yi on the feature vector xi to assign labels to previously unlabeled examples. In a two-layer neural network, this dependence is modeled as

ŷ(x;θ)=1Ni=1Nσ*(x;θi). [1]

Here, N is the number of hidden units (neurons), σ*:Rd×RDR is an activation function, and θiRD are parameters, which we collectively denote by θ=(θ1,,θN). The factor (1/N) is introduced for convenience and can be eliminated by redefining the activation. Often θi=(ai,bi,wi) and

σ*(x;θi)=aiσ(wi,x+bi), [2]

for some σ:RR. Ideally, the parameters θ=(θi)iN should be chosen as to minimize the risk (generalization error) RN(θ)=E{(y,ŷ(x;θ))}, where :R×RR is a certain loss function. For the sake of simplicity, we will focus on the square loss (y,ŷ)=(yŷ)2, but more general choices can be treated along the same lines.

In practice, the parameters of neural networks are learned by stochastic gradient descent (SGD) (4) or its variants. In the present case, this amounts to the iteration

θik+1=θik+2skykŷ(xk;θk)θiσ*(xk;θik). [3]

Here θk=(θik)iN denotes the parameters after k iterations, sk is a step size, and (xk,yk) is the kth example. Throughout the paper, we make the following “One-Pass Assumption”: Training examples are never revisited. Equivalently, {(xk,yk)}k1 are independent and identically distributed. (xk,yk)P.

In large-scale applications, this is not far from truth: The data are so large that each example is visited at most a few times (5). Further, theoretical guarantees suggest that there is limited advantage to be gained from multiple passes (6). For recent work deriving scaling limits under such an assumption (in different problems), see ref. 7.

Understanding the optimization landscape of two-layer neural networks is largely an open problem even when we have access to an infinite number of examples—that is, to the population risk RN(θ). Several studies have focused on special choices of the activation function σ* and of the data distribution P, proving that the population risk has no bad local minima (810). This type of analysis requires delicate calculations that are somewhat sensitive to the specific choice of the model. Another line of work proposes new algorithms with theoretical guarantees (1116), which use initializations based on tensor factorization.

In this paper, we prove that—in a suitable scaling limit—the SGD dynamics admits an asymptotic description in terms of a certain nonlinear partial differential equation (PDE). This PDE has a remarkable mathematical structure, in that it corresponds to a gradient flow in the metric space (P(RD),W2): the space of probability measures on RD, endowed with the Wasserstein metric. This gradient flow minimizes an asymptotic version of the population risk, which is defined for ρP(RD) and will be denoted by R(ρ). This description simplifies the analysis of the landscape of two-layer neural networks, for instance by exploiting underlying symmetries. We illustrate this by obtaining results on several concrete examples as well as a general convergence result for “noisy SGD.” In the next section, we provide an informal outline, focusing on basic intuitions rather than on formal results. We then present the consequences of these ideas on a few specific examples and subsequently state our general results.

An Informal Overview

A good starting point is to rewrite the population risk RN(θ)=E{[yŷ(x;θ)]2} as

RN(θ)=R#+2Ni=1NV(θi)+1N2i,j=1NU(θi,θj), [4]

where we defined the potentials V(θ)=Eyσ*(x;θ), U(θ1,θ2)=Eσ*(x;θ1)σ*(x;θ2). In particular, U(,) is a symmetric positive semidefinite kernel. The constant R#=E{y2} is the risk of the trivial predictor ŷ=0.

Notice that RN(θ) only depends on θ1,,θN through their empirical distribution ρ^(N)=N1i=1Nδθi. This suggests considering a risk function defined for ρP(RD) [we denote by P(Ω) the space of probability distributions on Ω]:

R(ρ)=R#+2V(θ)ρ(dθ)+U(θ1,θ2)ρ(dθ1)ρ(dθ2). [5]

Formal relationships can be established between RN(θ) and R(ρ). For instance, under mild assumptions, infθRN(θ)=infρR(ρ)+O(1/N). We refer to the next sections for mathematical statements of this type.

Roughly speaking, R(ρ) corresponds to the population risk when the number of hidden units goes to infinity, and the empirical distribution of parameters ρ^(N) converges to ρ. Since U(,) is positive semidefinite, we obtain that the risk becomes convex in this limit. The fact that learning can be viewed as convex optimization in an infinite-dimensional space was indeed pointed out in the past (17, 18). Does this mean that the landscape of the population risk simplifies for large N and descent algorithms will converge to a unique (or nearly unique) global optimum?

The answer to the latter question is generally negative, and a physics analogy can explain why. Think of θ1,,θN as the positions of N particles in a D-dimensional space. When N is large, the behavior of such a “gas” of particles is effectively described by a density ρt(θ) (with t indexing time). However, not all “small” changes of this density profile can be realized in the actual physical dynamics: The dynamics conserves mass locally because particles cannot move discontinuously. For instance, if supp(ρt)=S1S2 for two disjoint compact sets S1,S2RD and all t[t1,t2], then the total mass in each of these regions cannot change over time—that is, ρt(S1)=1ρt(S2) does not depend on t[t1,t2].

We will prove that SGD is well approximated (in a precise quantitative sense described below) by a continuum dynamics that enforces this local mass conservation principle. Namely, assume that the step size in SGD is given by sk=εξ(kε), for ξ:R0R0 a sufficiently regular function. Denoting by ρ^k(N)=N1i=1Nδθik the empirical distribution of parameters after k SGD steps, we prove that

ρ^t/ε(N)ρt, [6]

when N, ε0 (here denotes weak convergence). The asymptotic dynamics of ρt is defined by the following PDE, which we shall refer to as distributional dynamics (DD):

tρt=2ξ(t)θρtθΨ(θ;ρt), [7]

and

Ψ(θ;ρ)V(θ)+U(θ,θ)ρ(dθ). [8]

[Here, θv(θ) denotes the divergence of the vector field v(θ).] This should be interpreted as an evolution equation in P(RD). While we described the convergence to this dynamics in asymptotic terms, the results in the next sections provide explicit nonasymptotic bounds. In particular, ρt is a good approximation of ρ^k(N), k=t/ε, as soon as ε1/D and ND.

Using these results, analyzing learning in two-layer neural networks reduces to analyzing the PDE (Eq. 7). While this is far from being an easy task, the PDE formulation leads to several simplifications and insights. First, it factors out the invariance of the risk (Eq. 4) (and of the SGD dynamics; Eq. 3), with respect to permutations of the units {1,,N}.

Second, it allows us to exploit symmetries in the data distribution P. If P is left invariant under a group of transformations (e.g., rotations), we can look for a solution ρt of the DD (Eq. 7) that enjoys the same symmetry, hence reducing the dimensionality of the problem. This is impossible for the finite-N dynamics (Eq. 3), since no arrangement of the points {θ1,,θN}RD is left invariant, say, under rotations. We will provide examples of this approach in the next sections.

Third, there is rich mathematical literature on the PDE (Eq. 7) that was motivated by the study of interacting particle systems in mathematical physics. As mentioned above, a key structure exploited in this line of work is that Eq. 7 can be viewed as a gradient flow for the cost function R(ρ) in the space (P(RD),W2), of probability measures on RD endowed with the Wasserstein metric (1921). Roughly speaking, this means that the trajectory tρt attempts to minimize the risk R(ρ) while maintaining the “local mass conservation” constraint. Recall that the Wasserstein distance is defined as

W2(ρ1,ρ2)=infγC(ρ1,ρ2)θ1θ222γ(dθ1,dθ2)1/2, [9]

where the infimum is taken over all couplings of ρ1 and ρ2. Informally, the fact that ρt is a gradient flow means that Eq. 7 is equivalent, for small τ, to

ρt+τargminρP(RD)R(ρ)+12ξ(t)τW2(ρ,ρt)2. [10]

Powerful tools from the mathematical literature on gradient flows in measure spaces (20) can be exploited to study the behavior of Eq. 7.

Most importantly, the scaling limit elucidates the dependence of the landscape of two-layer neural networks on the number of hidden units N.

A remarkable feature of neural networks is the observation that, while they might be dramatically overparametrized, this does not lead to performance degradation. In the case of bounded activation functions, this phenomenon was clarified in the 1990s for empirical risk minimization algorithms (see, e.g., ref. 22). The present work provides analogous insight for the SGD dynamics: Roughly speaking, our results imply that the landscape remains essentially unchanged as N grows, provided ND. In particular, assume that the PDE (Eq. 7) converges close to an optimum in time t*(D). This might depend on D but does not depend on the number of hidden units N (which does not appear in the DD PDE; Eq. 7). If t*(D)=OD(1), we can then take N arbitrarily (as long as ND) and will achieve a population risk that is independent of N (and corresponds to the optimum), using k=t*/ε=O(D) samples.

Our analysis can accommodate some important variants of SGD, a particularly interesting one being noisy SGD:

θik+1=(12λsk)θik+2skykŷkθiσ*(xk;θik)+2sk/βgik, [11]

where gikN(0,ID) and ŷk=ŷ(xk;θk). (The term 2λskθik corresponds to an 2 regularization and will be useful for our analysis below.) The resulting scaling limit differs from Eq. 7 by the addition of a diffusion term:

tρt=2ξ(t)θρtθΨλ(θ;ρt)+2ξ(t)β1Δθρt, [12]

where Ψλ(θ;ρ)=Ψ(θ;ρ)+(λ/2)θ22, and Δθf(θ)=i=1dθi2f(θ) denotes the usual Laplacian. This can be viewed as a gradient flow for the free-energy Fβ,λ(ρ)=(1/2)R(ρ)+(λ/2)θ22ρ(dθ)β1Ent(ρ), where Ent(ρ)=ρ(θ)logρ(θ)dθ is the entropy of ρ [by definition Ent(ρ)= if ρ is singular]. Fβ,λ(ρ) is an entropy-regularized risk, which penalizes strongly nonuniform ρ.

We will prove below that, for β<, the evolution (Eq. 12) generically converges to the minimizer of Fβ,λ(ρ), hence implying global convergence of noisy SGD in a number of steps independent of N.

Examples

In this section, we discuss some simple applications of the general approach outlined above. Let us emphasize that these examples are not realistic. First, the data distribution P is extremely simple: We made this choice to be able to carry out explicit calculations. Second, the activation function σ*(x;θ) is not necessarily optimal: We made this choice to illustrate some interesting phenomena.

Centered Isotropic Gaussians.

One-neuron neural networks perform well with (nearly) linearly separable data. The simplest classification problem that requires multilayer networks is, arguably, the one of distinguishing two Gaussians with the same mean. Assume the joint law P of (y,x) to be as follows:

with probability 1/2: y=+1, xN(0,(1+Δ)2Id); and

with probability 1/2: y=1, xN(0,(1Δ)2Id).

(This example will be generalized later.) Of course, optimal classification in this model becomes entirely trivial if we compute the feature h(x)=x2. However, it is nontrivial that an SGD-trained neural network will succeed.

We choose an activation function without offset or output weights, namely σ*(x;θi)=σ(wi,x). While qualitatively similar results are obtained for other choices of σ, we will use a simple piecewise linear function as a running example: σ(t)=s1 if tt1, σ(t)=s2 if tt2, and σ(t) interpolated linearly for t(t1,t2). In simulations, we use t1=0.5, t2=1.5, s1=2.5, and s2=7.5.

We run SGD with initial weights (wi0)iNiidρ0, where ρ0 is spherically symmetric. Fig. 1 reports the result of such an experiment. Due to the symmetry of the distribution P, the distribution ρt remains spherically symmetric for all t and hence is completely determined by the distribution ρ¯t of the norm r=w2. This distribution satisfies a one-dimensional reduced DD:

tρ¯t=2ξ(t)rρ¯trψ(r;ρ¯t), [13]

where the form of ψ(r;ρ) can be derived from Ψ(θ;ρ). This reduced PDE can be efficiently solved numerically, see SI Appendix for technical details. As illustrated by Fig. 1, the empirical results match closely the predictions produced by this PDE.

Fig. 1.

Fig. 1.

Evolution of the radial distribution ρ¯t for the isotropic Gaussian model, with Δ=0.8. Histograms are obtained from SGD experiments with d=40, N=800, initial weight distribution ρ0=N(0,0.82/dId), and step size ϵ=106 and ξ(t)=1. Continuous lines correspond to a numerical solution of the DD (Eq. 13).

In Fig. 2, we compare the asymptotic risk achieved by SGD with the prediction obtained by minimizing R(ρ) (cf., Eq. 5) over spherically symmetric distributions. It turns out that, for certain values of Δ, the minimum is achieved by the uniform distribution over a sphere of radius w2=r*, to be denoted by ρr*unif. The value of r* is computed by minimizing

R¯d(1)(r)=1+2v(r)+ud(r,r), [14]

where expressions for v(r), ud(r1,r2) can be readily derived from V(w), U(w1,w2) and are given in SI Appendix.

Fig. 2.

Fig. 2.

Population risk in the problem of separating two isotropic Gaussians, as a function of the separation parameter Δ. We use a two-layer network with piecewise linear activation, no offset, and output weights equal to 1. Empirical results obtained by SGD (a single run per data point) are marked “+.” Continuous lines are theoretical predictions obtained by numerically minimizing R(ρ) (see SI Appendix for details). Dashed lines are theoretical predictions from the single-delta ansatz of Eq. 14. Notice that this ansatz is incorrect for Δ>Δdh, which is marked as a solid round dot. Here, N=800.

Lemma 1:

Let r* be a global minimizer of rRd(1)(r). Then ρr*unif is a global minimizer of ρR(ρ) if and only if v(r)+ud(r,r*)v(r*)+ud(r*,r*) for all r0.

Checking numerically, this condition yields that ρr*unif is a global minimizer for Δ in an interval [Δdl,Δdh], where limdΔdl=0 and limdΔdh=Δ0.47.

Fig. 2 shows good quantitative agreement between empirical results and theoretical predictions and suggests that SGD achieves a value of the risk that is close to optimum. Can we prove that this is indeed the case and that the SGD dynamics does not get stuck in local minima? It turns out that we can use our general theory (see next section) to prove that this is the case for large d. To state this result, we need to introduce a class of good uninformative initializations PgoodP(R0) for which convergence to the optimum takes place. For ρ¯P(R0), we let R¯d(ρ¯)R(ρ¯×Unif(Sd1)). This risk has a well-defined limit as d. We say that ρ¯Pgood if (i)ρ¯ is absolutely continuous with respect to the Lebesgue measure, with bounded density, (ii)R¯(ρ¯)<1.

Theorem 1:

For any η,Δ,δ>0 and ρ¯0Pgood, there exists d0=d0(η,ρ¯0,Δ), T=T(η,ρ¯0,Δ), and C0=C0(η,ρ¯0,Δ,δ), such that the following holds for the problem of classifying isotropic Gaussians. For any dimension dd0, number of neurons NC0d, consider SGD initialized with (wi0)iNiidρ¯0×Unif(Sd1) and step size ε[1/N10,1/(C0d)]. Then we have

RN(θk)infθRN×dR(θ)+η [15]

for any k[T/ε,10T/ε] with probability at least 1δ.

In particular, if we set ε=1/(C0d), then the number of SGD steps is k[(C0T)d,(10C0T)d]: The number of samples used by SGD does not depend on the number of hidden units N and is only linear in the dimension. Unfortunately, the proof does not provide the dependence of T on η, but Theorem 6 below suggests exponential local convergence.

While we stated Theorem 1 for the piecewise linear sigmoids, SI Appendix presents technical conditions under which it holds for a general monotone function σ:RR.

Centered Anisotropic Gaussians.

We can generalize the previous result to a problem in which the network needs to select a subset of relevant nonlinear features out of many a priori equivalent ones. We assume the joint law of (y,x) to be as follows:

with probability 1/2: y=+1, xN(0,Σ+); and

with probability 1/2: y=1, xN(0,Σ).

Given a linear subspace VRd of dimension s0d, we assume that Σ+, Σ differ uniquely along V: Σ±=Id+(τ±21)PV, where τ±=(1±Δ) and PV is the orthogonal projector onto V. In other words, the projection of x on the subspace V is distributed according to an isotropic Gaussian with variance τ+2 (if y=+1) or τ2 (if y=1). The projection orthogonal to V has instead the same variance in the two classes. A successful classifier must be able to learn the relevant subspace V. We assume the same class of activations σ*(x;θ)=σ(w,x) as for the isotropic case.

The distribution P is invariant under a reduced symmetry group O(s0)×O(ds0). As a consequence, letting r1=PVw2 and r2(IdPV)w2, it is sufficient to consider distributions ρ that are uniform, conditional on the values of r1 and r2. If we initialize ρ0 to be uniform conditional on (r1,r2), this property is preserved by the evolution (Eq. 7). As in the isotropic case, we can use our general theory to prove convergence to a near-optimum if d is large enough.

Theorem 2:

For any η,Δ,δ>0 and ρ¯0Pgood, there exists d0=d0(η,ρ¯0,Δ,γ), T=T(η,ρ¯0,Δ,γ), and C0=C0(η,ρ¯0,Δ,δ,γ), such that the following holds for the problem of classifying anisotropic Gaussians with s0=γd, γ(0,1) fixed. For any dimension parameters s0=γdd0, number of neurons NC0d, consider SGD initialized with initialization (wi0)iNiidρ¯0×Unif(Sd1) and step size ε[1/N10,1/(C0d)]. Then, we have RN(θk)infθRN×dRN(θ)+η for any k[T/ε,10T/ε] with probability at least 1δ.

Even with a reduced degree of symmetry, SGD converges to a network with nearly optimal risk, after using a number of samples k=O(d), which is independent of the number of hidden units N.

A Better Activation Function.

Our previous examples use activation functions σ*(x;θ)=σ(w,x) without output weights or offset to simplify the analysis and illustrate some interesting phenomena. Here we consider instead a standard rectified linear unit (ReLU) activation and fit both the output weight and the offset: σ*(x;θ)=aσReLU(w,x+b), where σReLU(x)=max(x,0). Hence, θ=(w,a,b)Rd+2.

We consider the same data distribution introduced in the last section (anisotropic Gaussians). Fig. 3 reports the evolution of the risk RN(θk) for three experiments with d=320, s0=60, and different values of Δ. SGD is initialized by setting ai=1, bi=1, and wi0iidN(0,0.82/dId) for iN. We observe that SGD converges to a network with very small risk, but this convergence has a nontrivial structure and presents long flat regions.

Fig. 3.

Fig. 3.

Evolution of the population risk for the variable selection problem using a two-layer neural network with ReLU activations. Here d=320, s0=60, and N=800, and we used ξ(t)=t1/4 and ε=2×104 to set the step size. Numerical simulations using SGD (one run per data point) are marked +, and curves are solutions of the reduced PDE with d=. (Inset) Evolution of three parameters of the reduced distribution ρ¯t (average output weights a, average offsets b, and average 2 norm in the relevant subspace r1) for the same setting.

The empirical results are well captured by our predictions based on the continuum limit. In this case, we obtain a reduced PDE for the joint distribution of the four quantities r=(a,b,r1=PVw2,r2=PVw2), denoted by ρ¯t. The reduced PDE is analogous to Eq. 13 albeit in 4 dimensions rather than 1 dimension. In Fig. 3, we consider the evolution of the risk, alongside three properties of the distribution ρ¯t—the means of the output weight a, of the offset b, and of r1.

Predicting Failure.

SGD does not always converge to a near global optimum. Our analysis allows us to construct examples in which SGD fails. For instance, Fig. 4 reports results for the isotropic Gaussians problem. We violate the assumptions of Theorem 1 by using nonmonotone activation function. Namely, we use σ*(x;θ)=σ(w,x), where σ(t)=2.5 for t0, σ(t)=7.5 for t1.5, and σ(t) linearly interpolates from (0, –2.5) to (0.5, –4), and from (0.5, –4) to (1.5,7.5).

Fig. 4.

Fig. 4.

Separating two isotropic Gaussians, with a nonmonotone activation function (see Predicting Failure for details). Here N=800, d=320, and Δ=0.5. The main frame presents the evolution of the population risk along the SGD trajectory, starting from two different initializations of (wi0)iNiidN(0,κ2/dId) for either κ=0.1 or κ=0.4. In Inset, we plot the evolution of the average of w2 for the same conditions. Symbols are empirical results. Continuous lines are predictions obtained with the reduced PDE (Eq. 13).

Depending on the initialization, SGD converges to two different limits, one with a small risk and the second with high risk. Again, this behavior is well tracked by solving a one-dimensional PDE for the distribution ρ¯t of r=w2.

General Results

In this section, we return to the general supervised learning problem described in the Introduction and describe our general results. Proofs are deferred to SI Appendix.

First, we note that the minimum of the asymptotic risk R(ρ) of Eq. 5 provides a good approximation of the minimum of the finite-N risk RN(θ).

Proposition 1:

Assume that either one of the following conditions hold: (a)infρR(ρ) is achieved by a distribution ρ* such that U(θ,θ)ρ*(dθ)K; (b) There exists ε0>0 such that, for any ρP(RD) such that R(ρ)infρR(ρ)+ε0, we have U(θ,θ)ρ(dθ)K. Then

infθRN(θ)infρR(ρ)K/N. [16]

Further, assume that θV(θ) and (θ1,θ2)U(θ1,θ2) are continuous, with U bounded below. A probability measure ρ* is a global minimum of R if infθRDΨ(θ;ρ*)> and

supp(ρ*)argminθRDΨ(θ;ρ*). [17]

We next consider the DDs (Eqs. 7 and 12). These should be interpreted to hold in a weak sense (cf. SI Appendix). To establish that these PDEs indeed describe the limit of the SGD dynamics, we make the following assumptions:

  • A1. tξ(t) is bounded Lipschitz: ξ,ξLipK1, with 0ξ(t)dt=.

  • A2. The activation function (x,θ)σ*(x;θ) is bounded, with sub-Gaussian gradient: σ*K2, θσ*(X;θ)ψ2K2. Labels are bounded |yk|K2.

  • A3. The gradients θV(θ), (θ1,θ2)θ1U(θ1,θ2) are bounded, Lipschitz continuous [namely θV(θ)2, θ1U(θ1,θ2)2K3, θV(θ)θV(θ)2K3θθ2, θ1U(θ1,θ2)θ1U(θ1,θ2)2K3(θ1,θ2)(θ1,θ2)2].

We also introduce the following error term that quantifies in a nonasymptotic sense the accuracy of our PDE model:

errN,D(z)1/NεD+log(N/ε)+z. [18]

The convergence of the SGD process to the PDE model is an example of a phenomenon that is known in probability theory as propagation of chaos (23).

Theorem 3:

Assume that conditions A1, A2, A3 hold. For ρ0P(RD), consider SGD with initialization (θi0)iNiidρ0 and step size sk=εξ(kε). For t0, let ρt be the solution of PDE (Eq. 7). Then, for any fixed k, ρ^k(N)ρkε almost surely along any sequence (N,ε=εN) such that N/log(1/εN), εN0. Further, there exists a constant C (depending uniquely on the parameters Ki of conditions A1A3) such that, for any f:RD×RR, with f,fLip1, ε1,

supk[0,T/ε]N1Ni=1Nf(θik)f(θ)ρkε(dθ)CeCTerrN,D(z),supk[0,T/ε]NRN(θk)R(ρkε)CeCTerrN,D(z), [19]

with probability 1ez2. The same statements hold for noisy SGD (Eq. 11), provided Eq. 7 is replaced by Eq. 12, and if β1, λ1, and ρ0 is K0 sub-Gaussian for some K0>0.

Notice that dependence of the error terms in N and D is rather benign. On the other hand, the error grows exponentially with the time horizon T, which limits its applicability to cases in which the DD converges rapidly to a good solution. We do not expect this behavior to be improvable within the general setting of 0.3, which a priori includes cases in which the dynamics is unstable.

We can regard J(θ;ρt)=ρt(θ)θΨ(θ;ρt) as a current. The fixed points of the continuum dynamics are densities that correspond to zero current, as stated below.

Proposition 2:

Assume V(),U(,) to be differentiable with bounded gradient. If ρt is a solution of the PDE (Eq. 7), then R(ρt) is nonincreasing. Further, probability distribution ρ is a fixed point of the PDE (Eq. 7) if and only if

supp(ρ)θ:θΨ(θ;ρ)=0. [20]

Note that global optimizers of R(ρ), defined by condition (Eq. 17), are fixed points, but the set of fixed points is, in general, larger than the set of optimizers. Our next proposition provides an analogous characterization of the fixed points of diffusion DD (Eq. 12) (see ref. 21 for related results).

Proposition 3:

Assume that conditions A1–A3 hold and that ρ0 is absolutely continuous with respect to the Lebesgue measure, with Fβ,λ(ρ0)<. If (ρt)t0 is a solution of the diffusion PDE (Eq. 12), then ρt is absolutely continuous. Further, there is at most one fixed point ρ*=ρ*β,λ of Eq. 12 satisfying Fβ,λ(ρ*)<. This fixed point is absolutely continuous and its density satisfies

ρ*(θ)=1Z(β)expβΨλ(θ;ρ*). [21]

In the next sections, we state our results about convergence of the DD to its fixed point. In the case of noisy SGD [and for the diffusion PDE (12)], a general convergence result can be established (although at the cost of an additional regularization). For noiseless SGD (and the continuity equation; Eq. 12), we do not have such a general result. However, we obtain a stability condition for a fixed point containing one point mass, which is useful to characterize possible limiting points (and is used in treating the examples in the previous section).

Convergence: Noisy SGD.

Remarkably, the diffusion PDE (Eq. 12) generically admits a unique fixed point, which is the global minimum of Fβ,λ(ρ), and the evolution (Eq. 12) converges to it, if initialized so that Fβ,λ(ρ0)<. This statement requires some qualifications. First, we introduce sufficient regularity assumptions to guarantee the existence of sufficiently smooth solutions of Eq. 12:

  • i)

    VC4(RD), UC4(RD×RD), θ1kU(θ1,θ2) is uniformly bounded for 0k4.

Next notice that the righthand side of the fixed point equation (Eq. 21) is not necessarily normalizable [for instance, it is not when V(), U(,) are bounded]. To ensure the existence of a fixed point, we need λ>0.

Theorem 4:

Assume that conditions A1A4 hold, and 1/K0λK0 for some K0>0. Then Fβ,λ(ρ) has a unique minimizer, denoted by ρ*β,λ, which satisfies

R(ρ*β,λ)infθRN×DRN(θ)+CD/β, [22]

where C is a constant depending on K0,K1,K2,K3. Further, letting ρt be a solution of the diffusion PDE (Eq. 12) with initialization satisfying Fβ,λ(ρ0)<, we have, as t,

ρtρ*β,λ. [23]

The proof of this theorem is based on the following formula that describes the free-energy decrease along the trajectories of the DD (Eq. 12):

dFβ,λ(ρt)dt=2ξ(t)RDθΨλ(θ;ρt)+1/βlogρt(θ)22ρt(θ)dθ. [24]

(A key technical hurdle is of course proving that this expression makes sense, which we do by showing the existence of strong solutions.) It follows that the righthand side must vanish as t, from which we prove that (eventually taking subsequences) ρtρ* where ρ* must satisfy βΨλ(θ;ρ*)+logρ*(θ)=const. This in turns mean ρ* is a solution of the fixed point condition 21 and is in fact a global minimum of Fβ,λ by convexity.

This result can be used in conjunction with Theorem 3, to analyze the regularized noisy SGD algorithm (Eq. 11).

Theorem 5:

Assume that conditions A1A4 hold. Let ρ0P(RD) be absolutely continuous with Fβ,λ(ρ0)< and K0 sub-Gaussian. Consider regularized noisy SGD (cf. Eq. 11) at inverse temperature β<, regularization 1/K0λK0 with initialization (θi0)iNiidρ0. Then, for any η>0, there exists K=K(η,{Ki}), and setting βKD, there exists T=T(η,V,U,{Ki},D,β)< and C0=C0(η,{Ki},δ) independent of the dimension D and temperature β) such that the following happens for N,(1/ε)C0eC0TD, ε1/N10: For any k[T/ε,10T/ε], we have, with probability 1δ,

RN(θk)infρP(RD)Rλ(ρ)+η. [25]

Let us emphasize that the convergence time T in the last theorem can depend on the dimension D and on the data distribution P but is independent of the number of hidden units N. As illustrated by the examples in the previous section, understanding the dependence of T on D requires further analysis, but examining the proof of this theorem suggests T=eO(D) quite generally [examples in which T=O(1) or T=eΘ(D) can be constructed]. We expect that our techniques could be pushed to investigate the dependence of T on η (see SI Appendix, Discussion). In highly structured cases, the dimension D can be of constant order and be much smaller than d.

Convergence: Noiseless SGD.

The next theorems provide necessary and sufficient conditions for distributions containing a single point mass to be a stable fixed point of the evolution. This result is useful to characterize the large time asymptotics of the dynamics (Eq. 7). Here, we write 1U(θ1,θ2) for the gradient of U with respect to its first argument and 1,12U for the corresponding Hessian. Further, for a probability distribution ρ*, we define

H0(ρ*)=2V(θ*)+1,12U(θ*,θ)ρ*(dθ). [26]

Note that H0(ρ*) is nothing but the Hessian of θΨ(θ;ρ*) at θ*.

Theorem 6:

Assume V,U to be twice differentiable with bounded gradient and bounded continuous Hessian. Let θ*RD be given. Then ρ*=δθ* is a fixed point of the evolution (Eq. 7) if and only if V(θ*)+1U(θ*,θ*)=0.

Define H0(δθ*)RD×D as per Eq. 26. If λmin(H0(δθ*))>0, then there exists r0>0 such that, if supp(ρt0)B(θ*;r0){θ:θθ*2r0}, then ρtρ* as t. In fact, convergence is exponentially fast, namely θθ*22ρt(dθ)eλ(tt0) for some λ>0.

Theorem 7:

Under the same assumptions of Theorem 6, let ρ*=p*δθ*+(1p*)ρ~*P(RD) be a fixed point of dynamics (Eq. 7), with p*(0,1] and Ψ(θ*;ρ*)=0 (which, in particular, is implied by the fixed point condition; Eq. 20). Define the level sets L(η){θ:Ψ(θ;ρ*)Ψ(θ*;ρ*)η} and make the following assumptions: (B1) The eigenvalues of H0=H0(ρ*) are all different from 0, with λmin(H0)<0; (B2) ρ~*(L(η))1 as η0; and (B3) there exists η0>0 such that the sets L(η) are compact for all η(0,η0).

If ρ0 has a bounded density with respect to the Lebesgue measure, then it cannot be that ρt converges weakly to ρ* as t.

Discussion and Future Directions

In this paper, we developed an approach to the analysis of two-layer neural networks. Using a propagation-of-chaos argument, we proved that—if the number of hidden units satisfies ND—SGD dynamics is well approximated by the PDE in Eq. 7, while noisy SGD is well approximated by Eq. 12. Both of these asymptotic descriptions correspond to Wasserstein gradient flows for certain energy (or free energy) functionals. While empirical risk minimization is known to be insensitive to overparametrization (22), the present work clarifies that the SGD behavior is also independent of the number of hidden units, as soon as this is large enough.

We illustrated our approach on several concrete examples, by proving convergence of SGD to a near-global optimum. This type of analysis provides a mechanism for avoiding the perils of nonconvexity. We do not prove that the finite-N risk RN(θ) has a unique local minimum or that all local minima are close to each other. Such claims have often been the target of earlier work but might be too strong for the case of neural networks. We prove instead that the PDE (Eq. 7) converges to a near-global optimum, when initialized with a bounded density. This effectively gets rid of some exceptional stationary points of RN(θ) and merges multiple finite N stationary points that result into similar distributions ρ.

In the case of noisy SGD (Eq. 11), we prove that it converges generically to a near-global minimum of the regularized risk, in time independent of the number of hidden units.

We emphasize that while we focused here on the case of square loss, our approach should be generalizable to other loss functions as well (cf. SI Appendix).

The present work opens the way to several interesting research directions. We will mention two of them: (i) The PDE (Eq. 7) corresponds to gradient flow in the Wasserstein metric for the risk R(ρ) (see ref. 20). Building on this remark, tools from optimal transportation theory can be used to prove convergence. (ii) Multiple finite-N local minima can correspond to the same minimizer ρ* of R(ρ) in the limit N. Ideas from glass theory (24) might be useful to investigate this structure.

Let us finally mention that, after a first version of this paper appeared as a preprint, several other groups obtained results that are closely related to Theorem 3 (2527).

Supplementary Material

Supplementary File

Acknowledgments

This work was partially supported by NSF Grants DMS-1613091, CCF-1714305, and IIS-1741162. S.M. was partially supported by an Office of Technology Licensing Stanford Graduate Fellowship. P.-M.N. was partially supported by a William R. Hewlett Stanford Graduate Fellowship.

Footnotes

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1806579115/-/DCSupplemental.

References

  • 1.Rosenblatt F. Principles of Neurodynamics. Spartan Book; Washington, DC: 1962. [Google Scholar]
  • 2.Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. In: Vardi MY, editor. Advances in Neural Information Processing Systems. Association for Computing Machinery; New York: 2012. pp. 1097–1105. [Google Scholar]
  • 3.Goodfellow I, Bengio Y, Courville A, Bengio Y. Deep Learning. Vol 1 MIT Press; Cambridge: 2016. [Google Scholar]
  • 4.Robbins H, Monro S. A stochastic approximation method. Ann Math Stat. 1951;22:400–407. [Google Scholar]
  • 5.Bottou L. Large-scale machine learning with stochastic gradient descent. In: Lechevallier Y, Saporta G, editors. Proceedings of COMPSTAT’2010. Physica; Heidelberg: 2010. pp. 177–186. [Google Scholar]
  • 6.Shalev-Shwartz S, Ben-David S. Understanding Machine Learning: From Theory to Algorithms. Cambridge Univ Press; New York: 2014. [Google Scholar]
  • 7.Wang C, Mattingly J, Lu YM. 2017. Scaling limit: Exact and tractable analysis of online learning algorithms with applications to regularized regression and PCA. arXiv:1712.04332.
  • 8.Soltanolkotabi M, Javanmard A, Lee JD. 2017. Theoretical insights into the optimization landscape of over-parameterized shallow neural networks. arXiv:1707.04926.
  • 9.Ge R, Lee JD, Ma T. 2017. Learning one-hidden-layer neural networks with landscape design. arXiv:1711.00501.
  • 10.Brutzkus A, Globerson A. 2017. Globally optimal gradient descent for a convnet with Gaussian inputs. arXiv:1702.07966.
  • 11.Arora S, Bhaskara A, Ge R, Ma T. 2014. Provable bounds for learning some deep representations. Proceedings of International Conference on Machine Learning (ICML). Available at https://arxiv.org/abs/1310.6343. Accessed July 18, 2018.
  • 12.Sedghi H, Anandkumar A. 2015. Provable methods for training neural networks with sparse connectivity. Proceedings of International Conference on Learning Representation (ICLR). Available at https://arxiv.org/abs/1412.2693. Accessed July 18, 2018.
  • 13.Janzamin M, Sedghi H, Anandkumar A. 2015. Beating the perils of non-convexity: Guaranteed training of neural networks using tensor methods. arXiv:1506.08473.
  • 14.Zhang Y, Lee JD, Jordan MI. 2016. L1-regularized neural networks are improperly learnable in polynomial time. Proceedings of International Conference on Machine Learning (ICML). Available at https://arxiv.org/abs/1510.03528. Accessed July 18, 2018.
  • 15.Tian Y. 2017. Symmetry-breaking convergence analysis of certain two-layered neural networks with ReLU nonlinearity. International Conference on Learning Representation (ICLR). Available at https://openreview.net/forum?id=Hk85q85ee. Accessed July 18, 2018.
  • 16.Zhong K, Song Z, Jain P, Bartlett PL, Dhillon IS. 2017. Recovery guarantees for one-hidden-layer neural networks. arXiv:1706.03175.
  • 17.Sun Lee W, Bartlett PL, Williamson RC. Efficient agnostic learning of neural networks with bounded fan-in. IEEE Trans Inf Theor. 1996;42:2118–2132. [Google Scholar]
  • 18.Bengio Y, Roux NL, Vincent P, Delalleau O, Marcotte P. Convex neural networks. In: Weiss Y, Schölkopf B, Platt JC, editors. Advances in Neural Information Processing Systems. MIT Press; Cambridge, MA: 2006. pp. 123–130. [Google Scholar]
  • 19.Jordan R, Kinderlehrer D, Otto F. The variational formulation of the Fokker–Planck equation. SIAM J Math Anal. 1998;29:1–17. [Google Scholar]
  • 20.Ambrosio L, Gigli N, Savaré G. Gradient Flows: In Metric Spaces and in the Space of Probability Measures. Birkhäuser; Basel: 2008. [Google Scholar]
  • 21.Carrillo JA, McCann RJ, Villani C. Kinetic equilibration rates for granular media and related equations: Entropy dissipation and mass transportation estimates. Reva Matematica Iberoam. 2003;19:971–1018. [Google Scholar]
  • 22.Bartlett PL. The sample complexity of pattern classification with neural networks: The size of the weights is more important than the size of the network. IEEE Trans Inf Theor. 1998;44:525–536. [Google Scholar]
  • 23.Sznitman A-S. Topics in propagation of chaos. In: Hennequin PL, editor. Ecole d’été de probabilités de Saint-Flour XIX—1989. Springer; Berlin: 1991. pp. 165–251. [Google Scholar]
  • 24.Mézard M, Parisi G. Thermodynamics of glasses: A first principles computation. J Phys Condens Matter. 1999;11:A157–A165. [Google Scholar]
  • 25.Rotskoff GM, Vanden-Eijnden E. 2018. Neural networks as interacting particle systems: Asymptotic convexity of the loss landscape and universal scaling of the approximation error. arXiv:1805.00915.
  • 26.Sirignano J, Spiliopoulos K. 2018. Mean field analysis of neural networks. arXiv:1805.01053.
  • 27.Chizat L, Bach F. 2018. On the global convergence of gradient descent for over-parameterized models using optimal transport. arXiv:1805.09545.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary File

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences

RESOURCES