Skip to main content
Proceedings of the National Academy of Sciences of the United States of America logoLink to Proceedings of the National Academy of Sciences of the United States of America
. 2024 Feb 20;121(9):e2316301121. doi: 10.1073/pnas.2316301121

On the different regimes of stochastic gradient descent

Antonio Sclocchi a,1, Matthieu Wyart a,1
PMCID: PMC10907278  PMID: 38377198

Significance

The success of deep learning contrasts with its limited understanding. One example is stochastic gradient descent, the main algorithm used to train neural networks. It depends on hyperparameters whose choice has little theoretical guidance and relies on expensive trial-and-error procedures. In this work, we clarify how these hyperparameters affect the training dynamics of neural networks, leading to a phase diagram with distinct dynamical regimes. Our results explain the surprising observation that these hyperparameters strongly depend on the number of data available. We show that this dependence is controlled by the difficulty of the task being learned.

Keywords: stochastic gradient descent, phase diagram, critical batch size, implicit bias

Abstract

Modern deep networks are trained with stochastic gradient descent (SGD) whose key hyperparameters are the number of data considered at each step or batch size B, and the step size or learning rate η. For small B and large η, SGD corresponds to a stochastic evolution of the parameters, whose noise amplitude is governed by the “temperature” Tη/B. Yet this description is observed to break down for sufficiently large batches BB, or simplifies to gradient descent (GD) when the temperature is sufficiently small. Understanding where these cross-overs take place remains a central challenge. Here, we resolve these questions for a teacher-student perceptron classification model and show empirically that our key predictions still apply to deep networks. Specifically, we obtain a phase diagram in the B-η plane that separates three dynamical phases: i) a noise-dominated SGD governed by temperature, ii) a large-first-step-dominated SGD and iii) GD. These different phases also correspond to different regimes of generalization error. Remarkably, our analysis reveals that the batch size B separating regimes (i) and (ii) scale with the size P of the training set, with an exponent that characterizes the hardness of the classification problem.


Stochastic gradient descent (SGD), with its variations, has been the algorithm of choice to minimize the loss and train neural networks since the introduction of back-propagation (13). When minimizing an empirical loss on a training set of size P, SGD consists in estimating the loss gradient using a mini-batch of the data selected randomly at each step. When the size B of the batch is small, SGD becomes a noisy version of gradient descent (GD) and the magnitude of this noise is controlled by the “temperature” T=η/B, with η the learning rate (4). This temperature scale is not a thermodynamic quantity, since in general no thermal equilibrium is reached before the dynamics. However, understanding the scale of hyperparameters where this description holds and SGD noise matters has remained a challenge. Specific questions include i) below which temperature Tc noise becomes irrelevant and the dynamics corresponds to gradient descent? ii) What determines the critical batch size B, beyond which SGD is observed not to be controlled by temperature in a variety of settings (5, 6)? This question is of practical importance: after searching for an optimal temperature, practitioners can maximize the batch size up to B while keeping temperature fixed, as large batches can lead to faster computations in practice (7, 8). iii) It was observed that the variation of the network weights during training increases as power laws of both T and P, both for deep nets and for simple models like the perceptron (9). Yet, a quantitative explanation of this phenomenon is lacking.

SGD noise has attracted significant research interest, in particular for its connections to loss landscape properties and performance. Recent works describing SGD via a stochastic differential equation (SDE) have emphasized several effects of this noise, including its ability to escape saddles (1016), to bias the dynamics toward broader minima (1721) or toward sparser distributions of the network parameters (2226). Yet, most of these works assume that fresh samples are considered at each time step, and are thus unable to explain the dependence of SGD on the finite size of the training set P. Here, we study this dependence on classification problems and we identify three regimes of SGD in deep learning. Following ref. 9, we first consider a perceptron setting where classes are given by a teacher y(x)=sign(w·x), where w is a unit vector, and are learned by a student f(x)w·x. We solve this problem analytically for the hinge loss, which vanishes if all data are fitted by some margin κ. Our central results, summarized in Fig. 1A and the phase diagrams therein, are as follows.

Fig. 1.

Fig. 1.

SGD phase diagrams for different data and architectures. The different datasets considered are a teacher perceptron model with P=8,192, dimension d=128 and data distribution of Eq. 1 with χ=1 (A.1), P=32,768 images of MNIST (B.1) and P = 16,384 images of CIFAR 10 (C.1). The different neural network architectures trained on these datasets correspond respectively to (A.2) a perceptron, for which the output is linear in both the input x and the weights w, and trained with hinge loss margin κ=27; (B.2) a fully connected network with five hidden layers, 128 hidden neurons per layer and margin κ=215; (C.2) a CNN made by several blocks composed of depth-wise, point-wise and standard convolutions plus residual connections (more details in SI Appendix, section 1), with margin κ=215. Panels (A.3), (B.3), and (C.3) display the alignment after training in the η,B phase diagram. The black dots correspond to diverging trainings where the algorithm does not converge. We can distinguish the noise-dominated SGD regime, for which the alignment is constant along the diagonals ηB=T. Within the first-step-dominated SGD, instead, the alignment is constant at constant η. For small η, one enters in the gradient descent (GD) regime where the alignment does not depend on η and B. Taking this value mGD of the alignment as a reference, the black dashed line ηc(B) delimiting the GD region corresponds to the alignment taking value 2mGD. The vertical black dashed line guides the eye to indicate the critical batch size B. Panels (A.4), (B.4), and (C.4) display the test error again as a function of η,B. As expected, the test error is constant along the diagonals ηB=T for noise-dominated SGD and constant in the GD regime. For first-step-dominated SGD, the test error can be affected by both η and B, and can improve at very large batches (Discussion in Sec. 2.A).

Noise-Dominated SGD. For small batches and large learning rates, the dynamics is well described by an SDE with a noise scale T=η/B. We show that this noise controls the components w of the student weights orthogonal to the teacher, such that at the end of training wT. Using this result, together with considerations on the angle of the predictor that can fit the entire training set, implies that the weight magnitude after training indeed depends both on T and P, as wTPγ, thus explaining the observation of Sclocchi et al. (9). The exponent γ characterizes the difficulty of the classification problem.

Gradient Descent. This regime breaks down when the temperature, and therefore the magnitude of orthogonal weights w, is too small. Then, fitting all data no longer corresponds to a constraint on the angle of the predictor w/w, but instead to a constraint on w to satisfy the margin. This latter constraint is the most stringent one for TTcκ. In that regime, temperature can be neglected and the dynamics corresponds to gradient descent, thus answering point (i) above.

First-Step-Dominated SGD. The noise-dominated SGD regime also breaks down when the batch and learning rates are increased at fixed T. Indeed, the first step will increase the weight magnitude to wη, which is larger than the noise-dominated prediction wTPγ if BBPγ, answering (ii) above.

Our second central result is empirical: We find this phase diagram also in deep neural networks. We train using the hinge loss as in the perceptron case (using the cross-entropy loss with early stopping leads to similar performance and weights increase (9, 27)). In this case, it is useful to introduce the quantity y(x)f(x)x characterizing the magnitude with which the output aligns to the task (for the perceptron, it simply corresponds to the quantity w·w). Fig. 1 shows this alignment in the diagram η,B for fully connected (B) and Convolutional networks (C), revealing in each case the three regimes also obtained for the perceptron. As shown below, in both cases, B depends as a power law of the training set size, as predicted.

Related Works.

SGD descriptions. In the high-dimensional setting, several works have analyzed online-SGD in teacher-student problems (2832). It has been rigorously shown (15) that, for online-SGD in high dimensions, the effective dynamics of summary statistics obey an ordinary differential equation with correction terms given by SGD stochasticity. Our analysis of the perceptron falls into this regime, with fixed points of the dynamics depending explicitly on T=η/B. We further introduce the effects of a finite training set P by studying when this online description breaks down. Other studies (33, 34) have analyzed the correlations of SGD noise with the training set using dynamical mean-field theory. They consider the regime where the batch size B is extensive in the training set size P, while our online description considers the regime where B is much smaller than P.

In finite-dimensional descriptions of SGD, theoretical justification for approximating SGD with an SDE has generally relied upon vanishing learning rates (13, 35). This condition is too restrictive for standard deep learning settings and (36) has shown empirically that the applicability of the SDE approximation can be extended to finite learning rates. Other SDE approximations assume isotropic and homogeneous noise (10, 3739) or a noise covariance proportional to Hessian of the loss (4, 14, 19, 40), while in the online limit of the perceptron model, we can compute the exact dependence of the noise covariance on the model weights. Other studies have observed a link between the effects of SGD noise and the size of the training set (41). In our work, we show that these effects do not depend on a direct increase of the noise scale with the training set size, but rather on the fact a larger P implies that larger weights are needed to fit the data, a process that depends on SGD noise.

Critical batch size. An empirical study at fixed training set size of large-batch training has observed in several learning settings that B is inversely proportional to the signal-to-noise ratio of loss gradients across training samples (8). Our work is consistent with these findings, but most importantly further predicts and tests a nontrivial power law dependence of B with P.

1. Noise-Dominated Regime

A. Perceptron Model.

We consider data xRd with d1 that are linearly separable. Without loss of generality, we chose that the class label y(x)=±1 is given by the sign of the first component y(x)=sign(x1). {ei}i=1,...,d is the canonical basis and e1 corresponds to the teacher direction. The informative component x1 of each datum is independently sampled from the probability distribution

ρ(x1)=|x1|χex12/2/Z, [1]

where Z is a normalization constant and χ>1 (42). The other d1 components x=[xi]i=2,...,d are distributed as standard multivariate Gaussian numbers, i.e., xN(0,Id1), being Id1 the (d1) identity matrix. The parameter χ controls the data distribution near the decision boundary x1=0 and how “hard” the classification problem is. Smaller χ corresponds to harder problems, as more points lie close to the boundary. The case χ=0 corresponds to the Gaussian case.*

As an architecture, we chose the perceptron f(w,x)=w·x/d. The weights are trained by minimizing the hinge loss L(w)=1Pμ=1P(κyμf(w,xμ))+, where (x)+=max(0,x), κ>0 is the margin, {(xμ,yμ=y(xμ))}μ=1,...,P is the training set with size Pd. We denote wt the weights obtained at time t, defined as the number of training steps times the learning rate, starting from the initialization w0=0.

The hinge loss is minimized with a SGD algorithm, in which the batch is randomly selected at each time step among all the P data. The learning rate η is kept constant during training. The end of training is reached at time t when L(wt)=n(wt)=0, where n(wt) indicates the fraction of data not fitted at time t: n(w)=1Pμ=1Pθ(κyμf(w,xμ)), with θ() the Heaviside step function.

B. Stochastic Differential Equation.

In the limit of small batches BP, SGD can be described with a stochastic differential equation (SDE) (35):

dwt=dtL(wt)+TΣwtdWt, [2]

where Wt is a d-dimensional Wiener process (Ito’s convention) and Σw/B is the covariance of the mini-batch gradients (43), given in SI Appendix, section 1.A. We first make the approximation of substituting the empirical gradient L(wt) and this covariance with their population averages L(wt)gt=Exwl(wt,x) and ΣtExwl(wt,x)wl(wt,x)gtgt. This approximation does not include finite training set effects. Our strategy below is to estimate the time where such a simplified description breaks down; the solution of the SDE at this time provides the correct asymptotic behaviors for the network at the end of training, as we observe experimentally.

B.1. Asymptotic online-SDE dynamics.

We decompose the student weights as w=w1e1+w and we study the dynamics of both the scalar variables w1t (the “overlap” between the student and teacher (28)), and the magnitude of the weights in the perpendicular directions wt. Given that we consider a data distribution that is radially symmetric in x, but not in x, w1, and w are the natural order parameters in our problem. Therefore, the d-dimensional dynamics of w can be studied through a two-dimensional summary statistics.

One finds that the online-SDE dynamics of w1t and wt, using Ito’s lemma (SI Appendix, section 6.B), can be written as

dw1t=dtg1t+Tdσ1tdW~1tdwt=dtgt+T2wtnt+Tdσ2tdW~2t, [3]

where W~1t and W~2t are Wiener processes and the expressions for g1t, gt, nt, σ1t, σ2t, reported in SI Appendix, section 6.A, are functions of w1t and wt through the time-dependent ratios:

λ=w1twt,r=κdwt. [4]

The quantity λ measures the angle θ between the student and the teacher directions, since θ=arctanλ1. The ratio r compares the hinge loss margin κ with the magnitude of the orthogonal components wt.

In the limit d1, the stochastic part of Eq. 3 is negligible, as we show in SI Appendix, section 4.B. The variables w1t and wt, therefore, have a deterministic time evolution given by the deterministic part of Eq. 3. This result has been proved more generally for the summary statistics of different learning tasks in high dimensions (15). Note that, even if the stochastic fluctuations are negligible, the SGD noise affects the dynamics through the term T2wtnt in the evolution of wt.

The term g1t (SI Appendix, section 6.A) is always positive and vanishes in the limit λ, which determines the fixed point of Eq. 3. Therefore, we consider its vicinity

λ=w1twt, [5]

corresponding to a vanishing angle θ=arctanλ1λ1 between the student and teacher directions. In addition, we consider the limit

r=κdwt0, [6]

and argue below why this limit corresponds to the noise-dominated regime of SGD.

Under the conditions of Eqs. 5 to 6, the deterministic part of Eq. 3 reads:

dw1t=dtλχ2c1ddwt=dtλχ11d2π+T2wtcn, [7]

with constants c1, cn (SI Appendix, section 6.A). Solving Eq. 7 gives

wtTd,w1tTdtTd13+χ. [8]

Therefore, the orthogonal component wt tends to a steady state proportional to the SGD temperature T, while the informative one w1t grows as a power law of time. Eq. 8 implies that λ1 is obtained in the large time limit tTd, while r1 corresponds to Tκ, and therefore holds at sufficiently high temperatures.

B.2. SDE breakdown and solution obtained at the end of training.

Due to the finiteness of P, the online solution of Eq. 8 is expected to be no longer valid at some time t^<t. In SI Appendix, section 5, we provide an argument, confirmed empirically, predicting that t^ is reached when the number of training points contributing to the hinge loss gradient is of order O(d). This is obtained by assuming that the student perceptron follows the dynamics Eq. 8 and applying the central limit theorem to the empirical gradient L(wt). One finds that the magnitude of its population average follows ExL(wt)1dn(wt) while that of its finite-P fluctuations is given by 1Pn(wt). For P finite but much larger than d, the average gradient is much larger than its fluctuations as long as the fraction of unfitted training points n(wt) satisfies n(wt)dP. As the training progresses, more data points get fitted and n(wt) decreases until reaching the condition

n(wt^)OdP. [9]

After this point, the empirical gradient and the population one greatly differ, and the online dynamics is no longer a valid description of the training dynamics. This is the time beyond which test and train errors start to differ. At leading order in λ as r0, we have that n(wt)λχ1. In fact, for a density of points ρ(x1)x1χ at small x1, n(wt)θχ+1λχ1 is the fraction of points with coordinate x1 smaller than θλ1.

Therefore, using Eq. 8,

n(wt)λχ1tTdχ+1χ+3. [10]

Note that, for tt^, n(wt) corresponds to the test error. Neglecting the dependence in d, we finally obtain:

t^TPb,wt^w1t^TPγ, [11]

where b=1+21+χ and γ=11+χ.

In SI Appendix, Fig. S9, we show experimentally that the asymptotic solution of the online SDE is a valid description of SGD up to t^, as predicted by Eq. 9. In SI Appendix, Figs. S10 and S11, we observe empirically that the power law scaling in T and P of the stopping time t and of wt are the same as those of t^ and wt^, therefore Eq. 11 also hold to characterize the end of training.

C. T Condition for the Noise-Dominated Regime.

To fit a data point (xμ,yμ), the perceptron weights must satisfy the condition yμw1x1μd+yμw·xμdκ, which corresponds to ref. 9:

w1w1|x1μ|κdw+cμ, [12]

where we have defined the random quantity cμ=yμww·xμ=O(1), and where κdwκT in the noise-dominated regime where Eq. 8 applies.

For Tκ, κdw is negligible with respect to cμ=O(1) and Eq. 12 now becomes w1wcμ|x1μ|, independent of the margin κ. In this case, therefore, fitting (xμ,yμ) is constrained by the SGD noise which inflates the noninformative weight component w. For Tκ, instead, fitting (xμ,yμ) is constrained by the margin κ and the SGD noise is negligible in Eq. 12, implying that the temperature delimiting the noise-dominated regime of SGD follows:

Tcκ. [13]

For TTc, the final magnitude of w is independent of T as expected. Instead, one finds that it is proportional to κ (SI Appendix, Fig. S11B).

2. Critical Batch Size B

In the small batch regime BB, for a given temperature T=η/B, different SGD simulations converge to the asymptotic trajectory given by the online-SDE, as observed in Fig. 2 showing SGD trajectories in the (w1,w) plane, at fixed T and varying batch size. This is not the case in the large batch regime BB. In fact, for fixed T, a larger batch size corresponds to a larger learning rate and therefore a larger initial step. If the initial step is too large, SGD would converge to a zero-loss solution before being able to converge to the online-SDE asymptotic trajectory. This situation is vividly apparent in Fig. 2. Therefore, we can estimate the scale of the critical batch-size B by comparing the first step w1ηη with the final value w1tηBPγ in the noise-dominated regime. When w1ηw1t, SGD converges to the asymptotic online-SDE and the final minimum depends on T=η/B. When w1ηw1t, SGD does not converge to the asymptotic online-SDE and the final weight magnitude depends only on w1η, as shown in Fig. 3. The condition w1ηw1t, that is ηηBPγ gives

BPγ, [14]

Fig. 2.

Fig. 2.

Dynamical trajectories of SGD in the (w1,w) plane, at fixed T and varying batch size as indicated in caption. Black circles indicate the first step of SGD, black stars indicate the last one. For small enough batches (and therefore small learning rates), trajectories converge to the online SDE solution (black dashed line). For large batches, this is not true anymore, and the final magnitude of the weights increases with batch size. The location of stopping weights corresponds to zero loss, which can be approximately determined by measuring the hinge loss values Ltrain(w1,w) (shown in color) computed as a function of the perceptron weights w=w1e1+wξ. Here, ξ is a (d1)-dimensional Gaussian random vector. The white area corresponds to interpolating solutions Ltrain=0 in this simplified set-up. For full-batch, we observe that w can land directly in the white area and therefore fit the data with at most few steps. This behavior affects the test error when η is large (Fig. 1A4). Data correspond to P = 16,384, d=128, κ=0.01, χ=1, T=2.

Fig. 3.

Fig. 3.

For large learning rates, the dynamics of the alignment is different in the small batch and large batch regimes. (Main panel) Perceptron: in this case, the alignment is proportional to the student component w1; data for fixed η=512, same setting as Fig. 1. For small B, w1 grows during the training dynamics, while, for large B, its final value is reached after a single training step. (Inset) Fully connected network on MNIST, small margin (κ=215), fixed η=16, same setting as Fig. 1: For small and large batch, the alignment shows a similar dynamics to the perceptron case, although for large batch it reaches its final value after some training steps (and not just a single step).

with γ=11+χ depending on the data distribution. The relationship of Eq. 14 is well verified from the empirical data of Fig. 4A.

Fig. 4.

Fig. 4.

The critical batch size B depends on the size of the training set as predicted by Eq. 14. (A) w1 at the end of training for the perceptron. Inset: w1 depends on η, B, and P for small B, while it only depends on η for large B. We observe that the cross-over between small and large B depends on P. Main panel: the curves collapse in one curve by rescaling w1 by η and B by BPγ, γ=11+χ. This is consistent with w1ηBPγ for small B (Eq. 11) and w1η for large B (Section 2). (B) Fully connected network on parity MNIST. Alignment y(x)f(x)x at the end of training (measured as in Eq. 15). Inset: as for w1 in panel (A), y(x)f(x)x depends on η, B, and P for small B, and it only depends on η for large B. Main panel: rescaling B by P0.2 aligns the cross-over batch size at different Ps, suggesting a dependence BP0.2. The curves are approximately collapsed by the rescaling of the y-axis as y(x)f(x)x/η.

This finding is consistent with B being inversely proportional to the signal-to-noise ratio of loss gradients across training samples as proposed in ref. 8. In fact, in our setting, increasing the gradient noise scale by a factor σ corresponds to the substitution TσT, which transforms Eq. 14 into BσPγ. Moreover, at large P, increasing χ reduces the exponent γ and B. In fact, larger χ corresponds to fewer training points close to the decision boundary and therefore a larger gradients’ signal-to-noise ratio.

A. Performance in Large Step SGD.

The first training step leads to w1η=ηBμB1|x1μ|d and wη=ηBμB1yμxμd, where B1 is the first sampled batch. wη is a zero-mean random vector of norm wη=O(ηB), while w1η=O(η). We can distinguish several regimes:

  • (i)

    If w1η|x1μ|η|x1μ|κμ, then the margin κ can be neglected. From extreme value statistics minμ|x1μ|=O(P1/(1+χ)), thus this condition is saturated when ηκP1/(1+χ). It corresponds to the horizontal dashed line in the diagrams of Fig. 1.

  • (ii)

    Above this line in the regime ηκP1/(1+χ), the fraction of unfitted points as well as the test error after one step are given by the angle of the predictor λη=w1ηwηO(B). Following Eq. 10, n(wη)(λη)χ1B(χ+1)/2. If n(wη)1/P or equivalently BP2/(1+χ), then with high probability all points are fitted after one step and the dynamics stops; the final test error is then of order ϵB(χ+1)/2. Since BP, this condition can only occur for χ1. Note that the error is smaller than in the noise-dominated regime where ϵ1/P.

For the marginal case χ=1 in Fig. 2, we observe that the full batch case B = 16,384 reaches (nearly) zero training loss after the first training step. Correspondingly, in the phase diagram of Fig. 1A4, the lowest test error is achieved for full batch in the first-step-dominated regime.

B. Notes on Variations of SGD.

Our analysis does not consider variations of SGD sometimes used in practice, such as the addition of momentum, adaptive learning rates, or weight decay. A detailed extension of the present approach to these cases is an interesting work left for the future. In SI Appendix, section 2, we provide evidence supporting that adding momentum is akin to rescaling the learning rate by a momentum-dependent factor. By contrast, adding weight decay in the noise-dominated regime is akin to stopping the dynamics early, and it is therefore lowering performance in the case of the perceptron with no label noise. For large batch sizes, the first-step-dominated dynamics eventually converge to the noise-dominated one at a time-scale depending on the weight-decay coefficient (SI Appendix, section 2.B).

3. Experiments for Deep Networks

We consider a 5-hidden-layer fully connected architecture with GELU activation functions (44) that classifies the parity MNIST dataset (even vs. odd digits) and a deep convolutional architecture (MobileNet) which classifies the CIFAR10 images (animals vs. the rest). As before, training corresponds to minimizing the hinge loss with SGD at constant η, until reaching zero training loss, see SI Appendix, section 1 for further details.

For deep neural networks, the alignment between the network output f(x) and the data labels y(x){+1,1} cannot simply be written as a projection w1 of the weights in some direction. Instead, we define it as

y(x)f(x)x=1|Stest|xνStesty(xν)f(xν), [15]

where the average is made over the test set Stest. For the perceptron, w1 and y(x)f(x)x are proportional.

A. Small Margin κ.

An interesting parameter to vary is the margin κ, which is equivalent to changing the initialization scale of the network. For very small κ and gradient flow, tiny changes of weights can fit the data, corresponding to the lazy regime where neural nets behave as kernel methods (4547). However, by cranking up the learning rate one escapes this regime (48), which is the case in our set-up (Fig. S3 in SI Appendix, section 3). Our central results are as follows:

  • (i)

    The phase diagram predicted for the perceptron holds true, as shown Fig. 1B3 and C3 representing the alignment in the (B, η) plane. We observe a noise-dominated SGD where the alignment depends on the ratio T=η/B; a first-step-dominated SGD where the alignment is constant at a given η; and a GD regime at small η where the alignment does not depend on η and B.

  • (ii)

    These three regimes also delineate different behaviors of the test error, as shown in Fig. 1B4 and C4.

  • (iii)

    Fig. 3, Inset confirms the prediction that in the first-step-dominated SGD regime, the alignment builds up in a few steps. By contrast, in the noise-dominated SGD regime, it builds up slowly over time.

  • (iv)

    The critical batch B separating these two regimes indeed depends on a power law of the training set P. This result is revealed in Fig. 4B, reporting the alignment at the end of training as a function of the batch size B. The Inset shows that the alignment depends on the batch size for small B, while it only depends on η for large B. The cross-over B between the two regimes depends on P. It is estimated in the Main panel by rescaling the x-axis by P0.2 so as to align these cross-overs, indicating a dependence BP0.2. The same phenomenology is observed for the perceptron in Fig. 4A, with the relationship BP11+χ given by Eq. 14. We can use this measurement of B in neural networks to estimate a value of χ for the corresponding dataset. From the data of Fig. 4B, we obtain χMNIST4 for the MNIST dataset, while from SI Appendix, Fig. S8A χCIFAR1.5 for CIFAR10. These estimates are relatively close to the values 6 and 1.5, for MNIST and CIFAR10 respectively, independently measured in a study of kernel ridge regression (42). In that context, the value of χ directly affects the kernel eigenvalue spectrum and therefore the task difficulty. These estimates suggest that our toy model for χ>0 captures some properties of image classification tasks.

B. Large Margin κ.

In the alternative case where the margin is large, the predictor has to grow in time or “inflate” in order to become of the order of the margin and fit the data (45, 47, 49). As a result, the weights and the magnitude of the gradients initially grow in time. In our set-up, this regime can be obtained choosing κ=1, because at initialization the output function is small. Such an inflation is not captured by the perceptron model. As a consequence, reasoning on the magnitude of the first gradient step may not be appropriate, and the finding (iii) above is not observed as shown in SI Appendix, Fig. S7A. Yet, after inflation occurs, as for the perceptron the output still needs to grow even further in the good direction to overcome the noise from SGD, the more so the larger the training set. This effect is observed both for CNNs and fully-connected nets on different datasets and, in the latter case, strongly correlates with performance (9).

For fully connected nets learning MNIST, observations are indeed similar to the small margin case: (i,ii) a phase diagram suggesting three regimes affecting performance is shown in SI Appendix, Fig. S5 and (iv) a critical batch size B that again appears to follow a power law of the training set size, as BP0.4 as shown in SI Appendix, Fig. S7B. Interestingly, using early-stopping or not does not affect performance, as shown in SI Appendix, Fig. S5D. It is consistent with our analysis of weight changes, which for the perceptron are similar at early stopping or at the end of training. For CNNs learning CIFAR10, the picture is different. Although three learning regimes may be identifiable in SI Appendix, Fig. S6, most of the dependence of performance with η,B is gone when using early stopping as shown in SI Appendix, Fig. S6D. It suggests that effects other than the growth of weights induced by SGD control performance in this example, as discussed below.

Conclusion

In deep nets, the effects of SGD on learning are known to depend on the size of the training set. We used a simple toy model, the perceptron, to explain these observations and relate them to the difficulty of the task. SGD noise increases the dependence of the predictor along incorrect input directions, a phenomenon that must be compensated by aligning more strongly toward the true function. As a result, alignment and weight changes depend on both T and P. If temperature is too small, this alignment is instead fixed by overcoming the margin and SGD is equivalent to GD. If the batch size is larger than a P-dependent B, the weight changes are instead governed by a the first few steps of gradient descent. As one would expect, the alignment magnitude correlates with performance. It was observed in several cases in ref. 9 and is also reflected by our observation that different alignment regimes correspond to different regimes of performance. In SI Appendix, section 3, we discuss why it is so in the simple example of a teacher perceptron learned by a multi-layer network. As shown in SI Appendix, Fig. S4, as weights align more strongly in the true direction as SGD noise increases, the network becomes less sensitive to irrelevant directions in input space, thus performing better. Yet, depending on the data structure, a strong alignment could be beneficial or not- a versatility of outcome that is observed (5, 9, 40, 50). Obviously, other effects of SGD independent on training set size may affect further performance, such as escaping saddles (1016), biasing the dynamics toward broader minima (1721, 43, 5058) or finding sparser solutions (22, 2426). We have shown examples where the alignment effects appear to dominate, and others where SGD instead is akin to a regularization similar to early stopping, which is the case for the CNN in SI Appendix, Fig. S6D—a situation predicted in some theoretical approaches, see e.g., ref. 24. Determining which effect of SGD most strongly affects performance given the structure of the task and the architecture is an important practical question for the future. Our work illustrates that subtle aspects enter this equation—such as the size of the training set or the density of points near the decision boundary.

Supplementary Material

Appendix 01 (PDF)

pnas.2316301121.sapp.pdf (834.3KB, pdf)

Acknowledgments

We thank F. Cagnetta, A. Favero, M. Geiger, B. Göransson, F. Krzakala, L. Petrini, U. Tomasini, and L. Zdeborova for discussion. M.W. acknowledges support from the Simons Foundation Grant (No. 454953 M.W.).

Author contributions

A.S. and M.W. designed research; A.S. performed research; A.S. and M.W. contributed new reagents/analytic tools; A.S. analyzed data; and A.S. and M.W. wrote the paper.

Competing interests

The authors declare no competing interest.

Footnotes

This article is a PNAS Direct Submission.

*Although the theory is valid for any χ>1, estimates of χ in image datasets correspond to χ>0 (42) (Discussion in Section 3).

A training point (x,y) has a nonzero hinge loss and contributes to n(w) if yf(x)<κ, that is |x1|<ww1c+κdw, with c=yw·xw, which becomes |x1|<cλ1 for r=κdw0. In the limit λ, the condition |x1|<cλ1 corresponds to the scaling relationship n(w)0λ1dx1ρ(x1)λχ1, since ρ(x1)x1χ for x10.

We chose the NTK initialization (45), for which the output at initialization behaves as 1/h where h is the width. h here is 128 for the fully connected and the CNN architectures (SI Appendix, section 1).

Contributor Information

Antonio Sclocchi, Email: antonio.sclocchi@epfl.ch.

Matthieu Wyart, Email: matthieu.wyart@epfl.ch.

Data, Materials, and Software Availability

Experiments’ code is available at https://github.com/pcsl-epfl/regimes_of_SGD (59).

Supporting Information

References

  • 1.Robbins H., Monro S., A stochastic approximation method. Ann. Math. Stat. 22, 400–407 (1951). [Google Scholar]
  • 2.LeCun Y., Bottou L., Orr G. B., Müller K. R., Efficient Backprop in Neural Networks: Tricks of the Trade (Springer, 2002), pp. 9–50. [Google Scholar]
  • 3.L. Bottou, “Large-scale machine learning with stochastic gradient descent” in Proceedings of COMPSTAT 2010: 19th International Conference on Computational Statistics, Paris, France, August 22–27, 2010 Keynote, Invited and Contributed Papers (Springer, 2010), pp. 177–186.
  • 4.S. Jastrzebski et al., Three factors influencing minima in SGD. arXiv [Preprint] (2017). http://arxiv.org/abs/1711.04623 (Accessed 8 November 2022).
  • 5.Shallue C. J., et al. , Measuring the effects of data parallelism on neural network training. J. Mach. Learn. Res. 20, 1–49 (2019). [Google Scholar]
  • 6.S. Smith, E. Elsen, S. De, “On the generalization benefit of noise in stochastic gradient descent” in International Conference on Machine Learning, H. Daumé III, A. Singh, Eds. (PMLR, 2020), pp. 9058–9067.
  • 7.P. Goyal et al., Accurate, large minibatch SGD: Training imagenet in 1 hour. arXiv [Preprint] (2017). http://arxiv.org/abs/1706.02677 (Accessed 21 November 2022).
  • 8.S. McCandlish, J. Kaplan, D. Amodei, OpenAI Dota Team, An empirical model of large-batch training. arXiv [Preprint] (2018). http://arxiv.org/abs/1812.06162 (Accessed 21 November 2022).
  • 9.A. Sclocchi, M. Geiger, M. Wyart, “Dissecting the effects of SGD noise in distinct regimes of deep learning” in Proceedings of the 40th International Conference on Machine Learning, Proceedings of Machine Learning Research, A. Krause et al., Eds. (PMLR, 2023), vol. 202, pp. 30381–30405.
  • 10.R. Ge, F. Huang, C. Jin, Y. Yuan, “Escaping from saddle points – online stochastic gradient for tensor decomposition” in Proceedings of the 40th International Conference on Machine Learning, Proceedings of Machine Learning Research, PMLR, Paris, France, P. Grünwald, E. Hazan, S. Kale, Eds. (PMLR, 2015), vol. 40, pp. 797–842.
  • 11.C. Jin, R. Ge, P. Netrapalli, S. M. Kakade, M. I. Jordan, “How to escape saddle points efficiently” in Proceedings of the 34th International Conference on Machine Learning, Proceedings of Machine Learning Research, PMLR, D. Precup, Y. W. Teh, Eds. (PMLR, 2017), vol. 70, pp. 1724–1732.
  • 12.H. Daneshmand, J. Kohler, A. Lucchi, T. Hofmann, “Escaping saddles with stochastic gradients” in Proceedings of the 35th International Conference on Machine Learning, Proceedings of Machine Learning Research, J. Dy, A. Krause, Eds. (PMLR, 2018), vol. 80, pp. 1155–1164.
  • 13.Hu W., Li C. J., Li L., Liu J. G., On the diffusion approximation of nonconvex stochastic gradient descent. Ann. Math. Sci. Appl. 4, 3–32 (2019). [Google Scholar]
  • 14.Z. Zhu, J. Wu, B. Yu, L. Wu, J. Ma, “The anisotropic noise in stochastic gradient descent: Its behavior of escaping from sharp minima and regularization effects” in International Conference on Machine Learning, K. Chaudhuri, R. Salakhutdinov, Eds. (PMLR, 2019), pp. 7654–7663.
  • 15.G. Ben Arous, R. Gheissari, A. Jagannath, High-dimensional limit theorems for SGD: Effective dynamics and critical scaling. Adv. Neural. Inf. Process. Syst. 35, 25349–25362 (2022).
  • 16.L. Arnaboldi, L. Stephan, F. Krzakala, B. Loureiro, From high-dimensional & mean-field dynamics to dimensionless ODEs: A unifying approach to SGD in two-layers networks. arXiv [Preprint] (2023). http://arxiv.org/abs/2302.05882 (Accessed 15 August 2023).
  • 17.Zhang Y., Saxe A. M., Advani M. S., Lee A. A., Energy-entropy competition and the effectiveness of stochastic gradient descent in machine learning. Mol. Phys. 116, 3214–3223 (2018). [Google Scholar]
  • 18.Wu L., et al. , How SGD selects the global minima in over-parameterized learning: A dynamical stability perspective. Adv. Neural. Inf. Process. Syst. 31 (2018). [Google Scholar]
  • 19.Z. Xie, I. Sato, M. Sugiyama, “A diffusion theory for deep learning dynamics: Stochastic gradient descent exponentially favors flat minima” in International Conference on Learning Representations (OpenReview.net, 2021).
  • 20.Feng Y., Tu Y., The inverse variance-flatness relation in stochastic gradient descent is critical for finding flat minima. Proc. Natl. Acad. Sci. U.S.A. 118, e2015617118 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Yang N., Tang C., Tu Y., Stochastic gradient descent introduces an effective landscape-dependent regularization favoring flat solutions. Phys. Rev. Lett. 130, 237101 (2023). [DOI] [PubMed] [Google Scholar]
  • 22.Pesme S., Pillaud-Vivien L., Flammarion N., Implicit bias of SGD for diagonal linear networks: A provable benefit of stochasticity. Adv. Neural. Inf. Process. Syst. 34, 29218–29230 (2021). [Google Scholar]
  • 23.L. P. Vivien, J. Reygner, N. Flammarion, “Label noise (stochastic) gradient descent implicitly solves the lasso for quadratic parametrisation” in Conference on Learning Theory, P.-L. Loh, M. Raginsky, Eds. (PMLR, 2022), pp. 2127–2159.
  • 24.F. Chen, D. Kunin, A. Yamamura, S. Ganguli, Stochastic collapse: How gradient noise attracts SGD dynamics towards simpler subnetworks. arXiv [Preprint] (2023). http://arxiv.org/abs/2306.04251 (Accessed 10 July 2023).
  • 25.M. Andriushchenko, A. V. Varre, L. Pillaud-Vivien, N. Flammarion, “SGD with large step sizes learns sparse features” in Proceedings of the 40th International Conference on Machine Learning, Proceedings of Machine Learning Research (PMLR), A. Krause et al., Eds. (2023), vol. 202, pp. 903–925.
  • 26.N. Ghosh, S. Frei, W. Ha, B. Yu, The effect of SGD batch size on autoencoder learning: Sparsity, sharpness, and feature learning. arXiv [Preprint] (2023). http://arxiv.org/abs/2308.03215 (Accessed 30 August 2023).
  • 27.Spigler S., et al. , A jamming transition from under-to over-parametrization affects generalization in deep learning. J. Phys. A: Math. Theor. 52, 474001 (2019). [Google Scholar]
  • 28.Engel A., Van den Broeck C., Statistical Mechanics of Learning (Cambridge University Press, 2001). [Google Scholar]
  • 29.Saad D., Solla S., Dynamics of on-line gradient descent learning for multilayer neural networks. Adv. Neural. Inf. Process. Syst. 8 (1995). [Google Scholar]
  • 30.Saad D., Solla S. A., On-line learning in soft committee machines. Phys. Rev. E 52, 4225 (1995). [DOI] [PubMed] [Google Scholar]
  • 31.Goldt S., Advani M., Saxe A. M., Krzakala F., Zdeborová L., Dynamics of stochastic gradient descent for two-layer neural networks in the teacher-student setup. Adv. Neural. Inf. Process. Syst. 32 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Veiga R., Stephan L., Loureiro B., Krzakala F., Zdeborová L., Phase diagram of stochastic gradient descent in high-dimensional two-layer neural networks. Adv. Neural. Inf. Process. Syst. 35, 23244–23255 (2022). [Google Scholar]
  • 33.Mignacco F., Krzakala F., Urbani P., Zdeborová L., Dynamical mean-field theory for stochastic gradient descent in Gaussian mixture classification. Adv. Neural. Inf. Process. Syst. 33, 9540–9550 (2020). [Google Scholar]
  • 34.Mignacco F., Urbani P., The effective noise of stochastic gradient descent. J. Stat. Mech: Theory Exp. 2022, 083405 (2022). [Google Scholar]
  • 35.Li Q., Tai C., Weinan E., Stochastic modified equations and dynamics of stochastic gradient algorithms I: Mathematical foundations. J. Mach. Learn. Res. 20, 1474–1520 (2019). [Google Scholar]
  • 36.Li Z., Malladi S., Arora S., On the validity of modeling SGD with stochastic differential equations (SDEs). Adv. Neural. Inf. Process. Syst. 34, 12712–12725 (2021). [Google Scholar]
  • 37.M. Welling, Y. W. Teh, “Bayesian learning via stochastic gradient Langevin dynamics” in Proceedings of the 28th International Conference on Machine Learning (ICML-11), L. Getoor, T. Scheffer, Eds. (Omnipress, 2011), pp. 681–688.
  • 38.A. Neelakantan et al., Adding gradient noise improves learning for very deep networks. arXiv [Preprint] (2015). http://arxiv.org/abs/1511.06807 (Accessed 13 June 2023).
  • 39.M. Raginsky, A. Rakhlin, M. Telgarsky, “Non-convex learning via stochastic gradient Langevin dynamics: A nonasymptotic analysis” in Conference on Learning Theory, S. Kale, O. Shamir, Eds. (PMLR, 2017), pp. 1674–1703.
  • 40.Hoffer E., Hubara I., Soudry D., Train longer, generalize better: Closing the generalization gap in large batch training of neural networks. Adv. Neural. Inf. Process. Syst. 30 (2017). [Google Scholar]
  • 41.S. L. Smith, Q. V. Le, “A Bayesian perspective on generalization and stochastic gradient descent” in International Conference on Learning Representations (OpenReview.net, 2018).
  • 42.U. M. Tomasini, A. Sclocchi, M. Wyart, “Failure and success of the spectral bias prediction for laplace kernel ridge regression” in International Conference on Machine Learning, K. Chaudhuri et al., Eds. (PMLR, 2022), pp. 21548–21583.
  • 43.P. Chaudhari, S. Soatto, “Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks” in 2018 Information Theory and Applications Workshop (ITA) (IEEE, 2018), pp. 1–10.
  • 44.D. Hendrycks, K. Gimpel, Gaussian error linear units (GELUs). arXiv [Preprint] (2016). http://arxiv.org/abs/1606.08415 (Accessed 30 August 2023).
  • 45.Jacot A., Gabriel F., Hongler C., Neural tangent kernel: Convergence and generalization in neural networks. Adv. Neural. Inf. Process. Syst. 31 (2018). [Google Scholar]
  • 46.Chizat L., Oyallon E., Bach F., On lazy training in differentiable programming. Adv. Neural. Inf. Process. Syst. 32 (2019). [Google Scholar]
  • 47.Geiger M., Spigler S., Jacot A., Wyart M., Disentangling feature and lazy training in deep neural networks. J. Stat. Mech: Theory Exp. 2020, 113301 (2020). [Google Scholar]
  • 48.A. Lewkowycz, Y. Bahri, E. Dyer, J. Sohl-Dickstein, G. Gur-Ari, The large learning rate phase of deep learning: The catapult mechanism. arXiv [Preprint] (2020). http://arxiv.org/abs/2003.02218 (Accessed 13 June 2023).
  • 49.A. Jacot, F. Ged, B. Şimşek, C. Hongler, F. Gabriel, Saddle-to-saddle dynamics in deep linear networks: Small initialization training, symmetry, and sparsity. arXiv [Preprint] (2021). http://arxiv.org/abs/2106.15933 (Accessed 21 November 2022).
  • 50.M. Andriushchenko, F. Croce, M. Müller, M. Hein, N. Flammarion, A modern look at the relationship between sharpness and generalization. arXiv [Preprint] (2023). http://arxiv.org/abs/2302.07011 (Accessed 15 August 2023).
  • 51.G. E. Hinton, D. Van Camp, “Keeping the neural networks simple by minimizing the description length of the weights” in Proceedings of the Sixth Annual Conference on Computational Learning Theory, L. Pitt, Ed. (Association for Computing Machinery, 1993), pp. 5–13.
  • 52.Hochreiter S., Schmidhuber J., Flat minima. Neural Comput. 9, 1–42 (1997). [DOI] [PubMed] [Google Scholar]
  • 53.Baldassi C., et al. , Unreasonable effectiveness of learning neural networks: From accessible states and robust ensembles to basic algorithmic schemes. Proc. Natl. Acad. Sci. U.S.A. 113, E7655–E7662 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, P. T. P. Tang, On large-batch training for deep learning: Generalization gap and sharp minima. arXiv [Preprint] (2016). http://arxiv.org/abs/1609.04836 (Accessed 8 November 2022).
  • 55.Neyshabur B., Bhojanapalli S., McAllester D., Srebro N., Exploring generalization in deep learning. Adv. Neural. Inf. Process. Syst. 30 (2017). [Google Scholar]
  • 56.L. Wu et al., “Towards understanding generalization of deep learning: Perspective of loss landscapes” in International Conference of Machine Learning Workshop, D. Precup, Y. W. Teh, Eds. (PMLR, 2017).
  • 57.Chaudhari P., et al. , Entropy-SGD: Biasing gradient descent into wide valleys. J. Stat. Mech: Theory Exp. 2019, 124018 (2019). [Google Scholar]
  • 58.Baldassi C., Pittorino F., Zecchina R., Shaping the learning landscape in neural networks around wide flat minima. Proc. Natl. Acad. Sci. U.S.A. 117, 161–170 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.A. Sclocchi, regimes of SGD. GitHub. https://github.com/pcsl-epfl/regimes_of_SGD. Deposited 1 September 2023.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Appendix 01 (PDF)

pnas.2316301121.sapp.pdf (834.3KB, pdf)

Data Availability Statement

Experiments’ code is available at https://github.com/pcsl-epfl/regimes_of_SGD (59).


Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences

RESOURCES