Explaining neural scaling laws

Yasaman Bahri; Ethan Dyer; Jared Kaplan; Jaehoon Lee; Utkarsh Sharma

doi:10.1073/pnas.2311878121

. 2024 Jun 24;121(27):e2311878121. doi: 10.1073/pnas.2311878121

Explaining neural scaling laws

Yasaman Bahri ^a,^1,², Ethan Dyer ^a,¹, Jared Kaplan ^b,^1,³, Jaehoon Lee ^a,¹, Utkarsh Sharma ^b,¹

PMCID: PMC11228526 PMID: 38913889

Significance

The population loss of trained deep neural networks has been empirically observed to improve as a power law in a variety of large models and datasets. We investigate the origins behind such “scaling laws” and provide a taxonomy for different scaling regimes. Our findings are based on derivations in linear random feature models—which, in addition to being a simple fruitful model, also describe the wide network limit of deep neural networks. We further formulate and verify aspects of scaling based on smoothness in interpolating a data manifold. We support our theory with empirical results in realistic settings. Our work provides insights into scaling laws and bridges the large gap between theory and experiment in modern deep learning.

Keywords: deep neural networks, machine learning, statistical physics

Abstract

The population loss of trained deep neural networks often follows precise power-law scaling relations with either the size of the training dataset or the number of parameters in the network. We propose a theory that explains the origins of and connects these scaling laws. We identify variance-limited and resolution-limited scaling behavior for both dataset and model size, for a total of four scaling regimes. The variance-limited scaling follows simply from the existence of a well-behaved infinite data or infinite width limit, while the resolution-limited regime can be explained by positing that models are effectively resolving a smooth data manifold. In the large width limit, this can be equivalently obtained from the spectrum of certain kernels, and we present evidence that large width and large dataset resolution-limited scaling exponents are related by a duality. We exhibit all four scaling regimes in the controlled setting of large random feature and pretrained models and test the predictions empirically on a range of standard architectures and datasets. We also observe several empirical relationships between datasets and scaling exponents under modifications of task and architecture aspect ratio. Our work provides a taxonomy for classifying different scaling regimes, underscores that there can be different mechanisms driving improvements in loss, and lends insight into the microscopic origin and relationships between scaling exponents.

For a large variety of models and datasets, neural network performance has been empirically observed to scale as a power law with model size and dataset size (1–4). These exponents determine how quickly performance, as measured by the population loss, improves with more data and larger models. We would like to understand why these power laws emerge. For example, what features of the data and models determine the values of the power-law exponents? Is there a taxonomy behind scaling regimes, where regimes are governed by different underlying mechanisms? Are dataset and model size scaling exponents related in any way? And finally, which aspects of scaling behavior might exhibit universal signatures, and which aspects are strongly dependent on the “microscopic” aspects of the problem? A theoretically and empirically grounded understanding of these questions could provide guidance for machine learning in the modern era of large models and training data.

In this work, we present a theoretical framework for understanding scaling laws in trained deep neural networks. We identify four related scaling regimes with respect to the number of model parameters P and the dataset size D. With respect to each of D, P, we define both a “variance-limited” regime and a “resolution-limited” regime.

1. Scaling Laws for Deep Neural Networks

1.1. Variance-Limited Regime.

In the limit of infinite data or an arbitrarily wide model, some aspects of neural network training simplify. Specifically, if we fix one of $D, P$ and study scaling with respect to the other parameter as it becomes arbitrarily large, then the difference between the finite test loss and its limiting value scales as $1 / x$ , i.e., as a power law with exponent 1, with $x = D$ or $\sqrt{P} \propto$ width in deep networks and $x = D$ or P in linear models.

1.2. Resolution-Limited Regime.

In this regime, one of D or P is effectively infinite, and we study scaling as the other parameter increases. In this case, a variety of works have empirically observed power-law scalings $1 / x^{α}$ , typically with $0 < α < 1$ for both $x = P$ or D. We derive exponents in this regime precisely in the setting of random feature models (c.f. next section). Empirically, we find that our theoretical predictions for exponents hold in pretrained, fine-tuned models even though these lie outside our theoretical setting.

For more general nonlinear models, we propose a refinement of naive bounds into estimates via expansions that hold asymptotically. These rely on the idea that additional data (in the infinite model-size limit) or added model parameters (in the infinite data limit) are used by the model to carve up the data manifold into smaller components. For smooth manifolds, loss, and network, the test loss will depend on the linear size of a subregion, while it is the d-dimensional subregion volume that scales inversely with P or D, giving rise to $α \propto 1 / d$ .^* To test this empirically, we make measurements of the resolution-limited exponents in neural networks and intrinsic dimension of the data manifold, shown in Fig. 1B.

Fig. 1. — (A) Four scaling regimes. Here, we exhibit the four regimes we focus on in this work. (*Top-Left* and *Bottom-Right*) Variance-limited scaling of underparameterized models with dataset size and overparameterized models with number of parameters (width) exhibit universal scaling ( $α_{D} = α_{W} = 1$ ) independent of the architecture or underlying dataset. (*Top-Right* and *Bottom-Left*) Resolution-limited overparameterized models with dataset or underparameterized models with model size exhibit scaling with exponents that depend on the details of the data distribution. These four regimes are also found in random feature (Fig. 2A) and pretrained models (*SI Appendix*). (B) Resolution-limited models interpolate the data manifold. Linear interpolation between two training points in a four-dimensional input space (*Top*). We show a teacher model and four student models, each trained on different-sized datasets. In all cases, teacher and student approximately agree on the training endpoints, but as the training set size increases they increasingly match everywhere. (*Bottom*) We show $4 / α_{D}$ versus the data manifold dimension (input dimension for teacher-student models, intrinsic dimension for standard datasets). We find that the teacher-student models follow the $4 / α_{D}$ (dark dashed line), while the relationship for a four-layer convolutional neural network (solid) and Wide ResNet architecture (hollow) on standard datasets is less clear.

1.3. Explicit Derivation.

We derive the scaling laws for these four regimes explicitly in the setting of random feature teacher-student models, which also applies to neural networks in the large width limit. This setting allows us to solve for the test error directly in terms of the feature covariance (kernel). The scaling of the test loss then follows from the asymptotic decay of the spectrum of the covariance matrix. For generic continuous kernels on a d-dimensional manifold, we can further relate this to the dimension of the data manifold.

1.4. Summary of Contributions.

We propose four scaling regimes for neural networks. The variance-limited and resolution-limited regimes originate from different mechanisms, which we identify. To our knowledge, this categorization has not been previously exhibited. We provide empirical support for all four regimes in deep networks on standard datasets.
We derive the variance-limited regime under simple yet general assumptions (Theorem 1).
We present a hypothesis for resolution-limited scaling through refinement of naive bounds (Theorems 2 and 3), for general nonlinear models. We empirically test the dependence of the estimates on intrinsic dimension of the data manifold for deep networks on standard datasets (Fig. 1B).
In the setting of random feature teacher-student networks, we derive both variance-limited and resolution-limited scaling exponents exactly. In the latter case, we relate this to the spectral decay of kernels. We identify a duality that exists between model and dataset size scaling.
We empirically investigate predictions from the random features setting in pretrained, fine-tuned models on standard datasets and find they give excellent agreement.
We study the dependence of the scaling exponent on changes in architecture and data, finding that i) changing the input distribution via switching datasets and ii) the addition of noise have strong effects on the exponent, while iii) changing the target task via superclassing does not.

1.5. Related Works.

There have been a number of recent works demonstrating empirical scaling laws (1–5) in deep neural networks, including scaling laws with model size, dataset size, compute, and other observables such as mutual information and pruning. Some precursors (6, 7) can be found in earlier literature. Recently, scaling laws have also played a significant role in motivating work on the largest models that have yet been developed (8, 9).

There has been comparatively little work on theoretical ideas (10, 11) that match and explain empirical findings in generic deep neural networks. In the particular case of large width, deep neural networks behave as random feature models (12–17) and known results on the loss scaling of kernel methods can be applied (18, 19). Though not in the original, (19) analyze resolution-limited dataset size scaling for power-law spectra in later versions. The decay of test error in ridge regression under certain settings has been studied in prior work, including refs. 20–22.

During the completion of this work, Hutter (23) presented a specific solvable model of learning exhibiting nontrivial power-law scaling for power-law (Zipf) distributed features. This does not directly relate to the settings studied in this work or present bounds that supersede our results. Concurrent to our work, Bisla et al. (11) presented a derivation of the resolution-limited scaling with dataset size, also stemming from nearest-neighbor distance scaling on data manifolds. However, they do not discuss requirements on model versus dataset size or how this scaling behavior fits into other asymptotic scaling regimes. A few recent works, appearing after the completion of this manuscript, also investigate the scaling of test error in related settings. Cui et al. (24) study the decay of test error with dataset size for kernel regression in a high-dimensional limit with Gaussian design. Maloney et al. (25) examine further a teacher-student framework similar to ours, deriving joint scaling laws using techniques from random matrix theory. Finally, Wei et al. (26) theoretically examine kernel regression through classical statistical estimators and random matrix theory.

In the variance-limited regime, scaling laws in the context of random feature models (27–29), deep linear models (30, 31), one-hidden-layer networks (32–34), and wide neural networks treated as Gaussian processes or trained in the NTK regime (16, 17, 35, 36) have been studied. In particular, this behavior was used in ref. 2 to motivate a particular ansatz for simultaneous scaling with data and model size. The resolution-limited analysis can perhaps be viewed as an attempt to quantify the ideal-world generalization error of ref. 37.

This work makes use of classic results connecting the spectrum of a smooth kernel to the geometry it is defined over (38–41) and on the scaling of iteratively refined approximations to smooth manifolds (42–44).

2. Four Scaling Regimes

Throughout this work, we will be interested in how the average test loss $L (D, P)$ depends on the dataset size D and the number of model parameters P. Unless otherwise noted, L denotes the test loss averaged over initialization of the parameters and draws of a size D training set. Some of our results only pertain directly to the scaling with width $w \propto \sqrt{P}$ , but we expect many of the intuitions apply more generally. We use the notation $α_{D}$ , $α_{P}$ , and $α_{W}$ to indicate scaling exponents with respect to dataset size, parameter count, and width. All proofs appear in SI Appendix.

2.1. Variance-Limited Exponents.

In the limit of large D, the outputs of an appropriately trained network approach a limiting form with corrections which scale as $D^{- 1}$ . Similarly, recent work shows that wide networks have a smooth large P limit (15), where fluctuations scale as $1 / \sqrt{P}$ . If the loss is sufficiently smooth then its value will approach the asymptotic loss with corrections proportional to the variance ( $1 / D$ or $1 / \sqrt{P}$ ). In Theorem 1 we present sufficient conditions on the loss to ensure this variance dominated scaling. We note, these conditions are satisfied by mean squared error and cross-entropy loss, though we conjecture the result holds even more generally.

Theorem 1.

Let $ℓ (f)$ be the test loss as a function of network output, ( $L = E [ℓ (f)]$ ), and let $f_{T}$ be the network output after T training steps, thought of as a random variable over weight initialization, draws of the training dataset, and optimization seed. Further let $f_{T}$ be concentrating with $E [{(f_{T} - E [f_{T}])}^{k}] = O (ϵ) \forall k \geq 2$ . If ℓ is a finite-degree polynomial, or has bounded second derivative, or is 2-Hölder, then $E [ℓ (f_{T})] - ℓ (E [f_{T}]) = O (ϵ)$ .

2.1.1. Dataset scaling.

Consider a neural network, and its associated training loss $L_{train} (θ)$ . For every value of the weights, the training loss, thought of as a random variable over draws of a training set of size D, concentrates around the population loss, with a variance which scales as $O (D^{- 1})$ . This is because if the optimization procedure is sufficiently smooth, the trained weights, network output, and higher moments, will approach their infinite D values, $E_{D} [{(f_{T} - E_{D} [f_{T}])}^{k}] = O (D^{- 1})$ . Here, the subscript D on the expectation indicates an average over draws of the training set. This scaling together with Theorem 1 gives the variance limited scaling of loss with dataset size.

This concentration result with respect to dataset size has appeared for linear models in ref. 27 and for single hidden layer networks with high-dimensional input data in refs. 32–34. In SI Appendix, we prove this for GD and SGD with polynomial loss as well as present informal arguments more generally. Additionally, we present examples violating the smoothness assumption and exhibiting different scaling.

2.1.2. Large width scaling.

We can make a very similar argument in the $w \to \infty$ limit. It has been shown that the predictions from an infinitely wide network, either under Bayesian inference (12, 13), or when trained via gradient descent (15, 16), approach a limiting distribution at large width equivalent to a linear model with random features. Furthermore, corrections to the infinite-width behavior are controlled by the variance of the full model around the linear model predictions. This variance (and higher moments) have been shown to scale as $1 / w$ (17, 35, 45), $E_{w} [{(f_{T} - E_{w} [f_{T}])}^{k}] = O (w^{- 1})$ . Theorem 1 then implies the loss will differ from its $w = \infty$ limit by a term proportional to $1 / w$ .

We note that there has also been work studying the combined large depth and large width limit, where ref. 46 found a well-defined infinite size limit with controlled fluctuations in randomly initialized deep neural networks. In any such context where the trained model predictions concentrate, we expect the loss to scale with the variance of the model output. In the case of linear models, studied below, the variance is $O (P^{- 1})$ rather than $O (\sqrt{P})$ , and we see the associated variance scaling in this case.

2.2. Resolution-Limited Exponents.

In this section we consider training and test data drawn uniformly from a compact d-dimensional manifold, $x \in M_{d}$ , and targets given by some smooth function $y = F (x)$ on this manifold.

2.2.1. Overparameterized dataset scaling.

Consider the double limit of an overparameterized model with large training set size, $P ≫ D ≫ 1$ . We further consider well-trained models, i.e., models that interpolate all training data. The goal is to understand $L (D)$ . If we assume that the learned model f is sufficiently smooth, then the dependence of the loss on D can be bounded in terms of the dimension of the data manifold $M_{d}$ .

Informally, if our train and test data are drawn i.i.d. from the same manifold, then the distance from a test point to the closest training data point decreases as we add more and more training data points. In particular, this distance scales as $O (D^{- 1 / d})$ (47). Furthermore, if f, $F$ are both sufficiently smooth, they cannot differ too much over this distance. If in addition the loss function, L, is a smooth function vanishing when $f = F$ , we have $L = O (D^{- 1 / d})$ . This is summarized in the following theorem.

Theorem 2.

Let $L (f)$ , f and $F$ be Lipschitz with constants $K_{L}$ , $K_{f}$ , and $K_{F}$ . Further let $D$ be a training dataset of size D sampled i.i.d from $M_{d}$ and let $f (x) = F (x), \forall x \in D$ then $L (D) = O (K_{L} max (K_{f}, K_{F}) D^{- 1 / d})$ .

2.2.2. Underparameterized parameter scaling.

We will again assume that $F$ varies smoothly on an underlying compact d-dimensional manifold $M_{d}$ . We can obtain a bound on $L (P)$ if we imagine that f approximates $F$ as a piecewise function with roughly P regions (see ref. 10). Here, we instead make use of the argument from the overparameterized, resolution-limited regime above. If we construct a sufficiently smooth estimator for $F$ by interpolating among P randomly chosen points from the (arbitrarily large) training set, then by the argument above the loss will be bounded by $O (P^{- 1 / d})$ .

Theorem 3.

Let $L (f)$ , f and $F$ be Lipschitz with constants $K_{L}$ , $K_{f}$ , and $K_{F}$ . Further let $f (x) = F (x)$ for P points sampled i.i.d from $M_{d}$ then $L (P) = O (K_{L} max (K_{f}, K_{F}) P^{- 1 / d})$ .

2.2.3. From bounds to estimates.

Theorems 2 and 3 are phrased as bounds, but we expect the stronger statement that these bounds also generically serve as estimates, so that, e.g., $L (D) = Ω (D^{- c / d})$ for $c \geq 2$ , and similarly for parameter scaling. If we assume that $F$ and f are analytic functions on $M_{d}$ and that the loss function $L (f, F)$ is analytic in $f - F$ and minimized at $f = F$ , then the loss at a given test input, $x_{test}$ , can be expanded around the nearest training point, ${\hat{x}}_{train}$ , $L (x_{test}) = \sum_{m = n \geq 2}^{\infty} a_{m} ({\hat{x}}_{train}) {(x_{test} - {\hat{x}}_{train})}^{m}$ ,^† where the first term is of finite order $n \geq 2$ because the loss vanishes at the training point. As the typical distance between nearest neighbor points scales as $D^{- 1 / d}$ on a d-dimensional manifold (an observation also made in ref. 18), the loss will be dominated by the leading term, $L \propto D^{- n / d}$ , at large D. Note that if the model provides an accurate piecewise linear approximation, we will generically find $n \geq 4$ .

2.3. Explicit Realization in Linear Random Feature Models.

In the proceeding sections, we have conjectured typical case scaling relations for a model’s test loss. We have further given intuitive arguments for this behavior which relied on smoothness assumptions on the loss and training procedure. In this section, we provide a concrete realization of all four scaling regimes within the context of linear models constructed from random features. Of particular interest is the resolution-limited regime, where the scaling of the loss is a consequence of the linear model kernel spectrum—the scaling of overparameterized models with dataset size and underparameterized models with parameters is a consequence of a classic result, originally due to ref. 38, bounding the spectrum of sufficiently smooth kernel functions by the dimension of the manifold they act on.

Linear predictors serve as a model system for learning. Such models are used frequently in practice when more expressive models are unnecessary or infeasible (48–50) and also serve as an instructive test bed to study training dynamics (28, 31, 51). Furthermore, in the large width limit, deep neural networks behave as Gaussian processes (12–14, 52–54) and in the low-learning rate regime of gradient-descent optimization (16, 55, 56), deep neural networks behave as a particular class of linear models (15, 16, 57). Hence, linear predictors constructed from random features provide an accurate description of deep neural networks in the large width limit.

Here, we discuss linear models in general terms, though the results immediately hold for the special cases of wide, deep neural networks. We focus on teacher-student models, in which the teacher generates samples from which the student model learns. We will assume student weights initialized to zero and trained with mean squared error (MSE) loss to their global optimum.

We consider a linear teacher F and student f,

\begin{matrix} F (x) & = \sum_{M = 1}^{S} ω_{M} F_{M} (x) & f (x) & = \sum_{μ = 1}^{P} θ_{μ} f_{μ} (x) . \end{matrix}

[1]

Here ${F_{M}}$ are a (potentially infinite) collection of features. The teacher weights, $ω_{M}$ , are sampled from a Normal distribution $ω \sim N (0, 1 / S)$ and are averaged over in the test loss. The student has learnable parameters ${θ_{μ}}$ and is built out of a subset of the teacher features. To vary the number of parameters in this simple model, we construct P features, $f_{μ = 1, \dots, P}$ , by introducing a projector $P$ onto a P-dimensional subspace of the teacher features, $f_{μ} = \sum_{M} P_{μ M} F_{M}$ . We train by sampling a training set of size D and minimizing the MSE loss, $L_{train} = \frac{1}{2 D} \sum_{a = 1}^{D} {(f (x_{a}) - F (x_{a}))}^{2}$ . We are interested in the test loss averaged over draws of our teacher weights and training dataset.

The infinite data test loss, $L (P) : = lim_{D \to \infty} L (D, P)$ , takes the form,

\begin{matrix} \begin{matrix} L (P) & = \frac{1}{2 S} Tr [C - C P^{T} {(P C P^{T})}^{- 1} P C] . \end{matrix} \end{matrix}

[2]

Here, we have introduced the feature-feature second moment-matrix, $C = E_{x} [F (x) F^{T} (x)]$ . If the teacher and student features had the same span, this would vanish, but due to the mismatch, the loss is nonzero.

On the other hand, if we keep a finite number of training points, but allow the student to use all of the teacher features, the test loss, $L (D) : = lim_{P \to S} L (D, P)$ , takes the form,

\begin{matrix} \begin{matrix} L (D) = \frac{1}{2} E_{x} [K (x, x) - \vec{K} (x) {\bar{K}}^{- 1} \vec{K} (x)] . \end{matrix} \end{matrix}

[3]

Here, $K (x, x^{'})$ is the data-data second moment matrix, $\vec{K}$ indicates restricting one argument to the D training points, while $\bar{K}$ indicates restricting both. This test loss vanishes as the number of training points becomes infinite but is nonzero for finite training size.

We present a full derivation of these expressions in SI Appendix. In the remainder of this section, we explore the scaling of the test loss with dataset and model size.

2.3.1. Variance-limited scaling.

To derive the limiting expressions Eqs. 2 and 3 for the loss, one makes use of the fact that the sample expectation of the second moment matrix over the finite dataset, as well as the finite feature set, is close to the full covariance, $\frac{1}{D} \sum_{a = 1}^{D} F (x_{a}) F^{T} (x_{a}) = C + δ C$ , $\frac{1}{P} f^{T} (x) f (x^{'}), = K + δ K$ , with the fluctuations satisfying $E_{D} [δ C^{2}] = O (D^{- 1})$ and $E_{P} [δ K^{2}] = O (P^{- 1})$ , where expectations are taken over draws of a dataset of size D and over feature sets. Using these expansions yields the variance-limited scaling, $L (D, P) - L (P) = O (D^{- 1})$ , $L (D, P) - L (D) = O (P^{- 1})$ in the underparameterized and overparameterized settings, respectively.

In Fig. 2A, we see evidence of these scaling relations for features built from randomly initialized ReLU deep neural networks on coarse-grained versions of the MNIST dataset obtained by local averaging over the images. We see that in the variance-limited regimes, the scaling exponent is independent of the modification to the training data. In SI Appendix, we provide an in-depth derivation of this behavior and expressions for the leading contributions to $L (D, P) - L (P)$ and $L (D, P) - L (D)$ .

2.3.2. Resolution-limited scaling.

We now would like to analyze the scaling behavior of our linear model in the resolution-limited regimes, that is the scaling with P when $1 ≪ P ≪ D$ and the scaling with D when $1 ≪ D ≪ P$ . In these cases, the scaling is controlled by the shared spectrum of $C$ or $K$ . This spectrum is often well described by a power law, where eigenvalues $λ_{i}$ satisfy $λ_{i} = \frac{1}{i^{1 + α_{K}}}$ . See Fig. 2B for example spectra on pooled MNIST.

In this case, we will argue that the losses also obey a power-law scaling, with the exponents controlled by the spectral decay factor, $1 + α_{K}$ ,

\begin{matrix} \begin{matrix} L (D) \propto D^{- α_{K}}, L (P) \propto P^{- α_{K}} . \end{matrix} \end{matrix}

[4]

In other words, in this setting, $α_{P} = α_{D} = α_{K}$ . This is supported empirically in Fig. 2B. For other derivations of dataset scaling for kernel regression, see refs. 18 and 19. We then argue that when the kernel function $K$ is sufficiently smooth on a manifold of dimension d, $α_{K} \propto d^{- 1}$ , thus realizing the more general resolution-limited picture described above.

2.3.3. From spectra to scaling laws for the loss.

To be concrete let us focus on the overparameterized loss. If we introduce the notation $e_{i}$ for the eigenvectors of $C$ and ${\bar{e}}_{i}$ for the eigenvectors of $\frac{1}{D} \sum_{a = 1}^{D} F (x_{a}) F^{T} (x_{a})$ , the loss becomes,

\begin{matrix} \begin{matrix} L (D) = \frac{1}{2} \sum_{i = 1}^{S} λ_{i} (1 - \sum_{j = 1}^{D} {(e_{i} \cdot {\bar{e}}_{j})}^{2}) . \end{matrix} \end{matrix}

[5]

Before discussing the general asymptotic behavior of Eq. 5, we can gain some intuition by considering the case of large $α_{K}$ . In this case, ${\bar{e}}_{j} \approx e_{j}$ (see, e.g., ref. 58), we can simplify Eq. 5 to,

\begin{matrix} \begin{matrix} L (D) & \propto \sum_{i = D + 1}^{\infty} \frac{1}{i^{1 + α_{K}}} = α_{K} D^{- α_{K}} + O (D^{- α_{K} - 1}) . \end{matrix} \end{matrix}

[6]

More generally in SI Appendix, following refs. 19 and 59, we use replica theory methods to derive $L (D) \propto D^{- α_{K}}$ and $L (P) \propto P^{- α_{K}}$ , without requiring the large $α_{K}$ limit.

2.3.4. Data manifolds and kernels.

In Section 2.2, we discussed a simple argument that resolution-limited exponents $α \propto 1 / d$ , where d is the dimension of the data manifold. Our goal now is to explain how this connects with the linearized models and kernels discussed above: how does the spectrum of eigenvalues of a kernel relate to the dimension of the data manifold?

The key point is that sufficiently smooth kernels must have an eigenvalue spectrum with a bounded tail. Specifically, a $C^{t}$ kernel on a d-dimensional space must have eigenvalues $λ_{n} ≲ \frac{1}{n^{1 + t / d}}$ (40). In the generic case where the covariance matrices we have discussed can be interpreted as kernels on a manifold, and they have spectra saturating the bound, linearized models will inherit scaling exponents given by the dimension of the manifold.

As a simple example, consider a d-torus. In this case, we can study the Fourier series decomposition and examine the case of a kernel $K (x - y)$ . This must take the form $K = \sum_{n_{I}} [a_{n_{I}} sin (n_{I} \cdot (x - y)) + b_{n_{I}} cos (n_{I} \cdot (x - y))]$ , where $n_{I} = (n_{1}, \dots, n_{d})$ are integer indices, and $a_{n_{I}}$ , $b_{n_{I}}$ are the overall Fourier coefficients. To guarantee that K is a $C^{t}$ function, we must have $a_{n_{I}}, b_{n_{I}} ≲ \frac{1}{n^{d + t}}$ where $n^{d} = N$ indexes the number of $a_{n_{I}}$ in decreasing order. But this means that in this simple case, the tail eigenvalues of the kernel must be bounded by $\frac{1}{N^{1 + t / d}}$ as $N \to \infty$ .

2.4. Duality.

We argued above that for kernels with pure power-law spectra, the asymptotic scaling of the underparameterized loss with respect to model size and the overparameterized loss with respect to dataset size share a common exponent. In the linear setup at hand, the relation between the underparameterized parameter dependence and overparameterized dataset dependence is even stronger. The underparameterized and overparameterized losses are directly related by exchanging the projection onto random features with the projection onto random training points. Note, sample-wise double descent observed in ref. 51 is a concrete realization of this duality for a simple data distribution. In SI Appendix, we present examples exhibiting the duality of the loss dependence on model and dataset size outside of the asymptotic regime.

3. Experiments

3.1. Deep Teacher-Student Models.

Our theory can be tested very directly in the teacher-student framework, in which a teacher deep neural network generates synthetic data used to train a student network. Here, it is possible to generate unlimited training samples and, crucially, controllably tune the dimension of the data manifold. We accomplish the latter by scanning over the dimension of the inputs to the teacher. We have found that when scanning over both model size and dataset size, the interpolation exponents closely match the prediction of $4 / d$ . The dataset size scaling is shown in Fig. 1, while model size scaling experiments appear in SI Appendix and have previously been observed in ref. 10.

3.2. Variance-Limited Scaling in the Wild.

Variance-limited scaling (Section 2.1) can be universally observed in real datasets. Fig. 1A (Top-Left and Bottom-Right) measures the variance-limited dataset scaling exponent $α_{D}$ and width scaling exponent $α_{W}$ . In both cases, we find striking agreement with the theoretically predicted values $α_{D}, α_{W} = 1$ across a variety of dataset, neural network architecture, batch size in stochastic gradient descent, and loss type. Our testbed includes deep fully connected and convolutional networks with ReLU or Erf nonlinearities and MSE or cross-entropy losses. SI Appendix contains experimental details.

3.3. Resolution-Limited Scaling in the Wild.

In addition to teacher-student models, we explored resolution-limited scaling behavior in the context of standard classification datasets. Wide ResNet (WRN) models (60) were trained for a fixed number of steps with cosine decay. In Fig. 1B we also include data from a four hidden layer convolutional neural network (CNN) detailed in SI Appendix. As detailed above, we find dataset-dependent scaling behavior in this context.

We further investigated the effect of the data distribution on the resolution-limited exponent, $α_{D}$ , by tuning the number of target classes and input noise (Fig. 3). To probe the effect of the number of classes, we constructed tasks derived from CIFAR-100 by grouping classes into broader semantic categories. We found that performance depends on the number of categories, but $α_{D}$ is insensitive to this number. In contrast, the addition of Gaussian noise had a more pronounced effect on $α_{D}$ . This suggests a picture in which the neural network learns to model the input data manifold, independent of the classification task, consistent with observations in refs. 61 and 62.

Fig. 3. — Effect of data distribution on scaling exponents. For CIFAR-100 superclassed to N classes (*Left*), we find that the number of target classes does not have a visible effect on the scaling exponent. (*Right*) For CIFAR-10 with the addition of Gaussian noise to inputs, we find the strength of the noise has a strong effect on performance scaling with dataset size. All models are WRN-28-10.

We also explored the effect of aspect ratio on dataset scaling, finding that the exponent magnitude increases with width up to a critical width, while the dependence on depth is milder (SI Appendix).

4. Discussion

We have presented a framework for categorizing neural network scaling laws, along with derivations that help to explain their origins. Crucially, our predictions agree with empirical findings in settings which have often proven challenging for theory – deep neural networks on real datasets. The variance-scaling regime yields, for smooth test losses, a universal prediction of $α_{D} = 1$ (for $D ≫ P$ ) and $α_{W} = 1$ (for $w ≫ D$ ). The resolution-limited regime yields exponents whose numerical value is variable and data and model dependent.

There are a number of intriguing directions for future work. The invariance of the dataset scaling exponent to superclassing (Fig. 3) suggests that deep networks may be largely learning properties of the input data manifold—akin to unsupervised learning—rather than significant task-specific structure, which may shed light on the versatility of learned deep network representations for different downstream tasks. Another direction for future research is to more explicitly derive within the theory the effects of “feature learning.” While the random feature linear models we have discussed are in exact correspondence with deep neural networks in the large-width limit and have been a useful theoretical testbed across a variety of problems, the kernels associated with networks of finite depth and finite width evolve dynamically during the course of training.

4.1. Limitations.

One limitation is that our theoretical results are asymptotic, while experiments are performed with finite models and datasets. This is apparent in the resolution-limited regime which requires a hierarchy ( $D ≫ P$ or $P ≫ D$ ). In Figs. 1A and 2A (Top-Right and Bottom-Left), we see a breakdown of the predicted scaling behavior as D $(P)$ become large and the hierarchy is lost. Furthermore, in the resolution-limited regime for deep networks, our theoretical tools rely on positing the existence of a data manifold. A precise definition of the data manifold, however, is lacking, forcing us to use imperfect proxies, such as the nearest-neighbor distances of final embedding layers from a trained network.

5. Outlook

Modern deep learning, in the era of large datasets, models, and computational power, has often made progress through extensive amounts of experimentation and trial-and-error. A theoretical understanding of deep learning that is grounded in experiments and strives to bridge the gap between mathematically rigorous theory on the one hand, and realistic settings on the other, could be scientifically important for guiding the field. Our treatment of neural scaling laws in this work touches on classic aspects of generalization within learning theory but derives results through realistic data assumptions and identifying and deriving a taxonomy for scaling regimes. Our approach is guided by the theoretical simplicity, realistic modeling, and experimental verification that is characteristic of theory construction in physics; we also leverage results from statistical physics approaches to deep learning in our derivations.

Looking further afield, it is an interesting question whether qualitatively new behavior can emerge in large neural models trained on rich datasets, or whether such models are natural extensions of smaller-scale models. The exploration of so-called emergent abilities within neural-based language models is an active area of research. Further investigation into these questions through the theoretical methods and scientific approaches of physics—calling for a realistic modeling of data and neural representations—may help shed light on our understanding of learning in deep neural networks.

Supplementary Material

Appendix 01 (PDF)

pnas.2311878121.sapp.pdf^{(1.1MB, pdf)}

Acknowledgments

We would like to thank Guy Gur-Ari, Boris Hanin, Tom Henighan, Danny Hernandez, Aitor Lewkowycz, Sam McCandlish, Preetum Nakkiran, Behnam Neyshabur, Jeffrey Pennington, Vinay Ramasesh, Dan Roberts, Jonathan Rosenfeld, Jascha Sohl-Dickstein, and Lechao Xiao for conversations during the completion of this work. U.S. completed a portion of this work during an internship at Google. J.K. and U.S. were supported in part by Open Philanthropy.

Author contributions

Y.B., E.D., J.K., J.L., and U.S. designed the research; performed theoretical and computational aspects of the research; and wrote the paper.

Competing interests

The authors declare no competing interest.

Footnotes

This article is a PNAS Direct Submission. Y.T. is a guest editor invited by the Editorial Board.

^*A visualization of this successively better approximation with dataset size is shown in Fig. 1B for models trained to predict data generated by a random fully connected network.

^†For simplicity we have used a very compressed notation for multitensor contractions in higher order terms.

Data, Materials, and Software Availability

All study data are included in the article and/or SI Appendix.

Supporting Information

References

1.J. Hestness et al., Deep learning scaling is predictable, empirically. arXiv [Preprint] (2017). 10.48550/arXiv.1712.00409 (Accessed 1 January 2021). [DOI]
2.J. Kaplan et al., Scaling laws for neural language models. arXiv [Preprint] (2020). 10.48550/arXiv.2001.08361 (Accessed 1 January 2021). [DOI]
3.J. S. Rosenfeld, A. Rosenfeld, Y. Belinkov, N. Shavit, “A constructive prediction of the generalization error across scales” in International Conference on Learning Representations (2020).
4.T. Henighan et al., Scaling laws for autoregressive generative modeling. arXiv [Preprint] (2020). 10.48550/arXiv.2010.14701 (Accessed 1 January 2021). [DOI]
5.J. S. Rosenfeld, J. Frankle, M. Carbin, N. Shavit, “On the predictability of pruning across scales” in International Conference on Machine Learning (PMLR, 2021), pp. 9075–9083.
6.S. Ahmad, G. Tesauro, “Scaling and generalization in neural networks: A case study” in Advances in Neural Information Processing Systems (1989), pp. 160–168.
7.D. Cohn, G. Tesauro, “Can neural networks do better than the Vapnik–Chervonenkis bounds?” in Advances in Neural Information Processing Systems (1991), pp. 911–917.
8.T. B. Brown et al., Language models are few-shot learners. arXiv [Preprint] (2020). 10.48550/arXiv.2010.14701 (Accessed 1 January 2021). [DOI]
9.Hoffmann J., et al. , An empirical analysis of compute-optimal large language model training. Adv. Neural Inf. Process. Syst. 35, 30016–30030 (2022). [Google Scholar]
10.Sharma U., Kaplan J., Scaling laws from the data manifold dimension. J. Mach. Learn. Res. 23, 1–34 (2022). [Google Scholar]
11.D. Bisla, A. N. Saridena, A. Choromanska, A theoretical-empirical approach to estimating sample complexity of DNNs. arXiv [Preprint] (2021). 10.48550/arXiv.2105.01867 (Accessed 1 January 2021). [DOI]
12.R. M. Neal, “Bayesian learning for neural networks,” PhD thesis, Department of Computer Science, University of Toronto, Toronto, ON, Canada (1994).
13.J. Lee et al., “Deep neural networks as Gaussian processes” in International Conference on Learning Representations (2018).
14.A. Matthews, J. Hron, M. Rowland, R. E. Turner, Z. Ghahramani, “Gaussian process behaviour in wide deep neural networks” in International Conference on Learning Representations (2018).
15.A. Jacot, F. Gabriel, C. Hongler, “Neural Tangent Kernel: Convergence and generalization in neural networks” in Advances in Neural Information Processing Systems (2018).
16.J. Lee et al., “Wide neural networks of any depth evolve as linear models under gradient descent” in Advances in Neural Information Processing Systems (2019).
17.E. Dyer, G. Gur-Ari, “Asymptotics of wide networks from Feynman diagrams” in International Conference on Learning Representations (2020).
18.Spigler S., Geiger M., Wyart M., Asymptotic learning curves of kernel methods: Empirical data versus teacher-student paradigm. J. Stat. Mech.: Theory Exp. 2020, 124001 (2020). [Google Scholar]
19.B. Bordelon, A. Canatar, C. Pehlevan, “Spectrum dependent learning curves in kernel regression and wide neural networks” in International Conference on Machine Learning (PMLR, 2020), pp. 1024–1034.
20.Caponnetto A., De Vito E., Optimal rates for the regularized least-squares algorithm. Found. Comput. Math. 7, 331–368 (2007). [Google Scholar]
21.I. Steinwart et al., “Optimal rates for regularized least squares regression” in COLT (2009), pp. 79–93.
22.Fischer S., Steinwart I., Sobolev norm learning rates for regularized least-squares algorithms. J. Mach. Learn. Res. 21, 8464–8501 (2020). [Google Scholar]
23.M. Hutter, Learning curve theory. arXiv [Preprint] (2021). 10.48550/arXiv.2102.04074 (Accessed 10 February 2021). [DOI]
24.Cui H., Loureiro B., Krzakala F., Zdeborová L., Generalization error rates in kernel regression: The crossover from the noiseless to noisy regime. Adv. Neural Inf. Process. Syst. 34, 10131–10143 (2021). [Google Scholar]
25.A. Maloney, D. A. Roberts, J. Sully, A solvable model of neural scaling laws. arXiv [Preprint] (2022). 10.48550/arXiv.2210.16859 (Accessed 1 November 2022). [DOI]
26.A. Wei, W. Hu, J. Steinhardt, “More than a toy: Random matrix models predict how real-world neural representations generalize” in International Conference on Machine Learning (PMLR, 2022), pp. 23549–23588.
27.Rahimi A., Recht B., Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. Adv. Neural Inf. Process. Syst. 21, 1313–1320 (2008). [Google Scholar]
28.T. Hastie, A. Montanari, S. Rosset, R. J. Tibshirani, Surprises in high-dimensional ridgeless least squares interpolation. arXiv [Preprint] (2019). 10.48550/arXiv.1903.08560 (Accessed 1 January 2021). [DOI] [PMC free article] [PubMed]
29.S. d’Ascoli, M. Refinetti, G. Biroli, F. Krzakala, “Double trouble in double descent: Bias and variance(s) in the lazy regime” in International Conference on Machine Learning (PMLR, 2020), pp. 2280–2290.
30.M. S. Advani, A. M. Saxe, High-dimensional dynamics of generalization error in neural networks. arXiv [Preprint] (2017). 10.48550/arXiv.1710.03667 (Accessed 1 January 2021). [DOI] [PMC free article] [PubMed]
31.Advani M. S., Saxe A. M., Sompolinsky H., High-dimensional dynamics of generalization error in neural networks. Neural Networks 132, 428–446 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
32.S. Mei, A. Montanari, The generalization error of random features regression: Precise asymptotics and double descent curve. arXiv [Preprint] (2019). 10.48550/arXiv.1908.05355 (Accessed 1 January 2021). [DOI]
33.B. Adlam, J. Pennington, “The Neural Tangent Kernel in high dimensions: Triple descent and a multi-scale theory of generalization” in International Conference on Machine Learning (PMLR, 2020), pp. 74–84.
34.Adlam B., Pennington J., Understanding double descent requires a fine-grained bias-variance decomposition. Adv. Neural Inf. Process. Syst. 33, 11022–11032 (2020). [Google Scholar]
35.A. Andreassen, E. Dyer, Asymptotics of wide convolutional neural networks. arxiv [Preprint] (2020). 10.48550/arXiv.2008.08675 (Accessed 1 January 2021). [DOI]
36.Geiger M., et al. , Scaling description of generalization with number of parameters in deep learning. J. Stat. Mech.: Theory Exp. 2020, 023401 (2020). [Google Scholar]
37.P. Nakkiran, B. Neyshabur, H. Sedghi, “The deep bootstrap framework: Good online learners are good offline generalizers” in International Conference on Learning Representations (2021).
38.Weyl H., Das asymptotische verteilungsgesetz der eigenwerte linearer partieller differentialgleichungen (mit einer anwendung auf die theorie der hohlraumstrahlung). Math. Ann. 71, 441–479 (1912). [Google Scholar]
39.Reade J., Eigenvalues of positive definite kernels. SIAM J. Math. Anal. 14, 152–157 (1983). [Google Scholar]
40.Kühn T., Eigenvalues of integral operators with smooth positive definite kernels. Arch. Math. 49, 525–534 (1987). [Google Scholar]
41.Ferreira J., Menegatto V., Eigenvalues of integral operators defined by smooth positive definite kernels. Int. Eq. Oper. Theory 64, 61–81 (2009). [Google Scholar]
42.Stein M. L., Interpolation of Spatial Data: Some Theory for Kriging (Springer Science & Business Media, 1999). [Google Scholar]
43.P. J. Bickel et al., “Local polynomial regression on unknown manifolds” in Complex Datasets and Inverse Problems (Institute of Mathematical Statistics, 2007), pp. 177–186.
44.D. de Laat, “Approximating manifolds by meshes: Asymptotic bounds in higher codimension” Thesis, University of Groningen, Groningen, The Netherlands (2011).
45.S. Yaida, “Non-Gaussian processes and neural networks at finite widths” in Mathematical and Scientific Machine Learning Conference (2020).
46.B. Hanin, M. Nica, “Finite depth and width corrections to the neural tangent” in International Conference on Learning Representations (2020).
47.E. Levina, P. J. Bickel, “Maximum likelihood estimation of intrinsic dimension” in Advances in Neural Information Processing Systems (2005), pp. 777–784.
48.P. McCullagh, J. A. Nelder, Generalized Linear Models (CRC Press, 1989), vol. 37.
49.R. M. Rifkin, R. A. Lippert, “Notes on regularized least squares” (Tech. Rep. MIT-CSAIL-TR-2007-025, MIT Computer Science and Artificial Intelligence Laboratory, MA, 2007).
50.Hastie T., Tibshirani R., Friedman J., The Elements of Statistical Learning: Data Mining, Inference, and Prediction (Springer Science & Business Media, 2009). [Google Scholar]
51.P. Nakkiran, More data can hurt for linear regression: Sample-wise double descent. arXiv [Preprint] (2019). 10.48550/arXiv.1912.07242 (Accessed 1 January 2021). [DOI]
52.R. Novak et al., “Bayesian deep convolutional networks with many channels are Gaussian processes” in International Conference on Learning Representations (2019).
53.A. Garriga-Alonso, L. Aitchison, C. E. Rasmussen, “Deep convolutional networks as shallow Gaussian processes” in International Conference on Learning Representations (2019).
54.G. Yang, Scaling limits of wide neural networks with weight sharing: Gaussian process behavior, gradient independence, and neural tangent kernel derivation. arXiv [Preprint] (2019). 10.48550/arXiv.1902.04760 (Accessed 1 January 2021). [DOI]
55.A. Lewkowycz, Y. Bahri, E. Dyer, J. Sohl-Dickstein, G. Gur-Ari, The large learning rate phase of deep learning: The catapult mechanism. arXiv [Preprint] (2020). 10.48550/arXiv.2003.02218 (Accessed 1 January 2021). [DOI]
56.W. Huang, W. Du, R. Y. Da Xu, C. Liu, Implicit bias of deep linear networks in the large learning rate phase. arXiv [Preprint] (2020). 10.48550/arXiv.2011.12547 (Accessed 1 January 2021). [DOI]
57.L. Chizat, E. Oyallon, F. Bach, “On lazy training in differentiable programming” in Advances in Neural Information Processing Systems (2019), pp. 2937–2947.
58.A. Loukas, “How close are the eigenvectors of the sample and actual covariance matrices?” in International Conference on Machine Learning (PMLR, 2017), pp. 2228–2237.
59.Canatar A., Bordelon B., Pehlevan C., Spectral bias and task-model alignment explain generalization in kernel regression and infinitely wide neural networks. Nat. Commun. 12, 1–12 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
60.S. Zagoruyko, N. Komodakis, “Wide residual networks” in British Machine Vision Conference (2016).
61.P. Nakkiran, Y. Bansal, Distributional generalization: A new kind of generalization. arXiv [Preprint] (2020). 10.48550/arXiv.2009.08092 (Accessed 1 January 2021). [DOI]
62.W. Grathwohl et al., “Your classifier is secretly an energy based model and you should treat it like one” in International Conference on Learning Representations (2020).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Appendix 01 (PDF)

pnas.2311878121.sapp.pdf^{(1.1MB, pdf)}

Data Availability Statement

All study data are included in the article and/or SI Appendix.

[r1] 1.J. Hestness et al., Deep learning scaling is predictable, empirically. arXiv [Preprint] (2017). 10.48550/arXiv.1712.00409 (Accessed 1 January 2021). [DOI]

[r2] 2.J. Kaplan et al., Scaling laws for neural language models. arXiv [Preprint] (2020). 10.48550/arXiv.2001.08361 (Accessed 1 January 2021). [DOI]

[r3] 3.J. S. Rosenfeld, A. Rosenfeld, Y. Belinkov, N. Shavit, “A constructive prediction of the generalization error across scales” in International Conference on Learning Representations (2020).

[r4] 4.T. Henighan et al., Scaling laws for autoregressive generative modeling. arXiv [Preprint] (2020). 10.48550/arXiv.2010.14701 (Accessed 1 January 2021). [DOI]

[r5] 5.J. S. Rosenfeld, J. Frankle, M. Carbin, N. Shavit, “On the predictability of pruning across scales” in International Conference on Machine Learning (PMLR, 2021), pp. 9075–9083.

[r6] 6.S. Ahmad, G. Tesauro, “Scaling and generalization in neural networks: A case study” in Advances in Neural Information Processing Systems (1989), pp. 160–168.

[r7] 7.D. Cohn, G. Tesauro, “Can neural networks do better than the Vapnik–Chervonenkis bounds?” in Advances in Neural Information Processing Systems (1991), pp. 911–917.

[r8] 8.T. B. Brown et al., Language models are few-shot learners. arXiv [Preprint] (2020). 10.48550/arXiv.2010.14701 (Accessed 1 January 2021). [DOI]

[r9] 9.Hoffmann J., et al. , An empirical analysis of compute-optimal large language model training. Adv. Neural Inf. Process. Syst. 35, 30016–30030 (2022). [Google Scholar]

[r10] 10.Sharma U., Kaplan J., Scaling laws from the data manifold dimension. J. Mach. Learn. Res. 23, 1–34 (2022). [Google Scholar]

[r11] 11.D. Bisla, A. N. Saridena, A. Choromanska, A theoretical-empirical approach to estimating sample complexity of DNNs. arXiv [Preprint] (2021). 10.48550/arXiv.2105.01867 (Accessed 1 January 2021). [DOI]

[r12] 12.R. M. Neal, “Bayesian learning for neural networks,” PhD thesis, Department of Computer Science, University of Toronto, Toronto, ON, Canada (1994).

[r13] 13.J. Lee et al., “Deep neural networks as Gaussian processes” in International Conference on Learning Representations (2018).

[r14] 14.A. Matthews, J. Hron, M. Rowland, R. E. Turner, Z. Ghahramani, “Gaussian process behaviour in wide deep neural networks” in International Conference on Learning Representations (2018).

[r15] 15.A. Jacot, F. Gabriel, C. Hongler, “Neural Tangent Kernel: Convergence and generalization in neural networks” in Advances in Neural Information Processing Systems (2018).

[r16] 16.J. Lee et al., “Wide neural networks of any depth evolve as linear models under gradient descent” in Advances in Neural Information Processing Systems (2019).

[r17] 17.E. Dyer, G. Gur-Ari, “Asymptotics of wide networks from Feynman diagrams” in International Conference on Learning Representations (2020).

[r18] 18.Spigler S., Geiger M., Wyart M., Asymptotic learning curves of kernel methods: Empirical data versus teacher-student paradigm. J. Stat. Mech.: Theory Exp. 2020, 124001 (2020). [Google Scholar]

[r19] 19.B. Bordelon, A. Canatar, C. Pehlevan, “Spectrum dependent learning curves in kernel regression and wide neural networks” in International Conference on Machine Learning (PMLR, 2020), pp. 1024–1034.

[r20] 20.Caponnetto A., De Vito E., Optimal rates for the regularized least-squares algorithm. Found. Comput. Math. 7, 331–368 (2007). [Google Scholar]

[r21] 21.I. Steinwart et al., “Optimal rates for regularized least squares regression” in COLT (2009), pp. 79–93.

[r22] 22.Fischer S., Steinwart I., Sobolev norm learning rates for regularized least-squares algorithms. J. Mach. Learn. Res. 21, 8464–8501 (2020). [Google Scholar]

[r23] 23.M. Hutter, Learning curve theory. arXiv [Preprint] (2021). 10.48550/arXiv.2102.04074 (Accessed 10 February 2021). [DOI]

[r24] 24.Cui H., Loureiro B., Krzakala F., Zdeborová L., Generalization error rates in kernel regression: The crossover from the noiseless to noisy regime. Adv. Neural Inf. Process. Syst. 34, 10131–10143 (2021). [Google Scholar]

[r25] 25.A. Maloney, D. A. Roberts, J. Sully, A solvable model of neural scaling laws. arXiv [Preprint] (2022). 10.48550/arXiv.2210.16859 (Accessed 1 November 2022). [DOI]

[r26] 26.A. Wei, W. Hu, J. Steinhardt, “More than a toy: Random matrix models predict how real-world neural representations generalize” in International Conference on Machine Learning (PMLR, 2022), pp. 23549–23588.

[r27] 27.Rahimi A., Recht B., Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. Adv. Neural Inf. Process. Syst. 21, 1313–1320 (2008). [Google Scholar]

[r28] 28.T. Hastie, A. Montanari, S. Rosset, R. J. Tibshirani, Surprises in high-dimensional ridgeless least squares interpolation. arXiv [Preprint] (2019). 10.48550/arXiv.1903.08560 (Accessed 1 January 2021). [DOI] [PMC free article] [PubMed]

[r29] 29.S. d’Ascoli, M. Refinetti, G. Biroli, F. Krzakala, “Double trouble in double descent: Bias and variance(s) in the lazy regime” in International Conference on Machine Learning (PMLR, 2020), pp. 2280–2290.

[r30] 30.M. S. Advani, A. M. Saxe, High-dimensional dynamics of generalization error in neural networks. arXiv [Preprint] (2017). 10.48550/arXiv.1710.03667 (Accessed 1 January 2021). [DOI] [PMC free article] [PubMed]

[r31] 31.Advani M. S., Saxe A. M., Sompolinsky H., High-dimensional dynamics of generalization error in neural networks. Neural Networks 132, 428–446 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r32] 32.S. Mei, A. Montanari, The generalization error of random features regression: Precise asymptotics and double descent curve. arXiv [Preprint] (2019). 10.48550/arXiv.1908.05355 (Accessed 1 January 2021). [DOI]

[r33] 33.B. Adlam, J. Pennington, “The Neural Tangent Kernel in high dimensions: Triple descent and a multi-scale theory of generalization” in International Conference on Machine Learning (PMLR, 2020), pp. 74–84.

[r34] 34.Adlam B., Pennington J., Understanding double descent requires a fine-grained bias-variance decomposition. Adv. Neural Inf. Process. Syst. 33, 11022–11032 (2020). [Google Scholar]

[r35] 35.A. Andreassen, E. Dyer, Asymptotics of wide convolutional neural networks. arxiv [Preprint] (2020). 10.48550/arXiv.2008.08675 (Accessed 1 January 2021). [DOI]

[r36] 36.Geiger M., et al. , Scaling description of generalization with number of parameters in deep learning. J. Stat. Mech.: Theory Exp. 2020, 023401 (2020). [Google Scholar]

[r37] 37.P. Nakkiran, B. Neyshabur, H. Sedghi, “The deep bootstrap framework: Good online learners are good offline generalizers” in International Conference on Learning Representations (2021).

[r38] 38.Weyl H., Das asymptotische verteilungsgesetz der eigenwerte linearer partieller differentialgleichungen (mit einer anwendung auf die theorie der hohlraumstrahlung). Math. Ann. 71, 441–479 (1912). [Google Scholar]

[r39] 39.Reade J., Eigenvalues of positive definite kernels. SIAM J. Math. Anal. 14, 152–157 (1983). [Google Scholar]

[r40] 40.Kühn T., Eigenvalues of integral operators with smooth positive definite kernels. Arch. Math. 49, 525–534 (1987). [Google Scholar]

[r41] 41.Ferreira J., Menegatto V., Eigenvalues of integral operators defined by smooth positive definite kernels. Int. Eq. Oper. Theory 64, 61–81 (2009). [Google Scholar]

[r42] 42.Stein M. L., Interpolation of Spatial Data: Some Theory for Kriging (Springer Science & Business Media, 1999). [Google Scholar]

[r43] 43.P. J. Bickel et al., “Local polynomial regression on unknown manifolds” in Complex Datasets and Inverse Problems (Institute of Mathematical Statistics, 2007), pp. 177–186.

[r44] 44.D. de Laat, “Approximating manifolds by meshes: Asymptotic bounds in higher codimension” Thesis, University of Groningen, Groningen, The Netherlands (2011).

[r45] 45.S. Yaida, “Non-Gaussian processes and neural networks at finite widths” in Mathematical and Scientific Machine Learning Conference (2020).

[r46] 46.B. Hanin, M. Nica, “Finite depth and width corrections to the neural tangent” in International Conference on Learning Representations (2020).

[r47] 47.E. Levina, P. J. Bickel, “Maximum likelihood estimation of intrinsic dimension” in Advances in Neural Information Processing Systems (2005), pp. 777–784.

[r48] 48.P. McCullagh, J. A. Nelder, Generalized Linear Models (CRC Press, 1989), vol. 37.

[r49] 49.R. M. Rifkin, R. A. Lippert, “Notes on regularized least squares” (Tech. Rep. MIT-CSAIL-TR-2007-025, MIT Computer Science and Artificial Intelligence Laboratory, MA, 2007).

[r50] 50.Hastie T., Tibshirani R., Friedman J., The Elements of Statistical Learning: Data Mining, Inference, and Prediction (Springer Science & Business Media, 2009). [Google Scholar]

[r51] 51.P. Nakkiran, More data can hurt for linear regression: Sample-wise double descent. arXiv [Preprint] (2019). 10.48550/arXiv.1912.07242 (Accessed 1 January 2021). [DOI]

[r52] 52.R. Novak et al., “Bayesian deep convolutional networks with many channels are Gaussian processes” in International Conference on Learning Representations (2019).

[r53] 53.A. Garriga-Alonso, L. Aitchison, C. E. Rasmussen, “Deep convolutional networks as shallow Gaussian processes” in International Conference on Learning Representations (2019).

[r54] 54.G. Yang, Scaling limits of wide neural networks with weight sharing: Gaussian process behavior, gradient independence, and neural tangent kernel derivation. arXiv [Preprint] (2019). 10.48550/arXiv.1902.04760 (Accessed 1 January 2021). [DOI]

[r55] 55.A. Lewkowycz, Y. Bahri, E. Dyer, J. Sohl-Dickstein, G. Gur-Ari, The large learning rate phase of deep learning: The catapult mechanism. arXiv [Preprint] (2020). 10.48550/arXiv.2003.02218 (Accessed 1 January 2021). [DOI]

[r56] 56.W. Huang, W. Du, R. Y. Da Xu, C. Liu, Implicit bias of deep linear networks in the large learning rate phase. arXiv [Preprint] (2020). 10.48550/arXiv.2011.12547 (Accessed 1 January 2021). [DOI]

[r57] 57.L. Chizat, E. Oyallon, F. Bach, “On lazy training in differentiable programming” in Advances in Neural Information Processing Systems (2019), pp. 2937–2947.

[r58] 58.A. Loukas, “How close are the eigenvectors of the sample and actual covariance matrices?” in International Conference on Machine Learning (PMLR, 2017), pp. 2228–2237.

[r59] 59.Canatar A., Bordelon B., Pehlevan C., Spectral bias and task-model alignment explain generalization in kernel regression and infinitely wide neural networks. Nat. Commun. 12, 1–12 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r60] 60.S. Zagoruyko, N. Komodakis, “Wide residual networks” in British Machine Vision Conference (2016).

[r61] 61.P. Nakkiran, Y. Bansal, Distributional generalization: A new kind of generalization. arXiv [Preprint] (2020). 10.48550/arXiv.2009.08092 (Accessed 1 January 2021). [DOI]

[r62] 62.W. Grathwohl et al., “Your classifier is secretly an energy based model and you should treat it like one” in International Conference on Learning Representations (2020).

PERMALINK

Explaining neural scaling laws

Yasaman Bahri

Ethan Dyer

Jared Kaplan

Jaehoon Lee

Utkarsh Sharma

Series information

Significance

Abstract

1. Scaling Laws for Deep Neural Networks

1.1. Variance-Limited Regime.

1.2. Resolution-Limited Regime.

Fig. 1.

1.3. Explicit Derivation.

1.4. Summary of Contributions.

1.5. Related Works.

2. Four Scaling Regimes

2.1. Variance-Limited Exponents.

Theorem 1.

2.1.1. Dataset scaling.

2.1.2. Large width scaling.

2.2. Resolution-Limited Exponents.

2.2.1. Overparameterized dataset scaling.

Theorem 2.

2.2.2. Underparameterized parameter scaling.

Theorem 3.

2.2.3. From bounds to estimates.

2.3. Explicit Realization in Linear Random Feature Models.

2.3.1. Variance-limited scaling.

Fig. 2.

2.3.2. Resolution-limited scaling.

2.3.3. From spectra to scaling laws for the loss.

2.3.4. Data manifolds and kernels.

2.4. Duality.

3. Experiments

3.1. Deep Teacher-Student Models.

3.2. Variance-Limited Scaling in the Wild.

3.3. Resolution-Limited Scaling in the Wild.

Fig. 3.

4. Discussion

4.1. Limitations.

5. Outlook

Supplementary Material

Acknowledgments

Author contributions

Competing interests

Footnotes

Data, Materials, and Software Availability

Supporting Information

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases