Dynamics in Deep Classifiers Trained with the Square Loss: Normalization, Low Rank, Neural Collapse, and Generalization Bounds

Mengjia Xu; Akshay Rangamani; Qianli Liao; Tomer Galanti; Tomaso Poggio

doi:10.34133/research.0024

. 2023 Mar 8;6:0024. doi: 10.34133/research.0024

Dynamics in Deep Classifiers Trained with the Square Loss: Normalization, Low Rank, Neural Collapse, and Generalization Bounds

Mengjia Xu ^1,², Akshay Rangamani ¹, Qianli Liao ¹, Tomer Galanti ¹, Tomaso Poggio ^1,^*

PMCID: PMC10202460 PMID: 37223467

Abstract

We overview several properties—old and new—of training overparameterized deep networks under the square loss. We first consider a model of the dynamics of gradient flow under the square loss in deep homogeneous rectified linear unit networks. We study the convergence to a solution with the absolute minimum ρ, which is the product of the Frobenius norms of each layer weight matrix, when normalization by Lagrange multipliers is used together with weight decay under different forms of gradient descent. A main property of the minimizers that bound their expected error for a specific network architecture is ρ. In particular, we derive novel norm-based bounds for convolutional layers that are orders of magnitude better than classical bounds for dense networks. Next, we prove that quasi-interpolating solutions obtained by stochastic gradient descent in the presence of weight decay have a bias toward low-rank weight matrices, which should improve generalization. The same analysis predicts the existence of an inherent stochastic gradient descent noise for deep networks. In both cases, we verify our predictions experimentally. We then predict neural collapse and its properties without any specific assumption—unlike other published proofs. Our analysis supports the idea that the advantage of deep networks relative to other classifiers is greater for problems that are appropriate for sparse deep architectures such as convolutional neural networks. The reason is that compositionally sparse target functions can be approximated well by “sparse” deep networks without incurring in the curse of dimensionality.

Introduction

A widely held belief in the last few years has been that the cross-entropy loss is superior to the square loss when training deep networks for classification problems. As such, the attempts at understanding the theory of deep learning have been largely focused on exponential-type losses [1,2], such as the cross-entropy. For these losses, the predictive ability of deep networks depends on the implicit complexity control of gradient descent (GD) algorithms that lead to asymptotic maximization of the classification margin on the training set [1,3,4]. Recently, however, Hui and Belkin [5] have empirically demonstrated that it is possible to achieve a similar level of performance, if not better, using the square loss, paralleling older results for support vector machines [6]. Can a theoretical analysis explain when and why regression should work well for classification? This question was the original motivation for this paper and preliminary versions of it [7,8].

In deep learning binary classification, unlike the case of linear networks, we expect from previous results (in the absence of regularization) several global minima with zero square loss, thus corresponding to interpolating solutions (in general degenerate, see [9,10] and reference therein), because of overparametrization. Although all the interpolating solutions are optimal solutions to the regression problem, they will generally correspond to different (normalized) margins and to different expected classification performances. In other words, zero square loss does not imply by itself neither large margin nor good classification on a test set. When can we expect the solutions to the regression problem obtained by GD to have a large margin?

We introduce a simplified model of the training procedure that uses square loss, binary classification, gradient flow (GF), and Lagrange multipliers (LMs) for normalizing the weights. With this model, we show that obtaining large margin interpolating solutions depends on the scale of initialization of the weights close to zero, in the absence of regularization [also called weight decay (WD)]. Assuming convergence, we describe the qualitative dynamics of the deep network’s parameters and show that ρ, which is the product of the Frobenius norms of the weight matrices, grows nonmonotonically until a large margin, which is small ρ solution, is found reached. Assuming that local minima and saddle points can be avoided, this analysis suggests that with WD (or sometimes with just small initialization), GD techniques may yield convergence to a minimum with a ρ biased to be small.

In the presence of WD, perfect interpolation of all data points cannot occur and is replaced by quasi-interpolation of the labels. In the special case of binary classification case in which y_n = ±1, quasi-interpolation is defined as ∀ n:|f(x_n) − y_n | ≤ ϵ, where ϵ > 0 is small. Our experiments and analysis of the dynamics show that in the presence of regularization, there is a weaker dependence on initial conditions, as has been observed in [5]. We show that WD helps stabilize normalization of the weights, in addition to its role in the dynamics of the norm.

We then apply basic bounds on expected error to the solutions provided by stochastic gradient descent (SGD) (for WD λ > 0), which have locally minimum ρ. For normal training set sizes, the bounds are still vacuous but much closer to the test error than previous estimates. This is encouraging because in our setup, large overparametrization, corresponding to interpolation of the training data [11], coexists with a relatively small Rademacher complexity because of the sparsity induced by the locality of the convolutional kernel. [By several orders of magnitude.]

We then turn to show that the quasi-interpolating solutions satisfy the recently discovered neural collapse (NC) phenomenon [12], assuming SGD with minibatches. According to NC, a dramatic simplification of deep network dynamics takes place—not only do all the margins become very similar to each other, but the last layer classifiers and the penultimate layer features also form the geometrical structure of a simplex equiangular tight frame (ETF). Here, we prove the emergence of NC for the square loss for the networks that we study—without any additional assumption (such as unconstrained features).

Finally, the study of SGD reveals surprising differences relative to GD. In particular, in the presence of regularization, SGD does not converge to a perfect equilibrium: There is always, at least generically, SGD noise. The underlying reason is a rank constraint that depends on the size of the minibatches. This also implies an SGD bias toward low-rank solutions that reinforces a similar bias due to maximization of the margin under normalization (which can be inferred from [13]).

Contributions

The main original contributions in this paper are as follows:

• We analyze the dynamics of deep network parameters, their norm, and the margins under GF on the square loss, using Lagrange normalization (LN). We describe the evolution of ρ and the role of WD and normalization in the training dynamics. The analysis in terms of the “polar” coordinates ρ, V_k is new, and many of the observed properties are not. Arguably, our analysis of the bias toward minimum ρ and its dynamics with and without WD is an original contribution.

• Our norm-based generalization bounds for convolutional neural networks (CNNs) are new. We outline in this paper a derivation for the case of nonoverlapping convolutional patches. The extension to the general case follows naturally and will be described in a forthcoming paper. The bounds show that generalization for CNNs can be orders of magnitude better than that for dense networks. In the experiments that we describe, the bounds turn out to be loose but close to nonvacuous. They appear to be much better than the other empirical tests of generalization bounds—all for dense networks—that we know of. The main reason for this, in addition to the relatively simple task (binary classification in CIAFR10), is the sparsity of the convolutional network, which is the low dimensionality (or locality) of the kernel.

• We prove that convergence of GD optimization with WD and normalization yields NC for deep networks trained with square loss in the binary and in the multiclass classification case. Experiments verify the predictions. Our proof is free of any assumption—unlike other recent papers that depend on the “unconstrained feature assumption”.

• We prove that training the network using SGD with WD induces a bias toward low-rank weight matrices. As we will describe in a separate paper, low rank can yield better generalization bounds.

• The same theoretical observation that predicts a low-rank-bias also predicts the existence of an intrinsic SGD noise in the weight matrices and in the margins.

Related Work

There has been much recent work on the analysis of deep networks and linear models trained using exponential-type losses for classification. The implicit bias of GD toward margin maximizing solutions under exponential-type losses was shown for linear models with separable data in [14] and for deep networks in [1,2,15,16]. Recent interest in using the square loss for classification has been spurred by the experiments in [5], although the practice of using the square loss is much older [6]. Muthukumar et al. [17] recently showed for linear models that interpolating solutions for the square loss are equivalent to the solutions to the hard margin support vector machine problem (see also [7]). Recent work also studied interpolating kernel machines [18,19] that use the square loss for classification.

In the recent past, there have been a number of papers analyzing deep networks trained with the square loss. These include the works of Zhong et al. [20] and Soltanolkotabi et al. [21] that show how to recover the parameters of a neural network by training on data sampled from it. The square loss has also been used in analyzing convergence of training in the neural tangent kernel (NTK) regime [22–24]. Detailed analyses of 2-layer neural networks such as [25–27] typically use the square loss as an objective function. However, these papers do not specifically consider the task of classification.

A large effort has been spent in understanding generalization in deep networks. The main focus has been solving the puzzle of how overparameterized deep networks (with more parameters than data) are able to generalize. An influential paper [11] showed that overparameterized deep networks that usually fit randomly labeled data also generalize well when they trained on correctly labeled data. Thus, the training error does not give any information about test error: There is no uniform convergence of training error to test error. This is related to another property of overparametrization: Standard Vapnik–Chervonenkis bounds are always vacuous when the number of parameters is larger than the number of data. Although often forgotten, it is, however, well known that another type of bounds—on the norm of parameters—may provide generalization even if there are more parameters than data. This point was made convincingly in [28], which provides norm-based bounds for deep networks. [The focus of this paper on ρ is directly related.] Bartlett bounds and related ones [29,30] in practice turn out to be very loose. Empirical studies such as [31] found little evidence so far that norms and margins correlate well with generalization.

NC [12] is a recently discovered empirical phenomenon that occurs when training deep classifiers using the cross-entropy loss. Since its discovery, there have been a few papers analytically proving its emergence when training deep networks. Mixon et al. [32] show NC in the regime of “unconstrained features”. Recent results in [33] perform a more comprehensive analysis of NC in the unconstrained features paradigm. There have been a series of papers analytically showing the emergence of NC when using the cross-entropy loss [34–36]. In the study of the emergence of NC when training using the square loss, Ergen and Pilanci [37] (see also [38]) derived it through a convex dual formulation of deep networks. In addition to that, Han et al. [39] and Zhou et al. [40] show the emergence of NC in the unconstrained features regime. Our independent derivation is different from these approaches and shows that NC emerges in the presence of normalization and WD.

Several papers in recent years have studied the relationship between implicit regularization in linear neural networks and rank minimization. A main focus was on the matrix factorization problem, which corresponds to training a depth-2 linear neural network with multiple outputs with respect to the square loss (see references in [13]). Beyond factorization problems, it was shown that in linear networks of output dimension 1, GF with respect to exponential-type loss functions converges to networks where the weight matrix of every layer is of rank 1. However, for nonlinear neural networks, things are less clear. Empirically, several studies (see references in [13]) showed that replacing the weight matrices by low-rank approximations results in only a small drop in accuracy. This suggests that the weight matrices in practice are not too far from being low rank.

Problem Setup

In this section, we describe the training settings considered in our work. We study training deep neural network with rectified linear unit (ReLU) nonlinearity using square loss minimization for classification problems. In the proposed analysis, we apply a specific normalization technique: weight normalization (WN), which is equivalent to LM, and regularization (also called WD), since such mechanisms seem commonly used for reliably training deep networks using GD techniques [5,41].

Assumptions

Throughout the theoretical analysis, we make, in some places, simplifying assumptions relative to standard practice in deep neural networks. We mostly consider that the case of binary classification though our analysis of NC includes multiclass classification. We restrict ourselves to the square loss. We consider GD techniques, but we assume different forms of them in various sections of the paper. In the first part, we assume continuous GF instead of GD or SGD. GF is the limit of discrete GD algorithm with the learning rate being infinitesimally small (we describe an approximation of GD within a GF approach in [8]). SGD is specifically considered and shown to bias rank and induce asymptotic noise that is unique to it. The analysis of NC is carried out using SGD with small learning rates. Furthermore, we assume WN by an LM term added to the loss function, which normalizes the weight matrices. This is equivalent to WN but is not equivalent to the more commonly used batch normalization (BN).

We also assume throughout that the network is overparameterized and so that there is convergence to global minima with appropriate initialization, parameter values, and data.

Classification with square loss minimization

In this work, we consider a square loss minimization for classification along with regularization and WN. We consider a binary classification problem, given a training dataset $S = {\{(x_{n}, y_{n})\}}_{n = 1}^{N}$ , where x_n ∈ ℝ^d is the input (normalized such that ∥x_n ∥ ≤ 1) and y_n ∈ {±1} is the label. We use deep rectified homogeneous networks with L layers to solve this problem. For simplicity, we consider networks f_W : ℝ^d → ℝ^p of the following form f_W(x) = W_Lσ(W_{L − 1}…σ(W₁x)…), where x ∈ ℝ^d is the input to the network and σ : ℝ → ℝ, σ(x) = max (0, x) is the ReLU activation function that is applied coordinate-wise at each layer. The last layer of the network is linear (see Fig. 1).

Fig. 1. — An illustration of 2 parametrizations of *f_W*(x). In (A), we decompose each layer’s weight matrix *W_i* into its norm *ρ_i* and its normalized version *V_i*. In (B), we normalize each layer except for the top layer’s matrix *W_L* that is decomposed into a global ρ and the last layer *V_L*. Normalizing the weight matrices, as WN (equivalent to LN) does, is different from BN, although both normalization techniques capture the relevant property of normalization—to make the dot product invariant to scale.

Because of the positive homogeneity of ReLU [i.e., σ(αx) = ασ(x) for all x ∈ ℝ and α > 0], one can reparametrize f_W(x) by considering normalized weight matrices $V_{k} = \frac{W_{k}}{∥ W_{k} ∥}$ and define ρ_k = ∥ W_k∥, obtaining f_W(x) = ρ_LV_Lσ(ρ_{L − 1}…σ(ρ₁V₁x)…). [We choose the Frobenius norm here.] Because of homogeneity of the ReLU, it is possible to pull out the product of the layer norms as ρ = ∏_k ρ_k and write f_W(x) = ρf_V(x) = ρV_Lσ(V_{L − 1}…σ(V₁x)…). Notice that the 2 networks—f_W(x) and ρf_V(x)—are equivalent reparameterizations of the same function (if ρ = ∏_k ρ_k) but their optimization generally differ. We define f_n ≔ f_V(x_n).

We adopt in our definition the convention that the norm ρ_j of the convolutional layers is the norm of their filters and not the norm of their associated Toeplitz matrices. The reason is that this is what our novel bounds for CNNs state (see also section 3.3 in [42,43]). The total ρ calculated in this way is the quantity that enters the generalization bounds of Generalization: Rademacher Complexity of Convolutional Layers.

In practice, certain normalization techniques are used to train neural networks. This is usually performed using either BN or, less frequently, WN. BN consists of standardizing the output of the units in each layer to have zero mean and unit variance with respect to training set. WN normalizes the weight matrices (section 10 in [4]). In our analysis, we model normalization by normalizing the weight matrices, using an LM term added to the loss function. This approach is equivalent to WN.

In the presence of normalization, we assume that all layers are normalized, except for the last one, via the added LM. Thus, the weight matrices ${\{V_{k}\}}_{k = 1}^{L}$ are constrained by the LM term to be close to, and eventually converge to, unit norm matrices (in fact, to fixed norm matrices); notice that normalizing V_L and then multiplying the output by ρ are equivalent to letting W_L = ρV_L be unnormalized. Thus, f_V is the network that, at convergence, has L − 1 normalized layers (see Fig. 1B).

We can write the Lagrangian corresponding to the minimization of the regularized loss function under the constraint ∥V_k∥² = 1 in the following manner:

\begin{matrix} L_{S} (ρ, {\{V_{k}\}}_{k = 1}^{L}) : & = \frac{1}{N} \sum_{n} {(ρ f_{n} - y_{n})}^{2} + \sum_{k = 1}^{L} ν_{k} (∥ V_{k} ∥^{2} - 1) + λ ρ^{2} \\ = \frac{1}{N} \sum_{n} {(1 - ρ {\bar{f}}_{n})}^{2} + \sum_{k = 1}^{L} ν_{k} (∥ V_{k} ∥^{2} - 1) + λ ρ^{2} \end{matrix}

(1)

where ν_k values are the LMs and λ > 0 is a predefined parameter.

Separability and margins

Two important aspects of classification are separability and margins. For a given sample (x, y) (train or test sample) and model f_W, we say that f_W correctly classifies x, if ${\bar{f}}_{n} = y_{n} f_{n} > 0$ . In addition, for a given dataset $S = {\{(x_{n}, y_{n})\}}_{n = 1}^{N}$ , separability is defined as the condition in which all training samples are classified correctly, $\forall n \in [N] : {\bar{f}}_{n} > 0$ . Furthermore, when $\sum_{n = 1}^{N} {\bar{f}}_{n} > 0$ , we say that average separability is satisfied. The minimum of $L_{S}$ for λ = 0 is usually zero under our assumption of overparametrization. This corresponds to separability.

Notice that if f_W is a zero loss solution of the regression problem, then ∀n : f_W(x_n) = y_n, which is also equivalent to ρf_n = y_n, where we call $y_{n} f_{n} = {\bar{f}}_{n}$ the margin for x_n. By multiplying both sides of this equation by y_n and summing both sides over n ∈ [N], we obtain that $ρ \sum_{n} {\bar{f}}_{n} = N$ . Thus, the norm ρ of a minimizer is inversely proportional to its average margin μ in the limit of λ = 0, with $μ = \frac{1}{N} \sum_{n} {\bar{f}}_{n}$ . It is also useful to define the margin variance σ² = M − μ² with $M = \frac{1}{N} \sum_{n} {\bar{f}}_{n}^{2}$ . Notice that $M = \frac{1}{N} \sum_{n} {\bar{f}}_{n}^{2} = σ^{2} + μ^{2}$ and that both M and σ² are not negative. [Notice that the term “margin” is usually defined as ${min}_{n \in [N]} {\bar{f}}_{n}$ . Instead, we use the term “margin for x_n” to distinguish our definition from the usual one.]

Interpolation and quasi-interpolation

Assume that the weights V_k are normalized at convergence. Then

Lemma 1.

For λ = 0, there are solutions that interpolate all data points with the same margin and achieve zero loss. For λ > 0, there are no solutions that have the same margins and interpolate. However, there are solutions with the same margins that quasi-interpolate and are critical points of the gradient.

Proof. Consider the loss $L_{S} = \frac{1}{N} \sum_{n} {(1 - ρ {\bar{f}}_{n})}^{2} + λ ρ^{2} = 1 - 2 ρ μ + ρ^{2} M + λ ρ^{2}$ . For λ = 0, a zero of the loss $L_{S} = 0$ implies $\forall n \in [N] : μ = {\bar{f}}_{n}$ and $μ = \frac{1}{ρ}$ . However, for λ > 0, the assumption that all ${\bar{f}}_{n}$ values are equal yields M = μ² and, thus, $L_{S} = ρ^{2} μ^{2} - 2 ρ μ + (1 + λ ρ^{2})$ . Setting $L_{S} = 0$ gives a second-order equation in ρ that does not have real-valued solutions for λ > 0. Thus, in the presence of regularization, there exist no solutions that have the same margin for all points and reach zero empirical loss. However, solutions that have the same margin for all points and correspond to zero gradient with respect to ρ exist. To see this, assume σ = 0, setting the gradient of $L_{S}$ with respect to ρ equal to zero, yielding ρμ² − μ + λρ = 0. This gives $ρ = \frac{μ}{μ^{2} + λ}$ . This solution yields ρμ < 1, which corresponds to noninterpolating solutions.

The Neural collapse section shows that the margins [which are never interpolating; interpolation is equivalent to ρy_nf(x_n) = 1] tend to become equal to each other as predicted from the lemma during convergence.

Experiments

We performed binary classification experiments using the standard CIFAR10 dataset [44]. Image samples with class labels 1 and 2 were extracted for the binary classification task. A total number of training and test data points are 10,000 and 2,000, respectively. The model architecture in Fig. 1B contains 4 convolutional layers and 2 fully connected layers with hidden sizes of 1,024 and 2. A number of channels for the 4 convolutional layers are 32, 64, 128, and 128, and the filter size is 3 × 3. The first fully connected layer has 3,200 × 1,024 = 3,276,800 weights, and the very last layer has 1,024 × 2 = 2,048 weights. At the top layer of our model, there is a learnable parameter ρ (Fig. 1B). In our experiments, instead of using LMs, we used the equivalent (see proof of the equivalence [2]) WN algorithm, freezing the weights of the WN parameter “g” [45] and normalizing the ${\{V_{k}\}}_{k = 1}^{L - 1}$ matrices at each layer with respect to their Frobenius norm, while the top layer’s norm is denoted by ρ and is the only parameter entering in the regularization term (see Eq. 11).

Landscape of the empirical risk

As a next step, we establish key properties of the loss landscape. The landscape of the empirical loss contains a set of degenerate zero-loss global minima (for λ = 0) that under certain overparametrization, assumptions may be connected in a single zero-loss degenerate valley for ρ ≥ ρ₀. Figure 2 shows a landscape that has a saddle for ρ = 0 and then goes to zero loss (zero crossing level, that is the coastline) for different values of ρ (look at the boundary of the mountain). As we will see in our analysis of the GF, the descent from ρ = 0 can encounter local minima and saddles with nonzero loss. Furthermore, although the valley of zero loss may be connected, the point of absolute minimum ρ may be unreachable by GF from another point of zero loss even in the presence of λ > 0, because of the possible nonconvex profile of the coastline (see inset of Fig. 2).

Fig. 2. — A speculative view of the landscape of the unregularized loss—which is for λ = 0. Think of the loss as the mountain emerging from the water with zero loss being the water level. ρ is the radial distance from the center of the mountain as shown in the inset, whereas the *V_k* can be thought as multidimensional angles in this “polar” coordinate system. There are global degenerate valleys for ρ ≥ ρ₀ with V₁ and V₂ weights of unit norm. The coastline of the loss marks the boundary of the zero-loss degenerate minimum where $L = 0$ in the high-dimensional space of ρ and *V_k* ∀ k = 1, ⋯, L. The degenerate global minimum is shown here as a connected valley outside the coastline. The red arrow marks the minimum loss with minimum ρ. Notice that, depending on the shape of the multidimensional valley, regularization with a term λρ² added to the loss biases the solution toward small ρ but does not guarantee convergence to the minimum ρ solution, unlike the case of a linear network.

If we assume overparameterized networks with d ≫ n, where d is the number of parameters and N is the number of data points, the study of Cooper [10] proved that the global minima of the unregularized loss function $L_{S} = \sum_{i = 1}^{N} {(f_{W} (x_{i}) - y_{i})}^{2}$ are highly degenerate with dimension d − N. [This result is also what one expects from Bezout theorem for a deep polynomial network. As mentioned in T. Tao’s blog “from the general “soft” theory of algebraic geometry, we know that the algebraic set V is a union of finitely many algebraic varieties, each of dimension at least d − N, with none of these components contained in any other. In particular, in the underdetermined case N < d, there are no zero-dimensional components of V , and, thus, V is either empty or infinite”(see references in [46]).]

Theorem 1

([46], informal). We assume an overparameterized neural network f_W with smooth ReLU activation functions and square loss. Then, the minimizers W^∗ achieve zero loss and are highly degenerate with dimension d − N.

Furthermore, for “large” overparametrization, all the global minima—associated with interpolating solutions—are connected within a unique, large valley. The argument is based on Theorem 5.1 of [47]:

Theorem 2

([47], informal). If the first layer of the network has at least 2N neurons, where N is the number of training data, and if the number of neurons in each subsequent layer decreases, then every sublevel set of the loss is connected.

In particular, the theorem implies that zero-square-loss minima with different values of ρ are connected. A connected single valley of zero loss does not, however, guarantee that SGD with WD will converge to the global minimum, which is now >0, independently of initial conditions.

For large ρ, we expect many solutions. The existence of several solutions for large ρ is based on the following intuition: The last linear layer is enough—if the layer before the linear classifier has more units than the number of training points—to provide solutions for a given set of random weights in the previous layers (for large ρ and small f_i). This also means that the intermediate layers do not need to change much under GD in the iterations immediately after initialization. The emerging picture is a landscape in which there are no zero-loss minima for ρ smaller than a certain minimum ρ, which is network and data dependent. With increasing ρ from ρ = 0, there will be a continuous set of zero-square-loss degenerate minima with the minimizer representing an interpolating (for λ = 0) or almost interpolating solution (for λ > 0). We expect that λ > 0 results in a “pull” toward the minimum ρ₀ within the local degenerate minimum of the loss.

Landscape for λ > 0

In the case of λρ² > 0, the landscape may become a Morse–Bott or Morse function with shallow almost zero-loss minima. The question is open because the regularizer is not the sum of squares.

Gradient dynamics

GF equations

The GF equations are as follows (see also [8]):

\begin{matrix} \overset{\cdot}{ρ} = - \frac{\partial L_{S} (ρ, {\{V_{k}\}}_{k = 1}^{L})}{\partial ρ} = \frac{2}{N} \sum_{n} (1 - ρ {\bar{f}}_{n}) {\bar{f}}_{n} - 2 λ ρ, \\ {\overset{\cdot}{V}}_{k} = - \frac{\partial L_{S} (ρ, {\{V_{k}\}}_{k = 1}^{L})}{\partial V_{k}} = \frac{2}{N} \sum_{n} (1 - ρ {\bar{f}}_{n}) ρ \frac{\partial {\bar{f}}_{n}}{\partial V_{k}} - 2 ν_{k} V_{k} \end{matrix}

(2)

In the second equation, we can use the unit norm constraint on the ∥V_k∥ to determine the LMs ν_k, using the following structural property of the gradient:

Lemma 2

(Lemma 2.1 of [48]). Let f_W(x) be a ReLU neural network, f_W(x) = W_Lσ(W_{L − 1}…σ(W₁x)) : ℝ^d → ℝ. Then, we can write:

$\forall x \in ℝ^{d} : \sum_{i, j} \frac{\partial f_{W} (x)}{\partial W_{k}^{i, j}} W_{k}^{i, j} = 〈W_{k}, \frac{\partial f_{W} (x)}{\partial W_{k}}〉 = f_{W} (x)$ (3)

The constraint ∥V_k∥² = 1 implies using the lemma above $\frac{\partial ∥ V_{k} ∥^{2}}{\partial t} = V_{k}^{T} {\overset{\cdot}{V}}_{k} = 0$ , which gives

ν_{k} = \frac{1}{N} \sum_{n} (ρ {\bar{f}}_{n} - ρ^{2} f_{n}^{2}) = \frac{1}{N} \sum_{n} ρ {\bar{f}}_{n} (1 - ρ f_{n})

(4)

Thus, the GF is the following dynamical system

\overset{\cdot}{ρ} = \frac{2}{N} [\sum_{n} {\bar{f}}_{n} - \sum_{n} ρ {({\bar{f}}_{n})}^{2}] - 2 λ ρ and {\overset{\cdot}{V}}_{k} = \frac{2}{N} ρ \sum_{n} [(1 - ρ {\bar{f}}_{n}) (- V_{k} {\bar{f}}_{n} + \frac{\partial {\bar{f}}_{n}}{\partial V_{k}})]

(5)

In particular, we can also write

\overset{\cdot}{ρ} = 2 (μ - ρ (M + λ))

(6)

Hence, at critical points (when $\overset{\cdot}{ρ} = 0$ and ${\overset{\cdot}{V}}_{k} = 0$ ), we used the definitions of μ and M,

ρ = ρ_{eq} ≔ \frac{\frac{1}{N} \sum_{n} {\bar{f}}_{n}}{λ + \frac{1}{N} \sum_{n} {\bar{f}}_{n}^{2}} = \frac{μ}{M + λ}

(7)

Thus, the gap to interpolation due to λ > 0 is $ϵ = (ρ_{λ = 0} - ρ_{λ}) μ = 1 - \frac{μ}{M + λ} μ$ that gives

ϵ = 1 - \frac{μ^{2}}{μ^{2} + σ^{2} + λ} = \frac{σ^{2} + λ}{μ^{2} + σ^{2} + λ}

(8)

Notice that since the V_k values are bounded functions, they must take their maximum and minimum values on their compact domain—the sphere—because of the extremum value theorem. In addition, notice that for normalized V_k, $V_{k}^{T} {\overset{\cdot}{V}}_{k} = 0$ always that is V_k can only rotate. If ${\overset{\cdot}{V}}_{k} = 0$ , then the weights V_k are given by

V_{k} = \frac{\sum_{n} ℓ_{n} \frac{\partial f_{n}}{\partial V_{k}}}{\sum ℓ_{n} f_{n}}

(9)

where $ℓ_{n} = 1 - ρ {\bar{f}}_{n}$ . [This overdetermined system of equations—with as many equations as weights—can also be used to reconstruct the training set from the V_k, the y_n, and the f_n.]

Convergence

A favorable property of optimization of the square loss is the convergence of the relevant parameters. With GD, the loss function cannot increase, while the trainable parameters may potentially diverge. A typical scenario of this kind happens with cross-entropy minimization, where the weights typically tend to infinity. In light of the theorems in the Landscape of the empirical risk section, we could hypothetically think of training dynamics in which the loss function’s value $L (ρ, {\{V_{k}\}}_{k = 1}^{L})$ decreases, while ρ oscillates periodically within some interval. As we show next, this is impossible when the loss function’s value converges to zero.

Lemma 3.

Let f_W(x) = ρf_V(x) be a neural network and λ = 0. Assume that during training time, we have ${lim}_{t \to \infty} L (ρ, {\{V_{k}\}}_{k = 1}^{L}) = 0$ and ∀k ∈ [L] : ∥ V_k ∥ = 1. Then, ρ and V_k converge (i.e., $\overset{\cdot}{ρ} \to 0$ and ${\overset{\cdot}{V}}_{k} \to 0$ ).

Proof. Note that if ${lim}_{t \to \infty} L (ρ, {\{V_{k}\}}_{k = 1}^{L}) = 0$ , then, for all n ∈ [N], we have (ρf_n − y_n)² → 0. In particular, ρf_n → y_n and $ρ {\bar{f}}_{n} \to 1$ . Hence, we conclude that μρ → 1. Therefore, by Lemma 4, $ρ \overset{\cdot}{ρ} \to 0$ . We note that ρ → 0 would imply ρf_n → 0 that contradicts $L (ρ, {\{V_{k}\}}_{k = 1}^{L}) \to 0$ , since the labels y_n are nonzero. Therefore, we conclude that $\overset{\cdot}{ρ} \to 0$ . To see why ${\overset{\cdot}{V}}_{k} \to 0$ , we recall that

{\overset{\cdot}{V}}_{k} = \frac{2}{N} ρ \sum_{n} [(1 - ρ {\bar{f}}_{n}) (- V_{k} {\bar{f}}_{n} + \frac{\partial {\bar{f}}_{n}}{\partial V_{k}})]

(10)

We note that ∥V_k ∥ = 1, $∣ {\bar{f}}_{n} ∣ = 1$ , and $\frac{\partial {\bar{f}}_{n}}{\partial V_{k}}$ is bounded (assuming that ∀n ∈ [N] : ∥ x_n ∥ ≤ 1 and ∀k ∈ [L] : ∥ V_k ∥ = 1). Hence, since ρ converges, $ρ {\bar{f}}_{n} \to 1$ , implying (for λ = 0) ${\overset{\cdot}{V}}_{k} \to 0$ .

So far, we have assumed convergence of GF, GD, or SGD to zero loss. Convergence does not seem too far-fetched given overparametrization and the associated high degeneracy of the global minima (see Landscape of the empirical risk section and theorems there). Proofs of convergence of descent methods have been, however, lacking until a recent paper [49] presented a new criterion for convergence of GD and used to show that GD with proper initialization converges to a global minimum. The result has technical limitations that are likely to be lifted in the future: It assumes that the activation function is smooth, that the input dimension is greater than or equal to the number of data points, and that the descent method is GF or GD.

Qualitative dynamics

We consider the dynamics of model in Fig. 1B. During training the norm of each layer, weight matrix is kept constant by the LM constraint that is applied to all layers but the last one, thus leaving ρ at the top to change depending on the dynamics. Recall that $\forall n \in [N] : 0 \leq ∣ {\bar{f}}_{n} ∣ \leq 1$ because the assumption ∥x ∥ ≤ 1 yields ∥f(x) ∥ ≤ 1 by taking into account the definition of ReLUs and the fact that matrix norms are submultiplicative. Depending on the number of layers, the maximum margin that the network can achieve for a given dataset is usually much smaller than the upper bound 1, because the weight matrices have unit norm and the bound ≤1 is conservative. Thus, to guarantee interpolation, namely, ρf_ny_n = 1, ρ must be substantially larger than 1. For instance, in the experiments plotted in this paper, the maximal ${\bar{f}}_{n}$ is ≈0.002, and, thus, the ρ needed for interpolation (for λ = 0) is in the order of 500. We assume then that for a given dataset, there is a maximal value of y_nf_n that allows interpolation. Correspondingly, there is a minimum value of ρ that we call, as mentioned earlier, ρ₀.

We now provide some intuition for the dynamics of the model. Notice that ρ(t) = 0 and f_V(x) = 0 (if all weights are zero) are critical unstable points. A small perturbation will either result in $\overset{\cdot}{ρ} < 0$ with ρ going back to zero or in ρ growing if the average margin is just positive, that is, μ > λρ > 0.

Small ρ initialization

First, we consider the case where the neural network is initialized with a smallish ρ, that is, ρ < ρ₀. Assume then that at some time t, μ > 0, that is, average separability holds. Notice that if the f_n values were zero-mean, random variables, then there would be a 50% chance for average separability to hold. Then, Eq. 5 shows that $\overset{\cdot}{ρ} > 0$ . If full separability takes place, that is, ∀n : f_n > 0, then $\overset{\cdot}{ρ}$ remains positive at least until ρ = 1. This is because Eq. 5 implies that $\overset{\cdot}{ρ} \geq 2 (μ - ρ μ)$ since M ≤ μ. In general, assuming eventual convergence, ρ may grow nonmonotonically, that is, there may oscillations in ρ for “short” intervals, until it converges to ρ₀.

To see this, consider the following lemma that gives a representation of the loss function in terms of ρ, $\overset{\cdot}{ρ}$ , and μ.

Lemma 4.

Let f_W(x) = ρf_V(x) be a neural network, with ∀k ∈ [L] : ∥ V_k ∥ = 1. The square loss can be written as $L_{S} (ρ, {\{V_{k}\}}_{k = 1}^{L}) = 1 - ρ (\frac{1}{2} \overset{\cdot}{ρ} + μ)$ .

Proof. First, we consider that

\begin{matrix} L_{S} (ρ, {\{V_{k}\}}_{k = 1}^{L}) & = \frac{1}{N} \sum_{n} {(ρ f_{n} - y_{n})}^{2} + \sum_{k = 1}^{L} ν_{k} (∥ V_{k} ∥^{2} - 1) + λ ρ^{2} \\ = \frac{1}{N} (ρ^{2} f_{n}^{2} - 2 y_{n} ρ f_{n} + y_{n}^{2}) + λ ρ^{2} \\ = 1 - 2 ρ μ + ρ^{2} M + λ ρ^{2} \end{matrix}

(11)

where the second equation follows from ∀k ∈ [L] : ∥ V_k ∥ = 1 and the third equation follows from $y_{n}^{2} = 1$ , using the previous definitions $μ = \frac{1}{N} \sum_{n} {\bar{f}}_{n}$ and $M = \frac{1}{N} \sum_{n} {\bar{f}}_{n}^{2}$ . On the other hand, by Eq. 6, $\overset{\cdot}{ρ} = 2 μ - 2 ρ M - 2 λ ρ$ that gives $2 ρ M = 2 μ - 2 λ ρ - \overset{\cdot}{ρ}$ . Therefore, we conclude that $L_{S} (ρ, {\{V_{k}\}}_{k = 1}^{L}) = 1 - \frac{1}{2} ρ \overset{\cdot}{ρ} - ρ μ = 1 - ρ (\frac{1}{2} \overset{\cdot}{ρ} + μ)$ as desired.

Following this lemma, if $\overset{\cdot}{ρ}$ becomes negative during training, then the average margin μ must increase since GD cannot increase but only decrease $L$ . In particular, this implies that $\overset{\cdot}{ρ}$ cannot be negative for long periods of time. Notice that short periods of decreasing ρ are “good” since they increase the average margin.

If $\overset{\cdot}{ρ}$ turns negative, then it means that it has crossed $\overset{\cdot}{ρ} = 0$ . This may be a critical point for the system if the values of V_k corresponding to ${\overset{\cdot}{V}}_{k} = 0$ are compatible (since the matrices ${\{V_{k}\}}_{k = 1}^{L}$ determine the value of ${\bar{f}}_{n}$ ). We assume that this critical point—either a local minimum or a saddle—can be avoided by the randomness of SGD or by an algorithm that restarts optimization when a critical point is reached for which $L > 0$ .

Thus, ρ grows (nonmonotonically) until it reaches an equilibrium value, close to ρ₀. Recall that for λ = 0, this corresponds to a degenerate global minimum $L = 0$ , usually resulting in a large attractive basin in the loss landscape. For λ = 0, a zero value of the loss ( $L = 0$ ) implies interpolation: Thus, all the f_n have the same value, that is, all the margins are the same.

Large ρ initialization

If we initialize a network with large norm ρ > ρ₀, then Eq. 1 shows that $\overset{\cdot}{ρ} < 0$ . This implies that the norm of the network will decrease until, eventually, an equilibrium is reached. In fact, since ρ ≫ 1, it is likely that there exists an interpolating (or near interpolating) solution with ρ that is very close to the initialization. In fact, for large ρ, it is usually empirically possible to find a set of weights V_L, such that $ρ {\bar{f}}_{n} \approx 1$ . To understand why this may be true, recall that if there are at least N units in the top layer of the network (layer L) with given activities and ρ ≫ ρ₀, then there exist values of V_L that yield interpolation due to Theorem 2. In other words, it is easy for the network to interpolate with small values ${\bar{f}}_{n}$ . These large ρ, small ${\bar{f}}_{n}$ solutions are reminiscent of the NTK solutions [24], where the parameters do not move too far from their initialization. A formal version of the same argument is based on the following result.

We now assume that the network in the absence of WD has converged to an interpolating solution

Lemma 5.

Let f_V be a neural network with weights ${\{V_{k}\}}_{k = 1}^{L}$ , such that, $\forall n \in [N] : ρ {\bar{f}}_{n} = ρ μ^{*} = 1$ . Further assume that the classifier V_L and the last layer features h are aligned, i.e., y_n〈V_L, h(x_n)〉 = ‖h(x_n)‖₂, where the vector h denotes the activities of the units in the last layer. Then, perturbing V_L into another unit-norm vector $V_{L}^{'} \in ℝ^{p}$ , such that $V_{L}^{T} V_{L}^{'} = α \in (0, 1)$ yields a neural network $\hat{f} (x) = 〈 V_{L}^{'}, h (x) 〉$ with the property that $\frac{ρ}{α} \hat{f}$ is an interpolating solution, corresponding to a critical point of the gradient but with a larger ρ.

Proof. Consider the margins of the network $\hat{f} (x) = 〈 V_{L}^{'}, h (x) 〉$ . We conclude that ${\bar{\hat{f}}}_{n} = y_{n} 〈 V_{L}^{'}, h (x_{n}) 〉$ . Since the classifier weights and the last layer features are aligned (as it may happen for λ → 0), we have that y_nh(x_n) = ‖h(x_n)‖ × V_L. This means ${\bar{\hat{f}}}_{n} = ‖ h (x_{n}) ‖ \times 〈 V_{L}^{'}, V_{L} 〉$ . We also have from the interpolating condition that $ρ {\bar{f}}_{n} = ρ μ^{*} = 1$ , which means $‖ h (x_{n}) ‖ = \frac{1}{ρ}$ . Putting all this together, we have $\frac{ρ}{α} {\bar{\hat{f}}}_{n} = 1$ , which concludes the proof.

Thus, if a network exists providing an interpolating solution with a minimum ρ and V_L ∝ h, there exist networks that differ only in the last V_L layer and are also interpolating but with larger ρ. As a consequence, there is a continuum of solutions that differ only in the weights V_L of the last layer.

Of course, there may be interpolating solutions corresponding to different sets of weights in layers below L, to which the above statement does not apply. These observations suggest that there is a valley of minimizers for increasing ρ, starting from a zero-loss minimizer that has the NC property (see Neural Collapse).

In Fig. 3, we show the dynamics of ρ alongside train loss and test error. We show results with and without WD in the top and bottom rows of Fig. 3, respectively. $L_{S}$ decreases with μ increasing and σ decreasing. The figures show that in our experiments, the large margins of some of the data points decrease during GD, contributing to a decrease in σ. Furthermore, Eq. 11 suggests that for small ρ, the term dominating the decrease in $L_{S}$ is −2ρμ. For larger ρ, the term ρ²M = ρ²(σ² + μ²) becomes important: Eventually, $L_{S}$ decreases because σ² decreases. The regularization term, for standard small values of λ, is relevant only in the final phase, when ρ is in the order of ρ₀. For λ = 0, the loss at the global equilibrium (which happens at ρ = ρ₀) is $L_{S} = 0$ (since $μ = \frac{1}{ρ_{0}}$ , M = μ², and σ² = 0).

To sum up, starting from small initialization, gradient techniques will explore critical points with ρ growing from zero. Thus, quasi-interpolating solutions with small ρ (corresponding to large margin solutions) may be found before the many large ρ quasi-interpolating solutions that have worse margins (see Fig. 3, top and bottom rows). This dynamics can take place even in the absence of regularization; however, λ > 0 makes the process more robust and bias it toward small ρ.

Generalization: Rademacher Complexity of Convolutional Layers

Classical Rademacher bounds

In this section, we analyze the test performance of the learned neural network. Following the standard learning setting, we assume that there is some underlying distribution P of labeled samples (x, y) and the training data $S = {\{(x_{i}, y_{i})\}}_{i = 1}^{N}$ consist of N independent and identically distributed samples from P. The model f_W is assumed to perfectly fit the training samples, i.e., f_W(x_i) = y_i = ± 1.

We would like to upper bound the classification error $err (f_{W}) ≔ E_{(x, y) \sim P} [I [sign (f_{W} (x)) \neq y]]$ of the learned function f_W in terms of the number of samples N and the norm ρ of f_W.

This analysis is based on the following data-dependent measure of the complexity of a class of functions.

Definition

Rademacher complexity. Let ℍ be a set of real-valued functions $h : X \to ℝ$ defined over a set $X$ . Given a fixed sample $S \in X^{m}$ , the empirical Rademacher complexity of ℍ is defined as follows:

$R_{S} (ℍ) ≔ \frac{2}{m} E_{σ} [sup_{h \in ℍ} | \sum_{i = 1}^{m} σ_{i} h (x_{i}) |]$

The expectation is taken over σ = (σ₁, …, σ_m), where, σ_i ∈ {±1} are independent and identically distributed and uniformly distributed samples.

The Rademacher complexity measures the ability of a class of functions to fit noise. The empirical Rademacher complexity has the added advantage that it is data dependent and can be measured from finite samples.

Theorem 3.

Let P be a distribution over ℝ^d × {±1}. Let $F = \{f_{W}| \prod_{i = 1}^{L} ∥ W_{i} ∥ \leq 1\}$ . Let $S = {\{(x_{i}, y_{i})\}}_{i = 1}^{N}$ be a dataset of independent and identically distributed samples selected from P. Then, with probability at least 1 − δ over the selection of $S$ , for any f_W that perfectly fits the data (i.e., f_W(x_i) = y_i), we have

$er r_{P} (f_{W}) \leq 2 (ρ + 1) \cdot R_{S} (F) + 3 \sqrt{\frac{log (2 {(ρ + 1)}^{2} / δ)}{2 N}}$ (12)

Proof. Let t ∈ ℕ ∪ {0} and $G_{t} = \{f_{W}| \prod_{i = 1}^{L} ∥ W_{i} ∥_{2} \in [t, t + 1]\}$ We consider the ramp loss function

ℓ_{ramp} (y, y^{'}) = \{\begin{matrix} 1, & if {yy}^{'} \leq 0, \\ 1 - {yy}^{'}, & if 0 \leq {yy}^{'} \leq 1, \\ 0, & if {yy}^{'} \geq 1 \end{matrix}

By Theorem 3.3 in [50], for any t ∈ ℕ ∪ {0}, with probability at least $1 - \frac{δ}{t (t + 1)}$ , for any function $f_{W} \in G_{t}$ , we have

E_{(x, y)} [ℓ_{ramp} (f_{W} (x), y)] \leq \frac{1}{N} \sum_{i = 1}^{N} ℓ_{ramp} (f_{W} (x_{i}), y_{i}) + 2 R_{S} (G_{t}) + 3 \sqrt{\frac{log (2 {(t + 1)}^{2} / δ)}{2 N}}

(13)

We note that for any function f_W for which f_W(x_i) = y_i = ± 1, we have ℓ_ramp(f_W(x_i), y_i) = 0. In addition, for any function f_W and pair (x, y), we have ℓ_ramp(f_W(x), y) ≥ I[sign(f_W(x)) ≠ y]. Therefore, we conclude that with probability at least $1 - \frac{δ}{t (t + 1)}$ , for any function $f_{W} \in G_{t}$ , we have

er r_{P} (f_{W}) \leq 2 R_{S} (G_{t}) + 3 \sqrt{\frac{log (2 {(t + 1)}^{2} / δ)}{2 N}}

(14)

We notice that by the homogeneity of ReLU neural networks, we have $R_{S} (G_{t}) \leq (t + 1) \cdot R_{S} (F)$ . By union bound over all t ∈ ℕ ∪ {0}, Eq. 14 holds uniformly for all t ∈ ℕ ∪ {0} and $f_{W} \in G_{t}$ with probability at least 1 − δ. For each f_W with $\prod_{i = 1}^{L} ∥ W_{i} ∥_{2} = ρ$ , we can apply the bound with t = ⌊ρ⌋ since $f_{W} \in G_{t}$ and obtain the desired bound,

\begin{matrix} er r_{P} (f_{W}) & \leq 2 (t + 1) \cdot R_{S} (G_{t}) + 3 \sqrt{\frac{log (2 {(t + 1)}^{2} / δ)}{2 N}} \\ \leq 2 (ρ + 1) \cdot R_{S} (F) + 3 \sqrt{\frac{log (2 {(ρ + 1)}^{2} / δ)}{2 N}} \end{matrix}

(15)

The above theorem provides an upper bound on the classification error of the trained network f_W that perfectly fits the training samples. The upper bound is decomposed into 2 main terms. The first term is proportional to the norm of the trained model ρ and the Rademacher complexity of $F$ that is the set of the normalized neural networks and the second term scales as $\sqrt{log (ρ / δ) / N}$ . As shown in Theorem 1 in [51], this term is upper bounded by $R_{S} (F) \leq (\sqrt{2 log (2) L} + 1) / \sqrt{{N}}$ , assuming that the samples are taken from the d-dimensional ball $B_{d}$ of radius 1. The overall bound is then (assuming zero training error)

er r_{P} (f_{W}) \leq \frac{2 (ρ + 1) (\sqrt{2 log (2) L} + 1)}{\sqrt{N}} + 3 \sqrt{\frac{log (2 {(log (ρ) + 1)}^{2} / δ)}{2 N}}

(16)

We note that while the mentioned bound on $ℝ_{N} (F)$ depends on the architecture of the network, it does not depend in an explicit way on the training set. However, as shown in Eq. 6 in [51], the bound may be improved further if the matrices’ stable rank is low, which happens with low rank of the weight matrices. In practice, the value of $ℝ_{N} (F)$ depends not only on the network architecture (e.g., convolutional) but also on the underlying optimization (e.g., L₂ versus L₁) and on the data (e.g., rank).

Relative generalization

We now consider 2 solutions with zero empirical loss of the square loss regression problem obtained with the same ReLU deep network and corresponding to 2 different minima with 2 different ρ values. Let us call them g^a(x) = ρ_af^a(x) and g^b(x) = ρ_bf^b(x). Using the notation of this paper, the functions f_a and f_b correspond to networks with normalized weight matrices at each layer.

Let us assume that ρ_a < ρ_b.

We now use Eq. 16 and the fact that the empirical ${\hat{L}}_{γ}$ for both functions is the same to write $L_{0} (f^{a}) = L_{0} (F^{a}) \leq c_{1} ρ_{a} ℝ_{N} (\tilde{F}) + c_{2} \sqrt{\frac{ln (\frac{1}{δ})}{2 N}}$ and $L_{0} (f^{b}) = L_{0} (F^{b}) \leq c_{1} ρ_{b} ℝ_{N} (\tilde{F}) + c_{2} \sqrt{\frac{ln (\frac{1}{δ})}{2 N}}$ . The bounds have the form

L_{0} (f^{a}) \leq A ρ_{a} + ϵ,

(17)

and

L_{0} (f^{b}) \leq A ρ_{b} + ϵ .

(18)

Thus, the upper bound for the expected error L₀(f^a) is better than the bound for L₀(f^b). Of course, this is just an upper bound. As a consequence, this result does not guarantee that a solution with smaller ρ will always have a smaller expected error than a solution with larger ρ.

Notice that this generalization claim is just a relative claim about different solutions obtained with the same network trained on the same training set.

Figure 4 shows clearly that increasing the percentage of random labels increases the ρ that is needed to maintain interpolation—thus decreasing the margin—and that, at the same time, the test error increases, as expected. This monotonic relation between margin and accuracy at test seems to break down for small differences in margin as shown in Fig. 5, although the significance of the effect is unclear. Of course, this kind of behavior is not inconsistent with an upper bound.

Fig. 4. — Mean 1/ρ and test error results over 10 runs for binary classification on CIFAR10 trained with LM and different percentages of random labels (r = 20%, 40%, 60%, and 80%), initialization scale of 1, and WD of 0.001. As mentioned in the text, the norm of the convolutional layers is just the norm of the filters. (Note that this network fails to get convergence with 100% random labels.)

Fig. 5. — Scatter plots for 1/ρ and mean test accuracy based on 10 runs for binary classification on CIFAR10 using LM normalization (LN), square loss, and WD (left) and without WD (right). In the left figure, the network was trained with different initialization scales (init. = [0.9, 1, 1.2, 1.3]) and with WD (λ = 1 × 10⁻³), while in the right figure, the network was trained with init. = [0.8, 0.9, 1, 1.3, 1.5] and no WD (λ = 0). The horizontal and vertical error bars correspond to the standard deviations of 1/ρ and mean test accuracy computed over 10 runs for different initializations, while the square dots correspond to the mean values. When λ > 0, the coefficient (R²), P value and slope for linear regression between 1/ρ and mean test accuracy are: R² = 0.94, P = 0.031, and slope = −18.968; when λ = 0, the coefficient R² = 0.004, P = 0.92, and slope = −2.915.

Novel bounds for sparse networks

In the Classical Rademacher bounds section, we describe generic bounds on the Rademacher complexity of deep neural networks. In these cases, ρ measures the product of the Frobenius norms of the network’s weight matrices in each layer. For convolutional networks, however, the operation in each layer is computed with a kernel, described by the vector w, that acts on each patch of the input separately. Therefore, a convolutional layer is represented by a Toeplitz matrix W, whose blocks are each given by w. A naive application of Eq. 16 to convolutional networks give a large bound, where the Frobenius norm of the Toeplitz matrix is equivalent to norm of the kernel multiplied by the number of patches.

In this section, we provide an informal analysis of the Rademacher complexity, showing that it can be reduced by exploiting the first one of the 2 properties of convolutional layers: (a) the locality of the convolutional kernels and (b) weight sharing. These properties allow us to bound the Rademacher complexity by taking the products of the norms of the kernel w instead of the norm of the associated Toeplitz matrix W. Here, we outline the results with more precise statements and proofs to be published separately.

We consider the case of one-dimensional convolutional networks with nonoverlapping patches and one channel per layer. For simplicity, we assume that the input of the network lies in ℝ^d, with d = 2^L and the stride and the kernel of each layer are 2. The analysis can be easily extended to kernels of different sizes. This means that the network h(x) can be represented as a binary tree, where the output neuron is computed as $W^{L} \cdot σ (v_{1}^{L} (x), v_{2}^{L} (x))$ , $v_{1}^{L} (x) = W^{L - 1} \cdot σ (v_{1}^{L - 1} (x), v_{2}^{L - 1} (x))$ , $v_{2}^{L} (x) = W^{L - 1} \cdot σ (v_{3}^{L - 1} (x), v_{4}^{L - 1} (x))$ , and so on. This means that we can write the ith row of the Toeplitz matrix of the lth layer (0, …, 0, −W^l−, 0…, 0), where W^l appears on the 2ⁱ − 1 and 2ⁱ coordinates. We define a set $H$ of neural networks of this form, where each layer is followed by a ReLU activation function and $\prod_{l = 1}^{L} W^{l} \leq ρ$ .

Theorem 4.

Let $H$ be the set of binary-tree-structured neural networks over ℝ^d, with d = 2^L for some natural number L. Let X = {x₁, …, x_N} ⊂ ℝ^d be a set of samples. Then,

$R_{X} (H) \leq \frac{2^{L} ρ \sqrt{\sum_{i = 1}^{N} ∥ x_{i} ∥^{2}}}{N}$ (19)

Proof sketch. First, we rewrite the Rademacher complexity in the following manner:

\begin{matrix} ℛ_{X} (ℋ) & = E_{ϵ} sup_{h \in ℋ} |\frac{1}{N} \sum_{i = 1}^{N} ϵ_{i} \cdot h (x_{i})| \\ = E_{ϵ} sup_{h \in ℋ} \frac{1}{N} |\sum_{i = 1}^{N} ϵ_{i} \cdot W^{L} \cdot σ (v_{1} (x), v_{2} (x))| \\ = E_{ϵ} sup_{h \in ℋ} \frac{1}{N} \sqrt{{|\sum_{i = 1}^{N} ϵ_{i} \cdot W^{L} \cdot σ (v_{1} (x), v_{2} (x))|}^{2}} \end{matrix}

(20)

Next, by the proof of Lemma 1 in [51], we obtain that

\begin{matrix} R_{X} (H) & \leq 2 E_{ϵ} sup_{h \in H} \frac{1}{N} \sqrt{∥ W^{L} ∥^{2} \cdot ∥ \sum_{i = 1}^{N} ϵ_{i} (v_{1} (x), v_{2} (x)) ∥^{2}} \\ = E_{ϵ} sup_{h \in H} \frac{1}{N} \sqrt{∥ W^{L} ∥^{2} \cdot \sum_{j = 1}^{2} ∥ \sum_{i = 1}^{N} ϵ_{i} v_{j} (x_{i}) ∥^{2}} \end{matrix}

(21)

By applying this peeling process L times, we obtain the following inequality:

\begin{matrix} R_{X} (H) \leq 2^{L - 1} E_{ϵ} sup_{h \in H} \frac{1}{N} \sqrt{\prod_{l = 1}^{L} ∥ W^{l} ∥^{2} \cdot \sum_{j = 1}^{d} ∥ \sum_{i = 1}^{N} ϵ_{i} x_{ij} ∥^{2}} \\ = 2^{L - 1} E_{ϵ} sup_{h \in H} \frac{1}{N} \sqrt{\prod_{l = 1}^{L} ∥ W^{l} ∥^{2} \cdot ∥ \sum_{i = 1}^{N} ϵ_{i} x_{i} ∥^{2}} \\ \leq \frac{2^{L - 1} ρ E_{ϵ} ∥ \sum_{i = 1}^{N} ϵ_{i} x_{i} ∥}{N} \\ \leq \frac{2^{L - 1} ρ \sqrt{\sum_{i = 1}^{N} ∥ x_{i} ∥^{2}}}{N} \end{matrix}

(22)

where the factor 2^{L − 1} is obtained because the last layer is linear (see [52]). We note that a better bound can achieved when using the reduction introduced in [51], which would give a factor of $\sqrt{2 log (2) L} + 1$ instead of 2^{L − 1}.

Thus, one ends up with a bound scaling as the product of the norms of the kernel at each layer. The constants may change depending on the architecture, the number of patches, the size of the patches, and their overlap.

This special nonoverlapping case can be extended to the general convolutional case. In fact, a proof of the following conjecture will be provided in [53].

Conjecture 1.

If a convolutional layer has overlap among its patches, then the nonoverlap bound

$R_{N} (H_{L}) \leq 2^{L - 1} ρ ∥ x ∥$ (23)

where ρ is the product of the norms of the kernels at each layer becomes

$R_{N} (H_{L}) \leq 2^{L - 1} ρ \sqrt{\frac{K}{K - O}} ∥ x ∥,$ (24)

where K is the size of the kernel (number of components) and O is the size of the overlap.

Sketch proof. Call P the number of patches and O the overlap. With no overlap, then PK = D, where D is the dimensionality of the input to the layer. In general, $P = \frac{D - O}{K - O}$ . It follows that a layer with the most overlap can add at most $< ∥ x ∥ \sqrt{K}$ to the bound. Notice that we assume that each component of x_i averaged across i will have norm $\sqrt{\frac{1}{d}}$ .

The bound is surprisingly small

In this section, we have derived bounds for convolutional networks that may potentially be orders of magnitude smaller than equivalent similar bounds for dense networks. We note that a naive application of Corollary 2 in [29] for the network that we used in Theorem 4 would require treating the network as if it were a dense network. In this case, the bound would be proportional to the product of the norms of each of the Toeplitz matrices in the network individually. In this case, the total bound becomes

\frac{2^{L} \sqrt{\prod_{l = 1}^{L} (2^{l})} ρ \sqrt{\sum_{i = 1}^{N} ∥ x_{i} ∥^{2}}}{N} = \frac{2^{0.25 L^{2} + 1.25 L} ρ \sqrt{\sum_{i = 1}^{N} ∥ x_{i} ∥^{2}}}{N}

(25)

which is much larger than the bound we obtained earlier. The key point is that the Rademacher bounds achievable for sparse networks are much smaller than for dense networks. This suggests that convolutional network with local kernels may generalize much better than dense network, which is consistent in spirit with approximation theory results (compositionally sparse target functions can be approximated by sparse networks without incurring in the curse of dimensionality, whereas generic functions cannot be approximated by dense networks without the curse). They also confirm the empirical success of convolutional networks compared to densely connected networks.

It is also important to observe that the bounds we obtained may be nonvacuous in the overparameterized case, unlike Vapnik–Chervonenkis bounds that depend on the number of weights and are therefore always vacuous in overparameterized situations. With our norm-based bounds, it is, in principle, possible to have overparametrization and interpolation simultaneously with nonvacuous generalization bounds: This is suggested by Fig. 6. Figure 7 shows the case of a 3-layer convolutional network with a total number of parameters of ≈20,000.

Fig. 6. — Product norm (ρ) and test error with respect to different training data sizes (N) for the 6-layer model trained with LM and square loss. The initialization scale is 0.1, WD λ = 10⁻³, no biases, the initial learning rate is 0.03 with cosine annealing scheduler; we used the SGD optimizer (momentum =0.9) and test data size =2, 000 in a binary classification task on CIFAR10 dataset. (A) The table shows the product norm ρ, mean training errors, mean test errors (average over the last 100 epochs), and generalization upper bound for different N. (B) A bar plot for the generalization gap for different N. (C) Generalization error upper bound is proportional to ( $\frac{ρ}{\sqrt{N}}$ ). The bounds are vacuous but “only” by an order of magnitude, while other bounds based on the number of parameters (here, 3,519,335) are typically much looser.

Fig. 7. — Product norm (ρ) and test error with respect to different training data sizes (N) for the 3-layer model (with nonoverlapped convolutional image patches, kernel size = 3 × 3, and stride = 3) trained with LM and square loss. The initialization scale is 0.1, WD λ = 0.001, no biases, batch size is 32, and the initial learning rate is 0.03 with cosine annealing scheduler; we used the SGD optimizer (momentum = 0.9) and test data size = 2,000 in a binary classification task on CIFAR10 dataset. (A) The table shows the product norm ρ, mean training errors, mean test errors (average over the last 100 epochs), and generalization upper bound for different N . (B) A bar plot for the generalization gap for different N . (C) Generalization error upper bound is a constant (see text) times ( $\frac{ρ}{\sqrt{N}}$ ). The bounds are almost not vacuous depending on the constant (see text).

Neural Collapse

A recent paper [12] described 4 empirical properties of the terminal phase of training (TPT) deep networks, using the cross-entropy loss function. TPT begins at the epoch where training error first vanishes. During TPT, the training error stays effectively zero, while training loss progressively decreases. Direct empirical measurements expose an inductive bias that they call NC, involving 4 interconnected phenomena. Informally, (NC1) cross-example within-class variability of last-layer training activations collapses to zero, as the individual activations themselves collapse to their class means. (NC2) The class means collapse to the vertices of a simplex ETF. (NC3) Up to rescaling, the last-layer classifiers collapse to the class means or, in other words, to the simplex ETF (i.e., to a self-dual configuration). (NC4) For a given activation, the classifier’s decision collapses to simply choose whichever class has the closest train class mean (i.e., the nearest class center decision rule).

We now formally define the 4 NC conditions. We consider a network f_W(x) = W_Lh(x), where h(x) ∈ ℝ^p denotes the last layer feature embedding of the network and W_L ∈ ℝ^{C × p} contains the parameters of the classifier. The network is trained on a C-class classification problem on a balanced dataset $S = {\{(x_{cn}, y_{cn})\}}_{n = 1, c = 1}^{N, C}$ with N samples per class. We can compute the per-class mean of the last layer features as follows:

μ_{c} = \frac{1}{N} \sum_{n = 1}^{N} h (x_{cn})

(26)

The global mean of all features as follows:

μ_{G} = \frac{1}{C} \sum_{c} μ_{c} = \frac{1}{NC} \sum_{c = 1, n = 1}^{C, N} h (x_{cn}) .

Furthermore, the second-order statistics of the last layer features are computed as follows:

\begin{matrix} Σ_{W} = \frac{1}{C} \sum_{c = 1}^{C} \frac{1}{N} \sum_{n = 1} (h (x_{cn}) - μ_{c}) {(h (x_{cn}) - μ_{c})}^{⊤} \\ Σ_{B} = \frac{1}{C} \sum_{c = 1}^{C} (μ_{c} - μ_{G}) {(μ_{c} - μ_{G})}^{⊤} \\ Σ_{T} = \frac{1}{NC} \sum_{c = 1, n = 1}^{C, N} (h (x_{cn}) - μ_{G}) {(h (x_{cn}) - μ_{G})}^{⊤} \end{matrix}

(27)

Here, Σ_W measures the within-class covariance of the features, Σ_B is the between-class covariance, and Σ_T is the total covariance of the features (Σ_T = Σ_W + Σ_B).

We can now list the formal conditions for NC:

•
NC1 (variability collapse). Variability collapse states that the variance of the feature embeddings of samples from the same class tends to zero, or formally, Tr(Σ_W) → 0.
•
NC2 (convergence to simplex ETF). |∥μ_c − μ_G∥₂ − ∥ μ_c′ − μ_G∥₂| → 0, or the centered class means of the last layer features become equinorm. Moreover, if we define ${\tilde{μ}}_{c} = \frac{μ_{c} - μ_{G}}{∥ μ_{c} - μ_{G} ∥_{2}}$ , then we have $〈{\tilde{μ}}_{c}, {\tilde{μ}}_{c'}〉 = - \frac{1}{C - 1}$ for c ≠ c^′, or the centered class means are also equiangular. The equinorm condition also implies that $\sum_{c} {\tilde{μ}}_{c} = 0$ , i.e., the centered features lie on a simplex.
•
NC3 (self-duality). If we collect the centered class means into a matrix M = [μ_c − μ_G], then we have $||\frac{W^{⊤}}{∥ W ∥_{F}} - \frac{M}{∥ M ∥_{F}}|| \to 0$ , or the classifier W and the last layer feature means M become duals of each other.
•
NC4 (nearest center classification). The classifier implemented by the deep network eventually boils down to choosing the closest mean last layer feature $argma x_{c} 〈W_{L}^{c}, h (x)〉 \to argmi n_{c} ∥ h (x) - μ_{c} ∥_{2}$ .

Related Work on NC

Since the empirical observation of NC was made in [12], a number of papers have studied the phenomenon in the so-called unconstrained features regime [32–34,39,40]. The basic assumption underlying these proofs is that the features of a deep network at the last layer can essentially be treated as free optimization variables, which converts the problem of finding the parameters of a deep network that minimize the training loss, into a matrix factorization problem of factoring one-hot class labels Y ∈ ℝ^{C × CN} into the last layer weights W ∈ ℝ^{C × p} and the last layer features H ∈ ℝ^{p × CN}. In the case of the squared loss, the problem that they study is min_{W, H} ∥ WH − Y∥² + λ_W ∥ W∥² + λ_H ∥ H∥².

In this section, we show instead that we can predict the existence of NC and its properties as a consequence of our analysis of the dynamics of SGD on deep binary classifiers trained on the square loss function with LN and WD without any additional assumption. We first consider the case of binary classification and show that NC follows from the analysis of the dynamics of the square loss in the previous sections. The loss function is the same one defined in Eq. 1, and we consider minimization using SGD with a batch size of 1. After establishing NC in this familiar setting, we consider the multiclass setting where we derive the conditions of NC from an analysis of the squared loss function with WD and WN.

Binary classification

We prove in this section that NC follows from the following property of the landscape of the squared loss that we analyzed in the previous section:

Property 1

[symmetric quasi-interpolation (binary classification)]. Consider a binary classification problem with inputs in a feature space $X$ and label space {+1, −1}. A classifier $f_{W} : X \to ℝ$ symmetrically quasi-interpolates a training dataset $S = {\{(x_{n}, y_{n})\}}_{n = 1}^{N}$ if, for all training examples, ${\bar{f}}_{W}_{n} = y_{n} f_{W} (x_{n}) = 1 - ϵ$ , where ϵ is the interpolation gap.

We prove first that the property follows without any assumption at convergence from our previous analysis of the landscape of the squared loss for binary classification.

Lemma 6.

An overparameterized deep ReLU network for binary classification trained to convergence under the squared loss in the presence of WD and WN satisfies the symmetric quasi-interpolation property. Furthermore, the gap to interpolation of the regularized network is $ϵ = \frac{λ}{μ^{2} + λ}$ where $μ = \frac{1}{N} \sum_{i} {\bar{f}}_{i}$ .

Proof. Consider the regularized square loss $L_{S} = \frac{1}{N} \sum_{i = 1}^{N} {(ρ {\bar{f}}_{i} - 1)}^{2} + λ ρ^{2}$ . We recall the definitions made earlier in in the Classification with square loss minimization" section of the margin ${\bar{f}}_{i} = y_{i} f_{i}$ , and the first- and second-order sample statistics of the margin $μ = \frac{1}{N} \sum_{i = 1}^{N} {\bar{f}}_{i}, M = \frac{1}{N} \sum_{i = 1}^{N} {\bar{f}}_{i}^{2}, σ^{2} = M - μ^{2}$ . We consider deep networks that are sufficiently overparameterized to attain 100% accuracy on the training dataset, which means ${\bar{f}}_{i} > 0$ . Since the weights of the deep network ${\{V_{k}\}}_{k = 1}^{L}$ are normalized and the data x_i lie within the unit norm ball, we have that $∣ {\bar{f}}_{i} ∣ \leq 1$ . Although ${\bar{f}}_{i}$ could take values close to 1, the typically observed values of ${\bar{f}}_{i}$ in our experiments are approximately 5 × 10⁻³. For our purposes, it suffices to note that there exists a maximum possible margin, such that $0 < {\bar{f}}_{i} \leq \bar{μ}$ for all training examples for a given dataset and network architecture.

Using these definitions, we can rewrite the deep network training problem as follows:

{min}_{ρ, {\{V_{k}\}}_{k = 1}^{L}} L_{S} = (M + λ) ρ^{2} - 2 ρ μ + 1 .

(28)

All critical points (including minima) need to satisfy $\frac{\partial L_{S}}{\partial ρ} = 0$ , from which we get $ρ = \frac{μ}{M + λ}$ . If we plug this back into the loss, then our minimization problem becomes:

\begin{matrix} {min}_{{\{V_{k}\}}_{k = 1}^{L}} (M + λ) {(\frac{μ}{M + λ})}^{2} - 2 \frac{μ^{2}}{M + λ} + 1 \\ = {min}_{{\{V_{k}\}}_{k = 1}^{L}} 1 - \frac{μ^{2}}{M + λ} \\ = {min}_{{\{V_{k}\}}_{k = 1}^{L}} \frac{σ^{2} + λ}{μ^{2} + σ^{2} + λ} \\ = {min}_{{\{V_{k}\}}_{k = 1}^{L}} \frac{1}{1 + \frac{μ^{2}}{σ^{2} + λ}} \end{matrix}

(29)

Hence, to minimize the loss, we have to find ${\{V_{k}\}}_{k = 1}^{L}$ that maximizes μ² and minimizes σ². Since we assumed that the network is expressive enough to attain any value, the loss is minimized when σ² = 0 and $μ = \bar{μ}$ . Thus, all training examples have the same margin.

If σ² → 0, then all margins tend to the same value, ${\bar{f}}_{i} \to \bar{μ}$ , and the optimum value of ρ is $ρ = \frac{\bar{μ}}{{\bar{μ}}^{2} + λ}$ . This means that the gap to interpolation is $ϵ = 1 - ρ \bar{μ} = \frac{λ}{λ + {\bar{μ}}^{2}}$ .

The prediction σ → 0 has empirical support: we show in Fig. 8 that all the margins converge to be roughly equal. Once within-class variability disappears and for all training samples, the last layer features collapse to their mean. The outputs and margins then also collapse to the same value. We can see this in the left plot of Fig. 10 where all of the margin histograms are concentrated around a single value. We visualize the evolution of the training margins over the training epochs in Fig. 8, which shows that the margin distribution concentrates over time. At the final epoch, the margin distribution (colored in yellow) is much narrower than at any intermediate epochs. Notice that our analysis of the origin of the SGD noise shows that “strict” NC1 never happens with SGD, in the sense that the margins are never, not even asymptotically, exactly equal to each other but just very close.

Fig. 10. — Training margins computed over 10 runs for binary classification on CIFAR10 trained with square loss, LM normalization, and WD λ = 0.001 (left) and without WD (right, λ = 0) for different initializations (init. = 0.8, 0.9, 1, 1.2, 1.3, and 1.5) with SGD and minibatch size of 128. The margin distribution is Gaussian-like with standard deviation ≈10⁻⁴ over the training set (N = 10⁴). The margins without WD result in a range of smaller margin values, each with essentially zero variance. As mentioned in the text, the norms of the convolutional layers are just the norm of the filters.

We now prove that NC follows from Property 1.

Theorem 5.

Let $S = {\{(x_{n}, y_{n})\}}_{n = 1}^{N}$ be a dataset. Let (ρ, V) be the parameters of a ReLU network f, such that V_L has converged when training using SGD with batches of size 1 on the square loss with LN + WD. Let $μ_{+} = \frac{1}{N} \sum_{n = 1, y_{n} = 1}^{N} h (x_{n})$ , $μ_{-} = \frac{1}{N} \sum_{n = 1, y_{n} = - 1}^{N} h (x_{n})$ . Consider points of convergence of SGD that satisfy Property 1. Those points also satisfy the conditions of NC as described below.

• NC1: μ₊ = h(x_n) for all n ∈ [N], y_n = 1, μ₋ = h(x_n) for all n ∈ [N], y_n = −1.

• NC2: μ₊ = −μ₋, which is the structure of an ETF with 2 vectors.

• NC3: V_L ∝ μ₊, μ₋.

• NC4: sign(ρf_V(x)) = arg min_{c ∈ {+1, −1}} ∥ μ_c − h(x)∥.

Proof. The update equations for SGD on the square loss function with LN+WD are given by:

\begin{matrix} V_{L}^{(t + 1)} = V_{L}^{(t)} - η \frac{\partial L}{\partial V_{L}^{(t)}} \\ ⟹ V_{L}^{(t + 1)} = V_{L}^{(t)} - η \times (2 ρ (ρ {\bar{f}}_{n} - 1) y_{n} h (x_{n}) + 2 ν_{L}^{(t)} V_{L}^{(t)}) \end{matrix}

(30)

We can apply the unit norm constraints ${‖V_{L}^{(t + 1)}‖}^{2} = 1$ and ${‖V_{L}^{(t)}‖}^{2} = 1$ and ignore all terms that are O(η²) to compute $ν_{L}^{(t)}$ as:

\begin{matrix} 2 ν_{L}^{(t)} = 2 ρ y_{n} V_{L}^{(t) ⊤} h (x_{n}) (1 - ρ {\bar{f}}_{n}) \\ ⟹ ν_{L}^{(t)} = ρ {\bar{f}}_{n} (1 - ρ {\bar{f}}_{n}) \end{matrix} .

(31)

This gives us the following SGD update:

V_{L}^{(t + 1)} = V_{L}^{(t)} - η \times 2 ρ y_{n} (ρ {\bar{f}}_{n} - 1) (h (x_{n}) - f_{n} V_{L}^{(t)}) .

(32)

Using Property 1, we can see that for every training sample in class y_n = 1, $h (x_{n}) = \frac{(1 - ϵ)}{ρ} V_{L}$ and for every training sample in class y_n = −1, $h (x_{n}) = \frac{(- 1 + ϵ)}{ρ} V_{L}$ . This shows that within-class variability has collapsed and that all last layer features collapse to their mean, which is the condition for NC1. We can also see that μ₊ = −μ₋, which is the condition for NC2 when there are 2 vectors in the simplex ETF. The SGD convergence condition also tells us that V_L ∝ μ₊ and V_L ∝ μ₋, which gives us the NC3 condition. NC4 follows then from NC1 to NC2, as shown by theorems in [12].

Multiclass classification

In the previous section, we proved the emergence of NC in the case of a binary classifier with scalar outputs, to be consistent with our framework in Problem Setup. The phenomenon of NC was, however, defined in [12] for the case of multiclass classification with deep networks. In this section, we describe how NC emerges in this setting from the minimization of the squared loss with WN and WD regularization. We also show in Fig. 9 that our networks show NC, similar to experiments reported in [12].

Fig. 9. — NC occurs during training for binary classification. This figure is similar to other published results on NC, such as for instance [12] for the case of exponential-type loss function. The key conditions for NC are: (a) NC1—variability collapse, which is measured by $Tr (Σ_{W} Σ_{B}^{- 1})$ , where Σ_W and Σ_B are the within and between class covariances, (b) NC2—equinorm and equiangularity of the mean features {*μ_c*} and classifiers {*W_c*}. We measure the equinorm condition by the standard deviation of the norms of the means (in red) and classifiers (in blue) across classes, divided by the average of the norms, and the equiangularity condition by the standard deviation of the inner products of the normalized means (in red) and the normalized classifiers (in blue), divided by the average inner product (this figure is similar to Fig. 4 in [12]; notice the small scale of the fluctuations), and (c) NC3—self-duality or the distance between the normalized classifiers and mean features. This network was trained on 2 classes of CIFAR10 with WN and WD = 5 × 10⁻⁴ and learning rate of 0.067, for 750 epochs with a stepped learning rate decay schedule.

We consider a classification problem with C classes with a balanced training dataset $S = \cup_{c = 1}^{C} S_{c} = \cup_{c = 1}^{C} {\{(x_{cn}, c)\}}_{n = 1}^{N} = \{(x_{n}, y_{n})\}$ that has N training examples $S_{c} = {\{(x_{cn}, c)\}}_{n = 1}^{N}$ per-class c ∈ [C]. The labels are represented by the unit vectors ${\{e_{c}\}}_{c = 1}^{C}$ in ℝ^C. Since we consider deep homogeneous networks that do not have bias vectors, we center the one-hot labels and scale them so that they have maximum output 1. We denote the resulting labels (for a class-balanced dataset) as ${\tilde{e}}_{c} = [\frac{- 1}{C - 1,} \dots \frac{- 1}{C - 1} 1 \frac{- 1}{C - 1} \dots \frac{- 1}{C - 1}]$ , where the cth coordinate is 1. We consider a deep ReLU network f_W : ℝ^d → ℝ^C, which takes the form of f_W(x) = W_Lσ(W_{L − 1}…W₂σ(W₁x)…). However, we stick to the normalized reparameterization of the deep ReLU network as f(x) = ρV_Lσ(V_{L − 1}…V₂σ(V₁x)…). We train this normalized network with SGD on the square loss with LMs and WD. This architecture differs from the one considered the Gradient dynamics section in that it has C outputs instead of a scalar output. Let the output of the network be $ρ f_{V} (x) = {[ρ f_{V}^{(1)} (x) \dots ρ f_{V}^{(C)} (x)]}^{⊤}$ and the target vectors be $y_{n} = {[y_{n}^{(1)} \dots y_{n}^{(C)}]}^{⊤}$ . We will also follow the notation of [12] and use h : ℝ^d → ℝ^p to denote the last layer features of the deep network. This means that $f_{V}^{(c)} (x) = 〈V_{L}^{c}, h (x)〉$ . The squared loss function with WD is written as $L_{S} (ρ, {\{V_{k}\}}_{k = 1}^{L}) = \frac{1}{NC} \sum_{c = 1}^{C} \sum_{n = 1}^{N} {‖y_{cn} - ρ f_{V} (x_{cn})‖}^{2} + λ ρ^{2}$ .

Property 2

[symmetric quasi-interpolation (multiclass classification)]. Consider a C-class classification problem with inputs in a feature space $X$ and label space ℝ^C. A classifier $f : X \to ℝ^{C}$ symmetrically quasi-interpolates a training dataset $S = \cup_{c = 1}^{C} S_{c} = \cup_{c = 1}^{C} {\{(x_{cn}, y_{cn})\}}_{n = 1}^{N}$ if, for all training examples, x_cn, $f (x_{cn}) \propto {\tilde{e}}_{c}$ .

Similar to the binary classification case, we show that this property arises from an analysis of the squared loss landscape for multiclass classification.

Lemma 7.

An overparameterized deep ReLU classifier trained to convergence under the squared loss in the presence of WD and WN satisfies the symmetric quasi-interpolation property

Proof. Consider the regularized square loss $L_{S} = \frac{1}{CN} \sum_{c = 1}^{C} \sum_{n = 1}^{N} ∥ ρ f_{V} (x_{cn}) - {\tilde{e}}_{c} ∥^{2} + λ ρ^{2}$ . In the multiclass case, we define the first-order statistics of the output of the normalized network as $μ = \frac{1}{CN} \sum_{c = 1}^{C} \sum_{n = 1}^{N} 〈f_{V} (x_{cn}), {\tilde{e}}_{c}〉$ and $M = \frac{1}{CN} \sum_{c = 1}^{C} \sum_{n = 1}^{N} ∥ f_{V} (x_{cn}) ∥^{2}$ . We consider deep networks that are overparameterized enough to attain 100% accuracy on the training dataset, which means $〈f_{V} (x_{cn}), {\tilde{e}}_{c}〉 > 0$ . Since the weights of the deep network ${\{V_{k}\}}_{k = 1}^{L}$ are normalized and the data x_cn lie within the unit norm ball, we also have that ∥f_V(x_cn) ∥ ≤ 1. However, similar to the binary case, we observe that the norm of f_V(x_cn) takes values of the order of 10⁻³.

Using these definitions, we can rewrite the deep network training problem as:

{min}_{ρ, {\{V_{k}\}}_{k = 1}^{L}} L_{S} = (M + λ) ρ^{2} - 2 ρ μ + \frac{C}{C - 1}

(33)

\begin{matrix} {min}_{{\{V_{k}\}}_{k = 1}^{L}} (M + λ) \times {(\frac{μ}{M + λ})}^{2} - 2 \frac{μ^{2}}{M + λ} + \frac{C}{C - 1} \\ = {min}_{{\{V_{k}\}}_{k = 1}^{L}} \frac{C}{C - 1} - \frac{μ^{2}}{M + λ} \end{matrix}

(34)

Hence, to minimize the loss we have to find ${\{V_{k}\}}_{k = 1}^{L}$ that maximizes $\frac{μ^{2}}{M + λ}$ . Since the network is expressive enough to attain any value and the norm of f_V(x_cn) is bounded, we see that the loss is minimized when μ² is maximized. That is, when $f (x_{cn}) \propto {\tilde{e}}_{c}$ for all training examples.

We now consider the optimization of the squared loss on deep networks with WN and WD:

L_{S} (ρ, {\{V_{k}\}}_{k = 1}^{L}) = \frac{1}{NC} \sum_{c = 1}^{C} \sum_{n = 1}^{N} {‖y_{cn} - ρ f_{V} (x_{cn})‖}^{2} + \sum_{k = 1}^{L} ν_{k} ({‖V_{k}‖}^{2} - 1) + λ ρ^{2}

(35)

At each time point t, the optimization process selects a random class-balanced batch $S^{'} = \cup_{c = 1}^{C} \cup_{n = 1}^{b} S_{c}^{'}$ including B samples per-class from $S_{c}^{'} \subset S_{c}$ and updates the scale and weights of the network is the following manner $V \leftarrow V - η \frac{\partial L_{S'} (ρ, V)}{\partial V}$ and $ρ \leftarrow ρ - η \frac{\partial L_{S'} (ρ, V)}{\partial ρ}$ , where η > 0 is a predefined learning rate and b is a divisor of N. A convergence point of the optimization process is a point (ρ, V) that will never be updated by any possible sequence of steps taken by the optimization algorithm. Specifically, the convergence points of the proposed method are all points ρ, V for which $\frac{\partial L_{S'} (ρ, V)}{\partial V} = 0$ and $\frac{\partial L_{S'} (ρ, V)}{\partial ρ} = 0$ for all class-balanced batches $S^{'} \subset S$ .

Theorem 6.

Let $S = \cup_{c = 1}^{C} {\{(x_{cn}, c)\}}_{n = 1}^{N}$ be a dataset and B be a divisor of N. Let (ρ, V) be the parameters of a ReLU network f_W, such that V_L has converged when training using SGD with balanced batches of size B = bC on the square loss with LN + WD. Let $μ_{c} = \frac{1}{N} \sum_{n = 1}^{N} h (x_{cn})$ , $μ_{G} = \frac{1}{C} \sum_{c = 1}^{C} μ_{c}$ , and M = […μ_c − μ_G…] ∈ ℝ^{p × C}. Consider points of convergence of SGD that satisfy Property 2. Then, those points also satisfy the conditions of NC as described below.

• NC1: μ_c = h(x_cn) for all n ∈ [N].

• NC2: The vectors ${\{μ_{c} - μ_{G}\}}_{c = 1}^{C}$ form an ETF.

• NC3: $V_{L}^{⊤} = \frac{M}{∥ M ∥_{F}}$ .

• NC4: $arg {max}_{c \in [C]} f_{W}^{c} (x) = arg {min}_{c \in [C]} ∥ μ_{c} - h (x) ∥$ .

Proof. Our training objective is the loss function described in Eq. 35. The network is trained using SGD along with LN and WD. We use SGD with balanced batches to train the network. Each step taken by SGD takes the form $- η \frac{\partial L_{S^{'}}}{\partial V}$ , where $S^{'} \subset S$ is a balanced batch containing exactly b samples per class. We consider limit points of the learning procedure, meaning that $\frac{\partial L_{S^{'}}}{\partial V} = 0$ for all balanced batches $S^{'}$ . Let $S^{'} = \cup_{c = 1}^{C} \cup_{n = 1}^{b} \{({\hat{x}}_{cn}, {\hat{y}}_{cn})\}$ be such a balanced batch. We use SGD, where, at each time t, the batch $S^{'}$ is drawn at random from $S$ , to study the time evolution of the normalized parameters V_L in the limit η → 0.

\begin{matrix} V_{L}^{(t + 1)} = V_{L}^{(t)} - η \frac{\partial L_{S'}}{\partial V_{L}^{(t)}} \\ V_{L}^{(t + 1)} = V_{L}^{(t)} - η \times (\frac{1}{B} \sum_{c' = 1}^{C} \sum_{n = 1}^{b} 2 ρ (ρ f_{V} (x_{c' n}) - {\tilde{e}}_{c'}) h {(x_{c' n})}^{⊤} + 2 ν_{L}^{(t)} V_{L}^{(t)}) \end{matrix}

(36)

We can apply the unit norm constraints ${‖V_{L}^{(t + 1)}‖}_{F}^{2} =$ and and ignore all terms that are O(η²) to compute $ν_{L}^{(t)}$ as:

\begin{matrix} 2 ν_{L}^{(t)} & = - \frac{1}{B} \sum_{c' = 1}^{C} \sum_{n = 1}^{b} 2 ρ tr (V_{L}^{(t) ⊤} (ρ f_{V} (x_{c' n}) - {\tilde{e}}_{c'}) h {(x_{c' n})}^{⊤}) \\ ⟹ ν_{L}^{(t)} & = - \frac{1}{B} \sum_{c' = 1}^{C} \sum_{n = 1}^{b} ρ tr ({(V_{L}^{(t)} h (x_{c' n}))}^{⊤} (ρ f_{V} (x_{c' n}) - {\tilde{e}}_{c'})) \\ = - \frac{1}{B} \sum_{c' = 1}^{C} \sum_{n = 1}^{b} ρ f_{V} {(x_{c' n})}^{⊤} (ρ f_{V} (x_{c' n}) - {\tilde{e}}_{c'}) \end{matrix}

(37)

This means that the (stochastic) gradient of the loss with respect to the last layer V_L and each classifier vector $V_{L}^{c}$ with LN can be written as (we drop the time index t for clarity):

\begin{matrix} \frac{\partial L_{S'}}{\partial V_{L}} = \frac{- 2 ρ}{B} \sum_{c' = 1}^{C} \sum_{n = 1}^{b} f_{V} {(x_{c' n})}^{⊤} (ρ f_{V} (x_{c' n}) - {\tilde{e}}_{c'}) V_{L} - (ρ f_{V} (x_{c' n}) - {\tilde{e}}_{c'}) h {(x_{c' n})}^{⊤} \\ \frac{\partial L_{S'}}{\partial V_{L}^{c}} = \frac{- 2 ρ}{B} \sum_{c' = 1}^{C} \sum_{n = 1}^{b} f_{V} {(x_{c' n})}^{⊤} (ρ f_{V} (x_{c' n}) - {\tilde{e}}_{c'}) V_{L}^{c} - (ρ f_{V}^{(c)} (x_{c' n}) - {\tilde{e}}_{c'}^{(c)}) h (x_{c' n}) \end{matrix}

(38)

Let us analyze the equilibrium parameters at the last layer, considering each classifier vector $V_{L}^{c}$ of V_L, separately:

\begin{matrix} 0 = \frac{\partial L_{S'}}{\partial V_{L}^{c}} = \frac{- 2 ρ}{B} \sum_{c' = 1}^{C} \sum_{n = 1}^{b} f_{V} {(x_{c' n})}^{⊤} (ρ f_{V} (x_{c' n}) - {\tilde{e}}_{c'}) V_{L}^{c} - (ρ f_{V}^{(c)} (x_{c' n}) - {\tilde{e}}_{c'}^{(c)}) h (x_{c' n}) \\ = \frac{- 2 ρ}{B} \sum_{n = 1}^{b} f_{V} {(x_{cn})}^{⊤} (ρ f_{V} (x_{cn}) - {\tilde{e}}_{c}) V_{L}^{c} - (ρ f_{V}^{(c)} (x_{cn}) - 1) h (x_{cn}) \\ - \frac{2 ρ}{B} \sum_{c' \in [C] \ \{c\}} \sum_{n = 1}^{b} f_{V} {(x_{c' n})}^{⊤} (ρ f_{V} (x_{c' n}) - {\tilde{e}}_{c'}) V_{L}^{c} - (ρ f_{V}^{(c)} (x_{c' n}) + \frac{1}{C - 1}) h (x_{c' n}) \end{matrix}

(39)

Using Property 2 and considering solutions that achieve symmetric quasi-interpolation, with $ρ f_{V} ({\hat{x}}_{cn}) = α {\tilde{e}}_{c}$ , we have

\frac{2 ρ}{B} \sum_{n = 1}^{b} (α - 1) h (x_{cn}) - \frac{2 ρ}{B} \sum_{c' \in [C] \ \{c\}} \sum_{n = 1}^{b} \frac{α - 1}{C - 1} h (x_{c' n}) - \frac{2 α (α - 1) C}{C - 1} V_{L}^{c} = 0

(40)

In addition, consider a second batch $S^{''}$ that differs from $S^{'}$ by only one sample x′_cn instead of x_cn from class c. By applying the previous Eq. 40 for $S^{'}$ and $S^{''}$ , we can obtain h(x_cn) = h(x′_cn), which proves NC1.

Let $S = \cup_{i = 1}^{k} S^{i}$ be a partition of $S$ into k = N/b (an integer) disjoint batches. Since our data are balanced, we obtain that

\begin{matrix} 0 & = \frac{1}{k} \sum_{i = 1}^{k} \frac{\partial L_{S^{i}} (ρ, V)}{\partial V_{L}^{c}} \\ = \frac{\partial L_{S} (ρ, V)}{\partial V_{L}^{c}} \\ = \frac{2 ρ}{NC} \sum_{c^{'} = 1}^{C} \sum_{n = 1}^{N} f_{V} {(x_{c^{'} n})}^{⊤} (ρ f_{V} (x_{c^{'} n}) - {\tilde{e}}_{c^{'}}) V_{L}^{c} - (ρ f_{V}^{(c)} (x_{c^{'} n}) - {\tilde{e}}_{c^{'}}^{(c)}) h (x_{c^{'} n}) \\ = \frac{2 ρ}{NC} \sum_{n = 1}^{N} (α - 1) h (x_{cn}) - \frac{2 ρ}{NC} \sum_{c^{'} \in [C] \ \{c\}} \sum_{n = 1}^{N} \frac{α - 1}{C - 1} h (x_{c^{'} n}) - \frac{2 α (α - 1) C}{C - 1} V_{L}^{c} \end{matrix}

(41)

Under the conditions of NC1, we can simply write μ_c = h(x_cn) for all n ∈ [N] and c ∈ [C]. Let us denote the global feature mean by $μ_{G} = \frac{1}{C} \sum_{c = 1}^{C} μ_{c}$ . This means we have:

\frac{\partial L_{S} (ρ V)}{\partial V_{L}^{c}} = 0 ⟹ V_{L}^{c} = \frac{ρ}{αC} \cdot (μ_{c} - μ_{G})

(42)

This implies that the last layer parameters V_L are a scaled version of the centered class-wise feature matrix M = […μ_c − μ_G…]. Thus, at equilibrium, with quasi-interpolation of the training labels, we obtain $\frac{V_{L}^{⊤}}{∥ V_{L} ∥_{F}} = \frac{M}{{∥ M|}_{F}}$ .

From the SGD equations, we can also see that at equilibrium, with quasi-interpolation, all classifier vectors in the last layer ( $V_{L}^{c}$ and, hence, μ_c − μ_G) have the same norm:

\begin{matrix} ∥ V_{L}^{c} ∥_{2}^{2} = \frac{\frac{1}{NC} \sum_{c' = 1}^{C} \sum_{n = 1}^{N} (ρ f_{V}^{(c)} (x_{c' n}) - {\tilde{e}}_{c'}^{(c)}) ρ f_{V}^{(c)} (x_{c' n})}{\frac{1}{NC} \sum_{c' = 1}^{C} \sum_{n = 1}^{N} 〈ρ f_{V} (x_{c' n}) - {\tilde{e}}_{c'}, ρ f_{V} (x_{c' n})〉} \\ = \frac{\frac{α (α - 1)}{C} + \frac{α (α - 1)}{C (C - 1)}}{α (α - 1) \times \frac{C}{C - 1}} = \frac{1}{C} \end{matrix}

(43)

From the quasi-interpolation of the correct class label, we have that $〈V_{L}^{c}, μ_{c}〉 = \frac{α}{ρ}$ , which means $〈V_{L}^{c}, μ_{G}〉 + 〈V_{L}^{c}, μ_{c} - μ_{G}〉 = \frac{α}{ρ}$ . Now using Eq. 42

\begin{matrix} 〈V_{L}^{c}, μ_{G}〉 = \frac{α}{ρ} - \frac{αC}{ρ} ∥ V_{L}^{c} ∥_{2}^{2} \\ = \frac{α}{ρ} - \frac{αC}{ρ} \times \frac{1}{C} = 0 \end{matrix}

(44)

From the quasi-interpolation of the incorrect class labels, we have that $〈V_{L}^{c}, μ_{c'}〉 = \frac{- α}{ρ (C - 1)}$ , which means $〈V_{L}^{c}, μ_{c'} - μ_{G}〉 + 〈V_{L}^{c}, μ_{G}〉 = \frac{- α}{ρ (C - 1)}$ . Plugging in the previous result and using Eq. 43 yields

\begin{matrix} \frac{αC}{ρ} \times 〈V_{L}^{c}, V_{L}^{c'}〉 & = \frac{- α}{ρ (C - 1)} \\ ⟹ 〈{\tilde{V}}_{L}^{c}, {\tilde{V}}_{L}^{c'}〉 & = \frac{1}{∥ V_{L}^{c} ∥_{2}^{2}} \times \frac{- 1}{C (C - 1)} = - \frac{1}{C - 1} \end{matrix}

(45)

Here, ${\tilde{V}}_{L}^{c} = \frac{V_{L}^{c}}{∥ V_{L}^{c} ∥_{2}}$ , and we use the fact that all the norms $∥ V_{L}^{c} ∥_{2}$ are equal. This completes the proof that the normalized classifier parameters form an ETF. Moreover, since $V_{L}^{c} \propto μ_{c} - μ_{G}$ and all the proportionality constants are independent of c, we obtain $\sum_{c} V_{L}^{c} = 0$ . This completes the proof of the NC2 condition. NC4 follows then from NC1 to NC2, as shown by theorems in [12].

Remarks

• The analyses of the loss landscape and the qualitative dynamics under the square loss in the Qualitative dynamics and Landscape of the empirical risk sections imply that all quasi-interpolating solutions with ρ ≥ ρ₀ and λ > 0 that satisfy assumption 2 yield NC and have its 4 properties.

• SGD is a necessary requirement in our proof of NC1.

• Our analysis implies that there is no direct relation between NC and generalization. In fact, a careful look at our derivation suggests that NC1 to NC4 should take place for any quasi-interpolating solutions (in the square loss case), including solutions that do not have a large margin. In particular, our analysis predicts NC for datasets with fully random labels—a prediction that has been experimentally verified.

SGD Bias toward Low-Rank Weight Matrices and Intrinsic SGD Noise

In the previous sections, we assumed that ρ and V_k are trained using GF. In this section, we consider a slightly different setting where SGD is applied instead of GF. Specifically, V_k and ρ are first initialized and then iteratively updated simultaneously in the following manner

\begin{matrix} ρ \leftarrow ρ - η \frac{\partial L_{S'} (ρ, {\{V_{k}\}}_{k = 1}^{L})}{\partial ρ} = ρ - η \frac{2}{B} \sum_{(x_{n}, y_{n}) \in S'} (1 - ρ {\bar{f}}_{n}) {\bar{f}}_{n} - 2 ηλρ \\ V_{k} \leftarrow V_{k} - \frac{\partial L_{S'} (ρ, {\{V_{k}\}}_{k = 1}^{L})}{\partial V_{k}} = V_{k} - η \frac{2}{B} \sum_{(x_{n}, y_{n}) \in S'} (1 - ρ {\bar{f}}_{n}) ρ \frac{\partial {\bar{f}}_{n}}{\partial V_{k}} - 2 η ν_{k} V_{k} \end{matrix}

(46)

where $S^{'}$ is selected uniformly as a subset of $S$ of size B, η > 0 is the learning rate, and ν_k is computed according to Eq. 4 with $S$ replaced by $S^{'}$ .

Low-rank bias

An intriguing argument for low-rank weight matrices is the following observation that follows from Eq. 5 (see also [7]). The Lemma 8 shows that, in practice, SGD cannot achieve zero gradient for all the minibatches of size smaller than N, because, otherwise, all the weight matrices would have very low rank that is incompatible, for generic datasets, with quasi-interpolation.

Lemma 8.

Let f_W be a neural network. Assume that we iteratively train ρ and ${\{V_{k}\}}_{k = 1}^{L}$ using the process described above with WD λ > 0. Suppose that training converges, that is $\frac{\partial L_{S'} (ρ, {\{V_{k}\}}_{k = 1}^{L})}{\partial ρ} = 0$ and $\forall k \in [L] : \frac{\partial L_{S'} (ρ, {\{V_{k}\}}_{k = 1}^{L})}{\partial V_{k}} = 0$ for all minibatches $S^{'} \subset S$ of size $B < ∣ S ∣$ . Assume that $\forall n \in [N] : {\bar{f}}_{n} \neq 0$ . Then, the ranks of the matrices V_k are at most ≤ 2.

Proof. Let f_V(x) = V_Lσ(V_{L − 1}…σ(V₁x)…) be the normalized neural network, where V_l ∈ ℝ^{d_{l + 1} × d_l} and ∥V_l ∥ = 1 for all l ∈ [L]. We would like to show that the matrix $\frac{\partial f_{V} (x)}{\partial V_{k}}$ is of rank ≤1. We note that for any given vector z ∈ ℝ^d, we have σ(v) = diag (σ^′(v)) · v (where σ is the ReLU activation function). Therefore, for any input vector x ∈ ℝⁿ, the output of f_V can be written as follows,

\begin{matrix} f_{V} (x) & = V_{L} σ (V_{L - 1} \dots σ (V_{1} x) \dots) \\ = V_{L} \cdot D_{L - 1} (x; V) \dots D_{1} (x; V) \cdot V_{1} \cdot x \end{matrix}

(47)

where D_l(x; V) = diag [σ^′(u_l(x; V)))] and u_l(x; V) = V_lσ(V_{l − 1}…σ(V₁x)…). We denote u_{l, i}(x; V) as the ith coordinate of the vector u_l(x; V). We note that u_l(x; V) are continuous functions of V. Therefore, assuming that none of the coordinates u_{l, i}(x; V) are zero, there exists a sufficiently small ball around V for which u_{l, i}(x; V) does not change its sign. Hence, within this ball, σ^′(u_{l, i}(x; V)) is constant. We define sets $V ≔ \{V| \forall l \leq L : ∥ V_{l} ∥ = 1\}$ and $V_{l, i} = \{V \in V| u_{l, i} (x; V) = 0\}$ . We note that as long as x ≠ 0, the set $V_{l, i}$ is negligible within $V$ . Since there is a finite set of indices l, i, the set $⋃_{l, i} V_{l, i}$ is also negligible within $V$ .

Let V be a set of matrices for which none of the coordinates u_{l, i}(x; V) are zero. Then, the matrices ${\{D_{l} (x; V)\}}_{l = 1}^{L - 1}$ are constant in the neighborhood of V, and therefore, their derivative with respect to V_k are zero. Let a^⊤ = V_L · D_{L − 1}(x; V)V_{L − 1}⋯V_{k + 1}D_k(x; V) and b = D_{k − 1}(x) · V_{k − 1}⋯V₁x. We can write f_V(x) = a(x; V)^⊤ · V_k · b(x; V). Since the derivatives of a(x; V) and b(x; V) with respect to V_k are zero, by applying $\frac{\partial a^{⊤} Xb}{X} = a b^{⊤}$ , we have $\frac{\partial f_{V} (x)}{\partial V_{k}} = a (x; V) \cdot b {(x; V)}^{⊤}$ that is a matrix of rank at most 1. Therefore, $\frac{\partial {\bar{f}}_{n}}{\partial V_{k}} = y_{n} \frac{\partial f_{V} (x_{n})}{\partial V_{k}}$ is a matrix of rank at most 1. Therefore, for any input x_n ≠ 0, with measure 1, $\frac{\partial {\bar{f}}_{n}}{\partial V_{k}}$ is a matrix of rank at most 1.

Since $\forall k \in [L] : \frac{\partial L_{S'} (ρ, {\{V_{k}\}}_{k = 1}^{L})}{\partial V_{k}} = 0$ for all minibatches $S^{'} = {\{(x_{i_{j}}, y_{i_{j}})\}}_{j = 1}^{B} \subset S$ of size $B < ∣ S ∣$ , we have

\frac{\partial L_{S'} (ρ, {\{V_{k}\}}_{k = 1}^{L})}{\partial V_{k}} = \frac{2}{B} ρ \sum_{j = 1}^{B} [(1 - ρ {\bar{f}}_{i_{j}}) (- V_{k} {\bar{f}}_{i_{j}} + \frac{\partial {\bar{f}}_{i_{j}}}{\partial V_{k}})] = 0

(48)

Since interpolation is impossible when training with λ > 0, there exists at least one n ∈ [N] for which $ρ {\bar{f}}_{n} \neq 1$ . We consider 2 batches $S_{i}^{'}$ and $S_{j}^{'}$ of size B that differ by sample, (x_i, y_i) and (x_j, y_j). We have

\begin{matrix} \forall i, j \in [N] : 0 & = \frac{\partial L_{S'_{i}} (ρ, {\{V_{k}\}}_{k = 1}^{L})}{\partial V_{k}} - \frac{\partial L_{S'_{j}} (ρ, {\{V_{k}\}}_{k = 1}^{L})}{\partial V_{k}} \\ = \frac{2}{B} \cdot ρ [(1 - ρ {\bar{f}}_{i}) (- V_{k} {\bar{f}}_{i} + \frac{\partial {\bar{f}}_{i}}{\partial V_{k}}) - (1 - ρ {\bar{f}}_{j}) (- V_{k} {\bar{f}}_{j} + \frac{\partial {\bar{f}}_{j}}{\partial V_{k}})] \end{matrix}

(49)

Assume that there exists a pair i, j ∈ [N] for which $(1 - ρ {\bar{f}}_{i}) {\bar{f}}_{i} \neq (1 - ρ {\bar{f}}_{j}) {\bar{f}}_{j}$ . Then, we can write

V_{k} = \frac{[(1 - ρ {\bar{f}}_{i}) \cdot \frac{\partial {\bar{f}}_{i}}{\partial V_{k}} + (1 - ρ {\bar{f}}_{j}) \cdot \frac{\partial {\bar{f}}_{j}}{\partial V_{k}}]}{[(1 - ρ {\bar{f}}_{i}) {\bar{f}}_{i} - (1 - ρ {\bar{f}}_{j}) {\bar{f}}_{j}]}

(50)

Since $\frac{\partial {\bar{f}}_{i}}{\partial V_{k}}$ and $\frac{\partial {\bar{f}}_{j}}{\partial V_{k}}$ are matrices of rank ≤1 (see the analysis above), we obtain that V_k is of rank ≤2. Otherwise, assume that for all pairs i, j ∈ [N], we have $α = (1 - ρ {\bar{f}}_{i}) {\bar{f}}_{i} = (1 - ρ {\bar{f}}_{j}) {\bar{f}}_{j}$ . In this case, we obtain that for all i, j ∈ [N], we have

(1 - ρ {\bar{f}}_{i}) \cdot \frac{\partial {\bar{f}}_{i}}{\partial V_{k}} = (1 - ρ {\bar{f}}_{j}) \cdot \frac{\partial {\bar{f}}_{j}}{\partial V_{k}} = U

(51)

Therefore, since $α = (1 - ρ {\bar{f}}_{i}) {\bar{f}}_{i} = (1 - ρ {\bar{f}}_{j}) {\bar{f}}_{j}$ , by Eq. 48,

0 = \frac{2}{B} ρ \sum_{j = 1}^{B} [(1 - ρ {\bar{f}}_{i_{j}}) (- V_{k} {\bar{f}}_{i_{j}} + \frac{\partial {\bar{f}}_{i_{j}}}{\partial V_{k}})] = - 2 ρα V_{k} + 2 ρU

(52)

Since the network cannot perfectly fit the dataset when trained with λ > 0, we obtain that there exists i ∈ [N] for which $(1 - ρ {\bar{f}}_{i}) \neq 0$ . Since ${\bar{f}}_{i} \neq 0$ for all i ∈ [N], this implies that α ≠ 0. We conclude that V_k is proportional to U, which is of rank ≤1.

All GD methods try to converge to points in parameter space that have zero or very small gradient; in other words, they try to minimize $∥ {\overset{\cdot}{V}}_{k} ∥, \forall k$ . Assuming separability, $ℓ_{n} = (1 - ρ {\bar{f}}_{n}) > 0, \forall n$ . Equation 10 then implies

∥ {\overset{\cdot}{V}}_{k} ∥ = \frac{2 ρ}{N} \sum_{n \in B} ℓ_{n} ∥ \frac{\partial {\bar{f}}_{n}}{\partial V_{k}} - f_{n} V_{k} ∥

(53)

which predicts that the norm of the SGD updates at layer k should reflect, asymptotically, the rank of V_k.

Is low-rank bias related to generalization?

An obvious question is whether a deep ReLU network that fits the data generalizes better than another one if the rank of its weight matrices is lower. The following result is stated in [8]:

Theorem 7.

Let f_V be a normalized neural network, trained with SGD under square loss in the presence of WN. Assume that the weight matrix V_k of dimensionality (n, n) has rank r < n. Then, its contribution to the Rademacher complexity of the network will be $\sqrt{\frac{r}{n}}$ (instead of 1 as in the typical bound).

Origin of SGD noise

Lemma 8 shows that there cannot be convergence to a unique set of weights ${\{V_{k}\}}_{k = 1}^{L}$ that satisfy equilibrium for all minibatches. More details of the argument are illustrated in [54,55]. When λ = 0, interpolation of all data points is expected: In this case, the GD equilibrium can be reached without any constraint on the weights. This is also the situation in which SGD noise is expected to essentially disappear: Compare the histograms on the left and the right hand side of Fig. 10. Thus, during training, the solution ${\{V_{k}\}}_{k = 1}^{L}$ is not the same for all samples: There is no convergence to a unique solution but instead fluctuations between solutions during training. The absence of convergence to a unique solution is not surprising for SGD when the landscape is not convex.

Summary

The dynamics of GF

In this paper, we have considered a model of the dynamics of, first, GF, and then stochastic GD in overparameterized ReLU neural networks trained for square loss minimization. Under the assumption of convergence to zero loss minima, we have shown that solutions have a bias toward small ρ, defined as the product of the Frobenius norms of each layer’s (unnormalized) weight matrix. We assume that during training, there is normalization using an LM of each layer weight matrix but the last one, together with WD with the regularization parameter λ. Without WD, the best solution would be the interpolating solution with minimum ρ that may be achieved with appropriate initial conditions that are appropriate.

Remarks

• The bias toward small ρ solutions induced by regularization with λ > 0 may be replaced—when λ = 0—by an implicit bias induced by small initialization. With appropriate parameter values, small initialization allows convergence to the first quasi-interpolating solution for increasing ρ from ≈ 0 to ρ₀. For λ = 0, we have empirically observed solutions with large ρ that are suboptimal and probably similar to the NTK regime.

• A puzzle that remains open is why BN leads to better solutions than LN and WN, despite similarities between them. WN is easier to formalize mathematically as LN, which is the main reason for the role it plays in this paper.

Generalization and bounds

Building on our analysis of the dynamics of ρ, we derive new norm-based generalization bounds for CNNs for the special case of nonoverlapping convolutional patches. These bounds show (a) that generalization for CNNs can be orders of magnitude better than for dense networks and (b) that these bounds can be empirically loose but nonvacuous despite overparametrization.

Remarks

• For λ > 0, a main property of the minimizers that upper bounds their expected error is ρ, which is the inverse of the margin: We prove that among all the quasi-interpolating solutions, the ones associated with smaller ρ have better bounds on the expected classification error.

• The situation here is somewhat similar to the linear case: For overparameterized networks, the best solution in terms of generalization is the minimum norm solution toward which GD is biased.

• Large margin is usually associated with good generalization [56]; in the meantime, however, it is also broadly recognized that margin alone does not fully account for generalization in deep nets [28,31,57]. Margin, in fact, provides an upper bound on generalization error, as shown in Generalization: Rademacher Complexity of Convolutional Layers. Larger margin gives a better upper bound on the generalization error for the same network trained on the same data. We have empirically verified this property by varying the margin using different degrees of random labels in a binary classification task. While training gives perfect classification and zero square loss, the margin on the training set together with the test error decreases with the increase in the percentage of random labels. Of course, large margin in our theoretical analysis is associated with regularization that helps minimizing ρ. Since ρ is the product of the Frobenius norm, its minimization is directly related to minimizing a Bayes prior [58], which is itself directly related to minimum description length principles.

• We do not believe that flat minima directly affect generalization. As we described in the Interpolation and quasi-interpolation section, degenerate minima correspond to solutions that have zero empirical loss (for λ = 0). Minimizing the empirical loss is a (almost) necessary condition for good generalization. It is not, however, sufficient since minimization of the expected error also requires a solution with low complexity.

• The upper bound given in Generalization: Rademacher Complexity of Convolutional Layers, however, does not explain by itself details of the generalization behavior that we observe for different initializations (see Fig. 4), where small differences in margin are actually anticorrelated with small differences in test error. We conjecture that margin (related to ρ) together with sparsity of $F$ may be sufficient to explain generalization.

Neural collapse

Another consequence of our analysis is a proof of NC for deep networks trained with square loss in the binary classification case without any assumption. In particular, we prove that training the network using SGD with WD, induces a bias toward low-rank weight matrices and yields SGD noise in the weight matrices and in the margins, which makes exact convergence impossible, even asymptotically.

Remarks

• A natural question is whether NC is related to solutions with good generalization. Our analysis suggests that this is not the case, at least not directly: NC is a property of the dynamics, independently of the size of the margin that provides an upper bound on the expected error. In fact, our prediction of NC for randomly labeled CIFAR10 was confirmed originally in then preliminary experiments by our collaborators (Papyan et al. [12]) and more recently in other papers (see for instance, [33]).

• Margins, however, do converge to each other but only within a small ϵ, implying that the first condition for NC [12] is satisfied only in this approximate sense. This is equivalent to saying that that SGD does not converge to a unique solution that corresponds to zero gradient for all data point.

Conclusion

Finally, we would like to emphasize that the analysis of this paper supports the idea that the advantage of deep networks relative to other standard classifiers is greater for the problems to which sparse architectures such as CNNs can be applied. The reason is that CNNs reflect the function graph of target functions that are compositionally sparse and, thus, can be approximated well by sparse networks without incurring in the curse of dimensionality. Despite overparametrization, the compositionally sparse networks can then show good generalization.

Acknowledgments

We thank L. Rosasco, Y. Copper, E. Malach, and S. Ullman for many relevant discussions. Funding: This material is based on the work supported by the Center for Minds, Brains and Machines (CBMM), funded by NSF STC award CCF-1231216. This research was also sponsored by grants from the National Science Foundation (NSF-0640097 and NSF-0827427) and AFSOR-THRL (FA8650-05-C-7262). Competing interests: The authors declare that they have no competing interests.

Data Availability

The experimental dataset we used in this paper is a public CIFAR10 dataset, and it can be accessed and downloaded from https://www.cs.toronto.edu/~kriz/cifar.html.

References

1.Lyu K, Li J. Gradient descent maximizes the margin of homogeneous neural networks. arXiv. 2019. 10.48550/arXiv.1906.05890 [DOI]
2.Poggio T, Banburski A, Liao Q. Theoretical issues in deep networks. Proc Natl Acad Sci USA. 2020;117(48):30039–30045. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Nacson MS, Gunasekar S, Lee J, Srebro N, Soudry D. Lexicographic and depth-sensitive margins in homogeneous and non-homogeneous deep models. arXiv. 2019. https://arxiv.org/abs/1905.07325
4.Banburski A, Liao Q, Miranda B, Poggio T, Rosasco L, Hidary J, De La Torre F. Theory of deep learning III: Dynamics and generalization in deep networks. Center for Brains, Minds and Machines (CBMM) Memo No. 90. 2019.
5.Hui L, Belkin M. Evaluation of neural architectures trained with square loss vs cross-entropy in classification tasks. arXiv. 2020. https://arxiv.org/abs/2006.07322
6.Rifkin RM. Everything old is new again: A fresh look at historical approaches to machine learning [PhD thesis]. [Cambridge (MA)]: Massachusetts Institute of Technology; 2002. [Google Scholar]
7.Poggio T, Liao Q. Generalization in deep network classifiers trained with the square loss. Center for Brains, Minds and Machines (CBMM) Memo No. 112. 2019.
8.Xu M, Rangamani A, Banburski A, Liao Q, Galanti T, Poggio T. Deep classifiers trained with the square loss. Center for Brains, Minds and Machines (CBMM) Memo No. 117. 2022.
9.Poggio T, Cooper Y. Loss landscape: SGD has a better view. CBMM Memo No. 107. 2020.
10.Cooper Y. Global minima of overparameterized neural networks. SIAM J Math Data Sci. 2021;3(2):676–691. [Google Scholar]
11.Zhang C, Bengio S, Hardt M, Recht B, Vinyals O. Understanding deep learning requires rethinking generalization. arXiv. 2016. 10.48550/arXiv.1611.03530 [DOI]
12.Papyan V, Han XY, Donoho DL. Prevalence of neural collapse during the terminal phase of deep learning training. Proc Natl Acad Sci USA. 2020;117(40):24652–24663. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Timor N, Vardi G, Shamir O. Implicit regularization towards rank minimization in relu networks. arXiv. 2022. 10.48550/arXiv.2201.12760 [DOI]
14.Soudry D, Hoffer E, Nacson MS, Gunasekar S, Srebro N. The implicit bias of gradient descent on separable data. J Mach Learn Res. 2018;19(1):2822–2878. [Google Scholar]
15.Chizat L, Bach F. Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss. Paper presented at: Conference on Learning Theory, Proceedings of Machine Learning Research; 2020. p. 1305–1338.
16.Xu T, Zhou Y, Ji K, Liang Y. When will gradient methods converge to max-margin classifier under relu models? Stat. 2021;10(1):Article e354. [Google Scholar]
17.Muthukumar V, Narang A, Subramanian V, Belkin M, Hsu D, Sahai A. Classification vs regression in overparameterized regimes: Does the loss function matter? arXiv. 2020. https://arxiv.org/abs/2005.08054
18.Liang T, Rakhlin A. Just interpolate: Kernel “Ridgeless” Regression can generalize. arXiv. 2018. https://arxiv.org/abs/1808.00387
19.Liang T, Recht B. Interpolating classifiers make few mistakes. arXiv. 2021. https://arxiv.org/abs/2101.11815
20.Zhong K, Song Z, Jain P, Bartlett PL, Dhillon IS. Recovery guarantees for one-hidden-layer neural networks. Paper presented at: International Conference on Machine Learning, Proceedings of Machine Learning Research; 2017. p. 4140–4149.
21.Soltanolkotabi M, Javanmard A, Lee JD. Theoretical insights into the optimization landscape of over-parameterized shallow neural networks. IEEE Trans Inf Theory. 2018;65(2):742–769. [Google Scholar]
22.Du SS, Zhai X, Poczos B, Singh A. Gradient descent provably optimizes over-parameterized neural networks. Paper presented at: International Conference on Learning Representations; 2019. p. 1–19.
23.Chizat L, Oyallon E, Bach F. On lazy training in differentiable programming. arXiv. 2018. https://arxiv.org/abs/1812.07956
24.Jacot A, Gabriel F, Hongler C. Neural tangent kernel: Convergence and generalization in neural networks. arXiv. 2018. https://arxiv.org/abs/1806.07572
25.Mei S, Montanari A, Nguyen P. A mean field view of the landscape of two-layer neural networks. Proc Natl Acad Sci USA. 2018;115(33):E7665–E7671. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Chen Z, Rotskoff GM, Bruna J, Vanden-Eijnden E. A dynamical central limit theorem for shallow neural networks. arXiv. 2020. https://arxiv.org/abs/2008.09623
27.Arora S, Du S, Hu W, Li Z, Wang R. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. Paper presented at: International Conference on Machine Learning, Proceedings of Machine Learning Research; 2019. p. 322–332.
28.Bartlett P, Foster DJ, Telgarsky M. Spectrally-normalized margin bounds for neural networks. Paper presented at: Advances in Neural Information Processing Systems (NeurIPS). 2017 June; Long Beach, CA.
29.Neyshabur B, Tomioka R, Srebro N. Norm-Based Capacity Control in Neural Networks. Paper presented at: Conference on Learning Theory, Proceedings of Machine Learning Research; 2015. p. 1376–1401.
30.Golowich N, Rakhlin A, Shamir O. Size-independent sample complexity of neural networks. arXiv. 2017. https://arxiv.org/abs/1712.06541
31.Jiang Y, Neyshabur B, Mobahi H, Krishnan D, Bengio S. Fantastic generalization measures and where to find them. arXiv. 2019. https://arxiv.org/abs/1912.02178
32.Mixon DG, Parshall H, Pi J. Neural collapse with unconstrained features. arXiv. 2020. 10.48550/arXiv.2011.11619 [DOI]
33.Zhu Z, Ding T, Zhou J, Li X, You C, Sulam J, Qu Q. A geometric analysis of neural collapse with unconstrained features. Adv Neural Inf Proces Syst. 2021;34:29820–29834. [Google Scholar]
34.Lu J, Steinerberger S. Neural collapse with cross-entropy loss. arXiv. 2020. 10.48550/arXiv.2012.08465 [DOI]
35.Fang C, He H, Long Q, Su WJ. Layer-peeled model: Toward understanding well-trained deep neural networks. arXiv. 2021. 10.48550/arXiv.2101.12699 [DOI]
36.W. E and Wojtowytsch S. On the emergence of tetrahedral symmetry in the final and penultimate layers of neural network classifiers. arXiv. 2020. https://arxiv.org/abs/2012.05420v1
37.Ergen T, Pilanci M. Revealing the structure of deep neural networks via convex duality. arXiv. 2020. https://arxiv.org/abs/2002.09773
38.Poggio T, Liao Q. Generalization in deep network classifiers trained with the square loss. Center for Brains, Minds and Machines (CBMM) Memo No. 112. 2021.
39.Han XY, Papyan V, Donoho DL. Neural collapse under mse loss: Proximity to and dynamics on the central path. arXiv. 2021. https://arxiv.org/abs/2106.02073v1
40.Zhou J, Li X, Ding T, You C, Qu Q, Zhu Z. On the optimization landscape of neural collapse under mse loss: Global optimality with unconstrained features. arXiv. 2022. https://arxiv.org/abs/2203.01238
41.Arora S, Li Z, Lyu K. Theoretical analysis of auto rate-tuning by batch normalization. arXiv. 2018. 10.48550/arXiv.1812.03981 [DOI]
42.Anselmi F, Rosasco L, Poggio T. On invariance and selectivity in representation learning. arXiv. 2015. 10.48550/arXiv.1503.05938 [DOI]
43.Ledent A, Lei Y, Kloft M. Improved generalisation bounds for deep learning through l∞ covering numbers. arXiv. 2019. 10.48550/arXiv.1905.12430 [DOI]
44.Krizhevsky A. Learning multiple layers of features from tiny images. Technical Report, 2009.
45.Salimans T, Kingm DP. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. Paper presented at: Proceedings of the 30th International Conference on Neural Information Processing Systems. 2016 December; Barcelona, Spain.
46.Poggio T, Cooper Y. Loss landscape: SGD can have a better view than GD. Center for Brains, Minds and Machines (CBMM) Memo No. 107. 2020.
47.Nguyen Q. On connected sublevel sets in deep learning. Paper presented at: Proceedings of the 36th International Conference on Machine Learning, Proceedings of Machine Learning Research; 2019;97. p. 4790–4799.
48.Liang T, Poggio T, Rakhlin A, Stokes J. Fisher-rao metric, geometry, and complexity of neural networks. arXiv. 2017. 10.48550/arXiv.1711.01530 [DOI]
49.Chatterjee S. Convergence of gradient descent for deep neural networks. arXiv. 2022. https://arxiv.org/abs/2203.16462
50.Mohri M, Rostamizadeh A, Talwalkar A. Foundations of machine learning. 2nd ed. Cambridge (MA): MIT Press; 2018.
51.Golowich N, Rakhlin A, Shamir O. Size-independent sample complexity of neural networks. arXiv. 2017. 10.48550/arXiv.1712.06541 [DOI]
52.Rebeschini P. Algorithmic foundations of learning lecture 3: Rademacher complexity. 2020. https://www.stats.ox.ac.uk/~rebeschi/teaching/AFoL/20/material/slides03.pdf
53.Xu M, Poggio T, Galanti T. Complexity bounds for sparse networks. Center for Brains, Minds and Machines (CBMM) Memo No. 1XX. 2022.
54.Galanti T, Poggio T. SGD noise and implicit low-rank bias in deep neural networks. Center for Brains, Minds and Machines (CBMM) Memo No. 134. 2022.
55.Galanti T, Siegel ZS, Gupte A, Poggio T. SGD and weight decay provably induce a low-rank bias in neural networks. arXiv. 2022. https://arxiv.org/abs/2206.05794
56.Bousquet O, Boucheron S, Lugosi G. Introduction to statistical learning theory. In: Bousquet O, von Luxburg U, Rätsch G, editors. Summer school on machine learning. Berlin (Heidelberg): Springer; 2003. p. 169–207.
57.Jiang Y, Krishnan D, Mobahi H, Bengio S. Predicting the generalization gap in deep networks with margin distributions. arXiv. 2018. https://arxiv.org/abs/1810.00113
58.Evgeniou T, Pontil M, Poggio T. Regularization networks and support vector machines. Adv Comput Math. 2000;13:1–50. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The experimental dataset we used in this paper is a public CIFAR10 dataset, and it can be accessed and downloaded from https://www.cs.toronto.edu/~kriz/cifar.html.

[B1] 1.Lyu K, Li J. Gradient descent maximizes the margin of homogeneous neural networks. arXiv. 2019. 10.48550/arXiv.1906.05890 [DOI]

[B2] 2.Poggio T, Banburski A, Liao Q. Theoretical issues in deep networks. Proc Natl Acad Sci USA. 2020;117(48):30039–30045. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B3] 3.Nacson MS, Gunasekar S, Lee J, Srebro N, Soudry D. Lexicographic and depth-sensitive margins in homogeneous and non-homogeneous deep models. arXiv. 2019. https://arxiv.org/abs/1905.07325

[B4] 4.Banburski A, Liao Q, Miranda B, Poggio T, Rosasco L, Hidary J, De La Torre F. Theory of deep learning III: Dynamics and generalization in deep networks. Center for Brains, Minds and Machines (CBMM) Memo No. 90. 2019.

[B5] 5.Hui L, Belkin M. Evaluation of neural architectures trained with square loss vs cross-entropy in classification tasks. arXiv. 2020. https://arxiv.org/abs/2006.07322

[B6] 6.Rifkin RM. Everything old is new again: A fresh look at historical approaches to machine learning [PhD thesis]. [Cambridge (MA)]: Massachusetts Institute of Technology; 2002. [Google Scholar]

[B7] 7.Poggio T, Liao Q. Generalization in deep network classifiers trained with the square loss. Center for Brains, Minds and Machines (CBMM) Memo No. 112. 2019.

[B8] 8.Xu M, Rangamani A, Banburski A, Liao Q, Galanti T, Poggio T. Deep classifiers trained with the square loss. Center for Brains, Minds and Machines (CBMM) Memo No. 117. 2022.

[B9] 9.Poggio T, Cooper Y. Loss landscape: SGD has a better view. CBMM Memo No. 107. 2020.

[B10] 10.Cooper Y. Global minima of overparameterized neural networks. SIAM J Math Data Sci. 2021;3(2):676–691. [Google Scholar]

[B11] 11.Zhang C, Bengio S, Hardt M, Recht B, Vinyals O. Understanding deep learning requires rethinking generalization. arXiv. 2016. 10.48550/arXiv.1611.03530 [DOI]

[B12] 12.Papyan V, Han XY, Donoho DL. Prevalence of neural collapse during the terminal phase of deep learning training. Proc Natl Acad Sci USA. 2020;117(40):24652–24663. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B13] 13.Timor N, Vardi G, Shamir O. Implicit regularization towards rank minimization in relu networks. arXiv. 2022. 10.48550/arXiv.2201.12760 [DOI]

[B14] 14.Soudry D, Hoffer E, Nacson MS, Gunasekar S, Srebro N. The implicit bias of gradient descent on separable data. J Mach Learn Res. 2018;19(1):2822–2878. [Google Scholar]

[B15] 15.Chizat L, Bach F. Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss. Paper presented at: Conference on Learning Theory, Proceedings of Machine Learning Research; 2020. p. 1305–1338.

[B16] 16.Xu T, Zhou Y, Ji K, Liang Y. When will gradient methods converge to max-margin classifier under relu models? Stat. 2021;10(1):Article e354. [Google Scholar]

[B17] 17.Muthukumar V, Narang A, Subramanian V, Belkin M, Hsu D, Sahai A. Classification vs regression in overparameterized regimes: Does the loss function matter? arXiv. 2020. https://arxiv.org/abs/2005.08054

[B18] 18.Liang T, Rakhlin A. Just interpolate: Kernel “Ridgeless” Regression can generalize. arXiv. 2018. https://arxiv.org/abs/1808.00387

[B19] 19.Liang T, Recht B. Interpolating classifiers make few mistakes. arXiv. 2021. https://arxiv.org/abs/2101.11815

[B20] 20.Zhong K, Song Z, Jain P, Bartlett PL, Dhillon IS. Recovery guarantees for one-hidden-layer neural networks. Paper presented at: International Conference on Machine Learning, Proceedings of Machine Learning Research; 2017. p. 4140–4149.

[B21] 21.Soltanolkotabi M, Javanmard A, Lee JD. Theoretical insights into the optimization landscape of over-parameterized shallow neural networks. IEEE Trans Inf Theory. 2018;65(2):742–769. [Google Scholar]

[B22] 22.Du SS, Zhai X, Poczos B, Singh A. Gradient descent provably optimizes over-parameterized neural networks. Paper presented at: International Conference on Learning Representations; 2019. p. 1–19.

[B23] 23.Chizat L, Oyallon E, Bach F. On lazy training in differentiable programming. arXiv. 2018. https://arxiv.org/abs/1812.07956

[B24] 24.Jacot A, Gabriel F, Hongler C. Neural tangent kernel: Convergence and generalization in neural networks. arXiv. 2018. https://arxiv.org/abs/1806.07572

[B25] 25.Mei S, Montanari A, Nguyen P. A mean field view of the landscape of two-layer neural networks. Proc Natl Acad Sci USA. 2018;115(33):E7665–E7671. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B26] 26.Chen Z, Rotskoff GM, Bruna J, Vanden-Eijnden E. A dynamical central limit theorem for shallow neural networks. arXiv. 2020. https://arxiv.org/abs/2008.09623

[B27] 27.Arora S, Du S, Hu W, Li Z, Wang R. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. Paper presented at: International Conference on Machine Learning, Proceedings of Machine Learning Research; 2019. p. 322–332.

[B28] 28.Bartlett P, Foster DJ, Telgarsky M. Spectrally-normalized margin bounds for neural networks. Paper presented at: Advances in Neural Information Processing Systems (NeurIPS). 2017 June; Long Beach, CA.

[B29] 29.Neyshabur B, Tomioka R, Srebro N. Norm-Based Capacity Control in Neural Networks. Paper presented at: Conference on Learning Theory, Proceedings of Machine Learning Research; 2015. p. 1376–1401.

[B30] 30.Golowich N, Rakhlin A, Shamir O. Size-independent sample complexity of neural networks. arXiv. 2017. https://arxiv.org/abs/1712.06541

[B31] 31.Jiang Y, Neyshabur B, Mobahi H, Krishnan D, Bengio S. Fantastic generalization measures and where to find them. arXiv. 2019. https://arxiv.org/abs/1912.02178

[B32] 32.Mixon DG, Parshall H, Pi J. Neural collapse with unconstrained features. arXiv. 2020. 10.48550/arXiv.2011.11619 [DOI]

[B33] 33.Zhu Z, Ding T, Zhou J, Li X, You C, Sulam J, Qu Q. A geometric analysis of neural collapse with unconstrained features. Adv Neural Inf Proces Syst. 2021;34:29820–29834. [Google Scholar]

[B34] 34.Lu J, Steinerberger S. Neural collapse with cross-entropy loss. arXiv. 2020. 10.48550/arXiv.2012.08465 [DOI]

[B35] 35.Fang C, He H, Long Q, Su WJ. Layer-peeled model: Toward understanding well-trained deep neural networks. arXiv. 2021. 10.48550/arXiv.2101.12699 [DOI]

[B36] 36.W. E and Wojtowytsch S. On the emergence of tetrahedral symmetry in the final and penultimate layers of neural network classifiers. arXiv. 2020. https://arxiv.org/abs/2012.05420v1

[B37] 37.Ergen T, Pilanci M. Revealing the structure of deep neural networks via convex duality. arXiv. 2020. https://arxiv.org/abs/2002.09773

[B38] 38.Poggio T, Liao Q. Generalization in deep network classifiers trained with the square loss. Center for Brains, Minds and Machines (CBMM) Memo No. 112. 2021.

[B39] 39.Han XY, Papyan V, Donoho DL. Neural collapse under mse loss: Proximity to and dynamics on the central path. arXiv. 2021. https://arxiv.org/abs/2106.02073v1

[B40] 40.Zhou J, Li X, Ding T, You C, Qu Q, Zhu Z. On the optimization landscape of neural collapse under mse loss: Global optimality with unconstrained features. arXiv. 2022. https://arxiv.org/abs/2203.01238

[B41] 41.Arora S, Li Z, Lyu K. Theoretical analysis of auto rate-tuning by batch normalization. arXiv. 2018. 10.48550/arXiv.1812.03981 [DOI]

[B42] 42.Anselmi F, Rosasco L, Poggio T. On invariance and selectivity in representation learning. arXiv. 2015. 10.48550/arXiv.1503.05938 [DOI]

[B43] 43.Ledent A, Lei Y, Kloft M. Improved generalisation bounds for deep learning through l∞ covering numbers. arXiv. 2019. 10.48550/arXiv.1905.12430 [DOI]

[B44] 44.Krizhevsky A. Learning multiple layers of features from tiny images. Technical Report, 2009.

[B45] 45.Salimans T, Kingm DP. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. Paper presented at: Proceedings of the 30th International Conference on Neural Information Processing Systems. 2016 December; Barcelona, Spain.

[B46] 46.Poggio T, Cooper Y. Loss landscape: SGD can have a better view than GD. Center for Brains, Minds and Machines (CBMM) Memo No. 107. 2020.

[B47] 47.Nguyen Q. On connected sublevel sets in deep learning. Paper presented at: Proceedings of the 36th International Conference on Machine Learning, Proceedings of Machine Learning Research; 2019;97. p. 4790–4799.

[B48] 48.Liang T, Poggio T, Rakhlin A, Stokes J. Fisher-rao metric, geometry, and complexity of neural networks. arXiv. 2017. 10.48550/arXiv.1711.01530 [DOI]

[B49] 49.Chatterjee S. Convergence of gradient descent for deep neural networks. arXiv. 2022. https://arxiv.org/abs/2203.16462

[B50] 50.Mohri M, Rostamizadeh A, Talwalkar A. Foundations of machine learning. 2nd ed. Cambridge (MA): MIT Press; 2018.

[B51] 51.Golowich N, Rakhlin A, Shamir O. Size-independent sample complexity of neural networks. arXiv. 2017. 10.48550/arXiv.1712.06541 [DOI]

[B52] 52.Rebeschini P. Algorithmic foundations of learning lecture 3: Rademacher complexity. 2020. https://www.stats.ox.ac.uk/~rebeschi/teaching/AFoL/20/material/slides03.pdf

[B53] 53.Xu M, Poggio T, Galanti T. Complexity bounds for sparse networks. Center for Brains, Minds and Machines (CBMM) Memo No. 1XX. 2022.

[B54] 54.Galanti T, Poggio T. SGD noise and implicit low-rank bias in deep neural networks. Center for Brains, Minds and Machines (CBMM) Memo No. 134. 2022.

[B55] 55.Galanti T, Siegel ZS, Gupte A, Poggio T. SGD and weight decay provably induce a low-rank bias in neural networks. arXiv. 2022. https://arxiv.org/abs/2206.05794

[B56] 56.Bousquet O, Boucheron S, Lugosi G. Introduction to statistical learning theory. In: Bousquet O, von Luxburg U, Rätsch G, editors. Summer school on machine learning. Berlin (Heidelberg): Springer; 2003. p. 169–207.

[B57] 57.Jiang Y, Krishnan D, Mobahi H, Bengio S. Predicting the generalization gap in deep networks with margin distributions. arXiv. 2018. https://arxiv.org/abs/1810.00113

[B58] 58.Evgeniou T, Pontil M, Poggio T. Regularization networks and support vector machines. Adv Comput Math. 2000;13:1–50. [Google Scholar]

PERMALINK

Dynamics in Deep Classifiers Trained with the Square Loss: Normalization, Low Rank, Neural Collapse, and Generalization Bounds

Mengjia Xu

Akshay Rangamani

Qianli Liao

Tomer Galanti

Tomaso Poggio

Abstract

Introduction

Contributions

Related Work

Problem Setup

Assumptions

Classification with square loss minimization

Fig. 1.

Separability and margins

Interpolation and quasi-interpolation

Lemma 1.

Experiments

Landscape of the empirical risk

Fig. 2.

Theorem 1

Theorem 2

Landscape for λ > 0

Gradient dynamics

GF equations

Lemma 2

Convergence

Lemma 3.

Qualitative dynamics

Small ρ initialization

Lemma 4.

Large ρ initialization

Lemma 5.

Fig. 3.

Generalization: Rademacher Complexity of Convolutional Layers

Classical Rademacher bounds

Definition

Theorem 3.

Relative generalization

Fig. 4.

Fig. 5.

Novel bounds for sparse networks

Theorem 4.

Conjecture 1.

The bound is surprisingly small

Fig. 6.

Fig. 7.

Neural Collapse

Related Work on NC

Binary classification

Property 1

Lemma 6.

Fig. 8.

Fig. 10.

Theorem 5.

Multiclass classification

Fig. 9.

Property 2

Lemma 7.

Theorem 6.

Remarks

SGD Bias toward Low-Rank Weight Matrices and Intrinsic SGD Noise

Low-rank bias

Lemma 8.

Is low-rank bias related to generalization?

Theorem 7.

Origin of SGD noise

Summary

The dynamics of GF

Remarks

Generalization and bounds

Remarks

Neural collapse

Remarks

Conclusion

Acknowledgments

Data Availability

References

Associated Data