Skip to main content
Proceedings of the National Academy of Sciences of the United States of America logoLink to Proceedings of the National Academy of Sciences of the United States of America
. 2020 Apr 24;117(48):30063–30070. doi: 10.1073/pnas.1907378117

Benign overfitting in linear regression

Peter L Bartlett a,b,1, Philip M Long c, Gábor Lugosi d,e,f, Alexander Tsigler a
PMCID: PMC7720150  PMID: 32332161

Abstract

The phenomenon of benign overfitting is one of the key mysteries uncovered by deep learning methodology: deep neural networks seem to predict well, even with a perfect fit to noisy training data. Motivated by this phenomenon, we consider when a perfect fit to training data in linear regression is compatible with accurate prediction. We give a characterization of linear regression problems for which the minimum norm interpolating prediction rule has near-optimal prediction accuracy. The characterization is in terms of two notions of the effective rank of the data covariance. It shows that overparameterization is essential for benign overfitting in this setting: the number of directions in parameter space that are unimportant for prediction must significantly exceed the sample size. By studying examples of data covariance properties that this characterization shows are required for benign overfitting, we find an important role for finite-dimensional data: the accuracy of the minimum norm interpolating prediction rule approaches the best possible accuracy for a much narrower range of properties of the data distribution when the data lie in an infinite-dimensional space vs. when the data lie in a finite-dimensional space with dimension that grows faster than the sample size.

Keywords: statistical learning theory, overfitting, linear regression, interpolation


Deep learning methodology has revealed a surprising statistical phenomenon: overfitting can perform well. The classical perspective in statistical learning theory is that there should be a tradeoff between the fit to the training data and the complexity of the prediction rule. Whether complexity is measured in terms of the number of parameters, the number of nonzero parameters in a high-dimensional setting, the number of neighbors averaged in a nearest neighbor estimator, the scale of an estimate in a reproducing kernel Hilbert space, or the bandwidth of a kernel smoother, this tradeoff has been ubiquitous in statistical learning theory. Deep learning seems to operate outside the regime where results of this kind are informative since deep neural networks can perform well even with a perfect fit to the training data.

As one example of this phenomenon, consider the experiment illustrated in figure 1C in ref. 1: standard deep network architectures and stochastic gradient algorithms, run until they perfectly fit a standard image classification training set, give respectable prediction performance, even when significant levels of label noise are introduced. The deep networks in the experiments reported in ref. 1 achieved essentially zero cross-entropy loss on the training data. In statistics and machine learning textbooks, an estimate that fits every training example perfectly is often presented as an illustration of overfitting [“…interpolating fits…[are] unlikely to predict future data well at all” (ref. 2, p. 37)]. Thus, to arrive at a scientific understanding of the success of deep learning methods, it is a central challenge to understand the performance of prediction rules that fit the training data perfectly.

In this paper, we consider perhaps the simplest setting where we might hope to witness this phenomenon: linear regression. That is, we consider quadratic loss and linear prediction rules, and we assume that the dimension of the parameter space is large enough that a perfect fit is guaranteed. We consider data in an infinite-dimensional space (a separable Hilbert space), but our results apply to a finite-dimensional subspace as a special case. There is an ideal value of the parameters, θ*, corresponding to the linear prediction rule that minimizes the expected quadratic loss. We ask when it is possible to fit the data exactly and still compete with the prediction accuracy of θ*. Since we require more parameters than the sample size in order to fit exactly, the solution might be underdetermined, and therefore, there might be many interpolating solutions. We consider the most natural: choose the parameter vector θ^ with the smallest norm among all vectors that gives perfect predictions on the training sample. (This corresponds to using the pseudoinverse to solve the normal equations; see below.) We ask when it is possible to overfit in this way—and embed all of the noise of the labels into the parameter estimate θ^—without harming prediction accuracy.

Our main result is a finite sample characterization of when overfitting is benign in this setting. The linear regression problem depends on the optimal parameters θ* and the covariance Σ of the covariates x. The properties of Σ turn out to be crucial since the magnitude of the variance in different directions determines both how the label noise gets distributed across the parameter space and how errors in parameter estimation in different directions in parameter space affect prediction accuracy. There is a classical decomposition of the excess prediction error into two terms. The first is rather standard: provided that the scale of the problem (that is, the sum of the eigenvalues of Σ) is small compared with the sample size n, the contribution to θ^ that we can view as coming from θ* is not too distorted. The second term is more interesting since it reflects the impact of the noise in the labels on prediction accuracy. We show that this part is small if and only if the effective rank of Σ in the subspace corresponding to low-variance directions is large compared with n. This necessary and sufficient condition of a large effective rank can be viewed as a property of significant overparameterization: fitting the training data exactly but with near-optimal prediction accuracy occurs if and only if there are many low-variance (and hence, unimportant) directions in parameter space where the label noise can be hidden.

The details are more complicated. The characterization depends in a specific way on two notions of effective rank, r and R; the smaller one, r, determines a split of Σ into large and small eigenvalues, and the excess prediction error depends on the effective rank as measured by the larger notion R of the subspace corresponding to the smallest eigenvalues. For the excess prediction error to be small, the smallest eigenvalues of Σ must decay slowly.

Studying the patterns of eigenvalues that allow benign overfitting reveals an interesting role for large but finite dimensions: in an infinite-dimensional setting, benign overfitting occurs only for a narrow range of decay rates of the eigenvalues. On the other hand, it occurs with any suitably slowly decaying eigenvalue sequence in a finite-dimensional space with dimension that grows faster than the sample size. Thus, for linear regression, data that lie in a large but finite-dimensional space exhibit the benign overfitting phenomenon with a much wider range of covariance properties than data that lie in an infinite-dimensional space.

The phenomenon of interpolating prediction rules has been an object of study by several authors over the last two years since it emerged as an intriguing mystery at the Simons Institute program on Foundations of Machine Learning in the spring of 2017. Belkin et al. (3) described an experimental study demonstrating that this phenomenon of accurate prediction for functions that interpolate noisy data also occurs for prediction rules chosen from reproducing kernel Hilbert spaces and explained the mismatch between this phenomenon and classical generalization bounds. Belkin et al. (4) gave an example of an interpolating decision rule—simplicial interpolation—with an asymptotic consistency property as the input dimension gets large. That work and subsequent work of Belkin et al. (5) studied kernel smoothing methods based on singular kernels that both interpolate and, with suitable bandwidth choice, give optimal rates for nonparametric estimation [building on earlier consistency results (6) for these unusual kernels]. Liang and Rakhlin (7) considered minimum norm interpolating kernel regression with kernels defined as nonlinear functions of the Euclidean inner product and showed that, with certain properties of the training sample (expressed in terms of the empirical kernel matrix), these methods can have good prediction accuracy. Belkin et al. (8) studied experimentally the excess risk as a function of the dimension of a sequence of parameter spaces for linear and nonlinear classes.

Subsequent to our work, ref. 9 considered the properties of the interpolating linear prediction rule with minimal expected squared error. After this work was presented at the NAS Colloquium on the Science of Deep Learning (10), we became aware of the concurrent work of Belkin et al. (11) and of Hastie et al. (12). Belkin et al. (11) calculated the excess risk for certain linear models (a regression problem with identity covariance and sparse optimal parameters, both with and without noise, and a problem with random Fourier features with no noise), and Hastie et al. (12) considered linear regression in an asymptotic regime, where sample size n and input dimension p go to infinity together with asymptotic ratio p/nγ. They assumed that, as p gets large, the empirical spectral distribution of Σ (the discrete measure on its set of eigenvalues) converges to a fixed measure, and they applied random matrix theory to explore the range of behaviors of the asymptotics of the excess prediction error as γ, the noise variance, and the eigenvalue distribution vary. They also studied the asymptotics of a model involving random nonlinear features. In contrast, we give upper and lower bounds on the excess prediction error for arbitrary finite sample size, for arbitrary covariance matrices, and for data of arbitrary dimension.

The next section introduces notation and definitions used throughout the paper, including definitions of the problem of linear regression and of various notions of effective rank of the covariance operator. The following section gives the characterization of benign overfitting, illustrates why the effective rank condition corresponds to significant overparameterization, and presents several examples of patterns of eigenvalues that allow benign overfitting, suggesting that slowly decaying covariance eigenvalues in input spaces of growing but finite dimension are the generic example of benign overfitting. Then we discuss the connections between these results and the benign overfitting phenomenon in deep neural networks and outline the proofs of the results.

Definitions and Notation

We consider linear regression problems, where a linear function of covariates x from a (potentially infinite-dimensional) Hilbert space H is used to predict a real-valued response variable y. We use vector notation, so that xθ denotes the inner product between x and θ and xz denotes the tensor product of x,zH.

Definition 1 (Linear Regression): A linear regression problem in a separable Hilbert space H is defined by a random covariate vector xH and outcome yR. We define

  • 1)

    the covariance operator Σ=E[xx] and

  • 2)

    the optimal parameter vector θ*H, satisfying E(yxθ*)2=minθE(yxθ)2.

We assume that

  • 1)

    x and y are mean zero;

  • 2)

    x=VΛ1/2z, where Σ=VΛV is the spectral decomposition of Σ and z has components that are independent σx2 sub-Gaussian with σx a positive constant: that is, for all λH,

E[exp(λz)]exp(σx2λ2/2),
  • where is the norm in the Hilbert space H;

  • 3)

    the conditional noise variance is bounded below by some constant σ2,

E(yxθ*)2|xσ2;
  • 4)

    yxθ* is σy2 sub-Gaussian conditionally on x: that is, for all λR,

E[exp(λ(yxθ*))|x]exp(σy2λ2/2)
  • (note that this implies E[y|x]=xθ*); and

  • 5)

    almost surely, the projection of the data X on the space orthogonal to any eigenvector of Σ spans a space of dimension n.

Given a training sample (x1,y1),,(xn,yn) of n independent pairs with the same distribution as (x,y), an estimator returns a parameter estimate θH. The excess risk of the estimator is defined as

R(θ)Ex,yyxθ2yxθ*2,

where Ex,y denotes the conditional expectation given all random quantities other than x,y (in this case, given the estimate θ). Define the vectors yRn with entries yi and εRn with entries εi=yixiθ*. We use infinite matrix notation: X denotes the linear map from H to Rn corresponding to x1,,xnH so that XθRn has ith component xiθ. We use similar notation for the linear map X from Rn to H.

Notice that Assumptions 1 to 5 are satisfied when x and y are jointly Gaussian with zero mean and rank(Σ)>n.

We shall be concerned with situations where an estimator θ can fit the data perfectly: that is, Xθ=y. Typically, this implies that there are many such vectors. We consider the interpolating estimator with minimal norm in H. We use to denote both the Euclidean norm of a vector in Rn and the Hilbert space norm.

Definition 2 (Minimum Norm Estimator): Given data XHn and yRn, the minimum norm estimator θ^ solves the optimization problem

minθH  θ2such that  Xθy2=minβXβy2.

By the projection theorem, parameter vectors that solve the least squares problem minβXβy2 solve the normal equations, and therefore, we can equivalently write θ^ as the minimum norm solution to the normal equations

θ^=argminθθ2:XXθ=Xy=XXXy=XXXy,

where XX denotes the pseudoinverse of the bounded linear operator XX (for infinite-dimensional H, the existence of the pseudoinverse is guaranteed because XX is bounded and has a closed range) (13). When H has dimension p with p<n and X has rank p, there is a unique solution to the normal equations. On the contrary, Assumption 5 in Definition 1 implies that we can find many solutions θH to the normal equations that achieve Xθ=y. The minimum norm solution is given by

θ^=XXX1y. [1]

Our main result gives tight bounds on the excess risk of this minimum norm estimator in terms of certain notions of effective rank of the covariance that are defined in terms of its eigenvalues.

We use μ1(Σ)μ2(Σ) to denote the eigenvalues of Σ in descending order, and we denote the operator norm of Σ by Σ. We use I to denote the identity operator on H and In to denote the n×n identity matrix.

Definition 3 (Effective Ranks): For the covariance operator Σ, define λi=μi(Σ) for i=1,2,. If i=1λi< and λk+1>0 for k0, define

rk(Σ)=i>kλiλk+1,  Rk(Σ)=i>kλi2i>kλi2.

Main Results

The following theorem establishes nearly matching upper and lower bounds for the risk of the minimum norm interpolating estimator.

Theorem 1. For any σx, there are b,c,c1>1 for which the following holds. Consider a linear regression problem from Definition 1. Define

k*=mink0:rk(Σ)bn,

where the minimum of the empty set is defined as . Suppose that δ<1 with log(1/δ)<n/c. If k*n/c1, then ER(θ^)σ2/c. Otherwise,

R(θ^)cθ*2Σmaxr0(Σ)n,r0(Σ)n,log(1/δ)n    +clog(1/δ)σy2k*n+nRk*(Σ)

with probability at least 1δ, and

ER(θ^)σ2ck*n+nRk*(Σ).

Moreover, there are universal constants a1,a2,n0 such that, for all nn0, for all Σ, and for all t0, there is a θ* with θ*=t such that, for xN(0,Σ) and y|xN(xθ*,θ*2Σ) with probability at least 1/4,

R(θ^)1a1θ*2Σ1r0(Σ)nlog1+r0(Σ)a2.

Effective Ranks and Overparameterization.

In order to understand the implications of Theorem 1, we now study relationships between the two notions of effective rank, rk and Rk, and establish sufficient and necessary conditions for the sequence {λi} of eigenvalues to lead to small excess risk.

The following lemma shows that the two notions of effective rank are closely related. SI Appendix, section H has its proof and other properties of rk and Rk.

Lemma 1. rk(Σ)1, rk2(Σ)=rk(Σ2)Rk(Σ), and

rk(Σ2)rk(Σ)Rk(Σ)rk2(Σ).

Notice that r0(Ip)=R0(Ip)=p. More generally, if all of the nonzero eigenvalues of Σ are identical, then r0(Σ)=R0(Σ)=rank(Σ). For Σ with finite rank, we can express both r0(Σ) and R0(Σ) as a product of the rank and a notion of symmetry. In particular, for rank(Σ)=p, we can write

r0(Σ)=rank(Σ)s(Σ),  R0(Σ)=rank(Σ)S(Σ),with  s(Σ)=1pi=1pλiλ1,   S(Σ)=1pi=1pλi21pi=1pλi2.

Both notions of symmetry s and S lie between 1/p (when λ20) and 1 (when the λi are all equal).

Theorem 1 shows that, for the minimum norm estimator to have near-optimal prediction accuracy, r0(Σ) should be small compared with the sample size n (from the first term) and rk*(Σ) and Rk*(Σ) should be large compared with n. Together, these conditions imply that overparameterization is essential for benign overfitting in this setting: the number of nonzero eigenvalues should be large compared with n, they should have a small sum compared with n, and there should be many eigenvalues no larger than λk*. If the number of these small eigenvalues is not much larger than n, then they should be roughly equal, but they can be more asymmetric if there are many more of them.

The following theorem shows that the kind of overparameterization that is essential for benign overfitting requires Σ to have a heavy tail. (The proof—and some other examples illustrating the boundary of benign overfitting—are in SI Appendix, section I.) In particular, if we fix Σ in an infinite-dimensional Hilbert space and ask when the excess risk of the minimum norm estimator approaches zero as n, it imposes tight restrictions on the eigenvalues of Σ. However, there are many other possibilities for these asymptotics if Σ can change with n. Since rescaling X affects the accuracy of the least norm interpolant in an obvious way, we may assume without loss of generality that Σ=1. If we restrict our attention to this case, then informally, Theorem 1 implies that, when the covariance operator for data with n examples is Σn, the least norm interpolant converges if r0(Σn)n0, kn*n0, and nRkn*(Σn)0 and only if r0(Σn)nlog(1+r0(Σn))0, kn*n0, and nRkn*(Σn)0, where kn*=mink0:rk(Σn)bn for the universal constant b in Theorem 1. This motivates the following definition.

Definition 4: A sequence of covariance operators Σn with Σn=1 is benign if

limnr0(Σn)n=limnkn*n=limnnRkn*(Σn)=0.

We give some examples of benign and nonbenign settings.

Theorem 2.

  • 1)

    If μk(Σ)=kαlnβ((k+1)e/2), then Σ is benign if and only if α=1 and β>1.

  • 2)

    If

μk(Σn)=γk+ϵnifkpn,0otherwise

and γk=Θ(exp(k/τ)), then Σn with Σn=1 is benign if and only if pn=ω(n) and neo(n)=ϵnpn=o(n). Furthermore, for pn=Ω(n) and ϵnpn=neo(n),

R(θ^)=Oϵnpn+1n+ln(n/(ϵnpn))n+max1n,npn.

Compare the situations described by Theorem 2.1 and 2.2. Theorem 2.1 shows that, for infinite-dimensional data with a fixed covariance, benign overfitting occurs if and only if the eigenvalues of the covariance operator decay just slowly enough for their sum to remain finite. Theorem 2.2 shows that the situation is very different if the data have finite dimension and a small amount of isotropic noise is added to the covariates. In that case, even if the eigenvalues of the original covariance operator (before the addition of isotropic noise) decay very rapidly, benign overfitting occurs if and only if both the dimension is large compared with the sample size and the isotropic component of the covariance is sufficiently small—but not exponentially small—compared with the sample size.

These examples illustrate the tension between the slow decay of eigenvalues that is needed for k/n+n/Rk to be small and the summability of eigenvalues that is needed for r0(Σ)/n to be small. There are two ways to resolve this tension. First, in the infinite-dimensional setting, slow decay of the eigenvalues suffices—decay just fast enough to ensure summability—as shown by Theorem 2.1. (SI Appendix, section I gives another example—Theorem S14.2—where the eigenvalue decay is allowed to vary with n; in that case, Σn is benign iff the decay rate gets close—but not too close—to 1/k as n increases.) Second, the other way to resolve the tension is to consider a finite-dimensional setting (which ensures that the eigenvalues are summable), and in this case, arbitrarily slow decay is possible. Theorem 2.2 gives an example of this: eigenvalues that are all at least as large as a small constant. SI Appendix, section I gives other examples with a similar flavor, including a truncated infinite series that decays sufficiently slowly that the sum does not converge (SI Appendix, section I, Theorem S14.3). Theorem 2.1 shows that a very specific decay rate is required in infinite dimensions, which suggests that this is an unusual phenomenon in that case. The more generic scenario where benign overfitting will occur is demonstrated by Theorem 2.2, with eigenvalues that are either at least a constant or slowly decaying in a very high—but finite-dimensional—space.

Proof

Throughout the proofs, we treat σx (the sub-Gaussian norm of the covariates) as a constant. Therefore, we use the symbols b,c,c1,c2, to refer to constants that only depend on σx. Their values are suitably large (and always at least one) but do not depend on any parameters of the problems that we consider other than σx. For universal constants that do not depend on any parameters of the problem at all, we use the symbol a. Also, whenever we sum over eigenvectors of Σ, the sum is restricted to eigenvectors with nonzero eigenvalues.

Outline.

The first step is a standard decomposition of the excess risk into two pieces, a term that corresponds to the distortion that is introduced by viewing θ* through the lens of the finite sample and a term that corresponds to the distortion introduced by the noise ε=yXθ. The impact of both sources of error in θ^ on the excess risk is modulated by the covariance Σ, which gives different weight to different directions in parameter space.

Lemma 2. The excess risk of the minimum norm estimator satisfies R(θ^)2θ*Bθ*+cσ2log(1/δ)tr(C) with probability at least 1δ over ϵ, and EεR(θ^)θ*Bθ*+σ2tr(C), where

B=IXXX1XΣIXXX1X,C=XX1XΣXXX1.

The proof of this lemma is in SI Appendix, section A. SI Appendix, sections J and K give bounds on the term θ*Bθ*. The heart of the proof is controlling tr(C).

Before continuing with the proof, let us make a quick digression to note that Lemma 2 already begins to give an idea that many low-variance directions are necessary for the least norm interpolator to succeed. Let us consider the extreme case that p=n and Σ=I. In this case, C=XX1. For Gaussian data, for instance, standard random matrix theory implies that, with high probability, the eigenvalues of XX will all be within a constant factor of n, which implies that tr(C) is bounded below by a constant, and then, Lemma 2 implies that the least norm interpolant’s excess risk is at least a constant.

To prove that tr(C) can be controlled for suitable Σ, the first step is to express it in terms of sums of outer products of unit-covariance, independent, sub-Gaussian random vectors. We show that, when there is a k* with k*/n small and rk*(Σ)/n large, all of the smallest eigenvalues of these matrices are suitably concentrated, and this implies that tr(C) is bounded above by

minlk*ln+ni>lλi2λk*+1rk*(Σ)2.

(Later, we show that the minimizer is l=k*.) Next, we show that this expression is also a lower bound on tr(C) provided that there is such a k*. Conversely, we show that, for any k for which rk(Σ) is not large compared with n, tr(C) is at least as big as a constant times min(1,k/n). Combining shows that, when k*/n is small, tr(C) is upper and lower bounded by constant factors times

k*n+nRk*(Σ).

Unit Variance Sub-Gaussians.

Our assumptions allow the trace of C to be expressed as a function of many independent sub-Gaussian vectors.

Lemma 3. Consider a covariance operator Σ with λi=μi(Σ) and λn>0. Write its spectral decomposition Σ=jλjvjvj, where the orthonormal vjH are the eigenvectors corresponding to the λj. For i with λi>0, define zi=Xvi/λi. Then,

trC=iλi2zijλjzjzj2zi,

and these ziRn are independent σx2 sub-Gaussian. Furthermore, for any i with λi>0, we have

λi2zijλjzjzj2zi=λi2ziAi2zi(1+λiziAi1zi)2,

where Ai=jiλjzjzj.

Proof: By Assumption 2 in Definition 1, the random variables xvi/λi are independent σx2 sub-Gaussian. We consider X in the basis of eigenvectors of Σ, Xvi=λizi, to see that

XX=iλizizi,  XΣX=iλi2zizi,

and therefore, we can write

trC=trXX1XΣXXX1=iλi2zijλjzjzj2zi.

For the second part, we use SI Appendix, section B, Lemma S3, which is a consequence of the Sherman–Woodbury–Morrison formula

λi2zijλjzjzj2zi=λi2ziλizizi+Ai2zi=λi2ziAi2zi(1+λiziAi1zi)2,

by SI Appendix, section B, Lemma S3 for the case k=1 and Z=λizi. Note that Ai is invertible by Assumption 5 in Definition 1.

The weighted sum of outer products of these sub-Gaussian vectors plays a central role in the rest of the proof. Define

A=iλizizi,Ai=jiλjzjzj,Ak=i>kλizizi,

where the ziRn are independent vectors with independent σx2 sub-Gaussian coordinates with unit variance defined in Lemma 3. Note that the vector zi is independent of the matrix Ai, and therefore, in the last part of Lemma 3, all of the random quadratic forms are independent of the points where those forms are evaluated.

Concentration of A.

The next step is to show that eigenvalues of A, Ai, and Ak are concentrated. The proof of the following inequality is in SI Appendix, section C. Recall that μ1(A) and μn(A) denote the largest and the smallest eigenvalues of the n×n matrix A.

Lemma 4. There is a constant c such that, for any k0 with probability at least 12en/c,

1ci>kλicλk+1nμn(Ak)μ1(Ak)ci>kλi+λk+1n.

The following lemma uses this result to give bounds on the eigenvalues of Ak, which in turn, give bounds on some eigenvalues of Ai and A. For these upper and lower bounds to match up to a constant factor, the sum of the eigenvalues of Ak should dominate the term involving its leading eigenvalue, which is a condition on the effective rank rk(Σ). The lemma shows that, after rk(Σ) is sufficiently large, all of the eigenvalues of Ak are identical up to a constant factor.

Lemma 5. There are constants b,c1 such that, for any k0, with probability at least 12en/c,

  • 1)

    for all i1,

μk+1(Ai)μk+1(A)μ1(Ak)cj>kλj+λk+1n;
  • 2)

    for all 1ik,

μn(A)μn(Ai)μnAk1cj>kλjcλk+1n;
  • and

  • 3)

    if rk(Σ)bn, then

1cλk+1rk(Σ)μnAkμ1(Ak)cλk+1rk(Σ).

Proof: By Lemma 4, we know that, with probability at least 12en/c1,

1c1j>kλjc1λk+1nμn(Ak)μ1(Ak)c1j>kλj+λk+1n.

First, the matrix AAk has rank at most k (as a sum of k matrices of rank 1). Thus, there is a linear space L of dimension nk such that, for all vL, vAv=vAkvμ1(Ak)v2 and therefore, μk+1(A)μ1(Ak).

Second, by the Courant–Fischer–Weyl Theorem, for all i and j, μj(Ai)μj(A) (SI Appendix, section G, Lemma S11). On the other hand, for ik, AkAi, and therefore, all of the eigenvalues of Ai are lower bounded by μn(Ak).

Finally, if rk(Σ)bn,

j>kλj+λk+1n=λk+1rk(Σ)+λk+1n1+1bλk+1rk(Σ),1c1j>kλjc1λk+1n=1c1λk+1rk(Σ)c1λk+1n1c1c1bλk+1rk(Σ).

Choosing b>c12 and c>maxc1+1/c1,1/c1c1/b1 gives the third claim of the lemma.

Upper Bound on the Trace Term.

Lemma 6. There are constants b,c1 such that, if 0kn/c, rk(Σ)bn, and lk, then with probability at least 17en/c,

tr(C)cln+ni>lλi2i>kλi2.

The proof uses the following lemma and its corollary. Their proofs are in SI Appendix, section C.

Lemma 7. Suppose that {λi}i is a nonincreasing sequence of nonnegative numbers such that i=1λi< and that {ξi}i=1 are independent centered σ-subexponential random variables. Then, for some universal constant a for any t>0, with probability at least 12et,

iλiξiaσmaxtλ1,tiλi2.

Corollary 1. Suppose that zRn is a centered random vector with independent σ2 sub-Gaussian coordinates with unit variances, L is a random subspace of Rn of codimension k, and L is independent of z. Then, for some universal constant a and any t>0, with probability at least 13et,

z2n+aσ2(t+nt),ΠLz2naσ2(k+t+nt),

where ΠL is the orthogonal projection on L.

Proof of Lemma 6: Fix b to its value in Lemma 5. By Lemma 3,

tr(C)=iλi2ziA2zi=i=1lλi2ziAi2zi(1+λiziAi1zi)2+i>lλi2ziA2zi. [2]

First, consider the sum up to l. If rk(Σ)bn, Lemma 5 shows that, with probability at least 12en/c1 for all ik, μn(Ai)λk+1rk(Σ)/c1 and for all i, μk+1(Ai)c1λk+1rk(Σ). The lower bounds on the μn(Ai) imply that, for all zRn and 1il,

zAi2zc12z2λk+1rk(Σ)2,

and the upper bounds on the μk+1(Ai) give

zAi1zΠLizAi1ΠLizΠLiz2c1λk+1rk(Σ),

where Li is the span of the nk eigenvectors of Ai corresponding to its smallest nk eigenvalues. Therefore, for il,

λi2ziAi2zi(1+λiziAi1zi)2ziAi2zi(ziAi1zi)2c14zi2ΠLizi4. [3]

Next, we apply Corollary 1 l times together with a union bound to show that, with probability at least 13et for all 1il,

zi2n+aσx2(t+lnk+n(t+lnk))c2n, [4]
ΠLizi2naσx2(k+t+lnk+n(t+lnk))n/c3, [5]

provided that t<n/c0 and c>c0 for some sufficiently large c0 (note that c2 and c3 only depend on c0, a, and σx, and we can still take c large enough in the end without changing c2 and c3). Combining Eqs. 35, with probability at least 15en/c0,

i=1lλi2ziAi2zi(1+λiziAi1zi)2c4ln.

Second, consider the second sum in Eq. 2. Lemma 5 shows that, on the same high-probability event that we considered in bounding the first half of the sum, μn(A)λk+1rk(Σ)/c1. Hence,

i>lλi2ziA2zic12i>lλi2zi2λk+1rk(Σ)2.

Notice that i>lλi2zi2 is a weighted sum of σx2-subexponential random variables, with the weights given by the λi2 in blocks of size n. Lemma 7 implies that, with probability at least 12et,

i>lλi2zi2ni>lλi2+aσx2maxλl+12t,tni>lλi4ni>lλi2+aσx2maxti>lλi2,tni>lλi2c5ni>lλi2

because t<n/c0. Combining the above gives

i>lλi2ziA2zic6ni>lλi2λk+1rk(Σ)2.

Finally, putting both parts together and taking c>max{c0,c4,c6} gives the lemma.

Lower Bound on the Trace Term.

We first give a bound on a single term in the expression for tr(C) in Lemma 3 that holds regardless of rk(Σ). The proof is in SI Appendix, section D.

Lemma 8. There is a constant c such that, for any i1 with λi>0 and any 0kn/c, with probability at least 15en/c,

λi2ziAi2zi(1+λiziAi1zi)21cn1+j>kλj+nλk+1nλi2.

We can extend these bounds to a lower bound on tr(C) using the following lemma. The proof is in SI Appendix, section E.

Lemma 9. Suppose that n, {ηi}i=1n is a sequence of nonnegative random variables, and that {ti}i=1n is a sequence of nonnegative real numbers (at least one of which is strictly positive) such that, for some δ(0,1) and any in, Pr(ηi>ti)1δ. Then,

Pri=1nηi12i=1nti12δ.

These two lemmas imply the following lower bound.

Lemma 10. There are constants c such that, for any 0kn/c and any b>1 with probability at least 110en/c,

  • 1)

    if rk(Σ)<bn, then tr(C)k+1cb2n; and

  • 2)

    if rk(Σ)bn, then

tr(C)1cb2minlkln+b2ni>lλi2λk+1rk(Σ)2.

In particular, if all choices of kn/c give rk(Σ)<bn, then rn/c(Σ)<bn implies that, with probability at least 110en/c, tr(C)=Ωσx(1).

Proof: From Lemmas 3, 8, and 9, with probability at least 110en/c1,

tr(C)1c1ni1+j>kλj+nλk+1nλi21c2nimin1,n2λi2j>kλj2,λi2λk+121c2b2nimin1,bnrk(Σ)2λi2λk+12,λi2λk+12.

Now, if rk(Σ)<bn, then the second term in the minimum is always bigger than the third term, and in that case,

tr(C)1c2b2nimin1,λi2λk+12k+1c2b2n.

On the other hand, if rk(λ)bn,

tr(C)1c2b2imin1n,b2nλi2λk+1rk(Σ)2=1c2b2minlkln+b2ni>lλi2λk+1rk(Σ)2,

where the equality follows from the fact that the λi are nonincreasing.

A Simple Choice of l.

Recall that σx is a constant. If no kn/c has rk(Σ)bn, then Lemmas 2 and 10 imply that the expected excess risk is Ω(σ2), which proves the first paragraph of Theorem 1 for large k*. If some kn/c does have rk(Σ)bn, then the upper and lower bounds of Lemmas 6 and 10 are constant multiples of

minlkln+ni>lλi2λk+1rk(Σ)2.

It might seem surprising that any suitable choice of k suffices to give upper and lower bounds: what prevents one choice of k from giving an upper bound that falls below the lower bound that arises from another choice of k? However, the freedom to choose k is somewhat illusory: Lemma 5 shows that, for any qualifying value of k, the smallest eigenvalue of A is within a constant factor of λk+1rk(Σ). Thus, any two choices of k satisfying kn/c and rk(Σ)bn must have values of λk+1rk(Σ) within constant factors. The smallest such k simplifies the bound on tr(C) as the following lemma shows. The proof is in SI Appendix, section F.

Lemma 11. For any b1 and k*mink:rk(Σ)bn, if k*<, we have

minlk*lbn+bni>lλi2λk*+1rk*(Σ)2=k*bn+bni>k*λi2λk*+1rk*(Σ)2=k*bn+bnRk*(Σ).

Finally, we can finish the proof of Theorem 1. Set b in Lemma 10 and Theorem 1 to the constant b from Lemma 6. Take c1 to be the maximum of the constants c from Lemmas 6 and 10.

By Lemma 10, if k*n/c1, then with high probability tr(C)1/c2. However, by Lemma 10.2 and by Lemma 6, if k*<n/c1, then with high probability tr(C) is within a constant factor of minlk*ln+ni>lλi2λk*+1rk*(Σ)2, which by Lemma 11, is within a constant factor of k*n+nRk*(Σ). Taking c sufficiently large and combining these results with Lemma 2 and with the upper bound on the term θ*Bθ* in SI Appendix, section J completes the proof of the first paragraph of Theorem 1.

The proof of the second paragraph is in SI Appendix, section K.

Deep Neural Networks

How relevant are Theorems 1 and 2 to the phenomenon of benign overfitting in deep neural networks? One connection appears by considering regimes where deep neural networks are well approximated by linear functions of their parameters. This so-called neural tangent kernel (NTK) viewpoint has been vigorously pursued recently in an attempt to understand the optimization properties of deep learning methods. Very wide neural networks, trained with gradient descent from a suitable random initialization, can be accurately approximated by linear functions in an appropriate Hilbert space, and in this case, gradient descent finds an interpolating solution quickly (1419). (Note that these papers do not consider prediction accuracy, except when there is no noise; for example, ref. 14, Assumption A1 implies that the network can compute a suitable real-valued response exactly, and the data-dependent bound of ref. 19, Theorem 5.1 becomes vacuous when independent noise is added to the yi.) The eigenvalues of the covariance operator in this case can have a heavy tail under reasonable assumptions on the data distribution (20, 21), and the dimension is very large but finite as required for benign overfitting. However, the assumptions of Theorem 1 do not apply in this case. In particular, the assumption that the random elements of the Hilbert space are a linearly transformed vector with independent components is not satisfied. Thus, our results are not directly applicable in this—somewhat unrealistic—setting. Note that the slow decay of the eigenvalues of the NTK is in contrast to the case of the Gaussian and other smooth kernels, where the eigenvalues decay nearly exponentially quickly (22).

The phenomenon of benign overfitting was first observed in deep neural networks. Theorems 1 and 2 are steps toward understanding this phenomenon by characterizing when it occurs in the simple setting of linear regression. Those results suggest that covariance eigenvalues that are constant or slowly decaying in a high (but finite)-dimensional space might be important in the deep network setting also. Some authors have suggested viewing neural networks as finite-dimensional approximations to infinite-dimensional objects (2325), and there are generalization bounds—although not for the overfitting regime—that are applicable to infinite-width deep networks with parameter norm constraints (2630). However, the intuition from the linear setting suggests that truncating to a finite-dimensional space might be important for good statistical performance in the overfitting regime. Confirming this conjecture by extending our results to the setting of prediction in deep neural networks is an important open problem.

Conclusions and Further Work

Our results characterize when the phenomenon of benign overfitting occurs in high-dimensional linear regression with Gaussian data and more generally. We give finite sample excess risk bounds that reveal the covariance structure that ensures that the minimum norm interpolating prediction rule has near-optimal prediction accuracy. The characterization depends on two notions of the effective rank of the data covariance operator. It shows that overparameterization (that is, the existence of many low-variance and hence, unimportant directions in parameter space) is essential for benign overfitting and that data that lie in a large but finite-dimensional space exhibit the benign overfitting phenomenon with a much wider range of covariance properties than data that lie in an infinite-dimensional space.

There are several natural future directions. Our main theorem requires the conditional expectation E[y|x] to be a linear function of x, and it is important to understand whether the results are also true in the misspecified setting, where this assumption is not true. Our main result also assumes that the covariates are distributed as a linear function of a vector of independent random variables. We would like to understand the extent to which this assumption can be relaxed since it rules out some important examples, such as infinite-dimensional reproducing kernel Hilbert spaces with continuous kernels defined on finite-dimensional spaces. We would also like to understand how our results extend to other loss functions other than squared error and what we can say about overfitting estimators beyond the minimum norm interpolating estimator. The most interesting future direction is understanding how these ideas could apply to nonlinearly parameterized function classes, such as neural networks, the methodology that uncovered the phenomenon of benign overfitting.

Data Availability.

There are no data associated with this manuscript.

Supplementary Material

Supplementary File
pnas.1907378117.sapp.pdf (381.1KB, pdf)

Acknowledgments

We acknowledge the support of NSF Grant IIS-1619362 and of a Google research award. G.L. was supported by the Spanish Ministry of Economy and Competitiveness, Grant PGC2018-101643-B-I00; “High-dimensional problems in structured probabilistic models - Ayudas Fundación BBVA a Equipos de Investigación Cientifica 2017”; and Google Focused Award “Algorithms and Learning for AI.” Part of this work was done as part of the fall 2018 program on Foundations of Data Science at the Simons Institute for the Theory of Computing.

Footnotes

The authors declare no competing interest.

This article is a PNAS Direct Submission. R.B. is a guest editor invited by the Editorial Board.

This paper results from the Arthur M. Sackler Colloquium of the National Academy of Sciences, “The Science of Deep Learning,” held March 13–14, 2019, at the National Academy of Sciences in Washington, DC. NAS colloquia began in 1991 and have been published in PNAS since 1995. From February 2001 through May 2019 colloquia were supported by a generous gift from The Dame Jillian and Dr. Arthur M. Sackler Foundation for the Arts, Sciences, & Humanities, in memory of Dame Sackler’s husband, Arthur M. Sackler. The complete program and video recordings of most presentations are available on the NAS website at http://www.nasonline.org/science-of-deep-learning.

This article contains supporting information online at https://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1907378117/-/DCSupplemental.

References

  • 1.Zhang C., Bengio S., Hardt M., Recht B., Vinyals O., “Understanding deep learning requires rethinking generalization” in 5th International Conference on Learning Representations. https://openreview.net/forum?id=Sy8gdB9xx. Accessed 30 March 2020. [Google Scholar]
  • 2.Hastie T., Tibshirani R., Friedman J. H., Elements of Statistical Learning (Springer, 2001). [Google Scholar]
  • 3.Belkin M., Ma S., Mandal S., “To understand deep learning we need to understand kernel learning” in Proceedings of the 35th International Conference on Machine Learning (Proceedings of Machine Learning Research, 2018), vol. 80, pp. 540–548. [Google Scholar]
  • 4.Belkin M., Hsu D., Mitra P., “Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate” in Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, S. Bengio et al., Eds. (NIPS, 2018), pp. 2306–2317. [Google Scholar]
  • 5.Belkin M., Rakhlin A., Tsybakov A. B., Does data interpolation contradict statistical optimality? arXiv:1806.09471 (25 June 2018).
  • 6.Devroye L., Györfi L., Krzyżak A., The Hilbert kernel regression estimate. J. Multivariate Anal. 65, 209–227 (1998). [Google Scholar]
  • 7.Liang T., Rakhlin A., Just interpolate: Kernel “ridgeless” regression can generalize. arXiv:1808.00387 (1 August 2018).
  • 8.Belkin M., Hsu D., Ma S., Mandal S., Reconciling modern machine learning and the bias-variance trade-off. arXiv:1812.11118 (28 December 2018). [DOI] [PMC free article] [PubMed]
  • 9.Muthukumar V., Vodrahalli K., Sahai A., Harmless interpolation of noisy data in regression. arXiv:1903.09139 (21 March 2019).
  • 10.Bartlett P. L., “Accurate prediction from interpolation: A new challenge for statistical learning theory (presentation at the National Academy of Sciences workshop, The Science of Deep Learning)” (video recording, 2019). https://www.youtube.com/watch?v=1y2sB38T6FU&feature=youtu.be. Accessed 14 March 2019.
  • 11.Belkin M., Hsu D., Xu J., Two models of double descent for weak features. arXiv:1903.07571 (18 March 2019).
  • 12.Hastie T., Montanari A., Rosset S., Tibshirani R. J., Surprises in high-dimensional ridgeless least squares interpolation. arXiv:1903.08560 (19 March 2019). [DOI] [PMC free article] [PubMed]
  • 13.Desoer C. A., Whalen B. H., A note on pseudoinverses. J. Soc. Ind. Appl. Math. 11, 442–446 (1963). [Google Scholar]
  • 14.Li Y., Liang Y., Learning overparameterized neural networks via stochastic gradient descent on structured data. arXiv:1808.01204 (3 August 2018).
  • 15.Du S. S., Poczós B., Zhai X., Singh A., Gradient descent provably optimizes over-parameterized neural networks. arXiv:1810.02054 (4 October 2018).
  • 16.Du S. S., Lee J. D., Li H., Wang L., Zhai X., Gradient descent finds global minima of deep neural networks. arXiv:1811.03804 (9 November 2018).
  • 17.Zou D., Cao Y., Zhou D., Gu Q., Stochastic gradient descent optimizes over-parameterized deep relu networks. arXiv:1811.08888 (21 November 2018).
  • 18.Jacot A, Gabriel F, Hongler C, “Neural tangent kernel: Convergence and generalization in neural networks” in 32nd Conference on Neural Information Processing Systems, Bengio S. et al., Eds. (NeurIPS, 2018), pp. 8580–8589. [Google Scholar]
  • 19.Arora S., Du S. S., Hu W., Li Z., Wang R., Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. arXiv:1901.08584 (24 January 2019).
  • 20.Xie B., Liang Y., Song L., Diverse neural network learns true target functions. arXiv:1611.03131 (9 November 2016).
  • 21.Cao Y., Fang Z., Wu Y., Zhou D. X., Gu Q., Towards understanding the spectral bias of deep learning. arXiv:1912.01198 (3 December 2019).
  • 22.Belkin M., “Approximation beats concentration? An approximation view on inference with smooth radial kernels” in Conference On Learning Theory, 2018, Stockholm, Sweden, 6-9 July 2018, S. Bubeck, V. Perchet, P. Rigollet, Eds. (PMLR, 2018), vol. 75, pp. 1348–1361.
  • 23.Lee W. S., Bartlett P. L., Williamson R. C., Efficient agnostic learning of neural networks with bounded fan-in. IEEE Trans. Inf. Theor. 42, 2118–2132 (1996). [Google Scholar]
  • 24.Bengio Y., Roux N. L., Vincent P., Delalleau O., Marcotte P., “Convex neural networks” in Advances in Neural Information Processing Systems 18, Weiss Y., Schölkopf B., Platt J. C., Eds. (MIT Press, Cambridge, MA, 2006), pp. 123–130. [Google Scholar]
  • 25.Bach F., Breaking the curse of dimensionality with convex neural networks. J. Mach. Learn. Res. 18, 1–53 (2017). [Google Scholar]
  • 26.Bartlett P. L., The sample complexity of pattern classification with neural networks: The size of the weights is more important than the size of the network. IEEE Trans. Inf. Theor. 44, 525–536 (1998). [Google Scholar]
  • 27.Bartlett P. L., Mendelson S., Rademacher and Gaussian complexities: Risk bounds and structural results. J. Mach. Learn. Res. 3, 463–482 (2002). [Google Scholar]
  • 28.Neyshabur B., Tomioka R., Srebro N., “Norm-based capacity control in neural networks” in Proceedings of the 28th Conference on Learning Theory, Proceedings of Machine Learning Research, Grünwald P., Hazan E., Kale S., Eds. (PMLR, Paris, France, 2015), vol. 40, pp. 1376–1401. [Google Scholar]
  • 29.Bartlett P., Foster D., Telgarsky M., “Spectrally-normalized margin bounds for neural networks” in Advances in Neural Information Processing Systems 30, Guyon I., et al., Eds. (Curran Associates, Inc., 2017), pp. 6240–6249. [Google Scholar]
  • 30.Golowich N., Rakhlin A., Shamir O., “Size-independent sample complexity of neural networks” in Proceedings of the 31st Conference on Learning Theory, Proceedings of Machine Learning Research, Bubeck S., Perchet V., Rigollet P., Eds. (PMLR, 2018), vol. 75, pp. 297–299. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary File
pnas.1907378117.sapp.pdf (381.1KB, pdf)

Data Availability Statement

There are no data associated with this manuscript.


Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences

RESOURCES