Benign overfitting in linear regression

Peter L Bartlett; Philip M Long; Gábor Lugosi; Alexander Tsigler

doi:10.1073/pnas.1907378117

. 2020 Apr 24;117(48):30063–30070. doi: 10.1073/pnas.1907378117

Benign overfitting in linear regression

Peter L Bartlett ^a,^b,¹, Philip M Long ^c, Gábor Lugosi ^d,^e,^f, Alexander Tsigler ^a

PMCID: PMC7720150 PMID: 32332161

Abstract

The phenomenon of benign overfitting is one of the key mysteries uncovered by deep learning methodology: deep neural networks seem to predict well, even with a perfect fit to noisy training data. Motivated by this phenomenon, we consider when a perfect fit to training data in linear regression is compatible with accurate prediction. We give a characterization of linear regression problems for which the minimum norm interpolating prediction rule has near-optimal prediction accuracy. The characterization is in terms of two notions of the effective rank of the data covariance. It shows that overparameterization is essential for benign overfitting in this setting: the number of directions in parameter space that are unimportant for prediction must significantly exceed the sample size. By studying examples of data covariance properties that this characterization shows are required for benign overfitting, we find an important role for finite-dimensional data: the accuracy of the minimum norm interpolating prediction rule approaches the best possible accuracy for a much narrower range of properties of the data distribution when the data lie in an infinite-dimensional space vs. when the data lie in a finite-dimensional space with dimension that grows faster than the sample size.

Keywords: statistical learning theory, overfitting, linear regression, interpolation

Deep learning methodology has revealed a surprising statistical phenomenon: overfitting can perform well. The classical perspective in statistical learning theory is that there should be a tradeoff between the fit to the training data and the complexity of the prediction rule. Whether complexity is measured in terms of the number of parameters, the number of nonzero parameters in a high-dimensional setting, the number of neighbors averaged in a nearest neighbor estimator, the scale of an estimate in a reproducing kernel Hilbert space, or the bandwidth of a kernel smoother, this tradeoff has been ubiquitous in statistical learning theory. Deep learning seems to operate outside the regime where results of this kind are informative since deep neural networks can perform well even with a perfect fit to the training data.

As one example of this phenomenon, consider the experiment illustrated in figure 1C in ref. 1: standard deep network architectures and stochastic gradient algorithms, run until they perfectly fit a standard image classification training set, give respectable prediction performance, even when significant levels of label noise are introduced. The deep networks in the experiments reported in ref. 1 achieved essentially zero cross-entropy loss on the training data. In statistics and machine learning textbooks, an estimate that fits every training example perfectly is often presented as an illustration of overfitting [“…interpolating fits…[are] unlikely to predict future data well at all” (ref. 2, p. 37)]. Thus, to arrive at a scientific understanding of the success of deep learning methods, it is a central challenge to understand the performance of prediction rules that fit the training data perfectly.

In this paper, we consider perhaps the simplest setting where we might hope to witness this phenomenon: linear regression. That is, we consider quadratic loss and linear prediction rules, and we assume that the dimension of the parameter space is large enough that a perfect fit is guaranteed. We consider data in an infinite-dimensional space (a separable Hilbert space), but our results apply to a finite-dimensional subspace as a special case. There is an ideal value of the parameters, $θ^{*}$ , corresponding to the linear prediction rule that minimizes the expected quadratic loss. We ask when it is possible to fit the data exactly and still compete with the prediction accuracy of $θ^{*}$ . Since we require more parameters than the sample size in order to fit exactly, the solution might be underdetermined, and therefore, there might be many interpolating solutions. We consider the most natural: choose the parameter vector $\hat{θ}$ with the smallest norm among all vectors that gives perfect predictions on the training sample. (This corresponds to using the pseudoinverse to solve the normal equations; see below.) We ask when it is possible to overfit in this way—and embed all of the noise of the labels into the parameter estimate $\hat{θ}$ —without harming prediction accuracy.

Our main result is a finite sample characterization of when overfitting is benign in this setting. The linear regression problem depends on the optimal parameters $θ^{*}$ and the covariance $Σ$ of the covariates $x$ . The properties of $Σ$ turn out to be crucial since the magnitude of the variance in different directions determines both how the label noise gets distributed across the parameter space and how errors in parameter estimation in different directions in parameter space affect prediction accuracy. There is a classical decomposition of the excess prediction error into two terms. The first is rather standard: provided that the scale of the problem (that is, the sum of the eigenvalues of $Σ$ ) is small compared with the sample size $n$ , the contribution to $\hat{θ}$ that we can view as coming from $θ^{*}$ is not too distorted. The second term is more interesting since it reflects the impact of the noise in the labels on prediction accuracy. We show that this part is small if and only if the effective rank of $Σ$ in the subspace corresponding to low-variance directions is large compared with $n$ . This necessary and sufficient condition of a large effective rank can be viewed as a property of significant overparameterization: fitting the training data exactly but with near-optimal prediction accuracy occurs if and only if there are many low-variance (and hence, unimportant) directions in parameter space where the label noise can be hidden.

The details are more complicated. The characterization depends in a specific way on two notions of effective rank, $r$ and $R$ ; the smaller one, $r$ , determines a split of $Σ$ into large and small eigenvalues, and the excess prediction error depends on the effective rank as measured by the larger notion $R$ of the subspace corresponding to the smallest eigenvalues. For the excess prediction error to be small, the smallest eigenvalues of $Σ$ must decay slowly.

Studying the patterns of eigenvalues that allow benign overfitting reveals an interesting role for large but finite dimensions: in an infinite-dimensional setting, benign overfitting occurs only for a narrow range of decay rates of the eigenvalues. On the other hand, it occurs with any suitably slowly decaying eigenvalue sequence in a finite-dimensional space with dimension that grows faster than the sample size. Thus, for linear regression, data that lie in a large but finite-dimensional space exhibit the benign overfitting phenomenon with a much wider range of covariance properties than data that lie in an infinite-dimensional space.

The phenomenon of interpolating prediction rules has been an object of study by several authors over the last two years since it emerged as an intriguing mystery at the Simons Institute program on Foundations of Machine Learning in the spring of 2017. Belkin et al. (3) described an experimental study demonstrating that this phenomenon of accurate prediction for functions that interpolate noisy data also occurs for prediction rules chosen from reproducing kernel Hilbert spaces and explained the mismatch between this phenomenon and classical generalization bounds. Belkin et al. (4) gave an example of an interpolating decision rule—simplicial interpolation—with an asymptotic consistency property as the input dimension gets large. That work and subsequent work of Belkin et al. (5) studied kernel smoothing methods based on singular kernels that both interpolate and, with suitable bandwidth choice, give optimal rates for nonparametric estimation [building on earlier consistency results (6) for these unusual kernels]. Liang and Rakhlin (7) considered minimum norm interpolating kernel regression with kernels defined as nonlinear functions of the Euclidean inner product and showed that, with certain properties of the training sample (expressed in terms of the empirical kernel matrix), these methods can have good prediction accuracy. Belkin et al. (8) studied experimentally the excess risk as a function of the dimension of a sequence of parameter spaces for linear and nonlinear classes.

Subsequent to our work, ref. 9 considered the properties of the interpolating linear prediction rule with minimal expected squared error. After this work was presented at the NAS Colloquium on the Science of Deep Learning (10), we became aware of the concurrent work of Belkin et al. (11) and of Hastie et al. (12). Belkin et al. (11) calculated the excess risk for certain linear models (a regression problem with identity covariance and sparse optimal parameters, both with and without noise, and a problem with random Fourier features with no noise), and Hastie et al. (12) considered linear regression in an asymptotic regime, where sample size $n$ and input dimension $p$ go to infinity together with asymptotic ratio $p / n \to γ$ . They assumed that, as $p$ gets large, the empirical spectral distribution of $Σ$ (the discrete measure on its set of eigenvalues) converges to a fixed measure, and they applied random matrix theory to explore the range of behaviors of the asymptotics of the excess prediction error as $γ$ , the noise variance, and the eigenvalue distribution vary. They also studied the asymptotics of a model involving random nonlinear features. In contrast, we give upper and lower bounds on the excess prediction error for arbitrary finite sample size, for arbitrary covariance matrices, and for data of arbitrary dimension.

The next section introduces notation and definitions used throughout the paper, including definitions of the problem of linear regression and of various notions of effective rank of the covariance operator. The following section gives the characterization of benign overfitting, illustrates why the effective rank condition corresponds to significant overparameterization, and presents several examples of patterns of eigenvalues that allow benign overfitting, suggesting that slowly decaying covariance eigenvalues in input spaces of growing but finite dimension are the generic example of benign overfitting. Then we discuss the connections between these results and the benign overfitting phenomenon in deep neural networks and outline the proofs of the results.

Definitions and Notation

We consider linear regression problems, where a linear function of covariates $x$ from a (potentially infinite-dimensional) Hilbert space $H$ is used to predict a real-valued response variable $y$ . We use vector notation, so that $x^{⊤} θ$ denotes the inner product between $x$ and $θ$ and $x z^{⊤}$ denotes the tensor product of $x, z \in H$ .

Definition 1 (Linear Regression): A linear regression problem in a separable Hilbert space $H$ is defined by a random covariate vector $x \in H$ and outcome $y \in R$ . We define

1)
the covariance operator $Σ = E [x x^{⊤}]$ and
2)
the optimal parameter vector $θ^{*} \in H$ , satisfying $E {(y - x^{⊤} θ^{*})}^{2} = min_{θ} E {(y - x^{⊤} θ)}^{2}$ .

We assume that

1)
$x$ and $y$ are mean zero;
2)
$x = V Λ^{1 / 2} z$ , where $Σ = V Λ V^{⊤}$ is the spectral decomposition of $Σ$ and $z$ has components that are independent $σ_{x}^{2}$ sub-Gaussian with $σ_{x}$ a positive constant: that is, for all $λ \in H$ ,

E [\exp (λ^{⊤} z)] \leq \exp (σ_{x}^{2} {‖ λ ‖}^{2} / 2),

where $‖ \cdot ‖$ is the norm in the Hilbert space $H$ ;
3)
the conditional noise variance is bounded below by some constant $σ^{2}$ ,

E [{(y - x^{⊤} θ^{*})}^{2} | x] \geq σ^{2};

4)
$y - x^{⊤} θ^{*}$ is $σ_{y}^{2}$ sub-Gaussian conditionally on $x$ : that is, for all $λ \in R$ ,

E [\exp (λ (y - x^{⊤} θ^{*})) | x] \leq \exp (σ_{y}^{2} λ^{2} / 2)

(note that this implies $E [y | x] = x^{⊤} θ^{*}$ ); and
5)
almost surely, the projection of the data $X$ on the space orthogonal to any eigenvector of $Σ$ spans a space of dimension $n$ .

Given a training sample $(x_{1}, y_{1}), \dots, (x_{n}, y_{n})$ of $n$ independent pairs with the same distribution as $(x, y)$ , an estimator returns a parameter estimate $θ \in H$ . The excess risk of the estimator is defined as

R (θ) ≔ E_{x, y} [{(y - x^{⊤} θ)}^{2} - {(y - x^{⊤} θ^{*})}^{2}],

where $E_{x, y}$ denotes the conditional expectation given all random quantities other than $x, y$ (in this case, given the estimate $θ$ ). Define the vectors $y \in R^{n}$ with entries $y_{i}$ and $ε \in R^{n}$ with entries $ε_{i} = y_{i} - x_{i}^{⊤} θ^{*}$ . We use infinite matrix notation: $X$ denotes the linear map from $H$ to $R^{n}$ corresponding to $x_{1}, \dots, x_{n} \in H$ so that $X θ \in R^{n}$ has $i$ th component $x_{i}^{⊤} θ$ . We use similar notation for the linear map $X^{⊤}$ from $R^{n}$ to $H$ .

Notice that Assumptions 1 to 5 are satisfied when $x$ and $y$ are jointly Gaussian with zero mean and $rank (Σ) > n$ .

We shall be concerned with situations where an estimator $θ$ can fit the data perfectly: that is, $X θ = y$ . Typically, this implies that there are many such vectors. We consider the interpolating estimator with minimal norm in $H$ . We use $‖ \cdot ‖$ to denote both the Euclidean norm of a vector in $R^{n}$ and the Hilbert space norm.

Definition 2 (Minimum Norm Estimator): Given data $X \in H^{n}$ and $y \in R^{n}$ , the minimum norm estimator $\hat{θ}$ solves the optimization problem

\begin{matrix} min_{θ \in H} & {∥θ∥}^{2} \\ such that & {∥X θ - y∥}^{2} = min_{β} {∥X β - y∥}^{2} . \end{matrix}

By the projection theorem, parameter vectors that solve the least squares problem $min_{β} {∥X β - y∥}^{2}$ solve the normal equations, and therefore, we can equivalently write $\hat{θ}$ as the minimum norm solution to the normal equations

\begin{matrix} \hat{θ} & = & \arg min_{θ} \{{∥θ∥}^{2} : X^{⊤} X θ = X^{⊤} y\} \\ = & {(X^{⊤} X)}^{†} X^{⊤} y \\ = X^{⊤} {(X X^{⊤})}^{†} y, \end{matrix}

where ${(X^{⊤} X)}^{†}$ denotes the pseudoinverse of the bounded linear operator $X^{⊤} X$ (for infinite-dimensional $H$ , the existence of the pseudoinverse is guaranteed because $X^{⊤} X$ is bounded and has a closed range) (13). When $H$ has dimension $p$ with $p < n$ and $X$ has rank $p$ , there is a unique solution to the normal equations. On the contrary, Assumption 5 in Definition 1 implies that we can find many solutions $θ \in H$ to the normal equations that achieve $X θ = y$ . The minimum norm solution is given by

\hat{θ} = X^{⊤} {(X X^{⊤})}^{- 1} y .

[1]

Our main result gives tight bounds on the excess risk of this minimum norm estimator in terms of certain notions of effective rank of the covariance that are defined in terms of its eigenvalues.

We use $μ_{1} (Σ) \geq μ_{2} (Σ) \geq \dots$ to denote the eigenvalues of $Σ$ in descending order, and we denote the operator norm of $Σ$ by $‖ Σ ‖$ . We use $I$ to denote the identity operator on $H$ and $I_{n}$ to denote the $n \times n$ identity matrix.

Definition 3 (Effective Ranks): For the covariance operator $Σ$ , define $λ_{i} = μ_{i} (Σ)$ for $i = 1,2, \dots$ . If $\sum_{i = 1}^{\infty} λ_{i} < \infty$ and $λ_{k + 1} > 0$ for $k \geq 0$ , define

r_{k} (Σ) = \frac{\sum_{i > k} λ_{i}}{λ_{k + 1}}, R_{k} (Σ) = \frac{{(\sum_{i > k} λ_{i})}^{2}}{\sum_{i > k} λ_{i}^{2}} .

Main Results

The following theorem establishes nearly matching upper and lower bounds for the risk of the minimum norm interpolating estimator.

Theorem 1. For any $σ_{x}$ , there are $b, c, c_{1} > 1$ for which the following holds. Consider a linear regression problem from Definition 1. Define

k^{*} = min \{k \geq 0 : r_{k} (Σ) \geq b n\},

where the minimum of the empty set is defined as $\infty$ . Suppose that $δ < 1$ with $\log (1 / δ) < n / c$ . If $k^{*} \geq n / c_{1}$ , then $E R (\hat{θ}) \geq σ^{2} / c$ . Otherwise,

\begin{matrix} R (\hat{θ}) & \leq & c ({‖ θ^{*} ‖}^{2} ∥Σ∥ max \{\sqrt{\frac{r_{0} (Σ)}{n}}, \frac{r_{0} (Σ)}{n}, \sqrt{\frac{\log (1 / δ)}{n}}\}) \\ + c \log (1 / δ) σ_{y}^{2} (\frac{k^{*}}{n} + \frac{n}{R_{k^{*}} (Σ)}) \end{matrix}

with probability at least $1 - δ$ , and

E R (\hat{θ}) \geq \frac{σ^{2}}{c} (\frac{k^{*}}{n} + \frac{n}{R_{k^{*}} (Σ)}) .

Moreover, there are universal constants $a_{1}, a_{2}, n_{0}$ such that, for all $n \geq n_{0}$ , for all $Σ$ , and for all $t \geq 0$ , there is a $θ^{*}$ with $‖ θ^{*} ‖ = t$ such that, for $x \sim N (0, Σ)$ and $y | x \sim N (x^{⊤} θ^{*}, {‖ θ^{*} ‖}^{2} ‖ Σ ‖)$ with probability at least $1 / 4$ ,

R (\hat{θ}) \geq \frac{1}{a_{1}} {‖ θ^{*} ‖}^{2} ∥Σ∥ 1 [\frac{r_{0} (Σ)}{n \log (1 + r_{0} (Σ))} \geq a_{2}] .

Effective Ranks and Overparameterization.

In order to understand the implications of Theorem 1, we now study relationships between the two notions of effective rank, $r_{k}$ and $R_{k}$ , and establish sufficient and necessary conditions for the sequence ${λ_{i}}$ of eigenvalues to lead to small excess risk.

The following lemma shows that the two notions of effective rank are closely related. SI Appendix, section H has its proof and other properties of $r_{k}$ and $R_{k}$ .

Lemma 1. $r_{k} (Σ) \geq 1$ , $r_{k}^{2} (Σ) = r_{k} (Σ^{2}) R_{k} (Σ)$ , and

r_{k} (Σ^{2}) \leq r_{k} (Σ) \leq R_{k} (Σ) \leq r_{k}^{2} (Σ) .

Notice that $r_{0} (I_{p}) = R_{0} (I_{p}) = p$ . More generally, if all of the nonzero eigenvalues of $Σ$ are identical, then $r_{0} (Σ) = R_{0} (Σ) = rank (Σ)$ . For $Σ$ with finite rank, we can express both $r_{0} (Σ)$ and $R_{0} (Σ)$ as a product of the rank and a notion of symmetry. In particular, for $rank (Σ) = p$ , we can write

\begin{matrix} r_{0} (Σ) & = & rank (Σ) s (Σ), R_{0} (Σ) = rank (Σ) S (Σ), \\ with s (Σ) & = & \frac{\frac{1}{p} \sum_{i = 1}^{p} λ_{i}}{λ_{1}}, S (Σ) = \frac{{(\frac{1}{p} \sum_{i = 1}^{p} λ_{i})}^{2}}{\frac{1}{p} \sum_{i = 1}^{p} λ_{i}^{2}} . \end{matrix}

Both notions of symmetry $s$ and $S$ lie between $1 / p$ (when $λ_{2} \to 0$ ) and 1 (when the $λ_{i}$ are all equal).

Theorem 1 shows that, for the minimum norm estimator to have near-optimal prediction accuracy, $r_{0} (Σ)$ should be small compared with the sample size $n$ (from the first term) and $r_{k^{*}} (Σ)$ and $R_{k^{*}} (Σ)$ should be large compared with $n$ . Together, these conditions imply that overparameterization is essential for benign overfitting in this setting: the number of nonzero eigenvalues should be large compared with $n$ , they should have a small sum compared with $n$ , and there should be many eigenvalues no larger than $λ_{k^{*}}$ . If the number of these small eigenvalues is not much larger than $n$ , then they should be roughly equal, but they can be more asymmetric if there are many more of them.

The following theorem shows that the kind of overparameterization that is essential for benign overfitting requires $Σ$ to have a heavy tail. (The proof—and some other examples illustrating the boundary of benign overfitting—are in SI Appendix, section I.) In particular, if we fix $Σ$ in an infinite-dimensional Hilbert space and ask when the excess risk of the minimum norm estimator approaches zero as $n \to \infty$ , it imposes tight restrictions on the eigenvalues of $Σ$ . However, there are many other possibilities for these asymptotics if $Σ$ can change with $n$ . Since rescaling $X$ affects the accuracy of the least norm interpolant in an obvious way, we may assume without loss of generality that $‖ Σ ‖ = 1$ . If we restrict our attention to this case, then informally, Theorem 1 implies that, when the covariance operator for data with $n$ examples is $Σ_{n}$ , the least norm interpolant converges if $\frac{r_{0} (Σ_{n})}{n} \to 0$ , $\frac{k_{n}^{*}}{n} \to 0$ , and $\frac{n}{R_{k_{n}^{*}} (Σ_{n})} \to 0$ and only if $\frac{r_{0} (Σ_{n})}{n \log (1 + r_{0} (Σ_{n}))} \to 0$ , $\frac{k_{n}^{*}}{n} \to 0$ , and $\frac{n}{R_{k_{n}^{*}} (Σ_{n})} \to 0$ , where $k_{n}^{*} = min \{k \geq 0 : r_{k} (Σ_{n}) \geq b n\}$ for the universal constant $b$ in Theorem 1. This motivates the following definition.

Definition 4: A sequence of covariance operators $Σ_{n}$ with $‖ Σ_{n} ‖ = 1$ is benign if

lim_{n \to \infty} \frac{r_{0} (Σ_{n})}{n} = lim_{n \to \infty} \frac{k_{n}^{*}}{n} = lim_{n \to \infty} \frac{n}{R_{k_{n}^{*}} (Σ_{n})} = 0 .

We give some examples of benign and nonbenign settings.

Theorem 2.

1)
If $μ_{k} (Σ) = k^{- α} \ln^{- β} ((k + 1) e / 2)$ , then $Σ$ is benign if and only if $α = 1$ and $β > 1$ .
2)
If

μ_{k} (Σ_{n}) = \{\begin{matrix} γ_{k} + ϵ_{n} & if k \leq p_{n}, \\ 0 & otherwise \end{matrix}

and $γ_{k} = Θ (\exp (- k / τ))$ , then $Σ_{n}$ with $‖ Σ_{n} ‖ = 1$ is benign if and only if $p_{n} = ω (n)$ and $n e^{- o (n)} = ϵ_{n} p_{n} = o (n)$ . Furthermore, for $p_{n} = Ω (n)$ and $ϵ_{n} p_{n} = n e^{- o (n)}$ ,

R (\hat{θ}) = O (\frac{ϵ_{n} p_{n} + 1}{n} + \frac{\ln (n / (ϵ_{n} p_{n}))}{n} + max \{\frac{1}{n}, \frac{n}{p_{n}}\}) .

Compare the situations described by Theorem 2.1 and 2.2. Theorem 2.1 shows that, for infinite-dimensional data with a fixed covariance, benign overfitting occurs if and only if the eigenvalues of the covariance operator decay just slowly enough for their sum to remain finite. Theorem 2.2 shows that the situation is very different if the data have finite dimension and a small amount of isotropic noise is added to the covariates. In that case, even if the eigenvalues of the original covariance operator (before the addition of isotropic noise) decay very rapidly, benign overfitting occurs if and only if both the dimension is large compared with the sample size and the isotropic component of the covariance is sufficiently small—but not exponentially small—compared with the sample size.

These examples illustrate the tension between the slow decay of eigenvalues that is needed for $k / n + n / R_{k}$ to be small and the summability of eigenvalues that is needed for $r_{0} (Σ) / n$ to be small. There are two ways to resolve this tension. First, in the infinite-dimensional setting, slow decay of the eigenvalues suffices—decay just fast enough to ensure summability—as shown by Theorem 2.1. (SI Appendix, section I gives another example—Theorem S14.2—where the eigenvalue decay is allowed to vary with $n$ ; in that case, $Σ_{n}$ is benign iff the decay rate gets close—but not too close—to $1 / k$ as $n$ increases.) Second, the other way to resolve the tension is to consider a finite-dimensional setting (which ensures that the eigenvalues are summable), and in this case, arbitrarily slow decay is possible. Theorem 2.2 gives an example of this: eigenvalues that are all at least as large as a small constant. SI Appendix, section I gives other examples with a similar flavor, including a truncated infinite series that decays sufficiently slowly that the sum does not converge (SI Appendix, section I, Theorem S14.3). Theorem 2.1 shows that a very specific decay rate is required in infinite dimensions, which suggests that this is an unusual phenomenon in that case. The more generic scenario where benign overfitting will occur is demonstrated by Theorem 2.2, with eigenvalues that are either at least a constant or slowly decaying in a very high—but finite-dimensional—space.

Proof

Throughout the proofs, we treat $σ_{x}$ (the sub-Gaussian norm of the covariates) as a constant. Therefore, we use the symbols $b, c, c_{1}, c_{2}, \dots$ to refer to constants that only depend on $σ_{x}$ . Their values are suitably large (and always at least one) but do not depend on any parameters of the problems that we consider other than $σ_{x}$ . For universal constants that do not depend on any parameters of the problem at all, we use the symbol $a$ . Also, whenever we sum over eigenvectors of $Σ$ , the sum is restricted to eigenvectors with nonzero eigenvalues.

Outline.

The first step is a standard decomposition of the excess risk into two pieces, a term that corresponds to the distortion that is introduced by viewing $θ^{*}$ through the lens of the finite sample and a term that corresponds to the distortion introduced by the noise $ε = y - X θ$ . The impact of both sources of error in $\hat{θ}$ on the excess risk is modulated by the covariance $Σ$ , which gives different weight to different directions in parameter space.

Lemma 2. The excess risk of the minimum norm estimator satisfies $R (\hat{θ}) \leq 2 {θ^{*}}^{⊤} B θ^{*} + c σ^{2} \log (1 / δ) tr (C)$ with probability at least $1 - δ$ over $ϵ$ , and $E_{ε} R (\hat{θ}) \geq {θ^{*}}^{⊤} B θ^{*} + σ^{2} tr (C)$ , where

\begin{matrix} B & = & (I - X^{⊤} {(X X^{⊤})}^{- 1} X) Σ (I - X^{⊤} {(X X^{⊤})}^{- 1} X), \\ C & = & {(X X^{⊤})}^{- 1} X Σ X^{⊤} {(X X^{⊤})}^{- 1} . \end{matrix}

The proof of this lemma is in SI Appendix, section A. SI Appendix, sections J and K give bounds on the term ${θ^{*}}^{⊤} B θ^{*}$ . The heart of the proof is controlling $tr (C)$ .

Before continuing with the proof, let us make a quick digression to note that Lemma 2 already begins to give an idea that many low-variance directions are necessary for the least norm interpolator to succeed. Let us consider the extreme case that $p = n$ and $Σ = I$ . In this case, $C = {(X X^{⊤})}^{- 1}$ . For Gaussian data, for instance, standard random matrix theory implies that, with high probability, the eigenvalues of $X X^{⊤}$ will all be within a constant factor of $n$ , which implies that $tr (C)$ is bounded below by a constant, and then, Lemma 2 implies that the least norm interpolant’s excess risk is at least a constant.

To prove that $tr (C)$ can be controlled for suitable $Σ$ , the first step is to express it in terms of sums of outer products of unit-covariance, independent, sub-Gaussian random vectors. We show that, when there is a $k^{*}$ with $k^{*} / n$ small and $r_{k^{*}} (Σ) / n$ large, all of the smallest eigenvalues of these matrices are suitably concentrated, and this implies that $tr (C)$ is bounded above by

min_{l \leq k^{*}} (\frac{l}{n} + n \frac{\sum_{i > l} λ_{i}^{2}}{{(λ_{k^{*} + 1} r_{k^{*}} (Σ))}^{2}}) .

(Later, we show that the minimizer is $l = k^{*}$ .) Next, we show that this expression is also a lower bound on $tr (C)$ provided that there is such a $k^{*}$ . Conversely, we show that, for any $k$ for which $r_{k} (Σ)$ is not large compared with $n$ , $tr (C)$ is at least as big as a constant times $min (1, k / n)$ . Combining shows that, when $k^{*} / n$ is small, $tr (C)$ is upper and lower bounded by constant factors times

\frac{k^{*}}{n} + \frac{n}{R_{k^{*}} (Σ)} .

Unit Variance Sub-Gaussians.

Our assumptions allow the trace of $C$ to be expressed as a function of many independent sub-Gaussian vectors.

Lemma 3. Consider a covariance operator $Σ$ with $λ_{i} = μ_{i} (Σ)$ and $λ_{n} > 0$ . Write its spectral decomposition $Σ = \sum_{j} λ_{j} v_{j} v_{j}^{⊤}$ , where the orthonormal $v_{j} \in H$ are the eigenvectors corresponding to the $λ_{j}$ . For $i$ with $λ_{i} > 0$ , define $z_{i} = X v_{i} / \sqrt{λ_{i}}$ . Then,

tr (C) = \sum_{i} [λ_{i}^{2} z_{i}^{⊤} {(\sum_{j} λ_{j} z_{j} z_{j}^{⊤})}^{- 2} z_{i}],

and these $z_{i} \in R^{n}$ are independent $σ_{x}^{2}$ sub-Gaussian. Furthermore, for any $i$ with $λ_{i} > 0$ , we have

λ_{i}^{2} z_{i}^{⊤} {(\sum_{j} λ_{j} z_{j} z_{j}^{⊤})}^{- 2} z_{i} = \frac{λ_{i}^{2} z_{i}^{⊤} A_{- i}^{- 2} z_{i}}{{(1 + λ_{i} z_{i}^{⊤} A_{- i}^{- 1} z_{i})}^{2}},

where $A_{- i} = \sum_{j \neq i} λ_{j} z_{j} z_{j}^{⊤}$ .

Proof: By Assumption 2 in Definition 1, the random variables $x^{⊤} v_{i} / \sqrt{λ_{i}}$ are independent $σ_{x}^{2}$ sub-Gaussian. We consider $X$ in the basis of eigenvectors of $Σ$ , $X v_{i} = \sqrt{λ_{i}} z_{i}$ , to see that

X X^{⊤} = \sum_{i} λ_{i} z_{i} z_{i}^{⊤}, X Σ X^{⊤} = \sum_{i} λ_{i}^{2} z_{i} z_{i}^{⊤},

and therefore, we can write

\begin{array}{l} tr (C) & = tr ({(X X^{⊤})}^{- 1} X Σ X^{⊤} {(X X^{⊤})}^{- 1}) \\ = \sum_{i} [λ_{i}^{2} z_{i}^{⊤} {(\sum_{j} λ_{j} z_{j} z_{j}^{⊤})}^{- 2} z_{i}] . \end{array}

For the second part, we use SI Appendix, section B, Lemma S3, which is a consequence of the Sherman–Woodbury–Morrison formula

\begin{array}{l} λ_{i}^{2} z_{i}^{⊤} {(\sum_{j} λ_{j} z_{j} z_{j}^{⊤})}^{- 2} z_{i} & = λ_{i}^{2} z_{i}^{⊤} {(λ_{i} z_{i} z_{i}^{⊤} + A_{- i})}^{- 2} z_{i} \\ = \frac{λ_{i}^{2} z_{i}^{⊤} A_{- i}^{- 2} z_{i}}{{(1 + λ_{i} z_{i}^{⊤} A_{- i}^{- 1} z_{i})}^{2}}, \end{array}

by SI Appendix, section B, Lemma S3 for the case $k = 1$ and $Z = \sqrt{λ_{i}} z_{i}$ . Note that $A_{- i}$ is invertible by Assumption 5 in Definition 1. $□$

The weighted sum of outer products of these sub-Gaussian vectors plays a central role in the rest of the proof. Define

A = \sum_{i} λ_{i} z_{i} z_{i}^{⊤}, A_{- i} = \sum_{j \neq i} λ_{j} z_{j} z_{j}^{⊤}, A_{k} = \sum_{i > k} λ_{i} z_{i} z_{i}^{⊤},

where the $z_{i} \in R^{n}$ are independent vectors with independent $σ_{x}^{2}$ sub-Gaussian coordinates with unit variance defined in Lemma 3. Note that the vector $z_{i}$ is independent of the matrix $A_{- i}$ , and therefore, in the last part of Lemma 3, all of the random quadratic forms are independent of the points where those forms are evaluated.

Concentration of $A$ .

The next step is to show that eigenvalues of $A$ , $A_{- i}$ , and $A_{k}$ are concentrated. The proof of the following inequality is in SI Appendix, section C. Recall that $μ_{1} (A)$ and $μ_{n} (A)$ denote the largest and the smallest eigenvalues of the $n \times n$ matrix $A$ .

Lemma 4. There is a constant $c$ such that, for any $k \geq 0$ with probability at least $1 - 2 e^{- n / c}$ ,

\frac{1}{c} \sum_{i > k} λ_{i} - c λ_{k + 1} n \leq μ_{n} (A_{k}) \leq μ_{1} (A_{k}) \leq c (\sum_{i > k} λ_{i} + λ_{k + 1} n) .

The following lemma uses this result to give bounds on the eigenvalues of $A_{k}$ , which in turn, give bounds on some eigenvalues of $A_{- i}$ and $A$ . For these upper and lower bounds to match up to a constant factor, the sum of the eigenvalues of $A_{k}$ should dominate the term involving its leading eigenvalue, which is a condition on the effective rank $r_{k} (Σ)$ . The lemma shows that, after $r_{k} (Σ)$ is sufficiently large, all of the eigenvalues of $A_{k}$ are identical up to a constant factor.

Lemma 5. There are constants $b, c \geq 1$ such that, for any $k \geq 0$ , with probability at least $1 - 2 e^{- n / c}$ ,

1)
for all $i \geq 1$ ,

μ_{k + 1} (A_{- i}) \leq μ_{k + 1} (A) \leq μ_{1} (A_{k}) \leq c (\sum_{j > k} λ_{j} + λ_{k + 1} n);

2)
for all $1 \leq i \leq k$ ,

μ_{n} (A) \geq μ_{n} (A_{- i}) \geq μ_{n} (A_{k}) \geq \frac{1}{c} \sum_{j > k} λ_{j} - c λ_{k + 1} n;

and
3)
if $r_{k} (Σ) \geq b n$ , then

\frac{1}{c} λ_{k + 1} r_{k} (Σ) \leq μ_{n} (A_{k}) \leq μ_{1} (A_{k}) \leq c λ_{k + 1} r_{k} (Σ) .

Proof: By Lemma 4, we know that, with probability at least $1 - 2 e^{- n / c_{1}}$ ,

\begin{array}{l} \frac{1}{c_{1}} \sum_{j > k} λ_{j} - c_{1} λ_{k + 1} n & \leq μ_{n} (A_{k}) \\ \leq μ_{1} (A_{k}) \leq c_{1} (\sum_{j > k} λ_{j} + λ_{k + 1} n) . \end{array}

First, the matrix $A - A_{k}$ has rank at most $k$ (as a sum of $k$ matrices of rank 1). Thus, there is a linear space $L$ of dimension $n - k$ such that, for all $v \in L$ , $v^{⊤} A v = v^{⊤} A_{k} v \leq μ_{1} (A_{k}) {‖ v ‖}^{2}$ and therefore, $μ_{k + 1} (A) \leq μ_{1} (A_{k})$ .

Second, by the Courant–Fischer–Weyl Theorem, for all $i$ and $j$ , $μ_{j} (A_{- i}) \leq μ_{j} (A)$ (SI Appendix, section G, Lemma S11). On the other hand, for $i \leq k$ , $A_{k} ⪯ A_{- i}$ , and therefore, all of the eigenvalues of $A_{- i}$ are lower bounded by $μ_{n} (A_{k})$ .

Finally, if $r_{k} (Σ) \geq b n$ ,

\begin{array}{l} \sum_{j > k} λ_{j} + λ_{k + 1} n & = λ_{k + 1} r_{k} (Σ) + λ_{k + 1} n \\ \leq (1 + \frac{1}{b}) λ_{k + 1} r_{k} (Σ), \\ \frac{1}{c_{1}} \sum_{j > k} λ_{j} - c_{1} λ_{k + 1} n & = \frac{1}{c_{1}} λ_{k + 1} r_{k} (Σ) - c_{1} λ_{k + 1} n \\ \geq (\frac{1}{c_{1}} - \frac{c_{1}}{b}) λ_{k + 1} r_{k} (Σ) . \end{array}

Choosing $b > c_{1}^{2}$ and $c > max \{c_{1} + 1 / c_{1}, {(1 / c_{1} - c_{1} / b)}^{- 1}\}$ gives the third claim of the lemma. $□$

Upper Bound on the Trace Term.

Lemma 6. There are constants $b, c \geq 1$ such that, if $0 \leq k \leq n / c$ , $r_{k} (Σ) \geq b n$ , and $l \leq k$ , then with probability at least $1 - 7 e^{- n / c}$ ,

tr (C) \leq c (\frac{l}{n} + n \frac{\sum_{i > l} λ_{i}^{2}}{{(\sum_{i > k} λ_{i})}^{2}}) .

The proof uses the following lemma and its corollary. Their proofs are in SI Appendix, section C.

Lemma 7. Suppose that ${λ_{i}}_{i}^{\infty}$ is a nonincreasing sequence of nonnegative numbers such that $\sum_{i = 1}^{\infty} λ_{i} < \infty$ and that ${ξ_{i}}_{i = 1}^{\infty}$ are independent centered $σ$ -subexponential random variables. Then, for some universal constant $a$ for any $t > 0$ , with probability at least $1 - 2 e^{- t}$ ,

|\sum_{i} λ_{i} ξ_{i}| \leq a σ max (t λ_{1}, \sqrt{t \sum_{i} λ_{i}^{2}}) .

Corollary 1. Suppose that $z \in R^{n}$ is a centered random vector with independent $σ^{2}$ sub-Gaussian coordinates with unit variances, $L$ is a random subspace of $R^{n}$ of codimension $k$ , and $L$ is independent of $z$ . Then, for some universal constant $a$ and any $t > 0$ , with probability at least $1 - 3 e^{- t}$ ,

\begin{array}{l} ‖ {z ‖}^{2} & \leq n + a σ^{2} (t + \sqrt{n t}), \\ ‖ Π_{L} {z ‖}^{2} & \geq n - a σ^{2} (k + t + \sqrt{n t}), \end{array}

where $Π_{L}$ is the orthogonal projection on $L$ .

Proof of Lemma 6: Fix $b$ to its value in Lemma 5. By Lemma 3,

\begin{align} tr (C) & = \sum_{i} λ_{i}^{2} z_{i}^{⊤} A^{- 2} z_{i} \\ = \sum_{i = 1}^{l} \frac{λ_{i}^{2} z_{i}^{⊤} A_{- i}^{- 2} z_{i}}{{(1 + λ_{i} z_{i}^{⊤} A_{- i}^{- 1} z_{i})}^{2}} + \sum_{i > l} λ_{i}^{2} z_{i}^{⊤} A^{- 2} z_{i} . \end{align}

[2]

First, consider the sum up to $l$ . If $r_{k} (Σ) \geq b n$ , Lemma 5 shows that, with probability at least $1 - 2 e^{- n / c_{1}}$ for all $i \leq k$ , $μ_{n} (A_{- i}) \geq λ_{k + 1} r_{k} (Σ) / c_{1}$ and for all $i$ , $μ_{k + 1} (A_{- i}) \leq c_{1} λ_{k + 1} r_{k} (Σ)$ . The lower bounds on the $μ_{n} (A_{- i})$ imply that, for all $z \in R^{n}$ and $1 \leq i \leq l$ ,

z^{⊤} A_{- i}^{- 2} z \leq \frac{c_{1}^{2} ‖ {z ‖}^{2}}{{(λ_{k + 1} r_{k} (Σ))}^{2}},

and the upper bounds on the $μ_{k + 1} (A_{- i})$ give

z^{⊤} A_{- i}^{- 1} z \geq {(Π_{L_{i}} z)}^{⊤} A_{- i}^{- 1} Π_{L_{i}} z \geq \frac{‖ Π_{L_{i}} {z ‖}^{2}}{c_{1} λ_{k + 1} r_{k} (Σ)},

where $L_{i}$ is the span of the $n - k$ eigenvectors of $A_{- i}$ corresponding to its smallest $n - k$ eigenvalues. Therefore, for $i \leq l$ ,

\frac{λ_{i}^{2} z_{i}^{⊤} A_{- i}^{- 2} z_{i}}{{(1 + λ_{i} z_{i}^{⊤} A_{- i}^{- 1} z_{i})}^{2}} \leq \frac{z_{i}^{⊤} A_{- i}^{- 2} z_{i}}{{(z_{i}^{⊤} A_{- i}^{- 1} z_{i})}^{2}} \leq c_{1}^{4} \frac{{‖ z_{i} ‖}^{2}}{{‖ Π_{L_{i}} z_{i} ‖}^{4}} .

[3]

Next, we apply Corollary 1 $l$ times together with a union bound to show that, with probability at least $1 - 3 e^{- t}$ for all $1 \leq i \leq l$ ,

{‖ z_{i} ‖}^{2} \leq n + a σ_{x}^{2} (t + \ln k + \sqrt{n (t + \ln k)}) \leq c_{2} n,

[4]

{‖ Π_{L_{i}} z_{i} ‖}^{2} \geq n - a σ_{x}^{2} (k + t + \ln k + \sqrt{n (t + \ln k)}) \geq n / c_{3},

[5]

provided that $t < n / c_{0}$ and $c > c_{0}$ for some sufficiently large $c_{0}$ (note that $c_{2}$ and $c_{3}$ only depend on $c_{0}$ , $a$ , and $σ_{x}$ , and we can still take $c$ large enough in the end without changing $c_{2}$ and $c_{3}$ ). Combining Eqs. 3–5, with probability at least $1 - 5 e^{- n / c_{0}}$ ,

\sum_{i = 1}^{l} \frac{λ_{i}^{2} z_{i}^{⊤} A_{- i}^{- 2} z_{i}}{{(1 + λ_{i} z_{i}^{⊤} A_{- i}^{- 1} z_{i})}^{2}} \leq c_{4} \frac{l}{n} .

Second, consider the second sum in Eq. 2. Lemma 5 shows that, on the same high-probability event that we considered in bounding the first half of the sum, $μ_{n} (A) \geq λ_{k + 1} r_{k} (Σ) / c_{1}$ . Hence,

\sum_{i > l} λ_{i}^{2} z_{i}^{⊤} A^{- 2} z_{i} \leq \frac{c_{1}^{2} \sum_{i > l} λ_{i}^{2} {‖ z_{i} ‖}^{2}}{{(λ_{k + 1} r_{k} (Σ))}^{2}} .

Notice that $\sum_{i > l} λ_{i}^{2} {‖ z_{i} ‖}^{2}$ is a weighted sum of $σ_{x}^{2}$ -subexponential random variables, with the weights given by the $λ_{i}^{2}$ in blocks of size $n$ . Lemma 7 implies that, with probability at least $1 - 2 e^{- t}$ ,

\begin{array}{l} \sum_{i > l} λ_{i}^{2} {‖ z_{i} ‖}^{2} & \leq n \sum_{i > l} λ_{i}^{2} + a σ_{x}^{2} max (λ_{l + 1}^{2} t, \sqrt{t n \sum_{i > l} λ_{i}^{4}}) \\ \leq n \sum_{i > l} λ_{i}^{2} + a σ_{x}^{2} max (t \sum_{i > l} λ_{i}^{2}, \sqrt{t n} \sum_{i > l} λ_{i}^{2}) \\ \leq c_{5} n \sum_{i > l} λ_{i}^{2} \end{array}

because $t < n / c_{0}$ . Combining the above gives

\sum_{i > l} λ_{i}^{2} z_{i}^{⊤} A^{- 2} z_{i} \leq c_{6} n \frac{\sum_{i > l} λ_{i}^{2}}{{(λ_{k + 1} r_{k} (Σ))}^{2}} .

Finally, putting both parts together and taking $c > max {c_{0}, c_{4}, c_{6}}$ gives the lemma. $□$

Lower Bound on the Trace Term.

We first give a bound on a single term in the expression for $tr (C)$ in Lemma 3 that holds regardless of $r_{k} (Σ)$ . The proof is in SI Appendix, section D.

Lemma 8. There is a constant $c$ such that, for any $i \geq 1$ with $λ_{i} > 0$ and any $0 \leq k \leq n / c$ , with probability at least $1 - 5 e^{- n / c}$ ,

\frac{λ_{i}^{2} z_{i}^{⊤} A_{- i}^{- 2} z_{i}}{{(1 + λ_{i} z_{i}^{⊤} A_{- i}^{- 1} z_{i})}^{2}} \geq \frac{1}{c n} {(1 + \frac{\sum_{j > k} λ_{j} + n λ_{k + 1}}{n λ_{i}})}^{- 2} .

We can extend these bounds to a lower bound on $tr (C)$ using the following lemma. The proof is in SI Appendix, section E.

Lemma 9. Suppose that $n \leq \infty$ , ${η_{i}}_{i = 1}^{n}$ is a sequence of nonnegative random variables, and that ${t_{i}}_{i = 1}^{n}$ is a sequence of nonnegative real numbers (at least one of which is strictly positive) such that, for some $δ \in (0,1)$ and any $i \leq n$ , $Pr (η_{i} > t_{i}) \geq 1 - δ$ . Then,

Pr (\sum_{i = 1}^{n} η_{i} \geq \frac{1}{2} \sum_{i = 1}^{n} t_{i}) \geq 1 - 2 δ .

These two lemmas imply the following lower bound.

Lemma 10. There are constants $c$ such that, for any $0 \leq k \leq n / c$ and any $b > 1$ with probability at least $1 - 10 e^{- n / c}$ ,

1)
if $r_{k} (Σ) < b n$ , then $tr (C) \geq \frac{k + 1}{c b^{2} n}$ ; and
2)
if $r_{k} (Σ) \geq b n$ , then

tr (C) \geq \frac{1}{c b^{2}} min_{l \leq k} (\frac{l}{n} + \frac{b^{2} n \sum_{i > l} λ_{i}^{2}}{{(λ_{k + 1} r_{k} (Σ))}^{2}}) .

In particular, if all choices of $k \leq n / c$ give $r_{k} (Σ) < b n$ , then $r_{n / c} (Σ) < b n$ implies that, with probability at least $1 - 10 e^{- n / c}$ , $tr (C) = Ω_{σ_{x}} (1)$ .

Proof: From Lemmas 3, 8, and 9, with probability at least $1 - 10 e^{- n / c_{1}}$ ,

\begin{array}{l} tr (C) & \geq \frac{1}{c_{1} n} \sum_{i} {(1 + \frac{\sum_{j > k} λ_{j} + n λ_{k + 1}}{n λ_{i}})}^{- 2} \\ \geq \frac{1}{c_{2} n} \sum_{i} min \{1, \frac{n^{2} λ_{i}^{2}}{{(\sum_{j > k} λ_{j})}^{2}}, \frac{λ_{i}^{2}}{λ_{k + 1}^{2}}\} \\ \geq \frac{1}{c_{2} b^{2} n} \sum_{i} min \{1, {(\frac{b n}{r_{k} (Σ)})}^{2} \frac{λ_{i}^{2}}{λ_{k + 1}^{2}}, \frac{λ_{i}^{2}}{λ_{k + 1}^{2}}\} . \end{array}

Now, if $r_{k} (Σ) < b n$ , then the second term in the minimum is always bigger than the third term, and in that case,

tr (C) \geq \frac{1}{c_{2} b^{2} n} \sum_{i} min \{1, \frac{λ_{i}^{2}}{λ_{k + 1}^{2}}\} \geq \frac{k + 1}{c_{2} b^{2} n} .

On the other hand, if $r_{k} (λ) \geq b n$ ,

\begin{array}{l} tr (C) & \geq \frac{1}{c_{2} b^{2}} \sum_{i} min \{\frac{1}{n}, \frac{b^{2} n λ_{i}^{2}}{{(λ_{k + 1} r_{k} (Σ))}^{2}}\} \\ = \frac{1}{c_{2} b^{2}} min_{l \leq k} (\frac{l}{n} + \frac{b^{2} n \sum_{i > l} λ_{i}^{2}}{{(λ_{k + 1} r_{k} (Σ))}^{2}}), \end{array}

where the equality follows from the fact that the $λ_{i}$ are nonincreasing. $□$

A Simple Choice of $l$ .

Recall that $σ_{x}$ is a constant. If no $k \leq n / c$ has $r_{k} (Σ) \geq b n$ , then Lemmas 2 and 10 imply that the expected excess risk is $Ω (σ^{2})$ , which proves the first paragraph of Theorem 1 for large $k^{*}$ . If some $k \leq n / c$ does have $r_{k} (Σ) \geq b n$ , then the upper and lower bounds of Lemmas 6 and 10 are constant multiples of

min_{l \leq k} (\frac{l}{n} + n \frac{\sum_{i > l} λ_{i}^{2}}{{(λ_{k + 1} r_{k} (Σ))}^{2}}) .

It might seem surprising that any suitable choice of $k$ suffices to give upper and lower bounds: what prevents one choice of $k$ from giving an upper bound that falls below the lower bound that arises from another choice of $k$ ? However, the freedom to choose $k$ is somewhat illusory: Lemma 5 shows that, for any qualifying value of $k$ , the smallest eigenvalue of $A$ is within a constant factor of $λ_{k + 1} r_{k} (Σ)$ . Thus, any two choices of $k$ satisfying $k \leq n / c$ and $r_{k} (Σ) \geq b n$ must have values of $λ_{k + 1} r_{k} (Σ)$ within constant factors. The smallest such $k$ simplifies the bound on $tr (C)$ as the following lemma shows. The proof is in SI Appendix, section F.

Lemma 11. For any $b \geq 1$ and $k^{*} ≔ min \{k : r_{k} (Σ) \geq b n\}$ , if $k^{*} < \infty$ , we have

\begin{array}{l} min_{l \leq k^{*}} (\frac{l}{b n} + \frac{b n \sum_{i > l} λ_{i}^{2}}{{(λ_{k^{*} + 1} r_{k^{*}} (Σ))}^{2}}) \\ = \frac{k^{*}}{b n} + \frac{b n \sum_{i > k^{*}} λ_{i}^{2}}{{(λ_{k^{*} + 1} r_{k^{*}} (Σ))}^{2}} = \frac{k^{*}}{b n} + \frac{b n}{R_{k^{*}} (Σ)} . \end{array}

Finally, we can finish the proof of Theorem 1. Set $b$ in Lemma 10 and Theorem 1 to the constant $b$ from Lemma 6. Take $c_{1}$ to be the maximum of the constants $c$ from Lemmas 6 and 10.

By Lemma 10, if $k^{*} \geq n / c_{1}$ , then with high probability $tr (C) \geq 1 / c_{2}$ . However, by Lemma 10.2 and by Lemma 6, if $k^{*} < n / c_{1}$ , then with high probability $tr (C)$ is within a constant factor of $min_{l \leq k^{*}} (\frac{l}{n} + n \frac{\sum_{i > l} λ_{i}^{2}}{{(λ_{k^{*} + 1} r_{k^{*}} (Σ))}^{2}})$ , which by Lemma 11, is within a constant factor of $\frac{k^{*}}{n} + \frac{n}{R_{k^{*}} (Σ)}$ . Taking $c$ sufficiently large and combining these results with Lemma 2 and with the upper bound on the term ${θ^{*}}^{⊤} B θ^{*}$ in SI Appendix, section J completes the proof of the first paragraph of Theorem 1.

The proof of the second paragraph is in SI Appendix, section K.

Deep Neural Networks

How relevant are Theorems 1 and 2 to the phenomenon of benign overfitting in deep neural networks? One connection appears by considering regimes where deep neural networks are well approximated by linear functions of their parameters. This so-called neural tangent kernel (NTK) viewpoint has been vigorously pursued recently in an attempt to understand the optimization properties of deep learning methods. Very wide neural networks, trained with gradient descent from a suitable random initialization, can be accurately approximated by linear functions in an appropriate Hilbert space, and in this case, gradient descent finds an interpolating solution quickly (14–19). (Note that these papers do not consider prediction accuracy, except when there is no noise; for example, ref. 14, Assumption A1 implies that the network can compute a suitable real-valued response exactly, and the data-dependent bound of ref. 19, Theorem 5.1 becomes vacuous when independent noise is added to the $y_{i}$ .) The eigenvalues of the covariance operator in this case can have a heavy tail under reasonable assumptions on the data distribution (20, 21), and the dimension is very large but finite as required for benign overfitting. However, the assumptions of Theorem 1 do not apply in this case. In particular, the assumption that the random elements of the Hilbert space are a linearly transformed vector with independent components is not satisfied. Thus, our results are not directly applicable in this—somewhat unrealistic—setting. Note that the slow decay of the eigenvalues of the NTK is in contrast to the case of the Gaussian and other smooth kernels, where the eigenvalues decay nearly exponentially quickly (22).

The phenomenon of benign overfitting was first observed in deep neural networks. Theorems 1 and 2 are steps toward understanding this phenomenon by characterizing when it occurs in the simple setting of linear regression. Those results suggest that covariance eigenvalues that are constant or slowly decaying in a high (but finite)-dimensional space might be important in the deep network setting also. Some authors have suggested viewing neural networks as finite-dimensional approximations to infinite-dimensional objects (23–25), and there are generalization bounds—although not for the overfitting regime—that are applicable to infinite-width deep networks with parameter norm constraints (26–30). However, the intuition from the linear setting suggests that truncating to a finite-dimensional space might be important for good statistical performance in the overfitting regime. Confirming this conjecture by extending our results to the setting of prediction in deep neural networks is an important open problem.

Conclusions and Further Work

Our results characterize when the phenomenon of benign overfitting occurs in high-dimensional linear regression with Gaussian data and more generally. We give finite sample excess risk bounds that reveal the covariance structure that ensures that the minimum norm interpolating prediction rule has near-optimal prediction accuracy. The characterization depends on two notions of the effective rank of the data covariance operator. It shows that overparameterization (that is, the existence of many low-variance and hence, unimportant directions in parameter space) is essential for benign overfitting and that data that lie in a large but finite-dimensional space exhibit the benign overfitting phenomenon with a much wider range of covariance properties than data that lie in an infinite-dimensional space.

There are several natural future directions. Our main theorem requires the conditional expectation $E [y | x]$ to be a linear function of $x$ , and it is important to understand whether the results are also true in the misspecified setting, where this assumption is not true. Our main result also assumes that the covariates are distributed as a linear function of a vector of independent random variables. We would like to understand the extent to which this assumption can be relaxed since it rules out some important examples, such as infinite-dimensional reproducing kernel Hilbert spaces with continuous kernels defined on finite-dimensional spaces. We would also like to understand how our results extend to other loss functions other than squared error and what we can say about overfitting estimators beyond the minimum norm interpolating estimator. The most interesting future direction is understanding how these ideas could apply to nonlinearly parameterized function classes, such as neural networks, the methodology that uncovered the phenomenon of benign overfitting.

Data Availability.

There are no data associated with this manuscript.

Supplementary Material

Supplementary File

pnas.1907378117.sapp.pdf^{(381.1KB, pdf)}

Acknowledgments

We acknowledge the support of NSF Grant IIS-1619362 and of a Google research award. G.L. was supported by the Spanish Ministry of Economy and Competitiveness, Grant PGC2018-101643-B-I00; “High-dimensional problems in structured probabilistic models - Ayudas Fundación BBVA a Equipos de Investigación Cientifica 2017”; and Google Focused Award “Algorithms and Learning for AI.” Part of this work was done as part of the fall 2018 program on Foundations of Data Science at the Simons Institute for the Theory of Computing.

Footnotes

The authors declare no competing interest.

This article is a PNAS Direct Submission. R.B. is a guest editor invited by the Editorial Board.

This paper results from the Arthur M. Sackler Colloquium of the National Academy of Sciences, “The Science of Deep Learning,” held March 13–14, 2019, at the National Academy of Sciences in Washington, DC. NAS colloquia began in 1991 and have been published in PNAS since 1995. From February 2001 through May 2019 colloquia were supported by a generous gift from The Dame Jillian and Dr. Arthur M. Sackler Foundation for the Arts, Sciences, & Humanities, in memory of Dame Sackler’s husband, Arthur M. Sackler. The complete program and video recordings of most presentations are available on the NAS website at http://www.nasonline.org/science-of-deep-learning.

This article contains supporting information online at https://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1907378117/-/DCSupplemental.

References

1.Zhang C., Bengio S., Hardt M., Recht B., Vinyals O., “Understanding deep learning requires rethinking generalization” in 5th International Conference on Learning Representations. https://openreview.net/forum?id=Sy8gdB9xx. Accessed 30 March 2020. [Google Scholar]
2.Hastie T., Tibshirani R., Friedman J. H., Elements of Statistical Learning (Springer, 2001). [Google Scholar]
3.Belkin M., Ma S., Mandal S., “To understand deep learning we need to understand kernel learning” in Proceedings of the 35th International Conference on Machine Learning (Proceedings of Machine Learning Research, 2018), vol. 80, pp. 540–548. [Google Scholar]
4.Belkin M., Hsu D., Mitra P., “Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate” in Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, S. Bengio et al., Eds. (NIPS, 2018), pp. 2306–2317. [Google Scholar]
5.Belkin M., Rakhlin A., Tsybakov A. B., Does data interpolation contradict statistical optimality? arXiv:1806.09471 (25 June 2018).
6.Devroye L., Györfi L., Krzyżak A., The Hilbert kernel regression estimate. J. Multivariate Anal. 65, 209–227 (1998). [Google Scholar]
7.Liang T., Rakhlin A., Just interpolate: Kernel “ridgeless” regression can generalize. arXiv:1808.00387 (1 August 2018).
8.Belkin M., Hsu D., Ma S., Mandal S., Reconciling modern machine learning and the bias-variance trade-off. arXiv:1812.11118 (28 December 2018). [DOI] [PMC free article] [PubMed]
9.Muthukumar V., Vodrahalli K., Sahai A., Harmless interpolation of noisy data in regression. arXiv:1903.09139 (21 March 2019).
10.Bartlett P. L., “Accurate prediction from interpolation: A new challenge for statistical learning theory (presentation at the National Academy of Sciences workshop, The Science of Deep Learning)” (video recording, 2019). https://www.youtube.com/watch?v=1y2sB38T6FU&feature=youtu.be. Accessed 14 March 2019.
11.Belkin M., Hsu D., Xu J., Two models of double descent for weak features. arXiv:1903.07571 (18 March 2019).
12.Hastie T., Montanari A., Rosset S., Tibshirani R. J., Surprises in high-dimensional ridgeless least squares interpolation. arXiv:1903.08560 (19 March 2019). [DOI] [PMC free article] [PubMed]
13.Desoer C. A., Whalen B. H., A note on pseudoinverses. J. Soc. Ind. Appl. Math. 11, 442–446 (1963). [Google Scholar]
14.Li Y., Liang Y., Learning overparameterized neural networks via stochastic gradient descent on structured data. arXiv:1808.01204 (3 August 2018).
15.Du S. S., Poczós B., Zhai X., Singh A., Gradient descent provably optimizes over-parameterized neural networks. arXiv:1810.02054 (4 October 2018).
16.Du S. S., Lee J. D., Li H., Wang L., Zhai X., Gradient descent finds global minima of deep neural networks. arXiv:1811.03804 (9 November 2018).
17.Zou D., Cao Y., Zhou D., Gu Q., Stochastic gradient descent optimizes over-parameterized deep relu networks. arXiv:1811.08888 (21 November 2018).
18.Jacot A, Gabriel F, Hongler C, “Neural tangent kernel: Convergence and generalization in neural networks” in 32nd Conference on Neural Information Processing Systems, Bengio S. et al., Eds. (NeurIPS, 2018), pp. 8580–8589. [Google Scholar]
19.Arora S., Du S. S., Hu W., Li Z., Wang R., Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. arXiv:1901.08584 (24 January 2019).
20.Xie B., Liang Y., Song L., Diverse neural network learns true target functions. arXiv:1611.03131 (9 November 2016).
21.Cao Y., Fang Z., Wu Y., Zhou D. X., Gu Q., Towards understanding the spectral bias of deep learning. arXiv:1912.01198 (3 December 2019).
22.Belkin M., “Approximation beats concentration? An approximation view on inference with smooth radial kernels” in Conference On Learning Theory, 2018, Stockholm, Sweden, 6-9 July 2018, S. Bubeck, V. Perchet, P. Rigollet, Eds. (PMLR, 2018), vol. 75, pp. 1348–1361.
23.Lee W. S., Bartlett P. L., Williamson R. C., Efficient agnostic learning of neural networks with bounded fan-in. IEEE Trans. Inf. Theor. 42, 2118–2132 (1996). [Google Scholar]
24.Bengio Y., Roux N. L., Vincent P., Delalleau O., Marcotte P., “Convex neural networks” in Advances in Neural Information Processing Systems 18, Weiss Y., Schölkopf B., Platt J. C., Eds. (MIT Press, Cambridge, MA, 2006), pp. 123–130. [Google Scholar]
25.Bach F., Breaking the curse of dimensionality with convex neural networks. J. Mach. Learn. Res. 18, 1–53 (2017). [Google Scholar]
26.Bartlett P. L., The sample complexity of pattern classification with neural networks: The size of the weights is more important than the size of the network. IEEE Trans. Inf. Theor. 44, 525–536 (1998). [Google Scholar]
27.Bartlett P. L., Mendelson S., Rademacher and Gaussian complexities: Risk bounds and structural results. J. Mach. Learn. Res. 3, 463–482 (2002). [Google Scholar]
28.Neyshabur B., Tomioka R., Srebro N., “Norm-based capacity control in neural networks” in Proceedings of the 28th Conference on Learning Theory, Proceedings of Machine Learning Research, Grünwald P., Hazan E., Kale S., Eds. (PMLR, Paris, France, 2015), vol. 40, pp. 1376–1401. [Google Scholar]
29.Bartlett P., Foster D., Telgarsky M., “Spectrally-normalized margin bounds for neural networks” in Advances in Neural Information Processing Systems 30, Guyon I., et al., Eds. (Curran Associates, Inc., 2017), pp. 6240–6249. [Google Scholar]
30.Golowich N., Rakhlin A., Shamir O., “Size-independent sample complexity of neural networks” in Proceedings of the 31st Conference on Learning Theory, Proceedings of Machine Learning Research, Bubeck S., Perchet V., Rigollet P., Eds. (PMLR, 2018), vol. 75, pp. 297–299. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary File

pnas.1907378117.sapp.pdf^{(381.1KB, pdf)}

Data Availability Statement

There are no data associated with this manuscript.

[r1] 1.Zhang C., Bengio S., Hardt M., Recht B., Vinyals O., “Understanding deep learning requires rethinking generalization” in 5th International Conference on Learning Representations. https://openreview.net/forum?id=Sy8gdB9xx. Accessed 30 March 2020. [Google Scholar]

[r2] 2.Hastie T., Tibshirani R., Friedman J. H., Elements of Statistical Learning (Springer, 2001). [Google Scholar]

[r3] 3.Belkin M., Ma S., Mandal S., “To understand deep learning we need to understand kernel learning” in Proceedings of the 35th International Conference on Machine Learning (Proceedings of Machine Learning Research, 2018), vol. 80, pp. 540–548. [Google Scholar]

[r4] 4.Belkin M., Hsu D., Mitra P., “Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate” in Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, S. Bengio et al., Eds. (NIPS, 2018), pp. 2306–2317. [Google Scholar]

[r5] 5.Belkin M., Rakhlin A., Tsybakov A. B., Does data interpolation contradict statistical optimality? arXiv:1806.09471 (25 June 2018).

[r6] 6.Devroye L., Györfi L., Krzyżak A., The Hilbert kernel regression estimate. J. Multivariate Anal. 65, 209–227 (1998). [Google Scholar]

[r7] 7.Liang T., Rakhlin A., Just interpolate: Kernel “ridgeless” regression can generalize. arXiv:1808.00387 (1 August 2018).

[r8] 8.Belkin M., Hsu D., Ma S., Mandal S., Reconciling modern machine learning and the bias-variance trade-off. arXiv:1812.11118 (28 December 2018). [DOI] [PMC free article] [PubMed]

[r9] 9.Muthukumar V., Vodrahalli K., Sahai A., Harmless interpolation of noisy data in regression. arXiv:1903.09139 (21 March 2019).

[r10] 10.Bartlett P. L., “Accurate prediction from interpolation: A new challenge for statistical learning theory (presentation at the National Academy of Sciences workshop, The Science of Deep Learning)” (video recording, 2019). https://www.youtube.com/watch?v=1y2sB38T6FU&feature=youtu.be. Accessed 14 March 2019.

[r11] 11.Belkin M., Hsu D., Xu J., Two models of double descent for weak features. arXiv:1903.07571 (18 March 2019).

[r12] 12.Hastie T., Montanari A., Rosset S., Tibshirani R. J., Surprises in high-dimensional ridgeless least squares interpolation. arXiv:1903.08560 (19 March 2019). [DOI] [PMC free article] [PubMed]

[r13] 13.Desoer C. A., Whalen B. H., A note on pseudoinverses. J. Soc. Ind. Appl. Math. 11, 442–446 (1963). [Google Scholar]

[r14] 14.Li Y., Liang Y., Learning overparameterized neural networks via stochastic gradient descent on structured data. arXiv:1808.01204 (3 August 2018).

[r15] 15.Du S. S., Poczós B., Zhai X., Singh A., Gradient descent provably optimizes over-parameterized neural networks. arXiv:1810.02054 (4 October 2018).

[r16] 16.Du S. S., Lee J. D., Li H., Wang L., Zhai X., Gradient descent finds global minima of deep neural networks. arXiv:1811.03804 (9 November 2018).

[r17] 17.Zou D., Cao Y., Zhou D., Gu Q., Stochastic gradient descent optimizes over-parameterized deep relu networks. arXiv:1811.08888 (21 November 2018).

[r18] 18.Jacot A, Gabriel F, Hongler C, “Neural tangent kernel: Convergence and generalization in neural networks” in 32nd Conference on Neural Information Processing Systems, Bengio S. et al., Eds. (NeurIPS, 2018), pp. 8580–8589. [Google Scholar]

[r19] 19.Arora S., Du S. S., Hu W., Li Z., Wang R., Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. arXiv:1901.08584 (24 January 2019).

[r20] 20.Xie B., Liang Y., Song L., Diverse neural network learns true target functions. arXiv:1611.03131 (9 November 2016).

[r21] 21.Cao Y., Fang Z., Wu Y., Zhou D. X., Gu Q., Towards understanding the spectral bias of deep learning. arXiv:1912.01198 (3 December 2019).

[r22] 22.Belkin M., “Approximation beats concentration? An approximation view on inference with smooth radial kernels” in Conference On Learning Theory, 2018, Stockholm, Sweden, 6-9 July 2018, S. Bubeck, V. Perchet, P. Rigollet, Eds. (PMLR, 2018), vol. 75, pp. 1348–1361.

[r23] 23.Lee W. S., Bartlett P. L., Williamson R. C., Efficient agnostic learning of neural networks with bounded fan-in. IEEE Trans. Inf. Theor. 42, 2118–2132 (1996). [Google Scholar]

[r24] 24.Bengio Y., Roux N. L., Vincent P., Delalleau O., Marcotte P., “Convex neural networks” in Advances in Neural Information Processing Systems 18, Weiss Y., Schölkopf B., Platt J. C., Eds. (MIT Press, Cambridge, MA, 2006), pp. 123–130. [Google Scholar]

[r25] 25.Bach F., Breaking the curse of dimensionality with convex neural networks. J. Mach. Learn. Res. 18, 1–53 (2017). [Google Scholar]

[r26] 26.Bartlett P. L., The sample complexity of pattern classification with neural networks: The size of the weights is more important than the size of the network. IEEE Trans. Inf. Theor. 44, 525–536 (1998). [Google Scholar]

[r27] 27.Bartlett P. L., Mendelson S., Rademacher and Gaussian complexities: Risk bounds and structural results. J. Mach. Learn. Res. 3, 463–482 (2002). [Google Scholar]

[r28] 28.Neyshabur B., Tomioka R., Srebro N., “Norm-based capacity control in neural networks” in Proceedings of the 28th Conference on Learning Theory, Proceedings of Machine Learning Research, Grünwald P., Hazan E., Kale S., Eds. (PMLR, Paris, France, 2015), vol. 40, pp. 1376–1401. [Google Scholar]

[r29] 29.Bartlett P., Foster D., Telgarsky M., “Spectrally-normalized margin bounds for neural networks” in Advances in Neural Information Processing Systems 30, Guyon I., et al., Eds. (Curran Associates, Inc., 2017), pp. 6240–6249. [Google Scholar]

[r30] 30.Golowich N., Rakhlin A., Shamir O., “Size-independent sample complexity of neural networks” in Proceedings of the 31st Conference on Learning Theory, Proceedings of Machine Learning Research, Bubeck S., Perchet V., Rigollet P., Eds. (PMLR, 2018), vol. 75, pp. 297–299. [Google Scholar]

PERMALINK

Benign overfitting in linear regression

Peter L Bartlett

Philip M Long

Gábor Lugosi

Alexander Tsigler

Abstract

Definitions and Notation

Main Results

Effective Ranks and Overparameterization.

Proof

Outline.

Unit Variance Sub-Gaussians.

Concentration of $A$ .

Upper Bound on the Trace Term.

Lower Bound on the Trace Term.

A Simple Choice of $l$ .

Deep Neural Networks

Conclusions and Further Work

Data Availability.

Supplementary Material

Acknowledgments

Footnotes

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Benign overfitting in linear regression

Peter L Bartlett

Philip M Long

Gábor Lugosi

Alexander Tsigler

Abstract

Definitions and Notation

Main Results

Effective Ranks and Overparameterization.

Proof

Outline.

Unit Variance Sub-Gaussians.

Concentration of A.

Upper Bound on the Trace Term.

Lower Bound on the Trace Term.

A Simple Choice of l.

Deep Neural Networks

Conclusions and Further Work

Data Availability.

Supplementary Material

Acknowledgments

Footnotes

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Concentration of $A$ .

A Simple Choice of $l$ .