Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2016 Jun 1.
Published in final edited form as: Biometrika. 2015 May 4;102(2):479–485. doi: 10.1093/biomet/asv019

Effective degrees of freedom: a flawed metaphor

Lucas Janson 1, William Fithian 1, Trevor J Hastie 1
PMCID: PMC4787623  NIHMSID: NIHMS736580  PMID: 26977114

Summary

To most applied statisticians, a fitting procedure’s degrees of freedom is synonymous with its model complexity, or its capacity for overfitting to data. In particular, it is often used to parameterize the bias-variance tradeoff in model selection. We argue that, on the contrary, model complexity and degrees of freedom may correspond very poorly. We exhibit and theoretically explore various fitting procedures for which degrees of freedom is not monotonic in the model complexity parameter, and can exceed the total dimension of the ambient space even in very simple settings. We show that the degrees of freedom for any non-convex projection method can be unbounded.

Some key words: Model complexity, Number of parameters, Optimism

1. Introduction

1.1. A motivating example: best-subsets regression

Consider observing data y = μ + ε with μ ∈ ℝn and independent errors ε ~ N (0, σ2 In), and producing an estimate of μ. A critical property of any estimator μ̂ is its so-called model complexity; informally, how flexibly it is able to conform to the observed response y. Commonly we set μ = X β for some n × p design matrix X.

For the ordinary-least-squares estimator μ̂, the most natural measure of complexity is the number p of fitted parameters, i.e., the degrees of freedom. For more general estimators, the effective degrees of freedom of Efron (1986), defined as σ2i=1ncov(yi,μ^i), has emerged as a popular and convenient measuring stick for comparing the complexity of very different fitting procedures. The name suggests that a method with p degrees of freedom has a complexity comparable to linear regression on p predictor variables, for which the effective degrees of freedom is p.

Now, suppose that instead of fitting a linear model using all p predictors, we fit the best-subsets regression of size k. That is, we find the best linear model that uses only a subset of k predictors, with k < p. What is the effective degrees of freedom, henceforth just degrees of freedom, of this method?

A simple, intuitive, and wrong argument predicts that the degrees of freedom, which depends on μ, is somewhere between k and p. We certainly expect it to be greater than k, since we use the data to select the best model of size k among all the possibilities. However, we have only p free parameters at our disposal, of which pk must be set to zero, so best-subsets regression with k parameters is still less complex than the saturated model with all p parameters and no constraints.

As convincing as this argument may seem, it is contradicted by a simple simulation with n = 50 and p = 15. Here μ = X β, where the entries of X are independent N (0, 1) variates and the coefficients βj are independent N (0, 4) variates, and σ2 = 1. Figure 1 shows that the degrees of freedom for both best-subsets regression and another method, forward selection, can exceed the ambient dimension p for values of k < p. The degrees of freedom of best-subsets regression exceeded p for some k < p in 179 of 200 realizations of μ, or about 90% of the time. To understand why our intuition should lead us astray here, we must first review why effective degrees of freedom is defined as it is, and what classical concepts the definition is meant to generalize.

Fig. 1.

Fig. 1

Estimated degrees of freedom versus subset size for 10 realizations of μ, with one highlighted for clarity, in the simulation described in Section 1.1 using best-subsets regression on the left, and forward selection on the right. Simulation details, including code, are provided in the Supplementary Material. Standard errors for all points are below 0.5, and are not shown. Dashed lines show the constant ambient dimension p and the subset size k, for reference.

1.2. Degrees of freedom in classical statistics

The original meaning of degrees of freedom, the number of dimensions in which a random vector may vary, plays a central role in classical statistics. In ordinary linear regression with full-rank n × p predictor matrix X, the fitted response μ̂ = X β̂ is the orthogonal projection of y onto the p-dimensional column space of X, and the residual r = yμ̂ is the projection onto its orthogonal complement, whose dimension is np. We say this linear model has p model degrees of freedom, with np residual degrees of freedom.

If the error variance is σ2, then r is constrained to have zero projection in p directions, and is free to vary, with variance σ2, in the remaining np orthogonal directions. In particular, if the model is correct, so that E(y) = X β, then the residual sum of squares, r22, has distribution r22~σ2χnp2, leading to the unbiased variance estimate σ^2=r22/(np). More generally, t-tests and F-tests are based on comparing lengths of n-variate Gaussian random vectors after projecting onto appropriate linear subspaces.

In linear regression, the model degrees of freedom, henceforth just degrees of freedom, serves to quantify multiple related properties of the fitting procedure. The degrees of freedom coincides with the number of non-redundant free parameters in the model, and thus constitutes a natural measure of model complexity or overfitting. In addition, the total variance of the fitted response μ̂ is exactly σ2 p, which depends only on the number of linearly independent predictors and not on their size or correlation with each other.

The degrees of freedom also quantifies the optimism of the residual sum of squares as an estimate of out-of-sample prediction error. In linear regression, one can easily show that the residual sum of squares underestimates mean squared prediction error by 2σ2p on average. Mallows (1973) exploits this identity as a means of model selection, by computing the Cp statistic r22+2σ^2p, an unbiased estimate of prediction error, for several models, and selecting the model with the smallest estimated test error. In this case the degrees of freedom of a model contributes a penalty to account for that model’s complexity.

1.3. Effective degrees of freedom

For more general fitting procedures such as smoothing splines, generalized additive models, lasso, or ridge regression, the number of free parameters is often an inappropriate measure of model complexity. Typically these methods have a tuning parameter, but it is not clear a priori how to compare, e.g., a lasso fit with Lagrange parameter λ = 3 to a local regression fit with window width 0.5. When comparing different methods, or the same method with different tuning parameters, it can be quite useful to have some measure of complexity with a consistent meaning across a range of algorithms. To this end, various authors have proposed alternative more general definitions for the effective degrees of freedom of a method; see Buja et al. (1989) and references therein.

If the method is linear, that is, if μ̂ = Hy for some fixed hat matrix H, then the trace of H serves as a natural generalization. For linear regression H is a p-dimensional projection, so tr(H) = p, coinciding with the original definition. Intuitively, when H is not a projection, tr(H) accumulates fractional degrees of freedom for directions of y that are shrunk, but not entirely eliminated, in computing μ̂.

For nonlinear methods, further generalization is necessary. The most popular definition, due to Efron (1986) and given in Equation (1), defines degrees of freedom in terms of the optimism of residual sum of squares as an estimate of test error, and applies to any fitting method.

Measuring or estimating optimism is a worthy goal in and of itself. But to justify our intuition that the degrees of freedom offers a consistent way to quantify model complexity, a bare requirement is that the degrees of freedom be monotone in model complexity when considering a fixed method. The term model complexity is itself rather metaphorical when describing arbitrary fitting algorithms, but has a concrete meaning for methods that minimize residual sum of squares subject to the fit μ̂ belonging to a closed constraint set M, a model. Commonly, some tuning parameter γ indexes a nested set of models Mγ, with γ1γ2 implying Mγ1Mγ2 ⊆ ℝn. Then the fitted vector for a tuning parameter γ is

μ^(γ)=arg minzMγyz22.

Examples include the lasso (Tibshirani, 1996) and ridge regression (Hoerl, 1962) in their constraint formulation, as well as best-subsets regression. The model Mk for best-subsets regression with k variables is a union of k-dimensional subspaces.

Because estimates from larger models conform more closely to the observed data, one naturally expects degrees of freedom to be monotone with respect to model inclusion. However, as we have already seen in Figure 2, monotonicity is far from guaranteed even in very simple examples, and can rise above p. Surprisingly, degrees of freedom need not be monotone even for methods projecting onto convex sets, including ridge regression and the lasso, although the degrees of freedom cannot exceed the dimension of the convex set. The non-monotonicity of degrees of freedom for such convex methods was discovered independently by Kaufman & Rosset (2014), who give a thorough account. Among other results, they prove that the degrees of freedom of the projection onto a convex set must always be smaller than the dimension of that set. In contrast, we show that projection onto any closed non-convex set can have arbitrarily large degrees of freedom, regardless of the dimensions of M and y.

Fig. 2.

Fig. 2

Heatmap of the degrees of freedom for 1-best-subset regression fit to data from the model y ~ N (μ, I2σ2), as a function of the true mean vector μ ∈ ℝ2.

2. Preliminaries

We consider fitting techniques with some tuning parameter, discrete or continuous, that can be used to vary a model from less to more constrained. In best-subsets regression, the tuning parameter k determines how many predictor variables are retained in the model. For a general fitting technique, we will use the notation μ̂(k) for the fitted response produced using tuning parameter k.

As mentioned in the introduction, a general formula for degrees of freedom can be motivated by the following relationship between expected prediction error, which we will represent by the symbol EPE, and residual sum of squares for ordinary least squares (Mallows, 1973):

EPE=E(r22)+2σ2p.

Analogously, once a fitting technique and tuning parameter k are chosen for fixed data y, the degrees of freedom functional, which we will denote by DF, is defined by:

E{i=1n(yiμ^i(k))2}=E{i=1n(yiμ^i(k))2}+2σ2DF(μ,σ2,k), (1)

where σ2 is the variance of the εi, assumed finite, and yi is a new independent copy of yi with mean μi. Thus degrees of freedom is defined as a measure of the optimism of residual sum of squares. This definition in turn leads to a simple closed form expression for degrees of freedom under very general conditions, as shown by the following theorem.

Theorem 1 (Efron (1986))

For i ∈ {1, …, n}, let yi = μi + εi, where the μi are non-random and the εi have mean zero and finite variance. Let μ̂i, i ∈ {1, …, n} denote estimates of μi from some fitting technique based on a fixed realization of the yi, and let yi, i ∈ {1, …, n} be independent of and identically distributed as the yi. Then

E{i=1n(yiμ^i)2}E{i=1n(yiμ^i)2}=2i=1ncov(yi,μ^i)

Remark 1

For independent and identically distributed errors with finite variance σ2, Theorem 1 implies that,

DF(μ,σ2,k)=1σ2tr{cov(y,μ^(k))}=1σ2i=1ncov(yi,μ^i(k)). (2)

When using a linear fitting method with hat matrix H, Equation (2) reduces to DF(μ, σ2, k) = tr(H).

3. Unbounded degrees of freedom for non-convex models

Before stating our main result, we present a very simple example illustrating how large the degrees of freedom can be. Consider estimating a no-intercept linear regression model with design matrix X = I2 and response y ~ N(μ, I2), with μ ∈ ℝ2. Suppose further that, in order to obtain a more parsimonious model than the full bivariate regression, we instead estimate the best fitting of the two univariate models, in other words, best-subsets regression with model size k = 1.

Figure 2 shows the effective degrees of freedom for this model, plotted as a function of μ. As before, the degrees of freedom can exceed the ambient dimension of 2. However, the plot shows the degrees of freedom growing steadily as μ moves diagonally away from the origin, raising the question of how large it can get.

For μ = (a, a)T and a large, y falls in the positive quadrant with high probability and the best univariate model chooses the larger of the two response variables. Figure 3(a) illustrates the fit for several realizations of y. For i ∈ {1, 2}, μ^i(1) is either 0 or approximately a depending on small changes in y. As a result, the variance of y is far smaller than that of μ̂(1). Since the correlation between yi and μ^i(1) is around 0.5, i=1ncov(yi,μ^i(1)) is also much larger than the variance of the yi, and the large degrees of freedom can be inferred from Equation (2).

Fig. 3.

Fig. 3

(a) Sketch of the example described in Section 3. The square is μ and the solid black lines are the coordinate axes. Some realizations of y are shown as circles, along with a few of their best-subset projections μ̂(1), shown as triangles. The dashed line divides the points y with respect to which axis they are closer to. (b) Sketch of a regression problem with n = 2 and a non-convex constraint set. The filled area is the constraint set, the point the true mean vector μ, and the circles are the contours of the least squares objective function.

We see below that the degrees of freedom actually diverges as a → ∞,

1aDF{(a,a)T,1,1}=12E[1ai=12{(yiμ^i(1))2(yiμ^i(1))2}]=12E[1ai=12{(yiμ^i(1))2(yiμ^i(1))2}IyQ1]+12E[1ai=12{(yiμ^i(1))2(yiμ^i(1))2}IyQ1]=12E(1a[a2+2aε1+{ε1}2+{ε2min(ε1,ε2)}2a22amin{ε1,ε2}{min(ε1,ε2)}2]IyQ1)+o(1)12E{2ε12min(ε1,ε2)}=E{max(ε1,ε2)},a,

where Q1 is the first quadrant of ℝ2, IS is the indicator function on the set S, and ε1, ε2 are noise realizations independent of one another and of y. The o(1) term comes from the the fact that a1i=12{(yiμ^i(1))2(yiμ^i(1))2} is Op(a) while pr(yQ1) shrinks exponentially fast in a, as it is a Gaussian tail probability. The convergence in the last line follows by the dominated convergence theorem applied to the first term in the preceding line. For large a, E {max(ε1, ε2)} ≈ 0.56, giving DF{(a, a)T, 1, 1} ≈ 0.56a. Equivalently, the degrees of freedom would also diverge if a were held fixed and σ2 → 0.

The phenomenon we have just illustrated is not an idiosyncratic pathology of best-subsets regression, but in fact can occur whenever we project onto a non-convex model.

Theorem 2

For a fitting technique that minimizes squared error subject to a non-convex, closed constraint μ̂(k)Mk ⊂ ℝn, consider the model,

y=μ+σε,εi~F(i=1,,n),

where the εi are independent and F is a mean-zero distribution with finite variance supported on an open neighborhood of 0. Then there exists some μ* such that DF(μ*, σ2, k) → ∞ as σ2 → 0.

Proof of Theorem 2

An intuitive proof sketch follows below, while a rigorous proof is deferred to the Supplementary Material.

Best-subsets regression has a non-convex constraint set for k < p, and our toy example gave some insight for why the degrees of freedom can be much greater than the ambient dimension. We now give intuition for how the theorem generalizes to any non-convex constraint set.

Place the true mean at a point with non-unique projection onto the constraint set; see Figure 3(b). The salient feature of the figure is that the spherical contour spans the gap of the divot where it meets the constraint set. A point with non-unique projection must exist by the Motzkin–Bunt Theorem (Motzkin, 1935, Kritikos, 1938). The constraint set for μ̂ = X β̂ is just an affine transformation of the constraint set for β̂, and thus a non-convex β̂-constraint is equivalent to a non-convex β̂ constraint. Then the fit depends sensitively on the noise process, even when the noise is very small, since y is projected onto multiple well-separated sections of the constraint set. Thus as the magnitude of the noise, σ, goes to zero, the variance of μ̂ remains roughly constant. Equation (2) then tells us that degrees of freedom can be made arbitrarily large, as it will be roughly proportional to σ−1. By inserting that rate into Equation (1), we also see that, for σ2 → 0, the difference between the expected prediction error and the residual sum of squares converges to zero, with both just approaching the expected squared bias of the model.

Theorem 2 and its proof have practical ramifications for the use of degrees of freedom in model selection criteria. Akaike information criterion and Bayes information criterion use degrees of freedom to mitigate overfitting of a model. Since theoretical results for these criteria, such as conditions for model consistency, rely solely on the definition of degrees of freedom which we have also used, those results still hold in the settings we consider here, but only so long as the correct degrees of freedom is used. Our result sheds light on the dangers of using a naïve estimate of degrees of freedom based on model size. For instance, using k as the degrees of freedom for best-subsets regression with k can be arbitrarily far from the truth, resulting in values of model selection criteria that are also arbitrarily wrong.

Our proof also clarifies that the degrees of freedom are most likely to be large when the true mean is nearly equidistant from two or more well-separated parts of the model constraint set. For methods like best-subsets or forward selection that explicitly seek a parsimonious model fit, this could occur if there are several different parsimonious models that describe μ almost equally well. Thus, if we use a parsimony-seeking model because we know the truth to be parsimonious, we would not expect the degrees of freedom to misbehave very often. By contrast, if we force the fit to be parsimonious even when the truth is not, the degrees of freedom may well misbehave, as it does in our simulation examples.

4. Discussion

The common intuition that effective degrees of freedom serves as a consistent and interpretable measure of model complexity merits some skepticism. Our results, combined with those of Kaufman & Rosset (2014), demonstrate that for many widely-used convex and non-convex fitting techniques, the degrees of freedom can be non-monotone with respect to model nesting. Furthermore, in the non-convex case, the degrees of freedom can exceed the dimension of the model space by an arbitrarily large amount, and may do so in run-of-the-mill datasets.

In light of the above, the term degrees of freedom seems misleading, as it is suggestive of a quantity corresponding to model size or complexity. It is also misleading to consider degrees of freedom as a measure of overfitting, or how flexibly the model conforms to the data, since a model is always at least as flexible as a submodel. By definition, the effective degrees of freedom of Efron (1983) measures optimism of in-sample error as an estimate of out-of-sample error, but we should not be too quick to carry over our intuition from linear models.

Supplementary Material

Supplement

Acknowledgments

We thank the editor, associate editor, and two reviewers for helpful comments that signficantly improved the paper. Lucas Janson was partially supported by the National Institutes of Health. William Fithian was partially supported by the National Science Foundation and the Gerald J. Lieberman Fellowship. Trevor Hastie was partially supported by the National Institutes of Health and the National Science Foundation.

Footnotes

Supplementary material

Supplementary material available at Biometrika online includes the proof of Theorem 2 and further explanation, including code, for the examples in Sections 1.1 and 3.

Contributor Information

Lucas Janson, Email: ljanson@stanford.edu.

William Fithian, Email: wfithian@stanford.edu.

Trevor J. Hastie, Email: hastie@stanford.edu.

References

  1. Buja A, Hastie TJ, Tibshirani RJ. Linear smoothers and additive models. The Annals of Statistics. 1989;17:453–510. [Google Scholar]
  2. Efron B. Estimating the error rate of a prediction rule: improvement on cross-validation. Journal of the American Statistical Association. 1983;78:316–331. [Google Scholar]
  3. Efron B. How Biased Is the Apparent Error Rate of a Prediction Rule? Journal of the American Statistical Association. 1986;81:461–470. [Google Scholar]
  4. Hoerl AE. Application of Ridge Analysis to Regression Problems. Chemical Engineering Progress. 1962;58:54–59. [Google Scholar]
  5. Kaufman S, Rosset S. When Does More Regularization Imply Fewer Degrees of Freedom? Sufficient Conditions and Counter Examples from Lasso and Ridge Regression. Biometrika (to appear) 2014 [Google Scholar]
  6. Kritikos MN. Sur quelques propriétés des ensembles convexes. Bulletin Mathématique de la Société Roumaine des Sciences. 1938;40:87–92. [Google Scholar]
  7. Mallows CL. Some Comments on Cp. Technometrics. 1973;15:661–675. [Google Scholar]
  8. Motzkin T. Sur quelques propriétés caractéristiques des ensembles convexes. Atti Accad Naz Lincei Rend Cl Sci Fis Mat Natur. 1935;21:562–567. [Google Scholar]
  9. Tibshirani RJ. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B. 1996;58:267–288. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement

RESOURCES