In their thought-provoking paper, Belkin et al. (1) illustrate and discuss the shape of risk curves in the context of modern high-complexity learners. Given a fixed training sample size , such curves show the risk of a learner as a function of some (approximate) measure of its complexity . With the number of features, these curves are also referred to as feature curves. A salient observation in ref. 1 is that these curves can display what they call double descent: With increasing , the risk initially decreases, attains a minimum, and then increases until equals , where the training data are fitted perfectly. Increasing even further, the risk decreases a second and final time, creating a peak at . This twofold descent may come as a surprise, but as opposed to what ref. 1 reports, it has not been overlooked historically. Our letter draws attention to some original earlier findings of interest to contemporary machine learning.
Already in 1989, using artificial data, Vallet et al. (2) experimentally demonstrated double descent for learning curves of classifiers trained through minimum norm linear regression (MNLR; see ref. 3)—termed the pseudo-inverse solution in ref. 2. In learning curves the risk is displayed as a function of , as opposed to for risk curves. What intuitively matters in learning behavior, however, is the sample size relative to the measure of complexity. This idea is made explicit in various physics papers on learning (e.g., refs. 2, 4, and 5), where the risk is often plotted against . A first theoretical result on double descent, indeed using such , is given by Opper et al. (4). They prove that in particular settings, for going to infinity, the pseudo-inverse solution improves as soon as one moves away from the peak at .
Employing a so-called pseudo-Fisher linear discriminant (PFLD, equivalent to MNLR), Duin (6) was the first to show feature curves on real-world data quite similar to the double-descent curves in ref. 1. Compare, for instance, figure 2 in ref. 1 with figures 6 and 7 in ref. 6. Skurichina and Duin (7) demonstrate experimentally that increasing PFLD’s complexity simply by adding random features can improve performance when (i.e., ). The benefit of some form of regularization has been shown already in ref. 2. For semisupervised PFLD, Krijthe and Loog (8) demonstrate that unlabeled data can regularize but also worsen the peak in the curve. Their work builds on the original analysis of double descent for the supervised PFLD by Raudys and Duin (9).
Interestingly, results from refs. 4–7 suggest that some losses may not exhibit double descent in the first place. In refs. 6 and 7, the linear support vector machine (SVM) shows regular monotonic behavior. Analytic results from refs. 4 and 5 show the same for the so-called perceptron of optimal (or maximal) stability, which is closely related to the SVM (5).
The findings in ref. 1 go, significantly, beyond those for the MNLR. Also shown, for instance, is double descent for two-layer neural networks and random forests. Combining this with observations such as those from Loog et al. (10), which show striking multiple-descent learning curves (even in the underparameterized regime), the need to further our understanding of such rudimentary learning behavior is evident.
Footnotes
The authors declare no competing interest.
References
- 1.Belkin M., Hsu D., Ma S., Mandal S., Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proc. Natl. Acad. Sci. U.S.A. 116, 15849–15854 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Vallet F., Cailton J.-G., Refregier Ph., Linear and nonlinear extension of the pseudo-inverse solution for learning Boolean functions. Europhys. Lett. 9, 315–320 (1989). [Google Scholar]
- 3.Penrose R., On best approximate solutions of linear matrix equations. Math. Proc. Cambridge Philos. Soc. 52, 17–19 (1956). [Google Scholar]
- 4.Opper M., Kinzel W., Kleinz J., Nehl R., On the ability of the optimal perceptron to generalise. J. Phys. A Math. Gen. 23, L581–L586 (1990). [Google Scholar]
- 5.Watkin T. L. H., Rau A., Biehl M., The statistical mechanics of learning a rule. Rev. Mod. Phys. 65, 499 (1993). [Google Scholar]
- 6.Duin R. P. W., “Classifiers in almost empty spaces” in Proceedings of the 15th International Conference on Pattern Recognition (IEEE, 2000), vol. 2, pp. 1–7. [Google Scholar]
- 7.Skurichina M., Duin R. P. W., “Regularization by adding redundant features” in Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR) (Springer, 1998), pp. 564–572. [Google Scholar]
- 8.Krijthe J. H., Loog M., “The peaking phenomenon in semi-supervised learning” in Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR) (Springer, 2016), pp. 299–309. [Google Scholar]
- 9.Raudys Š., Duin R. P. W., Expected classification error of the Fisher linear classifier with pseudo-inverse covariance matrix. Pattern Recognit. Lett. 19, 385–392 (1998). [Google Scholar]
- 10.Loog M., Viering T., Mey A., “Minimizers of the empirical risk and risk monotonicity” in Advances in Neural Information Processing Systems (MIT Press, 2019), pp. 7476–7485. [Google Scholar]