On cheap entropy-sparsified regression learning

Illia Horenko; Edoardo Vecchi; Juraj Kardoš; Andreas Wächter; Olaf Schenk; Terence J O’Kane; Patrick Gagliardini; Susanne Gerber

doi:10.1073/pnas.2214972120

. 2022 Dec 29;120(1):e2214972120. doi: 10.1073/pnas.2214972120

On cheap entropy-sparsified regression learning

Illia Horenko ^a,¹, Edoardo Vecchi ^b, Juraj Kardoš ^b, Andreas Wächter ^c, Olaf Schenk ^b, Terence J O’Kane ^d, Patrick Gagliardini ^e, Susanne Gerber ^f

PMCID: PMC9910478 PMID: 36580592

Abstract

Regression learning is one of the long-standing problems in statistics, machine learning, and deep learning (DL). We show that writing this problem as a probabilistic expectation over (unknown) feature probabilities – thus increasing the number of unknown parameters and seemingly making the problem more complex—actually leads to its simplification, and allows incorporating the physical principle of entropy maximization. It helps decompose a very general setting of this learning problem (including discretization, feature selection, and learning multiple piece-wise linear regressions) into an iterative sequence of simple substeps, which are either analytically solvable or cheaply computable through an efficient second-order numerical solver with a sublinear cost scaling. This leads to the computationally cheap and robust non-DL second-order Sparse Probabilistic Approximation for Regression Task Analysis (SPARTAn) algorithm, that can be efficiently applied to problems with millions of feature dimensions on a commodity laptop, when the state-of-the-art learning tools would require supercomputers. SPARTAn is compared to a range of commonly used regression learning tools on synthetic problems and on the prediction of the El Niño Southern Oscillation, the dominant interannual mode of tropical climate variability. The obtained SPARTAn learners provide more predictive, sparse, and physically explainable data descriptions, clearly discerning the important role of ocean temperature variability at the thermocline in the equatorial Pacific. SPARTAn provides an easily interpretable description of the timescales by which these thermocline temperature features evolve and eventually express at the surface, thereby enabling enhanced predictability of the key drivers of the interannual climate.

Keywords: entropy, supervised learning, climate prediction, numerics

The problem of regression learning is one of the oldest questions in data science, seeking for relationships between the D explanatory feature information X _t = (X _{1, t},…,X _{D, t}) and the predicted M-dimensional real-valued variable Y _t = (Y _{1, t},…,Y _{M, t}) in the available statistics of data pairs {X _t,Y _t}, with t = 1, …, T. In its most popular linear formulation, this relationship is described by means of a linear function and implemented as a convex numerical inference of a priori unknown M × D matrix of regression coefficients Λ from:

\begin{matrix} Y_{m, t} & = \sum_{d = 1}^{D} Λ_{m, d} X_{d, t} + noise, m = 1, \dots, M . \end{matrix}

[1]

Identification of the relevant/significant subset of features is usually performed by the l ₂, l ₁, elastic net (linear combination of l ₂ and l ₁), or garrote-like sparsification, that attempt to force the elements of regression coefficient matrix Λ to zero.

Nonlinear extensions of this regression problem constitute the foundations of modern deep learning, aiming at the inference of superpositions ϕ ° ϕ ° ϕ(…) for generalized functions ϕ (e.g., perceptrons) with multiple linear regressions Eq. 1 , by means of first-order numerical algorithms like the stochastic gradient descent (SGD). This learning approach dwells on the mathematical proof of the thirteenth Hilbert problem by Arnold and Kolmogorov (1, 2). This proof and further results guarantee that any arbitrary continuous mapping from D-dimensional variable X _t to Y _t can be learned as an appropriate superposition ϕ ° ϕ ° ϕ(…) of continuous one-dimensional (1D) functions.

Despite its generality and wide acceptance, this common paradigm of regression learning bears several limitations, specifically: (i) popular feature selection approaches in regression learning, for example, l ₂, l ₁, elastic net, and garrote-like regularizations (3, 4) are characterized by the polynomial growth C(D)=c D ^a of computational and memory costs C(D) with feature dimension D, for exponent a being significantly larger than 1; ii) these approaches do not allow a direct probabilistic interpretation—consequently prohibiting the application of very general and cheap probabilistic feature selection paradigms from physics and information theory, like, for example, the maximum entropy principle; iii) the nonlinear extensions of regression learning in DL, that dwell on universal approximations by means of regression superpositions ϕ ° ϕ ° ϕ(…) of generalized linear functions ϕ, are characterized by intractability of higher-order derivatives (beyond the first derivative) of the loss function. In particular, issue (iii) results in a very limited applicability of second- and higher-order numerical methods (like the generalized Newton and interior point optimization methods) that could provide much higher convergence speed rates than the first-order numerical algorithms (like SGD). In many realistic learning applications, gradient descent (GD) and SGD exhibit extremely slow convergence rates, taking hundreds of thousands of expensive iterations, not allowing any mathematically justified stopping criteria and requiring high-performance computing (HPC) facilities.

The central message of this brief report is in showing that these three challenges (i) to (iii) can be addressed by making the regression problem seemingly more complex and extending the set of tunable parameters with a vector of D feature probabilities w = (w ₁,…,w _D). In the linear case, this allows formulating a regression as an expectation value with respect to this unknown feature probability distribution:

\begin{matrix} Y_{t} & = E_{w} [Λ, X_{t}] + noise = \sum_{d = 1}^{D} w_{d} Λ_{:, d} X_{d, t} + noise . \end{matrix}

[2]

The common linear regression Eq. 1 then becomes a particular case of this more general probabilistic regression ( Eq. 2 ), with w being a uniform distribution w = (1/D,…,1/D). Generalization to nonlinear regression learning can be also achieved in the same probabilistic way and actually simplifies the problem, as will be shown below: Instead of dwelling on the thirteenth Hilbert problem and on superposition ϕ ° ϕ ° ϕ(…) of perceptrons (or other similar generalized linear model functions) that can lead to intractable high-order derivatives and slow convergence of GD and SGD, one can formulate the general nonlinear regression as a problem of a piece-wise linear learning of K linear regressions, coupled to entropic feature sparsification (5 –8)—and confined to K feature space discretization boxes with centers C _{:, k} = (C _{1, k},…,C _{D, k})^⊤, with the probability γ _{k, t} for a feature vector X _t to belong to the discretization box k (9). Such piece-wise linear probabilistic regression learning can be implemented as a numerical minimization of the following function:

\begin{array}{l} L (w, Λ, γ, C) = & \underset{loss of feature discretization in K boxes}{\underset{︸}{\frac{1}{T K} \sum_{t, d, k}^{T, D, K} w_{d} γ_{k, t} {(X_{d, t}, -, C_{d, k})}^{2}}} \\ + ϵ_{l_{2}} & \underset{loss of l_{2} regression regularization}{\underset{︸}{\sum_{m, d, k = 1}^{M, D, K} Λ_{d, m, k}^{2}}} + ϵ_{w} \underset{loss of entropic feature sparsification}{\underset{︸}{\sum_{d = 1}^{D} w_{d} \log w_{d}}} \\ + ϵ_{r} & \underset{loss of K box-wise linear regressions approximations}{\underset{︸}{\frac{1}{T M} \sum_{t, m, k}^{T, M, K} γ_{k, t} {(Y_{m, t}, - {,Λ}_{0, m, k}, -, \sum_{d = 1}^{D}, w_{d} {,Λ}_{d, m, k}, X_{d, t})}^{2}}} . \end{array}

[3]

Learning the regression model parameters (w,Λ,γ,C) from a given set of data pairs {X _t,Y _t}, t = 1, …, T, can be achieved with the monotonically convergent Sparse Probabilistic Approximation for Regression Task Analysis (SPARTAn) algorithm, reported in SI Appendix with the complete problem formulation. For any fixed set of four parameters (K,ϵ _r,ϵ _w,ϵ _l
₂), problem ( Eq. 3 ) can be solved analytically for each of the variables (Λ,γ,C), while keeping the remaining variables fixed. By applying the Euler–Lagrange principle to Eq. 3 , one obtains a convex optimization problem for the variable w, with a cheaply computable matrix of second derivatives. It appears that, for D ≤ 2, this problem allows an analytical solution for w, given by the Lambert W-functions that are widely used in quantum gravitation theory, solution of the Schrodinger equation, in polymer physics, and in material science (10). For D > 2, this convex problem can be solved numerically, adjusting the interior point Panua-Ipopt algorithm (11) to allow use of the sparsity and low-rank structure of the Hessian in problem ( Eq. 3 ). As can be seen from Fig. 1 A and B , this structural adaptation of the algorithm plays a crucial role in achieving the sublinear scaling of the overall computational and memory costs: Although application of common optimizers (like the commercial fmincon-solver from MathWorks and the general-purpose Ipopt algorithm) is characterized by the polynomial cost scaling C(D)=c D ^a (with a > 2 for computational cost and a > 1.3 for memory scaling), Panua-Ipopt adapted to the low-rank and sparsity patterns of the Hessian in ( Eq. 3 ) has a < 1 both for the overall memory and computational complexity scalings. Practically, it means that solving ( Eq. 3 ) for w, with one thousand data points T and one million feature data dimensions D—a typical situation in many realistic applications—requires only 13 min and around 16 GB of memory on a commodity laptop with 8 cores, against the roughly 336 y and 30 TB required by state-of-the-art optimization tools.

Fig. 1A–E summarizes numerical experiments comparing SPARTAn to common regression learning and feature selection tools on randomly generated synthetic data sets from the very popular scalable (in a broad range of dimensions D and statistics sizes T) regression learning example 3 from Tibshirani’s seminal paper on l1 regularization (3). All of the compared algorithms are provided with the same information and run with the same hardware and software; convergence tolerance t o l was set to 10⁻¹² for all of the compared methods. Fifty cross-validations were performed in every experiment with every learning tool, to visualize the obtained average performances in Fig. 1 D and E . These results demonstrate that SPARTAn allows a marked and robust improvement of parameter identification quality, improved sparsity pattern detection, and low computational costs for very high-dimensional regression learning problems. Please note that the regression parameters in SPARTAn are learned as products with feature probabilities w _d, i.e., α _d = w _d Λ _d. Uncertainty of α _d for a fixed w _d is expressed as Var(α _d)=w _d ²Var(Λ _d), i.e., factor 10 reduction of feature probability w _d results in factor 100 reduction of variance for estimated α _d (Fig. 1C ).

Next, we consider a learning problem from climate research aiming at predictions of the El Niño Southern Oscillation (ENSO) encoded by the NINO3.4 climate index—a climate phenomenon governing the interannual variation of sea surface temperatures in the Pacific Ocean. The periodicity of this phenomenon is highly irregular, but its prediction has been gaining growing importance recently due to its recognized major impact on global climate and economy (12). Predictions of the NINO3.4 index with the modern AI tools are currently getting an increasing amount of attention both in geosciences and in the AI/ML communities (13, 14). The considered predicted variable Y _t represents a historical sequence of deseasonalized monthly NINO3.4 index values. Feature values X _t represent a series of dominant deseasonalized climate proxies X _t in the months preceding the Y _t, consistent of the 100 dominant Essential Orthogonal Function projections (EOFs) of the global sea surface temperatures resimulated by the National Oceanic and Atmospheric Administration in the United States and the 100 dominant ACCESS-O resimulated global sea surface and deep ocean water temperature EOFs in the equatorial Pacific Ocean. A detailed explanation of this data set is provided in (7).

To achieve a fair comparison and to avoid overfitting, in Fig. 1F–K , we deployed a multiple cross-validation procedure for comparison of SPARTAn to common regression tools when all of the methods were trained, validated, and tested using the same historical data. Each of the models was trained for a range of meta-parameters (for example, DL LSTM algorithm was repeatedly trained with various network architectures and various numbers of neurons in the hidden layer, and SPARTAn was trained with various values of parameters (K,ϵ _r,ϵ _w,ϵ _l
₂)); details about trained meta-parameter ranges are available in the README-file (see the link in Data, Materials, and Software Availability section). Then, best-performing meta-parameters for each of the methods were determined on the validation data and, finally, the best performers from each method family were compared in Fig. 1F–H for the test historical data between 1992 and 2007, that were used neither in training nor in validation.

In Fig. 1 F–H , we see that SPARTAn comprehensively outperforms a wide range of machine learning algorithms, including the CNN (LSTM) that was best performing on training and validation data, in predicting the NINO3.4 index on test data. We also observe in Fig. 1G that SPARTAn is able to utilize additional subsurface ocean information at lead times of 5 mo, with a twofold reduction in MSE from > 0.012 to just over 0.006. None of the other methods show comparable improvements in skill, and only DL LSTM is able to approach but not match the low MSE values of SPARTAn even out to 15-mo lead times. In Fig. 1 I and J , the relative contributions of the surface temperature (SST) and subsurface vertical temperature gradients (dT/dz)—a proxy for thermocline variability—are clearly demonstrated with subseasonal predictability largely determined by surface temperatures (SST EOF 2 in Fig. 1I ) and where predictability beyond a season is strongly dependent on the stability of the thermocline (dT/dz EOF mode 3 in Fig. 1J ). That the coherent structures of the relevant modes are captured in the SPARTAn features is reflected in the weights shown in Fig. 1K .

Supplementary Material

Appendix 01 (PDF)

Click here for additional data file.^{(180.7KB, pdf)}

Acknowledgments

We acknowledge the suggestions, helpful hints, and comments by Rupert Klein (FU Berlin).

Author contributions

I.H. designed research; I.H., E.V., J.K., A.W., O.S., T.O., P.G., and S.G. performed research; I.H., J.K., A.W., and O.S. contributed new reagents/analytic tools; I.H., E.V., J.K., A.W., O.S., and T.O. analyzed data; and I.H., E.V., T.O., and S.G. wrote the paper.

Competing interest

The authors declare no competing interest.

Data, Materials, and Software Availability

Code reproducing the comparison of numerical solvers in Fig. 1 A and B is available at https://github.com/goghino/scalable_regression. Data and code reproducing the comparison of NINO3.4 regression learning and prediction results (in Fig. 1 F–K ) are provided for academic use at https://github.com/horenkoi/SPARTAn. All study data are included in the article and/or SI Appendix .

Supporting Information

References

1. Arnold V., Shimura G., “Superposition of algebraic functions, mathematical developments arising from Hilbert’s problems” in AMS Proceedings of Symposia in Pure and Applied Mathematics (1976), vol. 28 , pp. 45–46. [Google Scholar]
2. Amari Si., Backpropagation and stochastic gradient descent method. Neurocomputing 5 , 185–196 (1993). [Google Scholar]
3. Tibshirani R., Regression shrinkage and selection via the Lasso. J. R. Stat. Soc. Ser. B (Methodol.) 58 , 228–267 (1996). [Google Scholar]
4. Yuan M., Lin Y., On the non-negative garrotte estimator. J. R. Stat. Soc.: Ser. B (Stat. Methodol.) 69 , 143–161 (2007). [Google Scholar]
5. Meister C., Salesky E., Cotterell R., “Generalized entropy regularization or: There’s nothing special about label smoothing” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (Association for Computational Linguistics, 2020), pp. 6870–6886. [Google Scholar]
6. Horenko I., On a scalable entropic breaching of the overfitting barrier for small data problems in machine learning. Neural Comput. 32 , 1563–1579 (2020). [DOI] [PubMed] [Google Scholar]
7. Vecchi E., Pospíšil L., Albrecht S., O’Kane T. J., Horenko I., eSPA+: Scalable entropy-optimal machine learning classification for small data problems. Neural Comput. 34 , 1220–1255 (2022). [DOI] [PubMed] [Google Scholar]
8. Horenko I., Cheap robust learning of data anomalies with analytically solvable entropic outlier sparsification. Proc. Natl. Acad. Sci. U.S.A. 119 , e2119659119 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
9. Gerber S., Pospisil L., Navandar M., Horenko I., Low-cost scalable discretization, prediction, and feature selection for complex systems. Sci. Adv. 6 , eaaw0961 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
10. Veberič D., Lambert W function for applications in physics. Comput. Phys. Commun. 183 , 2622–2628 (2012). [Google Scholar]
11. Wächter A., Biegler L. T., On the implementation of a primal-dual interior point filter line search algorithm for large-scale nonlinear programming. Math. Program. 106 , 25–57 (2006). [Google Scholar]
12. McPhaden M. J., Zebiak S. E., Glantz M. H., ENSO as an integrating concept in earth science. Science 314 , 1740–1745 (2006). [DOI] [PubMed] [Google Scholar]
13. Timmermann A., et al. , El Niño-southern oscillation complexity. Nature 559 , 535–545 (2018). [DOI] [PubMed] [Google Scholar]
14. Ham Y. G., Kim J. H., Luo J. J., Deep learning for multi-year ENSO forecasts. Nature 573 , 568–572 (2019). [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Appendix 01 (PDF)

Click here for additional data file.^{(180.7KB, pdf)}

Data Availability Statement

[r1] 1. Arnold V., Shimura G., “Superposition of algebraic functions, mathematical developments arising from Hilbert’s problems” in AMS Proceedings of Symposia in Pure and Applied Mathematics (1976), vol. 28 , pp. 45–46. [Google Scholar]

[r2] 2. Amari Si., Backpropagation and stochastic gradient descent method. Neurocomputing 5 , 185–196 (1993). [Google Scholar]

[r3] 3. Tibshirani R., Regression shrinkage and selection via the Lasso. J. R. Stat. Soc. Ser. B (Methodol.) 58 , 228–267 (1996). [Google Scholar]

[r4] 4. Yuan M., Lin Y., On the non-negative garrotte estimator. J. R. Stat. Soc.: Ser. B (Stat. Methodol.) 69 , 143–161 (2007). [Google Scholar]

[r5] 5. Meister C., Salesky E., Cotterell R., “Generalized entropy regularization or: There’s nothing special about label smoothing” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (Association for Computational Linguistics, 2020), pp. 6870–6886. [Google Scholar]

[r6] 6. Horenko I., On a scalable entropic breaching of the overfitting barrier for small data problems in machine learning. Neural Comput. 32 , 1563–1579 (2020). [DOI] [PubMed] [Google Scholar]

[r7] 7. Vecchi E., Pospíšil L., Albrecht S., O’Kane T. J., Horenko I., eSPA+: Scalable entropy-optimal machine learning classification for small data problems. Neural Comput. 34 , 1220–1255 (2022). [DOI] [PubMed] [Google Scholar]

[r8] 8. Horenko I., Cheap robust learning of data anomalies with analytically solvable entropic outlier sparsification. Proc. Natl. Acad. Sci. U.S.A. 119 , e2119659119 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r9] 9. Gerber S., Pospisil L., Navandar M., Horenko I., Low-cost scalable discretization, prediction, and feature selection for complex systems. Sci. Adv. 6 , eaaw0961 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r10] 10. Veberič D., Lambert W function for applications in physics. Comput. Phys. Commun. 183 , 2622–2628 (2012). [Google Scholar]

[r11] 11. Wächter A., Biegler L. T., On the implementation of a primal-dual interior point filter line search algorithm for large-scale nonlinear programming. Math. Program. 106 , 25–57 (2006). [Google Scholar]

[r12] 12. McPhaden M. J., Zebiak S. E., Glantz M. H., ENSO as an integrating concept in earth science. Science 314 , 1740–1745 (2006). [DOI] [PubMed] [Google Scholar]

[r13] 13. Timmermann A., et al. , El Niño-southern oscillation complexity. Nature 559 , 535–545 (2018). [DOI] [PubMed] [Google Scholar]

[r14] 14. Ham Y. G., Kim J. H., Luo J. J., Deep learning for multi-year ENSO forecasts. Nature 573 , 568–572 (2019). [DOI] [PubMed] [Google Scholar]

PERMALINK

On cheap entropy-sparsified regression learning

Illia Horenko

Edoardo Vecchi

Juraj Kardoš

Andreas Wächter

Olaf Schenk

Terence J O’Kane

Patrick Gagliardini

Susanne Gerber

Series information

Abstract

Fig. 1.

Supplementary Material

Acknowledgments

Author contributions

Competing interest

Data, Materials, and Software Availability

Supporting Information

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

On cheap entropy-sparsified regression learning

Illia Horenko

Edoardo Vecchi

Juraj Kardoš

Andreas Wächter

Olaf Schenk

Terence J O’Kane

Patrick Gagliardini

Susanne Gerber

Series information

Abstract

Fig. 1.

Supplementary Material

Acknowledgments

Author contributions

Competing interest

Data, Materials, and Software Availability

Supporting Information

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases