Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2024 Mar 19.
Published in final edited form as: Adv Neural Inf Process Syst. 2023 Dec;36:41076–41258.

OKRidge: Scalable Optimal k-Sparse Ridge Regression

Jiachang Liu 1, Sam Rosen 1, Chudi Zhong 1, Cynthia Rudin 1
PMCID: PMC10950455  NIHMSID: NIHMS1976784  PMID: 38505104

Abstract

We consider an important problem in scientific discovery, namely identifying sparse governing equations for nonlinear dynamical systems. This involves solving sparse ridge regression problems to provable optimality in order to determine which terms drive the underlying dynamics. We propose a fast algorithm, OKRidge, for sparse ridge regression, using a novel lower bound calculation involving, first, a saddle point formulation, and from there, either solving (i) a linear system or (ii) using an ADMM-based approach, where the proximal operators can be efficiently evaluated by solving another linear system and an isotonic regression problem. We also propose a method to warm-start our solver, which leverages a beam search. Experimentally, our methods attain provable optimality with run times that are orders of magnitude faster than those of the existing MIP formulations solved by the commercial solver Gurobi.

1. Introduction

We are interested in identifying sparse and interpretable governing differential equations arising from nonlinear dynamical systems. These are scientific machine learning problems whose solution involves sparse linear regression. Specifically, these problems require the exact solution of sparse regression problems, with the most basic being sparse ridge regression:

minβy-Xβ22+λ2β22subjecttoβ0k, (1)

where k specifies the number of nonzero coefficients for the model. This formulation is general, but in the case of nonlinear dynamical systems, the outcome y is a derivative (usually time or spatial) of each dimension x. Here, we assume that the practitioner has included the true variables, along with many other possibilities, and is looking to determine which terms (which transformations of the variables) are real and which are not. This problem is NP-hard [49], and is more challenging in the presence of highly correlated features. Selection of correct features is vital in this context, as many solutions may give good results on training data, but will quickly deviate from the true dynamics when extrapolating past the observed data due to the chaotic nature of complex dynamical systems.

Both heuristic and optimal algorithms have been proposed to solve these problems. Heuristic methods include greedy sequential adding of features [25, 16, 54, 22] or ensemble [31] methods. These methods are fast, but often get stuck in local minima, and there is no way to assess solution quality due to the lack of a lower bound on performance. Optimal methods provide an alternative, but are slow since they must prove optimality. MIOSR [9], a mixed-integer programming (MIP) approach, has been able to certify optimality of solutions given enough time. Slow solvers cause difficulty in performing cross-validation on λ2 (the 2 regularization coefficient) and k (sparsity level).

We aim to solve sparse ridge regression to certifiable optimality, but in a fraction of the run time. We present a fast branch-and-bound (BnB) formulation, OKRidge. A crucial challenge is obtaining a tight and feasible lower bound for each node in the BnB tree. It is possible to calculate the lower bound via the SOS1 [9], big-M [10], or the perspective formulations (also known as the rotated second-order cone constraints) [32, 4, 59]; the mixed-integer problems can then be solved by a MIP solver. However, these formulations do not consider the special mathematical structure of the regression problem. To calculate a lower bound more efficiently, we first propose a new saddle point formulation for the relaxed sparse ridge regression problem. Based on the new saddle point formulation, we propose two novel methods to calculate the lower bound. The first method is extremely efficient and relies on solving only a linear system of equations. The second method is based on ADMM and can tighten the lower bound given by the first method. Together, these methods give us a tight lower bound, used to prune nodes and provide a small optimality gap. Additionally, we propose a method based on beam-search [58] to get a near-optimal solution quickly, which can be a starting point for both our algorithm and other MIP formulations. Unlike previous methods, our method uses a dynamic programming approach so that previous solutions in the BnB tree can be used while exploring the current node, giving a massive speedup. In summary, our contributions are:

  1. We develop a highly efficient customized branch-and-bound framework for achieving optimality in k-sparse ridge regression, using a novel lower bound calculation and heuristic search.

  2. To compute the lower bound, we introduce a new saddle point formulation, from which we derive two efficient methods (one based on solving a linear system and the other on ADMM).

  3. Our warm-start method is based on beam-search and implemented in a dynamic programming fashion, avoiding redundant calculations. We prove that our warm-start method is an approximation algorithm with an exponential factor tighter than previous work.

On benchmarks, OKRidge certifies optimality orders of magnitude faster than the commercial solver Gurobi. For dynamical systems, our method outperforms the state-of-the-art certifiable method by finding superior solutions, particularly in high-dimensional feature spaces.

2. Preliminary: Dual Formulation via the Perspective Function

There is an extensive literature on this topic, and a longer review of related work is in Appendix A. If we ignore the constant term yTy, we can rewrite the loss objective in Equation (1) as:

ridgeββTXTXβ-2yTXβ+λ2j=1pβj2, (2)

with p as the number of features. We are interested in the following optimization problem:

min βridgeβs.t.1-zjβj=0,j=1pzjk,zj0,1, (3)

where k is the number of nonzero coefficients. With the sparsity constraint, the problem is NP-hard. The constraint 1-zjβj in Problem (3) can be reformulated with the SOS1, big-M, or the perspective formulation (with quadratic cone constraints), which then can be solved by a MIP solver. Since commercial solvers do not exploit the special structure of the problem, we develop a customized branch-and-bound framework.

For any function f(a), the perspective function is ga,bbfab for the domain b>0 [32, 34, 26] and g(a,b)=0 otherwise. Applying to f(a)=a2, we obtain another function g(a,b)=a2b. As shown by [4], replacing the loss term βj2 and constraint 1-zjβj=0 with the perspective formula βj2/zj in Problem (3) would not change the optimal solution. By the Fenchel conjugate [4], g(,) can be rewritten as g(a,b)=maxcac-c24b. If we define a new perspective loss as:

ridgeFenchel(β,z,c)βTXTXβ-2yTXβ+λ2j=1pβjcj-cj24zj, (4)

then we can reformulate Problem (3) as:

minβ,zmaxcridgeFenchelβ,z,cs.t.j=1pzjk,zj0,1. (5)

If we relax the binary constraint {0, 1} to the interval [0, 1] and swap max and min (no duality gap, as pointed by [4]), we obtain the dual formulation for the convex relaxation of Problem (5):

maxcminβ,zridgeFenchelβ,z,cs.t.j=1pzjk,zj0,1. (6)

While [4] uses the perspective formulation for safe feature screening, we use it to calculate a lower bound for Problem (3). However, directly solving the maxmin problem is computationally challenging. In Section 3.1, we propose two methods to achieve this in an efficient way.

3. Methodology

We propose a custom BnB framework to solve Problem (3). We have 3 steps to process each node in the BnB tree. First, we calculate a lower bound of the node, using two algorithms proposed in the next subsection. If the lower bound exceeds or equals the current best solution, we have proven that it does not lead to any optimal solution, so we prune the node. Otherwise, we go to Step 2, where we perform beam-search to find a near-optimal solution. In Step 3, we use the solution from Step 2 and propose a branching strategy to create new nodes in the BnB tree. We continue until reaching the optimality gap tolerance. Below, we elaborate on each step. In Appendix E, we provide visual illustrations of BnB and beam search as well as the complete pseudocodes of our algorithms.

3.1. Lower Bound Calculation

Tight Saddle Point Formulation

We first rewrite Equation (2) with a new hyperparameter λ:

ridge-λβ,zβTQλβ-2yTXβ+λ2+λj=1pβj2, (7)

where QλXTX-λI. We restrict λ0,λminXTX, where λmin() denotes the minimum eigenvalue of a matrix. We see that Qλ is positive semidefinite, so the first term remains convex. This trick is related to the optimal perspective formulation [62, 28, 37], but we set the diagonal matrix diag(d) in [28] to be λI. We call this trick the eigen-perspective formulation. The optimal perspective formulation requires solving semidefinite programming (SDP) problems, which have been shown not scalable to high dimensions [28], and MI-SDP is not supported by Gurobi.

Solving Problem (3) is equivalent to solving the following problem:

minβ,zridge-λβ,zs.t.1-zjβj=0,j=1pzjk,zj0,1. (8)

We get a continuous relaxation of Problem (3) if we relax {0, 1} to [0, 1].

We can now define a new loss analogous to the loss defined in Equation (4):

ridge-λFenchel(β,z,c)βTQλβ-2yTXβ+λ2+λj=1pβjcj-cj24zj. (9)

Then, the dual formulation analogous to Problem (6) is:

maxcminβ,zridge-λFenchelβ,z,cs.t.j=1pzjk,zj0,1. (10)

Solving Problem (10) provides us with a lower bound to Problem (8). More importantly, this lower bound becomes tighter as λ increases. This novel formulation is the starting point for our work.

We next propose a reparametrization trick to simplify the optimization problem above. For the inner optimization problem in Problem (10), given any c, the optimality condition for β is (take the gradient with respect to β and set the gradient to 0):

c=2λ2+λXTy-Qλβ. (11)

Inspired by this optimality condition, we have the following theorem:

Theorem 3.1. If we reparameterize c=2λ2+λXTy-Qλγ with a new parameter γ, then Problem (10) is equivalent to the following saddle point optimization problem:

maxγminzridge-λsaddleγ,zs.t.j=1pzjk,zj0,1, (12)

where

ridge-λsaddle(γ,z)-γTQλγ-1λ2+λXTy-QλγTdiag(z)XTy-Qλγ, (13)

and diag(z) is a diagonal matrix with z on the diagonal.

To our knowledge, this is the first time this formulation is given. Solving the saddle point formulation to optimality in Problem (12) gives us a tight lower bound. However, this is computationally hard.

Our insight is that we can solve Problem (12) approximately while still obtaining a feasible lower bound. Let us define a new function h(γ) as short-hand for the inner minimization in Problem (12):

hγ=minzridge-λsaddleγ,zs.t.j=1pzjk,zj0,1. (14)

For any arbitrary γRp,h(γ) is a valid lower bound for Problem (3). We should choose γ such that this lower bound h(γ) is tight. Below, we provide two efficient methods to calculate such a γ.

Fast Lower Bound Calculation

First, we provide a fast way to choose γ. The choice of γ is motivated by the following theorem:

Theorem 3.2. The function h(γ) defined in Equation (14) is lower bounded by

hγ-γTQλγ-1λ2+λXTy-Qλγ22. (15)

Furthermore, the right-hand size of Equation (15) is maximized if γ=γˆ=argminαridge(α), where in this case, h(γ) evaluated at γˆ becomes

hγˆ=ridgeγˆ+λ2+λSumBottomp-kγˆj2, (16)

where SumBottomp-k() denotes the summation of the smallest p-k terms of a given set.

Here we provide an intuitive explanation of why h(γˆ) is a valid lower bound. Note that the ridge regression loss is strongly convex. Assuming that the strongly convex parameter is μ (see Appendix B), by the strong convexity property, we have that for any γRp,

ridge(γ)ridge(γˆ)+ridge(γˆ)T(γ-γˆ)+μ2γ-γˆ22. (17)

Because γˆ minimizes ridge(), we have ridge(γˆ)=0. For the k-sparse vector γ with γ0k, the minimum for the right-hand side of Inequality (17) can be achieved if γj=γˆj for the top k terms of γˆl2 ‘s. This ensures the bound applies for all k-sparse γ. Thus, the k-sparse ridge regression loss is lower-bounded by

ridge(γ)ridge(γˆ)+μ2SumBottomp-kγˆj2

for γRp with γ0k. For ridge regression, the strong convexity μ parameter can be chosen from 0,2λ2+λminXTX. If we let μ=2λ2+λ, we obtain h(γˆ) in Theorem 3.2.

The lower bound h(γˆ) can be calculated extremely efficiently by solving the ridge regression problem (solving the linear system XTX+λ2Iγ=XTy for γ) and adding the extra p-k terms. However, this bound is not the tightest we can achieve. In the next subsection, we discuss how to apply ADMM to maximize h(γ) further based on Equation (14).

Tight Lower Bound via ADMM

Let us define pXTy-Qλγ. Starting from Problem (12), if we minimize z in the inner optimization under the constraints j=1pzjk and zj[0,1] for j, we have zj=1 for the top k terms of pj2 and zj=0 otherwise. Then, Problem (12) can be reformulated as follows:

-minγFγ+Gps.t.Qλγ+p=XTy, (18)

where F(γ)γTQλγ and G(p)1λ2+λSumTopkpj2. The solution to this problem is a dense vector that can be used to provide a lower bound on the original k-sparse problem. This problem can be solved by the alternating direction method of multipliers (ADMM) [17]. Here, we apply the iterative algorithm with the scaled dual variable q [33]:

γt+1=argminγF(γ)+ρ2Qλγ+pt-XTy+qt22 (19)
θt+1=2αQλγt+1-(1-2α)pt-XTy (20)
pt+1=argminpG(p)+ρ2θt+1+p-XTy+qt22 (21)
qt+1=qt+θt+1+pt+1-XTy, (22)

where α is the relaxation factor, and ρ is the step size.

It is known that ADMM suffers from slow convergence when the step size is not properly chosen. According to [33], to ensure the optimized linear convergence rate bound factor, we can pick α=1 and ρ=2λmax(Qλ)λmin>0(Qλ)1, where λmax() denotes the largest eigenvalue of a matrix, and λmin>0() denotes the smallest positive eigenvalue of a matrix.

Having settled the choices for the relaxation factor α and the step size ρ, we are left with the task of solving Equation (19) and Equation (21) (also known as evaluating the proximal operators [52]). Interestingly, Equation (19) can be evaluated by solving a linear system while Equation (21) can be evaluated by recasting the problem as an isotonic regression problem.

Theorem 3.3. Let F(γ)=γTQλγ and G(p)=1λ2+λSumTopkpj2. Then the solution for the problem γt+1=argminγF(γ)+ρ2Qλγ+pt-XTy+qt22 is

γt+1=2ρI+Qλ-1XTy-pt-qt. (23)

Furthermore, let a=XTy-θt+1-qt and 𝒥 be the indices of the top k terms of aj. The solution for the problem pt+1=argminpG(p)+ρ2θt+1+p-XTy+qt22 is pjt+1=signajvˆj, where

vˆ=argminvj=1pwjvj-bj2s.t.vivlifaialwj=1ifj𝒥1+2ρλ2+λotherwise,bj=ajwj. (24)

Problem (24) is an isotonic regression problem and can be efficiently solved in linear time [12, 21].

3.2. Beam-Search as a Heuristic

After finishing the lower bound calculation in Section 3.1, we next explain how to quickly reduce the upper bound in the BnB tree. We discuss how to add features, keep good solutions, and use dynamic programming to improve efficiency. Lastly, we give a theoretical guarantee on the quality of our solution.

Starting from the vector 0, we add one coordinate at a time into our support until we reach a solution with support size k. At each iteration, we pick the coordinate that results in the largest decrease in the ridge regression loss while keeping coefficients in the existing support fixed:

j*argminjminαridgeβ+αejj*argmaxjjridge(β)2X:j22+λ2, (25)

where X:j denotes the j-th column of X, and the right-hand side uses an analytical solution for the line-search for α. This is similar to the sparse-simplex algorithm [6]. However, after adding a feature, we adjust the coefficients restricted on the new support by minimizing the ridge regression loss.

The above idea does not handle highly correlated features well. Once a feature is added, it cannot be removed [61]. To alleviate this problem, we use beam-search [58, 43], keeping the best B solutions at each stage of support expansion:

j*argBottomBjminαridgeβ+αej, (26)

where j* argBottomBj means j* belongs to the set of solutions whose loss is one of the B smallest losses. Afterwards, we finetune the solution on the newly expanded support and choose the best B solutions for the next stage of support expansion. A visual illustration of beam search can be found in Figure 6 in Appendix E which also contains the detailed algorithm.

Although many methods have been proposed for sparse ridge regression, none of them have been designed with the BnB tree structure in mind. Our approach is to take advantage of the search history of past nodes to speed up the search process for a current node. To achieve this, we follow a dynamic programming approach by saving the solutions of already explored support sets. Therefore, whenever we need to adjust coefficients on the new support during beam search, we can simply retrieve the coefficients from the history if a support has been explored in the past. Essentially, we trade memory space for computational efficiency.

3.2.1. Provable Guarantee

Lastly, using similar methods to [29], we quantify the gap between our found heuristic solution βˆ and the optimal solution β* in Theorem 3.4. Compared with Theorem 5 in [29], we improve the factor in the exponent from m2kM2k to m2kM1 (since M1M2k, where M1 and M2k are defined in [29]).

Theorem 3.4. Let us define a k-sparse vector pair domain to be Ωk(x,y)Rp×Rp:x0k,y0k,x-y0k. Any M1 satisfying f(y)f(x)+f(x)T(y-x)+M12y-x22 for all (x,y)Ω1 is called a restricted smooth parameter with support size 1, and any m2k satisfying f(y)f(x)+f(x)T(y-x)+m2k2y-x22 for all (x,y)Ω2k is called a restricted strongly convex parameter with support size 2k. If βˆ is our heuristic solution by the beam-search method, and β* is the optimal solution, then:

ridgeβ*ridge(βˆ)1-e-m2k/M1ridgeβ*. (27)

3.3. Branching and Queuing

Branching:

The most common branching techniques include most-infeasible branching and strong branching [2, 1, 15, 7]. However, these two techniques require having fractional values for the binary variables zj’s, which we do not compute in our framework. Instead, we propose a new branching strategy based on our heuristic solution βˆ: we branch on the coordinate whose coefficient, if set to 0, would result in the largest increase in the ridge regression loss ridge (See Appendix E for details):

j*=argmaxjridgeβˆ-βˆjej (28)

The intuition is that the coordinate with the largest increase in ridge potentially plays a significant role, so we want to fix such a coordinate as early as possible in the BnB tree.

Queuing:

Besides the branching strategy, we need a queue to pick a node to explore among newly created nodes. Here, we use a breadth-first approach, evaluating nodes in the order they are created.

4. Experiments

We test the effectiveness of our OKRidge on synthetic benchmarks and sparse identification of nonlinear dynamical systems (SINDy)[19]. Our main focus is: assessing how well our proposed lower bound calculation speeds up certification (Section 4.1), and evaluating solution quality of OKRidge on challenging applications (Section 4.2). Additional extensive experiments are in Appendix G and H. Our algorithms are written in Python. Any improvements we see over commercial MIP solvers, which are coded in C/C++, are solely due to our specialized algorithms.

4.1. Assessing How Well Our Proposed Lower Bound Calculation Speeds Up Certification

Here, we demonstrate the speed of OKRidge for certifying optimality compared to existing MIPs solved by Gurobi [35]. We set a 1-hour time limit and an optimality gap of relative tolerance 10−4. We use a value of 0.001 for λ2. Our 4 baselines include MIPs with SOS1, big-M (M=50 to prevent cutting off optimal solutions), perspective [4], and eigen-perspective formulations (λ=λminXTX) [28]. In the main text, we use plots to present the results. In Appendix G, we present the results in tables. Additionally, in Appendix G, we conduct perturbation studies on λ2λ2=0.1 and λ2=10 and M(M=20 and M=5). Finally, still in Appendix G, we also compare OKRidge with other MIPs including the MOSEK solver [3], SubsetSelectionCIO [11], and L0BNB [39].

Similar to the data generation process in [11, 48], we first sample xiRp from a Gaussian distribution 𝒩(0,Σ) with mean 0 and covariance matrix Σ, where Σij=ρ|i-j|. Variable ρ controls the feature correlation. Then, we create the coefficient vector β* with k nonzero entries, where βj*=1 if j mod (p/k)=0. Next, we construct the prediction yi=xiTβ*+ϵi, where ϵi~i.i.d.𝒩0,Xβ*22SNR. SNR stands for signal-to-noise ratio (SNR), and we choose SNR to be 5 in all our experiments.

In the first setting, we fix the number of samples with n=100000 and vary the number of features p{100,500,1000,3000,5000} and correlation levels ρ{0.1,0.5,0.9} (See Appendix G for ρ=0.3 and ρ=0.7). We warm-started the MIP solvers by our beam-search solutions. The results can be seen in Figure 1. From both figures, we see that OKRidge outperforms all existing MIPs solved by Gurobi, usually by orders of magnitude.

Figure 1:

Figure 1:

Comparison of running time (top row) and optimality gap (bottom row) between our method and baselines, varying the number of features, for three correlation levels ρ=0.1,0.5,0.9(n=100000,k=10). Time is on the log scale. Our method is generally orders of magnitude faster than other approaches. Our method achieves the smallest optimality gap, especially when the feature correlation ρ is high.

In the second setting, we fix the number of features to p=3000 and vary the number of samples n{3000,4000,5000,6000,7000} and the correlation levels ρ{0.1,0.5,0.9} (see Appendix G for ρ=0.3 and =0.7). As in the first setting, we also warm-started the MIP solvers by our beam-search solutions. The results are in Figure 2 When n is close to p or the correlation is high (ρ=0.9), no methods can finish within the 1-hour time limit, but OKRidge prunes the search space well and achieves the smallest optimality gap. When n becomes larger in the case of ρ=0.1 and ρ=0.5, OKRidge runs orders of magnitude faster than all baselines.

Figure 2:

Figure 2:

Comparison of running time (top row) and optimality gap (bottom row) between our method and baselines, varying sample sizes, for three correlation levels ρ=0.1,0.5,0.9(p=3000,k=10). Time is on the log scale. When ρ=0.1 and ρ=0.5, OKRidge is generally orders of magnitude faster than other approaches. In the case ρ=0.9, we achieve the smallest optimality gap as shown in the bottom row.

4.2. Evaluating Solution Quality of OKRidge on Challenging Applications

On previous synthetic benchmarks, many heuristics (including our beam search method) can find the optimal solution without branch-and-bound. In this subsection, we work on more challenging scenarios (sparse identification of differential equations). We replicate the experiments in [9] using three dynamical systems from the PySINDy library [27, 42]: Lorenz System, Hopf Bifurcation, and magnetohydrodynamical (MHD) model [24]. The Lorenz System is a 3-D system with the nonlinear differential equations:

dx/dt=-σx+σy,dy/dt=ρx-y-xz,dz/dt=xy-βz

where we use standard parameters σ=10,β=8/3,ρ=28. The true sparsities for each dimension are (2, 3, 2). The Hopf Bifurcation is a 2-D system with nonlinear differential equations:

dx/dt=μx+ωy-Ax3-Axy2,dy/dt=-ωx+μy-Ax2y-Ay3

where we use the standard parameters μ=-0.05,ω=1,A=1. The true sparsities for each dimension are (4, 4). Finally, the MHD is a 6-D system with the nonlinear differential equations:

dV1/dt=4V2V3-4B2B3,dV2/dt=-7V1V3+7B1B2,dV3/dt=3V1V2-3B1B2dB1/dt=2B3V2-2V3B2,dB2/dt=5V3B1-5B3V1,dB3/dt=9V1B2-9B1V2.

The true sparsities for each dimension are (2, 2, 2, 2, 2, 2).

We use all monomial features (candidate functions) up to 5th order interactions. This results in 56 functions for the Lorentz System, 21 for Hopf Bifurcation, and 462 for MHD. Due to the high-order interaction terms, features are highly correlated, resulting in poor performance of heuristic methods.

4.2.1. Baselines and Experimental Setup

In addition to MIOSR (which relies on the SOS1 formulation), we also compare with three common baselines in the SINDy literature: STLSQ [54], SSR [16], and E-STLSQ [31]. The baseline SR3 [25] is not included since the previous literature [9] shows it performs poorly. We compare OKRidge with other baselines using the SINDy library [27, 42]. We follow the experimental setups in [9] for model selection, hyperparameter choices, and evaluation metrics (please see Appendix F for details). In Appendix H, we provide additional experiments on Gurobi with different MIP formulations and comparing with more heuristic baselines.

4.2.2. Results

Figure 3 displays the results. OKRidge (red curves) outperforms all baselines, including MIOSR (blue curves), across evaluation metrics. On the Lorenz System, all methods recover the true feature support when the training trajectory is long enough. When the training trajectory length is short, i.e., the left part of each subplot, (or equivalently, when the number of samples is small), OKRidge performs uniformly better than all other baselines. On the Hopf Bifurcation, all heuristic methods fail to recover the true support, resulting in poor performance. On the final MHD, OKRidge maintains the top performance and outperforms MIOSR on the true positivity rate. This demonstrates the effectiveness of OKRidge, which incurs lower runtimes and yields better metric scores under high-dimensional settings. The highest runtimes are incurred for the MHD (with 462 candidate functions/features), which are shown in Figure 4.

Figure 3:

Figure 3:

Results on discovering sparse differential equations. On various metrics, OKRidge outperforms all other methods, including MIOSR which uses a commercial (proprietary) MIP solver.

Figure 4:

Figure 4:

Running time comparison between OKRidge and MIOSR on the MHD system with 462 candidate functions. OKRidge is significantly faster than the previous state of the art.

Limitations of OKRidge

When the feature dimension is low (under 100s), Gurobi can solve the problem to optimality faster than OKRidge. This is observed on the synthetic benchmarks (p=100) and also on the Hopf Bifurcation (p=21). Since Gurobi is a commercial proprietary solver, we cannot inspect the details of its sophisticated implementation. Gurobi may resort to an enumeration/brute-force approach, which could be faster than spending time to calculate lower bounds in the BnB tree. This being said, OKRidge is still competitive with Gurobi in the low-dimensional setting, and OKRidge scales favorably in high-dimensional settings.

5. Conclusion

We presented a method for optimal sparse ridge regression that leverages a novel tight lower bound on the objective. We showed that the method is both faster and more accurate than existing approaches for learning differential equations ‒ a key problem in scientific discovery. This tool (unlike its main competitor) does not require proprietary software with expensive licenses and can have a significant impact on various regression applications.

Supplementary Material

1

Acknowledgements

The authors gratefully acknowledge funding support from grants NSF IIS-2130250, NSF-NRT DGE-2022040, NSF OAC-1835782, DOE DE-SC0023194, and NIH/NIDA R01 DA054994. The authors would also like to thank the anonymous reviewers for their insightful comments.

Footnotes

Code Availability

Implementations of OKRidge discussed in this paper are available at https://github.com/jiachangliu/OKRidge.

1

[33] also considers matrix preconditioning when computing the step size, but this is computationally expensive when the number of features is large, so we ignore matrix rescaling by letting E be the identity matrix in Section VI Subsection A of [33].

References

  • [1].Achterberg T, Koch T, and Martin A. Branching rules revisited. Operations Research Letters, 33(1):42–54, 2005. [Google Scholar]
  • [2].Applegate D, Bixby R, Chvátal V, and Cook W. On the solution of traveling salesman problems. Documenta Mathematica, pages 645–656, 1998. [Google Scholar]
  • [3].ApS M. Mosek optimizer API for python. Version, 9(17):6–4, 2022. [Google Scholar]
  • [4].Atamturk A and Gómez A. Safe screening rules for 10-regression from perspective relaxations. In International Conference on Machine Learning, pages 421–430. PMLR, 2020. [Google Scholar]
  • [5].Atamtürk A, Gómez A, and Han S. Sparse and smooth signal estimation: Convexification of 10-formulations. Journal of Machine Learning Research, 22:52–1, 2021. [Google Scholar]
  • [6].Beck A and Eldar YC. Sparsity constrained nonlinear optimization: Optimality conditions and algorithms. SIAM Journal on Optimization, 23(3):1480–1509, 2013. [Google Scholar]
  • [7].Belotti P, Kirches C, Leyffer S, Linderoth J, Luedtke J, and Mahajan A. Mixed-integer nonlinear optimization. Acta Numerica, 22:1–131, 2013. [Google Scholar]
  • [8].Bertsekas D. Convex optimization theory, volume 1. Athena Scientific, 2009. [Google Scholar]
  • [9].Bertsimas D and Gurnee W. Learning sparse nonlinear dynamics via mixed-integer optimization. Nonlinear Dynamics, Jan 2023. [Google Scholar]
  • [10].Bertsimas D, King A, and Mazumder R. Best subset selection via a modern optimization lens. The Annals of Statistics, 44(2):813–852, 2016. [Google Scholar]
  • [11].Bertsimas D, Pauphilet J, and Van Parys B. Sparse regression: Scalable algorithms and empirical performance. Statistical Science, 35(4):555–578, 2020. [Google Scholar]
  • [12].Best MJ and Chakravarti N. Active set algorithms for isotonic regression; a unifying framework. Mathematical Programming, 47(1–3):425–439, 1990. [Google Scholar]
  • [13].Blumensath T and Davies ME. Gradient pursuits. IEEE Transactions on Signal Processing, 56(6):2370–2382, 2008. [Google Scholar]
  • [14].Blumensath T and Davies ME. Iterative hard thresholding for compressed sensing. Applied and Computational Harmonic Analysis, 27(3):265–274, 2009. [Google Scholar]
  • [15].Bonami P, Lee J, Leyffer S, and Wächter A. More branch-and-bound experiments in convex nonlinear integer programming. Preprint ANL/MCS-P1949–0911, Argonne National Laboratory, Mathematics and Computer Science Division, 91, 2011. [Google Scholar]
  • [16].Boninsegna L, Nüske F, and Clementi C. Sparse learning of stochastic dynamical equations. The Journal of Chemical Physics, 148(24):241723, June 2018. [DOI] [PubMed] [Google Scholar]
  • [17].Boyd S, Parikh N, Chu E, Peleato B, Eckstein J, et al. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1):1–122, 2011. [Google Scholar]
  • [18].Boyd S, Xiao L, and Mutapcic A. Subgradient methods. lecture notes of EE392o, Stanford University, Autumn Quarter, 2004:2004–2005, 2003. [Google Scholar]
  • [19].Brunton SL, Proctor JL, and Kutz JN. Discovering governing equations from data by sparse identification of nonlinear dynamical systems. Proceedings of the National Academy of Sciences, 113(15):3932–3937, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [20].Bubeck S et al. Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning, 8(3–4):231–357, 2015. [Google Scholar]
  • [21].Busing FM. Monotone regression: A simple and fast o(n) PAVA implementation. Journal of Statistical Software, 102:1–25, 2022. [Google Scholar]
  • [22].Cai TT and Wang L. Orthogonal matching pursuit for sparse signal recovery with noise. IEEE Transactions on Information Theory, 57(7):4680–4688, 2011. [Google Scholar]
  • [23].Camerini PM, Fratta L, and Maffioli F. On improving relaxation methods by modified gradient techniques. In Nondifferentiable Optimization, pages 26–34. Springer; Berlin Heidelberg, 1975. [Google Scholar]
  • [24].Carbone V and Veltri P. Relaxation processes in magnetohydrodynamics - A triad-interaction model. Astronomy and Astrophysics, 259(1):359–372, June 1992. [Google Scholar]
  • [25].Champion K, Zheng P, Aravkin AY, Brunton SL, and Kutz JN. A unified sparse optimization framework to learn parsimonious physics-informed models from data. IEEE Access, 8:169259–169271, 2020. [Google Scholar]
  • [26].Combettes PL. Perspective functions: Properties, constructions, and examples. Set-Valued and Variational Analysis, 26(2):247–264, 2018 [Google Scholar]
  • [27].de Silva B, Champion K, Quade M, Loiseau J-C, Kutz J, and Brunton S. Pysindy: A python package for the sparse identification of nonlinear dynamical systems from data. Journal of Open Source Software, 5(49):2104, 2020. [Google Scholar]
  • [28].Dong H, Chen K, and Linderoth J. Regularization vs. relaxation: A conic optimization perspective of statistical variable selection. arXiv preprint arXiv:1510.06083, 2015. [Google Scholar]
  • [29].Elenberg ER, Khanna R, Dimakis AG, and Negahban S. Restricted strong convexity implies weak submodularity. The Annals of Statistics, 46(6B):3539–3568, 2018. [Google Scholar]
  • [30].Eriksson A, Thanh Pham T, Chin T-J, and Reid I. The k-support norm and convex envelopes of cardinality and rank. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3349–3357, 2015. [Google Scholar]
  • [31].Fasel U, Kutz JN, Brunton BW, and Brunton SL. Ensemble-SINDy: Robust sparse model discovery in the low-data, high-noise limit, with active learning and control. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 478(2260), Apr. 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [32].Frangioni A and Gentile C. Perspective cuts for a class of convex 0–1 mixed integer programs. Mathematical Programming, 106(2):225–236, 2006. [Google Scholar]
  • [33].Giselsson P and Boyd S. Linear convergence and metric selection for douglas-rachford splitting and admm. IEEE Transactions on Automatic Control, 62(2):532–544, 2016. [Google Scholar]
  • [34].Günlük O and Linderoth J. Perspective reformulations of mixed integer nonlinear programs with indicator variables. Mathematical Programming, 124(1):183–205, 2010. [Google Scholar]
  • [35].Gurobi Optimization, LLC. Gurobi Optimizer Reference Manual, 2023.
  • [36].Haemers WH. Interlacing eigenvalues and graphs. Linear Algebra and its Applications, 226:593–616, 1995. [Google Scholar]
  • [37].Han S, Gómez A, and Atamtürk A. The equivalence of optimal perspective formulation and Shor’s SDP for quadratic programs with indicator variables. Operations Research Letters, 50(2):195–198, 2022. [Google Scholar]
  • [38].Hazimeh H and Mazumder R. Fast best subset selection: Coordinate descent and local combinatorial optimization algorithms. Operations Research, 68(5):1517–1537, 2020. [Google Scholar]
  • [39].Hazimeh H, Mazumder R, and Saab A. Sparse regression at scale: Branch-and-bound rooted in first-order optimization. Mathematical Programming, 196(1):347–388, 2022. [Google Scholar]
  • [40].Jain P, Tewari A, and Kar P. On iterative hard thresholding methods for high-dimensional m-estimation. Advances in Neural Information Processing Systems, 27, 2014. [Google Scholar]
  • [41].Kaheman K, Kutz JN, and Brunton SL. SINDy-PI: a robust algorithm for parallel implicit sparse identification of nonlinear dynamics. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 476(2242), Oct. 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [42].Kaptanoglu AA, de Silva BM, Fasel U, Kaheman K, Goldschmidt AJ, Callaham J, Delahunt CB, Nicolaou ZG, Champion K, Loiseau J-C, Kutz JN, and Brunton SL. Pysindy: A comprehensive python package for robust sparse system identification. Journal of Open Source Software, 7(69):3994, 2022. [Google Scholar]
  • [43].Liu J, Zhong C, Li B, Seltzer M, and Rudin C. FasterRisk: Fast and accurate interpretable risk scores. In Advances in Neural Information Processing Systems, 2022. [Google Scholar]
  • [44].Liu J, Zhong C, Seltzer M, and Rudin C. Fast sparse classification for generalized linear and additive models. In Proceedings of Artificial Intelligence and Statistics (AISTATS), 2022. [PMC free article] [PubMed] [Google Scholar]
  • [45].Mangan NM, Kutz JN, Brunton SL, and Proctor JL. Model selection for dynamical systems via sparse regression and information criteria. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 473(2204):20170009, Aug. 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [46].Messenger DA and Bortz DM. Weak sindy for partial differential equations. Journal of Computational Physics, 443:110525, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [47].Messenger DA and Bortz DM. Weak sindy: Galerkin-based data-driven model selection. Multiscale Modeling & Simulation, 19(3):1474–1497, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [48].Moreau T, Massias M, Gramfort A, Ablin P, Bannier P-A, Charlier B, Dagréou M, Dupré la Tour T, Durif G, Dantas CF, Klopfenstein Q, Larsson J, Lai E, Lefort T, Malézieux B, Moufad B, Nguyen BT, Rakotomamonjy A, Ramzi Z, Salmon J, and Vaiter S. Benchopt: Reproducible, efficient and collaborative optimization benchmarks. In Advances in Neural Information Processing Systems, 2022. [Google Scholar]
  • [49].Natarajan BK. Sparse approximate solutions to linear systems. SIAM Journal on Computing, 24(2):227–234, 1995. [Google Scholar]
  • [50].Needell D and Tropp JA. Cosamp: Iterative signal recovery from incomplete and inaccurate samples. Applied and Computational Harmonic Analysis, 26(3):301–321, 2009. [Google Scholar]
  • [51].Needell D and Vershynin R. Signal recovery from incomplete and inaccurate measurements via regularized orthogonal matching pursuit. IEEE Journal of Selected Topics in Signal Processing, 4(2):310–316, 2010. [Google Scholar]
  • [52].Parikh N, Boyd S, et al. Proximal algorithms. Foundations and Trends in Optimization, 1(3):127–239, 2014. [Google Scholar]
  • [53].Pilanci M, Wainwright MJ, and El Ghaoui L. Sparse learning via boolean relaxations. Mathematical Programming, 151(1):63–87, 2015. [Google Scholar]
  • [54].Rudy SH, Brunton SL, Proctor JL, and Kutz JN. Data-driven discovery of partial differential equations. Science Advances, 3(4):e1602614, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [55].Sander ME, Puigcerver J, Djolonga J, Peyré G, and Blondel M. Fast, differentiable and sparse top-k: a convex analysis perspective. In International Conference on Machine Learning, pages 29919–29936. PMLR, 2023. [Google Scholar]
  • [56].Tropp J. Greed is good: algorithmic results for sparse approximation. IEEE Transactions on Information Theory, 50(10):2231–2242, 2004. [Google Scholar]
  • [57].Vreugdenhil R, Nguyen VA, Eftekhari A, and Esfahani PM. Principal component hierarchy for sparse quadratic programs. In International Conference on Machine Learning, pages 10607–10616. PMLR, 2021. [Google Scholar]
  • [58].Wiseman S and Rush AM. Sequence-to-sequence learning as beam-search optimization. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1296–1306, Austin, Texas, Nov. 2016. Association for Computational Linguistics. [Google Scholar]
  • [59].Xie W and Deng X. Scalable algorithms for the sparse ridge regression. SIAM Journal on Optimization, 30(4):3359–3386, 2020. [Google Scholar]
  • [60].Yuan G, Shen L, and Zheng W-S. A block decomposition algorithm for sparse optimization. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 275–285, 2020. [Google Scholar]
  • [61].Zhang T. Adaptive forward-backward greedy algorithm for learning sparse representations. IEEE Transactions on Information Theory, 57(7):4689–4708, 2011. [Google Scholar]
  • [62].Zheng X, Sun X, and Li D. Improving the performance of MIQP solvers for quadratic programs with cardinality and minimum threshold constraints: A semidefinite program approach. INFORMS Journal on Computing, 26(4):690–703, 2014. [Google Scholar]
  • [63].Zhu J, Wen C, Zhu J, Zhang H, and Wang X. A polynomial algorithm for best-subset selection problem. Proceedings of the National Academy of Sciences, 117(52):33117–33123, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

1

RESOURCES