OKRidge: Scalable Optimal k-Sparse Ridge Regression

Jiachang Liu; Sam Rosen; Chudi Zhong; Cynthia Rudin

. Author manuscript; available in PMC: 2024 Mar 19.

Published in final edited form as: Adv Neural Inf Process Syst. 2023 Dec;36:41076–41258.

OKRidge: Scalable Optimal k-Sparse Ridge Regression

Jiachang Liu ¹, Sam Rosen ¹, Chudi Zhong ¹, Cynthia Rudin ¹

PMCID: PMC10950455 NIHMSID: NIHMS1976784 PMID: 38505104

Abstract

We consider an important problem in scientific discovery, namely identifying sparse governing equations for nonlinear dynamical systems. This involves solving sparse ridge regression problems to provable optimality in order to determine which terms drive the underlying dynamics. We propose a fast algorithm, OKRidge, for sparse ridge regression, using a novel lower bound calculation involving, first, a saddle point formulation, and from there, either solving (i) a linear system or (ii) using an ADMM-based approach, where the proximal operators can be efficiently evaluated by solving another linear system and an isotonic regression problem. We also propose a method to warm-start our solver, which leverages a beam search. Experimentally, our methods attain provable optimality with run times that are orders of magnitude faster than those of the existing MIP formulations solved by the commercial solver Gurobi.

1. Introduction

We are interested in identifying sparse and interpretable governing differential equations arising from nonlinear dynamical systems. These are scientific machine learning problems whose solution involves sparse linear regression. Specifically, these problems require the exact solution of sparse regression problems, with the most basic being sparse ridge regression:

min_{β} ∥ y - X β ∥_{2}^{2} + λ_{2} ∥ β ∥_{2}^{2} subject to ∥ β ∥_{0} \leq k,

(1)

where $k$ specifies the number of nonzero coefficients for the model. This formulation is general, but in the case of nonlinear dynamical systems, the outcome $y$ is a derivative (usually time or spatial) of each dimension $x$ . Here, we assume that the practitioner has included the true variables, along with many other possibilities, and is looking to determine which terms (which transformations of the variables) are real and which are not. This problem is NP-hard [49], and is more challenging in the presence of highly correlated features. Selection of correct features is vital in this context, as many solutions may give good results on training data, but will quickly deviate from the true dynamics when extrapolating past the observed data due to the chaotic nature of complex dynamical systems.

Both heuristic and optimal algorithms have been proposed to solve these problems. Heuristic methods include greedy sequential adding of features [25, 16, 54, 22] or ensemble [31] methods. These methods are fast, but often get stuck in local minima, and there is no way to assess solution quality due to the lack of a lower bound on performance. Optimal methods provide an alternative, but are slow since they must prove optimality. MIOSR [9], a mixed-integer programming (MIP) approach, has been able to certify optimality of solutions given enough time. Slow solvers cause difficulty in performing cross-validation on $λ_{2}$ (the $ℓ_{2}$ regularization coefficient) and $k$ (sparsity level).

We aim to solve sparse ridge regression to certifiable optimality, but in a fraction of the run time. We present a fast branch-and-bound (BnB) formulation, OKRidge. A crucial challenge is obtaining a tight and feasible lower bound for each node in the BnB tree. It is possible to calculate the lower bound via the SOS1 [9], big-M [10], or the perspective formulations (also known as the rotated second-order cone constraints) [32, 4, 59]; the mixed-integer problems can then be solved by a MIP solver. However, these formulations do not consider the special mathematical structure of the regression problem. To calculate a lower bound more efficiently, we first propose a new saddle point formulation for the relaxed sparse ridge regression problem. Based on the new saddle point formulation, we propose two novel methods to calculate the lower bound. The first method is extremely efficient and relies on solving only a linear system of equations. The second method is based on ADMM and can tighten the lower bound given by the first method. Together, these methods give us a tight lower bound, used to prune nodes and provide a small optimality gap. Additionally, we propose a method based on beam-search [58] to get a near-optimal solution quickly, which can be a starting point for both our algorithm and other MIP formulations. Unlike previous methods, our method uses a dynamic programming approach so that previous solutions in the BnB tree can be used while exploring the current node, giving a massive speedup. In summary, our contributions are:

We develop a highly efficient customized branch-and-bound framework for achieving optimality in $k$ -sparse ridge regression, using a novel lower bound calculation and heuristic search.
To compute the lower bound, we introduce a new saddle point formulation, from which we derive two efficient methods (one based on solving a linear system and the other on ADMM).
Our warm-start method is based on beam-search and implemented in a dynamic programming fashion, avoiding redundant calculations. We prove that our warm-start method is an approximation algorithm with an exponential factor tighter than previous work.

On benchmarks, OKRidge certifies optimality orders of magnitude faster than the commercial solver Gurobi. For dynamical systems, our method outperforms the state-of-the-art certifiable method by finding superior solutions, particularly in high-dimensional feature spaces.

2. Preliminary: Dual Formulation via the Perspective Function

There is an extensive literature on this topic, and a longer review of related work is in Appendix A. If we ignore the constant term $y^{T} y$ , we can rewrite the loss objective in Equation (1) as:

ℒ_{ridge} (β) ≔ β^{T} X^{T} X β - 2 y^{T} X β + λ_{2} \sum_{j = 1}^{p} β_{j}^{2},

(2)

with $p$ as the number of features. We are interested in the following optimization problem:

min_{β} ℒ_{ridge} (β) s.t. (1 - z_{j}) β_{j} = 0, \sum_{j = 1}^{p} z_{j} \leq k, z_{j} \in \{0,1\},

(3)

where $k$ is the number of nonzero coefficients. With the sparsity constraint, the problem is NP-hard. The constraint $(1 - z_{j}) β_{j}$ in Problem (3) can be reformulated with the SOS1, big-M, or the perspective formulation (with quadratic cone constraints), which then can be solved by a MIP solver. Since commercial solvers do not exploit the special structure of the problem, we develop a customized branch-and-bound framework.

For any function $f (a)$ , the perspective function is $g (a, b) ≔ b f (\frac{a}{b})$ for the domain $b > 0$ [32, 34, 26] and $g (a, b) = 0$ otherwise. Applying to $f (a) = a^{2}$ , we obtain another function $g (a, b) = \frac{a^{2}}{b}$ . As shown by [4], replacing the loss term $β_{j}^{2}$ and constraint $(1 - z_{j}) β_{j} = 0$ with the perspective formula $β_{j}^{2} / z_{j}$ in Problem (3) would not change the optimal solution. By the Fenchel conjugate [4], $g (\cdot, \cdot)$ can be rewritten as $g (a, b) = {max}_{c} a c - \frac{c^{2}}{4} b$ . If we define a new perspective loss as:

ℒ_{ridge}^{Fenchel} (β, z, c) ≔ β^{T} X^{T} X β - 2 y^{T} X β + λ_{2} \sum_{j = 1}^{p} (β_{j} c_{j} - \frac{c_{j}^{2}}{4} z_{j}),

(4)

then we can reformulate Problem (3) as:

min_{β, z} max_{c} ℒ_{ridge}^{Fenchel} (β, z, c) s.t. \sum_{j = 1}^{p} z_{j} \leq k, z_{j} \in \{0,1\} .

(5)

If we relax the binary constraint {0, 1} to the interval [0, 1] and swap max and min (no duality gap, as pointed by [4]), we obtain the dual formulation for the convex relaxation of Problem (5):

max_{c} min_{β, z} ℒ_{ridge}^{Fenchel} (β, z, c) s.t. \sum_{j = 1}^{p} z_{j} \leq k, z_{j} \in [0,1] .

(6)

While [4] uses the perspective formulation for safe feature screening, we use it to calculate a lower bound for Problem (3). However, directly solving the maxmin problem is computationally challenging. In Section 3.1, we propose two methods to achieve this in an efficient way.

3. Methodology

We propose a custom BnB framework to solve Problem (3). We have 3 steps to process each node in the BnB tree. First, we calculate a lower bound of the node, using two algorithms proposed in the next subsection. If the lower bound exceeds or equals the current best solution, we have proven that it does not lead to any optimal solution, so we prune the node. Otherwise, we go to Step 2, where we perform beam-search to find a near-optimal solution. In Step 3, we use the solution from Step 2 and propose a branching strategy to create new nodes in the BnB tree. We continue until reaching the optimality gap tolerance. Below, we elaborate on each step. In Appendix E, we provide visual illustrations of BnB and beam search as well as the complete pseudocodes of our algorithms.

3.1. Lower Bound Calculation

Tight Saddle Point Formulation

We first rewrite Equation (2) with a new hyperparameter $λ$ :

ℒ_{ridge - λ} (β, z) ≔ β^{T} Q_{λ} β - 2 y^{T} X β + (λ_{2} + λ) \sum_{j = 1}^{p} β_{j}^{2},

(7)

where $Q_{λ} ≔ X^{T} X - λ I$ . We restrict $λ \in [0, λ_{min} (X^{T} X)]$ , where $λ_{min} (\cdot)$ denotes the minimum eigenvalue of a matrix. We see that $Q_{λ}$ is positive semidefinite, so the first term remains convex. This trick is related to the optimal perspective formulation [62, 28, 37], but we set the diagonal matrix $diag (d)$ in [28] to be $λ I$ . We call this trick the eigen-perspective formulation. The optimal perspective formulation requires solving semidefinite programming (SDP) problems, which have been shown not scalable to high dimensions [28], and MI-SDP is not supported by Gurobi.

Solving Problem (3) is equivalent to solving the following problem:

min_{β, z} ℒ_{ridge - λ} (β, z) s.t. (1 - z_{j}) β_{j} = 0, \sum_{j = 1}^{p} z_{j} \leq k, z_{j} \in \{0,1\} .

(8)

We get a continuous relaxation of Problem (3) if we relax {0, 1} to [0, 1].

We can now define a new loss analogous to the loss defined in Equation (4):

ℒ_{ridge - λ}^{Fenchel} (β, z, c) ≔ β^{T} Q_{λ} β - 2 y^{T} X β + (λ_{2} + λ) \sum_{j = 1}^{p} (β_{j} c_{j} - \frac{c_{j}^{2}}{4} z_{j}) .

(9)

Then, the dual formulation analogous to Problem (6) is:

max_{c} min_{β, z} ℒ_{ridge - λ}^{Fenchel} (β, z, c) s.t. \sum_{j = 1}^{p} z_{j} \leq k, z_{j} \in [0,1] .

(10)

Solving Problem (10) provides us with a lower bound to Problem (8). More importantly, this lower bound becomes tighter as $λ$ increases. This novel formulation is the starting point for our work.

We next propose a reparametrization trick to simplify the optimization problem above. For the inner optimization problem in Problem (10), given any $c$ , the optimality condition for $β$ is (take the gradient with respect to $β$ and set the gradient to 0):

c = \frac{2}{λ_{2} + λ} (X^{T} y - Q_{λ} β) .

(11)

Inspired by this optimality condition, we have the following theorem:

Theorem 3.1. If we reparameterize $c = \frac{2}{λ_{2} + λ} (X^{T} y - Q_{λ} γ)$ with a new parameter $γ$ , then Problem (10) is equivalent to the following saddle point optimization problem:

max_{γ} min_{z} ℒ_{ridge - λ}^{saddle} (γ, z) s.t. \sum_{j = 1}^{p} z_{j} \leq k, z_{j} \in [0,1],

(12)

where

ℒ_{ridge - λ}^{saddle} (γ, z) ≔ - γ^{T} Q_{λ} γ - \frac{1}{λ_{2} + λ} {(X^{T} y - Q_{λ} γ)}^{T} diag (z) (X^{T} y - Q_{λ} γ),

(13)

and $diag (z)$ is a diagonal matrix with $z$ on the diagonal.

To our knowledge, this is the first time this formulation is given. Solving the saddle point formulation to optimality in Problem (12) gives us a tight lower bound. However, this is computationally hard.

Our insight is that we can solve Problem (12) approximately while still obtaining a feasible lower bound. Let us define a new function $h (γ)$ as short-hand for the inner minimization in Problem (12):

h (γ) = min_{z} ℒ_{ridge - λ}^{saddle} (γ, z) s.t. \sum_{j = 1}^{p} z_{j} \leq k, z_{j} \in [0,1] .

(14)

For any arbitrary $γ \in R^{p}, h (γ)$ is a valid lower bound for Problem (3). We should choose $γ$ such that this lower bound $h (γ)$ is tight. Below, we provide two efficient methods to calculate such a $γ$ .

Fast Lower Bound Calculation

First, we provide a fast way to choose $γ$ . The choice of $γ$ is motivated by the following theorem:

Theorem 3.2. The function $h (γ)$ defined in Equation (14) is lower bounded by

h (γ) \geq - γ^{T} Q_{λ} γ - \frac{1}{λ_{2} + λ} {∥X^{T} y - Q_{λ} γ∥}_{2}^{2} .

(15)

Furthermore, the right-hand size of Equation (15) is maximized if $γ = \hat{γ} = {argmin}_{α} ℒ_{ridge} (α)$ , where in this case, $h (γ)$ evaluated at $\hat{γ}$ becomes

h (\hat{γ}) = ℒ_{ridge} (\hat{γ}) + (λ_{2} + λ) {SumBottom}_{p - k} (\{{\hat{γ}}_{j}^{2}\}),

(16)

where ${SumBottom}_{p - k} (\cdot)$ denotes the summation of the smallest $p - k$ terms of a given set.

Here we provide an intuitive explanation of why $h (\hat{γ})$ is a valid lower bound. Note that the ridge regression loss is strongly convex. Assuming that the strongly convex parameter is $μ$ (see Appendix B), by the strong convexity property, we have that for any $γ \in R^{p}$ ,

ℒ_{ridge} (γ) \geq ℒ_{ridge} (\hat{γ}) + \nabla ℒ_{ridge} (\hat{γ})^{T} (γ - \hat{γ}) + \frac{μ}{2} ∥ γ - \hat{γ} ∥_{2}^{2} .

(17)

Because $\hat{γ}$ minimizes $ℒ_{ridge} (\cdot)$ , we have $\nabla ℒ_{ridge} (\hat{γ}) = 0$ . For the $k$ -sparse vector $γ$ with $∥ γ ∥_{0} \leq k$ , the minimum for the right-hand side of Inequality (17) can be achieved if $γ_{j} = {\hat{γ}}_{j}$ for the top $k$ terms of ${\hat{γ}}_{l}^{2}$ ‘s. This ensures the bound applies for all $k$ -sparse $γ$ . Thus, the $k$ -sparse ridge regression loss is lower-bounded by

ℒ_{ridge} (γ) \geq ℒ_{ridge} (\hat{γ}) + \frac{μ}{2} {SumBottom}_{p - k} (\{{\hat{γ}}_{j}^{2}\})

for $γ \in R^{p}$ with $∥ γ ∥_{0} \leq k$ . For ridge regression, the strong convexity $μ$ parameter can be chosen from $[0,2 (λ_{2} + λ_{min} (X^{T} X))]$ . If we let $μ = 2 (λ_{2} + λ)$ , we obtain $h (\hat{γ})$ in Theorem 3.2.

The lower bound $h (\hat{γ})$ can be calculated extremely efficiently by solving the ridge regression problem (solving the linear system $(X^{T} X + λ_{2} I) γ = X^{T} y$ for $γ$ ) and adding the extra $p - k$ terms. However, this bound is not the tightest we can achieve. In the next subsection, we discuss how to apply ADMM to maximize $h (γ)$ further based on Equation (14).

Tight Lower Bound via ADMM

Let us define $p ≔ X^{T} y - Q_{λ} γ$ . Starting from Problem (12), if we minimize $z$ in the inner optimization under the constraints $\sum_{j = 1}^{p} z_{j} \leq k$ and $z_{j} \in [0,1]$ for $\forall j$ , we have $z_{j} = 1$ for the top $k$ terms of $p_{j}^{2}$ and $z_{j} = 0$ otherwise. Then, Problem (12) can be reformulated as follows:

- min_{γ} (F (γ) + G (p)) s.t. Q_{λ} γ + p = X^{T} y,

(18)

where $F (γ) ≔ γ^{T} Q_{λ} γ$ and $G (p) ≔ \frac{1}{λ_{2} + λ} {SumTop}_{k} (\{p_{j}^{2}\})$ . The solution to this problem is a dense vector that can be used to provide a lower bound on the original $k$ -sparse problem. This problem can be solved by the alternating direction method of multipliers (ADMM) [17]. Here, we apply the iterative algorithm with the scaled dual variable $q$ [33]:

γ^{t + 1} = \underset{γ}{argmin} F (γ) + \frac{ρ}{2} {∥Q_{λ} γ + p^{t} - X^{T} y + q^{t}∥}_{2}^{2}

(19)

θ^{t + 1} = 2 α Q_{λ} γ^{t + 1} - (1 - 2 α) (p^{t} - X^{T} y)

(20)

p^{t + 1} = \underset{p}{argmin} G (p) + \frac{ρ}{2} {∥θ^{t + 1} + p - X^{T} y + q^{t}∥}_{2}^{2}

(21)

q^{t + 1} = q^{t} + θ^{t + 1} + p^{t + 1} - X^{T} y,

(22)

where $α$ is the relaxation factor, and $ρ$ is the step size.

It is known that ADMM suffers from slow convergence when the step size is not properly chosen. According to [33], to ensure the optimized linear convergence rate bound factor, we can pick $α = 1$ and $ρ = \frac{2}{\sqrt{λ_{max} (Q_{λ}) λ_{min > 0} (Q_{λ})}}$ ¹, where $λ_{max} (\cdot)$ denotes the largest eigenvalue of a matrix, and $λ_{min > 0} (\cdot)$ denotes the smallest positive eigenvalue of a matrix.

Having settled the choices for the relaxation factor $α$ and the step size $ρ$ , we are left with the task of solving Equation (19) and Equation (21) (also known as evaluating the proximal operators [52]). Interestingly, Equation (19) can be evaluated by solving a linear system while Equation (21) can be evaluated by recasting the problem as an isotonic regression problem.

Theorem 3.3. Let $F (γ) = γ^{T} Q_{λ} γ$ and $G (p) = \frac{1}{λ_{2} + λ} {SumTop}_{k} (\{p_{j}^{2}\})$ . Then the solution for the problem $γ^{t + 1} = {argmin}_{γ} F (γ) + \frac{ρ}{2} {∥Q_{λ} γ + p^{t} - X^{T} y + q^{t}∥}_{2}^{2}$ is

γ^{t + 1} = {(\frac{2}{ρ} I + Q_{λ})}^{- 1} (X^{T} y - p^{t} - q^{t}) .

(23)

Furthermore, let $a = X^{T} y - θ^{t + 1} - q^{t}$ and $𝒥$ be the indices of the top $k$ terms of $\{|a_{j}|\}$ . The solution for the problem $p^{t + 1} = {argmin}_{p} G (p) + \frac{ρ}{2} {∥θ^{t + 1} + p - X^{T} y + q^{t}∥}_{2}^{2}$ is $p_{j}^{t + 1} = sign (a_{j}) \cdot {\hat{v}}_{j}$ , where

\hat{v} = \underset{v}{argmin} \sum_{j = 1}^{p} w_{j} {(v_{j} - b_{j})}^{2} s.t. v_{i} \geq v_{l} if |a_{i}| \geq |a_{l}| w_{j} = \{\begin{array}{l} 1 & if j \notin 𝒥 \\ 1 + \frac{2}{ρ (λ_{2} + λ)} & otherwise \end{array}, b_{j} = \frac{|a_{j}|}{w_{j}} .

(24)

Problem (24) is an isotonic regression problem and can be efficiently solved in linear time [12, 21].

3.2. Beam-Search as a Heuristic

After finishing the lower bound calculation in Section 3.1, we next explain how to quickly reduce the upper bound in the BnB tree. We discuss how to add features, keep good solutions, and use dynamic programming to improve efficiency. Lastly, we give a theoretical guarantee on the quality of our solution.

Starting from the vector 0, we add one coordinate at a time into our support until we reach a solution with support size $k$ . At each iteration, we pick the coordinate that results in the largest decrease in the ridge regression loss while keeping coefficients in the existing support fixed:

j^{*} \in \underset{j}{argmin} min_{α} ℒ_{ridge} (β + α e_{j}) ⟺ j^{*} \in \underset{j}{argmax} \frac{{(\nabla_{j} ℒ_{ridge} (β))}^{2}}{{∥X_{: j}∥}_{2}^{2} + λ_{2}},

(25)

where $X_{: j}$ denotes the $j$ -th column of $X$ , and the right-hand side uses an analytical solution for the line-search for $α$ . This is similar to the sparse-simplex algorithm [6]. However, after adding a feature, we adjust the coefficients restricted on the new support by minimizing the ridge regression loss.

The above idea does not handle highly correlated features well. Once a feature is added, it cannot be removed [61]. To alleviate this problem, we use beam-search [58, 43], keeping the best $B$ solutions at each stage of support expansion:

j^{*} \in arg {BottomB}_{j} (min_{α} ℒ_{ridge} (β + α e_{j})),

(26)

where $j^{*} \in$ ${arg BottomB}_{j}$ means $j^{*}$ belongs to the set of solutions whose loss is one of the B smallest losses. Afterwards, we finetune the solution on the newly expanded support and choose the best B solutions for the next stage of support expansion. A visual illustration of beam search can be found in Figure 6 in Appendix E which also contains the detailed algorithm.

Although many methods have been proposed for sparse ridge regression, none of them have been designed with the BnB tree structure in mind. Our approach is to take advantage of the search history of past nodes to speed up the search process for a current node. To achieve this, we follow a dynamic programming approach by saving the solutions of already explored support sets. Therefore, whenever we need to adjust coefficients on the new support during beam search, we can simply retrieve the coefficients from the history if a support has been explored in the past. Essentially, we trade memory space for computational efficiency.

3.2.1. Provable Guarantee

Lastly, using similar methods to [29], we quantify the gap between our found heuristic solution $\hat{β}$ and the optimal solution $β^{*}$ in Theorem 3.4. Compared with Theorem 5 in [29], we improve the factor in the exponent from $\frac{m_{2 k}}{M_{2 k}}$ to $\frac{m_{2 k}}{M_{1}}$ (since $M_{1} \leq M_{2 k}$ , where $M_{1}$ and $M_{2 k}$ are defined in [29]).

Theorem 3.4. Let us define a $k$ -sparse vector pair domain to be $Ω_{k} ≔ \{(x, y) \in R^{p} \times R^{p} : ∥ x ∥_{0} \leq k, ∥ y ∥_{0} \leq k, ∥ x - y ∥_{0} \leq k\}$ . Any $M_{1}$ satisfying $f (y) \leq f (x) + \nabla f (x)^{T} (y - x) + \frac{M_{1}}{2} ∥ y - x ∥_{2}^{2}$ for all $(x, y) \in Ω_{1}$ is called a restricted smooth parameter with support size 1, and any $m_{2 k}$ satisfying $f (y) \geq f (x) + \nabla f (x)^{T} (y - x) + \frac{m_{2 k}}{2} ∥ y - x ∥_{2}^{2}$ for all $(x, y) \in Ω_{2 k}$ is called a restricted strongly convex parameter with support size $2 k$ . If $\hat{β}$ is our heuristic solution by the beam-search method, and $β^{*}$ is the optimal solution, then:

ℒ_{ridge} (β^{*}) \leq ℒ_{ridge} (\hat{β}) \leq (1 - e^{- m_{2 k} / M_{1}}) ℒ_{ridge} (β^{*}) .

(27)

3.3. Branching and Queuing

Branching:

The most common branching techniques include most-infeasible branching and strong branching [2, 1, 15, 7]. However, these two techniques require having fractional values for the binary variables $z_{j}$ ’s, which we do not compute in our framework. Instead, we propose a new branching strategy based on our heuristic solution $\hat{β}$ : we branch on the coordinate whose coefficient, if set to 0, would result in the largest increase in the ridge regression loss $ℒ_{ridge}$ (See Appendix E for details):

j^{*} = \underset{j}{argmax} ℒ_{ridge} (\hat{β} - {\hat{β}}_{j} e_{j})

(28)

The intuition is that the coordinate with the largest increase in $ℒ_{ridge}$ potentially plays a significant role, so we want to fix such a coordinate as early as possible in the BnB tree.

Queuing:

Besides the branching strategy, we need a queue to pick a node to explore among newly created nodes. Here, we use a breadth-first approach, evaluating nodes in the order they are created.

4. Experiments

We test the effectiveness of our OKRidge on synthetic benchmarks and sparse identification of nonlinear dynamical systems (SINDy)[19]. Our main focus is: assessing how well our proposed lower bound calculation speeds up certification (Section 4.1), and evaluating solution quality of OKRidge on challenging applications (Section 4.2). Additional extensive experiments are in Appendix G and H. Our algorithms are written in Python. Any improvements we see over commercial MIP solvers, which are coded in C/C++, are solely due to our specialized algorithms.

4.1. Assessing How Well Our Proposed Lower Bound Calculation Speeds Up Certification

Here, we demonstrate the speed of OKRidge for certifying optimality compared to existing MIPs solved by Gurobi [35]. We set a 1-hour time limit and an optimality gap of relative tolerance 10⁻⁴. We use a value of 0.001 for $λ_{2}$ . Our 4 baselines include MIPs with SOS1, big-M $(M = 50$ to prevent cutting off optimal solutions), perspective [4], and eigen-perspective formulations ( $λ = λ_{min} (X^{T} X)$ ) [28]. In the main text, we use plots to present the results. In Appendix G, we present the results in tables. Additionally, in Appendix G, we conduct perturbation studies on $λ_{2} (λ_{2} = 0.1$ and $λ_{2} = 10)$ and $M (M = 20$ and $M = 5)$ . Finally, still in Appendix G, we also compare OKRidge with other MIPs including the MOSEK solver [3], SubsetSelectionCIO [11], and L0BNB [39].

Similar to the data generation process in [11, 48], we first sample $x_{i} \in R^{p}$ from a Gaussian distribution $𝒩 (0, Σ)$ with mean 0 and covariance matrix $Σ$ , where $Σ_{i j} = ρ^{| i - j |}$ . Variable $ρ$ controls the feature correlation. Then, we create the coefficient vector $β^{*}$ with $k$ nonzero entries, where $β_{j}^{*} = 1$ if $j$ mod $(p / k) = 0 .$ Next, we construct the prediction $y_{i} = x_{i}^{T} β^{*} + ϵ_{i}$ , where $ϵ_{i} \overset{i . i . d .}{~} 𝒩 (0, \frac{{∥X β^{*}∥}_{2}^{2}}{S N R})$ . SNR stands for signal-to-noise ratio (SNR), and we choose SNR to be 5 in all our experiments.

In the first setting, we fix the number of samples with $n = 100000$ and vary the number of features $p \in {100,500,1000,3000,5000}$ and correlation levels $ρ \in {0.1,0.5,0.9}$ (See Appendix G for $ρ = 0.3$ and $ρ = 0.7)$ . We warm-started the MIP solvers by our beam-search solutions. The results can be seen in Figure 1. From both figures, we see that OKRidge outperforms all existing MIPs solved by Gurobi, usually by orders of magnitude.

Figure 1: — Comparison of running time (top row) and optimality gap (bottom row) between our method and baselines, varying the number of features, for three correlation levels $ρ = 0.1,0.5,0.9 (n = 100000, k = 10)$ . Time is on the log scale. Our method is generally orders of magnitude faster than other approaches. Our method achieves the smallest optimality gap, especially when the feature correlation $ρ$ is high.

In the second setting, we fix the number of features to $p = 3000$ and vary the number of samples $n \in {3000,4000,5000,6000,7000}$ and the correlation levels $ρ \in {0.1,0.5,0.9}$ (see Appendix G for $ρ = 0.3$ and $= 0.7$ ). As in the first setting, we also warm-started the MIP solvers by our beam-search solutions. The results are in Figure 2 When $n$ is close to $p$ or the correlation is high $(ρ = 0.9)$ , no methods can finish within the 1-hour time limit, but OKRidge prunes the search space well and achieves the smallest optimality gap. When $n$ becomes larger in the case of $ρ = 0.1$ and $ρ = 0.5$ , OKRidge runs orders of magnitude faster than all baselines.

Figure 2: — Comparison of running time (top row) and optimality gap (bottom row) between our method and baselines, varying sample sizes, for three correlation levels $ρ = 0.1,0.5,0.9 (p = 3000, k = 10)$ . Time is on the log scale. When $ρ = 0.1$ and $ρ = 0.5$ , OKRidge is generally orders of magnitude faster than other approaches. In the case $ρ = 0.9$ , we achieve the smallest optimality gap as shown in the bottom row.

4.2. Evaluating Solution Quality of OKRidge on Challenging Applications

On previous synthetic benchmarks, many heuristics (including our beam search method) can find the optimal solution without branch-and-bound. In this subsection, we work on more challenging scenarios (sparse identification of differential equations). We replicate the experiments in [9] using three dynamical systems from the PySINDy library [27, 42]: Lorenz System, Hopf Bifurcation, and magnetohydrodynamical (MHD) model [24]. The Lorenz System is a 3-D system with the nonlinear differential equations:

d x / d t = - σ x + σ y, d y / d t = ρ x - y - x z, d z / d t = x y - β z

where we use standard parameters $σ = 10, β = 8 / 3, ρ = 28$ . The true sparsities for each dimension are (2, 3, 2). The Hopf Bifurcation is a 2-D system with nonlinear differential equations:

d x / d t = μ x + ω y - A x^{3} - A x y^{2}, d y / d t = - ω x + μ y - A x^{2} y - A y^{3}

where we use the standard parameters $μ = - 0.05, ω = 1, A = 1$ . The true sparsities for each dimension are (4, 4). Finally, the MHD is a 6-D system with the nonlinear differential equations:

d V_{1} / d t = 4 V_{2} V_{3} - 4 B_{2} B_{3}, d V_{2} / d t = - 7 V_{1} V_{3} + 7 B_{1} B_{2}, d V_{3} / d t = 3 V_{1} V_{2} - 3 B_{1} B_{2} d B_{1} / d t = 2 B_{3} V_{2} - 2 V_{3} B_{2}, d B_{2} / d t = 5 V_{3} B_{1} - 5 B_{3} V_{1}, d B_{3} / d t = 9 V_{1} B_{2} - 9 B_{1} V_{2} .

The true sparsities for each dimension are (2, 2, 2, 2, 2, 2).

We use all monomial features (candidate functions) up to 5th order interactions. This results in 56 functions for the Lorentz System, 21 for Hopf Bifurcation, and 462 for MHD. Due to the high-order interaction terms, features are highly correlated, resulting in poor performance of heuristic methods.

4.2.1. Baselines and Experimental Setup

In addition to MIOSR (which relies on the SOS1 formulation), we also compare with three common baselines in the SINDy literature: STLSQ [54], SSR [16], and E-STLSQ [31]. The baseline SR3 [25] is not included since the previous literature [9] shows it performs poorly. We compare OKRidge with other baselines using the SINDy library [27, 42]. We follow the experimental setups in [9] for model selection, hyperparameter choices, and evaluation metrics (please see Appendix F for details). In Appendix H, we provide additional experiments on Gurobi with different MIP formulations and comparing with more heuristic baselines.

4.2.2. Results

Figure 3 displays the results. OKRidge (red curves) outperforms all baselines, including MIOSR (blue curves), across evaluation metrics. On the Lorenz System, all methods recover the true feature support when the training trajectory is long enough. When the training trajectory length is short, i.e., the left part of each subplot, (or equivalently, when the number of samples is small), OKRidge performs uniformly better than all other baselines. On the Hopf Bifurcation, all heuristic methods fail to recover the true support, resulting in poor performance. On the final MHD, OKRidge maintains the top performance and outperforms MIOSR on the true positivity rate. This demonstrates the effectiveness of OKRidge, which incurs lower runtimes and yields better metric scores under high-dimensional settings. The highest runtimes are incurred for the MHD (with 462 candidate functions/features), which are shown in Figure 4.

Figure 4: — Running time comparison between OKRidge and MIOSR on the MHD system with 462 candidate functions. OKRidge is significantly faster than the previous state of the art.

Limitations of OKRidge

When the feature dimension is low (under 100s), Gurobi can solve the problem to optimality faster than OKRidge. This is observed on the synthetic benchmarks $(p = 100)$ and also on the Hopf Bifurcation $(p = 21)$ . Since Gurobi is a commercial proprietary solver, we cannot inspect the details of its sophisticated implementation. Gurobi may resort to an enumeration/brute-force approach, which could be faster than spending time to calculate lower bounds in the BnB tree. This being said, OKRidge is still competitive with Gurobi in the low-dimensional setting, and OKRidge scales favorably in high-dimensional settings.

5. Conclusion

We presented a method for optimal sparse ridge regression that leverages a novel tight lower bound on the objective. We showed that the method is both faster and more accurate than existing approaches for learning differential equations ‒ a key problem in scientific discovery. This tool (unlike its main competitor) does not require proprietary software with expensive licenses and can have a significant impact on various regression applications.

Supplementary Material

NIHMS1976784-supplement-1.pdf^{(2.1MB, pdf)}

Acknowledgements

The authors gratefully acknowledge funding support from grants NSF IIS-2130250, NSF-NRT DGE-2022040, NSF OAC-1835782, DOE DE-SC0023194, and NIH/NIDA R01 DA054994. The authors would also like to thank the anonymous reviewers for their insightful comments.

Footnotes

Code Availability

Implementations of OKRidge discussed in this paper are available at https://github.com/jiachangliu/OKRidge.

[33] also considers matrix preconditioning when computing the step size, but this is computationally expensive when the number of features is large, so we ignore matrix rescaling by letting $E$ be the identity matrix in Section VI Subsection A of [33].

References

[1].Achterberg T, Koch T, and Martin A. Branching rules revisited. Operations Research Letters, 33(1):42–54, 2005. [Google Scholar]
[2].Applegate D, Bixby R, Chvátal V, and Cook W. On the solution of traveling salesman problems. Documenta Mathematica, pages 645–656, 1998. [Google Scholar]
[3].ApS M. Mosek optimizer API for python. Version, 9(17):6–4, 2022. [Google Scholar]
[4].Atamturk A and Gómez A. Safe screening rules for 10-regression from perspective relaxations. In International Conference on Machine Learning, pages 421–430. PMLR, 2020. [Google Scholar]
[5].Atamtürk A, Gómez A, and Han S. Sparse and smooth signal estimation: Convexification of 10-formulations. Journal of Machine Learning Research, 22:52–1, 2021. [Google Scholar]
[6].Beck A and Eldar YC. Sparsity constrained nonlinear optimization: Optimality conditions and algorithms. SIAM Journal on Optimization, 23(3):1480–1509, 2013. [Google Scholar]
[7].Belotti P, Kirches C, Leyffer S, Linderoth J, Luedtke J, and Mahajan A. Mixed-integer nonlinear optimization. Acta Numerica, 22:1–131, 2013. [Google Scholar]
[8].Bertsekas D. Convex optimization theory, volume 1. Athena Scientific, 2009. [Google Scholar]
[9].Bertsimas D and Gurnee W. Learning sparse nonlinear dynamics via mixed-integer optimization. Nonlinear Dynamics, Jan 2023. [Google Scholar]
[10].Bertsimas D, King A, and Mazumder R. Best subset selection via a modern optimization lens. The Annals of Statistics, 44(2):813–852, 2016. [Google Scholar]
[11].Bertsimas D, Pauphilet J, and Van Parys B. Sparse regression: Scalable algorithms and empirical performance. Statistical Science, 35(4):555–578, 2020. [Google Scholar]
[12].Best MJ and Chakravarti N. Active set algorithms for isotonic regression; a unifying framework. Mathematical Programming, 47(1–3):425–439, 1990. [Google Scholar]
[13].Blumensath T and Davies ME. Gradient pursuits. IEEE Transactions on Signal Processing, 56(6):2370–2382, 2008. [Google Scholar]
[14].Blumensath T and Davies ME. Iterative hard thresholding for compressed sensing. Applied and Computational Harmonic Analysis, 27(3):265–274, 2009. [Google Scholar]
[15].Bonami P, Lee J, Leyffer S, and Wächter A. More branch-and-bound experiments in convex nonlinear integer programming. Preprint ANL/MCS-P1949–0911, Argonne National Laboratory, Mathematics and Computer Science Division, 91, 2011. [Google Scholar]
[16].Boninsegna L, Nüske F, and Clementi C. Sparse learning of stochastic dynamical equations. The Journal of Chemical Physics, 148(24):241723, June 2018. [DOI] [PubMed] [Google Scholar]
[17].Boyd S, Parikh N, Chu E, Peleato B, Eckstein J, et al. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1):1–122, 2011. [Google Scholar]
[18].Boyd S, Xiao L, and Mutapcic A. Subgradient methods. lecture notes of EE392o, Stanford University, Autumn Quarter, 2004:2004–2005, 2003. [Google Scholar]
[19].Brunton SL, Proctor JL, and Kutz JN. Discovering governing equations from data by sparse identification of nonlinear dynamical systems. Proceedings of the National Academy of Sciences, 113(15):3932–3937, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
[20].Bubeck S et al. Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning, 8(3–4):231–357, 2015. [Google Scholar]
[21].Busing FM. Monotone regression: A simple and fast o(n) PAVA implementation. Journal of Statistical Software, 102:1–25, 2022. [Google Scholar]
[22].Cai TT and Wang L. Orthogonal matching pursuit for sparse signal recovery with noise. IEEE Transactions on Information Theory, 57(7):4680–4688, 2011. [Google Scholar]
[23].Camerini PM, Fratta L, and Maffioli F. On improving relaxation methods by modified gradient techniques. In Nondifferentiable Optimization, pages 26–34. Springer; Berlin Heidelberg, 1975. [Google Scholar]
[24].Carbone V and Veltri P. Relaxation processes in magnetohydrodynamics - A triad-interaction model. Astronomy and Astrophysics, 259(1):359–372, June 1992. [Google Scholar]
[25].Champion K, Zheng P, Aravkin AY, Brunton SL, and Kutz JN. A unified sparse optimization framework to learn parsimonious physics-informed models from data. IEEE Access, 8:169259–169271, 2020. [Google Scholar]
[26].Combettes PL. Perspective functions: Properties, constructions, and examples. Set-Valued and Variational Analysis, 26(2):247–264, 2018 [Google Scholar]
[27].de Silva B, Champion K, Quade M, Loiseau J-C, Kutz J, and Brunton S. Pysindy: A python package for the sparse identification of nonlinear dynamical systems from data. Journal of Open Source Software, 5(49):2104, 2020. [Google Scholar]
[28].Dong H, Chen K, and Linderoth J. Regularization vs. relaxation: A conic optimization perspective of statistical variable selection. arXiv preprint arXiv:1510.06083, 2015. [Google Scholar]
[29].Elenberg ER, Khanna R, Dimakis AG, and Negahban S. Restricted strong convexity implies weak submodularity. The Annals of Statistics, 46(6B):3539–3568, 2018. [Google Scholar]
[30].Eriksson A, Thanh Pham T, Chin T-J, and Reid I. The k-support norm and convex envelopes of cardinality and rank. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3349–3357, 2015. [Google Scholar]
[31].Fasel U, Kutz JN, Brunton BW, and Brunton SL. Ensemble-SINDy: Robust sparse model discovery in the low-data, high-noise limit, with active learning and control. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 478(2260), Apr. 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
[32].Frangioni A and Gentile C. Perspective cuts for a class of convex 0–1 mixed integer programs. Mathematical Programming, 106(2):225–236, 2006. [Google Scholar]
[33].Giselsson P and Boyd S. Linear convergence and metric selection for douglas-rachford splitting and admm. IEEE Transactions on Automatic Control, 62(2):532–544, 2016. [Google Scholar]
[34].Günlük O and Linderoth J. Perspective reformulations of mixed integer nonlinear programs with indicator variables. Mathematical Programming, 124(1):183–205, 2010. [Google Scholar]
[35].Gurobi Optimization, LLC. Gurobi Optimizer Reference Manual, 2023.
[36].Haemers WH. Interlacing eigenvalues and graphs. Linear Algebra and its Applications, 226:593–616, 1995. [Google Scholar]
[37].Han S, Gómez A, and Atamtürk A. The equivalence of optimal perspective formulation and Shor’s SDP for quadratic programs with indicator variables. Operations Research Letters, 50(2):195–198, 2022. [Google Scholar]
[38].Hazimeh H and Mazumder R. Fast best subset selection: Coordinate descent and local combinatorial optimization algorithms. Operations Research, 68(5):1517–1537, 2020. [Google Scholar]
[39].Hazimeh H, Mazumder R, and Saab A. Sparse regression at scale: Branch-and-bound rooted in first-order optimization. Mathematical Programming, 196(1):347–388, 2022. [Google Scholar]
[40].Jain P, Tewari A, and Kar P. On iterative hard thresholding methods for high-dimensional m-estimation. Advances in Neural Information Processing Systems, 27, 2014. [Google Scholar]
[41].Kaheman K, Kutz JN, and Brunton SL. SINDy-PI: a robust algorithm for parallel implicit sparse identification of nonlinear dynamics. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 476(2242), Oct. 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
[42].Kaptanoglu AA, de Silva BM, Fasel U, Kaheman K, Goldschmidt AJ, Callaham J, Delahunt CB, Nicolaou ZG, Champion K, Loiseau J-C, Kutz JN, and Brunton SL. Pysindy: A comprehensive python package for robust sparse system identification. Journal of Open Source Software, 7(69):3994, 2022. [Google Scholar]
[43].Liu J, Zhong C, Li B, Seltzer M, and Rudin C. FasterRisk: Fast and accurate interpretable risk scores. In Advances in Neural Information Processing Systems, 2022. [Google Scholar]
[44].Liu J, Zhong C, Seltzer M, and Rudin C. Fast sparse classification for generalized linear and additive models. In Proceedings of Artificial Intelligence and Statistics (AISTATS), 2022. [PMC free article] [PubMed] [Google Scholar]
[45].Mangan NM, Kutz JN, Brunton SL, and Proctor JL. Model selection for dynamical systems via sparse regression and information criteria. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 473(2204):20170009, Aug. 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
[46].Messenger DA and Bortz DM. Weak sindy for partial differential equations. Journal of Computational Physics, 443:110525, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
[47].Messenger DA and Bortz DM. Weak sindy: Galerkin-based data-driven model selection. Multiscale Modeling & Simulation, 19(3):1474–1497, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
[48].Moreau T, Massias M, Gramfort A, Ablin P, Bannier P-A, Charlier B, Dagréou M, Dupré la Tour T, Durif G, Dantas CF, Klopfenstein Q, Larsson J, Lai E, Lefort T, Malézieux B, Moufad B, Nguyen BT, Rakotomamonjy A, Ramzi Z, Salmon J, and Vaiter S. Benchopt: Reproducible, efficient and collaborative optimization benchmarks. In Advances in Neural Information Processing Systems, 2022. [Google Scholar]
[49].Natarajan BK. Sparse approximate solutions to linear systems. SIAM Journal on Computing, 24(2):227–234, 1995. [Google Scholar]
[50].Needell D and Tropp JA. Cosamp: Iterative signal recovery from incomplete and inaccurate samples. Applied and Computational Harmonic Analysis, 26(3):301–321, 2009. [Google Scholar]
[51].Needell D and Vershynin R. Signal recovery from incomplete and inaccurate measurements via regularized orthogonal matching pursuit. IEEE Journal of Selected Topics in Signal Processing, 4(2):310–316, 2010. [Google Scholar]
[52].Parikh N, Boyd S, et al. Proximal algorithms. Foundations and Trends in Optimization, 1(3):127–239, 2014. [Google Scholar]
[53].Pilanci M, Wainwright MJ, and El Ghaoui L. Sparse learning via boolean relaxations. Mathematical Programming, 151(1):63–87, 2015. [Google Scholar]
[54].Rudy SH, Brunton SL, Proctor JL, and Kutz JN. Data-driven discovery of partial differential equations. Science Advances, 3(4):e1602614, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
[55].Sander ME, Puigcerver J, Djolonga J, Peyré G, and Blondel M. Fast, differentiable and sparse top-k: a convex analysis perspective. In International Conference on Machine Learning, pages 29919–29936. PMLR, 2023. [Google Scholar]
[56].Tropp J. Greed is good: algorithmic results for sparse approximation. IEEE Transactions on Information Theory, 50(10):2231–2242, 2004. [Google Scholar]
[57].Vreugdenhil R, Nguyen VA, Eftekhari A, and Esfahani PM. Principal component hierarchy for sparse quadratic programs. In International Conference on Machine Learning, pages 10607–10616. PMLR, 2021. [Google Scholar]
[58].Wiseman S and Rush AM. Sequence-to-sequence learning as beam-search optimization. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1296–1306, Austin, Texas, Nov. 2016. Association for Computational Linguistics. [Google Scholar]
[59].Xie W and Deng X. Scalable algorithms for the sparse ridge regression. SIAM Journal on Optimization, 30(4):3359–3386, 2020. [Google Scholar]
[60].Yuan G, Shen L, and Zheng W-S. A block decomposition algorithm for sparse optimization. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 275–285, 2020. [Google Scholar]
[61].Zhang T. Adaptive forward-backward greedy algorithm for learning sparse representations. IEEE Transactions on Information Theory, 57(7):4689–4708, 2011. [Google Scholar]
[62].Zheng X, Sun X, and Li D. Improving the performance of MIQP solvers for quadratic programs with cardinality and minimum threshold constraints: A semidefinite program approach. INFORMS Journal on Computing, 26(4):690–703, 2014. [Google Scholar]
[63].Zhu J, Wen C, Zhu J, Zhang H, and Wang X. A polynomial algorithm for best-subset selection problem. Proceedings of the National Academy of Sciences, 117(52):33117–33123, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

NIHMS1976784-supplement-1.pdf^{(2.1MB, pdf)}

[R1] [1].Achterberg T, Koch T, and Martin A. Branching rules revisited. Operations Research Letters, 33(1):42–54, 2005. [Google Scholar]

[R2] [2].Applegate D, Bixby R, Chvátal V, and Cook W. On the solution of traveling salesman problems. Documenta Mathematica, pages 645–656, 1998. [Google Scholar]

[R3] [3].ApS M. Mosek optimizer API for python. Version, 9(17):6–4, 2022. [Google Scholar]

[R4] [4].Atamturk A and Gómez A. Safe screening rules for 10-regression from perspective relaxations. In International Conference on Machine Learning, pages 421–430. PMLR, 2020. [Google Scholar]

[R5] [5].Atamtürk A, Gómez A, and Han S. Sparse and smooth signal estimation: Convexification of 10-formulations. Journal of Machine Learning Research, 22:52–1, 2021. [Google Scholar]

[R6] [6].Beck A and Eldar YC. Sparsity constrained nonlinear optimization: Optimality conditions and algorithms. SIAM Journal on Optimization, 23(3):1480–1509, 2013. [Google Scholar]

[R7] [7].Belotti P, Kirches C, Leyffer S, Linderoth J, Luedtke J, and Mahajan A. Mixed-integer nonlinear optimization. Acta Numerica, 22:1–131, 2013. [Google Scholar]

[R8] [8].Bertsekas D. Convex optimization theory, volume 1. Athena Scientific, 2009. [Google Scholar]

[R9] [9].Bertsimas D and Gurnee W. Learning sparse nonlinear dynamics via mixed-integer optimization. Nonlinear Dynamics, Jan 2023. [Google Scholar]

[R10] [10].Bertsimas D, King A, and Mazumder R. Best subset selection via a modern optimization lens. The Annals of Statistics, 44(2):813–852, 2016. [Google Scholar]

[R11] [11].Bertsimas D, Pauphilet J, and Van Parys B. Sparse regression: Scalable algorithms and empirical performance. Statistical Science, 35(4):555–578, 2020. [Google Scholar]

[R12] [12].Best MJ and Chakravarti N. Active set algorithms for isotonic regression; a unifying framework. Mathematical Programming, 47(1–3):425–439, 1990. [Google Scholar]

[R13] [13].Blumensath T and Davies ME. Gradient pursuits. IEEE Transactions on Signal Processing, 56(6):2370–2382, 2008. [Google Scholar]

[R14] [14].Blumensath T and Davies ME. Iterative hard thresholding for compressed sensing. Applied and Computational Harmonic Analysis, 27(3):265–274, 2009. [Google Scholar]

[R15] [15].Bonami P, Lee J, Leyffer S, and Wächter A. More branch-and-bound experiments in convex nonlinear integer programming. Preprint ANL/MCS-P1949–0911, Argonne National Laboratory, Mathematics and Computer Science Division, 91, 2011. [Google Scholar]

[R16] [16].Boninsegna L, Nüske F, and Clementi C. Sparse learning of stochastic dynamical equations. The Journal of Chemical Physics, 148(24):241723, June 2018. [DOI] [PubMed] [Google Scholar]

[R17] [17].Boyd S, Parikh N, Chu E, Peleato B, Eckstein J, et al. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1):1–122, 2011. [Google Scholar]

[R18] [18].Boyd S, Xiao L, and Mutapcic A. Subgradient methods. lecture notes of EE392o, Stanford University, Autumn Quarter, 2004:2004–2005, 2003. [Google Scholar]

[R19] [19].Brunton SL, Proctor JL, and Kutz JN. Discovering governing equations from data by sparse identification of nonlinear dynamical systems. Proceedings of the National Academy of Sciences, 113(15):3932–3937, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] [20].Bubeck S et al. Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning, 8(3–4):231–357, 2015. [Google Scholar]

[R21] [21].Busing FM. Monotone regression: A simple and fast o(n) PAVA implementation. Journal of Statistical Software, 102:1–25, 2022. [Google Scholar]

[R22] [22].Cai TT and Wang L. Orthogonal matching pursuit for sparse signal recovery with noise. IEEE Transactions on Information Theory, 57(7):4680–4688, 2011. [Google Scholar]

[R23] [23].Camerini PM, Fratta L, and Maffioli F. On improving relaxation methods by modified gradient techniques. In Nondifferentiable Optimization, pages 26–34. Springer; Berlin Heidelberg, 1975. [Google Scholar]

[R24] [24].Carbone V and Veltri P. Relaxation processes in magnetohydrodynamics - A triad-interaction model. Astronomy and Astrophysics, 259(1):359–372, June 1992. [Google Scholar]

[R25] [25].Champion K, Zheng P, Aravkin AY, Brunton SL, and Kutz JN. A unified sparse optimization framework to learn parsimonious physics-informed models from data. IEEE Access, 8:169259–169271, 2020. [Google Scholar]

[R26] [26].Combettes PL. Perspective functions: Properties, constructions, and examples. Set-Valued and Variational Analysis, 26(2):247–264, 2018 [Google Scholar]

[R27] [27].de Silva B, Champion K, Quade M, Loiseau J-C, Kutz J, and Brunton S. Pysindy: A python package for the sparse identification of nonlinear dynamical systems from data. Journal of Open Source Software, 5(49):2104, 2020. [Google Scholar]

[R28] [28].Dong H, Chen K, and Linderoth J. Regularization vs. relaxation: A conic optimization perspective of statistical variable selection. arXiv preprint arXiv:1510.06083, 2015. [Google Scholar]

[R29] [29].Elenberg ER, Khanna R, Dimakis AG, and Negahban S. Restricted strong convexity implies weak submodularity. The Annals of Statistics, 46(6B):3539–3568, 2018. [Google Scholar]

[R30] [30].Eriksson A, Thanh Pham T, Chin T-J, and Reid I. The k-support norm and convex envelopes of cardinality and rank. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3349–3357, 2015. [Google Scholar]

[R31] [31].Fasel U, Kutz JN, Brunton BW, and Brunton SL. Ensemble-SINDy: Robust sparse model discovery in the low-data, high-noise limit, with active learning and control. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 478(2260), Apr. 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] [32].Frangioni A and Gentile C. Perspective cuts for a class of convex 0–1 mixed integer programs. Mathematical Programming, 106(2):225–236, 2006. [Google Scholar]

[R33] [33].Giselsson P and Boyd S. Linear convergence and metric selection for douglas-rachford splitting and admm. IEEE Transactions on Automatic Control, 62(2):532–544, 2016. [Google Scholar]

[R34] [34].Günlük O and Linderoth J. Perspective reformulations of mixed integer nonlinear programs with indicator variables. Mathematical Programming, 124(1):183–205, 2010. [Google Scholar]

[R35] [35].Gurobi Optimization, LLC. Gurobi Optimizer Reference Manual, 2023.

[R36] [36].Haemers WH. Interlacing eigenvalues and graphs. Linear Algebra and its Applications, 226:593–616, 1995. [Google Scholar]

[R37] [37].Han S, Gómez A, and Atamtürk A. The equivalence of optimal perspective formulation and Shor’s SDP for quadratic programs with indicator variables. Operations Research Letters, 50(2):195–198, 2022. [Google Scholar]

[R38] [38].Hazimeh H and Mazumder R. Fast best subset selection: Coordinate descent and local combinatorial optimization algorithms. Operations Research, 68(5):1517–1537, 2020. [Google Scholar]

[R39] [39].Hazimeh H, Mazumder R, and Saab A. Sparse regression at scale: Branch-and-bound rooted in first-order optimization. Mathematical Programming, 196(1):347–388, 2022. [Google Scholar]

[R40] [40].Jain P, Tewari A, and Kar P. On iterative hard thresholding methods for high-dimensional m-estimation. Advances in Neural Information Processing Systems, 27, 2014. [Google Scholar]

[R41] [41].Kaheman K, Kutz JN, and Brunton SL. SINDy-PI: a robust algorithm for parallel implicit sparse identification of nonlinear dynamics. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 476(2242), Oct. 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R42] [42].Kaptanoglu AA, de Silva BM, Fasel U, Kaheman K, Goldschmidt AJ, Callaham J, Delahunt CB, Nicolaou ZG, Champion K, Loiseau J-C, Kutz JN, and Brunton SL. Pysindy: A comprehensive python package for robust sparse system identification. Journal of Open Source Software, 7(69):3994, 2022. [Google Scholar]

[R43] [43].Liu J, Zhong C, Li B, Seltzer M, and Rudin C. FasterRisk: Fast and accurate interpretable risk scores. In Advances in Neural Information Processing Systems, 2022. [Google Scholar]

[R44] [44].Liu J, Zhong C, Seltzer M, and Rudin C. Fast sparse classification for generalized linear and additive models. In Proceedings of Artificial Intelligence and Statistics (AISTATS), 2022. [PMC free article] [PubMed] [Google Scholar]

[R45] [45].Mangan NM, Kutz JN, Brunton SL, and Proctor JL. Model selection for dynamical systems via sparse regression and information criteria. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 473(2204):20170009, Aug. 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R46] [46].Messenger DA and Bortz DM. Weak sindy for partial differential equations. Journal of Computational Physics, 443:110525, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R47] [47].Messenger DA and Bortz DM. Weak sindy: Galerkin-based data-driven model selection. Multiscale Modeling & Simulation, 19(3):1474–1497, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R48] [48].Moreau T, Massias M, Gramfort A, Ablin P, Bannier P-A, Charlier B, Dagréou M, Dupré la Tour T, Durif G, Dantas CF, Klopfenstein Q, Larsson J, Lai E, Lefort T, Malézieux B, Moufad B, Nguyen BT, Rakotomamonjy A, Ramzi Z, Salmon J, and Vaiter S. Benchopt: Reproducible, efficient and collaborative optimization benchmarks. In Advances in Neural Information Processing Systems, 2022. [Google Scholar]

[R49] [49].Natarajan BK. Sparse approximate solutions to linear systems. SIAM Journal on Computing, 24(2):227–234, 1995. [Google Scholar]

[R50] [50].Needell D and Tropp JA. Cosamp: Iterative signal recovery from incomplete and inaccurate samples. Applied and Computational Harmonic Analysis, 26(3):301–321, 2009. [Google Scholar]

[R51] [51].Needell D and Vershynin R. Signal recovery from incomplete and inaccurate measurements via regularized orthogonal matching pursuit. IEEE Journal of Selected Topics in Signal Processing, 4(2):310–316, 2010. [Google Scholar]

[R52] [52].Parikh N, Boyd S, et al. Proximal algorithms. Foundations and Trends in Optimization, 1(3):127–239, 2014. [Google Scholar]

[R53] [53].Pilanci M, Wainwright MJ, and El Ghaoui L. Sparse learning via boolean relaxations. Mathematical Programming, 151(1):63–87, 2015. [Google Scholar]

[R54] [54].Rudy SH, Brunton SL, Proctor JL, and Kutz JN. Data-driven discovery of partial differential equations. Science Advances, 3(4):e1602614, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R55] [55].Sander ME, Puigcerver J, Djolonga J, Peyré G, and Blondel M. Fast, differentiable and sparse top-k: a convex analysis perspective. In International Conference on Machine Learning, pages 29919–29936. PMLR, 2023. [Google Scholar]

[R56] [56].Tropp J. Greed is good: algorithmic results for sparse approximation. IEEE Transactions on Information Theory, 50(10):2231–2242, 2004. [Google Scholar]

[R57] [57].Vreugdenhil R, Nguyen VA, Eftekhari A, and Esfahani PM. Principal component hierarchy for sparse quadratic programs. In International Conference on Machine Learning, pages 10607–10616. PMLR, 2021. [Google Scholar]

[R58] [58].Wiseman S and Rush AM. Sequence-to-sequence learning as beam-search optimization. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1296–1306, Austin, Texas, Nov. 2016. Association for Computational Linguistics. [Google Scholar]

[R59] [59].Xie W and Deng X. Scalable algorithms for the sparse ridge regression. SIAM Journal on Optimization, 30(4):3359–3386, 2020. [Google Scholar]

[R60] [60].Yuan G, Shen L, and Zheng W-S. A block decomposition algorithm for sparse optimization. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 275–285, 2020. [Google Scholar]

[R61] [61].Zhang T. Adaptive forward-backward greedy algorithm for learning sparse representations. IEEE Transactions on Information Theory, 57(7):4689–4708, 2011. [Google Scholar]

[R62] [62].Zheng X, Sun X, and Li D. Improving the performance of MIQP solvers for quadratic programs with cardinality and minimum threshold constraints: A semidefinite program approach. INFORMS Journal on Computing, 26(4):690–703, 2014. [Google Scholar]

[R63] [63].Zhu J, Wen C, Zhu J, Zhang H, and Wang X. A polynomial algorithm for best-subset selection problem. Proceedings of the National Academy of Sciences, 117(52):33117–33123, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

OKRidge: Scalable Optimal k-Sparse Ridge Regression

Jiachang Liu

Sam Rosen

Chudi Zhong

Cynthia Rudin

Abstract

1. Introduction

2. Preliminary: Dual Formulation via the Perspective Function

3. Methodology

3.1. Lower Bound Calculation

Tight Saddle Point Formulation

Fast Lower Bound Calculation

Tight Lower Bound via ADMM

3.2. Beam-Search as a Heuristic

3.2.1. Provable Guarantee

3.3. Branching and Queuing

Branching:

Queuing:

4. Experiments

4.1. Assessing How Well Our Proposed Lower Bound Calculation Speeds Up Certification

Figure 1:

Figure 2:

4.2. Evaluating Solution Quality of OKRidge on Challenging Applications

4.2.1. Baselines and Experimental Setup

4.2.2. Results

Figure 3:

Figure 4:

Limitations of OKRidge

5. Conclusion

Supplementary Material

Acknowledgements

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases