Algorithms for Sparse Support Vector Machines

Alfonso Landeros; Kenneth Lange

doi:10.1080/10618600.2022.2146697

. Author manuscript; available in PMC: 2024 Jan 1.

Published in final edited form as: J Comput Graph Stat. 2022 Dec 13;32(3):1097–1108. doi: 10.1080/10618600.2022.2146697

Algorithms for Sparse Support Vector Machines

Alfonso Landeros ^1,^*, Kenneth Lange ^1,^2,³

PMCID: PMC10656054 NIHMSID: NIHMS1862287 PMID: 37982129

Abstract

Many problems in classification involve huge numbers of irrelevant features. Variable selection reveals the crucial features, reduces the dimensionality of feature space, and improves model interpretation. In the support vector machine literature, variable selection is achieved by $ℓ_{1}$ penalties. These convex relaxations seriously bias parameter estimates toward 0 and tend to admit too many irrelevant features. The current paper presents an alternative that replaces penalties by sparse-set constraints. Penalties still appear, but serve a different purpose. The proximal distance principle takes a loss function $L (β)$ and adds the penalty $\frac{ρ}{2} dist {(β, S_{k})}^{2}$ capturing the squared Euclidean distance of the parameter vector $β$ to the sparsity set $S_{k}$ where at most k components of $β$ are nonzero. If $β_{ρ}$ represents the minimum of the objective $f_{ρ} (β) = L (β) + \frac{ρ}{2} dist {(β, S_{k})}^{2}$ , then $β_{ρ}$ tends to the constrained minimum of $L (β)$ over $S_{k}$ as $ρ$ tends to $\infty$ . We derive two closely related algorithms to carry out this strategy. Our simulated and real examples vividly demonstrate how the algorithms achieve better sparsity without loss of classification power.

Keywords: sparsity, discriminant analysis, unsupervised learning, Julia

1. Introduction

Support vector machines (SVMs) are powerful pattern recognition tools (Cortes and Vapnik, 1995) with a wide range of applications across machine learning and statistics. Success stories in supervised learning include optical character recognition (Decoste and Schölkopf, 2002), image segmentation (Barghout, 2015), text categorization (Joachims, 1998; Pradhan et al., 2004), protein structure prediction (Dunbrack, 2006), and early detection and classification of human cancers (Sewak et al., 2007). It is possible to extend SVM to online algorithms for resource-limited computing environments (Cauwenberghs and Poggio, 2000; Laskov et al., 2006) and beyond classification to unsupervised learning problems (Ben-Hur et al., 2002).

Part of the success of SVMs is due to their flexible decision boundaries. SVMs are often sparse in the sense that they depend on relatively few training samples. Sparsity enhances the ability of SVMs to label unlabeled cases. Existing research on fitting SVMs largely focuses on this strength by devising methods that improve prediction quality or accelerate the fitting process on large-scale datasets. Speed and prediction are worthy goals, but some application domains also require model interpretability. Indeed, in biological sciences and biomedical applications most variables are uninformative. One naturally desires to extract the informative variables because these drive further discovery and hypothesis generation. Thus, variable selection for SVMs is a topic of considerable interest in the literature. Lasso penalization is the primary tool for inducing sparsity (Zhu et al., 2003). Unfortunately, lasso penalization also induces shrinkage in model selection and tends to admit many irrelevant features. In contrast to penalization and shrinkage, screening methods originally developed for lasso problems eliminate variables prior to model fitting (El Ghaoui et al., 2012; Tibshirani et al., 2012; Wang et al., 2013). Ogawa et al. (2013) implement screening in SVMs, and Jaggi (2014) forges further connections between SVMs and the lasso.

The aim of this work is to present a flexible framework that addresses parsimony in both variable selection and prediction. We focus on (a) primal problems under the squared-hinge loss, (b) novel proximal distance algorithms for parameter fitting and model selection (Chi et al., 2014; Lange, 2016, the latter is a good reference), and (c) extensions to kernel machines for nonlinear data. Our key strategy replaces sparsity inducing penalties with projection onto sparsity constraint sets. The SVM proximal distance algorithm comes in two closely related flavors. Both exhibit comparable, and often superior, predictive accuracy to existing SVM methods. Although the proximal distance algorithms are sometimes slower than competing methods, their output is both easier to interpret and better at revealing sparse signals hidden in high-dimensional data.

2. Sparse SVMs via Distance Penalization

In this section we develop the theory underpinning our sparse SVM algorithms. We begin with a brief overview of SVM loss models, specifically L₂ SVMs, and derive a quadratic surrogate for squared hinge losses. We also add a squared distance penalty to the L₂ SVM loss to guide fitting toward sparsity. By projecting model coefficients onto different sparsity sets, one can directly control the number of active features in a model. We explore this idea in linear classifiers to selecting potentially informative features and to nonlinear classifiers to select potentially informative support vectors. These vectors define a classifier’s decision boundary. Readers may consult Keys et al. (2019), for a broad overview of the proximal distance principle and its connection to proximal methods in convex optimization.

2.1. Background

Many variations on supervised SVMs have been proposed. For example, the class of soft-margin L_p SVMs are based on convex programs of the form

\begin{array}{l} \min_{w, b, ξ} & \frac{1}{2} w^{⊤} w + \frac{C}{n p} \sum_{i = 1}^{n} ξ_{i}^{p} \\ y_{i} (w^{⊤} x_{i} + b) & \geq 1 - ξ_{i}, for i = 1, 2, \dots, n \\ ξ_{i} & \geq 0, \end{array}

where $p \geq 1$ , the polarities $y_{i} \in {- 1, + 1}$ serve as class labels, and features (predictors) are denoted $x_{i} \in ℝ^{d}$ . Here the parameters $(w, b) \in ℝ^{d + 1}$ determine a hyperplane ${x : w^{⊤} x + b = 0}$ separating two classes, and the slack variables $ξ_{i}$ quantify margins in inactive constraints $y_{i} (w^{⊤} x_{i} + b) > 1$ . Typically, $C > 0$ is treated as a hyperparameter that mediates a compromise between maximizing the separating margin ( ${‖ w ‖}_{2}^{- 1}$ ) and minimizing empirical risk. Taking $C \to 0$ recovers a hard-margin model and requires data to be linearly separable.

Rewriting the margin constraints as $ξ_{i} \geq 1 - y_{i} (w^{⊤} x_{i} + b)$ and multiplying the objective by $λ = 1 / C$ allows one to state the constrained minimization problem as an equivalent unconstrained problem

\begin{array}{l} \min_{β} f_{λ} (β ∣ y, X) = λ ψ (β) + \frac{1}{n} \sum_{i = 1}^{n} R_{i} (β ∣ y, X), \\ where R_{i} (β ∣ y, X) = p^{- 1} \max {0, 1 - y_{i} (w^{⊤} x_{i} + b)}^{p}, \end{array}

(1)

with model parameters $β = (w, b)$ , penalty $ψ (β) = \frac{1}{2} {‖ w ‖}_{2}^{2}$ , and misclassification cost $R_{i} (β)$ per sample $i = 1, 2, \dots, n$ . We adopt this version of SVM fitting and note a few connections to the existing literature. First, under this formulation the loss $\frac{1}{n} \sum_{i = 1}^{n} R_{i} (β ∣ y, X)$ can be interpreted as a measure of empirical risk when the functional form of R_i is chosen appropriately. When p = 1 the SVM is the classic soft-margin classifier, sometimes called the L₁ SVM, in which the $ℓ_{2}$ penalty term on model coefficients enables classification even when data are not linearly separable. The ridge penalty can be replaced by a lasso penalty $ψ (β) = {‖ w ‖}_{1}$ to induce variable selection (Zhu et al., 2003). Other loss functions, such as quadratic and Huber hinge errors, have been proposed as alternatives to the hinge loss $u_{+} = \max {0, u}$ to promote better prediction and robustness to outliers (Groenen et al., 2008). Taking p = 2 leads to the L₂ SVM family which achieves differentiability and strict convexity at the expense of overemphasizing misclassification errors in outliers. In addition, clever formulations of convex primal programs often lead to dual programs that are easier to solve and thus accelerate the fitting process. The last point is especially pertinent to the L₂ family as the dual program reduces to maximization of a quadratic over a simplex. This brief review is hardly exhaustive. Nevertheless, in all their variations, the defining property of SVMs is their parsimonious decision boundaries driven by a small number of support vectors.

2.2. Mathematical Formulation

Our starting point is a L₂ SVM with a squared hinge-loss $\max {0, 1 - u}^{2}$ . The SVM takes the form

\min_{β} f_{λ} (β ∣ y, X) = \frac{λ}{2} {‖ w ‖}_{2}^{2} + \frac{1}{2 n} \sum_{i = 1}^{n} \max {0, 1 - y_{i} x_{i}^{⊤} β}^{2} .

(2)

Here the n labeled samples $(y_{i}, x_{i})$ consist of a binary label $y_{i} \in {- 1, 1}$ and a feature vector $x_{i} \in ℝ^{p + 1}$ occupying a row of the matrix $X \in ℝ^{n \times p + 1}$ . The parameter vector $β = (w, b) \in^{ℝ^{p + 1}}$ defines a hyperplane separating the two classes, with the first p components corresponding to weights $w \in ℝ^{p}$ and the last component b representing an intercept. The last component $x_{i, p + 1}$ of each $x_{i}$ is accordingly 1. Our notation $β$ is insensitive to the inclusion or exclusion of an intercept term in describing the primal model (2).

Rather than directly minimize $f_{λ} (β ∣ y, X)$ defined by equation (2), we turn to the MM principle (Lange et al., 2000; Lange, 2016) and invoke the quadratic majorization

\max {0, 1 - u}^{2} \leq {\begin{array}{l} {(u_{m} - u)}^{2}, & u_{m} \geq 1 \\ {(1 - u)}^{2}, & u_{m} < 1 \end{array}

at iteration m suggested by Groenen et al. (2008). Note the two sides of the majorization agree when $u = u_{m}$ . Given that all $y_{i}^{2} = 1$ , the term by term application of the majorization creates the overall quadratic surrogate

\begin{array}{l} g_{λ} (β ∣ β_{m}) = \frac{λ}{2} {‖ w ‖}_{2}^{2} + \frac{1}{2 n} ‖ z_{m} - X β ‖^{2}, \\ where z_{m i} = {\begin{array}{l} x_{i}^{⊤} β_{m} & if y_{i} x_{i}^{⊤} β_{m} \geq 1 \\ y_{i} & if y_{i} x_{i}^{⊤} β_{m} < 1. \end{array} \end{array}

(3)

This maneuver reduces the original minimization problem to a sequence of easier minimization problems that can be solved by repeated least squares. Nguyen and McLachlan (2017) also apply the MM principle to support vector machines and make a connection to iteratively reweighted least squares (IRLS), albeit with a different majorization. The surrogate (3) is appealing compared to surrogate (12) of Nguyen and McLachlan (2017) because the former avoids changing weights. In any case, the MM principle implies that every iteration decreases the objective function (2) via the chain of inequalities

f_{λ} (β_{m + 1}) \overset{majorize}{\leq} g_{λ} (β_{m + 1} ∣ β_{m}) \overset{minimize}{\leq} g_{λ} (β_{m} ∣ β_{m}) \overset{tangency}{=} f_{λ} (β_{m}) .

2.3. Variable Selection

The objective (2) employs the quadratic penalty to control the size of a SVM’s separating margin, $1 / ‖ w ‖$ , but the penalty shrinks parameters rather than selects them. Thus, we append an explicit sparsity constraint to (2) and consider constrained problems of the form

\min_{β} f_{λ} (β ∣ y, X) such that {‖ w ‖}_{0} \leq k,

where ${‖ w ‖}_{0} = \sum_{j} 1 {w_{j} \neq 0}$ counts the number of nonzero components in w. Note that the restriction ${‖ w ‖}_{0} \leq k$ makes the sparsity set

S_{k} = {β \in^{ℝ^{p + 1}} : β = (w, b) and {‖ w ‖}_{0} \leq k},

closed. For any dataset with $p \geq k$ predictors, $S_{k} = ℝ^{p + 1}$ . Fortunately, Euclidean projection onto S_k is straightforward. The projection operator $P_{S_{k}} (β)$ sets to zero all but the largest k entries in magnitude of $β_{1}$ through $β_{p}$ . This goal can be achieved efficiently by a partial sort of these entries. Figure 1 illustrates the Euclidean projection that selects one variable in a two-dimensional problem. The projection may fail to be unique when some or all coefficients are tied, but such events occur on lower dimensional subspaces and therefore have measure zero.

Fig. 1 — Euclidean projections of points in $ℝ^{2}$ onto $𝓢_{1}$ . Red points have a unique projection whereas blue points may have multiple valid projections.

In light of the simplicity of projections onto sparsity sets, we follow the penalty method of constrained optimization (Beltrami, 1970; Courant, 1943) and minimize the unconstrained objective

h_{ρ} (β) = f_{λ} (β ∣ y, X) + \frac{ρ}{2} dist {(β, S_{k})}^{2},

(4)

for a large value of the annealing parameter $ρ \geq 0$ . The squared distance penalty enforces near sparsity and is continuously differentiable wherever the underlying projection is single valued. In the limit as $ρ$ tends to $\infty$ , the solution vector $β_{p}$ tends to a solution of the constrained problem. Combining our previous majorization (3) with distance majorization

dist {(u, S)}^{2} \leq ‖ u - P_{S} {(u_{m}) ‖}^{2},

yields the sum of squares surrogate

g_{ρ} (β ∣ β_{m}) = \frac{λ}{2} {‖ w ‖}_{2}^{2} + \frac{1}{2 n} {‖ z_{m} - X β ‖}^{2} + \frac{ρ}{2} {‖ P_{S_{k}} (β_{m}) - β ‖}_{2}^{2} .

(5)

Note that because no projection is applied to the intercept, when present, projection does not touch the $(p + 1)$ -th component. Sparsity constraints permit identification of features driving a classifier’s decision boundary in spite of sacrificing convexity. Our previous experience supports the value of the proximal distance principle in building parsimonious models with nonconvex set constraints (Xu et al., 2017; Keys et al., 2019; Landeros et al., 2022).

2.4. Support Vector Selection

While soft-margin classifiers are known to perform decently even on datasets that are not linearly separable, they do not generalize well to inherently nonlinear data. Transforming the feature vectors $x_{i}$ into an abstract feature space via an implicit mapping, $ϕ (x_{i})$ , is sufficient to induce nonlinear decision boundaries. Careful design of a primal problem then allows one to invoke the kernel trick (Schölkopf and Smola, 2018), which reduces further calculations to the formation of inner products $κ (x_{i}, x_{j}) = {〈 ϕ (x_{i}), ϕ (x_{j}) 〉}_{𝓗}$ in a Hilbert space $𝓗$ . Thus, the functional form of the nonlinear transformation $ϕ (\cdot)$ is immaterial, and one can instead focus on choosing a positive semidefinite kernel $κ (x_{i}, x_{j})$ induced by the transformation.

The L₂ SVM considered in (2) has a well-known dual program (Frieß and Harrison, 1998a,b; Mangasarian and Musicant, 2001)

\begin{array}{l} \min_{α} f_{λ} (α ∣ y, X) & = \frac{n}{2} {‖ α ‖}_{2}^{2} + \frac{1}{2 λ} α^{⊤} Y K Y α - 1^{⊤} α \\ such that y^{⊤} α & = 0 and α \geq 0, \end{array}

(6)

where $Y = diag (y)$ and $α \in ℝ^{n}$ . The choice $K = X X^{⊤}$ corresponds to the standard linear kernel, but in practice one may substitute $K_{i j} = κ (x_{i}, x_{j})$ for any positive semidefinite kernel $κ (\cdot, \cdot)$ satisfying Mercer’s condition (Mercer, 1909; Kimeldorf and Wahba, 1971; Cortes and Vapnik, 1995; Schölkopf et al., 2001). One can show that the original parameters $β = (w, b)$ are given by

w = λ^{- 1} \sum_{i} α_{i} y_{i} ϕ (x_{i}), and b = - {(n λ)}^{- 1} y^{⊤} Y K Y α + \bar{y} .

The relationship between w and $α$ suggests that, in general, one cannot hope to induce sparsity in w using a simple distance penalty in the dual problem. Instead, we note that the combination of constraints $y^{⊤} α = 0$ and $α \geq 0$ already imply some level of sparsity in $α$ . Thus, let $α = (a, b)$ for some coefficients $a \in ℝ^{n}$ . Substituting $w = \sum_{i} a_{i} y_{i} ϕ (x_{i})$ in the squared hinge term of (2) leads to the alternative model

\begin{array}{l} {\min_{a} \frac{λ}{2} {‖ a ‖}_{2}^{2} + \frac{1}{2 n} \sum_{i = 1}^{n} \max {0, 1 - y_{i} [\sum_{j = 1}^{n} κ (x_{i}, x_{j}) y_{j} a_{j} + b]}}^{2}, \\ such that {‖ a ‖}_{0} \leq k, \end{array}

(7)

which directly controls the number of support vectors in a SVM’s decision boundary. The set of support vectors equals ${x_{i} : a_{i} \neq 0}$ . This nonlinear model, denoted by $f_{λ}^{NL} (α ∣ y, X)$ , combines the advantages of a the squared hinge in the primal problem (2) with the kernel trick but, crucially, it is not anchored by any duality theory or related to the dual problem (6). Nevertheless, we will see that this model successfully fits nonlinear decision boundaries with additional flexibility in selecting support vectors.

Adding a distance penalty to (7) leads to the objective

h_{ρ} (α, b) = f_{λ}^{NL} (α ∣ y, X) + \frac{ρ}{2} dist {(α, S_{k})}^{2} .

(8)

To recapitulate, define $Y = diag (y)$ as a diagonal matrix and majorize the squared hinge and distance penalty to arrive at the quadratic surrogate

g_{ρ} (α ∣ α_{m}) = \frac{λ}{2} {‖ a ‖}_{2}^{2} + \frac{1}{2 n} {‖ z_{m} + K Y a ‖}_{2}^{2} + \frac{ρ}{2} {‖ P (α_{m}) - a ‖}_{2}^{2} .

(9)

One can verify this surrogate by identifying KY with X and $α$ with $β$ in the function (3) majorizing the squared hinge.

3. Algorithms

In this section we derive practical algorithms for minimizing the criteria (4) and (8) via their surrogate functions (5) and (9), respectively. We begin by outlining a general strategy, proximal distance iteration, that overcomes the technical requirement that $ρ > 0$ should be sufficiently large in minimizing constrained objectives functions. Convergence results are discussed. Next, we derive iterative maps specific to minimization of (4) and (8) that apply to fitting sparse linear and nonlinear classifiers, respectively. Finally, we conclude by discussing strategies to tune the level of sparsity via cross validation.

3.1. Proximal Distance Iteration

Let us briefly discuss proximal distance iteration in a general. Driving the distance penalty $dist (x, S)$ in $h_{ρ} (x)$ towards 0 requires setting $ρ > 0$ sufficiently large. Propositions 1 through 4 of Keys et al. (2019) provide guidance on how large ρ should be. Unfortunately, taking $ρ$ arbitrarily large necessarily slows convergence. Thus, in practice one minimizes a sequence of penalized objectives ${h_{ρ} (x) : ρ = ρ_{0}, ρ_{1}, ρ_{2}, \dots}$ parameterized by an increasing sequence of penalty coefficients $ρ_{t}$ . The annealing path $x (ρ)$ is typically continuous in ρ, so that warm starting the current suproblem from the solution of the previous subproblem accelerates convergence. One can assess convergence by checking the criteria

‖ h_{ρ} (x) ‖ \leq_{g}^{ϵ}, and

(10)

dist (x, S) \leq_{d}^{ϵ},

(11)

for positive tolerances $ϵ_{g}$ and $ϵ_{d}$ . Note that in the initial phase of proximal distance iteration when condition (11) is violated, one does not need strict satisfaction of the gradient condition (10) before increasing ρ. On the other hand, enforcing the gradient check (10) compounds the benefits of warm starts. Algorithm 1 summarizes proximal distance iteration applied to a generic problem. According to the MM principle, the algorithm map $𝓜_{ρ} (x)$ guarantees descent for fixed $ρ$ . Note that the final projection step is justified when condition (11) is satisfied.

Algorithm 1.

Proximal Distance Iteration

Require: An objective $h_{ρ} (x)$ , gradient $\nabla h_{ρ} (x)$ , and algorithm map $𝓜_{ρ} (x)$ .
1:	Set tolerances $ϵ_{d}$ and $ϵ_{g}$ ; fix hyperparameters λ and k.
2:	Set maximum number of $ρ$ values, $t_{max}$ ; maximum inner iterations $m_{max}$ .
3:	Initialize the estimate X and define an increasing sequence ${ρ (t)}_{t \geq 0}$ .
4:	for $t \leftarrow 0, 1, \dots, t_{\max}$ do
5:	Set $x_{0} \leftarrow x$ using the current estimate and take $ρ \leftarrow ρ (t)$ .
6:	for $m \leftarrow 1, \dots, m_{\max}$ do
7:	Iterate the algorithm map, $x_{m + 1} \leftarrow 𝓜_{ρ} (x_{m})$ .
8:	if ${‖ \nabla h_{ρ} (x_{m + 1}) ‖}_{2} \leq ϵ_{g}$ then
9:	Break.
10:	end if
11:	end for
12:	Update $x \leftarrow x_{m + 1}$
13:	if $dist (x, S) ϵ_{d}$ then
14:	Break.
15:	end if
16:	end for
17:	Project the final estimate $x \leftarrow P_{S_{k}} (x)$ .

Open in a new tab

3.2. Iteration Maps

We now derive algorithm maps for minimizing the penalized loss (4) using Algorithm 1. The MM strategy repeatedly minimizes the surrogate (5) anchored at the current estimate $β_{m}$ . Updating $β$ amounts to solving the linear system

\begin{array}{l} β_{m + 1} & = argmin g_{ρ} (β ∣ β_{m}) \\ = {[n^{- 1} X^{⊤} X + (λ + ρ) I]}^{- 1} [n^{- 1} X^{⊤} z_{m} + ρ P_{s_{k}} (β_{m})] . \end{array}

(12)

Observe that the solution is non-unique if there are multiple admissible projections $P_{S_{k}} (β_{m})$ of $β_{m}$ onto S_k. Fortunately, the linear system can be solved efficiently using a single thin singular value decomposition (SVD) $X = U Σ V^{⊤}$ across all iterates. This approach may be expensive if the number of samples n or predictors p is large, but it keeps the factorization independent of $ρ$ and λ. We document our implementation of this strategy in Appendix A.

Alternatively, differentiability of $h_{ρ} (β)$ implies the first-order tangency condition $\nabla g_{ρ} (β ∣ β) = \nabla h_{ρ} (β)$ , which in turn suggests implementing gradient descent. Minimizing the surrogate $g_{ρ} (β ∣ β_{m})$ in the direction $- \nabla h_{ρ} (β_{m})$ leads to the update

β_{m + 1} = β_{m} - t_{m} \nabla h_{ρ} (β_{m}) t_{m} = \frac{{‖ \nabla h_{ρ} (β_{m}) ‖}^{2}}{n^{- 1} {‖ X \nabla h_{ρ} (β_{m}) ‖}^{2} + (ρ + λ) {‖ \nabla h_{ρ} (β_{m}) ‖}^{2}}

(13)

derived in the Appendix. A single steepest descent step cannot, in general, achieve exact minimization of $g_{ρ} (β ∣ β_{m})$ , but one step always drives the surrogate $h_{ρ} (β)$ downhill. In any event, the iterates generated by the MM update (12) are generally different from those generated by the steepest descent update (13). Multiple steps of steepest descent can achieve exact minimization of $g_{ρ} (β ∣ β_{m})$ and bring steepest descent into alignment with MM. However, exact minimization via steepest descent is costly and unlikely to offer any advantages in precision or computational cost over minimizing the surrogate by conjugate gradients. Fortunately, the algorithm maps (12) and (13) are easily adapted to the nonlinear setting. Simply substitute KY and $α$ for X and $β$ , respectively.

3.3. Convergence Theory

In this section we briefly address the convergence properties of our proximal distance algorithms for SVM. Convergence theory for gradient descent is more standard and is omitted. Readers mainly interested in applications can skip this discussion and return to it later as desired.

The loss $f_{λ} (β ∣ y, X)$ defined in (2) is convex by the standard closure properties of convex functions. Without loss of generality, let us assume that the intercept b = 0 and identify $β = w$ . In this regard note that the scalar function $u_{+}^{2} = \max {0, u}^{2}$ is convex in u. To prove that $f_{λ} (β ∣ y, X)$ is coercive, one can invoke the fact that a convex function on $ℝ^{p}$ is coercive if and only if it is coercive along all nontrivial rays ${β \in^{ℝ^{p}} : β = t v, t \geq 0}$ emanating from the origin (Lange et al., 2000). It is obvious that $f_{λ} (t v ∣ y, X)$ tends to $\infty$ as t tends to $\infty$ if and only if at least one vector $- y_{i} x_{i}$ satisfies $- y_{i} x_{i}^{⊤} v > 0$ . In other words, $f_{λ} (t v ∣ y, X)$ is coercive if and only if the polar cone

C = {v : - y_{i} x_{i}^{⊤} v \leq 0; for all; i}

consists of the trivial vector 0 alone. Under these circumstances X also has full column rank. In practice, the polar cone condition is difficult to check, so we rely on the ridge penalty $\frac{λ}{2} {‖ w ‖}^{2}$ to enforce coercivity. The SVM loss $f_{λ} (β ∣ y, X)$ is also continuously differentiable because $u_{+}^{2}$ is so with derivative 0 at u = 0. Furthermore, the gradient

\nabla f_{λ} (β ∣ y, X) = λ β - \frac{1}{n} \sum_{i = 1}^{n} {(1 - y_{i} x_{i}^{⊤} β)}_{+} y_{i} x_{i}

is locally bounded and Lipschitz because $u_{+}$ is so and the class of such functions is closed under the formation of sums and functional compositions. It follows that the surrogate (3) is continuous, strongly convex, and satisfies the Lipschitz condition

‖ \nabla g_{ρ} (β ∣ β_{n}) - \nabla g_{ρ} (α ∣ β_{n}) ‖ \leq L ‖ β - α ‖

for some L on the compact set ${β : f_{λ} (β ∣ y, X) \leq f_{λ} (β_{0} ∣ y, X)}$ whenever $f_{λ} (β ∣ y, X)$ is coercive. Strong converge results apply in the convex settings as summarized in the survey Lange et al. (2021) and in Propositions 5 through 11 of (Keys et al., 2019).

Even if the SVM loss functions $f_{λ} (β ∣ y, X)$ and $f_{λ}^{NL} (α ∣ y, X)$ are strictly convex, the addition of a nonconvex set constraint S_k sacrifices convexity in the penalized losses $h_{ρ} (β)$ and $h_{ρ} (α)$ . Thus, convergence is governed by nonconvex theory. One issue is the multivalent nature of projections onto sparsity sets. Fortunately, such pathological behavior occurs only on a set of Lebesgue measure 0. Furthermore, Zangwill’s Global Convergence Theorem also covers multivalent algorithm maps (Luenberger, 1984, see Section 7.7). To establish convergence to a stationary point, we rely on results from our companion papers (Keys et al., 2019; Landeros et al., 2022). There we define the necessary concepts of semialgebraic sets and functions and stationary points. Proposition 4.1 of Landeros et al. (2022) proves that the sparsity set S_k is semialgebraic. In the current context the fusion matrix D is just the identity I. Moreover, our arguments there show that our current loss and penalty are semialgebraic. Because our surrogates $g_{ρ} (β ∣ β_{m})$ are continuous, μ-strongly convex, and L-smooth, Proposition 4.4 of Landeros et al. (2022) demonstrates a linear rate of convergence to a stationary point $β_{\infty}$ when $β_{\infty}$ has k unambiguous largest components in magnitude. The complementary set of points $β$ with k ambiguous largest components in magnitude has Lebesgue measure 0.

3.4. Decision Functions and Multiclass SVM

In linear binary classification, SVMs typically use the decision rule

x \mapsto sgn (x^{⊤} w) = {\begin{matrix} + 1, & if x^{⊤} w > 0 \\ 0, & if x^{⊤} w = 0, \\ - 1, & if x^{⊤} w < 0 \end{matrix}

to assign an instance x a binary label using the SVM’s coefficients w. The signed decision rule can be extended to multiclass problems using the One-Versus-One (OVO) paradigm (Hsu and Lin, 2002). In OVO, one partitions a dataset with c classes into $(\begin{matrix} c \\ 2 \end{matrix})$ subsets such that each subset only contains samples from two classes. One then fits $(\begin{matrix} c \\ 2 \end{matrix})$ SVMs to discriminate between two classes within each subset. After fitting models to each subproblem, one constructs a decision rule through a voting system so that the assigned label of X is given by

x \mapsto \underset{j}{argmax} {v_{j} : v_{j} = \sum_{ℓ} 1 {{SVM}_{ℓ} (x) = j}},

where ${SVM}_{ℓ} (x)$ is the label assigned by binary SVM $ℓ$ using, for example, the signed decision rule. Thus, the class receiving the most votes is used as the predicted label for X.

Alternatively, one may interpret $x^{⊤} w$ as a confidence value indicating the strength of a SVM’s prediction. This is useful in the One-Versus-Rest (OVR) paradigm which constructs c SVMs to distinguish one class from the rest (Hsu and Lin, 2002). In this setting, one assigns a label by aggregating confidence values from each SVM

x \mapsto \underset{j}{argmax} {v_{j} : v_{j} = \sum_{ℓ} x^{⊤} w_{ℓ} \cdot 1 {{SVM}_{ℓ} (x) = j}} .

In this case, OVR assigns a class based on the highest confidence value. OVR reduces the number of SVMs required to fit a multiclass classifier and, therefore also reduces the number of model parameters, but is known to suffer from worse ambiguity issues when ties occur compared to OVO (van den Burg and Groenen, 2016, see Figure 1).

In our classifiers we favor the OVO paradigm using a weighted voting system based on confidence values. We prefer to avoid the potentially ambiguous decision boundaries of OVR, although it often yields similar results to OVO (Hsu and Lin, 2002).

3.5. Tuning Sparsity with Cross Validation

Without prior knowledge of the true number of causal features, one must tune the sparsity level of the solutions. This amounts to selecting $k \in {0, 1, 2, \dots, p}$ to restrict the number of active variables through sparsity constraints $w \in S_{k}$ or $a \in S_{k}$ in fitting linear or nonlinear SVM, respectively. Note that decreasing k increases the number of structural zeros as measured by the sparsity level $s = 1 - k / p$ .

We examine the viability of composing K-fold cross validation (CV) with our algorithms. In each fold we initialize model parameters and then minimize the criterion (4) or the criterion (8) on a training set with fixed λ but no sparsity penalty (k = p). We then gradually increase sparsity $s = 1 - k / n$ to construct a solution path from the fully dense models (s = 0) to fully sparse models (s = 1). Fitted classifiers in cross validation are parameterized by an ordered pair $(s, λ)$ . Order is important here because each λ is associated with a solution path over s. We do not consider $ρ$ as a hyperparameter in cross validation because it is already tuned by proximal distance iteration.

In principle there are various performance metrics that may be used to evaluate a classifier. A natural criterion is the minimum value of the regularized empirical risk, but this has two issues. First, it becomes difficult to compare across different loss models such as (2) versus (4) or even (8). Second, it is well-known that penalty methods reach a compromise in minimizing of f(x) or g(x) in a penalized objective $h (x) = f (x) + ρ g (x)$ (Lange, 2016). In fact, we expect that minimization of (4) or (8) will inflate the original loss model (2) or (7), respectively. This may even be desirable since our estimators for model parameters are necessarily biased compared to the standard L₂ SVM. Thus, we focus on maximizing prediction accuracy. The following describes how we fit and evaluate classifiers for each pair of hyperparameters $(s, λ)$ .

Any feature standardization or normalization is based on the training set and applied to the training, validation, and test subsets. For example, if one wishes to standardize the data, then we estimate $(μ, σ)$ from the training subset $X_{train}$ and apply the transformation $x^{\mapsto} (x - μ) / σ$ to all instances of feature X in each of the three data subsets. This is essential to ensure both the validity of cross validation results and the fitting of SVMs because the latter is not invariant under affine transformations.
In each fold, a training set is used to fit model parameters. The fitted classifier is evaluated on a validation set and a test set. The training and validation sets are shuffled between folds, whereas the test set is fixed and never participates in fitting.
In a linear SVM the size of a separating margin is given by $1 / ‖ w ‖$ . Thus, smaller values of λ induce smaller margins and may lead to overfitting. Moderate values of λ are desirable and tend to avoid overfitting. Average predication accuracy on validation sets is measured by
$S_{CV} = \frac{1}{K} \sum_{j = 1}^{K} \frac{1}{n_{j}} \sum_{i = 1}^{n_{j}} 1 {{\hat{L}}_{i j} = L_{i j}} .$
Here L_ij is the label of a validation sample i in fold j and ${\hat{L}}_{i j}$ is its predicted label.
To select an optimal pair $(s, λ)$ , we maximize $S_{CV}$ as a function of s and λ. This choice prioritizes maximizing prediction accuracy $(S_{CV})$ over parsimony (s) and generalizability (λ).

In addition, repeated K-fold cross validation can be used to assess stability of variable selection. This involves shuffling samples between the training and validation sets and repeating the K-fold cross validation procedure. Ideally, each replicate should select a similar pair $(s, λ)$ with similar prediction accuracies on validation and test subsets. Algorithm 2 summarizes the flow of our cross validation procedure.

Algorithm 2.

Repeated Cross Validation

1:	Split dataset into test and cross validation subsets $𝓣$ and $𝓓$ , respectively.
2:	for each replicate $r \leftarrow 1, R$ do
3:	Shuffle subset $𝓓$ .
4:	for each $fold \leftarrow 1$ , $K$ do
5:	Split $𝓓$ into training and validation subsets, $𝓓_{1}$ and $𝓓_{2}$ .
6:	Estimate transformation parameters from the training set $𝓓_{1}$ .
7:	Apply the transformation to $𝓓_{1}$ , $𝓓_{2}$ and $𝓣$ to standardize/normalize data.
8:	for each λ do
9:	Initialize model parameters $β$ or $α$ by fitting a L₂ SVM using (2) or (7).
10:	for each s do
11:	Fit a sparse classifier with $k = ⌊ p (1 - s) ⌋$ active features.
12:	Evaluate prediction accuracy on subsets $𝓓_{1}$ , $𝓓_{2}$ and $𝓣$ .
13:	end for
14:	end for
15:	end for
16:	Evaluate average scores $S_{CV} (𝓓_{1})$ , $S_{CV} (𝓓_{2})$ and $S_{CV} (𝓣)$ for each subset.
17:	Rank models by their augmented scores ( $S_{C V} (𝓓_{2})$ , $s$ , $λ$ ).
18:	Identify and record the optimal hyperparameters s, k, and λ.
19:	end for

Open in a new tab

4. Numerical Experiments

We assess the proposed SVM algorithms by testing them on both our own simulated examples (see Appendix B) and datasets from the UCI Machine Learning Repository (Dua and Graff, 2019). The chosen datasets, which address both overdetermined and underdetermined problems, are listed in Table 1. We use the DataDeps.jl package (White et al., 2019) to process datasets and make them reproducible in our experiments. We begin by demonstrating that our sparse SVMs can recover sparse models in various high-dimensional settings. Next, we report repeated cross validation results on selected datasets and compare against SVMs fitted using all or a reduced subset of variables. Finally, we conclude with a comparison to classical SVMs implemented in LIBSVM and LIBLINEAR.

Table 1.

Summary of datasets, classifier choices, and cross validation settings used in numerical experiments.

Dataset	Classes	Samples	Features	Model	Replicates	Folds	Train	Test
synthetic	2	1000	500	Linear	10	5	800	200
synthetic-hard	2	1000	500		10	5	800	200
iris	3	150	4		10	3	120	30
bcw	2	683	9		10	3	546	137
splice	3	3176	180		10	5	2541	635
optdigits	10	5620	64		10	5	4496	1124
letters	26	20000	16		1	5	16000	4000
TCGA-HiSeq	5	801	20531		1	3	641	160
spiral	3	1000	2	Nonlinear	10	5	800	200
spiral-hard	3	1000	2		10	5	800	200

Open in a new tab

4.1. Sparse Recovery in High-Dimensional Scenarios

Let us first characterize the behavior of our sparse SVMs in two regimes, overdetermined and underdetermined, by a simulation study. Given target sample and variable sizes n and p, respectively, we generate X by sampling row vectors from a standard multivariate distribution, $x_{i} \sim 𝓝 (0_{p \times 1}, I_{p \times p})$ , and set

w_{0 j} \sim {\begin{array}{l} Uniform (2, 10), & j is causal \\ 0, & otherwise, \end{array}

for k randomized components in the support of $w_{0} \in ℝ^{p}$ ; that is $| supp (w_{0}) | = k$ . Binary labels are then assigned via the linear SVM decision rule $y_{i} = x_{i}^{⊤} w_{0}$ . Overdetermined and underdetermined scenarios are generated by fixing n = 500 or p = 500 and then varying the remaining problem dimension. We also fix λ = 1 and control parameters $ϵ_{d} = 10^{- 3}$ and $ϵ_{g} = 10^{- 4}$ . For each simulation, we fit multiple classifiers by varying the target number of active features k, using both MM and SD, and measure iterations to convergence, wall time, and prediction accuracy on a separate test set. Fitted solutions w are compared against the ground truth $w_{0}$ in terms of accuracy, positive predictive value (PPV), and negative predictive value (NPV). In this context we have the following definitions of true and false positives/negatives (TP/FP/TN/FN)

	${{\hat{w}}_{j} \neq 0}$	${{\hat{w}}_{j} = 0}$
${w_{0 j} \neq 0}$	TP	FN
${w_{0 j} = 0}$	FP	TN

Open in a new tab

from which we derive PPV and NPV. Prevalence adjustment is important for our simulation study because we fix k = 50 and expect the proportion of causal variables to shrink.

Figure 2 summarizes our results. In the two cases $(n, p) = (10^{4}, 500)$ and $(n, p) = (500, 10^{4})$ , it is clear that our classifiers achieve acceptable performance on test sets, regardless of which algorithm is used to fit models. Moreover in the former case, maximal prediction accuracy is achieved near k = 50, which also aligns with high PPV and NPV in identifying the correct subset of causal variables. These positive results are more difficult to achieve when $p ≫ n$ , but the outlook is not so pessimistic. The fact that PPV increases monotonically as the number of causal variables shrinks suggests that failure to recover all the causal variables may be due to their exchangeability. In terms of scaling behavior, it is somewhat remarkable that the number of required iterations decreases as one increases n. However, the wall time does increase consistently. It should be noted that we do not include the cost of extracting the SVD of X in Algorithm MM, which would shift its time curve upwards.

Fig. 2 — Performance characteristics of sparse SVM classifiers fitted using algorithms MM (blue circles) and SD (orange triangles) across various simulated high-dimensional scenarios (row labels). Black vertical lines highlight the number of causal variables used to simulate data. In the bottom two rows, results correspond to the ideal case with k = 50.

4.2. Cross Validation Results

We our report results of using Algorithm 2 in combination with Algorithm 1 to select sparse SVMs across selected datasets. Table 2 illustrates the performance of Algorithm MM on the synthetic example in detail, and Table 3 reports the highlights across all 10 examples. Results for Algorithm SD are similar. Appendix D records distributions of the hyperparameters s, k, and λ along with validation prediction accuracy over 10 replicates of K-fold cross validation.

Table 2.

Partial solution path in 5-fold cross validation on the synthetic example with various active variables k. Reported metrics are averages over 5 folds with standard errors in parentheses. The highlighted row corresponds to results for the true sparsity level s = 0.996, or equivalently k = 2. Here λ = 1 is fixed.

				Accuracy (%)
k	Iterations	Loss	Support Vectors	Train	Validation	Test
500	7	0.300	623	98	77	72
	(0)	(0.004)	(1)	(0)	(2)	(1)
375	210	0.301	622	98	77	73
	(4)	(0.004)	(2)	(0)	(2)	(1)
250	421	0.313	622	98	77	74
	(7)	(0.004)	(1)	(0)	(2)	(1)
125	599	0.351	619	97	79	76
	(9)	(0.004)	(2)	(0)	(1)	(1)
73	696	0.388	622	97	82	79
	(11)	(0.004)	(1)	(0)	(2)	(1)
43	747	0.421	625	96	85	82
	(4)	(0.005)	⁽1⁾	(0)	(2)	(1)
25	813	0.450	627	95	86	85
	(4)	(0.004)	(1)	(0)	(2)	(1)
15	820	0.471	626	95	87	88
	(6)	(0.003)	(1)	(0)	(2)	(1)
9	866	0.488	627	95	91	91
	(9)	(0.002)	(1)	(0)	(2)	(1)
5	885	0.503	628	96	94	93
	(3)	(0.002)	(1)	(0)	(2)	(0)
3	891	0.513	628	97	96	95
	(2)	(0.002)	(1)	(0)	(1)	(0)
				Accuracy (%)
2	896	0.519	627	99	100	100
	(3)	(0.002)	(1)	(0)	(0)	(0)
1	946	0.747	640	76	73	76
	(3)	(0.002)	(0)	(0)	(0)	(1)

Open in a new tab

Table 3.

Summary of repeated cross validation results across our selected examples. We select an optimal pair $(λ, k)$ by averaging CV scores over folds and replicates and fit (i) a sparse SVM based on our framework, (ii) a reduced SVM using only the active variables, and (iii) the full L₂-regularized SVM. In multiclass problems, the total and average number of active variables is taken across each binary SVM to account for overlaps.

				Active Variables				Accuracy (%)
Example	Model	λ	k	Total	Average	SVs	Margin	Train	Test
synthetic	(i)	0.10	2	2	2	518	0.95	100	100
	(ii)	0.10	2	2	2	518	0.94	100	100
	(iii)	0.10	500	500	500	527	0.96	100	76
synthetic-hard	(i)	10.00	2	2	2	800	15.30	94	96
	(ii)	10.00	2	2	2	800	15.28	94	96
	(iii)	10.00	500	500	500	800	10.69	90	70
bcw	(i)	1.00	9	9	9	215	6.09	97	98
	(ii)	1.00	9	9	9	215	6.09	97	98
	(iii)	1.00	9	9	9	215	6.09	97	98
iris	(i)	0.10	1	2	1	117	0.98	98	90
	(ii)	0.10	1	2	2	109	1.14	98	93
	(iii)	0.10	4	4	4	98	1.24	98	93
splice	(i)	0.10	145	209	145	2027	1.04	97	96
	(ii)	0.10	145	209	209	2031	1.04	98	95
	(iii)	0.10	240	240	240	2025	1.04	98	95
optdigits	(i)	1.00	48	56	48	1544	8.71	99	96
	(ii)	1.00	48	56	56	1544	8.71	99	96
	(iii)	1.00	64	64	64	1544	8.71	99	96
letters	(i)	0.10	16	16	16	13945	1.80	84	82
	(ii)	0.10	16	16	16	13945	1.80	84	82
	(iii)	0.10	16	16	16	13945	1.80	84	82
TCGA-HiSeq	(i)	10.00	5066	15778	5066	320	38.58	100	100
	(ii)	10.00	5066	15778	15778	271	41.85	100	100
	(iii)	10.00	20264	20258	20258	273	42.90	100	100
spiral	(i)	1.00	400	561	267	561	2.97	97	98
	(ii)	1.00	400	561	374	561	2.99	97	98
	(iii)	1.00	800	800	533	800	3.32	97	98
spiral-hard	(i)	1.00	669	771	446	771	3.82	89	90
	(ii)	1.00	669	771	514	771	3.85	90	89
	(iii)	1.00	800	800	533	800	3.84	89	90

Open in a new tab

In Table 2 it is clear that, for fixed λ, including extraneous variables may lead to overfitting training data in binary classification. The summary in Table 3 compares results for (i) our sparse classifiers that target a specific number of active variables k, (ii) a reduced model using the same active variables as the optimal sparse model, and (iii) the full L₂ SVM using all available variables. Rows corresponding to the synthetic, synthetic-hard, and bcw examples cover binary classification with a linear model. In these examples all 3 models deliver model coefficients of similar magnitude based on the size of the margin $1 / ‖ w ‖$ , but our sparse classifiers better generalize to classifying novel instances when a sparse signal is present.

In multiclass classification, one must partition data into subsets to fit binary SVMs discriminating between two classes. This increases the total number of model parameters and makes it possible for the number of active variables to exceed the target model size k. With this in mind, our examples of multiclass problems with linear classifiers (iris, splice, optdigits, leters, and TCGA-HiSeq) echo the results for linear binary classification. However, it is now more challenging to successfully recover a sparse model because each nested binary SVM targets the same number of variables k. Comparing total number of active variables across all SVMs against average active variables per SVM illustrates this point in the iris, splice, optdigits, and TCGA-HiSeq examples. The splice junction data, in which variables are binary indicators for nucleotides T, C, G, and A within DNA sequences of 60 base pairs, shows that a uniform sparsity constraint only modestly reduces the number of selected variables from 240 to 209 because there is only some overlap between SVMs (145 variables per SVM). In image data such as the optdigits example, dropped variables correspond mainly to pixels at the boundary of images that are more likely to be black pixels. This is somewhat disappointing behavior because in the gene expression example, TCGA-HiSeq, repeated CV selects the model size k = 5066 per SVM resulting in 15778 genes discriminating between 5 cancer types. This is still a large number of genes to consider compared to the original set of 20264 genes. To our credit, Figure 3 illustrates that it may be possible to select significantly smaller values for k, thereby decreasing the total number of model parameters, in exchange for small decreases in prediction accuracy. Finally, it is interesting to note that our nonlinear SVM model successfully controls the number of support vectors while maintaining good classification accuracy in the spiral examples despite suffering from the same sparsity deficiency in multiclass problems.

Fig. 3 — Convergence metrics (top) and performance metrics (bottom) for MM and SD on the TCGA-HiSeq example. Control parameters were set to and $δ_{g} = 2 \times 10^{- 4}$ and $δ_{d} = 10^{- 3}$ for gradient norms and distances, respectively.

4.3. Comparison to Existing Models

We compare our sparsity-based models to well-known classics using LIBLINEAR (Fan et al., 2008) through the Julia wrapper package LIBSVM.jl. Specifically, we compare to

L2R, an option from LIBLINEAR for the standard L₂ (2), and
L1R, an option from LIBLINEAR for the L₁-regularized version of (2).

The models used in LIBLINEAR impose regularization on the hinge term, rather than the penalty term, through a tuning constant $C = {(n λ)}^{- 1}$ .

Table 4 reports our results in repeated cross validation. Our SVM training algorithms are comparable to existing approaches, albeit slower across all examples. To our credit the synthetic example underscores the superiority of sparsity constraints over shrinkage-based penalties. Namely, algorithms MM and SD successfully fit sparse classifiers with fewer variables and achieve superior prediction capability as reflected in validation and test accuracies. Specifically, our sparse approach compares favorably against the L₂-regularized SVM in our synthetic datasets, designed to have only two informative variables, even when the data are not linearly separable. Moreover, it is clear that our approach gives similar classification predictions to the L₁-regularized SVM on the synthetic examples. Similarities in performance scores disappear on multiclass problems, indicating that our algorithms converge to distinct solutions. It is not immediately clear whether our L₀ approach recovers sparser solutions, controls false positives, or mitigates shrinkage compared to the lasso. The L₁-regularized SVM is arguably too aggressive in selecting variables on the splice and letters examples but it achieves superior selection on the TCGA-HiSeq example. This suggests that, while the L₁ classifier also imposes a uniform penalty on each nested binary SVM, the continuous penalty can be more flexible on some problems. Thus, our sparse classifiers are conservative compared to shrinkage-based methods.

Table 4.

Comparison of our algorithms (MM and SD) against existing approaches implemented in LIBLINEAR (L2 and L1) in cross validation. In multiclass problems, the total and average number of active variables is taken across each binary SVM to account for overlaps. Results are based on repeated K-fold cross validation with 10 replicates in all examples except the letters and TCGA-HiSeq examples which only use 1 replicate.

					Active Variables			Accuracy (%)
Example	Algorithm	λ	k	Total Time [s]	Total	Average	SVs	Train	Validation	Test
synthetic	MM	10	2	3.98	2	2	573	99	99	100
	SD	0.1	2	4.05	2	2	506	99	99	100
	L2	0.55	500	0.05	500	500	529	99	78	74
	L1	0.1	500	0.025	271	271	471	99	99	100
synthetic-hard	MM	10	2	4.13	2	2	640	94	94	96
	SD	10	2	4.16	2	2	640	94	94	96
	L2	1	500	0.055	500	500	627	97	74	70
	L1	0.1	500	0.021	271	271	567	94	94	96
bcw	MM	0.56	6	0.64	6	6	156	97	97	96
	SD	0.32	8	2.35	8	8	129	98	98	95
	L2	0.78	9	0.007	9	9	387	87	87	87
	L1	0.04	9	0.02	9	9	369	87	86	87
iris	MM	0.02	4	0.254	4	3	45	99	98	94
	SD	0.03	3	0.282	4	3	47	98	98	95
	L2	0.02	4	0.002	4	4	80	86	85	85
	L1	0.01	4	0.006	4	3	80	85	85	84
					Active Variables			Accuracy (%)
splice	MM	0.1	144	9.51	208	150	1628	98	96	95
	SD	0.1	102	12.5	178	104	1661	97	96	97
	L2	0.1	240	0.125	240	240	1779	97	96	95
	L1	0.1	240	0.056	30	11	2032	85	84	85
optdigits	MM	0.1	36	17.1	55	36	901	100	98	96
	SD	1	34	192	54	36	1082	100	98	96
	L2	0.1	64	0.216	62	62	1693	98	96	95
	L1	0.1	64	0.534	43	17	3022	95	94	92
letters	MM	0.1	16	362	16	16	11157	84	83	82
	SD	0.1	14	3000	16	14	11174	84	83	82
	L2	0.1	16	2.38	16	16	12800	67	67	65
	L1	0.1	16	8.81	14	6	12800	46	46	45
TCGA-HiSeq	MM	10	5066	9980	15878	5066	249	100	100	100
	SD	0.1	1964	7710	8727	1964	240	100	100	100
	L2	10	20264	72	20237	20237	427	100	100	100
	L1	0.1	20264	2.78	79	16	427	100	100	100

Open in a new tab

5. Discussion

We have demonstrated the benefits of conceptually simple proximal distance algorithms for binary and multiclass classification problems on both linear SVMs and nonlinear kernel SVMs. The proximal distance principle makes it possible to attack parsimony directly through squared distance penalties. This direct approach (a) restores differentiability via quadratic surrogate functions, (b) potentially avoids the shrinkage inherent in lasso-based algorithms, (c) identifies sparser models with good predictive power, and (d) substitutes a discrete interpretable sparsity level for the continuous hyperparameters of competing methods. To our surprise, the more expensive Algorithm MM scales better on high-dimensional data due to its ability to quickly drive solutions close to a desired sparsity set.

Algorithm acceleration is essential in overcoming the unfortunate cost of (repeated) cross validation. We found experimentally that inclusion of ridge regularization is essential in preventing coefficients from diverging to $\pm \infty$ . Fortunately, addition of the ridge penalty convexifies our distance penalized objectives and accelerates convergence overall. We noticed a few other tactics that lower computational costs.

In multiclass classification, Algorithm MM greatly benefits from the OVO paradigm because it reduces the dimensions of each required singular value decomposition. The OVR paradigm only requires a single, albeit large decomposition, provided the classifier is linear. It also noteworthy that our implementation uses dense linear algebra operations, so it may be possible to speed up model fitting by tracking the active parameter set. In contrast, it is not clear whether the SD variant is truly a viable alternative to MM unless the required SVD is prohibitively expensive to compute. We observe that SD benefits from Nesterov acceleration in cutting down the number of iterations, but further work is needed to make it useful in cross validation. Other computational tricks may further lower computational costs (Schölkopf and Smola, 2018); these warrant further experimentation.

While we are pleased with our results, particularly for binary classification tasks, much is left to be desired for multiclass problems. Relying on multiple SVMs to handle multiclass problems introduces $(\begin{matrix} c \\ 2 \end{matrix})$ subproblems for c classes. Furthermore, different decision boundaries in the OVO paradigm may be driven by different features, obscuring the universal features that discriminate between classes. Hence, it is natural to investigate multiclass methods beyond hyperplane separation. Our previous research on multivertex discriminant analysis (MVDA) (Lange and Wu, 2008) explored a multiclass model that represents classes geometrically as vertices of a regular simplex embedded in Euclidean space rather than binary choices from ${- 1, 1}$ . MVDA takes advantage of ϵ-insensitive norms and generalizes to nonlinear classification via the kernel trick (Wu and Lange, 2010). We plan to revisit MVDA and incorporate sparsity based on the proximal distance principle and possibly Huber hinge errors (van den Burg and Groenen, 2016). Given the length of the current paper and the many unresolved challenges ahead, this goal is best left to a future paper.

Supplementary Material

Supp 1

NIHMS1862287-supplement-Supp_1.zip^{(1.6MB, zip)}

Acknowledgments

The authors gratefully acknowledge USPHS grants R35 GM141798 and HG006139

Footnotes

SUPPLEMENTARY MATERIAL

Appendix: The file “appendix.pdf” provides derivations for both Algorithm MM and Algorithm SD, a description of simulated datasets, implementation details, and stability results for variable selection. (.pdf)

Julia code: The file “SparseSVM.zip” contains Julia code to reproduce our numerical experiments. Software is also available at https://github.com/alanderos91/SparseSVM.jl. Contents are structured as a Julia project to handle software and data dependencies in an automated fashion. See the project’s README for details. (.zip)

References

Barghout L. (2015). Spatial-Taxon Information Granules as Used in Iterative Fuzzy-Decision-Making for Image Segmentation. In Granular Computing and Decision-Making: Interactive and Iterative Approaches, Studies in Big Data, pp. 285–318. Springer International Publishing. [Google Scholar]
Beltrami EJ (1970). An Algorithmic Approach to Nonlinear Analysis and Optimization. Academic Press. [Google Scholar]
Ben-Hur A, Horn D, Siegelmann HT, and Vapnik V. (2002). Support vector clustering. The Journal of Machine Learning Research 2, 125–137. [Google Scholar]
Cauwenberghs G. and Poggio T (2000). Incremental and decremental support vector machine learning. In Proceedings of the 13th International Conference on Neural Information Processing Systems, NIPS’00, pp. 388–394. MIT Press. [Google Scholar]
Chi EC, Zhou H, and Lange K. (2014). Distance majorization and its applications. Mathematical Programming 146 (1), 409–436. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cortes C. and Vapnik V. (1995). Support-vector networks. Machine Learning 20 (3), 273–297. [Google Scholar]
Courant R. (1943). Variational Methods for the Solution of Problems of Equilibrium and Vibrations. Verlag Nicht Ermittelbar. [Google Scholar]
Decoste D. and Schölkopf B. (2002). Training Invariant Support Vector Machines. Machine Learning 46 (1), 161–190. [Google Scholar]
Dua D. and Graff C. (2019). UCI Machine Learning Repository.
Dunbrack RL (2006). Sequence comparison and protein structure prediction. Current Opinion in Structural Biology 16 (3), 374–384. [DOI] [PubMed] [Google Scholar]
El Ghaoui L, Viallon V, and Rabbani T. (2012). Safe feature elimination for the lasso and sparse supervised learning problems. Pacific Journal of Optimization 8 (4), 667–698. [Google Scholar]
Fan R-E, Chang K-W, Hsieh C-J, Wang X-R, and Lin C-J (2008). LIBLINEAR: A Library for Large Linear Classification. Journal of Machine Learning Research 9 (61), 1871–1874. [Google Scholar]
Frieß T-T and Harrison RF (1998a). The Kernel Adatron With Bias Unit: Analysis of the Algorithm (Part 1). ACSE Research Report 729, University of Sheffield Department of Automatic Control and Systems Engineering. [Google Scholar]
Frieß T-T and Harrison RF (1998b). The Kernel Adatron with Bias Unit: Analysis of the Algorithm (Part 2). ACSE Research Report 728, University of Sheffield Department of Automatic Control and Systems Engineering. [Google Scholar]
Groenen PJF, Nalbantov G, and Bioch JC (2008). SVM-Maj: A majorization approach to linear support vector machines with different hinge errors. Advances in Data Analysis and Classification 2 (1), 17–43. [Google Scholar]
Hsu C-W and Lin C-J (2002). A comparison of methods for multiclass support vector machines. IEEE Transactions on Neural Networks 13 (2), 415–425. [DOI] [PubMed] [Google Scholar]
Jagg M. (2014). An Equivalence between the Lasso and Support Vector Machines. In Regularization, Optimization, Kernels, and Support Vector Machines. Chapman and Hall/CRC. [Google Scholar]
Joachims T. (1998). Text categorization with Support Vector Machines: Learning with many relevant features. In Machine Learning: ECML-98, Lecture Notes in Computer Science, pp. 137–142. Springer. [Google Scholar]
Keys KL, Zhou H, and Lange K. (2019). Proximal Distance Algorithms: Theory and Practice. Journal of Machine Learning Research 20 (66), 1–38. [PMC free article] [PubMed] [Google Scholar]
Kimeldorf G. and Wahba G. (1971). Some results on Tchebycheffian spline functions. Journal of Mathematical Analysis and Applications 33 (1), 82–95. [Google Scholar]
Landeros A, Padilla OHM, Zhou H, and Lange K. (2022). Extensions to the Proximal Distance Method of Constrained Optimization. Journal of Machine Learning Research 23 (182), 1–45. [PMC free article] [PubMed] [Google Scholar]
Lange K. (2016). MM Optimization Algorithms. SIAM-Society for Industrial and Applied Mathematics. [Google Scholar]
Lange K, Hunter DR, and Yang I. (2000). Optimization Transfer Using Surrogate Objective Functions. Journal of Computational and Graphical Statistics 9 (1), 1–20. [Google Scholar]
Lange K, Won J-H, Landeros A, and Zhou H. (2021). Nonconvex Optimization via MM Algorithms: Convergence Theory. In Wiley StatsRef: Statistics Reference Online, pp. 1–22. John Wiley & Sons, Ltd. [Google Scholar]
Lange K. and Wu TT (2008). An MM Algorithm for Multicategory Vertex Discriminant Analysis. Journal of Computational and Graphical Statistics 17 (3), 527–544. [Google Scholar]
Laskov P, Gehl C, Krüger S, and Müller K-R (2006). Incremental Support Vector Learning: Analysis, Implementation and Applications. Journal of Machine Learning Research 7 (69), 1909–1936. [Google Scholar]
Luenberger DG (1984). Linear and Nonlinear Programming. Addison-Wesley. [Google Scholar]
Mangasarian OL and Musicant DR (2001). Lagrangian support vector machines. The Journal of Machine Learning Research 1, 161–177. [Google Scholar]
Merce J. (1909). Functions of positive and negative type, and their connection the theory of integral equations. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character 209 (441–458), 415–446. [Google Scholar]
Nguyen HD and McLachlan GJ (2017). Iteratively-Reweighted Least-Squares Fitting of Support Vector Machines: A Majorization–Minimization Algorithm Approach. In Proceedings of the 2017 Future Technologies Conference, pp. 439–446. The Science and Information Organization. [Google Scholar]
Ogawa K, Suzuki Y, and Takeuchi I. (2013). Safe Screening of Non-Support Vectors in Pathwise SVM Computation. In Proceedings of the 30th International Conference on Machine Learning, pp. 1382–1390. PMLR. [Google Scholar]
Pradhan S, Ward W, Hacioglu K, Martin JH, and Jurafsky D. (2004). Shallow semantic parsing using support vector machines. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, pp. 233–240. Association for Computational Linguistics. [Google Scholar]
Schölkopf B, Herbrich R, and Smola AJ (2001). A Generalized Representer Theorem. In Computational Learning Theory, Lecture Notes in Computer Science, pp. 416–426. Springer. [Google Scholar]
Schölkopf B. and Smola AJ (2018). Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. The MIT Press. [Google Scholar]
Sewak M, Vaidya P, Chan C-C, and Duan Z-H (2007). SVM approach to breast cancer classification. In Second International Multi-Symposiums on Computer and Computational Sciences (IMSCCS 2007), pp. 32–37. [Google Scholar]
Tibshirani R, Bien J, Friedman J, Hastie T, Simon N, Taylor J, and Tibshirani RJ (2012). Strong rules for discarding predictors in lasso-type problems. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 74 (2), 245–266. [DOI] [PMC free article] [PubMed] [Google Scholar]
van den Burg GJJ and Groenen PJF (2016). GenSVM: A Generalized Multiclass Support Vector Machine. Journal of Machine Learning Research 17 (224), 1–42. [Google Scholar]
Wang J., Zhou J, Wonka P, and Ye J (2013). Lasso screening rules via dual polytope projection. In Advances in Neural Information Processing Systems, Volume 26. Curran Associates, Inc. [Google Scholar]
White L, Togneri R, Liu W, and Bennamoun M. (2019). DataDeps.jl: Repeatable Data Setup for Reproducible Data Science. Journal of Open Research Software 7 (1), 33. [Google Scholar]
Wu TT and Lange K. (2010). Multicategory vertex discriminant analysis for high-dimensional data. The Annals of Applied Statistics 4 (4), 1698–1721. [DOI] [PMC free article] [PubMed] [Google Scholar]
Xu J, Chi E, and Lange K. (2017). Generalized Linear Model Regression under Distance-to-set Penalties. In Advances in Neural Information Processing Systems, Volume 30. Curran Associates, Inc. [Google Scholar]
Zhu J, Rosset S, Hastie T, and Tibshirani R. (2003). 1-norm support vector machines. In Proceedings of the 16th International Conference on Neural Information Processing Systems, NIPS’03, pp. 49–56. MIT Press. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp 1

NIHMS1862287-supplement-Supp_1.zip^{(1.6MB, zip)}

[R1] Barghout L. (2015). Spatial-Taxon Information Granules as Used in Iterative Fuzzy-Decision-Making for Image Segmentation. In Granular Computing and Decision-Making: Interactive and Iterative Approaches, Studies in Big Data, pp. 285–318. Springer International Publishing. [Google Scholar]

[R2] Beltrami EJ (1970). An Algorithmic Approach to Nonlinear Analysis and Optimization. Academic Press. [Google Scholar]

[R3] Ben-Hur A, Horn D, Siegelmann HT, and Vapnik V. (2002). Support vector clustering. The Journal of Machine Learning Research 2, 125–137. [Google Scholar]

[R4] Cauwenberghs G. and Poggio T (2000). Incremental and decremental support vector machine learning. In Proceedings of the 13th International Conference on Neural Information Processing Systems, NIPS’00, pp. 388–394. MIT Press. [Google Scholar]

[R5] Chi EC, Zhou H, and Lange K. (2014). Distance majorization and its applications. Mathematical Programming 146 (1), 409–436. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] Cortes C. and Vapnik V. (1995). Support-vector networks. Machine Learning 20 (3), 273–297. [Google Scholar]

[R7] Courant R. (1943). Variational Methods for the Solution of Problems of Equilibrium and Vibrations. Verlag Nicht Ermittelbar. [Google Scholar]

[R8] Decoste D. and Schölkopf B. (2002). Training Invariant Support Vector Machines. Machine Learning 46 (1), 161–190. [Google Scholar]

[R9] Dua D. and Graff C. (2019). UCI Machine Learning Repository.

[R10] Dunbrack RL (2006). Sequence comparison and protein structure prediction. Current Opinion in Structural Biology 16 (3), 374–384. [DOI] [PubMed] [Google Scholar]

[R11] El Ghaoui L, Viallon V, and Rabbani T. (2012). Safe feature elimination for the lasso and sparse supervised learning problems. Pacific Journal of Optimization 8 (4), 667–698. [Google Scholar]

[R12] Fan R-E, Chang K-W, Hsieh C-J, Wang X-R, and Lin C-J (2008). LIBLINEAR: A Library for Large Linear Classification. Journal of Machine Learning Research 9 (61), 1871–1874. [Google Scholar]

[R13] Frieß T-T and Harrison RF (1998a). The Kernel Adatron With Bias Unit: Analysis of the Algorithm (Part 1). ACSE Research Report 729, University of Sheffield Department of Automatic Control and Systems Engineering. [Google Scholar]

[R14] Frieß T-T and Harrison RF (1998b). The Kernel Adatron with Bias Unit: Analysis of the Algorithm (Part 2). ACSE Research Report 728, University of Sheffield Department of Automatic Control and Systems Engineering. [Google Scholar]

[R15] Groenen PJF, Nalbantov G, and Bioch JC (2008). SVM-Maj: A majorization approach to linear support vector machines with different hinge errors. Advances in Data Analysis and Classification 2 (1), 17–43. [Google Scholar]

[R16] Hsu C-W and Lin C-J (2002). A comparison of methods for multiclass support vector machines. IEEE Transactions on Neural Networks 13 (2), 415–425. [DOI] [PubMed] [Google Scholar]

[R17] Jagg M. (2014). An Equivalence between the Lasso and Support Vector Machines. In Regularization, Optimization, Kernels, and Support Vector Machines. Chapman and Hall/CRC. [Google Scholar]

[R18] Joachims T. (1998). Text categorization with Support Vector Machines: Learning with many relevant features. In Machine Learning: ECML-98, Lecture Notes in Computer Science, pp. 137–142. Springer. [Google Scholar]

[R19] Keys KL, Zhou H, and Lange K. (2019). Proximal Distance Algorithms: Theory and Practice. Journal of Machine Learning Research 20 (66), 1–38. [PMC free article] [PubMed] [Google Scholar]

[R20] Kimeldorf G. and Wahba G. (1971). Some results on Tchebycheffian spline functions. Journal of Mathematical Analysis and Applications 33 (1), 82–95. [Google Scholar]

[R21] Landeros A, Padilla OHM, Zhou H, and Lange K. (2022). Extensions to the Proximal Distance Method of Constrained Optimization. Journal of Machine Learning Research 23 (182), 1–45. [PMC free article] [PubMed] [Google Scholar]

[R22] Lange K. (2016). MM Optimization Algorithms. SIAM-Society for Industrial and Applied Mathematics. [Google Scholar]

[R23] Lange K, Hunter DR, and Yang I. (2000). Optimization Transfer Using Surrogate Objective Functions. Journal of Computational and Graphical Statistics 9 (1), 1–20. [Google Scholar]

[R24] Lange K, Won J-H, Landeros A, and Zhou H. (2021). Nonconvex Optimization via MM Algorithms: Convergence Theory. In Wiley StatsRef: Statistics Reference Online, pp. 1–22. John Wiley & Sons, Ltd. [Google Scholar]

[R25] Lange K. and Wu TT (2008). An MM Algorithm for Multicategory Vertex Discriminant Analysis. Journal of Computational and Graphical Statistics 17 (3), 527–544. [Google Scholar]

[R26] Laskov P, Gehl C, Krüger S, and Müller K-R (2006). Incremental Support Vector Learning: Analysis, Implementation and Applications. Journal of Machine Learning Research 7 (69), 1909–1936. [Google Scholar]

[R27] Luenberger DG (1984). Linear and Nonlinear Programming. Addison-Wesley. [Google Scholar]

[R28] Mangasarian OL and Musicant DR (2001). Lagrangian support vector machines. The Journal of Machine Learning Research 1, 161–177. [Google Scholar]

[R29] Merce J. (1909). Functions of positive and negative type, and their connection the theory of integral equations. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character 209 (441–458), 415–446. [Google Scholar]

[R30] Nguyen HD and McLachlan GJ (2017). Iteratively-Reweighted Least-Squares Fitting of Support Vector Machines: A Majorization–Minimization Algorithm Approach. In Proceedings of the 2017 Future Technologies Conference, pp. 439–446. The Science and Information Organization. [Google Scholar]

[R31] Ogawa K, Suzuki Y, and Takeuchi I. (2013). Safe Screening of Non-Support Vectors in Pathwise SVM Computation. In Proceedings of the 30th International Conference on Machine Learning, pp. 1382–1390. PMLR. [Google Scholar]

[R32] Pradhan S, Ward W, Hacioglu K, Martin JH, and Jurafsky D. (2004). Shallow semantic parsing using support vector machines. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, pp. 233–240. Association for Computational Linguistics. [Google Scholar]

[R33] Schölkopf B, Herbrich R, and Smola AJ (2001). A Generalized Representer Theorem. In Computational Learning Theory, Lecture Notes in Computer Science, pp. 416–426. Springer. [Google Scholar]

[R34] Schölkopf B. and Smola AJ (2018). Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. The MIT Press. [Google Scholar]

[R35] Sewak M, Vaidya P, Chan C-C, and Duan Z-H (2007). SVM approach to breast cancer classification. In Second International Multi-Symposiums on Computer and Computational Sciences (IMSCCS 2007), pp. 32–37. [Google Scholar]

[R36] Tibshirani R, Bien J, Friedman J, Hastie T, Simon N, Taylor J, and Tibshirani RJ (2012). Strong rules for discarding predictors in lasso-type problems. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 74 (2), 245–266. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R37] van den Burg GJJ and Groenen PJF (2016). GenSVM: A Generalized Multiclass Support Vector Machine. Journal of Machine Learning Research 17 (224), 1–42. [Google Scholar]

[R38] Wang J., Zhou J, Wonka P, and Ye J (2013). Lasso screening rules via dual polytope projection. In Advances in Neural Information Processing Systems, Volume 26. Curran Associates, Inc. [Google Scholar]

[R39] White L, Togneri R, Liu W, and Bennamoun M. (2019). DataDeps.jl: Repeatable Data Setup for Reproducible Data Science. Journal of Open Research Software 7 (1), 33. [Google Scholar]

[R40] Wu TT and Lange K. (2010). Multicategory vertex discriminant analysis for high-dimensional data. The Annals of Applied Statistics 4 (4), 1698–1721. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R41] Xu J, Chi E, and Lange K. (2017). Generalized Linear Model Regression under Distance-to-set Penalties. In Advances in Neural Information Processing Systems, Volume 30. Curran Associates, Inc. [Google Scholar]

[R42] Zhu J, Rosset S, Hastie T, and Tibshirani R. (2003). 1-norm support vector machines. In Proceedings of the 16th International Conference on Neural Information Processing Systems, NIPS’03, pp. 49–56. MIT Press. [Google Scholar]

PERMALINK

Algorithms for Sparse Support Vector Machines

Alfonso Landeros

Kenneth Lange

Abstract

1. Introduction

2. Sparse SVMs via Distance Penalization

2.1. Background

2.2. Mathematical Formulation

2.3. Variable Selection

Fig. 1.

2.4. Support Vector Selection

3. Algorithms

3.1. Proximal Distance Iteration

Algorithm 1.

3.2. Iteration Maps

3.3. Convergence Theory

3.4. Decision Functions and Multiclass SVM

3.5. Tuning Sparsity with Cross Validation

Algorithm 2.

4. Numerical Experiments

Table 1.

4.1. Sparse Recovery in High-Dimensional Scenarios

Fig. 2.

4.2. Cross Validation Results

Table 2.

Table 3.

Fig. 3.

4.3. Comparison to Existing Models

Table 4.

5. Discussion

Supplementary Material

Acknowledgments

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases