Two-Dimensional Solution Surface for Weighted Support Vector Machines

Seung Jun Shin; Yichao Wu; Hao Helen Zhang

doi:10.1080/10618600.2012.761139

. Author manuscript; available in PMC: 2014 Oct 3.

Published in final edited form as: J Comput Graph Stat. 2014 Apr 28;23(2):383–402. doi: 10.1080/10618600.2012.761139

Two-Dimensional Solution Surface for Weighted Support Vector Machines

Seung Jun Shin ¹, Yichao Wu ², Hao Helen Zhang ^3,^✉

PMCID: PMC4060833 NIHMSID: NIHMS533706 PMID: 24955001

Abstract

The support vector machine (SVM) is a popular learning method for binary classification. Standard SVMs treat all the data points equally, but in some practical problems it is more natural to assign different weights to observations from different classes. This leads to a broader class of learning, the so-called weighted SVMs (WSVMs), and one of their important applications is to estimate class probabilities besides learning the classification boundary. There are two parameters associated with the WSVM optimization problem: one is the regularization parameter and the other is the weight parameter. In this paper we first establish that the WSVM solutions are jointly piecewise-linear with respect to both the regularization and weight parameter. We then develop a state-of-the-art algorithm that can compute the entire trajectory of the WSVM solutions for every pair of the regularization parameter and the weight parameter, at a feasible computational cost. The derived two-dimensional solution surface provides theoretical insight on the behavior of the WSVM solutions. Numerically, the algorithm can greatly facilitate the implementation of the WSVM and automate the selection process of the optimal regularization parameter. We illustrate the new algorithm on various examples.

Keywords: binary classification, probability estimation, solution surface, support vector machine, weighted support vector machine

1 Introduction

Frequently encountered in real applications is binary classification, in which we are given a training set {(x_i, y_i), i = 1, ···, n} of size n and the goal is to learn a classification rule. Here x_i ∈ ℝ^p and y_i ∈ {−1, 1} denote a p-dimensional predictor and a binary response (or class label), respectively, for the ith example. The primary goal of the binary classification is to construct a classifier which can be used for class prediction of new objects with predictors given. Among many existing methods for binary classification, the support vector machine (SVM) (Vapnik; 1996) is one of the most well known classifiers. It has gained much popularity since its introduction. It originates with the simple idea of finding an optimal hyperplane to separate two classes. The hyperplane is optimal in the sense that the geometric margin between these two classes is maximized. Later it was shown by Wahba (1999) that the SVM can be fit in the general regularization framework by solving

min_{f \in F} \sum_{i = 1}^{n} H_{1} (y_{i} f (x_{i})) + \frac{λ}{2} J (f),

(1)

where H₁(u) = max(1 −u, 0) is the hinge loss function, J(f) denotes the roughness penalty of a function f(·) in a function space Inline graphic , and the sign of f(x) for a given predictor x will be used for class prediction. Here λ > 0 is a regularization parameter which balances data fitting measured by the hinge loss and model complexity measured by the roughness penalty. Lin (2002) shows that the Hinge loss is Fisher consistent. See Liu (2007) for a more detailed discussion on Fisher consistency of different loss functions. A common choice of the penalty is $J (f) = {‖ f ‖}_{F}^{2}$ . Other choices include the l₁-norm penalty (Zhu et al.; 2003; Wang and Shen; 2007) and the SCAD penalty (Zhang et al.; 2006). In general, we set Inline graphic to be the Reproducing Kernel Hilbert Space (RKHS, Wahba; 1990) , generated by a non-negative definite kernel K(·,·). By the Representer Theorem (Kimeldorf and Wahba; 1971), the optimizer of (1) has a finite dimensional representation given by

f (x) = b + \sum_{i = 1}^{n} θ_{i} K (x, x_{i}) .

(2)

Using (2), it can be shown that the roughness penalty of f in Inline graphic is $\sum_{i = 1}^{n} \sum_{j = 1}^{n} θ_{i} θ_{j} K (x_{i}, x_{j})$ . Then the SVM estimates f(x) by solving

min_{b, θ_{1}, \dots, θ_{n}} \sum_{i = 1}^{n} H_{1} (y_{i} [b + \sum_{j = 1}^{n} θ_{j} K (x_{i}, x_{j})]) + \frac{λ}{2} \sum_{i = 1}^{n} \sum_{j = 1}^{n} θ_{i} θ_{j} K (x_{i}, x_{j}) .

(3)

For the SVM (3), Hastie et al. (2004) showed that the optimizers b, θ₁, ···, θ_n change piecewise-linearly when the regularization parameter λ changes and proposed an efficient solution path algorithm. From now on, we refer to this path as a λ-path.

Note that, in the standard SVM, each observation is treated equally no matter which class it belongs to. Yet this may not be always optimal. In some situations, it is desired to assign different weights to the observations from different classes; one such example is when one type of misclassification induces a larger cost than the other type of misclassification. Motivated by this, Lin et al. (2004) considered the weighted SVM (WSVM) by solving

min_{b, θ_{1}, \dots, θ_{n}} \sum_{i = 1}^{n} π_{i} H_{1} (y_{i} [b + \sum_{j = 1}^{n} θ_{j} K (x_{i}, x_{j})]) + \frac{λ}{2} \sum_{i = 1}^{n} \sum_{j = 1}^{n} θ_{i} θ_{j} K (x_{i}, x_{j}),

(4)

where the weight π_i’s are given by

π_{i} \equiv π (y_{i}) = {\begin{cases} 1 - π & if & y_{i} = + 1 \\ π & if & y_{i} = - 1 \end{cases}

with π ∈ (0, 1). Each point (x_i, y_i) is associated with one weight parameter π_i, the value of which depends on the label y_i. According to Lin et al. (2004), the WSVM classifier provides a consistent estimate of sign(p(x) −π) for any x, where p(x) is the conditional class probability p(x) = P (Y = 1|X = x). Using this fact, Wang et al. (2008) proposed the weighted SVM for probability estimation. The basic idea is to divide and conquer by converting a probability estimation problem into many classification sub-problems. Each classification sub-problem is assigned with a different weight parameter π, and these sub-problems are solved separately. Then the solutions are assembled together to construct the final probability estimator. A more detailed description is as follows. Consider a sequence of 0 < π⁽¹⁾ < ··· < π⁽^M⁾ < 1. For each m = 1, ···, M, solve the (4) associated with π⁽^m⁾ and denote the solution by f̂_m(·). Finally, for any x, construct the probability estimator as p̂ (x) = {π⁽^m⁺⁾ + π⁽^m⁻⁾}/2, where m⁺ = argmax_m f̂_m(x) > 0 and m⁻ = argmin_m f̂_m(x) < 0. Advantages of the weighted SVM include flexibility and capability of handling large dimensional data.

One main concern about the probability estimation scheme proposed by Wang et al. (2008) is its computational cost. The cost comes from two sources: there are multiple sub-problems to solve, since the weight parameter π varies in (0, 1); each sub-problem is associated with one regularization parameter λ, which needs to be adaptively tuned in the range (0,∞). To facilitate the computation, the λ-path algorithm of the standard SVM (Hastie et al.; 2004) can be extended to the WSVM for any fixed π ∈ [0, 1]. In addition, Wang et al. (2008) developed the π-path algorithm for any fixed λ > 0. Both the λ-path and the π-path are piecewise-linear. However it is largely unknown how the WSVM solution f_λ,π changes when both the regularization parameter λ and the weight parameter π vary together. The main purpose of our two-dimensional solution surface is to reduce the computation and tuning burden by automatically obtaining the solutions for all possible (π, λ) with an efficient algorithm.

One main motivation of this paper is to study the behaviors of the entire WSVM solutions and characterize them by a simple representation form through their relationship to π and λ. We use subscripts to emphasize that the WSVM solution f is a function of λ and π and sometimes omit the subscripts when they are clear from the context. Another motivation for the need of the solution surface is to automate the selection of the regularization parameter and improve the efficiency of searching process. Although Wang et al. (2008)’s conditional class probability estimator performs well as demonstrated by their numerical examples, its performance depends heavily on λ. They proposed to tune λ by using a grid search in their numerical illustrations. Yet, it is well known that such a grid search can be computationally inefficient and, in addition, its performance depends on how fine the grid is. The above discussions motivate us to develop a two-dimensional solution surface (rather than a one-dimensional path) as a continuous function of both λ and π in the analogous way that one resolved the inefficiency of the grid search for selecting the regularization parameter λ of the SVM by computing the entire λ-regularization path (Hastie et al.; 2004). From now on, we refer to the new two-dimensional solution surface of the WSVM as the WSVM solution surface.

In order to show the difficulties in tuning the regularization parameter for the probability estimation (Wang et al.; 2008) and motivate our new tuning method based on the WSVM solution surface, we use a univariate toy example generated from a Gaussian mixture: x_i is randomly drawn from the standard normal distribution if y_i = 1 and from N (1, 1) otherwise, with five points from each class. The linear kernel (K(x_i, x_j) = x_ix_j) is employed for the WSVM and its solution is then given by f(x) = b + βx with $β = \sum_{i = 1}^{n} θ_{i} x_{i}$ . In order to describe the behavior of f (x), we plot λβ based on the obtained WSVM solution surface (or path), since λβ, instead of β, is piecewise-linear in λ due to our parametrization. In Figure 1, the top two panels depict the solution paths of λβ for the different values of λ fixed at 0.2, 0.4, 0.6, 0.8, 1.0 (left) and the corresponding estimates of p̂ (·) as a function of x (right); the bottom two panels plot the entire two-dimensional joint solution surface (left) and the corresponding probability estimate p̂ (·) as a function of λ as well as x (right). We note that, although all the five π-paths are piecewise-linear in π they have quite different shapes for different values of λ (see (a)). Thus the corresponding probability estimates can be quite different even for the same x (see (b)), suggesting the importance of selecting an optimal λ. By using the proposed WSVM solution surface, we can completely recover the WSVM solutions on the whole (λ × π)-plane (see (c)), which enables us to produce the corresponding conditional probability estimators at a given x for every λ with very little computational expense (see (d)). We will shortly demonstrate that it is computationally efficient to extract marginal paths (λ-path or π-path) once the WSVM solution surface is obtained.

Simulated toy example: the top two panels depict the solution λβ and the estimate p̂ (x) given by five marginal π-solution paths, with λ values fixed at 0.2, 0.4, 0.6, 0.8, 1.0. The bottom two panels plot the joint solution surface of λβ and the corresponding surface of p̂ (x). Notice that the horizontal axis of (b) is x. Similarly the surface in (d) lies on the (*x × λ*) plane.

In this example, we use a grid of five equally-spaced λs which is very rough. In practice, it is typically not known a priori how fine the grid should be or what the appropriate range of the grid is. If the data are very large or complicated, the grid one choose may not be fine enough to capture the variation of the WSVM solution and will lose efficiency for the subsequent probability estimation. The proposed WSVM solution surface provides a complete portrait for the WSVM solution corresponding to any pair of λ and π and therefore naturally overcomes this kind of practical difficulties, in addition to the gain in computational efficiency.

In this article, we show that the WSVM solution is jointly piecewise-linear in both λ and π and propose an efficient algorithm to construct the entire solution surface on the (λ × π)-plane by taking advantage of the established joint piecewise-linearity. As a straightforward application, an adaptive grid for tuning the regularization parameter of the probability estimation scheme (Wang et al.; 2008) is proposed. We finally remark that the WSVM solution surface has broad applications in addition to the probability estimation.

The rest of the article is organized as follows. In Section 2, the WSVM problem is formulated in details to develop the joint solution surface of the WSVM. In Section 3 we establish the joint piecewise-linearity of the WSVM solution on (λ × π)-plane. An efficient algorithm is developed to compute the WSVM solution surface by taking advantage of its piecewise-linearity in Section 4; its computational complexity is explored in Section 5. The proposed WSVM solution surface algorithm is illustrated by the kyphosis data in Section 6. It is then applied to the probability estimation problems in several sets of real data in Section 7. Some concluding remarks are given in Section 8.

2 Problem Setup

By introducing nonnegative slack variables ξ_i, i = 1, ···, n and using inequality constraints, the WSVM problem (4) can be equivalently rewritten as

\begin{array}{l} min_{b, θ_{1}, \dots, θ_{n}} \sum_{i = 1}^{n} π_{i} ξ_{i} + \frac{λ}{2} \sum_{i = 1}^{n} \sum_{j = 1}^{n} θ_{i} θ_{j} K (x_{i}, x_{j}) \\ subject to 1 - y_{i} [b + \sum_{j = 1}^{n} θ_{j} K (x_{i}, x_{j})] \leq ξ_{i} and ξ_{i} \geq 0, i = 1, 2, \dots, n . \end{array}

The corresponding Lagrangian primal function is constructed as

L_{P} : \sum_{i = 1}^{n} π_{i} ξ_{i} + \frac{λ}{2} \sum_{i = 1}^{n} \sum_{j = 1}^{n} θ_{i} θ_{j} K (x_{i}, x_{j}) + \sum_{i = 1}^{n} α_{i} (1 - y_{i} f (x_{i})) - \sum_{i = 1}^{n} γ_{i} ξ_{i},

(5)

where α_i ≥ 0 and γ_i ≥ 0 are the Lagrange multipliers. To derive the corresponding dual problem, we set the partial derivatives of L_P in (5) with respect to the primal variables to zero, which gives

\frac{\partial}{\partial θ_{i}} : \sum_{j = 1}^{n} θ_{j} K (x_{i}, x_{j}) = \frac{1}{λ} \sum_{j = 1}^{n} α_{j} y_{j} K (x_{i}, x_{j}),

(6)

\frac{\partial}{\partial b} : \sum_{i = 1}^{n} α_{i} y_{i} = 0,

(7)

\frac{\partial}{\partial ξ_{i}} : α_{i} = π_{i} - γ_{i},

(8)

along with the Krush-Kuhn-Turker (KKT) conditions

α_{i} [1 - y_{i} f (x_{i}) - ξ_{i}] = 0,

(9)

γ_{i} ξ_{i} = 0.

(10)

Notice from (6) that the function (2) can be rewritten as

f (x) = b + \frac{1}{λ} \sum_{i = 1}^{n} α_{i} y_{i} K (x, x_{i}) .

(11)

By combining (6–10), the dual problem of the WSVM is given by

\begin{array}{l} max_{α_{1}, \dots, α_{n}} \sum_{i = 1}^{n} α_{i} - \frac{1}{2 λ} \sum_{i = 1}^{n} \sum_{j = 1}^{n} α_{i} α_{j} y_{i} y_{j} K (x_{i}, x_{j}) \\ subject to \sum_{i = 1}^{n} y_{i} α_{i} = 0 and 0 \leq α_{i} \leq π_{i}, \forall i = 1, \dots, n . \end{array}

(12)

and this can be readily solved via the quadratic programming (QP) for any given λ and π. However, the two parameters λ and π may not be known in practice. A classical approach is to repeatedly solve the QP problem (12) for different pairs of (λ, π) in order to select the desired λ and π. This is very computationally intensive for large data set because the QP itself is a numerical method whose computational complexity increases polynomially in n.

For the standard SVM (equivalent to a special case of the WSVM with the weight parameter π = 0.5), Hastie et al. (2004) showed the piecewise-linearity of α_i in λ and developed an efficient algorithm to compute the entire piecewise-linear solution path. The same idea can be extended straightforwardly to the WSVM with any fixed weight parameter or any fixed individual weights. In addition, Wang et al. (2008) showed the WSVM solutions α_i are piecewise-linear in the weight parameter π while keeping the regularization parameter λ fixed. However it is largely unknown how the WSVM solution α_i changes with respect to the two parameters jointly. In this article, we show that the WSVM solutions α_is, as a function of both λ and π, form a continuous piecewise-linear solution surface on the (λ × π)-plane and propose an efficient algorithm to compute the entire solution surface.

Similar to the idea of Hastie et al. (2004), we categorize all the examples, i = 1, ···, n into the three disjoint sets as

= {i : y_if (x_i) = 1, 0 ≤ α_i ≤ π_i} (elbow),
= {i : y_if (x_i) < 1, α_i = π_i} (left),
= {i : y_if (x_i) > 1, α_i = 0} (right).

It is easy to see that the above three sets are always defined uniquely by the conditions (6)–(10). The set names come from the particular shape of the hinge loss function (Hastie et al.; 2004). Note that {α_i, ∀i ∈ Inline graphic } contains most of the information on how the WSVM solution changes on the (λ × π)-plane, since the rest solutions {α_i, i ∈ ∪ } are trivially determined by the definition of the sets.

As λ and π change, the sets may change and, as long as this happens, we call it an event. All the solution surfaces for α_i, i = 1, ···, n are continuous and hence no element in Inline graphic can move directly to or vice versa. Therefore there are only three possible events to be considered as follows. The first event defines when one of α_i, i ∈ reaches π_i and the corresponding index i exits from to (event 1). Similarly, an α_i, i ∈ can reach the other boundary 0 and the index moves form Inline graphic to (event 2). The last event happens when one element i of ∪ satisfies y_if (x_i) = 1 and enters to (event 3).

3 Joint piecewise-linearity

In this section, we study the behavior of the WSVM solutions from a theoretical point of view. One major discovery we have made is that, α_i, i ∈ Inline graphic , hence all the α_i, i = 1, ···, n changes in a jointly piecewise-linear manner when λ and π vary. The following theorem describes how α_i moves as λ and π change. For simplicity, we define α₀ = λb.

Theorem 1

(Joint Piecewise-Linearity) Suppose we have a point (λ^ℓ, π^ℓ) in the (λ × π) plane. Let Inline graphic , , , $α^{ℓ} = {(α_{1}^{ℓ}, \dots, α_{n}^{ℓ})}^{T}$ , and $α_{0}^{ℓ}$ denote the associated sets and solutions obtained at (λ^ℓ, π^ℓ), respectively. Now, we consider a subset (of the (λ×π)-plane) which contains (λ^ℓ, π^ℓ) such that no event happens within Inline graphic . In other words, for all (λ, π) ∈ , the associated three sets, , , and remain the same as , , and , respectively. Then the solution α_i, i ∈ {0} ∪ , denoted by α₀, in a vector form moves in as follows:

α_{0, E} \equiv α_{0, E} (λ, π) = α_{0, E}^{ℓ} + G_{ℓ} Δ, \forall (λ, π) \in S^{ℓ},

(13)

where $α_{0, E}^{ℓ} = {α_{i}^{ℓ} : i \in {0} \cup E^{ℓ}}$ and Δ = (Δ_λ, Δ_π)^T = (λ − λ^ℓ, π − π^ℓ)^T. The gradient matrix, G_ℓ is given by

G_{ℓ} = A_{ℓ}^{- 1} B_{ℓ} = {(\begin{matrix} 0 & y_{ℓ}^{T} \\ y_{ℓ} & K_{ℓ}^{*} \end{matrix})}^{- 1} (\begin{matrix} 0 & ∣ L^{ℓ} ∣ \\ 1 & - k_{ℓ}^{*} \end{matrix}),

(14)

where $K_{ℓ}^{*} = {y_{i} y_{j} K (x_{i}, x_{j}) : for i, j, \in E^{ℓ}}; k_{ℓ}^{*} = {\sum_{j \in L^{ℓ}} y_{i} K (x_{i}, x_{j}) : i \in E^{ℓ}}^{T}$ ; y_ℓ = {y_i:i ∈ Inline graphic }^T; |A| denotes the cardinality of a set A; and 1 is the one vector of length | |.

Proof

We can rewrite (11) as:

\begin{array}{l} f (x) = [f (x) - \frac{λ^{ℓ}}{λ} f^{ℓ} (x)] + \frac{λ^{ℓ}}{λ} f^{ℓ} (x) \\ = \frac{1}{λ} [(α_{0} - α_{0}^{ℓ}) + \sum_{j \in E^{ℓ}} (α_{j} - α_{j}^{ℓ}) y_{j} K (x, x_{j}) + (π - π_{ℓ}) \sum_{j \in L^{ℓ}} K (x, x_{j})] + \frac{λ^{ℓ}}{λ} f^{ℓ} (x) . \end{array}

(15)

Moreover, in Inline graphic we have y_if (x_i) = y_if^ℓ(x_i) = 1, i ∈ , which leads to

Δ_{λ} = ν_{0} y_{i} + \sum_{j \in E^{ℓ}} ν_{j} y_{i} y_{j} K (x_{i}, x_{j}) + Δ_{π} \sum_{j \in L^{ℓ}} y_{i} K (x_{i}, x_{j}),

(16)

where $ν_{i} = α_{i} - α_{i}^{ℓ}$ , ∀i ∈ {0}∪ Inline graphic . In addition, $\sum_{i = 1}^{n} y_{i} α_{i} = \sum_{i \in E^{ℓ}} y_{i} α_{i} + π ∣ L^{ℓ} ∣$ by definitions of the sets and the condition (7) gives

\sum_{i \in E^{ℓ}} y_{i} ν_{i} = ∣ L^{ℓ} ∣ Δ_{π} .

(17)

Finally, we have (| Inline graphic |+1) linear equations from (16), (17), which can be expressed in a matrix form as A_ℓν= B_ℓΔ, where ν = {νi : i ∈ {0} ∪ }^T = α₀, − α₀,. Finally the desired result follows by assuming A_ℓ to be invertible.

We remark that A_ℓ is rarely singular in practice, and related discussions can be found in Hastie et al. (2004). It is worthwhile to point out that the joint piecewise-linearity of the solution guarantees the marginal piecewise-linearity as presented in Corollary 2, but not vice versa. Therefore, Theorem 1 implies the piecewise-linearity of the marginal solution paths as a function of either λ or π, which were separately explored by Hastie et al. (2004) and Wang et al. (2008).

Corollary 2

(Marginal Piecewise-Linearity) For a given π₀, the solution, α₀, moves linearly in λ ∈ {λ : (λ, π₀) ∈ Inline graphic } as follows.

α_{0, E} = α_{0, E}^{ℓ} + g_{1}^{ℓ} Δ_{λ} .

Similarly, α₀, changes in π ∈ {π : (λ₀, π) ∈ Inline graphic } for a given λ₀ as follows.

α_{0, E} = α_{0, E}^{ℓ} + g_{2}^{ℓ} Δ_{π},

where $g_{1}^{ℓ}$ and $g_{2}^{ℓ}$ denote the first and second column vectors of G_ℓ in (14), respectively.

The classification function f(x) can be conveniently updated by plugging (13) into (15), which gives

f (x) = \frac{λ^{ℓ}}{λ} [f^{ℓ} (x) - h_{1}^{ℓ} (x)] + h_{1}^{ℓ} (x) + \frac{π - π^{ℓ}}{λ} h_{2}^{ℓ} (x),

(18)

where

\begin{array}{l} h_{1}^{ℓ} (x) = g_{01} + \sum_{i \in E^{ℓ}} g_{i 1} y_{i} K (x, x_{i}), \\ h_{2}^{ℓ} (x) = g_{02} + \sum_{i \in E^{ℓ}} g_{i 2} y_{i} K (x, x_{i}) + \sum_{i \in L^{ℓ}} K (x, x_{i}), \end{array}

and (g_i₁, g_i₂) denotes the row of G_ℓ where i ∈ {0} ∪ Inline graphic . We observe from (18) that f (x) is not jointly piecewise-linear in (λ, π) while it is marginally piecewise-linear in λ⁻¹ and π, respectively.

4 Solution Surface Algorithm

In this section, we propose an efficient algorithm to compute the entire solution surface of the WSVM on the (λ × π)-plane by using the joint piecewise-linearity established in Theorem 1.

4.1 Initialization

Denote index sets I₊ = {i : y_i = 1} and I₋ = {i : y_i = −1}. We initialize the algorithm at π₀ = |I₊|/n. Notice that the π₀ satisfies the so-called balanced condition that requires Σ_i∈I₊ π_i(π₀) = Σ_i∈I₋ π_i(π₀). With π = π₀ it is easy to verify that, for a sufficiently large λ, α_i = π₀ if i ∈ I₊ and 1 − π₀ otherwise. Following the idea of Hastie et al. (2004), the initial values of λ and α₀ denoted by λ₀ and $α_{0}^{0}$ , respectively are given by

\begin{array}{l} λ_{0} = \frac{1}{2} [(1 - π_{0}) \sum_{i \in I_{+}} (K_{i}^{+} - K_{i}^{-}) + π_{0} \sum_{i \in I_{-}} (K_{i}^{+} - K_{i}^{-})] and \\ α_{0}^{0} = - \frac{1}{2} [(1 - π_{0}) \sum_{i \in I_{+}} (K_{i}^{+} - K_{i}^{-}) - π_{0} \sum_{i \in I_{-}} (K_{i}^{+} - K_{i}^{-})], \end{array}

where $K_{i}^{+} = K (x_{i}, x_{i +}), K_{i}^{-} = K (x_{i}, x_{i -})$ . The indices i₊ and i₋ are defined as

\begin{array}{l} i_{+} = \underset{i \in I_{+}}{argmax} {(1 - π_{0}) \sum_{j \in I_{+}} K (x_{i}, x_{j}) - π_{0} \sum_{j \in I_{-}} K (x_{i}, x_{j})} and \\ i_{-} = \underset{i \in I_{-}}{argmin} {(1 - π_{0}) \sum_{j \in I_{+}} K (x_{i}, x_{j}) - π_{0} \sum_{j \in I_{-}} K (x_{i}, x_{j})} . \end{array}

It is possible to initialize the algorithm at any π between 0 and 1 rather than π₀, however, we empirically observe that the initialized λ is the largest when π = π₀. Notice that the corresponding solution is trivial as α_i = π_i for all i = 1, ···, n for any λ larger than λ₀. Therefore, the proposed algorithm focuses only on the non-trivial solutions obtained at Inline graphic = {(λ, π) : 0 ≤ λ ≤ λ₀, 0 ≤ π ≤ 1}. Finally, the three sets initialized at (λ₀, π₀) denoted by , and , respectively are given by

E^{0} = {i_{+}, i_{-}}, L^{0} = {1, \dots, n} \ E^{0}, and R^{0} = φ,

where φ denotes the empty set.

4.2 Update

Recall that, for any point (λ^ℓ, π^ℓ), no event occurs if (λ, π) ∈ Inline graphic and the WSVM solution can be updated by applying Theorem 1 for any (λ, π) ∈ . Therefore, it is essential to know how to define as large as possible for any (λ^ℓ, π^ℓ). We shall demonstrate next that the set can be explicitly determined by some linear constraints.

Note that event 1 happens when α_i reaches π_i for some i ∈ Inline graphic . Based on (13), we have the following inequality constraints to prevent event 1 from happening:

g_{i 1} λ + (g_{i 2} + 1) π \leq t_{i}^{ℓ} + 1, i \in E_{+}^{ℓ}

(9)

g_{i 1} λ + (g_{i 2} - 1) π \leq t_{i}^{ℓ}, i \in E_{-}^{ℓ},

(20)

where $t_{i}^{ℓ} = g_{i 1} λ^{ℓ} + g_{i 2} π^{ℓ} - α_{i}^{ℓ}, E_{+}^{ℓ} = E^{ℓ} \cap I_{+}$ and $E_{-}^{ℓ} = E^{ℓ} \cap I_{-}$ . In a similar way, we have the following inequalities to prevent event 2 :

g_{i 1} λ + g_{i 2} π \geq t_{i}^{ℓ}, i \in E^{ℓ} .

(21)

In order to prevent event 3, we have y_if (x_i) < 1, ∀_i ∈ Inline graphic and y_if (x_i) > 1, ∀_i ∈ . Therefore, by noting (15), we have

(h_{1}^{ℓ} (x_{i}) - 1) λ + h_{2}^{ℓ} (x_{i}) y_{i} π \leq s_{i}^{ℓ}, i \in R^{ℓ}

(22)

(h_{1}^{ℓ} (x_{i}) - 1) λ + h_{2}^{ℓ} (x_{i}) y_{i} π \geq s_{i}^{ℓ}, i \in L^{ℓ},

(23)

where $s_{i}^{ℓ} = y_{i} (h_{1}^{ℓ} (x_{i}) - f^{ℓ} (x_{i})) λ^{ℓ} + h_{2}^{ℓ} (x_{i}) y_{i} π^{ℓ}$ . We remark that the equalities do not need to be strict since an event is instant transition. Recall that it is enough to consider the solution on Inline graphic and hence the additional constraints 0 ≤ λ ≤ λ₀ and 0 ≤ π ≤ 1 should be considered as well by default. In summary can be defined by a subregion on the (λ × π)-plane that satisfies all the constraints (19)–(23). Figure 2 illustrates a generated from the initial point (λ₀, π₀) for the toy example in Section 1. We remark that Inline graphic forms a polygon which can be uniquely expressed by its vertices, since the constraints are all linear.

The illustration of defining generated from the initial point (λ₀*, π*₀) for the toy example in Section 1: Each line in the left penal (a) represents a constraint boundary where an *event* occurs as labeled. For example, the upper right label represents the event: the fourth example moves from to on the boundary. The set obtained from the boundaries in (a) is depicted in (b).

We describe next how to determine the vertices of Inline graphic in an efficient manner. First, compute all the pairwise intersection points of the boundaries of (19)–(23) then we have $\frac{n_{c} (n_{c} - 1)}{2}$ intersection points, where n_c denotes the number of constraints (19)–(23), n + | |. The left penal of Figure 2 shows all the intersections of the boundaries of the obtained constraints. Then, we can define the vertices by identifying the intersections that satisfy all the constraints (19)–(23) as illustrated in (b) of Figure 2. We denote the vertices of the Inline graphic as { $v_{1}^{ℓ}, \dots, v_{n_{v}}^{ℓ}$ } where $v_{r}^{ℓ} = (λ_{r}, π_{r})$ , r = 1, … ; n_v,.

There are a couple of issues we need to clarify here. First, based on our limited experience, we observe that n_v is small and typically does not exceed eight. Hence it is not efficient to compute all the $\frac{n_{c} (n_{c} - 1)}{2}$ intersections due to the involved computational intensity. Note also that some constraints are dominated by others on Inline graphic and are thus automatically satisfied if other constraints hold. Consequently, we can save a huge amount of computational time by excluding those constraints which are dominated by others, especially when n is large. We also need to order the vertices for set-updates which are discussed next. Here the ordering means that a vertex $v_{r}^{ℓ}$ , r = 1, ···, n_v is adjacent to $v_{r - 1}^{ℓ}$ and $v_{r + 1}^{ℓ}$ , where we set $v_{0}^{ℓ} = v_{n_{v}}^{ℓ}, v_{n_{v} + 1}^{ℓ} = v_{1}^{ℓ}$ (see (b) in Figure 2). The updating of α₁, ···, α_n as well as α₀ at vertices of Inline graphic provided is straightforward by Theorem 1.

Figure 3 illustrates how to extend the polygons on Inline graphic from current in Figure 2-(b). Notice that the sides of are determined by some boundaries of the constraints (19) – (23) and represent corresponding events. Each middle point of two adjacent vertices can be used as a new starting point, (λ^ℓ+1, π^ℓ+1) in Theorem 1 to compute a new polygon Inline graphic . Computing middle points and corresponding solutions denoted by $m_{r} = (λ_{r}^{ℓ + 1}, π_{r}^{ℓ + 1})$ and ${\bar{α}}_{r}^{ℓ + 1} = {(α_{0}, α_{1}, \dots, α_{n}) obtained at m_{r} = (λ_{r}^{ℓ + 1}, π_{r}^{ℓ + 1})}^{T}$ , respectively where r = 1, ···, n_v is trivial due to the piecewise-linearity of the solutions. The bar sign in ${\bar{α}}_{r}^{ℓ + 1}$ is used to emphasize the quantities are obtained at middle points, not vertices. At each middle point, the corresponding three sets denoted by $E_{r}^{ℓ + 1}, R_{r}^{ℓ + 1}$ , and $L_{r}^{ℓ + 1}$ respectively can be updated based on which line the middle point m_r is on. For example as shown in (a) of Figure 3, the three sets $E_{1}^{ℓ + 1}, R_{1}^{ℓ + 1}$ , and $L_{1}^{ℓ + 1}$ obtained at m₁ are updated as $E_{1}^{ℓ + 1} = E^{ℓ} \ {4}, R_{1}^{ℓ + 1} = R^{ℓ} \cup {4}$ , and $L_{1}^{ℓ + 1} = L^{ℓ}$ . This is because $m_{1} = (λ_{1}^{ℓ + 1}, π_{1}^{ℓ + 1})$ lies on the boundary which represents an event that the element 4 moves to the right set from the elbow set (see (a) in Figure 2). Now we have all the information required to generate a new polygon, $S_{1}^{ℓ + 1}$ , from the middle point, m₁ and $S_{1}^{ℓ + 1}$ can be accordingly computed by treating m₁ as a new updating point (This is the reason why we use a superscript ℓ + 1 to denote middle points). Figure 3 shows all the four newly-created polygons, $S_{r}^{1}$ , r = 1, ···, 4 respectively from the four middle points m₁, ···, m₄ of the sides of Inline graphic in Figure 2-(b). Notice that the right vertical line represents λ = λ₀ and we do not have any interest beyond it. Finally, the proposed algorithm can be continued to extend the polygons being searched on and is terminated when the complete solution surface is recovered on .

The updated polygons $S_{r}^{1}$ *, r* = 1, ···, 4 from the middle points m₁, ···*, m*₄ of the sides of in Figure 2(b): Dotted lines in each subfigure depict the boundaries of constrains for obtaining the polygon.

4.3 Resolving Empty Elbow

Note that either event 1 and or event 2 leads to the possibility that Inline graphic is empty, named empty elbow. This may cause a problem in the update described in Section 4.2, because Theorem 1 cannot be applied when is empty.

We suppose the empty elbow occurs at (λ^o, π^o) and use a superscript ‘o’ to denote any quantity defined at (λ^o, π^o). In order to resolve the empty elbow, we first notice that the objective function (4) is differentiable with respect to b and α_i, i = 1, ···, n whenever the the elbow set is empty since in this case there is no example satisfying y_if (x_i) = 1. Taking derivative of (4) with respect to b and α_i, we get two conditions to be satisfied under the empty elbow :i) $π^{o} = ∣ L_{+}^{o} ∣ / ∣ L^{o} ∣$ , where $L_{+}^{o} = L^{o} \cap I_{+}$ and ii) α_i are unique while α₀ is not. Moreover, α₀ can be any value in the following interval,

[a_{L}, a_{U}] ≜ [max_{i \in L_{-}^{o} \cup R_{+}^{o}} m_{i}^{o}, min_{i \in L_{+}^{o} \cup R_{-}^{o}} m_{i}^{o}]

(24)

where $m_{i}^{o} = y_{i} λ^{o} - \sum_{j = 1}^{n} α_{j}^{o} y_{j} K (x_{i}, x_{j})$ ; and $L_{-}^{o}, R_{+}^{o}$ , and $R_{-}^{o}$ are similarly defined as $L_{+}^{o}$ . Recall that α₀ is continuous and hence the empty elbow can be resolved only by α₀ touching one of the two boundaries a_L and a_U. Notice that the $α_{0}^{o}$ is regarded as a starting point where the empty elbow begins and hence it must be one of a_L and a_U. Without loss of generality, we suppose $α_{0}^{o} = a_{L}$ then the empty elbow should be resolved when it becomes the other boundary a_U. Let $i_{L}^{o} = {argmax}_{i \in L_{-}^{o} \cup R_{+}^{o}} m_{i}^{o}$ and $i_{U}^{o} = {argmin}_{i \in L_{+}^{o} \cup R_{-}^{o}} m_{i}^{o}$ , then the Inline graphic should be updated to { $i_{U}^{o}$ }. The and are accordingly updated as well since the updated index $i_{U}^{o}$ enters to the from one of the two sets. In case of $α_{0}^{o} = a_{U}$ , we update in a similar way which leads to $E = {i_{L}^{o}}$ .

4.4 Pseudo Algorithm

Combining the previous several subsections, we now summarize our WSVM solution surface algorithm at Algorithm 1. We denote α = (α₀, α₁, ···, α_n)^T for simplicity. We note that the proposed algorithm computes the complete WSVM solution αs for any (λ, π) ∈ Inline graphic without involving any numerical optimization. We have implemented the algorithm in R language and the wsvmsurf package is available from the authors upon request (and will be available on CRAN soon).

Algorithm 1.

WSVM solution surface of the WSVM

Input: A training data set (x_i, y_i), i = 1, ···, n and a non-negative definite kernel function K(x, x^′).

Output : The entire (piecewise-linear) solution surface on Inline graphic

of α_i, i = 0, ···, n, the optimizer of (4).

1. Initialize ${Input}^{(1)} \leftarrow {{in}_{1}^{(1)}} = {λ^{0}, π^{0}, α^{0}, E^{0}, L^{0}, R^{0}}^{T}$ from Section 4.1 and k ← 1.
While ${Input}^{(k)} = {{in}_{1}^{(k)}, \dots, {in}_{J_{k}}^{(k)}}$ is not empty (i.e, J_k ≠ 0),
- 2-1
  For j = 1, ···, J_k
  1. Set λ^ℓ, π^ℓ, α^ℓ, , , and from ${in}_{j}^{(k)}$ .
  2. If is empty, resolve the empty elbow as described in Section 4.3, otherwise skip this step and go directly to the following step iii.
  3. Define by identifying vertices $v_{1}^{ℓ}, \dots, v_{n_{v}}^{ℓ}$ using (19)–(23) and update $α_{r}^{ℓ}$ at $v_{r}^{ℓ}$ , r = 1, ···, n_ν from (13).
    
    Set out_j ← {out_j,₁, ···, out_{j, n_v}}, where ${out}_{j, r} \leftarrow {v_{r}^{ℓ}, α_{r}^{ℓ}}$ for r = 1, ···, n_v.
  4. Compute the middle points m₁ ··· m_{n_v} and the corresponding ᾱ_r at m_r, r =1, ···, n_v.
    
    Update $E_{r}^{ℓ + 1}, L_{r}^{ℓ + 1}$ and $R_{r}^{ℓ + 1}$ at $m_{r} = (λ_{r}^{ℓ + 1}, π_{m}^{ℓ + 1})$ , r = 1, ···, n_v.
    
    Set in_j ← {in_j_,1, ···, in_{j,n_v}}, where ${in}_{j, r} \leftarrow {λ_{r}^{ℓ + 1}, π_{r}^{ℓ + 1}, α_{r}^{ℓ + 1}, E_{r}^{ℓ + 1}, L_{r}^{ℓ + 1}, R_{r}^{ℓ + 1}}^{T}$ for r = 1, ···, n_v.
  end
- 2-2
  Set Output⁽^k⁾ ← {out₁, ···, out_J}.
- 2-3
  Set Input⁽^k⁺¹⁾ ← {in₁, ···, in_J}.
- 2-4
  Refine Input⁽^k⁺¹⁾ by deleting any element which revisits an old middle points.
end
Set Output ← {Output⁽¹⁾ : ··· : Output⁽^k⁾}

Open in a new tab

Table 1.

Empirical computing time based on 100 independent repetitions: the machine we used equips Intel Core (TM) i3 550 @ 3.20GHZ CPU with 4GB memory.

n	Gaussian			Linear

	time(s)	piece	time/piece	time(s)	piece	time/piece
10	0.53 (0.01)	164.3 (3.58)	0.00325	0.44 (0.01)	140.98 (2.70)	0.00311
30	5.78 (0.08)	2056.6 (27.16)	0.00281	4.36 (0.06)	1648.05 (21.24)	0.00265
50	19.08 (0.22)	6036.8 (55.29)	0.00316	14.38 (0.16)	4870.05 (52.40)	0.00295
100	108.00 (1.07)	25653.0 (193.50)	0.00421	80.72 (0.67)	20087.13 (148.05)	0.00402

Open in a new tab

5 Computational Complexity

The essential part of the proposed algorithm involves several steps. First, we solve the linear equation system (14) of size 1 + | Inline graphic |, which involves O((1 + | |)³) computations. We empirically observe that | |, the number of support vector depends on n but is usually much less than n during the algorithm. Next, in order to determine the set , we compute constraints (19) – (23) and h₁(x_i) and h₂(x_i), which respectively requires O(n| Inline graphic |) and O(n²) computations. Now we have n_c = n + | | constraints and need to find their intersection points which satisfy all the constraints (19)–(23) simultaneously. This is the most computationally expensive step since it requires to evaluate n_c(n_c − 1)/2 intersection points and check n_c constraints for each point, which takes at least $O (n_{c}^{3})$ computations in total.

Our limited numerical experience suggests that the final vertices of Inline graphic is of size less than or equal to eight, which implies that most of the constraints are dominated by others. This leads us to suggest adding a refining step which removes those constraints dominated by others in the algorithm as follows. First, since all the constraints have a common linear form of αλ + bπ ≥ c for different values of a, b and c for different constraints, we can classify the constraints into two types depending on the sign of b, positive or negative, assuming b ≠ 0 for simplicity. See Figure 4-(a) for an illustration. If b < 0 for a constraint then the region represented by the constraint is below its boundary (dashed lines), otherwise it is above the boundary (dotted lines). For each type of constraints with either b > 0 or b < 0, we compute values of πs at λ = 0 and λ₀ (vertical lines), then we can exclude most of, but not all dominated boundaries by sorting the π values obtained. The refining step requires at most $O (n_{c}^{4 / 3})$ computations since it is based on the sorting algorithm, Shellsort variant. After the refining step, we have only ñ_c constraints to consider for computing the Inline graphic and ñ_c is much smaller than n_c.

Illustration of the refining step and its effect for simulated example: The left penal (a) shows all the boundaries of constraints. Dashed (red) lines represents boundaries of constraints with *b <* 0 and only two constraints (bold) are used to defined and the other two excluded are chosen by comparing π values at λ = 0 and λ₀. Dotted (blue) lines are the ones with *b >* 0 and we can exclude dominated constraints in a similar manner. The right panel (b) shows dramatic saving in computation after use of the refining step by comparing the numbers of intersection points of unrefined constraints (*n_c*) and refined ones (*ñ_c*) as functions of sample size, n.

In order to demonstrate the benefit of the refining step, we consider the situations where the data sets from univariate Gaussian mixture for different sample size, n. Data are generated as x_i ~ N (0, 1) if i ∈ I₊ and x_i ~ N (1, 1) otherwise with |I₊| = |I₋| = n/2. Employing Gaussian kernel $K (x, x^{'}) = exp (- {‖ x - x^{'} ‖}_{2}^{2} / σ^{2})$ with σ = 1, Figure 4-(b) illustrates how effective the refining step is. We observe that only about 20% of the constraints are used to compute Inline graphic for relatively large n, suggesting a dramatic computational saving to evaluate (only about 0.2³ = .8% computations is required, compared to the case when the refining step is not employed). The right panel (b) shows $n_{c}^{3}$ (black dashed-line) versus ${\tilde{n}}_{c}^{3}$ (red solid-line) for different n; the computational saving is substantial when n is large.

In order to obtain the entire solution surface we need to iterate the aforementioned updating steps. It is quite challenging to rigorously quantify the number of required iterations as a function of n, since it depends on not only the data but also the kernel function employed (and hence its parameter). We instead evaluate empirical computing time for different sample size n based on 100 independent repetitions (Table 1). The data are simulated from a bivariate Gaussian mixture with the same number of observations from each class (i.e., |I₊| = |I₋| = n/2). In particular, if i ∈ I₊ then x_i is from N ((0, 0)^T, I₂) and otherwise x_i ~ N₂ (2, 2)^T, I₂ where I₂ is the two-dimensional identity matrix. In Table 1, we report the average computing time, average number of pieces, and the computing time per piece for both the Gaussian kernel and the linear kernel. The kernel parameter σ of the Gaussian kernel is assumed to be fixed and set to be the median of pairwise distances between the two classes (Jaakkola et al.; 1999). Our limited numerical experience suggests that the computing time per a piece ( Inline graphic ) does not depend severely on n while the entire time increases proportional to n² since the number of polygons produced does. In Table 1, the numbers in parentheses are the corresponding standard errors.

Finally, one may desire to extract the marginal path, either a λ-path or a π-path, from the WSVM solution surface. This can be easily done after obtaining all the polygons and the corresponding WSVM solutions. It is straightforward to interpolate the WSVM solution for any pair (λ, π) using the WSVM solutions corresponding to vertices of all those polygons produced from the WSVM solution surface. Let ‘d’ be the total number of vertices. As aforementioned, d is proportional to n². Then it requires only O(d) computations to obtain the marginal solution paths, since the vertices of each polygon are properly ordered already.

6 Illustration

In this section, we illustrate how the proposed WSVM solution surface algorithm works by using the kyphosis data (Chambers and Hastie; 1992) available in R package rpart. The data contain the status (absence or presence of a kyphosis) of n = 81 children who had a corrective spinal surgery. The class variable y_i is coded −1 for absence and +1 for presence of a kyphosis after the operation. Three predictors are recorded for each child (i.e., p = 3): age of a patient (in month), the number of vertebrae involved, and the number of the first vertebra operated on. The Gaussian kernel is used with kernel parameter σ set to be 0.01. The entire solution surface consists of 7,502 polygons (or pieces) and the total computing time is 34.90 seconds. In Figure 5, the panel (a) plots all the vertices of the polygons produced, (denoted by red dots), and the panel (b) depicts the entire surface of α₁₈. There are 81 α_i, i = 1, ···, 81-surfaces totally, one corresponding to each of the ith observation. For the purpose of illustration, we only show one of them for i = 18. The solution surfaces for the rest 80 α_is can be visualized in a similar way. Once we have the entire two-dimensional solution surfaces, the marginal solution paths can be readily obtained for a fixed value of either λ or π. The two panels (c) and (d) represent the marginal regularization paths with a fixed π = 0.2 and the marginal paths with a fixed λ = 0.5, extracted from the WSVM solution surface, respectively. Notice that there are 81 observations in the data and hence each piecewise-linear path in (c) and (d) represents a marginal solution path of a α_i where i = 1, ···, 81.

Real data illustration on *kyphosis* data.

7 Applications to Probability Estimation

In this section, we revisit the motivating application of the proposed joint solution surface algorithm to the probability estimation (Wang et al.; 2008) introduced in Section 1. In Wang et al. (2008), the regularization parameter λ is selected by a grid search over the interval [10⁻³, 10³] with ten equally-spaced points in each sub-interval (10^j, 10^j⁺¹] where j = −3, ···, 2. In practice, it can be difficult to choose an appropriate range for parameter search or to determine the right level of grid coarseness. A general rule is: if the WSVM solution varies rapidly for different λ, we need a fine gird; otherwise, a coarse grid will be more appropriate in terms of computational efficiency. However, without a priori information available, a subjective choice of the range could be either too wide or too narrow, and the grid could be too fine or too rough. The proposed WSVM solution surface algorithm provides a natural and informative guidance to tackle these issues. We propose to use all the distinct λs obtained from the WSVM solution surface algorithm as the grid, namely the set of all λs corresponding to all vertices of those polygons obtained by the WSVM solution surface. We like to point out that the proposed grid of λ is adaptive to the solution variation, in the sense that we have a finer grid if the solution surface is complicated and a rough grid otherwise.

We compare these two grid search methods using two microarray datasets available at LIBSVM Data website (http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html). For both of the datasets, the sample sizes are much smaller then the number of predictors. The Gaussian kernel is employed to train the WSVM and the associated kernel parameter σ is set to be the median distances between the two classes. In order to evaluate performance of the probability estimator using the two different grids, the fixed grid employed in Wang et al. (2008) and the proposed adaptive grid, the cross entropy error (CRE) is used. The CRE over a set of data {(x_i, y_i) : i = 1, 2, ···, n} is defined as follows.

CRE = - \frac{1}{n} \sum_{i = 1}^{n} [\frac{1}{2} (1 + y_{i}) log \hat{p} (x_{i}) + \frac{1}{2} (1 + y_{i}) log {1 - \hat{p} (x_{i})}] .

(25)

Table 2 contains summarized information of the well-known microarray data sets used, Duke cancer and Colon cancer data. One merit of the proposed probability estimator is that it is not only distribution-free method but also free from the dimension of predictors p, since it exploits the kernel trick. For tuning λ, we employ the cross-validation (leave-one-out and 10-fold) instead of using independent tuning set since the data sets have relatively small sample sizes. Table 3 shows the test CREs for the microarray examples. We point out that clear improvements of the proposed adaptive grid are observed for the microarray examples compared to the fixed grid. This is because the joint solution surfaces of the microarray examples are complicated and the fixed grid fails to reflect this complicated pattern of the solutions.

Table 2.

Sources for the microarray data used for illustrations: The numbers of predictors (p) are all 7219 and much larger than the sample size (n).

Name	train(n)	test(ñ)	p	Source
Duke Cancer	38	4	7219	West et al. (2001)
Colon Cancer	40	22	2000	Alon et al. (1999)

Open in a new tab

Table 3.

Test CREs for Microarray data sets (p > n).

data	tuning	fixed	adaptive	improve
Duke	LOCV	0.4991	0.4794	3.94%
Duke	10-fold	0.4846	0.4729	2.42%

Colon	LOCV	0.3058	0.2511	17.89%
Colon	10-fold	0.3058	0.2646	13.45%

Open in a new tab

8 Concluding Remarks

In this article, we first establish that the WSVM solution is jointly piecewise-linear in λ and π and then develop an efficient algorithm to compute the entire WSVM solution surface on the (λ × π)-plane by taking advantage of the joint piecewise-linearity. To demonstrate its practical value, we propose an adaptive tuning scheme for the probability estimation method based on the WSVM solution surface. Illustrations using real data sets show that the proposed adaptive tuning grid from the WSVM solution surface improves the performance of the probability estimator especially when the solution surfaces are complicated. We remark that the proposed solution surface provides the entire information of the WSVM solution for arbitrary pair of λ and π and therefore potentially contains wide applicability in various statistical applications.

The piecewise-linearity of the marginal λ-path or regularization paths has been well-studied in the literature. See Rosset and Zhu (2007) for a unified analysis under various models, where a general characterization of the loss and penalty functions to give piecewise-linear coefficient paths is derived. We would like to point out that it is possible to extend the notion of joint piecewise-linearity to more general problems beyond the weighted SVM. For example, we can show that the kernel quantile regression solution also enjoys the joint piecewise-linear property in terms of its regularization parameter and the quantile parameter. However, it is not yet clear what are general sufficient conditions for the joint piecewise-linearity to hold. We find that the sufficient conditions for the marginal (one-dimensional) piecewise-linearity, given in Rosset and Zhu (2007), generally do not imply the joint (two-dimensional) piecewise-linearity. For example, support vector regression (Vapnik; 1996) has piecewise-linear marginal paths, with respect to the regularization parameter or the intensity parameter, but its solution does not have the joint piecewise-linearity property. Based on our experience, the two-dimensional solution surface is much more complex than one-dimensional marginal paths, mainly due to the intricate roles played by two parameters and their relationship. This is an important topic and worth further investigation.

Acknowledgments

The authors would like to thank Professor Richard A. Levine, the associate editor, and two reviewers for their constructive comments and suggestions that led to significant improvement of the article. This work is partially supported by NIH/NCI grants R01 CA-149569 (Shin and Wu), R01 CA-085848 (Zhang), and NSF Grant DMS-0645293 (Zhang). The content is solely the responsibility of the authors and does not necessarily represent the official views of NIH or NCI.

Footnotes

Online Supplement: The online supplement materials contain the computer code and the instructions on how to build and install our newly developed R package wsvmsurf. All the files are available via a single zipped file “Supplement.rar”. After unzipping the file, one can retrieve one folder (“wsvmsurf”) and two text files (“example.R” and “readme.txt”). The folder “wsvmsurf” contains all the source files (which are properly structured) for building the R package “wsvmsurf”. The file “readme.txt” gives detailed instructions for installing the package. For windows machines, the package can be built by using R-tools available at www.r-project.org. For Linux machines, the installation can be done by using the R command R CMD build wsvmsurf. The file “example.R” contains the code to run two examples in the paper.

Contributor Information

Seung Jun Shin, Email: sshin@ncsu.edu, Department of Statistics, North Carolina State University, Raleigh, NC 27695.

Yichao Wu, Email: wu@stat.ncsu.edu, Department of Statistics, North Carolina State University, Raleigh, NC 27695.

Hao Helen Zhang, Email: hzhang@math.arizona.edu, Department of Mathematics, University of Arizona, Tucson, AZ 85718.

References

Alon U, Barkai N, Notterman D, Gish K, Ybarra S, Mack D, Levine A. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Cell Biology. 1999;96:6745–6750. doi: 10.1073/pnas.96.12.6745. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chambers J, Hastie T. Statistical Models in S. Wadsworth and Brooks/Cole; Pacific Grove, CA: 1992. [Google Scholar]
Hastie T, Rosset R, Tibshirani R, Zhou J. The entrie regularization path for the supprot vector machine. Journal of Machine Learning Reasearch. 2004;5:1931–1415. [Google Scholar]
Jaakkola T, Diekhans M, Haussler D. Using the fisher kernel method to detect remote protein homologies. Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology; Heidelberg, Germany: AAAI; 1999. pp. 149–158. [PubMed] [Google Scholar]
Kimeldorf G, Wahba G. Some results on tchebycheffian spline functions. Journal of Mathematical Analysis and Applications. 1971;33:82–95. [Google Scholar]
Lin Y. Support vector machines and the bayes rule in classification. Data Mining and Knowledge Discovery. 2002;6:259–275. [Google Scholar]
Lin Y, Lee Y, Wahba G. Support vector machines for classification in nonstandard situations. Machine Learning. 2004;46:191–202. [Google Scholar]
Liu Y. Fisher consistency of multicategory support vector machines. Proceedings of Eleventh International Conference on Artificial Intelligence and Statistics; Heidelberg, Germany: AAAI; 2007. pp. 289–296. [Google Scholar]
Rosset S, Zhu J. Piecewise linear regularized solution paths. Annals of Statistics. 2007;35 (3):1012–1030. [Google Scholar]
Vapnik V. The Natual of Statistical Learning. Springer-Verlag; 1996. [Google Scholar]
Wahba G. Spline models for observational data; CBMS-NSF Regional Conference Series in Applied Mathematics; SIAM; 1990. [Google Scholar]
Wahba G. Support vector machines, reproducing kernel hilbert spaces, and randomized gacv. In: Schölkopf B, Burges CJC, Smola AJ, editors. Advances in Kernel Methods: Support Vector Learning. MIT Press; 1999. pp. 125–143. [Google Scholar]
Wang J, Shen X, Liu Y. Probability estimation for large-margin classifier. Biometrika. 2008;95 :149–167. [Google Scholar]
Wang L, Shen X. On l1-norm multi-class support vector machines: methodology and theory. Journal of the American Statistical Association. 2007;102:595–602. [Google Scholar]
West M, Blanchette C, Dressman H, Huang E, Ishida S, Spang R, Zuzan H, Jr, JAO, Marks J, Nevins J. Predicting the clinical status of human breast cancer by using gene expression profiles. Proceedings of the National Academy of Sciences. 2001;98:11462–11467. doi: 10.1073/pnas.201162998. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhang HH, Ahn J, Lin X, Park C. Gene selection using support vector machines with nonconvex penalty. Bioinformatics. 2006;22(1):88–95. doi: 10.1093/bioinformatics/bti736. [DOI] [PubMed] [Google Scholar]
Zhu J, Rosset S, Hastie T, Tibshirani R. 1-norm support vector machines. Advances in Neural Information Processing Systems. 2003;16 [Google Scholar]

[R1] Alon U, Barkai N, Notterman D, Gish K, Ybarra S, Mack D, Levine A. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Cell Biology. 1999;96:6745–6750. doi: 10.1073/pnas.96.12.6745. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] Chambers J, Hastie T. Statistical Models in S. Wadsworth and Brooks/Cole; Pacific Grove, CA: 1992. [Google Scholar]

[R3] Hastie T, Rosset R, Tibshirani R, Zhou J. The entrie regularization path for the supprot vector machine. Journal of Machine Learning Reasearch. 2004;5:1931–1415. [Google Scholar]

[R4] Jaakkola T, Diekhans M, Haussler D. Using the fisher kernel method to detect remote protein homologies. Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology; Heidelberg, Germany: AAAI; 1999. pp. 149–158. [PubMed] [Google Scholar]

[R5] Kimeldorf G, Wahba G. Some results on tchebycheffian spline functions. Journal of Mathematical Analysis and Applications. 1971;33:82–95. [Google Scholar]

[R6] Lin Y. Support vector machines and the bayes rule in classification. Data Mining and Knowledge Discovery. 2002;6:259–275. [Google Scholar]

[R7] Lin Y, Lee Y, Wahba G. Support vector machines for classification in nonstandard situations. Machine Learning. 2004;46:191–202. [Google Scholar]

[R8] Liu Y. Fisher consistency of multicategory support vector machines. Proceedings of Eleventh International Conference on Artificial Intelligence and Statistics; Heidelberg, Germany: AAAI; 2007. pp. 289–296. [Google Scholar]

[R9] Rosset S, Zhu J. Piecewise linear regularized solution paths. Annals of Statistics. 2007;35 (3):1012–1030. [Google Scholar]

[R10] Vapnik V. The Natual of Statistical Learning. Springer-Verlag; 1996. [Google Scholar]

[R11] Wahba G. Spline models for observational data; CBMS-NSF Regional Conference Series in Applied Mathematics; SIAM; 1990. [Google Scholar]

[R12] Wahba G. Support vector machines, reproducing kernel hilbert spaces, and randomized gacv. In: Schölkopf B, Burges CJC, Smola AJ, editors. Advances in Kernel Methods: Support Vector Learning. MIT Press; 1999. pp. 125–143. [Google Scholar]

[R13] Wang J, Shen X, Liu Y. Probability estimation for large-margin classifier. Biometrika. 2008;95 :149–167. [Google Scholar]

[R14] Wang L, Shen X. On l1-norm multi-class support vector machines: methodology and theory. Journal of the American Statistical Association. 2007;102:595–602. [Google Scholar]

[R15] West M, Blanchette C, Dressman H, Huang E, Ishida S, Spang R, Zuzan H, Jr, JAO, Marks J, Nevins J. Predicting the clinical status of human breast cancer by using gene expression profiles. Proceedings of the National Academy of Sciences. 2001;98:11462–11467. doi: 10.1073/pnas.201162998. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] Zhang HH, Ahn J, Lin X, Park C. Gene selection using support vector machines with nonconvex penalty. Bioinformatics. 2006;22(1):88–95. doi: 10.1093/bioinformatics/bti736. [DOI] [PubMed] [Google Scholar]

[R17] Zhu J, Rosset S, Hastie T, Tibshirani R. 1-norm support vector machines. Advances in Neural Information Processing Systems. 2003;16 [Google Scholar]

PERMALINK

Two-Dimensional Solution Surface for Weighted Support Vector Machines

Seung Jun Shin

Yichao Wu

Hao Helen Zhang

Roles

Abstract

1 Introduction

Figure 1.

2 Problem Setup