Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2014 Oct 3.
Published in final edited form as: J Comput Graph Stat. 2014 Apr 28;23(2):383–402. doi: 10.1080/10618600.2012.761139

Two-Dimensional Solution Surface for Weighted Support Vector Machines

Seung Jun Shin 1, Yichao Wu 2, Hao Helen Zhang 3,
PMCID: PMC4060833  NIHMSID: NIHMS533706  PMID: 24955001

Abstract

The support vector machine (SVM) is a popular learning method for binary classification. Standard SVMs treat all the data points equally, but in some practical problems it is more natural to assign different weights to observations from different classes. This leads to a broader class of learning, the so-called weighted SVMs (WSVMs), and one of their important applications is to estimate class probabilities besides learning the classification boundary. There are two parameters associated with the WSVM optimization problem: one is the regularization parameter and the other is the weight parameter. In this paper we first establish that the WSVM solutions are jointly piecewise-linear with respect to both the regularization and weight parameter. We then develop a state-of-the-art algorithm that can compute the entire trajectory of the WSVM solutions for every pair of the regularization parameter and the weight parameter, at a feasible computational cost. The derived two-dimensional solution surface provides theoretical insight on the behavior of the WSVM solutions. Numerically, the algorithm can greatly facilitate the implementation of the WSVM and automate the selection process of the optimal regularization parameter. We illustrate the new algorithm on various examples.

Keywords: binary classification, probability estimation, solution surface, support vector machine, weighted support vector machine

1 Introduction

Frequently encountered in real applications is binary classification, in which we are given a training set {(xi, yi), i = 1, ···, n} of size n and the goal is to learn a classification rule. Here xi ∈ ℝp and yi ∈ {−1, 1} denote a p-dimensional predictor and a binary response (or class label), respectively, for the ith example. The primary goal of the binary classification is to construct a classifier which can be used for class prediction of new objects with predictors given. Among many existing methods for binary classification, the support vector machine (SVM) (Vapnik; 1996) is one of the most well known classifiers. It has gained much popularity since its introduction. It originates with the simple idea of finding an optimal hyperplane to separate two classes. The hyperplane is optimal in the sense that the geometric margin between these two classes is maximized. Later it was shown by Wahba (1999) that the SVM can be fit in the general regularization framework by solving

minfFi=1nH1(yif(xi))+λ2J(f), (1)

where H1(u) = max(1 −u, 0) is the hinge loss function, J(f) denotes the roughness penalty of a function f(·) in a function space Inline graphic, and the sign of f(x) for a given predictor x will be used for class prediction. Here λ > 0 is a regularization parameter which balances data fitting measured by the hinge loss and model complexity measured by the roughness penalty. Lin (2002) shows that the Hinge loss is Fisher consistent. See Liu (2007) for a more detailed discussion on Fisher consistency of different loss functions. A common choice of the penalty is J(f)=fF2. Other choices include the l1-norm penalty (Zhu et al.; 2003; Wang and Shen; 2007) and the SCAD penalty (Zhang et al.; 2006). In general, we set Inline graphic to be the Reproducing Kernel Hilbert Space (RKHS, Wahba; 1990) Inline graphic, generated by a non-negative definite kernel K,·). By the Representer Theorem (Kimeldorf and Wahba; 1971), the optimizer of (1) has a finite dimensional representation given by

f(x)=b+i=1nθiK(x,xi). (2)

Using (2), it can be shown that the roughness penalty of f in Inline graphic is i=1nj=1nθiθjK(xi,xj). Then the SVM estimates f(x) by solving

minb,θ1,,θni=1nH1(yi[b+j=1nθjK(xi,xj)])+λ2i=1nj=1nθiθjK(xi,xj). (3)

For the SVM (3), Hastie et al. (2004) showed that the optimizers b, θ1, ···, θn change piecewise-linearly when the regularization parameter λ changes and proposed an efficient solution path algorithm. From now on, we refer to this path as a λ-path.

Note that, in the standard SVM, each observation is treated equally no matter which class it belongs to. Yet this may not be always optimal. In some situations, it is desired to assign different weights to the observations from different classes; one such example is when one type of misclassification induces a larger cost than the other type of misclassification. Motivated by this, Lin et al. (2004) considered the weighted SVM (WSVM) by solving

minb,θ1,,θni=1nπiH1(yi[b+j=1nθjK(xi,xj)])+λ2i=1nj=1nθiθjK(xi,xj), (4)

where the weight πi’s are given by

πiπ(yi)={1-πifyi=+1πifyi=-1

with π ∈ (0, 1). Each point (xi, yi) is associated with one weight parameter πi, the value of which depends on the label yi. According to Lin et al. (2004), the WSVM classifier provides a consistent estimate of sign(p(x) −π) for any x, where p(x) is the conditional class probability p(x) = P (Y = 1|X = x). Using this fact, Wang et al. (2008) proposed the weighted SVM for probability estimation. The basic idea is to divide and conquer by converting a probability estimation problem into many classification sub-problems. Each classification sub-problem is assigned with a different weight parameter π, and these sub-problems are solved separately. Then the solutions are assembled together to construct the final probability estimator. A more detailed description is as follows. Consider a sequence of 0 < π(1) < ··· < π(M) < 1. For each m = 1, ···, M, solve the (4) associated with π(m) and denote the solution by m(·). Finally, for any x, construct the probability estimator as (x) = {π(m+) + π(m−)}/2, where m+ = argmaxmm(x) > 0 and m = argminmm(x) < 0. Advantages of the weighted SVM include flexibility and capability of handling large dimensional data.

One main concern about the probability estimation scheme proposed by Wang et al. (2008) is its computational cost. The cost comes from two sources: there are multiple sub-problems to solve, since the weight parameter π varies in (0, 1); each sub-problem is associated with one regularization parameter λ, which needs to be adaptively tuned in the range (0,∞). To facilitate the computation, the λ-path algorithm of the standard SVM (Hastie et al.; 2004) can be extended to the WSVM for any fixed π ∈ [0, 1]. In addition, Wang et al. (2008) developed the π-path algorithm for any fixed λ > 0. Both the λ-path and the π-path are piecewise-linear. However it is largely unknown how the WSVM solution fλ,π changes when both the regularization parameter λ and the weight parameter π vary together. The main purpose of our two-dimensional solution surface is to reduce the computation and tuning burden by automatically obtaining the solutions for all possible (π, λ) with an efficient algorithm.

One main motivation of this paper is to study the behaviors of the entire WSVM solutions and characterize them by a simple representation form through their relationship to π and λ. We use subscripts to emphasize that the WSVM solution f is a function of λ and π and sometimes omit the subscripts when they are clear from the context. Another motivation for the need of the solution surface is to automate the selection of the regularization parameter and improve the efficiency of searching process. Although Wang et al. (2008)’s conditional class probability estimator performs well as demonstrated by their numerical examples, its performance depends heavily on λ. They proposed to tune λ by using a grid search in their numerical illustrations. Yet, it is well known that such a grid search can be computationally inefficient and, in addition, its performance depends on how fine the grid is. The above discussions motivate us to develop a two-dimensional solution surface (rather than a one-dimensional path) as a continuous function of both λ and π in the analogous way that one resolved the inefficiency of the grid search for selecting the regularization parameter λ of the SVM by computing the entire λ-regularization path (Hastie et al.; 2004). From now on, we refer to the new two-dimensional solution surface of the WSVM as the WSVM solution surface.

In order to show the difficulties in tuning the regularization parameter for the probability estimation (Wang et al.; 2008) and motivate our new tuning method based on the WSVM solution surface, we use a univariate toy example generated from a Gaussian mixture: xi is randomly drawn from the standard normal distribution if yi = 1 and from N (1, 1) otherwise, with five points from each class. The linear kernel (K(xi, xj) = xixj) is employed for the WSVM and its solution is then given by f(x) = b + βx with β=i=1nθixi. In order to describe the behavior of f (x), we plot λβ based on the obtained WSVM solution surface (or path), since λβ, instead of β, is piecewise-linear in λ due to our parametrization. In Figure 1, the top two panels depict the solution paths of λβ for the different values of λ fixed at 0.2, 0.4, 0.6, 0.8, 1.0 (left) and the corresponding estimates of (·) as a function of x (right); the bottom two panels plot the entire two-dimensional joint solution surface (left) and the corresponding probability estimate (·) as a function of λ as well as x (right). We note that, although all the five π-paths are piecewise-linear in π they have quite different shapes for different values of λ (see (a)). Thus the corresponding probability estimates can be quite different even for the same x (see (b)), suggesting the importance of selecting an optimal λ. By using the proposed WSVM solution surface, we can completely recover the WSVM solutions on the whole (λ × π)-plane (see (c)), which enables us to produce the corresponding conditional probability estimators at a given x for every λ with very little computational expense (see (d)). We will shortly demonstrate that it is computationally efficient to extract marginal paths (λ-path or π-path) once the WSVM solution surface is obtained.

Figure 1.

Figure 1

Simulated toy example: the top two panels depict the solution λβ and the estimate (x) given by five marginal π-solution paths, with λ values fixed at 0.2, 0.4, 0.6, 0.8, 1.0. The bottom two panels plot the joint solution surface of λβ and the corresponding surface of (x). Notice that the horizontal axis of (b) is x. Similarly the surface in (d) lies on the (x × λ) plane.

In this example, we use a grid of five equally-spaced λs which is very rough. In practice, it is typically not known a priori how fine the grid should be or what the appropriate range of the grid is. If the data are very large or complicated, the grid one choose may not be fine enough to capture the variation of the WSVM solution and will lose efficiency for the subsequent probability estimation. The proposed WSVM solution surface provides a complete portrait for the WSVM solution corresponding to any pair of λ and π and therefore naturally overcomes this kind of practical difficulties, in addition to the gain in computational efficiency.

In this article, we show that the WSVM solution is jointly piecewise-linear in both λ and π and propose an efficient algorithm to construct the entire solution surface on the (λ × π)-plane by taking advantage of the established joint piecewise-linearity. As a straightforward application, an adaptive grid for tuning the regularization parameter of the probability estimation scheme (Wang et al.; 2008) is proposed. We finally remark that the WSVM solution surface has broad applications in addition to the probability estimation.

The rest of the article is organized as follows. In Section 2, the WSVM problem is formulated in details to develop the joint solution surface of the WSVM. In Section 3 we establish the joint piecewise-linearity of the WSVM solution on (λ × π)-plane. An efficient algorithm is developed to compute the WSVM solution surface by taking advantage of its piecewise-linearity in Section 4; its computational complexity is explored in Section 5. The proposed WSVM solution surface algorithm is illustrated by the kyphosis data in Section 6. It is then applied to the probability estimation problems in several sets of real data in Section 7. Some concluding remarks are given in Section 8.

2 Problem Setup

By introducing nonnegative slack variables ξi, i = 1, ···, n and using inequality constraints, the WSVM problem (4) can be equivalently rewritten as

minb,θ1,,θni=1nπiξi+λ2i=1nj=1nθiθjK(xi,xj)subjectto1-yi[b+j=1nθjK(xi,xj)]ξiandξi0,i=1,2,,n.

The corresponding Lagrangian primal function is constructed as

LP:i=1nπiξi+λ2i=1nj=1nθiθjK(xi,xj)+i=1nαi(1-yif(xi))-i=1nγiξi, (5)

where αi 0 and γi 0 are the Lagrange multipliers. To derive the corresponding dual problem, we set the partial derivatives of LP in (5) with respect to the primal variables to zero, which gives

θi:j=1nθjK(xi,xj)=1λj=1nαjyjK(xi,xj), (6)
b:i=1nαiyi=0, (7)
ξi:αi=πi-γi, (8)

along with the Krush-Kuhn-Turker (KKT) conditions

αi[1-yif(xi)-ξi]=0, (9)
γiξi=0. (10)

Notice from (6) that the function (2) can be rewritten as

f(x)=b+1λi=1nαiyiK(x,xi). (11)

By combining (6–10), the dual problem of the WSVM is given by

maxα1,,αni=1nαi-12λi=1nj=1nαiαjyiyjK(xi,xj)subjecttoi=1nyiαi=0and0αiπi,i=1,,n. (12)

and this can be readily solved via the quadratic programming (QP) for any given λ and π. However, the two parameters λ and π may not be known in practice. A classical approach is to repeatedly solve the QP problem (12) for different pairs of (λ, π) in order to select the desired λ and π. This is very computationally intensive for large data set because the QP itself is a numerical method whose computational complexity increases polynomially in n.

For the standard SVM (equivalent to a special case of the WSVM with the weight parameter π = 0.5), Hastie et al. (2004) showed the piecewise-linearity of αi in λ and developed an efficient algorithm to compute the entire piecewise-linear solution path. The same idea can be extended straightforwardly to the WSVM with any fixed weight parameter or any fixed individual weights. In addition, Wang et al. (2008) showed the WSVM solutions αi are piecewise-linear in the weight parameter π while keeping the regularization parameter λ fixed. However it is largely unknown how the WSVM solution αi changes with respect to the two parameters jointly. In this article, we show that the WSVM solutions αis, as a function of both λ and π, form a continuous piecewise-linear solution surface on the (λ × π)-plane and propose an efficient algorithm to compute the entire solution surface.

Similar to the idea of Hastie et al. (2004), we categorize all the examples, i = 1, ···, n into the three disjoint sets as

  • Inline graphic = {i : yif (xi) = 1, 0 ≤ αi ≤ πi} (elbow),

  • Inline graphic = {i : yif (xi) < 1, αi = πi} (left),

  • Inline graphic = {i : yif (xi) > 1, αi = 0} (right).

It is easy to see that the above three sets are always defined uniquely by the conditions (6)–(10). The set names come from the particular shape of the hinge loss function (Hastie et al.; 2004). Note that {αi,iInline graphic} contains most of the information on how the WSVM solution changes on the (λ × π)-plane, since the rest solutions {αi, iInline graphicInline graphic} are trivially determined by the definition of the sets.

As λ and π change, the sets may change and, as long as this happens, we call it an event. All the solution surfaces for αi, i = 1, ···, n are continuous and hence no element in Inline graphic can move directly to Inline graphic or vice versa. Therefore there are only three possible events to be considered as follows. The first event defines when one of αi, iInline graphic reaches πi and the corresponding index i exits from Inline graphic to Inline graphic (event 1). Similarly, an αi, iInline graphic can reach the other boundary 0 and the index moves form Inline graphic to Inline graphic (event 2). The last event happens when one element i of Inline graphicInline graphic satisfies yif (xi) = 1 and enters to Inline graphic (event 3).

3 Joint piecewise-linearity

In this section, we study the behavior of the WSVM solutions from a theoretical point of view. One major discovery we have made is that, αi, iInline graphic, hence all the αi, i = 1, ···, n changes in a jointly piecewise-linear manner when λ and π vary. The following theorem describes how αi moves as λ and π change. For simplicity, we define α0 = λb.

Theorem 1

(Joint Piecewise-Linearity) Suppose we have a point (λ, π) in the (λ × π) plane. Let Inline graphic, Inline graphic, Inline graphic, α=(α1,,αn)T, and α0 denote the associated sets and solutions obtained at (λ, π), respectively. Now, we consider a subset Inline graphic (of the (λ×π)-plane) which contains (λ, π) such that no event happens within Inline graphic. In other words, for all (λ, π) ∈ Inline graphic, the associated three sets, Inline graphic, Inline graphic, and Inline graphic remain the same as Inline graphic, Inline graphic, and Inline graphic, respectively. Then the solution αi, i ∈ {0} ∪ Inline graphic, denoted by α0,Inline graphic in a vector form moves in Inline graphic as follows:

α0,Eα0,E(λ,π)=α0,E+GΔ,(λ,π)S, (13)

where α0,E={αi:i{0}E} and Δ = (Δλ, Δπ)T = (λλ, ππ)T. The gradient matrix, G is given by

G=A-1B=(0yTyK)-1(0L1-k), (14)

where K={yiyjK(xi,xj):fori,j,E};k={jLyiK(xi,xj):iE}T; y = {yi:iInline graphic}T; |A| denotes the cardinality of a set A; and 1 is the one vector of length | Inline graphic |.

Proof

We can rewrite (11) as:

f(x)=[f(x)-λλf(x)]+λλf(x)=1λ[(α0-α0)+jE(αj-αj)yjK(x,xj)+(π-π)jLK(x,xj)]+λλf(x). (15)

Moreover, in Inline graphic we have yif (xi) = yif(xi) = 1, iInline graphic, which leads to

Δλ=ν0yi+jEνjyiyjK(xi,xj)+ΔπjLyiK(xi,xj), (16)

where νi=αi-αi, ∀i ∈ {0}∪ Inline graphic . In addition, i=1nyiαi=iEyiαi+πL by definitions of the sets and the condition (7) gives

iEyiνi=LΔπ. (17)

Finally, we have (| Inline graphic |+1) linear equations from (16), (17), which can be expressed in a matrix form as Aν= BΔ, where ν = {νi : i ∈ {0} ∪ Inline graphic}T = α0,Inline graphicα0,Inline graphic. Finally the desired result follows by assuming A to be invertible.

We remark that A is rarely singular in practice, and related discussions can be found in Hastie et al. (2004). It is worthwhile to point out that the joint piecewise-linearity of the solution guarantees the marginal piecewise-linearity as presented in Corollary 2, but not vice versa. Therefore, Theorem 1 implies the piecewise-linearity of the marginal solution paths as a function of either λ or π, which were separately explored by Hastie et al. (2004) and Wang et al. (2008).

Corollary 2

(Marginal Piecewise-Linearity) For a given π0, the solution, α0,Inline graphic moves linearly in λ ∈ {λ : (λ, π0) ∈ Inline graphic} as follows.

α0,E=α0,E+g1Δλ.

Similarly, α0,Inline graphic changes in π ∈ {π : (λ0, π) ∈ Inline graphic} for a given λ0 as follows.

α0,E=α0,E+g2Δπ,

where g1 and g2 denote the first and second column vectors of G in (14), respectively.

The classification function f(x) can be conveniently updated by plugging (13) into (15), which gives

f(x)=λλ[f(x)-h1(x)]+h1(x)+π-πλh2(x), (18)

where

h1(x)=g01+iEgi1yiK(x,xi),h2(x)=g02+iEgi2yiK(x,xi)+iLK(x,xi),

and (gi1, gi2) denotes the row of G where i ∈ {0} ∪ Inline graphic. We observe from (18) that f (x) is not jointly piecewise-linear in (λ, π) while it is marginally piecewise-linear in λ−1 and π, respectively.

4 Solution Surface Algorithm

In this section, we propose an efficient algorithm to compute the entire solution surface of the WSVM on the (λ × π)-plane by using the joint piecewise-linearity established in Theorem 1.

4.1 Initialization

Denote index sets I+ = {i : yi = 1} and I = {i : yi = −1}. We initialize the algorithm at π0 = |I+|/n. Notice that the π0 satisfies the so-called balanced condition that requires ΣiI+ πi(π0) = ΣiI πi(π0). With π = π0 it is easy to verify that, for a sufficiently large λ, αi = π0 if iI+ and 1 − π0 otherwise. Following the idea of Hastie et al. (2004), the initial values of λ and α0 denoted by λ0 and α00, respectively are given by

λ0=12[(1-π0)iI+(Ki+-Ki-)+π0iI-(Ki+-Ki-)]andα00=-12[(1-π0)iI+(Ki+-Ki-)-π0iI-(Ki+-Ki-)],

where Ki+=K(xi,xi+),Ki-=K(xi,xi-). The indices i+ and i are defined as

i+=argmaxiI+{(1-π0)jI+K(xi,xj)-π0jI-K(xi,xj)}andi-=argminiI-{(1-π0)jI+K(xi,xj)-π0jI-K(xi,xj)}.

It is possible to initialize the algorithm at any π between 0 and 1 rather than π0, however, we empirically observe that the initialized λ is the largest when π = π0. Notice that the corresponding solution is trivial as αi = πi for all i = 1, ···, n for any λ larger than λ0. Therefore, the proposed algorithm focuses only on the non-trivial solutions obtained at Inline graphic= {(λ, π) : 0 ≤ λλ0, 0 ≤ π ≤ 1}. Finally, the three sets initialized at (λ0, π0) denoted by Inline graphic, Inline graphic and Inline graphic, respectively are given by

E0={i+,i-},L0={1,,n}\E0,andR0=φ,

where φ denotes the empty set.

4.2 Update

Recall that, for any point (λ, π), no event occurs if (λ, π) ∈ Inline graphic and the WSVM solution can be updated by applying Theorem 1 for any (λ, π) ∈ Inline graphic. Therefore, it is essential to know how to define Inline graphic as large as possible for any (λ, π). We shall demonstrate next that the set Inline graphic can be explicitly determined by some linear constraints.

Note that event 1 happens when αi reaches πi for some iInline graphic. Based on (13), we have the following inequality constraints to prevent event 1 from happening:

gi1λ+(gi2+1)πti+1,iE+ (9)
gi1λ+(gi2-1)πti,iE-, (20)

where ti=gi1λ+gi2π-αi,E+=EI+ and E-=EI-. In a similar way, we have the following inequalities to prevent event 2 :

gi1λ+gi2πti,iE. (21)

In order to prevent event 3, we have yif (xi) < 1,iInline graphic and yif (xi) > 1,iInline graphic. Therefore, by noting (15), we have

(h1(xi)-1)λ+h2(xi)yiπsi,iR (22)
(h1(xi)-1)λ+h2(xi)yiπsi,iL, (23)

where si=yi(h1(xi)-f(xi))λ+h2(xi)yiπ. We remark that the equalities do not need to be strict since an event is instant transition. Recall that it is enough to consider the solution on Inline graphic and hence the additional constraints 0 ≤ λλ0 and 0 ≤ π ≤ 1 should be considered as well by default. In summary Inline graphic can be defined by a subregion on the (λ × π)-plane that satisfies all the constraints (19)–(23). Figure 2 illustrates a Inline graphic generated from the initial point (λ0, π0) for the toy example in Section 1. We remark that Inline graphic forms a polygon which can be uniquely expressed by its vertices, since the constraints are all linear.

Figure 2.

Figure 2

The illustration of defining Inline graphic generated from the initial point (λ0, π0) for the toy example in Section 1: Each line in the left penal (a) represents a constraint boundary where an event occurs as labeled. For example, the upper right label represents the event: the fourth example moves from Inline graphic to Inline graphic on the boundary. The set Inline graphic obtained from the boundaries in (a) is depicted in (b).

We describe next how to determine the vertices of Inline graphic in an efficient manner. First, compute all the pairwise intersection points of the boundaries of (19)–(23) then we have nc(nc-1)2 intersection points, where nc denotes the number of constraints (19)–(23), n + | Inline graphic|. The left penal of Figure 2 shows all the intersections of the boundaries of the obtained constraints. Then, we can define the vertices by identifying the intersections that satisfy all the constraints (19)–(23) as illustrated in (b) of Figure 2. We denote the vertices of the Inline graphic as { v1,,vnv} where vr=(λr,πr), r = 1, … ; nv,.

There are a couple of issues we need to clarify here. First, based on our limited experience, we observe that nv is small and typically does not exceed eight. Hence it is not efficient to compute all the nc(nc-1)2 intersections due to the involved computational intensity. Note also that some constraints are dominated by others on Inline graphic and are thus automatically satisfied if other constraints hold. Consequently, we can save a huge amount of computational time by excluding those constraints which are dominated by others, especially when n is large. We also need to order the vertices for set-updates which are discussed next. Here the ordering means that a vertex vr, r = 1, ···, nv is adjacent to vr-1 and vr+1, where we set v0=vnv,vnv+1=v1 (see (b) in Figure 2). The updating of α1, ···, αn as well as α0 at vertices of Inline graphic provided is straightforward by Theorem 1.

Figure 3 illustrates how to extend the polygons on Inline graphic from current Inline graphic in Figure 2-(b). Notice that the sides of Inline graphic are determined by some boundaries of the constraints (19) – (23) and represent corresponding events. Each middle point of two adjacent vertices can be used as a new starting point, (λℓ+1, πℓ+1) in Theorem 1 to compute a new polygon Inline graphic. Computing middle points and corresponding solutions denoted by mr=(λr+1,πr+1) and α¯r+1={(α0,α1,,αn)obtainedatmr=(λr+1,πr+1)}T, respectively where r = 1, ···, nv is trivial due to the piecewise-linearity of the solutions. The bar sign in α¯r+1 is used to emphasize the quantities are obtained at middle points, not vertices. At each middle point, the corresponding three sets denoted by Er+1,Rr+1, and Lr+1 respectively can be updated based on which line the middle point mr is on. For example as shown in (a) of Figure 3, the three sets E1+1,R1+1, and L1+1 obtained at m1 are updated as E1+1=E\{4},R1+1=R{4}, and L1+1=L. This is because m1=(λ1+1,π1+1) lies on the boundary which represents an event that the element 4 moves to the right set from the elbow set (see (a) in Figure 2). Now we have all the information required to generate a new polygon, S1+1, from the middle point, m1 and S1+1 can be accordingly computed by treating m1 as a new updating point (This is the reason why we use a superscript ℓ + 1 to denote middle points). Figure 3 shows all the four newly-created polygons, Sr1, r = 1, ···, 4 respectively from the four middle points m1, ···, m4 of the sides of Inline graphic in Figure 2-(b). Notice that the right vertical line represents λ = λ0 and we do not have any interest beyond it. Finally, the proposed algorithm can be continued to extend the polygons being searched on Inline graphic and is terminated when the complete solution surface is recovered on Inline graphic.

Figure 3.

Figure 3

The updated polygons Sr1, r = 1, ···, 4 from the middle points m1, ···, m4 of the sides of Inline graphic in Figure 2(b): Dotted lines in each subfigure depict the boundaries of constrains for obtaining the polygon.

4.3 Resolving Empty Elbow

Note that either event 1 and or event 2 leads to the possibility that Inline graphic is empty, named empty elbow. This may cause a problem in the update described in Section 4.2, because Theorem 1 cannot be applied when Inline graphic is empty.

We suppose the empty elbow occurs at (λo, πo) and use a superscript ‘o’ to denote any quantity defined at (λo, πo). In order to resolve the empty elbow, we first notice that the objective function (4) is differentiable with respect to b and αi, i = 1, ···, n whenever the the elbow set is empty since in this case there is no example satisfying yif (xi) = 1. Taking derivative of (4) with respect to b and αi, we get two conditions to be satisfied under the empty elbow :i) πo=L+o/Lo, where L+o=LoI+ and ii) αi are unique while α0 is not. Moreover, α0 can be any value in the following interval,

[aL,aU][maxiL-oR+omio,miniL+oR-omio] (24)

where mio=yiλo-j=1nαjoyjK(xi,xj); and L-o,R+o, and R-o are similarly defined as L+o. Recall that α0 is continuous and hence the empty elbow can be resolved only by α0 touching one of the two boundaries aL and aU. Notice that the α0o is regarded as a starting point where the empty elbow begins and hence it must be one of aL and aU. Without loss of generality, we suppose α0o=aL then the empty elbow should be resolved when it becomes the other boundary aU. Let iLo=argmaxiL-oR+omio and iUo=argminiL+oR-omio, then the Inline graphic should be updated to { iUo}. The Inline graphic and Inline graphic are accordingly updated as well since the updated index iUo enters to the Inline graphic from one of the two sets. In case of α0o=aU, we update in a similar way which leads to E={iLo}.

4.4 Pseudo Algorithm

Combining the previous several subsections, we now summarize our WSVM solution surface algorithm at Algorithm 1. We denote α = (α0, α1, ···, αn)T for simplicity. We note that the proposed algorithm computes the complete WSVM solution αs for any (λ, π) ∈ Inline graphic without involving any numerical optimization. We have implemented the algorithm in R language and the wsvmsurf package is available from the authors upon request (and will be available on CRAN soon).

Algorithm 1.

WSVM solution surface of the WSVM

Input: A training data set (xi, yi), i = 1, ···, n and a non-negative definite kernel function K(x, x).
Output : The entire (piecewise-linear) solution surface on Inline graphic of αi, i = 0, ···, n, the optimizer of (4).
  1. 1. Initialize Input(1){in1(1)}={λ0,π0,α0,E0,L0,R0}T from Section 4.1 and k ← 1.

  2. While Input(k)={in1(k),,inJk(k)} is not empty (i.e, Jk ≠ 0),

    • 2-1

      For j = 1, ···, Jk

      1. Set λ, π, α, Inline graphic, Inline graphic, and Inline graphic from inj(k).

      2. If Inline graphic is empty, resolve the empty elbow as described in Section 4.3, otherwise skip this step and go directly to the following step iii.

      3. Define Inline graphic by identifying vertices v1,,vnv using (19)–(23) and update αr at vr, r = 1, ···, nν from (13).

        Set outj ← {outj,1, ···, outj, nv}, where outj,r{vr,αr} for r = 1, ···, nv.

      4. Compute the middle points m1 ··· mnv and the corresponding r at mr, r =1, ···, nv.

        Update Er+1,Lr+1 and Rr+1 at mr=(λr+1,πm+1), r = 1, ···, nv.

        Set inj ← {inj,1, ···, inj,nv}, where inj,r{λr+1,πr+1,αr+1,Er+1,Lr+1,Rr+1}T for r = 1, ···, nv.

      end

    • 2-2

      Set Output(k) ← {out1, ···, outJ}.

    • 2-3

      Set Input(k+1) ← {in1, ···, inJ}.

    • 2-4

      Refine Input(k+1) by deleting any element which revisits an old middle points.

    end

  3. Set Output ← {Output(1) : ··· : Output(k)}

Table 1.

Empirical computing time based on 100 independent repetitions: the machine we used equips Intel Core (TM) i3 550 @ 3.20GHZ CPU with 4GB memory.

n Gaussian Linear

time(s) piece time/piece time(s) piece time/piece
10 0.53 (0.01) 164.3 (3.58) 0.00325 0.44 (0.01) 140.98 (2.70) 0.00311
30 5.78 (0.08) 2056.6 (27.16) 0.00281 4.36 (0.06) 1648.05 (21.24) 0.00265
50 19.08 (0.22) 6036.8 (55.29) 0.00316 14.38 (0.16) 4870.05 (52.40) 0.00295
100 108.00 (1.07) 25653.0 (193.50) 0.00421 80.72 (0.67) 20087.13 (148.05) 0.00402

5 Computational Complexity

The essential part of the proposed algorithm involves several steps. First, we solve the linear equation system (14) of size 1 + | Inline graphic|, which involves O((1 + | Inline graphic|)3) computations. We empirically observe that | Inline graphic|, the number of support vector depends on n but is usually much less than n during the algorithm. Next, in order to determine the set Inline graphic, we compute constraints (19) – (23) and h1(xi) and h2(xi), which respectively requires O(n| Inline graphic |) and O(n2) computations. Now we have nc = n + | Inline graphic| constraints and need to find their intersection points which satisfy all the constraints (19)–(23) simultaneously. This is the most computationally expensive step since it requires to evaluate nc(nc − 1)/2 intersection points and check nc constraints for each point, which takes at least O(nc3) computations in total.

Our limited numerical experience suggests that the final vertices of Inline graphic is of size less than or equal to eight, which implies that most of the constraints are dominated by others. This leads us to suggest adding a refining step which removes those constraints dominated by others in the algorithm as follows. First, since all the constraints have a common linear form of αλ + c for different values of a, b and c for different constraints, we can classify the constraints into two types depending on the sign of b, positive or negative, assuming b ≠ 0 for simplicity. See Figure 4-(a) for an illustration. If b < 0 for a constraint then the region represented by the constraint is below its boundary (dashed lines), otherwise it is above the boundary (dotted lines). For each type of constraints with either b > 0 or b < 0, we compute values of πs at λ = 0 and λ0 (vertical lines), then we can exclude most of, but not all dominated boundaries by sorting the π values obtained. The refining step requires at most O(nc4/3) computations since it is based on the sorting algorithm, Shellsort variant. After the refining step, we have only ñc constraints to consider for computing the Inline graphic and ñc is much smaller than nc.

Figure 4.

Figure 4

Illustration of the refining step and its effect for simulated example: The left penal (a) shows all the boundaries of constraints. Dashed (red) lines represents boundaries of constraints with b < 0 and only two constraints (bold) are used to defined Inline graphic and the other two excluded are chosen by comparing π values at λ = 0 and λ0. Dotted (blue) lines are the ones with b > 0 and we can exclude dominated constraints in a similar manner. The right panel (b) shows dramatic saving in computation after use of the refining step by comparing the numbers of intersection points of unrefined constraints (nc) and refined ones (ñc) as functions of sample size, n.

In order to demonstrate the benefit of the refining step, we consider the situations where the data sets from univariate Gaussian mixture for different sample size, n. Data are generated as xi ~ N (0, 1) if iI+ and xi ~ N (1, 1) otherwise with |I+| = |I| = n/2. Employing Gaussian kernel K(x,x)=exp(-x-x22/σ2) with σ = 1, Figure 4-(b) illustrates how effective the refining step is. We observe that only about 20% of the constraints are used to compute Inline graphic for relatively large n, suggesting a dramatic computational saving to evaluate Inline graphic (only about 0.23 = .8% computations is required, compared to the case when the refining step is not employed). The right panel (b) shows nc3 (black dashed-line) versus nc3 (red solid-line) for different n; the computational saving is substantial when n is large.

In order to obtain the entire solution surface we need to iterate the aforementioned updating steps. It is quite challenging to rigorously quantify the number of required iterations as a function of n, since it depends on not only the data but also the kernel function employed (and hence its parameter). We instead evaluate empirical computing time for different sample size n based on 100 independent repetitions (Table 1). The data are simulated from a bivariate Gaussian mixture with the same number of observations from each class (i.e., |I+| = |I| = n/2). In particular, if iI+ then xi is from N ((0, 0)T, I2) and otherwise xi ~ N2 (2, 2)T, I2 where I2 is the two-dimensional identity matrix. In Table 1, we report the average computing time, average number of pieces, and the computing time per piece for both the Gaussian kernel and the linear kernel. The kernel parameter σ of the Gaussian kernel is assumed to be fixed and set to be the median of pairwise distances between the two classes (Jaakkola et al.; 1999). Our limited numerical experience suggests that the computing time per a piece ( Inline graphic) does not depend severely on n while the entire time increases proportional to n2 since the number of polygons produced does. In Table 1, the numbers in parentheses are the corresponding standard errors.

Finally, one may desire to extract the marginal path, either a λ-path or a π-path, from the WSVM solution surface. This can be easily done after obtaining all the polygons and the corresponding WSVM solutions. It is straightforward to interpolate the WSVM solution for any pair (λ, π) using the WSVM solutions corresponding to vertices of all those polygons produced from the WSVM solution surface. Let ‘d’ be the total number of vertices. As aforementioned, d is proportional to n2. Then it requires only O(d) computations to obtain the marginal solution paths, since the vertices of each polygon are properly ordered already.

6 Illustration

In this section, we illustrate how the proposed WSVM solution surface algorithm works by using the kyphosis data (Chambers and Hastie; 1992) available in R package rpart. The data contain the status (absence or presence of a kyphosis) of n = 81 children who had a corrective spinal surgery. The class variable yi is coded −1 for absence and +1 for presence of a kyphosis after the operation. Three predictors are recorded for each child (i.e., p = 3): age of a patient (in month), the number of vertebrae involved, and the number of the first vertebra operated on. The Gaussian kernel is used with kernel parameter σ set to be 0.01. The entire solution surface consists of 7,502 polygons (or pieces) and the total computing time is 34.90 seconds. In Figure 5, the panel (a) plots all the vertices of the polygons produced, (denoted by red dots), and the panel (b) depicts the entire surface of α18. There are 81 αi, i = 1, ···, 81-surfaces totally, one corresponding to each of the ith observation. For the purpose of illustration, we only show one of them for i = 18. The solution surfaces for the rest 80 αis can be visualized in a similar way. Once we have the entire two-dimensional solution surfaces, the marginal solution paths can be readily obtained for a fixed value of either λ or π. The two panels (c) and (d) represent the marginal regularization paths with a fixed π = 0.2 and the marginal paths with a fixed λ = 0.5, extracted from the WSVM solution surface, respectively. Notice that there are 81 observations in the data and hence each piecewise-linear path in (c) and (d) represents a marginal solution path of a αi where i = 1, ···, 81.

Figure 5.

Figure 5

Real data illustration on kyphosis data.

7 Applications to Probability Estimation

In this section, we revisit the motivating application of the proposed joint solution surface algorithm to the probability estimation (Wang et al.; 2008) introduced in Section 1. In Wang et al. (2008), the regularization parameter λ is selected by a grid search over the interval [10−3, 103] with ten equally-spaced points in each sub-interval (10j, 10j+1] where j = −3, ···, 2. In practice, it can be difficult to choose an appropriate range for parameter search or to determine the right level of grid coarseness. A general rule is: if the WSVM solution varies rapidly for different λ, we need a fine gird; otherwise, a coarse grid will be more appropriate in terms of computational efficiency. However, without a priori information available, a subjective choice of the range could be either too wide or too narrow, and the grid could be too fine or too rough. The proposed WSVM solution surface algorithm provides a natural and informative guidance to tackle these issues. We propose to use all the distinct λs obtained from the WSVM solution surface algorithm as the grid, namely the set of all λs corresponding to all vertices of those polygons obtained by the WSVM solution surface. We like to point out that the proposed grid of λ is adaptive to the solution variation, in the sense that we have a finer grid if the solution surface is complicated and a rough grid otherwise.

We compare these two grid search methods using two microarray datasets available at LIBSVM Data website (http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html). For both of the datasets, the sample sizes are much smaller then the number of predictors. The Gaussian kernel is employed to train the WSVM and the associated kernel parameter σ is set to be the median distances between the two classes. In order to evaluate performance of the probability estimator using the two different grids, the fixed grid employed in Wang et al. (2008) and the proposed adaptive grid, the cross entropy error (CRE) is used. The CRE over a set of data {(xi, yi) : i = 1, 2, ···, n} is defined as follows.

CRE=-1ni=1n[12(1+yi)logp^(xi)+12(1+yi)log{1-p^(xi)}]. (25)

Table 2 contains summarized information of the well-known microarray data sets used, Duke cancer and Colon cancer data. One merit of the proposed probability estimator is that it is not only distribution-free method but also free from the dimension of predictors p, since it exploits the kernel trick. For tuning λ, we employ the cross-validation (leave-one-out and 10-fold) instead of using independent tuning set since the data sets have relatively small sample sizes. Table 3 shows the test CREs for the microarray examples. We point out that clear improvements of the proposed adaptive grid are observed for the microarray examples compared to the fixed grid. This is because the joint solution surfaces of the microarray examples are complicated and the fixed grid fails to reflect this complicated pattern of the solutions.

Table 2.

Sources for the microarray data used for illustrations: The numbers of predictors (p) are all 7219 and much larger than the sample size (n).

Name train(n) test(ñ) p Source
Duke Cancer 38 4 7219 West et al. (2001)
Colon Cancer 40 22 2000 Alon et al. (1999)

Table 3.

Test CREs for Microarray data sets (p > n).

data tuning fixed adaptive improve
Duke LOCV 0.4991 0.4794 3.94%
10-fold 0.4846 0.4729 2.42%

Colon LOCV 0.3058 0.2511 17.89%
10-fold 0.3058 0.2646 13.45%

8 Concluding Remarks

In this article, we first establish that the WSVM solution is jointly piecewise-linear in λ and π and then develop an efficient algorithm to compute the entire WSVM solution surface on the (λ × π)-plane by taking advantage of the joint piecewise-linearity. To demonstrate its practical value, we propose an adaptive tuning scheme for the probability estimation method based on the WSVM solution surface. Illustrations using real data sets show that the proposed adaptive tuning grid from the WSVM solution surface improves the performance of the probability estimator especially when the solution surfaces are complicated. We remark that the proposed solution surface provides the entire information of the WSVM solution for arbitrary pair of λ and π and therefore potentially contains wide applicability in various statistical applications.

The piecewise-linearity of the marginal λ-path or regularization paths has been well-studied in the literature. See Rosset and Zhu (2007) for a unified analysis under various models, where a general characterization of the loss and penalty functions to give piecewise-linear coefficient paths is derived. We would like to point out that it is possible to extend the notion of joint piecewise-linearity to more general problems beyond the weighted SVM. For example, we can show that the kernel quantile regression solution also enjoys the joint piecewise-linear property in terms of its regularization parameter and the quantile parameter. However, it is not yet clear what are general sufficient conditions for the joint piecewise-linearity to hold. We find that the sufficient conditions for the marginal (one-dimensional) piecewise-linearity, given in Rosset and Zhu (2007), generally do not imply the joint (two-dimensional) piecewise-linearity. For example, support vector regression (Vapnik; 1996) has piecewise-linear marginal paths, with respect to the regularization parameter or the intensity parameter, but its solution does not have the joint piecewise-linearity property. Based on our experience, the two-dimensional solution surface is much more complex than one-dimensional marginal paths, mainly due to the intricate roles played by two parameters and their relationship. This is an important topic and worth further investigation.

Acknowledgments

The authors would like to thank Professor Richard A. Levine, the associate editor, and two reviewers for their constructive comments and suggestions that led to significant improvement of the article. This work is partially supported by NIH/NCI grants R01 CA-149569 (Shin and Wu), R01 CA-085848 (Zhang), and NSF Grant DMS-0645293 (Zhang). The content is solely the responsibility of the authors and does not necessarily represent the official views of NIH or NCI.

Footnotes

Online Supplement: The online supplement materials contain the computer code and the instructions on how to build and install our newly developed R package wsvmsurf. All the files are available via a single zipped file “Supplement.rar”. After unzipping the file, one can retrieve one folder (“wsvmsurf”) and two text files (“example.R” and “readme.txt”). The folder “wsvmsurf” contains all the source files (which are properly structured) for building the R package “wsvmsurf”. The file “readme.txt” gives detailed instructions for installing the package. For windows machines, the package can be built by using R-tools available at www.r-project.org. For Linux machines, the installation can be done by using the R command R CMD build wsvmsurf. The file “example.R” contains the code to run two examples in the paper.

Contributor Information

Seung Jun Shin, Email: sshin@ncsu.edu, Department of Statistics, North Carolina State University, Raleigh, NC 27695.

Yichao Wu, Email: wu@stat.ncsu.edu, Department of Statistics, North Carolina State University, Raleigh, NC 27695.

Hao Helen Zhang, Email: hzhang@math.arizona.edu, Department of Mathematics, University of Arizona, Tucson, AZ 85718.

References

  1. Alon U, Barkai N, Notterman D, Gish K, Ybarra S, Mack D, Levine A. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Cell Biology. 1999;96:6745–6750. doi: 10.1073/pnas.96.12.6745. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Chambers J, Hastie T. Statistical Models in S. Wadsworth and Brooks/Cole; Pacific Grove, CA: 1992. [Google Scholar]
  3. Hastie T, Rosset R, Tibshirani R, Zhou J. The entrie regularization path for the supprot vector machine. Journal of Machine Learning Reasearch. 2004;5:1931–1415. [Google Scholar]
  4. Jaakkola T, Diekhans M, Haussler D. Using the fisher kernel method to detect remote protein homologies. Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology; Heidelberg, Germany: AAAI; 1999. pp. 149–158. [PubMed] [Google Scholar]
  5. Kimeldorf G, Wahba G. Some results on tchebycheffian spline functions. Journal of Mathematical Analysis and Applications. 1971;33:82–95. [Google Scholar]
  6. Lin Y. Support vector machines and the bayes rule in classification. Data Mining and Knowledge Discovery. 2002;6:259–275. [Google Scholar]
  7. Lin Y, Lee Y, Wahba G. Support vector machines for classification in nonstandard situations. Machine Learning. 2004;46:191–202. [Google Scholar]
  8. Liu Y. Fisher consistency of multicategory support vector machines. Proceedings of Eleventh International Conference on Artificial Intelligence and Statistics; Heidelberg, Germany: AAAI; 2007. pp. 289–296. [Google Scholar]
  9. Rosset S, Zhu J. Piecewise linear regularized solution paths. Annals of Statistics. 2007;35 (3):1012–1030. [Google Scholar]
  10. Vapnik V. The Natual of Statistical Learning. Springer-Verlag; 1996. [Google Scholar]
  11. Wahba G. Spline models for observational data; CBMS-NSF Regional Conference Series in Applied Mathematics; SIAM; 1990. [Google Scholar]
  12. Wahba G. Support vector machines, reproducing kernel hilbert spaces, and randomized gacv. In: Schölkopf B, Burges CJC, Smola AJ, editors. Advances in Kernel Methods: Support Vector Learning. MIT Press; 1999. pp. 125–143. [Google Scholar]
  13. Wang J, Shen X, Liu Y. Probability estimation for large-margin classifier. Biometrika. 2008;95 :149–167. [Google Scholar]
  14. Wang L, Shen X. On l1-norm multi-class support vector machines: methodology and theory. Journal of the American Statistical Association. 2007;102:595–602. [Google Scholar]
  15. West M, Blanchette C, Dressman H, Huang E, Ishida S, Spang R, Zuzan H, Jr, JAO, Marks J, Nevins J. Predicting the clinical status of human breast cancer by using gene expression profiles. Proceedings of the National Academy of Sciences. 2001;98:11462–11467. doi: 10.1073/pnas.201162998. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Zhang HH, Ahn J, Lin X, Park C. Gene selection using support vector machines with nonconvex penalty. Bioinformatics. 2006;22(1):88–95. doi: 10.1093/bioinformatics/bti736. [DOI] [PubMed] [Google Scholar]
  17. Zhu J, Rosset S, Hastie T, Tibshirani R. 1-norm support vector machines. Advances in Neural Information Processing Systems. 2003;16 [Google Scholar]

RESOURCES