Utility-based Weighted Multicategory Robust Support Vector Machines

Yufeng Liu; Yichao Wu; Qinying He

doi:10.4310/sii.2010.v3.n4.a5

. Author manuscript; available in PMC: 2013 Jul 25.

Published in final edited form as: Stat Interface. 2010 Oct 1;3(4):465–476. doi: 10.4310/sii.2010.v3.n4.a5

Utility-based Weighted Multicategory Robust Support Vector Machines

Yufeng Liu ¹, Yichao Wu ², Qinying He ^3,^✉

PMCID: PMC3722909 NIHMSID: NIHMS250316 PMID: 23894688

Abstract

The Support Vector Machines (SVM) has been an important classification technique in both machine learning and statistics communities. The robust SVM is an improved version of the SVM so that the resulting classifier can be less sensitive to outliers. In many practical problems, it may be advantageous to use different weights for different types of misclassification. However, the existing RSVM treats different kinds of misclassification equally. In this paper, we propose the weighted RSVM, as an extension of the standard SVM. We show that surprisingly, the cost-based weights do not work well for weighted extensions of the RSVM. To solve this problem, we propose a novel utility-based weights for the weighted RSVM. Both theoretical and numerical studies are presented to investigate the performance of the proposed weighted multicategory RSVM.

Key Words and Phrases: Multicategory Classification, Robustness, SVM, Utility, Weighted Learning

1 Introduction

In supervised learning, one important goal is to build predictive models using a training dataset for future prediction. Among various learning tasks, classification plays an important role, both theoretically and practically. It has been widely applied in a wide range of disciplines such as medicine, engineering, and bioinformatics.

There are numerous classification techniques proposed in the literature. In particular, several machine learning approaches become popular and have been increasingly studied in both machine learning and statistics communities. Important examples include the Support Vector Machine (SVM, Boser et al. (1992); Cortes and Vapnik (1995)), Boosting (Freund and Schapire, 1997; Friedman et al., 2000), and others. See Hastie et al. (2009) for a comprehensive survey of different learning techniques. Many proposals on further improvements of these methods have been made in recent years. For example, Lee et al. (2004); Zhu et al. (2009) introduced multicategory versions of SVM and AdaBoost respectively. A more recent learning method, ψ-learning, was first proposed by Shen et al. (2003) as a competitive binary large margin classifier. Liu and Shen (2006) developed a multicategory extension of ψ-learning. Wu and Liu (2007) further generalized ψ-learning to robust SVMs. Other examples of machine learning classification methods include the Import Vector Machine (IVM, Zhu and Hastie (2005)) and Distance Weighted Discrimination (DWD, Marron et al. (2007); Qiao et al. (2010)).

To measure the performance of a classifier, one can quantify its prediction accuracy. One commonly used measure is the probability of misclassification. This measure treats all types of misclassification equally. Specifically, the cost of misclassifying a subject of class one into class two is the same as the cost of misclassifying a subject of class two into class one. This treatment may not be appropriate for many applications. One common example is the application of tumor classification. Clearly, misclassifying a cancer patient as normal is much more severe than the other type of misclassification. A misclassification on a normal patient can be corrected using further diagnosis. However, a wrong diagnosis on a cancer patient will delay the necessary treatment and can be life threatening. Another example is learning with samples having minority classes. In that case, the resulting classifier tends to sacrifice the minority classes and try to classify the training points in majority classes correctly. Sometimes the classifier may misclassify all points of a minority class but still give high overall classification accuracy (Qiao and Liu, 2009). Therefore, unequal cost assignments on different types of misclassification are needed.

To handle the problem of unequal cost assignments, Lin et al. (2004) generalized the original SVM to nonstandard situations. The nonstandard SVM allows unequal cost assignments on the two types of misclassification. Lee et al. (2004) extended this idea further for multicategory problems. For ψ-learning and robust SVM, the available methods can only deal with standard binary and multicategory classification. In this paper, we develop a general robust SVM technique which allows unequal cost assignments on different types of misclassification. The proposed technique covers the binary ψ-learning in Shen et al. (2003), multicategory ψ-learning in Liu and Shen (2006), and robust SVM (RSVM) in Wu and Liu (2007) as special cases.

The optimization problem of the standard RSVM involves nonconvex minimization. Wu and Liu (2007) proposed to decompose the nonconvex objective function into the difference of two convex functions and then use difference convex (DC) algorithm through iterative convex minimization to obtain a local solution. Since the existing weighted SVMs in the literature are based on the use of unequal costs of different misclassification, it is natural for us to use the same idea for the RSVM. Surprisingly, the resulting weighted RSVM using the unequal cost approach cannot be solved using the DC algorithm. To solve this difficulty, we propose a novel utility-based weighted RSVM which can be implemented via the DC algorithm. We show the equivalence of cost and utility under the 0–1 loss. Numerical examples are provided to illustrate the effectiveness of the new methodology.

The rest of this paper is organized as follows. In Section 2, we review the standard ψ-learning technique. In Section 3, we propose the weighted multicategory RSVM methodology. A computational algorithm using the DC algorithm is provided in Section 4, followed by numerical examples in Section 5. Some discussions are provided in Section 6. The Appendix contains the technical proof of the theorem.

2 Standard ψ-Learning and Robust SVM

2.1 General Framework

Consider a k-class classification problem. Let {(x_i, y_i); i = 1, …, n} denote a training dataset. The n pairs of observations (x_i, y_i)’s are assumed to be independent realizations of a random pair (X, Y), which has an unknown probability distribution P(x, y). Here x ∈ S ⊂ IR^d denotes an input vector and y ∈ {1, …, k} represents an output (class) variable. Throughout the paper, we use X and Y to denote random variables and x and y to represent corresponding observations.

Define f = (f₁, …, f_k), each f_j being a mapping from S to IR, as a decision function vector. These k functions represent k different classes with f_j corresponding to class j; j = 1, …, k. Once f is obtained from the training dataset, a classifier argmax_j=1,…kf_j(x) is employed to predict the class of any input vector x ∈ S. In other words, f_ŷ(x) is the maximum among k values of f (x). One important goal of multicategory classification is to find a classifier which minimizes the probability of misclassifying a new input vector X, namely the generalization error (GE), Err(f) = P[Y ≠ argmax_jf_j(X)]. Denote the multiple comparison vector of class y versus the rest as g(f(x), y) = (f_y(x)−f₁(x), …, f_y(x)−f_y−1(x), f_y(x)−f_y+1(x), …, f_y(x)−f_k(x)). Then f produces correct classification for (x, y) if min(g(f(x), y)) > 0. Using the notation of generalized functional margin min(g(f(x), y)), we can rewrite the classification error rate on the training dataset as $(1 / n) \sum_{i = 1}^{n} I (min (g (f (x_{i}), y_{i})) \leq 0)$ , where I(·) is an indicator function.

2.2 Standard ψ-Learning

After replacing the indictor function, also known as the 0–1 loss, by a ψ-loss function, the standard multicategory ψ-learning proposed by Liu and Shen (2006) solves the following minimization problem:

min_{f} (λ \sum_{j = 1}^{k} J (f_{j}) + \frac{1}{n} \sum_{i = 1}^{n} ψ (min (g (f (x_{i}), y_{i})))) subject to \sum_{j = 1}^{k} f_{j} (x) = 0 \forall x \in S,

(1)

where ψ(u) ∈ (0, 1] if u ∈ (0, τ) and ψ(u) = I(u ≤ 0) otherwise. The first term $\sum_{j = 1}^{k} J (f_{j})$ in the objective function in (1) can be viewed as a roughness penalty of f. For example, in linear learning where each f_j is a linear function, a common choice of J(f_j) is the squared L₂ norm of the corresponding linear coefficients for x. The penalty term also helps to enforce maximum separation of the data in the separable case (Cristianini and Shawe-Taylor, 2000; Liu and Shen, 2006). The second term $\frac{1}{n} \sum_{i = 1}^{n} ψ (min (g (f (x_{i}), y_{i})))$ in the objective function is a measure of goodness of fit of f on the training dataset. The reason to use the ψ-loss instead of the 0–1 loss is that problem (1) would be ill-posed if we replace ψ(·) by I(·). In fact, the solution ${argmin}_{f} (λ \sum_{j = 1}^{k} J (f_{j}) + \frac{1}{n} \sum_{i = 1}^{n} I (min (g (f (x_{i}), y_{i})) \leq 0))$ is essentially 0 since for any c ∈ (0, 1), cf yields the same training error as f, but a smaller penalty than f. The positive values of ψ(u) when u ∈ (0, 1] aim to eliminate the scaling problem of I(·) and make (1) a well-defined optimization problem. One particular ψ-loss suggested by Liu and Shen (2006) is piecewise linear with ψ(u) = 1 if u ≤ 0, 1 − u if u ∈ (0, 1], and 0 otherwise. The tuning parameter λ balances the penalty term and the data fit term. The sum-to-zero constraint $\sum_{j = 1}^{k} f_{j} (x) = 0$ in (1) helps to solve the potential identifiability problem of f and reduce the dimension of the optimization problem.

Shen et al. (2003); Liu and Shen (2006) showed that ψ-learning is robust to outliers and deliver high classification accuracy by using the ψ-loss, a loss that resembles the 0–1 loss closely. However, the methods they proposed only allow penalizing different types of misclassification equally.

2.3 Standard Robust SVM

Wu and Liu (2007) proposed robust SVM (RSVM) via truncating the unbounded hinge loss of the SVM. Notice that the hinge loss H₁(u) = [1 − u]₊ grows linearly when u decreases with u ≤ 1. This implies that a point with large 1 − min g(f(x); y) results in large H₁ and, as a consequence, greatly influences the final solution. Such points are typically far away from their own classes and tend to deteriorate the SVM performance. The RSVM utilizes the truncated hinge loss function to reduce the influence of outliers. In particular, the truncated hinge loss function can be expressed as T_s(u) = H₁(u) − H_s(u), where H_s(u) = [s − u]₊.

The value of s of the truncated loss T_s(u) specifies the location of truncation. We set s ≤ 0 since a truncated loss with s > 0 is constant for u ∈ [−s, s] and cannot distinguish those correctly classified points with min g(f(x), y) ∈ (0, s] from those wrongly classified points with min g(f(x), y) ∈ [−s, 0]. When s = −∞, no truncation has been performed and T_s(u) = H₁(u). When s = 0, the truncated loss T₀(u) becomes the ψ loss. As shown in Wu and Liu (2007), the choice of s is important and affects the performance of the RSVM. Interestingly, the numerical examples in Wu and Liu (2007) suggest that s = 0 for ψ-learning is not the optimal choice. The best value of s appears to be −1/(k − 1). The corresponding truncated loss enjoys Fisher consistency and also delivers most accurate classification results compared to the truncated loss functions with other values of s.

In Section 3, we develop a weighted RSVM methodology which permits flexible treatments on different misclassifications. Since ψ-learning is a special case of the RSVM, our method offers weighted ψ-learning as a byproduct.

3 Weighted RSVM

To extend the standard RSVM, one can use weights on different types of misclassification. A common technique is to use misclassification costs as weights. For example, Lin et al. (2004); Lee et al. (2004) used costs for extension of the standard SVM to non-standard situations. In Section 3.1, we explore the use of costs for possible extension of the multicategory SVM based on the generalized functional margin min(g(f(x), y)). In view of the difficulty on the implementation of the corresponding method, in Section 3.2, we propose a new way of weighting using utilities, instead of costs.

3.1 Challenge with the Cost Formulation

In the standard learning case, we do not distinguish different types of misclassification. The 0–1 loss can be represented as I(min(g(f(x_i), y_i)) ≤ 0). A natural way to extend binary margin loss functions into multicategory case is to use the generalized functional margin min(g(f(x_i), y_i)) as its argument. For example, the multicategory SVM proposed by Liu and Shen (2006) uses the loss [1 − min(g(f(x_i), y_i))]₊ and solves the following optimization problem

min_{f} (λ \sum_{j = 1}^{k} J (f_{j}) + \frac{1}{n} \sum_{i = 1}^{n} {[1 - min (g (f (x_{i}), y_{i}))]}_{+}) subject to \sum_{j = 1}^{k} f_{j} (x) = 0 .

(2)

To extend standard learning via weighting, we need to consider different types of misclassification. Let ϕ(x) : IR^d → {1, …, k} be a classifier and C_yϕ(x) represent the cost of misclassifying input x with class y into class ϕ(x). We set C_yϕ(x) = 0 if ϕ(x) = y and C_yϕ(x) > 0 otherwise. That implies no penalty shall be given for correct classification and a positive cost can be used when an error occurs. Under the same framework as in Section 2, our goal is to obtain f which minimizes the GE Err(f) = E[C_Yϕ(X)].

Notice that $E [C_{Y ϕ (X)}] = E [\sum_{j = 1}^{k} C_{Y j} I (ϕ (X) = j)] = E [\sum_{j = 1}^{k} C_{Y j} I (min (g (f (X), j)) > 0)]$ . Then an empirical version of Err(f) based on the training dataset can be written as

\frac{1}{n} \sum_{i = 1}^{n} \sum_{j = 1}^{k} C_{y_{i} j} I (min (g (f (x_{i}), j)) > 0) .

(3)

Because of the scaling problem of I(·) as discussed in Section 2, (3) can not be used for learning directly. However, one can use a convex approximation to replace the indicator function. For example, a natural approximation is to replace I(min(g(f(x_i), j)) > 0) by [1 + min(g(f(x_i), j))]₊. Then we can get the following empirical minimization problem

min_{f} (λ \sum_{j = 1}^{k} J (f_{j}) + \frac{1}{n} \sum_{i = 1}^{n} \sum_{j = 1}^{k} C_{y_{i} j} {[1 + min (g (f (x_{i}), j))]}_{+}) subject to \sum_{j = 1}^{k} f_{j} (x) = 0 .

(4)

Through the use of costs, (4) can be viewed as a weighted extension of the multicategory SVM formulation in (2). Surprisingly, unlike (2), (4) is not a convex minimization anymore. To see that, we can first introduce slack variables ξ_ij, as commonly done in the SVM optimization, to replace [1 + min(g(f(x_i), j))]₊ with constraints ξ_ij ≥ 0 and ξ_ij ≥ 1 + min(g(f(x_i), j)). Note that the region satisfying the constraints min(g(f(x_i), j)) ≤ ξ_ij−1 is not convex. As a result, (4) cannot be directly implemented using convex optimization.

3.2 The New Utility Formulation

In contrast to the cost formulation discussed in Section 3.1, we explore the use of utility in this section. In particular, we assign the utility amount U_yϕ(x) for classifying input x with class y into class ϕ(x). Naturally, one should set U_jj a bigger number comparing to other U_jl for l ≠ j. As a remark, both our costs and utilities are nonnegative.

To generalize the 0–1 loss, instead of minimizing the total cost, one should maximize the utility $E [U_{Y ϕ (X)}] = E [\sum_{j = 1}^{k} U_{Y j} I (ϕ (X) = j)] = E [\sum_{j = 1}^{k} U_{Y j} I (min (g (f (X), j)) > 0)]$ . Then an empirical version of the total utility is as follows

\frac{1}{n} \sum_{i = 1}^{n} \sum_{j = 1}^{k} U_{y_{i} j} I (min (g (f (x_{i}), j)) > 0) .

(5)

Notice that maximizing E[U_Yϕ(X)] is equivalent to minimizing E[−U_Yϕ(X)]. Then it is straightforward to see that maximizing the total utility (5) is equivalent to minimizing the following quantity

\frac{1}{n} \sum_{i = 1}^{n} \sum_{j = 1}^{k} U_{y_{i} j} I (min (g (f (x_{i}), j)) \leq 0) .

(6)

Therefore, the utility-based multicategory SVM extension can be written as follows

min_{f} (λ \sum_{j = 1}^{k} J (f_{j}) + \frac{1}{n} \sum_{i = 1}^{n} \sum_{j = 1}^{k} U_{y_{i} j} {[1 - min (g (f (x_{i}), j))]}_{+}) subject to \sum_{j = 1}^{k} f_{j} (x) = 0 .

(7)

It is easy to see that problem (7) is a natural generalization of the standard SVM problem (2) with weights U_{y_ij}. Similarly, our proposed weighted RSVM solves the following problem

min_{f} (λ \sum_{j = 1}^{k} J (f_{j}) + \frac{1}{n} \sum_{i = 1}^{n} \sum_{j = 1}^{k} U_{y_{i} j} T_{s} (min (g (f (x_{i}), j)))) subject to \sum_{j = 1}^{k} f_{j} (x) = 0 .

(8)

Our weighted RSVM uses the multicategory weighted truncated hinge loss function $\sum_{j = 1}^{k} U_{y_{i} j} T_{s} (min (g (f (x_{i}), j)))$ . To further explore the proposed loss, we study its consistency. In particular, we first define weighted Fisher consistency and then study Fisher consistency of general truncated loss functions. Note that the Bayes rule ϕ*(X) that maximizes the expected utility $E [U_{Y ϕ (X)}] is ϕ^{*} (x) = {argmax}_{j} \sum_{l = 1}^{k} U_{lj} p_{l} (x)$ , where p_l(x) = P(Y = l|X = x). Assume that the function ℓ(·) is non-increasing and ℓ′(0) < 0 exists. Then the weighted Fisher consistency is defined as follows.

Definition 3.1. Weighted Fisher Consistency

Denote $f^{*} (x) = {argmin}_{f} E [\sum_{j = 1}^{k} U_{Y j} ℓ (min (g (f (X), j))) | X = x]$ . Then the corresponding weighted loss function is weighted Fisher consistent if argmax $f^{*} (x) = {argmax}_{j} \sum_{l = 1}^{k} U_{lj} p_{l} (x)$ .

As a remark, we note our weighted Fisher consistency definition and results are more general than that of Wu et al. (2010). Essentially, the weights imposed by Wu et al. (2010) can be viewed as a special case of our utility-based learning with a diagonal utility matrix.

Let ℓ_{T_s}(·) = min(ℓ(·), ℓ(s)) with s ≤ 0. The following theorem, as an extension of the results in Wu and Liu (2007), states the weighted Fisher consistent results of the weighted truncated loss $\sum_{j = 1}^{k} U_{Y j} ℓ_{T_{s}} (min (g (f (x), j)))$ .

Theorem 3.1. Assume that the function ℓ(·) is non-increasing and ℓ′(0) < 0 exists. Then a sufficient condition for the loss $\sum_{j = 1}^{k} U_{yj} ℓ_{T_{s}} (min (g (f (x), j)))$ with k > 2 to be weighted Fisher consistent is that the truncation location s satisfies that sup_{{u:u≥−s≥0}} (ℓ(0)−ℓ(u))/(ℓ(s) − ℓ(0)) ≥ (k − 1). This condition is also necessary if ℓ(·) is convex.

Similar to the standard learning, the truncation value s given in Theorem 3.1 depends on the class number k. For ℓ(u) = H₁(u), e^−u, and log(1 + e^−u), weighted Fisher consistency for ℓ_{T_s} (min g(f(x), y)) can be guaranteed for $s \in [- \frac{1}{k - 1}, 0], [log (1 - \frac{1}{k}), 0], and [- log (2^{\frac{k}{k - 1}} - 1), 0]$ , respectively. For the implementation of our weighted RSVM, we recommend to choose $s = - \frac{1}{k - 1}$ , same as that for the standard RSVM.

3.3 Construction of the Utility Matrix

As we mentioned earlier, using costs to extend standard to nonstandard learning is very common. Weighted loss functions using costs extend the standard 0–1 loss by allowing unequal costs on different types of misclassification. Typically, one sets costs for correct classification to be zero and sets various costs for different types of misclassification depending on the problem and context. In this section, we describe one way of constructing an utility matrix based on a predefined cost matrix and then show their equivalence.

With a prespecified cost matrix {C_jl; j, l = 1, …, k}, we can construct the utility matrix with

(\begin{matrix} U_{11} & \dots & U_{1 k} \\ \dots & \dots & \dots \\ U_{k 1} & \dots & U_{kk} \end{matrix}) = max_{j l} {C_{jl}} 1_{k} 1_{k}^{T} - (\begin{matrix} C_{11} & \dots & C_{1 k} \\ \dots & \dots & \dots \\ C_{k 1} & \dots & C_{kk} \end{matrix}),

where 1 is a vector of 1 with length k.

Next we demonstrate that this choice of utility is reasonable by showing that

{argmin}_{f} E [\sum_{j = 1}^{k} C_{Y j} I (min g (f, j) > 0)] = {argmax}_{f} E [\sum_{j = 1}^{k} U_{Y j} I (min g (f, j) > 0)] .

(9)

The equality (9) implies that using our choice of utility matrix, the solution f, which minimizes the expected cost, also maximizes the expected utility. This justifies the usage of our utility matrix.

To show (9), we define P_j = P(Y = j|X = x). Then we have

\begin{matrix} {argmin}_{f} E [\sum_{j = 1}^{k} C_{Y j} I (min g (f, j) > 0)] \\ = {argmin}_{f} \sum_{l = 1}^{k} P_{l} \sum_{j = 1}^{k} [C_{lj} I (min g (f, j) > 0)] \\ = {argmax}_{f} \sum_{l = 1}^{k} P_{l} \sum_{j = 1}^{k} [(max_{j^{*} l^{*}} C_{j^{*} l^{*}} - C_{lj}) I (min g (f, j) > 0)] \\ = {argmax}_{f} E [\sum_{j = 1}^{k} U_{Yj} I (min g (f, j) > 0)] . \end{matrix}

Thus, (9) is proved. As a result, one can construct the utility matrix directly once the cost matrix is given for the proposed weighted RSVM.

4 Nonconvex Minimization via Difference Convex Algorithm

In this section, we develop a difference convex algorithm to solve (8). Note that the loss function in (8) is piecewise linear. Consequently it can also be solved by developing a mixed integer programming (MIP) algorithm as in Liu and Wu (2006). However due to the computational intensity of the MIP, we use the DC algorithm. The DC algorithm solves the nonconvex minimization problem via minimizing a sequence of convex subproblems (An and Tao, 1997; Liu et al., 2005). In particular, to apply the DC algorithm, we first rewrite the nonconvex objective function as a difference of two convex functions. Then we solve the original nonconvex optimization problem via iterative convex minimization problems. Each convex minimization subproblem serves as an approximation of the original problem.

For simplicity, we only focus on linear learning. The DC algorithm for nonlinear learning can be derived using kernel formulation as well as the idea of iterative approximation in linear learning. More details on the implementation of nonlinear learning can be founded in Wu and Liu (2007). For linear learning, we set $f_{j} (x) = w_{j}^{T} x + b_{j}$ ; w_j ∈ ℜ^d, b_j ∈ ℜ, and b = (b₁, b₂, ⋯, b_k)^T ∈ ℜ^k, where w_j = (w_1j, w_2j, ⋯, w_dj)^T, and W = (w₁, w₂, ⋯, w_k). With the two-norm penalty $J (f_{j}) = \frac{1}{2} {‖ w_{j} ‖}_{2}^{2}$ , (8) simplifies to

min_{W, b} \frac{1}{2} \sum_{j = 1}^{k} {‖ w_{j} ‖}_{2}^{2} + C \sum_{i = 1}^{n} \sum_{j = 1}^{k} U_{y_{i} j} T_{s} (min g (f (x_{i}), j)) subject to \sum_{j = 1}^{k} w_{jm} = 0, m = 1, 2, \dots, d; \sum_{j = 1}^{k} b_{j} = 0,

(10)

where $C = \frac{1}{n λ}$ and the constraints are adopted to avoid non-identifiability issue of the solution.

We denote parameters (W, b) by Θ. By noting the fact that T_s = H₁ − H_s, the objective function in (10) can be decomposed as

\begin{matrix} Q (Θ) & = \frac{1}{2} \sum_{j = 1}^{k} {‖ w_{j} ‖}_{2}^{2} + C \sum_{i = 1}^{n} \sum_{j = 1}^{k} U_{y_{i} j} H_{1} (min g (f (x_{i}), j)) - C \sum_{i = 1}^{n} \sum_{j = 1}^{k} U_{y_{i} j} H_{s} (min g (f (x_{i}), j)) \\ = Q_{vex} (Θ) + Θ_{cav} (Θ), \end{matrix}

where

Q_{vex} (Θ) = \frac{1}{2} \sum_{j = 1}^{k} {‖ w_{j} ‖}_{2}^{2} + C \sum_{i = 1}^{n} \sum_{j = 1}^{k} U_{y_{i} j} H_{1} (min g (f (x_{i}), j))

and

Q_{cav} (Θ) = - C \sum_{i = 1}^{n} \sum_{j = 1}^{k} U_{y_{i} j} H_{s} (min g (f (x_{i}), j))

denote the convex and concave parts, respectively.

Note that $\frac{\partial}{\partial w_{j}} Q_{cav} (Θ) and \frac{\partial}{\partial b_{j}} Q_{cav} (Θ)$ can be written respectively as follows

- C \sum_{i = 1}^{n} [U_{y_{i} j} (- I_{{min g (f (x_{i}), j) < s}}) + \sum_{j' \neq j} U_{y_{i} j'} (I_{{j = argmax (f_{m} (x_{i}) : m \neq j'), f_{j'} (x_{i}) - f_{j} (x_{i}) < s}})] x_{i}^{T}

and

- C \sum_{i = 1}^{n} [U_{y_{i} j} (- I_{{min g (f (x_{i}), j) < s}}) + \sum_{j' \neq j} U_{y_{i} j'} (I_{{j = argmax (f_{m} (x_{i}) : m \neq j'), f_{j'} (x_{i}) - f_{j} (x_{i}) < s}})],

where I_{A} = 1 if event A is true, and 0 otherwise.

Define, for j′ ≠ j, β_ijj′ = C if $f_{j}^{t} - f_{j'}^{t} < s with j' = argmax (f_{m}^{t} : m \neq j)$ and 0 otherwise. With the help of β_ij, we have

\frac{\partial}{\partial w_{j}} Q_{cav} (Θ) = \sum_{i = 1}^{n} (U_{y_{i} j} \sum_{j' \neq j} β_{ijj'} - \sum_{j' \neq j} U_{y_{i} j'} β_{ij' j}) x_{i}^{T},

and

\frac{\partial}{\partial b_{j}} Q_{cav} (Θ) = \sum_{i = 1}^{n} (U_{y_{i} j} \sum_{j' \neq j} β_{ijj'} - \sum_{j' \neq j} U_{y_{i} j'} β_{ij' j}) .

We now describe the iterative procedure of the DC algorithm for minimizing the nonconvex objective function in (10). For initialization, we can use the solution of (10) when T_s(u) = H₁(u), i.e., when no truncation is performed on the loss function with s = −∞. In that case, problem (10) becomes a convex minimization problem. Denote Θ_t to be the solution at the end of step t. At the step (t + 1), we apply the linear approximation to the concave part and then the objective function becomes

Q (Θ) = \frac{1}{2} \sum_{j = 1}^{k} {‖ w_{j} ‖}_{2}^{2} + C \sum_{i = 1}^{n} \sum_{j = 1}^{k} U_{y_{i} j} H_{1} (min g (f (x_{i}), j)) + \sum_{j = 1}^{k} 〈 \frac{\partial}{\partial w_{j}} Q_{cav} (Θ_{t}), w_{j} 〉 + \sum_{j = 1}^{k} b_{j} \frac{\partial}{\partial b_{j}} Q_{cav} (Θ_{t}) .

Using slack variable ξ_ij’s for the hinge loss function, the optimization problem at step (t + 1) becomes

\begin{matrix} {min}_{W, b, ξ} & \frac{1}{2} \sum_{j = 1}^{k} {‖ w_{j} ‖}_{2}^{2} + C \sum_{i = 1}^{n} \sum_{j = 1}^{k} U_{y_{i} j} ξ_{ij} + \sum_{j = 1}^{k} 〈 \frac{\partial}{\partial w_{j}} Q_{cav} (Θ_{t}), w_{j} 〉 + \sum_{j = 1}^{k} b_{j} \frac{\partial}{\partial b_{j}} Q_{cav} (Θ_{t}) \\ subject to & ξ_{ij} \geq 0 i = 1, 2, \dots, n; j = 1, 2, \dots, k \\ ξ_{ij} \geq 1 - [x_{i}^{T} w_{j} + b_{j}] + [x_{i}^{T} w_{j'} + b_{j'}], i = 1, 2, \dots, n; j' \neq j . \end{matrix}

The corresponding Lagrangian is

L (W, b, ξ) = \frac{1}{2} \sum_{j = 1}^{k} {‖ w_{j} ‖}_{2}^{2} + C \sum_{i = 1}^{n} \sum_{j = 1}^{k} U_{y_{i} j} ξ_{ij} - \sum_{i = 1}^{n} \sum_{j = 1}^{k} u_{ij} ξ_{ij} - \sum_{i = 1}^{n} \sum_{j = 1}^{k} \sum_{j' \neq j} α_{ijj'} (x_{i}^{T} w_{j} + b_{j} - x_{i}^{T} w_{j'} - b_{j'} + ξ_{ij} - 1) + \sum_{j = 1}^{k} 〈 \frac{\partial}{\partial w_{j}} Q_{cav} (Θ_{t}), w_{j} 〉 + \sum_{j = 1}^{k} b_{j} \frac{\partial}{\partial b_{j}} Q_{cav} (Θ_{t}),

(11)

subject to

\frac{\partial}{\partial w_{j}} L = w_{j}^{T} - [\sum_{i = 1}^{n} \sum_{j' \neq j} (α_{ijj'} - U_{y_{i} j} β_{ijj'}) x_{i}^{T} - \sum_{i = 1}^{n} \sum_{j' \neq j} (α_{ij' j} - U_{y_{i} j'} β_{ij' j}) x_{i}^{T}] = 0

(12)

\frac{\partial}{\partial b_{j}} L = - [\sum_{i = 1}^{n} \sum_{j' \neq j} (α_{ijj'} - U_{y_{i} j} β_{ijj'}) - \sum_{i = 1}^{n} \sum_{j' \neq j} (α_{ij' j} - U_{y_{i} j'} β_{ij' j})] = 0

(13)

\frac{\partial}{\partial ξ_{ij}} L = {CU}_{y_{i} j} - u_{ij} - \sum_{j' \neq j} α_{ijj'} = 0,

(14)

where the Lagrangian multipliers are u_ij ≥ 0 and α_ijj′ ≥ 0 for any i = 1, 2, ⋯, n, j = 1, 2, ⋯, k, j′ ≠ j. Substituting (12)–(14) into (11) yields the corresponding dual problem

\begin{matrix} {min}_{α} & \frac{1}{2} \sum_{j = 1}^{k} {‖ \sum_{i = 1}^{n} \sum_{j' \neq j} (α_{ijj'} - U_{y_{i} j} β_{ijj'}) x_{i}^{T} - \sum_{i = 1}^{n} \sum_{j' \neq j} (α_{ij' j} - U_{y_{i} j'} β_{ij' j}) x_{i}^{T} ‖}_{2}^{2} - \sum_{i = 1}^{n} \sum_{j = 1}^{k} \sum_{j' \neq j} α_{ijj}' \\ subject to & \sum_{i = 1}^{n} \sum_{j' \neq j} (α_{ijj'} - U_{y_{i} j} β_{ijj'}) - \sum_{i = 1}^{n} \sum_{j' \neq j} (α_{ij' j} - U_{y_{i} j'} β_{ij' j}) = 0, j = 1, 2, \dots, k \\ 0 \leq \sum_{j' \neq j} α_{ijj'} \leq {CU}_{y_{i} j}, i = 1, 2, \dots, n; j = 1, 2, \dots, k \\ α_{ijj'} \geq 0, i = 1, 2, \dots, n; j = 1, 2, \dots, k; j' \neq j; \end{matrix}

where β_ijj′ is defined as above. Note that the above dual problem is a quadratic programming (QP) problem similar to that of the standard SVM. Many optimization softwares can be used to solve the above dual problem. Once the solution is obtained, the coefficients w_j’s can be recovered as follows,

w_{j} = \sum_{i = 1}^{n} \sum_{j' \neq j} (α_{ijj'} - U_{y_{i} j} β_{ijj'}) x_{i}^{T} - \sum_{i = 1}^{n} \sum_{j' \neq j} (α_{ij' j} - U_{y_{i} j'} β_{ij' j}) x_{i}^{T},

(15)

which satisfies the sum-to-zero constraint $\sum_{j = 1}^{k} w_{mj} = 0$ for each 1 ≤ m ≤ d automatically.

After the solution of W is derived, b can be obtained via solving either a sequence of KKT conditions as used in the standard SVM or a linear programming (LP) problem. Denote ${\tilde{f}}_{j} (x_{i}) = x_{i}^{T} w_{j}$ . Then b can be obtained through the following LP problem:

\begin{matrix} min_{η, b} C \sum_{i = 1}^{n} \sum_{j = 1}^{k} U_{y_{i} j} η_{ij} + \sum_{j = 1}^{k} [\sum_{i = 1}^{n} (U_{y_{i} j} \sum_{j' \neq j} β_{ijj'} - \sum_{j' \neq j} U_{y_{i} j'} β_{ij' j})] b_{j} \\ subject to & η_{ij} \geq 0, i = 1, 2, \dots, n \\ η_{ij} \geq 1 - ({\tilde{f}}_{j} (x_{i}) + b_{j}) + {\tilde{f}}_{j'} (x_{i}) + b_{j'}, i = 1, 2, \dots, n; j' \neq j \\ \sum_{j = 1}^{k} b_{j} = 0 . \end{matrix}

We continue iterating the above convex optimization steps until convergence. Its convergence is guaranteed due to the fact that the objective function value decreases at each iteration.

5 Numerical Examples

In this section, we investigate the performance of our weighted RSVM through simulated examples in Section 5.1. We use three simulated examples to show the effect of the utility matrix and the improvement of weighted RSVM over the weighted SVM. A hand written digit recognition example is presented in Section 5.2 to further demonstrate the use of our proposed weighted RSVM. For all examples discussed in this section, the tuning parameter λ is selected over a grid using an independent tuning data set.

5.1 Simulation

We consider three simulated examples. For Examples 1 and 2, the underlying Bayes decision boundary is piecewise linear. We perform both linear and kernel nonlinear learning in Example 1 to demonstrate the change of the decision boundary with the utility matrix. Example 2 is used to show the advantage of the weighted RSVM over the weighted SVM when there are outliers in the data. For Example 3, the underlying Bayes decision boundary is nonlinear and we study the performance of our nonlinear weighted RSVM.

Example 1: This example is used to illustrate how boundary moves when we change the utility matrix. The data of this piecewise linear examples are generated as follows: the class response Y has equal probabilities taking 1, 2, or 3; conditional on Y = y, X ~ N(μ_y, 0.7²I₂) where I₂ is a 2 × 2 identity matrix, μ₁ = (1, 0)^T, $μ_{2} = {(- 1 / 2, \sqrt{3} / 2)}^{T}, and μ_{3} = {(- 1 / 2, - \sqrt{3} / 2)}^{T}$ .

To study the effect of the utility matrix, we consider four different configurations of the utility matrix as follows:

U_{0} = (\begin{matrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{matrix}), U_{1, a} = (\begin{matrix} 1 & a & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{matrix}), U_{2, a} = (\begin{matrix} 1 & a & a \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{matrix}), U_{3, a} = U_{2, a}^{T},

(16)

where 0 ≤ a < 1.

Since the true decision boundary is piecewise linear, we apply both linear learning and nonlinear kernel learning. For linear learning, the sample size is set to be n = 1600. Figure 1 shows how the decision boundaries change as we change the utility matrix. For the utility matrix U_1,a, the utility of classifying points in class 1 into class 2 increases as a increases. As a result, the region for class 1 becomes smaller while the region for class 2 becomes larger as a increases, as shown on the left panel of Figure 1. For the utility matrix U_2,a, the utilities of classifying points in class 1 into class 2 or 3 increase as a increases. Thus, the regions for classes 2 and 3 become larger and the region for class 1 becomes smaller as a increases, as shown on the middle panel of Figure 1. In contrast to U_2,a, for U_3,a, the utilities of classifying points in class 2 or 3 into class 1 increase as a increases. Consequently, the right panel of Figure 1 shows that the regions for classes 2 and 3 become smaller and the region for class 1 becomes larger as a increases. Overall, the results match our expectation when we change the utility matrix.

Boundaries for three configurations of the utility matrix with different a’s in (16) for Example 1. The left, middle, and right panels correspond to utility matrices U_1,a, U_2,a, U_3,a respectively.

Next we apply nonlinear learning via Gaussian kernel and see how the boundaries move with the change of the utility matrix. In this case, we set the sample size n = 800. We show the boundaries for one typical realization in Figure 2. The first, second and third rows of Figure 2 correspond to the changes of decision boundaries as we change a in (16). The pattern changes are similar to that of linear learning. In particular, for the first row, the region for class 2 gets larger while that for class 1 gets smaller. For the second row, the regions for both classes 2 and 3 get larger while that for class 1 gets smaller. Interestingly, as a gets larger, the region for class 1 is mixed with the regions for class 2 and 3. This is possibly due to the same value for U₁₂ and U₁₃. As a gets larger, the chances of classifying a point in class 1 into one of the three classes are approximately the same and thus the decision is close to be random. For the third row, the region for class 1 gets larger at the expense of smaller regions for classes 2 and 3.

Boundaries for three utility matrix configuration with different a.

Example 2: This example is used to demonstrate the performance of weighted SVM and weighted RSVM when there are outliers in the data. For simplicity, we use a similar data generation scheme as in Example 1. We first generate (X, Ỹ) as in Example 1. Then we perform a data contamination step to generate outliers. Specifically, conditional on Ỹ = ỹ, random flip Ỹ to get the final reponse Y by setting Y = j with probability P(j, ỹ), where j = 1, 2, 3. We set P(j, ỹ) = 0.85 when j = ỹ and 0.075 otherwise. With outliers existing in the data, we want to examine the effect of truncation of the hinge loss for the weighted RSVM.

For this example, the sample size is set to be n = 400. We generate an independent test set of size n_test = 100n. For a utility matrix U = (u_ij), we report the average utility on the test set as defined by $\sum_{i = 1}^{n_{test}} u_{y_{i} {\hat{y}}_{i}} / n_{test}$ for every (x_i, y_i) in the test set with corresponding prediction ỹ_i. Means and standard deviations of the average utilities over 100 repetition are reported for different methods in Table 1. We can see that using truncation can help to produce classifiers with higher utilities than those without truncation in most cases.

Table 1.

Utility comparison for the weighted SVM (WSVM) and RSVM (WRSVM) for Example 2.

	Utility	a = 0	a = 0.2	a = 0.4	a = 0.6	a = 0.8
WSVM	U_1,a	70.67 (0.24)	71.58 (0.36)	72.60 (0.31)	73.90 (0.42)	74.79 (0.94)
	U_2,a		72.53 (0.34)	74.57 (0.45)	76.42 (0.99)	79.99 (0.29)
	U_3,a		72.82 (0.28)	75.44 (0.31)	77.39 (1.84)	86.59 (0.09)

WRSVM	U_1,a	70.62 (0.28)	71.32 (0.47)	72.13 (0.49)	72.88 (0.93)	73.96 (0.71)
	U_2,a		71.98 (0.70)	71.88 (1.04)	74.46 (0.38)	80.04 (0.26)
	U_3,a		72.60 (0.43)	74.35 (1.08)	75.90 (1.58)	86.61 (0.08)

Open in a new tab

Note: All table entries are multiplied by 100.

Example 3: Examples 1 and 2 have piecewise linear Bayes decision boundaries. For Example 3, we consider an example with nonlinear Bayes decision boundary. In particular, predictors X = (X₁, X₂)^T are generated with X₁ ~ Uniform[−3, 3] and X₂ ~ Uniform[−6, 6]. Conditional on X = x, the initial reponse Ỹ takes value j with probability $exp (f_{j} (x_{i})) / \sum_{m = 1}^{3} exp (f_{m} (x_{i}))$ with $f_{1} (x) = - 20 x_{1} + 2 x_{2}^{2} - x_{2}^{2} + 2, f_{2} (x) = - 4 x_{1}^{2} + 2 x_{2}^{2} - 4, and f_{3} (x) = 20 x_{1} + 2 x_{1}^{2} - x_{2}^{2} + 2$ . Conditional on Ỹ = ỹ, randomly flip Ỹ to get the final reponse Y by setting Y = j with probability P(j, ỹ), where j = 1, 2, 3. We set P(j, ỹ) = 0.85 when j = ỹ and 0.075 otherwise.

We choose n = 200, n_test = 10n, and three specific utility matrices U_1,0.4, U_2,0.4, and U_3,0.4. Average utility results are given in Table 2. In general, loss truncation for the weighted RSVM helps to deliver classifiers with larger utilities except for the case of U_2,0.4.

Table 2.

Utility comparison for the weighted SVM (WSVM) and RSVM (WRSVM) for Example 3.

	U_1,0.4	U_2,0.4	U_3,0.4
WSVM	43.47 (1.88)	47.09 (1.11)	57.62 (3.10)
WRSVM	43.10 (1.76)	47.28 (1.09)	55.13 (2.80)

Open in a new tab

Note: All table entries are multiplied by 100.

5.2 Hand Written Digit Recognition

In this session, we use one real data example to further illustrate our proposed method. The real dataset we use is the “Pen-Based Recognition of Handwritten Digits” availabe online at the UCI Machine Learning Repository. See Alimoglu and Alpaydin (1996) for more information related to this data set.

For this digit dataset, the response variable is multicategory with class codes being 0, 1, 2, ⋯, 9, representing corresponding digits. To simplify the task, we focus on three classes with class codes of 3, 6, and 9, which are labeled as class 1, 2, and 3, respectively, in our weighted RSVM. After combining both the training data and the testing data, we have 3166 observations with class codes 3, 6, and 9. There are 16 predictors available, which are first standardized to have mean zero and standard deviation one. For each replication, we randomly select 50 observations from each class to be used as the training data, another 50 observations for each class from the remaining to be used as the tuning data, and the rest as the test data. We repeat this random splitting 20 times.

We report the results based on these 20 splitting in Table 3. For each utility matrix and each random splitting, we calculate the average percentage of predicting observations of each class to different classes. For the purpose of illustration, we only report the results for the utility type U_2,a in Table 3. For the case of standard learning with utility U_2,0, we can see that the correct classification rates for the three classes are all around 0.99. When we increase a, more and more data points in class 1 are classified into classes 2 and 3 as the design of this utility matrix encourages such kinds of misclassification.

Table 3.

Classification results for the hand written digit recognition data

		Percentage

		ŷ = 1	ŷ = 2	ŷ = 3
U_2,0	y = 1	0.9929	0.0005	0.0066
	y = 2	0.0001	0.9989	0.0010
	y = 3	0.0130	0.0002	0.9868

U_2,0.2	y = 1	0.9886	0.0034	0.0080
	y = 2	0.0001	0.9990	0.0009
	y = 3	0.0109	0.0015	0.9875

U_2,0.6	y = 1	0.4545	0.1573	0.3882
	y = 2	0	0.9990	0.0010
	y = 3	0	0.0002	0.9998

Open in a new tab

6 Discussion

Multicategory classification is an important statistical problem in practice. In this paper, we propose a weighted extension of the multicategory RSVM. A novel utility-based weighting method is proposed. The resulting weighted RSVM can be implemented using the DC algorithm. The connections between the cost matrix and the utility matrix are explored. Our numerical examples demonstrate the effectiveness of our weighted RSVM.

Our method requires a predefined utility matrix in order to apply the weighted learning. Although we show that the utility matrix can be constructed using the cost matrix, the method requires the users to specify the cost or utility matrix. How to best assign sensible costs or utilities for multicategory classification is an important topic. Further investigation is necessary.

Acknowledgments

The authors would like to thank the reviewers for their constructive comments and suggestions. The authors are partially supported by NSF Grants DMS-0747575 (Liu) and DMS-0905561 (Wu), NIH/NCI Grants R01 CA-149569 (Liu and Wu) and P01 CA142538 (Liu).

Appendix*

Proof of Theorem 1: Note that

E [\sum_{j = 1}^{k} U_{Y j} ℓ_{T_{s}} (min (g (f (X), j)))] = E [\sum_{l = 1}^{k} U_{l j} ℓ_{T_{s}} (min (g (f (X), j))) p_{l} (X)] .

For any given x, we need to minimize $\sum_{l = 1}^{k} U_{lj} ℓ_{T_{s}} (g_{j}) p_{l} (x)$ where g_j = min g(f(x), j). By definition and the fact that $\sum_{j = 1}^{k} f_{j} = 0$ , we can conclude that max_j g_j ≥ 0 and at most one of g_j’s is positive. Assume $j_{p} = {argmax}_{j} \sum_{l = 1}^{k} U_{lj} p_{l} (x)$ is unique. Then using the non-increasing property of ℓ_{T_s} and ℓ′(0) < 0, the minimizer f* satisfies that $g_{j_{p}}^{*} \geq 0$ .

We are now left to show $g_{j_{p}}^{*} \neq 0$ , equivalently that 0 cannot be a minimizer. For simplicity, denote $A = \sum_{l = 1}^{k} u_{{lj}_{p}} p_{l} (x), B = \sum_{j \neq j_{p}} \sum_{l = 1}^{k} U_{lj} p_{l} (x)$ , and C = A + B. Note A > C/k due to the uniqueness of j_p. Then it is sufficient to show that there exists a solution with g_{j_p} > 0. By assumption, there exists u₁ > 0 such that u₁ ≥ −s and (ℓ(0) − ℓ(u₁))/(ℓ(s) − ℓ(0)) ≥ k − 1. Consider a solution f⁰ with $f_{j_{p}}^{0} = u_{1} (k - 1) / k and f_{j}^{0} = - u_{1} / k for j \neq j_{p}$ . We want to show that f⁰ yields a smaller expected loss than 0, i.e., Aℓ_{T_s} (u₁) + Bℓ_{T_s} (−u₁) < ℓ_{T_s}(0)C. Equivalently, (ℓ(0) − ℓ(u₁))/(ℓ(s) − ℓ(0)) > B/A, which holds due to the fact that B/A < (k − 1). This implies sufficiency of the condition.

To prove necessity of the condition, it is sufficient to show that if (ℓ(0)−ℓ(u))/(ℓ(s)−ℓ(0)) < (k − 1) for all u with −u ≤ s ≤ 0, 0 is a minimizer of $\sum_{l = 1}^{k} U_{lj} ℓ_{T_{s}} (g_{j}) p_{l} (x)$ . Equivalently, we need to show that there exists (p₁, …, p_k) such that $\sum_{l = 1}^{k} U_{lj} ℓ_{T_{s}} (g_{j}) p_{l} (x) \geq ℓ_{T_{s}} (0) C$ for all f. Without loss of generality, assume that j_p = k and f₁ ≤ f₂ ≤ ⋯ f_k. Then $\sum_{l = 1}^{k} \sum_{j = 1}^{k} U_{lj} ℓ_{T_{s}} (g_{j}) p_{l} (x) = \sum_{j = 1}^{k - 1} \sum_{l = 1}^{k} U_{lj} ℓ_{T_{s}} (f_{j} - f_{k}) p_{l} (x) + \sum_{l = 1}^{k} U_{lk} ℓ_{T_{s}} (f_{k} - f_{k - 1}) p_{l} (x) \geq ℓ_{T_{s}} (f_{k - 1} - f_{k}) \sum_{j = 1}^{k - 1} \sum_{l = 1}^{k} U_{lj} p_{l} (x) + ℓ_{T_{s}} (f_{k} - f_{k - 1}) \sum_{l = 1}^{k} U_{lk} p_{l} (x) = ℓ_{T_{s}} (f_{k - 1} - f_{k}) B + ℓ_{T_{s}} (f_{k} - f_{k - 1}) A$ since ℓ_{T_s} is non-increasing. Thus it is sufficient to show ℓ_{T_s} (f_k − f_k−1)A + ℓ_{T_s}(f_k−1 − f_k)B > ℓ_{T_s}(0)C, that is, B(ℓ_{T_s}(−u)−ℓ(0)) > A(ℓ(0)−ℓ(u)) for all u > 0. Since ℓ(·) is convex, B(ℓ(−u) − ℓ(0)) > A(ℓ(0) − ℓ(u)) holds for all 0 < u ≤ −s with ℓ_{T_s}(−u) = ℓ(−u). For u ≥ −s, it is equivalent to show B(ℓ(s)−ℓ(0)) > A(ℓ(0) − ℓ(u)). By assumption, we can set (ℓ(s) − ℓ(0)) = (ℓ(0) − ℓ(u))/(k − 1) + a for some a > 0. Denote (ℓ(0)−ℓ(u)) = W. Then we need to have B(W/(k−1)+a) > AW. Let A = C/k + ε. Then it becomes ((k − 1)/kC − ε)(W/(k − 1) + a) > (C/k + ε)W, equivalently,

aC \frac{k - 1}{k ε} > \frac{k}{k - 1} W + a .

(17)

For any given a > 0, C ≥ A > 0 and W > 0, we can always find a small ε > 0 to have (17) satisfied. The desired result then follows.

Contributor Information

Yufeng Liu, Department of Statistics and Operations Research, Carolina Center for Genome Sciences, University of North Carolina, Chapel Hill, NC 27599 (yfliu@email.unc.edu)..

Yichao Wu, Department of Statistics, North Carolina State University, Raleigh, NC 27695 (wu@stat.ncsu.edu)..

Qinying He, Research Institute of Economics and Management, Southwestern University of Finance and Economics, Chengdu, Sichuan, China, 610074 (he@swufe.edu.cn)..

References

Alimoglu F, Alpaydin E. Methods of Combining Multiple Classifiers Based on Different Representations for Pen-based Handwriting Recognition. Proceedings of the Fifth Turkish Artificial Intelligence and Artificial Neural Networks Symposium (TAINN 96); Istanbul, Turkey. 1996. [Google Scholar]
An LTH, Tao PD. Solving a class of linearly constrained indefinite quadratic problems by d.c. algorithms. Journal of Global Optimization. 1997;11:253–285. [Google Scholar]
Boser B, Guyon I, Vapnik VN. A training algorithm for optimal margin classifiers. The Fifth Annual Conference on Computational Learning Theory; Pittsburgh ACM. 1992. pp. 142–152. [Google Scholar]
Cortes C, Vapnik VN. Support-Vector Networks. Machine Learning. 1995;20:273–279. [Google Scholar]
Cristianini N, Shawe-Taylor J. An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. Cambridge University Press; 2000. [Google Scholar]
Freund Y, Schapire R. A decision theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences. 1997;55:119–139. [Google Scholar]
Friedman JH, Hastie T, Tibshirani R. Additive logistic regression: a statistical view of boosting. Annals of Statistics. 2000;28:337–407. [Google Scholar]
Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition. New York: Springer-Verlag; 2009. [Google Scholar]
Lee Y, Lin Y, Wahba G. Multicategory Support Vector Machines, theory, and application to the classification of microarray data and satellite radiance data. Journal of the American Statistical Association. 2004;99:67–81. [Google Scholar]
Lin Y, Lee Y, Wahba G. Support vector machines for classification in nonstandard situations. Machine Learning. 2004;46:191–202. [Google Scholar]
Liu Y, Shen X. Multicategory ψ-learning. Journal of the American Statistical Association. 2006;101:500–509. [Google Scholar]
Liu Y, Shen X, Doss H. Multicategory ψ-learning and support vector machines: computational tools. Journal of Computational and Graphical Statistics. 2005;14:219–236. [Google Scholar]
Liu Y, Wu Y. Optimizing ψ-learning via mixed integer programming. Statistica Sinica. 2006;16:441–457. [Google Scholar]
Marron JS, Todd M, Ahn J. Distance weighted discrimination. Journal of the American Statistical Association. 2007;102:1267–1271. doi: 10.1198/jasa.2010.tm08487. [DOI] [PMC free article] [PubMed] [Google Scholar]
Qiao X, Liu Y. Adaptive weighted learning for unbalanced multicategory classification. Biometrics. 2009;65:159–168. doi: 10.1111/j.1541-0420.2008.01017.x. [DOI] [PubMed] [Google Scholar]
Qiao X, Zhang HH, Liu Y, Todd MJ, Marron JS. Weighted distance weighted discrimination and its asymptotic properties. Journal of the American Statistical Association. 2010;105:401–414. doi: 10.1198/jasa.2010.tm08487. [DOI] [PMC free article] [PubMed] [Google Scholar]
Shen X, Tseng GC, Zhang X, Wong WH. On ψ-learning. Journal of the American Statistical Association. 2003;98:724–734. [Google Scholar]
Wu Y, Liu Y. Robust truncated-hinge-loss support vector machines. Journal of the American Statistical Association. 2007;102:974–983. [Google Scholar]
Wu Y, Zhang HH, Liu Y. Robust Model-free Multiclass Probability Estimation. Journal of the American Statistical Association. 2010;105:424–436. doi: 10.1198/jasa.2010.tm09107. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhu J, Hastie T. Kernel Logistic Regression and the Import Vector Machine. Journal of Computational and Graphical Statistics. 2005;14:185–205. [Google Scholar]
Zhu J, Zou H, Rosset S, Hastie T. Multi-class adaboost. Statistics and Its Interface. 2009;2:349–360. [Google Scholar]

[R1] Alimoglu F, Alpaydin E. Methods of Combining Multiple Classifiers Based on Different Representations for Pen-based Handwriting Recognition. Proceedings of the Fifth Turkish Artificial Intelligence and Artificial Neural Networks Symposium (TAINN 96); Istanbul, Turkey. 1996. [Google Scholar]

[R2] An LTH, Tao PD. Solving a class of linearly constrained indefinite quadratic problems by d.c. algorithms. Journal of Global Optimization. 1997;11:253–285. [Google Scholar]

[R3] Boser B, Guyon I, Vapnik VN. A training algorithm for optimal margin classifiers. The Fifth Annual Conference on Computational Learning Theory; Pittsburgh ACM. 1992. pp. 142–152. [Google Scholar]

[R4] Cortes C, Vapnik VN. Support-Vector Networks. Machine Learning. 1995;20:273–279. [Google Scholar]

[R5] Cristianini N, Shawe-Taylor J. An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. Cambridge University Press; 2000. [Google Scholar]

[R6] Freund Y, Schapire R. A decision theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences. 1997;55:119–139. [Google Scholar]

[R7] Friedman JH, Hastie T, Tibshirani R. Additive logistic regression: a statistical view of boosting. Annals of Statistics. 2000;28:337–407. [Google Scholar]

[R8] Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition. New York: Springer-Verlag; 2009. [Google Scholar]

[R9] Lee Y, Lin Y, Wahba G. Multicategory Support Vector Machines, theory, and application to the classification of microarray data and satellite radiance data. Journal of the American Statistical Association. 2004;99:67–81. [Google Scholar]

[R10] Lin Y, Lee Y, Wahba G. Support vector machines for classification in nonstandard situations. Machine Learning. 2004;46:191–202. [Google Scholar]

[R11] Liu Y, Shen X. Multicategory ψ-learning. Journal of the American Statistical Association. 2006;101:500–509. [Google Scholar]

[R12] Liu Y, Shen X, Doss H. Multicategory ψ-learning and support vector machines: computational tools. Journal of Computational and Graphical Statistics. 2005;14:219–236. [Google Scholar]

[R13] Liu Y, Wu Y. Optimizing ψ-learning via mixed integer programming. Statistica Sinica. 2006;16:441–457. [Google Scholar]

[R14] Marron JS, Todd M, Ahn J. Distance weighted discrimination. Journal of the American Statistical Association. 2007;102:1267–1271. doi: 10.1198/jasa.2010.tm08487. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] Qiao X, Liu Y. Adaptive weighted learning for unbalanced multicategory classification. Biometrics. 2009;65:159–168. doi: 10.1111/j.1541-0420.2008.01017.x. [DOI] [PubMed] [Google Scholar]

[R16] Qiao X, Zhang HH, Liu Y, Todd MJ, Marron JS. Weighted distance weighted discrimination and its asymptotic properties. Journal of the American Statistical Association. 2010;105:401–414. doi: 10.1198/jasa.2010.tm08487. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] Shen X, Tseng GC, Zhang X, Wong WH. On ψ-learning. Journal of the American Statistical Association. 2003;98:724–734. [Google Scholar]

[R18] Wu Y, Liu Y. Robust truncated-hinge-loss support vector machines. Journal of the American Statistical Association. 2007;102:974–983. [Google Scholar]

[R19] Wu Y, Zhang HH, Liu Y. Robust Model-free Multiclass Probability Estimation. Journal of the American Statistical Association. 2010;105:424–436. doi: 10.1198/jasa.2010.tm09107. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] Zhu J, Hastie T. Kernel Logistic Regression and the Import Vector Machine. Journal of Computational and Graphical Statistics. 2005;14:185–205. [Google Scholar]

[R21] Zhu J, Zou H, Rosset S, Hastie T. Multi-class adaboost. Statistics and Its Interface. 2009;2:349–360. [Google Scholar]

PERMALINK

Utility-based Weighted Multicategory Robust Support Vector Machines

Yufeng Liu

Yichao Wu

Qinying He

Roles

Abstract

1 Introduction

2 Standard ψ-Learning and Robust SVM

2.1 General Framework

2.2 Standard ψ-Learning

2.3 Standard Robust SVM

3 Weighted RSVM

3.1 Challenge with the Cost Formulation

3.2 The New Utility Formulation

Definition 3.1. Weighted Fisher Consistency

3.3 Construction of the Utility Matrix

4 Nonconvex Minimization via Difference Convex Algorithm

5 Numerical Examples

5.1 Simulation

Figure 1.

Figure 2.

Table 1.

Table 2.

5.2 Hand Written Digit Recognition

Table 3.

6 Discussion

Acknowledgments

Appendix*

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Utility-based Weighted Multicategory Robust Support Vector Machines

Yufeng Liu

Yichao Wu

Qinying He

Roles

Abstract

1 Introduction

2 Standard ψ-Learning and Robust SVM

2.1 General Framework

2.2 Standard ψ-Learning

2.3 Standard Robust SVM

3 Weighted RSVM

3.1 Challenge with the Cost Formulation

3.2 The New Utility Formulation

Definition 3.1. Weighted Fisher Consistency

3.3 Construction of the Utility Matrix

4 Nonconvex Minimization via Difference Convex Algorithm

5 Numerical Examples

5.1 Simulation

Figure 1.

Figure 2.

Table 1.

Table 2.

5.2 Hand Written Digit Recognition

Table 3.

6 Discussion

Acknowledgments

Appendix*

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases