Abstract
The Support Vector Machines (SVM) has been an important classification technique in both machine learning and statistics communities. The robust SVM is an improved version of the SVM so that the resulting classifier can be less sensitive to outliers. In many practical problems, it may be advantageous to use different weights for different types of misclassification. However, the existing RSVM treats different kinds of misclassification equally. In this paper, we propose the weighted RSVM, as an extension of the standard SVM. We show that surprisingly, the cost-based weights do not work well for weighted extensions of the RSVM. To solve this problem, we propose a novel utility-based weights for the weighted RSVM. Both theoretical and numerical studies are presented to investigate the performance of the proposed weighted multicategory RSVM.
Key Words and Phrases: Multicategory Classification, Robustness, SVM, Utility, Weighted Learning
1 Introduction
In supervised learning, one important goal is to build predictive models using a training dataset for future prediction. Among various learning tasks, classification plays an important role, both theoretically and practically. It has been widely applied in a wide range of disciplines such as medicine, engineering, and bioinformatics.
There are numerous classification techniques proposed in the literature. In particular, several machine learning approaches become popular and have been increasingly studied in both machine learning and statistics communities. Important examples include the Support Vector Machine (SVM, Boser et al. (1992); Cortes and Vapnik (1995)), Boosting (Freund and Schapire, 1997; Friedman et al., 2000), and others. See Hastie et al. (2009) for a comprehensive survey of different learning techniques. Many proposals on further improvements of these methods have been made in recent years. For example, Lee et al. (2004); Zhu et al. (2009) introduced multicategory versions of SVM and AdaBoost respectively. A more recent learning method, ψ-learning, was first proposed by Shen et al. (2003) as a competitive binary large margin classifier. Liu and Shen (2006) developed a multicategory extension of ψ-learning. Wu and Liu (2007) further generalized ψ-learning to robust SVMs. Other examples of machine learning classification methods include the Import Vector Machine (IVM, Zhu and Hastie (2005)) and Distance Weighted Discrimination (DWD, Marron et al. (2007); Qiao et al. (2010)).
To measure the performance of a classifier, one can quantify its prediction accuracy. One commonly used measure is the probability of misclassification. This measure treats all types of misclassification equally. Specifically, the cost of misclassifying a subject of class one into class two is the same as the cost of misclassifying a subject of class two into class one. This treatment may not be appropriate for many applications. One common example is the application of tumor classification. Clearly, misclassifying a cancer patient as normal is much more severe than the other type of misclassification. A misclassification on a normal patient can be corrected using further diagnosis. However, a wrong diagnosis on a cancer patient will delay the necessary treatment and can be life threatening. Another example is learning with samples having minority classes. In that case, the resulting classifier tends to sacrifice the minority classes and try to classify the training points in majority classes correctly. Sometimes the classifier may misclassify all points of a minority class but still give high overall classification accuracy (Qiao and Liu, 2009). Therefore, unequal cost assignments on different types of misclassification are needed.
To handle the problem of unequal cost assignments, Lin et al. (2004) generalized the original SVM to nonstandard situations. The nonstandard SVM allows unequal cost assignments on the two types of misclassification. Lee et al. (2004) extended this idea further for multicategory problems. For ψ-learning and robust SVM, the available methods can only deal with standard binary and multicategory classification. In this paper, we develop a general robust SVM technique which allows unequal cost assignments on different types of misclassification. The proposed technique covers the binary ψ-learning in Shen et al. (2003), multicategory ψ-learning in Liu and Shen (2006), and robust SVM (RSVM) in Wu and Liu (2007) as special cases.
The optimization problem of the standard RSVM involves nonconvex minimization. Wu and Liu (2007) proposed to decompose the nonconvex objective function into the difference of two convex functions and then use difference convex (DC) algorithm through iterative convex minimization to obtain a local solution. Since the existing weighted SVMs in the literature are based on the use of unequal costs of different misclassification, it is natural for us to use the same idea for the RSVM. Surprisingly, the resulting weighted RSVM using the unequal cost approach cannot be solved using the DC algorithm. To solve this difficulty, we propose a novel utility-based weighted RSVM which can be implemented via the DC algorithm. We show the equivalence of cost and utility under the 0–1 loss. Numerical examples are provided to illustrate the effectiveness of the new methodology.
The rest of this paper is organized as follows. In Section 2, we review the standard ψ-learning technique. In Section 3, we propose the weighted multicategory RSVM methodology. A computational algorithm using the DC algorithm is provided in Section 4, followed by numerical examples in Section 5. Some discussions are provided in Section 6. The Appendix contains the technical proof of the theorem.
2 Standard ψ-Learning and Robust SVM
2.1 General Framework
Consider a k-class classification problem. Let {(xi, yi); i = 1, …, n} denote a training dataset. The n pairs of observations (xi, yi)’s are assumed to be independent realizations of a random pair (X, Y), which has an unknown probability distribution P(x, y). Here x ∈ S ⊂ IRd denotes an input vector and y ∈ {1, …, k} represents an output (class) variable. Throughout the paper, we use X and Y to denote random variables and x and y to represent corresponding observations.
Define f = (f1, …, fk), each fj being a mapping from S to IR, as a decision function vector. These k functions represent k different classes with fj corresponding to class j; j = 1, …, k. Once f is obtained from the training dataset, a classifier argmaxj=1,…kfj(x) is employed to predict the class of any input vector x ∈ S. In other words, fŷ(x) is the maximum among k values of f (x). One important goal of multicategory classification is to find a classifier which minimizes the probability of misclassifying a new input vector X, namely the generalization error (GE), Err(f) = P[Y ≠ argmaxjfj(X)]. Denote the multiple comparison vector of class y versus the rest as g(f(x), y) = (fy(x)−f1(x), …, fy(x)−fy−1(x), fy(x)−fy+1(x), …, fy(x)−fk(x)). Then f produces correct classification for (x, y) if min(g(f(x), y)) > 0. Using the notation of generalized functional margin min(g(f(x), y)), we can rewrite the classification error rate on the training dataset as , where I(·) is an indicator function.
2.2 Standard ψ-Learning
After replacing the indictor function, also known as the 0–1 loss, by a ψ-loss function, the standard multicategory ψ-learning proposed by Liu and Shen (2006) solves the following minimization problem:
| (1) |
where ψ(u) ∈ (0, 1] if u ∈ (0, τ) and ψ(u) = I(u ≤ 0) otherwise. The first term in the objective function in (1) can be viewed as a roughness penalty of f. For example, in linear learning where each fj is a linear function, a common choice of J(fj) is the squared L2 norm of the corresponding linear coefficients for x. The penalty term also helps to enforce maximum separation of the data in the separable case (Cristianini and Shawe-Taylor, 2000; Liu and Shen, 2006). The second term in the objective function is a measure of goodness of fit of f on the training dataset. The reason to use the ψ-loss instead of the 0–1 loss is that problem (1) would be ill-posed if we replace ψ(·) by I(·). In fact, the solution is essentially 0 since for any c ∈ (0, 1), cf yields the same training error as f, but a smaller penalty than f. The positive values of ψ(u) when u ∈ (0, 1] aim to eliminate the scaling problem of I(·) and make (1) a well-defined optimization problem. One particular ψ-loss suggested by Liu and Shen (2006) is piecewise linear with ψ(u) = 1 if u ≤ 0, 1 − u if u ∈ (0, 1], and 0 otherwise. The tuning parameter λ balances the penalty term and the data fit term. The sum-to-zero constraint in (1) helps to solve the potential identifiability problem of f and reduce the dimension of the optimization problem.
Shen et al. (2003); Liu and Shen (2006) showed that ψ-learning is robust to outliers and deliver high classification accuracy by using the ψ-loss, a loss that resembles the 0–1 loss closely. However, the methods they proposed only allow penalizing different types of misclassification equally.
2.3 Standard Robust SVM
Wu and Liu (2007) proposed robust SVM (RSVM) via truncating the unbounded hinge loss of the SVM. Notice that the hinge loss H1(u) = [1 − u]+ grows linearly when u decreases with u ≤ 1. This implies that a point with large 1 − min g(f(x); y) results in large H1 and, as a consequence, greatly influences the final solution. Such points are typically far away from their own classes and tend to deteriorate the SVM performance. The RSVM utilizes the truncated hinge loss function to reduce the influence of outliers. In particular, the truncated hinge loss function can be expressed as Ts(u) = H1(u) − Hs(u), where Hs(u) = [s − u]+.
The value of s of the truncated loss Ts(u) specifies the location of truncation. We set s ≤ 0 since a truncated loss with s > 0 is constant for u ∈ [−s, s] and cannot distinguish those correctly classified points with min g(f(x), y) ∈ (0, s] from those wrongly classified points with min g(f(x), y) ∈ [−s, 0]. When s = −∞, no truncation has been performed and Ts(u) = H1(u). When s = 0, the truncated loss T0(u) becomes the ψ loss. As shown in Wu and Liu (2007), the choice of s is important and affects the performance of the RSVM. Interestingly, the numerical examples in Wu and Liu (2007) suggest that s = 0 for ψ-learning is not the optimal choice. The best value of s appears to be −1/(k − 1). The corresponding truncated loss enjoys Fisher consistency and also delivers most accurate classification results compared to the truncated loss functions with other values of s.
In Section 3, we develop a weighted RSVM methodology which permits flexible treatments on different misclassifications. Since ψ-learning is a special case of the RSVM, our method offers weighted ψ-learning as a byproduct.
3 Weighted RSVM
To extend the standard RSVM, one can use weights on different types of misclassification. A common technique is to use misclassification costs as weights. For example, Lin et al. (2004); Lee et al. (2004) used costs for extension of the standard SVM to non-standard situations. In Section 3.1, we explore the use of costs for possible extension of the multicategory SVM based on the generalized functional margin min(g(f(x), y)). In view of the difficulty on the implementation of the corresponding method, in Section 3.2, we propose a new way of weighting using utilities, instead of costs.
3.1 Challenge with the Cost Formulation
In the standard learning case, we do not distinguish different types of misclassification. The 0–1 loss can be represented as I(min(g(f(xi), yi)) ≤ 0). A natural way to extend binary margin loss functions into multicategory case is to use the generalized functional margin min(g(f(xi), yi)) as its argument. For example, the multicategory SVM proposed by Liu and Shen (2006) uses the loss [1 − min(g(f(xi), yi))]+ and solves the following optimization problem
| (2) |
To extend standard learning via weighting, we need to consider different types of misclassification. Let ϕ(x) : IRd → {1, …, k} be a classifier and Cyϕ(x) represent the cost of misclassifying input x with class y into class ϕ(x). We set Cyϕ(x) = 0 if ϕ(x) = y and Cyϕ(x) > 0 otherwise. That implies no penalty shall be given for correct classification and a positive cost can be used when an error occurs. Under the same framework as in Section 2, our goal is to obtain f which minimizes the GE Err(f) = E[CYϕ(X)].
Notice that . Then an empirical version of Err(f) based on the training dataset can be written as
| (3) |
Because of the scaling problem of I(·) as discussed in Section 2, (3) can not be used for learning directly. However, one can use a convex approximation to replace the indicator function. For example, a natural approximation is to replace I(min(g(f(xi), j)) > 0) by [1 + min(g(f(xi), j))]+. Then we can get the following empirical minimization problem
| (4) |
Through the use of costs, (4) can be viewed as a weighted extension of the multicategory SVM formulation in (2). Surprisingly, unlike (2), (4) is not a convex minimization anymore. To see that, we can first introduce slack variables ξij, as commonly done in the SVM optimization, to replace [1 + min(g(f(xi), j))]+ with constraints ξij ≥ 0 and ξij ≥ 1 + min(g(f(xi), j)). Note that the region satisfying the constraints min(g(f(xi), j)) ≤ ξij−1 is not convex. As a result, (4) cannot be directly implemented using convex optimization.
3.2 The New Utility Formulation
In contrast to the cost formulation discussed in Section 3.1, we explore the use of utility in this section. In particular, we assign the utility amount Uyϕ(x) for classifying input x with class y into class ϕ(x). Naturally, one should set Ujj a bigger number comparing to other Ujl for l ≠ j. As a remark, both our costs and utilities are nonnegative.
To generalize the 0–1 loss, instead of minimizing the total cost, one should maximize the utility . Then an empirical version of the total utility is as follows
| (5) |
Notice that maximizing E[UYϕ(X)] is equivalent to minimizing E[−UYϕ(X)]. Then it is straightforward to see that maximizing the total utility (5) is equivalent to minimizing the following quantity
| (6) |
Therefore, the utility-based multicategory SVM extension can be written as follows
| (7) |
It is easy to see that problem (7) is a natural generalization of the standard SVM problem (2) with weights Uyij. Similarly, our proposed weighted RSVM solves the following problem
| (8) |
Our weighted RSVM uses the multicategory weighted truncated hinge loss function . To further explore the proposed loss, we study its consistency. In particular, we first define weighted Fisher consistency and then study Fisher consistency of general truncated loss functions. Note that the Bayes rule ϕ*(X) that maximizes the expected utility , where pl(x) = P(Y = l|X = x). Assume that the function ℓ(·) is non-increasing and ℓ′(0) < 0 exists. Then the weighted Fisher consistency is defined as follows.
Definition 3.1. Weighted Fisher Consistency
Denote . Then the corresponding weighted loss function is weighted Fisher consistent if argmax .
As a remark, we note our weighted Fisher consistency definition and results are more general than that of Wu et al. (2010). Essentially, the weights imposed by Wu et al. (2010) can be viewed as a special case of our utility-based learning with a diagonal utility matrix.
Let ℓTs(·) = min(ℓ(·), ℓ(s)) with s ≤ 0. The following theorem, as an extension of the results in Wu and Liu (2007), states the weighted Fisher consistent results of the weighted truncated loss .
Theorem 3.1. Assume that the function ℓ(·) is non-increasing and ℓ′(0) < 0 exists. Then a sufficient condition for the loss with k > 2 to be weighted Fisher consistent is that the truncation location s satisfies that sup{u:u≥−s≥0} (ℓ(0)−ℓ(u))/(ℓ(s) − ℓ(0)) ≥ (k − 1). This condition is also necessary if ℓ(·) is convex.
Similar to the standard learning, the truncation value s given in Theorem 3.1 depends on the class number k. For ℓ(u) = H1(u), e−u, and log(1 + e−u), weighted Fisher consistency for ℓTs (min g(f(x), y)) can be guaranteed for , respectively. For the implementation of our weighted RSVM, we recommend to choose , same as that for the standard RSVM.
3.3 Construction of the Utility Matrix
As we mentioned earlier, using costs to extend standard to nonstandard learning is very common. Weighted loss functions using costs extend the standard 0–1 loss by allowing unequal costs on different types of misclassification. Typically, one sets costs for correct classification to be zero and sets various costs for different types of misclassification depending on the problem and context. In this section, we describe one way of constructing an utility matrix based on a predefined cost matrix and then show their equivalence.
With a prespecified cost matrix {Cjl; j, l = 1, …, k}, we can construct the utility matrix with
where 1 is a vector of 1 with length k.
Next we demonstrate that this choice of utility is reasonable by showing that
| (9) |
The equality (9) implies that using our choice of utility matrix, the solution f, which minimizes the expected cost, also maximizes the expected utility. This justifies the usage of our utility matrix.
To show (9), we define Pj = P(Y = j|X = x). Then we have
Thus, (9) is proved. As a result, one can construct the utility matrix directly once the cost matrix is given for the proposed weighted RSVM.
4 Nonconvex Minimization via Difference Convex Algorithm
In this section, we develop a difference convex algorithm to solve (8). Note that the loss function in (8) is piecewise linear. Consequently it can also be solved by developing a mixed integer programming (MIP) algorithm as in Liu and Wu (2006). However due to the computational intensity of the MIP, we use the DC algorithm. The DC algorithm solves the nonconvex minimization problem via minimizing a sequence of convex subproblems (An and Tao, 1997; Liu et al., 2005). In particular, to apply the DC algorithm, we first rewrite the nonconvex objective function as a difference of two convex functions. Then we solve the original nonconvex optimization problem via iterative convex minimization problems. Each convex minimization subproblem serves as an approximation of the original problem.
For simplicity, we only focus on linear learning. The DC algorithm for nonlinear learning can be derived using kernel formulation as well as the idea of iterative approximation in linear learning. More details on the implementation of nonlinear learning can be founded in Wu and Liu (2007). For linear learning, we set ; wj ∈ ℜd, bj ∈ ℜ, and b = (b1, b2, ⋯, bk)T ∈ ℜk, where wj = (w1j, w2j, ⋯, wdj)T, and W = (w1, w2, ⋯, wk). With the two-norm penalty , (8) simplifies to
| (10) |
where and the constraints are adopted to avoid non-identifiability issue of the solution.
We denote parameters (W, b) by Θ. By noting the fact that Ts = H1 − Hs, the objective function in (10) can be decomposed as
where
and
denote the convex and concave parts, respectively.
Note that can be written respectively as follows
and
where I{A} = 1 if event A is true, and 0 otherwise.
Define, for j′ ≠ j, βijj′ = C if and 0 otherwise. With the help of βij, we have
and
We now describe the iterative procedure of the DC algorithm for minimizing the nonconvex objective function in (10). For initialization, we can use the solution of (10) when Ts(u) = H1(u), i.e., when no truncation is performed on the loss function with s = −∞. In that case, problem (10) becomes a convex minimization problem. Denote Θt to be the solution at the end of step t. At the step (t + 1), we apply the linear approximation to the concave part and then the objective function becomes
Using slack variable ξij’s for the hinge loss function, the optimization problem at step (t + 1) becomes
The corresponding Lagrangian is
| (11) |
subject to
| (12) |
| (13) |
| (14) |
where the Lagrangian multipliers are uij ≥ 0 and αijj′ ≥ 0 for any i = 1, 2, ⋯, n, j = 1, 2, ⋯, k, j′ ≠ j. Substituting (12)–(14) into (11) yields the corresponding dual problem
where βijj′ is defined as above. Note that the above dual problem is a quadratic programming (QP) problem similar to that of the standard SVM. Many optimization softwares can be used to solve the above dual problem. Once the solution is obtained, the coefficients wj’s can be recovered as follows,
| (15) |
which satisfies the sum-to-zero constraint for each 1 ≤ m ≤ d automatically.
After the solution of W is derived, b can be obtained via solving either a sequence of KKT conditions as used in the standard SVM or a linear programming (LP) problem. Denote . Then b can be obtained through the following LP problem:
We continue iterating the above convex optimization steps until convergence. Its convergence is guaranteed due to the fact that the objective function value decreases at each iteration.
5 Numerical Examples
In this section, we investigate the performance of our weighted RSVM through simulated examples in Section 5.1. We use three simulated examples to show the effect of the utility matrix and the improvement of weighted RSVM over the weighted SVM. A hand written digit recognition example is presented in Section 5.2 to further demonstrate the use of our proposed weighted RSVM. For all examples discussed in this section, the tuning parameter λ is selected over a grid using an independent tuning data set.
5.1 Simulation
We consider three simulated examples. For Examples 1 and 2, the underlying Bayes decision boundary is piecewise linear. We perform both linear and kernel nonlinear learning in Example 1 to demonstrate the change of the decision boundary with the utility matrix. Example 2 is used to show the advantage of the weighted RSVM over the weighted SVM when there are outliers in the data. For Example 3, the underlying Bayes decision boundary is nonlinear and we study the performance of our nonlinear weighted RSVM.
Example 1: This example is used to illustrate how boundary moves when we change the utility matrix. The data of this piecewise linear examples are generated as follows: the class response Y has equal probabilities taking 1, 2, or 3; conditional on Y = y, X ~ N(μy, 0.72I2) where I2 is a 2 × 2 identity matrix, μ1 = (1, 0)T, .
To study the effect of the utility matrix, we consider four different configurations of the utility matrix as follows:
| (16) |
where 0 ≤ a < 1.
Since the true decision boundary is piecewise linear, we apply both linear learning and nonlinear kernel learning. For linear learning, the sample size is set to be n = 1600. Figure 1 shows how the decision boundaries change as we change the utility matrix. For the utility matrix U1,a, the utility of classifying points in class 1 into class 2 increases as a increases. As a result, the region for class 1 becomes smaller while the region for class 2 becomes larger as a increases, as shown on the left panel of Figure 1. For the utility matrix U2,a, the utilities of classifying points in class 1 into class 2 or 3 increase as a increases. Thus, the regions for classes 2 and 3 become larger and the region for class 1 becomes smaller as a increases, as shown on the middle panel of Figure 1. In contrast to U2,a, for U3,a, the utilities of classifying points in class 2 or 3 into class 1 increase as a increases. Consequently, the right panel of Figure 1 shows that the regions for classes 2 and 3 become smaller and the region for class 1 becomes larger as a increases. Overall, the results match our expectation when we change the utility matrix.
Figure 1.
Boundaries for three configurations of the utility matrix with different a’s in (16) for Example 1. The left, middle, and right panels correspond to utility matrices U1,a, U2,a, U3,a respectively.
Next we apply nonlinear learning via Gaussian kernel and see how the boundaries move with the change of the utility matrix. In this case, we set the sample size n = 800. We show the boundaries for one typical realization in Figure 2. The first, second and third rows of Figure 2 correspond to the changes of decision boundaries as we change a in (16). The pattern changes are similar to that of linear learning. In particular, for the first row, the region for class 2 gets larger while that for class 1 gets smaller. For the second row, the regions for both classes 2 and 3 get larger while that for class 1 gets smaller. Interestingly, as a gets larger, the region for class 1 is mixed with the regions for class 2 and 3. This is possibly due to the same value for U12 and U13. As a gets larger, the chances of classifying a point in class 1 into one of the three classes are approximately the same and thus the decision is close to be random. For the third row, the region for class 1 gets larger at the expense of smaller regions for classes 2 and 3.
Figure 2.
Boundaries for three utility matrix configuration with different a.
Example 2: This example is used to demonstrate the performance of weighted SVM and weighted RSVM when there are outliers in the data. For simplicity, we use a similar data generation scheme as in Example 1. We first generate (X, Ỹ) as in Example 1. Then we perform a data contamination step to generate outliers. Specifically, conditional on Ỹ = ỹ, random flip Ỹ to get the final reponse Y by setting Y = j with probability P(j, ỹ), where j = 1, 2, 3. We set P(j, ỹ) = 0.85 when j = ỹ and 0.075 otherwise. With outliers existing in the data, we want to examine the effect of truncation of the hinge loss for the weighted RSVM.
For this example, the sample size is set to be n = 400. We generate an independent test set of size ntest = 100n. For a utility matrix U = (uij), we report the average utility on the test set as defined by for every (xi, yi) in the test set with corresponding prediction ỹi. Means and standard deviations of the average utilities over 100 repetition are reported for different methods in Table 1. We can see that using truncation can help to produce classifiers with higher utilities than those without truncation in most cases.
Table 1.
Utility comparison for the weighted SVM (WSVM) and RSVM (WRSVM) for Example 2.
| Utility | a = 0 | a = 0.2 | a = 0.4 | a = 0.6 | a = 0.8 | |
|---|---|---|---|---|---|---|
| WSVM | U1,a | 70.67 (0.24) | 71.58 (0.36) | 72.60 (0.31) | 73.90 (0.42) | 74.79 (0.94) |
| U2,a | 72.53 (0.34) | 74.57 (0.45) | 76.42 (0.99) | 79.99 (0.29) | ||
| U3,a | 72.82 (0.28) | 75.44 (0.31) | 77.39 (1.84) | 86.59 (0.09) | ||
| WRSVM | U1,a | 70.62 (0.28) | 71.32 (0.47) | 72.13 (0.49) | 72.88 (0.93) | 73.96 (0.71) |
| U2,a | 71.98 (0.70) | 71.88 (1.04) | 74.46 (0.38) | 80.04 (0.26) | ||
| U3,a | 72.60 (0.43) | 74.35 (1.08) | 75.90 (1.58) | 86.61 (0.08) | ||
Note: All table entries are multiplied by 100.
Example 3: Examples 1 and 2 have piecewise linear Bayes decision boundaries. For Example 3, we consider an example with nonlinear Bayes decision boundary. In particular, predictors X = (X1, X2)T are generated with X1 ~ Uniform[−3, 3] and X2 ~ Uniform[−6, 6]. Conditional on X = x, the initial reponse Ỹ takes value j with probability with . Conditional on Ỹ = ỹ, randomly flip Ỹ to get the final reponse Y by setting Y = j with probability P(j, ỹ), where j = 1, 2, 3. We set P(j, ỹ) = 0.85 when j = ỹ and 0.075 otherwise.
We choose n = 200, ntest = 10n, and three specific utility matrices U1,0.4, U2,0.4, and U3,0.4. Average utility results are given in Table 2. In general, loss truncation for the weighted RSVM helps to deliver classifiers with larger utilities except for the case of U2,0.4.
Table 2.
Utility comparison for the weighted SVM (WSVM) and RSVM (WRSVM) for Example 3.
| U1,0.4 | U2,0.4 | U3,0.4 | |
|---|---|---|---|
| WSVM | 43.47 (1.88) | 47.09 (1.11) | 57.62 (3.10) |
| WRSVM | 43.10 (1.76) | 47.28 (1.09) | 55.13 (2.80) |
Note: All table entries are multiplied by 100.
5.2 Hand Written Digit Recognition
In this session, we use one real data example to further illustrate our proposed method. The real dataset we use is the “Pen-Based Recognition of Handwritten Digits” availabe online at the UCI Machine Learning Repository. See Alimoglu and Alpaydin (1996) for more information related to this data set.
For this digit dataset, the response variable is multicategory with class codes being 0, 1, 2, ⋯, 9, representing corresponding digits. To simplify the task, we focus on three classes with class codes of 3, 6, and 9, which are labeled as class 1, 2, and 3, respectively, in our weighted RSVM. After combining both the training data and the testing data, we have 3166 observations with class codes 3, 6, and 9. There are 16 predictors available, which are first standardized to have mean zero and standard deviation one. For each replication, we randomly select 50 observations from each class to be used as the training data, another 50 observations for each class from the remaining to be used as the tuning data, and the rest as the test data. We repeat this random splitting 20 times.
We report the results based on these 20 splitting in Table 3. For each utility matrix and each random splitting, we calculate the average percentage of predicting observations of each class to different classes. For the purpose of illustration, we only report the results for the utility type U2,a in Table 3. For the case of standard learning with utility U2,0, we can see that the correct classification rates for the three classes are all around 0.99. When we increase a, more and more data points in class 1 are classified into classes 2 and 3 as the design of this utility matrix encourages such kinds of misclassification.
Table 3.
Classification results for the hand written digit recognition data
| Percentage | ||||
|---|---|---|---|---|
| ŷ = 1 | ŷ = 2 | ŷ = 3 | ||
| U2,0 | y = 1 | 0.9929 | 0.0005 | 0.0066 |
| y = 2 | 0.0001 | 0.9989 | 0.0010 | |
| y = 3 | 0.0130 | 0.0002 | 0.9868 | |
| U2,0.2 | y = 1 | 0.9886 | 0.0034 | 0.0080 |
| y = 2 | 0.0001 | 0.9990 | 0.0009 | |
| y = 3 | 0.0109 | 0.0015 | 0.9875 | |
| U2,0.6 | y = 1 | 0.4545 | 0.1573 | 0.3882 |
| y = 2 | 0 | 0.9990 | 0.0010 | |
| y = 3 | 0 | 0.0002 | 0.9998 | |
6 Discussion
Multicategory classification is an important statistical problem in practice. In this paper, we propose a weighted extension of the multicategory RSVM. A novel utility-based weighting method is proposed. The resulting weighted RSVM can be implemented using the DC algorithm. The connections between the cost matrix and the utility matrix are explored. Our numerical examples demonstrate the effectiveness of our weighted RSVM.
Our method requires a predefined utility matrix in order to apply the weighted learning. Although we show that the utility matrix can be constructed using the cost matrix, the method requires the users to specify the cost or utility matrix. How to best assign sensible costs or utilities for multicategory classification is an important topic. Further investigation is necessary.
Acknowledgments
The authors would like to thank the reviewers for their constructive comments and suggestions. The authors are partially supported by NSF Grants DMS-0747575 (Liu) and DMS-0905561 (Wu), NIH/NCI Grants R01 CA-149569 (Liu and Wu) and P01 CA142538 (Liu).
Appendix*
Proof of Theorem 1: Note that
For any given x, we need to minimize where gj = min g(f(x), j). By definition and the fact that , we can conclude that maxj gj ≥ 0 and at most one of gj’s is positive. Assume is unique. Then using the non-increasing property of ℓTs and ℓ′(0) < 0, the minimizer f* satisfies that .
We are now left to show , equivalently that 0 cannot be a minimizer. For simplicity, denote , and C = A + B. Note A > C/k due to the uniqueness of jp. Then it is sufficient to show that there exists a solution with gjp > 0. By assumption, there exists u1 > 0 such that u1 ≥ −s and (ℓ(0) − ℓ(u1))/(ℓ(s) − ℓ(0)) ≥ k − 1. Consider a solution f0 with . We want to show that f0 yields a smaller expected loss than 0, i.e., AℓTs (u1) + BℓTs (−u1) < ℓTs(0)C. Equivalently, (ℓ(0) − ℓ(u1))/(ℓ(s) − ℓ(0)) > B/A, which holds due to the fact that B/A < (k − 1). This implies sufficiency of the condition.
To prove necessity of the condition, it is sufficient to show that if (ℓ(0)−ℓ(u))/(ℓ(s)−ℓ(0)) < (k − 1) for all u with −u ≤ s ≤ 0, 0 is a minimizer of . Equivalently, we need to show that there exists (p1, …, pk) such that for all f. Without loss of generality, assume that jp = k and f1 ≤ f2 ≤ ⋯ fk. Then since ℓTs is non-increasing. Thus it is sufficient to show ℓTs (fk − fk−1)A + ℓTs(fk−1 − fk)B > ℓTs(0)C, that is, B(ℓTs(−u)−ℓ(0)) > A(ℓ(0)−ℓ(u)) for all u > 0. Since ℓ(·) is convex, B(ℓ(−u) − ℓ(0)) > A(ℓ(0) − ℓ(u)) holds for all 0 < u ≤ −s with ℓTs(−u) = ℓ(−u). For u ≥ −s, it is equivalent to show B(ℓ(s)−ℓ(0)) > A(ℓ(0) − ℓ(u)). By assumption, we can set (ℓ(s) − ℓ(0)) = (ℓ(0) − ℓ(u))/(k − 1) + a for some a > 0. Denote (ℓ(0)−ℓ(u)) = W. Then we need to have B(W/(k−1)+a) > AW. Let A = C/k + ε. Then it becomes ((k − 1)/kC − ε)(W/(k − 1) + a) > (C/k + ε)W, equivalently,
| (17) |
For any given a > 0, C ≥ A > 0 and W > 0, we can always find a small ε > 0 to have (17) satisfied. The desired result then follows.
Contributor Information
Yufeng Liu, Department of Statistics and Operations Research, Carolina Center for Genome Sciences, University of North Carolina, Chapel Hill, NC 27599 (yfliu@email.unc.edu)..
Yichao Wu, Department of Statistics, North Carolina State University, Raleigh, NC 27695 (wu@stat.ncsu.edu)..
Qinying He, Research Institute of Economics and Management, Southwestern University of Finance and Economics, Chengdu, Sichuan, China, 610074 (he@swufe.edu.cn)..
References
- Alimoglu F, Alpaydin E. Methods of Combining Multiple Classifiers Based on Different Representations for Pen-based Handwriting Recognition. Proceedings of the Fifth Turkish Artificial Intelligence and Artificial Neural Networks Symposium (TAINN 96); Istanbul, Turkey. 1996. [Google Scholar]
- An LTH, Tao PD. Solving a class of linearly constrained indefinite quadratic problems by d.c. algorithms. Journal of Global Optimization. 1997;11:253–285. [Google Scholar]
- Boser B, Guyon I, Vapnik VN. A training algorithm for optimal margin classifiers. The Fifth Annual Conference on Computational Learning Theory; Pittsburgh ACM. 1992. pp. 142–152. [Google Scholar]
- Cortes C, Vapnik VN. Support-Vector Networks. Machine Learning. 1995;20:273–279. [Google Scholar]
- Cristianini N, Shawe-Taylor J. An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. Cambridge University Press; 2000. [Google Scholar]
- Freund Y, Schapire R. A decision theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences. 1997;55:119–139. [Google Scholar]
- Friedman JH, Hastie T, Tibshirani R. Additive logistic regression: a statistical view of boosting. Annals of Statistics. 2000;28:337–407. [Google Scholar]
- Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition. New York: Springer-Verlag; 2009. [Google Scholar]
- Lee Y, Lin Y, Wahba G. Multicategory Support Vector Machines, theory, and application to the classification of microarray data and satellite radiance data. Journal of the American Statistical Association. 2004;99:67–81. [Google Scholar]
- Lin Y, Lee Y, Wahba G. Support vector machines for classification in nonstandard situations. Machine Learning. 2004;46:191–202. [Google Scholar]
- Liu Y, Shen X. Multicategory ψ-learning. Journal of the American Statistical Association. 2006;101:500–509. [Google Scholar]
- Liu Y, Shen X, Doss H. Multicategory ψ-learning and support vector machines: computational tools. Journal of Computational and Graphical Statistics. 2005;14:219–236. [Google Scholar]
- Liu Y, Wu Y. Optimizing ψ-learning via mixed integer programming. Statistica Sinica. 2006;16:441–457. [Google Scholar]
- Marron JS, Todd M, Ahn J. Distance weighted discrimination. Journal of the American Statistical Association. 2007;102:1267–1271. doi: 10.1198/jasa.2010.tm08487. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Qiao X, Liu Y. Adaptive weighted learning for unbalanced multicategory classification. Biometrics. 2009;65:159–168. doi: 10.1111/j.1541-0420.2008.01017.x. [DOI] [PubMed] [Google Scholar]
- Qiao X, Zhang HH, Liu Y, Todd MJ, Marron JS. Weighted distance weighted discrimination and its asymptotic properties. Journal of the American Statistical Association. 2010;105:401–414. doi: 10.1198/jasa.2010.tm08487. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shen X, Tseng GC, Zhang X, Wong WH. On ψ-learning. Journal of the American Statistical Association. 2003;98:724–734. [Google Scholar]
- Wu Y, Liu Y. Robust truncated-hinge-loss support vector machines. Journal of the American Statistical Association. 2007;102:974–983. [Google Scholar]
- Wu Y, Zhang HH, Liu Y. Robust Model-free Multiclass Probability Estimation. Journal of the American Statistical Association. 2010;105:424–436. doi: 10.1198/jasa.2010.tm09107. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhu J, Hastie T. Kernel Logistic Regression and the Import Vector Machine. Journal of Computational and Graphical Statistics. 2005;14:185–205. [Google Scholar]
- Zhu J, Zou H, Rosset S, Hastie T. Multi-class adaboost. Statistics and Its Interface. 2009;2:349–360. [Google Scholar]


