Summary
Large-margin classifiers are popular methods for classification. Among existing simultaneous multicategory large-margin classifiers, a common approach is to learn k different functions for a k-class problem with a sum-to-zero constraint. Such a formulation can be inefficient. We propose a new multicategory angle-based large-margin classification framework. The proposed angle-based classifiers consider a simplex-based prediction rule without the sum-to-zero constraint, and enjoy more efficient computation. Many binary large-margin classifiers can be naturally generalized for multicategory problems through the angle-based framework. Theoretical and numerical studies demonstrate the usefulness of the angle-based methods.
Some key words: Hard classification, Probability estimation, Soft classification, Support vector machine
1. Introduction
Classification is an important supervised learning technique with numerous applications. Given a training dataset with subjects having both covariates and class labels, the goal is to build a classifier that predicts the label for a new subject. Existing techniques for classification include Fisher linear discriminant analysis, logistic regression, support vector machines and boosting; see Hastie et al. (2009).
Large-margin classifiers in machine learning have attracted a lot of attention in recent years. Typically a large-margin classification method can be written in the regularization framework of loss plus penalty. The loss term measures the goodness of fit of the resulting classifier, and the penalty term controls model complexity and helps to prevent overfitting. Many existing binary classification methods belong to the large-margin classification framework, such as adaboost (Freund & Schapire, 1997), penalized logistic regression (Lin et al., 2000), import vector machines (Zhu & Hastie, 2005), support vector machines (Boser et al., 1992), ψ-learning (Shen et al., 2003), and large-margin unified machines (Liu et al., 2011).
Although binary large-margin classifiers are popular, extensions are needed for multicategory problems. One simple approach is to deal with a multicategory classification problem via a sequence of binary classifiers. Both the one-versus-one and one-versus-rest approaches are sequential binary methods, which are invariant with respect to the order of the class labels. These methods can be suboptimal in certain situations. For example, the one-versus-rest support vector machine method can be inconsistent when there is no dominating class (Lee et al., 2004; Liu & Yuan, 2011). Hence, it is desirable to study frameworks that consider all classes simultaneously for multicategory classification problems.
For a simultaneous classification problem with k > 2 classes, a common approach in the literature is to map the covariates to a classification function vector with length k. In this framework, one associates each class with a coordinate in the function vector. The prediction rule assigns the class label to the category that corresponds to the maximum element within the function vector. A sum-to-zero constraint on the function vector is commonly used to reduce the parameter space as well as to obtain desirable theoretical properties. Many proposed methods follow this framework. See, for example, Vapnik (1998), Crammer & Singer (2001), Lee et al. (2004), Zhu & Hastie (2005), Liu & Shen (2006), Zhu et al. (2009) and Liu & Yuan (2011). However, as binary large-margin classifiers estimate a single function to perform classification using its sign, for a k-class problem, it should be sufficient to consider a (k − 1)-dimensional classification function space. Therefore, existing simultaneous classifiers with the sum-to-zero constraint can be inefficient. The corresponding computation is more involved as well. Apart from inefficiency, the geometric interpretation of maximum separation for binary support vector machines is well understood, while the extension to multicategory support vector machines is much less clear.
To overcome these difficulties, we propose a new multicategory angle-based large-margin classification technique that not only provides a natural generalization of binary large-margin methods, but also overcomes the disadvantages of the existing methods mentioned above. In particular, we consider a k-vertex simplex structure in an Euclidean space, whose dimension is k − 1, one less than the number of categories (Hill & Doucet, 2007; Lange & Wu, 2008). Our notation of a k-vertex simplex is centred at the origin with equal pairwise distances among the vertices. The classification function vector maps the covariates of a given instance to the Euclidean space wherein the simplex lies. In the (k − 1)-dimensional space, each vertex of the simplex represents one class and is a (k − 1)-dimensional vector. The prediction rule assigns the given instance to the category whose corresponding vertex vector has the smallest angle with respect to the mapped classification function vector. The details of class label representations using a simplex and the prediction rule are introduced in § 2.
Compared to the regular simultaneous multicategory classification methods, the angle-based method has several attractive properties. First, the geometric interpretation of the least angle prediction rule is very easy to understand through our newly defined classification regions. See Fig. 1(a) in § 2. The corresponding functional margin can be directly generalized from the binary to the multicategory case. See Chapter 6 of Cristianini & Shawe-Taylor (2000) for the definition of functional margin. The minimizers of the new multicategory large-margin unified machines using the angle-based structure help us to better understand the transition behaviours from soft to hard classifiers (Wahba, 2002; Liu et al., 2011). Second, our angle-based method enjoys a lower computational cost than the regular procedure with the sum-to-zero constraint, as we implicitly transfer the constraint onto our newly defined functional margins. Furthermore, we show that the angle-based method has many desirable theoretical properties. In particular, we provide some theoretical insights on the advantages of our angle-based method in § 3·5. Despite the sum-to-zero constraint, using k classification functions instead of k − 1 can introduce extra variability in the estimated classifier. This can reduce the classification performance of the regular simultaneous classifiers. In linear learning problems with high dimensional predictors, this extra variability can be large. Consequently, the angle-based methods can significantly outperform the regular simultaneous methods. We confirm this insight numerically by comparing our angle-based methods with some existing simultaneous multicategory classifiers. We also show that the numerical performance of our angle-based method is competitive for nonlinear problems using kernel learning.
Fig. 1.
Illustration for the angle-based classification with k = 3. (a) The classification regions for k = 3. The mapped observation f̂ is predicted as class 1, because θ1 < θ2 < θ3, where θj (j = 1, 2, 3) are shown in the figure. Observe that ℝ2 is naturally divided into three classification regions, together with the classification boundaries. (b) The plot of f* with P = (0·55, 0·25, 0·2) and a = 1. Observe that f* moves along the dotted line as c changes from 0 to ∞. For c < ∞, f* remains in C1, and hence the angle-based large-margin unified machine is Fisher consistent. When c → ∞, f* is on the boundary, and consequently the angle-based support vector machine with hinge loss is not Fisher consistent.
2. Methodology
2·1. Binary large-margin classifiers
Suppose we are given a training dataset, {(x1, y1), …, (xn, yn)}, obtained from an unknown underlying distribution P(x, y). In classification problems, one important goal is to find a classifier so that the corresponding misclassification rate is minimized. Our focus in this paper is on large-margin classifiers. We first review binary large-margin classifiers in this section, and then introduce the new angle-based methodology for multicategory problems in § 2·2.
For simplicity, we first discuss binary problems, with y ∈ {1, −1}. In that case, we have a single classification function f with sign (f) as the corresponding prediction rule. Specifically, for an instance with covariates x, the predicted class is ŷ = 1 if f(x) ≥ 0, and ŷ = −1 otherwise. Correct classification occurs if and only if yf(x) > 0. This quantity yf(x) is known as the functional margin. Given the classification function f, the theoretical 0–1 loss can be expressed as L{yf(x)} = I[y ⧧ sign{f(x)}], where I(·) is the indicator function. Because L is discontinuous and nonconvex, direct minimization of the total empirical 0–1 loss is difficult. To overcome this challenge, a common approach is to use a surrogate convex loss function in place of L. Let ℓ(·) be a convex upper bound of L. A large-margin classifier solves the minimization problem min. , or alternatively min. , subject to J(f) ≤ s, where F is the functional space of interest, J(f) is the penalty term on f to control its complexity and consequently prevent the resulting classifier from overfitting, and λ, s are tuning parameters that balance the goodness of fit and the penalty terms.
The use of the loss ℓ circumvents the difficulty of minimizing the 0–1 loss L. In the literature, many binary classification methods use a nonincreasing ℓ to encourage a large functional margin yf(x). For example, the support vector machine uses the hinge loss ℓ(yf) = (1 − yf)+ (Vapnik, 1998;Wahba, 1999), logistic regression uses the deviance loss ℓ(yf) = log{1 + exp(−yf)} (Lin et al., 2000; Zhu & Hastie, 2005), boosting approximately uses the exponential loss ℓ(yf) = exp(−yf) (Freund & Schapire, 1997), and the large-margin unified machine uses the large-margin unified loss function (Liu et al., 2011). See the Supplementary Material for details. This family provides a convenient platform for studying the transition behaviour from soft to hard binary classifiers (Wahba, 2002). When c = 0, the large-margin unified machine loss is a typical soft classifier, and when c → ∞, the large-margin unified machine loss becomes the hinge loss in the standard support vector machine, which is a typical hard classifier.
2·2. Multicategory angle-based large-margin classifiers
Multicategory problems are prevalent in practice. In that case, it is desirable to extend binary large-margin classifiers to multicategory versions.
For multicategory problems, let Y ∈ {1, …, k}, where k > 2 is the number of classes. The regular simultaneous procedure maps x to f(x) ∈ ℝk, and the corresponding prediction rule is ŷ = argmaxj fj(x), where fj is the jth element of f. A sum-to-zero constraint on f is typically imposed, and many statistical properties such as Fisher consistency can be derived. This constraint is commonly used in the literature, see for example, Lee et al. (2004), Wang & Shen (2007), Zhu et al. (2009), Liu & Yuan (2011), and Zhang & Liu (2013). However, since only one function is needed for binary classification, it should be sufficient to use k − 1 functions for k-class problems. Hence, to construct k functions and remove the redundancy by the sum-to-zero constraint can be inefficient. Furthermore, the corresponding optimization problem can be more challenging as well. See the discussion in § 4 for more details. The angle-based method we introduce next overcomes the difficulties mentioned above. We give some geometric explanations on how angle-based classifiers work for multicategory classification problems. Without the sum-to-zero constraint, the computational speed of our angle-based methods can be significantly faster than the regular methods.
To begin with, we define a specific simplex in a (k − 1)-dimensional space, as studied in Lange & Wu (2008). We propose large-margin classifiers using the simplex coding, while the classifiers introduced in Lange & Wu (2008) and Wu & Wu (2012) are regression-based methods. In this setting, a simplex is defined as a k-regular polyhedron in a (k − 1)-dimensional Euclidean space. For example, when k =3, the simplex is an equilateral triangle in ℝ2, and when k = 4, it is a regular tetrahedron in ℝ3. For a multicategory classification problem with k classes, define W as the collection of k vectors in ℝk−1 with elements
where ζ is a vector with length k − 1 and each element 1, and ej is a vector in ℝk−1 such that its every element is 0, except the jth element is 1. One can check that W = {W1, …, Wk} forms a simplex with k vertices in the (k − 1)-dimensional space. The centre of W is at the origin, and each Wj has Euclidean norm 1.
The set {W1, …, Wk} defines k directions in ℝk−1, and the angles between any two directions are equal. Every vector in ℝk−1 generates k different angles with respect to {W1, …, Wk}. Assume that the angles are in [0, π]. Let Wj represent class j. For a fitted f̂, our method is to map x to f̂(x) ∈ ℝk−1, and predict ŷ to be the class whose corresponding angle is the smallest. In particular, we let ŷ = argminj ∠(Wj, f̂), where ∠(·, ·) denotes the angle between two vectors. Based on the prediction rule, we can split the space ℝk−1 into k disjoint classification regions, whose definition is given below, together with a boundary set.
Definition 1
A classification region with respect to class j, Cj (j = 1, …, k), is the subset of ℝk−1, such that if f̂ ∈ Cj, then ŷ = j.
The classification regions are closely connected to the prediction rule. In Fig. 1(a) we plot the classification regions for the angle-based method with k = 3. For a given vector f ∈ ℝk−1, ∠(Wj, f) is smallest if and only if the projection of f on Wj is the largest for j =1, …, k. In other words, ∠(Wj, f) being smallest is equivalent to 〈f, Wj〉 > 〈f, Wl〉, for l ⧧ j, where is the inner product of vectors x1, x2, and xT is the transpose of x. Moreover, the smaller ∠(Wj, f) is, the larger 〈f, Wj〉 is. Therefore, for any binary large-margin loss function ℓ(·), we propose to solve the following optimization for the angle-based classifier
| (1) |
By duality with convex ℓ and J, (1) is equivalent to the optimization problem
| (2) |
subject to J(f) ≤ s. Our prediction rule is now equivalent to ŷ = argmaxj 〈f̂(x), Wj〉. The value 〈f(x), Wy〉, which is equivalent to the projected length of f(x) on Wy, can be viewed as a generalized functional margin of (x, y), and ℓ{〈f(x), Wy〉} measures the loss of assigning the label y to x, in terms of the inner product between f(x) and Wy. From this perspective, the loss function in (1) and (2) encourages ∠{Wyi, f(xi)} to be as small as possible, or equivalently, it encourages the projection of f(xi) on Wyi to be as large as possible. Notice that , which means we implicitly transfer the sum-to-zero constraint on the classification function vector f in the regular simultaneous multicategory classification to our new functional margins 〈Wj, f〉. Hence, without an explicit constraint on the classification function f, we reduce the computational burden of the optimization problem. As we will see in § § 5 and 6, it is much more efficient to compute the angle-based classification solution than the regular multicategory classifiers with k functions.
The representation of the multicategory class label using the simplex structure was used in the literature previously. Some special cases with certain loss functions have been studied in the literature using the simplex class coding, such as the ε-insensitive loss (Lange & Wu, 2008; Wu & Wu, 2012), boosting (Saberian & Vasconcelos, 2011) and support vector machines (Hill & Doucet, 2007). Our method is more general and covers general large-margin classifiers. The methods proposed by Saberian & Vasconcelos (2011) and Hill & Doucet (2007) can be viewed as special cases of the angle-based structure.
3. Statistical properties
3·1. Fisher consistency and theoretical minimizer
Fisher consistency is also known as classification calibration (Bartlett et al., 2006; Tewari & Bartlett, 2007), or infinite sample consistency (Zhang, 2004a). Intuitively, when the functional space F is large enough and we have infinitely many samples, the prediction rule for a Fisher consistent large-margin classifier should yield the minimum misclassification rate. It is a fundamental requirement of a classification loss function.
Define the class conditional probabilities Pj = pr(Y = j | X = x) (j = 1, …, k). Under our angle-based prediction rule, Fisher consistency requires that for a given x such that Py > Pj (j ⧧ y) for some y ∈ {1,…, k}, the vector f*(x) that minimizes E[ℓ{〈f(X), WY〉} | X = x] satisfies that y = argmaxj 〈f*(x), Wj〉 and such an argmax is unique. In other words, Fisher consistency requires that f* ∈ Cy, where f* is the theoretical minimizer. Next we show that the angle-based method is Fisher consistent if the loss ℓ has a negative derivative function.
Theorem 1
The angle-based classification loss function ℓ in (1) and (2) is Fisher consistent if the derivative ℓ′ exists and ℓ′(x) < 0 for all x.
The logistic loss, the exponential loss, and the large-margin unified machine family with c < ∞ meet the conditions of Theorem 1, and thus are all Fisher consistent. For the support vector machine, the hinge loss is not differentiable and hence does not satisfy the condition of Theorem 1. As the large-margin unified machine family includes the hinge loss as a special case with c → ∞, we show that the multicategory angle-based support vector machine is not Fisher consistent, by studying the theoretical minimizer f* of the entire large-margin unified machine family. The next theorem gives the explicit expression of f* in terms of Pj (j = 1, …, k).
Theorem 2
Consider the angle-based method with ℓ in the large-margin unified machine family. Assume that P1 > ⋯ > Pk > 0. The theoretical minimizer f* satisfies
for a ∈ (0, ∞) and c ∈ [0, ∞]. Hence . For a = ∞ and c ∈ [0, +∞], we have that
In this case, .
For any fixed (P1, …, Pk) in Theorem 2 and a < ∞, we have 〈W1, f*〉 − 〈W2, f*〉 = a{(P1/Pk)1/(a+1) − (P2/Pk)1/(a+1)}/(c + 1). As c increases, the difference in the functional margins 〈W1, f*〉 − 〈W2, f*〉 decreases, and consequently the difference between ∠(W1, f*) and ∠(W2, f*) decreases. Hence, one can verify that as c increases, the theoretical minimizer f* stays in C1 and moves closer to the boundary separating different classification regions. When c → ∞ with ℓ the hinge loss, the theoretical minimizer f* satisfies 〈Wj, f*〉 = 1 (j = 1, …, k − 1) and 〈Wk, f*〉 = −k + 1. Therefore, f* is on the boundary set, and consequently the multicategory angle-based support vector machine is not Fisher consistent. See Fig. 1(b) for an illustration. The situation is similar with a → ∞. To overcome this difficulty, we propose to approximate the hinge loss with a large-margin unified machine loss function with large c. Under the angle-based structure, we call this large-margin unified machine loss with a large but finite c the approximate support vector machine loss. By Theorem 2, the angle-based support vector machine using the approximate support vector machine loss is Fisher consistent.
In Theorem 2 we assume that P1 > ⋯ > Pk > 0 only for simplicity of expression. When Pi and Pi+1 are the same, the theoretical minimizer is not unique. When Pk = 0, f* is unbounded. We discuss this further in the Supplementary Material.
3·2. Class conditional probability estimation
Recall the definition of class conditional probability Pj (j = 1, …, k) in § 3·1. In practice, after we build the classifier, it is important to estimate Pj accordingly (Wang et al., 2008; Wu et al., 2010). In this section, we explore the relationship between the theoretical minimizer f* and the probability Pj. In practice, after obtaining the solution f̂ to (1) or (2), one can replace f* by f̂ to estimate Pj, as in the next theorem.
Theorem 3
In the angle-based classification structure, suppose the loss function ℓ is differentiable. For a given class conditional probability vector (P1, …, Pk) and its corresponding theoretical minimizer f*, the class probabilities can be expressed as .
From Theorem 3, given the fitted f̂, the estimated probabilities are . If ℓ has a negative derivative ℓ′, as in Theorem 1, one can verify that the estimated class conditional probabilities are proper, in the sense that 0 ≤ P̂j ≤ 1 (j = 1, …, k) and .
In binary classification problems, it is known that the support vector machine does not provide class conditional probability estimation (Wang et al., 2008), because the hinge loss is not differentiable at 1. Liu et al. (2011) showed that when c → ∞, pr(Y = 1 | X = x) is a step function, and thus cannot provide specific class conditional probability information.
In the multicategory case, probability estimation becomes more involved because we have a classification function vector instead of a single classification function. The probability estimation formula derived from Theorem 3 depends on f̂ only through the derivative function ℓ′. For i ⧧ j, P̂i = P̂j if ℓ′ (〈Wi, f̂〉) = ℓ′(〈Wj, f̂〉). For the large-margin unified machine loss, ℓ′(u) = −1 for u ≤ c/(1 + c). Thus, if f̂ satisfies that 〈Wi, f̂〉 ≤ c/(1 + c) and 〈Wj, f̂〉 ≤ c/(1 + c), then the estimated class conditional probabilities for classes i and j are the same. As c → ∞, c/(1 + c) → 1, and the estimated probabilities are all 1/k for f̂ with a norm small enough. For i ⧧ j, when 〈Wj, f̂〉 is large and 〈Wi, f̂〉 is small, we have P̂j = 1 and P̂i = 0. We show this phenomenon in Fig. 2 with k = 3. We see that in the multicategory angle-based large-margin unified machine case, as c increases, the probability function becomes closer to a step function and thus the accuracy of probability estimation deteriorates. Our angle-based support vector machine defined in § 3·1 provides some probability information, while the angle-based classifier with the hinge loss does not, due to its nondifferentiability.
Fig. 2.
Visualization of the relationship between and P1, with k = 3, a = 1, and c ∈ {0, 1, 5} and c → ∞. As c increases, the function becomes closer to a step function, with more probability information lost.
3·3. Asymptotic results
In this section, we extend the notations of excess ℓ-risk and excess risk used by Bartlett et al. (2006) from the binary to the multicategory case, and study their convergence rates as the sample size n → ∞. We focus on linear learning with a diverging number of covariates p, under the form (2). The penalty J(f) we consider is the linear combination of the L1 and L2 penalties, similar to the elastic net penalty (Zou & Hastie, 2005). For simplicity, we assume that the intercepts are included in J(f) by adding a variable to x whose value is always 1. In this case, we can write , where β1, …, βk−1 are vectors of length p ≥ 2 that include the intercepts. Here ‖·‖1 and ‖·‖2 are the regular L1 norm and L2 norm of a vector, and the parameter α ∈ [0, 1] controls the relative proportion of the L1 and L2 penalties.
The L1 penalty is well known for its variable selection property, and we show that the convergence rates of the excess ℓ-risk for α > 0 and α = 0 are different. In particular, we show that when α > 0, we have convergence provided p = o{exp(n)}, and when α = 0, the convergence requires that p = o(n). Furthermore, we study the convergence rate of β̂j to its best value, under the uniform metric. We then study the convergence rate of the excess risk using a conversion condition. Because we introduce some assumptions for the theory, we give an illustrative example in the Supplementary Material, where the assumptions are verified and the convergence rates are obtained.
For the remainder of § 3, we assume that the number of covariates including the intercept is p = pn, and the penalty is J(f) ≤ s = sn, where both pn and sn may depend on n. We assume that each coordinate x(l) ∈ [0, 1] (l = 1, …, pn), and the loss function ℓ is Lipschitz with the Lipschitz constant 1. A function ℓ(·) is said to be Lipschitz with constant γ if for any x1 and x2 in its domain, |ℓ(x1) − ℓ(x2)| ≤ γ ‖x1 − x2‖, where ‖x1 − x2‖ is some fixed norm in the domain of ℓ. Our theory can be analogously extended to other bounded domains and Lipschitz functions with different Lipschitz constants. Because pn may become unbounded as n → ∞, we assume that the underlying distribution P(X, Y) is defined on ([0, 1]∞ × {1, …, k}, σ∞([0, 1]∞) × 2{1,…,k}) with σ∞ ([0, 1]∞) being the σ-field generated by the open balls introduced by the uniform metric .
Before stating our theory, we first introduce some notation and definitions. For linear learning, we have that . The intercepts are included in the βj s, as discussed above. Let , and let F(pn) = ∪0≤sn<∞ F(pn, sn) be the full pn-dimensional model. Suppose is the empirical minimizer of the optimization problem (2), and f(pn) = arginff ∈ F(pn) E[ℓ{〈WY, f(x)〉}] represents the best model under the full pn-dimensional model. Here f(pn) may not belong to F(pn).
Let eℓ(f, f(pn)) = E{ℓ(〈WY, f〉)} − E{ℓ(〈WY, f(pn)〉)}. We call eℓ(f, f(pn)) the excess ℓ-risk, and name e(f, f(pn)) = E{L(Y, f)} − E{L(Y, f(pn))} the excess risk, where L is the 0–1 loss. Hence E{L(Y, f)} is the generalization error of f. Our main result is as follows.
Theorem 4
Assume τn = {log(pn)/n}1/2 → 0 as n → ∞. If α > 0, then , almost surely under P. If α = 0, then , almost surely under P. Here dn = inff ∈F(pn,sn) eℓ(f, f(pn)) is the approximation error between F(pn, sn) and F(pn).
In Theorem 4, the balance between the approximation error and estimation error is represented by dn and sn. In particular, as sn increases, the function class F(pn, sn) becomes larger, the approximation error dn decreases and the estimation error increases. In this view, the best tradeoff takes place when for α > 0, and for α = 0. If the model depends on finitely many predictors, we have a simpler version of Theorem 4.
Assumption 1
There exists a finite s*, such that f(pn) ∈ F(pn, s*) for all pn.
Under Assumption 1, dn is strictly zero for large enough n and sn, leading to Corollary 1.
Corollary 1
When Assumption 1 is met, we have sn = s* for all large n. Consequently, if α > 0, then , almost surely under P. If α = 0, then , almost surely under P.
There are important differences between the cases α > 0 and α = 0. For the former case, the convergence of requires that τn = {log(pn)/n}1/2 converges, or in other words, pn grows no faster than exp(n). In the pure L2 penalty case where α = 0, the convergence of requires that pn = o(n), which is a much stronger assumption on the divergence speed of pn. Theorem 4 and Corollary 1 shed some light on the effectiveness of the L1 penalty when the true model contains many noise covariates.
We have established the convergence rate for the excess ℓ-risk. As we focus on linear learning, the convergence rates for the estimated parameters β̂1, …, β̂k−1 are of interest as well. Next, we explore the convergence rate for B̂ − B(pn), where B = (β1, …, βk−1) is a pn by k − 1 matrix whose columns are the parameters, B̂ = (β̂1, …, β̂k−1) is the estimated parameter matrix, and is the matrix whose columns represent the best parameters under the full pn-dimensional model. In this section we assume that pn depends on n and can go to infinity, hence we study the convergence rate with respect to the uniform metric, d(B1, B2) = supi,j{(B1)i,j − (B2)i,j}, where Bi,j is the (i, j) element of the matrix B. First we introduce an assumption that is valid for many applications. For this assumption, the notation X does not include the intercept term.
Assumption 2
The marginal distribution of X is absolutely continuous with respect to the Lebesgue measure on [0, 1]∞.
Under Assumption 2, we study the convergence rate of d(B̂, B(pn)) in the next theorem.
Theorem 5
Suppose ℓ is differentiable and convex, and Assumption 2 holds. We have (a) if the theoretical minimizer f* ∈ F(pn) for some pn, then for α > 0 and for α = 0, almost surely under P, (b) if f* ∉ F(∞), and Assumption 1 holds, then for α > 0 and for α = 0, almost surely under P.
Theorem 5 indicates that the convergence rate of B̂ to B(pn) depends on whether the function class F is large enough. In practice, the situation that f* ∈ F(pn) for some pn rarely happens, and the convergence rate of B̂ to B(pn) is the same as the excess ℓ-risk for most of the cases, if the true model is indeed sparse. We give an illustrative example in the Supplementary Material, where f* ∈ F(3), and the convergence rate of B̂ to B(pn) is slower than that of the excess ℓ-risk.
We have studied the convergence rate of the excess ℓ-risk. In practice, the convergence rate of excess risk is of interest if the classification accuracy of a given model converges to the best possible accuracy. In other words, we are interested in the consistency and convergence rate of the excess risk. Next, we establish a relationship between eℓ(f̂, f(pn)) and |e(f̂, f(pn))|, where e(f̂, f(pn)) is the difference between the misclassification rate of f̂ and f(pn). The technique of studying the excess risk using excess ℓ-risk in the binary case was previously studied in Zhang (2004b) and Bartlett et al. (2006). Define the L2 metric on F as . The following conversion assumption controls the behaviours of the excess ℓ-risk and the excess risk in a small neighbourhood of f(pn), introduced by the L2 metric. It was previously used in Wang & Shen (2007) and Shen & Wang (2007).
Assumption 3
There exist constants γ1 ≥ 1 and γ2 > 0 such that for all small ε > 0,
| (3) |
| (4) |
where ξ1(pn) and ξ2(pn) may depend on pn.
Corollary 2
Under Assumption 3, assume that τn → 0 as n → ∞. Then , almost surely under P if α > 0, and , almost surely under P if α = 0. Moreover, suppose Assumption 1 is also met, then if α > 0, then , almost surely under P. If α = 0, then , almost surely under P.
In the Supplementary Material, we give an example in which Assumptions 1, 2 and 3 are satisfied, and γ1, γ2 can be calculated explicitly.
3·4. Finite sample error bound
We use the surrogate loss ℓ as an upper bound of the 0–1 loss to make the optimization problem trackable, and the corresponding theoretical analysis is easier to deal with. We saw that Assumption 3 builds a relationship between the excess ℓ-risk and the excess risk, and furthermore that one can establish the convergence rate of the excess risk by that of the excess ℓ-risk. In this section, we study the finite sample bound on the expectation of the ℓ-risk given f̂, E{ℓ(〈WY, f̂〉)}, which can be regarded as a bound on the misclassification rate of f̂.
Theorem 6
The solution f̂ to (2) satisfies that, with probability at least 1 − δ,
where
with A1 = 6{log(2/δ)/n}1/2 and A2 = 4n−1/4[log{(e + 2epnk − 2epn)n−1/2}]1/2.
Theorem 6 gives an upper bound on E{ℓ(〈WY, f̂〉)} that depends only on n, sn, pn, α and the training sample, so it is directly computable from the data and the model we choose.
Remark 1
When solving the optimization (1), to calculate the result in Theorem 6, one may replace sn/α by if α > 0 and replace pnsn by if α = 0.
3·5. Comparison to existing methods with k classification functions
In this section, we provide some theoretical insight on the comparison between our angle-based method and regular multicategory large-margin classifiers using k classification functions with the sum-to-zero constraint. We show that if the true signal in linear learning is sparse, then our angle-based method can enjoy a smaller estimation error, with the approximation error fixed at zero. Consequently, our method can give more accurate prediction.
We focus on comparing the complexity of functional classes of the corresponding optimization problems. To illustrate the idea, we use the method in Lee et al. (2004) as an example of the regular simultaneous methods. The proofs and conclusions are analogous for many other simultaneous classifiers. To begin with, we introduce some notation. Let t(pn, sn) = sn/α if α > 0 and (pnsn)1/2 if α = 0. Recall the definition of F(pn, sn) in § 3·3. For our angle-based method, let f(pn,sn) = argminf ∈ F(pn,sn) E{ℓ(〈f, WY〉)}. Define hf(·) = {2t(pn, sn)}−1 {ℓ(〈f, W․〉) − ℓ(〈f(pn,sn), W․〉)}, and let H̄ = {hf : f ∈ F(pn, sn)}. For the multicategory support vector machine of Lee et al. (2004), define with the sum-to-zero constraint. Let , where L̄ is the multicategory support vector machine loss used by Lee et al. (2004). Analogously define , and H̄L = {hL, f : f ∈ G(pn, sn)}. The next proposition provides the comparison between H̄ and H̄L in terms of their uniform covering numbers N(ε, ·). More details about uniform covering numbers are provided in the Supplementary Material.
Proposition 1
For positive ε small enough, log{N(ε, H̄)} is bounded above by (2/ε2) log[e + e{2pn(k − 1)}ε2] + log(k − 1), and log{N(ε, H̄L)} is bounded above by (2/ε2) log{e + e(2pnk)ε2} + log(k).
From Proposition 1 and its proof, we can conclude that for fixed α, the upper bound of N(ε, H̄) of the angle-based methods is smaller than that of N(ε, H̄L) with k functions, because our angle-based classifiers use only k − 1 classification functions. In the Supplementary Material, we give an example where the upper bound of N(ε, H̄L) is almost tight. Furthermore, assume the true classification signal depends only on a finite number of predictors. In order to have the approximation error dn = 0, one can verify that our angle-based method requires a smaller sn compared to the multicategory support vector machines in Lee et al. (2004). As a result, for a classification problem with sparse signal, our angle-based method can have a functional class smaller than that of Lee et al. (2004). Consequently, the angle-based method can have a smaller estimation error, which can lead to a better classification performance. See the proof of Theorem 4 in the Supplementary Material for more details on how the covering number affects the estimation error. Intuitively, the regular simultaneous classifiers with k functions can introduce extra variability in the estimated classifier, which can reduce the corresponding classification performance. Our angle-based method with k − 1 functions circumvents this difficulty and is more efficient.
As a remark, we point out that from Proposition 1, the difference of the uniform covering numbers becomes larger as the dimensionality pn increases. This suggests that the difference in classification performance between our angle-based methods and the regular multicategory large-margin classifiers can be large for high dimensional problems. We confirm this finding in § 5. In particular, we observe that for a classification problem with fixed and sparse signal, the difference in classification performance increases when the dimensionality increases.
4. Computational algorithms and tuning procedures
For a convex surrogate loss ℓ and a convex penalty term J(f), (1) and (2) are convex problems and can be solved by many standard optimization tools (Boyd & Vandenberghe, 2004). For instance, with the squared loss ℓ(u)=(1 − u)2 and the L2 penalty, there exist explicit solutions for the problem (1), or one can obtain its equivalent solution by solving the Karush–Kuhn–Tucker conditions generated from (2). If one applies the generalized hinge loss with a reject option (Bartlett & Wegkamp, 2008) and the L1 penalty, then linear programming can be used to solve (2). In this section, we demonstrate how to implement (1) using the large-margin unified machine loss with finite c and the L2 penalty, by the coordinate descent algorithm (Friedman et al., 2010).
For simplicity we assume the intercepts are included in the parameters, as in § 3·3. Let the estimated parameters at the mth step be . For clarification, the notation B̂(m) is different from B(pn) defined in § 3·3. At the (m + 1)th step, let , where is the qth element of and is defined in a similar way. Then Wi,j(z) is a function of z ∈ ℝ. Furthermore, define for i = 1, …, n, j = 1, …, k − 1 and l = 1, …, p. Set
| (5) |
The optimization (5) is a one-dimensional problem, and so can be solved efficiently. We update all the elements in until convergence.
We summarize the above coordinate descent algorithm in Algorithm 1.
Algorithm 1.
| Step 1. | Initialize the algorithm with β̂j = (0, …, 0)T, for all j = 1, …, k − 1. |
| Step 2. | For the mth loop, with the solution given, update the elements of β̂1 sequentially. After β̂1 has been updated, update β̂2, …, β̂k − 1 in the same manner. |
| Step 3. | Repeat Step 2 until convergence. |
For the regular classifiers with the sum-to-zero constraint, one can remove the constraint and reparameterize one function as the negative sum of others. This can be computationally inefficient compared to the angle-based method. We take the coordinate descent algorithm as an example. Without loss of generality we assume that fk is reparameterized. To update a parameter in f1, one needs to involve all the terms that depend on f1 and fk to calculate the next updated value. This can be much slower than the angle-based method, which requires only the terms that depend on f1. For other optimization tools, such as the linear or quadratic programming for the simultaneous support vector machines (Lee et al., 2004;Wang & Shen, 2007; Liu & Yuan, 2011), reparameterizing can be even slower than having the sum-to-zero constraint in the optimization. Therefore, our angle-based method can be much faster.
In practice, a proper choice of the tuning parameter λ or s is crucial for the accuracy of our classifier. Different tuning parameters correspond to different prediction models. There are various tuning techniques in the literature. Here, we briefly discuss the crossvalidation procedure, which is commonly used in practice. For the crossvalidation approach, one partitions the training dataset into q subsets of observations whose sizes are roughly the same. Each time a single subset of observations is used for tuning, and the remaining q − 1 subsets are used for training. One repeats the process q times when all observations have been used for tuning, and the q prediction results are combined to select the best tuning parameter.
5. Simulation examples
In this section, we use five simulated examples to explore the performance of our angle-based method. For each example, we apply penalized logistic regression, the support vector machine approximated by a large-margin unified machine with large c, and the tuned large-margin unified machine (Liu et al., 2011). We also implement other existing methods with the sum-to-zero constraint, including several multicategory support vector machine formulations (Vapnik, 1998; Crammer & Singer, 2001; Lee et al., 2004) and the penalized logistic regression proposed by Zhu & Hastie (2005). We show that our angle-based classifiers can give more accurate classification, and their computational cost can be significantly smaller than these alternative methods.
In the first four simulated examples, we generate datasets whose signal depends linearly on a few covariates, and we add additional covariates that are pure noise variables. The total dimensions are set to be 10, 50, 100 and 200, and we observe the pattern of change in classification performance. For the fifth example, we perform Gaussian kernel learning. Instead of letting the dimensionality grow, we fix the dimension of this example and let the number of observations increase.
For classification performance, we compare the test error rates and probability estimation using the mean absolute error, E(|p − p̂|), on a test set of size 105. By Theorem 3, our angle-based approximate support vector machine can provide class conditional probability estimation. Within one replication, the model is built on a training dataset, and the optimal tuning parameter is chosen by minimizing the classification error rate over an independent tuning set with a grid search. For all the classifiers including our angle-based methods and the existing ones, we choose the L2 penalty as the regularization term J(f). Through 1000 replicates, we record the total computational time for training a model, tuning it on 30 different values of tuning parameters, and making prediction on the test set. We report the average time in seconds as a measurement of speed. All simulations are done using R (R Development Core Team, 2014), on a 2·80 GHz Intel processor. We report the simulation setting and results of Examples 2 to 5 in the Supplementary Material.
Example 1
We generate a three-class dataset, where pr(Y = j) = 1/3, and (X | Y = j) ~ N(μj, σ2I2) (j = 1, 2, 3). Here μjs are equally distributed on the unit circle, and σ is chosen such that the Bayes error is 0·1. The noise covariates are independent and identically distributed as N(0, 0·5). Both the training and tuning datasets are of size 100.
Table 1 reports the behaviours of the classifiers. The angle-based classifiers have smaller error rates overall, while the tuned angle-based large-margin unified machine works best. For the four linear learning examples, the difference in classification accuracy between angle-based methods and the regular classifiers is small when the dimension is low. When the dimension gets higher, the difference becomes larger. This is consistent with the theoretical insights provided in § 3·5. For the kernel learning example, the angle-based classifiers are also very competitive. For all the examples, the computational costs of the angle-based methods are significantly less than those of the other methods. In terms of probability estimation, the approaches of Vapnik (1998), Crammer & Singer (2001) and Lee et al. (2004) do not provide class conditional probability information. In contrast, our angle-based methods can estimate class conditional probabilities, and yield more accurate results than the penalized logistic classifier.
Table 1.
Summary of the classification results for Example 1
| Dimension 10 | Dimension 50 | Dimension 100 | Dimension 200 | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Example 1 | Error | Prob | Time | Error | Prob | Time | Error | Prob | Time | Error | Prob | Time | |
| Existing methods | SVM1 | 12·5 | NA | 14 | 24·9 | NA | 20 | 34·4 | NA | 27 | 43·3 | NA | 37 |
| SVM2 | 12·3 | NA | 12 | 24·6 | NA | 21 | 34·2 | NA | 28 | 43·3 | NA | 37 | |
| SVM3 | 12·6 | NA | 13 | 25·1 | NA | 24 | 34·4 | NA | 27 | 42·7 | NA | 41 | |
| Logi | 12·0 | 9·2 | 16 | 24·4 | 16·2 | 21 | 34·0 | 17·7 | 30 | 41·2 | 19·5 | 39 | |
| Angle based | ALogi | 11·2 | 7·3 | 7 | 17·8 | 10·1 | 9 | 23·0 | 12·0 | 12 | 33·3 | 12·9 | 17 |
| ASVM | 11·9 | 12·9 | 9 | 19·0 | 13·7 | 11 | 24·6 | 15·5 | 16 | 34·5 | 16·2 | 22 | |
| ALUM | 11·3 | 7·3 | 15 | 17·7 | 11·2 | 19 | 22·9 | 11·7 | 27 | 33·1 | 12·9 | 38 | |
The standard deviations of the mean error rates range from 0·04 to 0·23. SVM1, support vector machines of Vapnik (1998); SVM2, support vector machines of Crammer&Singer (2001); SVM3, support vector machines of Lee et al. (2004); Logi, multicategory logistic regression of Zhu & Hastie (2005); ALogi, angle-based logistic regression; ASVM, angle-based support vector machine using approximated hinge loss; ALUM, angle-based tuned large-margin unified machines of Liu et al. (2011); Error, classification error percentage; Prob, the mean percentage of absolute error on class conditional probability estimation.
6. Real data examples
In this section, we demonstrate the performance of our angle-based classifiers via three datasets, CNAE-9, Semeion Handwritten Digit and Vehicle from the University of California Irvine machine learning repository website, and a recent Glioblastoma Multiforme Cancer gene expression dataset. See the Supplementary Material for more information on the Glioblastoma Multiforme Cancer data. To select the best tuning parameter, we split each dataset into six groups of observations whose sizes are roughly the same, choose one group as the testing data, and perform 5-fold crossvalidations on the remaining observations. We perform linear learning on the CNAE-9, Semeion Handwritten Digit and Glioblastoma data, and we use second-order polynomial kernel learning for the Vehicle data. To reduce the computational cost in the Glioblastoma dataset, we choose 1000 genes with the largest median absolute deviation values based on the training sample within each replication.
The results are reported in Table 2. The classification accuracy of our angle-based methods is better than that of the other methods, and the angle-based tuned large-margin unified machine loss works the best overall. Furthermore, the computational time for the angle-based methods is considerably shorter, consistent with the simulation results.
Table 2.
Summary of the classification results for the real datasets
| CNAE-9 | Semeion | Vehicle | Glioblastoma | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Error | Time | Error | Time | Error | Time | Error | Time | ||
| Existing methods | SVM1 | 14·1 | 241 | 14·0 | 110 | 25·4 | 122 | 20·3 | 877 |
| SVM2 | 13·6 | 266 | 14·2 | 120 | 24·9 | 116 | 20·1 | 910 | |
| SVM3 | 13·1 | 241 | 14·6 | 131 | 26·0 | 142 | 19·7 | 883 | |
| Logi | 15·7 | 203 | 15·1 | 105 | 26·7 | 109 | 20·0 | 677 | |
| Angle based | ALogi | 12·5 | 101 | 12·9 | 51 | 23·9 | 44 | 18·1 | 359 |
| ASVM | 12·2 | 138 | 12·6 | 69 | 23·3 | 67 | 17·9 | 497 | |
| ALUM | 12·2 | 199 | 12·5 | 103 | 23·5 | 104 | 17·7 | 810 | |
The standard deviations of the mean error rates range from 0·04 to 0·15. See Table 1 for abbreviations.
Supplementary Material
Acknowledgement
The authors thank the editor, the associate editor and two reviewers for their helpful suggestions. This work was supported in part by the U.S. National Institutes of Health and National Science Foundation. Yufeng Liu is also affiliated with the Carolina Center for Genome Sciences and the Department of Biostatistics at the University of North Carolina.
Footnotes
Supplementary material
Supplementary material available at Biometrika online includes the large-margin unified loss function, proofs of the theorems, more simulation examples and results, details for the Glioblastoma Multiforme Cancer data, and a summary of attributes of the real datasets.
Contributor Information
Chong Zhang, Email: chongz@live.unc.edu.
Yufeng Liu, Email: yfliu@email.unc.edu.
References
- 1.Bartlett PL, Jordan MI, McAuliffe JD. Convexity, classification, and risk bounds. J. Am. Statist. Assoc. 2006;101:138–156. [Google Scholar]
- 2.Bartlett PL, Wegkamp MH. Classification with a reject option using a hinge loss. J. Mach. Learn. Res. 2008;9:1823–1840. [Google Scholar]
- 3.Boser BE, Guyon IM, Vapnik VN. A training algorithm for optimal margin classifiers. In: Haussler D, editor. Proc. 5th Ann. Workshop Comp. Learn. Theory. New York: Association for Computing Machinery; 1992. pp. 144–152. COLT ’92. [Google Scholar]
- 4.Boyd SP, Vandenberghe L. Convex Optimization. Cambridge: Cambridge University Press; 2004. [Google Scholar]
- 5.Crammer K, Singer Y. On the algorithmic implementation of multiclass kernel-based vector machines. J. Mach. Learn. Res. 2001;2:265–292. [Google Scholar]
- 6.Cristianini N, Shawe-Taylor JS. An Introduction to Support Vector Machines. Cambridge: Cambridge University Press; 2000. [Google Scholar]
- 7.Freund Y, Schapire RE. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comp. Syst. Sci. 1997;55:119–139. [Google Scholar]
- 8.Friedman JH, Hastie TJ, Tibshirani RJ. Regularization paths for generalized linear models via coordinate descent. J. Statist. Software. 2010;33:1–22. [PMC free article] [PubMed] [Google Scholar]
- 9.Hastie TJ, Tibshirani RJ, Friedman JH. The Elements of Statistical Learning. 2nd ed. New York: Springer; 2009. [Google Scholar]
- 10.Hill SI, Doucet A. A framework for kernel-based multi-category classification. J. Artif. Intel. Res. 2007;30:525–564. [Google Scholar]
- 11.Lange K, Wu TT. An MM algorithm for multicategory vertex discriminant analysis. J. Comp. Graph. Statist. 2008;17:527–544. [Google Scholar]
- 12.Lee Y, Lin Y, Wahba G. Multicategory support vector machines, theory, and application to the classification of microarray data and satellite radiance data. J. Am. Statist. Assoc. 2004;99:67–81. [Google Scholar]
- 13.Lin X, Wahba G, Xiang D, Gao F, Klein R, Klein B. Smoothing spline ANOVA models for large data sets with Bernoulli observations and the randomized GACV. Ann. Statist. 2000;28:1570–1600. [Google Scholar]
- 14.Liu Y, Shen X. Multicategory ψ-learning. J. Am. Statist. Assoc. 2006;101:500–509. [Google Scholar]
- 15.Liu Y, Yuan M. Reinforced multicategory support vector machines. J. Comp. Graph. Statist. 2011;20:901–919. [Google Scholar]
- 16.Liu Y, Zhang HH, Wu Y. Soft or hard classification? Large margin unified machines. J. Am. Statist. Assoc. 2011;106:166–177. doi: 10.1198/jasa.2011.tm10319. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.R Development Core Team. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing; 2014. ISBN 3-900051-07-0. http://www.R-project.org. [Google Scholar]
- 18.Saberian MJ, Vasconcelos N. Multiclass boosting: Theory and algorithms. In: Shawe-Taylor JS, Zemel RS, Bartlett PL, Pereira FCN, Weinberger KQ, editors. Adv. Neural Info. Proces. Syst. Vol. 24. 2011. pp. 2124–2132. [Google Scholar]
- 19.Shen X, Tseng GC, Zhang X, Wong WH. On ψ-learning. J. Am. Statist. Assoc. 2003;98:724–734. [Google Scholar]
- 20.Shen X, Wang L. Generalization error for multi-class margin classification. Electron. J. Statist. 2007;1:307–330. [Google Scholar]
- 21.Tewari A, Bartlett PL. On the consistency of multiclass classification methods. J. Mach. Learn. Res. 2007;8:1007–1025. [Google Scholar]
- 22.Vapnik VN. Statistical Learning Theory. New York: Wiley; 1998. [Google Scholar]
- 23.Wahba G. Support vector machines, reproducing kernel Hilbert spaces and the randomized GACV. Adv. Kernel Meth. Support Vector Learn. 1999;6:69–87. [Google Scholar]
- 24.Wahba G. Soft and hard classification by reproducing kernel Hilbert space methods. Proc. Nat. Acad. Sci. 2002;99:16524–16530. doi: 10.1073/pnas.242574899. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Wang J, Shen X, Liu Y. Probability estimation for large margin classifiers. Biometrika. 2008;95:149–167. [Google Scholar]
- 26.Wang L, Shen X. On L1-norm multi-class support vector machines: Methodology and theory. J. Am. Statist. Assoc. 2007;102:595–602. [Google Scholar]
- 27.Wu TT, Wu Y. Nonlinear vertex discriminant analysis with reproducing kernels. Statist. Anal. Data Mining. 2012;5:167–176. doi: 10.1002/sam.11137. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Wu Y, Zhang HH, Liu Y. Robust model-free multiclass probability estimation. J. Am. Statist. Assoc. 2010;105:424–436. doi: 10.1198/jasa.2010.tm09107. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Zhang C, Liu Y. Multicategory large-margin unified machines. J. Mach. Learn. Res. 2013;14:1349–1386. [PMC free article] [PubMed] [Google Scholar]
- 30.Zhang T. Statistical analysis of some multi-category large margin classification methods. J. Mach. Learn. Res. 2004a;5:1225–1251. [Google Scholar]
- 31.Zhang T. Statistical behavior and consistency of classification methods based on convex risk minimization. Ann. Statist. 2004b;32:56–85. [Google Scholar]
- 32.Zhu J, Hastie TJ. Kernel logistic regression and the import vector machine. J. Comp. Graph. Statist. 2005;14:185–205. [Google Scholar]
- 33.Zhu J, Zou H, Rosset S, Hastie TJ. Multi-class Adaboost. Statist. Interf. 2009;2:349–360. [Google Scholar]
- 34.Zou H, Hastie TJ. Regularization and variable selection via the elastic net. J. R. Statist. Soc. B. 2005;67:301–320. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.


