Abstract
A fundamental question in multiclass classification concerns understanding the consistency properties of surrogate risk minimization algorithms, which minimize a (often convex) surrogate to the multiclass 0–1 loss. In particular, the framework of calibrated surrogates has played an important role in analyzing Bayes consistency of such algorithms, i.e. in studying convergence to a Bayes optimal classifier (Zhang, 2004; Tewari and Bartlett, 2007). However, follow-up work has suggested this framework can be of limited value when studying -consistency; in particular, concerns have been raised that even when the data comes from an underlying linear model, minimizing certain convex calibrated surrogates over linear scoring functions fails to recover the true model (Long and Servedio, 2013). In this paper, we investigate this apparent conundrum. We find that while some calibrated surrogates can indeed fail to provide -consistency when minimized over a naturallooking but naïvely chosen scoring function class , the situation can potentially be remedied by minimizing them over a more carefully chosen class of scoring functions . In particular, for the popular one-vs-all hinge and logistic surrogates, both of which are calibrated (and therefore provide Bayes consistency) under realizable models, but were previously shown to pose problems for realizable -consistency, we derive a form of scoring function class that enables Hconsistency. When is the class of linear models, the class consists of certain piecewise linear scoring functions that are characterized by the same number of parameters as in the linear case, and minimization over which can be performed using an adaptation of the min-pooling idea from neural network training. Our experiments confirm that the one-vs-all surrogates, when trained over this class of nonlinear scoring functions , yield better linear multiclass classifiers than when trained over standard linear scoring functions.
1. Introduction and Background
Consider a standard multiclass classification problem, with instance space , label space with n > 2 classes, and standard 0–1 loss given by . There is an unknown probability distribution D on ; given a training sample S = ((x1, y1), …, (xm, ym)) containing examples drawn i.i.d. from D, the goal is to learn a classifier with small 0–1 generalization error on new examples drawn from D:
| (1) |
A Bayes consistent algorithm is one which, given enough training examples, learns a classifier whose generalization error approaches the Bayes optimal error:
| (2) |
34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada.
On the other hand, for a class of models , an -consistent algorithm is one which, given enough training examples, learns a classifier whose generalization error approaches the optimal error in :
| (3) |
Since minimizing the discrete 0–1 loss directly is generally computationally hard, a popular approach to multiclass classification is to learn n real-valued scoring functions , one for each class, by minimizing a (often convex) surrogate loss, and then given a new test point , to predict a class y with highest score fy(x). Specifically, given a training sample S as above, a surrogate loss , and a scoring function class , a surrogate risk minimization algorithm finds a vector of n scoring functions by solving
| (4) |
and then returns a classifier given by
| (5) |
This approach includes several popular multiclass learning algorithms, such as multiclass logistic regression, various forms of multiclass SVMs [15, 4, 6, 5], one-vs-all logistic regression, and one-vs-all SVM; see Table 1 for a summary of the surrogate losses used by some of these algorithms.
Table 1:
Examples of convex surrogate losses used by various multiclass classification algorithms, together with a summary of some previous consistency results (here z+ = max(0,z)). In this paper, we show that one-vs-all surrogates can in fact achieve -consistency if minimized over the right scoring function class.
| Algorithm | Surrogate loss | Universally calibrated? | Realizable calibrated? | Realizable -consistent? |
|---|---|---|---|---|
| Multiclass logistic regression | ✓ | ✓ | ✓ | |
| Crammer-Singer multiclass SVM | ψCS(y,u) = maxy'≠y(1 − (uy − uy'))+ | × | ✓ | ✓ |
| One-vs-all logistic regression | ✓ | ✓ | × (we give a fix) | |
| One-vs-all SVM | × | ✓ | × (we give a fix) |
A natural question then is: Under what conditions do such surrogate risk minimization algorithms provide Bayes consistency or, for various classes of interest, -consistency, for the target 0–1 loss?
Surrogate losses and Bayes consistency.
For Bayes consistency, the above question is answered by the notion of calibrated surrogates [2, 17, 16, 14, 13, 11]. Specifically, if a surrogate loss ψ is calibrated w.r.t. the 0–1 loss, then for any universal function class , the surrogate risk minimization algorithm (implemented with suitable regularization) is a Bayes consistent algorithm for ℓ0–1.1 Among the surrogate losses shown in Table 1, ψmlog and ψOvA,log are universally calibrated for ℓ0–1 (calibrated for all probability distributions), while ψCS and ψOvA,hinge are calibrated under the so-called ‘dominant-label’ condition (calibrated for distributions in which the conditional distributions p(y|x) assign probability at least to one of the n classes) [16].
Surrogate losses and -consistency.
For -consistency, the situation is more complex [3, 7]. In particular, Long and Servedio [7] showed the following results:2
-
Realizable -consistency of Crammer-Singer surrogate for closed-under-scaling models . Let be any class of (vector) scoring functions that is closed under scaling, and
Long and Servedio [7] showed that if the data distribution D is -realizable (i.e. the data is labeled according to a true model ), then minimizing the Crammer-Singer surrogate ψCS over is -consistent, i.e. the surrogate risk minimization algorithm is -consistent. This was viewed as surprising in light of the fact that ψCS is not (universally) calibrated for ℓ0–1.
- Lack of realizable -consistency of one-vs-all logistic surrogate for linear models . Let be the class of linear (vector) scoring functions and the class of linear multiclass classification models:
(6) (7)
Long and Servedio [7] showed that even if the data distribution D is -realizable (i.e. the data is labeled according to a true linear model ), minimizing the one-vs-all logistic surrogate ψOvA,log over fails to give an -consistent algorithm, i.e. the (ψOvA,log, ) surrogate risk minimization algorithm is not -consistent, even though ψOvA,log is universally calibrated for ℓ0–1.
Our contributions.
As discussed by Long and Servedio [7] and summarized above, it seems peculiar that the Crammer-Singer surrogate ψCS, which is not universally calibrated for ℓ0–1, provides realizable -consistency (and more generally, realizable -consistency for closed-under-scaling models ), while the one-vs-all logistic surrogate, ψOvA,log, which is universally calibrated for ℓ0–1, fails to provide realizable -consistency. In this paper, we investigate this apparent conundrum.
First, regarding result (1) of Long and Servedio [7] above, we note that any realizable distribution D (i.e. a distribution that labels data points x according to a deterministic model y = h(x)) trivially satisfies the dominant-label condition (for each x, one class y has conditional probability ), and therefore the Crammer-Singer surrogate ψCS is in fact calibrated for any such distribution. Therefore, in the realizable setting studied by Long and Servedio [7], the surrogate ψCS is in fact calibrated for ℓ0–1 (the paper emphasizes that ψCS is not calibrated/consistent for ℓ0–1, implicitly referring to universal calibration, and misses the fact that it is indeed calibrated for the setting studied). So, while the result (1) is still interesting and non-trivial, it should be kept in mind that under the realizable setting studied in [7], all the surrogates studied by the authors are in fact calibrated for ℓ0–1.
Second, and more importantly, we look into result (2) of Long and Servedio [7] above. We know that minimizing the one-vs-all logistic surrogate ψOvA,log over a universal scoring function class gives Bayes consistency for all distributions D. Therefore, for -realizable distributions D, where and therefore Bayes consistency is equivalent to -consistency, we have that minimizing the ψOvA,log surrogate over such a class gives an -consistent algorithm. So why does minimizing the same surrogate over the class of linear scoring functions fail in this regard? On closer inspection, we find that an important part of the answer lies in the form of the decision boundaries induced by a linear (or more generally, affine) multiclass classification model. As an example, Figure 1 shows an affine 4-class model in a 2-dimensional instance space; specifically, the figure shows the 4 affine scoring functions for the 4 classes, and the corresponding decision regions. As can be seen, the one-vs-all boundaries induced by such a model are not linear! Indeed, in general, each class is described by a convex polytope, and is separated from the rest of the classes by a piecewise linear decision boundary (where the boundaries for different classes include shared pieces). Therefore, when the one-vs-all classifier is forced to separate each class from the rest using a linear decision boundary, it can end up learning a suboptimal separator.
Figure 1:
Example in d = 2 dimensions with n = 4 classes. Top row: True linear 4-class classifier . The first 4 plots show contours of the 4 linear scoring functions (darker shades represent higher values); the 5th plot shows the regions corresponding to the classifier . As can be seen, each class is described by a convex polytope, and is separated from the rest by a piecewise linear decision boundary. Middle row: Contours of scoring functions and decision regions learned from training data labeled according to h∗ by minimizing the one-vs-all logistic surrogate ψOvA,log over the linear scoring function class . The generalization accuracy is 0.886. Bottom row: Contours of scoring functions and decision regions learned from the same training data by minimizing the same one-vs-all logistic surrogate ψOvA,log over the ‘shared’ piecewise linear scoring function class . The decision regions are closer to the true model, and the generalization accuracy is 0.986. Since the piecewise linear functions have shared pieces, the number of parameters to be learned is the same as in the linear case; moreover, the resulting model can also be transformed to a linear model if desired (see Theorem 3 and Corollary 4). See Section 5 for details of the experimental setup.
In the rest of the paper, we use the above insight to design a special class of ‘shared’ piecewise linear scoring functions such that minimizing the one-vs-all logistic surrogate ψOvA,log over this class yields an -consistent algorithm. We will see that is characterized by the same number of parameters as ; in fact, will also be parametrized by n weight vectors .3 In order to minimize ψOvA,log over this scoring function class, we will make use of an adaptation of the min-pooling idea from neural network training. The same idea can be applied to other one-vs-all surrogates as well; in our experiments, we consider both ψOvA,log and the one-vs-all SVM surrogate ψOvA,hinge, and find that in both cases, while minimizing these surrogates over the class of linear scoring functions fails to provide -consistency, minimizing them over the nonlinear scoring function class does indeed provide -consistency.
An additional interesting aspect of the scoring function class is that, while the individual scoring functions in it are nonlinear (specifically, piecewise linear), the classification models resulting from taking the highest-scoring class according to these scoring functions can also be expressed as linear models. Therefore, having learned a classifier by minimizing a one-vs-all surrogate over this nonlinear scoring function class, one can then convert the learned model to a linear model in .
We believe our study can pave the way for a more thorough understanding of the role of surrogate losses in -consistency. In particular, our results suggest that, when studying -consistency, one needs to carefully take into account the interplay between surrogate losses and the scoring function class over which they are minimized, and that this can lead to unexpected improvements to learning algorithms used in practice.
Organization.
We start by giving various formal definitions in Section 2. We then describe the class of ‘shared’ piecewise linear scoring functions and give our associated consistency result in Section 3. We discuss how to minimize one-vs-all surrogates over this scoring function class in practice in Section 4, and describe our numerical experiments in Section 5. Section 6 concludes with a brief summary. Additional details/proofs are provided in the supplementary material.
2. Formal Definitions: Consistency, Calibration, Realizability
Consistency.
We start with formal definitions of Bayes consistency and -consistency:
Definition 1 (Bayes consistency).
We say a multiclass learning algorithm that maps training samples to multiclass models is Bayes consistent (w.r.t. ℓ0–1) for a distribution D if for all 𝜖 > 0,
If an algorithm is Bayes consistent for all distributions D, we say it is universally Bayes consistent.
Definition 2 ( -consistency).
Let . We say a multiclass learning algorithm that maps training samples to multiclass models is -consistent (w.r.t. ℓ0–1) for a distribution D if for all 𝜖 > 0,
Note that we do not require the algorithm to produce a model in ; we only require that as m→∞, the performance of the model it learns approaches that of the best model in . If an algorithm is -consistent for all distributions D, we say it is universally -consistent.
Calibration.
Next, we give the standard definition of calibration of a surrogate loss that is useful for studying Bayes consistency of surrogate risk minimization algorithms, followed by a definition of calibration w.r.t. that is useful for studying -consistency of such algorithms. To give these definitions, for a surrogate loss , we need the following notions of ψ-generalization error, Bayes optimal ψ-error, and optimal ψ-error in a scoring function class :
| (8) |
Definition 3 (Calibration (standard definition)).
We say a surrogate loss is calibrated w.r.t. ℓ0–1 for a distribution D if there exists a strictly increasing function that is continuous at 0 with g(0) = 0 such that for all ,
where h ≡ argmax ∘ f denotes a classifier that satisfies . If ψ is calibrated w.r.t. ℓ0–1 for all distributions D, we say ψ is universally calibrated w.r.t. ℓ0–1.
Definition 4 (Calibration w.r.t. ).
For a class of multiclass models , a surrogate loss , and a scoring function class , we say is calibrated w.r.t. for a distribution D if there exists a strictly increasing function that is continuous at 0 with g(0) = 0 such that for all ,
where h ≡ argmax ∘ f denotes a classifier that satisfies . If is calibrated w.r.t. for all distributions D, we say is universally calibrated w.r.t. .
Realizability and realizable calibration/consistency.
Finally, we give formal definitions of realizable and -realizable distributions, realizable calibration, and Long and Servedio’s definition of realizable -consistency.
Definition 5 (Realizability and -realizability).
We say a distribution D over is realizable if (almost surely) it labels points according to a deterministic model, i.e. if such that . For a class , we say a distribution D over is -realizable if (almost surely) it labels points according to a deterministic model in , i.e. if such that .
Definition 6 (Realizable calibration).
We say a surrogate loss is realizable calibrated (w.r.t. ℓ0–1) if it is calibrated (w.r.t. ℓ0–1) for all realizable distributions.4
Definition 7 (Long and Servedio’s definition of realizable -consistency [7]).
Let , and let . A surrogate loss is realizable -consistent if is calibrated w.r.t. for all -realizable distributions.5,6
3. Minimizing One-vs-All Surrogates over a Class of ‘Shared’ Piecewise Linear Scoring Functions is -Consistent
As discussed in Section 1, even though the one-vs-all logistic surrogate ψOvA,log is universally calibrated for ℓ0–1, Long and Servedio [7] showed that the surrogate risk minimization algorithm, which minimizes ψOvA,log over the class of linear scoring functions , is not -consistent even when the data distribution D is -realizable. In this section, we remedy this situation by showing how to minimize the same surrogate loss ψOvA,log (as well as other one-vs-all surrogate losses) over a different, nonlinear scoring function class such that the resulting algorithm is -consistent for all -realizable distributions D.
Linear models.
For the remainder of the paper, we will re-define the classes of linear scoring functions and linear classification models to allow for the inclusion of bias/offset terms:
| (9) |
| (10) |
Our conclusions will apply both in this more general setting, and in the special case where by = 0 ∀y.
‘Shared’ piecewise linear scoring functions.
To motivate the scoring function class we will construct, consider again the example in Figure 1. As this example makes clear, under a linear classification model in defined by weight vectors and bias terms , the decision region corresponding exclusively to class y ∈ [n] is the (open) convex polytope given by
| (11) |
| (12) |
| (13) |
We use this observation to construct the following special class of ‘shared’ piecewise linear scoring functions:
| (14) |
Clearly, this class is parametrized by the same number of parameters as . The reason that the class is useful is that the scoring functions in this class will allow for learning precisely the form of one-vs-all decision boundaries that are induced by linear multiclass models. In particular, we have the following result:
Lemma 1
(Scoring functions in capture correct one-vs-all decision boundaries for linear multiclass models). Let be parametrized by , . Then
| (15) |
Since one-vs-all surrogates effectively learn scoring functions that aim to separate points x with label y from points with other labels according to whether fy(x) ≥ 0, the above result implies that minimizing such surrogates over the class should allow learning precisely the form of one-vs-all separation boundaries induced by linear multiclass models. Formally, we have the following -consistency result:
Theorem 2
( -consistency of surrogate risk minimization algorithm). The pair is calibrated w.r.t. for all -realizable distributions.
Remark 1
(Generalization to other one-vs-all surrogates). The above -consistency result can be generalized to other one-vs-all surrogates, such as the one-vs-all hinge surrogate ψOvA,hinge.
Remark 2
(Loss of ‘independence’ of one-vs-all binary classifiers). Since the n components of the (vector) scoring functions in share parameters, they can no longer be learned independently by training separate binary classifiers in parallel; while minimizing a one-vs-all surrogate over still amounts to learning binary separators for each of the classes versus the rest, these separators must be learned together in an “all-in-one” multiclass learning algorithm.
Remark 3
(Non-convexity of resulting optimization problems). Although the one-vs-all surrogates ψOvA,log and ψOvA,hinge are convex, minimizing these surrogates over the function class results in non-convex optimization problems. In order to solve these optimization problems, our implementation makes use of an adaptation of the min-pooling idea from neural network training (see Section 4). Additional details regarding the behavior of this approach in our experiments are discussed in Section 5.
We also have the following result, which shows that the classification models induced by the nonlinear (vector) scoring functions in are in fact equivalent to those in the class of linear classification models :
Theorem 3
(Scoring functions in induce linear multiclass classifiers). Let be the class of multiclass classifiers induced by :
| (16) |
Then .
Indeed, the following corollary shows that once we have learned a nonlinear (vector) scoring function in , we can easily transform it into a linear classification model in :
Corollary 4
(Converting a nonlinear scoring function in to a linear classification model in ). Let be parametrized by , . Then for all ,
| (17) |
Margin interpretation of .
We note that the scoring functions in can also be viewed as computing a multiclass ‘margin’ vector over the underlying linear functions defining the shared piecewise linear scores. Specifically, recall that a (vector) scoring function has the form
| (18) |
for some This suggests that for each y, the score fy(x) effectively computes the ‘margin’ of separation between and ; if this margin is non-negative, then , and if it is negative, then .
Generalization to other multiclass models .
The above construction can be generalized beyond to other multiclass models defined in terms of a class of real-valued scoring functions . Specifically, for any such class , let
| (19) |
(Thus is a special case with .) Define the class of ‘shared’ piecewise-difference-of- scoring functions as follows:
| (20) |
Then similarly to the linear case, it can be shown that minimizing any of the one-vs-all surrogates ψOvA,log or ψOvA,hinge over is -consistent for all -realizable distributions.
4. Implementation of One-vs-All Surrogate Risk Minimization over
In order to implement surrogate risk minimization over the scoring function class , we make use of an adaptation of the min-pooling idea from neural network training. Figure 2 shows a summary of the architecture we use to implement scoring functions f in .
Figure 2:
Neural network-like architecture implementing scoring functions in . To find parameters minimizing a surrogate loss ψ on the training data, we use a backpropagation-like procedure on this architecture.
Specifically, given an input point , the first layer computes the n linear functions
The second layer then computes the n scoring function components fy(x) in terms of minima of the relevant functions from the first layer (see Eq. (14)):
To fit the parameters to training data, we then use a backpropagation-like procedure to minimize the surrogate loss of interest. Any existing neural network training library can be easily modified to perform this minimization; in our experiments, we implemented this approach using PyTorch [10].
5. Experiments
We conducted two sets of experiments. In the first set, we generated synthetic data from a true linear model (i.e. a known -realizable distribution) and tested the -consistency of minimizing one-vs-all surrogates over . In the second set, we implemented the approach on various real benchmark data sets to test its practical behavior. In all cases, we implemented a total of 6 multiclass algorithms: all 4 algorithms shown in Table 1 with surrogate risk minimized over linear scoring functions , and the two one-vs-all algorithms with surrogate risk minimized over . All algorithms were implemented in PyTorch and used the AdamW optimizer [8].7,8
5.1. Synthetic Data: Consistency Behavior on Linear Models
We generated two synthetic data sets. The first data set had d = 2 features and n = 4 classes. A true model was created by choosing as follows: elements of were drawn i.i.d. from and subsequently scaled so that ||wy||2 = 1 ∀y; bias terms b1, …, b4 were set to 0.2, 0.1, −0.1, −0.2 (decision regions of the resulting model h∗ are shown in Figure 1). Instances x were then drawn uniformly at random from a disk of radius 0.5 centered at (0.3,−0.1), and labeled according to h∗. We ran all 6 algorithms (using AdamW with zero weight decay factor) on increasingly large training samples (up to 30,000 data points) generated in this manner, and measured the generalization accuracy on a large test set of 10,000 data points generated in the same manner. The results are shown in Figure 3 (left); an illustration of some of the models learned from 10,000 data points is also shown in Figure 1.
Figure 3:
Convergence behavior of various multiclass surrogate risk minimization algorithms on synthetic data generated from a true linear model in . Left: d = 2, n = 4. Right: d = 100, n = 10. In both cases, the ψOvA,log and ψOvA,hinge surrogates fail to converge to the optimal performance in when minimized over standard linear scoring functions , but successfully do so when minimized over the class . (In the right plot, the curves for all four -consistent algorithms overlap.) See Section 5.1 for details.
The second data set had d = 100 features and n = 10 classes. A true model was created in the same manner as above, except that in this case we set by = 0 ∀y ∈ [100]. Instances x were drawn uniformly at random from X = [−1, 1]100, and labeled according to h∗. We ran all 6 algorithms on increasingly large training samples (up to 40,000 data points) and measured accuracy on a large test set of 10,000 data points. The results are shown in Figure 3 (right).
In both cases, the one-vs-all surrogates fail to give -consistency when minimized over linear scoring functions , but successfully do so when minimized over the scoring function class .
5.2. Real Data: Practical Behavior
We evaluated the performance of all 6 algorithms on various benchmark multiclass classification data sets drawn from the UCI repository and the LIBSVM data repository.9 Details of the data sets are provided in the supplementary material; the number of features d ranges from 16 to 3072, and the number of classes n ranges from 7 to 26. Several of the data sets come with prescribed train/validation/test splits; for the others, we randomly chose a 3:1:1 split. For all algorithms, we used AdamW with a weight decay factor λ; the factor λ was chosen from {10−3, …, 102} to maximize 0–1 accuracy on the validation set.
The results are shown in Table 2. For each data set, the best-performing algorithms within the group of logistic surrogates and within the group of hinge surrogates are shown in bold font; the best overall is enclosed in asterisks. For hinge surrogates, consistent with previous results [5], we find ψOvA,hinge, when minimized over , does slightly poorer than ψCS, but minimizing it over brings it in line with (and even slightly exceeds) ψCS. For logistic surrogates, the results are more mixed, although frequently outperforms . Overall, despite the good performance, we do not necessarily advocate minimizing one-vs-all surrogates over as a practical strategy, as training is 2–3 times slower than for ψCS or ψmlog, which generally give comparable results. Our primary interest is in the -consistency of this scheme under -realizable data distributions; the main purpose of the experiments on real data was to serve as a sanity check and ensure that this does not come at a huge price in terms of practical applicability of the resulting algorithms.
Table 2:
Results (in terms of test accuracy) on various real multiclass data sets. See Section 5.2 for details.
| Data set | ψmlog | ψOvA,log | ψOvA,log | ψCS | ψOvA,hinge | ψOvA,hinge |
|---|---|---|---|---|---|---|
| Covertype (50K) | 0.6606 | 0.6943 | 0.6607 | 0.7186 | 0.7069 | *0.7193* |
| Digits | 0.8985 | 0.8696 | 0.8982 | 0.9025 | 0.8819 | *0.9042* |
| USPS | *0.9153* | 0.9138 | 0.9148 | 0.9128 | 0.9063 | 0.9148 |
| MNIST (70K) | 0.9270 | 0.9200 | 0.9271 | 0.9307 | 0.9216 | *0.9317* |
| CIFAR10 | 0.4000 | *0.4066* | 0.3763 | 0.3831 | 0.3686 | 0.4006 |
| Sensorless | *0.8266* | 0.6539 | 0.7918 | 0.7703 | 0.5381 | 0.7791 |
| Letter | 0.7644 | 0.7126 | 0.7662 | 0.7738 | 0.6058 | *0.7804* |
6. Conclusion
Our study shows that when studying -consistency of surrogate risk minimization algorithms, the interplay between the surrogate loss and scoring function class can play an important role. In particular, for ψOvA,log and ψOvA,hinge, we found that minimization over a suitable function class gives -consistency where standard minimization over linear functions fails to do so.
Supplementary Material
Broader Impact.
The primary goal of this paper is to better understand the statistical consistency properties of surrogate risk minimization algorithms in machine learning. The insights and results of the paper will benefit readers who wish to be aware of these properties when designing or selecting learning algorithms.
We do not expect this research to put anyone at a disadvantage. Nevertheless, issues related to data bias and fairness can potentially affect any algorithm that learns models from data [9], and users should keep this in mind when applying the ideas discussed here to domains where such issues may be important. In the future, it may also be of interest to consider incorporating fairness constraints in the types of algorithms discussed here.
Acknowledgments and Disclosure of Funding
Thanks to Avrim Blum for early discussions related to this work. Part of the motivation for this work also came from discussions following a talk by SA at a workshop on machine learning theory held at Google NYC in September 2019; thanks to all the participants of the workshop for stimulating discussions. We also thank the anonymous referees for helpful comments.
This material is based upon work supported in part by the US National Science Foundation (NSF) under Grant No. 1934876. SA is also supported in part by the US National Institutes of Health (NIH) under Grant No. U01CA214411. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the NSF or NIH.
Appendix
A. Proof of Lemma 1
Proof.
This essentially follows from the definition of . In particular, we have:
B. Proof of Theorem 2
Proof.
Let D be a -realizable distribution. Then such that , and therefore . Thus our goal is to show that ∃ a strictly increasing function that is continuous at 0 with g(0) = 0 such that for all ,
We will do this in two parts:
We will show that .
We will show that for all , .
Putting these together will then give that for all ,
Part 1.
We will show that for any sufficiently small such that ; this will establish that .
Let 0 < ϵ < 2n ln(2). Since , we have such that
Define as
Then we have
Therefore ∃κ > 0 such that
Define as
Then it can be verified that
and moreover,
This gives
Part 2.
Let , and let be such that
Define such that
Then we have
C. Proof of Theorem 3
Proof.
Let , , and let be parametrized by , so that
We will show that
this will establish the result.
To see that the above claim is true, notice that we can write
In other words, fy(x) is the difference between and the largest value of among y′ ≠ y. Clearly, this difference is largest when (in particular, in this case the difference is non-negative; in all other cases, the difference is negative, and therefore smaller). Thus
This proves the claim.
D. Proof of Corollary 4
This follows directly from the proof of Theorem 3.
E. Details of Real Data Sets Used in Experiments in Section 5.2
Table 3:
Multiclass classification data sets used in experiments in Section 5.2.
| Data set | # train | # validation | # test | # classes (n) | # features (d) |
|---|---|---|---|---|---|
| Covertype (50K) | 30000 | 10000 | 10000 | 7 | 54 |
| Digits | 5620 | 1874 | 3498 | 10 | 16 |
| USPS | 5468 | 1823 | 2007 | 10 | 256 |
| MNIST (70K) | 45000 | 15000 | 10000 | 10 | 780 |
| CIFAR10 | 37500 | 12500 | 10000 | 10 | 3072 |
| Sensorless | 35105 | 11702 | 11702 | 11 | 48 |
| Letter | 10500 | 4500 | 5000 | 26 | 16 |
Notes:
Subsampling: For Covertype, we used a random subsample of the original data set containing 50,000 examples (the original data set has 581,012 examples).
Image data sets with pixel features: The versions of the USPS and MNIST datasets that we used came with features scaled to the ranges [−1, 1] and [0, 1], respectively. For CIFAR10, we similarly scaled the features to the range [0, 1] by dividing all features by 255.
Footnotes
A universal function class is one that can approximate any continuous function; such classes can be obtained, for example, via reproducing kernel Hilbert spaces (RKHSs) associated with Gaussian kernels [12], or via sufficiently flexible neural networks [1].
Long and Servedio [7] presented the results slightly differently; in particular, in their case, refers to a class of real-valued functions from which individual scoring functions are drawn, and consistency is defined in terms of this class. We describe the results here in terms of our notation and terminology.
More generally, we will allow both the linear and piecewise linear classes to be characterized by n weight vectors and n bias/offset terms .
This is the sense used in Table 1, column 4.
Technically, Long and Servedio’s definition [7] applies to scoring function classes for which individual scoring function components come independently from a common fixed class, i.e. for which there is a class such that , and they would refer to such a surrogate as realizable -consistent. We modify the terminology slightly to better fit our presentation of ideas, and the definition we give is slightly more general (in that it allows for more general scoring function classes ).
This is the sense used in Table 1, column 5.
As noted above, the minimization over is non-convex; we found that for most (but not all) data sets, the results were fairly stable under different random initializations. The results we report are for a single random initialization; our results could potentially be improved by starting the optimizer from multiple random initializations, and keeping the model with best training objective value.
In all cases, the optimizer was run for 50 epochs over the training sample; the learning rate parameter α was initially set to 0.01 and was halved at the end of every 5 epochs.
Contributor Information
Mingyuan Zhang, University of Pennsylvania, Philadelphia, PA 19104.
Shivani Agarwal, University of Pennsylvania, Philadelphia, PA 19104.
References
- [1].Anthony Martin and Bartlett Peter L.. Neural Network Learning: Theoretical Foundations. Cambridge University Press, 1999. [Google Scholar]
- [2].Bartlett Peter L., Jordan Michael, and Jon McAuliffe. Convexity, classification and risk bounds. Journal of the American Statistical Association, 101(473):138–156, 2006. [Google Scholar]
- [3].Ben-David Shai, Loker David, Srebro Nathan, and Sridharan Karthik. Minimizing the misclassification error rate using a surrogate convex loss. In Proceedings of the 29th International Conference on Machine Learning (ICML), 2012. [Google Scholar]
- [4].Crammer Koby and Singer Yoram. On the algorithmic implementation of multiclass kernel-based vector machines. Journal of Machine Learning Research, 2:265–292, 2001. [Google Scholar]
- [5].Doğan Ürün, Glasmachers Tobias, and Igel Christian. A unified view on multi-class support˘ vector classification. Journal of Machine Learning Research, 17(45):1–32, 2016. [Google Scholar]
- [6].Lee Yoonkyung, Lin Yi, and Wahba Grace. Multicategory support vector machines: Theory and application to the classification of microarray data. Journal of the American Statistical Association, 99(465):67–81, 2004. [Google Scholar]
- [7].Long Philip M. and Servedio Rocco A.. Consistency versus realizable H-consistency for multiclass classification. In Proceedings of the 30th International Conference on Machine Learning (ICML), 2013. [Google Scholar]
- [8].Loshchilov Ilya and Hutter Frank. Decoupled weight decay regularization. In 7th International Conference on Learning Representations (ICLR), 2019. [Google Scholar]
- [9].Mehrabi Ninareh, Morstatter Fred, Saxena Nripsuta, Lerman Kristina, and Galstyan Aram. A survey on bias and fairness in machine learning. CoRR, abs/1908.09635, 2019. [Google Scholar]
- [10].Paszke Adam, Gross Sam, Massa Francisco, Lerer Adam, Bradbury James, Chanan Gregory, Killeen Trevor, Lin Zeming, Gimelshein Natalia, Antiga Luca, Desmaison Alban, Kopf Andreas, Yang Edward, DeVito Zachary, Raison Martin, Tejani Alykhan, Chilamkurthy Sasank, Steiner Benoit, Fang Lu, Bai Junjie, and Chintala Soumith. PyTorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pages 8024–8035. 2019. [Google Scholar]
- [11].Pires Bernardo Ávila and Szepesvári Csaba. Multiclass classification calibration functions. CoRR, abs/1609.06385, 2016. [Google Scholar]
- [12].Steinwart Ingo. On the influence of the kernel on the consistency of support vector machines. Journal of Machine Learning Research, 2:67–93, 2001. [Google Scholar]
- [13].Steinwart Ingo. How to compare different loss functions and their risks. Constructive Approximation, 26:225–287, 2007. [Google Scholar]
- [14].Tewari Ambuj and Bartlett Peter L.. On the consistency of multiclass classification methods. Journal of Machine Learning Research, 8:1007–1025, 2007. [Google Scholar]
- [15].Weston Jason and Watkins Chris. Support vector machines for multi-class pattern recognition. In In Proceedings of the 7th European Symposium on Artificial Neural Networks (ESANN), 1999. [Google Scholar]
- [16].Zhang Tong. Statistical analysis of some multi-category large margin classification methods. Journal of Machine Learning Research, 5:1225–1251, 2004. [Google Scholar]
- [17].Zhang Tong. Statistical behavior and consistency of classification methods based on convex risk minimization. Annals of Statistics, 32(1):56–134, 2004. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.



