Bayes Consistency vs. ℋ-Consistency: The Interplay between Surrogate Loss Functions and the Scoring Function Class

Mingyuan Zhang; Shivani Agarwal

. Author manuscript; available in PMC: 2021 Jul 24.

Published in final edited form as: Adv Neural Inf Process Syst. 2020 Dec;33:16927–16936.

Bayes Consistency vs. $H$ -Consistency: The Interplay between Surrogate Loss Functions and the Scoring Function Class

Mingyuan Zhang ¹, Shivani Agarwal ²

PMCID: PMC8307545 NIHMSID: NIHMS1720311 PMID: 34305367

Abstract

A fundamental question in multiclass classification concerns understanding the consistency properties of surrogate risk minimization algorithms, which minimize a (often convex) surrogate to the multiclass 0–1 loss. In particular, the framework of calibrated surrogates has played an important role in analyzing Bayes consistency of such algorithms, i.e. in studying convergence to a Bayes optimal classifier (Zhang, 2004; Tewari and Bartlett, 2007). However, follow-up work has suggested this framework can be of limited value when studying $H$ -consistency; in particular, concerns have been raised that even when the data comes from an underlying linear model, minimizing certain convex calibrated surrogates over linear scoring functions fails to recover the true model (Long and Servedio, 2013). In this paper, we investigate this apparent conundrum. We find that while some calibrated surrogates can indeed fail to provide $H$ -consistency when minimized over a naturallooking but naïvely chosen scoring function class $F$ , the situation can potentially be remedied by minimizing them over a more carefully chosen class of scoring functions $F$ . In particular, for the popular one-vs-all hinge and logistic surrogates, both of which are calibrated (and therefore provide Bayes consistency) under realizable models, but were previously shown to pose problems for realizable $H$ -consistency, we derive a form of scoring function class $F$ that enables Hconsistency. When $H$ is the class of linear models, the class $F$ consists of certain piecewise linear scoring functions that are characterized by the same number of parameters as in the linear case, and minimization over which can be performed using an adaptation of the min-pooling idea from neural network training. Our experiments confirm that the one-vs-all surrogates, when trained over this class of nonlinear scoring functions $F$ , yield better linear multiclass classifiers than when trained over standard linear scoring functions.

1. Introduction and Background

Consider a standard multiclass classification problem, with instance space $X \subseteq ℝ^{d}$ , label space $Y = [n] : = {1, \dots, n}$ with n > 2 classes, and standard 0–1 loss $ℓ_{0 - 1} : Y \times Y \to ℝ_{+}$ given by $ℓ_{0 - 1} (y, \hat{y}) = 1 (\hat{y} \neq y)$ . There is an unknown probability distribution D on $X \times Y$ ; given a training sample S = ((x₁, y₁), …, (x_m, y_m)) containing examples drawn i.i.d. from D, the goal is to learn a classifier $h : X \to Y$ with small 0–1 generalization error on new examples drawn from D:

{er}_{D}^{0 - 1} [h] = E_{(X, Y) ~ D} [ℓ_{0 - 1} (Y, h (X))] = P_{(X, Y) ~ D} (h (X) \neq Y) .

(1)

A Bayes consistent algorithm is one which, given enough training examples, learns a classifier whose generalization error approaches the Bayes optimal error:

{er}_{D}^{0 - 1, *} = \inf_{h : X \to Y} {er}_{D}^{0 - 1} [h] .

(2)

34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada.

On the other hand, for a class of models $H \subset {h : X \to Y}$ , an $H$ -consistent algorithm is one which, given enough training examples, learns a classifier whose generalization error approaches the optimal error in $H$ :

{er}_{D}^{0 - 1} [H] = \inf_{h \in H} {er}_{D}^{0 - 1} [h] .

(3)

Since minimizing the discrete 0–1 loss directly is generally computationally hard, a popular approach to multiclass classification is to learn n real-valued scoring functions $f_{1}, \dots, f_{n} : X \to ℝ$ , one for each class, by minimizing a (often convex) surrogate loss, and then given a new test point $x \in X$ , to predict a class y with highest score f_y(x). Specifically, given a training sample S as above, a surrogate loss $ψ : Y \times ℝ^{n} \to ℝ_{+}$ , and a scoring function class $F \subset {f : X \to ℝ^{n}}$ , a $(ψ, F)$ surrogate risk minimization algorithm finds a vector of n scoring functions $\hat{f} : X \to ℝ^{n}$ by solving

\hat{f} \in {argmin}_{f \in F} \frac{1}{m} \sum_{i = 1}^{m} ψ (y_{i}, f (x_{i})),

(4)

and then returns a classifier $\hat{h} : X \to Y$ given by

\hat{h} (x) \in {argmax}_{y \in [n]} {\hat{f}}_{y} (x) .

(5)

This approach includes several popular multiclass learning algorithms, such as multiclass logistic regression, various forms of multiclass SVMs [15, 4, 6, 5], one-vs-all logistic regression, and one-vs-all SVM; see Table 1 for a summary of the surrogate losses used by some of these algorithms.

Table 1:

Examples of convex surrogate losses used by various multiclass classification algorithms, together with a summary of some previous consistency results (here z₊ = max(0,z)). In this paper, we show that one-vs-all surrogates can in fact achieve $H$ -consistency if minimized over the right scoring function class.

Algorithm	Surrogate loss $ψ : Y \times ℝ^{n} \to ℝ_{+}$	Universally calibrated?	Realizable calibrated?	Realizable $H_{lin}$ -consistent?
Multiclass logistic regression	$ψ_{mlog} (y, u) = - u_{y} + \ln (Σ_{y' = 1}^{n} \exp (u_{y'}))$	✓	✓	✓
Crammer-Singer multiclass SVM	ψ_CS(y,u) = max_y'≠y(1 − (u_y − u_y'))₊	×	✓	✓
One-vs-all logistic regression	$ψ_{OvA,log} (y, u) = \ln (1 + e^{{- u}_{y}}) + Σ_{y' \neq y} \ln (1 + e^{u_{y'}})$	✓	✓	× (we give a fix)
One-vs-all SVM	$ψ_{OvA,hinge} (y, u) = {(1 - u_{y})}_{+} + Σ_{y' \neq y} {(1 + u_{y'})}_{+}$	×	✓	× (we give a fix)

Open in a new tab

A natural question then is: Under what conditions do such surrogate risk minimization algorithms provide Bayes consistency or, for various classes $H$ of interest, $H$ -consistency, for the target 0–1 loss?

Surrogate losses and Bayes consistency.

For Bayes consistency, the above question is answered by the notion of calibrated surrogates [2, 17, 16, 14, 13, 11]. Specifically, if a surrogate loss ψ is calibrated w.r.t. the 0–1 loss, then for any universal function class $F_{univ}$ , the $(ψ, F_{univ})$ surrogate risk minimization algorithm (implemented with suitable regularization) is a Bayes consistent algorithm for ℓ_0–1.¹ Among the surrogate losses shown in Table 1, ψ_mlog and ψ_OvA,log are universally calibrated for ℓ_0–1 (calibrated for all probability distributions), while ψ_CS and ψ_OvA,hinge are calibrated under the so-called ‘dominant-label’ condition (calibrated for distributions in which the conditional distributions p(y|x) assign probability at least $\frac{1}{2}$ to one of the n classes) [16].

Surrogate losses and $H$ -consistency.

For $H$ -consistency, the situation is more complex [3, 7]. In particular, Long and Servedio [7] showed the following results:²

Realizable $H_{cls}$ -consistency of Crammer-Singer surrogate for closed-under-scaling models $H_{cls}$ . Let $F_{cls} \subset {f : X \to ℝ^{n}}$ be any class of (vector) scoring functions that is closed under scaling, and
$H_{cls} = {h : X \to Y ∣ \exists f \in F_{cls} s . t . h (x) \in {argmax}_{y \in [n]} f_{y} (x) \forall x} .$

Long and Servedio [7] showed that if the data distribution D is $H_{cls}$ -realizable (i.e. the data is labeled according to a true model $h^{*} \in H_{cls}$ ), then minimizing the Crammer-Singer surrogate ψ_CS over $F_{cls}$ is $H_{cls}$ -consistent, i.e. the $(ψ_{CS}, F_{cls})$ surrogate risk minimization algorithm is $H_{cls}$ -consistent. This was viewed as surprising in light of the fact that ψ_CS is not (universally) calibrated for ℓ_0–1.
Lack of realizable $H_{lin}$ -consistency of one-vs-all logistic surrogate for linear models $H_{lin}$ . Let $F_{lin}$ be the class of linear (vector) scoring functions and $H_{lin}$ the class of linear multiclass classification models:
$F_{lin} = {f : X \to ℝ^{n} ∣ \exists w_{1}, \dots, w_{n} \in ℝ^{d} s . t . f_{y} (x) = w_{y}^{⊤} x \forall x}$ (6)

$H_{lin} = {h : X \to Y ∣ \exists w_{1}, \dots, w_{n} \in ℝ^{d} s . t . h (x) \in {argmax}_{y \in [n]} w_{y}^{⊤} x \forall x} .$ (7)

Long and Servedio [7] showed that even if the data distribution D is $H_{lin}$ -realizable (i.e. the data is labeled according to a true linear model $h^{*} \in H_{lin}$ ), minimizing the one-vs-all logistic surrogate ψ_OvA,log over $F_{lin}$ fails to give an $H_{lin}$ -consistent algorithm, i.e. the (ψ_OvA,log, $F_{lin}$ ) surrogate risk minimization algorithm is not $H_{lin}$ -consistent, even though ψ_OvA,log is universally calibrated for ℓ_0–1.

Our contributions.

As discussed by Long and Servedio [7] and summarized above, it seems peculiar that the Crammer-Singer surrogate ψ_CS, which is not universally calibrated for ℓ_0–1, provides realizable $H_{lin}$ -consistency (and more generally, realizable $H_{cls}$ -consistency for closed-under-scaling models $H_{cls}$ ), while the one-vs-all logistic surrogate, ψ_OvA,log, which is universally calibrated for ℓ_0–1, fails to provide realizable $H_{lin}$ -consistency. In this paper, we investigate this apparent conundrum.

First, regarding result (1) of Long and Servedio [7] above, we note that any realizable distribution D (i.e. a distribution that labels data points x according to a deterministic model y = h(x)) trivially satisfies the dominant-label condition (for each x, one class y has conditional probability $p (y ∣ x) \geq \frac{1}{2}$ ), and therefore the Crammer-Singer surrogate ψ_CS is in fact calibrated for any such distribution. Therefore, in the realizable setting studied by Long and Servedio [7], the surrogate ψ_CS is in fact calibrated for ℓ_0–1 (the paper emphasizes that ψ_CS is not calibrated/consistent for ℓ_0–1, implicitly referring to universal calibration, and misses the fact that it is indeed calibrated for the setting studied). So, while the result (1) is still interesting and non-trivial, it should be kept in mind that under the realizable setting studied in [7], all the surrogates studied by the authors are in fact calibrated for ℓ_0–1.

Second, and more importantly, we look into result (2) of Long and Servedio [7] above. We know that minimizing the one-vs-all logistic surrogate ψ_OvA,log over a universal scoring function class $F_{univ}$ gives Bayes consistency for all distributions D. Therefore, for $H_{lin}$ -realizable distributions D, where ${er}_{D}^{0 - 1, *} = {er}_{D}^{0 - 1} [H_{lin}]$ and therefore Bayes consistency is equivalent to $H_{lin}$ -consistency, we have that minimizing the ψ_OvA,log surrogate over such a class $F_{univ}$ gives an $H_{lin}$ -consistent algorithm. So why does minimizing the same surrogate over the class $F_{lin}$ of linear scoring functions fail in this regard? On closer inspection, we find that an important part of the answer lies in the form of the decision boundaries induced by a linear (or more generally, affine) multiclass classification model. As an example, Figure 1 shows an affine 4-class model in a 2-dimensional instance space; specifically, the figure shows the 4 affine scoring functions for the 4 classes, and the corresponding decision regions. As can be seen, the one-vs-all boundaries induced by such a model are not linear! Indeed, in general, each class is described by a convex polytope, and is separated from the rest of the classes by a piecewise linear decision boundary (where the boundaries for different classes include shared pieces). Therefore, when the one-vs-all classifier is forced to separate each class from the rest using a linear decision boundary, it can end up learning a suboptimal separator.

Figure 1: — Example in d = 2 dimensions with n = 4 classes. **Top row:** True linear 4-class classifier $h^{*} \in H_{lin}$ . The first 4 plots show contours of the 4 linear scoring functions $f_{1}^{*}, \dots, f_{4}^{*} : X \to ℝ$ (darker shades represent higher values); the 5th plot shows the regions corresponding to the classifier $h^{*} (x) \in {argmax}_{y \in [n]} f_{y}^{*} (x)$ . As can be seen, each class is described by a convex polytope, and is separated from the rest by a piecewise linear decision boundary. **Middle row:** Contours of scoring functions and decision regions learned from training data labeled according to h^∗ by minimizing the one-vs-all logistic surrogate ψ_OvA,log over the linear scoring function class $F_{lin}$ . The generalization accuracy is 0.886. **Bottom row:** Contours of scoring functions and decision regions learned from the same training data by minimizing the same one-vs-all logistic surrogate ψ_OvA,log over the ‘shared’ piecewise linear scoring function class $F_{spwlin}$ . The decision regions are closer to the true model, and the generalization accuracy is 0.986. Since the piecewise linear functions have shared pieces, the number of parameters to be learned is the same as in the linear case; moreover, the resulting model can also be transformed to a linear model if desired (see Theorem 3 and Corollary 4). See Section 5 for details of the experimental setup.

In the rest of the paper, we use the above insight to design a special class of ‘shared’ piecewise linear scoring functions $F_{spwlin}$ such that minimizing the one-vs-all logistic surrogate ψ_OvA,log over this class yields an $H_{lin}$ -consistent algorithm. We will see that $F_{spwlin}$ is characterized by the same number of parameters as $F_{lin}$ ; in fact, $F_{spwlin}$ will also be parametrized by n weight vectors $w_{1}, \dots, w_{n} \in ℝ^{d}$ .³ In order to minimize ψ_OvA,log over this scoring function class, we will make use of an adaptation of the min-pooling idea from neural network training. The same idea can be applied to other one-vs-all surrogates as well; in our experiments, we consider both ψ_OvA,log and the one-vs-all SVM surrogate ψ_OvA,hinge, and find that in both cases, while minimizing these surrogates over the class of linear scoring functions $F_{lin}$ fails to provide $H_{lin}$ -consistency, minimizing them over the nonlinear scoring function class $F_{spwlin}$ does indeed provide $H_{lin}$ -consistency.

An additional interesting aspect of the scoring function class $F_{spwlin}$ is that, while the individual scoring functions in it are nonlinear (specifically, piecewise linear), the classification models resulting from taking the highest-scoring class according to these scoring functions can also be expressed as linear models. Therefore, having learned a classifier by minimizing a one-vs-all surrogate over this nonlinear scoring function class, one can then convert the learned model to a linear model in $H_{lin}$ .

We believe our study can pave the way for a more thorough understanding of the role of surrogate losses in $H$ -consistency. In particular, our results suggest that, when studying $H$ -consistency, one needs to carefully take into account the interplay between surrogate losses and the scoring function class over which they are minimized, and that this can lead to unexpected improvements to learning algorithms used in practice.

Organization.

We start by giving various formal definitions in Section 2. We then describe the class of ‘shared’ piecewise linear scoring functions $F_{spwlin}$ and give our associated consistency result in Section 3. We discuss how to minimize one-vs-all surrogates over this scoring function class in practice in Section 4, and describe our numerical experiments in Section 5. Section 6 concludes with a brief summary. Additional details/proofs are provided in the supplementary material.

2. Formal Definitions: Consistency, Calibration, Realizability

Consistency.

We start with formal definitions of Bayes consistency and $H$ -consistency:

Definition 1 (Bayes consistency).

We say a multiclass learning algorithm that maps training samples $S \in \cup_{m = 1}^{\infty} {(X \times Y)}^{m}$ to multiclass models $\hat{h} : X \to Y$ is Bayes consistent (w.r.t. ℓ_0–1) for a distribution D if for all 𝜖 > 0,

\lim_{m \to \infty} P_{S ~ D^{m}} ({er}_{D}^{0 - 1} [\hat{h}] - {er}_{D}^{0 - 1, *} > ϵ) = 0.

If an algorithm is Bayes consistent for all distributions D, we say it is universally Bayes consistent.

Definition 2 ( $H$ -consistency).

Let $H \subset {h : X \to Y}$ . We say a multiclass learning algorithm that maps training samples $S \in \cup_{m = 1}^{\infty} {(X \times Y)}^{m}$ to multiclass models $\hat{h} : X \to Y$ is $H$ -consistent (w.r.t. ℓ_0–1) for a distribution D if for all 𝜖 > 0,

\lim_{m \to \infty} P_{S ~ D^{m}} ({er}_{D}^{0 - 1} [\hat{h}] - {er}_{D}^{0 - 1} [H] > ϵ) = 0.

Note that we do not require the algorithm to produce a model in $H$ ; we only require that as m→∞, the performance of the model it learns approaches that of the best model in $H$ . If an algorithm is $H$ -consistent for all distributions D, we say it is universally $H$ -consistent.

Calibration.

Next, we give the standard definition of calibration of a surrogate loss that is useful for studying Bayes consistency of surrogate risk minimization algorithms, followed by a definition of calibration w.r.t. $H$ that is useful for studying $H$ -consistency of such algorithms. To give these definitions, for a surrogate loss $ψ : Y \times ℝ^{n} \to ℝ_{+}$ , we need the following notions of ψ-generalization error, Bayes optimal ψ-error, and optimal ψ-error in a scoring function class $F \subset {f : X \to ℝ^{n}}$ :

{er}_{D}^{ψ} [f] = E_{(X, Y) ~ D} [ψ (Y, f (X))]; {er}_{D}^{ψ, *} = \inf_{f : X \to ℝ^{n}} {er}_{D}^{ψ} [f]; {er}_{D}^{ψ} [F] = \inf_{f \in F} {er}_{D}^{ψ} [f] .

(8)

Definition 3 (Calibration (standard definition)).

We say a surrogate loss $ψ : Y \times ℝ^{n} \to ℝ_{+}$ is calibrated w.r.t. ℓ_0–1 for a distribution D if there exists a strictly increasing function $g : ℝ_{+} \to ℝ_{+}$ that is continuous at 0 with g(0) = 0 such that for all $f : X \to ℝ^{n}$ ,

{er}_{D}^{0 - 1} [\underset{h}{\underset{︸}{argmax \circ f}}] - {er}_{D}^{0 - 1, *} \leq g ({er}_{D}^{ψ} [f] - {er}_{D}^{ψ, *}),

where h ≡ argmax ∘ f denotes a classifier that satisfies $h (x) \in {argmax}_{y \in [n]} f_{y} (x)$ . If ψ is calibrated w.r.t. ℓ_0–1 for all distributions D, we say ψ is universally calibrated w.r.t. ℓ_0–1.

Definition 4 (Calibration w.r.t. $H$ ).

For a class of multiclass models $H \subset {h : X \to Y}$ , a surrogate loss $ψ : Y \times ℝ^{n} \to ℝ_{+}$ , and a scoring function class $F \subset {f : X \to ℝ^{n}}$ , we say $(ψ, F)$ is calibrated w.r.t. $(ℓ_{0 - 1}, H)$ for a distribution D if there exists a strictly increasing function $g : ℝ_{+} \to ℝ_{+}$ that is continuous at 0 with g(0) = 0 such that for all $f \in F$ ,

{er}_{D}^{0 - 1} [\underset{h}{\underset{︸}{argmax \circ f}}] - {er}_{D}^{0 - 1} [H] \leq g ({er}_{D}^{ψ} [f] - {er}_{D}^{ψ} [F]),

where h ≡ argmax ∘ f denotes a classifier that satisfies $h (x) \in {argmax}_{y \in [n]} f_{y} (x)$ . If $(ψ, F)$ is calibrated w.r.t. $(ℓ_{0 - 1}, H)$ for all distributions D, we say $(ψ, F)$ is universally calibrated w.r.t. $(ℓ_{0 - 1}, H)$ .

Realizability and realizable calibration/consistency.

Finally, we give formal definitions of realizable and $H$ -realizable distributions, realizable calibration, and Long and Servedio’s definition of realizable $H_{F}$ -consistency.

Definition 5 (Realizability and $H$ -realizability).

We say a distribution D over $X \times Y$ is realizable if (almost surely) it labels points according to a deterministic model, i.e. if $\exists h : X \to Y$ such that $P_{(X, Y) ~ D} (h (X) = Y) = 1$ . For a class $H \subset {h : X \to Y}$ , we say a distribution D over $X \times Y$ is $H$ -realizable if (almost surely) it labels points according to a deterministic model in $H$ , i.e. if $\exists h \in H$ such that $P_{(X, Y) ~ D} (h (X) = Y) = 1$ .

Definition 6 (Realizable calibration).

We say a surrogate loss $ψ : Y \times ℝ^{n} \to ℝ_{+}$ is realizable calibrated (w.r.t. ℓ_0–1) if it is calibrated (w.r.t. ℓ_0–1) for all realizable distributions.⁴

Definition 7 (Long and Servedio’s definition of realizable $H_{F}$ -consistency [7]).

Let $F \subset {f : X \to ℝ^{n}}$ , and let $H_{F} = {h : X \to Y ∣ \exists f \in F s . t . h (x) \in {argmax}_{y} f_{y} (x) \forall x}$ . A surrogate loss $ψ : Y \times ℝ^{n} \to ℝ_{+}$ is realizable $H_{F}$ -consistent if $(ψ, F)$ is calibrated w.r.t. $(ℓ_{0 - 1}, H_{F})$ for all $H_{F}$ -realizable distributions.^5,6

3. Minimizing One-vs-All Surrogates over a Class $F_{spwlin}$ of ‘Shared’ Piecewise Linear Scoring Functions is $H_{lin}$ -Consistent

As discussed in Section 1, even though the one-vs-all logistic surrogate ψ_OvA,log is universally calibrated for ℓ_0–1, Long and Servedio [7] showed that the $(ψ_{OvA, \log}, F_{lin})$ surrogate risk minimization algorithm, which minimizes ψ_OvA,log over the class of linear scoring functions $F_{lin}$ , is not $H_{lin}$ -consistent even when the data distribution D is $H_{lin}$ -realizable. In this section, we remedy this situation by showing how to minimize the same surrogate loss ψ_OvA,log (as well as other one-vs-all surrogate losses) over a different, nonlinear scoring function class $F_{spwlin}$ such that the resulting algorithm is $H_{lin}$ -consistent for all $H_{lin}$ -realizable distributions D.

Linear models.

For the remainder of the paper, we will re-define the classes of linear scoring functions and linear classification models to allow for the inclusion of bias/offset terms:

F_{lin} = {f : X \to ℝ^{n} ∣ \exists w_{1}, \dots, w_{n} \in ℝ^{d}, b_{1} \dots, b_{n} \in ℝ s . t . f_{y} (x) = w_{y}^{⊤} x + b_{y} \forall x}

(9)

H_{lin} = {h : X \to Y ∣ \exists w_{1}, \dots, w_{n} \in ℝ^{d}, b_{1} \dots, b_{n} \in ℝ s . t . h (x) \in {argmax}_{y \in [n]} w_{y}^{⊤} x + b_{y} \forall x} .

(10)

Our conclusions will apply both in this more general setting, and in the special case where b_y = 0 ∀y.

‘Shared’ piecewise linear scoring functions.

To motivate the scoring function class we will construct, consider again the example in Figure 1. As this example makes clear, under a linear classification model in $H_{lin}$ defined by weight vectors $w_{1}, \dots, w_{n} \in ℝ^{d}$ and bias terms $b_{1} \dots, b_{n} \in ℝ$ , the decision region corresponding exclusively to class y ∈ [n] is the (open) convex polytope given by

X_{y} = {x \in X ∣ w_{y}^{⊤} x + b_{y} > w_{y^{'}}^{⊤} x + b_{y^{'}} \forall y^{'} \neq y}

(11)

= {x \in X ∣ {(w_{y} - w_{y^{'}})}^{⊤} x + (b_{y} - b_{y^{'}}) > 0 \forall y^{'} \neq y}

(12)

= {x \in X ∣ \min_{y^{'} \neq y} {{(w_{y} - w_{y^{'}})}^{⊤} x + (b_{y} - b_{y^{'}})} > 0} .

(13)

We use this observation to construct the following special class of ‘shared’ piecewise linear scoring functions:

F_{spwlin} = {f : X \to ℝ^{n} ∣ \exists w_{1}, \dots, w_{n} \in ℝ^{d}, b_{1} \dots, b_{n} \in ℝ s . t . f_{y} (x) = \min_{y^{'} \neq y} {{(w_{y} - w_{y^{'}})}^{⊤} x + (b_{y} - b_{y^{'}})} \forall x} .

(14)

Clearly, this class is parametrized by the same number of parameters as $F_{lin}$ . The reason that the class $F_{spwlin}$ is useful is that the scoring functions in this class will allow for learning precisely the form of one-vs-all decision boundaries that are induced by linear multiclass models. In particular, we have the following result:

Lemma 1

(Scoring functions in $F_{spwlin}$ capture correct one-vs-all decision boundaries for linear multiclass models). Let $f \in F_{spwlin}$ be parametrized by $w_{1}, \dots, w_{n} \in ℝ^{d}$ , $b_{1}, \dots, b_{n} \in ℝ$ . Then

f_{y} (x) \geq 0 \Leftrightarrow y \in {argmax}_{y^{'} \in [n]} w_{y^{'}}^{⊤} x + b_{y^{'}} .

(15)

Since one-vs-all surrogates effectively learn scoring functions that aim to separate points x with label y from points with other labels according to whether f_y(x) ≥ 0, the above result implies that minimizing such surrogates over the class $F_{spwlin}$ should allow learning precisely the form of one-vs-all separation boundaries induced by linear multiclass models. Formally, we have the following $H_{lin}$ -consistency result:

Theorem 2

( $H_{lin}$ -consistency of $(ψ_{OvA,log}, F_{spwlin})$ surrogate risk minimization algorithm). The pair $(ψ_{OvA,log}, F_{spwlin})$ is calibrated w.r.t. $(ℓ_{0 - 1}, H_{lin})$ for all $H_{lin}$ -realizable distributions.

Remark 1

(Generalization to other one-vs-all surrogates). The above $H_{lin}$ -consistency result can be generalized to other one-vs-all surrogates, such as the one-vs-all hinge surrogate ψ_OvA,hinge.

Remark 2

(Loss of ‘independence’ of one-vs-all binary classifiers). Since the n components of the (vector) scoring functions in $F_{spwlin}$ share parameters, they can no longer be learned independently by training separate binary classifiers in parallel; while minimizing a one-vs-all surrogate over $F_{spwlin}$ still amounts to learning binary separators for each of the classes versus the rest, these separators must be learned together in an “all-in-one” multiclass learning algorithm.

Remark 3

(Non-convexity of resulting optimization problems). Although the one-vs-all surrogates ψ_OvA,log and ψ_OvA,hinge are convex, minimizing these surrogates over the function class $F_{spwlin}$ results in non-convex optimization problems. In order to solve these optimization problems, our implementation makes use of an adaptation of the min-pooling idea from neural network training (see Section 4). Additional details regarding the behavior of this approach in our experiments are discussed in Section 5.

We also have the following result, which shows that the classification models induced by the nonlinear (vector) scoring functions in $F_{spwlin}$ are in fact equivalent to those in the class of linear classification models $H_{lin}$ :

Theorem 3

(Scoring functions in $F_{spwlin}$ induce linear multiclass classifiers). Let $H_{spwlin}$ be the class of multiclass classifiers induced by $F_{spwlin}$ :

H_{spwlin} = {h : X \to Y ∣ \exists f \in F_{spwlin} s . t . h (x) \in {argmax}_{y \in [n]} f_{y} (x)} .

(16)

Then $H_{spwlin} = H_{lin}$ .

Indeed, the following corollary shows that once we have learned a nonlinear (vector) scoring function in $F_{spwlin}$ , we can easily transform it into a linear classification model in $H_{lin}$ :

Corollary 4

(Converting a nonlinear scoring function in $F_{spwlin}$ to a linear classification model in $H_{lin}$ ). Let $f \in F_{spwlin}$ be parametrized by $w_{1}, \dots, w_{n} \in ℝ^{d}$ , $b_{1}, \dots, b_{n} \in ℝ$ . Then for all $x \in X$ ,

{argmax}_{y \in [n]} f_{y} (x) = {argmax}_{y \in [n]} w_{y}^{⊤} x + b_{y} .

(17)

Margin interpretation of $F_{s p w l i n}$ .

We note that the scoring functions in $F_{spwlin}$ can also be viewed as computing a multiclass ‘margin’ vector over the underlying linear functions defining the shared piecewise linear scores. Specifically, recall that a (vector) scoring function $f \in F_{spwlin}$ has the form

f_{y} (x) = \min_{y^{'} \neq y} {{(w_{y} - w_{y^{'}})}^{⊤} x + (b_{y} - b_{y^{'}})} = (w_{y}^{⊤} x + b_{y}) - \min_{y^{'} \neq y} {w_{y^{'}}^{⊤} x + b_{y^{'}}}

(18)

for some ${w_{y}, b_{y}}_{y = 1}^{n}$ This suggests that for each y, the score f_y(x) effectively computes the ‘margin’ of separation between $(w_{y}^{⊤} x + b_{y})$ and $\max_{y^{'} \neq y} {w_{y^{'}}^{⊤} x + b_{y^{'}}}$ ; if this margin is non-negative, then $y \in {argmax}_{y^{'} \in [n]} w_{y^{'}}^{⊤} x + b_{y^{'}}$ , and if it is negative, then $y \notin {argmax}_{y^{'} \in [n]} w_{y^{'}}^{⊤} x + b_{y^{'}}$ .

Generalization to other multiclass models $H$ .

The above construction can be generalized beyond $H_{lin}$ to other multiclass models $H_{G}$ defined in terms of a class of real-valued scoring functions $G \subset {g : X \to ℝ}$ . Specifically, for any such class $G$ , let

H_{G} = {h : X \to Y ∣ \exists g_{1}, \dots, g_{n} \in G s . t . h (x) \in {argmax}_{y \in [n]} g_{y} (x) \forall x} .

(19)

(Thus $H_{lin}$ is a special case with $G_{lin} = {g : X \to ℝ ∣ \exists w \in ℝ^{d}, b \in ℝ s . t . g (x) = w^{⊤} x + b \forall x}$ .) Define the class of ‘shared’ piecewise-difference-of- $G$ scoring functions $F_{spwdiff G}$ as follows:

F_{spwdiff G} = {f : X \to ℝ^{n} ∣ \exists g_{1}, \dots, g_{n} \in G s . t {. f}_{y} (x) = \min_{y^{'} \neq y} {g_{y} (x) {-g}_{y^{'}} (x)} \forall x} .

(20)

Then similarly to the linear case, it can be shown that minimizing any of the one-vs-all surrogates ψ_OvA,log or ψ_OvA,hinge over $F_{spwdiff G}$ is $H_{G}$ -consistent for all $H_{G}$ -realizable distributions.

4. Implementation of One-vs-All Surrogate Risk Minimization over $F_{spwlin}$

In order to implement surrogate risk minimization over the scoring function class $F_{spwlin}$ , we make use of an adaptation of the min-pooling idea from neural network training. Figure 2 shows a summary of the architecture we use to implement scoring functions f in $F_{spwlin}$ .

Figure 2: — Neural network-like architecture implementing scoring functions in $F_{spwlin}$ . To find parameters ${w_{y}, b_{y}}_{y = 1}^{n}$ minimizing a surrogate loss ψ on the training data, we use a backpropagation-like procedure on this architecture.

Specifically, given an input point $x \in ℝ^{d}$ , the first layer computes the n linear functions

g_{y} (x) = w_{y}^{⊤} x + b_{y}, y \in [n] .

The second layer then computes the n scoring function components f_y(x) in terms of minima of the relevant functions from the first layer (see Eq. (14)):

μ_{y} (g) = \min_{y^{'} \neq y} {g_{y} - g_{y^{'}}}, y \in [n] .

To fit the parameters ${w_{y}, b_{y}}_{y = 1}^{n}$ to training data, we then use a backpropagation-like procedure to minimize the surrogate loss of interest. Any existing neural network training library can be easily modified to perform this minimization; in our experiments, we implemented this approach using PyTorch [10].

5. Experiments

We conducted two sets of experiments. In the first set, we generated synthetic data from a true linear model (i.e. a known $H_{lin}$ -realizable distribution) and tested the $H_{lin}$ -consistency of minimizing one-vs-all surrogates over $F_{spwlin}$ . In the second set, we implemented the approach on various real benchmark data sets to test its practical behavior. In all cases, we implemented a total of 6 multiclass algorithms: all 4 algorithms shown in Table 1 with surrogate risk minimized over linear scoring functions $F_{lin}$ , and the two one-vs-all algorithms with surrogate risk minimized over $F_{spwlin}$ . All algorithms were implemented in PyTorch and used the AdamW optimizer [8].^7,8

5.1. Synthetic Data: Consistency Behavior on Linear Models

We generated two synthetic data sets. The first data set had d = 2 features and n = 4 classes. A true model $h^{*} \in H_{lin}$ was created by choosing ${w_{y}, b_{y}}_{y = 1}^{4}$ as follows: elements of $w_{y} \in ℝ^{2}$ were drawn i.i.d. from $N (0, 1)$ and subsequently scaled so that ||w_y||₂ = 1 ∀y; bias terms b₁, …, b₄ were set to 0.2, 0.1, −0.1, −0.2 (decision regions of the resulting model h^∗ are shown in Figure 1). Instances x were then drawn uniformly at random from a disk of radius 0.5 centered at (0.3,−0.1), and labeled according to h^∗. We ran all 6 algorithms (using AdamW with zero weight decay factor) on increasingly large training samples (up to 30,000 data points) generated in this manner, and measured the generalization accuracy on a large test set of 10,000 data points generated in the same manner. The results are shown in Figure 3 (left); an illustration of some of the models learned from 10,000 data points is also shown in Figure 1.

Figure 3: — Convergence behavior of various multiclass surrogate risk minimization algorithms on synthetic data generated from a true linear model in $H_{lin}$ . **Left:** d = 2, n = 4. **Right:** d = 100, n = 10. In both cases, the ψ_OvA,log and ψ_OvA,hinge surrogates fail to converge to the optimal performance in $H_{lin}$ when minimized over standard linear scoring functions $F_{lin}$ , but successfully do so when minimized over the class $F_{spwlin}$ . (In the right plot, the curves for all four $H_{lin}$ -consistent algorithms overlap.) See Section 5.1 for details.

The second data set had d = 100 features and n = 10 classes. A true model $h^{*} \in H_{lin}$ was created in the same manner as above, except that in this case we set b_y = 0 ∀y ∈ [100]. Instances x were drawn uniformly at random from X = [−1, 1]¹⁰⁰, and labeled according to h^∗. We ran all 6 algorithms on increasingly large training samples (up to 40,000 data points) and measured accuracy on a large test set of 10,000 data points. The results are shown in Figure 3 (right).

In both cases, the one-vs-all surrogates fail to give $H_{lin}$ -consistency when minimized over linear scoring functions $F_{lin}$ , but successfully do so when minimized over the scoring function class $F_{spwlin}$ .

5.2. Real Data: Practical Behavior

We evaluated the performance of all 6 algorithms on various benchmark multiclass classification data sets drawn from the UCI repository and the LIBSVM data repository.⁹ Details of the data sets are provided in the supplementary material; the number of features d ranges from 16 to 3072, and the number of classes n ranges from 7 to 26. Several of the data sets come with prescribed train/validation/test splits; for the others, we randomly chose a 3:1:1 split. For all algorithms, we used AdamW with a weight decay factor λ; the factor λ was chosen from {10⁻³, …, 10²} to maximize 0–1 accuracy on the validation set.

The results are shown in Table 2. For each data set, the best-performing algorithms within the group of logistic surrogates and within the group of hinge surrogates are shown in bold font; the best overall is enclosed in asterisks. For hinge surrogates, consistent with previous results [5], we find ψ_OvA,hinge, when minimized over $F_{lin}$ , does slightly poorer than ψ_CS, but minimizing it over $F_{spwlin}$ brings it in line with (and even slightly exceeds) ψ_CS. For logistic surrogates, the results are more mixed, although $(ψ_{OvA,log}, F_{spwlin})$ frequently outperforms $(ψ_{OvA,log}, F_{lin})$ . Overall, despite the good performance, we do not necessarily advocate minimizing one-vs-all surrogates over $F_{spwlin}$ as a practical strategy, as training is 2–3 times slower than for ψ_CS or ψ_mlog, which generally give comparable results. Our primary interest is in the $H_{lin}$ -consistency of this scheme under $H_{lin}$ -realizable data distributions; the main purpose of the experiments on real data was to serve as a sanity check and ensure that this does not come at a huge price in terms of practical applicability of the resulting algorithms.

Table 2:

Results (in terms of test accuracy) on various real multiclass data sets. See Section 5.2 for details.

Data set	ψmlog	ψOvA,log	ψOvA,log	ψCS	ψOvA,hinge	ψOvA,hinge
	$F_{lin}$	$F_{lin}$	$F_{spwlin}$	$F_{lin}$	$F_{lin}$	$F_{spwlin}$
Covertype (50K)	0.6606	0.6943	0.6607	0.7186	0.7069	*0.7193*
Digits	0.8985	0.8696	0.8982	0.9025	0.8819	*0.9042*
USPS	*0.9153*	0.9138	0.9148	0.9128	0.9063	0.9148
MNIST (70K)	0.9270	0.9200	0.9271	0.9307	0.9216	*0.9317*
CIFAR10	0.4000	*0.4066*	0.3763	0.3831	0.3686	0.4006
Sensorless	*0.8266*	0.6539	0.7918	0.7703	0.5381	0.7791
Letter	0.7644	0.7126	0.7662	0.7738	0.6058	*0.7804*

Open in a new tab

6. Conclusion

Our study shows that when studying $H$ -consistency of surrogate risk minimization algorithms, the interplay between the surrogate loss and scoring function class can play an important role. In particular, for ψ_OvA,log and ψ_OvA,hinge, we found that minimization over a suitable function class $F_{spwlin}$ gives $H_{lin}$ -consistency where standard minimization over linear functions $F_{lin}$ fails to do so.

Supplementary Material

Supplementary information

NIHMS1720311-supplement-Supplementary_information.pdf^{(199KB, pdf)}

Broader Impact.

The primary goal of this paper is to better understand the statistical consistency properties of surrogate risk minimization algorithms in machine learning. The insights and results of the paper will benefit readers who wish to be aware of these properties when designing or selecting learning algorithms.

We do not expect this research to put anyone at a disadvantage. Nevertheless, issues related to data bias and fairness can potentially affect any algorithm that learns models from data [9], and users should keep this in mind when applying the ideas discussed here to domains where such issues may be important. In the future, it may also be of interest to consider incorporating fairness constraints in the types of algorithms discussed here.

Acknowledgments and Disclosure of Funding

Thanks to Avrim Blum for early discussions related to this work. Part of the motivation for this work also came from discussions following a talk by SA at a workshop on machine learning theory held at Google NYC in September 2019; thanks to all the participants of the workshop for stimulating discussions. We also thank the anonymous referees for helpful comments.

This material is based upon work supported in part by the US National Science Foundation (NSF) under Grant No. 1934876. SA is also supported in part by the US National Institutes of Health (NIH) under Grant No. U01CA214411. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the NSF or NIH.

Appendix

A. Proof of Lemma 1

Proof.

This essentially follows from the definition of $F_{spwlin}$ . In particular, we have:

f_{y} (x) \geq 0 \Leftrightarrow \min_{y^{'} \neq y} {{(w_{y} - w_{y^{'}})}^{⊤} x + (b_{y} - b_{y^{'}})} \geq 0 \Leftrightarrow \min_{y^{'} \neq y} {(w_{y}^{⊤} x + b_{y}) - (w_{y^{'}}^{⊤} x + b_{y^{'}})} \geq 0 \Leftrightarrow (w_{y}^{⊤} x + b_{y}) \geq (w_{y^{'}}^{⊤} x + b_{y^{'}}) \forall y^{'} \neq y \Leftrightarrow y \in {argmax}_{y^{'} \in [n]} w_{y^{'}}^{⊤} x + b_{y^{'}} .

B. Proof of Theorem 2

Proof.

Let D be a $H_{lin}$ -realizable distribution. Then $\exists h^{*} \in H_{lin}$ such that $P_{(X, Y) ~ D} (Y = h^{*} (X)) = 1$ , and therefore ${er}_{D}^{0 - 1} [H_{lin}] = 0$ . Thus our goal is to show that ∃ a strictly increasing function $g : ℝ_{+} \to ℝ_{+}$ that is continuous at 0 with g(0) = 0 such that for all $f \in F_{spwlin}$ ,

{er}_{D}^{0 - 1} [argmax \circ f] \leq g ({er}_{D}^{OvA, \log} [f] - {er}_{D}^{OvA,log} [F_{spwlin}]) .

We will do this in two parts:

We will show that ${er}_{D}^{OvA, \log} [F_{spwlin}] = 0$ .
We will show that for all $f \in F_{spwlin}$ , ${er}_{D}^{0 - 1} [argmax \circ f] \leq \frac{1}{\ln (2)} {er}_{D}^{OvA,log} [f]$ .

Putting these together will then give that for all $f \in F_{spwlin}$ ,

{er}_{D}^{0 - 1} [argmax \circ f] \leq \frac{1}{\ln (2)} ({er}_{D}^{OvA, \log} [f] - {er}_{D}^{OvA, \log} [F_{spwlin}]) .

Part 1.

We will show that for any sufficiently small $ϵ > 0, \exists f^{ϵ} \in F_{spwlin}$ such that ${er}_{D}^{OvA, \log} [f^{ϵ}] < ϵ$ ; this will establish that ${er}_{D}^{OvA, \log} [F_{spwlin}] = 0$ .

Let 0 < ϵ < 2n ln(2). Since $h^{*} \in H_{lin}$ , we have $\exists {w_{y}^{*}, b_{y}^{*}}_{y = 1}^{n}$ such that

h^{*} (x) \in {argmax}_{y \in [n]} {(w_{y}^{*})}^{⊤} x + b_{y}^{*} \forall x

Define $f^{*} \in F_{spwlin}$ as

f_{y}^{*} (x) = \min_{y^{'} \neq y} {{(w_{y}^{*} - w_{y^{'}}^{*})}^{⊤} x + (b_{y}^{*} - b_{y^{'}}^{*})} = \min_{y^{'} \neq y} {({(w_{y}^{*})}^{⊤} x + b_{y}^{*}) - ({(w_{y^{'}}^{*})}^{⊤} x + b_{y^{'}}^{*})} .

Then we have

P_{(X, Y) ~ D} (f_{Y}^{*} (X) > 0) = 1.

Therefore ∃κ > 0 such that

P_{(X, Y) ~ D} (f_{Y}^{*} (X) < κ) \leq \frac{ϵ}{2 n \ln (2)} .

Define $f^{ϵ} \in F_{spwlin}$ as

f_{y}^{ϵ} (x) = \frac{f_{y}^{*} (x)}{κ} \ln (\frac{1}{e^{ϵ / 2 n} - 1}) .

Then it can be verified that

f_{y}^{*} (x) > 0 \Rightarrow f_{y}^{ϵ} (x) > 0 \Rightarrow ψ_{OvA, \log} (y, f^{ϵ} (x)) \leq n \ln (2),

and moreover,

f_{y}^{*} (x) \geq κ \Rightarrow f_{y}^{ϵ} (x) \geq \ln (\frac{1}{e^{ϵ / 2 n} - 1}) \Rightarrow ψ_{OvA, \log} (y, f^{ϵ} (x)) \leq \frac{ϵ}{2} .

This gives

{er}_{D}^{OvA, \log} [f^{ϵ}] = E_{(X, Y) ~ D} [ψ_{OvA, \log} (Y, f^{ϵ} (X))] \leq P_{(X, Y) ~ D} (0 < f_{Y}^{*} (X) < κ) \cdot E [ψ_{OvA,log} (Y, f^{ϵ} (X)) ∣ 0 < f_{Y}^{*} (X) < κ] + P_{(X, Y) ~ D} (f_{Y}^{*} (X) \geq κ) \cdot E [ψ_{OvA,log} (Y, f^{ϵ} (X)) ∣ f_{Y}^{*} (X) \geq κ] \leq \frac{ϵ}{2 n \ln (2)} \cdot n \ln (2) + 1 \cdot \frac{ϵ}{2} = ϵ .

Part 2.

Let $f \in F_{spwlin}$ , and let ${w_{y}, b_{y}}_{y = 1}^{n}$ be such that

f_{y} (x) = \min_{y^{'} \neq y} {{(w_{y} - w_{y^{'}})}^{⊤} x + (b_{y} - b_{y^{'}})} \forall x .

Define $h : X \to Y$ such that

h (x) \in {argmax}_{y \in [n]} f_{y} (x) \forall x

Then we have

{er}_{D}^{0 - 1} [h] = E_{(X, Y) ~ D} [ℓ_{0 - 1} (Y, h (X))] = E_{(X, Y) ~ D} [1 (h (X) \neq Y)] = E_{(X, Y) ~ D} [\sum_{y \neq Y} 1 (h (X) = y)] \leq E_{(X, Y) ~ D} [\sum_{y \neq Y} 1 (f_{y} (X) \geq 0)] (by definition of h and Lemma 1) \leq \frac{1}{\ln (2)} E_{(X, Y) ~ D} [\sum_{y \neq Y} \ln (1 + e^{f_{y} (X)})] \leq \frac{1}{\ln (2)} E_{(X, Y) ~ D} [\ln (1 + e^{- f_{Y} (X)}) + \sum_{y \neq Y} \ln (1 + e^{f_{y} (X)})] (since \ln (1 + e^{- f_{y} (x)}) \geq 0 \forall (x, y)) = \frac{1}{\ln (2)} E_{(X, Y) ~ D} [ℓ_{OvA, \log} (Y, f (X))] = \frac{1}{\ln (2)} {er}_{D}^{OvA, \log} [f] .

C. Proof of Theorem 3

Proof.

Let $w_{1}, \dots, w_{n} \in ℝ^{d}$ , $b_{1}, \dots, b_{n} \in ℝ$ , and let $f \in F_{spwlin}$ be parametrized by ${w_{y}, b_{y}}_{y = 1}^{n}$ , so that

f_{y} (x) = \min_{y^{'} \neq y} {{(w_{y} - w_{y^{'}})}^{⊤} x + (b_{y} - b_{y^{'}})} \forall x .

We will show that

{argmax}_{y \in [n]} f_{y} (x) = {argmax}_{y \in [n]} w_{y}^{⊤} x + b_{y};

this will establish the result.

To see that the above claim is true, notice that we can write

f_{y} (x) = (w_{y}^{⊤} x + b_{y}) - \max_{y^{'} \neq y} {w_{y^{'}}^{⊤} x + b_{y^{'}}} .

In other words, f_y(x) is the difference between $(w_{y}^{⊤} x + b_{y})$ and the largest value of $(w_{y^{'}}^{⊤} x + b_{y^{'}})$ among y′ ≠ y. Clearly, this difference is largest when $(w_{y}^{⊤} x + b_{y}) \geq (w_{y^{'}}^{⊤} x + b_{y^{'}}) \forall y^{'} \neq y$ (in particular, in this case the difference is non-negative; in all other cases, the difference is negative, and therefore smaller). Thus

f_{y} (x) \geq f_{y^{'}} (x) \forall y^{'} \neq y \Leftrightarrow (w_{y}^{⊤} x + b_{y}) \geq (w_{y^{'}}^{⊤} x + b_{y^{'}}) \forall y^{'} \neq y .

This proves the claim.

D. Proof of Corollary 4

This follows directly from the proof of Theorem 3.

E. Details of Real Data Sets Used in Experiments in Section 5.2

Table 3:

Multiclass classification data sets used in experiments in Section 5.2.

Data set	# train	# validation	# test	# classes (n)	# features (d)
Covertype (50K)	30000	10000	10000	7	54
Digits	5620	1874	3498	10	16
USPS	5468	1823	2007	10	256
MNIST (70K)	45000	15000	10000	10	780
CIFAR10	37500	12500	10000	10	3072
Sensorless	35105	11702	11702	11	48
Letter	10500	4500	5000	26	16

Open in a new tab

Notes:

Subsampling: For Covertype, we used a random subsample of the original data set containing 50,000 examples (the original data set has 581,012 examples).

Image data sets with pixel features: The versions of the USPS and MNIST datasets that we used came with features scaled to the ranges [−1, 1] and [0, 1], respectively. For CIFAR10, we similarly scaled the features to the range [0, 1] by dividing all features by 255.

Footnotes

A universal function class is one that can approximate any continuous function; such classes can be obtained, for example, via reproducing kernel Hilbert spaces (RKHSs) associated with Gaussian kernels [12], or via sufficiently flexible neural networks [1].

Long and Servedio [7] presented the results slightly differently; in particular, in their case, $H$ refers to a class of real-valued functions from which individual scoring functions are drawn, and consistency is defined in terms of this class. We describe the results here in terms of our notation and terminology.

More generally, we will allow both the linear and piecewise linear classes to be characterized by n weight vectors $w_{1}, \dots, w_{n} \in ℝ^{d}$ and n bias/offset terms $b_{1}, \dots, b_{n} \in ℝ$ .

⁴

This is the sense used in Table 1, column 4.

⁵

Technically, Long and Servedio’s definition [7] applies to scoring function classes $F$ for which individual scoring function components come independently from a common fixed class, i.e. for which there is a class $F_{0} \subset {f : X \to ℝ}$ such that $F = {f : X \to ℝ^{n} ∣ f_{y} \in F_{0} \forall y}$ , and they would refer to such a surrogate as realizable $F_{0}$ -consistent. We modify the terminology slightly to better fit our presentation of ideas, and the definition we give is slightly more general (in that it allows for more general scoring function classes $F$ ).

⁶

This is the sense used in Table 1, column 5.

⁷

As noted above, the minimization over $F_{spwlin}$ is non-convex; we found that for most (but not all) data sets, the results were fairly stable under different random initializations. The results we report are for a single random initialization; our results could potentially be improved by starting the optimizer from multiple random initializations, and keeping the model with best training objective value.

⁸

In all cases, the optimizer was run for 50 epochs over the training sample; the learning rate parameter α was initially set to 0.01 and was halved at the end of every 5 epochs.

⁹

https://archive.ics.uci.edu/ml/index.php and https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/

Contributor Information

Mingyuan Zhang, University of Pennsylvania, Philadelphia, PA 19104.

Shivani Agarwal, University of Pennsylvania, Philadelphia, PA 19104.

References

[1].Anthony Martin and Bartlett Peter L.. Neural Network Learning: Theoretical Foundations. Cambridge University Press, 1999. [Google Scholar]
[2].Bartlett Peter L., Jordan Michael, and Jon McAuliffe. Convexity, classification and risk bounds. Journal of the American Statistical Association, 101(473):138–156, 2006. [Google Scholar]
[3].Ben-David Shai, Loker David, Srebro Nathan, and Sridharan Karthik. Minimizing the misclassification error rate using a surrogate convex loss. In Proceedings of the 29th International Conference on Machine Learning (ICML), 2012. [Google Scholar]
[4].Crammer Koby and Singer Yoram. On the algorithmic implementation of multiclass kernel-based vector machines. Journal of Machine Learning Research, 2:265–292, 2001. [Google Scholar]
[5].Doğan Ürün, Glasmachers Tobias, and Igel Christian. A unified view on multi-class support˘ vector classification. Journal of Machine Learning Research, 17(45):1–32, 2016. [Google Scholar]
[6].Lee Yoonkyung, Lin Yi, and Wahba Grace. Multicategory support vector machines: Theory and application to the classification of microarray data. Journal of the American Statistical Association, 99(465):67–81, 2004. [Google Scholar]
[7].Long Philip M. and Servedio Rocco A.. Consistency versus realizable H-consistency for multiclass classification. In Proceedings of the 30th International Conference on Machine Learning (ICML), 2013. [Google Scholar]
[8].Loshchilov Ilya and Hutter Frank. Decoupled weight decay regularization. In 7th International Conference on Learning Representations (ICLR), 2019. [Google Scholar]
[9].Mehrabi Ninareh, Morstatter Fred, Saxena Nripsuta, Lerman Kristina, and Galstyan Aram. A survey on bias and fairness in machine learning. CoRR, abs/1908.09635, 2019. [Google Scholar]
[10].Paszke Adam, Gross Sam, Massa Francisco, Lerer Adam, Bradbury James, Chanan Gregory, Killeen Trevor, Lin Zeming, Gimelshein Natalia, Antiga Luca, Desmaison Alban, Kopf Andreas, Yang Edward, DeVito Zachary, Raison Martin, Tejani Alykhan, Chilamkurthy Sasank, Steiner Benoit, Fang Lu, Bai Junjie, and Chintala Soumith. PyTorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pages 8024–8035. 2019. [Google Scholar]
[11].Pires Bernardo Ávila and Szepesvári Csaba. Multiclass classification calibration functions. CoRR, abs/1609.06385, 2016. [Google Scholar]
[12].Steinwart Ingo. On the influence of the kernel on the consistency of support vector machines. Journal of Machine Learning Research, 2:67–93, 2001. [Google Scholar]
[13].Steinwart Ingo. How to compare different loss functions and their risks. Constructive Approximation, 26:225–287, 2007. [Google Scholar]
[14].Tewari Ambuj and Bartlett Peter L.. On the consistency of multiclass classification methods. Journal of Machine Learning Research, 8:1007–1025, 2007. [Google Scholar]
[15].Weston Jason and Watkins Chris. Support vector machines for multi-class pattern recognition. In In Proceedings of the 7th European Symposium on Artificial Neural Networks (ESANN), 1999. [Google Scholar]
[16].Zhang Tong. Statistical analysis of some multi-category large margin classification methods. Journal of Machine Learning Research, 5:1225–1251, 2004. [Google Scholar]
[17].Zhang Tong. Statistical behavior and consistency of classification methods based on convex risk minimization. Annals of Statistics, 32(1):56–134, 2004. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary information

NIHMS1720311-supplement-Supplementary_information.pdf^{(199KB, pdf)}

[R1] [1].Anthony Martin and Bartlett Peter L.. Neural Network Learning: Theoretical Foundations. Cambridge University Press, 1999. [Google Scholar]

[R2] [2].Bartlett Peter L., Jordan Michael, and Jon McAuliffe. Convexity, classification and risk bounds. Journal of the American Statistical Association, 101(473):138–156, 2006. [Google Scholar]

[R3] [3].Ben-David Shai, Loker David, Srebro Nathan, and Sridharan Karthik. Minimizing the misclassification error rate using a surrogate convex loss. In Proceedings of the 29th International Conference on Machine Learning (ICML), 2012. [Google Scholar]

[R4] [4].Crammer Koby and Singer Yoram. On the algorithmic implementation of multiclass kernel-based vector machines. Journal of Machine Learning Research, 2:265–292, 2001. [Google Scholar]

[R5] [5].Doğan Ürün, Glasmachers Tobias, and Igel Christian. A unified view on multi-class support˘ vector classification. Journal of Machine Learning Research, 17(45):1–32, 2016. [Google Scholar]

[R6] [6].Lee Yoonkyung, Lin Yi, and Wahba Grace. Multicategory support vector machines: Theory and application to the classification of microarray data. Journal of the American Statistical Association, 99(465):67–81, 2004. [Google Scholar]

[R7] [7].Long Philip M. and Servedio Rocco A.. Consistency versus realizable H-consistency for multiclass classification. In Proceedings of the 30th International Conference on Machine Learning (ICML), 2013. [Google Scholar]

[R8] [8].Loshchilov Ilya and Hutter Frank. Decoupled weight decay regularization. In 7th International Conference on Learning Representations (ICLR), 2019. [Google Scholar]

[R9] [9].Mehrabi Ninareh, Morstatter Fred, Saxena Nripsuta, Lerman Kristina, and Galstyan Aram. A survey on bias and fairness in machine learning. CoRR, abs/1908.09635, 2019. [Google Scholar]

[R10] [10].Paszke Adam, Gross Sam, Massa Francisco, Lerer Adam, Bradbury James, Chanan Gregory, Killeen Trevor, Lin Zeming, Gimelshein Natalia, Antiga Luca, Desmaison Alban, Kopf Andreas, Yang Edward, DeVito Zachary, Raison Martin, Tejani Alykhan, Chilamkurthy Sasank, Steiner Benoit, Fang Lu, Bai Junjie, and Chintala Soumith. PyTorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pages 8024–8035. 2019. [Google Scholar]

[R11] [11].Pires Bernardo Ávila and Szepesvári Csaba. Multiclass classification calibration functions. CoRR, abs/1609.06385, 2016. [Google Scholar]

[R12] [12].Steinwart Ingo. On the influence of the kernel on the consistency of support vector machines. Journal of Machine Learning Research, 2:67–93, 2001. [Google Scholar]

[R13] [13].Steinwart Ingo. How to compare different loss functions and their risks. Constructive Approximation, 26:225–287, 2007. [Google Scholar]

[R14] [14].Tewari Ambuj and Bartlett Peter L.. On the consistency of multiclass classification methods. Journal of Machine Learning Research, 8:1007–1025, 2007. [Google Scholar]

[R15] [15].Weston Jason and Watkins Chris. Support vector machines for multi-class pattern recognition. In In Proceedings of the 7th European Symposium on Artificial Neural Networks (ESANN), 1999. [Google Scholar]

[R16] [16].Zhang Tong. Statistical analysis of some multi-category large margin classification methods. Journal of Machine Learning Research, 5:1225–1251, 2004. [Google Scholar]

[R17] [17].Zhang Tong. Statistical behavior and consistency of classification methods based on convex risk minimization. Annals of Statistics, 32(1):56–134, 2004. [Google Scholar]

PERMALINK

Bayes Consistency vs. H-Consistency: The Interplay between Surrogate Loss Functions and the Scoring Function Class

Mingyuan Zhang

Shivani Agarwal

Abstract

1. Introduction and Background

Table 1:

Surrogate losses and Bayes consistency.

Surrogate losses and H-consistency.

Our contributions.

Figure 1:

Organization.

2. Formal Definitions: Consistency, Calibration, Realizability

Consistency.

Definition 1 (Bayes consistency).

Definition 2 ( H-consistency).

Calibration.

Definition 3 (Calibration (standard definition)).

Definition 4 (Calibration w.r.t. H).

Realizability and realizable calibration/consistency.

Definition 5 (Realizability and H-realizability).

Definition 6 (Realizable calibration).

Definition 7 (Long and Servedio’s definition of realizable HF-consistency [7]).

3. Minimizing One-vs-All Surrogates over a Class Fspwlin of ‘Shared’ Piecewise Linear Scoring Functions is Hlin-Consistent

Linear models.

‘Shared’ piecewise linear scoring functions.

Lemma 1

Theorem 2

Remark 1

Remark 2

Remark 3

Theorem 3

Corollary 4

Margin interpretation of Fspwlin.

Generalization to other multiclass models H.

4. Implementation of One-vs-All Surrogate Risk Minimization over Fspwlin

Figure 2:

5. Experiments

5.1. Synthetic Data: Consistency Behavior on Linear Models

Figure 3:

5.2. Real Data: Practical Behavior

Table 2:

6. Conclusion

Supplementary Material

Broader Impact.

Acknowledgments and Disclosure of Funding

Appendix

A. Proof of Lemma 1

Proof.

B. Proof of Theorem 2

Proof.

Part 1.

Part 2.

C. Proof of Theorem 3

Proof.

D. Proof of Corollary 4

E. Details of Real Data Sets Used in Experiments in Section 5.2

Table 3:

Footnotes

Contributor Information

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Bayes Consistency vs. $H$ -Consistency: The Interplay between Surrogate Loss Functions and the Scoring Function Class

Surrogate losses and $H$ -consistency.

Definition 2 ( $H$ -consistency).

Definition 4 (Calibration w.r.t. $H$ ).

Definition 5 (Realizability and $H$ -realizability).

Definition 7 (Long and Servedio’s definition of realizable $H_{F}$ -consistency [7]).

3. Minimizing One-vs-All Surrogates over a Class $F_{spwlin}$ of ‘Shared’ Piecewise Linear Scoring Functions is $H_{lin}$ -Consistent

Margin interpretation of $F_{s p w l i n}$ .

Generalization to other multiclass models $H$ .

4. Implementation of One-vs-All Surrogate Risk Minimization over $F_{spwlin}$