Robust Model-Free Multiclass Probability Estimation

Yichao Wu; Hao Helen Zhang; Yufeng Liu

doi:10.1198/jasa.2010.tm09107

. Author manuscript; available in PMC: 2010 Nov 24.

Published in final edited form as: J Am Stat Assoc. 2010 Mar 1;105(489):424–436. doi: 10.1198/jasa.2010.tm09107

Robust Model-Free Multiclass Probability Estimation

Yichao Wu ¹, Hao Helen Zhang ¹, Yufeng Liu ¹

PMCID: PMC2990887 NIHMSID: NIHMS250286 PMID: 21113386

Abstract

Classical statistical approaches for multiclass probability estimation are typically based on regression techniques such as multiple logistic regression, or density estimation approaches such as linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA). These methods often make certain assumptions on the form of probability functions or on the underlying distributions of subclasses. In this article, we develop a model-free procedure to estimate multiclass probabilities based on large-margin classifiers. In particular, the new estimation scheme is employed by solving a series of weighted large-margin classifiers and then systematically extracting the probability information from these multiple classification rules. A main advantage of the proposed probability estimation technique is that it does not impose any strong parametric assumption on the underlying distribution and can be applied for a wide range of large-margin classification methods. A general computational algorithm is developed for class probability estimation. Furthermore, we establish asymptotic consistency of the probability estimates. Both simulated and real data examples are presented to illustrate competitive performance of the new approach and compare it with several other existing methods.

Keywords: Fisher consistency, Hard classification, Multicategory classification, Probability estimation, Soft classification, SVM

1. INTRODUCTION

Multiclass probability estimation is an important problem in statistics and data mining. Suppose we are given a sample {(x_i, y_i), i = 1, 2, …, n} consisting of iid observations from some unknown probability distribution P(X, Y), where x_i ∈ Inline graphic ⊂ ℜ^d denotes the input vector, y_i ∈ {1, 2, …, K} denotes the label, n is the sample size, d is the dimensionality of the input space, and K denotes the number of classes. The main goal is to estimate the conditional probabilities p_k(x) = P(Y = k|X = x), k = 1, …, K. This problem is also known as soft classification, since the estimated p_k’s can be used to determine the classification boundary among K classes and to predict class labels for future samples collected from the same population.

Traditionally, the probability estimation problem is commonly tackled by regression techniques such as multiple logistic regression, or the density estimation approaches such as linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA). Agresti and Coull (1998) gave a thorough review on these methods. These methods often make certain model assumptions on the function forms of p_k’s (or their transformations) or on the underlying distributions of subclasses. For example, multiple logistic regression assumes that the logarithms of the odd ratios are linear in x,

log \frac{P (Y = k ∣ X = x)}{P (Y = 1 ∣ X = x)} = β_{k 0} + x^{T} β_{k}, \forall k \geq 2,

where class 1 is chosen as the baseline class. On the other hand, both LDA and QDA assume that the covariates X associated with each subclass follow a multivariate Gaussian distribution and construct the probability estimates as

P (Y = k ∣ X = x) = \frac{φ (x; μ_{k}, \sum_{k}) α_{k}}{\sum_{j = 1}^{K} φ (x; μ_{j}, \sum_{j}) α_{j}}, k = 1, \dots, K,

where φ(x; μ, Σ) is the density function of the multivariate Gaussian distribution associated with the mean μ and the covariance Σ and α_k = P(Y = k) is known as the prior probability of class k. These methods are widely used in practice. In many real applications, however, it is difficult to justify the assumption of the linear effects of covariates in the multiple logistic regression. Moreover, it is often difficult to validate the Gaussian assumption for multivariate data. If the distribution is very skewed, some proper transformation is needed to make data approximately Gaussian, which can be nontrivial for multivariate data. These issues become even more challenging for high-dimensional data.

In this article, we propose a new class of model-free methods for estimating multiclass probabilities. The new method does not make any assumption on the forms of p_k’s or the distribution for each subclass. Different from the traditional methods, we tackle the soft classification problem by solving a series of hard classification problems and combining these decision rules to construct the probability estimates. The main difference between soft classification and hard classification is their estimation target, with the former directly estimating p_k(x)’s and the latter estimating arg max_k_=1,…,_K p_k(x). For many complicated problems, estimation of the classification rule arg max_k_=1,…,_K p_k(x) may be a relatively easier task than estimating the probability functions. Many successful large-margin classifiers such as the support vector machine (SVM) can estimate arg max_k_=1,…,_K p_k(x) with high accuracy without estimating p_k(x)’s at all. This motivates us to take advantage of good classification performance of hard classifiers and try to extract the probability information contained in them.

Wang, Shen, and Liu (2008) recently explored class probability estimation for binary large-margin classifiers. In particular, they made use of the property that the theoretical minimizer of a consistent weighted binary large-margin loss function is sign[p₁(x) − π], where π ∈ (0, 1). Although a particular weighted binary large-margin classifier only estimates whether p₁(x) is larger than π or not, one can obtain a good estimate of p₁(x) if a sequence of weighted classifiers are calculated for many different π’s. As shown in Wang, Shen, and Liu (2008), this method indeed works well for binary class probability estimation. However, the simultaneous generalization from K = 2 to K ≥ 3 is nontrivial and largely unknown due to the increased level of problem complexity. Wu, Lin, and Weng (2004) proposed a pairwise coupling method for multi-class probability estimation by solving many binary problems. In this article, we develop a new multiclass probability estimation scheme by utilizing the proposed concept of border weight for large-margin classifiers. As a result, the (K − 1)-dimensional probability estimation reduces to the search of border weight. We propose two estimation schemes for probability estimation, the direct scheme and the indirect scheme. Furthermore, we focus on the truncated hinge loss (Wu and Liu 2007) for demonstration of our proposed probability estimation technique. The technique is, however, applicable to other large-margin classifiers as well.

The rest of our article is structured as follows. Section 2 presents the idea of weighted classification and its Fisher consistency properties. Section 3 introduces the main methodology, along with two estimation schemes and the theoretical properties of the resulting probability estimator. Section 4 discusses the computational algorithm and tuning method. Section 5 and Section 6 contain numerous simulated and real examples to illustrate the numerical performance of the new approach, which is followed by the concluding section. The Appendix collects proofs for the theoretical results as well as the derivation of our algorithm.

2. WEIGHTED CLASSIFICATION AND FISHER CONSISTENCY

In this section, we give a brief review on an important class of hard classifiers, SVM’s (Cortes and Vapnik 1995; Vapnik 1998). We start with the simple binary classification problems and then discuss the multiclass extensions. We will, in particular, discuss the extension of SVM’s by minimizing a weighted loss function.

2.1 Weighted Binary Classification

When K = 2, the class label y is often coded as {−1, +1} for notational convenience. The binary SVM classifier can be fit in the following regularization framework

min_{f \in F} n^{- 1} \sum_{i = 1}^{n} H_{1} (y_{i} f (x_{i})) + λ J (f),

(1)

where the function H₁(z) = (1 − z)₊ ≡ max{1 − z, 0} is the so-called hinge loss, J(f) is a penalty term for model complexity, and Inline graphic is some functional space. Let p₁(x) = P(Y = 1|X = x). Lin (2002) showed that the SVM solution f̂ to Equation (1) targets directly at $sign [p_{1} (x) - \frac{1}{2}]$ . Therefore, sign[f̂(x)] approximates the Bayes classification rule without estimating p₁(x).

Since the SVM showed good classification accuracy in many applications, a natural question to ask is whether it is possible to extract any information about p₁(x) from the SVM solution. Recently, Wang, Shen, and Liu (2008) proposed training a series of binary SVM’s by minimizing a weighted loss function and then constructing p̂₁(x) by combining multiple SVM classification rules. In particular, by assigning a weight π to all the samples from class −1 and assigning 1 − π to all the samples from class +1, one can solve the regularization problem based on the weighted hinge loss

min_{f \in F} n^{- 1} [(1 - π) \sum_{y_{i} = 1} H_{1} {y_{i} f (x_{i})} + π \sum_{y_{i} = - 1} H_{1} {y_{i} f (x_{i})}] + λ J (f),

(2)

where 0 ≤ π ≤ 1. Wang, Shen, and Liu (2008) showed that the minimizer to Equation (2) is a consistent estimate of sign[p₁(x) − π]. Therefore, one can repeatedly solve Equation (2) using different π values, say, 0 = π₁ < ··· < π_m₊₁ = 1 and search ĵ such that π_ĵ and π_ĵ₊₁ satisfy sign[p₁(x) − π_ĵ] ≠ sign[p₁(x) − π_ĵ₊₁]. The probability estimate can be estimated as ${\hat{p}}_{1} (x) = \frac{1}{2} (π_{\hat{j}} + π_{\hat{j} + 1})$ . More technical details can be found in their article.

2.2 Weighted Multiclass Classification

Now consider the multiclass problems with K ≥ 2. In this setup, we code y as {1, 2, …, K}. A classifier seeks the function vector f = (f₁, f₂, …, f_K), where f_k is a map from the input domain Inline graphic to ℜ (the set of all real numbers) representing the class k; k = 1, …, K. To ensure uniqueness of the solution, a sum-to-zero constraint $\sum_{k = 1}^{K} f_{k} = 0$ is usually employed. For any new input vector x, its label is estimated via a decision rule ŷ = arg max_k_=1,2,…,_K f_k(x). Clearly, the argmax rule is equivalent to the sign function used in the binary case.

Various loss functions were proposed to extend the binary SVM to multiclass problems, such as Weston and Watkins (1999), Lee, Lin, and Wahba (2004), and Liu (2007). Here we focus on the notion of the 0–1 loss. Note that a point (x, y) is misclassified by f if y ≠ arg max_k f_k(x), that is, if min g(f(x), y) ≤ 0, where

g (f (x), y) = {f_{y} (x) - f_{k} (x), k \neq y} .

The quantity min g(f(x), y) is known as the generalized functional margin and can be reduced to yf (x) in the binary case with y ∈ {±1} (Liu and Shen 2006). With the generalized functional margin, the 0–1 loss can be expressed as I(min g(f(x), y) ≤ 0). As in the binary case, one can replace the indicator function in the 0–1 loss by some other loss ℓ. Typically, to assure that a misclassified sample induces a larger loss than a correctly classified sample, the loss function ℓ is nonincreasing and satisfies that ℓ′(0) < 0. Once the loss ℓ(·) is given, the decision vector can be obtained by solving the following regularization problem

\begin{array}{l} min_{f} n^{- 1} \sum_{i = 1}^{n} ℓ (min g (f (x_{i}), y_{i})) + λ \sum_{k = 1}^{K} J (f_{k}) \\ subject to \sum_{k = 1}^{K} f_{k} (x) = 0 . \end{array}

(3)

Motivated by Wang, Shen, and Liu (2008), we propose a new approach to estimate the class probabilities by solving a series of weighted multiclass problems and then combining multiple classification rules. In the article, we focus on the class of losses based on the functional margin ℓ (min g(f(X), Y)), as they provide a natural extension from two-class to multiclass problems. For the weighted learning, we assign a weight 0 ≤ π_k ≤ 1 to samples from class k, k = 1, …, K, where π₁ + ··· + π_K = 1 to assure identifiability. Define the unit K-cube hyperplane as

A_{K} = {(π_{1}, \dots, π_{k}) : \sum_{k = 1}^{K} π_{k} = 1, π_{k} \geq 0, k = 1, 2, \dots, K} .

For any given π ∈ A_K, we can train a weighted hard classifier by minimizing the objective function using a weighted loss function

\begin{array}{l} min_{f} n^{- 1} \sum_{i = 1}^{n} π_{y_{i}} ℓ (min g (f (x_{i}), y_{i})) + λ \sum_{k = 1}^{K} J (f_{k}) \\ subject to \sum_{k = 1}^{K} f_{k} (x) = 0 . \end{array}

(4)

Compared with the binary case, extracting the probability information from the constructed classifiers becomes much more challenging for K > 2. In particular, instead of estimating only one probability function as in K = 2, we need to estimate multiple functions p₁(x), …, p_K₋₁(x) when K > 2. As a result, a substantially different formulation from the binary case is required for multiclass probability estimation.

In the binary case, the standard SVM is shown to be Fisher-consistent for estimating the Bayes classification rule $sign (p_{1} (x) - \frac{1}{2})$ . To estimate conditional class probabilities, Wang, Shen, and Liu (2008)’s method requires that the weighted SVM in Equation (2) is Fisher-consistent for estimating weighted Bayes classification rule sign(p₁(x) − π). To proceed with the multicategory probability estimation, we need to extend the definition of weighted Fisher-consistency. To construct a good probability estimate from the classification rules, we require that the loss function ℓ in Equation (4) is consistent in the following sense.

Definition 1

A functional margin based loss ℓ is called weighted Fisher-consistent for the weighted classification problem if the minimizer f^* of E[π_Y ℓ(min g(f(X), Y))|X = x] satisfies

\underset{k = 1, \dots, K}{arg max} f_{k}^{*} (x) = \underset{k = 1, \dots, K}{arg max} π_{k} p_{k} (x), \forall x, \forall π \in A_{K} .

In a standard multiclass classification problem, the misclassification costs are all equal, i.e., C(Y, f(X)) = I(Y ≠ f(X)), and the Bayes rule minimizing E[C(Y, f (X))] is arg max_k_=1,2,…,_K p_k(x). A loss ℓ is Fisher-consistent if the decision rule induced from f^* = arg min E[ℓ (min g(f(X), Y))|X = x] is the same as the Bayes rule, i.e., $arg {max}_{k = 1, \dots, K} f_{k}^{*} (x) = arg {max}_{k = 1, \dots, K} p_{k} (x)$ for all x. For a weighted learning problem, the weighted loss E[π_Y ℓ(min g(f(X), Y))] implies that unequal costs C_π (Y, f (X)) = π_Y I(Y ≠ f (X)) are used for incorrect decisions. It is straightforward to show that the Bayes rule minimizing E[C_π (Y, f (X)] is arg max_k_=1,…,_K π_kp_k(x). In this context, we say ℓ is weighted Fisher-consistent if arg max_k_=1,…,_K f*(x) = arg max_k_=1,…,_K π_kp_k(x) for all π and x. This is also known as classification calibrated (Bartlett, Jordan, and Mcauliffe 2006) and infinite-sample consistent (Zhang 2004). Therefore, the weighted Fisher-consistency can be regarded as an equivalent formulation of Fisher-consistency for weighted classification problems.

It turns out that not all functional margin based loss ℓ(min g(f(x), y)) satisfying ℓ′ (0) < 0 is weighted Fisher-consistent for multicategory problems, as shown in the next proposition.

Proposition 1

Let ℓ(·) be a nonincreasing loss function satisfying ℓ′ (0) < 0. For any given positive weights π ∈ A_K, the minimizer f^* of E[π_Y ℓ(min g(f(X), Y))|X = x] has the following properties:

If $\frac{{max}_{k = 1, \dots, K} π_{k} p_{k}}{\sum_{k = 1}^{K} π_{k} p_{k}} > 1 / 2$ , then $arg {max}_{k} f_{k}^{*} = arg {max}_{k = 1, \dots, K} π_{k} p_{k}$ .
If ℓ(·) is convex and $\frac{{max}_{k = 1, \dots, K} π_{k} p_{k}}{\sum_{k = 1}^{K} π_{k} p_{k}} \leq 1 / 2$ , then f^* = 0 is a minimizer.

Proposition 1 suggests that one sufficient condition for the weighted loss π_yℓ(min g(f(x), y)) to be weighted Fisher-consistent is $\frac{{max}_{k} π_{k} p_{k}}{\sum_{k} π_{k} p_{k}} > 1 / 2$ , i.e., there exists a “dominating” class in the weighted sense. This condition is always satisfied for a binary problem except at the Bayes boundary {x: π₁p₁(x) = π₂p₂(x)}, but not for K > 2 as we require $\frac{{max}_{k = 1, \dots, K} π_{k} p_{k}}{\sum_{k = 1}^{K} π_{k} p_{k}} > 1 / 2$ to hold for all π ∈ A_K. When K > 2 and $\frac{{max}_{k = 1, \dots, K} π_{k} p_{k}}{\sum_{k = 1}^{K} π_{k} p_{k}} \leq 1 / 2$ , f^* = 0 can be a minimizer and consequently $arg {max}_{k = 1, \dots, K} f_{k}^{*} (x)$ is not uniquely determined. As a result, the weighted loss π_yℓ(min g(f(x), y)) is not weighted Fisher-consistent in such cases. By Theorem 1, the weighted hinge loss π_yH₁(min g(f(x), y)) is not weighted Fisher-consistent.

Interestingly, although the weighted loss π_yℓ(min g(f(x), y)) may not be weighted Fisher-consistent, the corresponding truncated version can be weighted Fisher-consistent. Specifically, for any ℓ(·), we define its truncated loss at a location s ≤ 0 by

ℓ_{T_{s}} (\cdot) = min (ℓ (\cdot), ℓ (s)) .

The following theorem shows that the truncated loss ℓ_{T_s} is weighted Fisher-consistent.

Theorem 1

Let ℓ(·) be a nonincreasing loss function satisfying ℓ′ (0) < 0. Then a sufficient condition for the weighted truncated loss π_yℓ_{T_s} (min g(f(x), y)) with K > 2 and s ≤ 0 to be weighted Fisher-consistent for estimating arg max_j π_jp_j is that the truncation location s satisfies ${sup}_{{u : u \geq - s \geq 0}} \frac{ℓ (0) - ℓ (u)}{ℓ (s) - ℓ (0)} \geq K - 1$ . This condition is also necessary if ℓ(·) is convex.

Remark 1

As pointed out by one referee, the condition ℓ′ (0) < 0 requires differentiability of ℓ at 0 and thus excludes nondifferentiable loss functions such as the ψ loss. In the following, we show how the condition can be relaxed for nondifferentiable losses. Note that ℓ′ (0) is used to assure

(\sum_{k \neq k_{π p}} π_{k} p_{k} (x)) (- K) ℓ^{'} (0) + π_{k_{π p}} p_{k_{π p}} (x) K ℓ^{'} (0) < 0,

(5)

when π_{k_πp} p_{k_πp} (x) > Σ_k_≠_{k_πp} π_kp_k(x). If ℓ′ (0) does not exist, the term in Equation (5) can be simply replaced by

(\sum_{k \neq k_{π p}} π_{k} p_{k} (x)) (- K) ℓ^{'} (0^{-}) + π_{k_{π p}} p_{k_{π p}} (x) K ℓ^{'} (0^{+}) < 0,

(6)

where ℓ′ (0⁻) and ℓ′ (0⁺) denote the left and right derivatives, respectively. Therefore, the condition ℓ′ (0) < 0 can be relaxed as ℓ′ (0⁺) < ℓ′ (0⁻) ≤ 0.

We now use two common loss examples to illustrate how to check the condition and find a proper truncating location s to assure the weighted Fisher-consistency of a truncated loss.

The hinge loss, ℓ(u) = [1 − u]₊: In this case, ${sup}_{{u : u \geq - s \geq 0}} \frac{ℓ (0) - ℓ (u)}{ℓ (s) - ℓ (0)} = \frac{1 - 0}{- s}$ and the condition becomes $s \in [- \frac{1}{K - 1}, 0]$ by noting that s ≤ 0.
The logistic loss, ℓ(u) = log(1 + e^u): In this case, ${sup}_{{u : u \geq - s \geq 0}} \frac{ℓ (0) - ℓ (u)}{ℓ (s) - ℓ (0)} = \frac{log 2 - 0}{log (1 + e^{- s}) - log 2}$ , which leads to condition s ∈ [− log(2^K^/(^K⁻¹⁾ − 1), 0].

Both of these two loss functions satisfy ℓ′ (0) < 0. Although they are not weighted Fisher-consistent themselves, they can become weighted Fisher-consistent after truncating them at s (with s satisfying the above condition).

Proposition 1 and Theorem 1 are weighted extensions of the results of Wu and Liu (2007). Furthermore, we note that the truncating location s given in Theorem 1 depends on the class number K. The larger K is the more truncation is needed to ensure Fisher consistency. This is due to the fact that the difficulty of no “dominating” class becomes more severe as K increases. The more truncation is the closer the truncated loss is to the 0–1 loss. For the hinge loss H₁(u), exponential loss e⁻^u and logistic loss log(1 + e⁻^u), their truncated versions are guaranteed to be weighted Fisher-consistent for $s \in [- \frac{1}{K - 1}, 0], [log (1 - \frac{1}{K}), 0]$ , and [− log(2^K^/(^K⁻¹⁾ − 1), 0], respectively. Note that the ψ loss used in ψ-learning can be viewed as special examples of truncated loss functions (Shen et al. 2003; Liu and Shen 2006). Theoretically, different truncation may give a different performance. Empirically, the numerical examples in Wu and Liu (2007) indicated that minimum truncation appears to work better for the unweighted case. In this article we proceed with the minimum truncation required to achieve a weighted Fisher-consistency. By minimal truncation we mean truncation with the smallest s to make the corresponding truncated loss weighted Fisher-consistent: $s = - \frac{1}{K - 1}, log (1 - \frac{1}{K})$ , and − log(2^K^/(^K⁻¹⁾ − 1) for the hinge loss, exponential loss, and logistic loss, respectively. In Figure 1 we plot these three truncated loss functions for K = 3 with the minimal truncation.

Plots of weighted Fisher-consistent truncated loss functions with minimal truncation for K = 3. The left, middle, and right panels correspond to the hinge, exponential, and logistic loss functions. A color version of this figure is available in the electronic version of this article.

3. METHODOLOGY

In this section, we derive our methodology for multiclass probability estimation based on hard classifiers. In particular, we propose training a series of weighted classifiers and using them to construct the probability estimates. For demonstration, we focus on the hinge and truncated hinge loss functions. However, our estimation schemes are applicable to general large-margin classifiers.

3.1 Direct Scheme for Probability Recovery

Define the truncated hinge loss as

H_{T_{s}} (u) = min (H_{1} (s), H_{1} (u)),

where $s = - \frac{1}{K - 1}$ corresponds to the minimum truncation required by Theorem 1 to guarantee H_{T_s} to be weighted Fisher-consistent. Denote f̂ ^π as the solution of the π -weighted truncated-hinge-loss SVM, obtained by solving the following optimization problem

\begin{array}{l} min_{f} n^{- 1} \sum_{i = 1}^{n} π_{y_{i}} H_{T_{s}} (min g (f (x_{i}), y_{i})) + λ \sum_{k = 1}^{K} J (f_{k}) \\ subject to \sum_{k = 1}^{K} f_{k} (x) = 0, \end{array}

(7)

where π = (π₁, …, π_K ) ∈ A_K. By Theorem 1, we have that $arg {max}_{k = 1, \dots, K} {\hat{f}}_{k}^{π}$ converges to arg max_k_=1,…,_K π_kp_k as n → ∞ and λ → 0.

The following proposition gives a key result for estimating the probabilities for each x ∈ Inline graphic .

Proposition 2

For any given x ∈ Inline graphic satisfying min_k p_k(x) > 0, there exists a unique weight vector π̃ (x) = (π₁(x), π₂(x), …, π_K (x)) ∈ A_K such that

{\tilde{π}}_{1} (x) p_{1} (x) = {\tilde{π}}_{2} (x) p_{2} (x) = \dots = {\tilde{π}}_{K} (x) p_{K} (x) .

Proposition 2 shows that for any x ∈ Inline graphic with min_k p_k(x) > 0, there is a unique weight vector so that the corresponding weighted probabilities π̃_j(x)p_j(x) are identical for all j. We call the point π̃ (x) ∈ A_K as the border weight for x since the K weighted probabilities meet at this point.

Interestingly, the result in Proposition 2 can help us to estimate the conditional probabilities p_k(x). In particular, for a given point x, using the property of weighted Fisher-consistency, the corresponding Bayes rule is arg max_k_=1,…,_K π_kp_k for any π ∈ A_K. Then one can vary the weight vector π ∈ A_K to search for the border weight. To illustrate this further, we consider a simple case of K = 3. In Figure 2, we plot the classification results of a particular point x for K = 3 when we change the weight vector. In this case, A₃ is an equilateral triangle with the three vertices being (1, 0, 0), (0, 1, 0), and (0, 0, 1). Theoretically, for any x, the weighted Bayes rule arg max_k π_kp_k(x) assigns x to class k when π_k is close to 1, and consequently, the whole region A₃ can be divided into three subregions R₁, R₂, and R₃ with R_k = {π ∈ A₃ : k = arg max_j π_jp_j(x)} for k = 1, 2, 3. Since the vertex (1, 0, 0) represents imposing the weight 1 to points from class 1 and the weight zero to points from the other classes, the region R₁ around (1, 0, 0) corresponds to the set of π with prediction arg max_k π_kp_k(x) = 1. The argument is similar for the other two vertices. Note that there is a special point in the center that borders all three subregions. This is the border weight satisfying π̃₁(x)p₁(x) = π̃₂(x)p₂(x) = π̃₃(x)p₃(x).

A plot of the weighted Bayes classification rule for all combinations of π for a certain point x when K = 3.

To estimate p_j(x) it is enough to estimate π̃ (x) because, once the estimate of π̃ (x) is given, we can estimate p_k(x) by the following proposition.

Proposition 3

For any given x ∈ Inline graphic , we assume that its associated border weight is estimated as $\hat{\tilde{π}} (x)$ . Then its class probabilities can be estimated as

{\hat{p}}_{k} (x) = \frac{{\hat{\tilde{π}}}_{k} {(x)}^{- 1}}{{\hat{\tilde{π}}}_{1} {(x)}^{- 1} + {\hat{\tilde{π}}}_{2} {(x)}^{- 1} + \dots + {\hat{\tilde{π}}}_{K} {(x)}^{- 1}}, k = 1, \dots K .

Propositions 2 and 3 suggest that identifying the border weight for each x is a key step to estimating the conditional probabilities p_k(x) for k = 1, …, K. To that end, a general scheme is needed to search for π̃ ∈ A_K for each x. Without loss of generality, we assume for the moment that the tuning parameter λ for Equation (7) is properly chosen. In the following, we outline the probability estimation scheme for general cases.

Direct Scheme

Define a fine grid of π within A_K. Let the grid size be d_π. Any grid point π takes the form (m₁d_π, m₂d_π, …, m_K d_π ) with nonnegative integers m₁, m₂, …, m_K satisfying $\sum_{k = 1}^{K} m_{k} d_{π} = 1$ .
Solve Equation (7) over the above grid using the properly chosen tuning parameter λ.
Form all possible K-vertex polyhedrons of (side) length d_π using the available grid points. Here each K-vertex polyhedron corresponds to K adjacent grid points.
For any x ∈ , identify the K-vertex polyhedron such that its K vertices all belong to K distinct classes. The average of the coordinates corresponding to these vertices is defined as the estimate of the border weight π̃ (x) for x. The probability estimate can then be calculated using Proposition 3.

In the following, we demonstrate how the direct scheme works in the case of K = 3. To search for the border weight in Figure 2, we define a fine grid of π within the triangle as in Figure 3. Let the grid size be d_π with 1/d_π being an integer. To estimate probabilities at any point x, we need to identify some π such that its three neighboring combinations are of the form

Left: Classification rule over a grid of π for a point x for K = 3, where the circle ○ denotes being classified as class 1, the square □ denotes being classified as class 2, and the asterisk * denotes being classified as class 3. Right: Another possible configuration of neighboring three-class classifiers.

{π_{1} (x) = (π_{1}, π_{2}, π_{3}), π_{2} (x) = (π_{1} - d_{π}, π_{2} + d_{π}, π_{3}), π_{3} (x) = (π_{1} - d_{π}, π_{2}, π_{3} + d_{π})},

(8)

which classify Y into three distinct classes as shown on the left panel of Figure 3.

3.1.1 Numerical Challenges in Implementing Direct Scheme

Now we provide some discussion on the numerical coherence of multiple decisions resulting from training multiple weighted classification problems. Let us start with the two-class problems. Assume that π and 1 − π are the costs for the negative and positive classes, then it is known that the minimizer f̂_π (x) of Equation (2) gives a consistent estimator of sign[P(Y = +1|X = x) − π]. When an increasing sequence of weights 0 = π₁ < π₂ < ···< π_m₊₁ = 1 are used, we expect the decision sequence sign[f̂_{π_j} (x)] to be monotonically changed for a fixed x due to their consistency properties. Though this is true in theory (or when n goes to infinity), the monotonic property of sign[f̂_π] may not always hold in finite sampling situations, mainly due to numerical variations. In this case, the probability p(x) can be estimated by taking the average of π_* = min{π_j : sign[f̂_{π_j} (x)] = −1} and π^* = max{π_j : sign[f̂_{π_j} (x)] = 1} (Wang, Shen, and Liu 2008).

For multiclass problems, a similar issue can occur even more frequently due to the increased complexity of the optimization problem. Take the three-class problem as an example. Each nonnegative weight vector is π = (π₁, π₂, π₃) satisfying $\sum_{k = 1}^{3} π_{k} = 1$ . It is known that the minimization of Equation (5) satisfies that arg max_k f̂_k(x) = arg max_k π_kp_k(x) asymptotically. This suggests that the weight vectors change (partially) monotonically, the decision rule arg max_k f̂_k(x) should satisfy some constraints. For example, for a given x, if π = (π₁, π₂, π₃) satisfies π₁p₁(x) > max(π₂p₂(x), π₃p₃(x)), then we have arg max_k f̂_k(x) = 1 asymptotically. Now if the weight is changed to $π^{'} = (π_{1}^{'}, π_{2}^{'}, π_{3}^{'}) = (π_{1} + d, π_{2} - d, π_{3})$ , the inequality $π_{1}^{'} p_{1} (x) > max (π_{2}^{'} p_{2} (x), π_{3}^{'} p_{3} (x))$ still holds, implying that $arg {max}_{k} {\hat{f}}_{k}^{'} (x) = 1$ asymptotically as well. Though this is true in theory, the relationship $arg {max}_{k} {\hat{f}}_{k} (x) = arg {max}_{k} {\hat{f}}_{k}^{'} (x)$ does not necessarily hold in finite-sample results. Therefore, in practice, with a finite sample, those three neighboring combinations do not always take the form of Equation (8). Other variations are possible. For example, it can also be of the form

{π_{1} (x) = (π_{1}, π_{2}, π_{3}), π_{2} (x) = (π_{1} - d_{π}, π_{2}, π_{3} + d_{π}), π_{3} (x) = (π_{1}, π_{2} - d_{π}, π_{3} + d_{π})},

as shown on the right panel of Figure 3. This corresponds to the monotonicity violation in the binary case as discussed in the first paragraph of section 2.2 of Wang, Shen, and Liu (2008). Our selection criterion is to select three neighboring combinations corresponding to three distinct classes. The average (π₁(x) + π₂(x) + π₃(x))/3 of these three neighboring combinations, denoted by π̂ (x) = (π̂₁(x), π̂₂(x), π̂₃(x)), serves as our estimate of the border weight. Using the estimated border weight π̂ (x), our estimate is given by

{\hat{p}}_{k} (x) = \frac{{\hat{π}}_{k} {(x)}^{- 1}}{{\hat{π}}_{1} {(x)}^{- 1} + {\hat{π}}_{2} {(x)}^{- 1} + {\hat{π}}_{3} {(x)}^{- 1}} .

For the finite-sample case, it is possible to have more than one possibility of three neighboring combinations corresponding to three distinct classes. Each possibility leads to one estimated border weight. When this happens, averaging all the estimated border weights is necessary to proceed our probability estimation. The nonuniqueness of border weights encountered in practice adds some challenges in the implementation of the direct scheme. As the number of classes gets larger, this may become more severe. Furthermore, the border weights are identified through counting multiple decisions. The process is discrete and tends to be slow and unstable. These challenges motivate us to develop another scheme that is more continuous and of high stability.

3.2 Indirect Scheme for Probability Estimation

In this section, we provide an alternative scheme to recover probabilities. Instead of directly targeting the probabilities as the direct scheme does, the new scheme estimates some continuous functions of probabilities, which can be easier to estimate, and then inverts those functions to recover probabilities.

Note that the total volume (or area when K = 3) of A_K is given by

\int_{0}^{1} d π_{1} \int_{0}^{1 - π_{1}} d π_{2} \dots \int_{0}^{1 - π_{1} - \dots - π_{K - 2}} d π_{K - 1} \sqrt{K} = \frac{\sqrt{K}}{(K - 1)!} .

The collection of π representing class k is given by R_k = {π : π_kp_k ≥ π_jp_j for j ≠ k}, which can be represented as

R_{k} = \cup r_{j_{1} j_{2} \dots j_{K - 1}}^{k},

where $r_{j_{1} j_{2} \dots j_{K - 1}}^{k} = {π : π_{j_{1}} p_{j_{1}} \leq π_{j_{2}} p_{j_{2}} \leq \dots \leq π_{j_{K - 1}} p_{j_{K - 1}} \leq π_{k} p_{k}}$ and the union is over all permutations (j₁, j₂, …, j_K₋₁) of {1, 2, …, K} \ k. When K = 3, Figure 4 demonstrates how A₃ is partitioned into different parts using the notation $r_{j_{1} j_{2} \dots j_{K - 1}}^{k}$ corresponding to Figure 2. For permutation (j₁, j₂, …, j_K₋₁), the volume (or area) of $r_{j_{1} j_{2} \dots j_{K - 1}}^{k}$ is given by

\sqrt{K} \int_{0}^{B_{j_{1}}} d π_{j_{1}} \int_{p_{j_{1}} π_{j_{1}} / p_{j_{2}}}^{B_{j_{2}}} d π_{j_{2}} \dots \int_{p_{j_{K - 2}} π_{j_{K - 2}} / p_{j_{K - 1}}}^{B_{j_{K - 1}}} d π_{j_{K - 1}},

where $B_{j_{1}} = (1 - \sum_{m = 1}^{k - 1} π_{j_{m}}) (1 / p_{j_{i}}) / ((1 / p_{k}) + \sum_{m = i}^{K - 1} (1 / p_{j_{m}}))$ ; i = 1, …, K − 1 with the convention $\sum_{m = 1}^{0} π_{j_{m}} = 0$ . We can then sum over the area (volume) according to all permutations (j₁, j₂, …, j_K₋₁) of {1, 2, …, K} \ k to give a formula for the volume (or area) of R_k.

Naturally the proportion of grid points of π leading to prediction of k, denoted by prop_k, estimates the volume (or area) ratio of $\frac{Volume (R_{k})}{Volume (A_{K})}$ , which is a function of p₁, p₂, …, p_K and denoted by h_k(p₁, p₂, …, p_K). Then by solving the equation system of K equations

{prop}_{k} = h_{k} (p_{1}, p_{2}, \dots, p_{K}); k = 1, 2, \dots, K,

(9)

we can obtain the estimated probabilities. In particular when K = 3, area of A₃ is $\sqrt{3} / 2$ and area of R₃ is $\frac{\sqrt{3}}{2} p_{3}^{2} (p_{2} p_{3} + p 1 p_{3} + 2 p_{1} p_{2}) / ((p_{1} + p_{3}) (p_{2} + p_{3}) (p_{1} p_{2} + p_{1} p_{3} + p_{2} p_{3}))$ . Consequently $h_{3} (p_{1}, p_{2}, p_{3}) = p_{3}^{2} (p_{2} p_{3} + p 1 p_{3} + 2 p_{1} p_{2}) / ((p_{1} + p_{3}) (p_{2} + p_{3}) (p_{1} p_{2} + p_{1} p_{3} + p_{2} p_{3}))$ .

The indirect scheme can be summarized as follows:

Indirect Scheme

Same as those of the Direct Scheme.
For any x ∈ , calculate the grid percentage prop_k for k = 1, 2, …, K.
Solve the equation system in Equation (9) to recover the estimation of (p₁(x), p₂(x), …, p_K (x)).

In Section 5, we will illustrate the performance of both schemes. Our empirical results suggest that the indirect scheme is indeed faster and more accurate.

3.3 Theoretical Properties

The next theorem establishes the consistency of our class probability estimation.

Theorem 2

For any nonincreasing loss function ℓ(·) with ℓ′(0) < 0, if the truncation location s is chosen such that ${sup}_{{u : u \geq - s \geq 0}} \frac{ℓ (0) - ℓ (u)}{ℓ (s) - ℓ (0)} \geq K - 1$ . When λ → 0 and the grid size d_π → 0 as n → ∞, our estimate p̂_k(x) based on the truncated loss ℓ_{T_s} is asymptotically consistent, i.e., p̂_k(x) → p_k(x) for k = 1, 2, …, K as n → ∞.

The consistency result in Theorem 2 provides theoretical justification of our proposed method. It can be straightforward to extend the consistency to our indirect probability recovery scheme as Theorem 1 implies that prop_k is consistent for estimating h_k(p₁, p₂, …, p_K ) and inversion will inherit the consistency. Although our probability estimation method is model-free, it converges to the true probability asymptotically. As shown in our simulation studies in Section 5, our method indeed provides competitive probability estimation compared to several other existing techniques.

4. COMPUTATION ALGORITHMS

As shown on the right panel of Figure 5, the function H_{T_s}(·) is not convex, thus solving Equation (7) involves a nonconvex minimization problem. However, we note that H_{T_s}( u) can be decomposed as the difference of two convex functions,

The left, middle, and right panels display functions H₁(u), *H_s*(u), and *H_{T_s}*(u), respectively. A color version of this figure is available in the electronic version of this article.

H_{T_{s}} (u) = min (H_{1} (u), H_{1} (s)) = H_{1} (u) - H_{s} (u),

where H_s(u) = (s − u)₊. Figure 5 displays the three functions H₁(u), H_s(u), and H_{T_s}(u).

Using this property of the truncated hinge loss function, we apply the difference convex (d.c.) algorithm (An and Tao 1997; Liu, Shen, and Doss 2005; Wu and Liu 2007) to solve the non-convex optimization problem of the weighted truncated-hinge-loss SVM. The d.c. algorithm solves the nonconvex minimization problem via minimizing a sequence of convex subproblems (see Algorithm 1). We derive the d.c. algorithm for linear learning in Section 4.1 and then generalize it to the case of nonlinear learning via kernel mapping in Section 4.2.

Algorithm 1

[The Difference Convex Algorithm for minimizing Q(Θ) = Q_vex(Θ) + Q_cav(Θ)].

Initialize Θ₀.
Repeat $Θ_{t = 1} = arg {min}_{Θ} (Q_{vex} (Θ) + 〈 Q_{cav}^{'} (Θ_{t}), Θ - Θ_{t} 〉)$ until convergence of Θ_t.

4.1 Linear Learning

Let $f_{k} (x) = w_{k}^{T} x + b_{k}$ ; w_k ∈ ℜ^d, b_k ∈ ℜ, and b = (b₁, b₂, …, b_k)^T ℜ^K, where w_k = (w₁_k, w₂_k, …, w_dk)^T and W = (w₁, w₂, …, w_K ). With ℓ = H_{T_s}, Equation (7) becomes

\begin{array}{l} min_{W, b} \frac{1}{2} \sum_{k = 1}^{K} {| | w_{k} | |}_{2}^{2} + C \sum_{i = 1}^{n} π_{y_{i}} H_{T_{s}} (min g (f (x_{i}), y_{i})) \\ subject to \sum_{k = 1}^{K} w_{j k} = 0; j = 1, 2, \dots, d; \sum_{k = 1}^{K} b_{k} = 0, \end{array}

(10)

where the constraints are adopted to avoid the nonidentifiability issue of the solution. Note that Equation (10) is equivalent to the other representation of Equation (4) by setting C = 1/λ. Thus we will use them interchangeably.

Denote Θ as (W, b). Applying the fact that H_{T_s} = H₁ − H_s, the objective function in Equation (10) can be decomposed as

\begin{array}{l} Q^{s} (Θ) = \frac{1}{2} \sum_{k = 1}^{K} {| | w_{k} | |}_{2}^{2} + C \sum_{i = 1}^{n} π_{y_{i}} H_{1} (min g (f (x_{i}), y_{i})) - C \sum_{i = 1}^{n} π_{y_{i}} H_{s} (min g (f (x_{i}), y_{i})) \\ = Q_{vex}^{s} (Θ) + Q_{cav}^{s} (Θ), \end{array}

where

Q_{vex}^{s} (Θ) = \frac{1}{2} \sum_{k = 1}^{K} {| | w_{k} | |}_{2}^{2} + C \sum_{i = 1}^{n} π_{y_{i}} H_{1} (min g (f (x_{i}), y_{i}))

and

Q_{cav}^{s} (Θ) = - C \sum_{i = 1}^{n} π_{y_{i}} H_{s} (min g (f (x_{i}), y_{i})),

denote the convex and concave parts, respectively.

Define

β_{i k} = {\begin{array}{l} C π_{y_{i}} & if k = arg max (f_{k^{'}}^{t} : k^{'} \neq y_{i}), f_{y_{i}}^{t} - f_{k}^{t} < s \\ 0 & otherwise, \end{array}

where $f^{t} = {(f_{1}^{t} (\cdot), f_{2}^{t}, \dots, f_{K}^{t})}^{T}$ denotes the solution at the tth iteration. It is shown in the Appendix that the dual problem of the convex optimization at the (t + 1)th iteration, given the solution f^t at the tth iteration, is as follows

\begin{array}{l} min_{α} \frac{1}{2} \sum_{k = 1}^{K} {‖ \sum_{i : y_{i} = k} \sum_{k^{'} \neq y_{i}} (α_{{i k}^{'}} - β_{{i k}^{'}}) x_{i}^{T} - \sum_{i : y_{i} \neq k} (α_{i k} - β_{i k}) x_{i}^{T} ‖}_{2}^{2} - \sum_{i = 1}^{n} \sum_{k^{'} \neq y_{i}} α_{{i k}^{'}} \\ subject to \sum_{i : y_{i} = k} \sum_{k^{'} \neq y_{i}} (α_{{i k}^{'}} - β_{{i k}^{'}}) - \sum_{i : y_{i} \neq k} (α_{i k} - β_{i k}) = 0, \\ k = 1, 2, \dots, K, \\ 0 \leq \sum_{j \neq y_{i}} α_{i j} \leq C π_{y_{i}}, i = 1, 2, \dots, n, \\ α_{i k} \geq 0, i = 1, 2, \dots, n; k \neq y_{i} . \end{array}

This dual problem is a quadratic programming (QP) problem similar to that of the standard SVM and can be solved by many optimization software. Once the solution is obtained, the coefficients w_k’s can be recovered as follows,

w_{k} = \sum_{i : y_{i} = k} \sum_{k^{'} \neq y_{i}} (α_{{i k}^{'}} - β_{{i k}^{'}}) x_{i} - \sum_{i : y_{i} \neq k} (α_{i k} - β_{i k}) x_{i} .

(11)

It is interesting to note that representation of w_k’s given in Equation (11) automatically satisfies that $\sum_{k = 1}^{K} w_{j k} = 0$ for each 1 ≤ j ≤ d. Moreover, we can see that coefficients w_k’s are determined only by those data points whose corresponding α_ik − β_ik is not zero for some 1 ≤ k ≤ K and these data points are the SV’s of the weighted truncated-hinge-loss SVM. The set of SV’s of the weighted truncated-hinge-loss SVM using the d.c. algorithm is only a subset of the set of SV’s of the original weighted SVM. Basically the weighted truncated-hinge-loss SVM tries to remove points satisfying $f_{y_{i}}^{t} - f_{k}^{t} < s$ with $k = arg max (f_{k^{'}}^{t} : k^{'} \neq y_{i})$ from the original set of SV’s, and consequently, eliminate the effects of outliers. This provides an intuitive algorithmic explanation of the robustness of the weighted truncated-hinge-loss SVM to outliers. A similar conclusion was provided by Wu and Liu (2007) for the unweighted version.

After the solution of W is derived, b can be obtained via solving either a sequence of KKT conditions as used in the standard SVM or a linear programming (LP) problem. Denote ${\tilde{f}}_{k} (x_{i}) = x_{i}^{T} w_{k}$ . Then b can be obtained through the following LP problem:

\begin{array}{l} min_{η, b} C \sum_{i = 1}^{n} π_{y_{i}} η_{i} + \sum_{k = 1}^{K} (\sum_{i : y_{i} = k} \sum_{k^{'} \neq y_{i}} β_{{i k}^{'}} - \sum_{i : y_{i} = k} β_{i k}) b_{k} \\ subject to η_{i} \geq 0, i = 1, 2, \dots, n, \\ η_{i} \geq 1 - ({\tilde{f}}_{y_{i}} (x_{i}) + b_{y_{i}}) + {\tilde{f}}_{k} (x_{i}) + b_{k}, \\ i = 1, 2, \dots, n; k \neq y_{i}, \sum_{k = 1}^{K} b_{k} = 0 . \end{array}

4.2 Nonlinear Learning

For nonlinear learning, each decision function f_k(x) is represented by h_k(x) + b_k with h_k(x) ∈ Inline graphic , where is a reproducing kernel Hilbert space (RKHS). Here the kernel R(·, ·) is a positive definite function mapping from × to ℜ. Due to the representer theorem of Kimeldorf and Wahba (1971) (also see Wahba 1999), the nonlinear problem can be reduced to finding finite dimensional coefficients v_ik’s and h_k(x) can be represented as $\sum_{i = 1}^{n} R (x, x_{i}) v_{i k}$ ; k = 1,2, …, K.

Denote v_k = (v₁_k, v₂_k, …, v_nk)^T, V = (v₁, v₂, …, v_K ), and R to be an n × n matrix whose (i₁, i₂) entry is R(x_i₁, x_i₂). Let R_i be the ith column of R and denote the standard basis of the n-dimensional space by e_i = (0, 0, …, 1, …, 0)^T with 1 for its ith component and 0 for other components.

A similar derivation as in the linear case leads to the following dual problem for nonlinear learning

\begin{array}{l} min_{α} \frac{1}{2} \sum_{k = 1}^{K} 〈 \sum_{i : y_{i} = k} \sum_{k^{'} \neq y_{i}} (α_{{i k}^{'}} - β_{{i k}^{'}}) R_{i} - \sum_{i : y_{i} \neq k} (α_{i k} - β_{i k}) R_{i}, \sum_{i : y_{i} = j} \sum_{k^{'} \neq y_{i}} (α_{{i k}^{'}} - β_{{i k}^{'}}) e_{i} - \sum_{i : y_{i} \neq k} (α_{{i k}^{'}} - β_{i k}) e_{i} 〉 - \sum_{i = 1}^{n} \sum_{k^{'} \neq y_{i}} α_{{i k}^{'}} \\ subject to \sum_{i : y_{i} = k} \sum_{k^{'} \neq y_{i}} (α_{{i k}^{'}} - β_{{i k}^{'}}) - \sum_{i : y_{i} \neq k} (α_{i k} - β_{i k}) = 0, \\ k = 1, 2, \dots, K, \\ 0 \leq \sum_{k \neq y_{i}} α_{i j} \leq C π_{y_{i}}, i = 1, 2, \dots, n, \\ α_{i k} \geq 0, i = 1, 2, \dots, n; k \neq y_{i}, \end{array}

where β_ik’s are defined similarly as in the linear case. After solving the above QP problem, we can recover the coefficients v_k’s as follows

v_{k} = \sum_{i : y_{i} = k} \sum_{k^{'} \neq y_{i}} (α_{{i k}^{'}} - β_{{i k}^{'}}) e_{i} - \sum_{i : y_{i} \neq k} (α_{i k} - β_{i k}) e_{i} .

The intercepts b_k’s can be solved using LP as in the linear learning.

4.3 Parameter Tuning

So far we assume that we selected the optimal tuning parameter λ. In practice, the tuning parameter selection can be done using an independent validation set or cross-validation. In this article, we choose to select the parameter using an independent set of size ñ. Theoretically speaking, the larger ñ is the better the tuning effect and the better classifier we may obtain. This is related to Shao (1993)’s results on cross-validation in the context of linear model selection: the proportion of tuning set size over the size of all available data points, namely ñ/(n + ñ), should go to 1 as (n + ñ) → ∞ to ensure the asymptotically correct selection. However, how to split the dataset into the training part and the tuning part is always a trade-off between model training and parameter tuning since a large training set is also desired for better model fitting. A commonly accepted procedure is to use one half for training and the other half for tuning, i.e., n = ñ.

Now we detail the approach using an independent tuning set of size ñ. We first obtain probability estimates ${\hat{p}}_{j}^{(λ_{m})} ({\tilde{x}}_{i})$ , j = 1, 2, …, k for any x˜_i in the tuning set {(x˜_i, ỹ_i): i = 1, 2, …, ñ} over a grid {λ₁, λ₂, …, λ_M} of the tuning parameter. Then we can evaluate the log-likelihood $L (λ_{m}) = \sum_{i = 1}^{\tilde{n}} log ({\hat{p}}_{y_{i}}^{(λ_{m})} ({\tilde{x}}_{i}))$ of the tuning set for each λ_m. Let m̂ = arg max_m L(λ_m). The optimal tuning parameter is selected to be λ_m̂.

5. SIMULATIONS

In this section we use four simulation examples to illustrate the methodological power of our new multiclass probability estimation scheme by comparing it to some existing methods. We consider five alternative methods: cumulative logit model (CLM), baseline logit model (BLM), kernel multicategory logistic regression (KMLR), classification tree (TREE), and random forest (RF). Both CLM and BLM make certain assumptions on the forms of the transformed probabilities. In particular, the CLM assumes that $log \frac{\sum_{i = 1}^{k} p_{i} (x)}{1 - \sum_{i = 1}^{k} p_{i} (x)} = β_{k 0} + x^{T} β_{k}$ for k = 1, 2, …, K − 1 while the BLM assumes that $log \frac{p_{k} (x)}{p_{K} (x)} = β_{k 0} + x^{T} β_{k}$ for k = 1, 2, …, K − 1. KMLR refers to the one proposed by Zhu and Hastie (2005) with Gaussian kernel $R (x_{1}, x_{2}) = e^{- {| | x_{1} - x_{2} | |}_{2}^{2} / σ^{2}}$ . Ten separate datasets are generated to tune the data width parameter σ among a grid of {1/4, 1/2, 3/4, 1, 5/4, 3/2}σ_m, where σ_m is the median pairwise Euclidean distance defined as median{||x_i − x_j||: y_i ≠ y_j}. Among these methods, CLM and BLM are essentially parametric models while our methods, KMLR, TREE, and RF are nonparametric. Denote the size of the training set by n. Fivefold cross-validation is used to select the tuning parameter. For the TREE-based method, we use the R package “Tree” and its build-in cross-validation function is used to prune trees with fold number set as 10. Similarly we use the build-in tuning for RF provided in the R package.

In simulations, the true conditional probability functions p_k(·), k = 1, 2, …, K are known. To measure the estimation accuracy of the conditional probabilities, we use various scores evaluated on the testing set (of size 10n):

1-norm error $\frac{1}{10 n} \sum_{i = 1}^{10 n} \sum_{k = 1}^{K} ∣ {\hat{p}}_{k} ({\bar{x}}_{i}) - p_{k} ({\bar{x}}_{i}) ∣$
2-norm error $\frac{1}{10 n} \sum_{i = 1}^{10 n} \sum_{k = 1}^{K} {({\hat{p}}_{k} ({\bar{x}}_{i}) - p_{k} ({\bar{x}}_{i}))}^{2}$
Empirical generalized Kullback–Leibler (EGKL) loss $\frac{1}{10 n} \sum_{i = 1}^{10 n} \sum_{k = 1}^{K} p_{k} ({\bar{x}}_{i}) log \frac{p_{k} ({\bar{x}}_{i})}{{\hat{p}}_{k} ({\bar{x}}_{i})}$ .

Here x̄_i denotes the predictor vector of the ith observation in the testing set. The average errors over 100 replications and the corresponding standard deviations (in parentheses) are reported. Whenever appropriate, our method is employed with minimal truncation with s = −1/(K − 1). We implement our method using linear learning for Examples 1, 3, and 4 while using the Gaussian kernel $R (x_{1}, x_{2}) = e^{- {| | x_{1} - x_{2} | |}_{2}^{2} / σ^{2}}$ for Example 2. The grid size d_π is chosen to be 0.02 for our three-class examples and 0.05 for the five-class example. In addition to tuning parameter λ and truncation location s, different grid size gives different performance. See Table 5 in Section 5.2 for the effect of different grid sizes in the discussion following these numerical examples.

Table 5.

Probability estimation errors on the test set for Example 1 with different d_π

	d_π = 0.1	d_π = 0.04	d_π = 0.02	d_π = 0.01
1-norm	17.26 (2.55)	12.49 (2.50)	11.03 (2.29)	11.76 (2.18)
2-norm	2.37 (0.81)	1.31 (0.56)	0.90 (0.34)	0.93 (0.33)
EGKL	12.50 (2.67)	4.81 (1.63)	2.56 (0.79)	2.43 (0.64)

Open in a new tab

5.1 Numerical Examples

Example 1

We consider a three-class linear learning example. Our data are generated in two steps: (1) Y is uniformly distributed over {1, 2, 3}; (2) Conditional on Y = y, the two-dimension predictor X is generated from N(μ(y), Σ), where μ(y) = (cos(2yπ/3), sin(2yπ/3))^T and Σ = 0.7²I₂ with I₂ being the 2 × 2 identity matrix. The sample size n is 400. Table 1 reports the average test errors and the corresponding standard deviations (in parentheses) over 100 replications for various methods. Note that in this example, the BLM specifies the correct parametric model and hence fits the true (oracle) model, while the CLM corresponds to a model misspecification. A tuning of σ in Gaussian kernel for the KMLR selects σ_m as the best. As shown in Table 1, the oracle BLM performs the best while the CLM performs the worst. Except for the oracle BLM, our method with either the direct probability recovery scheme or the indirect probability recovery scheme consistently outperforms all the other methods with significant improvement. Between these two different probability recovery schemes the indirect scheme works much better. Hence in our later examples, we will only report results of our new method with the indirect scheme. Here the TREE-based methods (TREE or RF) lead to infinity (denoted by Inf in Table 1) for EGKL because it returns zero probability for some point x and some classes. The corresponding standard deviation does not make sense and we denote by NaN, which stands for Not A Number. This is one property of TREE-type methods.

Table 1.

Probability estimation errors on the test set for Example 1

	Our method		CLM	KMLR	TREE	RF	BLM (Oracle)
	Direc	Indirect	CLM	KMLR	TREE	RF	BLM (Oracle)
1-norm	18.46 (3.83)	11.03 (2.29)	57.35 (1.06)	52.92 (2.80)	27.48 (3.34)	24.02 (1.40)	6.19 (1.95)
2-norm	3.34 (1.74)	0.90 (0.34)	20.53 (0.19)	11.68 (1.20)	5.99 (1.21)	5.01 (0.64)	0.36 (0.23)
EGKL	6.73 (2.21)	2.56 (0.79)	31.28 (0.33)	23.72 (1.77)	Inf (NaN)	Inf (NaN)	0.78 (0.48)

Open in a new tab

NOTE: All table entries are multiplied by 100. Numbers in parentheses are the corresponding standard deviations. See the description at the end of Example 1 for the meaning and reasons for Inf (NaN). The same explanation applies to the results of other examples.

Example 2

In this example, we study a three-class nonlinear example. For any x = (x₁, x₂)^T, define $f_{1} (x) = - x_{1} + 0.1 x_{1}^{2} - 0.05 x_{2}^{2} + 0.1, f_{2} (x) = - 0.2 x_{1}^{2} + 0.1 x_{2}^{2} - 0.2$ , and $f_{3} (x) = x_{1} + 0.1 x_{1}^{2} - 0.05 x_{2}^{2} + 0.1$ . Set $p_{k} (x) = P (Y = k ∣ X = x) = exp (f_{k} (x)) / (\sum_{m = 1}^{3} exp (f_{m} (x)))$ for k = 1, 2, 3. Each pair of data point (x, y) is generated in two steps: we first generate x₁ ~ Uniform[−3, 3] and x₂ ~ Uniform[−6, 6]; conditional on X = x, the class response Y takes value k with probability p_k(x) for k = 1, 2, 3. The sample size is chosen to be n = 100. A similar example was previously used by Zhang et al. (2008).

In this example, we consider basis expansion for the parametric methods CLM and BLM by also including the quadratic terms $x_{1}^{2}$ and $x_{2}^{2}$ . Consequently, the BLM is again the oracle model. Results over 100 repetitions in the same format of Example 1 are reported in Table 2. Column Indirect corresponds to our method with the indirect probability recovery scheme. The tuning of σ in Gaussian kernel selects 5σ_m/4 and σ_m/2 as the best for KMLR and our new method, respectively. Similar to Example 1, we again observe that the new method gives smaller errors than all the other methods except RF for the 1-norm error and the oracle.

Table 2.

Probability estimation errors on the test set for Example 2

	Indirect	CLM	KMLR	TREE	RF	BLM (Oracle)
1-norm	36.37 (4.35)	45.42 (2.38)	46.41 (9.82)	60.08 (10.57)	34.10 (3.68)	19.38 (5.21)
2-norm	7.85 (2.06)	13.46 (1.07)	10.08 (3.78)	22.23 (4.75)	8.84 (1.90)	3.19 (1.90)
EGKL	13.60 (2.91)	19.91 (2.04)	18.88 (6.00)	Inf (NaN)	Inf (NaN)	9.11 (9.23)

Open in a new tab

NOTE: All table entries are multiplied by 100.

Example 3

In Examples 1 and 2 the BLM takes the true model form, so it is not surprising that the BLM shows better performance than our method. In this example, we design an experiment so that none of the parametric methods corresponds to the oracle. This will provide a fair comparison between them.

The two-dimensional predictor X is uniformly distributed over the disc { $x : x_{1}^{2} + x_{2}^{2} \leq 100$ }. Define functions $h_{1} (x) = - 5 x_{1} \sqrt{3} + 5 x_{2}, h_{2} (x) = - 5 x_{1} \sqrt{3} - 5 x_{2}$ , and h₃(x) = 0. Apply a transformation f_k(x) = Φ⁻₁(T₂(h_k(x))), where Φ(·) and T₂(·) are the cumulative distribution functions of the standard normal distribution and t distribution with degrees of freedom 2, respectively. We set probabilities $p_{k} (x) = P (Y = k ∣ X = x) = exp (f_{k} (x)) / (\sum_{j = 1}^{3} exp (f_{j} (x)))$ for k = 1, 2, 3 as in Example 2. Because of the nonlinear transformation Φ⁻₁(T₂(·)), BLM is no longer the oracle model. Our multiclass probability with linear kernel is not the oracle model either. The training set size is n = 600. The tuning of KMLR selects σ = σ_m/4 as the data width parameter. Table 3 shows clearly that our method is consistently better than the BLM and performs best among all the approaches under comparison.

Table 3.

Probability estimation errors on the test set for Example 3

	Indirect	CLM	KMLR	TREE	RF	BLM
1-norm	21.78 (2.20)	67.88 (0.82)	59.31 (1.94)	24.44 (3.29)	24.47 (1.20)	31.02 (1.07)
2-norm	4.47 (1.04)	25.80 (0.29)	14.42 (0.88)	7.69 (1.35)	5.55 (0.54)	6.85 (0.27)
EGKL	11.79 (2.58)	38.51 (0.28)	28.48 (1.32)	Inf (NaN)	Inf (NaN)	12.72 (0.40)

Open in a new tab

NOTE: All table entries are multiplied by 100.

Example 4

In this five-class example, the data are generated similarly as in Example 1. Response Y is uniformly distributed over {1, 2, 3, 4, 5}. Conditional on Y = y, the two-dimensional predictor X is generated from N(μ(y), Σ), where μ(y) = (cos(2yπ/5), sin(2yπ/5))^T and Σ = 0.7²I₂. The sample size n is 1000. The tuning of KMLR selects 5σ_m/4 as the best. Simulation results are reported in Table 4. A similar improvement is observed for our new method.

Table 4.

Probability estimation errors on the test set for Example 4

	Indirect	CLM	KMLR	TREE	RF	BLM (Oracle)
1-norm	22.61 (2.52)	81.71 (0.29)	57.91 (2.14)	38.05 (1.90)	42.54 (1.10)	7.33 (1.72)
2-norm	2.64 (0.64)	24.00 (0.25)	9.77 (0.74)	6.78 (0.70)	8.93 (0.50)	0.28 (0.13)
EGKL	7.36 (0.97)	49.05 (0.23)	27.02 (1.49)	Inf (NaN)	Inf (NaN)	0.63 (0.27)

Open in a new tab

NOTE: All table entries are multiplied by 100.

Among the six procedures considered above, BLM and CLM are parametric methods, while our method, KMLR, TREE, and RF are nonparametric procedures that do not make explicit assumptions on the form of the true probability functions. Our simulated results suggest that, if the parametric assumption is correct, then the associated parametric estimator is essentially the oracle and performs the best among all. This explains why BLM gives the smallest errors in Examples 1, 2, and 4. However, if the parametric assumption is incorrect, then the parametric estimators can perform poorly, as shown for the BLM in Example 3 and the CLM in all the settings. By contrast, model-free methods do not rely on the model assumption and show more robust performance. For complicated problems, some of the nonparametric methods can outperform the parametric ones. As shown in Example 3, our method and RF are the top two performers. Furthermore, it is noticed that our method performs competitively among the three nonparametric procedures.

In practice, sometimes it is difficult to determine or validate the parametric assumption on the function forms, especially when data are complicated or high-dimensional, then a good nonparametric procedure will provide a useful alternative tool for estimating multiclass probabilities.

5.2 Empirical Computation Cost

The total computation cost of the proposed procedure is mainly determined by three factors: the computation cost of solving one weighted optimization problem, the number of optimization problems corresponding to different weight vectors, and the scheme for recovering probabilities from multiple decision rules. As shown in the article, each optimization problem involves a nonconvex minimization problem and the proposed DCA-based algorithm seems quite efficient. For example, it takes 0.4827, 5.4086, 0.5118, and 2.3549 s, on average, to solve an individual optimization problem for Examples 1, 2, 3, and 4, respectively. Since Example 2 deals with more complicated nonlinear classification problems it takes a little longer.

The second factor is controlled by the size of d_π. In a three-class problem, if d_π = 0.02, we need to solve 1176 optimization problems; and if d_π = 0.1, we only need to solve 24 problems. The effects of d_π are important: the smaller d_π is the better the estimation result will be. On the other hand, this accuracy gain is obtained at the cost of computational time. To illustrate the effects of d_π on the procedure, we now present the performance of our procedure in Example 1 with different values of d_π = 0.1, 0.04, 0.02, 0.01 used.

From Table 5, it is clear that smaller d_π values, in general, lead to better accuracy in probability estimation. However, this accuracy gain levels off as d_π becomes very small. For example, the accuracy improvement from d_π = 0.1 to d_π = 0.04 is substantial, but the difference among d_π = 0.04, 0.02, 0.01 is quite small. It is worth pointing out that the computational time grows faster as d_π gets smaller. For example, the computation time for d_π = 0.01 is about 25-folds of that for d_π = 0.04 and about four-folds of that for d_π = 0.02. So there is a trade-off between computational cost and estimation accuracy when choosing d_π. In our simulations, we find d_π = 0.02 works pretty well in various three-class problem settings.

To recover the probabilities, we propose two schemes in the article. The numerical results suggest that the indirect scheme is faster and produces better estimation accuracy. In practice, we recommend using the indirect scheme.

To conclude our simulation studies, we plot in Figure 6 a randomly chosen training set from each example to show what our training data look like. Note that the sample size is 1000 for Example 4. However to make Figure 6 look nicer, we only use a random sample of size 200 while plotting the right bottom panel for Example 4.

Plots of a randomly selected training set from each simulation example. The solid lines indicate Bayes boundaries for un-weighted classification.

6. REAL DATA

In this section, we apply our new multiclass probability estimation scheme to the wine data by comparing it to those four alternative methods considered in the previous section. The wine data are available online at the University of California, Irvine (UCI) Machine Learning Repository by following the URL http://archive.ics.uci.edu/ml/datasets/Wine. In addition to the categorical response variable Wine Type, it has 13 attributes available. They are Alcohol, Malic acid, Ash, Alcalinity of ash, Magnesium, Total phenols, Flavanoids, Nonflavanoid phenols, Proanthocyanins, Color intensity, Hue, OD280/OD315 of diluted wines, and Proline. All these 13 attributes are continuous. Before applying any probability estimation scheme, we standardize each attribute to have mean zero and standard deviation one. Wine Type belongs to one of three classes with class distribution of 59 in class 1, 71 in class 2, and 48 in class 3. The total number of observations is n = 178. We randomly select 19 observations from class 1, 23 from class 2, and 16 from class 3 to be set aside as the testing set. The remaining 120 observations are used as the training dataset. We randomly divide those 120 observations in the training set into eight folds with each fold containing five observations in class 1, six in class 2, and four in class 3 so that an eight-fold cross-validation is used to select any tuning parameter over a grid if necessary. The tuning selects σ_m as the best data width parameter for KMLR. For simplicity, our method is implemented with the linear kernel.

For any estimate p̂_j(·), we define its log-likelihood over the testing set as $\sum_{i = 1}^{58} log {\hat{p}}_{y_{i}} (x_{i})$ , where (x_i, y_i), i = 1, 2, …, 58 denotes observations of the testing set. The corresponding test error is defined by $\sum_{i = 1}^{58} I (y_{i} \neq arg {max}_{j} {\hat{p}}_{j} (x_{i}))$ , where I(·) is the indicator function taking value 1 if its argument is true and 0 otherwise. We report both the log-likelihood of the testing set and the test error in Table 6 for all five methods we include for comparison. The same reason as in the simulation examples applies to why both CLM and TREE lead to negative infinity. According to Table 6, our new method with linear learning performs competitively in terms of either the log-likelihood of the testing set or the test error.

Table 6.

Results of Wine Example

	Direct	Indirect	CLM	KMLR	TREE	RF	BLM
Test log-likelihood	−9.5487	−6.4817	−Inf	−26.7858	−Inf	−11.2876	−12.7390
Test error	2/58	1/58	8/58	0/58	5/58	0/58	3/58

Open in a new tab

7. CONCLUSION

In this work, we propose a model-free multiclass probability estimation approach. It is achieved by solving a series of weighted hard classification problems and then combining these decision rules to construct the probability estimates. Both theoretical and numerical results are provided to demonstrate the competitive performance of our estimation procedure. Our numerical results show favorable performance of our new probability estimation procedure in comparison to several other existing approaches.

Our probability estimation procedure requires computation of weighted classifiers over a fine grid of the K-vertex polyhedron. The computational cost can be high when the class number K gets large. To further improve the computational efficiency, one possible solution is to investigate an efficient solution path over the grid. Further investigation is needed.

Acknowledgments

The authors thank the editor, the associate editor, and two referees for their helpful suggestions that led to significant improvement of the article. The authors are supported in part by NSF grants DMS-0905561 (Wu), DMS-0645293 (Zhang), DMS-0747575 (Liu), and DMS-0606577 (Liu), and NIH/NCI grants R01-CA-085848 (Zhang) and R01-CA-149569 (Liu and Wu).

APPENDIX

Proofs of Proposition 1 and Theorem 1

For any x and any π ∈ A_K, define ${\tilde{p}}_{k} = π_{k} p_{k} (x) / (\sum_{m = 1}^{K} π_{k} p_{k} (x))$ . Then p̃_k satisfies p˜_k ≥ 0 and $\sum_{k = 1}^{K} {\tilde{p}}_{k} = 1$ . Thus, we can treat p˜_k as a new conditional class probability, and as a result, Fisher-consistency implies weighted Fisher-consistency in that p˜_k covers all possibilities as we vary x in the whole domain. Proofs of Proposition 1 and Theorem 1 will be parallel to the proofs of proposition 2 and theorem 1 of Wu and Liu (2007). Thus, we skip them to save space.

Proof of Proposition 2

The unique vector π̃(x) is given by (π̃₁(x), π̃₂(x), …, π̃_K (x)) with ${\tilde{π}}_{k} (x) = 1 / p_{k} (x) / (\sum_{m = 1}^{K} 1 / p_{m} (x))$ .

Proof of Proposition 3

The result is straightforward by Proposition 2.

Proof of Theorem 2

Note that, as n → ∞ and λ → ∞, the penalty consequently does not contribute to the objective function. Thus asymptotically we are solving

\begin{array}{l} min_{f} \frac{1}{n} \sum_{i = 1}^{n} π_{y_{i}} ℓ_{T_{s}} (min g (f (x_{i}), y_{i})) \\ subject to \sum_{k = 1}^{K} f_{k} (x) = 0, \end{array}

for each π.

Note first that, for any x with positive probability density within a small neighborhood B(x, r) = {x˜: ||x˜ − x|| ≤ r} of radius r > 0 [i.e., P_X(x˜) > 0 for any x˜ ∈ B(x, r)], the average of π_{y_i}ℓ_{T_s}(min g(f(x_i), y_i)) over x_i ∈ B(x, r) converges to E(π_Y ℓ_{T_s}(min g(f(X), Y))|X = x) as n → ∞ and the radius r shrinks to zero. Thus, by Theorem 1, it is guaranteed that there exists a set of neighboring π₁(x) = (π₁(x), π₂(x), …, π_K (x)), π₂(x) = (π₁(x) − d_π, π₂(x) + d_π, …, π_K (x)), …, π_K (x) = (π₁(x) − d_π, π₂(x), …, π_K (x) + d_π) such that the weighted truncated large-margin classifiers with π₁(x), π₂(x), …, and π_K (x) classify x to class 1, 2, …, and K, respectively, for some π(x) = (π₁(x), π₂(x), …, π_K (x)). Consequently our $\hat{π} (x) = \sum_{k = 1}^{K} π_{k} (x) / K$ is well defined for any x.

For any π = (π₁, π₂, …, π_K ), ${| | π | |}_{1} = \sum_{k = 1}^{K} ∣ π_{k} ∣$ denotes its 1-norm. Next, we prove consistency by contradiction. If there exists an x such that π̂(x) does not converge to π̃(x) satisfying π̃_k(x)p_k(x) = π̃_k_′(x)p_k_′(x) for 1 ≤ k ≠ k′ ≤ K, then, as d_π → 0 and n → ∞, the classification rules for π₁(x), π₂(x), …, π_K (x) do not union to the set {1, 2, …, K} due to the consistency established in Theorem 1 and the fact that ${max}_{k = 1}^{K} {| | π_{k} (x) - \hat{π} (x) | |}_{1} = d_{π} / K \to 0$ . This violates our criterion on selecting π₁(x), π₂(x), …, π_K (x). As a result π̂(x) → π̃(x) for any x, which in turn implies that p̂_k(x) → p_k(x) for k = 1, 2, …, K for any x.

Derivation of the Dual Problem in Section 4.1

Note that $\frac{\partial}{\partial w_{k}} Q_{cav}^{s} (Θ)$ and $\frac{\partial}{\partial b_{k}} Q_{cav}^{s} (Θ)$ can be written, respectively, as follows

\begin{array}{l} - C [\sum_{i : y_{i} = k} π_{y_{i}} (- I_{{min g (f (x_{i}), y_{i}) < s}}) x_{i}^{T} + \sum_{i : y_{i} \neq k} π_{y_{i}} (I_{{j = arg max (f_{k^{'}} (x_{i}) : k^{'} \neq y_{i}), f_{y_{i}} (x_{i}) - f_{k} (x_{i}) < s}}) x_{i}^{T}], \\ - C [\sum_{i : y_{i} = k} π_{y_{i}} (- I_{{min g (f (x_{i}), y_{i}) < s}}) + \sum_{i : y_{i} \neq k} π_{y_{i}} (I_{{k = arg max (f_{k^{'}} (x_{i}) : k^{'} \neq y_{i}), f_{y_{i}} (x_{i}) - f_{k} (x_{i}) < s}})], \end{array}

where I_{_A_} = 1 if event A is true and 0 otherwise.

Using the definition of β_ij, we have

\frac{\partial}{\partial w_{k}} Q_{cav}^{s} (Θ) = \sum_{i : y_{i} = k} (\sum_{k^{'} \neq y_{i}} β_{{i k}^{'}}) x_{i}^{T} - \sum_{i : y_{i} \neq k} β_{i k} x_{i}^{T}

and

\frac{\partial}{\partial b_{k}} Q_{cav}^{s} (Θ) = \sum_{i : y_{i} = k} (\sum_{k^{'} \neq y_{i}} β_{{i k}^{'}}) - \sum_{i : y_{i} \neq k} β_{i k} .

Applying the first-order approximation to the concave part, the objective function at step (t + 1) becomes

Q^{s} (Θ) = \frac{1}{2} \sum_{k = 1}^{K} {| | w_{k} | |}_{2}^{2} + C \sum_{i = 1}^{n} π_{y_{i}} H_{1} (min g (f (x_{i}), y_{i})) + \sum_{k = 1}^{K} 〈 \frac{\partial}{\partial w_{k}} Q_{cav}^{s} (Θ_{t}), w_{k} 〉 + \sum_{k = 1}^{K} b_{k} \frac{\partial}{\partial b_{k}} Q_{cav}^{s} (Θ_{t}),

where Θ_t is the current solution.

Using slack variable ξ_i’s for the hinge loss function, the optimization problem at step (t + 1) becomes

\begin{array}{l} min_{W, b, ξ} \frac{1}{2} \sum_{k = 1}^{K} {| | w_{k} | |}_{2}^{2} + C \sum_{i = 1}^{n} π_{y_{i}} ξ_{i} + \sum_{k = 1}^{K} 〈 \frac{\partial}{\partial w_{k}} Q_{cav}^{s} (Θ_{t}), w_{k} 〉 + \sum_{k = 1}^{K} b_{k} \frac{\partial}{\partial b_{k}} Q_{cav}^{s} (Θ_{t}) \\ subject to ξ_{i} \geq 0, i = 1, 2, \dots, n, \\ ξ_{i} \geq 1 - [x_{i}^{T} w_{y_{i}} + b_{y_{i}}] + [x_{i}^{T} w_{k} + b_{k}], \\ i = 1, 2, \dots, n; k \neq y_{i} . \end{array}

The corresponding Lagrangian is

L (W, b, ξ) = \frac{1}{2} \sum_{k = 1}^{K} {| | w_{k} | |}_{2}^{2} + C \sum_{i = 1}^{n} π_{y_{i}} ξ_{i} - \sum_{i = 1}^{n} u_{i} ξ_{i} - \sum_{i = 1}^{n} \sum_{k^{'} \neq y_{i}} α_{{i k}^{'}} (x_{i}^{T} w_{y_{i}} + b_{y_{i}} - x_{i}^{T} w_{k^{'}} - b_{k^{'}} + ξ_{i} - 1) + \sum_{k = 1}^{K} 〈 \frac{\partial}{\partial w_{k}} Q_{cav}^{s} (Θ_{t}), w_{k} 〉 + \sum_{k = 1}^{K} b_{k} \frac{\partial}{\partial b_{k}} Q_{cav}^{s} (Θ_{t}),

(A.1)

subject to

\frac{\partial}{\partial w_{k}} L = w_{k}^{T} - [\sum_{i : y_{i} = k} \sum_{k^{'} \neq y_{i}} (α_{{i k}^{'}} - β_{{i k}^{'}}) x_{i}^{T} - \sum_{i : y_{i} \neq k} (α_{i k} - β_{i k}) x_{i}^{T}] = 0,

(A.2)

\frac{\partial}{\partial b_{k}} L = - [\sum_{i : y_{i} = k} \sum_{k^{'} \neq y_{i}} (α_{{i k}^{'}} - β_{{i k}^{'}}) - \sum_{i : y_{i} \neq k} (α_{i k} - β_{i k})] = 0,

(A.3)

\frac{\partial}{\partial ξ_{i}} L = C π_{y_{i}} - u_{i} - \sum_{k \neq y_{i}} α_{i k} = 0,

(A.4)

where the Lagrangian multipliers are u_i ≥ 0 and α_ik_′ ≥ 0 for any i = 1, 2, …, n, k′ ≠ y_i. Substituting Equations (A.2) through (A.4) into Equation (A.1) yields the desired dual problem in Section 4.1.

Contributor Information

Yichao Wu, Email: wu@stat.ncsu.edu.

Hao Helen Zhang, Email: hzhang2@stat.ncsu.edu.

Yufeng Liu, Email: yfliu@email.unc.edu.

References

Agresti A, Coull B. Approximate Is Better Than ‘Exact’ for Interval Estimation of Binomial Proportions. The American Statistician. 1998;52:119–126. [Google Scholar]
An LTH, Tao PD. Solving a Class of Linearly Constrained Indefinite Quadratic Problems by d.c. Algorithms. Journal of Global Optimization. 1997;11:253–285. [Google Scholar]
Bartlett PL, Jordan MI, Mcauliffe JD. Convexity, Classification, and Risk Bounds. Journal of the American Statistical Association. 2006;101:138–156. [Google Scholar]
Cortes C, Vapnik V. Support Vector Networks. Machine Learning. 1995;20:273–297. [Google Scholar]
Kimeldorf G, Wahba G. Some Results on Tchebycheffian Spline Functions. Journal of Mathematical Analysis and Applications. 1971;33:82–95. [Google Scholar]
Lee Y, Lin Y, Wahba G. Multicategory Support Vector Machines, Theory, and Application to the Classification of Microarray Data and Satellite Radiance Data. Journal of the American Statistical Association. 2004;99:67–81. [Google Scholar]
Lin Y. Support Vector Machines and the Bayes Rule in Classification. Data Minning and Knoweldge Discovery. 2002;6:259–275. [Google Scholar]
Liu Y. Fisher Consistency of Multicategory Support Vector Machines. Eleventh International Conference on Artificial Intelligence and Statistics. 2007:289–296. Available at http://www.stat.umn.edu/~aistat/proceedings/start.htm.
Liu Y, Shen X. Multicategory ψ-Learning. Journal of the American Statistical Association. 2006;101:500–509. [Google Scholar]
Liu Y, Shen X, Doss H. Multicategory ψ-Learning and Support Vector Machine: Computational Tools. Journal of Computational and Graphical Statistics. 2005;14:219–236. [Google Scholar]
Shao J. Linear Model Selection by Cross-Validation. Journal of the American Statistical Association. 1993;88:486–494. [Google Scholar]
Shen X, Tseng G, Zhang X, Wong W. On ψ-Learning. Journal of the American Statistical Association. 2003;98:724–734. [Google Scholar]
Vapnik V. Statistical Learning Theory. New York: Wiley; 1998. [Google Scholar]
Wahba G. Support Vector Machines, Reproducing Kernel Hilbert Spaces and the Randomized GACV. In: Schoelkopf B, Burges C, Smola A, editors. Advances in Kernel Methods Support Vector Learning. Cambridge, MA: MIT Press; 1999. pp. 69–88. [Google Scholar]
Wang J, Shen X, Liu Y. Probability Estimation for Large Margin Classifiers. Biometrika. 2008;95:149–167. [Google Scholar]
Weston J, Watkins C. Support Vector Machines for Multi-Class Pattern Recognition. In: Verleysen M, editor. Proceedings of the 7th European Symposium on Artificial Neural Networks (ESANN-99) Bruges; Belgium: 1999. pp. 219–224. Available at http://www.informatik.uni-trier.de/~ley/db/conf/esann/esann1999.html. [Google Scholar]
Wu TF, Lin CJ, Weng RC. Probability Estimates for Multi-Class Classification by Pairwise Coupling. Journal of Machine Learning Research. 2004;5:975–1005. [Google Scholar]
Wu Y, Liu Y. Robust Truncated-Hinge-Loss Support Vector Machines. Journal of the American Statistical Association. 2007;102:974–983. [Google Scholar]
Zhang HH, Liu Y, Wu Y, Zhu J. Variable Selection for the Multicategory SVM via Adaptive Sup-Norm Regularization. Eletronic Journal of Statistics. 2008;2:149–167. [Google Scholar]
Zhang T. Statistical Analysis of Some Multi-Category Large Margin Classification Methods. Journal of Machine Learning Research. 2004;5:1225–1251. [Google Scholar]
Zhu J, Hastie T. Kernel Logistic Regression and the Import Vector Machine. Journal of Computational and Graphical Statistics. 2005;14:185–205. [Google Scholar]

[R1] Agresti A, Coull B. Approximate Is Better Than ‘Exact’ for Interval Estimation of Binomial Proportions. The American Statistician. 1998;52:119–126. [Google Scholar]

[R2] An LTH, Tao PD. Solving a Class of Linearly Constrained Indefinite Quadratic Problems by d.c. Algorithms. Journal of Global Optimization. 1997;11:253–285. [Google Scholar]

[R3] Bartlett PL, Jordan MI, Mcauliffe JD. Convexity, Classification, and Risk Bounds. Journal of the American Statistical Association. 2006;101:138–156. [Google Scholar]

[R4] Cortes C, Vapnik V. Support Vector Networks. Machine Learning. 1995;20:273–297. [Google Scholar]

[R5] Kimeldorf G, Wahba G. Some Results on Tchebycheffian Spline Functions. Journal of Mathematical Analysis and Applications. 1971;33:82–95. [Google Scholar]

[R6] Lee Y, Lin Y, Wahba G. Multicategory Support Vector Machines, Theory, and Application to the Classification of Microarray Data and Satellite Radiance Data. Journal of the American Statistical Association. 2004;99:67–81. [Google Scholar]

[R7] Lin Y. Support Vector Machines and the Bayes Rule in Classification. Data Minning and Knoweldge Discovery. 2002;6:259–275. [Google Scholar]

[R8] Liu Y. Fisher Consistency of Multicategory Support Vector Machines. Eleventh International Conference on Artificial Intelligence and Statistics. 2007:289–296. Available at http://www.stat.umn.edu/~aistat/proceedings/start.htm.

[R9] Liu Y, Shen X. Multicategory ψ-Learning. Journal of the American Statistical Association. 2006;101:500–509. [Google Scholar]

[R10] Liu Y, Shen X, Doss H. Multicategory ψ-Learning and Support Vector Machine: Computational Tools. Journal of Computational and Graphical Statistics. 2005;14:219–236. [Google Scholar]

[R11] Shao J. Linear Model Selection by Cross-Validation. Journal of the American Statistical Association. 1993;88:486–494. [Google Scholar]

[R12] Shen X, Tseng G, Zhang X, Wong W. On ψ-Learning. Journal of the American Statistical Association. 2003;98:724–734. [Google Scholar]

[R13] Vapnik V. Statistical Learning Theory. New York: Wiley; 1998. [Google Scholar]

[R14] Wahba G. Support Vector Machines, Reproducing Kernel Hilbert Spaces and the Randomized GACV. In: Schoelkopf B, Burges C, Smola A, editors. Advances in Kernel Methods Support Vector Learning. Cambridge, MA: MIT Press; 1999. pp. 69–88. [Google Scholar]

[R15] Wang J, Shen X, Liu Y. Probability Estimation for Large Margin Classifiers. Biometrika. 2008;95:149–167. [Google Scholar]

[R16] Weston J, Watkins C. Support Vector Machines for Multi-Class Pattern Recognition. In: Verleysen M, editor. Proceedings of the 7th European Symposium on Artificial Neural Networks (ESANN-99) Bruges; Belgium: 1999. pp. 219–224. Available at http://www.informatik.uni-trier.de/~ley/db/conf/esann/esann1999.html. [Google Scholar]

[R17] Wu TF, Lin CJ, Weng RC. Probability Estimates for Multi-Class Classification by Pairwise Coupling. Journal of Machine Learning Research. 2004;5:975–1005. [Google Scholar]

[R18] Wu Y, Liu Y. Robust Truncated-Hinge-Loss Support Vector Machines. Journal of the American Statistical Association. 2007;102:974–983. [Google Scholar]

[R19] Zhang HH, Liu Y, Wu Y, Zhu J. Variable Selection for the Multicategory SVM via Adaptive Sup-Norm Regularization. Eletronic Journal of Statistics. 2008;2:149–167. [Google Scholar]

[R20] Zhang T. Statistical Analysis of Some Multi-Category Large Margin Classification Methods. Journal of Machine Learning Research. 2004;5:1225–1251. [Google Scholar]

[R21] Zhu J, Hastie T. Kernel Logistic Regression and the Import Vector Machine. Journal of Computational and Graphical Statistics. 2005;14:185–205. [Google Scholar]

PERMALINK

Robust Model-Free Multiclass Probability Estimation

Yichao Wu

Hao Helen Zhang

Yufeng Liu

Roles

Abstract

1. INTRODUCTION

2. WEIGHTED CLASSIFICATION AND FISHER CONSISTENCY

2.1 Weighted Binary Classification

2.2 Weighted Multiclass Classification

Definition 1

Proposition 1

Theorem 1

Remark 1

Figure 1.

3. METHODOLOGY

3.1 Direct Scheme for Probability Recovery

Proposition 2

Figure 2.

Proposition 3

Direct Scheme

Figure 3.

3.1.1 Numerical Challenges in Implementing Direct Scheme

3.2 Indirect Scheme for Probability Estimation

Figure 4.

Indirect Scheme

3.3 Theoretical Properties

Theorem 2

4. COMPUTATION ALGORITHMS

Figure 5.

Algorithm 1

4.1 Linear Learning

4.2 Nonlinear Learning

4.3 Parameter Tuning

5. SIMULATIONS

Table 5.

5.1 Numerical Examples

Example 1

Table 1.

Example 2

Table 2.

Example 3

Table 3.

Example 4

Table 4.

5.2 Empirical Computation Cost

Figure 6.

6. REAL DATA

Table 6.

7. CONCLUSION

Acknowledgments

APPENDIX

Proofs of Proposition 1 and Theorem 1

Proof of Proposition 2

Proof of Proposition 3

Proof of Theorem 2

Derivation of the Dual Problem in Section 4.1

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases