Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2013 Jul 21.
Published in final edited form as: Neural Comput. 2012 May 17;24(9):2473–2507. doi: 10.1162/NECO_a_00321

Inhibition in Multiclass Classification

Ramón Huerta 1, Shankar Vembu 2, José M Amigó 3, Thomas Nowotny 4, Charles Elkan 5
PMCID: PMC3717401  NIHMSID: NIHMS476923  PMID: 22594829

Abstract

The role of inhibition is investigated in a multiclass support vector machine formalism inspired by the brain structure of insects. The so-called mushroom bodies have a set of output neurons, or classification functions, that compete with each other to encode a particular input. Strongly active output neurons depress or inhibit the remaining outputs without knowing which is correct or incorrect. Accordingly, we propose to use a classification function that embodies unselective inhibition and train it in the large margin classifier framework. Inhibition leads to more robust classifiers in the sense that they perform better on larger areas of appropriate hyperparameters when assessed with leave-one-out strategies. We also show that the classifier with inhibition is a tight bound to probabilistic exponential models and is Bayes consistent for 3-class problems. These properties make this approach useful for data sets with a limited number of labeled examples. For larger data sets, there is no significant comparative advantage to other multiclass SVM approaches.

1 Introduction

The question of what algorithms neural media use to solve challenging pattern recognition problems remains one of the most fascinating and elusive problems in the neurosciences, as well as in artificial intelligence. Perceptrons and artificial neural networks were originally inspired by neural computation, but thereafter, a new generation of powerful algorithms for pattern recognition returned to Fisher discriminant ideas and addressed the fundamental question of minimizing the generalization error by using statistical principles. Kernel-based methods, in particular support vector machines (SVMs), became prevalent due to the convenience and simplicity of their algorithms. These methods became standard, and the original inspiration from neural computation faded away. The heuristics of neural integration, neural networks, plasticity in the form of Hebbian learning, and the regulatory effect of inhibitory neurons were less needed, and the bioinspiration from neuroscience and AI fields grew increasingly distant from each other.

We seek to bridge this gap and identify the similarities and, in some cases, equivalence between neural information processing and large margin classifiers. We use the large margin classifier formalism and attempt to identify a correspondence to neural mechanisms for pattern recognition, putting emphasis on the role of inhibition (Huerta, Nowotny, Garcia-Sanchez, Abarbanel, & Rabinovich, 2004; Huerta & Nowotny, 2009; O’Reilly, 2001). We use insect olfaction as our biological model system for two main reasons: (1) the simplicity and consistency of the structural organization of the olfactory pathway in many species and its similarity to the structure of a SVM and (2) the large body of knowledge concerning the location of learning in insects during odor conditioning, which matches the location of plasticity in SVMs.

The mushroom bodies in the brains of insects contain many classifiers that compete with each other. The mechanism to organize this competition such that a single winner (class) emerges is inhibition (Cassenaer & Laurent, 2012; Huerta et al., 2004; Nowotny, Huerta, Abarbanel, & Rabinovich, 2005; Huerta et al., 2009; O’Reilly, 2001). Each individual classifier exerts downward pressure on the rest, with a strength that has to be regulated. The SVM formalism provides a framework in which to understand the consequences of inhibition in multiclass classification problems.

The solution of the value of the inhibition using the SVM formalism leads to a unique solution, it is robust to parameter variations, and it is a tight bound of probabilistic exponential models. We also show simple sequential algorithms to solve the problem using the sequential minimization algorithm (Platt, 1999a, 1999b; Keerthi, Shevade, Bhattacharyya, & Murthy, 2001) and a stochastic gradient descent (Chapelle, 2007; Kivinen, Smola, & Williamson, 2010). We provide efficient software for both algorithms written in C/C++ for others to experiment with (http://inls.ucsd.edu/~huerta/ISVM.tar.gz).

We present extensive experimental results using a collection of easy and difficult data sets, some with heavily unbalanced classes. The data sets are from the UCI repository except for the MNIST digits data set. Results show that the inhibitory SVM framework generalizes better than the leading alternative methods with a small number of training examples. The mechanism of inhibition provides robustness. The inhibitory models, for a large sample of meta parameters, outperform 1-versus-all SVMs and Weston-Watkins multiclass SVMs (Weston & Watkins, 1999). For large data sets when there is sufficient data to estimate the metaparameters by leave-one-out strategies, the ISVM does not provide a significant advantage. Moreover, in terms of Bayes consistency (Tewari & Bartlett, 2007), the inhibitory SVM is better than other methods with the exception of Lee, Lin, and Wahba (2004).

This letter starts by explaining the notation and the insect-inspired formalism of the inhibitory classifier, followed by a comparison to previous methods using the same notation. Then we solve the formulation to write efficient and simple algorithms. We conclude with experimental results.

2 Insect Brain Anatomy

The three areas of the insect brain involved in olfaction are the olfactory receptor cells or sensors, the antennal lobe (AL) or feature extraction device, and the mushroom body (MB) or classifier (see Figure 1). When a gas is present, olfactory receptor cells feed this information into the AL, which extracts the features that will be classified by the MB.

Figure 1.

Figure 1

Illustration of the correspondence between the insect brain and kernel classification. (Left) Anatomical picture of the honeybee brain (courtesy of Robert Brandt, Paul Szyszka, and Giovanni Galizia). The antennal lobe is circled in dashed yellow, and the MB is circled in red. The projection neurons (in green) send direct synapses to the Kenyon cells in the calyx. The Kenyon cells carry the connections w that are the equivalent to the SVM hyperplane. (Right) Equivalent circuit representation in SVM language.

The input, and hence the evoked feature pattern x in the AL, can be associated with either a reward +1 or with punishment −1 at the level of the output of the MB that we denote by y. Given N inputs, the problem consists of training the MB to correctly match yi = f (xi) for i = 1, …, N.

The MB function consists of two phases (Heisemberg, 2003; Laurent, 2002): (1) a projection into an explicit high-dimensional space Φ(x) named calyx and consisting of hundreds of thousands of Kenyon cell neurons (KC) and (2) a perceptron-like layer in the MB lobes (Huerta & Nowotny, 2009) where the classification function of each output neuron is implemented, fk (x) = 〈wkj, Φ(x)〉 = Σjwkj Φj (x).1 The inner product reflects the synaptic integration of KC outputs in MB lobe neurons. Huerta and Nowotny (2009) and Huerta et al. (2004) showed that simple Hebbian rules can solve discrimination and classification problems because they closely resemble the learning obtained by calculating the subgradient in an SVM framework. In particular, it can be shown that the change in the synaptic connections, Δw, is proportional to Φj (x). These rules are also equivalent to the perceptron algorithm, as Freund and Schapire (1999) showed.

In addition, the MB lobes contain hundreds of neurons that operate in parallel and compete via synaptic inhibition that they receive from each other, in addition to the input Φ(x) from the calyx. The output neurons can, in principle, code for different stimulus classes. They can be situated in different MB lobes specializing in different functions, and they are modulated by neuromodulators like dopamine, octopamine, and others that are the focus of intense research in neuroscience.

The concept of inhibition does not directly appear in the SVM literature, although a fairly large body of research on multiclass SVMs uses similar concepts. Our goal here is to directly integrate the concept of inhibition into the SVM formalism in order to provide a simple algorithm for multiclass classification.

3 The Inhibitory Classifier

Consider a training set of data points xi for i = 1,…, N where N is the number of data points. Each point i belongs to a known class ŷi whose value is an integer in the range [1, L]. We first make a change of variables from the vector ŷ to the N × L matrix y (called a coding matrix by Diettrich & Bakiri, 1995) defined by

yij={1ify^i=j-1otherwise, (3.1)

that is, yij is 1 if the data point xi belongs to the class j; otherwise the entry is −1.2

Next, we create a vector χi as L concatenations of xi, that is,

χi=(xi,xi,,xi)Ltimes. (3.2)

If xi ∈ ℜM, then the number of components of χi is L · M. More generally, given an arbitrary data point x ∈ ℜM, define Inline graphic(ℜM) ⊂ ℜLM to be the subspace of intrinsic dimension M built by vectors of the form χ = (x, x, …, x) (x repeated L times). We say sometimes that χ = (χ1, …, χLM) is the embedding of x into Inline graphic(ℜM). The inverse relation is given by x = (χkL+1, χkL+2, , χ(k+1)L) for any k = 0, 1, , M − 1.

When discussing SVMs, it is common to assume a nonlinear transformation Φ : ℜMInline graphic from the original data space ℜM to a feature vector space Inline graphic in order to facilitate the separability of data points. Moreover, we assume that Inline graphic is endowed with a dot product 〈·, ·〉 : Inline graphic × Inline graphic → ℜ. The inhibitory SVM proposed here uses a feature space that is the Cartesian product Inline graphic= Inline graphic ×···× Inline graphic (L times). Correspondingly, we extend Φ to a nonlinear transformation Ψ : Inline graphic(ℜM) → Inline graphic( Inline graphic), where Inline graphic( Inline graphic) ⊂ Inline graphicis the subspace of dimension dim Inline graphic built analogously as before, by repeated concatenation of the first dim Inline graphic components, and

Ψ(χ)=(Φ(x),Φ(x),,Φ(x)), (3.3)

where χ is the embedding of x into Inline graphic(ℜM). Furthermore, let Ψj : Inline graphic(ℜM) → Inline graphic be the composition of Ψ with the projection operator onto the jth coordinate subspace of Inline graphic corresponding to the class j, that is,

Ψj(χ)=(0,,0,Φ(x),0,,0). (3.4)

To ease the notation, the indices i, i′ will refer henceforth to data points in ℜM, while the indices j, j′ will refer to the classification classes. Their ranges are thus i, i′ ∈ {1, …, N} and j, j′ ∈ {1, …, L}.

The new inhibitory classifier for a data point xi and class j, fj : Inline graphic(ℜM) → ℜ, has the form

fj(χi)=w,Ψj(χi)-μw,Ψ(χi)=w,Ψj(χi)-μΨ(χi), (3.5)

where wInline graphic, w0, is a hyperplane. Here 〈·, ·〉 is the dot product in Inline graphic, defined as the sum of the dot products of corresponding projections onto each factor space Inline graphic. The scalar μ is the inhibitory factor and is the key novelty compared to other multiclass SVM methods because it is directly used in the evaluation of the classification function. As we will show, the value of the inhibitory factor μ can be derived directly from the minimization of the Lagrangian form and is data set independent. Note that

jfj(χi)=jw,Ψj(χi)-μLw,Ψ(χi)=w,Ψ(χi)-μLw,Ψ(χi)=(1-μL)w,ψ(χi) (3.6)

for all i = 1, …, N.

The transformations Ψ and Ψj inherit many properties from the transformation function of standard SVMs, Φ : ℜMInline graphic. In particular (see equations 3.3 and 3.4),

Ψ(χi),Ψ(χi)=L·Φ(xi),Φ(xi), (3.7)
Ψj(χi),Ψ(χi)=Φ(xi),Φ(xi), (3.8)
Ψj(χi),Ψj(χi)=I(j=j)Φ(xi),Φ(xi), (3.9)

where the dot product 〈·, ·〉 on the left-hand side of equations 3.7 to 3.9 is taken in the product space Inline graphic, while the dot product on the right-hand side is taken in Inline graphic, and the indicator function I(j = j′) is 1 if j = j′ and 0 otherwise. The dot product 〈Φ(xi), Φ(xi′)〉 can be computed effectively by a standard SVM kernel evaluation Kii′ = K(xi, xi′) = 〈Φ(xi), Φ(xi′)〉. Thus, we can develop the inhibitory multiclass SVM formulation using the standard kernel trick.

The basic idea behind equation 3.5 is to train fj classifiers that inhibit each other by a factor μ, which is data set independent. In the current form, we seek a single winner by virtue of the matrix yij. However, the approach can be used with data points assigned to multiple classes. All the subclassifiers fj must adjust, using the inhibitory factor, to classify the whole training set as well as possible. The conditions to have all the training points properly classified are

yijfj(χi)1-ηij,

where ηi j = 0 are N · L slack variables.

Inhibition is not a new concept in machine learning. In particular, it has already been proposed in the context of energy-based learning via the so-called generalized margin loss (GML) function (LeCun, Chopra, Hadshell, Ranzato, & Jie, 2006). The word inhibition is not used explicitly in LeCun et al., but there are manifest similarities. The GML function represents the distance between the correct answer and the most offending incorrect answer. GML learning algorithms must change parameter values in order to make this distance be above a margin m. One can express the GML using our notation as

fjGML(χi)=w,Ψj(χi)-maxjj{w,Ψj(χi)}.

The goal of training is to achieve yijfjGML(χi)m-ηij for all yi j = 1, where m is an arbitrary margin value. The inhibitory formulation that we propose replaces the max operation by a summation and a multiplicative factor μ. Thus, we retain differentiability, which is advantageous for subsequent developments. A second difference is that the SVM formulation requires margin constraints to be satisfied for yi j = −1. As we will see in the next few sections, these modifications allow us to create an effective, straightforward version of inhibition for SVMs.

Regular SVMs have been related to probabilistic exponential models (Canu & Smola, 2005; Pletscher, Soon Ong, & Buhmann, 2010). The inhibitory SVM can remarkably also be connected to log-linear models. Using our notation in a log-linear model, the probability of the label j given the data point χ and parameters w is

p(jχ;w)=ew,Ψj(χ)ke(w,Ψk(χ)),

where the indices j and k run over the classes 1 to L. Taking the logarithm of the previous expression gives

logp(jχ;w)=w,Ψj(χ)-logkew,Ψk(χ).

Lemma 1

Given f = (f1, , fL) ∈ ℜL, then

a:logk=1Lexpfk-1Lk=1Lfk-logL0b:logk=1Lexpfk-1Lk=1Lfk-logL=0 (3.10)

for f1, = ··· = fL only

The proof can be found in appendix A. By applying lemma 1, one can write

logp(jχ;w)w,ψj(χ)-1Lw,Ψ(χ)-logL, (3.11)

which is an equality if and only if fj := 〈w, Ψj (χ)〉 = 〈w, Ψk(χ)〉=: fk, for all 1 ≤ j, kL.

Note that most of the values of 〈w, Ψj(χ)〉 will be in the range [−1, 1] due to the large margin optimization of yi j fj (χi) ≥ 1 − ηi j. That means that the equality is a close bound to p(j|χ; w) for most of the χi. This approximation to log p(j|χ; w) is similar to equation 3.5, where μ is in this case 1/L, as shown below in the derivation. The universality of the inhibitory factor is prevalent. The idea of inhibition can thus be expressed by a normalization factor that depends on the outcome of all classifiers.

4 The Primal Problem

The primal objective function is the sum of the loss on each training example and a regularization term that reduces the complexity of the solution (Vapnik, 1995; Muller, Mika, Ratsch, Tsuda, & Schölkopf, 2001). The relative weight of the regularization term is controlled by a constant C > 0. The primal optimization problem can be expressed as

minimizeE(w,μ)=12||w||2+Cijηij(w,μ)subjectto(i)ηij(w,μ)0(ii)yijfj(χi)-1+ηij(w,μ)0. (4.1)

Thus, we have L · dim Inline graphic + 1 variables (wInline graphic\{0} and μ ∈ ℜ) and 2NL constraints. This problem is not convex in general due to the dependence of ηi j on w and μ. Observe that fj (χi) also depends on w and μ (see equation 3.5). If dom η = ∩i j dom ηi j denotes the common domain of the maps ηi j, then the domain of the problem, equation 4.1, is Inline graphic = ( Inline graphic\{0} × ℜ)∩ dom η. Moreover, we assume that all ηi j are continuously differentiable. For practical purposes, the latter codition can be relaxed to hold except on a zero-measure set.

Consider the Lagrangian associated with equation 4.1:

L(w,μ,α,β)=12||w||2+Cijηij-ijβiηij (4.2)
-ijαij(yij[w,Ψj(χi)-μw,Ψ(χi)]-1+ηij), (4.3)

where α = (αi j) ∈ ℜNL, β = (βi j) ∈ ℜNL are the Lagrange multipliers. The Lagrange dual function (Boyd & Vandenberghe, 2004),

G(α,β)=inf(w,μ)DL(w,μ,α,β), (4.4)

then yields a lower bound on the optimal value p* of the primal problem, equation 4.1, for all αi j ≥ 0 and βi j ≥ 0.

Thus, Inline graphic(α, β) is determined by the critical points of Inline graphic (w, μ, α, β) for each value of α and β. Since Inline graphic is a C1 function of all its variables, we take the partial derivatives of Inline graphic with respect to w and μ and equate to zero in order to get its critical points:

w-ij(βij-C+αij)wηij-ijαijyij[Ψj(χi)-μΨ(χi)]=0 (4.5)
-ij(βij-C+αij)μηij+ijαijyijw,Ψ(χi)=0. (4.6)

According to the implicit function theorem, the solutions of equations 4.5 and 4.6 provide local functions w = wcrit (α, β) and μ = μcrit (α, β), except possibly for a zero measure set (actually a manifold) comprising those αi j, βi j values that make the Jacobian determinant vanish:

detJ(w,μ,α,β)=0. (4.7)

Moreover, these functions are continuously differentiable on account of all functional dependencies in equations 4.5 and 4.6 being continuously differentiable. Note that the infimum in equation 4.4 is taken over points (w, μ) ∈ Inline graphic, but (wcrit (α, β), μcrit (α, β)) need not be in Inline graphic for all values of α and β that parameterize the implicit solutions. This being the case, we have that

G(α,β)=L(wcrit(α,β),μcrit(α,β),α,β) (4.8)

for all α, β such that det J(w, μ, α, β) ≠ 0 and (wcrit (α, β), μcrit (α, β)) ∈ Inline graphic.

For our purposes, it will suffice to study the critical points on the NL-dimensional plane α + βC = 0 (intersection of the NL hyperplanes αi j + βi j = C), where C = (Ci j) ∈ ℜNL with Ci j = C > 0 for all i, j.

Lemma 2

From equations 4.5 and 4.6, it follows that

μcrit(α,C-α)=1L (4.9)

and

wcrit(α,C-α)=ijαijyij[ψj(χi)-1LΨ(χi)] (4.10)

for all α=(αi j) ∈ ℜNL such that Σi j αi j yi j Ψ (χi) ≠ 0.

The proof can be found in appendix B. Note that C in equation 4.9 is fixed but arbitrary. If follows that μcrit (α, β) does not depend on either α or β; hence,

μcrit(α,β)=1L. (4.11)

Theorem 1

Let E(w*, μ *) be the optimal value of the primal problem, equation 4.1. Then

μ=1L.

The proof can be found in appendix C. The optimal solution μ=1L renders the average output of all subclassifiers to be 1Lw,Ψ(χi)=1Ljw,Ψj(χi). The inhibitory factor turns out to be data set independent. Furthermore, from equation 3.6, it follows that Σj fj (χi) = 0.

The next step consists of putting all the constraints back into the classifier given by equation 3.5 to obtain

fj(χ)=i=1Nj=1LαijyijK(xi,x)(I(j=j)-1/L)fj(x), (4.12)

where χ = (x, x, …, x) ∈ Inline graphic(ℜM). To decide which class to choose for a given data point x, one uses the same decision function as in Weston and Watkins (1999) and Crammer and Singer (2001):

argmaxjfj(x). (4.13)

It is important to note that during classification, all of the fj (x) can be simplified because they are shifted by the same amount, that is,

fj(χ)=i=1Nj=1LαijyijK(xi,x)I(j=j)-1Li=1Nj=1LαijyijK(xi,x)=fj(x)+G(x). (4.14)

We can simplify the evaluation on the test set by just calculating

fj(x)=i=1NαijyijK(xi,x) (4.15)

and selecting the class as

argmaxjfj(x)=argmaxjfj(x). (4.16)

5 Previous Integrated Multiclass Formulations

This section places the new inhibitory SVM in the context of previous work. As described in section 1, the most common approach to multiclass classification is to combine models trained for a set of separate binary problems. A few previous approaches have integrated all classes into a single formulation. Generally, for class j, the output of the integrated approaches uses the classification function

fj(χ)=w,Ψj(χ)+bj

where bj is a bias term, with decision function 4.13. Weston and Watkins (1999) were the first to put multiclass SVM classification into a single formulation. Using our notation, they solved the problem

minw,ηE=12||w||2+Cijs.t.yij1ηij, (5.1)

but with different constraints,

w,Ψj(χi)-Ψj(χi)+(bj-bj)2-ηij

for all j such that yi j = 1 and for all j′ such that yi j = −1, where bj, bj′ are bias terms and ηi j ≥ 0. The constraints imply that the SVM scores of all data points belonging to a given class need to be greater than the margin (see appendix E for details).

The large number of constraints hinders solving the quadratic programming problem. Crammer and Singer (2001) proposed to reduce the number of slack variables by solving

minw,ηE=12||w||2+Ciηi (5.2)

with constraints

w,Ψj(χi)-Ψj(χi)+I(yij=yij)1-ηi

for all j such that yi j = 1, j′j and for all data points i. The main differences with respect to Weston and Watkins (1999) are the reduced number of slack variables (see appendix F for details).

Tsochantaridis, Joachims, Hofmann, and Altun (2005) propose solving a similar problem as in equation 5.2 by rescaling the slack as

w,Ψj(χi)-Ψj(χi)1-ηiΔ(yij,yij)

for all j such that yi j = 1. The function Δ(yi j, yi j′) allows the loss to be penalized in a flexible manner, with Δ(1, 1) = 0. A second version proposes rescaling the margin as

w,Ψj(χi)-Ψj(χi)Δ(yij,yij)-ηi.

Both approaches lead to similar accuracies on test sets, as shown in Tsochantaridis et al. (2005).

A remarkable approach is the formalism proposed by Lee et al. (2004) where the authors rewrite the constraints to match the Bayes decision rule (see section 10 for details) such that the most probable class of a particular example χ is the same as the one obtained by minimizing the primal problem. Lee and coauthors solve constraints as

-w,Ψj(χi)1L-1-ηij

such that j is chosen from the set {j ∈ {1, L}, s.t.yi j ≠ 1} with the additional constraint 〈w, Ψ(χi)〉 = 0. These constraints pose a cumbersome optimization problem but yield Bayes consistency (Tewari & Bartlett, 2007).

Table 1 presents a summary of the constraints used in each of the described methods. The main difference between our inhibitory multiclass method and the methods just described is in the way the classifier for class j is compared to the other classifiers. The inhibitory method essentially compares to the average of the outputs of all classifiers, while the previous methods perform pairwise comparisons. The second important difference of the inhibitory method is that inhibition is incorporated directly into the classification function itself.

Table 1.

Summary of the Constraints for Several Integrated SVM Multiclass Formulations.

Method Constraints Number of Constraints Bayes Consistency
Weston and Watkins, 1999 w, Ψj(χi)− Ψj′(χi)〉+(bjbj′)≥2− ηij N· (L − 1) L < 3
Crammer and Singer, 2001 w, Ψj(χi)−Ψj′(χi)〉+I(yij=yij′)≥1−ηi N·L L < 3
Tsochantaridis et al., 2005, slack rescaling w, Ψj (χi)−Ψj′(χi)〉≥1−ηi/Δ(yij, yij′) N·L L < 3
Tsochantaridis et al., 2005, margin rescaling w, Ψj(χi)−Ψj′(χi)〉≥Δ(yij, yij′)−ηi N·L L < 3
Lee et al., 2004 -w,Ψj(χi)1L-1-ηij andw, Ψj(χi)〉 = 0 N ·L L ≥ 2
Inhibitory multiclass (ISVM) yijw, Ψj (χi)−μΨ(χi)〉≥1−ηij N·L L < 4

6 The Dual Problem of the Inhibitory Multiclass Problem

The dual problem is obtained by replacing all the constraints given by equations 4.9 and 4.10 with the solution μ = 1/L in the Lagrangian, equations 4.2 and 4.3, which yields the dual cost function, W. This cost function has to be maximized with respect to the Lagrange multipliers, αi j, as follows:

maxαW=ijαij-12ijijαijyijαiyyijKii[I(j=j)-1L]and0αijC.

The double index notation in αi j and elsewhere is inconvenient to compare with previous published work and with the primal formulation explained in the following sections. Thus, we change the notation from i, j to a new index k running from 1 to N · L. Thus, we order the αi j ’s lexicographically: α1,1, …, α1,L, α2,1, …, α(N−1),L, αN,1, …, αN,L. With the new notation, we can write the dual problem as

maxαW=kαk-12kkαkykαkykGkk (6.1)
and0αkC, (6.2)

where k, k′ = 1, , N · L and

Gkk=Kk-1L+1,k-1L+1[I([kmodL]=[kmodL])-1L]. (6.3)

If one uses C-language type indexing with i = 0, …, N − 1, j = 0, …, N − 1, and k = 0, …, N L − 1, then the following kernel call is suggested:

Gkk=KkL,kL[I([kmodL]=[kmodL])-1L]. (6.4)

The Karush-Kuhn-Tucker (KKT) conditions for this problem can be calculated by constructing the Lagrangian from the dual as in Keerthi et al. (2001):

L=-kαk+12kkαkykαkykGkk-kukαk-klk(C-αk)0αkC,uk,lk0,

which leads to

yiEi-ui+li=0,ui,li0,αkuk=0,lk(C-αk)=0,

where Ei = fiyi and fi = Σk αkykGki. We obtain the standard KKT conditions for the SVM training problem:

yiEi0forαi=0, (6.5)
yiEi=0for0<αi<C, (6.6)
yiEi0forαi=C. (6.7)

It is useful to define a new variable Vi = yiEi that indicates the proximity to the margin and saves computation time.

7 Stochastic Sequential Minimal Optimization

Prior to the first sequential minimal optimization (SMO) methods (Platt, 1999a, 1999b), the quadratic programming algorithms available at the time made SVMs unfeasible for large-scale problems. The straightforward implementation of SMO enabled a significant thrust of developments and improvements (Keerthi et al., 2001). The multiclass problem investigated in equations 6.1 and 6.2 has an advantage due to the absence of the constraint Σk αkyk = 0, which is typical in the dual SVM formulation. This constraint appears after solving the primal problem for the bias b of the classifier. It is avoidable in the multiclass problem due to the mutual competition among the classifiers by means of the inhibitory factor μ.

The idea of optimizing the quadratic function for a pair of multipliers is needed because one cannot modify the values of a single multiplier without violating the constraint Σk αkyk = 0 (Platt, 1999a, 1999b). In the inhibitory SVM, a single multiplier can be modified at a time. The analytical solution for a single multiplier i is derived from

W=constant+αi-12Giiαi2-αiyi[fi-αioldyiGii],

whose solution is obtained from Wαi=0 to yield

1-Giiαi-yi[fi-αioldyiGii]=0.

This can be rewritten as

αi=[αiold+1Gii(1-yiEi)], (7.1)

where αiold is the value of the multiplier in the previous iteration. Every time an αi is updated, each error updates according to Ej(t+1)=Ej(t)+(αi-αiold)yiGij. In terms of the margin variable Vi, one can write

Vj(t+1)=Vj(t)+(αi-αiold)yiyjGijforj=1,,NL. (7.2)

The randomized SMO algorithm is given in algorithm 1. One can improve the performance of the algorithm by remembering the indices of the multipliers that violate the KKT conditions. Then, instead of choosing among all possible multipliers, one chooses among those that need to be changed. The KKT distance function in algorithm 1 is

Algorithm 1.

Stochastic SMO Algorithm.

t := 1
αi:= 0 and Vi := −1 for i = 1, …, N L
do {
 Choose one index from k ∈ [1, …, N L].
αnew=αk-VkGkk,
αnew = max{0, αnew} and αnew = min{C, αnew}
 Initialize the KKT distance: KKT := 0
 loop over all i = 1, …, N L
  Vi := Vi + (αnewαk) yiykGik
  KKT := KKT + KKT distance(Vi, αi)
 end loop
αk = αnew
KKT := KKT/(N L)
t := t + 1
 } while (KKT >θ)

Note: N is the number of data points, L is the number of classes, and θ is the termination threshold, which we generally set to the same value as the tolerance T (10−3).

KKTdistance(Vi,αi)={-ViifVi<-Tandαi<ε,ViifVi>Tandαi>C-ε,ViifVi>Tandε<αi<C-ε,0restofcases.

Above, T is the resolution of the proximity to the KKT condition, which we typically fix to 10−3 as originally proposed by Platt, and ε is the numerical resolution that depends on the machine precision that we typically set to 10−6. Generally, for all data sets tested, one can stop the algorithm early without impairing accuracy significantly.

8 Stochastic Gradient Descent in Hilbert Space

Synaptic changes do not occur in a deterministic manner (Harvey & Svoboda, 2007; Abbott & Regehr, 2004). Axons are believed to make additional connections to dendrites of other neurons in a stochastic manner, suggesting that the formation or removal of synapses to strengthen or weaken a connection between two neurons is best described as a stochastic process (Seung, 2003; Abbott & Regehr, 2004). On the other hand, in recent times, variants of stochastic gradient descent (SGD) have been used to solve the SVM problem in the primal formulation (Bottou & Bousquet, 2008; Zhang, 2004; Shalev-Shwartz, Singer, & Srebro, 2007). The algorithms obtained for the modification of the synaptic weights w resemble closely Hebbian learning or perceptron rules. We are primarily dealing with nonlinear kernels, so let us bridge the dual formulation with stochastic gradient descent using a Hilbert space.

Let us rewrite the primal formulation in equation 4.1 using a reproducing kernel Hilbert space (RKHS) as proposed in Chapelle (2007) and Kivinen et al. (2010). Let S be the training data set. For our specific problem, the RKHS Inline graphic = {f : S→ ℜ} has a kernel G : S × S → ℜ with a dot product 〈·, ·〉Inline graphic such that 〈G(·, χ), fInline graphic = f (χ), with χ ∈ S and fInline graphic. The primal formulation then can be expressed as

minfHE=minfH[12||f||H2+Ci=1NLmax{0,1-yif(χi)}]=minfH[12f,fH+Ci=1NLmax{0,1-yif,G(χi,·)H}]. (8.1)

The formal expression of f is a linear combination of the kernel functions such that f(χ)=i=1NLα^iG(χi,χ). In appendix D we show how the updating rule is derived as

α^i(t+1)=(1-η)α^i(t)+ηCyi1¯(yift(χi)-1), (8.2)

with

1¯(u)={1ifu<00ifu>0[0,1]ifu=0, (8.3)

and η is the learning rate. For the evaluation of ft (χ) we use the kernel derived from the Lagrange multipliers function given by equation 6.3 because we know from the minimization of the Lagrangian that μ = 1/L. The corresponding i index of χ is the one that verifies χ= χi in the training set. For stochastic updating, it is convenient to track the evolution of the margin proximity variable Vi = yi ft (χi) − 1 every time a coefficient α̂i is changed:

Vj(t+1)=Vj(t)+yi(α^i(t+1)-α^i(t))G(χi,χj)forj=1,,NL,

which is very similar to equation 7.2 obtained in the dual form.

Many approaches using stochastic gradient descent use a scaling factor in the learning rate proportional to (1/iteration number) in order to guarantee convergence (Zhang, 2004; Shalev-Shwartz et al., 2007). We propose here a different approach that leads to an algorithm that is almost equivalent to the stochastic SMO method. As in that method, we make use of the KKT conditions, which requires computing the current state of training at each iteration. Note that the variable Vi provides guidance concerning distance to the margin.

If the algorithm chooses the index k, then the change α̂k (t + 1) − α̂k (t) = Δk is derived from

0=Vk(t)+ykΔkG(χk,χk),

so

Δk=-Vk(t)ykG(χk,χk), (8.4)

assuming that G(χk, χk) ≠ 0. We combine equations 8.4 and 8.2 to obtain the learning rate η that would take the data point k exactly to the margin as

η(t)=VkykG(χk,χk)(λα^k(t)-yk1¯(Vk)).

To avoid the computation inherent in the previous formula one can change Δk to

Δk=-ηeffVk(t)ykG(χk,χk). (8.5)

When ηeff = 1, the update takes data point x to the margin.

When we use ηeff = 1, we recover the SMO solution given in equation 7.1. The corresponding SGD algorithm is presented in algorithm 2. Algorithms 1 and 2 are almost identical. C++ implementations of both algorithms can be found in the software package ISVM.

Algorithm 2.

Stochastic Gradient Descent (SGD) with Endogenous Learning Rate.

t := 1
αi:= 0 and Vi:= −1 for i = 1, …, N L
do {
 Choose one index from k ∈ [1, …, N L].
αnew=α^k-ηeffVk(t)ykGkk
αnew = max{C, αnew} and αnew = min{−C, αnew}
 Initialize the KKT distance: KKT := 0
 loop over all i = 1, …, N L
  Vi := Vi + yi(αnewα̂k)Gik
  KKT := KKT + KKT distance(Vi, yiα̂i)
 end loop
KKT := KKT/(N L)
α̂k = αnew
t:= t + 1
 } while (KKT > θ)

Note: N is the number of data points, L is the number of classes, ηeff is the learning rate, and θ is the stopping criterion. Note that this algorithm needs to compute the Vi values.

When making a prediction for a test example using fj(χ)=i=1NLα^iG(χi,χ), we replace G(χi, χ) by K(xi′, x)(I(j = j′) −1/L) ≡ fj(x), which means that we need to make L evaluations for each data point from j = 1, …, L and select the one with the largest margin. This procedure is equivalent to equations 4.12 and 4.13.

The primal and the dual formalism lead to an almost identical algorithm for the inhibitory multiclass problem. A major appealing feature of the algorithms is the simplicity of their implementation.

9 Experimental Robustness

In this section we show experimentally that the inhibitory SVM (ISVM) method generally achieves better generalization than other multiclass SVM methods for small training set sizes. With large training sets, all methods converge to similar levels of accuracy, and it is not possible to obtain a clear distinction between methods. Rifkin and Klautau (2004) and Hsu and Lin (2002) showed that the performance of one-versus-all and one-versus-one approaches is good on many occasions with faster training times than the rest.

For this investigation, we use a gaussian kernel as exp (− γ||xx′||2/M). Then we have a pair of metaparameters C > 0 and γ> 0 to investigate. The key issue, in terms of robustness, is to determine whether the inhibitory SVM leads to better average performance than 1-versus-all and Weston-Watkins multiclass approaches for any pair (C, γ). It is obviously not possible to cover the whole space of metaparameters, but one can sample it and get estimates. Our sampling methodology picks the best models at different percentile cuts—10%, 25%, and 50%—because one expects to explore parameter areas with a higher likelihood of achieving better performance. Thus, we ran an empirical leave-one-out verification strategy scanning the three hyperparameter values γ= 5, 10 and varying C from 0.1 to 100 at steps of 0.5. The lower bound C = 0.1 is set because for small data sets, the SVM evaluation functions hardly reach the margin, and the performance drops considerably for all the methods. Note also that since we discard all solutions below the 50% performance, we do not explore these solutions. We used the same stochastic SMO algorithms and the same C++ implementation for 1-versus-all, Weston-Watkins, and ISVM. Note that the only difference in the methods is the factors multiplying the kernel: K(xi, xi′)(I(j = j′) − 1/L) for ISVM, K(xi, xi′)I(j = j′) for SVM, and K(xi,xi)(k=1L(I(j=j)+yik+12yik+12)) for Weston-Watkins.

In order to demonstrate the higher robustness of inhibition in a systematic manner we ran comparisons in 14 datasets for several different sizes of the training set Ns = 50, 100, 150, 200, 500 (see Table 2). Then we took an average of 100 random samples of each data set of size Ns. In Table 3, we report the results of pooling the leave-one-out performances for a grid of metaparameters using the gaussian kernel, exp(−γ ||xx′||2/M)). The 10% best models were pooled and the average calculated. The same procedure is carried out for the 25% and 50% best to illustrate the drop of performance as we increased the area of the parameter set.

Table 2.

Summary of the Data Sets Used for Robustness Calculation.

Data Set Number of Examples Number of Classes Base Performance
Abalone 4,177 6 (age/5)a 36%
DNA 3,186 3 52
E. coli 332 6b 43
Glass Identification 214 7 35
Iris 150 3 33.33
Image Segmentation 330 7 14
Landsat Satellite 6,435 6 23.8
Letter 20,000 26 4
MNIST 60,000 10 10
Shuttle 58,000 7 78
Vehicle 946 4 25.7
Vowel Recognition 528 11 9
Wine recognition 178 3 40
Yeast 1,462 10 30

Notes: We indicate the number of examples, the number of classes, and the worst possible performance by choosing as the default answer the most probable class in the data sets.

a

This data set predicts age from 1 to 29. It is more of a regression problem. Thus, we predict age bands dividing age by 5.

b

imL and imS classes removed because they have two examples each.

Table 3.

Average Performance Comparison for ISVM, 1-Versus-All and Weston-Watkins Using 14 Data Sets and Running LOO on 100 Random Samples for Each Data Set.

Data Set NS Inhibitory SVM
1-Versus-All
Weston-Watkins
10% 25% 50% 10% 25% 50% 10% 25% 50%
Abalone 50 61.43% 60.69% 60.09% 60.83% 59.69% 58.85% 60.07% 59.47% 59.10%
Abalone 100 66.55 65.91 65.16 65.47 64.13 63.12 64.18 64.12 64.06
Abalone 200 67.00 66.61 66.08 65.97 65.07 64.08 63.63 63.61 63.58
Abalone 500 67.77 67.48 67.09 67.63 66.87 65.79 64.24 64.23 64.22
DNA 50 49.59 49.25 49.14 49.28 49.13 49.08 49.77 49.18 47.99
DNA 100 54.08 54.04 54.02 54.04 54.04 54.01 52.78 52.29 52.02
DNA 200 56.24 56.19 56.16 56.20 56.18 56.13 53.33 53.18 52.66
DNA 500 60.90 60.87 60.82 60.92 60.90 60.84 53.57 53.57 53.56
E. coli 50 82.05 81.24 80.25 80.60 79.04 78.44 81.02 80.83 80.58
E. coli 100 84.06 83.60 82.78 83.23 81.56 80.55 82.97 82.90 82.78
E. coli 200 87.02 86.52 85.78 86.32 84.85 83.41 85.34 85.32 85.30
Glass 50 64.52 64.36 64.13 63.82 63.29 62.83 61.00 60.97 60.92
Glass 100 71.92 71.80 71.35 71.08 70.79 70.41 63.99 63.97 63.94
Glass 200 75.23 74.78 74.37 75.27 74.79 74.12 65.79 65.79 65.76
Iris 50 89.45 89.37 89.26 89.31 89.14 88.91 87.19 86.54 85.81
Iris 100 91.94 91.86 91.65 91.88 91.55 91.38 90.88 90.20 89.13
Iris 140 93.16 93.03 92.81 92.95 92.71 92.54 92.39 92.27 91.95
L. Sat 50 82.43 82.24 81.99 81.91 81.56 81.44 82.37 82.30 82.24
L. Sat 100 83.00 82.88 82.64 82.61 82.33 82.24 83.00 82.97 82.93
L. Sat 200 85.49 85.38 85.16 85.17 84.86 84.75 84.81 84.80 84.79
L. Sat 500 89.08 88.74 88.47 88.58 88.34 88.26 85.94 85.93 85.93
Letter 50 30.68 30.65 30.61 30.64 30.64 30.63 30.00 30.00 30.00
Letter 100 40.69 40.57 40.27 39.95 39.93 39.91 39.98 39.98 39.98
Letter 200 51.53 51.46 51.35 50.96 50.95 50.94 52.41 52.41 52.40
Letter 500 66.54 66.45 66.27 64.57 64.44 64.39 68.09 68.08 68.08
MNIST 50 53.76 53.76 53.38 53.80 53.80 53.72 51.88 51.86 51.85
MNIST 100 67.22 67.22 66.50 67.18 67.18 67.03 64.58 64.58 64.58
MNIST 200 77.53 77.53 76.76 77.51 77.51 77.34 75.40 75.40 75.40
MNIST 500 85.82 85.80 85.08 85.62 85.61 85.44 83.65 83.65 83.65
Segment 50 77.72 77.63 77.53 77.71 77.58 77.46 75.35 75.02 74.65
Segment 100 83.74 83.71 83.61 83.90 83.82 83.67 81.86 81.82 81.77
Segment 200 87.86 87.79 87.63 87.85 87.82 87.74 85.48 85.46 85.44
Shuttle 50 90.85 90.84 90.83 90.76 90.76 90.72 90.22 90.15 90.08
Shuttle 100 94.31 94.29 94.18 93.95 93.94 93.92 93.91 93.89 93.84
Shuttle 200 97.02 97.01 96.95 96.88 96.85 96.81 96.29 96.28 96.28
Shuttle 500 98.60 98.53 98.41 98.40 98.30 98.25 97.68 97.68 97.67
Vehicle 50 61.06 61.02 60.70 60.91 60.89 60.69 58.13 57.86 57.56
Vehicle 100 66.28 66.28 65.99 66.14 66.14 66.03 63.36 63.14 62.86
Vehicle 200 70.13 70.07 69.85 70.01 69.97 69.88 67.63 67.58 67.51
Vehicle 500 75.26 74.36 73.89 74.66 74.13 73.95 71.23 71.16 71.03
Vowel 50 46.61 46.61 46.48 46.60 46.60 46.57 46.76 46.76 46.76
Vowel 100 61.61 61.58 61.37 61.48 61.48 61.45 62.08 62.07 62.06
Vowel 200 77.73 77.65 77.54 77.78 77.77 77.74 77.76 77.75 77.75
Vowel 500 95.00 94.87 94.83 95.20 95.20 95.16 94.52 94.52 94.52
Wine 50 93.17 93.17 93.12 93.16 93.16 93.12 93.32 93.30 93.25
Wine 100 94.26 94.23 94.21 94.21 94.20 94.20 94.24 94.22 94.20
Wine 150 95.29 95.29 95.27 95.29 95.29 95.28 94.85 94.82 94.81
Yeast 50 48.36 47.60 46.98 47.21 46.39 46.09 47.99 47.71 47.44
Yeast 100 52.57 51.63 50.74 50.66 49.11 48.56 51.58 51.57 51.55
Yeast 200 55.00 54.28 53.16 53.01 50.55 49.60 53.06 53.05 53.02
Yeast 500 60.26 59.27 57.75 55.92 51.95 49.88 54.89 54.89 54.89

Notes: The kernel used is exp(− γ||xx′||2/M)), such that the radial basis functions are normalized to the number of features. The performance shown is based on the leave-of-out calculation of Ns samples run over 100 different realizations. The performances of all explored metaparameters for C = 0.1 to 50 and γ = 5, 10 are pooled and sorted. The table shows the average performance of the 10%, 25%, and 50% best models. In most of the cases, the inhibitory SVM outperforms the rest, with Weston-Watkins being competitive for smaller sizes and 1-versus-all becoming competitive for Ns ≥ 200.

The main conclusion from this assessment is that the average performance for areas of parameter values that provide a near-optimal performance is higher for the ISVM than for the 1-versus-all and Weston-Watkins. In general, one can see that for small data sets, the performance of the ISVM is better, although it curves down for a higher number of examples. The Weston-Watkins method is competitive for small data sets but then loses performance for a higher number of samples. In general, the ISVM demonstrates better overall robustness and performance for small data sets. To summarize the results and add interpretation to the table, we tested the null hypotheses Inline graphic that either the SVM or WW method has average performance better than or equal to the ISVM method. We performed a maximum likelihood ratio test (Dempster, 1997; Rodriguez & Huerta, 2009) as it has, according to the Neyman-Pearson lemma, optimal power for a given significance niveau (Neyman & Pearson, 1933). For the 14-trial (data set) test, Inline graphic can be rejected at significance niveau 5% if the likelihood ratio L is larger than c = 3.77. Table 4 summarizes the results by showing that most of the time we can reject the Inline graphic hypothesis. If, on the other hand, one runs the test against the alternative hypothesis Inline graphic “ISVM is better than or equal to SVM or WW,” it cannot be rejected in any of the cases.

Table 4.

Likelihood Ratio Values Using the 14 Data Sets.

Inline graphic: SVM Better Than ISVM
Inline graphic: WW Better Than ISVM
NS 10% 25% 50% 10% 25% 50%
50 446** 446** 3.77* 11.35* 3.77* 1.78
100 446** 446** 3.77* 446** 11.35** 3.77*
200 52** 3.77* 1.78 52** 52** 52**
500 4.35* 4.35* 1.05 22.17** 22.17** 22.17**

Notes: c-values ≥ 3.77 reflect a significance niveau of Pr(Lc| Inline graphic) ≤ 0.05 (*) and c values ≥ 11.35 reflect a significance of Pr(Lc| Inline graphic) ≤ 0.01 (**). For the 9 data sets with size 500, the rejection thresholds are 4.35 and 22.17. Thus, the null hypothesis can be rejected in most cases. If the null hypothesis is reversed (ISVM better than SVM and ISVM better than WW), then we cannot reject it in any of the cases.

In terms of training time, the Weston-Watkins algorithm is the fastest of all the methods and runs eight times faster than the ISVM on the leave-one-out error task from C = 0.1 to 50 for all the data sets and two times faster than the 1-versus-all. The three methods were implemented using the same code and the same stochastic SMO, so the better performance and robustness come with a cost in training, although there is not significant time difference in execution.

10 Bayes Consistency

Our overall goal is to find a classification function f with a minimal probability of misclassification R(f) (Lugosi & Vayatis, 2004). In a multiclass setting (Tewari & Bartlett, 2007), given the posterior probabilities pjp(o = yj |x) with j labeling all L output classes and given the outputs fj(x) after training, argmaxjfj must match arg maxj pj. In other words, the classifier function, f = {f1, …, fL} must select the most probable class (or the most probable classes if several classes have equal probability). This condition is called classification calibration, and theorem 2 in Tewari and Bartlett (2007) asserts that classification calibration is necessary and sufficient for convergence to the optimal Bayes risk. Tewari and Bartlett use

inffR(f)=inffjpjh(fj), (10.1)

where h(fj) is the cost function without the regularization term. The inhibitory SVM has

h(fj)=[1-(fj-f^)]++ij[1+(fi-f^)]+

where f^=1Lifi and fj ∈ ℜ. The problem, equation 10.1 is thus equivalent to solving a linear problem with infz Σj pj zj, where z takes all the admissible values induced by f ∈ ℜL. The consistency condition is

argminizi=argmaxipi.

Tewari and Bartlett (2007) analyze the consistency of several multiclass classifiers, which requires characterizing the induced sets of z by f. Because the proofs can be cumbersome due to the topological complexity of the intersecting hyperplanes induced by f, Monte Carlo simulations are a viable alternative to quickly evaluate the consistency of a classifier. Algorithm 3 is a straightforward algorithm.

Algorithm 3.

Monte Carlo Algorithm to Check Bayes Consistency.

c := 1, N := 1, L := L*
do {
 Choose p ∈ (ℜ+)L and normalize pi := pi/Σj pj
 Find the infimum of Σi pih(fi)
 if arg mini h(fi) = arg maxi pi then c := c + 1
N := N + 1
} while (NNmax)
risk=1-cNmax

Table 5 lists the consistency risks observed. An advantage of the ISVM is its consistency for 3-class problems and a lower probability of reaching inconsistencies for L > 3.

Table 5.

Monte Carlo Simulation of Consistency Using 100,000 Runs.

L Regular SVM ISVM Weston-Watkins
2 0% 0% 0%
3 5 0 15
4 25 10 39
5 37 17 48

Notes: We found 0% consistency errors, not surprisingly, for binary problems. The ISVM is also consistent for L = 3, and then it becomes inconsistent. Note that the probability of having a harder problem increases with the number of classes.

11 Conclusion

In this letter, we have developed a new variation on the support vector machine theme using the concept of inhibition that is widespread in animal neural systems (Cassenaer, & Laurent, 2012). The main engineering advantage of inhibition is the ability to achieve better average accuracy for a broad metaparameter space with a small number of training examples, shown across multiple learning tasks. This success of the inhibitory SVM method is reminiscent of the low number of examples that insects need to learn odor recognition (Smith, Abramson, & Tobin, 1991; Smith, Wright, & Daly, 2005).

The underlying reason that ISVMs perform better in the cases reported here appears to be that the inhibition provides a wider area of the hyperparameters C and γ that are close to the optimum, making finding good hyperparameters easier. Consistency analysis shows that ISVM are still consistent for 3-class problems and show a smaller percentage of inconsistencies overall. The ISVM can be made consistent by eliminating the positive examples yi j = 1 from the primal function, but this point is left for further analysis. Finally, it is important to emphasize that by using lemma 1, we show that log-linear models are almost equivalent to the inhibitory SVM framework, reflecting the universality of inhibition in different classification formalisms.

Acknowledgments

We acknowledge partial support by ONR N00014-07-1-0741, NIDCD R01DC011422-01, JPL 1396686, U.S. Army Medical and Material Command number W81XWH-10-C-004 (in collaboration with Elintrix) and TIN 2007-65989 (Spain). J.M.A. was funded by grant MTM2009-11820 (Spain). We thank Carlos Santa Cruz for discussions and comments on this work.

Appendix A: Proof of Lemma 1

  1. Jensen’s inequality for convex functions applied to the exponential map reads (see section 3.1.8 of Boyd & Vandenberghe, 2004)
    1L(k=1Lexpfk)exp(1Lk=1Lfk) (A.1)
    for all f1, …, fL ∈ ℜ. Use the increasing monotonicity of the logarithm function to derive
    log1L+log(k=1Lexpfk)1Lk=1Lfk,

    which is equation 3.10.

  2. From the graphical interpretation of Jensen’s inequality, it is plain that the equality in equation A.1 holds if and only if f1 = ···= fL, that is, if all the components of f are equal.

Appendix B: Proof of Lemma 2

Since αi j and βi j are arbitrary in equations 4.5 and 4.6, set βi j = Cαi j to get the simplified expressions

w-ijαijyij[Ψj(χi)-μΨ(χi)]=0, (B.1)
ijαijyijw,Ψ(χi)=0. (B.2)

Next solve for w in equation B.1 and replace in equation B.2 to obtain

0=ijijαijyijαijyijΨj(χi)-μΨ(χi),Ψ(χi)=ijijαijyijαijyij[Ψj(χi),Ψ(χi)-μΨ(χi),Ψ(χi)]=ijijαijyijαijyij[Φ(xi),Φ(xi)-μLΦ(xi),Φ(xi)]=(1-μL)ijαijyijΦ(xi),ijαijyijΦ(xi)=(1-μL)ijαijyijΦ(xi)2, (B.3)

where we employed equations 3.7 to 3.9. Hence μ=1L if Σi j αi j yi j Φ(χi) ≠ 0. Finally, note that the latter inequality holds true if and only if Σi j αi j yi j Ψ(χi) ≠ 0 in virtue of equation 3.3.

Appendix C: Proof of Theorem 1

Let E(w*, μ *) be the optimal value of the primal problem, equation 4.1.

  1. In the generic case, det J(w*, μ*, α*, β*) ≠ 0. Then w* = wcrit (α*, β*) and
    μ=μcrit(α,β)=1L

    because μcrit (α, β) is the constant 1L, equation 4.11.

  2. If, otherwise, det J(w*, μ*, α*, β*) = 0, then an argument based on the continuity of the Jacobian determinant with respect to all of its variables leads to the same conclusion. Indeed, let (αn)n1 and (βn)n1 be sequences such that det detJ(w,μ,αn,β)0,αnα, and βnβ. (This is always possible because the solutions of det J(w, μ, α, β) = 0 build an (L dim F + 2NL)-dimensional manifold in an (L dim F + 2NL + 1) -dimensional domain.) Then wcrit(αn,βn)w and μcrit(αn,βn)μ. Since μcrit(αn,βn)=1L for all n ≥ 1, it follows that μ=1L.

Appendix D: Stochastic Gradient Descent on the RKHS

Let us calculate the minimum by taking the gradient of E in equation 8.1 with respect to f. To this end, note that the partial derivative of max{0, 1 − yi f (χi)} for yi f (χi) = 1 does not exist uniquely but is bounded between 0 and 1. If (·) is the function defined as

1¯(u)={1ifu<00ifu>0,[0,1]ifu=0, (D.1)

then

fE=f-Ci=1NLyi1¯(yif(χi)-1)G(χi,·). (D.2)

We are looking for a solution of the form f(χ)=i=iNLα^iG(χi,χ) such that ∂f E = 0. Therefore, we insert f (χ) into equation D.2 to obtain

0=i=1NLα^iG(χi,χ)-Ci=1NLyi1¯(yif(χi)-1)G(χi,χ),0=i=1NL{α^i-Cyi1¯(yif(χi)-1)}G(χi,χ),

which leads to

α^iyi=C1¯(ykf(χi)-1)

for 1 ≤ iNL. From the previous equation, we distinguish three types of solution:

yi(f(χi)-yi)0foryiα^i=0,yi(f(χi)-yi)=0for0<yiα^i<C,yi(f(χi)-yi)0foryiα^i=C,

which are identical to the KKT conditions obtained in the dual problem and shown in equations 6.5 to 6.7. The gradient rule for the whole system ft+1 = ftηf E is then

ft+1=(1-ηλ)ft+ηCi=1NLyi1¯(yift(χi)-1)G(χi,·)=i=1NL{(1-ηλ)α^i(t)+ηCyi1¯(yift(χi)-1)}G(χi,·),

which leads to the updating rule,

α^i(t+1)=(1-η)α^i(t)+ηCyi1¯(yift(χi)-1). (D.3)

Appendix E: Weston-Watkins Method

The Weston-Watkins can be written using our notation as

minimizeE(w,ηij)=12||w||2+Ci,jj(i)ηijsubjectto(i)ηij0,(ii)w,[k=1Lφk(χi)yik+12]-Φj(χi)1-ηij,(iii)j(i)={j=1,Ls.t.yij=-1}. (E.1)

Note that in Weston-Watkins, the margin value is 2 but we replaced it by 1 for consistency with other methods. After building the Lagrangian and taking all the necessary steps, one can express the solution as

w=i,jj(i)αij([k=1LΦk(χi)yik+12]-Φj(χi)),αij[0,C]. (E.2)

Using property 3.9, one obtains the dual problem for Weston-Watkins as

maximizeW=i=1Njj(i)αij-12i,i=1Njj(i),jj(i)αijαijGijijsubjecttoαij[0,C], (E.3)

where the kernel is expressed as

Gijij=K(xi,xi)(k=1L(yik+12yik+12)-yij+12-yij+12+I(j=j))=K(xi,xi)(k=1L(yik+12yik+12)+I(j=j))=Gijij,

with jj*(i), j′j′* (i′), and the KKT conditions are

-1+i,j(i)Gijijαij0forαij=0,-1+i,j(i)Gijijαij=0for0<αij<C,-1+i,j(i)Gijijαij0forαij=C.

On defining the margin variables as: Vi j = −1 + ·i′, j′* (i′) Gi ji′ j′ αi′ j′, we can directly apply the stochastic SMO algorithm described in the main text.

Appendix F: Crammer-Singer Method

The Crammer-Singer multiclass problem can be written as

minimizeE(w,ηi)=12||w||2+Ciηisubjectto(i)ηi0,(ii)w,[k=1LΦk(χi)yik+12]-Φj(χi)+yij+121-ηi. (F.1)

Note the similarity with the Weston-Watkins method except for the number of constraints and slack variables. Since the constraints in (ii) are always verified for yi j = 1, we can loop the j index for the set j*(i) as defined in equation E.1. The problem can be expressed as the Lagrangian,

L=12||w||2+Ci=1Nηi-i=1Nκiηi-i=1Njj(i)αij(w,[k=1LΦk(χi)yik+12]-Φj(χi)+yij+12-1+ηi)subjectto(i)κi0,(ii)αij0. (F.2)

By calculating the gradient respect to w and ηm,

w=i,jj(i)αij([k=1LΦk(χi)yik+12]-Φj(χi)),jj(m)αmj=C-κm, (F.3)

replacing the two previous equations back into the Lagrangian and using the property 3.9, one obtains the dual problem

maximizeW=i=1Njj(i)αij-12i,i=1Njj(i),jj(i)αijαijGijijsubjecttojj(i)αmj[0,C], (F.4)

where the multiclass kernel is exactly the same as Watson-Watkins:

Gijij=K(xi,xi)(k=1L(yik+12yik+12)-yij+12-yij+12+I(j=j))=K(xi,xi)(k=1L(yik+12yik+12)+I(j=j))=Gijij.

This problem is nearly identical to the Weston-Watkins approach but with minor differences in the constraints of the Lagrange multipliers due to the use of a lower number of slack variables. Note also that constraint F.4 is different from the one used in Crammer and Singer (2001), where ηi ≥ 0 was not enforced in the Lagrangian (see Tsochantaridis et al., 2005, for an appropriate derivation).

Footnotes

1

Note the distinction to the standard kernel trick with an implicit mapping of inputs. Explicit mapping of inputs into a high-dimensional feature space was recently considered in Chang, Hsieh, Chang, Ringgaard, and Lin (2010) to speed up the training of nonlinear SVMs.

2

There is a proposed generalization of the coding matrix (Allwein, Schapire, & Singer, 2000). For simplicity, we prefer to solve the problem of inhibitory classifiers in the framework of Diettrich and Bakiri (1995). The extension proposed by Allwein et al. (2000) is a possible generalization for the future.

Contributor Information

Ramón Huerta, Email: ramon.huerta@gmail.com, BioCircuits Institute, University of California, San Diego, La Jolla, CA 92093-0402, U.S.A.

Shankar Vembu, Email: shankar.vembu@gmail.com, BioCircuits Institute, University of California, San Diego, La Jolla, CA 92093-0402, U.S.A.

José M. Amigó, Email: jm.amigo@umh.es, Department of Statistics, Mathematics, and Computer Science, Universidad Miguel Hernandez, Elche 03202, Spain

Thomas Nowotny, Email: t.nowotny@sussex.ac.uk, School of Informatics, University of Sussex, Falmer, Brighton BN1 9QJ, U.K.

Charles Elkan, Email: elkan@ucsd.edu, Department of Computer Science and Engineering, University of California, San Diego, La Jolla, CA 92093-0404, U.S.A.

References

  1. Abbott LF, Regehr WG. Synaptic computation. Nature. 2004;43:796–803. doi: 10.1038/nature03010. [DOI] [PubMed] [Google Scholar]
  2. Allwein EL, Schapire RE, Singer Y. Reducing multiclass to binary: A unifying approach for margin classifiers. Journal of Machine Learning Research. 2000;1:113–141. [Google Scholar]
  3. Bottou L, Bousquet O. The tradeoffs of large scale learning. In: Platt JC, Koller D, Singer Y, Roweis S, editors. Advances in neural information processing systems. Vol. 20. Cambridge, MA: MIT Press; 2008. pp. 161–168. [Google Scholar]
  4. Boyd S, Vandenberghe L. Convex optimization. Cambridge: Cambridge University Press; 2004. [Google Scholar]
  5. Canu S, Smola A. Kernel methods and the exponential family. Neurocomputing. 2005;69:714–720. [Google Scholar]
  6. Cassenaer S, Laurent G. Conditional modulation of spike-timing dependent plasticity for olfactory learning. Nature. 2012;482:47–52. doi: 10.1038/nature10776. [DOI] [PubMed] [Google Scholar]
  7. Chang YW, Hsieh CJ, Chang KW, Ringgaard M, Lin CJ. Training and testing low-degree polynomial data mappings via linear SVM. Journal of Machine Learning Research. 2010;11:1471–1490. [Google Scholar]
  8. Chapelle O. Training a support vector machine in the primal. Neural Computation. 2007;19:1155–1178. doi: 10.1162/neco.2007.19.5.1155. [DOI] [PubMed] [Google Scholar]
  9. Crammer K, Singer Y. On the algorithmic implementation of multiclass kernel-based vector machines. Journal of Machine Learning Research. 2001;2:265–292. [Google Scholar]
  10. Dempster AP. The direct use of likelihood for significance testing. Stat Comput. 1997;7:242–252. [Google Scholar]
  11. Diettrich TG, Bakiri G. Solving multiclass learning problems via error-correcting output codes. Journal of Artificial Intelligence Research. 1995;2:263–286. [Google Scholar]
  12. Freund Y, Schapire RE. Large margin classification using the perceptron algorithm. Machine Learning. 1999;37:277–296. [Google Scholar]
  13. Harvey CD, Svoboda K. Locally dynamic synaptic learning rules in pyramidal neuron dendrites. Nature. 2007;450:1195–1200. doi: 10.1038/nature06416. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Heisemberg M. Mushroom body memoir: From maps to models. Nat Rev Neurosci. 2003;4:266–275. doi: 10.1038/nrn1074. [DOI] [PubMed] [Google Scholar]
  15. Hsu CW, Lin CJ. A comparison of methods for multiclass support vector machines. IEEE Transactions on Neural Networks. 2002;13:415–425. doi: 10.1109/72.991427. [DOI] [PubMed] [Google Scholar]
  16. Huerta R, Nowotny T. Fast and robust learning by reinforcement signals: Explorations in the insect brain. Neural Computation. 2009;21:2123–2151. doi: 10.1162/neco.2009.03-08-733. [DOI] [PubMed] [Google Scholar]
  17. Huerta R, Nowotny T, Garcia-Sanchez M, Abarbanel HDI, Rabinovich MI. Learning classification in the olfactory system of insects. Neural Computation. 2004;16:1601–1640. doi: 10.1162/089976604774201613. [DOI] [PubMed] [Google Scholar]
  18. Keerthi SS, Shevade SK, Bhattacharyya C, Murthy C. Improvements to Platt’s SMO algorithm for SVM classifier design. Neural Computation. 2001;13:637–650. [Google Scholar]
  19. Kivinen J, Smola AJ, Williamson RC. Online learning with kernels. IEEE Transactions on Signal Processing. 2010;100:1–12. [Google Scholar]
  20. Laurent G. Olfactory network dynamics and the coding of multidimensional signals. Nat Rev Neurosci. 2002;3:884–895. doi: 10.1038/nrn964. [DOI] [PubMed] [Google Scholar]
  21. LeCun Y, Chopra S, Hadshell R, Ranzato M, Jie H-F. A tutorial on energy-based learning. In: Bakir G, Hofmann T, Schölkopf B, Smola A, Taskar B, editors. Predicting structured data. Cambridge, MA: MIT Press; 2006. [Google Scholar]
  22. Lee Y, Lin Y, Wahba G. Multicategory support vector machines: Theory and application to the classification of microarray data and satellite radiance data. Journal of the American Statistical Association. 2004;99:67–81. [Google Scholar]
  23. Lugosi G, Vayatis N. On the Bayes-risk consistency of regularized boosting methods. Annals of Statistics. 2004;32:30–55. [Google Scholar]
  24. Muller KR, Mika S, Ratsch G, Tsuda K, Schölkopf B. An introduction to kernel-based learning algorithms. IEEE Transactions on Neural Networks. 2001;12:181–202. doi: 10.1109/72.914517. [DOI] [PubMed] [Google Scholar]
  25. Neyman J, Pearson E. On the problem of the most efficient tests of statistical hypotheses. Phil Trans R Soc Lond Ser A. 1933;231:289–337. [Google Scholar]
  26. Nowotny T, Huerta R, Abarbanel HDI, Rabinovich MI. Self-organization in the olfactory system: One shot odor recognition in insects. Biol Cybern. 2005;93:436–446. doi: 10.1007/s00422-005-0019-7. [DOI] [PubMed] [Google Scholar]
  27. O’Reilly RC. Generalization in interactive networks: The benefits of inhibitory competition and Hebbian learning. Neural Computation. 2001;13:1199–1241. doi: 10.1162/08997660152002834. [DOI] [PubMed] [Google Scholar]
  28. Platt JC. Fast training of support vector machines using sequential minimal optimization. In: Schölkopf B, Burges C, Smola A, editors. Advances in kernel methods: Support vector machines. Cambridge, MA: MIT Press; 1999a. pp. 185–208. [Google Scholar]
  29. Platt JC. Using analytic QP and sparseness to speed training of support vector machines. In: Kearns MS, Solla SA, Cohn DA, editors. Advances in neural information processing Systems. Vol. 11. Cambridge, MA: MIT Press; 1999b. pp. 557–563. [Google Scholar]
  30. Pletscher P, Soon Ong C, Buhmann JM. Entropy and margin maximization for structured output learning. Proceedings of the 2010 European Conference on Machine Learning and Knowledge Discovery in Databases: Part III; Berlin: Springer-Verlag; 2010. [Google Scholar]
  31. Rifkin R, Klautau A. In defense of one-vs-all classification. Journal of Machine Learning Research. 2004;5:101–141. [Google Scholar]
  32. Rodriguez FB, Huerta R. Techniques for temporal detection of neural sensitivity to external stimulation. Biol Cybern. 2009;100:289–297. doi: 10.1007/s00422-009-0297-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Seung HS. Learning in spiking neural networks by reinforcement of stochastic synaptic transmission. Neuron. 2003;40:1063–1073. doi: 10.1016/s0896-6273(03)00761-x. [DOI] [PubMed] [Google Scholar]
  34. Shalev-Shwartz S, Singer Y, Srebro N. Pegasos: Primal estimated sub-gradient solver for SVM. In: Ghahramani Z, editor. Proceedings of the 24th international Conference on Machine Learning. New York: ACM; 2007. pp. 807–814. [Google Scholar]
  35. Smith BH, Abramson CI, Tobin TR. Conditional withholding of proboscis extension in honeybees (Apis mellifera) during discriminative punishment. J Comp Psychol. 1991;105:345–356. doi: 10.1037/0735-7036.105.4.345. [DOI] [PubMed] [Google Scholar]
  36. Smith BH, Wright GA, Daly KC. Learning-based recognition and discrimination of floral odors. In: Dudareva N, Pichersky E, editors. Biology of floral scent. Boca Raton, FL: CRC Press; 2005. pp. 263–295. [Google Scholar]
  37. Tewari A, Bartlett PL. On the consistency of multiclass classification methods. Journal of Machine Learning Research. 2007;8:1007–1025. [Google Scholar]
  38. Tsochantaridis I, Joachims T, Hofmann T, Altun Y. Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research. 2005;6:1453–1484. [Google Scholar]
  39. Vapnik VN. The nature of statistical learning theory. Berlin: Springer-Verlag; 1995. [Google Scholar]
  40. Weston J, Watkins C. Proceedings of the European Symposium on Artificial Neural Networks. Bruges: D-facto; 1999. Support vector machines for multiclass pattern recognition; pp. 219–224. [Google Scholar]
  41. Zhang T. Proceedings of the Twenty-First International Conference on Machine Learning. New York: ACM; 2004. Solving large scale linear prediction problems using stochastic gradient descent algorithms. [Google Scholar]

RESOURCES