Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2024 May 1.
Published in final edited form as: Appl Numer Math. 2023 Feb 15;187:138–157. doi: 10.1016/j.apnum.2023.02.011

Sparse Machine Learning in Banach Spaces

Yuesheng Xu *
PMCID: PMC10062057  NIHMSID: NIHMS1876260  PMID: 37006783

Abstract

The aim of this expository paper is to explain to graduate students and beginning researchers in the field of mathematics, statistics and engineering the fundamental concept of sparse machine learning in Banach spaces. In particular, we use binary classification as an example to explain the essence of learning in a reproducing kernel Hilbert space and sparse learning in a reproducing kernel Banach space (RKBS). We then utilize the Banach space 1() to illustrate the basic concepts of the RKBS in an elementary yet rigorous fashion. This paper reviews existing results in the author’s perspectives to reflect the state of the art of the field of sparse learning, and includes new theoretical observations on the RKBS. Several open problems critical to the theory of the RKBS are also discussed at the end of this paper.

Keywords: sparse machine learning, reproducing kernel Banach space

AMS subject classifications: 46B45, 46N10, 90C30

1. Introduction

Most of machine learning methods are to learn a function from available data. Mathematically, we need a hypothesis space to hold functions to be learned [13]. Choices of the hypothesis space are critical for learning outcomes. Traditionally, a reproducing kernel Hilbert space (RKHS), a Hilbert space where the point-evaluation functionals are continuous, is chosen, due to the existence of an inner product that can be used to measure similarity in data and continuous point-evaluation functionals that often are used to sample data.

While learning in an RKHS offers attractive features, it suffers from a major drawback. To explain it, we define the notion of sparsity. We say a vector or a sequence is sparse, we mean most of its components are zero. When it is not sparse, we call it dense. We stress here that this definition of density is peculiar to the data science community, and it does not mean (topologically) dense in the usual sense. Sparsity also refers to a representation of a function under a basis when the coefficient vector is sparse. A major drawback of learning in a Hilbert space is that a learned solution tends to be dense. This feature is a consequence of the smoothness of Hilbert spaces used as hypothesis spaces. Such a dense learned solution leads to high computational costs when the learned solution is used in prediction or other decision making procedures since once a function is learned it will be used repeatedly many times. The suffering is even more severe in the context of big data analytics. For example, Google would want its search engine to be able to make instant recommendations. Due to increasingly greater amounts of data and related model sizes, demands for more competent data processing models have emerged. As pointed in [21], the future of machine learning is sparsity. It is immensely desirable to learn sparse solutions in the sense that they have substantially fewer nonzero components or fewer terms in their representation.

Most data sets have intrinsic sparsity. As a matter of fact, data that we encounter often have certain embedded sparsity structures in the sense that if they are represented in certain ways, their intrinsic characteristics can concentrate on a few scattering spots. To construct sparse solutions, we require that the norm of the hypothesis space for the learning model have certain sparsity promoting property.

With a sparsity promoting norm for the hypothesis space, when a solution of a learning model is represented in an appropriate basis, the solution can have a sparse representation. It is well-known that the norm of a usual Hilbert space does not promote sparsity due to its smoothness. Norms of certain nonsmooth Banach spaces have the ability to induce sparsity for solutions of learning methods having the spaces as hypothesis spaces, such as the 1-norm. For this reason, some Banach spaces have been used as hypothesis spaces for sparse machine learning methods. With sparse solutions, machine learning models can considerably reduce storage, computing time and communication time.

We are interested in a class of Banach spaces which are spaces of functions because function values are useful in machine learning. In particular, we would like the point-evaluation functionals in such spaces to be continuous. This gives rise to a special class of Banach spaces: the reproducing kernel Banach spaces (RKBSs) which was introduced in [58]. During the last decade, great interest has been paid to understanding of this class of function spaces and their uses in applications. We will discuss these function spaces and learning in the spaces.

The goal of this expository paper is to review recent advances in sparse machine learning in Banach spaces. We will use binary classification, a basic problem in data science, as an example to motivate the notion of sparse learning in Banach spaces. In Section 2, we present a brief review of classification. Section 3 is devoted to a presentation of learning in an RKHS. In Section 4 we illustrate the theory of the RKBS by the example of the 1() space. We discuss in Section 5 learning in a Banach space. We present a numerical example in Section 6 to demonstrate a sparse classification method in a Banach space. In Section 7, we elaborate several unsolved mathematical problems related to the theory of RKBS and sparse learning.

2. A Review of Classification

In this section, we review the classical support vector machine for binary classification which was discussed in [52].

A typical binary classification problem maybe described as follows: Given two sets of points in the Euclidean space d, we wish to find a surface to separate them. A simplest way to separate two sets of points is to use a hyperplane, since among all types of surfaces a hyperplane is the easiest to construct and to work with. For this purpose, we wish to construct the hyperplane

wxb=0,xd, (2.1)

where wd and b are to be determined so that the hyperplane (2.1) has the largest distance to both of the two sets. Specifically, we let the training data D:={(xk,yk):k=1,2,,N} be composed of input data points

X{xk:k=1,2,,N}d

and output data values

Y{yk:k=1,2,,N}{1,1}.

We intend to find a hyperplane determined by the linear function

s(x)wsxbs,xd, (2.2)

which separates the training data D into two groups, one with label yk=1 which corresponds to s(xk)>0 and another with label yk=1 which corresponds to s(xk)<0. The parameters (ws,bs)d× in equation (2.2) are chosen such that the hyperplane maximizes its distance to the points in D. This gives us a decision rule to predict the labels for new points, that is,

r(x)sign(s(x)), for xd.

This method is called the support vector machine (SVM) [12].

Specifically, as illustrated in Figure 2.1 we select two parallel hyperplanes

wxb=1 and wxb=1

that separate the two classes of data so that the distance between the two hyperplanes is as large as possible. The region bounded by these two hyperplanes is called the margin. We compute the distance d between these two planes. Let x0d be a point on the first plane. That is, x0 satisfies the equation

wx0b=1. (2.3)

The distance d between these two planes is given by

d=|wx0b+1|w2=2w2.

Maximizing the distance d is equivalent to minimizing 1d=12w2. To prevent data points from falling into the margin, we require

wxkb1, if yk=1 or wxkb1, if yk=1. 

These constraints state that each data point must lie on the correct side of the margin, and they may be rewritten as

yk(wxkb)1,k=1,2,,N. (2.4)

The pair (ws,bs) is solved by a constrained minimization problem

min{12w2:(w,b)d×} (2.5)

subject to (2.4). Let

C{(w,b)d×:yk(wxkb)1,k=1,2,,N}. (2.6)

It can be verified that C is a convex set. The minimization problem (2.5) with constraints (2.4) may be rewritten as

min{12w2:wC}, (2.7)

where the constraint set C is defined by (2.6). Model (2.7) is called the hard margin support vector machine (SVM), which is a constrained minimization problem.

Figure 2.1:

Figure 2.1:

Classification (Courtesy: Wikipedia)

It is computationally advantageous to reformulate the constrained minimization problem (2.7) as an unconstrained one. This leads to the soft margin SVM model. Specifically, we rewrite the constraints (2.4) as

1yk(wxkb)0,k=1,2,,N, (2.8)

and make use of the ReLU (Rectified Linear Unit) function

σ(s)max{0,s}, for s

to define the hinge loss function

L(y,t)σ(1yt),for y{1,1} and t.

When (2.8) is satisfied, we have that L(yk,wxkb)=0 and when (2.8) is not satisfied, we have that

L(yk,wxkb)=1yk(wxkb)=|1yk(wxkb)|>0,

which is the case that we would like to prevent from occurring. For this reason, we would like to minimize the nonnegative fidelity term

1Nk=1NL(yk,wxkb).

Combining this with the minimization (2.5) leads to the regularization problem

min{1Nk=1NL(yk,wxkb)+λw22:(w,b)d×}, (2.9)

where λ>0 is a regularization parameter. Regularization problem (2.9) is called the soft margin SVM. Upon solving minimization problem (2.9), we obtain the pair (ws,bs) which defines the function s for classification.

3. Learning in Reproducing Kernel Hilbert Spaces

We discuss in this section the notion of feature maps in machine learning. In particular, we show the necessity of introducing feature maps in machine learning, which leads to learning in an RKHS.

We first motivate the introduction of feature maps in machine learning by returning to the classification problem reviewed in the last section. When given data sets can be separated by a hyperplane, the hard/soft margin SVM discussed in the last section will do a reasonably good job for classification. However, in most cases of practical applications, it is not possible to separate two sets of points entirely by a hyperplane in the same space. Certain misclassification may occur. In general, we might allow a low degree of misclassification but do not tolerate a high degree of misclassification. We are required to separate two sets with a low degree of misclassification.

One way to alleviate misclassification is to map data sets to a higher (or even an infinite) dimensional space and perform classification in the new space. Note that sparse data sets are easier to be separated by a hyperplane, with a low degree of misclassification. Non-sparse data sets in a lower dimensional space may be made sparse if they are projected to a higher or even an infinite dimensional space by an appropriate map. A nice idea is to find a feature map Φ to map lower dimensional data sets into a higher dimensional space to gain sparsity for the resulting data sets. The sparsity of the mapped data sets in the higher dimensional space can significantly reduce the degree of misclassifications. In other words, the feature map Φ transfers the classification problem in a lower dimensional space to one in a higher dimensional space, where it is possible to separate point sets mapped from d by a “hyperplane”, with a lower degree of misclassification. This is illustrated in Figure 3.2. A crucial issue is the choice of the feature map Φ. From a pure theoretical standpoint, “any” function Φ is a feature map. The choice of the feature map Φ pretty much depends on specific applications. Feature maps lead to the notion of kernel based learning or learning in RKHSs.

Figure 3.2:

Figure 3.2:

An illustration of data mapping

We next demonstrate how a feature map defines a kernel. A feature map has a well-known remarkable property which we show below. We recall that xjd, j=1,2,,n, are the n data points.

Proposition 3.1 Let be a real Hilbert space with inner product ·,·. If Φ:d, then for all n, Xn{xjd:j=1,2,,n}, the matrices

Φ(Xn)[Φ(xi),Φ(xj):i,j=1,2,,n]

are positive semi-definite.

Proof: For all n, suppose that cj, j=1,2,,n, are arbitrarily given. Let c[c1,c2,,cn]. For all xjd, j=1,2,,n, there holds

cΦ(Xn)c=i=1nj=1ncicjΦ(xi),Φ(xj)=i=1nciΦ(xi),j=1ncjΦ(xj)=i=1nciΦ(xi)20.

This confirms that the matrices Φ(Xn) are positive semi-definite. □

The feature map Φ:d naturally leads to a function

K(x,y)Φ(x),Φ(y),x,yd, (3.1)

which is a kernel according to Proposition 3.1 and the following definition.

Definition 3.2 Let Xd. A function K:X×X is called a kernel if it is symmetric and positive semi-definite, that is,

i=1nj=1ncicjK(xi,xj)0,for all n,xjX,cj,j=1,2,,n.

Kernels were studied over 100 years ago in the context of integral equations [29]. See, also [4]. Kernels can define distances of data sets to measure their similarity.

Function values are often used in machine learning. Hence, it would be desirable to require that the Hilbert space of functions defined on X that we work with has the property that the point-evaluation functionals are continuous in the space. Precisely, we have the following definition.

Definition 3.3 A Hilbert space of functions defined on X is called an RKHS if for every point-evaluation functional δx, xX, there exists a constant Mx such that for all f

|δxf|=|f(x)|Mxf.

We now illustrate Definition 3.3 by a simplest example. To this end, we consider the space 2() of real sequences f[f1,f2,] such that f2:=k|fk|2<. Clearly, elements of 2() are functions defined on the set X and 2() is a Hilbert space, and thus, it is a Hilbert space of functions on . Moreover, 2() is isometrically isomorphic to (2())*. The point-evaluation functional has the form δx(f)=δj(f)fj, for xj. It follows that

|δj(f)|=|fj|f2,for allf2(),

with Mx in Definition 3.3 being Mj1, for all j. Therefore, according to Definition 3.3, 2() is an RKHS on with the kernel K:=[δi,j:i,j], where δi,j1 if i=j and 0 otherwise, for i,j.

Immediately from Definition 3.3, we observe that the closeness of two functions in an RKHS in the norm of the space implies their pointwise closeness.

An RKHS is associated with a kernel. The following well-known result can be found in [1].

Theorem 3.4 If a Hilbert space of functions on X is an RKHS, then there exists a unique kernel K:X×X such that K(x,·), xX and for all f,

f(x)=f(),K(x,),xX. (3.2)

If K:X×X is a kernel, then there exists a unique RKHS on X such that K(x,·), xX and for all f, the reproducing property (3.2) holds true.

Proof: Suppose that is an RKHS of functions on X. Since the point-evaluation functional δxff(x), for each xX, is continuous, by the Riesz representation theorem, there exists kx such that f(x)=δxf=f,kx for all f. For each x,yX, we define a function K:X×X by K(x,y)kx(y) for x,yX. Clearly, we have that for all xX, K(x,·) and (3.2) holds true. The uniqueness of such a function K can be verified. It remains to show that K is a kernel. The symmetry of K can be seen from the computation

K(x,y)=kx(y)=kx(),ky()=K(x,),K(y,)=K(y,),K(x,)=K(y,x).

It remains to prove that it is positive semi-definite. For any n, xjX, cj, j=1,2,,n, we let

c[c1,c2,,cn] and Kn[K(xi,xj):i,j=1,2,,n]

By the reproducing property and symmetry of the kernel, we have that

cKnc=i=1nj=1ncicjK(xi,xj)=i=1nj=1ncicjK(xj,),K(xi,)=j=1ncjK(xj,),i=1nciK(xi,)=i=1nciK(xi,)20.

Thus, K is a kernel. We omit the proof for the converse. □

Theorem 3.4 clearly states that every function in the RKHS can be reproduced by the underlying kernel. Equation (3.2) is called the reproducing property. For this reason, the kernel is also called the reproducing kernel. Moreover, the kernel (3.1) defined by a feature map Φ determines an RKHS . The mapped RKHS is more complex than the original space d and thus by using it as a hypothesis space of a learning method we can better represent our learned function. We expect that learning outcomes in this space will be more accurate than those in the original Euclidean space. In particular, data sets mapped from d to via the feature map are likely more sparse than the original data sets and thus classification of the mapped data sets may result in substantially less misclassification.

Two most popular kernels in machine learning are the polynomial kernel of degree r

K(x,y)(xy+c)r,for x,yd, (3.3)

where c is a constant and the Gaussian kernel

K(x,y)eθ2xy22, for x,yd. (3.4)

Google Scholar shows that the term “polynomial kernel” is used in titles of 53,900 articles and the term “Gaussian kernel” is used in titles of 649,000 articles. Gaussian kernels are used widely in application due to its remarkable flexibility in fitting data. In Figures 3.3, we illustrate that by altering the parameter θ we can control the shape of the graphs of the Gaussian kernel.

Figure 3.3:

Figure 3.3:

K(x,0) with θ=1.6 (left); with θ=3 (right)

Following the above discussion, we map the data points xjd to the RKHS by the feature map Φ and seek a function f that defines a decision rule for classification. This discussion leads us to extend the regularization problem (2.9) for classification to

min{1nk=1nL(yk,f,Φ(xk))+λf2:f} (3.5)

to determine a decision function f ∈ the RKHS . It follows from the reproducing property that

f,Φ(xk)=f,K(xk,)=f(xk).

Thus, the regularization problem (3.5) for classification is rewritten as

min{1nk=1nL(yk,f(xk))+λf2:f}. (3.6)

Clearly, the regularization problem (2.9) with b=0 is a special case of (3.6). To see this, we introduce the reproducing kernel

K(x,y)x,y,for allx,yd

The corresponding RKHS has the form {·,x:xd}, with ·,x,·,y:=x,y for all x,yd, as its inner product. According to the reproducing property that for each f·,w with wd, we deduce that f(x)=x,w=wx, for all xd, and f=w2. Thus, problem (3.6) reduces to (2.9) with b=0.

The above discussion of classification motivates us to consider learning methods in RKHSs. For example, given a pair {X,Y} of data sets X{xjd:j=1,2,,m}, and Y{yj:j=1,2,,m}, we wish to learn a function f* from an RKHS . This learning method may be described as the minimization problem

f*=argmin{F(X,Y,f)+λf:f}, (3.7)

where F denotes the empirical fidelity term which measures the closeness of a learned solution to the given data and λ>0 is a regularization parameter. The learning model (3.7) covers many learning problems including minimal norm interpolation, regularization network, support vector machines, kernel principal component analysis, kernel regression and deep regularized learning. Moreover, the recent work [33] established a framework where distributions are mapped into an RKHS in which the kernel methods can be extended to probability measures.

A great success of learning in Hilbert spaces has been achieved. At the same time theory of the RKHS gains coordinated development [19, 56, 57, 59]. In particular, motivated from needs of developing multiscale bases for an RKHS, refinable kernels were investigated in [56, 57, 59]. A mathematical foundation of learning in Hilbert spaces may be found in [13].

Two fundamental mathematical problems were raised regarding learning in an RKHS. The first one regards the form of a solution of the learning problem (3.7) which seeks a solution in an infinite dimensional space . The second is about the approximation property of the learning methods. Does the learned solution converge to the “true solution” of the practical problem as data points become dense and their number tends to infinity?

Answer to the first fundamental problem is the celebrated representer theorem [40]. A solution of the learning problem (3.7) in an RKHS can be represented as a linear combination of a finite number of the kernel sessions, kernel K evaluated at the input training data points, namely,

K(xk,x),k=1,2,,m. (3.8)

That is, a solution of the learning problem (3.7) has the form

f*(x)=k=1mckK(xk,x), (3.9)

for some suitable parameters ck, where K is the kernel associated with the space . Here the integer m is the number of the data points involved in the fidelity term of (3.7). This result is called the representer theorem for the learning method (3.7). The most remarkable feature of the representer theorem lies in a finite number of kernel sessions that represent a learned solution in the infinite dimensional space . In other words, even though we seek a learned solution in the infinite dimensional space hoping it will provide much more richness in representing a learned solution, the learned solution is indeed located in the finite dimensional subspace of , spanned by the kernel sessions (3.8). Although the modern form of the representer theorem was found in [40], its origin dates back to [5] and [23] that appeared fifty years ago.

An answer to the second fundamental problem for the learning method (3.7) is the universality of a kernel. According to the representer theorem (3.9), a solution f* of the learning method (3.7) is expressed as linear combination of a finite number of the kernel sessions. This motivates us to ask the following question: Can a continuous function on a compact set be approximated arbitrarily well by the kernel sessions as the data points getting dense in the set? It is well-known from the Weiersstrass Theorem in Real Analysis [38] that a continuous function on a compact domain can be arbitrarily approximated by polynomials. However, this is not always true when polynomials are replaced by kernel sessions of an arbitrary kernel. To explain this, we assume that the input space X is a Hausdorff topological space and that all kernels to be considered are continuous on X×X. Moreover, we let Z be an arbitrarily fixed compact subset of X and let C(Z) denote the space of all continuous real-valued functions from Z to or equipped with maximum norm ·Z. For simplicity, we restrict our discussion to real-valued function.

We first examine the polynomial kernel (3.3). We have the following negative result.

Proposition 3.5 Let K:d×d be a polynomial kernel. There exists a continuous function defined on a compact set Zd which cannot be approximated in the uniform norm arbitrarily close by any linear combination of the kernel sessions K(xj,·), j=1,2,,m, for any m and any {xj:j=1,2,,m}Z.

Proof: Since a polynomial kernel has the form (3.3), we will construct a function fC(Z) so that f cannot be approximated in the uniform norm arbitrarily close by

f*(x)j=1mcj(xjx+c)r,xZ, (3.10)

for any positive integer m and any {xj:j=1,2,,m}Z.

In fact, for simplicity, we choose f as a polynomial of degree r+1 with leading coefficient 1. No matter how large the integer m is and no matter what xj, j=1,2,,m, are chosen, f* defined by equation (3.10) is a polynomial of degree r. The assertion is proved by the fact that a nondegenerate polynomial of degree r+1 cannot be approximated in the uniform norm arbitrarily close by a polynomial of degree r. For example, we can verify this statement for the special case when d=1, r=1, Z=[1,1], and f(x)x2. In this case, we have that f*(x)=ax+b, a,b, x[1,1]. Let e(x)f(x)f*(x), for x[1,1]. Clearly, e(x)=x2axb. By the characterization of the best uniform approximation by the linear polynomial [36], we see that

inf{e[1,1]:a,b}=max{|x212|:x[1,1]}=12>0.

Hence, no matter how large the positive integer m is chosen, no matter what points xj, j=1,2,,m, are chosen, the corresponding error e[1,1]12. □

This example motivates us to introduce in [32] the notion of universality of a kernel. We now describe the definition of the universal kernel. Given a kernel K:X×X, we introduce the function Ky:X defined at every xX by the equation Ky(x)K(x,y) and form the space of kernel sections K(Z)span¯{Ky:yZ}. The set K(Z) consists of all functions in C(Z) which are uniform limits of functions of the form (3.9) where {xj:j=1,2,,m}Z. We say that a kernel K:X×X is universal, if for any given prescribed compact subset Z of X, any positive number and any function fC(Z) there is a function gK(Z) such that

fgzϵ.

A characterization of a universal kernel was originally established in [32]. Moreover, convenient sufficient conditions of universal kernels were given there and several classes of commonly used kernels including the Gaussian kernel defined by equation (3.4) were shown to be universal. However, it follows from the definition of the universal kernel and Proposition 3.5 that the polynomial kernels are not universal.

Although a solution of a learning method in an RKHS has nice features, it suffers from the denseness in its representation. To illustrate this point, we consider a simple minimum norm interpolation problem in 2(). We seek x*2() for such that

x*2=inf{x2:x2(),ai,x2=yi,i=1,2}, (3.11)

where

y13,y24,a1(1n:n),a2(1(2)n1:n). (3.12)

According to [10], the solution of problem (3.11)-(3.12) is given by

x*(0.4924584n+2.7004714(2)n1:n). (3.13)

Clearly, every component of the solution x* is nonzero and thus, the solution is dense.

In the context of big data analytics, the dimension of data is large. It is desirable to obtain a sparse solution, a solution with substantial number of zero components. We next illustrate the intrinsic characteristic of a space that may or may not lead to a sparse solution. To this end, we consider a learning problem in a Hilbert space

min{x2:xd}, subject to Ax=b

and a Banach space

min{x1:xd}, subject to Ax=b.

Figures 3.4 and 3.5 illustrate that the minimum norm interpolations in the 1 space are sparse but in the 2 they are dense. This is because the geometry of the unit balls of these two different norms are different: The unit balls of the 2 are smooth, which do not promote sparsity, while those of the 1 have corners at coordinate axes, which promote sparsity.

Figure 3.4:

Figure 3.4:

The 2-Norm vs the 1-Norm: A two dimension illustration.

Figure 3.5:

Figure 3.5:

The 2-Norm vs the 1-Norm: A three dimension illustration.

We now return to the classification problem discussed in Section 2. To obtain a sparse solution, we replace the 2 norm in (2.9) by the 1 norm. This gives rise to the 1 soft margin SVM model

min{1Nk=1NL(yk,wxkb)+λw1:(w,b)d×}. (3.14)

To close this section, we summarize the pros and cons of using RKHSs for machine learning. The pros include:

  • The canonical inner product of an RKHS provides a convenient approach to defining a measure for comparison which is an inevitable operation in machine learning.

  • A representer theorem of a machine learning method in an RKHS reduces an infinite dimensional problem to a finite dimensional problem, with a canonical finite dimensional space determined by the training data.

  • Universal kernels guarantee the approximation property.

The cons are mainly on the following aspect:

  • Learning methods in an RKHS result in dense solutions. It is computationally expensive to use dense solutions in prediction and other practical applications.

4. Reproducing Kernel Banach Spaces for Machine Learning

Discussion in the previous section regarding the denseness of a learned solution in a Hilbert space leads us to explore Banach spaces as hypothesis spaces for machine learning hoping to gain sparsity for a learning solution in the spaces. This is because the class of Banach spaces which includes Hilbert spaces as special cases offers more choices for a hypothesis space for a machine learning method. In particular, certain Banach spaces with special geometric features may lead to sparse learning solutions. Aiming at developing sparse learning methods, the notion of the RKBS was first introduced in [58] in 2009. In this section, we review briefly the development of the RKBS during the last decade.

As pointed out earlier, the aim of most of machine learning methods is to construct functions whose values can be used for prediction or other decision making purposes. For this reason, it would be desirable to consider a Banach space of functions defined over a prescribed set as our hypothesis space for learning. A Banach space is called a space of functions on a prescribed set X if is composed of functions defined on X and for each f, f=0 implies that f(x)=0 for all xX. This implies that in a Banach space of functions, the pointwise function evaluation f(x) must be well-defined for all xX and all f. The standard Lp(X) spaces are Banach function spaces, but not Banach spaces of functions, since their elements are equivalent classes rather than functions, and the pointwise function evaluation is not defined in these spaces. Because function values are crucial in decision making, we would expect that they are stable with respect to functions chosen in the space. For this purpose, we would prefer the point-evaluation functionals δx: defined by

δx(f)f(x),for allxX

being continuous with respect to functions in the space . This is exactly the way in which the RKHS was defined [1]. This consideration gives rise to the following definition of the RKBS.

Definition 4.1 A Banach space of functions defined on a prescribed set X is called an RKBS if point-evaluation functionals δx, for all xX, are continuous on , that is, for each xX there exists a constant cx>0 such that

|δx(f)|cxf, for all f.

The original definition of an RKBS was given in Definition 1 of [59] in a somewhat restricted version, where was assumed to reflexive, see [27] for further generalization. In Definition 4.1, we do not have the restriction. It follows from Definition 4.1 that if is an RKBS on X, and f, fn, for n, then fnf0 implies fn(x)f(x), as n, for each xX.

In the RKHS setting, due to the well-known Riesz representation theorem, one can identify a function in the same space to represent each point-evaluation functional. That is, a Hilbert space is isometrically isomorphic to its dual space, by which we refer to the space of all continuous linear functionals on the Hilbert space. This naturally leads to the unique reproducing kernel associated with the RKHS. The kernel in conjunction with the inner product of the Hilbert space yields the reproducing property. However, in general, a Banach space is not isometrically isomorphic to its dual space. Moreover, continuity of the point-evaluation functionals does not guarantee the existence of a kernel. For this reason, we need to pay special attention to the notion of the reproducing kernel for an RKBS.

For a Banach space , we denote by * its dual space, the space of all continuous linear functionals on . It is well-known that the dual space of a Banach space is again a Banach space. Definition 4.1 of the RKBS ensures that when is an RKBS on X, the point-evaluation functionals

δx*,for allxX (4.1)

When is an RKBS on X, we let denote the completion of the linear span ˜ of all the point-evaluation functionals δx on for xX, under the norm of *. It follows from (4.1) that ˜*. Thus, * since * is complete. Moreover, we have that is the smallest Banach space that contains all point-evaluation functionals on . We will call the δ-dual space of . Because in machine learning, we are interested in Banach spaces of functions, we further suppose that the δ-dual space is isometrically isomorphic to a Banach space # of functions on a set X. This hypothesis is satisfied for most examples that we encounter in application. For instance, the space that we will discuss later in this section has this property. In the rest of this paper, we will not distinguish and #.

For a Banach space of functions defined on the set X, we let ·,·× denote the dual bilinear form on × induced by restriction of the dual bilinear form on ×* to ×. When there is no ambiguity, we will write it as ·,·. We now define a reproducing kernel, which provides a closed-form function representation for the point-evaluation functionals, for an RKBS.

Definition 4.2 Let be an RKBS on a set X with the δ-dual space . Suppose that is isometrically isomorphic to a Banach space of functions on a set X. A function K:X×X is called a reproducing kernel for if K(x,·) for all xX, and

f(x)=f(),K(x,),for allf. (4.2)

If in addition is an RKBS on X, K(·,y) for all yX, and

g(y)=K(,y),g(),for allg, (4.3)

we call an adjoint RKBS of . In this case, the function

K(x,y)K(y,x), for xX,yX,

is a reproducing kernel for , and we call , a pair of RKBSs.

The original definition of a reproducing kernel appeared in [58] with a choice =*, which was extended in [55]. Also, see [46, 49] for further development.

Equation (4.2) furnishes a representation of the point-evaluation functional in terms of the kernel and the dual bilinear form. The present form in Definition 4.2 of a reproducing kernel for an RKBS is slightly different from the various forms known in the literature. We feel that the form described in Definition 4.2 better captures the essence of reproducing kernels. Unlike a reproducing kernel for an RKHS, a reproducing kernel for an RKBS is not necessarily symmetric or positive semi-definite. In a special case when an RKBS satisfies the following hypothesis, we can establish the positive semi-definiteness of its kernel.

Hypothesis (H1): Spaces , are a pair of RKBSs on a set X, the δ-dual space is isometrically isomorphic to a Banach space of functions on X, and a reproducing kernel K:X×X for is symmetric.

Suppose that Hypothesis (H1) is satisfied. By symmetry of the kernel K, we have that K(·,x) for all xX and K(y,·) for all yX. In this case, K(·,x),K(·,x)× is well-defined for xX. We further require that the following hypothesis be satisfied.

Hypothesis (H2): There exists a positive constant A such that

fn,fn×fn2A,for allfni=1nciK(,xi)0,n,xiX,ci,i=1,2,,n. (4.4)

In a loose sense, inequality (4.4) has a geometric interpretation: “cosine” of the “angle” formed by fn,fn×fn2 is uniformly above zero, for all fn0, n, xiX, ci, i=1,2,,n. In other words, the spaces and cannot be “perpendicular” to each other, with respect to the dual bilinear form ·,·×. When is a Hilbert space, inequality (4.4) reduces to an equality with A=1, since in such a case =. Under Hypotheses (H1) and (H2), we can establish the positive semi-definiteness of a reproducing kernel.

Theorem 4.3 Suppose that a pair , of RKBSs satisfy Hypotheses (H1) with a reproducing kernel K:X×X and (H2). For all n, all xjX, j=1,2,,n, let

Kn[K(xi,xj):i,j=1,2,,n]. (4.5)

Then, for all n and all xjX, j=1,2,,n, the matrices Kn are positive semi-definite.

Proof: It suffices to show for all c[c1,c2,,cn]n that cKnc0. For yX, we introduce the notation kyK(·,y). By the reproducing property (4.2) and symmetry of the kernel K, we observe that

K(xi,xj)=kxj(xi)=kxj(),K(xi,)=K(,xj),K(xi,)=K(,xj),K(,xi).

Therefore, according to Hypotheses (H1) and (H2), for all c[c1,c2,,cn]n, we obtain that

cKnc=i=1nj=1ncicjK(xi,xj)=i=1nj=1ncicjK(,xj),K(,xi)=i=1nciK(,xi),i=1nciK(,xi)Ai=1nciK(,xi)20,

proving the positive semi-definiteness of Kn. □

In general, the kernel matrix (4.5) is not well-defined unless X=X.

We next illustrate the notion of the RKBS and its kernel with an example. For this purpose, we consider the space 1() of real sequences f[f1,f2,] such that f1k|fk|<. Elements of 1() are functions defined on the set X. It is well-known that 1() is a Banach space, and thus, it is a Banach space of functions on . For each j, the point-evaluation functional δj has the form

δj(f)=fj. (4.6)

We first establish that 1() is an RKBS, which is stated in the next theorem.

Theorem 4.4 The space 1() is an RKBS on .

Proof: It suffices to verify that the point-evaluation functionals on the space 1() are continuous. All point-evaluation functionals on space 1() are δj, for all j. Hence, for all j, it follows from (4.6) that

|δj(f)|=|fj|f1,for allf1().

This ensures that for all j, the point-evaluation functionals δj are all continuous on the space 1(). By Definition 4.1, the space 1() is an RKBS over . □

We next identify a reproducing kernel for the RKBS 1(). By () we denote the Banach space of real bounded sequences on under the supremum norm. Namely, for any a[a1,a2,](), we have that asup{|ak|:k}<. We further denote by c0() the set of real sequences that are convergent to zero in the sense that for all a[a1,a2,]c0(), there holds limk𝑎k=0. The set c0() is a Banach space of real convergent sequences under the supremum norm defined on . Thus, for all ac0(), there holds that a<+. This ensures that c0()(). Moreover, since a nonzero constant sequence is in () but not in c0(), c0() is a true subspace of (). It is well-known [37] that c0*()=1() and 1*()=() It follows that (1*())*=(())*c0*()=1(). Thus, (1*())*1(). That is, the Banach space 1() is not reflexive and its predual is c0().

The dual bilinear form on 1()×() has a concrete form. Specifically, for any f1() and a(), we define

f,a1():=kakfk. (4.7)

Restricting the definition (4.7) to the subspace c0() of () yields the dual bilinear form on 1()×c0(). It follows from (4.7) for any f1() and ac0() that f,a1()=a,fc0(). For this reason, we will drop the subscript and simply write f,a1()=f,a and a,fc0()=a,f. Moreover, since f,a=a,f according to (4.7), we will use f,a and a,f interchangeably when no ambiguity may raise.

Proposition 4.5 If 1() denotes the δ-dual space of 1(), then 1()=c0().

Proof: We first show that c0()1(). Let a[a1,a2,]c0() be arbitrary. For all f[f1,f2,]1(), we have that

a,f=jajfj=jajδj(f).

This implies that a=jajδj. Namely, a is a linear combination of the point-evaluation functionals δj, for j. Hence, c0()1().

Conversely, we establish that 1()c0(). Let a1() be arbitrary. Since 1() is the completion of the linear span of all the point-evaluation functionals δj on 1() for j, under the norm of (), there exist knj, for j=1,2,,Kn, n such that sequences anj=1Knγknjδknj satisfies

limnaan=0. (4.8)

Clearly, anc0() for all n. Since c0() is a Banach space, equation (4.8) ensures that ac0(), which implies that 1()c0(). □

Proposition 4.5 guarantees that 1()c0(), which allows us to identify a reproducing kernel for the RKBS 1(). To this end, we define a function K:×, which is a semi-infinite matrix. Specifically, for i,j, we let δi,j1 if i=j and 0 otherwise. We then define

K[Ki,j:i,j] where Ki,jδi,j, for i,j (4.9)

Theorem 4.6 The function K[Ki,j:i,j] defined by (4.9) is a reproducing kernel for RKBS 1().

Proof: According to Proposition 4.5, we have that 1()=c0(), which is a Banach space of functions on X. By the definition (4.9) of the function K, we find that K(i,)=Ki,=δi,c0(), for all i. Moreover, for all f[f1,f2,]1(), we have for all i that

fi=jδi,jfj=jK(i,j)fj=K(i,),f.

This proves the reproducing property. By Definition 4.2 with 1() and c0(), we conclude that the function K is a reproducing kernel for 1(). □

Note that Theorem 4.6 gives the simplest example of the RKKS, since any principal (and finite) sub-matrix of the (infinite operator) K is the identity matrix. This result is not covered by the setting described in [58]. In the same manner, we consider the predual space c0().

Theorem 4.7 The space c0() is an RKBS on .

Proof: We establish that the point-evaluation functionals on c0() are continuous. All point-evaluation functionals on space c0() are δj, for all j. Hence, we observe for all ac0() that

|δj(a)|=|aj|a,for allj.

This ensures that the point-evaluation functionals on c0() are continuous. Again, by Definition 4.1, c0() is an RKBS over . □

We next identify a reproducing kernel for the RKBS c0() of functions defined on X.

Theorem 4.8 The function K[Ki,j:i,j] defined by (4.9) is a reproducing kernel for c0(). The space c0() is an adjoint RKBS of 1(), the spaces 1(), c0() form a pair of RKBSs, and the kernel K is symmetric and positive semi-definite.

Proof: By Theorem 4.7, c0() is an RKBS on X. Furthermore, we have that K(,j)K,j=δ,j1(), for all j. Moreover, we have the dual reproducing property that for all ac0(),

aj=iδi,jai=iK(i,j)ai=K(,j),a.

It follows from Definition 4.2 that the function K defined by (4.9) is a reproducing kernel for the RKBS c0(). Therefore, c0() is an adjoint RKBS of 1().

Clearly, spaces 1(), c0() with the kernel K satisfy Hypotheses (H1)-(H2). The positive definiteness of K can be seen immediately since for any n, xi, i=1,2,,n, the kernel matrix Kn[K(xi,xj):i,j=1,2,,n] is the identity matrix. □

Several nontrivial RKBSs may be found in [46, 53, 55, 58]. In particular, an RKBS isometrically isomorphic to the space 1() was constructed in [46]. We now briefly describe a RKBS on a continuous function space. We choose X as a locally compact Hausdorff space. By C0(X) we denote the space of all continuous functions f:X such that for each ϵ>0, the set {xX:|f(x)|ϵ} is compact. We equip the maximum norm on C0(X), that is,

fsupxX|f(x)|,for allfC0(X).

The Riesz-Markov representation theorem states that the dual space of C0(X) is isometrically isomorphic to the space 𝔐(X) of the real-valued regular Borel measures on X endowed with the total variation norm ·TV. We suppose that K:X×X satisfies K(x,·)C0(X) for all xX and the density condition span¯{K(x,):xX}=C0(X). We then introduce the space of functions on X by

{fμXK(,x)dμ(x):μM(X)} (4.10)

equipped with fμμTV. Clearly, defined by (4.10) is a Banach space of functions defined on X. It was established in [53] that is an RKBS, the δ-dual of is isometrically isomorphic to the space C0(X). Moreover, the function K is a reproducing kernel for the RKBS in the sense of Definition 4.2. More information about the space may be found in [3, 27, 47, 53].

Last decade has witnessed the rapid development of the theory of the RKBS and its applications since the publication of the original paper [58]. Continued understanding proper definition and construction of the RKBS and related theoretical issues remains research topics of great interest [18, 20, 22, 45, 46, 50, 51, 55, 60, 61, 62]. In particular, construction of RKBSs was proposed in [55] by using Mercer’s kernels. A class of vector-valued reproducing kernel Banach spaces with the 1 norm was constructed in [26] and it was used in multi-task learning. Separability of RKBSs was studied in [34]. Statistical margin error bounds were given in [8] for L1-norm support vector machines. Lipschitz implicit function theorems in RKBSs were established in [43]. A converse sampling theorem in RKBSs was given in [7]. The notion of the RKBS has been used in statistics [39, 44], game theory [48], and machine learning [3, 11, 17, 25, 35, 52].

5. Learning in a Banach Space

Most of learning methods may be formulated as a minimum norm interpolation problem or a regularization problem. We consider both of the cases in this section. We will focus on solution representation of these learning problems.

Recently, representer theorems of learning methods in Banach spaces received considerable attention. In the framework of a semi-inner-product RKBS [58], the representer theorem was derived from the dual elements and the semi-inner-product [58, 61]. In [55], for a reflexive and smooth RKBS, a representer theorem of a solution of the regularized learning problem was established using the Gâteaux derivative of the norm function and the reproducing kernel. In addition, the representer theorem was generalized to a non-reflexive and non-smooth Banach space which has a predual space [22, 50, 51, 52]. Due to lack of the Gâteaux derivative, other tools such as the subdifferential of the norm function and the duality mapping need to be used to describe the representer theorem.

In this section, we assume that is a general Banach space with the dual space * or predual space * unless it is stated otherwise. We first study the minimum norm interpolation problem. Given y[y1,y2,,ym]m and νj*, j=1,2,,m, we seek f* such that

f*=inf{f:f,νj,f=yj,j=1,2,,m}. (5.1)

While minimum norm interpolation in a Hilbert space is a classical topic [14, 15], its counterpart in a Banach space has drawn much attention in the literature [9, 10, 30, 50, 52] due to its connection with compressed sensing [6, 16]. An existence theorem of a solution of the minimum norm interpolation problem was established in Proposition 1 of [52].

Minimum norm interpolation problem (5.1) with 1() was investigated in [10], which may be stated as follows: Given yj and ajc0(), j=1,2,,m, we seek x1*1() such that

x1*1=inf{x1:x1(),aj,x=yj,j=1,2,,m}. (5.2)

Problem (5.2) is an infinite dimensional version of compressed sensing considered in [6, 16]. According to [10], it may be reduced to a finite dimensional linear programming problem by a duality approach, leading to a sparse solution. Sparse learning methods in 1() were studied in [2, 10, 50, 52].

The solution of problem (5.2) with interpolation conditions (3.12) has a sparse solution, which is given by x1*=(72,1,0,0,) (see, [10] for details about this example). This solution has only two nonzero components while the solution of the minimum 2 norm interpolation problem (3.11) with the same interpolation conditions has infinite numbers of nonzero components. The solution x1* is sparse because the Banach space 1() promotes sparsity while the solution x*, given by (3.13), of problem (3.11) is dense because the Hilbert space 2() does not promote sparsity. Several forms of the representer theorem for a solution of problem (5.1) were established in [52].

We now turn to investigating the regularized learning problem in a Banach space. Such a problem may be formulated from the ill-posed problem

(f)=y, (5.3)

where y is given data and represents either a physical system or a learning system. We define a data fidelity term 𝒬y((f)) from (5.3) by using a loss function 𝒬y:m+ and choose a regularization term λφ(f) with a regularizer φ:++. We then solve the regularization problem

inf{𝒬y((f))+λφ(f):f}, (5.4)

where λ is a positive regularization parameter. Here, 𝒬y(u), for um, measures the “lost” of u from given data y. For example, one may choose 𝒬y(u)uy2. Regularized learning problems in Banach spaces were originated in [30] and since then, desired representer theorems for such learning problems have received considerable attention in the literature. For existence results of a solution of problem (5.4), the readers are referred to [52].

When the operator :m in (5.3) is defined by

(f)νj,f, for j=1,2,,m (5.5)

the regularization problem (5.4) is intimately related to the minimum norm interpolation problem (5.1). Their relation is described in the next proposition.

Proposition 5.1 Suppose that is a Banach space with the dual space *, νj*, jm and is defined by (5.5). For a given y0m, let 𝒬y0:m+ be a loss function, φ:++ be an increasing regularizer and λ>0. Let f^ be a solution of the regularization problem (5.4) with yy0. Then the following statements hold true:

  1. A solution g^ of problem (5.1) with y(f^) is a solution of the regularization problem (5.4) with yy0.

  2. If φ is strictly increasing, then f^ is a solution of problem (5.1) with y(f^).

Statement (ii) of Proposition 5.1 was claimed in [30] without details of proof. A complete proof for Proposition 5.1 may be found in [52].

We are interested in characterizing a solution of the regularization problem (5.4) with an operator defined by (5.5) having an adjoint operator *:m*. The following two representer theorems are due to [52]. For a given y0m, let 𝒬y0:m+ be a loss function. We first consider the case when the Banach space has a smooth predual.

Theorem 5.2 Suppose that is a Banach space having the smooth predual space *, and νj*, jm. Let 𝒢*(ν) denote the Gâteaux derivative of the norm ·* at ν*. Let φ:++ be a strictly increasing regularizer and λ>0. Then

f0*(c^)*𝒢*(*(c^)),c^m, (5.6)

is a solution of the regularization problem (5.4) with yy0 if and only if c^m is a solution of the finite dimensional minimization problem

inf{𝒬y0(*(c)*(𝒢*(*(c)))+λφ(*(c)*):cm}. (5.7)

We now consider the case when the predual of the Banach space is not necessarily smooth.

Theorem 5.3 Suppose that is a Banach space having the predual space *, νj*, jm. Let φ:++ be a regularizer and λ>0.

  1. If φ is increasing, then there exists a solution f0 of the regularization problem (5.4) with yy0 such that
    f0ρ*(jmcjνj), (5.8)
    for some cj, jm, with ρjmcjνj*, where * denotes the subdifferential of the norm.
  2. If φ is strictly increasing, then every solution f0 of the regularization problem (5.4) with yy0 satisfies (5.8) for some cj, jm.

Item (i) of Theorem 5.3 indicates that there exists a solution f0 of the regularization problem (5.4) with yy0 that satisfies (5.8), which is a generalization of the stationary point condition to minimization problems with non-differentiable objective function. The essence of Theorems 5.2 and 5.3 is that a solution f0, defined by (5.6) or (5.8), of the infinite dimensional regularization problem (5.4) is determined completely by a finite number of parameters. Values of these parameters can be obtained by solving either a finite dimensional optimization problem (5.7) or a nonlinear system. When the space is a Hilbert space, the nonlinear system reduces to a linear one. In particular, when is an RKBS with a kernel K, the functionals νj in Theorems 5.2 and 5.3 have convenient representations in terms of the kernel K. We will demonstrate this point later in this section. These representer theorems serve as a base for further development of efficient numerical solvers of the regularization problem (5.4). One may find more discussions about representer theorems in Banach spaces in [41].

Not all Banach spaces will produce sparse learning solutions. Only Banach spaces with certain geometric features can lead to a sparse learning solution. It has been shown in [10, 50] that minimum norm interpolation and regularization problems in 1() have sparse solutions. As it is shown in Theorems 4.4 and 4.6, the space 1() is an RKBS on with the kernel K defined by (4.9). Hence, a learning solution in the RKBS 1() can be expressed in terms of the kernel K. Integrating Theorems 4.4 and 4.6 with those known in the literature [50], we have the following sparse representer theorem for a learning solution in 1(). For any learning solution f1(), there exists a vector of positive integers [n1,,nq]q with qm such that

f=iqciK(,ni). (5.9)

Clearly, the solution f has only q nonzero components. Here, the vector [n1,,nq]q represents the support of the learning solution f defined by (5.9) and the coefficients ci are determined by the fidelity data. The paper [10] suggests that we can first convert the minimum norm interpolation problem to its dual problem which allows us to identify the positive integer q and the support [n1,,nq]q, and then solve a linear system to obtain the coefficients ci. The idea of [10] has been extended in [9] to solve a class of regularization problems.

We now return to the binary classification problem and describe a model based on an RKBS for it. We choose the RKBS defined by (4.10) with X=d and

K(x,y)exp(xy222μ2),x,yn,

as the hypothesis space. The classification problem in the RKBS may be described by

min{1nk=1nL(yk,f(xk))+λf:f}, (5.10)

where the fidelity term is the same as in problem (3.6). A representer theorem for a solution of the learning method (5.10) was obtained in [53]. To state it, for a function f:d with f<, we define the set 𝒩(f){xd:|f(x)|=f}. We denote by h* a solution of minimization problem (5.10) and let y[h*(xk):kn]n. It was proved in [53] that there exists a solution of (5.10) having the form

f*k=1nαkK(,zk), for some zk𝒩(j=1ncj*K(xj,)) and αk,kn, (5.11)

where c*n is a solution of optimization problem

max{yc:c=[cj:jn]n,j=1ncjK(xj,)=1}. (5.12)

Although the points zk, kn may not be the same as the points xk, kn, motivated from (5.11), we seek a solution of the regularization problem (5.10) in the form f˜*k=1nαkK(,xk). By plugging f˜* into (5.10) and adding a bias term b, we obtain the 1-SVM classification model has the form

min{LD(α,b)+λα1:αn,b}, (5.13)

where

LD(α,b)jnmax{1yj(knαkK(xk,xj)+b),0}.

The discrete minimization problem (5.13) is the same as (3.14). We comment that the space n with the 1-norm is a finite dimensional RKBS according to Theorem 4.4.

Next, we elaborate the numerical solution of the regularized learning problem. Due to the use of a Banach space as the hypothesis space in learning, we are inevitably facing solving the regularized learning problem (5.4) with the Banach norm. Major challenges for solving this problem include infinite dimensionality, nonlinearity and nondifferentiability. The regularization problem (5.4) by nature is infinite dimensional, since we seek a solution in the hypothesis space which is an infinite dimensional Banach space, even though the fidelity term is determined by a finite number of data points. The use of a Banach space as a regularization term leads to nonlinearity. Moreover, employing a sparsity promotion norm results in a non-differentiable optimization problem. Representer theorems presented in this section for a solution of this problem indicate that the regularization problem (5.4) is essentially of finite dimension. However, how the representer theorems can be used in developing practical algorithms requires further investigation. A recent paper [9] is making an attempt toward this direction. Specifically, we reformulated the regularization problem (5.4) in the following way: When the fidelity term of the regularization problem (5.4) has a form in a Banach norm and a regularization term in another Banach norm, we construct a direct sum space based on the two Banach spaces for the data fidelity term and the regularization term, and then recast the objective function as the norm of a suitable quotient space of the direct sum space. In this way, we express the original regularized problem as a best approximation problem in the direct sum space, which is in turn reformulated as a dual optimization problem in the dual space of the direct sum space. The dual problem is to find the maximum of a linear function on a convex polytope, which is of finite dimension and may be solved by numerical methods such as linear programming. The solution of the dual optimization problem provides related extremal properties of norming functionals, by which the original problem is reduced to a finite dimensional optimization problem, which is then reformulated as a finite dimensional fixed-point equation [24, 31], solved by iterative algorithms.

6. Numerical Results

This section is devoted to a numerical example about classification of handwriting digits by using the 1 SVM model (5.13) with a comparison to the 2 SVM model. The handwriting digit classification by using the 1-SVM was studied in [28].

In our experiment, we use the database MNIST of handwriting digits, which is originally composed of 60,000 training samples and 10,000 testing samples of the digits “0” through “9”. We consider classifying two handwriting digits “7” and “9” taken from MNIST. Handwriting digits “7” and “9” are often easy to cause obfuscation in comparison to other pairs of digits. We take 8,141 training samples and 2,037 testing samples of these two digits from the database. Specifically, we consider training data set D{(xj,yj):jn}d×{±1}, where xj are images of digits 7 or 9. We wish to find a function that defines a hyperplane separating the data D into two groups with labels yj=1 (digit 7) and yj=1 (digit 9).

We employ the model (5.10) and its related discrete form (5.13) described in section 5 for our experiment. For a comparison purpose, we also test the learning model (5.13) with the squared loss function

LD(α,b)12jn(knαkK(xk,xj)+byj)2

as a fidelity term in problem (5.13). We compare the classification performance and solution sparsity of problem (5.13) with those of its Hilbert space counterpart (3.6), which leads to problem (5.13) with the 1-norm being replaced by the 2-norm.

The objective functions of both the minimization problem (5.13) and its 2-norm counterpart are not differentiable and thus they can not be solved by a gradient based iteration algorithm. Instead, we employ the Fixed Point Proximity Algorithm (FPPA) developed in [24, 31] to solve these non-smooth minimization problems. Specifically, we first use the Fermat rule to reformulate a solution of the minimization (5.13) as a solution of an inclusion relation defined by the subdifferential of the objective function of (5.13). We then rewrite the inclusion relation as an equivalent fixed-point equation involving the proximity operator of the functions LD and ·1 which appear in the objective function of (5.13). The fixed-point equation leads to the FPPA, for solving (5.13), which is guaranteed to converge under a mild condition. We refer the interested readers to [24, 31] for more information on the FPPA.

Numerical results are reported in Tables IIV. In these tables, λ represents values of the regularization parameter, #NZ the sparsity level (number of nonzero entries), and TRA and TEA the learning accuracy (correction rate) for training datasets and testing datasets, respectively. Specifically,

TRA NPtrain Ntrain  and TEANPtest Ntest ,

where NPtrain and NPtest denote the number of predicted labels equal to the given labels of the training and testing data points, respectively, and Ntrain and Ntest denote the total number of the training and testing data points, respectively.

Table I:

Classification using the Gaussian Kernel with μ=4.8: Results for the 1-SVM model with the square loss function.

λ 0.5 0.7 1.4 1.8 4.0 6.0 487.7
#NZ 1762 880 481 340 187 142 0
TRA 99.46% 99.19% 98.61% 98.34% 97.57% 97.16% 50.67%
TEA 98.77% 98.82% 98.48% 98.38% 97.79% 97.40% 50.47%

Table IV:

Classification using the Gaussian Kernel with μ=4: Results for the 2-SVM model with hinge loss.

λ 0.001 0.2 2 12 22 47 128
#NZ 8, 141 8, 141 8, 141 8, 141 8, 141 8, 141 8, 141
TRA 99.90% 99.84% 99.18% 98.13% 97.68% 97.07% 96.31%
TEA 99.02% 98.97% 98.33% 97.89% 97.59% 97.05% 96.41%

Explanations on the choice of the regularization parameter λ are in order. In Table I, we randomly selected seven different values of λ from the interval (0,λmax], where λmax is determined by Remark 4.12 in [28]. In each of the remaining tables, we selected seven different values of λ in an increasing order such that the 1-SVM classification model (5.13) and the related 2 model produce the same accuracy. In this way, we can compare the sparsity of the corresponding solutions of these two models. Moreover, the largest λ in Table I is given by the smallest value such that #NZ=0, since for the 1-SVM model with the square loss function, we have Remark 4.12 of [28] to determine such a λ value. However, in Table III, which is for the 1-SVM model with the hinge loss function, we do not have a theoretical result similar to Remark 4.12 of [28] to determine the smallest value that ensures #NZ=0 and hence, we empirically choose the λ value such that #NZ=0.

Table III:

Classification using the Gaussian Kernel with μ=4: Results for the 1-SVM model with the hinge loss function.

λ 0.1 0.2 1 2 4 10 435
#NZ 552 481 167 92 56 34 0
TRA 99.99% 99.99% 99.08% 98.17% 97.53% 96.30% 50.67%
TEA 98.72% 98.77% 98.38% 98.09% 97.45% 96.27% 50.47%

The numerical results presented in Tables IIV confirm that the two models generate learning results with comparable accuracy, while the solution of the 1-SVM model can have different levels of sparsity according to values of λ, but that of the 2-SVM model is always dense disregarding the choice of λ. In passing, we point it out that the hinge loss function outperforms the square loss function as a fidelity term for the classification problem.

7. Remarks on Future Research Problems

In this section, we elaborate several open mathematical problems critical to the theory of RKBSs and machine learning methods on a Banach space.

It has been demonstrated that learning in certain Banach spaces leads to a sparse solution and learning in a Hilbert space yields a dense solution. Sparse learning solutions can save tremendous computing effort and storage when they are used in decision making processes. However, learning in Banach spaces introduces new challenges. First of all, the theory of the RKBS is far away from completion. Construction of reproducing kernels for RKBSs suitable for different applications is needed. For example, related to the definition of the reproducing kernel, we have the following specific problem: Suppose that is a Banach space of functions on X and point-evaluation functionals on are all continuous. Let denote the the completion of the linear span of all the point-evaluation functionals on under the norm of *. In general, a Banach space of functions is not isometrically isomorphic to its dual * or its δ-dual . We take 1() as an example. We have known from section 4 that *=() and =c0(). Clearly, the space 1() is neither isometrically isomorphic to () nor c0(). In Definition 4.2 of a kernel, we need the assumption that is isometrically isomorphic to a Banach space of functions on some set X. This assumption holds true when 1() since c0() is a Banach space of functions (sequences) defined on . We are interested in knowing to what extent is this true in general?

Theorem 4.3 reveals that under Hypotheses (H1) and (H2), a reproducing kernel as defined in Definition 4.2 is positive semi-definite. Can the two hypotheses be weakened? We are curious to know how a given function of two arguments will lead to a pair of RKBSs.

A problem related to sparse learning in a Banach space may be stated as follows: It has been understood that the Banach space 1() promotes sparsity for a learning solution in the space. What is the essential geometric feature for a general Banach space of functions that can lead to a sparse learning solution in the space?

Finally, practical numerical algorithms for learning a function from an RKBS require systematical investigation.

Answers to these challenging issues would contribute to completing the theory of the RKBS and making it a useful hypothesis space for machine learning, hoping leading to practical numerical methods for learning in an RKBS.

Table II:

Classification using the Gaussian Kernel with μ=4.8: Results for the 2-SVM model with the square loss function.

λ 0.005 1.0 2.3 6.0 43.0 96.0 490
#NZ 8, 141 8, 141 8, 141 8, 141 8, 141 8, 141 8, 141
TRA 100% 99.50% 99.18% 98.69% 97.56% 97.16% 95.93%
TEA 99.21% 98.87% 98.72% 98.43% 97.69% 97.25% 96.41%

Acknowledgement:

This paper is based on the author’s plenary invited lecture delivered online at the international conference “Functional Analysis, Approximation Theory and Numerical Analysis” that took place at Matera, Italy, July 5-8, 2022. The author is indebted to Professor Rui Wang for helpful discussion on issues related to the notion of the reproducing kernel Banach space, and to Professor Raymond Cheng for careful reading of the manuscript and providing a list of corrections. The author is grateful to two anonymous referees whose constructive comments improve the presentation of this paper. The author is supported in part by the US National Science Foundation under grants DMS-1912958 and DMS-2208386, and by the US National Institutes of Health under grant R21CA263876.

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

  • [1].Aronszajn N, Theory of reproducing kernels, Transactions of the American Mathematical Society, 68 (1950), 337–404. [Google Scholar]
  • [2].Aziznejad S and Unser M, Multikernel regression with sparsity constraint, SIAM Journal on Mathematics of Data Science 3 (2021) 10.1137/20M1318882. [DOI] [Google Scholar]
  • [3].Bartolucci F, De Vito E, Rosasco L and Vigogna S, Understanding neural networks with reproducing kernel Banach spaces, Applied and Computational Harmonic Analysis 62 (2023), 194–236. [Google Scholar]
  • [4].Bochner S, Hilbert distances and positive definite functions, Annals of Mathematics 42 (1941), 647–656. [Google Scholar]
  • [5].de Boor C and Lynch RE, On splines and their minimum properties, Journal of Mathematics and Mechanics, 15 (1966), 953–969. [Google Scholar]
  • [6].Candés EJ, Romberg J and Tao T, Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information, IEEE Transactions on Information Theory, 52 (2006), 489–509. [Google Scholar]
  • [7].Centeno H and Medina JM, A converse sampling theorem in reproducing kernel Banach spaces, Sampling Theory, Signal Processing, and Data Analysis, 20 (2022), 8, 19 pages. [Google Scholar]
  • [8].Chen L and Zhang H, Statistical margin error bounds for L1-norm support vector machines, Neurocomputing, 339 (2019), 210–216. [Google Scholar]
  • [9].Cheng R, Wang R, and Xu Y, A duality approach to regularization learning problems in Banach spaces, preprint, 2022.
  • [10].Cheng R and Xu Y, Minimum norm interpolation in the 1() space, Analysis and Applications, 19 (2021), 21–42. [Google Scholar]
  • [11].Combettes PL, Salzo S and Villa S, Regularized learning schemes in feature Banach spaces, Analysis and Applications, 16 (2018), 1–54. [Google Scholar]
  • [12].Cortes C and Vapnik V, Support-vector networks, Machine Learning, 20 (3) (1995), 273–297. [Google Scholar]
  • [13].Cucker F and Smale S, On the mathematical foundations of learning, Bulletin of the American Mathematical Society, 39 (2002), 1–49. [Google Scholar]
  • [14].Deutsch F, Best Approximation in Inner Product Spaces, Springer, New York, 2001. [Google Scholar]
  • [15].Deutsch F, Ubhaya VA, Ward JD and Xu Y, Constrained best approximation in Hilbert space III. Applications to n-convex functions, Constructive Approximation, 12 (1996), 361–384. [Google Scholar]
  • [16].Donoho DL, Compressed sensing, IEEE Transactions on Information Theory, 52 (2006), 1289–1306. [Google Scholar]
  • [17].Fasshauer GE, Hickernell FJ and Ye Q, Solving support vector machines in reproducing kernel Banach spaces with positive definite functions, Applied Computational Harmonic Analysis, 38 (2015), 115–139. [Google Scholar]
  • [18].Fukumizu K, Lanckriet G and Sriperumbudur BK, Learning in Hilbert vs. Banach spaces: A measure embedding viewpoint, Advances in Neural Information Processing Systems 24 (NIPS 2011). [Google Scholar]
  • [19].Gretton A, Borgwardt KM, Rasch MJ, Schölkopf B, and Smola A, A kernel two-sample test, Journal of Machine Learning Research, 13 (2012), 723–773. [Google Scholar]
  • [20].Georgiev PG, Sánchez-González L and Pardalos PM, Construction of pairs of reproducing kernel Banach spaces, in “Constructive Nonsmooth Analysis and Related Topics” the Springer Optimization and Its Applications book series; (SOIA,volume 87), pp. 39–57, September 2013. [Google Scholar]
  • [21].Hoefler T, Alistarh DA, Dryden N, and Ben-Nun T, The future of deep learning will be sparse, SIAM News, May 03, 2021. [Google Scholar]
  • [22].Huang L, Liu C, Tan L and Ye Q, Generalized representer theorems in Banach spaces, Analysis and Applications, 19 (2021), 125–146. [Google Scholar]
  • [23].Kimeldorf GS and Wahba G, A correspondence between Bayesian estimation on stochastic processes and smoothing by splines, Annals of Mathematical Statistics, 41 (1970), 495–502. [Google Scholar]
  • [24].Li Q, Shen L, Xu Y and Zhang N, Multi-step fixed-point proximity algorithms for solving a class of optimization problems arising from image processing, Advances in Computational Mathematics, 41 (2015), 387–422. [Google Scholar]
  • [25].Li Z, Xu Y and Ye Q, Sparse support vector machines in reproducing kernel Banach spaces, in “Contemporary Computational Mathematics - A Celebration of the 80th Birthday of Ian Sloan” pp. 869–887, Springer, 2018. [Google Scholar]
  • [26].Lin R, Song G and Zhang H, Multi-task learning in vector-valued reproducing kernel Banach spaces with the 1 norm, Journal of Complexity 63 (2021), 101514. [Google Scholar]
  • [27].Lin R, Zhang H and Zhang J, On reproducing kernel Banach spaces: Generic definitions and unified framework of constructions, Acta Mathematica Sinica, 38 (2022) 1459–1483. [Google Scholar]
  • [28].Liu Q, Wang R, Xu Y and Yan M, Parameter choices for sparse regularization with the 1 norm, Inverse Problems 39 (2023) 025004 (34pp) [Google Scholar]
  • [29].Mercer J, Functions of positive and negative type and their connection with the theory of integral equations, Philosophical Transactions of the Royal Society of London, Series A 209 (1909), 415–446. [Google Scholar]
  • [30].Micchelli CA and Pontil M, A function representation for learning in Banach spaces, in Learning theory, 255–269, Lecture Notes in Computer Science 3120, Springer, Berlin, 2004. [Google Scholar]
  • [31].Micchelli CA, Shen L and Xu Y, Proximity algorithms for image models: denoising, Inverse Problems, 27 (2011), 045009. [Google Scholar]
  • [32].Micchelli CA, Xu Y and Zhang H, Universal kernels, Journal of Machine Learning Research, 7 (2006), 2651–2667. [Google Scholar]
  • [33].Muandet K, Fukumizu K, Sriperumbudur B and Schölkopf B, (2017), Kernel mean embedding of distributions: A review and beyond, Foundations and Trends ® in Machine Learning 10 (1–2) (2017), 1–141. [Google Scholar]
  • [34].Owhadi H and Scovel C, Separability of reproducing kernel spaces, Proceedings of American Mathematical Society, 145 (2017), 2131–2138. [Google Scholar]
  • [35].Parhi R and Nowak RD, Banach space representer theorems for neural networks and ridge splines, Journal of Machine Learning Research, 22 (2021), 1–40. [Google Scholar]
  • [36].Powell MJD, Approximation Theory and Methods, 1st Edition, Cambridge University Press, Cambridge, 1981. [Google Scholar]
  • [37].Reed M and Simon B, Functional Analysis, Academic Press, New York, 1980. [Google Scholar]
  • [38].Royden HL, Real Analysis, 3rd Edition, Macmillan Publishing Company, New York, 1988. [Google Scholar]
  • [39].Salzo S and Suykens JAK, Generalized support vector regression: Duality and tensor-kernel representation, Analysis and Applications, 18 (2020), 149–183. [Google Scholar]
  • [40].Schölkopf B, Herbrich R and Smola AJ, A generalized representer theorem, Computational Learning Theory, Lecture Notes in Computer Science, 2111 (2001), 416–426. [Google Scholar]
  • [41].Schlegel K, When is there a representer theorem? Advances in Computational Mathematics 47, 54 (2021). [Google Scholar]
  • [42].Schuster T, Kaltenbacher B, Hofmann B and Kazimierski KS, Regularization Methods in Banach Spaces, Vol. 10 in “Radon Series on Computational and Applied Mathematics”, De Gruyter, Berlin, 2012. ( 10.1515/9783110255720). [DOI] [Google Scholar]
  • [43].Shannon C, On Lipschitz implicit function theorems in Banach spaces and applications, Journal of Mathematical Analysis and Applications, 494 (2021), 124589. [Google Scholar]
  • [44].Sheng B and Zuo L, Error analysis of the kernel regularized regression based on refined convex losses and RKBSs, International Journal of Wavelets, Multiresolution and Information Processing, 19 (2021), 2150012. [Google Scholar]
  • [45].Song G and Zhang H, Reproducing kernel Banach spaces with the 1 norm ii: error analysis for regularized least square regression, Neural Computation, 23 (2011), 2713–2729. [Google Scholar]
  • [46].Song G, Zhang H, Hickernell FJ, Reproducing kernel Banach spaces with the 1 norm, Applied and Computational Harmonic Analysis, 34 (2013), 96–116. [Google Scholar]
  • [47].Spek L, Heeringa TJ and Brune C, Duality for neural networks through reproducing kernel banach spaces, arXiv preprint arXiv:2211.05020, 2022. [Google Scholar]
  • [48].Sridharan K and Tewari A, Convex games in Banach spaces, In “Proceedings of the 23rd Annual Conference on Learning Theory,” pp. 1–13, Omnipress, 2010. [Google Scholar]
  • [49].Sriperumbudur BK, Fukumizu K and Lanckriet GRG, Learning in Hilbert vs. Banach spaces: A measure embedding viewpoint, Advances in Neural Information Processing Systems (Cambridge), MIT Press, pp. 1773–1781, 2011. [Google Scholar]
  • [50].Unser M, Representer theorems for sparsity-promoting 1 regularization, IEEE Transactions on Information Theory, 62 (2016), 5167–5180. [Google Scholar]
  • [51].Unser M, A unifying representer theorem for inverse problems and machine learning, Foundations of Computational Mathematics, 21 (2021), 941–960. [Google Scholar]
  • [52].Wang R and Xu Y, Representer theorems in Banach spaces: Minimum norm interpolation, regularized learning and semi-discrete inverse problems, Journal of Machine Learning Research, 22 (2021), 1–65. [Google Scholar]
  • [53].Wang R, Xu Y and Yan M, Representer theorems for sparse learning in Banach spaces, preprint, 2023.
  • [54].Wang W, Lu S, Hofmann B and Cheng J, Tikhonov regularization with 0-term complementing a convex penalty: 1-convergence under sparsity constraints, Journal of Inverse Ill-Posed Problems, 27 (2019), 575–590. [Google Scholar]
  • [55].Xu Y and Ye Q, Generalized Mercer kernels and reproducing kernel Banach spaces, Memoirs of the American Mathematical Society, 258 (2019), Number 1243, 122 pages. [Google Scholar]
  • [56].Xu Y and Zhang H, Refinable kernels, Journal of Machine Learning Research, 8 (2007), 2083–2120. [Google Scholar]
  • [57].Xu Y and Zhang H, Refinement of reproducing kernels, Journal of Machine Learning Research, 10 (2009) 107–140. [Google Scholar]
  • [58].Zhang H, Xu Y and Zhang J, Reproducing kernel Banach spaces for machine learning, Journal of Machine Learning Research, 10 (2009), 2741–2775. [Google Scholar]
  • [59].Zhang H, Xu Y and Zhang Q, Refinement of operator-valued reproducing kernels, Journal of Machine Learning Research, 13 (2012), 91–136. [Google Scholar]
  • [60].Zhang H and Zhang J, Frames, Riesz bases, and sampling expansions in Banach spaces via semi-inner products, Applied and Computational Harmonic Analysis, 31 (2011), 1–25. [Google Scholar]
  • [61].Zhang H and Zhang J, Regularized learning in Banach spaces as an optimization problem: representer theorems, Journal of Global Optimization, 54 (2012), 235–250. [Google Scholar]
  • [62].Zhang H and Zhang J, Vector-valued reproducing kernel Banach spaces with applications to multi-task learning, Journal of Complexity, 29 (2013), 195–215. [Google Scholar]

RESOURCES