Abstract
The aim of this expository paper is to explain to graduate students and beginning researchers in the field of mathematics, statistics and engineering the fundamental concept of sparse machine learning in Banach spaces. In particular, we use binary classification as an example to explain the essence of learning in a reproducing kernel Hilbert space and sparse learning in a reproducing kernel Banach space (RKBS). We then utilize the Banach space to illustrate the basic concepts of the RKBS in an elementary yet rigorous fashion. This paper reviews existing results in the author’s perspectives to reflect the state of the art of the field of sparse learning, and includes new theoretical observations on the RKBS. Several open problems critical to the theory of the RKBS are also discussed at the end of this paper.
Keywords: sparse machine learning, reproducing kernel Banach space
AMS subject classifications: 46B45, 46N10, 90C30
1. Introduction
Most of machine learning methods are to learn a function from available data. Mathematically, we need a hypothesis space to hold functions to be learned [13]. Choices of the hypothesis space are critical for learning outcomes. Traditionally, a reproducing kernel Hilbert space (RKHS), a Hilbert space where the point-evaluation functionals are continuous, is chosen, due to the existence of an inner product that can be used to measure similarity in data and continuous point-evaluation functionals that often are used to sample data.
While learning in an RKHS offers attractive features, it suffers from a major drawback. To explain it, we define the notion of sparsity. We say a vector or a sequence is sparse, we mean most of its components are zero. When it is not sparse, we call it dense. We stress here that this definition of density is peculiar to the data science community, and it does not mean (topologically) dense in the usual sense. Sparsity also refers to a representation of a function under a basis when the coefficient vector is sparse. A major drawback of learning in a Hilbert space is that a learned solution tends to be dense. This feature is a consequence of the smoothness of Hilbert spaces used as hypothesis spaces. Such a dense learned solution leads to high computational costs when the learned solution is used in prediction or other decision making procedures since once a function is learned it will be used repeatedly many times. The suffering is even more severe in the context of big data analytics. For example, Google would want its search engine to be able to make instant recommendations. Due to increasingly greater amounts of data and related model sizes, demands for more competent data processing models have emerged. As pointed in [21], the future of machine learning is sparsity. It is immensely desirable to learn sparse solutions in the sense that they have substantially fewer nonzero components or fewer terms in their representation.
Most data sets have intrinsic sparsity. As a matter of fact, data that we encounter often have certain embedded sparsity structures in the sense that if they are represented in certain ways, their intrinsic characteristics can concentrate on a few scattering spots. To construct sparse solutions, we require that the norm of the hypothesis space for the learning model have certain sparsity promoting property.
With a sparsity promoting norm for the hypothesis space, when a solution of a learning model is represented in an appropriate basis, the solution can have a sparse representation. It is well-known that the norm of a usual Hilbert space does not promote sparsity due to its smoothness. Norms of certain nonsmooth Banach spaces have the ability to induce sparsity for solutions of learning methods having the spaces as hypothesis spaces, such as the 1-norm. For this reason, some Banach spaces have been used as hypothesis spaces for sparse machine learning methods. With sparse solutions, machine learning models can considerably reduce storage, computing time and communication time.
We are interested in a class of Banach spaces which are spaces of functions because function values are useful in machine learning. In particular, we would like the point-evaluation functionals in such spaces to be continuous. This gives rise to a special class of Banach spaces: the reproducing kernel Banach spaces (RKBSs) which was introduced in [58]. During the last decade, great interest has been paid to understanding of this class of function spaces and their uses in applications. We will discuss these function spaces and learning in the spaces.
The goal of this expository paper is to review recent advances in sparse machine learning in Banach spaces. We will use binary classification, a basic problem in data science, as an example to motivate the notion of sparse learning in Banach spaces. In Section 2, we present a brief review of classification. Section 3 is devoted to a presentation of learning in an RKHS. In Section 4 we illustrate the theory of the RKBS by the example of the space. We discuss in Section 5 learning in a Banach space. We present a numerical example in Section 6 to demonstrate a sparse classification method in a Banach space. In Section 7, we elaborate several unsolved mathematical problems related to the theory of RKBS and sparse learning.
2. A Review of Classification
In this section, we review the classical support vector machine for binary classification which was discussed in [52].
A typical binary classification problem maybe described as follows: Given two sets of points in the Euclidean space , we wish to find a surface to separate them. A simplest way to separate two sets of points is to use a hyperplane, since among all types of surfaces a hyperplane is the easiest to construct and to work with. For this purpose, we wish to construct the hyperplane
| (2.1) |
where and are to be determined so that the hyperplane (2.1) has the largest distance to both of the two sets. Specifically, we let the training data be composed of input data points
and output data values
We intend to find a hyperplane determined by the linear function
| (2.2) |
which separates the training data into two groups, one with label which corresponds to and another with label which corresponds to . The parameters in equation (2.2) are chosen such that the hyperplane maximizes its distance to the points in . This gives us a decision rule to predict the labels for new points, that is,
This method is called the support vector machine (SVM) [12].
Specifically, as illustrated in Figure 2.1 we select two parallel hyperplanes
that separate the two classes of data so that the distance between the two hyperplanes is as large as possible. The region bounded by these two hyperplanes is called the margin. We compute the distance between these two planes. Let be a point on the first plane. That is, satisfies the equation
| (2.3) |
The distance between these two planes is given by
Maximizing the distance is equivalent to minimizing . To prevent data points from falling into the margin, we require
These constraints state that each data point must lie on the correct side of the margin, and they may be rewritten as
| (2.4) |
The pair is solved by a constrained minimization problem
| (2.5) |
subject to (2.4). Let
| (2.6) |
It can be verified that is a convex set. The minimization problem (2.5) with constraints (2.4) may be rewritten as
| (2.7) |
where the constraint set is defined by (2.6). Model (2.7) is called the hard margin support vector machine (SVM), which is a constrained minimization problem.
Figure 2.1:

Classification (Courtesy: Wikipedia)
It is computationally advantageous to reformulate the constrained minimization problem (2.7) as an unconstrained one. This leads to the soft margin SVM model. Specifically, we rewrite the constraints (2.4) as
| (2.8) |
and make use of the ReLU (Rectified Linear Unit) function
to define the hinge loss function
When (2.8) is satisfied, we have that and when (2.8) is not satisfied, we have that
which is the case that we would like to prevent from occurring. For this reason, we would like to minimize the nonnegative fidelity term
Combining this with the minimization (2.5) leads to the regularization problem
| (2.9) |
where is a regularization parameter. Regularization problem (2.9) is called the soft margin SVM. Upon solving minimization problem (2.9), we obtain the pair which defines the function for classification.
3. Learning in Reproducing Kernel Hilbert Spaces
We discuss in this section the notion of feature maps in machine learning. In particular, we show the necessity of introducing feature maps in machine learning, which leads to learning in an RKHS.
We first motivate the introduction of feature maps in machine learning by returning to the classification problem reviewed in the last section. When given data sets can be separated by a hyperplane, the hard/soft margin SVM discussed in the last section will do a reasonably good job for classification. However, in most cases of practical applications, it is not possible to separate two sets of points entirely by a hyperplane in the same space. Certain misclassification may occur. In general, we might allow a low degree of misclassification but do not tolerate a high degree of misclassification. We are required to separate two sets with a low degree of misclassification.
One way to alleviate misclassification is to map data sets to a higher (or even an infinite) dimensional space and perform classification in the new space. Note that sparse data sets are easier to be separated by a hyperplane, with a low degree of misclassification. Non-sparse data sets in a lower dimensional space may be made sparse if they are projected to a higher or even an infinite dimensional space by an appropriate map. A nice idea is to find a feature map to map lower dimensional data sets into a higher dimensional space to gain sparsity for the resulting data sets. The sparsity of the mapped data sets in the higher dimensional space can significantly reduce the degree of misclassifications. In other words, the feature map transfers the classification problem in a lower dimensional space to one in a higher dimensional space, where it is possible to separate point sets mapped from by a “hyperplane”, with a lower degree of misclassification. This is illustrated in Figure 3.2. A crucial issue is the choice of the feature map . From a pure theoretical standpoint, “any” function is a feature map. The choice of the feature map pretty much depends on specific applications. Feature maps lead to the notion of kernel based learning or learning in RKHSs.
Figure 3.2:

An illustration of data mapping
We next demonstrate how a feature map defines a kernel. A feature map has a well-known remarkable property which we show below. We recall that , , are the data points.
Proposition 3.1 Let be a real Hilbert space with inner product . If , then for all , , the matrices
are positive semi-definite.
Proof: For all , suppose that , , are arbitrarily given. Let . For all , , there holds
This confirms that the matrices are positive semi-definite. □
The feature map naturally leads to a function
| (3.1) |
which is a kernel according to Proposition 3.1 and the following definition.
Definition 3.2 Let . A function is called a kernel if it is symmetric and positive semi-definite, that is,
Kernels were studied over 100 years ago in the context of integral equations [29]. See, also [4]. Kernels can define distances of data sets to measure their similarity.
Function values are often used in machine learning. Hence, it would be desirable to require that the Hilbert space of functions defined on that we work with has the property that the point-evaluation functionals are continuous in the space. Precisely, we have the following definition.
Definition 3.3 A Hilbert space of functions defined on is called an RKHS if for every point-evaluation functional , , there exists a constant such that for all
We now illustrate Definition 3.3 by a simplest example. To this end, we consider the space of real sequences such that . Clearly, elements of are functions defined on the set and is a Hilbert space, and thus, it is a Hilbert space of functions on . Moreover, is isometrically isomorphic to . The point-evaluation functional has the form , for . It follows that
with in Definition 3.3 being , for all . Therefore, according to Definition 3.3, is an RKHS on with the kernel , where if and 0 otherwise, for .
Immediately from Definition 3.3, we observe that the closeness of two functions in an RKHS in the norm of the space implies their pointwise closeness.
An RKHS is associated with a kernel. The following well-known result can be found in [1].
Theorem 3.4 If a Hilbert space of functions on is an RKHS, then there exists a unique kernel such that , and for all ,
| (3.2) |
If is a kernel, then there exists a unique RKHS on such that , and for all , the reproducing property (3.2) holds true.
Proof: Suppose that is an RKHS of functions on . Since the point-evaluation functional , for each , is continuous, by the Riesz representation theorem, there exists such that for all . For each , we define a function by for . Clearly, we have that for all , and (3.2) holds true. The uniqueness of such a function can be verified. It remains to show that is a kernel. The symmetry of can be seen from the computation
It remains to prove that it is positive semi-definite. For any , , , , we let
By the reproducing property and symmetry of the kernel, we have that
Thus, is a kernel. We omit the proof for the converse. □
Theorem 3.4 clearly states that every function in the RKHS can be reproduced by the underlying kernel. Equation (3.2) is called the reproducing property. For this reason, the kernel is also called the reproducing kernel. Moreover, the kernel (3.1) defined by a feature map determines an RKHS . The mapped RKHS is more complex than the original space and thus by using it as a hypothesis space of a learning method we can better represent our learned function. We expect that learning outcomes in this space will be more accurate than those in the original Euclidean space. In particular, data sets mapped from to via the feature map are likely more sparse than the original data sets and thus classification of the mapped data sets may result in substantially less misclassification.
Two most popular kernels in machine learning are the polynomial kernel of degree
| (3.3) |
where is a constant and the Gaussian kernel
| (3.4) |
Google Scholar shows that the term “polynomial kernel” is used in titles of 53,900 articles and the term “Gaussian kernel” is used in titles of 649,000 articles. Gaussian kernels are used widely in application due to its remarkable flexibility in fitting data. In Figures 3.3, we illustrate that by altering the parameter we can control the shape of the graphs of the Gaussian kernel.
Figure 3.3:

with (left); with (right)
Following the above discussion, we map the data points to the RKHS by the feature map and seek a function that defines a decision rule for classification. This discussion leads us to extend the regularization problem (2.9) for classification to
| (3.5) |
to determine a decision function ∈ the RKHS . It follows from the reproducing property that
Thus, the regularization problem (3.5) for classification is rewritten as
| (3.6) |
Clearly, the regularization problem (2.9) with is a special case of (3.6). To see this, we introduce the reproducing kernel
The corresponding RKHS has the form , with for all , as its inner product. According to the reproducing property that for each with , we deduce that , for all , and . Thus, problem (3.6) reduces to (2.9) with .
The above discussion of classification motivates us to consider learning methods in RKHSs. For example, given a pair of data sets , and , we wish to learn a function from an RKHS . This learning method may be described as the minimization problem
| (3.7) |
where denotes the empirical fidelity term which measures the closeness of a learned solution to the given data and is a regularization parameter. The learning model (3.7) covers many learning problems including minimal norm interpolation, regularization network, support vector machines, kernel principal component analysis, kernel regression and deep regularized learning. Moreover, the recent work [33] established a framework where distributions are mapped into an RKHS in which the kernel methods can be extended to probability measures.
A great success of learning in Hilbert spaces has been achieved. At the same time theory of the RKHS gains coordinated development [19, 56, 57, 59]. In particular, motivated from needs of developing multiscale bases for an RKHS, refinable kernels were investigated in [56, 57, 59]. A mathematical foundation of learning in Hilbert spaces may be found in [13].
Two fundamental mathematical problems were raised regarding learning in an RKHS. The first one regards the form of a solution of the learning problem (3.7) which seeks a solution in an infinite dimensional space . The second is about the approximation property of the learning methods. Does the learned solution converge to the “true solution” of the practical problem as data points become dense and their number tends to infinity?
Answer to the first fundamental problem is the celebrated representer theorem [40]. A solution of the learning problem (3.7) in an RKHS can be represented as a linear combination of a finite number of the kernel sessions, kernel evaluated at the input training data points, namely,
| (3.8) |
That is, a solution of the learning problem (3.7) has the form
| (3.9) |
for some suitable parameters , where is the kernel associated with the space . Here the integer is the number of the data points involved in the fidelity term of (3.7). This result is called the representer theorem for the learning method (3.7). The most remarkable feature of the representer theorem lies in a finite number of kernel sessions that represent a learned solution in the infinite dimensional space . In other words, even though we seek a learned solution in the infinite dimensional space hoping it will provide much more richness in representing a learned solution, the learned solution is indeed located in the finite dimensional subspace of , spanned by the kernel sessions (3.8). Although the modern form of the representer theorem was found in [40], its origin dates back to [5] and [23] that appeared fifty years ago.
An answer to the second fundamental problem for the learning method (3.7) is the universality of a kernel. According to the representer theorem (3.9), a solution of the learning method (3.7) is expressed as linear combination of a finite number of the kernel sessions. This motivates us to ask the following question: Can a continuous function on a compact set be approximated arbitrarily well by the kernel sessions as the data points getting dense in the set? It is well-known from the Weiersstrass Theorem in Real Analysis [38] that a continuous function on a compact domain can be arbitrarily approximated by polynomials. However, this is not always true when polynomials are replaced by kernel sessions of an arbitrary kernel. To explain this, we assume that the input space is a Hausdorff topological space and that all kernels to be considered are continuous on . Moreover, we let be an arbitrarily fixed compact subset of and let denote the space of all continuous real-valued functions from to or equipped with maximum norm . For simplicity, we restrict our discussion to real-valued function.
We first examine the polynomial kernel (3.3). We have the following negative result.
Proposition 3.5 Let be a polynomial kernel. There exists a continuous function defined on a compact set which cannot be approximated in the uniform norm arbitrarily close by any linear combination of the kernel sessions , , for any and any .
Proof: Since a polynomial kernel has the form (3.3), we will construct a function so that cannot be approximated in the uniform norm arbitrarily close by
| (3.10) |
for any positive integer and any .
In fact, for simplicity, we choose as a polynomial of degree with leading coefficient 1. No matter how large the integer is and no matter what , , are chosen, defined by equation (3.10) is a polynomial of degree . The assertion is proved by the fact that a nondegenerate polynomial of degree cannot be approximated in the uniform norm arbitrarily close by a polynomial of degree . For example, we can verify this statement for the special case when , , , and . In this case, we have that , , . Let , for . Clearly, . By the characterization of the best uniform approximation by the linear polynomial [36], we see that
Hence, no matter how large the positive integer is chosen, no matter what points , , are chosen, the corresponding error . □
This example motivates us to introduce in [32] the notion of universality of a kernel. We now describe the definition of the universal kernel. Given a kernel , we introduce the function defined at every by the equation and form the space of kernel sections . The set consists of all functions in which are uniform limits of functions of the form (3.9) where . We say that a kernel is universal, if for any given prescribed compact subset of , any positive number and any function there is a function such that
A characterization of a universal kernel was originally established in [32]. Moreover, convenient sufficient conditions of universal kernels were given there and several classes of commonly used kernels including the Gaussian kernel defined by equation (3.4) were shown to be universal. However, it follows from the definition of the universal kernel and Proposition 3.5 that the polynomial kernels are not universal.
Although a solution of a learning method in an RKHS has nice features, it suffers from the denseness in its representation. To illustrate this point, we consider a simple minimum norm interpolation problem in . We seek for such that
| (3.11) |
where
| (3.12) |
According to [10], the solution of problem (3.11)-(3.12) is given by
| (3.13) |
Clearly, every component of the solution is nonzero and thus, the solution is dense.
In the context of big data analytics, the dimension of data is large. It is desirable to obtain a sparse solution, a solution with substantial number of zero components. We next illustrate the intrinsic characteristic of a space that may or may not lead to a sparse solution. To this end, we consider a learning problem in a Hilbert space
and a Banach space
Figures 3.4 and 3.5 illustrate that the minimum norm interpolations in the space are sparse but in the they are dense. This is because the geometry of the unit balls of these two different norms are different: The unit balls of the are smooth, which do not promote sparsity, while those of the have corners at coordinate axes, which promote sparsity.
Figure 3.4:

The -Norm vs the -Norm: A two dimension illustration.
Figure 3.5:

The -Norm vs the -Norm: A three dimension illustration.
We now return to the classification problem discussed in Section 2. To obtain a sparse solution, we replace the norm in (2.9) by the norm. This gives rise to the soft margin SVM model
| (3.14) |
To close this section, we summarize the pros and cons of using RKHSs for machine learning. The pros include:
The canonical inner product of an RKHS provides a convenient approach to defining a measure for comparison which is an inevitable operation in machine learning.
A representer theorem of a machine learning method in an RKHS reduces an infinite dimensional problem to a finite dimensional problem, with a canonical finite dimensional space determined by the training data.
Universal kernels guarantee the approximation property.
The cons are mainly on the following aspect:
Learning methods in an RKHS result in dense solutions. It is computationally expensive to use dense solutions in prediction and other practical applications.
4. Reproducing Kernel Banach Spaces for Machine Learning
Discussion in the previous section regarding the denseness of a learned solution in a Hilbert space leads us to explore Banach spaces as hypothesis spaces for machine learning hoping to gain sparsity for a learning solution in the spaces. This is because the class of Banach spaces which includes Hilbert spaces as special cases offers more choices for a hypothesis space for a machine learning method. In particular, certain Banach spaces with special geometric features may lead to sparse learning solutions. Aiming at developing sparse learning methods, the notion of the RKBS was first introduced in [58] in 2009. In this section, we review briefly the development of the RKBS during the last decade.
As pointed out earlier, the aim of most of machine learning methods is to construct functions whose values can be used for prediction or other decision making purposes. For this reason, it would be desirable to consider a Banach space of functions defined over a prescribed set as our hypothesis space for learning. A Banach space is called a space of functions on a prescribed set if is composed of functions defined on and for each , implies that for all . This implies that in a Banach space of functions, the pointwise function evaluation must be well-defined for all and all . The standard spaces are Banach function spaces, but not Banach spaces of functions, since their elements are equivalent classes rather than functions, and the pointwise function evaluation is not defined in these spaces. Because function values are crucial in decision making, we would expect that they are stable with respect to functions chosen in the space. For this purpose, we would prefer the point-evaluation functionals defined by
being continuous with respect to functions in the space . This is exactly the way in which the RKHS was defined [1]. This consideration gives rise to the following definition of the RKBS.
Definition 4.1 A Banach space of functions defined on a prescribed set is called an RKBS if point-evaluation functionals , for all , are continuous on , that is, for each there exists a constant such that
The original definition of an RKBS was given in Definition 1 of [59] in a somewhat restricted version, where was assumed to reflexive, see [27] for further generalization. In Definition 4.1, we do not have the restriction. It follows from Definition 4.1 that if is an RKBS on , and , , for , then implies , as , for each .
In the RKHS setting, due to the well-known Riesz representation theorem, one can identify a function in the same space to represent each point-evaluation functional. That is, a Hilbert space is isometrically isomorphic to its dual space, by which we refer to the space of all continuous linear functionals on the Hilbert space. This naturally leads to the unique reproducing kernel associated with the RKHS. The kernel in conjunction with the inner product of the Hilbert space yields the reproducing property. However, in general, a Banach space is not isometrically isomorphic to its dual space. Moreover, continuity of the point-evaluation functionals does not guarantee the existence of a kernel. For this reason, we need to pay special attention to the notion of the reproducing kernel for an RKBS.
For a Banach space , we denote by its dual space, the space of all continuous linear functionals on . It is well-known that the dual space of a Banach space is again a Banach space. Definition 4.1 of the RKBS ensures that when is an RKBS on , the point-evaluation functionals
| (4.1) |
When is an RKBS on , we let denote the completion of the linear span of all the point-evaluation functionals on for , under the norm of . It follows from (4.1) that . Thus, since is complete. Moreover, we have that is the smallest Banach space that contains all point-evaluation functionals on . We will call the -dual space of . Because in machine learning, we are interested in Banach spaces of functions, we further suppose that the -dual space is isometrically isomorphic to a Banach space of functions on a set . This hypothesis is satisfied for most examples that we encounter in application. For instance, the space that we will discuss later in this section has this property. In the rest of this paper, we will not distinguish and .
For a Banach space of functions defined on the set , we let denote the dual bilinear form on induced by restriction of the dual bilinear form on to . When there is no ambiguity, we will write it as . We now define a reproducing kernel, which provides a closed-form function representation for the point-evaluation functionals, for an RKBS.
Definition 4.2 Let be an RKBS on a set with the δ-dual space . Suppose that is isometrically isomorphic to a Banach space of functions on a set . A function is called a reproducing kernel for if for all , and
| (4.2) |
If in addition is an RKBS on , for all , and
| (4.3) |
we call an adjoint RKBS of . In this case, the function
is a reproducing kernel for , and we call , a pair of RKBSs.
The original definition of a reproducing kernel appeared in [58] with a choice , which was extended in [55]. Also, see [46, 49] for further development.
Equation (4.2) furnishes a representation of the point-evaluation functional in terms of the kernel and the dual bilinear form. The present form in Definition 4.2 of a reproducing kernel for an RKBS is slightly different from the various forms known in the literature. We feel that the form described in Definition 4.2 better captures the essence of reproducing kernels. Unlike a reproducing kernel for an RKHS, a reproducing kernel for an RKBS is not necessarily symmetric or positive semi-definite. In a special case when an RKBS satisfies the following hypothesis, we can establish the positive semi-definiteness of its kernel.
Hypothesis (H1): Spaces , are a pair of RKBSs on a set , the -dual space is isometrically isomorphic to a Banach space of functions on , and a reproducing kernel for is symmetric.
Suppose that Hypothesis (H1) is satisfied. By symmetry of the kernel , we have that for all and for all . In this case, is well-defined for . We further require that the following hypothesis be satisfied.
Hypothesis (H2): There exists a positive constant such that
| (4.4) |
In a loose sense, inequality (4.4) has a geometric interpretation: “cosine” of the “angle” formed by is uniformly above zero, for all , , , , . In other words, the spaces and cannot be “perpendicular” to each other, with respect to the dual bilinear form . When is a Hilbert space, inequality (4.4) reduces to an equality with , since in such a case . Under Hypotheses (H1) and (H2), we can establish the positive semi-definiteness of a reproducing kernel.
Theorem 4.3 Suppose that a pair , of RKBSs satisfy Hypotheses (H1) with a reproducing kernel and (H2). For all , all , , let
| (4.5) |
Then, for all and all , , the matrices are positive semi-definite.
Proof: It suffices to show for all that . For , we introduce the notation . By the reproducing property (4.2) and symmetry of the kernel , we observe that
Therefore, according to Hypotheses (H1) and (H2), for all , we obtain that
proving the positive semi-definiteness of . □
In general, the kernel matrix (4.5) is not well-defined unless .
We next illustrate the notion of the RKBS and its kernel with an example. For this purpose, we consider the space of real sequences such that . Elements of are functions defined on the set . It is well-known that is a Banach space, and thus, it is a Banach space of functions on . For each , the point-evaluation functional has the form
| (4.6) |
We first establish that is an RKBS, which is stated in the next theorem.
Theorem 4.4 The space is an RKBS on .
Proof: It suffices to verify that the point-evaluation functionals on the space are continuous. All point-evaluation functionals on space are , for all . Hence, for all , it follows from (4.6) that
This ensures that for all , the point-evaluation functionals are all continuous on the space . By Definition 4.1, the space is an RKBS over . □
We next identify a reproducing kernel for the RKBS . By we denote the Banach space of real bounded sequences on under the supremum norm. Namely, for any , we have that . We further denote by the set of real sequences that are convergent to zero in the sense that for all , there holds . The set is a Banach space of real convergent sequences under the supremum norm defined on . Thus, for all , there holds that . This ensures that . Moreover, since a nonzero constant sequence is in but not in , is a true subspace of . It is well-known [37] that and It follows that . Thus, . That is, the Banach space is not reflexive and its predual is .
The dual bilinear form on has a concrete form. Specifically, for any and , we define
| (4.7) |
Restricting the definition (4.7) to the subspace of yields the dual bilinear form on . It follows from (4.7) for any and that . For this reason, we will drop the subscript and simply write and . Moreover, since according to (4.7), we will use and interchangeably when no ambiguity may raise.
Proposition 4.5 If denotes the δ-dual space of , then .
Proof: We first show that . Let be arbitrary. For all , we have that
This implies that . Namely, is a linear combination of the point-evaluation functionals , for . Hence, .
Conversely, we establish that . Let be arbitrary. Since is the completion of the linear span of all the point-evaluation functionals on for , under the norm of , there exist , for , such that sequences satisfies
| (4.8) |
Clearly, for all . Since is a Banach space, equation (4.8) ensures that , which implies that . □
Proposition 4.5 guarantees that , which allows us to identify a reproducing kernel for the RKBS . To this end, we define a function , which is a semi-infinite matrix. Specifically, for , we let if and 0 otherwise. We then define
| (4.9) |
Theorem 4.6 The function defined by (4.9) is a reproducing kernel for RKBS .
Proof: According to Proposition 4.5, we have that , which is a Banach space of functions on . By the definition (4.9) of the function , we find that , for all . Moreover, for all , we have for all that
This proves the reproducing property. By Definition 4.2 with and , we conclude that the function is a reproducing kernel for . □
Note that Theorem 4.6 gives the simplest example of the RKKS, since any principal (and finite) sub-matrix of the (infinite operator) is the identity matrix. This result is not covered by the setting described in [58]. In the same manner, we consider the predual space .
Theorem 4.7 The space is an RKBS on .
Proof: We establish that the point-evaluation functionals on are continuous. All point-evaluation functionals on space are , for all . Hence, we observe for all that
This ensures that the point-evaluation functionals on are continuous. Again, by Definition 4.1, is an RKBS over . □
We next identify a reproducing kernel for the RKBS of functions defined on .
Theorem 4.8 The function defined by (4.9) is a reproducing kernel for . The space is an adjoint RKBS of , the spaces , form a pair of RKBSs, and the kernel is symmetric and positive semi-definite.
Proof: By Theorem 4.7, is an RKBS on . Furthermore, we have that , for all . Moreover, we have the dual reproducing property that for all ,
It follows from Definition 4.2 that the function defined by (4.9) is a reproducing kernel for the RKBS . Therefore, is an adjoint RKBS of .
Clearly, spaces , with the kernel satisfy Hypotheses (H1)-(H2). The positive definiteness of can be seen immediately since for any , , , the kernel matrix is the identity matrix. □
Several nontrivial RKBSs may be found in [46, 53, 55, 58]. In particular, an RKBS isometrically isomorphic to the space was constructed in [46]. We now briefly describe a RKBS on a continuous function space. We choose as a locally compact Hausdorff space. By we denote the space of all continuous functions such that for each , the set is compact. We equip the maximum norm on , that is,
The Riesz-Markov representation theorem states that the dual space of is isometrically isomorphic to the space of the real-valued regular Borel measures on endowed with the total variation norm . We suppose that satisfies for all and the density condition . We then introduce the space of functions on by
| (4.10) |
equipped with . Clearly, defined by (4.10) is a Banach space of functions defined on . It was established in [53] that is an RKBS, the -dual of is isometrically isomorphic to the space . Moreover, the function is a reproducing kernel for the RKBS in the sense of Definition 4.2. More information about the space may be found in [3, 27, 47, 53].
Last decade has witnessed the rapid development of the theory of the RKBS and its applications since the publication of the original paper [58]. Continued understanding proper definition and construction of the RKBS and related theoretical issues remains research topics of great interest [18, 20, 22, 45, 46, 50, 51, 55, 60, 61, 62]. In particular, construction of RKBSs was proposed in [55] by using Mercer’s kernels. A class of vector-valued reproducing kernel Banach spaces with the norm was constructed in [26] and it was used in multi-task learning. Separability of RKBSs was studied in [34]. Statistical margin error bounds were given in [8] for -norm support vector machines. Lipschitz implicit function theorems in RKBSs were established in [43]. A converse sampling theorem in RKBSs was given in [7]. The notion of the RKBS has been used in statistics [39, 44], game theory [48], and machine learning [3, 11, 17, 25, 35, 52].
5. Learning in a Banach Space
Most of learning methods may be formulated as a minimum norm interpolation problem or a regularization problem. We consider both of the cases in this section. We will focus on solution representation of these learning problems.
Recently, representer theorems of learning methods in Banach spaces received considerable attention. In the framework of a semi-inner-product RKBS [58], the representer theorem was derived from the dual elements and the semi-inner-product [58, 61]. In [55], for a reflexive and smooth RKBS, a representer theorem of a solution of the regularized learning problem was established using the Gâteaux derivative of the norm function and the reproducing kernel. In addition, the representer theorem was generalized to a non-reflexive and non-smooth Banach space which has a predual space [22, 50, 51, 52]. Due to lack of the Gâteaux derivative, other tools such as the subdifferential of the norm function and the duality mapping need to be used to describe the representer theorem.
In this section, we assume that is a general Banach space with the dual space or predual space unless it is stated otherwise. We first study the minimum norm interpolation problem. Given and , , we seek such that
| (5.1) |
While minimum norm interpolation in a Hilbert space is a classical topic [14, 15], its counterpart in a Banach space has drawn much attention in the literature [9, 10, 30, 50, 52] due to its connection with compressed sensing [6, 16]. An existence theorem of a solution of the minimum norm interpolation problem was established in Proposition 1 of [52].
Minimum norm interpolation problem (5.1) with was investigated in [10], which may be stated as follows: Given and , , we seek such that
| (5.2) |
Problem (5.2) is an infinite dimensional version of compressed sensing considered in [6, 16]. According to [10], it may be reduced to a finite dimensional linear programming problem by a duality approach, leading to a sparse solution. Sparse learning methods in were studied in [2, 10, 50, 52].
The solution of problem (5.2) with interpolation conditions (3.12) has a sparse solution, which is given by (see, [10] for details about this example). This solution has only two nonzero components while the solution of the minimum norm interpolation problem (3.11) with the same interpolation conditions has infinite numbers of nonzero components. The solution is sparse because the Banach space promotes sparsity while the solution , given by (3.13), of problem (3.11) is dense because the Hilbert space does not promote sparsity. Several forms of the representer theorem for a solution of problem (5.1) were established in [52].
We now turn to investigating the regularized learning problem in a Banach space. Such a problem may be formulated from the ill-posed problem
| (5.3) |
where is given data and represents either a physical system or a learning system. We define a data fidelity term from (5.3) by using a loss function and choose a regularization term with a regularizer . We then solve the regularization problem
| (5.4) |
where is a positive regularization parameter. Here, , for , measures the “lost” of from given data . For example, one may choose . Regularized learning problems in Banach spaces were originated in [30] and since then, desired representer theorems for such learning problems have received considerable attention in the literature. For existence results of a solution of problem (5.4), the readers are referred to [52].
When the operator in (5.3) is defined by
| (5.5) |
the regularization problem (5.4) is intimately related to the minimum norm interpolation problem (5.1). Their relation is described in the next proposition.
Proposition 5.1 Suppose that is a Banach space with the dual space , , and is defined by (5.5). For a given , let be a loss function, be an increasing regularizer and . Let be a solution of the regularization problem (5.4) with . Then the following statements hold true:
A solution of problem (5.1) with is a solution of the regularization problem (5.4) with .
If is strictly increasing, then is a solution of problem (5.1) with .
Statement (ii) of Proposition 5.1 was claimed in [30] without details of proof. A complete proof for Proposition 5.1 may be found in [52].
We are interested in characterizing a solution of the regularization problem (5.4) with an operator defined by (5.5) having an adjoint operator . The following two representer theorems are due to [52]. For a given , let be a loss function. We first consider the case when the Banach space has a smooth predual.
Theorem 5.2 Suppose that is a Banach space having the smooth predual space , and , . Let denote the Gâteaux derivative of the norm at . Let be a strictly increasing regularizer and . Then
| (5.6) |
is a solution of the regularization problem (5.4) with if and only if is a solution of the finite dimensional minimization problem
| (5.7) |
We now consider the case when the predual of the Banach space is not necessarily smooth.
Theorem 5.3 Suppose that is a Banach space having the predual space , , . Let be a regularizer and .
- If is increasing, then there exists a solution of the regularization problem (5.4) with such that
for some , , with , where denotes the subdifferential of the norm.(5.8) If is strictly increasing, then every solution of the regularization problem (5.4) with satisfies (5.8) for some , .
Item (i) of Theorem 5.3 indicates that there exists a solution of the regularization problem (5.4) with that satisfies (5.8), which is a generalization of the stationary point condition to minimization problems with non-differentiable objective function. The essence of Theorems 5.2 and 5.3 is that a solution , defined by (5.6) or (5.8), of the infinite dimensional regularization problem (5.4) is determined completely by a finite number of parameters. Values of these parameters can be obtained by solving either a finite dimensional optimization problem (5.7) or a nonlinear system. When the space is a Hilbert space, the nonlinear system reduces to a linear one. In particular, when is an RKBS with a kernel , the functionals in Theorems 5.2 and 5.3 have convenient representations in terms of the kernel . We will demonstrate this point later in this section. These representer theorems serve as a base for further development of efficient numerical solvers of the regularization problem (5.4). One may find more discussions about representer theorems in Banach spaces in [41].
Not all Banach spaces will produce sparse learning solutions. Only Banach spaces with certain geometric features can lead to a sparse learning solution. It has been shown in [10, 50] that minimum norm interpolation and regularization problems in have sparse solutions. As it is shown in Theorems 4.4 and 4.6, the space is an RKBS on with the kernel defined by (4.9). Hence, a learning solution in the RKBS can be expressed in terms of the kernel . Integrating Theorems 4.4 and 4.6 with those known in the literature [50], we have the following sparse representer theorem for a learning solution in . For any learning solution , there exists a vector of positive integers with such that
| (5.9) |
Clearly, the solution has only nonzero components. Here, the vector represents the support of the learning solution defined by (5.9) and the coefficients are determined by the fidelity data. The paper [10] suggests that we can first convert the minimum norm interpolation problem to its dual problem which allows us to identify the positive integer and the support , and then solve a linear system to obtain the coefficients . The idea of [10] has been extended in [9] to solve a class of regularization problems.
We now return to the binary classification problem and describe a model based on an RKBS for it. We choose the RKBS defined by (4.10) with and
as the hypothesis space. The classification problem in the RKBS may be described by
| (5.10) |
where the fidelity term is the same as in problem (3.6). A representer theorem for a solution of the learning method (5.10) was obtained in [53]. To state it, for a function with , we define the set . We denote by a solution of minimization problem (5.10) and let . It was proved in [53] that there exists a solution of (5.10) having the form
| (5.11) |
where is a solution of optimization problem
| (5.12) |
Although the points , may not be the same as the points , , motivated from (5.11), we seek a solution of the regularization problem (5.10) in the form . By plugging into (5.10) and adding a bias term , we obtain the -SVM classification model has the form
| (5.13) |
where
The discrete minimization problem (5.13) is the same as (3.14). We comment that the space with the -norm is a finite dimensional RKBS according to Theorem 4.4.
Next, we elaborate the numerical solution of the regularized learning problem. Due to the use of a Banach space as the hypothesis space in learning, we are inevitably facing solving the regularized learning problem (5.4) with the Banach norm. Major challenges for solving this problem include infinite dimensionality, nonlinearity and nondifferentiability. The regularization problem (5.4) by nature is infinite dimensional, since we seek a solution in the hypothesis space which is an infinite dimensional Banach space, even though the fidelity term is determined by a finite number of data points. The use of a Banach space as a regularization term leads to nonlinearity. Moreover, employing a sparsity promotion norm results in a non-differentiable optimization problem. Representer theorems presented in this section for a solution of this problem indicate that the regularization problem (5.4) is essentially of finite dimension. However, how the representer theorems can be used in developing practical algorithms requires further investigation. A recent paper [9] is making an attempt toward this direction. Specifically, we reformulated the regularization problem (5.4) in the following way: When the fidelity term of the regularization problem (5.4) has a form in a Banach norm and a regularization term in another Banach norm, we construct a direct sum space based on the two Banach spaces for the data fidelity term and the regularization term, and then recast the objective function as the norm of a suitable quotient space of the direct sum space. In this way, we express the original regularized problem as a best approximation problem in the direct sum space, which is in turn reformulated as a dual optimization problem in the dual space of the direct sum space. The dual problem is to find the maximum of a linear function on a convex polytope, which is of finite dimension and may be solved by numerical methods such as linear programming. The solution of the dual optimization problem provides related extremal properties of norming functionals, by which the original problem is reduced to a finite dimensional optimization problem, which is then reformulated as a finite dimensional fixed-point equation [24, 31], solved by iterative algorithms.
6. Numerical Results
This section is devoted to a numerical example about classification of handwriting digits by using the SVM model (5.13) with a comparison to the SVM model. The handwriting digit classification by using the -SVM was studied in [28].
In our experiment, we use the database MNIST of handwriting digits, which is originally composed of 60,000 training samples and 10,000 testing samples of the digits “0” through “9”. We consider classifying two handwriting digits “7” and “9” taken from MNIST. Handwriting digits “7” and “9” are often easy to cause obfuscation in comparison to other pairs of digits. We take 8,141 training samples and 2,037 testing samples of these two digits from the database. Specifically, we consider training data set , where are images of digits 7 or 9. We wish to find a function that defines a hyperplane separating the data into two groups with labels (digit 7) and (digit 9).
We employ the model (5.10) and its related discrete form (5.13) described in section 5 for our experiment. For a comparison purpose, we also test the learning model (5.13) with the squared loss function
as a fidelity term in problem (5.13). We compare the classification performance and solution sparsity of problem (5.13) with those of its Hilbert space counterpart (3.6), which leads to problem (5.13) with the -norm being replaced by the -norm.
The objective functions of both the minimization problem (5.13) and its -norm counterpart are not differentiable and thus they can not be solved by a gradient based iteration algorithm. Instead, we employ the Fixed Point Proximity Algorithm (FPPA) developed in [24, 31] to solve these non-smooth minimization problems. Specifically, we first use the Fermat rule to reformulate a solution of the minimization (5.13) as a solution of an inclusion relation defined by the subdifferential of the objective function of (5.13). We then rewrite the inclusion relation as an equivalent fixed-point equation involving the proximity operator of the functions and which appear in the objective function of (5.13). The fixed-point equation leads to the FPPA, for solving (5.13), which is guaranteed to converge under a mild condition. We refer the interested readers to [24, 31] for more information on the FPPA.
Numerical results are reported in Tables I–IV. In these tables, represents values of the regularization parameter, #NZ the sparsity level (number of nonzero entries), and TRA and TEA the learning accuracy (correction rate) for training datasets and testing datasets, respectively. Specifically,
where and denote the number of predicted labels equal to the given labels of the training and testing data points, respectively, and and denote the total number of the training and testing data points, respectively.
Table I:
Classification using the Gaussian Kernel with : Results for the -SVM model with the square loss function.
| 0.5 | 0.7 | 1.4 | 1.8 | 4.0 | 6.0 | 487.7 | |
| #NZ | 1762 | 880 | 481 | 340 | 187 | 142 | 0 |
| TRA | 99.46% | 99.19% | 98.61% | 98.34% | 97.57% | 97.16% | 50.67% |
| TEA | 98.77% | 98.82% | 98.48% | 98.38% | 97.79% | 97.40% | 50.47% |
Table IV:
Classification using the Gaussian Kernel with : Results for the -SVM model with hinge loss.
| 0.001 | 0.2 | 2 | 12 | 22 | 47 | 128 | |
| #NZ | 8, 141 | 8, 141 | 8, 141 | 8, 141 | 8, 141 | 8, 141 | 8, 141 |
| TRA | 99.90% | 99.84% | 99.18% | 98.13% | 97.68% | 97.07% | 96.31% |
| TEA | 99.02% | 98.97% | 98.33% | 97.89% | 97.59% | 97.05% | 96.41% |
Explanations on the choice of the regularization parameter are in order. In Table I, we randomly selected seven different values of from the interval , where is determined by Remark 4.12 in [28]. In each of the remaining tables, we selected seven different values of in an increasing order such that the -SVM classification model (5.13) and the related model produce the same accuracy. In this way, we can compare the sparsity of the corresponding solutions of these two models. Moreover, the largest in Table I is given by the smallest value such that , since for the -SVM model with the square loss function, we have Remark 4.12 of [28] to determine such a value. However, in Table III, which is for the -SVM model with the hinge loss function, we do not have a theoretical result similar to Remark 4.12 of [28] to determine the smallest value that ensures and hence, we empirically choose the value such that .
Table III:
Classification using the Gaussian Kernel with : Results for the -SVM model with the hinge loss function.
| 0.1 | 0.2 | 1 | 2 | 4 | 10 | 435 | |
| #NZ | 552 | 481 | 167 | 92 | 56 | 34 | 0 |
| TRA | 99.99% | 99.99% | 99.08% | 98.17% | 97.53% | 96.30% | 50.67% |
| TEA | 98.72% | 98.77% | 98.38% | 98.09% | 97.45% | 96.27% | 50.47% |
The numerical results presented in Tables I–IV confirm that the two models generate learning results with comparable accuracy, while the solution of the -SVM model can have different levels of sparsity according to values of , but that of the -SVM model is always dense disregarding the choice of . In passing, we point it out that the hinge loss function outperforms the square loss function as a fidelity term for the classification problem.
7. Remarks on Future Research Problems
In this section, we elaborate several open mathematical problems critical to the theory of RKBSs and machine learning methods on a Banach space.
It has been demonstrated that learning in certain Banach spaces leads to a sparse solution and learning in a Hilbert space yields a dense solution. Sparse learning solutions can save tremendous computing effort and storage when they are used in decision making processes. However, learning in Banach spaces introduces new challenges. First of all, the theory of the RKBS is far away from completion. Construction of reproducing kernels for RKBSs suitable for different applications is needed. For example, related to the definition of the reproducing kernel, we have the following specific problem: Suppose that is a Banach space of functions on and point-evaluation functionals on are all continuous. Let denote the the completion of the linear span of all the point-evaluation functionals on under the norm of . In general, a Banach space of functions is not isometrically isomorphic to its dual or its -dual . We take as an example. We have known from section 4 that and . Clearly, the space is neither isometrically isomorphic to nor . In Definition 4.2 of a kernel, we need the assumption that is isometrically isomorphic to a Banach space of functions on some set . This assumption holds true when since is a Banach space of functions (sequences) defined on . We are interested in knowing to what extent is this true in general?
Theorem 4.3 reveals that under Hypotheses (H1) and (H2), a reproducing kernel as defined in Definition 4.2 is positive semi-definite. Can the two hypotheses be weakened? We are curious to know how a given function of two arguments will lead to a pair of RKBSs.
A problem related to sparse learning in a Banach space may be stated as follows: It has been understood that the Banach space promotes sparsity for a learning solution in the space. What is the essential geometric feature for a general Banach space of functions that can lead to a sparse learning solution in the space?
Finally, practical numerical algorithms for learning a function from an RKBS require systematical investigation.
Answers to these challenging issues would contribute to completing the theory of the RKBS and making it a useful hypothesis space for machine learning, hoping leading to practical numerical methods for learning in an RKBS.
Table II:
Classification using the Gaussian Kernel with : Results for the -SVM model with the square loss function.
| 0.005 | 1.0 | 2.3 | 6.0 | 43.0 | 96.0 | 490 | |
| #NZ | 8, 141 | 8, 141 | 8, 141 | 8, 141 | 8, 141 | 8, 141 | 8, 141 |
| TRA | 100% | 99.50% | 99.18% | 98.69% | 97.56% | 97.16% | 95.93% |
| TEA | 99.21% | 98.87% | 98.72% | 98.43% | 97.69% | 97.25% | 96.41% |
Acknowledgement:
This paper is based on the author’s plenary invited lecture delivered online at the international conference “Functional Analysis, Approximation Theory and Numerical Analysis” that took place at Matera, Italy, July 5-8, 2022. The author is indebted to Professor Rui Wang for helpful discussion on issues related to the notion of the reproducing kernel Banach space, and to Professor Raymond Cheng for careful reading of the manuscript and providing a list of corrections. The author is grateful to two anonymous referees whose constructive comments improve the presentation of this paper. The author is supported in part by the US National Science Foundation under grants DMS-1912958 and DMS-2208386, and by the US National Institutes of Health under grant R21CA263876.
Footnotes
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
References
- [1].Aronszajn N, Theory of reproducing kernels, Transactions of the American Mathematical Society, 68 (1950), 337–404. [Google Scholar]
- [2].Aziznejad S and Unser M, Multikernel regression with sparsity constraint, SIAM Journal on Mathematics of Data Science 3 (2021) 10.1137/20M1318882. [DOI] [Google Scholar]
- [3].Bartolucci F, De Vito E, Rosasco L and Vigogna S, Understanding neural networks with reproducing kernel Banach spaces, Applied and Computational Harmonic Analysis 62 (2023), 194–236. [Google Scholar]
- [4].Bochner S, Hilbert distances and positive definite functions, Annals of Mathematics 42 (1941), 647–656. [Google Scholar]
- [5].de Boor C and Lynch RE, On splines and their minimum properties, Journal of Mathematics and Mechanics, 15 (1966), 953–969. [Google Scholar]
- [6].Candés EJ, Romberg J and Tao T, Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information, IEEE Transactions on Information Theory, 52 (2006), 489–509. [Google Scholar]
- [7].Centeno H and Medina JM, A converse sampling theorem in reproducing kernel Banach spaces, Sampling Theory, Signal Processing, and Data Analysis, 20 (2022), 8, 19 pages. [Google Scholar]
- [8].Chen L and Zhang H, Statistical margin error bounds for -norm support vector machines, Neurocomputing, 339 (2019), 210–216. [Google Scholar]
- [9].Cheng R, Wang R, and Xu Y, A duality approach to regularization learning problems in Banach spaces, preprint, 2022.
- [10].Cheng R and Xu Y, Minimum norm interpolation in the space, Analysis and Applications, 19 (2021), 21–42. [Google Scholar]
- [11].Combettes PL, Salzo S and Villa S, Regularized learning schemes in feature Banach spaces, Analysis and Applications, 16 (2018), 1–54. [Google Scholar]
- [12].Cortes C and Vapnik V, Support-vector networks, Machine Learning, 20 (3) (1995), 273–297. [Google Scholar]
- [13].Cucker F and Smale S, On the mathematical foundations of learning, Bulletin of the American Mathematical Society, 39 (2002), 1–49. [Google Scholar]
- [14].Deutsch F, Best Approximation in Inner Product Spaces, Springer, New York, 2001. [Google Scholar]
- [15].Deutsch F, Ubhaya VA, Ward JD and Xu Y, Constrained best approximation in Hilbert space III. Applications to -convex functions, Constructive Approximation, 12 (1996), 361–384. [Google Scholar]
- [16].Donoho DL, Compressed sensing, IEEE Transactions on Information Theory, 52 (2006), 1289–1306. [Google Scholar]
- [17].Fasshauer GE, Hickernell FJ and Ye Q, Solving support vector machines in reproducing kernel Banach spaces with positive definite functions, Applied Computational Harmonic Analysis, 38 (2015), 115–139. [Google Scholar]
- [18].Fukumizu K, Lanckriet G and Sriperumbudur BK, Learning in Hilbert vs. Banach spaces: A measure embedding viewpoint, Advances in Neural Information Processing Systems 24 (NIPS 2011). [Google Scholar]
- [19].Gretton A, Borgwardt KM, Rasch MJ, Schölkopf B, and Smola A, A kernel two-sample test, Journal of Machine Learning Research, 13 (2012), 723–773. [Google Scholar]
- [20].Georgiev PG, Sánchez-González L and Pardalos PM, Construction of pairs of reproducing kernel Banach spaces, in “Constructive Nonsmooth Analysis and Related Topics” the Springer Optimization and Its Applications book series; (SOIA,volume 87), pp. 39–57, September 2013. [Google Scholar]
- [21].Hoefler T, Alistarh DA, Dryden N, and Ben-Nun T, The future of deep learning will be sparse, SIAM News, May 03, 2021. [Google Scholar]
- [22].Huang L, Liu C, Tan L and Ye Q, Generalized representer theorems in Banach spaces, Analysis and Applications, 19 (2021), 125–146. [Google Scholar]
- [23].Kimeldorf GS and Wahba G, A correspondence between Bayesian estimation on stochastic processes and smoothing by splines, Annals of Mathematical Statistics, 41 (1970), 495–502. [Google Scholar]
- [24].Li Q, Shen L, Xu Y and Zhang N, Multi-step fixed-point proximity algorithms for solving a class of optimization problems arising from image processing, Advances in Computational Mathematics, 41 (2015), 387–422. [Google Scholar]
- [25].Li Z, Xu Y and Ye Q, Sparse support vector machines in reproducing kernel Banach spaces, in “Contemporary Computational Mathematics - A Celebration of the 80th Birthday of Ian Sloan” pp. 869–887, Springer, 2018. [Google Scholar]
- [26].Lin R, Song G and Zhang H, Multi-task learning in vector-valued reproducing kernel Banach spaces with the norm, Journal of Complexity 63 (2021), 101514. [Google Scholar]
- [27].Lin R, Zhang H and Zhang J, On reproducing kernel Banach spaces: Generic definitions and unified framework of constructions, Acta Mathematica Sinica, 38 (2022) 1459–1483. [Google Scholar]
- [28].Liu Q, Wang R, Xu Y and Yan M, Parameter choices for sparse regularization with the norm, Inverse Problems 39 (2023) 025004 (34pp) [Google Scholar]
- [29].Mercer J, Functions of positive and negative type and their connection with the theory of integral equations, Philosophical Transactions of the Royal Society of London, Series A 209 (1909), 415–446. [Google Scholar]
- [30].Micchelli CA and Pontil M, A function representation for learning in Banach spaces, in Learning theory, 255–269, Lecture Notes in Computer Science 3120, Springer, Berlin, 2004. [Google Scholar]
- [31].Micchelli CA, Shen L and Xu Y, Proximity algorithms for image models: denoising, Inverse Problems, 27 (2011), 045009. [Google Scholar]
- [32].Micchelli CA, Xu Y and Zhang H, Universal kernels, Journal of Machine Learning Research, 7 (2006), 2651–2667. [Google Scholar]
- [33].Muandet K, Fukumizu K, Sriperumbudur B and Schölkopf B, (2017), Kernel mean embedding of distributions: A review and beyond, Foundations and Trends ® in Machine Learning 10 (1–2) (2017), 1–141. [Google Scholar]
- [34].Owhadi H and Scovel C, Separability of reproducing kernel spaces, Proceedings of American Mathematical Society, 145 (2017), 2131–2138. [Google Scholar]
- [35].Parhi R and Nowak RD, Banach space representer theorems for neural networks and ridge splines, Journal of Machine Learning Research, 22 (2021), 1–40. [Google Scholar]
- [36].Powell MJD, Approximation Theory and Methods, 1st Edition, Cambridge University Press, Cambridge, 1981. [Google Scholar]
- [37].Reed M and Simon B, Functional Analysis, Academic Press, New York, 1980. [Google Scholar]
- [38].Royden HL, Real Analysis, 3rd Edition, Macmillan Publishing Company, New York, 1988. [Google Scholar]
- [39].Salzo S and Suykens JAK, Generalized support vector regression: Duality and tensor-kernel representation, Analysis and Applications, 18 (2020), 149–183. [Google Scholar]
- [40].Schölkopf B, Herbrich R and Smola AJ, A generalized representer theorem, Computational Learning Theory, Lecture Notes in Computer Science, 2111 (2001), 416–426. [Google Scholar]
- [41].Schlegel K, When is there a representer theorem? Advances in Computational Mathematics 47, 54 (2021). [Google Scholar]
- [42].Schuster T, Kaltenbacher B, Hofmann B and Kazimierski KS, Regularization Methods in Banach Spaces, Vol. 10 in “Radon Series on Computational and Applied Mathematics”, De Gruyter, Berlin, 2012. ( 10.1515/9783110255720). [DOI] [Google Scholar]
- [43].Shannon C, On Lipschitz implicit function theorems in Banach spaces and applications, Journal of Mathematical Analysis and Applications, 494 (2021), 124589. [Google Scholar]
- [44].Sheng B and Zuo L, Error analysis of the kernel regularized regression based on refined convex losses and RKBSs, International Journal of Wavelets, Multiresolution and Information Processing, 19 (2021), 2150012. [Google Scholar]
- [45].Song G and Zhang H, Reproducing kernel Banach spaces with the norm ii: error analysis for regularized least square regression, Neural Computation, 23 (2011), 2713–2729. [Google Scholar]
- [46].Song G, Zhang H, Hickernell FJ, Reproducing kernel Banach spaces with the norm, Applied and Computational Harmonic Analysis, 34 (2013), 96–116. [Google Scholar]
- [47].Spek L, Heeringa TJ and Brune C, Duality for neural networks through reproducing kernel banach spaces, arXiv preprint arXiv:2211.05020, 2022. [Google Scholar]
- [48].Sridharan K and Tewari A, Convex games in Banach spaces, In “Proceedings of the 23rd Annual Conference on Learning Theory,” pp. 1–13, Omnipress, 2010. [Google Scholar]
- [49].Sriperumbudur BK, Fukumizu K and Lanckriet GRG, Learning in Hilbert vs. Banach spaces: A measure embedding viewpoint, Advances in Neural Information Processing Systems (Cambridge), MIT Press, pp. 1773–1781, 2011. [Google Scholar]
- [50].Unser M, Representer theorems for sparsity-promoting regularization, IEEE Transactions on Information Theory, 62 (2016), 5167–5180. [Google Scholar]
- [51].Unser M, A unifying representer theorem for inverse problems and machine learning, Foundations of Computational Mathematics, 21 (2021), 941–960. [Google Scholar]
- [52].Wang R and Xu Y, Representer theorems in Banach spaces: Minimum norm interpolation, regularized learning and semi-discrete inverse problems, Journal of Machine Learning Research, 22 (2021), 1–65. [Google Scholar]
- [53].Wang R, Xu Y and Yan M, Representer theorems for sparse learning in Banach spaces, preprint, 2023.
- [54].Wang W, Lu S, Hofmann B and Cheng J, Tikhonov regularization with -term complementing a convex penalty: -convergence under sparsity constraints, Journal of Inverse Ill-Posed Problems, 27 (2019), 575–590. [Google Scholar]
- [55].Xu Y and Ye Q, Generalized Mercer kernels and reproducing kernel Banach spaces, Memoirs of the American Mathematical Society, 258 (2019), Number 1243, 122 pages. [Google Scholar]
- [56].Xu Y and Zhang H, Refinable kernels, Journal of Machine Learning Research, 8 (2007), 2083–2120. [Google Scholar]
- [57].Xu Y and Zhang H, Refinement of reproducing kernels, Journal of Machine Learning Research, 10 (2009) 107–140. [Google Scholar]
- [58].Zhang H, Xu Y and Zhang J, Reproducing kernel Banach spaces for machine learning, Journal of Machine Learning Research, 10 (2009), 2741–2775. [Google Scholar]
- [59].Zhang H, Xu Y and Zhang Q, Refinement of operator-valued reproducing kernels, Journal of Machine Learning Research, 13 (2012), 91–136. [Google Scholar]
- [60].Zhang H and Zhang J, Frames, Riesz bases, and sampling expansions in Banach spaces via semi-inner products, Applied and Computational Harmonic Analysis, 31 (2011), 1–25. [Google Scholar]
- [61].Zhang H and Zhang J, Regularized learning in Banach spaces as an optimization problem: representer theorems, Journal of Global Optimization, 54 (2012), 235–250. [Google Scholar]
- [62].Zhang H and Zhang J, Vector-valued reproducing kernel Banach spaces with applications to multi-task learning, Journal of Complexity, 29 (2013), 195–215. [Google Scholar]
