Learning Functions Using Data-Dependent Regularization: Representer Theorem Revisited

Qing Zou

doi:10.1007/978-3-030-50420-5_23

. 2020 May 22;12139:312–326. doi: 10.1007/978-3-030-50420-5_23

Learning Functions Using Data-Dependent Regularization: Representer Theorem Revisited

Editors: Valeria V Krzhizhanovskaya⁸, Gábor Závodszky⁹, Michael H Lees¹⁰, Jack J Dongarra¹¹, Peter M A Sloot¹², Sérgio Brissos¹³, João Teixeira¹⁴

PMCID: PMC7304014

Abstract

We introduce a data-dependent regularization problem which uses the geometry structure of the data to learn functions from incomplete data. We show another proof of the standard representer theorem when introducing the problem. At the end of the paper, two applications in image processing are used to illustrate the function learning framework.

Keywords: Function learning, Manifold structure, Representer theorem

Introduction

Background

Many machine learning problems involve the learning of multidimensional functions from incomplete training data. For example, the classification problem can be viewed as learning a function whose function values give the classes that the inputs belong to. The direct representation of the function in high-dimensional spaces often suffers from the issue of dimensionality. The large number of parameters in the function representation would translate to the need of extensive training data, which is expensive to obtain. However, researchers found that many natural datasets have extensive structure presented in them, which is usually known as manifold structure. The intrinsic structure of the data can then be used to improve the learning results. Nowadays, assuming data lying on or close to a manifold becomes more and more common in machine learning. It is called manifold assumption in machine learning. Though researchers are not clear about the theoretical reason why the datasets have manifold structure, it is useful for supervised learning and it gives excellent performance. In this work, we will exploit the manifold structure to learn functions from incomplete training data.

A Motivated Example

One of the main problems in numerical analysis is function approximation. During the last several decades, researchers usually considered the following problem to apply the theory of function approximation to real-world problems:

where L is some linear operator, Inline graphic are n accessible observations and is the input space. We can use the method of Lagrange multiplier to solve Problem (1). Assume that the searching space for the function f is large enough (for example space). Then the Lagrangian function C(f) is given by

Taking the gradient of the Lagrangian function w.r.t. the function f gives us

Setting Inline graphic , we have

where Inline graphic is the delta function. Suppose is the adjoint operator of L. Then we have

which gives us Inline graphic . This implies , for some and .

Kernels and Representer Theorem

As machine learning develops fast these years, kernel methods [1] have received much attentions. Researchers found that working in the original data space is somehow not well-performed. So, we would like to map the data to a high dimensional space (feature space) using some non-linear mapping (feature map). Then we can do a better job (e.g. classification) in the feature space. When we talk about feature map, one concept that is unavoidable to mention is the kernel, which easily speaking is the inner product of the features. With a kernel (positive definite), we can then have a corresponding reproducing kernel Hilbert space (RKHS) [2] Inline graphic . We can now solve the problem that is similar to (1) in the RKHS:

A more feasible way is to consider a regularization problem in the RKHS:

Then the searching space of f becomes Inline graphic , which is a Hilbert space. Before solving Problem (2), we would like to recall some basic concepts about the RKHS. Suppose we have a positive definite kernel , i.e.,

then Inline graphic is the Hilbert space corresponding to the kernel . It is defined by all the possible linear combination of the kernel , i.e., . Thus, for any , there exists and such that

Since Inline graphic is a Hilbert space, it is equipped with an inner product. The principle to define the inner product is to let have representer and the representer performs like the delta function for functions in (note that delta function is not in ). In other word, we want to have a similar result to the following formula:

This is called reproducing relation or reproducing property. In Inline graphic , we want to define the inner product so that we have the reproducing relation in :

To achieve this goal, we can define

Then we have

With the kernel, the feature map Inline graphic can be defined as

Having these knowledge about the RKHS, we can now look at the solution of Problem (2). It can be characterized by the famous conclusion named representer theorem, which states that the solution of Problem (2) is

The standard proof of the representer theorem is well-known and can be found in many literatures, see for example [3, 4]. While the drawback of the standard proof is that the proof did not provide the expression of the coefficients Inline graphic . In the first part of this work, we will provide another proof of the representer theorem. As a by-product, we can also build the relation between Problem (1) and Problem (2).

Another Proof of Representer Theorem

To give another proof of the representer theorem, we first build some relations between Inline graphic and . We endow the dataset X with a measure . Then the corresponding inner product is given by

Consider an operator L on f with respect to the kernel K:

which is the Hilbert-Schmidt integral operator [5]. This operator is self-adjoint, bounded and compact. By the spectral theorem [6], we can obtain that the eigenfunctions Inline graphic of the operator will form an orthonormal basis of , i.e.,

With the operator L defined as (3), we can look at the relations between Inline graphic and . Suppose are the eigenfunctions of the operator L and are the corresponding eigenvalues, then

But by the reproducing relation, we have

Now, let us look at how to represent K(x, y) by the eigenfunctions. We have

and Inline graphic can be computed by

To see Inline graphic , we can just plug it into (4) to verify it:

Since the eigenfunctions of L form an orthogonal basis of Inline graphic , then for any , it can be written as . So we have

While for the Inline graphic norm, we have

Next we show that the orthonormal basis Inline graphic are within . Note that

which implies

So we can get

Therefore, we get Inline graphic .

We now need to investigate that for any Inline graphic , when will we have that . To let , we need to have . So

This means that to let Inline graphic , we need to have [7].

Combining all these analysis, we can then get the following relation between Inline graphic and :

According to which, we can have another proof of the representer theorem.

Proof

Suppose Inline graphic are eigenfunctions of the operator L. Then we can write the solution as . To let , we require .

We consider here a more general form of Problem (2):

where Inline graphic is the error function which is differentiable with respect to each . We would use the tools in space to get the solution.

The cost function of the regularization problem is

By substituting Inline graphic into the cost function, we have

Since

differentiating Inline graphic w.r.t. each and setting it equal to zero gives

Solving Inline graphic , we get

Since Inline graphic , we have

This proves the representer theorem.

Note that this result not only proves the representer theorem, but also gives the expression of the coefficients Inline graphic .

With the operator L, we can also build a relation between Problem (1) and Problem (2). Define the operator in Problem (1) to be the inverse of the Hilbert-Schmidt Integral operator. The discussion on the inverse of the Hilbert-Schmidt Integral operator can be found in [8]. Note that for the delta function, we have

Then the solution of Problem (1) becomes Inline graphic . So we have . Applying L on both sides gives

By which we obtain

Data-Dependent Regularization

So far, we have introduced the standard representer theorem. While as we discussed at the very beginning, many natural datasets have the manifold structure presented in them. So based on the classical Problem (2), we would like to introduce a new learning problem which exploits the manifold structure of the data. We call it the data-dependent regularization problem. Regularization problem has a long history going back to Tikhonov [9]. He proposed the Tikhonov regularization to solve the ill-posed inverse problem.

To exploit the manifold structure of the data, we can then divide a function into two parts: the function restricted on the manifold and the function restricted outside the manifold. So the problem can be formulated as

where Inline graphic and . The norms and will be explained later in details. and are two parameters which control the degree for penalizing the energy of the function on the manifold and outside the manifold. We will show later that by controlling the two balancing parameters (set ), the standard representer theorem is a special case of Problem (5).

We now discuss something about the functions Inline graphic and . Consider the ambient space (or ) and a positive definite kernel K. Let us first look at the restriction of K to the manifold . The restriction is again a positive definite kernel [2] and it will then have a corresponding Hilbert space. We consider the relation between the RKHS Inline graphic and the restricted RKHS to explain the norms and .

Lemma 1

([10]). Suppose Inline graphic (or ) is a positive definite kernel. Let be a subset of X (or ). denote all the functions defined on . Then the RKHS given by the restricted kernel is

with the norm defined as

Proof

Define the set

We first show that the set Inline graphic has a minuma for any . Choose a sequence . Then the sequence is bounded because the space is a Hilbert space. It is reasonable to assume that is weakly convergent because of the Banach-Alaoglu theorem [11]. By the weakly convergence, we can obtain pointwise convergence according to the reproducing property. So the limit of the sequence Inline graphic attains the minima.

We further define Inline graphic . We show that is a Hilbert space by the parallelogram law. In other word, we are going to show that

Since we defined Inline graphic . Then for all , there exists such that

By the definition of Inline graphic , we can choose such that

and

Thus, we have

For the reverse inequality, we first choose Inline graphic such that and . Then

Therefore, we get

Next, we show (6) by showing that for all Inline graphic and ,

where Inline graphic .

Choose Inline graphic such that and . This is possible because of the analysis above. Specially, we have

Now, for any function Inline graphic such that , we have

Thus,

This completes the proof of the lemma.

With this lemma, the solution of Problem (5) then becomes easy to obtain. By the representer theorem we mentioned before, we know that the function satisfies

is Inline graphic . Since we have

Thus, we can conclude that is solution of (5) is exactly

where the coefficients Inline graphic are controlled by the parameters and .

With the norms Inline graphic and being well-defined, we would like to seek the relation between , and . Before stating the relation, we would like to restate some of the notations to make the statement more clear. Let

and

To find the relation between Inline graphic , and , we need to pullback the restricted kernel and to the original space. To do so, define

Then we have Inline graphic . The corresponding Hilbert spaces for and are

It is straightforward to define that

The following lemma shows the relation between Inline graphic , and , which also reveals the relation between , and by Moore-Aronszajn theorem [12].

Lemma 2

Suppose Inline graphic (or ) are two positive definie kernels. If , then

is a Hilbert space with the norm defined by

The idea of the proof of this lemma is exactly the same as the one for Lemma 1. Thus we omit it here.

A direct corollary of this lemma is:

Corollary 1

Under the assumption of Lemma 2, if the functions in Inline graphic and have no functions except for zero function in common. Then the norm of is given simply by

If we go back to our scenario, we can get the following result by Corollary 1:

This means that if we set Inline graphic in Problem (5), it will reduce to Problem (2). Therefore, the standard representer theorem is a special case of our data-dependent regularization problem (5).

Applications

As we said in the introduction part, many engineering problems can be viewed as learning multidimensional functions from incomplete data. In this section, we would like to show two applications of functions learning: image interpolation and patch-based iamge denoising.

Image Interpolation

Image interpolation tries to best approximate the color and intensity of a pixel based on the values at surrounding pixels. See Fig. 1 for illustration. From function learning perspective, image interpolation is to learn a function from the known pixels and their corresponding positions.

Inline graphic — Illustration of image interpolation. The size of the original image is . We want to enlarge it as an image. Then the blue shaded positions are unknown. Using image interpolation, we can find the values of these positions. (Color figure online)

We would like to use the Lena image as shown in Fig. 2(a) to give an example of image interpolation utilizing the proposed framework. The zoomed image is shown in Fig. 2(d). In the image interpolation example, the two balancing parameters are set to be the same and the Laplacian kernel [13] is used:

Note that we can also use other kernels, for example, polynomial kernel and Gaussian kernel to proceed image interpolation. Choosing the right kernel is an interesting problem and we do not have enough space to compare different kernels in this paper.

In Fig. 2(b), we downsampled the original image by a factor of 3 in each direction. The zoomed image is shown in Fig. 2(e). The interpolation result with the zoomed image are shown in Fig. 2(c) and Fig. 2(f).

Patch-Based Image Denoising

From the function learning point of view, the patch-based image denoising problem can be viewed as learning a function from noisy patches to their “noise-free” centered pixels. See Fig. 3 for illustration.

Fig. 3. — Illustration of patch-based image denoising. It can be viewed as learning a function from the noisy patches to the centered clean pixels.

In the patch-based image denoising application, we use the Laplacian kernel as well. We assume that the noisy patches are lying close to some manifold so we set the balancing parameter which controls the energy outside the manifold to be large enough. We use the images in Fig. 4 as known data to learn the function. Then for a given noisy image, we can use the learned function to do image denoising. To speed up the learning process, we randomly choose only 10% of the known data to learn the function.

Fig. 4. — Four training images. We use noisy images and clean pixels to learn the denoising function.

We use the image Baboon to test the learned denoising function. The denoising results are shown in Fig. 5. Each column shows the result corresponding to one noise level.

Conclusion and Future Work

In this paper, we introduced a framework of learning functions from part of the data. We gave a data-dependent regularization problem which helps us learn a function using the manifold structure of the data. We used two applications to illustrate the learning framework. While these two applications are just part of the learning framework. They are special cases of the data-dependent regularization problem. However, for the general application, we need to calculate Inline graphic and , which is hard to do so since we only have partial data. So we need to approximate and from incomplete data and to propose a new learning algorithm so that our framework can be used in a general application. This is part of our future work. Another line for the future work is from the theoretical aspect. We showed that the solution of the data-dependent regularization problem is the linear combination of the kernel. It then can be viewed as a function approximation result. If it is an approximated function, then we can consider the error analysis of the approximated function.

Contributor Information

Valeria V. Krzhizhanovskaya, Email: V.Krzhizhanovskaya@uva.nl

Gábor Závodszky, Email: G.Zavodszky@uva.nl.

Michael H. Lees, Email: m.h.lees@uva.nl

Jack J. Dongarra, Email: dongarra@icl.utk.edu

Peter M. A. Sloot, Email: p.m.a.sloot@uva.nl

Sérgio Brissos, Email: sergio.brissos@intellegibilis.com.

João Teixeira, Email: joao.teixeira@intellegibilis.com.

Qing Zou, Email: zou-qing@uiowa.edu.

References

1.Schölkopf B, Smola AJ, Bach F. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. Cambridge: MIT press; 2002. [Google Scholar]
2.Aronszajn N. Theory of reproducing kernels. Trans. Am. Math. Soc. 1950;68(3):337–404. doi: 10.1090/S0002-9947-1950-0051437-7. [DOI] [Google Scholar]
3.Schölkopf B, Herbrich R, Smola AJ. A generalized representer theorem. In: Helmbold D, Williamson B, editors. Computational Learning Theory; Heidelberg: Springer; 2001. pp. 416–426. [Google Scholar]
4.Argyriou A, Micchelli CA, Pontil M. When is there a representer theorem? vector versus matrix regularizers. J. Mach. Learn. Res. 2009;10(Nov):2507–2529. [Google Scholar]
5.Gohberg, I., Goldberg, S., Kaashoek, M.A.: Hilbert-Schmidt operators. In: Classes of Linear Operators, vol. I, pp. 138–147. Birkhäuser, Basel (1990)
6.Helmberg G. Introduction to Spectral Theory in Hilbert Space. New York: Courier Dover Publications; 2008. [Google Scholar]
7.Mikhail B, Partha N, Vikas S. Manifold regularization: a geometric framework for learning from labeled and unlabeled examples. J. Mach. Learn. Res. 2006;7:2507–2529. [Google Scholar]
8.Pipkin AC. A Course on Integral Equations. New York: Springer; 1991. [Google Scholar]
9.Tikhonov AN. Regularization of incorrectly posed problems. Soviet Math. Doklady. 1963;4(6):1624–1627. [Google Scholar]
10.Saitoh S, Sawano Y. Theory of Reproducing Kernels and Applications. Singapore: Springer; 2016. [Google Scholar]
11.Rudin W. Functional Analysis. MA. Boston: McGraw-Hill; 1991. [Google Scholar]
12.Amir, A.D., Luis, G.C.R., Yukawa, M., Stanczak, S.: Adaptive learning for symbol detection: a reproducing kernel hilbert space approach. Mach. Learn. Fut. Wirel. Commun., 197–211 (2020)
13.Kernel Functions for Machine Learning Applications. http://crsouza.com/2010/03/17/kernel-functions-for-machine-learning-applications/

[CR1] 1.Schölkopf B, Smola AJ, Bach F. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. Cambridge: MIT press; 2002. [Google Scholar]

[CR2] 2.Aronszajn N. Theory of reproducing kernels. Trans. Am. Math. Soc. 1950;68(3):337–404. doi: 10.1090/S0002-9947-1950-0051437-7. [DOI] [Google Scholar]

[CR3] 3.Schölkopf B, Herbrich R, Smola AJ. A generalized representer theorem. In: Helmbold D, Williamson B, editors. Computational Learning Theory; Heidelberg: Springer; 2001. pp. 416–426. [Google Scholar]

[CR4] 4.Argyriou A, Micchelli CA, Pontil M. When is there a representer theorem? vector versus matrix regularizers. J. Mach. Learn. Res. 2009;10(Nov):2507–2529. [Google Scholar]

[CR5] 5.Gohberg, I., Goldberg, S., Kaashoek, M.A.: Hilbert-Schmidt operators. In: Classes of Linear Operators, vol. I, pp. 138–147. Birkhäuser, Basel (1990)

[CR6] 6.Helmberg G. Introduction to Spectral Theory in Hilbert Space. New York: Courier Dover Publications; 2008. [Google Scholar]

[CR7] 7.Mikhail B, Partha N, Vikas S. Manifold regularization: a geometric framework for learning from labeled and unlabeled examples. J. Mach. Learn. Res. 2006;7:2507–2529. [Google Scholar]

[CR8] 8.Pipkin AC. A Course on Integral Equations. New York: Springer; 1991. [Google Scholar]

[CR9] 9.Tikhonov AN. Regularization of incorrectly posed problems. Soviet Math. Doklady. 1963;4(6):1624–1627. [Google Scholar]

[CR10] 10.Saitoh S, Sawano Y. Theory of Reproducing Kernels and Applications. Singapore: Springer; 2016. [Google Scholar]

[CR11] 11.Rudin W. Functional Analysis. MA. Boston: McGraw-Hill; 1991. [Google Scholar]

[CR12] 12.Amir, A.D., Luis, G.C.R., Yukawa, M., Stanczak, S.: Adaptive learning for symbol detection: a reproducing kernel hilbert space approach. Mach. Learn. Fut. Wirel. Commun., 197–211 (2020)

[CR13] 13.Kernel Functions for Machine Learning Applications. http://crsouza.com/2010/03/17/kernel-functions-for-machine-learning-applications/

PERMALINK

Learning Functions Using Data-Dependent Regularization: Representer Theorem Revisited

Qing Zou

Abstract

Introduction

Background

A Motivated Example

Kernels and Representer Theorem

Another Proof of Representer Theorem

Proof

Data-Dependent Regularization

Lemma 1

Proof

Lemma 2

Corollary 1

Applications

Image Interpolation

Fig. 1.

Fig. 2.

Patch-Based Image Denoising

Fig. 3.

Fig. 4.

Fig. 5.

Conclusion and Future Work

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Learning Functions Using Data-Dependent Regularization: Representer Theorem Revisited

Qing Zou

Abstract

Introduction

Background

A Motivated Example

Kernels and Representer Theorem

Another Proof of Representer Theorem

Proof

Data-Dependent Regularization

Lemma 1

Proof

Lemma 2

Corollary 1

Applications

Image Interpolation

Fig. 1.

Fig. 2.

Patch-Based Image Denoising

Fig. 3.

Fig. 4.

Fig. 5.

Conclusion and Future Work

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases