Bilevel Parameter Learning for Higher-Order Total Variation Regularisation Models

J C De los Reyes; C-B Schönlieb; T Valkonen

doi:10.1007/s10851-016-0662-8

. 2016 Jun 1;57(1):1–25. doi: 10.1007/s10851-016-0662-8

Bilevel Parameter Learning for Higher-Order Total Variation Regularisation Models

J C De los Reyes ^1,^✉, C-B Schönlieb ², T Valkonen ³

PMCID: PMC7175605 PMID: 32355410

Abstract

We consider a bilevel optimisation approach for parameter learning in higher-order total variation image reconstruction models. Apart from the least squares cost functional, naturally used in bilevel learning, we propose and analyse an alternative cost based on a Huber-regularised TV seminorm. Differentiability properties of the solution operator are verified and a first-order optimality system is derived. Based on the adjoint information, a combined quasi-Newton/semismooth Newton algorithm is proposed for the numerical solution of the bilevel problems. Numerical experiments are carried out to show the suitability of our approach and the improved performance of the new cost functional. Thanks to the bilevel optimisation framework, also a detailed comparison between ${TGV}^{2}$ and $ICTV$ is carried out, showing the advantages and shortcomings of both regularisers, depending on the structure of the processed images and their noise level.

Keywords: Bilevel optimisation, Total variation regularisers, Image quality measures

Introduction

In this paper, we propose a bilevel optimisation approach for parameter learning in higher-order total variation regularisation models for image restoration. The reconstruction of an image from imperfect measurements is essential for all research which relies on the analysis and interpretation of image content. Mathematical image reconstruction approaches aim to maximise the information gain from acquired image data by intelligent modelling and mathematical analysis.

A variational image reconstruction model can be formalised as follows: Given data f which is related to an image (or to certain image information, e.g. a segmented or edge detected image) u through a generic forward operator (or function) K, the task is to retrieve u from f. In most realistic situations, this retrieval is complicated by the ill-posedness of K as well as random noise in f. A widely accepted method that approximates this ill-posed problem by a well-posed one and counteracts the noise is the method of Tikhonov regularisation. That is, an approximation to the true image is computed as a minimiser of

\begin{matrix} α R (u) + d (K (u), f), \end{matrix}

1.1

where R is a regularising energy that models a-priori knowledge about the image u, $d (\cdot, \cdot)$ is a suitable distance function that models the relation of the data f to the unknown u, and $α > 0$ is a parameter that balances our trust in the forward model against the need of regularisation. The parameter $α$ , in particular, depends on the amount of ill-posedness in the operator K and the amount (amplitude) of the noise present in f. A key issue in imaging inverse problems is the correct choice of $α$ , image priors (regularisation functionals R), fidelity terms d and (if applicable) the choice of what to measure (the linear or non-linear operator K). Depending on this choice, different reconstruction results are obtained.

While functional modelling (1.1) constitutes a mathematically rigorous and physical way of setting up the reconstruction of an image—providing reconstruction guarantees in terms of error and stability estimates—it is limited with respect to its adaptivity for real data. On the other hand, data-based modelling of reconstruction approaches is set up to produce results which are optimal with respect to the given data. However, in general, it neither offers insights into the structural properties of the model nor provides comprehensible reconstruction guarantees. Indeed, we believe that for the development of reliable, comprehensible and at the same time effective models (1.1), it is essential to aim for a unified approach that seeks tailor-made regularisation and data models by combining model- and data-based approaches.

To do so, we focus on a bilevel optimisation strategy for finding an optimal setup of variational regularisation models (1.1). That is, for a given training pair of noisy and original clean images $(f, f_{0})$ , respectively, we consider a learning problem of the form

\begin{matrix} min_{α} F (u^{*}) = & c o s t (u^{*}, f_{0}) subject to \\ u^{*} \in \underset{u}{\arg \min} \{α R (u) + d (K (u), f)\}, \end{matrix}

1.2

where F is a generic cost functional that measures the fitness of $u^{*}$ to the training image $f_{0}$ . The argument of the minimisation problem will depend on the specific setup (i.e. the degrees of freedom) in the constraint problem (1.1). In particular, we propose a bilevel optimisation approach for learning optimal parameters in higher-order total variation regularisation models for image reconstruction in which the arguments of the optimisation constitute parameters in front of the first- and higher-order regularisation terms.

Rather than working on the discrete problem, as is done in standard parameter learning and model optimisation methods, we optimise the regularisation models in infinite-dimensional function space. The resulting problems are difficult to treat due to the non-smooth structure of the lower level problem, which makes it impossible to verify standard constraint qualification conditions for Karush–Kuhn–Tucker (KKT) systems. Therefore, in order to obtain characterising first-order necessary optimality conditions, alternative analytical approaches have emerged, in particular regularisation techniques [4, 20, 28]. We consider such an approach here and study the related regularised problem in depth. In particular, we prove the Fréchet differentiability of the regularised solution operator, which enables to obtain an optimality condition for the problem under consideration and an adjoint state for the efficient numerical solution of the problem. The bilevel problems under consideration are related to the emerging field of generalised mathematical programmes with equilibrium constraints (MPEC) in function space. Let us remark that even for finite-dimensional problems, there are few recent references dealing with stationarity conditions and solution algorithms for this type of problems (see, e.g. [18, 30, 33, 34, 38]).

Let us give an account to the state of the art of bilevel optimisation for model learning. In machine learning, bilevel optimisation is well established. It is a semi-supervised learning method that optimally adapts itself to a given dataset of measurements and desirable solutions. In [15, 23, 43], for instance, the authors consider bilevel optimisation for finite-dimensional Markov random field models. In inverse problems, the optimal inversion and experimental acquisition setup is discussed in the context of optimal model design in works by Haber, Horesh and Tenorio [25, 26], as well as Ghattas et al. [3, 9]. Recently, parameter learning in the context of functional variational regularisation models (1.1) also entered the image processing community with works by the authors [10, 22], Kunisch, Pock and co-workers [14, 33], Chung et al. [16] and Hintermüller et al. [30].

Apart from the work of the authors [10, 22], all approaches so far are formulated and optimised in the discrete setting. Our subsequent modelling, analysis and optimisation will be carried out in function space rather than on a discretisation of (1.1). While digitally acquired image data are of course discrete, the aim of high-resolution image reconstruction and processing is always to compute an image that is close to the real (analogue, infinite dimensional) world. Hence, it makes sense to seek images which have certain properties in an infinite dimensional function space. That is, we aim for a processing method that accentuates and preserves qualitative properties in images independent of the resolution of the image itself [45]. Moreover, optimisation methods conceived in function space potentially result in numerical iterative schemes which are resolution and mesh independent upon discretisation [29].

Higher-order total variation regularisation has been introduced as an extension of the standard total variation regulariser in image processing. As the Total Variation (TV) [41] and many more contributions in the image processing community have proven, a non-smooth first-order regularisation procedure results in a non-linear smoothing of the image, smoothing more in homogeneous areas of the image domain and preserving characteristic structures such as edges. In particular, the TV regulariser is tuned towards the preservation of edges and performs very well if the reconstructed image is piecewise constant. The drawback of such a regularisation procedure becomes apparent as soon as images or signals (in 1D) are considered which do not only consist of constant regions and jumps but also possess more complicated, higher-order structures, e.g. piecewise linear parts. The artefact introduced by TV regularisation in this case is called staircasing [40]. One possibility to counteract such artefacts is the introduction of higher-order derivatives in the image regularisation. Chambolle and Lions [11], for instance, propose a higher-order method by means of an infimal convolution of the TV and the TV of the image gradient called Infimal Convolution Total Variation (ICTV) model. Other approaches to combine first- and second-order regularisation originate, for instance, from Chan et al. [12] who consider total variation minimisation together with weighted versions of the Laplacian, the Euler-elastica functional [13, 37], which combines total variation regularisation with curvature penalisation, and many more [35, 39] just to name a few. Recently, Bredies et al. have proposed Total Generalized Variation (TGV) [5] as a higher-order variant of TV regularisation.

In this work, we mainly concentrate on two second-order total variation models: the recently proposed TGV [5] and the ICTV model of Chambolle and Lions [11]. We focus on second-order TV regularisation only since this is the one which seems to be most relevant in imaging applications [6, 31]. For $Ω \subset R^{2}$ open and bounded and $u \in B V (Ω)$ , the ICTV regulariser reads

\begin{matrix} {ICTV}_{α, β} (u) : = & min_{v \in W^{1, 1} (Ω), \nabla v \in B V (Ω)} α {‖ D u - \nabla v ‖}_{M (Ω ; R^{2})} \\ + {β ‖ D \nabla v ‖}_{M (Ω ; R^{2 \times 2})} . \end{matrix}

1.3

On the other hand, second-order TGV [7, 8] for $u \in B V (Ω)$ reads

\begin{matrix} {TGV}_{α, β}^{2} (u) : = & min_{w \in B D (Ω)} α {‖ D u - w ‖}_{M (Ω ; R^{2})} \\ + {β ‖ E w ‖}_{M (Ω ; {Sym}^{2} (R^{2}))} . \end{matrix}

1.4

Here

\begin{matrix} {‖ D u ‖}_{M (Ω ; R^{2})} = sup_{g \in C_{0}^{\infty} (Ω ; R^{2}), {‖ g ‖}_{\infty} \leq 1} \int_{Ω} u \nabla \cdot g d x \end{matrix}

1.5

stands for the total variation of u in $Ω$ , $BD (Ω) : = {w \in L^{1} (Ω ; R^{n}) ∣ ‖ E w ‖_{M (Ω ; R^{n \times n})} < \infty}$ is the space of vector fields of bounded deformation on $Ω$ , E denotes the symmetrised gradient and ${Sym}^{2} (R^{2})$ denotes the space of symmetric tensors of order 2 with arguments in $R^{2}$ . The parameters $α, β$ are fixed positive parameters and will constitute the arguments in the special learning problem á la (1.2) we consider in this paper. The main difference between (1.3) and (1.4) is that we do not generally have that $w = \nabla v$ for any function v. That results in some qualitative differences of ICTV and TGV regularisation, compare for instance [1]. Substituting $α R (u)$ in (1.1) by ${TGV}_{α, β}^{2} (u)$ or ${ICTV}_{α, β} (u)$ gives the TGV image reconstruction model and the ICTV image reconstruction model, respectively. In this paper, we only consider the case $K = I d$ identity and $d (u, f) = {‖ u - f ‖}_{L^{2} (Ω)}^{2}$ in (1.1) which corresponds to an image denoising model for removing Gaussian noise. With our choice of regulariser, the former scalar $α$ in (1.1) has been replaced by a vector $(α, β)$ of two parameters in (1.3) and (1.4). The choice of the entries in this vector now do not only determine the overall strength of the regularisation (depending on the properties of K and the noise level), but those parameters also balance between the different orders of regularity of the function u, and their choice is indeed crucial for the image reconstruction result. Large $β$ will give regularised solutions that are close to TV regularised reconstructions, compare Fig. 1. Large $α$ will result in TV $^{2}$ type solutions, that is solutions that are regularised with TV of the gradient [27, 39], compare Fig. 2. With our approach described in the next section, we propose a learning approach for choosing those parameters optimally, in particular optimally for particular types of images.

Fig. 1 — Effect of $β$ on ${TGV}^{2}$ denoising with optimal $α$

Fig. 2 — Effect of choosing $α$ too large in ${TGV}^{2}$ denoising

For the existence analysis of an optimal solution as well as for the derivation of an optimality system for the corresponding learning problem (1.2), we will consider a smoothed version of the constraint problem (1.1)—which is the one in fact used in the numerics. That is, we replace R(u)—being TV, TGV or ICTV in this paper—by a Huber-regularised version and add an $H^{1}$ regularisation with a small weight to (1.1). In this setting and under the special assumption of box constraints on $α$ and $β$ , we provide a simple existence proof for an optimal solution. A more general existence result that holds also for the original non-smooth problem and does not require box constraints is derived in [19], and we refer the reader to this paper for a more sophisticated analysis on the structure of solutions.

A main challenge in the setup of such a learning approach is to decide what is the best way to measure fitness (optimality) of the model. In our setting this amounts to choosing an appropriate distance F in (1.2) that measures the fitness of reconstructed images to the ‘perfect’, noise-free images in an appropriate training set. We have to formalise what we mean by an optimal reconstruction model. Classically, the difference between the original, noise-free image $f_{0}$ and its regularised version $u_{α, β}$ is computed with an $L_{2}^{2}$ cost functional

\begin{matrix} F_{L_{2}^{2}} (u_{α, β}) : = {‖ u_{α, β} - f_{0} ‖}_{L^{2} (Ω)}^{2}, \end{matrix}

1.6

which is closely related to the PSNR quality measure. Apart from this, we propose in this paper an alternative cost functional based on a Huberised total variation cost

\begin{matrix} F_{L_{η}^{1} \nabla} (u_{α, β}) : = \int_{Ω} {| D (u_{α, β} - f_{0}) |}_{γ} d x, \end{matrix}

1.7

where the Huber regularisation ${| \cdot |}_{γ}$ will be defined later on in Definition 2.1. We will see that the choice of this cost functional is indeed crucial for the qualitative properties of the reconstructed image.

The proposed bilevel approach has an important indirect consequence: It establishes a basis for the comparison of the different total variation regularisers employed in image denoising tasks. In the last part of this paper, we exhaustively compare the performance of $TV$ , ${TGV}^{2}$ and $ICTV$ for various image datasets. The parameters are chosen optimally, according to the proposed bilevel approach, and different quality measures (like PSNR and SSIM) are considered for the comparison. The obtained results are enlightening about when to use each one of the considered regularisers. In particular, $ICTV$ appears to behave better for images with arbitrary structure and moderate noise levels, whereas ${TGV}^{2}$ behaves better for images with large smooth areas.

Outline of the paper In Sect. 2, we state the bilevel learning problem for the two higher-order total variation regularisation models, TGV and ICTV, and prove existence of an optimal parameter pair $α, β$ . The bilevel optimisation problem is analysed in Sect. 3, where existence of Lagrange multipliers is proved and an optimality system, as well as a gradient formula, is derived. Based on the optimality condition, a BFGS algorithm for the bilevel learning problem is devised in Sect. 5.1. For the numerical solution of each denoising problem, an infeasible semismooth Newton method is considered. Finally, we discuss the performance of the parameter learning method by means of several examples for the denoising of natural photographs in Sect. 5. Therein, we also present a statistical analysis on how TV, ICTV and TGV regularisation compare in terms of returned image quality, carried out on 200 images from the Berkeley segmentation dataset BSDS300.

Problem Statement and Existence Analysis

We strive to develop a parameter learning method for higher-order total variation regularisation models that maximises the fit of the reconstructed images to training images simulated for an application at hand. For a given noisy image $f \in L^{2} (Ω)$ , $Ω \subset R^{2}$ open and bounded, we consider

\begin{matrix} min_{u} \{R_{α, β} (u) + \frac{1}{2} {‖ u - f ‖}_{L^{2} (Ω)}^{2}\} . \end{matrix}

2.1

where, $α, β \in R$ . We focus on TGV $^{2}$ ,

\begin{matrix} R_{α, β} (u) = & {TGV}_{α, β}^{2} (u) : = min_{w \in B D (Ω)} {‖ α (D u - w) ‖}_{M (Ω ; R^{2})} \\ + {‖ β E w ‖}_{M (Ω ; {Sym}^{2} (R^{2}))}, \end{matrix}

and ICTV,

\begin{matrix} R_{α, β} (u) = & {ICTV}_{α, β} (u) : = min_{\begin{matrix} v \in W^{1, 1} (Ω) \\ \nabla v \in B V (Ω) \end{matrix}} {‖ α (D u - \nabla v) ‖}_{M (Ω ; R^{2})} \\ + {‖ β D \nabla v ‖}_{M (Ω ; R^{2 \times 2})}, \end{matrix}

for $u \in B V (Ω)$ . For these models, we want to determine the optimal choice of $α, β$ , given a particular type of images and a fixed noise level. More precisely, we consider a training pair $(f, f_{0})$ , where f is a noisy image corrupted by normally distributed noise with a fixed variation, and the image $f_{0}$ represents the ground truth or an image that approximates the ground truth within a desirable tolerance. Then, we determine the optimal choice of $α, β$ by solving the following problem:

\begin{matrix} min_{(α, β) \in R^{2}} F (u_{α, β}) s.t. α, β \geq 0, \end{matrix}

2.2

where $F$ equals the $L_{2}^{2}$ cost (1.6) or the Huberised TV cost (1.7) and $u_{α, β}$ for a given f solves a regularised version of the minimisation problem (2.1) that will be specified in the next section, compare problem (2.3b). This regularisation of the problem is a technical requirement for solving the bilevel problem that will be discussed in the sequel. In contrast to learning $α, β$ in (2.1) in finite dimensional parameter spaces (as is the case in machine learning), we consider optimisation techniques in infinite dimensional function spaces.

Formal Statement

Let $Ω \subset R^{n}$ be an open bounded domain with Lipschitz boundary. This will be our image domain. Usually $Ω = (0, w) \times (0, h)$ for w and h the width and height of a two-dimensional image, although no such assumptions are made in this work. Our data f and $f_{0}$ are assumed to lie in $L^{2} (Ω)$ .

In our learning problem, we look for parameters $(α, β)$ that for some cost functional $F : H^{1} (Ω) \to R$ solve the problem

\begin{matrix} min_{(α, β) \in R^{2}} F (u_{α, β}) \end{matrix}

2.3a

subject to

\begin{matrix} u_{α, β} \in \underset{u \in H^{1} (Ω)}{\arg \min} J^{γ, μ} (u ; α, β) \end{matrix}

2.3b

\begin{matrix} α, β \geq 0, \end{matrix}

2.3c

where

\begin{matrix} J^{γ, μ} (u ; α, β) : = \frac{1}{2} {‖ u - f ‖}_{L^{2} (Ω)}^{2} + R_{α, β}^{γ, μ} (u) . \end{matrix}

Here $J^{γ, μ} (\cdot ; α, β)$ is the regularised denoising functional that amends the regularisation term in (2.1) by a Huber-regularised version of it with parameter $γ > 0$ , and an elliptic regularisation term with parameter $μ > 0$ . In the case of TGV $^{2}$ , the modified regularisation term $R_{α, β}^{γ, μ} (u)$ then reads, for $u \in H^{1} (Ω)$ ,

\begin{matrix} {TGV}_{α, β}^{2, γ, μ} (u) & : = min_{w \in H^{1} (Ω)} \int_{Ω} α {| D u - w |}_{γ} d x \\ + \int_{Ω} β {| E w |}_{γ} d x \\ + \frac{μ}{2} ({‖ u ‖}_{H^{1} (Ω)}^{2} + {‖ w ‖}_{H^{1} (Ω)}^{2}), \end{matrix}

and in the case of ICTV, we have

\begin{matrix} {ICTV}_{α, β}^{γ, μ} (u) & : = min_{\begin{matrix} v \in W^{1, 1} (Ω) \\ \nabla v \in B V (Ω, R^{n}) \cap H^{1} (Ω) \end{matrix}} \int_{Ω} α {| D u - \nabla v |}_{γ} d x \\ + \int_{Ω} β {| D \nabla v |}_{γ} d x \\ + \frac{μ}{2} ({‖ u ‖}_{H^{1} (Ω)}^{2} + {‖ \nabla v ‖}_{H^{1} (Ω)}^{2}) . \end{matrix}

Here, $H^{1} (Ω) = H^{1} (Ω ; R^{n})$ and the Huber regularisation ${| \cdot |}_{γ}$ is defined as follows.

Definition 2.1

Given $γ \in (0, \infty]$ , we define for the norm ${‖ \cdot ‖}_{2}$ on $R^{m}$ , the Huber regularisation

\begin{matrix} {| g |}_{γ} = \{\begin{matrix} {‖ g ‖}_{2} - \frac{1}{2 γ}, & {‖ g ‖}_{2} \geq 1 / γ, \\ \frac{γ}{2} {‖ g ‖}_{2}^{2}, & {‖ g ‖}_{2} < 1 / γ, \end{matrix} \end{matrix}

and its derivative, given by

\begin{matrix} h_{γ} (g) : = \frac{γ g}{max (1, γ | g |)} . \end{matrix}

2.4

For the cost functional $F$ , given noise-free data $f_{0} \in L^{2} (Ω)$ and a regularised solution $u \in H^{1} (Ω)$ , we consider in particular the $L^{2}$ cost

\begin{matrix} F_{L_{2}^{2}} (u) = \frac{1}{2} {‖ f_{0} - u ‖}_{L^{2} (Ω ; R^{d})}^{2}, \end{matrix}

as well as the Huberised total variation cost

\begin{matrix} F_{L_{η}^{1} \nabla} (u) = \int_{Ω} {| D (f_{0} - u) |}_{γ} d x \end{matrix}

with noise-free data $f_{0} \in BV (Ω)$ .

Remark 2.1

Please note that in our formulation of the bilevel problem (2.3), we only impose a non-negativity constraint on the parameters $α$ and $β$ , i.e. we do not strictly bound them away from zero. There are two reasons for that. First, for the existence analysis of the smoothed problem, the case $α = β = 0$ is not critical since compactness can be secured by the $H^{1}$ term in the functional, compare Sect. 2.2. Second, in [19], we indeed prove that even for the non-smooth problem (as $μ \to 0$ ), under appropriate assumptions on the given data, the optimal $α, β$ are guaranteed to be strictly positive.

Existence of an Optimal Solution

The existence of an optimal solution for the learning problem (2.3) is a special case of the class of bilevel problems considered in [19], where the existence of optimal parameters in ${(0, + \infty]}^{2 N}$ is proven. For convenience of the reader, we provide a simplified proof for the case where additional box constraints on the parameters are imposed. We start with an auxiliary lower semicontinuity result for the Huber-regularised functionals.

Lemma 2.1

Let $u, v \in L^{p} (Ω)$ , $1 \leq p < \infty$ . Then, the functional $u \mapsto \int_{Ω} {| u - v |}_{γ} d x$ , where ${| \cdot |}_{γ}$ is the Huber regularisation in Definition 2.1, is lower semicontinuous with respect to weak* convergence in $M (Ω ; R^{d})$

Proof

Recall that for $g \in R^{m}$ , the Huber-regularised norm may be written in dual form as

\begin{matrix} {| g |}_{γ} = sup {⟨ q, g ⟩ - \frac{γ}{2} {‖ q ‖}_{2}^{2} : {‖ q ‖}_{2} \leq 1} . \end{matrix}

Therefore, we find that

\begin{matrix} G (u) : = & \int_{Ω} {| u - v |}_{γ} d x = sup {\int_{Ω} u (x) \cdot φ (x) d x \\ - \int_{Ω} \frac{γ}{2} {‖ φ (x) ‖}_{2}^{2} d x : \\ φ \in C_{c}^{\infty} (Ω), {‖ φ (x) ‖}_{2} \leq 1 for every x \in Ω} . \end{matrix}

The functional G is of the form $G (u) = sup {⟨ u, φ ⟩ - G^{*} (φ)}$ , where $G^{*}$ is the convex conjugate of G. Now, let ${u^{i}}_{i = 1}^{\infty}$ converge to u weakly* in $M (Ω ; R^{d})$ . Taking a supremising sequence ${φ^{j}}_{j = 1}^{\infty}$ for this functional at any point u, we easily see lower semicontinuity by considering the sequences ${⟨ u^{i}, φ^{j} ⟩ - G^{*} (φ^{j})}_{i = 1}^{\infty}$ for each j. $□$

Our main existence result is the following.

Theorem 2.1

We consider the learning problem (2.3) for TGV $^{2}$ and ICTV regularisation, optimising over parameters $(α, β)$ such that $0 \leq α \leq \bar{α}, 0 \leq β \leq \bar{β}$ . Here $(\bar{α}, \bar{β}) < \infty$ is an arbitrary but fixed vector in $R^{2}$ that defines a box constraint on the parameter space. There exists an optimal solution $(\hat{α}, \hat{β}) \in R^{2}$ for this problem for both choices of cost functionals, $F = F_{L_{2}^{2}}$ and $F = F_{L_{η}^{1} \nabla}$ .

Proof

Let $(α_{n}, β_{n}) \subset R^{2}$ be a minimising sequence. Due to the box constraints we have that the sequence $(α_{n}, β_{n})$ is bounded in $R^{2}$ . Moreover, we get for the corresponding sequences of states $u_{n} : = u_{(α_{n}, β_{n})}$ that

\begin{matrix} J^{γ, μ} (u_{n} ; α_{n}, β_{n}) \leq J^{γ, μ} (u ; α_{n}, β_{n}), \forall u \in H^{1} (Ω), \end{matrix}

in particular this holds for $u = 0$ . Hence,

\begin{matrix} \frac{1}{2} ‖ u_{n} {- f ‖}_{L^{2} (Ω)}^{2} + R_{α_{n}, β_{n}}^{γ, μ} (u_{n}) \leq \frac{1}{2} {‖ f ‖}_{L^{2} (Ω)}^{2} . \end{matrix}

2.5

Exemplarily, we consider here the case for the TGV regulariser, that is $R_{α_{n}, β_{n}}^{γ, μ} = {TGV}_{α_{n}, β_{n}}^{2, γ, μ}$ . The proof for the ICTV regulariser can be done in a similar fashion. Inequality (2.5) in particular gives

\begin{matrix} ‖ u_{n} ‖_{H^{1} (Ω)}^{2} + ‖ w_{n} ‖_{H^{1} (Ω)}^{2} \leq \frac{1}{μ} {‖ f ‖}_{L^{2} (Ω)}, \end{matrix}

where $w_{n}$ is the optimal w for $u_{n}$ . This gives that $(u_{n}, w_{n})$ is uniformly bounded in $H^{1} (Ω) \times H^{1} (Ω)$ and that there exists a subsequence ${(α_{n}, β_{n}, u_{n}, w_{n})}$ which converges weakly in $R^{2} \times H^{1} (Ω) \times H^{1} (Ω)$ to a limit point $(\hat{α}, \hat{β}, \hat{u}, \hat{w})$ . Moreover, $u_{n} \to \hat{u}$ strongly in $L^{p} (Ω)$ and $w_{n} \to \hat{w}$ in $L^{p} (Ω ; R^{n})$ . Using the continuity of the $L^{2}$ fidelity term with respect to strong convergence in $L^{2}$ , and the weak lower semicontinuity of the $H^{1}$ term with respect to weak convergence in $H^{1}$ and of the Huber-regularised functional even with respect to weak $*$ convergence in $M$ (cf. Lemma 2.1), we get

\begin{matrix} \frac{1}{2} ‖ \hat{u} {- f ‖}_{L^{2} (Ω)}^{2} + \int_{Ω} \hat{α} | D \hat{u} - \hat{w} |_{γ} d x + \int_{Ω} \hat{β} {| E w |}_{γ} d x \\ + \frac{μ}{2} (‖ \hat{u} ‖_{H^{1} (Ω)}^{2} + {‖ \hat{w} ‖}_{H^{1} (Ω)}^{2}) \\ \leq \underset{n}{lim inf} \frac{1}{2} {‖ u_{n} - f ‖}_{L^{2} (Ω)}^{2} \\ + \int_{Ω} \hat{α} | D u_{n} - w_{n} |_{γ} d x + \int_{Ω} \hat{β} {| E w_{n} |}_{γ} d x \\ + \frac{μ}{2} (‖ u_{n} ‖_{H^{1} (Ω)}^{2} + {‖ w_{n} ‖}_{H^{1} (Ω)}^{2}) \\ \leq \underset{n}{lim inf} \frac{1}{2} ‖ u_{n} {- f ‖}_{L^{2} (Ω)}^{2} + \int_{Ω} α_{n} {| D u_{n} - w_{n} |}_{γ} d x \\ + \int_{Ω} β_{n} {| E w_{n} |}_{γ} d x \\ + \frac{μ}{2} (‖ u_{n} ‖_{H^{1} (Ω)}^{2} + {‖ w_{n} ‖}_{H^{1} (Ω)}^{2}), \end{matrix}

where in the last step we have used the boundedness of the sequence $R_{α_{n}, β_{n}}^{γ, μ} (u_{n})$ from (2.5) and the convergence of $(α_{n}, β_{n})$ in $R^{2}$ . This shows that the limit point $\hat{u}$ is an optimal solution for $(\hat{α}, \hat{β})$ . Moreover, due to the weak lower semicontinuity of the cost functional F and the fact that the set ${(α, β) : 0 \leq α \leq \bar{α}, 0 \leq β \leq \bar{β}}$ is closed, we have that $(\hat{α}, \hat{β}, \hat{u})$ is optimal for (2.3). $□$

Remark 2.2

Using the existence result in [19], in principle we could allow infinite values for $α$ and $β$ . This would include both ${TV}^{2}$ and $TV$ as possible optimal regularisers in our learning problem.
In [19], in the case of the $L^{2}$ cost and assuming that
$\begin{matrix} R_{α, β}^{γ} (f) > R_{α, β}^{γ} (f_{0}), \end{matrix}$
we moreover show that the parameters $(α, β)$ are strictly larger than 0. In the case of the Huberised TV cost, this is proven in a discretised setting. Please see [19] for details.
The existence of solutions with $μ = 0$ , that is without elliptic regularisation, is also proven in [19]. Note that here, we focus on the $μ > 0$ case since the elliptic regularity is required for proving the existence of Lagrange multipliers in the next section.

Remark 2.3

In [19], it was shown that the solution map of our bilevel problem is outer semicontinuous. This implies, in particular, that the minimisers of the regularised bilevel problems converge towards the minimiser of the original one.

Lagrange Multipliers

In this section, we prove the existence of Lagrange multipliers for the learning problem (2.3) and derive an optimality system that characterises stationary points. Moreover, a gradient formula for the reduced cost functional is obtained, which plays an important role in the development of fast solution algorithms for the learning problems (see Sect. 5.1).

In what follows, all proofs are presented for the ${TGV}^{2}$ regularisation case, that is $R_{α, β}^{γ} = {TGV}_{α, β}^{2, γ}$ . However, possible modifications to cope with the ICTV model will also be commented. Moreover, we consider along this section a smoother variant of the Huber regularisation, given by

\begin{matrix} {| g |}_{γ} = \{\begin{matrix} | g | + \frac{γ}{2} L_{γ} - \frac{U_{γ}}{2} + \frac{A_{γ}}{γ^{2}} + \frac{B_{γ}}{γ^{3}} + \frac{C_{γ}}{3 γ^{4}} (3 + \frac{1}{4 γ^{2}}) & if γ | g | \geq 1 + \frac{1}{2 γ} \\ A_{γ} | g | + \frac{B_{γ}}{2} {| g |}^{2} + \frac{C_{γ}}{3} {| g |}^{3} + D_{γ} & if 1 - \frac{1}{2 γ} \leq γ | g | \leq 1 + \frac{1}{2 γ} \\ \frac{γ}{2} {| g |}^{2} & if γ | g | \leq 1 - \frac{1}{2 γ}, \end{matrix} \end{matrix}

with

\begin{matrix} U_{γ} = \frac{1}{γ} (1 + \frac{1}{2 γ}), L_{γ} = \frac{1}{γ} (1 - \frac{1}{2 γ}), \\ A_{γ} = 1 - \frac{γ}{2} {(\frac{2 γ + 1}{2 γ})}^{2}, \\ B_{γ} = \frac{γ}{2} (2 γ + 1), C_{γ} = - \frac{γ^{3}}{2}, \\ D_{γ} = - \frac{γ^{3}}{3} L_{γ}^{3} - A_{γ} L_{γ} . \end{matrix}

This modified Huber function is required in order to get differentiability of the solution operator, a matter which is investigated next.

Differentiability of the Solution Operator

We recall that the ${TGV}^{2}$ denoising problem can be rewritten as

\begin{matrix} y = (u, w) = & \underset{B V (Ω) \times B D (Ω)}{\arg \min} \{\frac{1}{2} \int_{Ω} {| u - f |}^{2} \\ + \int_{Ω} {α | D u - w |}_{γ} + \int_{Ω} β {| E w |}_{γ}\} . \end{matrix}

Using an elliptic regularisation, we then get

\begin{matrix} y = & \underset{H^{1} (Ω) \times H^{1} (Ω)}{\arg \min} \{\frac{1}{2} a (y, y) + \frac{1}{2} \int_{Ω} {| u - f |}^{2} \\ + \int_{Ω} {α | D u - w |}_{γ} + \int_{Ω} β {| E w |}_{γ}\}, \end{matrix}

where $a (y, y) = μ ({‖ u ‖}_{H^{1}}^{2} + {‖ w ‖}_{H^{1}}^{2})$ . A necessary and sufficient optimality condition for the latter is then given by the following variational equation:

\begin{matrix} a (y, Ψ) + \int_{Ω} α h_{γ} (D u - w) (D ϕ - φ) d x \\ + \int_{Ω} β h_{γ} (E w) E φ d x + \int_{Ω} (u - f) ϕ d x = 0, \\ for all Ψ \in Y, \end{matrix}

3.1

where $Ψ = (ϕ, φ)$ , $Y = H^{1} (Ω) \times H^{1} (Ω)$ and

\begin{matrix} h_{γ} (g) = \{\begin{matrix} \frac{g}{| g |} & if γ | g | \geq 1 + \frac{1}{2 γ} \\ \frac{g}{| g |} (1 - \frac{γ}{2} (1 - γ | g | + \frac{1}{2 γ})^{2}) & if 1 - \frac{1}{2 γ} \leq γ | g | \leq 1 + \frac{1}{2 γ} \\ γ g & if γ | g | \leq 1 - \frac{1}{2 γ} . \end{matrix} \end{matrix}

3.2

Theorem 3.1

The solution operator $S : R^{2} \mapsto Y$ , which assigns to each pair $(α, β) \in R^{2}$ the corresponding solution to the denoising problem (3.1), is Fréchet differentiable and its derivative is characterised by the unique solution $z = S^{'} (α, β) [θ_{1}, θ_{2}] \in Y$ of the following linearised equation:

\begin{matrix} a (z, Ψ) + \int_{Ω} θ_{1} h_{γ} (D u - w) (D ϕ - φ) d x \\ + \int_{Ω} α h_{γ}^{'} (D u - w) (D z_{1} - z_{2}) (D ϕ - φ) d x \\ + \int_{Ω} θ_{2} h_{γ} (E w) E φ d x \\ + \int_{Ω} β h_{γ}^{'} (E w) E z_{2} E φ d x \\ + \int_{Ω} z_{1} ϕ d x = 0, for all Ψ \in Y . \end{matrix}

3.3

Proof

Thanks to the ellipticity of $a (\cdot, \cdot)$ and the monotonicity of $h_{γ}$ , the existence of a unique solution to the linearised equation follows from the Lax-Milgram theorem.

Let $ξ : = y^{+} - y - z$ , where $y = S (α, β)$ and $y^{+} = S (α + θ_{1}, β + θ_{2})$ . Our aim is to prove that ${‖ ξ ‖}_{Y} = o (| θ |) .$ Combining the equations for $y^{+}$ , y and z we get that

\begin{matrix} a (ξ, Ψ) + \int_{Ω} (α + θ_{1}) h_{γ} (D u^{+} - w^{+}) (D ϕ - φ) d x \\ - \int_{Ω} α h_{γ} (D u - w) (D ϕ - φ) d x \\ - \int_{Ω} θ_{1} h_{γ} (D u - w) (D ϕ - φ) d x \\ - \int_{Ω} α h_{γ}^{'} (D u - w) (D z_{1} - z_{2}) (D ϕ - φ) d x \\ + \int_{Ω} (β + θ_{2}) h_{γ} (E w^{+}) E φ d x - \int_{Ω} β h_{γ} (E w) E φ d x \\ - \int_{Ω} θ_{2} h_{γ} (E w) E φ d x - \int_{Ω} β h_{γ}^{'} (E w) E z_{2} E φ d x \\ + 2 \int_{Ω} ξ_{1} ϕ d x = 0, for all Ψ \in Y, \end{matrix}

where $ξ : = (ξ_{1}, ξ_{2}) \in H^{1} (Ω) \times H^{1} (Ω)$ . Adding and subtracting the terms

\begin{matrix} \int_{Ω} α h_{γ}^{'} (D u - w) (D δ_{u} - δ_{w}) (D ϕ - φ) d x \end{matrix}

and

\begin{matrix} \int_{Ω} β h_{γ}^{'} (E w) E δ_{w} : E φ d x, \end{matrix}

where $δ_{u} : = u_{α + θ} - u$ and $δ_{w} : = w_{α + θ} - w$ , we obtain that

\begin{matrix} a (ξ, Ψ) + \int_{Ω} α h_{γ}^{'} (D u - w) (D ξ_{1} - ξ_{2}) (D ϕ - φ) \\ + \int_{Ω} β h_{γ}^{'} (E w) E ξ_{2} : E φ d x + 2 \int_{Ω} ξ_{1} ϕ d x \\ = - \int_{Ω} α [h_{γ} (D u^{+} - w^{+}) - h_{γ} (D u - w) \\ - h_{γ}^{'} (D u - w) (D δ_{u} - δ_{w})] (D ϕ - φ) \\ - \int_{Ω} θ_{1} [h_{γ} (D u^{+} - w^{+}) \\ - h_{γ} (D u - w)] (D ϕ - φ) d x \\ - \int_{Ω} β [h_{γ} (E w^{+}) - h_{γ} (E w) - h_{γ}^{'} (E w) E δ_{w}] : E φ d x \\ - \int_{Ω} θ_{2} [h_{γ} (E w_{α + θ}) - h_{γ} (E w)] : E φ d x, for all Ψ \in Y . \end{matrix}

Testing with $Ψ = ξ$ and using the monotonicity of $h_{γ}^{'} (\cdot)$ , we get that

\begin{matrix} {‖ ξ ‖}_{Y} \leq & C \{| α | ‖ h_{γ} (D u^{+} - w^{+}) - h_{γ} (D u - w) \\ - h_{γ}^{'} (D u - w) (D δ_{u} - δ_{w}) ‖_{L^{2}} \\ + | θ_{1} | {∥h_{γ} (D u^{+} - w^{+}) - h_{γ} (D u - w)∥}_{L^{2}} \\ + | β | {∥h_{γ} (E w^{+}) - h_{γ} (E w) - h_{γ}^{'} (E w) E δ_{w}∥}_{L^{2}} \\ + | θ_{2} | {∥h_{γ} (E w_{α + θ}) - h_{γ} (E w)∥}_{L^{2}}\}, \end{matrix}

for some generic constant $C > 0$ . Considering the differentiability and Lipschitz continuity of $h_{γ}^{'} (\cdot)$ , it then follows that

\begin{matrix} {‖ ξ ‖}_{Y} \leq & C (| α | o ({∥y^{+} - y∥}_{1, p}) + | θ_{1} | {∥y_{α + θ} - y∥}_{Y} \\ + | β | o ({∥w^{+} - w∥}_{1, p}) + | θ_{2} | {∥w_{α + θ} - w∥}_{H^{1} (Ω)}), \end{matrix}

3.4

where ${‖ \cdot ‖}_{1, p}$ stands for the norm in the space $W^{1, p} (Ω)$ . From regularity results for second-order systems (see [24, Theorem 1, Remark 14]), it follows that

\begin{matrix} {∥y^{+} - y∥}_{1, p} \\ \leq L | θ | (‖ Div h_{γ} {(D u - w) ‖}_{- 1, p} + {‖ h_{γ} (D u - w) ‖}_{- 1, p} \\ + ‖ Div h_{γ} {(E w) ‖}_{- 1, p}) \\ \leq L | θ | (2 ‖ h_{γ} {(D u - w) ‖}_{L^{\infty}} + {‖ h_{γ} (E w) ‖}_{L^{\infty}}) \\ \leq \tilde{L} | θ |, \end{matrix}

since $| h_{γ} (\cdot) | \leq 1$ . Inserting the latter in estimate (3.4), we finally get that

\begin{matrix} {‖ ξ ‖}_{Y} = o (| θ |) . \end{matrix}

$□$

Remark 3.1

The extra regularity result for second-order systems used in the last proof and due to Gröger [24, Thm. 1, Rem. 14] relies on the properties of the domain $Ω$ . The result was originally proved for $C^{2}$ domains. However, the regularity of the domain (in the sense of Gröger) may also be verified for convex Lipschitz bounded domains [17], which is precisely our image domain case.

Remark 3.2

The Fréchet differentiability proof makes use of the quasilinear structure of the ${TGV}^{2}$ variational form, making it difficult to extend to the ICTV model without further regularisation terms. For the latter, however, a Gâteaux differentiability result may be obtained using the same proof technique as in [22].

The Adjoint Equation

Next, we use the Lagrangian formalism for deriving the adjoint equations for both the ${TGV}^{2}$ and ICTV learning problems. The existence of a solution to the adjoint equation follows from the Lax-Milgram theorem.

Defining the Lagrangian associated to the ${TGV}^{2}$ learning problem by

\begin{matrix} L (u, w, α, β, p_{1}, p_{2}) = F (u) + μ {(u, p_{1})}_{H^{1}} + μ {(w, p_{2})}_{H^{1}} \\ + \int_{Ω} α h_{γ} (D u - w) (D p_{1} - p_{2}) \\ + \int_{Ω} β h_{γ} (E w) E p_{2} + \int_{Ω} (u - f) p_{1}, \end{matrix}

and taking the derivative with respect to the state variable (u, w), we get the necessary optimality condition

\begin{matrix} L_{(u, w)}^{'} (u, w, α, β, p_{1}, p_{2}) [(δ_{u}, δ_{w})] \\ = F^{'} (u) δ_{u} + μ {(p_{1}, δ_{u})}_{H^{1}} + μ {(p_{2}, δ_{w})}_{H^{1}} \\ + \int_{Ω} α h_{γ}^{'} (D u - w) (D δ_{u} - δ_{w}) (D p_{1} - p_{2}) \\ + \int_{Ω} β h_{γ}^{'} (E w) E δ_{w} E p_{2} + \int_{Ω} p_{1} δ_{u} = 0 . \end{matrix}

If $δ_{w} = 0$ , then

\begin{matrix} μ {(p_{1}, δ_{u})}_{H^{1}} + \int_{Ω} α h_{γ}^{'} (D u - w) (D p_{1} - p_{2}) D δ_{u} \\ + \int_{Ω} p_{1} δ_{u} = - \nabla_{u} F (u) δ_{u}, for all δ_{u} \in H^{1} (Ω), \end{matrix}

3.5

whereas if $δ_{u} = 0$ , then

\begin{matrix} μ {(p_{2}, δ_{w})}_{H^{1}} - \int_{Ω} α h_{γ}^{'} (D u - w) (D p_{1} - p_{2}) δ_{w} \\ + \int_{Ω} β h_{γ}^{'} (E w) E p_{2} E δ_{w} = 0, for all δ_{w} \in H^{1} (Ω) . \end{matrix}

3.6

Theorem 3.2

Let $(u, w) \in H^{1} (Ω) \times H^{1} (Ω)$ . There exists a unique solution $Π = (p_{1}, p_{2}) \in Y = H^{1} (Ω) \times H^{1} (Ω)$ to the adjoint system

\begin{matrix} μ {(Π, δ_{y})}_{Y} + \int_{Ω} α h_{γ}^{'} (D u - w) (D δ_{u} - δ_{w}) (D p_{1} - p_{2}) \\ + \int_{Ω} β h_{γ}^{'} (E w) E δ_{w} E p_{2} + \int_{Ω} p_{1} δ_{v} \\ = - F^{'} (u) δ_{u}, for all δ_{y} \in Y . \end{matrix}

3.7

The corresponding solution is called adjoint state associated to (v, w).

Proof

We have to show that the left-hand side of equation (3.7) constitutes a bilinear, continuous and coercive form on $Y \times Y$ . Linearity and continuity follows immediately. For the coercivity, let us take $δ_{y} = Π$ . Since $h_{γ}$ is a monotone function, the terms $\int_{Ω} α h_{γ}^{'} (D u - w) (D p_{1} - p_{2}) (D p_{1} - p_{2})$ and $\int_{Ω} β h_{γ}^{'} (E w) E p_{2} E p_{2}$ become positive, yielding

\begin{matrix} {μ ‖ Π ‖}_{Y}^{2} + \int_{Ω} α h_{γ}^{'} (D u - w) (D p_{1} - p_{2}) (D p_{1} - p_{2}) \\ + \int_{Ω} β h_{γ}^{'} (E w) E p_{2} E p_{2} + \int_{Ω} p_{1}^{2} \geq μ {‖ Π ‖}_{Y}^{2} . \end{matrix}

Thus, coercivity holds and, using Lax-Milgram theorem, we conclude that there exists a unique solution to the adjoint system (3.7). $□$

Remark 3.3

For the ICTV model, it is possible to proceed formally with the Lagrangian approach. We recall that a necessary and sufficient optimality condition for the ICTV functional is given by

\begin{matrix} μ {(u, ϕ)}_{H^{1}} + μ {(\nabla v, \nabla φ)}_{H^{1}} + \int_{Ω} α h_{γ} (D u - \nabla v) (D ϕ - \nabla φ) \\ + \int_{Ω} β h_{γ} (D \nabla v) D \nabla φ + \int_{Ω} (u - f) ϕ = 0, \\ for all (ϕ, φ) \in H^{1} (Ω) \times H^{1} (Ω) \end{matrix}

3.8

and the correspondent Lagrangian functional $L$ is given by

\begin{matrix} L (u, v, α, β, p_{1}, p_{2}) = F (u) + μ {(u, p_{1})}_{H^{1}} \\ + μ {(\nabla v, \nabla p_{2})}_{H^{1}} + \int_{Ω} α h_{γ} (D u - \nabla v) (D p_{1} - \nabla p_{2}) \\ + \int_{Ω} β h_{γ} (D \nabla v) D \nabla p_{2} + \int_{Ω} (u - f) p_{1} . \end{matrix}

Deriving the Lagrangian with respect to the state variables (u, v) and setting it equal to zero yields

\begin{matrix} L_{(u, v)}^{'} (u, v, α, β, p_{1}, p_{2}) [(δ_{u}, δ_{v})] \\ = F^{'} (u) δ_{u} + μ {(p_{1}, δ_{u})}_{H^{1}} + μ {(\nabla p_{2}, \nabla δ_{v})}_{H^{1}} \\ + \int_{Ω} α h_{γ}^{'} (D u - \nabla v) (D δ_{u} - \nabla δ_{v}) (D p_{1} - \nabla p_{2}) \\ + \int_{Ω} β h_{γ}^{'} (D \nabla v) D \nabla δ_{v} D \nabla p_{2} + \int_{Ω} p_{1} δ_{u} = 0 . \end{matrix}

By taking successively $δ_{v} = 0$ and $δ_{u} = 0$ , the following adjoint system is obtained

\begin{matrix} μ {(p_{1}, δ_{u})}_{H^{1}} + \int_{Ω} α h_{γ}^{'} (D u - \nabla v) (D p_{1} - \nabla p_{2}) D δ_{u} \\ + \int_{Ω} p_{1} δ_{u} = - F^{'} (u) δ_{u}, \end{matrix}

3.9a

\begin{matrix} μ {(\nabla p_{2}, \nabla δ_{v})}_{H^{1}} + \int_{Ω} α h_{γ}^{'} (D u - \nabla v) (D p_{1} - \nabla p_{2}) \nabla δ_{v} \\ + \int_{Ω} β h_{γ}^{'} (D \nabla v) D \nabla p_{2} D \nabla δ_{v} = 0 . \end{matrix}

3.9b

Optimality Condition

Using the differentiability of the solution operator and the well-posedness of the adjoint equation, we derive next an optimality system for the characterisation of local minima of the bilevel learning problem. Besides the optimality condition itself, a gradient formula arises as byproduct, which is of importance in the design of solution algorithms for the learning problems.

Theorem 3.3

Let $(\bar{α}, \bar{β}) \in R_{+}^{2}$ be a local optimal solution for problem (2.3). Then there exist Lagrange multipliers $Π \in Y : = H^{1} (Ω) \times H^{1} (Ω)$ and $λ_{1}, λ_{2} \in R$ such that the following system holds

\begin{matrix} a (y, Ψ) + α \int_{Ω} h_{γ} (D u - w) (D ϕ - φ) d x \\ + β \int_{Ω} h_{γ} (E w) E φ d x + \int_{Ω} (u - f) ϕ d x = 0, for all \\ Ψ = (ϕ, φ) \in Y, \end{matrix}

3.10a

\begin{matrix} a (Π, Ψ) + α \int_{Ω} h_{γ}^{'} (D u - w) (D p_{1} - p_{2}) (D ϕ - φ) d x \\ + β \int_{Ω} h_{γ}^{'} (E w) E p_{2} E φ d x + \int_{Ω} p_{1} ϕ d x = - F_{u} (u) [ϕ], \\ for all Ψ = (ϕ, φ) \in Y, \end{matrix}

3.10b

\begin{matrix} λ_{1} = \int_{Ω} h_{γ} (D u - w) (D p_{1} - p_{2}), \end{matrix}

3.10c

\begin{matrix} λ_{2} = \int_{Ω} h_{γ} (E w), E p_{2} \end{matrix}

3.10d

\begin{matrix} λ_{1} \geq 0, λ_{2} \geq 0, \end{matrix}

3.10e

\begin{matrix} λ_{1} \cdot \bar{α} = λ_{2} \cdot \bar{β} = 0 . \end{matrix}

3.10f

Proof

Consider the reduced cost functional $F (α, β) = F (u (α, β)) .$ The bilevel optimisation problem can then be formulated as

\begin{matrix} min_{(α, β) \in C} F (α, β), \end{matrix}

where $F : R^{2} \to R$ and C corresponds to the positive orthant in $R^{2}$ . From [47, Thm. 3.1], there exist multipliers $λ_{1}, λ_{2} \in R$ such that

\begin{matrix} λ_{1} = \nabla_{α} F (\bar{α}, \bar{β}), \\ λ_{2} = \nabla_{β} F (\bar{α}, \bar{β}), \\ λ_{1} \geq 0, λ_{2} \geq 0, \\ λ_{1} \cdot \bar{α} = λ_{2} \cdot \bar{β} = 0 . \end{matrix}

By taking the derivative with respect to $(α, β)$ and denoting by z the solution to the linearised equation (3.3), we get, together with the adjoint equation (3.10b), that

\begin{matrix} F^{'} (α, β) [θ_{1}, θ_{2}] & = F_{u} (u) z_{1} = - a (Π, z) \\ - α \int_{Ω} h_{γ}^{'} (D u - w) (D p_{1} - p_{2}) (D z_{1} - z_{2}) \\ - β \int_{Ω} h_{γ}^{'} (E w) E p_{2} E z_{2} - \int_{Ω} p_{1} z_{1} \\ = - a (z, Π) \\ - α \int_{Ω} h_{γ}^{'} (D u - w) (D z_{1} - z_{2}) (D p_{1} - p_{2}) \\ - β \int_{Ω} h_{γ}^{'} (E w) E z_{2} E p_{2} - \int_{Ω} z_{1} p_{1} \end{matrix}

which, taking into account the linearised equation, yields

\begin{matrix} F^{'} (α, β) [θ_{1}, θ_{2}] = θ_{1} \int_{Ω} h_{γ} (D u - w) (D p_{1} - p_{2}) \\ + θ_{2} \int_{Ω} h_{γ} (E w) E p_{2} . \end{matrix}

3.11

Altogether we proved the result. $□$

Remark 3.4

From the existence result (see Remark 2.2), we actually know that, under some assumptions on F, $\bar{α}$ and $\bar{β}$ are strictly greater than zero. This implies that the multipliers $λ_{1}$ and $λ_{2}$ may be zero, and the problem becomes an unconstrained one. This plays an important role in the design of solution algorithms, since only a mild treatment of the constraints has to be taken into account, as shown in Sect. 6.

Numerical Algorithms

In this section, we propose a second-order quasi-Newton method for the solution of the learning problem with scalar regularisation parameters. The algorithm is based on a BFGS update, preserving the positivity of the iterates through the line search strategy and updating the matrix cyclically depending on the satisfaction of the curvature condition. For the solution of the lower level problem, a semismooth Newton method with a properly modified Jacobi matrix is considered. Moreover, warm initialisation strategies have to be taken into account in order to get convergence for the ${TGV}^{2}$ problem.

BFGS Algorithm

Thanks to the gradient characterisation obtained in Theorem 3.3, we next devise a BFGS algorithm to solve the bilevel learning problems with higher-order regularisers. We employ a few technical tricks to ensure convergence of the classical method. In particular, we limit the step length to get at most a fraction closer to the boundary. As shown in [19], the solution is in the interior for the regularisation and cost functionals we are interested in.

Moreover, the good behaviour of the BFGS method depends upon the BFGS matrix staying positive definite. This would be ensured by the Wolfe conditions, but because of our step length limitation, the curvature condition is not necessarily satisfied. (The Wolfe conditions are guaranteed to be satisfied for some step length $σ$ , if our domain is unbounded, but the range, where the step satisfies the criterion, may be beyond our maximum step length and is not necessarily satisfied closer to the current point.) Instead, we skip the BFGS update if the curvature is negative.

Overall, our learning algorithm may be written as follows:

Algorithm 4.1

(BFGS for denoising parameter learning) Pick Armijo line search constant c, and target residual $ρ$ . Pick initial iterate $(α^{0}, β^{0})$ . Solve the denoising problem (2.3b) for $(α, β) = (α^{0}, β^{0})$ , yielding $u^{0}$ . Initialise $B^{1} = I$ . Set $i : = 0$ , and iterate the following steps:

Solve the adjoint equation (3.10b) for $Π^{i}$ , and calculate $\nabla F (α^{i}, β^{i})$ from (3.11).
If $i \geq 2$ , do the following:
1. Set $s : = (α^{i}, β^{i}) - (α^{i - 1}, β^{i - 1})$ , and $r : = \nabla F (α^{i}, β^{i}) - \nabla F (α^{i - 1}, β^{i - 1})$ .
2. Perform the BFGS update
  $\begin{matrix} B^{i} : = \{\begin{matrix} B^{i - 1}, & s^{T} r \leq 0, \\ B^{i - 1} - \frac{(B^{i - 1} s) {(B^{i - 1} s)}^{T}}{t^{T} B^{i - 1} s} + \frac{r r^{T}}{s^{T} r} & s^{T} r > 0 . \end{matrix} \end{matrix}$
Compute $δ_{α, β}$ from
$\begin{matrix} B^{i} δ_{α, β} = g^{i} . \end{matrix}$
Initialise $σ : = min {1, σ_{max} / 2}$ , where
$\begin{matrix} σ_{max} : = max {σ \geq 0 ∣ (α^{i}, β^{i}) + σ δ_{α, β} > 0} . \end{matrix}$
Repeat the following:
1. Let $(α_{σ}, β_{σ}) : = (α^{i}, β^{i}) + σ δ_{α, β}$ , and solve the denoising problem (2.3b) for $(α, β) = (α_{σ}, β_{σ})$ , yielding $u_{σ}$ .
2. If the residual $‖ (α_{σ}, β_{σ}) - (α^{i}, β^{i}) ‖ / ‖ (α_{σ}, β_{σ}) ‖ < ρ$ , do the following:
  - (i)
    If ${min}_{σ} F (α_{σ}, β_{σ}) < F (α^{i}, β^{i})$ over all $σ$ tried, choose $σ^{*}$ the minimiser, set $(α^{i + 1}, β^{i + 1}) : = (α_{σ^{*}}, β_{σ^{*}})$ , $u^{i + 1} : = u_{σ^{*}}$ , and continue from Step 5.
  - (ii)
    Otherwise end the algorithm with solution $(α^{*}, β^{*}) : = (α^{i}, β^{i})$ .
3. Otherwise, if Armijo condition $F (α_{σ}, β_{σ}) \leq F (α^{i}, β^{i}) + σ c \nabla F {(α^{i}, β^{i})}^{T} δ_{α, β}$ holds, set $(α^{i + 1}, β^{i + 1}) : = (α_{σ}, β_{σ})$ , $u^{i + 1} : = u_{σ}$ , and continue from Step 5.
4. In all other cases, set $σ : = σ / 2$ and continue from Step 4a.
If the residual $‖ (α^{i + 1}, β^{i + 1}) - (α^{i}, β^{i}) ‖ / ‖ (α^{i + 1}, β^{i + 1}) ‖ < ρ$ , end the algorithm with $(α^{*}, β^{*}) : = (α^{i + 1}, β^{i + 1})$ . Otherwise continue from Step 1 with $i : = i + 1$ .

Step (4) ensures that the iterates remain feasible, without making use of a projection step.

An Infeasible Semismooth Newton Method

In this section, we consider semismooth Newton methods for solving the ${TGV}^{2}$ and the ICTV denoising problems. Semismooth Newton methods feature a local superlinear convergence rate and have been previously successfully applied to image processing problems (see, e.g. [21, 29, 32]). The primal-dual algorithm we use here is an extension of the method proposed in [29] to the case of higher-order regularisers.

In variational form, the ${TGV}^{2}$ denoising problem can be written as

\begin{matrix} μ \int_{Ω} (D u \cdot D ϕ + v ϕ) + \int_{Ω} α h_{γ} (D u - w) D ϕ \\ + \int_{Ω} (u - f) ϕ = 0, \forall ϕ \in H^{1} (Ω) \\ μ \int_{Ω} (E w : E φ + w φ) - \int_{Ω} α h_{γ} (D u - w) D φ \\ + \int_{Ω} β h_{γ} (E w) E φ = 0, \forall φ \in H^{1} (Ω) \end{matrix}

or, in general abstract primal-dual form, as

\begin{matrix} L y + \sum_{i = 1}^{N} A_{j}^{*} q_{j} = f in Ω, \end{matrix}

4.1a

\begin{matrix} max {1 / γ, | [A_{j} y] (x) |_{2}} q_{j} (x) - α_{j} [A_{j} y] (x) = 0 a.e. in Ω, \\ j = 1, \dots, N . \end{matrix}

4.1b

where $L \in L (H^{1} (Ω ; R^{m}), H^{1} {(Ω ; R^{m})}^{'})$ is a second-order linear elliptic operator, $A_{j}, j = 1, \dots, N$ , are linear operators acting on y and $q_{j} (x), j = 1, \dots, N$ , correspond to the dual multipliers.

Let us set

\begin{matrix} m_{j} (y) : = max {1 / γ, | [A_{j} y] (x) |_{2}} . \end{matrix}

Let us also define the diagonal application $D (y) : L^{2} (Ω ; R^{m}) \to L^{2} (Ω ; R^{m})$ by

\begin{matrix} [D (y) q] (x) = y (x) q (x), (x \in Ω) \end{matrix}

We may derive $\nabla_{y} [D (m_{j} (y)) q_{j}]$ being defined by

\begin{matrix} \nabla_{y} [D (m_{j} (y)) p_{j}] = A_{j}^{*} D (q_{j}) N (A_{j} y) \\ where N (z) : = \{\begin{matrix} 0, & {| z (x) |}_{2} < 1 / γ \\ \frac{z (x)}{{| z (x) |}_{2}}, & {| z (x) |}_{2} \geq 1 / γ . \end{matrix} \end{matrix}

Then (4.1a), (4.1b) may be written as

\begin{matrix} L y + \sum_{i = 1}^{N} A_{j}^{*} q_{j} = f in Ω \\ D (m_{j} (y)) q_{j} - α_{j} A_{j} y = 0, a.e. in Ω, (j = 1, \dots, N) . \end{matrix}

Linearising, we obtain the system

\begin{matrix} (\begin{matrix} L & A_{1}^{*} & \dots & A_{N}^{*} \\ - α_{1} A_{1} + N (A_{1} y) D (q_{1}) A_{1} & D (m_{j} (y)) & 0 & 0 \\ ⋮ & 0 & ⋱ & 0 \\ - α_{N} A_{N} + N (A_{N} y) D (q_{N}) A_{N} & 0 & 0 & D (m_{N} (y)) \end{matrix}) (\begin{matrix} δ y \\ δ q_{1} \\ ⋮ \\ δ q_{N} \end{matrix}) = R \end{matrix}

SSN-1

where

\begin{matrix} R : = (\begin{matrix} - L y - \sum_{i = 1}^{N} A_{j}^{*} q_{j} + f \\ α_{1} A_{1} y - D (m_{1} (y)) q_{1} \\ ⋮ \\ α_{N} A_{N} y - D (m_{N} (y)) q_{N} \end{matrix}) . \end{matrix}

The semismooth Newton method solves (SSN-1) at a current iterate $(y^{i}, q_{1}^{i}, \dots q_{N}^{i})$ . It then updates

\begin{matrix} (y^{i + 1}, {\tilde{q}}_{1}^{i + 1}, \dots {\tilde{q}}_{N}^{i + 1}) \\ : = (y^{i} + τ δ y, q_{1}^{i} + τ δ q_{1}, q_{N}^{i} + τ δ q_{N}), \end{matrix}

SSN-2

for a suitable step length $τ$ , allowing ${\tilde{q}}^{i + 1}$ to become infeasible in the process. That is, it may hold that $| {\tilde{q}}_{j}^{i + 1} {(x) |}_{2} > α_{j}$ , which may lead to non-descent directions. In order to globalise the method, one projects

\begin{matrix} q_{j}^{i + 1} : = P ({\tilde{q}}_{j}^{i + 1} ; α_{j}), where P (q, α) (x) \\ : = sgn (q (x)) min {α, | q (x) |}, \end{matrix}

SSN-3

in the building of the Jacobi matrix. Following [29, 42], it can be shown that a discrete version of the method (SSN-1)–(SSN-3) converges globally and locally superlinearly near a point where the subdifferentials of the operator on $(y, q_{1}, \dots q_{N})$ corresponding (4.1) are non-singular. Further dampening as in [29] guarantees local superlinear convergence at any point. We do not present the proof, as going into the discretisation and dampening details would expand this work considerably.

Remark 4.1

The system (SSN-1) can be further simplified, which is crucial to obtain acceptable performance with ${TGV}^{2}$ . Indeed, observe that B is invertible, so we may solve $δ u$ from

\begin{matrix} B δ y = R_{1} - \sum_{j = 1}^{N} A_{j}^{*} δ q_{j} . \end{matrix}

4.2

Thus, we may simplify $δ y$ out of (SSN-1) and only solve for $δ q_{1}, \dots, δ q_{N}$ using a reduced system matrix. Finally, we calculate $δ y$ from (4.2).

For the denoising sub-problem (2.3b), we use the method (SSN-1)–(SSN-3) with the reduced system matrix of Remark 4.1. Here, we denote by y in the case of TGV $^{2}$ the parameters

\begin{matrix} y = (u, w), \end{matrix}

and in the case of ICTV

\begin{matrix} y = (u, v) . \end{matrix}

For the calculation of the step length $τ$ , we use Armijo line search with parameter $c = 1 E^{- 4}$ . We end the SSN iterations when

\begin{matrix} τ \frac{‖ δ Y^{i} ‖}{max {1, ‖ Y^{i} ‖}} \leq 1 E^{- 5}, \end{matrix}

where $δ Y^{i} : = (δ y^{i}, δ q_{1}^{i}, \dots, δ q_{N}^{i})$ , and $Y^{i} : = (y^{i}, q_{1}^{i}, \dots, q_{N}^{i})$ .

Warm Initialisation

In our numerical experimentation, we generally found Algorithm 4.1 to perform well for learning the regularisation parameter for $TV$ denoising as was done in [22]. For learning the two (or even more) regularisation parameters for ${TGV}^{2}$ denoising, we found that a warm initialisation is needed to obtain convergence. More specifically, we use $TV$ as an aid for discovering both the initial iterate $(α^{0}, β^{0})$ as well as the initial BFGS matrix $B^{1}$ . This is outlined in the following algorithm:

Algorithm 4.2

(BFGS initialisation for ${TGV}^{2}$ parameter learning) Pick a heuristic factor $δ_{0} > 0$ . Then do the following:

Solve the corresponding problem for $TV$ using Algorithm 4.1. This yields optimal $TV$ denoising parameter $α_{TV}^{*}$ , as well as the BFGS estimate $B_{TV}$ for $\nabla^{2} F (α_{TV}^{*})$ .
Run Algorithm 4.1 for ${TGV}^{2}$ with initialisation $(α^{0}, β^{0}) : = (α_{TV}^{*} δ_{0}, α_{TV}^{*})$ , and initial BFGS matrix $B^{1} : = diag (B_{TV} δ_{0}, B_{TV})$ .

With $Ω = {(0, 1)}^{2}$ , we pick $δ_{0} = 1 / ℓ$ , where the original discrete image has $ℓ \times ℓ$ pixels. This corresponds to the heuristic [2, 44] that if $ℓ \approx 128$ or 256, and the discrete image is mapped into the corresponding domain $Ω = {(0, ℓ)}^{2}$ directly (corresponding to spatial step size of one in the discrete gradient operator), then $β \in (α, 1.5 α)$ tends to be a good choice. We will later verify this through the use of our algorithms. Now, if $f \in BV ({(0, ℓ)}^{2})$ is rescaled to $BV ({(0, 1)}^{2})$ , i.e. $\tilde{f} (x) : = f (x / ℓ)$ , then with $\tilde{u} (x) : = u (x / ℓ)$ and $\tilde{w} (x) : = w (x / ℓ) / ℓ$ , we have the theoretical equivalence

\begin{matrix} \frac{1}{2} {‖ f - u ‖}_{L^{2} ({(0, ℓ)}^{2})}^{2} + α {‖ D u - w ‖}_{M ({(0, ℓ)}^{2} ; R^{2})} \\ + {β ‖ E w ‖}_{M ({(0, ℓ)}^{2} ; R^{2 \times 2})} \end{matrix}

4.3

\begin{matrix} = n^{2} (\frac{1}{2} ‖ \tilde{f} - \tilde{u} ‖_{L^{2} ({(0, 1)}^{2})}^{2} + n α {‖ D \tilde{u} - \tilde{w} ‖}_{M ({(0, 1)}^{2} ; R^{2})} \\ + n^{2} β {‖ E \tilde{w} ‖}_{M ({(0, 1)}^{2} ; R^{2 \times 2})}) . \end{matrix}

4.4

This introduces the factor $1 / ℓ = {| Ω |}^{- 1 / 2}$ between rescaled $α$ , $β$ .

Experiments

In this section, we present some numerical experiments to verify the theoretical properties of the bilevel learning problems and the efficiency of the proposed solution algorithms. In particular, we exhaustively compare the performance of the new proposed cost functional with respect to well-known quality measures, showing a better behaviour of the new cost for the chosen tested images. The performance of the proposed BFGS algorithm, combined with the semismooth Newton method for the lower level problem, is also examined.

Moreover, on basis of the learning setting proposed, a thorough comparison between ${TGV}^{2}$ and $ICTV$ is carried out. The use of higher-order regularisers in image denoising is rather recent, and the question on whether ${TGV}^{2}$ or ICTV performs better has been around. We target that question and, on basis of the bilevel learning approach, we are able to give some partial answers.

Gaussian Denoising

We tested Algorithm 4.1 for $TV$ and Algorithm 4.2 for ${TGV}^{2}$ Gaussian denoising parameter learning on various images. Here we report the results for two images, the parrot image in Fig. 4a, and the geometric image in Fig. 5. We applied synthetic noise to the original images, such that the PSNR of the parrot image are 24.7, and the PSNR of the geometric image is 24.8.

Fig. 4 — Optimal denoising results for initial guess $\vec{α} = (α_{TV}^{*} / ℓ, α_{TV}^{*})$ for ${TGV}^{2}$ and $\vec{α} = 0.1 / ℓ$ for $TV$

Fig. 5 — Optimal denoising results for initial guess $\vec{α} = (α_{TV}^{*} / ℓ, α_{TV}^{*})$ for ${TGV}^{2}$ and $\vec{α} = 0.2 / ℓ$ for $TV$

In order to learn the regularisation parameter $α$ for $TV$ , we picked initial $α^{0} = 0.1 / ℓ$ . For ${TGV}^{2}$ , initialisation by $TV$ was used as in Algorithm 4.1. We chose the other parameters of Algorithm 4.1 as $c = 1 E^{- 4}$ , $ρ = 1 E^{- 5}$ , $θ = 1 E - 8$ , and $Θ = 10$ . For the SSN denoising method, the parameters $γ = 100$ and $μ = 1 E^{- 10}$ were chosen.

We have included results for both the $L^{2}$ -squared cost functional $L_{2}^{2}$ and the Huberised total variation cost functional $L_{η}^{1} \nabla$ . The learning results are reported in Table 1 for the parrot images, and Table 2 for the geometric image. The denoising results with the discovered parameters are shown in Figs 4 and 5. We report the resulting optimal parameter values, the cost functional value, PSNR, SSIM [46], as well as the number of iterations taken by the outer BFGS method.

Table 1.

Quantified results for the parrot image ( $ℓ = 256 = image width/height in pixels$ )

Denoise	Cost	Initial ( $α$ , $β$ )	Result ( $α$ , $β$ )	Cost	SSIM	PSNR	Its.	Fig.
${TGV}^{2}$	$L_{η}^{1} \nabla$	$(α_{TV}^{} / ℓ, α_{TV}^{})$	(0.069/ $ℓ^{2}$ , 0.051/ $ℓ$ )	6.615	0.897	31.720	12	4c
${TGV}^{2}$	$L_{2}^{2}$	$(α_{TV}^{} / ℓ, α_{TV}^{})$	(0.058/ $ℓ^{2}$ , 0.041/ $ℓ$ )	6.412	0.890	31.992	11	4d
ICTV	$L_{η}^{1} \nabla$	$(α_{TV}^{} / ℓ, α_{TV}^{})$	(0.068/ $ℓ^{2}$ , 0.051/ $ℓ$ )	6.656	0.895	31.667	16	4e
ICTV	$L_{2}^{2}$	$(α_{TV}^{} / ℓ, α_{TV}^{})$	(0.051/ $ℓ^{2}$ , 0.041/ $ℓ$ )	6.439	0.887	31.954	7	4f
TV	$L_{η}^{1} \nabla$	$0.1 / ℓ$	0.057/ $ℓ$	6.944	0.887	31.298	10	4g
TV	$L_{2}^{2}$	$0.1 / ℓ$	0.042/ $ℓ$	6.623	0.879	31.710	12	4h

Open in a new tab

Table 2.

Quantified results for the synthetic image ( $ℓ = 256 = image width/height in pixels$ )

Denoise	Cost	Initial $\vec{α}$	Result ${\vec{α}}^{*}$	Value	SSIM	PSNR	Its.	Fig.
TGV $^{2}$	$L_{η}^{1} \nabla$	$(α_{TV}^{} / ℓ, α_{TV}^{})$	(0.453/ $ℓ^{2}$ , 0.071/ $ℓ$ )	3.769	0.989	36.606	17	5c
TGV $^{2}$	$L_{2}^{2}$	$(α_{TV}^{} / ℓ, α_{TV}^{})$	(0.307/ $ℓ^{2}$ , 0.055/ $ℓ$ )	3.603	0.986	36.997	19	5d
ICTV	$L_{η}^{1} \nabla$	$(α_{TV}^{} / ℓ, α_{TV}^{})$	(0.505/ $ℓ^{2}$ , 0.103/ $ℓ$ )	4.971	0.970	34.201	23	5e
ICTV	$L_{2}^{2}$	$(α_{TV}^{} / ℓ, α_{TV}^{})$	(0.056/ $ℓ^{2}$ , 0.049/ $ℓ$ )	3.947	0.965	36.206	7	5f
TV	$L_{η}^{1} \nabla$	$0.1 / ℓ$	0.136/ $ℓ$	5.521	0.966	33.291	6	5g
TV	$L_{2}^{2}$	$0.1 / ℓ$	0.052/ $ℓ$	4.157	0.948	35.756	7	5h

Open in a new tab

Our first observation is that all approaches successfully learn a denoising parameter that gives a good-quality denoised image. Secondly, we observe that the gradient cost functional $L_{η}^{1} \nabla$ performs visually and in terms of SSIM significantly better for ${TGV}^{2}$ parameter learning than the cost functional $L_{2}^{2}$ . In terms of PSNR, the roles are reversed, as should be, since the $L_{2}^{2}$ is equivalent to PSNR. This again confirms that PSNR is a poor-quality measure for images. For $TV$ , there is no significant difference between different cost functionals in terms of visual quality, although the PSNR and SSIM differ.

We also observe that the optimal ${TGV}^{2}$ parameters $(α^{*}, β^{*})$ generally satisfy $β^{*} / α^{*} \in (0.75, 1.5) / ℓ$ . This confirms the earlier observed heuristic that if $ℓ \approx 128, 256$ then $β \in (1, 1.5) α$ tends to be a good choice. As we can observe from Figs. 4 and 5, this optimal ${TGV}^{2}$ parameter choice also avoids the staircasing effect that can be observed with $TV$ in the results.

In Fig. 3, we have plotted by the red star the discovered regularisation parameter $(α^{*}, β^{*})$ reported in Fig. 4. Studying the location of the red star, we may conclude that Algorithms 4.1 and 4.2 manage to find a nearly optimal parameter in very few BFGS iterations.

Statistical Testing

To obtain a statistically significant outlook to the performance of different regularisers and cost functionals, we made use of the Berkeley segmentation dataset BSDS300 [36], displayed in Fig. 6. We resized each image to 128 pixels on its shortest edge and take the $128 \times 128$ top left square of the image. To this dataset, we applied pixelwise Gaussian noise of variance $σ^{2} = 2, 10$ , and 20. We tested the performance of both cost functionals, $L_{η}^{1} \nabla$ and $L_{2}^{2}$ , as well as the ${TGV}^{2}$ , $ICTV$ , and $TV$ regularisers on this dataset, for all noise levels. In the first instance, reported in Figs. 7, 8, 9 and 10 (noise levels $σ^{2} = 2, 20$ only), and Tables 3, 4 and 5, we applied the proposed bilevel learning model on each image individually, to learn the optimal parameters specifically for that image, and a corresponding noisy image for all of the noise levels separately. For the algorithm, we use the same parametrisation as presented in Sect. 5.1.

Fig. 6 — The 200 images of the Berkeley segmentation dataset BSDS300 [36], cropped to be rectangular, keeping *top left corner*, and resized to $128 \times 128$

Fig. 7 — Ordering of regularisers with individual learning, $L_{η}^{1} \nabla$ cost, and noise variance $σ^{2} = 2$ , on the 200 images of the BSDS300 dataset, resized. Best regulariser: *red* TV, *green* ICTV, *blue* TGV $^{2}$ ; *top* SSIM, *middle* PSNR, *bottom* objective value

Fig. 8 — Ordering of regularisers with individual learning, $L_{2}^{2}$ cost, and noise variance $σ^{2} = 2$ , on the 200 images of the BSDS300 dataset, resized. Best regulariser: *red* TV, *green* ICTV, *blue* TGV $^{2}$ ; *top* SSIM, *middle* PSNR, *bottom* objective value

Fig. 9 — Ordering of regularisers with individual learning, $L_{η}^{1} \nabla$ cost, and noise variance $σ^{2} = 20$ , on the 200 images of the BSDS300 dataset, resized. Best regulariser: *red* TV, *green* ICTV, *blue* TGV $^{2}$ ; *top* SSIM, *middle* PSNR, *bottom* objective value

Fig. 10 — Ordering of regularisers with individual learning, $L_{2}^{2}$ cost, and noise variance $σ^{2} = 20$ , on the 200 images of the BSDS300 dataset, resized. Best regulariser: *red* TV, *green* ICTV, *blue* TGV $^{2}$ ; *top* SSIM, *middle* PSNR, *bottom* objective value

Table 3.

Regulariser performance with individual learning, $L_{2}^{2}$ and $L_{η}^{1} \nabla$ costs and noise variance $σ^{2} =$ 2; BSDS300 dataset, resized

	SSIM				PSNR				Value
	Mean	Std	Med	Best	Mean	Std	Med	Best	Mean	Std	Med	Best
Noisy data	0.978	0.015	0.981	0	41.56	0.86	41.95	0	2.9E $^{4}$	3.1E $^{2}$	2.9E $^{4}$	0
$L_{η}^{1} \nabla -$ TV	0.988	0.005	0.989	1	42.57	1.10	42.46	5	2.4E $^{4}$	3.7E $^{3}$	2.5E $^{4}$	1
$L_{η}^{1} \nabla -$ ICTV	0.989	0.005	0.990	141	42.74	1.16	42.62	143	2.3E $^{4}$	3.9E $^{3}$	2.4E $^{4}$	137
$L_{η}^{1} \nabla -$ TGV $^{2}$	0.989	0.005	0.989	58	42.70	1.17	42.55	52	2.4E $^{4}$	4.0E $^{3}$	2.5E $^{4}$	62
95 % t test	$ICTV > {TGV}^{2} > TV$				$ICTV > {TGV}^{2} > TV$				$ICTV > {TGV}^{2} > TV$
$L_{2}^{2} -$ TV	0.988	0.005	0.988	2	42.64	1.14	42.50	2	0.41	0.08	0.43	2
$L_{2}^{2} -$ ICTV	0.988	0.005	0.989	142	42.79	1.18	42.64	148	0.39	0.08	0.41	148
$L_{2}^{2} -$ TGV $^{2}$	0.988	0.005	0.989	56	42.76	1.19	42.58	50	0.40	0.08	0.42	50
95 % t test	$ICTV > {TGV}^{2} > TV$				$ICTV > {TGV}^{2} > TV$				$ICTV > {TGV}^{2} > TV$

Open in a new tab

Table 4.

Regulariser performance with individual learning, $L_{2}^{2}$ and $L_{η}^{1} \nabla$ costs and noise variance $σ^{2} =$ 10; BSDS300 dataset, resized

	SSIM				PSNR				Value
	Mean	Std	Med	Best	Mean	Std	Med	Best	Mean	Std	Med	Best
Noisy data	0.731	0.120	0.744	0	27.72	0.88	28.09	0	1.4E $^{5}$	2.5E $^{3}$	1.4E $^{5}$	0
$L_{η}^{1} \nabla -$ TV	0.898	0.036	0.900	4	31.28	1.63	30.97	8	7.3E $^{4}$	2.2E $^{4}$	7.3E $^{4}$	1
$L_{η}^{1} \nabla -$ ICTV	0.906	0.034	0.909	139	31.54	1.68	31.21	142	7.1E $^{4}$	2.2E $^{4}$	7.1E $^{4}$	121
$L_{η}^{1} \nabla -$ TGV $^{2}$	0.905	0.035	0.907	57	31.47	1.72	31.10	50	7.1E $^{4}$	2.2E $^{4}$	7.1E $^{4}$	78
95 % t test	ICTV > TGV $^{2} >$ TV				ICTV > TGV $^{2} >$ TV				ICTV > TGV $^{2} >$ TV
$L_{2}^{2} -$ TV	0.897	0.033	0.898	9	31.54	1.76	31.15	2	5.52	1.89	5.51	2
$L_{2}^{2} -$ ICTV	0.903	0.032	0.903	131	31.72	1.76	31.33	148	5.30	1.81	5.35	148
$L_{2}^{2} -$ TGV $^{2}$	0.902	0.033	0.903	60	31.67	1.80	31.28	50	5.38	1.87	5.39	50
95 % t test	ICTV > TGV $^{2} >$ TV				ICTV > TGV $^{2} >$ TV				ICTV > TGV $^{2} >$ TV

Open in a new tab

Table 5.

Regulariser performance with individual learning, $L_{2}^{2}$ and $L_{η}^{1} \nabla$ costs and noise variance $σ^{2} = 20$ ; BSDS300 dataset, resized

	SSIM				PSNR				Value
	Mean	Std	Med	Best	Mean	Std	Med	Best	Mean	Std	Med	Best
Noisy data	0.505	0.143	0.516	0	21.80	0.92	22.14	0	2.8E $^{5}$	7.9E $^{3}$	2.8E $^{5}$	0
$L_{η}^{1} \nabla -$ TV	0.795	0.063	0.799	7	27.27	1.64	27.02	11	1.0E $^{5}$	3.5E $^{4}$	9.7E $^{4}$	1
$L_{η}^{1} \nabla -$ ICTV	0.810	0.061	0.814	120	27.52	1.66	27.24	125	9.7E $^{4}$	3.4E $^{4}$	9.6E $^{4}$	79
$L_{η}^{1} \nabla -$ TGV $^{2}$	0.808	0.062	0.814	73	27.50	1.74	27.15	64	9.8E $^{4}$	3.5E $^{4}$	9.5E $^{4}$	120
95 % t test	ICTV > TGV $^{2} >$ TV				ICTV, TGV $^{2} >$ TV				ICTV, TGV $^{2} >$ TV
$L_{2}^{2} -$ TV	0.802	0.056	0.804	8	27.70	1.93	27.28	0	13.65	5.53	13.14	0
$L_{2}^{2} -$ ICTV	0.811	0.056	0.816	126	27.86	1.91	27.45	138	13.14	5.22	12.62	138
$L_{2}^{2} -$ TGV $^{2}$	0.810	0.057	0.814	66	27.83	1.94	27.41	62	13.28	5.38	12.77	62
95 % t test	ICTV > TGV $^{2} >$ TV				ICTV > TGV $^{2} >$ TV				ICTV > TGV $^{2} >$ TV

Open in a new tab

The figures display the noisy images and indicate by colour coding the best result as judged by the structural similarity measure SSIM [46], PSNR and the objective function value ( $L_{η}^{1} \nabla$ or $L_{2}^{2}$ cost). These criteria are, respectively, the top, middle and bottom rows of colour-coding squares. Red square indicates that $TV$ performed the best, green square indicates that $ICTV$ performed the best and blue square indicates that ${TGV}^{2}$ performed the best—this is naturally for the optimal parameters for the corresponding regulariser and cost functional discovered by our algorithms.

In the tables, we report the information in a more concise numerical fashion, indicating the mean, standard deviation and median for all the different criteria (SSIM, PSNR and cost functional value), as well as the number of images for which each regulariser performed the best. We recall that SSIM is normalised to [0, 1], with higher value better. Moreover, we perform a statistical 95 paired t-test on each of the criteria, and a pair of regularisers, to see whether any pair of regularisers can be ordered. If so, this is indicated in the last row of each of the tables.

Overall, studying the t-test and other data, the ordering of the regularisers appears to be

\begin{matrix} ICTV > {TGV}^{2} > TV . \end{matrix}

This is rather surprising, as in many specific examples, ${TGV}^{2}$ has been observed to perform better than $ICTV$ , see Figs. 4 and 5, as well as [1, 5]. Only when the noise is high, appears ${TGV}^{2}$ to come on par with $ICTV$ with the $L_{η}^{1} \nabla$ cost functional in Fig. 9 and Table 5.

A more detailed study of the results in Figs. 7, 8, 9 and 10 seems to indicate that ${TGV}^{2}$ performs better than $ICTV$ when the image contains large smooth areas, but $ICTV$ generally seems to perform better for images with more complicated and varying contents. This observation agrees with the results in Figs. 4 and 5, as well as [1, 5], where the images are of the former type.

One possible reason for the better performance of $ICTV$ could be that ${TGV}^{2}$ has more degrees of freedom—in $ICTV$ we essentially constrain $w = \nabla v$ for some function v—and therefore overfits to the noisy data, until the noise level becomes so high that overfitting would become too high for any parameter. To see whether this is true, we also performed batch learning, learning a single set of parameters for all images with the same noise level. That is, we studied the model

\begin{matrix} min_{\vec{α}} \sum_{i = 1}^{N} F_{i} (u_{i, \vec{α}}) s.t. u_{i, \vec{α}} \in \underset{u \in H^{1} (Ω)}{\arg \min} \frac{1}{2} {‖ f_{i} - u ‖}_{L^{2} (Ω)}^{2} \\ + R_{\vec{α}}^{γ, μ} (u), \end{matrix}

with

\begin{matrix} F_{i} (u) = & \frac{1}{2} {‖ f_{0, i} - u ‖}_{L^{2} (Ω)}^{2}, or F_{i} (u) \\ = & \int_{Ω} {| \nabla (f_{0, i} - u) |}_{γ} d x, \end{matrix}

where $\vec{α} = (α, β)$ , $f_{1}, \dots, f_{N}$ are the $N = 200$ noisy images with the same noise level, and $f_{0, 1}, \dots, f_{0, N}$ the original noise-free images.

The results are shown in Figs. 11, 12, 13 and 14 (noise levels $σ^{2} = 2, 20$ only), and Tables 6, 7 and 8. The results are still roughly the same as with individual learning. Again, only with high noise in Table 8, ${TGV}^{2}$ does not lose to $ICTV$ . Another interesting observation is that $TV$ starts to be frequently the best regulariser for individual images, although still statistically does worse than either $ICTV$ or ${TGV}^{2}$ .

Fig. 11 — Ordering of regularisers with batch learning, $L_{η}^{1} \nabla$ cost, and noise variance $σ^{2} = 2$ , on the 200 images of the BSDS300 dataset, resized. Best regulariser: *red* TV, *green* ICTV, *blue* TGV $^{2}$ ; *top* SSIM, *middle* PSNR, *bottom* objective value

Fig. 12 — Ordering of regularisers with batch learning, $L_{2}^{2}$ cost, and noise variance $σ^{2} = 2$ , on the 200 images of the BSDS300 dataset, resized. Best regulariser: *red* TV, *green* ICTV, *blue* TGV $^{2}$ ; *top* SSIM, *middle* PSNR, *bottom* objective value

Fig. 13 — Ordering of regularisers with batch learning, $L_{η}^{1} \nabla$ cost, and noise variance $σ^{2} = 20$ , on the 200 images of the BSDS300 dataset, resized. Best regulariser: *red* TV, *green* ICTV, *blue* TGV $^{2}$ ; *top* SSIM, *middle* PSNR, *bottom* objective value

Fig. 14 — Ordering of regularisers with batch learning, $L_{2}^{2}$ , cost, and noise variance $σ^{2} = 20$ , on the 200 images of the BSDS300 dataset, resized. Best regulariser: *red* TV, *green* ICTV, *blue* TGV $^{2}$ ; *top* SSIM, *middle* PSNR, *bottom* objective value

Table 6.

Regulariser performance with batch learning, $L_{η}^{1} \nabla$ and $L_{2}^{2}$ costs, noise variance $σ^{2} =$ 2; BSDS300 dataset, resized

	SSIM				PSNR				Value
	Mean	Std	Med	Best	Mean	Std	Med	Best	Mean	Std	Med	Best
Noisy data	0.978	0.015	0.981	16	41.56	0.86	41.95	24	2.9E $^{4}$	3.1E $^{2}$	2.9E $^{4}$	16
$L_{η}^{1} \nabla -$ TV	0.987	0.006	0.988	23	42.43	1.07	42.37	21	2.5E $^{4}$	3.4E $^{3}$	2.5E $^{4}$	20
$L_{η}^{1} \nabla -$ ICTV	0.988	0.006	0.989	119	42.56	1.06	42.51	135	2.4E $^{4}$	3.5E $^{3}$	2.5E $^{4}$	113
$L_{η}^{1} \nabla -$ TGV $^{2}$	0.987	0.006	0.989	42	42.51	1.09	42.44	20	2.4E $^{4}$	3.6E $^{3}$	2.5E $^{4}$	51
95 % t test	ICTV > TGV $^{2} >$ TV				ICTV > TGV $^{2} >$ TV				ICTV > TGV $^{2} >$ TV
$L_{2}^{2} -$ TV	0.986	0.007	0.987	13	42.46	0.95	42.43	17	0.42	0.07	0.43	17
$L_{2}^{2} -$ ICTV	0.987	0.007	0.988	139	42.57	0.95	42.56	128	0.41	0.07	0.42	128
$L_{2}^{2} -$ TGV $^{2}$	0.987	0.007	0.988	38	42.53	0.97	42.51	40	0.41	0.07	0.42	40
95 % t test	ICTV > TGV $^{2} >$ TV				ICTV > TGV $^{2} >$ TV				ICTV > TGV $^{2} >$ TV

Open in a new tab

Table 7.

Regulariser performance with batch learning, $L_{η}^{1} \nabla$ and $L_{2}^{2}$ costs, noise variance $σ^{2} =$ 10; BSDS300 dataset, resized

	SSIM				PSNR				Value
	Mean	Std	Med	Best	Mean	Std	Med	Best	Mean	Std	Med	Best
Noisy data	0.731	0.120	0.744	8	27.72	0.88	28.09	2	1.4E $^{5}$	2.5E $^{3}$	1.4E $^{5}$	0
$L_{η}^{1} \nabla -$ TV	0.893	0.035	0.897	23	31.24	1.87	30.94	23	7.5E $^{4}$	2.2E $^{4}$	7.3E $^{4}$	18
$L_{η}^{1} \nabla -$ ICTV	0.897	0.034	0.902	134	31.36	1.81	31.11	150	7.4E $^{4}$	2.2E $^{4}$	7.2E $^{4}$	107
$L_{η}^{1} \nabla -$ TGV $^{2}$	0.896	0.035	0.901	35	31.31	1.88	31.01	25	7.4E $^{4}$	2.3E $^{4}$	7.2E $^{4}$	75
95 % t test	ICTV > TGV $^{2} >$ TV				ICTV > TGV $^{2} >$ TV				ICTV, TGV $^{2} >$ TV
$L_{2}^{2} -$ TV	0.887	0.035	0.889	29	31.31	1.50	31.15	25	5.72	1.91	5.51	25
$L_{2}^{2} -$ ICTV	0.889	0.036	0.893	127	31.41	1.44	31.28	131	5.57	1.83	5.37	131
$L_{2}^{2} -$ TGV $^{2}$	0.888	0.035	0.891	44	31.38	1.50	31.20	44	5.64	1.90	5.44	44
95 % t test	ICTV > TGV $^{2} >$ TV				ICTV > TGV $^{2} >$ TV				ICTV > TGV $^{2} >$ TV

Open in a new tab

Table 8.

Regulariser performance with batch learning, $L_{η}^{1} \nabla$ and $L_{2}^{2}$ costs, noise variance $σ^{2} =$ 20; BSDS300 dataset, resized

	SSIM				PSNR				Value
	Mean	Std	Med	Best	Mean	Std	Med	Best	Mean	Std	Med	Best
Noisy data	0.505	0.143	0.516	4	21.80	0.92	22.14	1	2.8E $^{5}$	7.9E $^{3}$	2.8E $^{5}$	0
$L_{η}^{1} \nabla -$ TV	0.789	0.067	0.798	18	27.37	2.13	26.98	24	1.0E $^{5}$	3.7E $^{4}$	9.8E $^{4}$	14
$L_{η}^{1} \nabla -$ ICTV	0.795	0.065	0.804	139	27.46	2.10	27.05	141	1.0E $^{5}$	3.6E $^{4}$	9.6E $^{4}$	91
$L_{η}^{1} \nabla -$ TGV $^{2}$	0.794	0.066	0.804	39	27.44	2.12	27.04	34	1.0E $^{5}$	3.7E $^{4}$	9.6E $^{4}$	95
95 % t test	ICTV > TGV $^{2} >$ TV				ICTV > TGV $^{2} >$ TV				TGV $^{2} >$ ICTV > TV
$L_{2}^{2} -$ TV	0.786	0.053	0.790	31	27.50	1.71	27.27	33	14.11	5.78	13.16	33
$L_{2}^{2} -$ ICTV	0.790	0.054	0.790	123	27.56	1.64	27.37	119	13.84	5.54	12.75	119
$L_{2}^{2} -$ TGV $^{2}$	0.789	0.053	0.793	46	27.55	1.70	27.33	48	13.93	5.73	12.95	48
95 % t test	ICTV, TGV $^{2} >$ TV				ICTV, TGV $^{2} >$ TV				ICTV > TGV $^{2} >$ TV

Open in a new tab

For the first image of the dataset, $ICTV$ does in all of the Figs. 7, 8, 9, 10, 11, 12, 13 and 14 better than ${TGV}^{2}$ , while for the second image, the situation is reversed. We have highlighted these two images for the $L_{η}^{1} \nabla$ cost in Figs. 15, 16, 17 and 18, for both noise levels $σ = 2$ and $σ = 20$ . In the case where $ICTV$ does better, hardly any difference can be observed by the eye, while for second image, ${TGV}^{2}$ clearly has less staircasing in the smooth areas of the image, especially with the noise level $σ = 20$ .

Fig. 15 — Image for which $ICTV$ performs better than ${TGV}^{2}$ , $σ = 2$

Fig. 16 — Image for which $ICTV$ performs better than ${TGV}^{2}$ , $σ = 20$

Fig. 17 — Image for which ${TGV}^{2}$ performs better than $ICTV$ , $σ = 2$

Fig. 18 — Image for which ${TGV}^{2}$ performs better than $ICTV$ , $σ = 20$

Based on this study, it therefore seems that $ICTV$ is the most reliable regulariser of the ones tested, when the type of image being processed is unknown, and low SSIM, PSNR or $L_{η}^{1} \nabla$ cost functional value is desired. But as can be observed for individual images, it can within large smooth areas exhibit artefacts that are avoided by the use of ${TGV}^{2}$ .

The Choice of Cost Functional

The $L_{2}^{2}$ cost functional naturally obtains better PSNR than $L_{η}^{1} \nabla$ , as the two former are equivalent. Comparing the results for the two cost funtionals in Tables 3, 4 and 5, we may however observe that for low noise levels $σ^{2} = 2, 10$ , and generally for batch learning, $L_{η}^{1} \nabla$ attains better (higher) SSIM. Since SSIM better captures [46] the visual quality of images than PSNR, this recommends the use of our novel total variation cost functional $L_{η}^{1} \nabla$ . Of course, one might attempt to optimise the SSIM. This is however a non-convex functional, which will pose additional numerical challenges avoided by the convex total variation cost.

Conclusion and Outlook

In this paper, we propose a bilevel optimisation approach in function space for learning the optimal choice of parameters in higher-order total variation regularisation. We present a rigorous analysis of this optimisation problem as well as a numerical discussion in the context of image denoising.

Analytically, we obtain the existence results for the bilevel optimisation problem and prove the Fréchet differentiability of the solution operator. This leads to the existence of Lagrange multipliers and a first-order optimality system characterising optimal solutions. In particular, the existence of an adjoint state allows to obtain a cost functional gradient formula which is of importance in the design of efficient solution algorithms.

We make use of the bilevel learning approach, and the theoretical findings, to compare the performance—in terms of returned image quality—of TV, ICTV and TGV regularisation. A statistical analysis, carried out on a dataset of 200 images, suggests that ICTV performs slightly better than TGV, and both perform better than TV, in average. For denoising of images with a high noise level, ICTV and TGV score comparably well. For images with large smooth areas, TGV performs better than ICTV.

Moreover, we propose a new cost functional for the bilevel learning problem, which exhibits interesting theoretical properties and has a better behaviour with respect to the PSNR related L $^{2}$ cost used previously in the literature. This study raises the question of other, alternative cost functionals. For instance, one could be tempted to used the SSIM as cost, but its non-convexity might present several analytical and numerical difficulties. The new cost functional, proposed in this paper, turns out to be a good compromise between image quality measure and analytically tractable cost term.

Acknowledgments

This research has been supported by King Abdullah University of Science and Technology (KAUST) Award No. KUK-I1-007-43, EPSRC grants Nr. EP/J009539/1 “Sparse & Higher-order Image Restoration” and Nr. EP/M00483X/1 “Efficient computational tools for inverse imaging problems”, Escuela Politécnica Nacional de Quito Award No. PIS 12-14, MATHAmSud project SOCDE “Sparse Optimal Control of Differential Equations” and the Leverhulme Trust project on “Breaking the non-convexity barrier”. While in Quito, T. Valkonen has moreover been supported by SENESCYT (Ecuadorian Ministry of Higher Education, Science, Technology and Innovation) under a Prometeo Fellowship.

Biographies

J. C. De los Reyes

is Director of the Ecuadorian Research Center on Mathematical Modelling (MODEMAT) and Full Professor at Escuela Politécnica Nacional (Ecuador). He obtained his degree in Mathematics at Escuela Politécnica Nacional in 2000 and his Ph.D. in Mathematics at the University of Graz (Austria) in 2003. He worked for one year (2005/2006) as postdoctoral researcher at the Technical Univ. of Berlin. In the year 2009 he was awarded a Alexander von Humboldt Fellowship for Experienced Researchers to carry out research in Berlin. He has held Visiting Professor positions at the Humboldt University of Berlin (2010) and at the University of Hamburg (2013). In 2010 he was also awarded a J.T. Oden Faculty Fellowship to carry out research at The University of Texas at Austin. In February 2015 he was appointed as member of the Academy of Sciences of Ecuador (ACE).

C.-B. Schönlieb

is a Reader in Applied and Computational Mathematics, head of the Cambridge Image Analysis (CIA) group at the Department of Applied Mathematics and Theoretical Physics (DAMTP), University of Cambridge. Moreover, she is the Director of the Cantab Capital Institute for the Mathematics of Information, Co-Director of the EPSRC Centre for Mathematical and Statistical Analysis of Multimodal Clinical Imaging, a Fellow of Jesus College, Cambridge and co-leader of the IMAGES network. Carola obtained her degree in Mathematics at the University of Salzburg in 2004 and her Ph.D. in Mathematics at the University of Cambridge in 2009. After one year of postdoctoral activity at the University of Goettingen (Germany), she became a Lecturer in DAMTP, promoted to Reader in 2015.

T. Valkonen

is a Lecturer in applied mathematics at the University of Liverpool. He obtained his Ph.D. in Mathematics at the University of Jyväskylä (Finland) 2008 and worked as postdoctoral researcher at the University of Graz and at the University of Cambridge.

References

1.Benning M, Brune C, Burger M, Müller J. Higher-order TV methods-enhancement via Bregman iteration. J. Sci. Comput. 2013;54(2–3):269–310. doi: 10.1007/s10915-012-9650-3. [DOI] [Google Scholar]
2.Benning M, Gladden L, Holland D, Schönlieb C-B, Valkonen T. Phase reconstruction from velocity-encoded MRI measurements—a survey of sparsity-promoting variational approaches. J. Magn. Reson. 2014;238:26–43. doi: 10.1016/j.jmr.2013.10.003. [DOI] [PubMed] [Google Scholar]
3.Biegler L, Biros G, Ghattas O, Heinkenschloss M, Keyes D, Mallick B, Tenorio L, van Bloemen Waanders B, Willcox K, Marzouk Y. Large-Scale Inverse Problems and Quantification of Uncertainty. New York: Wiley; 2011. [Google Scholar]
4.Bonnans JF, Tiba D. Pontryagin’s principle in the control of semilinear elliptic variational inequalities. Appl. Math. Optim. 1991;23(1):299–312. doi: 10.1007/BF01442403. [DOI] [Google Scholar]
5.Bredies K, Kunisch K, Pock T. Total generalized variation. SIAM J. Imaging Sci. 2011;3:492–526. doi: 10.1137/090769521. [DOI] [Google Scholar]
6.Bredies K, Holler M. A total variation-based jpeg decompression model. SIAM J. Imaging Sci. 2012;5(1):366–393. doi: 10.1137/110833531. [DOI] [Google Scholar]
7.Bredies K, Kunisch K, Valkonen T. Properties of $L^{1} - {TGV}^{2}$ : the one-dimensional case. J. Math. Anal. Appl. 2013;398:438–454. doi: 10.1016/j.jmaa.2012.08.053. [DOI] [Google Scholar]
8.Bredies, K., Valkonen, T.: Inverse problems with second-order total generalized variation constraints. In: Proceedings of the 9th International Conference on Sampling Theory and Applications (SampTA), Singapore (2011)
9.Bui-Thanh T, Willcox K, Ghattas O. Model reduction for large-scale systems with high-dimensional parametric input space. SIAM J. Sci. Comput. 2008;30(6):3270–3288. doi: 10.1137/070694855. [DOI] [Google Scholar]
10.Calatroni L, De los Reyes JC, Schönlieb C-B. Dynamic sampling schemes for optimal noise learning under multiple nonsmooth constraints. In: Poetzsche C, editor. System Modeling and Optimization. New York: Springer Verlag; 2014. pp. 85–95. [Google Scholar]
11.Chambolle A, Lions P-L. Image recovery via total variation minimization and related problems. Numer. Math. 1997;76:167–188. doi: 10.1007/s002110050258. [DOI] [Google Scholar]
12.Chan T, Marquina A, Mulet P. High-order total variation-based image restoration. SIAM J. Sci. Comput. 2000;22(2):503–516. doi: 10.1137/S1064827598344169. [DOI] [Google Scholar]
13.Chan TF, Kang SH, Shen J. Euler’s elastica and curvature-based inpainting. SIAM J. Appl. Math. 2002;63(2):564–592. [Google Scholar]
14.Chen, Y., Pock, T., Bischof, H.: Learning $ℓ_{1}$ -based analysis and synthesis sparsity priors using bi-level optimization. In: Workshop on Analysis Operator Learning versus Dictionary Learning, NIPS 2012 (2012)
15.Chen, Y., Ranftl, R., Pock, T.: Insights into analysis operator learning: from patch-based sparse models to higher-order mrfs. IEEE Trans. Image Process. (2014) (to appear) [DOI] [PubMed]
16.Chung, J., Español, M.I., Nguyen, T.: Optimal regularization parameters for general-form tikhonov regularization. arXiv preprint arXiv:1407.1911 (2014)
17.Dauge M. Neumann and mixed problems on curvilinear polyhedra. Integr. Equ. Oper. Theory. 1992;15(2):227–261. doi: 10.1007/BF01204238. [DOI] [Google Scholar]
18.De los Reyes JC, Meyer C. Strong stationarity conditions for a class of optimization problems governed by variational inequalities of the second kind. J. Optim. Theory Appl. 2015;168(2):375–409. doi: 10.1007/s10957-015-0748-2. [DOI] [Google Scholar]
19.De los Reyes JC, Schönlieb C-B, Valkonen T. The structure of optimal parameters for image restoration problems. J. Math. Anal. Appl. 2016;434(1):464–500. doi: 10.1016/j.jmaa.2015.09.023. [DOI] [Google Scholar]
20.De los Reyes JC. Optimal control of a class of variational inequalities of the second kind. SIAM J. Control Optim. 2011;49(4):1629–1658. doi: 10.1137/090764438. [DOI] [Google Scholar]
21.De los Reyes JC, Hintermüller M. A duality based semismooth Newton framework for solving variational inequalities of the second kind. Interfaces Free Bound. 2011;13(4):437–462. doi: 10.4171/IFB/267. [DOI] [Google Scholar]
22.De los Reyes JC, Schönlieb C-B. Image denoising: learning the noise model via nonsmooth PDE-constrained optimization. Inverse Probl. Imaging. 2013;7(4):1139–1155. doi: 10.3934/ipi.2013.7.1183. [DOI] [Google Scholar]
23.Domke, J.: Generic methods for optimization-based modeling. In: International Conference on Artificial Intelligence and Statistics, pp. 318–326 (2012)
24.Gröger K. A $W^{1, p}$ -estimate for solutions to mixed boundary value problems for second order elliptic differential equations. Math. Ann. 1989;283(4):679–687. doi: 10.1007/BF01442860. [DOI] [Google Scholar]
25.Haber E, Tenorio L. Learning regularization functionals—a supervised training approach. Inverse Probl. 2003;19(3):611. doi: 10.1088/0266-5611/19/3/309. [DOI] [Google Scholar]
26.Haber E, Horesh L, Tenorio L. Numerical methods for the design of large-scale nonlinear discrete ill-posed inverse problems. Inverse Probl. 2010;26(2):025002. doi: 10.1088/0266-5611/26/2/025002. [DOI] [Google Scholar]
27.Hinterberger W, Scherzer O. Variational methods on the space of functions of bounded hessian for convexification and denoising. Computing. 2006;76(1):109–133. doi: 10.1007/s00607-005-0119-1. [DOI] [Google Scholar]
28.Hintermüller M, Laurain A, Löbhard C, Rautenberg CN, Surowiec TM. Elliptic mathematical programs with equilibrium constraints in function space: Optimality conditions and numerical realization. In: Rannacher R, editor. Trends in PDE Constrained Optimization. Berlin: Springer International Publishing; 2014. pp. 133–153. [Google Scholar]
29.Hintermüller M, Stadler G. An infeasible primal-dual algorithm for total bounded variation-based inf-convolution-type image restoration. SIAM J. Sci. Comput. 2006;28(1):1–23. doi: 10.1137/040613263. [DOI] [Google Scholar]
30.Hintermüller, M., Wu, T.: Bilevel optimization for calibrating point spread functions in blind deconvolution. Preprint (2014)
31.Knoll F, Bredies K, Pock T, Stollberger R. Second order total generalized variation (TGV) for MRI. Magn. Reson. Med. 2011;65(2):480–491. doi: 10.1002/mrm.22595. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Kunisch K, Hintermüller M. Total bounded variation regularization as a bilaterally constrained optimization problem. SIAM J. Imaging Sci. 2004;64(4):1311–1333. [Google Scholar]
33.Kunisch K, Pock T. A bilevel optimization approach for parameter learning in variational models. SIAM J. Imaging Sci. 2013;6(2):938–983. doi: 10.1137/120882706. [DOI] [Google Scholar]
34.Luo Z-Q, Pang J-S, Ralph D. Mathematical Programs with Equilibrium Constraints. Cambridge: Cambridge University Press; 1996. [Google Scholar]
35.Lysaker M, Tai X-C. Iterative image restoration combining total variation minimization and a second-order functional. Int. J. Comput. Vis. 2006;66(1):5–18. doi: 10.1007/s11263-005-3219-7. [DOI] [Google Scholar]
36.Martin, D., Fowlkes, C., Tal, D., Malik, J.: A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In: Proceedings of the 8th International Conference on Computer Vision, vol. 2, pp. 416–423 (2001). The database is available online at http://www.eecs.berkeley.edu/Research/Projects/CS/vision/bsds/BSDS300/html/dataset/images.html
37.Masnou, S., Morel, J.-M.: Level lines based disocclusion. In: 1998 IEEE International Conference on Image Processing (ICIP 98), pp. 259–263 (1998)
38.Outrata JV. A generalized mathematical program with equilibrium constraints. SIAM J. Control Optim. 2000;38(5):1623–1638. doi: 10.1137/S0363012999352911. [DOI] [Google Scholar]
39.Papafitsoros K, Schönlieb C-B. A combined first and second order variational approach for image reconstruction. J. Math. Imaging Vis. 2014;48(2):308–338. doi: 10.1007/s10851-013-0445-4. [DOI] [Google Scholar]
40.Ring W. Structural properties of solutions to total variation regularization problems. ESAIM. 2000;34:799–810. doi: 10.1051/m2an:2000104. [DOI] [Google Scholar]
41.Rudin L, Osher S, Fatemi E. Nonlinear total variation based noise removal algorithms. Phys. D. 1992;60:259–268. doi: 10.1016/0167-2789(92)90242-F. [DOI] [Google Scholar]
42.Sun D, Han J. Newton and quasi-Newton methods for a class of nonsmooth equations and related problems. SIAM J. Optim. 1997;7(2):463–480. doi: 10.1137/S1052623494274970. [DOI] [Google Scholar]
43.Tappen, M.F.: Utilizing variational optimization to learn Markov random fields. In: 2007 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’07), pp. 1–8 (2007)
44.Valkonen T, Bredies K, Knoll F. Total generalised variation in diffusion tensor imaging. SIAM J. Imaging Sci. 2013;6(1):487–525. doi: 10.1137/120867172. [DOI] [Google Scholar]
45.Viola, F., Fitzgibbon, A., Cipolla, R.: A unifying resolution-independent formulation for early vision. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 494–501 (2012)
46.Wang Z, Bovik AC, Sheikh HR, Simoncelli EP. Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 2004;13(4):600–612. doi: 10.1109/TIP.2003.819861. [DOI] [PubMed] [Google Scholar]
47.Zowe J, Kurcyusz S. Regularity and stability for the mathematical programming problem in Banach spaces. Appl. Math. Optim. 1979;5(1):49–62. doi: 10.1007/BF01442543. [DOI] [Google Scholar]

[CR1] 1.Benning M, Brune C, Burger M, Müller J. Higher-order TV methods-enhancement via Bregman iteration. J. Sci. Comput. 2013;54(2–3):269–310. doi: 10.1007/s10915-012-9650-3. [DOI] [Google Scholar]

[CR2] 2.Benning M, Gladden L, Holland D, Schönlieb C-B, Valkonen T. Phase reconstruction from velocity-encoded MRI measurements—a survey of sparsity-promoting variational approaches. J. Magn. Reson. 2014;238:26–43. doi: 10.1016/j.jmr.2013.10.003. [DOI] [PubMed] [Google Scholar]

[CR3] 3.Biegler L, Biros G, Ghattas O, Heinkenschloss M, Keyes D, Mallick B, Tenorio L, van Bloemen Waanders B, Willcox K, Marzouk Y. Large-Scale Inverse Problems and Quantification of Uncertainty. New York: Wiley; 2011. [Google Scholar]

[CR4] 4.Bonnans JF, Tiba D. Pontryagin’s principle in the control of semilinear elliptic variational inequalities. Appl. Math. Optim. 1991;23(1):299–312. doi: 10.1007/BF01442403. [DOI] [Google Scholar]

[CR5] 5.Bredies K, Kunisch K, Pock T. Total generalized variation. SIAM J. Imaging Sci. 2011;3:492–526. doi: 10.1137/090769521. [DOI] [Google Scholar]

[CR6] 6.Bredies K, Holler M. A total variation-based jpeg decompression model. SIAM J. Imaging Sci. 2012;5(1):366–393. doi: 10.1137/110833531. [DOI] [Google Scholar]

[CR7] 7.Bredies K, Kunisch K, Valkonen T. Properties of $L^{1} - {TGV}^{2}$ : the one-dimensional case. J. Math. Anal. Appl. 2013;398:438–454. doi: 10.1016/j.jmaa.2012.08.053. [DOI] [Google Scholar]

[CR8] 8.Bredies, K., Valkonen, T.: Inverse problems with second-order total generalized variation constraints. In: Proceedings of the 9th International Conference on Sampling Theory and Applications (SampTA), Singapore (2011)

[CR9] 9.Bui-Thanh T, Willcox K, Ghattas O. Model reduction for large-scale systems with high-dimensional parametric input space. SIAM J. Sci. Comput. 2008;30(6):3270–3288. doi: 10.1137/070694855. [DOI] [Google Scholar]

[CR10] 10.Calatroni L, De los Reyes JC, Schönlieb C-B. Dynamic sampling schemes for optimal noise learning under multiple nonsmooth constraints. In: Poetzsche C, editor. System Modeling and Optimization. New York: Springer Verlag; 2014. pp. 85–95. [Google Scholar]

[CR11] 11.Chambolle A, Lions P-L. Image recovery via total variation minimization and related problems. Numer. Math. 1997;76:167–188. doi: 10.1007/s002110050258. [DOI] [Google Scholar]

[CR12] 12.Chan T, Marquina A, Mulet P. High-order total variation-based image restoration. SIAM J. Sci. Comput. 2000;22(2):503–516. doi: 10.1137/S1064827598344169. [DOI] [Google Scholar]

[CR13] 13.Chan TF, Kang SH, Shen J. Euler’s elastica and curvature-based inpainting. SIAM J. Appl. Math. 2002;63(2):564–592. [Google Scholar]

[CR14] 14.Chen, Y., Pock, T., Bischof, H.: Learning $ℓ_{1}$ -based analysis and synthesis sparsity priors using bi-level optimization. In: Workshop on Analysis Operator Learning versus Dictionary Learning, NIPS 2012 (2012)

[CR15] 15.Chen, Y., Ranftl, R., Pock, T.: Insights into analysis operator learning: from patch-based sparse models to higher-order mrfs. IEEE Trans. Image Process. (2014) (to appear) [DOI] [PubMed]

[CR16] 16.Chung, J., Español, M.I., Nguyen, T.: Optimal regularization parameters for general-form tikhonov regularization. arXiv preprint arXiv:1407.1911 (2014)

[CR17] 17.Dauge M. Neumann and mixed problems on curvilinear polyhedra. Integr. Equ. Oper. Theory. 1992;15(2):227–261. doi: 10.1007/BF01204238. [DOI] [Google Scholar]

[CR18] 18.De los Reyes JC, Meyer C. Strong stationarity conditions for a class of optimization problems governed by variational inequalities of the second kind. J. Optim. Theory Appl. 2015;168(2):375–409. doi: 10.1007/s10957-015-0748-2. [DOI] [Google Scholar]

[CR19] 19.De los Reyes JC, Schönlieb C-B, Valkonen T. The structure of optimal parameters for image restoration problems. J. Math. Anal. Appl. 2016;434(1):464–500. doi: 10.1016/j.jmaa.2015.09.023. [DOI] [Google Scholar]

[CR20] 20.De los Reyes JC. Optimal control of a class of variational inequalities of the second kind. SIAM J. Control Optim. 2011;49(4):1629–1658. doi: 10.1137/090764438. [DOI] [Google Scholar]

[CR21] 21.De los Reyes JC, Hintermüller M. A duality based semismooth Newton framework for solving variational inequalities of the second kind. Interfaces Free Bound. 2011;13(4):437–462. doi: 10.4171/IFB/267. [DOI] [Google Scholar]

[CR22] 22.De los Reyes JC, Schönlieb C-B. Image denoising: learning the noise model via nonsmooth PDE-constrained optimization. Inverse Probl. Imaging. 2013;7(4):1139–1155. doi: 10.3934/ipi.2013.7.1183. [DOI] [Google Scholar]

[CR23] 23.Domke, J.: Generic methods for optimization-based modeling. In: International Conference on Artificial Intelligence and Statistics, pp. 318–326 (2012)

[CR24] 24.Gröger K. A $W^{1, p}$ -estimate for solutions to mixed boundary value problems for second order elliptic differential equations. Math. Ann. 1989;283(4):679–687. doi: 10.1007/BF01442860. [DOI] [Google Scholar]

[CR25] 25.Haber E, Tenorio L. Learning regularization functionals—a supervised training approach. Inverse Probl. 2003;19(3):611. doi: 10.1088/0266-5611/19/3/309. [DOI] [Google Scholar]

[CR26] 26.Haber E, Horesh L, Tenorio L. Numerical methods for the design of large-scale nonlinear discrete ill-posed inverse problems. Inverse Probl. 2010;26(2):025002. doi: 10.1088/0266-5611/26/2/025002. [DOI] [Google Scholar]

[CR27] 27.Hinterberger W, Scherzer O. Variational methods on the space of functions of bounded hessian for convexification and denoising. Computing. 2006;76(1):109–133. doi: 10.1007/s00607-005-0119-1. [DOI] [Google Scholar]

[CR28] 28.Hintermüller M, Laurain A, Löbhard C, Rautenberg CN, Surowiec TM. Elliptic mathematical programs with equilibrium constraints in function space: Optimality conditions and numerical realization. In: Rannacher R, editor. Trends in PDE Constrained Optimization. Berlin: Springer International Publishing; 2014. pp. 133–153. [Google Scholar]

[CR29] 29.Hintermüller M, Stadler G. An infeasible primal-dual algorithm for total bounded variation-based inf-convolution-type image restoration. SIAM J. Sci. Comput. 2006;28(1):1–23. doi: 10.1137/040613263. [DOI] [Google Scholar]

[CR30] 30.Hintermüller, M., Wu, T.: Bilevel optimization for calibrating point spread functions in blind deconvolution. Preprint (2014)

[CR31] 31.Knoll F, Bredies K, Pock T, Stollberger R. Second order total generalized variation (TGV) for MRI. Magn. Reson. Med. 2011;65(2):480–491. doi: 10.1002/mrm.22595. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR32] 32.Kunisch K, Hintermüller M. Total bounded variation regularization as a bilaterally constrained optimization problem. SIAM J. Imaging Sci. 2004;64(4):1311–1333. [Google Scholar]

[CR33] 33.Kunisch K, Pock T. A bilevel optimization approach for parameter learning in variational models. SIAM J. Imaging Sci. 2013;6(2):938–983. doi: 10.1137/120882706. [DOI] [Google Scholar]

[CR34] 34.Luo Z-Q, Pang J-S, Ralph D. Mathematical Programs with Equilibrium Constraints. Cambridge: Cambridge University Press; 1996. [Google Scholar]

[CR35] 35.Lysaker M, Tai X-C. Iterative image restoration combining total variation minimization and a second-order functional. Int. J. Comput. Vis. 2006;66(1):5–18. doi: 10.1007/s11263-005-3219-7. [DOI] [Google Scholar]

[CR36] 36.Martin, D., Fowlkes, C., Tal, D., Malik, J.: A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In: Proceedings of the 8th International Conference on Computer Vision, vol. 2, pp. 416–423 (2001). The database is available online at http://www.eecs.berkeley.edu/Research/Projects/CS/vision/bsds/BSDS300/html/dataset/images.html

[CR37] 37.Masnou, S., Morel, J.-M.: Level lines based disocclusion. In: 1998 IEEE International Conference on Image Processing (ICIP 98), pp. 259–263 (1998)

[CR38] 38.Outrata JV. A generalized mathematical program with equilibrium constraints. SIAM J. Control Optim. 2000;38(5):1623–1638. doi: 10.1137/S0363012999352911. [DOI] [Google Scholar]

[CR39] 39.Papafitsoros K, Schönlieb C-B. A combined first and second order variational approach for image reconstruction. J. Math. Imaging Vis. 2014;48(2):308–338. doi: 10.1007/s10851-013-0445-4. [DOI] [Google Scholar]

[CR40] 40.Ring W. Structural properties of solutions to total variation regularization problems. ESAIM. 2000;34:799–810. doi: 10.1051/m2an:2000104. [DOI] [Google Scholar]

[CR41] 41.Rudin L, Osher S, Fatemi E. Nonlinear total variation based noise removal algorithms. Phys. D. 1992;60:259–268. doi: 10.1016/0167-2789(92)90242-F. [DOI] [Google Scholar]

[CR42] 42.Sun D, Han J. Newton and quasi-Newton methods for a class of nonsmooth equations and related problems. SIAM J. Optim. 1997;7(2):463–480. doi: 10.1137/S1052623494274970. [DOI] [Google Scholar]

[CR43] 43.Tappen, M.F.: Utilizing variational optimization to learn Markov random fields. In: 2007 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’07), pp. 1–8 (2007)

[CR44] 44.Valkonen T, Bredies K, Knoll F. Total generalised variation in diffusion tensor imaging. SIAM J. Imaging Sci. 2013;6(1):487–525. doi: 10.1137/120867172. [DOI] [Google Scholar]

[CR45] 45.Viola, F., Fitzgibbon, A., Cipolla, R.: A unifying resolution-independent formulation for early vision. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 494–501 (2012)

[CR46] 46.Wang Z, Bovik AC, Sheikh HR, Simoncelli EP. Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 2004;13(4):600–612. doi: 10.1109/TIP.2003.819861. [DOI] [PubMed] [Google Scholar]

[CR47] 47.Zowe J, Kurcyusz S. Regularity and stability for the mathematical programming problem in Banach spaces. Appl. Math. Optim. 1979;5(1):49–62. doi: 10.1007/BF01442543. [DOI] [Google Scholar]

PERMALINK

Bilevel Parameter Learning for Higher-Order Total Variation Regularisation Models

J C De los Reyes

C-B Schönlieb

T Valkonen

Abstract

Introduction

Fig. 1.

Fig. 2.

Problem Statement and Existence Analysis

Formal Statement

Definition 2.1

Remark 2.1

Existence of an Optimal Solution

Lemma 2.1

Proof

Theorem 2.1

Proof

Remark 2.2

Remark 2.3

Lagrange Multipliers

Differentiability of the Solution Operator

Theorem 3.1

Proof

Remark 3.1

Remark 3.2

The Adjoint Equation

Theorem 3.2

Proof

Remark 3.3

Optimality Condition

Theorem 3.3

Proof

Remark 3.4

Numerical Algorithms

BFGS Algorithm

Algorithm 4.1

An Infeasible Semismooth Newton Method

Remark 4.1

Warm Initialisation

Algorithm 4.2

Experiments

Gaussian Denoising

Fig. 4.

Fig. 5.

Table 1.

Table 2.

Fig. 3.

Statistical Testing

Fig. 6.

Fig. 7.

Fig. 8.

Fig. 9.

Fig. 10.

Table 3.

Table 4.

Table 5.

Fig. 11.

Fig. 12.

Fig. 13.

Fig. 14.

Table 6.

Table 7.

Table 8.

Fig. 15.

Fig. 16.

Fig. 17.

Fig. 18.

The Choice of Cost Functional

Conclusion and Outlook

Acknowledgments

Biographies

J. C. De los Reyes

C.-B. Schönlieb

T. Valkonen

References

ACTIONS

PERMALINK

RESOURCES

Similar articles