Skip to main content
Nature Portfolio logoLink to Nature Portfolio
. 2024 Mar 16;66(3):294–313. doi: 10.1007/s10851-024-01174-1

Stochastic Primal–Dual Hybrid Gradient Algorithm with Adaptive Step Sizes

Antonin Chambolle 1,2, Claire Delplancke 3,, Matthias J Ehrhardt 4,, Carola-Bibiane Schönlieb 5, Junqi Tang 6
PMCID: PMC11636721  PMID: 39669719

Abstract

In this work, we propose a new primal–dual algorithm with adaptive step sizes. The stochastic primal–dual hybrid gradient (SPDHG) algorithm with constant step sizes has become widely applied in large-scale convex optimization across many scientific fields due to its scalability. While the product of the primal and dual step sizes is subject to an upper-bound in order to ensure convergence, the selection of the ratio of the step sizes is critical in applications. Up-to-now there is no systematic and successful way of selecting the primal and dual step sizes for SPDHG. In this work, we propose a general class of adaptive SPDHG (A-SPDHG) algorithms and prove their convergence under weak assumptions. We also propose concrete parameters-updating strategies which satisfy the assumptions of our theory and thereby lead to convergent algorithms. Numerical examples on computed tomography demonstrate the effectiveness of the proposed schemes.

Introduction

The stochastic primal–dual hybrid gradient (SPDHG) algorithm introduced in [8] is a stochastic version of the primal–dual hybrid gradient (PDHG) algorithm, also known as Chambolle–Pock algorithm [9]. SPDHG has proved more efficient than PDHG for a variety of problems in the framework of large-scale non-smooth convex inverse problems [13, 22, 24, 27]. Indeed, SPDHG only uses a subset of the data at each iteration, hence reducing the computational cost of evaluating the forward operator and its adjoint; as a result, for the same computational burden, SPDHG attains convergence faster than PDHG. This is especially relevant in the context of medical imaging, where there is a need for algorithms whose convergence speed is compatible with clinical standards, and at the same time able to deal with convex, non-smooth priors like total variation (TV), which are well-suited to ill-posed imaging inverse problems, but preclude the recourse to scalable gradient-based methods.

Like PDHG, SPDHG is provably convergent under the assumption that the product of its primal and dual step sizes is bounded by a constant depending on the problem to solve. On the other hand, the ratio between the primal and dual step sizes is a free parameter, whose value needs to be chosen by the user. The value of this parameter, which can be interpreted as a control on balance between primal and dual convergence, can have a severe impact on the convergence speed of PDHG, and the same also holds true for SPDHG [12]. This leads to an important challenge in practice, as there is no known theoretical or empirical rule to guide the choice of the parameter. Manual tuning is computationally expensive, as it would require running and comparing the algorithm on a range of values, and there is no guarantee that a value leading to fast convergence for one dataset would keep being a good choice for another dataset. For PDHG, [14] have proposed an online primal–dual balancing strategy to solve the issue, where the values of the step sizes evolve along the iterations. More generally, adaptive step sizes have been used for PDHG with backtracking in [14, 20], adapting to local smoothness in [25], and are widely used for a variety of other algorithms, namely gradient methods in [19], subgradient methods in [3] and splitting methods in [47, 18] to improve convergence speed and bypass the need for explicit model constants, like Lipschitz constants or operator norms. For SPDHG, an empirical adaptive scheme has been used for Magnetic Particle Imaging but without convergence proof [27].

On the theoretical side, a standard procedure to prove the convergence of proximal-based algorithms for convex optimization is to use the notion of Féjer monotonicity [2]. Constant step sizes lead to a fixed metric setting, while adaptive step sizes lead to a variable metric setting. Work [11] states the convergence of deterministic Féjer-monotone sequences in the variable metric setting, while work [10] is concerned by the convergence of random Féjer-monotone sequences in the fixed metric setting.

In this work, we introduce and study an adaptive version of SPDHG. More precisely:

  • We introduce a broad class of strategies to adaptively choose the step sizes of SPDHG. This class includes, but is not limited to, the adaptive primal–dual balancing strategy, where the ratio of the step sizes, which controls the balance between convergence of the primal and dual variable, is tuned online.

  • We prove the almost-sure convergence of SPDHG under the schemes of the class. In order to do that, we introduce the concept of C-stability, which generalizes the notion of Féjer monotonicity, and we prove the convergence of random C-stable sequences in a variable metric setting, hence generalizing results from [11] and [10]. We then show that our proposed algorithm falls within this novel theoretical framework by following similar strategies than in the almost-sure convergence proofs of [1, 16].

  • We compare the performance of SPDHG for various adaptive schemes and the known fixed step-size scheme on large-scale imaging inverse tasks (sparse-view CT, limited-angle CT, low-dose CT). We observe that the primal–dual balancing adaptive strategy is always as fast or faster than all the other strategies. In particular, it consistently leads to substantial gains in convergence speed over the fixed strategy if the fixed step sizes, while in the theoretical convergence range, are badly chosen. This is especially relevant as it is impossible to know whether the fixed step sizes are well or badly chosen without running expensive comparative tests. Even in the cases where the SPDHG’s fixed step sizes are well tuned, meaning that they are in the range to which the adaptive step sizes are observed to converge, we observe that our adaptive scheme still provides convergence acceleration over the standard SPDHG after a certain number of iterations. Finally, we pay special attention to the hyperparameters used in the adaptive schemes. These hyperparameters are essentially controlling the degree of adaptivity for the algorithm and each of them has a clear interpretation and is easy to choose in practice. We observe in our extensive numerical tests that the convergence speed of our adaptive scheme is robust to the choices of these parameters within the empirical range we provide, hence can be applied directly to the problem at hand without fine-tuning, and solves the step-size choice challenge encountered by the user.

The rest of the paper is organized as follows. In Sect. 2, we introduce SPDHG with adaptive step sizes, state the convergence theorem, and carry the proof. In Sect. 3, we propose concrete schemes to implement the adaptiveness, followed by numerical tests on CT data in Sect. 4. We conclude in Sect. 5. Finally, Sect. 6 collects some useful lemmas and proofs.

Theory

Convergence Theorem

The variational problem to solve takes the form:

minxXi=1nfi(Aix)+g(x),

where X and (Yi)i1,,n are Hilbert spaces, Ai:XYi are bounded linear operators, and fi:YiR+ and g:XR+ are convex functions. We define Y=Y1××Yn with elements y=(y1,,yn) and A:XY such that Ax=(A1x,,Anx). The associated saddle-point problem reads as

minxXsupyYi=1nAix,yi-fi(yi)+g(x), 2.1

where fi stands for the Fenchel conjugate of fi. The set of solution to (2.1) is denoted by C, and the set of nonnegative integers by N and 1,n stands for 1,,n. Elements (x,y) of C are called saddle points and characterized by

Aixfi(yi),i1,n;-i=1nAiyig(x). 2.2

In order to solve the saddle-point problem, we introduce the adaptive stochastic primal–dual hybrid gradient (A-SPDHG) algorithm in Algorithm 2.1. At each iteration kN, A-SPDHG involves the following five steps:graphic file with name 10851_2024_1174_Figf_HTML.jpg

  • update the primal step size τk and the dual step sizes (σik)i1,n (line 4);

  • update the primal variable xk by a proximal step with step size τk+1 (line 5);

  • randomly choose an index i with probability pi (line 6);

  • update the dual variable yik by a proximal step with step size σik+1 (line 7);

  • compute the extrapolated dual variable (line 8).

A-SPDHG is adaptive in the sense that the step-size values are updated at each iteration according to an update rule which takes into account the value of the primal and dual iterates xl and yl up to the current iteration. As the iterates are stochastic, the step sizes are themselves stochastic, which must be carefully accounted for in the theory.

Before turning to the convergence of A-SPDHG, let us recall some facts about the state-of-the-art SPDHG. Each iteration of SPDHG involves the selection of a random subset of 1,n. In the serial sampling case where the random subset is a singleton, SPDHG algorithm [8] is a special case of Algorithm 2.1 with the update rule

σik+1=σik(=σi),i1,n,τk+1=τk(=τi),kN.

Under the condition

τσi<piAi2,i1,n, 2.3

SPDHG iterates converge almost surely to a solution of the saddle-point problem (2.1) [1, 16].

Let us now turn to the convergence of A-SPDHG. The main theorem, Theorem 2.1, gives conditions on the update rule under which A-SPDHG is provably convergent. Plainly speaking, these conditions are threefold:

  • (i)

    the step sizes for step k+1, (σik+1)i1,n and τk+1, depend only on the iterates up to step k,

  • (ii)

    the step sizes satisfy a uniform version of condition (2.3),

  • (iii)

    the step-size sequences (τk)k0 and (σik)k0 for i1,n do not decrease too fast. More precisely, they are uniformly almost surely quasi-increasing in the sense defined below.

In order to state the theorem rigorously, let us introduce some useful notation and definitions. For all kN, the σ-algebra generated by the iterates up to point k, F(xl,yl),l0,k, is denoted by Fk. We say that a sequence (uk)kN is FkkN-adapted if for all kN, uk is measurable with respect to Fk.

A positive real sequence (uk)kN is said to be quasi-increasing if there exists a sequence (ηk)kN with values in [0, 1), called the control on (uk)kN, such that k=1ηk< and:

uk+1(1-ηk)uk,kN. 2.4

By extension, we call a random positive real sequence (uk)kN uniformly almost surely quasi-increasing if there exists a deterministic sequence (ηk)kN with values in [0, 1) such that k=1ηk< and equation (2.4) above holds almost surely (a.s.).

Theorem 2.1

(Convergence of A-SPDHG) Let X and Y be separable Hilbert spaces, Ai:XYi bounded linear operators, fi:YiR+ and g:XR+ proper, convex and lower semi-continuous functions for all i1,n. Assume that the set of saddle points C is non-empty and the sampling is proper, that is to say pi>0 for all i1,n. If the following conditions are met:

  • (i)

    the step-size sequences (τk+1)kN,(σik+1)kN,i1,n are FkkN-adapted,

  • (ii)
    there exists β(0,1) such that for all indices i1,n and iterates kN,
    τkσikAi2piβ<1, 2.5
  • (iii)

    the initial step sizes τ0 and σi0 for all indices i1,n are positive and the step-size sequences (τk)kN and (σik)kN for all indices i1,n are uniformly almost surely quasi-increasing,

then the sequence of iterates (xk,yk)kN converges almost surely to an element of C.

While the conditions (i)–(iii) are general enough to cover a large range of step-size update rules, we will focus in practice on the primal–dual balancing strategy, which consists in scaling the primal and the dual step sizes by an inverse factor at each iteration. In that case, the update rule depends on a random positive sequence (γk)kN and reads as:

τk+1=τkγk,σik+1=γkσik,i1,n. 2.6

Lemma 2.2

(Primal–dual balancing) Let the step-size sequences satisfy equation (2.6) and assume in addition that (γk)kN is FkkN-adapted that the initial step sizes satisfy

τ0σi0Ai2pi<1,i1,n,

and are positive, that there exists a deterministic sequence (ϵk)kN with values in [0, 1) such that ϵk< and for all kN and i1,n,

minγk,(γk)-11-ϵk. 2.7

Then, the step-size sequences satisfy assumptions (i)–(iii) of Theorem 2.1.

Lemma 2.2 is proved in Sect. 6.

Connection with the literature:

  • The primal–dual balancing strategy has been introduced in [14] for PDHG and indeed for n=1 we recover with Lemma 2.2 the non-backtracking algorithm presented in [14]. As a consequence, our theorem also implies the pointwise convergence of this algorithm, whose convergence was established in the sense of vanishing residuals in [14].

  • Still for PDHG, [20] proposes without proof an update rule where the ratio of the step sizes is either quasi-non-increasing or quasi-non-decreasing. This requirement is similar to but not directly connected with ours, where we ask the step sizes themselves to be quasi-non-increasing.

  • For SPDHG, the angular constraint step-size rule proposed without convergence proof in [27] satisfies assumptions (i)–(iii).

Outline of the proof: Theorem 2.1 is proved in the following subsections. We first define in Sect. 2.2 metrics related to the algorithm step sizes on the primal–dual product space. As the step sizes are adaptive, we obtain a sequence of metrics. The proof of Theorem 2.1 is then similar in strategy to those of [1] and [16] but requires novel elements to deal with the metrics variability. In Theorem 2.5, we state convergence conditions for an abstract random sequence in a Hilbert space equipped with random variable metrics. In Sects. 2.4 and 2.5, we show that A-SPDHG falls within the scope of Theorem 2.5. We collect all elements and conclude the proof in Sect. 2.6.

Variable Metrics

For a Hilbert space H, we call S(H) the set of bounded self-adjoint linear operators from H to H, and for all MS(H) we introduce the notation:

uM2=Mu,u,uH.

By an abuse of notation, we write ·α2=·αId2 for a scalar αR. Notice that ·M is a norm on H if M is positive definite. Furthermore, we introduce the partial order on S(H) such that for M,NS(H),

NMifuH,uNuM.

We call Sα(H) the subset of S(H) comprised of M such that αIdM. Furthermore, a random sequence (Mk)kN in S(H) is said to be uniformly almost surely quasi-decreasing if there exists a deterministic nonnegative sequence (ηk)kN such that k=1ηk< and a.s.

Mk+1(1+ηk)Mk,kN.

Coming back to A-SPDHG, let us define for every iteration kN and every index i1,n two block operators of S(X×Yi) as:

Mik=1τkId-1piAi-1piAi1piσikId,Nik=1τkId001piσikId,

and a block operator of S(X×Y) as:

Nk=1τkId(0)1p1σ1kId1piσikId(0)1pnσnkId. 2.8

The following lemma translates assumptions (i)–(iii) of Theorem 2.1 on properties on the variable metric sequences.

Lemma 2.3

(Variable metric properties)

  1. Assumption (i) of Theorem 2.1 implies that (Mik+1)kN, (Nik+1)kN,i1,n and (Nk+1)kN are FkkN-adapted.

  2. Assumption (ii) of Theorem 2.1 is equivalent to the existence of β(0,1) such that for all indices i1,n and iterates kN,
    (1-β)NikMik.
  3. Assumptions (ii) and (iii) of Theorem 2.1 imply that (Mik)kN,(Nik)kN,i1,n and (Nk)kN are uniformly a.s. quasi-decreasing.

  4. Assumption (ii) and (iii) of Theorem 2.1 imply that the sequences (τk)kN and (σik)kN for all i1,n are a.s. bounded from above and by below by positive constants. In particular, this implies that there exists α>0 such that NikSα(X×Yi) for all i1,n and kN, or equivalently that NkSα(X×Y) for all kN.

Remark 2.4

(Step-size induced metrics on the primal–dual product space) The lemma implies that Mik, Nik and Nk are positive definite and hence induce a metric on the corresponding spaces. If n=1 and for constant step sizes, Mik corresponds to the metric used in [17], where PDHG is reformulated as a proximal-point algorithm for a non-trivial metric on the primal–dual product space.

Proof of Lemma 2.3

Assertion (a) of the lemma follows from the fact that for all iterate kN, the operators Mik+1, Nik+1 and Nk+1 are in the σ-algebra generated by τk+1,σik+1,i1,n. Assertion (b) follows from equation (6.2) of Lemma 6.1 to be found in the complementary material. The proof of assertion (c) is a bit more involved. Let us assume that assumption (iii) of Theorem 2.1 holds and let (η0k)kN and (ηik)kN be the controls of (τk)kN and (σik)kN for i1,n, respectively. We define the sequence (ηk)kN by:

ηk=maxηik,i0,n,kN, 2.9

which is a common control on (τk)kN and (σik)kN for i1,n as the maximum of a finite number of controls. Let us fix kN and i1,n. Because the intersection of a finite number of measurable events of probability one is again a measurable event of probability one, it holds almost surely that for all (x,yi)X×Yi,

(x,yi)Nik+12=1τk+1x2+1piσik+1yi211-ηk1τkx2+1piσikyi2=1+ηk1-ηk(x,yi)Nik2.

Hence, the sequence (Nik)kN is uniformly quasi-decreasing with control ηk(1-ηk)-1kN, which is indeed a positive sequence with bounded sum. (To see that ηk(1-ηk)-1kN has a bounded sum, consider that ηkkN is summable, hence converges to 0, hence is smaller than 1/2 for all integers k bigger than a certain K; in turn, for all integers k bigger than K, the term ηk(1-ηk)-1 is bounded from below by 0 and from above by 2ηk, hence is summable.) One can see by a similar proof that (Nk)kN is uniformly quasi-decreasing with the same control. To follow with the case of (Mik)kN, we have, as before:

Mik+1=1τk+1Id-1piAi-1piAi1piσik+1IdMik+ηk1-ηkNik1+ηk1-ηk11-βMik

thanks to (b).

Let us conclude with the proof of assertion (d). By assumption (iii), the sequences (τk)kN and (σik)kN are uniformly a.s. quasi-increasing. We define a common control (ηk)kN as in (2.9). Then, the sequences (τk)kN and (σik)kN are a.s. bounded from below by the same deterministic constant C=minτ0,σi0,i1,nj=0(1-ηj) which is positive as the initial step sizes are positive and (ηk)kN takes values in [0, 1) and has finite sum. Furthermore, by assumption (ii), the product of the sequences (τk)kN and (σik)kN is almost surely bounded from above. As a consequence, each sequence (τk)kN and (σik)kN is a.s. bounded from above. The equivalence with NikSα(X×Yi) for all i1,n, and with NkSα(X×Y), is straightforward.

Convergence of Random C-stable Sequences in Random Variable Metrics

Let H be a Hilbert space and CH a subset of H. Let Ω,σ(Ω),P be a probability space. All random variables in the following are assumed to be defined on Ω and measurable with respect to σ(Ω) unless stated otherwise. Let (Qk)kN be a random sequence of S(H).

A random sequence (uk)kN with values in H is said to be stable with respect to the target C relative to (Qk)kN if for all uC, the sequence uk-uQkkN converges almost surely. The following theorem then states sufficient conditions for the convergence of such sequences.

Theorem 2.5

(Convergence of C-stable sequences) Let H be a separable Hilbert space, C a closed non-empty subset of H, (Qk)kN a random sequence of S(H), and (uk)kN a random sequence of H. If the following conditions are met:

  • (i)

    (Qk)kN takes values in Sα(H) for a given α>0 and is uniformly a.s. quasi-decreasing,

  • (ii)

    (uk)kN is stable with respect to the target C relative to (Qk)kN,

  • (iii)

    every weak sequential cluster point of (uk)kN is almost surely in C, meaning that there exists Ω(iii) a measurable subset of Ω of probability one such that for all ωΩ, every weak sequential cluster point of (uk(ω))kN is in C.

then (uk)kN converges almost surely weakly to a random variable in C.

Stability with respect to a target set C is implied by Féjer and quasi-Féjer monotonicity with respect to C, which have been studied either for random sequences [10] or in the framework of variable metrics [11], but to the best of our knowledge not both at the same time. The proof of Theorem 2.5 follows the same lines than [10, Proposition 2.3 (iii)] and uses two results from [11].

Proof

The set C is a subset of the separable Hilbert space H, hence is separable. As C is a closed and separable, there exists cn,nN a countable subset of C whose closure is equal to C. Thanks to assumption (ii), there exists for all nN a measurable subset Ω(ii)n of Ω with probability one such that the sequence (uk(ω)-cnQk(ω))kN converges for all ωΩ(ii)n. Furthermore, let Ω(i) be a measurable subset of Ω of probability one corresponding to the almost-sure property for assumption (i). Let

Ω~=n0Ω(ii)nΩ(i)Ω(iii).

As the intersection of a countable number of measurable subsets of probability one, Ω~ is itself a measurable set of Ω with P(Ω~)=1. Fix ωΩ~ for the rest of the proof.

The sequence (Qk(ω))kN takes values in Sα(H) for α>0 and is quasi-decreasing with control (ηk(ω))kN. Furthermore, for all kN,

Qk(ω)j=0k-11+ηjQ0(ω)j=01+ηjQ0(ω),

where the product j=01+ηj is finite because (ηk)kN is positive and summable. By [11, Lemma 2.3], (Qk(ω))kN converges pointwise strongly to some Q(ω)Sα(H).

Furthermore, for all xC, there exists a sequence (xn)nN with values in cn,nN converging strongly to x. By assumption, for all nN, the sequence (uk(ω)-xnQk(ω))kN converges to a limit which shall be called ln(ω). For all nN and kN, we can write thanks to the triangular inequality:

-xn-xQk(ω)uk(ω)-xQk(ω)-uk(ω)-xnQk(ω)xn-xQk(ω).

By taking the limit k+, it follows that:

-xn-xQ(ω)liminfkuk(ω)-xQk(ω)-ln(ω)limsupkuk(ω)-xQk(ω)-ln(ω)xn-xQ(ω).

Taking now the limit n+ shows that the sequence (uk(ω)-xQk(ω))kN converges for all xC. On the other hand, because ωΩ(iii), the weak cluster points of (uk(ω))kN lie in C. Hence, by [11, Theorem 3.3], the sequence (uk(ω))kN converges almost surely to a point u(ω)C.

We are now equipped to prove Theorem 2.1. We show in Sects. 2.4 and 2.5 that A-SPDHG satisfies points (ii) and (iii) of Theorem 2.5, respectively, and conclude the proof in Sect. 2.6. Interestingly, the proofs of point (ii) and of point (iii) rely on two different ways of apprehending A-SPDHG. Point (ii) relies on a convex optimization argument: By taking advantage of the measurability of the primal variable at step k+1 with respect to Fk, one can write a contraction-type inequality relating the conditional expectation of the iterates’ norm at step k+1 to the iterates’ norm at step k. Point (iii) relies on monotone operator theory: We use the fact that the update from the half-shifted iterations (yk,xk+1) to (yk+1,xk+2) can be interpreted as a step of a proximal-point algorithm on X×Yi conditionally to i being the index randomly selected at step k.

A-SPDHG is Stable with Respect to the Set of Saddle Points

In this section, we show that (xk,yk)kN is stable with respect to C relative to the variable metrics sequence (Nk)kN defined in equation (2.8) above. We introduce the operators PS(Y) and ΣkS(Y) defined, respectively, by

(Py)i=piyi,(Σky)i=σikyi,i1,n,

and the functionals (Uk)kN,(Vk)kN defined for all (x,y)X×Y as:

Uk(y)=y(PΣk)-12,Vk(x,y)=x(τk)-12-2P-1Ax,y+y(PΣk)-12.

We begin by recalling the cornerstone inequality satisfied by the iterates of SPDHG stated first in [8] and reformulated in [1].

Lemma 2.6

([1], Lemma 4.1) For every saddle-point (x,y), it a.s. stands that for all kN\0,

EVk+1(xk+1-x,yk+1-yk)+Uk+1(yk+1-y)|FkVk+1(xk-x,yk-yk-1)+Uk+1(yk-y)-Vk+1(xk+1-xk,yk-yk-1). 2.10

The second step is to relate the assumptions of Theorem 2.1 to properties of the functionals appearing in (2.10). Let us introduce YsparseY the set of elements (y1,,yn) having at most one non-vanishing component.

Lemma 2.7

(Properties of functionals of interest) Under the assumptions of Theorem 2.1, there exists a nonnegative, summable sequence (ηk)kN such that a.s. for every iterate kN and xX,yY,zYsparse:

Uk+1(y)(1+ηk)Uk(y), 2.11a
Vk+1(x,z)(1+ηk)Vk(x,z), 2.11b
(x,z)Nk2α(x,z)2, 2.11c
Vk(x,z)(1-β)(x,z)Nk2, 2.11d
P-1Ax,zβx(τk)-1z(PΣk)-1. 2.11e

Proof

Let (ηik)kN and (η~ik)kN be the controls of (Mik)kN and (Nik)kN, respectively, for all i1,n. We define the common control (ηk)kN by:

ηk=maxmaxηik,η~ik,i1,n,kN. 2.12

For all yY, we can write

Uk+1(y)=i=1n(0,yi)Nik+12(1+ηk)i=1n(0,yi)Nik2=(1+ηk)Uk(y),

which proves (2.11a). Let us now fix xX, zYsparse and kN. By definition, there exists i1,n such that zj=0 for all ji. We obtain the inequalities (2.11b)–(2.11d) by writing:

Vk+1(x,z)=(x,zi)Mik+12(1+ηk)(x,zi)Mik2=(1+ηk)Vk(x,z),(x,z)Nk2=(x,zi)Nik2α(x,zi)2=α(x,z)2,Vk(x,z)=(x,zi)Mik2(1-β)(x,zi)Nik2=(1-β)(x,z)Nk2.

Finally, we obtain inequality (2.11e) by writing:

P-1Ax,z=1piAix,ziAipixzi=Aipiτkσikpi1/2x(τk)-1z(PΣk)-1βx(τk)-1z(PΣk)-1,

where the last inequality is a consequence of (2.5).

Lemma 2.8

(A-SPDHG is C-stable) Under the assumptions of Theorem 2.1,

  • (i)

    The sequence (xk,yk)kN of Algorithm  2.1 is stable with respect to C relative to (Nk)kN,

  • (ii)
    the following results hold:
    Ek=1(xk+1-xk,yk-yk-1)2<and a.s.xk+1-xk0.

Proof

Let us begin with the proof of point (i). By definition of A-SPDHG with serial sampling, the difference between two consecutive dual iterates is almost surely sparse:

a.s.kN\0,yk-yk-1Ysparse.

Let us define the sequences

ak=Vk(xk-x,yk-yk-1)+Uk(yk-y),bk=Vk+1(xk+1-xk,yk-yk-1),

which are a.s. nonnegative thanks to (2.11c) and (2.11d). Notice that the primal iterates xl from l=0 up to l=k+1 are measurable with respect to Fk, whereas the dual iterates yl from l=0 up to l=k are measurable with respect to Fk. Hence, ak and bk are measurable with respect to Fk. Furthermore, inequalities (2.10), (2.11a) and (2.11b) imply that almost surely for all kN\0,

Eak+1|Fk(1+ηk)ak-bk.

By Robbins–Siegmund lemma [23], (ak) converges almost surely, supkEak< and k=1Ebk<. From the last point in particular, we can write thanks to (2.11d) and the monotone convergence theorem:

Ek=1yk-yk-1(PΣk+1)-12Ek=1(xk+1-xk,yk-yk-1)Nk+12(1-β)-1Ek=1bk=(1-β)-1k=1Ebk<,

hence k=1yk-yk-1(PΣk+1)-12 is almost surely finite, thus yk-yk-1(PΣk+1)-12kN\0, and in turn (yk-yk-1(PΣk+1)-1)kN\0, converge almost surely to 0. Furthermore, supkEak< hence supkxk-x(τk)-12, and in turn supkxk-x(τk)-1, are finite, and by (2.11e), one can write that for kN\0,

P-1A(xk-x),yk-yk-1βxk-x(τk+1)-1yk-yk-1(PΣk+1)-1β(1+ηk)xk-x(τk)-1yk-yk-1(PΣk+1)-1.

We know that (ηk)kN is summable hence converges to 0. As a consequence,

|P-1A(xk-x),yk-yk-1|0almost surely.

To conclude with, thanks to the identity

ak=(xk-x,yk-y)Nk2+P-1A(xk-x),yk-yk-1,kN\0,

the almost-sure convergence of (ak)kN implies in turn that of ((xk-x,yk-y)Nk2)kN.

Let us now turn to point (ii). The first assertion is a straightforward consequence of

Ek=1bk=k=1Ebk<

and bounds (2.11c) and (2.11d). Furthermore, it implies that k=1(xk+1-xk,yk-yk-1)2 is a.s. finite, hence (xk+1-xk,yk-yk-1) a.s. converges to 0, and so does xk+1-xk.

Weak Cluster Points of A-SPDHG are Saddle Points

The goal of this section is to prove that A-SPDHG satisfies point (iii) of Theorem 2.5. On the event Ik=i, A-SPDHG update procedure can be rewritten as

yik+1=proxσik+1fi(yik+σik+1Aixk+1),y¯ik+1=yik+1+1piyik+1-yik,y¯jk+1=yjk,jixk+2=proxτk+2g(xk+1-τk+2Ay¯k+1).

We define Tiσ,τ:(x,y)(x^,y^i) by:

y^i=proxσifi(yi+σiAix),x^=proxτgx-τAy-τ1+pipiAi(y^i-yi),

so that (xk+2,yik+1)=Tiσik+1,τk+2(xk+1,yk) on the event {Ik=i} (and yjk+1=yjk for ji).

Lemma 2.9

(Cluster points of A-SPDHG are saddle points) Let (x¯,y¯) a.s. be a weak cluster point of (xk,yk)kN (meaning that there exists a measurable subset Ω¯ of Ω of probability one such that for all ωΩ¯, (x¯(ω),y¯(ω)) is a weak sequential cluster point of (xk(ω),yk(ω))kN) and assume that the assumptions of Theorem 2.1 hold. Then, (x¯,y¯) is a.s. in C.

Proof

Thanks to Lemma 2.8-(ii) and the monotone convergence theorem,

k=1E(xk+1-xk,yk-yk-1)2=Ek=1(xk+1-xk,yk-yk-1)2<.

Now,

k=1E(xk+1-xk,yk-yk-1)2=k=1EE(xk+1-xk,yk-yk-1)2|Ik-1=k=1i=1nP(Ik-1=i)ETiσik,τk+1(xk,yik-1)-(xk,yik-1)2=Ei=1npik=1Tiσik,τk+1(xk,yk-1)-(xk,yik-1)2.

Hence, we can deduce that

Ek=1i=1npiTiσik,τk+1(xk,yk-1)-(xk,yik-1)2<.

It follows that the series in the expectation is a.s. finite, and since pi>0 we deduce that almost surely,

Tiσik,τk+1(xk,yk-1)-(xk,yik-1)k0 2.13

for all i=1,n. We consider a sample (xk,yk) which is bounded and such that (2.13) holds. We let for each i, (x^i,k+1,y^ii,k)=Tiσik,τk+1(xk,yk-1), so that (x^i,k+1,y^ii,k)-(xk,yik-1)0 for i=1,,n. Then, one has

fi(y^ii,k)yik-1-y^ii,kσik+Aixk=:Aixk+δyi,kg(x^i,k+1)xk-x^i,k+1τk+1-Ayk-1-1+pipiAi(y^ii,k-yik-1)=:-Ayk-1+δxi,k

where δx,yi,k0 as k. Given a test point (xy), one may write for any k:

fi(yi)fi(y^ii,k)+Aixk,yi-yik-1+Aixk,yik-1-y^ii,k+δyi,k,yi-y^ii,k,i=1,,ng(x)g(x^1,k+1)-Ayk-1,x-xk-Ayk-1,xk-x^1,k+1+δxi,k,x-x^1,k+1

and summing all these inequalities, we obtain:

g(x)+i=1nfi(yi)g(x^1,k+1)+i=1nfi(y^ii,k)+Aixk,yi-Ayk-1,x+δk

where δk0 as k. We deduce that if (x¯,y¯) is the weak limit of a subsequence (xkl,ykl-1) (as well as, of course, (xkl,ykl)), then:

g(x)+i=1nfi(yi)g(x¯)+i=1nfi(y¯i)+Aix¯,yi-Ay¯,x.

Since (xy) is arbitrary, we find that (2.2) holds for (x¯,y¯).

Proof of Theorem 2.1

Under the assumptions of Theorem 2.1, the set C of saddle points is closed and non-empty and X×Y is a separable Hilbert space. By Lemma 2.3, the variable metrics sequence (Nk)kN defined in (2.8) satisfies condition (i) of Theorem 2.5. Furthermore, the iterates of Algorithm 2.1 comply with condition (ii) and (iii) of Theorem 2.5 by Lemma 2.8 and Lemma 2.9, respectively, and hence converge almost surely to a point in C.

Algorithmic Design and Practical Implementations

In this section, we present practical instances of our A-SPDHG algorithm, where we specify a step-size adjustment rule which satisfies our assumptions in convergence proof. We extend the adaptive step-size balancing rule for deterministic PDHG, which is proposed by [14], into our stochastic setting, with minibatch approximation to minimize the computational overhead.

A-SPDHG Rule (a)—Tracking and Balancing the Primal–Dual Progress

Let’s first briefly introduce the foundation of our first numerical scheme, which is built upon the deterministic adaptive PDHG algorithm proposed by Goldstein et al [14], with the iterates:

xk+1=proxτk+1g(xk-τk+1Ayk),yk+1=proxσk+1f(yk+σk+1A(2xk+1-xk))

In this foundational work of Goldstein et al [14], they proposed to evaluate two sequences in order to track and balance the progresses of the primal and dual iterates of deterministic PDHG (denoted here as vk and dk):

vk:=(xk-xk+1)/τk+1-A(yk-yk+1)1,dk:=(yk-yk+1)/σk+1-A(xk-xk+1)1. 3.1

These two sequences measure the lengths of the primal and dual subgradients for the objective minxXmaxyYg(x)+Ax,y-f(y), which can be demonstrated by the definition of proximal operators. The primal update of deterministic PDHG can be written as:

xk+1=argminx12x-(xk-τk+1Ayk)22+τk+1g(x). 3.2

The optimality condition of the above objective declares:

0g(xk+1)+Ayk+1τk+1(xk+1-xk). 3.3

By adding -Ayk+1 on both sides and rearranging the terms, one can derive:

(xk-xk+1)/τk+1-A(yk-yk+1)g(xk+1)+Ayk+1 3.4

and similarly for the dual update one can also derive:

(yk-yk+1)/σk+1-A(xk-xk+1)f(yk+1)-Axk+1, 3.5

which indicates that the sequences vk and dk given by (3.1) should effectively track the primal progress and dual progress of deterministic PDHG, and hence, Goldstein et al [14] propose to utilize these as the basis of balancing the primal and dual step sizes for PDHG.

In light of this, we propose our first practical implementation of A-SPDHG in Algorithm 3.1 as our rule-(a), where we use a unique dual step-size σk=σjk for all iterates k and indices j and where we estimate the progress of achieving optimality on the primal and dual variables via the two sequences vk and dk defined at each iteration k with Ik=i as:

vk+1:=(xk-xk+1)/τk+1-1piAi(yik-yik+1)1,dk+1:=1pi(yik-yik+1)/σk+1-Ai(xk-xk+1)1, 3.6

which are minibatch extension of (3.1) tailored for our stochastic setting. By making them balanced on the fly via adjusting the primal–dual step-size ratio when appropriate, we can enforce the algorithm to achieve similar progress in both primal and dual steps and hence improve the convergence. To be more specific, as shown in Algorithm 3.1, in each iteration the values of vk and dk are evaluated and compared. If the value of vk (which tracks the primal subgradients) is significantly larger than dk (which tracks the dual subgradients), then we know that the primal progress is slower than the dual progress, and hence, the algorithm would boost the primal step size while shrinking the dual step size. If vk is noticeably smaller than dk, then the algorithm would do the opposite.

Note that here we adopt the choice of 1-norm as the length measure for vk and dk as done by Goldstein et al [14, 15], since we also observe numerically the benefit over the more intuitive choice of 2-norm.

For full-batch case (n=1), it reduces to the adaptive PDHG proposed by [14, 15]. We adjust the ratio between primal and dual step sizes according to the ratio between vk and dk, and whenever the step-size change, we shrink α (which controls the amplitude of the changes) by a factor η(0,1)—we typically choose η=0.995 in our experiments. For the choice of s, we choose s=A as our default.1

Reducing the Overhead with Subsampling

Noting that unlike the deterministic case which does not have the need of extra matrix–vector multiplication since Ayk and Axk can be memorized, our stochastic extension will require the computation of Aixk since we will sample different subsets between back-to-back iterations with high probability. When using this strategy, we will only have a maximum 50% overhead in terms of FLOP counts, which is numerically negligible compared to the significant acceleration it will bring toward SPDHG especially when the primal–dual step-size ratio is suboptimal, as we will demonstrate later in the experiments. Moreover, we found numerically that we can significantly reduce this overhead by approximation tricks such as subsampling:

dk+1ρpiSk(yik-yik+1)/σk+1-SkAi(xk-xk+1)1 3.7

with Sk being a random subsampling operator such that E[(Sk)TSk]=1ρId. In our experiments, we choose 10% subsampling for this approximation and hence the overhead is reduced from 50% to only 5% which is negligible, without compromising the convergence rates in practice.

A-SPDHG Rule (b)—Exploiting Angle Alignments

More recently, Yokota and Hontani [26] propose a variant of adaptive step-size balancing scheme for PDHG, utilizing the angles between the subgradients g(xk+1)+Ayk+1 and the difference of the updates xk-xk+1.

If these two directions are highly aligned, then the primal step size can be increased for bigger step. If these two directions have a large angle, then the primal step size should be shrunken. By extending this scheme to stochastic setting, we obtain another choice of adaptive scheme for SPDHG.graphic file with name 10851_2024_1174_Figh_HTML.jpg

We present this scheme in Algorithm 3.2 as our rule (b). At iteration k with Ik=i, compute:

qk+1=(xk-xk+1)/τk+1-1piAi(yik-yik+1), 3.8

as an estimate of g(xk+1)+Ayk+1, then measure the cosine of the angle between this and xk-xk+1:

wk+1=xk-xk+1,qk+1(xk-xk+12qk+12). 3.9

The threshold c for the cosine value (which triggers the increase of the primal step size) typically needs to be very close to 1 (we use c=0.999) due to the fact that we mostly apply these type of algorithms in high-dimensional problems, following the choice in [26] which was for deterministic PDHG.

Recently, Zdun et al [27] proposed a heuristic similar to our rule (b), but they choose qk+1 to be the approximation for an element of g(xk+1) instead of g(xk+1)+Ayk+1. Our choice follows more closely to the original scheme of Yokota and Hontani [26]. We numerically found that their scheme is not competitive in our settings.

Numerical Experiments

In this section, we present numerical studies of the proposed scheme in solving one of the most typical imaging inverse problems, the computed tomography (CT). We compare A-SPDHG algorithm with the original SPDHG, on different choices of starting ratio of the primal and dual step sizes.

In our CT imaging example, we seek to reconstruct the tomography images from fanbeam X-ray measurement data, by solving the following TV-regularized objective:

xargminxRd12Ax-b22+λDx1 4.1

where D denotes the 2D differential operator, ARm×d and xRd. We consider three fanbeam CT imaging modalities: sparse-view CT, low-dose CT and limited-angle CT. We test the A-SPDHG and SPDHG on two images of different sizes (Example 1 on a phantom image sized 1024×1024, while Example 2 being an image from the Mayo Clinic Dataset [21] sized 512×512.), on 4 different starting ratios (10-3, 10-5, 10-7 and 10-9). We interleave partitioned the measurement data and operator into n=10 minibatches for both algorithms. To be more specific, we first collect all the X-ray measurement data and list them consecutively from 0 degree to 360 degree to form the full A and b, and then interleavingly group every 10-th of the measurements into one minibatch, to form the partition {Ai}i=110 and {bi}i=110.

For A-SPDHG, we choose to use the approximation step for dk presented in (3.7) with 10% subsampling and hence the computational overhead is negligible in this experiment. We initialize all algorithms from a zero image.

We present our numerical results in Figs. 12, 3 and 6. In these plots, we compare the convergence rates of the algorithms in terms of number of iterations (the execution time per iteration for the algorithms are almost the same, as the overhead of A-SPDHG is trivial numerically). Among these, Figs. 1 and 2 report the results for large-scale sparse-view CT experiments on a phantom image and a lung CT image from Mayo Clinic dataset [21], while Fig. 3 reports the results for low-dose CT experiments where we simulate a large number of measurements corrupted with a significant amount Poisson noise, and then, in Fig. 6 we report the results for limited-angle CT which only a range of 0-degree to 150-degree of measurement angles are present, while the measurements from the rest [150, 360] degrees of angles are all missing. In all these examples, we can consistently observe that no matter how we initialize the primal–dual step-size ratio, A-SPDHG can automatically and consistently adjust the step-size ratio to the optimal choice which is around either 10-5 or 10-7 for these four different CT problems and significantly outperform the vanilla SPDHG for the cases where the starting ratio is away from the optimal range. Meanwhile, even for the cases where the starting ratio of SPDHG algorithm is near-optimal, we can observe consistently from most of these examples that our scheme outperforms the vanilla SPDHG algorithm locally after a certain number of iterations (highlighted by the vertical dash lines in relevant subfigures), which further indicates the benefit of adaptivity for this class of algorithms2. Note that throughout all these different examples, we use only one fixed set of parameters for A-SPDHG suggested in the previous section, which again indicates the strong practicality of our scheme.

Fig. 1.

Fig. 1

Comparison between SPDHG and A-SPDHG on sparse-view CT (Example 1), with a variety of starting primal–dual step size ratios. Here, the forward operator is ARm×d with dimensions m=368640, d=1048576. We include the images reconstructed by the algorithms at termination (50th epoch). In the first plot of each subfigure, the black circle indicates the starting step-size ratio for all the algorithms, same for the following figures

Fig. 2.

Fig. 2

Comparison between SPDHG and A-SPDHG on sparse-view CT (Example 2), with a variety of starting primal–dual step-size ratios. Here, the forward operator is ARm×d with dimensions m=92160, d=262144. We include the images reconstructed by the algorithms at termination (50th epoch)

Fig. 3.

Fig. 3

Comparison between SPDHG and A-SPDHG on low-dose CT (where we use a large number of highly-noisy X-ray measurements), with a variety of starting primal–dual step-size ratios. Here, the forward operator is ARm×d with dimensions m=184320, d=65536. We resized the phantom image to 256 by 256. We include the images reconstructed by the algorithms at termination (50th epoch)

Fig. 6.

Fig. 6

Comparison between SPDHG and A-SPDHG on limited-angle CT (Example 2), with a variety of starting primal–dual step-size ratios. Here, the forward operator is ARm×d with dimensions m=92160, d=262144. We include the images reconstructed by the algorithms at termination (50th epoch)

For the low-dose CT example, we run two extra sets of experiments, regarding a larger number of partitioning of minibatches (40) in Fig. 4, and warm-start from a better initialization image obtained via filter backprojection in Fig. 5. We found that in all these extra examples we consistently observe superior performances of A-SPDHG over the vanilla SPDHG especially when the primal–dual step-size ratios are suboptimal. Interestingly, we found that the warm-start’s effect does not have noticeable impact of the comparative performances between SPDHG and A-SPDHG. This is mainly due to the fact that the SPDHG with suboptimal primal–dual step-size ratio will converge very slowly in high accuracy regimes (see Fig. 5d for example) in practice hence the warm-start won’t help much here.

Fig. 4.

Fig. 4

Comparison between SPDHG and A-SPDHG with the data being split to 40 minibatches on low-dose CT. Comparing to the results presented in Fig. 3 which used 10 minibatches, we obtain similar results and our A-SPDHG continues to perform more favorably comparing to SPDHG.

Fig. 5.

Fig. 5

Comparison between SPDHG and A-SPDHG with warm-start using a FBP (filtered backprojection) on low-dose CT. Comparing to the results shown in Fig. 3 which are without warm-start, actually our methods seem to compare even more favorably with warm-start. Please also note that the early jump in terms of function value is within our expectation due to the stochasticity of the algorithms. We include the images reconstructed by the algorithms at termination (50th epoch)

We should also note that conceptually all the hyperparameters in our adaptive schemes are basically the controllers of the adaptivity of the algorithm (while for extreme choices we recover the vanilla SPDHG). In Figs. 7 and 9, we present some numerical studies on the choices of hyperparameters of rule (a) and rule (b) of A-SPDHG algorithm. We choose the fixed starting ratio of 10-7 for primal–dual step sizes in these experiments. For rule (a), we found that it is robust to the choice of the starting shrinking rate α0, shrinking speed η and the gap δ. Overall, we found that these parameters have weak impact of the convergence performance of our rule (a) and easy to choose.

Fig. 7.

Fig. 7

Test on different choices of parameters of A-SPDHG (rule-a) on X-ray low-dose fanbeam CT example, starting ratio of primal–dual step sizes: 10-7. We can observe that the performance of A-SPDHG has only minor dependence on these parameter choices

Fig. 9.

Fig. 9

Test on different choices of parameters of A-SPDHG (rule-b) on X-ray low-dose fanbeam CT example, starting ratio of primal–dual step sizes: 10-7

For rule (b), we found that the performance is more sensitive to the choice of parameter c and η comparing to rule (a), although the dependence is still weak. Our numerical studies suggest that rule (a) is a better-performing choice than rule (b), but each of them have certain mild weaknesses (the first rule has a slight computational overhead which can be partially addressed with subsampling scheme, while the second rule seems often being slower than the first rule), which require further studies and improvements. Nevertheless, we need to emphasis that all these parameters are essentially controlling the degree of adaptivity of the algorithms and fairly easy to choose, noting that for all these CT experiments with varying sizes/dimensions and modalities we only use one fixed set of the hyperparameters in A-SPDHG, and we are already able to consistently observe numerical improvements over vanilla SPDHG.

Conclusion

In this work, we propose a new framework (A-SPDHG) for adaptive step-size balancing in stochastic primal–dual hybrid gradient methods. We first derive theoretically sufficient conditions on the adaptive primal and dual step sizes for ensuring convergence in the stochastic setting. We then propose a number of practical schemes which satisfy the condition for convergence, and our numerical results on imaging inverse problems support the effectiveness of the proposed approach.

To our knowledge, this work constitutes the first theoretical analysis of adaptive step sizes for a stochastic primal–dual algorithm. Our ongoing work includes the theoretical analysis and algorithmic design of further accelerated stochastic primal–dual methods with line-search schemes for even faster convergence rates.

Fig. 8.

Fig. 8

Test on the default choice s=A of A-SPDHG (rule-a) on X-ray low-dose fanbeam CT example. Left figure: starting ratio of primal–dual step sizes: 10-7. Right figure: starting ratio of primal–dual step sizes: 10-5. We can observe that our default choice of s is indeed a reasonable choice (at least near-optimal) in practice, and when deviating from it may lead to slower convergence

Complementary Material for Sect. 2

We begin by a useful lemma.

Lemma 6.1

Let a,b be positive scalars, β(0,1), and P a bounded linear operator from a Hilbert space X to a Hilbert space Y. Then,

(ab)-1/2P1aIdPPbId0. 6.1
(ab)-1/2PβaIdPPbId(1-β)aId00bId. 6.2

Proof

Let us call

M=aIdPPbId.

For all (x,y)X×Y,

(x,y)M2ax2+by2-2Pxy=xa2+yb2-2(ab)-1/2Pxayb,

which proves the direct implication of (6.1). For the converse implication, consider xX\0 such that Px=Px and y=-λPx for a scalar λ. Then, the nonnegativity of the polynomial

(x,y)M2x2=bP2λ2-2P2λ+a

for all λR implies that P4-abP20, which is equivalent to the desired conclusion (ab)-1/2P1.Equivalence (6.2) is straightforward by noticing that

aIdPPbId(1-β)aId00bIdβaIdPPβbId0.

Let us now turn to the proof of Lemma 2.2.

Proof of Lemma 2.2

Let us assume that the step sizes satisfy the assumptions of the lemma. Then, Assumption (i) of Theorem 2.1 is straightforwardly satisfied. Moreover, for i1,n, the product sequence (τkσik)kN is constant along the iterations by equation (2.6) and satisfies equation (2.5) for iterate k=0 and thus satisfies (2.5) for all kN for β=maxiτ0σi0Ai2/pi, which proves Assumption (ii). Finally, equation (2.7) implies that Assumption (iii) is satisfied.

Biographies

Antonin Chambolle

is a CNRS senior scientist at CEREMADE, CNRS and Paris-Dauphine University (PSL), France. He received a Ph.D. from U. Paris-Dauphine in 1993, in Mathematics applied to image analysis, and a Habilitation in 2002. He worked in SISSA, Trieste, CEREMADE, CMAP (CNRS and Ècole Polytechnique), and now back to CEREMADE. His main research topics are related to the calculus of variations, the theoretical and numerical analysis of variational and evolution problems involving discontinuities and boundaries, the numerical optimization, especially for non-smooth convex problems. He is part also of the INRIA team “Mokaplan” which studies numerical methods for optimal transportation problems.graphic file with name 10851_2024_1174_Figa_HTML.jpg

Claire Delplancke

is a researcher and engineer at EDF Research & Development (EDF Lab Paris-Saclay). After studying at ENS Cachan, she received her Ph.D. in Applied Mathematics from the University of Toulouse in 2017. She held two postdoctoral positions, first at the University of Chile, then at the University of Bath, before joining EDF in 2022, as a researcher, engineer and now project manager. Her research interests lie with stochastic algorithms and optimization for a variety of applications: inverse problems, medical imaging, and more recently energy management.graphic file with name 10851_2024_1174_Figb_HTML.jpg

Matthias J. Ehrhardt

received the Diploma degree (Hons.) in industrial mathematics from the University of Bremen, Germany, in 2011, and the Ph.D. degree in medical imaging from University College London, U.K., in 2015. He held a postdoctoral position with the Cambridge Image Analysis group, Department for Applied Mathematics and Theoretical Physics, University of Cambridge, U.K., from 2016 to 2018. He moved to the University of Bath, U.K. as a Prize Fellow in 2018, where since 2021 he is a Reader at the Department of Mathematical Sciences. He is heading the Bath Imaging Group, a co-director of the Bath Centre for Mathematics and Algorithms for Data and the deputy director of the EPSRC Programme Grant on the Mathematics of Deep Learning. His research interests include optimisation, inverse problems, computational imaging, and machine learning.graphic file with name 10851_2024_1174_Figc_HTML.jpg

Carola-Bibiane Schönlieb

graduated from the Institute for Mathematics, University of Salzburg (Austria) in 2004. From 2004 to 2005 she held a teaching position in Salzburg. She received her Ph.D. degree from the University of Cambridge (UK) in 2009. After one year of postdoctoral activity at the University of Göttingen (Germany), she became a Lecturer at Cambridge in 2010, promoted to Reader in 2015 and promoted to Professor in 2018. Since 2011 she is a fellow of Jesus College Cambridge. She currently is Professor of Applied Mathematics at the University of Cambridge, where she is head of the Cambridge Image Analysis group and co-Director of the EPSRC Cambridge Mathematics of Information in Healthcare Hub. Her current research interests focus on variational methods, partial differential equations and machine learning for image analysis, image processing and inverse imaging problems.graphic file with name 10851_2024_1174_Figd_HTML.jpg

Junqi Tang

is an Assistant Professor in the School of Mathematics, University of Birmingham. He received the M.Sc. and Ph.D. from the Institute for Digital Communications, University of Edinburgh, U.K., in 2015 and 2019, respectively. He worked as a postdoctoral research associate with the Department of Applied Mathematics and Theoretical Physics (DAMTP), University of Cambridge before joining the University of Birmingham in 2023. His research interests include machine learning, large-scale optimization and multi-agent systems, with applications in computational imaging and computational social science.graphic file with name 10851_2024_1174_Fige_HTML.jpg

Author Contributions

CD, MJE and AC elaborated the proof strategy, and CD wrote parts 1, 2 and 6. JT worked on the algorithmic design, performed the numerical experiments and wrote parts 3-5. All authors reviewed the manuscript.

Funding

CD acknowledges support from the EPSRC (EP/S026045/1). MJE acknowledges support from the EPSRC (EP/S026045/1, EP/T026693/1, EP/V026259/1) and the Leverhulme Trust (ECF-2019-478). CBS acknowledges support from the Philip Leverhulme Prize, the Royal Society Wolfson Fellowship, the EPSRC advanced career fellowship EP/V029428/1, EPSRC grants EP/S026045/1 and EP/T003553/1, EP/N014588/1, EP/T017961/1, the Wellcome Innovator Awards 215733/Z/19/Z and 221633/Z/20/Z, the European Union Horizon 2020 research and innovation program under the Marie Sklodowska-Curie grant agreement No. 777826 NoMADS, the Cantab Capital Institute for the Mathematics of Information and the Alan Turing Institute.

Availability of data and materials

The related implementation of the algorithms and the image data used in the experiment will be made available on the website https://junqitang.com. For the phantom image example, we use the one in the experimental section of [8], while for the lung CT image example we use an image from the Mayo Clinic Dataset [21] which is publicly available.

Declarations

Conflict of interest

There are no competing interests to declare.

Ethical approval

This declaration is not applicable.

Footnotes

1

The choice of s is crucial for the convergence behavior of rule (a), and we found numerically that it is better to scale with the operator norm A instead of depending on the range of pixel values as suggested in [15].

2

The most typical example here would be Fig. 1b where the optimal step-size ratio selected by the adaptive scheme at convergence is almost exactly 10-5, where we have set SPDHG to run with this ratio. We can still observe benefit of local convergence acceleration given by our adaptive scheme.

CD was at the Department of Mathematical Sciences, University of Bath, while the research presented in this article was undertaken.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Claire Delplancke, Email: claire.delplancke@edf.fr.

Matthias J. Ehrhardt, Email: m.ehrhardt@bath.ac.uk

References

  • 1.Alacaoglu, A., Fercoq, O., Cevher, V.: On the convergence of stochastic primal-dual hybrid gradient. SIAM J. Optim. 32(2), 1288–1318 (2022) [Google Scholar]
  • 2.Bauschke, H.H., Combettes, P.L.: Convex Analysis and Monotone Operator Theory in Hilbert Spaces, vol. 408. Springer (2011)
  • 3.Bonettini, S., Benfenati, A., Ruggiero, V.: Scaling techniques for epsilon-subgradient methods. SIAM J. Optim. 26(3), 1741–1772 (2016) [Google Scholar]
  • 4.Bonettini, S., Porta, F., Ruggiero, V., Zanni, L.: Variable metric techniques for forward-backward methods in imaging. J. Comput. Appl. Math. 385, 113192 (2021) [Google Scholar]
  • 5.Bonettini, S., Prato, M., Rebegoldi, S.: A block coordinate variable metric linesearch based proximal gradient method. Comput. Optim. Appl. 71(1), 5–52 (2018) [Google Scholar]
  • 6.Bonettini, S., Rebegoldi, S., Ruggiero, V.: Inertial variable metric techniques for the inexact forward-backward algorithm. SIAM J. Sci. Comput. 40(5), A3180–A3210 (2018) [Google Scholar]
  • 7.Bonettini, S., Ruggiero, V.: On the convergence of primal-dual hybrid gradient algorithms for total variation image restoration. J. Math. Imaging Vis. 44(3), 236–253 (2012) [Google Scholar]
  • 8.Chambolle, A., Ehrhardt, M.J., Richtárik, P., Schönlieb, C.-B.: Stochastic primal-dual hybrid gradient algorithm with arbitrary sampling and imaging applications. SIAM J. Optim. 28(4), 2783–2808 (2018) [Google Scholar]
  • 9.Chambolle, A., Pock, T.: A first-order primal-dual algorithm for convex problems with applications to imaging. J. Math. Imaging Vis. 40(1), 120–145 (2011) [Google Scholar]
  • 10.Combettes, P.L., Pesquet, J.-C.: Stochastic quasi-Fejér block-coordinate fixed point iterations with random sweeping. SIAM J. Optim. 25(2), 1221–1248 (2015) [Google Scholar]
  • 11.Combettes, P.L., Vũ, B.C.: Variable metric quasi-Fejér monotonicity. Nonlinear Anal.: Theory Methods Appl. 78, 17–31 (2013) [Google Scholar]
  • 12.Delplancke, C., Gurnell, M., Latz, J., Markiewicz, P.J., Schönlieb, C.-B., Ehrhardt, M.J.: Improving a stochastic algorithm for regularized PET image reconstruction. In: 2020 IEEE Nuclear Science Symposium and Medical Imaging Conference (NSS/MIC), pp. 1–3. IEEE (2020)
  • 13.Ehrhardt, M.J., Markiewicz, P., Schönlieb, C.-B.: Faster PET reconstruction with non-smooth priors by randomization and preconditioning. Phys. Med. Biol. 64(22), 225019 (2019) [DOI] [PubMed] [Google Scholar]
  • 14.Goldstein, T., Li, M., Yuan, X.: Adaptive primal-dual splitting methods for statistical learning and image processing. Adv. Neural. Inf. Process. Syst. 28, 2089–2097 (2015) [Google Scholar]
  • 15.Goldstein, T., Li, M., Yuan, X., Esser, E., Baraniuk, R.: Adaptive primal-dual hybrid gradient methods for saddle-point problems. arXiv preprint arXiv:1305.0546 (2013)
  • 16.Gutiérrez, E.B., Delplancke, C., Ehrhardt, M.J.: On the convergence and sampling of randomized primal-dual algorithms and their application to parallel MRI reconstruction. arXiv preprint arXiv:2207.12291 (2022)
  • 17.He, B., Yuan, X.: Convergence Analysis of Primal-dual Algorithms for Total Variation Image Restoration. Rapport technique, Citeseer (2010)
  • 18.Malitsky, Y.: Golden ratio algorithms for variational inequalities. Math. Program. 184(1), 383–410 (2020) [Google Scholar]
  • 19.Malitsky, Y., Mishchenko, K.: Adaptive gradient descent without descent. In: Daumé III, H. Singh, A., (eds) Proceedings of the 37th International Conference on Machine Learning, vol. 119 of Proceedings of Machine Learning Research, pp. 6702–6712 (2020)
  • 20.Malitsky, Y., Pock, T.: A first-order primal-dual algorithm with linesearch. SIAM J. Optim. 28(1), 411–432 (2018) [Google Scholar]
  • 21.McCollough, C.: TU-FG-207A-04: overview of the low dose CT grand challenge. Med. Phys. 43(6Part35), 3759–3760 (2016) [Google Scholar]
  • 22.Papoutsellis, E., Ametova, E., Delplancke, C., Fardell, G., Jørgensen, J.S., Pasca, E., Turner, M., Warr, R., Lionheart, W.R.B., Withers, P.J.: Core imaging library-part II: multichannel reconstruction for dynamic and spectral tomography. Philos. Trans. R. Soc. A 379(2204), 20200193 (2021) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Robbins, H., Siegmund, D.: A convergence theorem for non negative almost supermartingales and some applications. In: Optimizing Methods in Statistics, pp. 233–257. Elsevier (1971)
  • 24.Schramm, G., Holler, M.: Fast and memory-efficient reconstruction of sparse poisson data in listmode with non-smooth priors with application to time-of-flight PET. Phys. Med. Biol. (2022) [DOI] [PMC free article] [PubMed]
  • 25.Vladarean, M.-L., Malitsky, Y., Cevher, V.: A first-order primal-dual method with adaptivity to local smoothness. Adv. Neural. Inf. Process. Syst. 34, 6171–6182 (2021) [Google Scholar]
  • 26.Yokota, T., Hontani, H.: An efficient method for adapting step-size parameters of primal-dual hybrid gradient method in application to total variation regularization. In: 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 973–979. IEEE (2017)
  • 27.Zdun, L., Brandt, C.: Fast MPI reconstruction with non-smooth priors by stochastic optimization and data-driven splitting. Phys. Med. Biol. 66(17), 175004 (2021) [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The related implementation of the algorithms and the image data used in the experiment will be made available on the website https://junqitang.com. For the phantom image example, we use the one in the experimental section of [8], while for the lung CT image example we use an image from the Mayo Clinic Dataset [21] which is publicly available.


Articles from Journal of Mathematical Imaging and Vision are provided here courtesy of Nature Publishing Group

RESOURCES