Abstract
In this work, we propose a new primal–dual algorithm with adaptive step sizes. The stochastic primal–dual hybrid gradient (SPDHG) algorithm with constant step sizes has become widely applied in large-scale convex optimization across many scientific fields due to its scalability. While the product of the primal and dual step sizes is subject to an upper-bound in order to ensure convergence, the selection of the ratio of the step sizes is critical in applications. Up-to-now there is no systematic and successful way of selecting the primal and dual step sizes for SPDHG. In this work, we propose a general class of adaptive SPDHG (A-SPDHG) algorithms and prove their convergence under weak assumptions. We also propose concrete parameters-updating strategies which satisfy the assumptions of our theory and thereby lead to convergent algorithms. Numerical examples on computed tomography demonstrate the effectiveness of the proposed schemes.
Introduction
The stochastic primal–dual hybrid gradient (SPDHG) algorithm introduced in [8] is a stochastic version of the primal–dual hybrid gradient (PDHG) algorithm, also known as Chambolle–Pock algorithm [9]. SPDHG has proved more efficient than PDHG for a variety of problems in the framework of large-scale non-smooth convex inverse problems [13, 22, 24, 27]. Indeed, SPDHG only uses a subset of the data at each iteration, hence reducing the computational cost of evaluating the forward operator and its adjoint; as a result, for the same computational burden, SPDHG attains convergence faster than PDHG. This is especially relevant in the context of medical imaging, where there is a need for algorithms whose convergence speed is compatible with clinical standards, and at the same time able to deal with convex, non-smooth priors like total variation (TV), which are well-suited to ill-posed imaging inverse problems, but preclude the recourse to scalable gradient-based methods.
Like PDHG, SPDHG is provably convergent under the assumption that the product of its primal and dual step sizes is bounded by a constant depending on the problem to solve. On the other hand, the ratio between the primal and dual step sizes is a free parameter, whose value needs to be chosen by the user. The value of this parameter, which can be interpreted as a control on balance between primal and dual convergence, can have a severe impact on the convergence speed of PDHG, and the same also holds true for SPDHG [12]. This leads to an important challenge in practice, as there is no known theoretical or empirical rule to guide the choice of the parameter. Manual tuning is computationally expensive, as it would require running and comparing the algorithm on a range of values, and there is no guarantee that a value leading to fast convergence for one dataset would keep being a good choice for another dataset. For PDHG, [14] have proposed an online primal–dual balancing strategy to solve the issue, where the values of the step sizes evolve along the iterations. More generally, adaptive step sizes have been used for PDHG with backtracking in [14, 20], adapting to local smoothness in [25], and are widely used for a variety of other algorithms, namely gradient methods in [19], subgradient methods in [3] and splitting methods in [4–7, 18] to improve convergence speed and bypass the need for explicit model constants, like Lipschitz constants or operator norms. For SPDHG, an empirical adaptive scheme has been used for Magnetic Particle Imaging but without convergence proof [27].
On the theoretical side, a standard procedure to prove the convergence of proximal-based algorithms for convex optimization is to use the notion of Féjer monotonicity [2]. Constant step sizes lead to a fixed metric setting, while adaptive step sizes lead to a variable metric setting. Work [11] states the convergence of deterministic Féjer-monotone sequences in the variable metric setting, while work [10] is concerned by the convergence of random Féjer-monotone sequences in the fixed metric setting.
In this work, we introduce and study an adaptive version of SPDHG. More precisely:
We introduce a broad class of strategies to adaptively choose the step sizes of SPDHG. This class includes, but is not limited to, the adaptive primal–dual balancing strategy, where the ratio of the step sizes, which controls the balance between convergence of the primal and dual variable, is tuned online.
We prove the almost-sure convergence of SPDHG under the schemes of the class. In order to do that, we introduce the concept of C-stability, which generalizes the notion of Féjer monotonicity, and we prove the convergence of random C-stable sequences in a variable metric setting, hence generalizing results from [11] and [10]. We then show that our proposed algorithm falls within this novel theoretical framework by following similar strategies than in the almost-sure convergence proofs of [1, 16].
We compare the performance of SPDHG for various adaptive schemes and the known fixed step-size scheme on large-scale imaging inverse tasks (sparse-view CT, limited-angle CT, low-dose CT). We observe that the primal–dual balancing adaptive strategy is always as fast or faster than all the other strategies. In particular, it consistently leads to substantial gains in convergence speed over the fixed strategy if the fixed step sizes, while in the theoretical convergence range, are badly chosen. This is especially relevant as it is impossible to know whether the fixed step sizes are well or badly chosen without running expensive comparative tests. Even in the cases where the SPDHG’s fixed step sizes are well tuned, meaning that they are in the range to which the adaptive step sizes are observed to converge, we observe that our adaptive scheme still provides convergence acceleration over the standard SPDHG after a certain number of iterations. Finally, we pay special attention to the hyperparameters used in the adaptive schemes. These hyperparameters are essentially controlling the degree of adaptivity for the algorithm and each of them has a clear interpretation and is easy to choose in practice. We observe in our extensive numerical tests that the convergence speed of our adaptive scheme is robust to the choices of these parameters within the empirical range we provide, hence can be applied directly to the problem at hand without fine-tuning, and solves the step-size choice challenge encountered by the user.
The rest of the paper is organized as follows. In Sect. 2, we introduce SPDHG with adaptive step sizes, state the convergence theorem, and carry the proof. In Sect. 3, we propose concrete schemes to implement the adaptiveness, followed by numerical tests on CT data in Sect. 4. We conclude in Sect. 5. Finally, Sect. 6 collects some useful lemmas and proofs.
Theory
Convergence Theorem
The variational problem to solve takes the form:
where X and are Hilbert spaces, are bounded linear operators, and and are convex functions. We define with elements and such that . The associated saddle-point problem reads as
| 2.1 |
where stands for the Fenchel conjugate of . The set of solution to (2.1) is denoted by , and the set of nonnegative integers by and stands for . Elements of are called saddle points and characterized by
| 2.2 |
In order to solve the saddle-point problem, we introduce the adaptive stochastic primal–dual hybrid gradient (A-SPDHG) algorithm in Algorithm 2.1. At each iteration , A-SPDHG involves the following five steps:
update the primal step size and the dual step sizes (line 4);
update the primal variable by a proximal step with step size (line 5);
randomly choose an index i with probability (line 6);
update the dual variable by a proximal step with step size (line 7);
compute the extrapolated dual variable (line 8).
A-SPDHG is adaptive in the sense that the step-size values are updated at each iteration according to an update rule which takes into account the value of the primal and dual iterates and up to the current iteration. As the iterates are stochastic, the step sizes are themselves stochastic, which must be carefully accounted for in the theory.
Before turning to the convergence of A-SPDHG, let us recall some facts about the state-of-the-art SPDHG. Each iteration of SPDHG involves the selection of a random subset of . In the serial sampling case where the random subset is a singleton, SPDHG algorithm [8] is a special case of Algorithm 2.1 with the update rule
Under the condition
| 2.3 |
SPDHG iterates converge almost surely to a solution of the saddle-point problem (2.1) [1, 16].
Let us now turn to the convergence of A-SPDHG. The main theorem, Theorem 2.1, gives conditions on the update rule under which A-SPDHG is provably convergent. Plainly speaking, these conditions are threefold:
-
(i)
the step sizes for step , and , depend only on the iterates up to step k,
-
(ii)
the step sizes satisfy a uniform version of condition (2.3),
-
(iii)
the step-size sequences and for do not decrease too fast. More precisely, they are uniformly almost surely quasi-increasing in the sense defined below.
In order to state the theorem rigorously, let us introduce some useful notation and definitions. For all , the -algebra generated by the iterates up to point k, , is denoted by . We say that a sequence is -adapted if for all , is measurable with respect to .
A positive real sequence is said to be quasi-increasing if there exists a sequence with values in [0, 1), called the control on , such that and:
| 2.4 |
By extension, we call a random positive real sequence uniformly almost surely quasi-increasing if there exists a deterministic sequence with values in [0, 1) such that and equation (2.4) above holds almost surely (a.s.).
Theorem 2.1
(Convergence of A-SPDHG) Let X and Y be separable Hilbert spaces, bounded linear operators, and proper, convex and lower semi-continuous functions for all . Assume that the set of saddle points is non-empty and the sampling is proper, that is to say for all . If the following conditions are met:
-
(i)
the step-size sequences are -adapted,
-
(ii)there exists such that for all indices and iterates ,
2.5 -
(iii)
the initial step sizes and for all indices are positive and the step-size sequences and for all indices are uniformly almost surely quasi-increasing,
then the sequence of iterates converges almost surely to an element of .
While the conditions (i)–(iii) are general enough to cover a large range of step-size update rules, we will focus in practice on the primal–dual balancing strategy, which consists in scaling the primal and the dual step sizes by an inverse factor at each iteration. In that case, the update rule depends on a random positive sequence and reads as:
| 2.6 |
Lemma 2.2
(Primal–dual balancing) Let the step-size sequences satisfy equation (2.6) and assume in addition that is -adapted that the initial step sizes satisfy
and are positive, that there exists a deterministic sequence with values in [0, 1) such that and for all and ,
| 2.7 |
Then, the step-size sequences satisfy assumptions (i)–(iii) of Theorem 2.1.
Lemma 2.2 is proved in Sect. 6.
Connection with the literature:
The primal–dual balancing strategy has been introduced in [14] for PDHG and indeed for we recover with Lemma 2.2 the non-backtracking algorithm presented in [14]. As a consequence, our theorem also implies the pointwise convergence of this algorithm, whose convergence was established in the sense of vanishing residuals in [14].
Still for PDHG, [20] proposes without proof an update rule where the ratio of the step sizes is either quasi-non-increasing or quasi-non-decreasing. This requirement is similar to but not directly connected with ours, where we ask the step sizes themselves to be quasi-non-increasing.
For SPDHG, the angular constraint step-size rule proposed without convergence proof in [27] satisfies assumptions (i)–(iii).
Outline of the proof: Theorem 2.1 is proved in the following subsections. We first define in Sect. 2.2 metrics related to the algorithm step sizes on the primal–dual product space. As the step sizes are adaptive, we obtain a sequence of metrics. The proof of Theorem 2.1 is then similar in strategy to those of [1] and [16] but requires novel elements to deal with the metrics variability. In Theorem 2.5, we state convergence conditions for an abstract random sequence in a Hilbert space equipped with random variable metrics. In Sects. 2.4 and 2.5, we show that A-SPDHG falls within the scope of Theorem 2.5. We collect all elements and conclude the proof in Sect. 2.6.
Variable Metrics
For a Hilbert space H, we call the set of bounded self-adjoint linear operators from H to H, and for all we introduce the notation:
By an abuse of notation, we write for a scalar . Notice that is a norm on H if M is positive definite. Furthermore, we introduce the partial order on such that for ,
We call the subset of comprised of M such that . Furthermore, a random sequence in is said to be uniformly almost surely quasi-decreasing if there exists a deterministic nonnegative sequence such that and a.s.
Coming back to A-SPDHG, let us define for every iteration and every index two block operators of as:
and a block operator of as:
| 2.8 |
The following lemma translates assumptions (i)–(iii) of Theorem 2.1 on properties on the variable metric sequences.
Lemma 2.3
(Variable metric properties)
Assumption (i) of Theorem 2.1 implies that , and are -adapted.
- Assumption (ii) of Theorem 2.1 is equivalent to the existence of such that for all indices and iterates ,
Assumptions (ii) and (iii) of Theorem 2.1 imply that and are uniformly a.s. quasi-decreasing.
Assumption (ii) and (iii) of Theorem 2.1 imply that the sequences and for all are a.s. bounded from above and by below by positive constants. In particular, this implies that there exists such that for all and , or equivalently that for all .
Remark 2.4
(Step-size induced metrics on the primal–dual product space) The lemma implies that , and are positive definite and hence induce a metric on the corresponding spaces. If and for constant step sizes, corresponds to the metric used in [17], where PDHG is reformulated as a proximal-point algorithm for a non-trivial metric on the primal–dual product space.
Proof of Lemma 2.3
Assertion (a) of the lemma follows from the fact that for all iterate , the operators , and are in the -algebra generated by . Assertion (b) follows from equation (6.2) of Lemma 6.1 to be found in the complementary material. The proof of assertion (c) is a bit more involved. Let us assume that assumption (iii) of Theorem 2.1 holds and let and be the controls of and for , respectively. We define the sequence by:
| 2.9 |
which is a common control on and for as the maximum of a finite number of controls. Let us fix and . Because the intersection of a finite number of measurable events of probability one is again a measurable event of probability one, it holds almost surely that for all ,
Hence, the sequence is uniformly quasi-decreasing with control , which is indeed a positive sequence with bounded sum. (To see that has a bounded sum, consider that is summable, hence converges to 0, hence is smaller than 1/2 for all integers k bigger than a certain K; in turn, for all integers k bigger than K, the term is bounded from below by 0 and from above by , hence is summable.) One can see by a similar proof that is uniformly quasi-decreasing with the same control. To follow with the case of , we have, as before:
thanks to (b).
Let us conclude with the proof of assertion (d). By assumption (iii), the sequences and are uniformly a.s. quasi-increasing. We define a common control as in (2.9). Then, the sequences and are a.s. bounded from below by the same deterministic constant which is positive as the initial step sizes are positive and takes values in [0, 1) and has finite sum. Furthermore, by assumption (ii), the product of the sequences and is almost surely bounded from above. As a consequence, each sequence and is a.s. bounded from above. The equivalence with for all , and with , is straightforward.
Convergence of Random C-stable Sequences in Random Variable Metrics
Let H be a Hilbert space and a subset of H. Let be a probability space. All random variables in the following are assumed to be defined on and measurable with respect to unless stated otherwise. Let be a random sequence of .
A random sequence with values in H is said to be stable with respect to the target C relative to if for all , the sequence converges almost surely. The following theorem then states sufficient conditions for the convergence of such sequences.
Theorem 2.5
(Convergence of C-stable sequences) Let H be a separable Hilbert space, C a closed non-empty subset of H, a random sequence of , and a random sequence of H. If the following conditions are met:
-
(i)
takes values in for a given and is uniformly a.s. quasi-decreasing,
-
(ii)
is stable with respect to the target C relative to ,
-
(iii)
every weak sequential cluster point of is almost surely in C, meaning that there exists a measurable subset of of probability one such that for all , every weak sequential cluster point of is in C.
then converges almost surely weakly to a random variable in C.
Stability with respect to a target set C is implied by Féjer and quasi-Féjer monotonicity with respect to C, which have been studied either for random sequences [10] or in the framework of variable metrics [11], but to the best of our knowledge not both at the same time. The proof of Theorem 2.5 follows the same lines than [10, Proposition 2.3 (iii)] and uses two results from [11].
Proof
The set C is a subset of the separable Hilbert space H, hence is separable. As C is a closed and separable, there exists a countable subset of C whose closure is equal to C. Thanks to assumption (ii), there exists for all a measurable subset of with probability one such that the sequence converges for all . Furthermore, let be a measurable subset of of probability one corresponding to the almost-sure property for assumption (i). Let
As the intersection of a countable number of measurable subsets of probability one, is itself a measurable set of with . Fix for the rest of the proof.
The sequence takes values in for and is quasi-decreasing with control . Furthermore, for all ,
where the product is finite because is positive and summable. By [11, Lemma 2.3], converges pointwise strongly to some .
Furthermore, for all , there exists a sequence with values in converging strongly to x. By assumption, for all , the sequence converges to a limit which shall be called . For all and , we can write thanks to the triangular inequality:
By taking the limit , it follows that:
Taking now the limit shows that the sequence converges for all . On the other hand, because , the weak cluster points of lie in C. Hence, by [11, Theorem 3.3], the sequence converges almost surely to a point .
We are now equipped to prove Theorem 2.1. We show in Sects. 2.4 and 2.5 that A-SPDHG satisfies points (ii) and (iii) of Theorem 2.5, respectively, and conclude the proof in Sect. 2.6. Interestingly, the proofs of point (ii) and of point (iii) rely on two different ways of apprehending A-SPDHG. Point (ii) relies on a convex optimization argument: By taking advantage of the measurability of the primal variable at step with respect to , one can write a contraction-type inequality relating the conditional expectation of the iterates’ norm at step to the iterates’ norm at step k. Point (iii) relies on monotone operator theory: We use the fact that the update from the half-shifted iterations to can be interpreted as a step of a proximal-point algorithm on conditionally to i being the index randomly selected at step k.
A-SPDHG is Stable with Respect to the Set of Saddle Points
In this section, we show that is stable with respect to relative to the variable metrics sequence defined in equation (2.8) above. We introduce the operators and defined, respectively, by
and the functionals defined for all as:
We begin by recalling the cornerstone inequality satisfied by the iterates of SPDHG stated first in [8] and reformulated in [1].
Lemma 2.6
([1], Lemma 4.1) For every saddle-point , it a.s. stands that for all ,
| 2.10 |
The second step is to relate the assumptions of Theorem 2.1 to properties of the functionals appearing in (2.10). Let us introduce the set of elements having at most one non-vanishing component.
Lemma 2.7
(Properties of functionals of interest) Under the assumptions of Theorem 2.1, there exists a nonnegative, summable sequence such that a.s. for every iterate and :
| 2.11a |
| 2.11b |
| 2.11c |
| 2.11d |
| 2.11e |
Proof
Let and be the controls of and , respectively, for all . We define the common control by:
| 2.12 |
For all , we can write
which proves (2.11a). Let us now fix , and . By definition, there exists such that for all . We obtain the inequalities (2.11b)–(2.11d) by writing:
Finally, we obtain inequality (2.11e) by writing:
where the last inequality is a consequence of (2.5).
Lemma 2.8
(A-SPDHG is -stable) Under the assumptions of Theorem 2.1,
-
(i)
The sequence of Algorithm 2.1 is stable with respect to relative to ,
-
(ii)the following results hold:
Proof
Let us begin with the proof of point (i). By definition of A-SPDHG with serial sampling, the difference between two consecutive dual iterates is almost surely sparse:
Let us define the sequences
which are a.s. nonnegative thanks to (2.11c) and (2.11d). Notice that the primal iterates from up to are measurable with respect to , whereas the dual iterates from up to are measurable with respect to . Hence, and are measurable with respect to . Furthermore, inequalities (2.10), (2.11a) and (2.11b) imply that almost surely for all ,
By Robbins–Siegmund lemma [23], converges almost surely, and . From the last point in particular, we can write thanks to (2.11d) and the monotone convergence theorem:
hence is almost surely finite, thus , and in turn , converge almost surely to 0. Furthermore, hence , and in turn , are finite, and by (2.11e), one can write that for ,
We know that is summable hence converges to 0. As a consequence,
To conclude with, thanks to the identity
the almost-sure convergence of implies in turn that of .
Let us now turn to point (ii). The first assertion is a straightforward consequence of
and bounds (2.11c) and (2.11d). Furthermore, it implies that is a.s. finite, hence a.s. converges to 0, and so does .
Weak Cluster Points of A-SPDHG are Saddle Points
The goal of this section is to prove that A-SPDHG satisfies point (iii) of Theorem 2.5. On the event , A-SPDHG update procedure can be rewritten as
We define by:
so that on the event (and for ).
Lemma 2.9
(Cluster points of A-SPDHG are saddle points) Let a.s. be a weak cluster point of (meaning that there exists a measurable subset of of probability one such that for all , is a weak sequential cluster point of ) and assume that the assumptions of Theorem 2.1 hold. Then, is a.s. in .
Proof
Thanks to Lemma 2.8-(ii) and the monotone convergence theorem,
Now,
Hence, we can deduce that
It follows that the series in the expectation is a.s. finite, and since we deduce that almost surely,
| 2.13 |
for all . We consider a sample which is bounded and such that (2.13) holds. We let for each i, , so that for . Then, one has
where as . Given a test point (x, y), one may write for any k:
and summing all these inequalities, we obtain:
where as . We deduce that if is the weak limit of a subsequence (as well as, of course, ), then:
Since (x, y) is arbitrary, we find that (2.2) holds for .
Proof of Theorem 2.1
Under the assumptions of Theorem 2.1, the set of saddle points is closed and non-empty and is a separable Hilbert space. By Lemma 2.3, the variable metrics sequence defined in (2.8) satisfies condition (i) of Theorem 2.5. Furthermore, the iterates of Algorithm 2.1 comply with condition (ii) and (iii) of Theorem 2.5 by Lemma 2.8 and Lemma 2.9, respectively, and hence converge almost surely to a point in .
Algorithmic Design and Practical Implementations
In this section, we present practical instances of our A-SPDHG algorithm, where we specify a step-size adjustment rule which satisfies our assumptions in convergence proof. We extend the adaptive step-size balancing rule for deterministic PDHG, which is proposed by [14], into our stochastic setting, with minibatch approximation to minimize the computational overhead.
A-SPDHG Rule (a)—Tracking and Balancing the Primal–Dual Progress
Let’s first briefly introduce the foundation of our first numerical scheme, which is built upon the deterministic adaptive PDHG algorithm proposed by Goldstein et al [14], with the iterates:
In this foundational work of Goldstein et al [14], they proposed to evaluate two sequences in order to track and balance the progresses of the primal and dual iterates of deterministic PDHG (denoted here as and ):
| 3.1 |
These two sequences measure the lengths of the primal and dual subgradients for the objective , which can be demonstrated by the definition of proximal operators. The primal update of deterministic PDHG can be written as:
| 3.2 |
The optimality condition of the above objective declares:
| 3.3 |
By adding on both sides and rearranging the terms, one can derive:
| 3.4 |
and similarly for the dual update one can also derive:
| 3.5 |
which indicates that the sequences and given by (3.1) should effectively track the primal progress and dual progress of deterministic PDHG, and hence, Goldstein et al [14] propose to utilize these as the basis of balancing the primal and dual step sizes for PDHG.
In light of this, we propose our first practical implementation of A-SPDHG in Algorithm 3.1 as our rule-(a), where we use a unique dual step-size for all iterates k and indices j and where we estimate the progress of achieving optimality on the primal and dual variables via the two sequences and defined at each iteration k with as:
| 3.6 |
which are minibatch extension of (3.1) tailored for our stochastic setting. By making them balanced on the fly via adjusting the primal–dual step-size ratio when appropriate, we can enforce the algorithm to achieve similar progress in both primal and dual steps and hence improve the convergence. To be more specific, as shown in Algorithm 3.1, in each iteration the values of and are evaluated and compared. If the value of (which tracks the primal subgradients) is significantly larger than (which tracks the dual subgradients), then we know that the primal progress is slower than the dual progress, and hence, the algorithm would boost the primal step size while shrinking the dual step size. If is noticeably smaller than , then the algorithm would do the opposite.
Note that here we adopt the choice of -norm as the length measure for and as done by Goldstein et al [14, 15], since we also observe numerically the benefit over the more intuitive choice of -norm.
For full-batch case (), it reduces to the adaptive PDHG proposed by [14, 15]. We adjust the ratio between primal and dual step sizes according to the ratio between and , and whenever the step-size change, we shrink (which controls the amplitude of the changes) by a factor —we typically choose in our experiments. For the choice of s, we choose as our default.1
Reducing the Overhead with Subsampling
Noting that unlike the deterministic case which does not have the need of extra matrix–vector multiplication since and can be memorized, our stochastic extension will require the computation of since we will sample different subsets between back-to-back iterations with high probability. When using this strategy, we will only have a maximum overhead in terms of FLOP counts, which is numerically negligible compared to the significant acceleration it will bring toward SPDHG especially when the primal–dual step-size ratio is suboptimal, as we will demonstrate later in the experiments. Moreover, we found numerically that we can significantly reduce this overhead by approximation tricks such as subsampling:
| 3.7 |
with being a random subsampling operator such that . In our experiments, we choose subsampling for this approximation and hence the overhead is reduced from to only which is negligible, without compromising the convergence rates in practice.
A-SPDHG Rule (b)—Exploiting Angle Alignments
More recently, Yokota and Hontani [26] propose a variant of adaptive step-size balancing scheme for PDHG, utilizing the angles between the subgradients and the difference of the updates .
If these two directions are highly aligned, then the primal step size can be increased for bigger step. If these two directions have a large angle, then the primal step size should be shrunken. By extending this scheme to stochastic setting, we obtain another choice of adaptive scheme for SPDHG.
We present this scheme in Algorithm 3.2 as our rule (b). At iteration k with , compute:
| 3.8 |
as an estimate of , then measure the cosine of the angle between this and :
| 3.9 |
The threshold c for the cosine value (which triggers the increase of the primal step size) typically needs to be very close to 1 (we use ) due to the fact that we mostly apply these type of algorithms in high-dimensional problems, following the choice in [26] which was for deterministic PDHG.
Recently, Zdun et al [27] proposed a heuristic similar to our rule (b), but they choose to be the approximation for an element of instead of . Our choice follows more closely to the original scheme of Yokota and Hontani [26]. We numerically found that their scheme is not competitive in our settings.
Numerical Experiments
In this section, we present numerical studies of the proposed scheme in solving one of the most typical imaging inverse problems, the computed tomography (CT). We compare A-SPDHG algorithm with the original SPDHG, on different choices of starting ratio of the primal and dual step sizes.
In our CT imaging example, we seek to reconstruct the tomography images from fanbeam X-ray measurement data, by solving the following TV-regularized objective:
| 4.1 |
where D denotes the 2D differential operator, and . We consider three fanbeam CT imaging modalities: sparse-view CT, low-dose CT and limited-angle CT. We test the A-SPDHG and SPDHG on two images of different sizes (Example 1 on a phantom image sized , while Example 2 being an image from the Mayo Clinic Dataset [21] sized .), on 4 different starting ratios (, , and ). We interleave partitioned the measurement data and operator into minibatches for both algorithms. To be more specific, we first collect all the X-ray measurement data and list them consecutively from 0 degree to 360 degree to form the full A and b, and then interleavingly group every 10-th of the measurements into one minibatch, to form the partition and .
For A-SPDHG, we choose to use the approximation step for presented in (3.7) with subsampling and hence the computational overhead is negligible in this experiment. We initialize all algorithms from a zero image.
We present our numerical results in Figs. 1, 2, 3 and 6. In these plots, we compare the convergence rates of the algorithms in terms of number of iterations (the execution time per iteration for the algorithms are almost the same, as the overhead of A-SPDHG is trivial numerically). Among these, Figs. 1 and 2 report the results for large-scale sparse-view CT experiments on a phantom image and a lung CT image from Mayo Clinic dataset [21], while Fig. 3 reports the results for low-dose CT experiments where we simulate a large number of measurements corrupted with a significant amount Poisson noise, and then, in Fig. 6 we report the results for limited-angle CT which only a range of 0-degree to 150-degree of measurement angles are present, while the measurements from the rest [150, 360] degrees of angles are all missing. In all these examples, we can consistently observe that no matter how we initialize the primal–dual step-size ratio, A-SPDHG can automatically and consistently adjust the step-size ratio to the optimal choice which is around either or for these four different CT problems and significantly outperform the vanilla SPDHG for the cases where the starting ratio is away from the optimal range. Meanwhile, even for the cases where the starting ratio of SPDHG algorithm is near-optimal, we can observe consistently from most of these examples that our scheme outperforms the vanilla SPDHG algorithm locally after a certain number of iterations (highlighted by the vertical dash lines in relevant subfigures), which further indicates the benefit of adaptivity for this class of algorithms2. Note that throughout all these different examples, we use only one fixed set of parameters for A-SPDHG suggested in the previous section, which again indicates the strong practicality of our scheme.
Fig. 1.
Comparison between SPDHG and A-SPDHG on sparse-view CT (Example 1), with a variety of starting primal–dual step size ratios. Here, the forward operator is with dimensions , . We include the images reconstructed by the algorithms at termination (50th epoch). In the first plot of each subfigure, the black circle indicates the starting step-size ratio for all the algorithms, same for the following figures
Fig. 2.
Comparison between SPDHG and A-SPDHG on sparse-view CT (Example 2), with a variety of starting primal–dual step-size ratios. Here, the forward operator is with dimensions , . We include the images reconstructed by the algorithms at termination (50th epoch)
Fig. 3.
Comparison between SPDHG and A-SPDHG on low-dose CT (where we use a large number of highly-noisy X-ray measurements), with a variety of starting primal–dual step-size ratios. Here, the forward operator is with dimensions , . We resized the phantom image to 256 by 256. We include the images reconstructed by the algorithms at termination (50th epoch)
Fig. 6.
Comparison between SPDHG and A-SPDHG on limited-angle CT (Example 2), with a variety of starting primal–dual step-size ratios. Here, the forward operator is with dimensions , . We include the images reconstructed by the algorithms at termination (50th epoch)
For the low-dose CT example, we run two extra sets of experiments, regarding a larger number of partitioning of minibatches (40) in Fig. 4, and warm-start from a better initialization image obtained via filter backprojection in Fig. 5. We found that in all these extra examples we consistently observe superior performances of A-SPDHG over the vanilla SPDHG especially when the primal–dual step-size ratios are suboptimal. Interestingly, we found that the warm-start’s effect does not have noticeable impact of the comparative performances between SPDHG and A-SPDHG. This is mainly due to the fact that the SPDHG with suboptimal primal–dual step-size ratio will converge very slowly in high accuracy regimes (see Fig. 5d for example) in practice hence the warm-start won’t help much here.
Fig. 4.
Comparison between SPDHG and A-SPDHG with the data being split to 40 minibatches on low-dose CT. Comparing to the results presented in Fig. 3 which used 10 minibatches, we obtain similar results and our A-SPDHG continues to perform more favorably comparing to SPDHG.
Fig. 5.
Comparison between SPDHG and A-SPDHG with warm-start using a FBP (filtered backprojection) on low-dose CT. Comparing to the results shown in Fig. 3 which are without warm-start, actually our methods seem to compare even more favorably with warm-start. Please also note that the early jump in terms of function value is within our expectation due to the stochasticity of the algorithms. We include the images reconstructed by the algorithms at termination (50th epoch)
We should also note that conceptually all the hyperparameters in our adaptive schemes are basically the controllers of the adaptivity of the algorithm (while for extreme choices we recover the vanilla SPDHG). In Figs. 7 and 9, we present some numerical studies on the choices of hyperparameters of rule (a) and rule (b) of A-SPDHG algorithm. We choose the fixed starting ratio of for primal–dual step sizes in these experiments. For rule (a), we found that it is robust to the choice of the starting shrinking rate , shrinking speed and the gap . Overall, we found that these parameters have weak impact of the convergence performance of our rule (a) and easy to choose.
Fig. 7.
Test on different choices of parameters of A-SPDHG (rule-a) on X-ray low-dose fanbeam CT example, starting ratio of primal–dual step sizes: . We can observe that the performance of A-SPDHG has only minor dependence on these parameter choices
Fig. 9.

Test on different choices of parameters of A-SPDHG (rule-b) on X-ray low-dose fanbeam CT example, starting ratio of primal–dual step sizes:
For rule (b), we found that the performance is more sensitive to the choice of parameter c and comparing to rule (a), although the dependence is still weak. Our numerical studies suggest that rule (a) is a better-performing choice than rule (b), but each of them have certain mild weaknesses (the first rule has a slight computational overhead which can be partially addressed with subsampling scheme, while the second rule seems often being slower than the first rule), which require further studies and improvements. Nevertheless, we need to emphasis that all these parameters are essentially controlling the degree of adaptivity of the algorithms and fairly easy to choose, noting that for all these CT experiments with varying sizes/dimensions and modalities we only use one fixed set of the hyperparameters in A-SPDHG, and we are already able to consistently observe numerical improvements over vanilla SPDHG.
Conclusion
In this work, we propose a new framework (A-SPDHG) for adaptive step-size balancing in stochastic primal–dual hybrid gradient methods. We first derive theoretically sufficient conditions on the adaptive primal and dual step sizes for ensuring convergence in the stochastic setting. We then propose a number of practical schemes which satisfy the condition for convergence, and our numerical results on imaging inverse problems support the effectiveness of the proposed approach.
To our knowledge, this work constitutes the first theoretical analysis of adaptive step sizes for a stochastic primal–dual algorithm. Our ongoing work includes the theoretical analysis and algorithmic design of further accelerated stochastic primal–dual methods with line-search schemes for even faster convergence rates.
Fig. 8.

Test on the default choice of A-SPDHG (rule-a) on X-ray low-dose fanbeam CT example. Left figure: starting ratio of primal–dual step sizes: . Right figure: starting ratio of primal–dual step sizes: . We can observe that our default choice of s is indeed a reasonable choice (at least near-optimal) in practice, and when deviating from it may lead to slower convergence
Complementary Material for Sect. 2
We begin by a useful lemma.
Lemma 6.1
Let be positive scalars, , and P a bounded linear operator from a Hilbert space X to a Hilbert space Y. Then,
| 6.1 |
| 6.2 |
Proof
Let us call
For all ,
which proves the direct implication of (6.1). For the converse implication, consider such that and for a scalar . Then, the nonnegativity of the polynomial
for all implies that , which is equivalent to the desired conclusion .Equivalence (6.2) is straightforward by noticing that
Let us now turn to the proof of Lemma 2.2.
Proof of Lemma 2.2
Let us assume that the step sizes satisfy the assumptions of the lemma. Then, Assumption (i) of Theorem 2.1 is straightforwardly satisfied. Moreover, for , the product sequence is constant along the iterations by equation (2.6) and satisfies equation (2.5) for iterate and thus satisfies (2.5) for all for , which proves Assumption (ii). Finally, equation (2.7) implies that Assumption (iii) is satisfied.
Biographies
Antonin Chambolle
is a CNRS senior scientist at CEREMADE, CNRS and Paris-Dauphine University (PSL), France. He received a Ph.D. from U. Paris-Dauphine in 1993, in Mathematics applied to image analysis, and a Habilitation in 2002. He worked in SISSA, Trieste, CEREMADE, CMAP (CNRS and Ècole Polytechnique), and now back to CEREMADE. His main research topics are related to the calculus of variations, the theoretical and numerical analysis of variational and evolution problems involving discontinuities and boundaries, the numerical optimization, especially for non-smooth convex problems. He is part also of the INRIA team “Mokaplan” which studies numerical methods for optimal transportation problems.
Claire Delplancke
is a researcher and engineer at EDF Research & Development (EDF Lab Paris-Saclay). After studying at ENS Cachan, she received her Ph.D. in Applied Mathematics from the University of Toulouse in 2017. She held two postdoctoral positions, first at the University of Chile, then at the University of Bath, before joining EDF in 2022, as a researcher, engineer and now project manager. Her research interests lie with stochastic algorithms and optimization for a variety of applications: inverse problems, medical imaging, and more recently energy management.
Matthias J. Ehrhardt
received the Diploma degree (Hons.) in industrial mathematics from the University of Bremen, Germany, in 2011, and the Ph.D. degree in medical imaging from University College London, U.K., in 2015. He held a postdoctoral position with the Cambridge Image Analysis group, Department for Applied Mathematics and Theoretical Physics, University of Cambridge, U.K., from 2016 to 2018. He moved to the University of Bath, U.K. as a Prize Fellow in 2018, where since 2021 he is a Reader at the Department of Mathematical Sciences. He is heading the Bath Imaging Group, a co-director of the Bath Centre for Mathematics and Algorithms for Data and the deputy director of the EPSRC Programme Grant on the Mathematics of Deep Learning. His research interests include optimisation, inverse problems, computational imaging, and machine learning.
Carola-Bibiane Schönlieb
graduated from the Institute for Mathematics, University of Salzburg (Austria) in 2004. From 2004 to 2005 she held a teaching position in Salzburg. She received her Ph.D. degree from the University of Cambridge (UK) in 2009. After one year of postdoctoral activity at the University of Göttingen (Germany), she became a Lecturer at Cambridge in 2010, promoted to Reader in 2015 and promoted to Professor in 2018. Since 2011 she is a fellow of Jesus College Cambridge. She currently is Professor of Applied Mathematics at the University of Cambridge, where she is head of the Cambridge Image Analysis group and co-Director of the EPSRC Cambridge Mathematics of Information in Healthcare Hub. Her current research interests focus on variational methods, partial differential equations and machine learning for image analysis, image processing and inverse imaging problems.
Junqi Tang
is an Assistant Professor in the School of Mathematics, University of Birmingham. He received the M.Sc. and Ph.D. from the Institute for Digital Communications, University of Edinburgh, U.K., in 2015 and 2019, respectively. He worked as a postdoctoral research associate with the Department of Applied Mathematics and Theoretical Physics (DAMTP), University of Cambridge before joining the University of Birmingham in 2023. His research interests include machine learning, large-scale optimization and multi-agent systems, with applications in computational imaging and computational social science.
Author Contributions
CD, MJE and AC elaborated the proof strategy, and CD wrote parts 1, 2 and 6. JT worked on the algorithmic design, performed the numerical experiments and wrote parts 3-5. All authors reviewed the manuscript.
Funding
CD acknowledges support from the EPSRC (EP/S026045/1). MJE acknowledges support from the EPSRC (EP/S026045/1, EP/T026693/1, EP/V026259/1) and the Leverhulme Trust (ECF-2019-478). CBS acknowledges support from the Philip Leverhulme Prize, the Royal Society Wolfson Fellowship, the EPSRC advanced career fellowship EP/V029428/1, EPSRC grants EP/S026045/1 and EP/T003553/1, EP/N014588/1, EP/T017961/1, the Wellcome Innovator Awards 215733/Z/19/Z and 221633/Z/20/Z, the European Union Horizon 2020 research and innovation program under the Marie Sklodowska-Curie grant agreement No. 777826 NoMADS, the Cantab Capital Institute for the Mathematics of Information and the Alan Turing Institute.
Availability of data and materials
The related implementation of the algorithms and the image data used in the experiment will be made available on the website https://junqitang.com. For the phantom image example, we use the one in the experimental section of [8], while for the lung CT image example we use an image from the Mayo Clinic Dataset [21] which is publicly available.
Declarations
Conflict of interest
There are no competing interests to declare.
Ethical approval
This declaration is not applicable.
Footnotes
The choice of s is crucial for the convergence behavior of rule (a), and we found numerically that it is better to scale with the operator norm instead of depending on the range of pixel values as suggested in [15].
The most typical example here would be Fig. 1b where the optimal step-size ratio selected by the adaptive scheme at convergence is almost exactly , where we have set SPDHG to run with this ratio. We can still observe benefit of local convergence acceleration given by our adaptive scheme.
CD was at the Department of Mathematical Sciences, University of Bath, while the research presented in this article was undertaken.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Contributor Information
Claire Delplancke, Email: claire.delplancke@edf.fr.
Matthias J. Ehrhardt, Email: m.ehrhardt@bath.ac.uk
References
- 1.Alacaoglu, A., Fercoq, O., Cevher, V.: On the convergence of stochastic primal-dual hybrid gradient. SIAM J. Optim. 32(2), 1288–1318 (2022) [Google Scholar]
- 2.Bauschke, H.H., Combettes, P.L.: Convex Analysis and Monotone Operator Theory in Hilbert Spaces, vol. 408. Springer (2011)
- 3.Bonettini, S., Benfenati, A., Ruggiero, V.: Scaling techniques for epsilon-subgradient methods. SIAM J. Optim. 26(3), 1741–1772 (2016) [Google Scholar]
- 4.Bonettini, S., Porta, F., Ruggiero, V., Zanni, L.: Variable metric techniques for forward-backward methods in imaging. J. Comput. Appl. Math. 385, 113192 (2021) [Google Scholar]
- 5.Bonettini, S., Prato, M., Rebegoldi, S.: A block coordinate variable metric linesearch based proximal gradient method. Comput. Optim. Appl. 71(1), 5–52 (2018) [Google Scholar]
- 6.Bonettini, S., Rebegoldi, S., Ruggiero, V.: Inertial variable metric techniques for the inexact forward-backward algorithm. SIAM J. Sci. Comput. 40(5), A3180–A3210 (2018) [Google Scholar]
- 7.Bonettini, S., Ruggiero, V.: On the convergence of primal-dual hybrid gradient algorithms for total variation image restoration. J. Math. Imaging Vis. 44(3), 236–253 (2012) [Google Scholar]
- 8.Chambolle, A., Ehrhardt, M.J., Richtárik, P., Schönlieb, C.-B.: Stochastic primal-dual hybrid gradient algorithm with arbitrary sampling and imaging applications. SIAM J. Optim. 28(4), 2783–2808 (2018) [Google Scholar]
- 9.Chambolle, A., Pock, T.: A first-order primal-dual algorithm for convex problems with applications to imaging. J. Math. Imaging Vis. 40(1), 120–145 (2011) [Google Scholar]
- 10.Combettes, P.L., Pesquet, J.-C.: Stochastic quasi-Fejér block-coordinate fixed point iterations with random sweeping. SIAM J. Optim. 25(2), 1221–1248 (2015) [Google Scholar]
- 11.Combettes, P.L., Vũ, B.C.: Variable metric quasi-Fejér monotonicity. Nonlinear Anal.: Theory Methods Appl. 78, 17–31 (2013) [Google Scholar]
- 12.Delplancke, C., Gurnell, M., Latz, J., Markiewicz, P.J., Schönlieb, C.-B., Ehrhardt, M.J.: Improving a stochastic algorithm for regularized PET image reconstruction. In: 2020 IEEE Nuclear Science Symposium and Medical Imaging Conference (NSS/MIC), pp. 1–3. IEEE (2020)
- 13.Ehrhardt, M.J., Markiewicz, P., Schönlieb, C.-B.: Faster PET reconstruction with non-smooth priors by randomization and preconditioning. Phys. Med. Biol. 64(22), 225019 (2019) [DOI] [PubMed] [Google Scholar]
- 14.Goldstein, T., Li, M., Yuan, X.: Adaptive primal-dual splitting methods for statistical learning and image processing. Adv. Neural. Inf. Process. Syst. 28, 2089–2097 (2015) [Google Scholar]
- 15.Goldstein, T., Li, M., Yuan, X., Esser, E., Baraniuk, R.: Adaptive primal-dual hybrid gradient methods for saddle-point problems. arXiv preprint arXiv:1305.0546 (2013)
- 16.Gutiérrez, E.B., Delplancke, C., Ehrhardt, M.J.: On the convergence and sampling of randomized primal-dual algorithms and their application to parallel MRI reconstruction. arXiv preprint arXiv:2207.12291 (2022)
- 17.He, B., Yuan, X.: Convergence Analysis of Primal-dual Algorithms for Total Variation Image Restoration. Rapport technique, Citeseer (2010)
- 18.Malitsky, Y.: Golden ratio algorithms for variational inequalities. Math. Program. 184(1), 383–410 (2020) [Google Scholar]
- 19.Malitsky, Y., Mishchenko, K.: Adaptive gradient descent without descent. In: Daumé III, H. Singh, A., (eds) Proceedings of the 37th International Conference on Machine Learning, vol. 119 of Proceedings of Machine Learning Research, pp. 6702–6712 (2020)
- 20.Malitsky, Y., Pock, T.: A first-order primal-dual algorithm with linesearch. SIAM J. Optim. 28(1), 411–432 (2018) [Google Scholar]
- 21.McCollough, C.: TU-FG-207A-04: overview of the low dose CT grand challenge. Med. Phys. 43(6Part35), 3759–3760 (2016) [Google Scholar]
- 22.Papoutsellis, E., Ametova, E., Delplancke, C., Fardell, G., Jørgensen, J.S., Pasca, E., Turner, M., Warr, R., Lionheart, W.R.B., Withers, P.J.: Core imaging library-part II: multichannel reconstruction for dynamic and spectral tomography. Philos. Trans. R. Soc. A 379(2204), 20200193 (2021) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Robbins, H., Siegmund, D.: A convergence theorem for non negative almost supermartingales and some applications. In: Optimizing Methods in Statistics, pp. 233–257. Elsevier (1971)
- 24.Schramm, G., Holler, M.: Fast and memory-efficient reconstruction of sparse poisson data in listmode with non-smooth priors with application to time-of-flight PET. Phys. Med. Biol. (2022) [DOI] [PMC free article] [PubMed]
- 25.Vladarean, M.-L., Malitsky, Y., Cevher, V.: A first-order primal-dual method with adaptivity to local smoothness. Adv. Neural. Inf. Process. Syst. 34, 6171–6182 (2021) [Google Scholar]
- 26.Yokota, T., Hontani, H.: An efficient method for adapting step-size parameters of primal-dual hybrid gradient method in application to total variation regularization. In: 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 973–979. IEEE (2017)
- 27.Zdun, L., Brandt, C.: Fast MPI reconstruction with non-smooth priors by stochastic optimization and data-driven splitting. Phys. Med. Biol. 66(17), 175004 (2021) [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The related implementation of the algorithms and the image data used in the experiment will be made available on the website https://junqitang.com. For the phantom image example, we use the one in the experimental section of [8], while for the lung CT image example we use an image from the Mayo Clinic Dataset [21] which is publicly available.







