Forward-reflected-backward method with variance reduction

Ahmet Alacaoglu; Yura Malitsky; Volkan Cevher

doi:10.1007/s10589-021-00305-3

. 2021 Aug 19;80(2):321–346. doi: 10.1007/s10589-021-00305-3

Forward-reflected-backward method with variance reduction

Ahmet Alacaoglu ^1,^✉, Yura Malitsky ², Volkan Cevher ¹

PMCID: PMC8550342 PMID: 34720428

Abstract

We propose a variance reduced algorithm for solving monotone variational inequalities. Without assuming strong monotonicity, cocoercivity, or boundedness of the domain, we prove almost sure convergence of the iterates generated by the algorithm to a solution. In the monotone case, the ergodic average converges with the optimal O(1/k) rate of convergence. When strong monotonicity is assumed, the algorithm converges linearly, without requiring the knowledge of strong monotonicity constant. We finalize with extensions and applications of our results to monotone inclusions, a class of non-monotone variational inequalities and Bregman projections.

Keywords: Variational inequalities, Stochastic variance reduction, Finite-sum structure, Saddle point problems, Monotone inclusions

Introduction

We are interested in solving variational inequalities (VI)

\begin{matrix} Find z^{⋆} \in Z : ⟨ F (z^{⋆}), z - z^{⋆} ⟩ + g (z) - g (z^{⋆}) \geq 0, \forall z \in Z, \end{matrix}

where g is a proper lower semicontinuous convex function and F is a monotone operator also given as the finite sum $F = \frac{1}{n} \sum_{i = 1}^{n} F_{i}$ .

A special case of monotone VIs is the structural saddle point problem

\begin{matrix} min_{x} max_{y} Ψ (x, y) + f (x) - h (y), \end{matrix}

where f, h are proper lower semicontinuous convex functions and $Ψ$ is a smooth convex-concave function. Indeed, problem (2) can be formulated as (1) by setting

\begin{matrix} z = (x, y), F (z) = [\begin{matrix} \nabla_{x} Ψ (x, y) \\ - \nabla_{y} Ψ (x, y) \end{matrix}], g (z) = f (x) + h (y), \end{matrix}

and $F (z) = \frac{1}{n} \sum_{i = 1}^{n} F_{i} (z)$ (see [2, Section 2], [5, 7] for examples).

Another related problem is the monotone inclusion where the aim is to

\begin{matrix} find z^{⋆} \in Z such that 0 \in (A + F) (x), \end{matrix}

where $A : Z ⇉ Z$ and $F : Z \to Z$ are maximally monotone operators and F is Lipschitz continuous with finite sum form. Monotone inclusions generalize (1) and our results also extend to this setting as will be shown in Sect. 4.1. Due to convenient abstraction, it is the problem (1) that will be our main concern.

The case when $Ψ$ in (2) is convex-concave and, in particular when it is bilinear, has found numerous applications in machine learning, image processing and operations research, resulting in efficient methods being developed in the respective areas [6, 14, 15, 33]. As VI methods solve the formulation (1), they seamlessly apply to solve instances of (2) with nonbilinear $Ψ$ .

In addition to the potentially complex structure of $Ψ$ , the size of the data in modern learning tasks lead to development of stochastic variants of VI methods [4, 17, 28]. An important technique on this front is stochastic variance reduction [18] which exploits the finite sum structures in problems to match the convergence rates of deterministic algorithms.

In the specific case of convex minimization, variance reduction has been transformative over the last decade [13, 16, 18, 21]. As a result, there has been several works on developing variance reduced versions of the standard VI methods, including forward-backward [2], extragradient [7, 20], and mirror-prox [5, 27]. Despite recent remarkable advances in this field, these methods rely on strong assumptions such as strong monotonicity [2, 7] or boundedness of the domain [5] and have complicated structures for handling the cases with non-bilinear $Ψ$ [5].

Contributions In this work, we introduce a variance reduced method with a simple single loop structure, for monotone VIs. We prove its almost sure convergence under mere monotonicity; without any of the aforementioned assumptions. The new method achieves the O(1/k) convergence rate in the general monotone case and linear rate of convergence when strong monotonicity is assumed, without using strong monotonicity constant as a parameter. We also consider natural extensions of our algorithm to monotone inclusions, a class of non-monotone problems, and monotone VIs with general Bregman distances.

Related works

Most of the research in variance reduction has focused on convex minimization [13, 16, 18, 21], leading to efficient methods in both theory and practice. On the other hand, variance reduction for solving VIs is started to be investigated recently. One common technique for reducing the variance in stochastic VIs, is to use increasing mini-batch sizes, which leads to high per iteration costs and slower convergence rates in practice [4, 9, 17].

A different approach used in [25] was to use the same sample in both steps of stochastic extragradient method [19] to reduce the variance, which results in a slower $O (1 / \sqrt{k})$ rate. The results of [25] for bilinear problems on the other hand are limited to the case when the matrix is full rank. The most related to our work, in the sense how variance reduction is used, are [2, 5, 7] (see Table 1).

Table 1.

$^{*}$ We say that the algorithm is $μ$ -adaptive if it does not require strong monotonicity constant as a parameter to obtain linear convergence. [7]) obtains $μ$ -adaptivity if cocoercivity constant of the operator is of the same order as the Lipschitz constant and not in general (see [7, Table 1]). $^{†}$ Our complexity matches the rate of deterministic methods [23, 27], however due to worse dependence on n compared to [5], it does not improve deterministic method in bilinear cases

	Assumptions for convergence	$μ$ -adaptivity $^{*}$	Complexity with monotonicity
[2]	Strong monotonicity	✗	N/A
[7]	Strong monotonicity	✗	N/A
[5]	Monotonicity, bounded domains	✗	$O (\sqrt{n} L / ϵ)$
This work	Monotonicity	✓	$O {(n L / ϵ)}^{†}$

Open in a new tab

For the specific case of strongly monotone operators, [2] proposed algorithms based on SVRG and SAGA, with linear convergence rates. Two major questions for future work are posed in [2]: (i) obtaining convergence without strong monotonicity assumption and (ii) proving linear convergence without using strong monotonicity constant in the algorithm as a parameter.

The work by [7] proposed an algorithm based on extragradient method [20] and under strong monotonicity assumption, proved linear convergence of the method. The step size in this work depends on cocoercivity constant, which might depend on strong monotonicity constant as discussed in [7, Table 1]. Thus, the result of [7] gave a partial answer to the second question of [2] while leaving the first one unanswered.

An elegant recent work of [5] focused on matrix games and proposed a method based on the mirror prox [27]. The extension of the method of [5] for general min-max problems is also considered there. Unfortunately, this extension not only features a three loop structure, but also uses the bounded domain assumption actively and requires domain diameter as a parameter in the algorithm [5, Corollary 2]. This result has been an important step towards an answer for the first question of [2].

As highlighted in Table 1, our complexity bounds have a worse dependence on n compared to [5], and do not improve the complexity of deterministic VI methods for bilinear games, which was the case in [5]. On the other hand, to our knowledge, our result is the first to show the existence of a variance reduced method that converges under the same set of assumptions as the deterministic methods and also matches the complexity of these deterministic methods. Moreover, our result is also the first variance reduced method to solve monotone inclusions in finite sum form, without strong monotonicity, increasing mini-batch sizes or decreasing step sizes [2].

Finally, our work answers an open problem posed in [23] regarding a stochastic extensions of the forward-reflected-backward method. Our result improves the preliminary result in [23, Section 6], which still requires evaluating the full operator every iteration.

Preliminaries and notation

We work in Euclidean space $Z = R^{d}$ with scalar product $⟨ \cdot, \cdot ⟩$ and induced norm $‖ \cdot ‖$ . Domain of a function $g : Z \to R \cup {+ \infty}$ is defined as $dom g = {z \in Z : g (z) < + \infty}$ . Proximal operator of g is defined as

\begin{matrix} {prox}_{g} (u) = {argmin}_{z \in Z} {g (z) + \frac{1}{2} {‖ z - u ‖}^{2}} . \end{matrix}

We call an operator $F : K \to Z$ , where $K \subseteq Z$ ,

L-Lipschitz, for $L > 0$ , if $‖ F (u) - F (v) ‖ \leq L ‖ u - v ‖, \forall u, v \in K$ .
monotone, if $⟨ F (u) - F (v), u - v ⟩ \geq 0, \forall u, v \in K$ .
$ν$ -cocoercive, for $ν > 0$ , if $⟨ F (u) - F (v), u - v ⟩ \geq {ν ‖ F (u) - F (v) ‖}^{2}, \forall u, v \in K$ .
$μ$ -strongly monotone, for $μ > 0$ , if $⟨ F (u) - F (v), u - v ⟩ \geq {μ ‖ u - v ‖}^{2}, \forall u, v \in K$ .

For example, in the context of (2) and (1), F is (strongly) monotone when $Ψ$ is (strongly) convex- (strongly) concave. However, it is worth noting that both cocoercivity and strong monotonicity fail even for the simple bilinear case when $Ψ (x, y) = ⟨ A x, y ⟩$ in (2).

Given iterates ${z_{k}}_{k \geq 1}$ , ${w_{k}}_{k \geq 1}$ and the filtration $F_{k} = σ {z_{1}, \dots, z_{k}, w_{1}, \dots, w_{k - 1}}$ , we define $E_{k} [\cdot] = E [\cdot | F_{k}]$ as the conditional expectations with respect to $F_{k}$ .

Finally, we state our common assumptions for (1).

Assumption 1

$g : Z \to R \cup {+ \infty}$ is proper lower semicontinuous convex.
$F : dom g \to Z$ is monotone.
$F = \frac{1}{n} \sum_{i = 1}^{n} F_{i}$ , with L-Lipschitz $F_{i} : dom g \to Z$ , $\forall i$ .
The solution set of (1), denoted by $Z^{⋆}$ , is nonempty.

Algorithm

Our algorithm is a careful mixture of a recent deterministic algorithm for VIs, proposed by [23], with a special technique of using variance reduction in finite sum minimization given in [16] and [21].

It is clear that for $n = 1$ any stochastic variance reduced algorithm for VI reduces to some deterministic one. As a consequence, this immediately rules out the most obvious choice — the well-known forward-backward method (FB)

\begin{matrix} z_{k + 1} = {prox}_{τ g} (z_{k} - τ F (z_{k})), \end{matrix}

since its convergence requires either strong monotonicity or cocoercivity of F. The classical algorithms that work under mere monotonicity [20, 30, 34] have a more complicated structure, and thus, it is not clear how to meld them with a variance reduction technique for finite sum problems. Instead, we chose the recent forward-reflected-backward method (FoRB) [23]

\begin{matrix} z_{k + 1} = {prox}_{τ g} (z_{k} - τ (2 F (z_{k}) - F (z_{k - 1}))), \end{matrix}

which converges under Assumption 1 with $n = 1$ .

When $g = 0$ , this method takes its origin in the Popov’s algorithm [30]. In this specific case, FoRB is also equivalent to optimistic gradient ascent algorithm [12, 31] which became increasingly popular in machine learning literature recently [11, 12, 24, 26].

Among many variance reduced methods for solving finite sum problems ${min}_{z} f (z) : = \frac{1}{n} \sum_{i = 1}^{n} f_{i} (z)$ one of the simplest is the Loopless-SVRG method [21] (see also [16]),

\begin{matrix} z_{k + 1} = z_{k} - τ \nabla f (w_{k}) - τ (\nabla f_{i_{k}} (z_{k}) - \nabla f_{i_{k}} (w_{k})) \\ w_{k + 1} = \{\begin{matrix} z_{k}, with probability p, \\ w_{k}, with probability 1 - p, \end{matrix}) \end{matrix}

which can be seen as a randomized version of the gradient and hence forward-backward methods. The latter is the exact reason why we cannot extend this method directly to the variational inequality setting, without cocoercivity or strong monotonicity.

An accurate blending of [23] and [21], described above, results in Algorithm 1. Compared to Loopless-SVRG, the last evaluation of the operator at step 4 of Algorithm 1 is done at $w_{k - 1}$ , instead of $w_{k}$ . In the deterministic case when $n = 1$ or $p = 1$ , this modification reduces the method to FoRB (4) and not FB (3). The other change is that we use the most recent iterate $z_{k + 1}$ in the update of $w_{k + 1}$ , instead of $z_{k}$ in the Loopless-SVRG. Surprisingly, these two small distinctions result in the method which converges for general VIs without the restrictive assumptions of the previous works.

We note that we use uniform sampling for choosing $i_{k}$ in Algorithm 1 for simplicity. Our arguments directly extend to arbitrary sampling as in [2, 5] which is used for obtaining tighter Lipschitz constants.

Convergence analysis

We start with a key lemma that appeared in [23] for analyzing a general class of VI methods. The proof of this lemma is given in the appendix for completeness. The only change from [23] is that we consider the proximal operator, instead of a more general resolvent.

Lemma 3.1

[23, Proposition 2.3] Let $g : Z \to R \cup {+ \infty}$ be proper lower semicontinuous convex and let $x_{1}$ , $U_{0}, U_{1}, V_{1} \in Z$ be arbitrary points. Define $x_{2}$ as

\begin{matrix} x_{2} = {prox}_{g} (x_{1} - U_{1} - (V_{1} - U_{0})) . \end{matrix}

Then for all $x \in Z$ and $V_{2} \in Z$ , it holds

\begin{matrix} ‖ x_{2} {- x ‖}^{2} + 2 ⟨ V_{2} - U_{1}, x - x_{2} ⟩ + 2 ⟨ V_{2}, x_{2} - x ⟩ + 2 g (x_{2}) - 2 g (x) \\ \leq ‖ x_{1} {- x ‖}^{2} + 2 ⟨ V_{1} - U_{0}, x - x_{1} ⟩ + 2 ⟨ V_{1} - U_{0}, x_{1} - x_{2} ⟩ - {‖ x_{1} - x_{2} ‖}^{2} . \end{matrix}

The benefit of Lemma 3.1 is that it gives a candidate for a Lyapunov function that can be used to prove convergence. We will need a slight modification in this function due to randomization in Algorithm 1.

Convergence of the iterates

We start by proving the almost sure convergence of the iterates. Such a result states that the trajectories of the iterates generated by our algorithm converge to a point in the solution set. This type of result is the analogue of sequential convergence results for deterministic methods [23].

For the iterates ${z_{k}}$ , ${w_{k}}$ of Algorithm 1 and any $z \in dom g$ , $β > 0$ we define

\begin{matrix} Φ_{k + 1} (z) : = {‖ z_{k + 1} - z ‖}^{2} + 2 τ ⟨ F (z_{k + 1}) - F (w_{k}), z - z_{k + 1} ⟩ \\ + \frac{β}{2} ‖ z_{k} - w_{k} ‖^{2} + \frac{1}{2} {‖ z_{k + 1} - z_{k} ‖}^{2} \\ Θ_{k + 1} (z) : = ⟨ F (z_{k + 1}), z_{k + 1} - z ⟩ + g (z_{k + 1}) - g (z) . \end{matrix}

The first expression plays the role of a Lyapunov function and the second is essential for the rate.

Lemma 3.2

Let Assumption 1 hold, $τ < \frac{1 - \sqrt{1 - p}}{2 L}$ , $β = \frac{1}{\sqrt{1 - p}} - 1$ , and the iterates ${z_{k}}$ are generated by Algorithm 1. Then for any $z \in dom g$ ,

\begin{matrix} E_{k} [Φ_{k + 1} (z) + 2 τ Θ_{k + 1} (z)] \leq Φ_{k} (z) . \end{matrix}

This lemma is essential in establishing the convergence of iterates and sublinear convergence rates that we will derive in the next section. We now continue with the proof.

Proof

We set in Lemma 3.1 $U_{0} = τ F_{i} (w_{k - 1})$ , $U_{1} = τ F (w_{k})$ , $V_{1} = τ F_{i} (z_{k})$ , $V_{2} = τ F (z_{k + 1})$ , and $x_{1} = z_{k}$ , with $i_{k} = i$ . Then by (5) and step 4 of Algorithm 1, $x_{2} = z_{k + 1}$ , thus, by (6)

\begin{matrix} ‖ z_{k + 1} {- z ‖}^{2} + 2 τ ⟨ F (z_{k + 1}) - F (w_{k}), z - z_{k + 1} ⟩ + 2 τ (⟨ F (z_{k + 1}), z_{k + 1} - z ⟩ \\ + g (z_{k + 1}) - g (z)) \leq {‖ z_{k} - z ‖}^{2} + 2 τ ⟨ F_{i} (z_{k}) - F_{i} (w_{k - 1}), z - z_{k} ⟩ \\ + 2 τ ⟨ F_{i} (z_{k}) - F_{i} (w_{k - 1}), z_{k} - z_{k + 1} ⟩ - {‖ z_{k + 1} - z_{k} ‖}^{2} . \end{matrix}

First, note that by Lipschitzness of $F_{i}$ , Cauchy-Schwarz and Young’s inequalities,

\begin{matrix} 2 τ ⟨ F_{i} (z_{k}) - F_{i} (w_{k - 1}), z_{k} - z_{k + 1} ⟩ & \leq 2 τ^{2} L^{2} ‖ z_{k} - w_{k - 1} ‖^{2} + \frac{1}{2} {‖ z_{k} - z_{k + 1} ‖}^{2} . \end{matrix}

Thus, it follows that

\begin{matrix} ‖ z_{k + 1} {- z ‖}^{2} + 2 τ ⟨ F (z_{k + 1}) - F (w_{k}), z - z_{k + 1} ⟩ + \frac{1}{2} {‖ z_{k + 1} - z_{k} ‖}^{2} + 2 τ Θ_{k + 1} (z) \\ \leq ‖ z_{k} {- z ‖}^{2} + 2 τ ⟨ F_{i} (z_{k}) - F_{i} (w_{k - 1}), z - z_{k} ⟩ + 2 τ^{2} L^{2} {‖ z_{k} - w_{k - 1} ‖}^{2} . \end{matrix}

Taking expectation conditioning on the knowledge of $z_{k}, w_{k - 1}$ and using that $E_{k} F_{i} (z_{k}) = F (z_{k})$ , $E_{k} F_{i} (w_{k - 1}) = F (w_{k - 1})$ , we obtain

\begin{matrix} E_{k} ‖ z_{k + 1} {- z ‖}^{2} + 2 τ E_{k} ⟨ F (z_{k + 1}) - F (w_{k}), z - z_{k + 1} ⟩ + \frac{1}{2} E_{k} {‖ z_{k + 1} - z_{k} ‖}^{2} \\ + 2 τ E_{k} Θ_{k + 1} (z) \leq {‖ z_{k} - z ‖}^{2} + 2 τ ⟨ F (z_{k}) - F (w_{k - 1}), z - z_{k} ⟩ \\ + 2 τ^{2} L^{2} {‖ z_{k} - w_{k - 1} ‖}^{2} . \end{matrix}

Adding

\begin{matrix} \frac{β}{2} E_{k} ‖ z_{k} - w_{k} ‖^{2} = \frac{β (1 - p)}{2} {‖ z_{k} - w_{k - 1} ‖}^{2}, \end{matrix}

which follows from the definition of $w_{k}$ , to (11), we obtain

\begin{matrix} E_{k} [Φ_{k + 1} (z) + 2 τ Θ_{k + 1} (z)] \leq Φ_{k} (z) \\ + (2 τ^{2} L^{2} + \frac{β (1 - p)}{2}) ‖ z_{k} - w_{k - 1} ‖^{2} - \frac{1}{2} {‖ z_{k} - z_{k - 1} ‖}^{2} \\ - \frac{β}{2} {‖ z_{k - 1} - w_{k - 1} ‖}^{2} . \end{matrix}

The proof will be complete, if we can show that the expression in the second and third lines are nonpositive. Due to our choice of $β$ and $τ$ this is a matter of a simple algebra. As $β + 1 = \frac{1}{\sqrt{1 - p}}$ , $\frac{β}{1 + β} = 1 - \sqrt{1 - p}$ , and $2 τ L < 1 - \sqrt{1 - p} = \frac{β}{1 + β}$ , we have

\begin{matrix} 2 τ^{2} L^{2} + \frac{β (1 - p)}{2} \leq \frac{1}{2} (\frac{β^{2}}{{(1 + β)}^{2}} + \frac{β}{{(1 + β)}^{2}}) = \frac{β}{2 (1 + β)} . \end{matrix}

Then we must show that

\begin{matrix} \frac{β}{1 + β} ‖ z_{k} - w_{k - 1} ‖^{2} \leq ‖ z_{k} - z_{k - 1} ‖^{2} + β {‖ z_{k - 1} - w_{k - 1} ‖}^{2}, \end{matrix}

which is a direct consequence of ${‖ u + v ‖}^{2} \leq (1 + \frac{1}{β}) {‖ u ‖}^{2} + (1 + β) {‖ v ‖}^{2}$ . The proof is complete. $□$

Theorem 3.1

Let Assumption 1 hold and let $τ < \frac{1 - \sqrt{1 - p}}{2 L}$ . Then for the iterates ${z_{k}}$ of Algorithm 1, almost surely there exists $z^{⋆} \in Z^{⋆}$ such that $z_{k} \to z^{⋆}$ .

Remark 3.1

It is interesting to observe that for $p = 1$ , i.e., when the algorithm becomes deterministic, the bound for the stepsize is $τ < \frac{1}{2 L}$ , which coincides with the one in [23] and is known to be tight. In this case analysis will be still valid if for convenience we assume that $\infty \cdot 0 = 0$ .

For small p we might use a simpler bound for the stepsize, as the following corollary suggests.

Corollary 3.1

Suppose that $p = \frac{1}{n}$ and $τ \leq \frac{p}{4 L} = \frac{1}{4 L n}$ . Then the statement of Theorem 3.1 holds.

Proof

We only have to check that $\frac{p}{2} \leq 1 - \sqrt{1 - p}$ , which follows from $\sqrt{1 - p} \leq 1 - \frac{p}{2}$ . $□$

Proof of Theorem 3.1

From Lemma 3.2 we have for any $z \in dom g$

\begin{matrix} E_{k} [Φ_{k + 1} (z) + 2 τ Θ_{k + 1} (z)] \leq Φ_{k} (z) . \end{matrix}

First, we show that $Φ_{k + 1} (z)$ is nonnegative for all $z \in dom g$ . This is straightforward but tedious. Recall that $1 - \sqrt{1 - p} = \frac{β}{1 + β}$ and hence $2 τ L \leq \frac{β}{1 + β}$ . Then by Cauchy-Schwarz and Young’s inequalities,

\begin{matrix} - 2 τ ⟨ & F (z_{k + 1}) - F (w_{k}), z - z_{k + 1} ⟩ \leq 2 τ L ‖ z_{k + 1} - w_{k} ‖ ‖ z_{k + 1} - z ‖ \\ \leq \frac{β}{2 (1 + β)} (‖ z_{k + 1} - w_{k} ‖^{2} + {‖ z_{k + 1} - z ‖}^{2}) \\ \leq \frac{β}{2 (1 + β)} ‖ z_{k + 1} {- z ‖}^{2} + \frac{β}{2 (1 + β)} ((1 + \frac{1}{β}) {‖ z_{k + 1} - z_{k} ‖}^{2} \\ + (1 + β) ‖ z_{k} - w_{k} ‖^{2}) \\ = \frac{β}{2 (1 + β)} ‖ z_{k + 1} {- z ‖}^{2} + \frac{1}{2} ‖ z_{k + 1} - z_{k} ‖^{2} + \frac{β}{2} {‖ z_{k} - w_{k} ‖}^{2} . \end{matrix}

Therefore, we deduce

\begin{matrix} Φ_{k + 1} (z) \geq ‖ z_{k + 1} {- z ‖}^{2} - \frac{β}{2 (1 + β)} ‖ z_{k + 1} {- z ‖}^{2} \geq \frac{1}{2} {‖ z_{k + 1} - z ‖}^{2} . \end{matrix}

Now let $z = \bar{z} \in Z^{⋆}$ . Then by monotonicity of F and (1),

\begin{matrix} Θ_{k + 1} (\bar{z}) = ⟨ F (z_{k + 1}), z_{k + 1} - \bar{z} ⟩ + g (z_{k + 1}) - g (\bar{z}) \\ \geq ⟨ F (\bar{z}), z_{k + 1} - \bar{z} ⟩ + g (z_{k + 1}) - g (\bar{z}) \geq 0 . \end{matrix}

Summing up, we have that $Θ_{k + 1} (\bar{z}) \geq 0$ , $Φ_{k} (\bar{z}) \geq 0$ and $E_{k} Φ_{k + 1} (\bar{z}) \leq Φ_{k} (\bar{z})$ . Unfortunately, this is still not sufficient for us, so we are going to strengthen this inequality by reexamining the proof of Lemma 3.2. In estimating the second line of inequality (13) we used that $2 τ L \leq 1 - \sqrt{1 - p}$ , however, both in the statements of Lemma 3.2 and Theorem 3.1 we assumed a strict inequality. Let

\begin{matrix} δ = \frac{β}{1 + β} - \frac{4 τ^{2} L^{2} (1 + β)}{β} \\ ⟺ 4 τ^{2} L^{2} = \frac{β^{2}}{{(1 + β)}^{2}} - \frac{δ β}{1 + β} . \end{matrix}

From $2 τ L < 1 - \sqrt{1 - p} = \frac{β}{1 + β}$ it follows that $δ > 0$ . Now, inequality (14) can be improved to equality as

\begin{matrix} 2 τ^{2} L^{2} + \frac{β (1 - p)}{2} = \frac{1}{2} (\frac{β^{2}}{{(1 + β)}^{2}} - \frac{δ β}{(1 + β)} + \frac{β}{{(1 + β)}^{2}}) = \frac{β (1 - δ)}{2 (1 + β)} . \end{matrix}

This change results in a slightly stronger version of (7)

\begin{matrix} E_{k} [Φ_{k + 1} (\bar{z}) + 2 τ Θ_{k + 1} (\bar{z})] \\ \leq Φ_{k} (\bar{z}) - \frac{δ}{2} (‖ z_{k} - z_{k - 1} ‖^{2} + β {‖ z_{k - 1} - w_{k - 1} ‖}^{2}) . \end{matrix}

As $Φ_{k + 1} (\bar{z}) \geq 0$ and $Θ_{k + 1} (\bar{z}) \geq 0$ , we can apply Robbins-Siegmund lemma [32] to conclude that ${Φ_{k + 1} (\bar{z})}$ converges almost surely and that

\begin{matrix} \sum_{k = 1}^{\infty} E [‖ z_{k} - z_{k - 1} ‖^{2} + {‖ z_{k - 1} - w_{k - 1} ‖}^{2}] < \infty . \end{matrix}

It then follows that almost surely, $‖ z_{k} - z_{k - 1} ‖^{2} \to 0$ and $‖ z_{k - 1} - w_{k - 1} ‖^{2} \to 0$ . Moreover, due to (16), ${z_{k}}$ is almost surely bounded and therefore by the definition of $Φ_{k}$ , continuity of F, and (21), we have that $‖ z_{k} {- \bar{z} ‖}^{2}$ converges almost surely.

More specifically, this means that for every $\bar{z} \in Z^{⋆}$ , there exists $Ω_{\bar{z}}$ with $P (Ω_{\bar{z}}) = 1$ such that $\forall ω \in Ω_{\bar{z}}$ , $‖ z_{k} {(ω) - \bar{z} ‖}^{2}$ converges. We can strengthen this result by using the arguments from [3, Proposition 9], [8, Proposition 2.3] to obtain that there exists $Ω$ with $P (Ω) = 1$ such that for every $\bar{z} \in Z^{⋆}$ and for every $ω \in Ω$ , $‖ z_{k} {(ω) - \bar{z} ‖}^{2}$ converges.

We now pick a realization $ω \in Ω$ and note that $z_{k} (ω) - z_{k - 1} (ω) \to 0$ and $z_{k - 1} (ω) - w_{k - 1} (ω) \to 0$ . Let us denote by $\tilde{z}$ a cluster point of the bounded sequence $z_{k} (ω)$ . By using the definition of $z_{k}$ and convexity of g, as in the proof of Lemma 3.1, we have for any $z \in Z$

\begin{matrix} g (z) \geq g (z_{k} (ω)) & + \frac{1}{τ} ⟨ z_{k - 1} (ω) - z_{k} (ω), z - z_{k} (ω) ⟩ - ⟨ F (w_{k - 1} (ω)), z - z_{k} (ω) ⟩ \\ - ⟨ F_{i_{k - 1}} (z_{k - 1} (ω)) - F_{i_{k - 1}} (w_{k - 2} (ω)), z - z_{k} (ω) ⟩ . \end{matrix}

Taking the limit as $k \to \infty$ and using that g is lower semicontinuous and $\forall i$ , $F_{i}$ is Lipschitz, $z_{k} (ω) - z_{k - 1} (ω) \to 0$ and $z_{k - 1} (ω) - w_{k - 1} (ω) \to 0$ , we get that $\tilde{z} \in Z^{⋆}$ . Then, as we have that $‖ z_{k} (ω) - \tilde{z} ‖^{2}$ converges and we have shown that $‖ z_{k} (ω) - \tilde{z} ‖^{2}$ converges to 0 at least on one subsequence, we conclude that the sequence $(z_{k} (ω))$ converges to some point $\tilde{z}$ , where $\tilde{z} \in Z^{⋆}$ . $□$

Convergence rate for the general case

In this section, we prove that the average of the iterates of the algorithm exhibits O(1/k) convergence rate which is optimal for solving monotone VIs [27]. The standard quantity to show sublinear rates for VIs is gap function which is defined as

\begin{matrix} G (\bar{z}) = sup_{z \in Z} ⟨ F (z), \bar{z} - z ⟩ + g (\bar{z}) - g (z) . \end{matrix}

As this quantity requires taking a supremum over the whole space $Z$ which is potentially unbounded, restricted versions of gap functions are used, for example in [22, 29]

\begin{matrix} G_{C} (\bar{z}) = sup_{z \in C} ⟨ F (z), \bar{z} - z ⟩ + g (\bar{z}) - g (z), \end{matrix}

where $C \subset dom g$ is an arbitrary bounded set. It is known that $G_{C} (\bar{z})$ is a valid merit function, as proven by [29, Lemma 1]. As we are concerned with randomized algorithms, we derive the rate of convergence for the expected gap function $E [G_{C}, (z_{k})]$ .

Theorem 3.2

Given ${z_{k}}$ generated by Algorithm 1, we define the averaged iterate $z_{K}^{av} = \frac{1}{K} \sum_{k = 1}^{K} z_{k}$ . Let $C \subset dom g$ be an arbitrary bounded set. Then under the hypotheses of Theorem 3.1 it holds that

\begin{matrix} E [G_{C}, (z_{K}^{av})] \leq \frac{1}{K} [\frac{1}{τ} sup_{z \in C} {‖ z_{0} - z ‖}^{2} + \frac{2 τ L^{2} (1 + β)}{δ β} dist {(z_{0}, Z^{⋆})}^{2}], \end{matrix}

where $δ = \frac{β}{1 + β} - \frac{4 τ^{2} L^{2} (1 + β)}{β}$ .

Remark 3.2

If we set $p = \frac{1}{n}$ , $τ = \frac{p}{3 \sqrt{2} L}$ , and $β = \frac{1}{\sqrt{1 - p}} - 1$ , the rate will be bounded by $\frac{nL}{K} (3 \sqrt{2} {sup}_{z \in C} {‖ z_{0} - z ‖}^{2} + 12 \sqrt{2} dist {(z_{0}, Z^{⋆})}^{2})$ , hence it is $O (\frac{nL}{K})$ .

The high level idea of the proof is that on top of Lemma 3.2 we sum the resulting inequality and accumulate terms $Θ_{k} (z)$ . Then we use Jensen’s inequality to obtain the result.

There are two intricate points that need attention in these kind of results. First, the convergence measure is the expected duality gap $E [G_{C} (z_{K}^{av})]$ that includes the expectation of the supremum. In a standard analysis, it is easy to obtain a bound for the supremum of expectation, however obtaining the former requires a technique, which is common in the literature for saddle point problems [1, 28]. Roughly, the idea is to use an auxiliary iterate to characterize the difference two quantities, and show that the error term does not degrade the rate.

Second, as duality gap requires taking a supremum over the domain, the rate might contain a diameter term as in [5]. The standard way to adjust this result for unbounded domains is to utilize a restricted merit function as in (22) on which the rate is obtained [29]. We note that the result in [5] not only involves the domain diameter in the final bound, but it also requires the domain diameter as a parameter for the algorithm in the general monotone case [5, Corollary 2].

Proof of Theorem 3.2

First, we collect some useful bounds. Consider (20) with a specific choice $\bar{z} = P_{Z^{⋆}} (z_{0})$ . Taking a full expectation and then summing that inequality, we get

\begin{matrix} \frac{δ}{2} \sum_{k = 0}^{\infty} E [‖ z_{k} - z_{k - 1} ‖^{2} + β {‖ z_{k - 1} - w_{k - 1} ‖}^{2}] \\ \leq ‖ z_{0} - P_{Z^{⋆}} (z_{0}) ‖^{2} = dist {(z_{0}, Z^{*})}^{2}, \end{matrix}

which also implies by Young’s inequality that

\begin{matrix} \frac{β δ}{2 (1 + β)} \sum_{k = 0}^{\infty} E {‖ z_{k} - w_{k - 1} ‖}^{2} \leq dist {(z_{0}, Z^{*})}^{2} . \end{matrix}

Next, we rewrite (10) as

\begin{matrix} 2 τ Θ_{k + 1} (z) + ‖ z_{k + 1} {- z ‖}^{2} + 2 τ ⟨ F (z_{k + 1}) - F (w_{k}), z - z_{k + 1} ⟩ + \frac{1}{2} {‖ z_{k + 1} - z_{k} ‖}^{2} \\ \leq ‖ z_{k} {- z ‖}^{2} + 2 τ ⟨ F (z_{k}) - F (w_{k - 1}), z - z_{k} ⟩ + 2 τ^{2} L^{2} {‖ z_{k} - w_{k - 1} ‖}^{2} \\ + 2 τ ⟨ F_{i_{k}} (z_{k}) - F_{i_{k}} (w_{k - 1}) - (F (z_{k}) - F (w_{k - 1})), z - z_{k} ⟩ . \end{matrix}

Let $ν_{k} = τ (F_{i_{k}} (z_{k}) - F_{i_{k}} (w_{k - 1}) - (F (z_{k}) - F (w_{k - 1})))$ , then $E_{k} [ν_{k}] = 0$ . We define the process ${{\hat{z}}_{k}}$ by ${\hat{z}}_{0} = z_{0}$ and

\begin{matrix} {\hat{z}}_{k + 1} = {\hat{z}}_{k} + ν_{k} . \end{matrix}

Note that for $F_{k} = σ {z_{1}, \dots, z_{k}, w_{1}, \dots, w_{k - 1}}$ , ${\hat{z}}_{k}$ is $F_{k}$ -measurable. It also follows that $\forall z \in Z$

\begin{matrix} ‖ {\hat{z}}_{k + 1} {- z ‖}^{2} = ‖ {\hat{z}}_{k} {- z ‖}^{2} + 2 ⟨ ν_{k}, {\hat{z}}_{k} - z ⟩ + {‖ ν_{k} ‖}^{2}, \end{matrix}

which after summation over $k = 0, \dots, K - 1$ yields

\begin{matrix} \sum_{k = 0}^{K - 1} 2 ⟨ ν_{k}, z - {\hat{z}}_{k} ⟩ \leq ‖ z_{0} {- z ‖}^{2} + \sum_{k = 0}^{K - 1} {‖ ν_{k} ‖}^{2} . \end{matrix}

With the definition of $ν_{k}$ we can rewrite (25) as

\begin{matrix} 2 τ Θ_{k + 1} (z) + ‖ z_{k + 1} {- z ‖}^{2} + 2 τ ⟨ F (z_{k + 1}) - F (w_{k}), z - z_{k + 1} ⟩ + \frac{1}{2} {‖ z_{k + 1} - z_{k} ‖}^{2} \\ \leq ‖ z_{k} {- z ‖}^{2} + 2 τ ⟨ F (z_{k}) - F (w_{k - 1}), z - z_{k} ⟩ + 2 τ^{2} L^{2} {‖ z_{k} - w_{k - 1} ‖}^{2} \\ + 2 ⟨ ν_{k}, z - {\hat{z}}_{k} ⟩ + 2 ⟨ ν_{k}, {\hat{z}}_{k} - z_{k} ⟩ . \end{matrix}

We use (12), the definition of $Φ_{k}$ , and the arguments in Lemma 3.2 to show that the last line of (13) is nonpositive, to obtain

\begin{matrix} 2 τ Θ_{k + 1} (z) + Φ_{k + 1} (z) + \frac{β}{2} (E_{k} ‖ z_{k} - w_{k} ‖^{2} - {‖ z_{k} - w_{k} ‖}^{2}) \\ \leq Φ_{k} (z) + 2 ⟨ ν_{k}, z - {\hat{z}}_{k} ⟩ + 2 ⟨ ν_{k}, {\hat{z}}_{k} - z_{k} ⟩ . \end{matrix}

Summing this inequality over $k = 0, \dots, K - 1$ and using bound (28) yields

\begin{matrix} 2 τ \sum_{k = 0}^{K - 1} Θ_{k + 1} (z) + Φ_{K} (z) + \frac{β}{2} \sum_{k = 0}^{K - 1} (E_{k} ‖ z_{k} - w_{k} ‖^{2} - {‖ z_{k} - w_{k} ‖}^{2}) \\ \leq Φ_{0} (z) + 2 \sum_{k = 0}^{K - 1} ⟨ ν_{k}, z - {\hat{z}}_{k} ⟩ + 2 \sum_{k = 0}^{K - 1} ⟨ ν_{k}, {\hat{z}}_{k} - z_{k} ⟩ \\ \leq Φ_{0} (z) + ‖ z_{0} {- z ‖}^{2} + 2 \sum_{k = 0}^{K - 1} {‖ ν_{k} ‖}^{2} + 2 \sum_{k = 0}^{K - 1} ⟨ ν_{k}, {\hat{z}}_{k} - z_{k} ⟩ \\ = 2 ‖ z_{0} {- z ‖}^{2} + 2 \sum_{k = 0}^{K - 1} {‖ ν_{k} ‖}^{2} + 2 \sum_{k = 0}^{K - 1} ⟨ ν_{k}, {\hat{z}}_{k} - z_{k} ⟩ . \end{matrix}

We now take the supremum of this inequality over $z \in C$ and then take a full expectation. As ${\hat{z}}_{k}$ is $F_{k}$ -measurable, $E [E_{k} [\cdot]] = E [\cdot]$ , and $E_{k} ν_{k} = 0$ , we have $E_{k} [⟨ ν_{k}, {\hat{z}}_{k} - z_{k} ⟩] = 0$ . Using this and that $Φ_{K} (z) \geq 0$ by (16), we arrive at

\begin{matrix} τ E [sup_{z \in C}, \sum_{k = 0}^{K - 1}, Θ_{k + 1}, (z)] \leq sup_{z \in C} ‖ z_{0} {- z ‖}^{2} + \sum_{k = 0}^{K - 1} E {‖ ν_{k} ‖}^{2} . \end{matrix}

It remains to estimate the last term $\sum_{k = 0}^{K - 1} E {‖ ν_{k} ‖}^{2}$ . For this, we use a standard inequality ${E ‖ X - E X ‖}^{2} \leq E {‖ X ‖}^{2}$ and Lipschitzness of $F_{i_{k}}$

\begin{matrix} \sum_{k = 0}^{K - 1} E {‖ ν_{k} ‖}^{2} & = \sum_{k = 0}^{K - 1} E τ^{2} {‖ F_{i_{k}} (z_{k}) - F_{i_{k}} (w_{k - 1}) - (F (z_{k}) - F (w_{k - 1})) ‖}^{2} \\ \leq τ^{2} \sum_{k = 0}^{K - 1} E ‖ F_{i_{k}} (z_{k}) - F_{i_{k}} (w_{k - 1}) ‖^{2} \leq τ^{2} L^{2} \sum_{k = 0}^{K - 1} E {‖ z_{k} - w_{k - 1} ‖}^{2} \\ \overset{(24)}{\leq} \frac{2 τ^{2} L^{2} (1 + β)}{δ β} dist {(z_{0}, Z^{⋆})}^{2} . \end{matrix}

Plugging this bound into (31), we obtain

\begin{matrix} τ E [sup_{z \in C}, \sum_{k = 0}^{K - 1}, Θ_{k + 1}, (z)] \\ \leq sup_{z \in C} {‖ z_{0} - z ‖}^{2} + \frac{2 τ^{2} L^{2} (1 + β)}{δ β} dist {(z_{0}, Z^{⋆})}^{2} . \end{matrix}

Finally, using monotonicity of F, followed by Jensen inequality, we deduce

\begin{matrix} sup_{z \in C} \sum_{k = 0}^{K - 1} Θ_{k + 1} (z) \\ \geq sup_{z \in C} \sum_{k = 1}^{K} (⟨ F (z), z_{k} - z ⟩ + g (z_{k}) - g (z)) \geq K G_{C} (z_{K}^{av}), \end{matrix}

which combined with (33) finishes the proof. $□$

It is worth mentioning that even though our method is simple and the convergence rate is O(1/k) as in [5], our complexity result has a worse dependence on n, compared to [5]. In particular, our complexity is $O (n / ϵ)$ instead of the $O (\sqrt{n} / ϵ)$ of [5]. This is because our step size has the factor of p which is of the order $\frac{1}{n}$ in general and it appears to be tight based on numerical experiments. This seems like the cost of handling a more general problem without bounded domain assumption. We leave it as an open question to derive a method that works under our general assumptions and features favorable complexity guarantees as in [5].

Convergence rate for strongly monotone case

We show that linear convergence is attained when strong monotonicity is assumed.

Theorem 3.3

Let Assumption 1 hold and let F be $μ$ -strongly monotone. Let $z^{⋆}$ be the unique solution of (1). Then for the iterates ${z_{k}}$ generated by Algorithm 1 with $τ = \frac{p}{4 \sqrt{2} L}$ , it holds that

\begin{matrix} E ‖ z_{k} - z^{⋆} ‖^{2} \leq {(1 - \frac{μ p}{8 \sqrt{2} L})}^{k} {‖ z_{0} - z^{⋆} ‖}^{2} . \end{matrix}

Remark 3.3

We analyzed the case when F is strongly monotone, however, the same analysis would go through when F is monotone and g is strongly convex. One can transfer strong convexity of g to make F strongly monotone.

Proof of Theorem 3.3

We start from (8) with $i_{k} = i$ ,

\begin{matrix} ‖ z_{k + 1} {- z ‖}^{2} + 2 τ ⟨ F (z_{k + 1}) - F (w_{k}), z - z_{k + 1} ⟩ + 2 τ g (z_{k + 1}) - 2 τ g (z) \\ + 2 τ ⟨ F (z_{k + 1}), z_{k + 1} - z ⟩ \leq {‖ z_{k} - z ‖}^{2} + 2 τ ⟨ F_{i} (z_{k}) - F_{i} (w_{k - 1}), z - z_{k} ⟩ \\ + 2 τ ⟨ F_{i} (z_{k}) - F_{i} (w_{k - 1}), z_{k} - z_{k + 1} ⟩ - {‖ z_{k + 1} - z_{k} ‖}^{2} \end{matrix}

Setting $z = z^{⋆}$ and using strong monotonicity of F,

\begin{matrix} ⟨ F (z_{k + 1}), z_{k + 1} - z^{⋆} ⟩ + g (z_{k + 1}) - g (z^{⋆}) \geq ⟨ F (z^{⋆}), z_{k + 1} - z^{⋆} ⟩ + μ {‖ z_{k + 1} - z^{⋆} ‖}^{2} \\ + g (z_{k + 1}) - g (z^{⋆}) \geq μ {‖ z_{k + 1} - z^{⋆} ‖}^{2} . \end{matrix}

Hence, we have

\begin{matrix} (1 + 2 τ μ) ‖ z_{k + 1} - z^{⋆} ‖^{2} + 2 τ ⟨ F (z_{k + 1}) - F (w_{k}), z^{⋆} - z_{k + 1} ⟩ + {‖ z_{k + 1} - z_{k} ‖}^{2} \\ \leq ‖ z_{k} - z^{⋆} ‖^{2} + 2 τ ⟨ F_{i} (z_{k}) - F_{i} (w_{k - 1}), z^{⋆} - z_{k} ⟩ \\ + 2 τ ⟨ F_{i} (z_{k}) - F_{i} (w_{k - 1}), z_{k} - z_{k + 1} ⟩ . \end{matrix}

Then, we continue as in the proof of Theorem 3.1 until we obtain a stronger version of (20) due to the strong monotonicity term

\begin{matrix} E_{k} [(1 + 2 μ τ) {‖ z_{k + 1} - z^{⋆} ‖}^{2} + 2 τ ⟨ F (z_{k + 1}) - F (w_{k}), z^{⋆} - z_{k + 1} ⟩ \\ + \frac{β}{2} ‖ z_{k} - w_{k} ‖^{2} + \frac{1}{2} ‖ z_{k + 1} - z_{k} ‖^{2}] \leq {‖ z_{k} - z^{⋆} ‖}^{2} + 2 τ ⟨ F (z_{k}) - F (w_{k - 1}), z^{⋆} - z_{k} ⟩ \\ + \frac{β}{2} ‖ z_{k - 1} - w_{k - 1} ‖^{2} + \frac{1}{2} ‖ z_{k} - z_{k - 1} ‖^{2} - \frac{δ}{2} (‖ z_{k} - z_{k - 1} ‖^{2} + β {‖ z_{k - 1} - w_{k - 1} ‖}^{2}) . \end{matrix}

Let $a_{k + 1} = \frac{1}{2} {‖ z_{k + 1} - z^{⋆} ‖}^{2}$ and

\begin{matrix} b_{k + 1} & = \frac{1}{2} ‖ z_{k + 1} - z^{⋆} ‖^{2} + 2 τ ⟨ F (z_{k + 1}) - F (w_{k}), z^{⋆} - z_{k + 1} ⟩ + \frac{1}{2} {‖ z_{k + 1} - z_{k} ‖}^{2} \\ + \frac{β}{2} {‖ z_{k} - w_{k} ‖}^{2} . \end{matrix}

Note that we have $b_{k + 1} + \frac{1}{2} ‖ z_{k + 1} - z^{⋆} ‖^{2} = Φ_{k + 1} (z^{⋆}) \geq \frac{1}{2} {‖ z_{k + 1} - z^{⋆} ‖}^{2}$ by (16), hence $b_{k + 1} \geq 0$ .

Using the definitions of $a_{k}$ and $b_{k}$ in (35), it follows that for any $ε \leq δ$ ,

\begin{matrix} E_{k} [(1 + 4 μ τ) a_{k + 1} + b_{k + 1}] \leq a_{k} + b_{k} - \frac{ε}{2} (‖ z_{k} - z_{k - 1} ‖^{2} + β {‖ z_{k - 1} - w_{k - 1} ‖}^{2}), \end{matrix}

Next, we derive

\begin{matrix} RHS of (36) = a_{k} + b_{k} - \frac{ε}{2} ‖ z_{k} - z_{k - 1} ‖^{2} - \frac{ε}{2} β {‖ z_{k - 1} - w_{k - 1} ‖}^{2} \end{matrix}

\begin{matrix} = (1 + \frac{ε}{2}) a_{k} + (1 - \frac{ε}{2}) b_{k} \\ - \frac{ε}{4} ‖ z_{k} - z_{k - 1} ‖^{2} - \frac{ε β}{4} {‖ z_{k - 1} - w_{k - 1} ‖}^{2} \\ + ε τ ⟨ F (z_{k}) - F (w_{k - 1}), z^{⋆} - z_{k} ⟩ \leq (1 + \frac{3 ε}{2}) a_{k} + (1 - \frac{ε}{2}) b_{k}, \end{matrix}

where the last inequality follows from (15) with a shifted index k. Then, (36) becomes

\begin{matrix} E_{k} [(1 + 4 μ τ) a_{k + 1} + b_{k + 1}] \leq (1 + \frac{3 ε}{2}) a_{k} + (1 - \frac{ε}{2}) b_{k} . \end{matrix}

Since $ε \leq δ$ is arbitrary, we can choose $ε$ such that $1 + 4 μ τ > 1 + \frac{3 ε}{2}$ . For instance, we can set

\begin{matrix} ε = min \{δ, 2 μ τ\}, \end{matrix}

that results in

\begin{matrix} E_{k} [(1 + 4 μ τ) a_{k + 1} + b_{k + 1}] \leq (1 + 3 μ τ) a_{k} + (1 - \frac{ε}{2}) b_{k} \\ = (1 - \frac{μ τ}{1 + 4 μ τ}) (1 + 4 μ τ) a_{k} + (1 - \frac{ε}{2}) b_{k} \\ \leq (1 - min \{\frac{μ τ}{1 + 4 μ τ}, \frac{ε}{2}\}) ((1 + 4 μ τ) a_{k} + b_{k}) . \end{matrix}

Taking a full expectation and using that $\frac{ε}{2} = min {\frac{δ}{2}, μ τ}$ and $b_{0} = 0$ , we obtain

\begin{matrix} E [(1 + 4 μ τ) a_{k + 1} + b_{k + 1}] & \leq (1 - min \{\frac{μ τ}{1 + 4 μ τ}, \frac{δ}{2}\}) E [(1 + 4 μ τ) a_{k} + b_{k}] \\ \leq {(1 - min \{\frac{μ τ}{1 + 4 μ τ}, \frac{δ}{2}\})}^{k + 1} (1 + 4 μ τ) a_{0} . \end{matrix}

Now it only remains to compute the contraction factor. By our choice of $τ$ , we have $τ L = \frac{p}{4 \sqrt{2}} \leq \frac{1 - \sqrt{1 - p}}{2 \sqrt{2}} = \frac{β}{2 \sqrt{2} (1 + β)}$ , and hence,

\begin{matrix} δ = \frac{β}{1 + β} - \frac{4 τ^{2} L^{2} (1 + β)}{β} \geq \frac{β}{2 (1 + β)} \geq \frac{1 - \sqrt{1 - p}}{2} \geq \frac{p}{4} . \end{matrix}

From $μ \leq L$ it follows that $4 μ τ = \frac{μ p}{\sqrt{2} L} \leq \frac{p}{\sqrt{2}} < 1$ and, hence, $\frac{μ τ}{1 + 4 μ τ} \geq \frac{μ τ}{2} = \frac{μ p}{8 \sqrt{2} L}$ . Thus, we obtain

\begin{matrix} min \{\frac{μ τ}{1 + 4 μ τ}, \frac{δ}{2}\} \geq min \{\frac{μ p}{8 \sqrt{2} L}, \frac{p}{8}\} = \frac{μ p}{8 \sqrt{2} L}, \end{matrix}

which finally implies

\begin{matrix} E ‖ z_{k + 1} - z^{⋆} ‖^{2} \leq {(1 - \frac{μ p}{8 \sqrt{2} L})}^{k + 1} {‖ z_{0} - z^{⋆} ‖}^{2} . \end{matrix}

$□$

A key characteristic of our result is that strong monotonicity constant is not required in the algorithm as a parameter to obtain the rate. This has been raised as an open question by [2] and a partial answer is studied by [7] (see Table 1). Our result gives a full answer to this question without using strong monotonicity constant in all cases.

We next discuss the dependence of $μ$ in the convergence rate. Our rate has a dependence of $\frac{1}{μ}$ compared to $\frac{1}{μ^{2}}$ of non-accelerated methods of [2] and the method of [7]. This difference is important especially when $μ$ is small. On the other hand, in terms of n, our complexity has a worse dependence compared to [5] and accelerated method of [2] as discussed before (see the discussions in Sect. 1.1 and Section 3.2).

Beyond monotonicity

Lastly, we illustrate that our method has convergence guarantees for a class of non-monotone problems. There exist several relaxations of monotonicity that are used in the literature [10, 17, 22, 24]. Among these, we assume the existence of the solutions to Minty variational inequality given as

\begin{matrix} \exists \hat{z} \in Z : ⟨ F (z), z - \hat{z} ⟩ + g (z) - g (\hat{z}) \geq 0, \forall z \in Z . \end{matrix}

Under (43), we can drop the monotonicity assumption and show almost sure subsequential convergence of the iterates of our method. Naturally, in this case one can no longer show sequential convergence as with monotonicity (see Theorem 3.1).

Theorem 3.4

Suppose that Assumption 1 (a), (c), (d) and the condition (43) hold. Then almost surely all cluster points of the sequence ${z_{k}}$ generated by Algorithm 1 are in $Z^{⋆}$ .

Proof

We will proceed as in Theorem 3.1 and [22, Theorem 6]. We note that Lemma 3.2 does not use monotonicity of F, thus its result follows in this case. In the inequality

\begin{matrix} E_{k} [Φ_{k + 1} (z) + 2 τ Θ_{k + 1} (z)] \leq Φ_{k} (z) . \end{matrix}

we plug in $z = \hat{z}$ for a point satisfying (43).

Then, by (43), we have

\begin{matrix} Θ_{k + 1} (\hat{z}) = ⟨ F (z_{k + 1}), z_{k + 1} - \hat{z} ⟩ + g (z_{k + 1}) - g (\hat{z}) \geq 0 . \end{matrix}

We then argue the same way as in Theorem 3.1 to conclude that almost surely, ${z_{k}}$ is bounded and cluster points of ${z_{k}}$ are in $Z^{⋆}$ .

Note that the steps in Theorem 3.1 for showing sequential convergence relies on the choice of z as an arbitrary point in $Z^{⋆}$ , which is not the case here, therefore, we can only use the arguments from Theorem 3.1 for showing subsequential convergence. $□$

Extensions

We illustrate extensions of our results to monotone inclusions and Bregman projections. The proofs for this section are given in the appendix in Section 7.

Monotone inclusions

We have chosen to focus on monotone VIs in the main part of the paper for being able to derive sublinear rates for the gap function. In this section, we show that our analysis extends directly for solving monotone inclusions. In this case, we are interested in finding z such that $0 \in (A + F) (z)$ , where A, F are monotone operators and each $F_{i}$ is Lipschitz with the form $F = \frac{1}{n} \sum_{i = 1}^{n} F_{i}$ . In this case, one changes the prox operator in the algorithm, to resolvent operator of A which is defined as $J_{τ A} (z) = {(I + τ A)}^{- 1} (z)$ . Then, one can use Lemma 3.1 as directly given in [23, Proposition 2.3] to prove an analogous result of Theorem 3.1 for solving monotone inclusions. Moreover, when $A + F$ is strongly monotone, one can prove an analogue of Theorem 3.3. We prove the former result and we note that the latter can be shown by applying the steps in Theorem 3.3 on top of Theorem 4.1, which we do not repeat for brevity.

Theorem 4.1

Let $A : Z ⇉ Z$ be maximally monotone and $F : Z \to Z$ be monotone with $F = \frac{1}{n} \sum_{i = 1}^{n} F_{i}$ , where $F_{i}$ is L-Lipschitz for all i. Assume that ${(A + F)}^{- 1} (0)$ is nonempty and let the iterates ${z_{k}}$ be generated by Algorithm 1 with the update for $z_{k + 1}$

\begin{matrix} z_{k + 1} = J_{τ A} (z_{k} - τ F (w_{k}) - τ (F_{i_{k}} (z_{k}) - F_{i_{k}} (w_{k - 1}))) . \end{matrix}

Then, for $τ < \frac{1 - \sqrt{1 - p}}{2 L}$ , almost surely there exist $z^{⋆} \in {(A + F)}^{- 1} (0)$ such that $z_{k} \to z^{⋆}$ .

Bregman distances

We developed our analysis in the Euclidean setting, relying on $ℓ_{2}$ -norm for simplicity. However, we can also generalize it to proximal operators involving Bregman distances. In this setting, we have a distance generating function $h : Z \to R$ , which is 1-strongly convex and continuous. We follow the standard convention to assume that subdifferential of h admits a continuous selection, which means that there exists a continuous function $\nabla h$ such that $\nabla h (x) \in \partial h (x)$ for all $x \in dom \partial h$ . We define the Bregman distance as $D_{h} (z, \bar{z}) = h (z) - h (\bar{z}) - ⟨ \nabla h (\bar{z}), z - \bar{z} ⟩$ . Then, we will change the proximal step 4 of Algorithm 1 with

\begin{matrix} z_{k + 1} = {argmin}_{z} {g (z) + ⟨ F (w_{k}) + F_{i_{k}} (z_{k}) - F_{i_{k}} (w_{k - 1}), z - z_{k} ⟩ + \frac{1}{τ} D_{h} (z, z_{k})} . \end{matrix}

We prove an analogue of Lemma 3.2 with Bregman distances from which the convergence rate results will follow.

Lemma 4.1

Let Assumption 1 hold and

\begin{matrix} Φ_{k + 1} (z) : = D_{h} (z, z_{k + 1}) + τ ⟨ F (z_{k + 1}) - F (w_{k}), z - z_{k + 1} ⟩ & + \frac{β}{4} {‖ z_{k} - w_{k} ‖}^{2} \\ + \frac{1}{2} D_{h} (z_{k + 1}, z_{k}) . \end{matrix}

Moreover, suppose $τ < \frac{1 - \sqrt{1 - p}}{2 L}$ , $β = \frac{1}{\sqrt{1 - p}} - 1$ , and the iterates ${z_{k}}$ are generated by Algorithm 1 with the update (45) for $z_{k + 1}$ . Then for any $z \in dom g$ ,

\begin{matrix} E_{k} [Φ_{k + 1} (z) + τ Θ_{k + 1} (z)] \leq Φ_{k} (z) . \end{matrix}

Numerical verification

In this section, we include preliminary experimental results for our algorithm. We would like to note that these results are mainly for verifying our theoretical results and are not intended to serve as complete benchmarks. We suspect that for an extensive practical comparison, some practical enhancements of our method similar to proximal-point acceleration from [2] or restarting from [7] may be useful. We leave such investigations for future work.

First, we apply our method to the unconstrained bilinear problem. It was shown in [7] that this simple problem is particularly challenging for stochastic methods, due to unboundedness of the domain, where the standard methods, such as stochastic extragradient method [19], diverges. Our assumptions are general enough to cover this case and we now verify in practice that our method indeed converges for this problem by setting $d = n = 100$ and generating $A_{i} \in R^{d \times d}$ randomly with distribution $N (0, 1)$

\begin{matrix} min_{x \in R^{d}} max_{y \in R^{d}} \frac{1}{n} \sum_{i = 1}^{n} ⟨ A_{i} x, y ⟩ . \end{matrix}

For this experiment, we test the tightness of our step size rule by progressively increasing it. Recall that our step size is $τ = \frac{p}{cL}$ , where $c = 4$ is suggested in our analysis, see Corollary 3.1. We try the values of $c = {0.5, 1, 2, 4}$ and observe that for $c = 0.5$ the algorithm diverges, see Fig. 1(left). The message of this experiment is that even though slightly higher step sizes than what is allowed in our theory might work, it is not possible to significantly increase it.

Fig. 1 — Left: bilinear problem. Middle: Constrained minimization with data generated by normal distribution. Right: Constrained minimization with data generated by uniform distribution

The second problem we consider is constrained minimization, which is an instance where the dual domain is not necessarily bounded. We want to solve

\begin{matrix} min_{x \in C} f (x) s.t. h_{i} (x) \leq 0, i = 1, \dots, m, \end{matrix}

where $f (x) = \frac{1}{2} {‖ x - u ‖}^{2}$ for some $u \in Z$ and $h_{i} (x) = {‖ A_{i} x - b_{i} ‖}^{2} - δ_{i}$ for $A_{i} \in R^{d \times d}$ , $b_{i} \in R^{d}$ , $δ_{i} \in R_{+ +}$ , and C is a unit ball. In other words, we want to find a projection of u onto the intersection given by C and the constraint inequalities ${x : h_{i} (x) \leq 0}$ .

Introducing Lagrange multipliers $y_{i}$ for each constraint, we obtain (see Section 7.3 for further details)

\begin{matrix} min_{x \in R^{d}} max_{y \in R_{+}^{m}} f (x) + \sum_{i = 1}^{m} y_{i} h_{i} (x) . \end{matrix}

As the Lipschitz constant in this problem does not admit a closed-form expression, we first estimate the Lipschitz constant by finding an L such that deterministic method converges. Next, we note that even though we analyzed the algorithm with a single step size $τ$ for both primal and dual variables x, y, one can use different step sizes for primal and dual variables (see [22, Section 4.1]). Therefore, we tuned the scaling of primal and dual step sizes for both methods with one random instance and we used the same scaling for all tests for both methods.

We set $p = 1 / m$ . Every iteration, the deterministic method needs to go through all m constraints to compute $\sum_{i = 1}^{m} y_{i} \nabla h_{i} (x)$ , whereas our method computes $\nabla h_{i} (x)$ for only one i. First setup is with $m = 400$ , $d = 100$ , and the data is generated with the normal distribution $N (0, 1)$ . Second setup is with $m = 400$ , $d = 50$ , and the data is generated with the uniform distribution $U (- 1, 1)$ . We ran both setups with 10 different instances of randomly generated data and plotted all results, see Fig. 1. We observe that in one instance, the tuned scaling diverges for deterministic method, whereas our method with the same tuning converged in all cases.

Conclusion

In this work, we proposed a variance reduced algorithm for solving monotone VIs without assuming bilinearity, strong monotonicity, cocoercivity or boundedness. Even though our method is the first to converge under the same set of assumptions as deterministic methods, a drawback of our approach is the lack of complexity improvements.

In particular, previous approach of [5] showed complexity improvements for bilinear games, while needing more assumptions than deterministic methods to converge. Thus, an important open problem is to obtain a method that i) converges under the minimal set of assumptions as our algorithm, ii) features improved complexity guarantees compared to deterministic methods, while solving structured problems such as bilinear games such as [5] to obtain the best of both worlds.

7. Appendix

7.1 Proofs for Sect. 3

Proof of Lemma 3.1

By using the definition of proximal operator and convexity of g, we have for all $x \in Z$

\begin{matrix} g (x) & \geq g (x_{2}) + ⟨ x_{1} - U_{1} - (V_{1} - U_{0}) - x_{2}, x - x_{2} ⟩ \\ = g (x_{2}) + ⟨ x_{1} - x_{2}, x - x_{2} ⟩ - ⟨ U_{1}, x - x_{2} ⟩ - ⟨ V_{1} - U_{0}, x - x_{2} ⟩ . \end{matrix}

Since $2 ⟨ a, b ⟩ = {‖ a ‖}^{2} + {‖ b ‖}^{2} - {‖ a - b ‖}^{2}$ , $\forall a, b$ , it follows that

\begin{matrix} 2 ⟨ x_{1} - x_{2}, x - x_{2} ⟩ = ‖ x_{1} - x_{2} ‖^{2} + ‖ x - x_{2} ‖^{2} - {‖ x - x_{1} ‖}^{2} . \end{matrix}

Simple rearrangements give

\begin{matrix} - ⟨ U_{1}, x - x_{2} ⟩ = ⟨ V_{2} - U_{1}, x - x_{2} ⟩ - ⟨ V_{2}, x - x_{2} ⟩ \end{matrix}

and

\begin{matrix} - ⟨ V_{1} - U_{0}, x - x_{2} ⟩ = - ⟨ V_{1} - U_{0}, x - x_{1} ⟩ - ⟨ V_{1} - U_{0}, x_{1} - x_{2} ⟩ . \end{matrix}

Using the last three equalities in (47) completes the result. $□$

7.2 Proofs for Sect. 4

We first need a generalized version of Lemma 3.1. In fact, this is the exact form proven in [23], therefore we do not provide its proof.

Lemma 7.1

[23, Proposition 2.3] Let $A : Z ⇉ Z$ be maximally monotone and let $x_{1}, U_{0}, U_{1}$ , $V_{1} \in Z$ be arbitrary points. Define $x_{2}$ as

\begin{matrix} x_{2} = J_{A} (x_{1} - U_{1} - (V_{1} - U_{0})) . \end{matrix}

Then for all $x \in Z$ , $V_{2} \in Z$ , and $U \in - A (x)$ , we have

\begin{matrix} ‖ x_{2} {- x ‖}^{2} + 2 ⟨ V_{2} - U_{1}, x - x_{2} ⟩ + 2 ⟨ V_{2} - U, x_{2} - x ⟩ \\ \leq ‖ x_{1} {- x ‖}^{2} + 2 ⟨ V_{1} - U_{0}, x - x_{1} ⟩ + 2 ⟨ V_{1} - U_{0}, x_{1} - x_{2} ⟩ - {‖ x_{1} - x_{2} ‖}^{2} . \end{matrix}

7.2.1 Proof of Theorem 4.1

Proof

We will start similar to Lemma 3.2. After setting $U_{0} = τ F_{i} (w_{k - 1})$ , $U_{1} = τ F (w_{k})$ , $V_{1} = τ F_{i} (z_{k})$ , $V_{2} = τ F (z_{k + 1})$ , $x_{1} = z_{k}$ , $x_{2} = z_{k + 1}$ with $i_{k} = i$ and plugging into Lemma 7.1, we have

\begin{matrix} ‖ z_{k + 1} {- z ‖}^{2} + 2 τ ⟨ F (z_{k + 1}) - F (w_{k}), z - z_{k + 1} ⟩ \leq {‖ z_{k} - z ‖}^{2} \\ + 2 τ ⟨ F_{i} (z_{k}) - F_{i} (w_{k - 1}), z - z_{k} ⟩ \\ + 2 τ ⟨ F_{i} (z_{k}) - F_{i} (w_{k - 1}), z_{k} - z_{k + 1} ⟩ - {‖ z_{k + 1} - z_{k} ‖}^{2} \\ - 2 τ ⟨ F (z_{k + 1}) - F (z), z_{k + 1} - z ⟩ . \end{matrix}

We use monotonicity for the last term and get

\begin{matrix} ‖ z_{k + 1} {- z ‖}^{2} + 2 τ ⟨ F (z_{k + 1}) - F (w_{k}), z - z_{k + 1} ⟩ \leq ‖ z_{k} {- z ‖}^{2} - {‖ z_{k + 1} - z_{k} ‖}^{2} \\ + 2 τ ⟨ F_{i} (z_{k}) - F_{i} (w_{k - 1}), z - z_{k} ⟩ + 2 τ ⟨ F_{i} (z_{k}) - F_{i} (w_{k - 1}), z_{k} - z_{k + 1} ⟩ . \end{matrix}

The rest of Lemma 3.2 follows in this case the same way with the lack of the terms with $Θ_{k + 1} (z)$ . Then, similar arguments as in Theorem 3.1 with the changes of i) not having $Θ_{k + 1} (z)$ , ii) using the definition of resolvent instead of proximal operator to show cluster points are solutions, will give the result (see also [23, Theorem 2.5]). $□$

We now present a version of Lemma 3.1 with the proximal operator using Bregman distance.

Lemma 7.2

Let $g : Z \to R \cup {+ \infty}$ be proper lower semicontinuous convex and let $x_{1}, U_{0}, U_{1}$ , $V_{1} \in Z$ be arbitrary points. Define $x_{2}$ as

\begin{matrix} x_{2} = {argmin}_{z \in Z} {g (z) + ⟨ U_{1} + (V_{1} - U_{0}), z - x_{1} ⟩ + D_{h} (z, x_{1})} . \end{matrix}

Then, for all $x \in Z$ , $V_{2} \in Z$ we have

\begin{matrix} D_{h} (x, x_{2}) + ⟨ V_{2} - U_{1}, z - x_{2} ⟩ + ⟨ V_{2}, x_{2} - x ⟩ + g (x_{2}) - g (x) \\ \leq D_{h} (x, x_{1}) + ⟨ V_{1} - U_{0}, x - x_{1} ⟩ + ⟨ V_{1} - U_{0}, x_{1} - x_{2} ⟩ - D_{h} (x_{2}, x_{1}) . \end{matrix}

Proof

By the definition of $x_{2}$ , it follows from [35, Property 1] that

\begin{matrix} g (x) \geq g (x_{2}) - ⟨ U_{1} + V_{1} - U_{0}, x - x_{2} ⟩ - D_{h} (x, x_{1}) + D_{h} (x, x_{2}) + D_{h} (x_{2}, x_{1}) . \end{matrix}

For the bilinear term, we argue the same as Lemma 3.1. $□$

7.2.2 Proof of Lemma 4.1

Proof

We will follow the proof of Lemma 3.2 with suitable changes for Bregman distances.

First, set $U_{0} = τ F_{i} (w_{k - 1})$ , $U_{1} = τ F (w_{k})$ , $V_{1} = τ F_{i} (z_{k})$ , $V_{2} = τ F (z_{k + 1})$ , $x_{1} = z_{k}$ , then $x_{2} = z_{k + 1}$ with $i_{k} = i$ and we plug these into (53) to get

\begin{matrix} D_{h} (z, z_{k + 1}) + τ ⟨ F (z_{k + 1}) - F (w_{k}), z - z_{k + 1} ⟩ + τ (⟨ F (z_{k + 1}), z_{k + 1} - z ⟩ \\ + g (z_{k + 1}) - g (z)) \leq D_{h} (z, z_{k}) + τ ⟨ F_{i} (z_{k}) - F_{i} (w_{k - 1}), z - z_{k} ⟩ \\ + τ ⟨ F_{i} (z_{k}) - F_{i} (w_{k - 1}), z_{k} - z_{k + 1} ⟩ - D_{h} (z_{k + 1}, z_{k}) . \end{matrix}

First, note that by Lipschitzness of $F_{i}$ , Cauchy-Schwarz, Young’s inequalities, and since $\frac{1}{2} {‖ z_{k} - z_{k - 1} ‖}^{2} \leq D_{h} (z_{k}, z_{k - 1})$ ,

\begin{matrix} τ ⟨ F_{i} (z_{k}) - F_{i} (w_{k - 1}), z_{k} - z_{k + 1} ⟩ \\ \leq τ^{2} L^{2} ‖ z_{k} - w_{k - 1} ‖^{2} + \frac{1}{4} {‖ z_{k} - z_{k + 1} ‖}^{2} \\ \leq τ^{2} L^{2} {‖ z_{k} - w_{k - 1} ‖}^{2} + \frac{1}{2} D_{h} (z_{k + 1}, z_{k}) \end{matrix}

Thus, it follows that

\begin{matrix} D_{h} (z, z_{k + 1}) & + τ ⟨ F (z_{k + 1}) - F (w_{k}), z - z_{k + 1} ⟩ + \frac{1}{2} D_{h} (z_{k + 1}, z_{k}) + τ Θ_{k + 1} (z) \\ \leq D_{h} (z, z_{k}) & + τ ⟨ F_{i} (z_{k}) - F_{i} (w_{k - 1}), z - z_{k} ⟩ + τ^{2} L^{2} {‖ z_{k} - w_{k - 1} ‖}^{2} . \end{matrix}

Taking expectation conditioning on the knowledge of $z_{k}, w_{k - 1}$ and using that $E_{k} F_{i} (z_{k}) = F (z_{k})$ , $E_{k} F_{i} (w_{k - 1}) = F (w_{k - 1})$ , we obtain

\begin{matrix} E_{k} D_{h} (z, z_{k + 1}) + τ E_{k} ⟨ F (z_{k + 1}) - F (w_{k}), z - z_{k + 1} ⟩ \\ + \frac{1}{2} E_{k} D_{h} (z_{k + 1}, z_{k}) + τ E_{k} Θ_{k + 1} (z) \\ \leq D_{h} (z, z_{k}) + τ ⟨ F (z_{k}) - F (w_{k - 1}), z - z_{k} ⟩ + τ^{2} L^{2} {‖ z_{k} - w_{k - 1} ‖}^{2} . \end{matrix}

Adding

\begin{matrix} \frac{β}{4} E_{k} ‖ z_{k} - w_{k} ‖^{2} = \frac{β (1 - p)}{4} {‖ z_{k} - w_{k - 1} ‖}^{2}, \end{matrix}

which follows by the definition of $w_{k}$ , to (56), we obtain

\begin{matrix} E_{k} [Φ_{k + 1} (z) + Θ_{k + 1} (z)] \leq Φ_{k} (z) \\ + (τ^{2} L^{2} + \frac{β (1 - p)}{4}) ‖ z_{k} - w_{k - 1} ‖^{2} - \frac{1}{2} D_{h} (z_{k}, z_{k - 1}) - \frac{β}{4} {‖ z_{k - 1} - w_{k - 1} ‖}^{2} . \end{matrix}

To show that the last line is nonpositive, we use (14), Young’s inequality as in Lemma 3.2 and $\frac{1}{2} {‖ z_{k} - z_{k - 1} ‖}^{2} \leq D_{h} (z_{k}, z_{k - 1})$ .

Nonnegativity of $Φ_{k}$ follows as in Theorem 3.1 after using $\frac{1}{2} {‖ z_{k} - z_{k - 1} ‖}^{2} \leq D_{h} (z_{k}, z_{k - 1})$ . $□$

7.3 Experiment details

Only for this section, we will use superscripts for iterates rather than subscripts that we have used up to now. Recall that our problem is

\begin{matrix} min_{x \in R^{d}} max_{y \in R_{+}^{m}} f (x) + \sum_{i = 1}^{m} y_{i} h_{i} (x) . \end{matrix}

This problem is equivalent to the following variational inequality

\begin{matrix} ⟨ F (z^{*}), z - z^{*} ⟩ \geq 0 \forall z \in C \times R_{+}^{m}, \end{matrix}

where

\begin{matrix} z = (x, y), F (z) = (\begin{matrix} \nabla f (x) + \sum_{i = 1}^{m} y_{i} \nabla h_{i} (x) \\ - h (x) \end{matrix}) = (\begin{matrix} F^{(1)} (z) \\ F^{(2)} (z) \end{matrix}) \end{matrix}

The notation reads: $h (x) = (h_{1} (x), \dots, h_{m} (x))$ and $\nabla h (x) = (\nabla h_{1} (x), \dots, \nabla h_{m} (x))$ . Let us note $h : R^{d} \to R^{m}$ , $\nabla h (x) \in R^{m \times d}$ . We note that the residual in y-axes of Fig. 1 is computed as $‖ x_{t} - {prox}_{g} (x_{t} - F (x_{t})) ‖$ .

We split F as follows

\begin{matrix} F (z) = \frac{1}{m} \sum_{i = 1}^{m} F_{i} (z) with F_{i} (z) = (\begin{matrix} f (x) + m y_{i} \nabla h_{i} (x) \\ - m h_{i} (x) e_{i} \end{matrix}) = (\begin{matrix} F_{i}^{(1)} (z) \\ F_{i}^{(2)} (z) \end{matrix}), \end{matrix}

where ${(e_{i})}_{i = 1}^{m}$ is a standard basis in $R^{m}$ .

Hence, Algorithm 1, with different step sizes for primal and dual, will be

\begin{matrix} \begin{matrix} x^{k + 1} = P_{C} (x^{k} - τ F^{(1)} (u^{k}, v^{k}) - τ (F_{i}^{(1)} (x^{k}, y^{k}) - F_{i}^{(1)} (u^{k - 1}, v^{k - 1}))) \\ y^{k + 1} = P_{R_{m}^{+}} (y^{k} - σ F^{(2)} (u^{k}, v^{k}) - σ (F_{i}^{(2)} (x^{k}, y^{k}) - F_{i}^{(2)} (u^{k - 1}, v^{k - 1}))) \\ (u^{k + 1}, v^{k + 1}) = \{\begin{matrix} (x^{k + 1}, y^{k + 1}) & with probability p \\ (u^{k}, v^{k}) & with probability 1 - p \end{matrix}) \end{matrix} \end{matrix}

Funding

Open Access funding provided by EPFL Lausanne. This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement no 725594 - time-data), the Swiss National Science Foundation (SNSF) under grant number $200021_178865 / 1$ , the Department of the Navy, Office of Naval Research (ONR) under a grant number N62909-17-1-211, the Hasler Foundation Program: Cyber Human Systems (project number 16066), the Wallenberg Al, Autonomous Systems and Software Program (WASP) funded by the Knut and Alice Wallenberg Foundation with project number 305286.

Declarations

Conflict of interest

The author declared that there is no conflict of interest.

Data availability

Not applicable.

Footnotes

Part of the work was done while Y. Malitsky was at EPFL.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Ahmet Alacaoglu, Email: ahmet.alacaoglu@epfl.ch.

Yura Malitsky, Email: yurii.malitskyi@liu.se.

Volkan Cevher, Email: volkan.cevher@epfl.ch.

References

1.Alacaoglu, A., Fercoq, O., Cevher, V. On the convergence of stochastic primal-dual hybrid gradient. arXiv preprint arXiv:1911.00799, (2019)
2.Balamurugan, P., and Bach. F. Stochastic variance reduction methods for saddle-point problems. In Advances in Neural Information Processing Systems, pages 1416–1424, (2016)
3.Bertsekas DP. Incremental proximal methods for large scale convex optimization. Math. Prog. 2011;129(2):163. doi: 10.1007/s10107-011-0472-0. [DOI] [Google Scholar]
4.Boţ R. I., Mertikopoulos, P., Staudigl, M., and Vuong, P. T. Forward-backward-forward methods with variance reduction for stochastic variational inequalities. arXiv preprint arXiv:1902.03355, (2019)
5.Carmon, Y., Jin, Y., Sidford, A., and Tian. K. Variance reduction for matrix games. In Advances in Neural Information Processing Systems, pages 11377–11388, (2019)
6.Chambolle A, Pock T. A first-order primal-dual algorithm for convex problems with applications to imaging. J. Math. Imag. Vis. 2011;40(1):120–145. doi: 10.1007/s10851-010-0251-1. [DOI] [Google Scholar]
7.Chavdarova, T., Gidel G., Fleuret., F and Lacoste-Julien., S. Reducing noise in GAN training with variance reduced extragradient. In Advances in Neural Information Processing Systems, pages 391–401, (2019)
8.Combettes PL, Pesquet J-C. Stochastic quasi-Fejér block-coordinate fixed point iterations with random sweeping. SIAM J. Optim. 2015;25(2):1221–1248. doi: 10.1137/140971233. [DOI] [Google Scholar]
9.Cui., S and Shanbhag., UV. On the analysis of variance-reduced and randomized projection variants of single projection schemes for monotone stochastic variational inequality problems. arXiv preprint arXiv:1904.11076, (2019)
10.Dang CD, Lan G. On the convergence properties of non-euclidean extragradient methods for variational inequalities with generalized monotone operators. Comput. Opt. Appl. 2015;60(2):277–310. doi: 10.1007/s10589-014-9673-9. [DOI] [Google Scholar]
11.Daskalakis. C., and Panageas. I. The limit points of (optimistic) gradient descent in min-max optimization. In Advances in Neural Information Processing Systems, pages 9236–9246, (2018)
12.Daskalakis, C., Ilyas, A., Syrgkanis, V., and Zeng, H. Training GANs with optimism. In International Conference on Learning Representations, (2018)
13.Defazio, A., Bach, F., and Lacoste-Julien, S. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems, pages 1646–1654, (2014)
14.Esser E, Zhang X, Chan TF. A general framework for a class of first order primal-dual algorithms for convex optimization in imaging science. SIAM J. Imag. Sci. 2010;3(4):1015–1046. doi: 10.1137/09076934X. [DOI] [Google Scholar]
15.Hamedani, E. Y., and Aybat, N. S. A primal-dual algorithm for general convex-concave saddle point problems. arXiv:1803.01401, (2018)
16.Hofmann, T., Lucchi, A., Lacoste-Julien, S., and McWilliams, B. Variance reduced stochastic gradient descent with neighbors. In Advances in Neural Information Processing Systems, pages 2305–2313, (2015)
17.Iusem AN, Jofré A, Oliveira RI, Thompson P. Extragradient method with variance reduction for stochastic variational inequalities. SIAM J. Opt. 2017;27(2):686–724. doi: 10.1137/15M1031953. [DOI] [Google Scholar]
18.Johnson, R., and Zhang. T. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, pages 315–323, (2013)
19.Juditsky A, Nemirovski A, Tauvel C. Solving variational inequalities with stochastic mirror-prox algorithm. Stoch. Syst. 2011;1(1):17–58. doi: 10.1287/10-SSY011. [DOI] [Google Scholar]
20.Korpelevich G. The extragradient method for finding saddle points and other problems. Ekon. Mat. Metody. 1976;12:747–756. [Google Scholar]
21.Kovalev, D., Horvath, S., and Richtarik. P. Don’t jump through hoops and remove those loops: SVRG and Katyusha are better without the outer loop. In Proceedings of the 31st International Conference on Algorithmic Learning Theory, pages 451–467, (2020)
22.Malitsky Y. Golden ratio algorithms for variational inequalities. Math. Prog. 2019;184:383–410. doi: 10.1007/s10107-019-01416-w. [DOI] [Google Scholar]
23.Malitsky Y, Tam MK. A forward-backward splitting method for monotone inclusions without cocoercivity. SIAM J. Optim. 2020;30(2):1451–1472. doi: 10.1137/18M1207260. [DOI] [Google Scholar]
24.Mertikopoulos, P., Lecouat, B., Zenati, H., Foo, C. S., Chandrasekhar, V., and Piliouras. G. Optimistic mirror descent in saddle-point problems: Going the extra(-gradient) mile. In International Conference on Learning Representations, (2019)
25.Mishchenko, K., Kovalev, D., Shulgin, E., Richtárik, P., and Malitsky. Y.: Revisiting stochastic extragradient. In The 23rd International Conference on Artificial Intelligence and Statistics, pages 4573–4582. PMLR, (2020)
26.Mokhtari, A., Ozdaglar, A., and Pattathil. S.: A unified analysis of extra-gradient and optimistic gradient methods for saddle point problems: Proximal point approach. In International Conference on Artificial Intelligence and Statistics, pages 1497–1507. PMLR, (2020)
27.Nemirovski A. Prox-method with rate of convergence $O (1 / t)$ for variational inequalities with lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM J. Opt. 2004;15(1):229–251. doi: 10.1137/S1052623403425629. [DOI] [Google Scholar]
28.Nemirovski A, Juditsky A, Lan G, Shapiro A. Robust stochastic approximation approach to stochastic programming. SIAM J. Opt. 2009;19(4):1574–1609. doi: 10.1137/070704277. [DOI] [Google Scholar]
29.Nesterov Y. Dual extrapolation and its applications to solving variational inequalities and related problems. Math. Prog. 2007;109(2–3):319–344. doi: 10.1007/s10107-006-0034-z. [DOI] [Google Scholar]
30.Popov LD. A modification of the Arrow-Hurwicz method for search of saddle points. Math. Notes Acad. Sci. USSR. 1980;28(5):845–848. [Google Scholar]
31.Rakhlin, S., and Sridharan. K, Optimization, learning, and games with predictable sequences. In Advances in Neural Information Processing Systems, pages 3066–3074, (2013)
32.Robbins, H., and Siegmund. D.: A convergence theorem for non negative almost supermartingales and some applications. In Optimizing methods in statistics, pages 233–257. Elsevier, (1971)
33.Shalev-Shwartz, S., and Zhang. T: Stochastic dual coordinate ascent methods for regularized loss minimization. J. Mach. Learn. Res., 14(Feb):567–599, (2013)
34.Tseng P. A modified forward-backward splitting method for maximal monotone mappings. SIAM J. Control Opt. 2000;38(2):431–446. doi: 10.1137/S0363012998338806. [DOI] [Google Scholar]
35.Tseng. P.: On accelerated proximal gradient methods for convex-concave optimization. submitted to SIAM J. Opt., 1, (2008)

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Not applicable.

[CR1] 1.Alacaoglu, A., Fercoq, O., Cevher, V. On the convergence of stochastic primal-dual hybrid gradient. arXiv preprint arXiv:1911.00799, (2019)

[CR2] 2.Balamurugan, P., and Bach. F. Stochastic variance reduction methods for saddle-point problems. In Advances in Neural Information Processing Systems, pages 1416–1424, (2016)

[CR3] 3.Bertsekas DP. Incremental proximal methods for large scale convex optimization. Math. Prog. 2011;129(2):163. doi: 10.1007/s10107-011-0472-0. [DOI] [Google Scholar]

[CR4] 4.Boţ R. I., Mertikopoulos, P., Staudigl, M., and Vuong, P. T. Forward-backward-forward methods with variance reduction for stochastic variational inequalities. arXiv preprint arXiv:1902.03355, (2019)

[CR5] 5.Carmon, Y., Jin, Y., Sidford, A., and Tian. K. Variance reduction for matrix games. In Advances in Neural Information Processing Systems, pages 11377–11388, (2019)

[CR6] 6.Chambolle A, Pock T. A first-order primal-dual algorithm for convex problems with applications to imaging. J. Math. Imag. Vis. 2011;40(1):120–145. doi: 10.1007/s10851-010-0251-1. [DOI] [Google Scholar]

[CR7] 7.Chavdarova, T., Gidel G., Fleuret., F and Lacoste-Julien., S. Reducing noise in GAN training with variance reduced extragradient. In Advances in Neural Information Processing Systems, pages 391–401, (2019)

[CR8] 8.Combettes PL, Pesquet J-C. Stochastic quasi-Fejér block-coordinate fixed point iterations with random sweeping. SIAM J. Optim. 2015;25(2):1221–1248. doi: 10.1137/140971233. [DOI] [Google Scholar]

[CR9] 9.Cui., S and Shanbhag., UV. On the analysis of variance-reduced and randomized projection variants of single projection schemes for monotone stochastic variational inequality problems. arXiv preprint arXiv:1904.11076, (2019)

[CR10] 10.Dang CD, Lan G. On the convergence properties of non-euclidean extragradient methods for variational inequalities with generalized monotone operators. Comput. Opt. Appl. 2015;60(2):277–310. doi: 10.1007/s10589-014-9673-9. [DOI] [Google Scholar]

[CR11] 11.Daskalakis. C., and Panageas. I. The limit points of (optimistic) gradient descent in min-max optimization. In Advances in Neural Information Processing Systems, pages 9236–9246, (2018)

[CR12] 12.Daskalakis, C., Ilyas, A., Syrgkanis, V., and Zeng, H. Training GANs with optimism. In International Conference on Learning Representations, (2018)

[CR13] 13.Defazio, A., Bach, F., and Lacoste-Julien, S. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems, pages 1646–1654, (2014)

[CR14] 14.Esser E, Zhang X, Chan TF. A general framework for a class of first order primal-dual algorithms for convex optimization in imaging science. SIAM J. Imag. Sci. 2010;3(4):1015–1046. doi: 10.1137/09076934X. [DOI] [Google Scholar]

[CR15] 15.Hamedani, E. Y., and Aybat, N. S. A primal-dual algorithm for general convex-concave saddle point problems. arXiv:1803.01401, (2018)

[CR16] 16.Hofmann, T., Lucchi, A., Lacoste-Julien, S., and McWilliams, B. Variance reduced stochastic gradient descent with neighbors. In Advances in Neural Information Processing Systems, pages 2305–2313, (2015)

[CR17] 17.Iusem AN, Jofré A, Oliveira RI, Thompson P. Extragradient method with variance reduction for stochastic variational inequalities. SIAM J. Opt. 2017;27(2):686–724. doi: 10.1137/15M1031953. [DOI] [Google Scholar]

[CR18] 18.Johnson, R., and Zhang. T. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, pages 315–323, (2013)

[CR19] 19.Juditsky A, Nemirovski A, Tauvel C. Solving variational inequalities with stochastic mirror-prox algorithm. Stoch. Syst. 2011;1(1):17–58. doi: 10.1287/10-SSY011. [DOI] [Google Scholar]

[CR20] 20.Korpelevich G. The extragradient method for finding saddle points and other problems. Ekon. Mat. Metody. 1976;12:747–756. [Google Scholar]

[CR21] 21.Kovalev, D., Horvath, S., and Richtarik. P. Don’t jump through hoops and remove those loops: SVRG and Katyusha are better without the outer loop. In Proceedings of the 31st International Conference on Algorithmic Learning Theory, pages 451–467, (2020)

[CR22] 22.Malitsky Y. Golden ratio algorithms for variational inequalities. Math. Prog. 2019;184:383–410. doi: 10.1007/s10107-019-01416-w. [DOI] [Google Scholar]

[CR23] 23.Malitsky Y, Tam MK. A forward-backward splitting method for monotone inclusions without cocoercivity. SIAM J. Optim. 2020;30(2):1451–1472. doi: 10.1137/18M1207260. [DOI] [Google Scholar]

[CR24] 24.Mertikopoulos, P., Lecouat, B., Zenati, H., Foo, C. S., Chandrasekhar, V., and Piliouras. G. Optimistic mirror descent in saddle-point problems: Going the extra(-gradient) mile. In International Conference on Learning Representations, (2019)

[CR25] 25.Mishchenko, K., Kovalev, D., Shulgin, E., Richtárik, P., and Malitsky. Y.: Revisiting stochastic extragradient. In The 23rd International Conference on Artificial Intelligence and Statistics, pages 4573–4582. PMLR, (2020)

[CR26] 26.Mokhtari, A., Ozdaglar, A., and Pattathil. S.: A unified analysis of extra-gradient and optimistic gradient methods for saddle point problems: Proximal point approach. In International Conference on Artificial Intelligence and Statistics, pages 1497–1507. PMLR, (2020)

[CR27] 27.Nemirovski A. Prox-method with rate of convergence $O (1 / t)$ for variational inequalities with lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM J. Opt. 2004;15(1):229–251. doi: 10.1137/S1052623403425629. [DOI] [Google Scholar]

[CR28] 28.Nemirovski A, Juditsky A, Lan G, Shapiro A. Robust stochastic approximation approach to stochastic programming. SIAM J. Opt. 2009;19(4):1574–1609. doi: 10.1137/070704277. [DOI] [Google Scholar]

[CR29] 29.Nesterov Y. Dual extrapolation and its applications to solving variational inequalities and related problems. Math. Prog. 2007;109(2–3):319–344. doi: 10.1007/s10107-006-0034-z. [DOI] [Google Scholar]

[CR30] 30.Popov LD. A modification of the Arrow-Hurwicz method for search of saddle points. Math. Notes Acad. Sci. USSR. 1980;28(5):845–848. [Google Scholar]

[CR31] 31.Rakhlin, S., and Sridharan. K, Optimization, learning, and games with predictable sequences. In Advances in Neural Information Processing Systems, pages 3066–3074, (2013)

[CR32] 32.Robbins, H., and Siegmund. D.: A convergence theorem for non negative almost supermartingales and some applications. In Optimizing methods in statistics, pages 233–257. Elsevier, (1971)

[CR33] 33.Shalev-Shwartz, S., and Zhang. T: Stochastic dual coordinate ascent methods for regularized loss minimization. J. Mach. Learn. Res., 14(Feb):567–599, (2013)

[CR34] 34.Tseng P. A modified forward-backward splitting method for maximal monotone mappings. SIAM J. Control Opt. 2000;38(2):431–446. doi: 10.1137/S0363012998338806. [DOI] [Google Scholar]

[CR35] 35.Tseng. P.: On accelerated proximal gradient methods for convex-concave optimization. submitted to SIAM J. Opt., 1, (2008)

PERMALINK

Forward-reflected-backward method with variance reduction

Ahmet Alacaoglu

Yura Malitsky

Volkan Cevher

Abstract

Introduction

Related works

Table 1.

Preliminaries and notation

Assumption 1

Algorithm

Convergence analysis

Lemma 3.1

Convergence of the iterates

Lemma 3.2

Proof

Theorem 3.1

Remark 3.1

Corollary 3.1

Proof

Proof of Theorem 3.1

Convergence rate for the general case

Theorem 3.2

Remark 3.2

Proof of Theorem 3.2

Convergence rate for strongly monotone case

Theorem 3.3

Remark 3.3

Proof of Theorem 3.3

Beyond monotonicity

Theorem 3.4

Proof

Extensions

Monotone inclusions

Theorem 4.1

Bregman distances

Lemma 4.1

Numerical verification

Fig. 1.

Conclusion

7. Appendix

7.1 Proofs for Sect. 3

Proof of Lemma 3.1

7.2 Proofs for Sect. 4

Lemma 7.1

7.2.1 Proof of Theorem 4.1

Proof

Lemma 7.2

Proof

7.2.2 Proof of Lemma 4.1

Proof

7.3 Experiment details

Funding

Declarations

Conflict of interest

Data availability

Footnotes

Contributor Information

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases