GDA-AM: ON THE EFFECTIVENESS OF SOLVING MIN-IMAX OPTIMIZATION VIA ANDERSON MIXING

Huan He; Shifan Zhao; Yuanzhe Xi; Joyce C Ho; Yousef Saad

. Author manuscript; available in PMC: 2026 Apr 21.

Published in final edited form as: Int Conf Learn Represent. 2022 Apr;2022:13527–13557.

GDA-AM: ON THE EFFECTIVENESS OF SOLVING MIN-IMAX OPTIMIZATION VIA ANDERSON MIXING

Huan He ¹, Shifan Zhao ², Yuanzhe Xi ³, Joyce C Ho ^4,^*, Yousef Saad ^5,^†

PMCID: PMC13095136 NIHMSID: NIHMS1846240 PMID: 42016826

Abstract

Many modern machine learning algorithms such as generative adversarial networks (GANs) and adversarial training can be formulated as minimax optimization. Gradient descent ascent (GDA) is the most commonly used algorithm due to its simplicity. However, GDA can converge to non-optimal minimax points. We propose a new minimax optimization framework, GDA-AM, that views the GDA dynamics as a fixed-point iteration and solves it using Anderson Mixing to converge to the local minimax. It addresses the diverging issue of simultaneous GDA and accelerates the convergence of alternating GDA. We show theoretically that the algorithm can achieve global convergence for bilinear problems under mild conditions. We also empirically show that GDA-AM solves a variety of minimax problems and improves adversarial training on several datasets. Codes are available on Github ¹.

1. Introduction

Minimax optimization has received a surge of interest due to its wide range of applications in modern machine learning, such as generative adversarial networks (GAN), adversarial training and multi-agent reinforcement learning (Goodfellow et al., 2014; Madry et al., 2018; Li et al., 2019). Formally, given a bivariate function $f (x, y)$ , the objective is to find a stable solution where the players cannot improve their objective, i.e., to find the Nash equilibrium of the underlying game (von Neumann & Morgenstern, 1944):

\underset{x \in X}{arg min} \underset{y \in Y}{arg max} f (x, y) .

(1)

It is commonplace to use simple algorithms such as gradient descent ascent (GDA) to solve such problems, where both players take a gradient update simultaneously or alternatively. Despite its simplicity, GDA is known to suffer from a generic issue for minimax optimization: it may cycle around a stable point, exhibit divergent behavior, or converge very slowly since it requires very small learning rates (Gidel et al., 2019a; Mertikopoulos et al., 2019). Given the widespread usage of gradient-based methods for solving machine learning problems, first-order optimization algorithms to solve minimax problems have gained considerable popularity in the last few years. Algorithms such as optimistic Gradient Descent Ascent (OG) (Daskalakis et al., 2018; Mertikopoulos et al., 2019) and extra-gradient (EG) (Gidel et al., 2019a) can alleviate the issue of GDA for some problems. Yet, it has been shown that these methods can still diverge or cycle around a stable point (Adolphs et al.; Mazumdar et al., 2019; Parker-Holder et al., 2020). For example, these algorithms even fail to find a local minimax (the set of local minimax is a superset of local Nash (Jin et al., 2020; Wang et al., 2020)) as shown in Figure 1. This leads to the following question: Can we design better algorithms for minimax problems? We answer this in the affirmative, by introducing GDA-AM. We cast the GDA dynamics as a fixed-point iteration problem and compute the iterates effectively using an advanced nonlinear extrapolation method. We show that indeed our algorithm has theoretical and empirical guarantees across a broad range of minimax problems, including GANs.

Figure 1: — **Left:** $f (x, y) = (4 x^{2} - {(y - 3 x + 0.05 x^{3})}^{2} - 0.1 y^{4}) e^{- 0.01 (x^{2} + y^{2})}$ . **Middle:** $- 3 x^{2} - y^{2} + 4 x y$ . **Right:** $f (x, y) = 2 x^{2} + y^{2} + 4 x y + \frac{4}{3} y^{3} - \frac{1}{4} y^{4}$ . We can observe that baseline methods fail to converge to a local minimax, whereas `GDA-AM` with table size $p = 3$ always exhibits desirable behaviors.

Our contributions:

In this paper, we propose a different approach to solve minimax optimization. Our starting point is to cast the GDA dynamics as a fixed-point iteration. We then highlight that the fixed-point iteration can be solved effectively by using advanced non-linear extrapolation methods such as Anderson Mixing (Anderson, 1965), which we name as GDA-AM. redAlthough first mentioned in Azizian et al. (2020), to our best knowledge, this is still the first work to investigate and improve the GDA dynamics by tapping into advanced fixed-point algorithms.

We demonstrate that GDA dynamics can benefit from Anderson Mixing. In particular, we study bilinear games and give a systematic analysis of GDA-AM for both simultaneous and alternating versions of GDA. We theoretically show that GDA-AM can achieve global convergence guarantees under mild conditions.

We complement our theoretical results with numerical simulations across a variety of minimax problems. We show that for some convex-concave and non-convex-concave functions, GDA-AM can converge to the optimal point with little hyper-parameter tuning whereas existing first-order methods are prone to divergence and cycling behaviors.

We also provide empirical results for GAN training across two different datasets, CIFAR10 and CelebA. Given the limited computational overhead of our method, the results suggest that an extrapolation add-on to GDA can lead to significant performance gains. Moreover, the convergence behavior across a variety of problems and the ease-of-use demonstrate the potential of GDA-AM to become the minimax optimization workhorse.

2. Preliminaries and background

2.1. Minimax optimization

Definition 1.

Point $(x^{*}, y^{*})$ is a local Nash equilibrium of $f$ if there exists $δ > 0$ such that for any $(x, y)$ satisfying $‖x - x^{*}‖ \leq δ$ and $‖y - y^{*}‖ \leq δ$ we have: $f (x^{*}, y) \leq f (x^{*}, y^{*}) \leq f (x, y^{*})$ .

To find the Nash equilibria, common algorithms including GDA, EG and OG, can be formulated as follows. For the two variants of GDA, simultaneous GDA (SimGDA) and alternating GDA (AltGDA), the updates have the following forms:

\begin{array}{l} Simultaneous : x_{t + 1} = x_{t} - η \nabla_{x} f (x_{t}, y_{t}), y_{t + 1} = y_{t} + η \nabla_{y} f (x_{t}, y_{t}) \\ Alternating : x_{t + 1} = x_{t} - η \nabla_{x} f (x_{t}, y_{t}), y_{t + 1} = y_{t} + η \nabla_{y} f (x_{t + 1}, y_{t}) . \end{array}

(2)

The EG update has the following form:

\begin{array}{l} x_{t + \frac{1}{2}} = x_{t} - η \nabla_{x} f (x_{t}, y_{t}), & y_{t + \frac{1}{2}} = y_{t} + η \nabla_{y} f (x_{t}, y_{t}) \\ x_{t + 1} = x_{t} - η \nabla_{x} f (x_{t + \frac{1}{2}}, y_{t + \frac{1}{2}}), & y_{t + 1} = y_{t} + η \nabla_{y} f (x_{t + \frac{1}{2}}, y_{t + \frac{1}{2}}) . \end{array}

(3)

The OG update has the following form:

x_{t + 1} = x_{t} - η \nabla_{x} f (x_{t}, y_{t}) + \frac{η}{2} \nabla_{x} f (x_{t - 1}, y_{t - 1}), y_{t + 1} = y_{t} + η \nabla_{y} f (x_{t}, y_{t}) - \frac{η}{2} \nabla_{y} f (x_{t - 1}, y_{t - 1}) .

(4)

2.2. Fixed-Point Iteration and Anderson Mixing (AM)

Definition 2.

$w^{⋆}$ is a fixed point of the mapping $g$ if $w^{⋆} = g (w^{⋆})$ .

Consider the simple fixed-point iteration $w_{t + 1} = g (w_{t})$ which produces a sequence of iterates $\{w_{0}, w_{1}, \dots, w_{N}\}$ . In most cases, this converges to the fixed-point, $w^{*} = g (w^{*})$ . Take gradient descent as an example, it can be viewed as iteratively applying the operation: $w_{t + 1} = g (w_{t}) ≜ w_{t} - α_{t} \nabla f (w_{t})$ , where the limit is the fixed-point $w^{⋆} = g (w^{⋆})$ (i.e $\nabla f (w_{t} = 0)$ . SimGDA updates can be defined as the repeated application of a nonlinear operator:

w_{t + 1} = G_{η}^{(sim)} (w_{t}) ≜ w_{t} - η V (w_{t}) with w = [\begin{array}{l} x \\ y \end{array}], V (w) = [\begin{matrix} \nabla_{x} f (x, y) \\ - \nabla_{y} f (x, y) \end{matrix}]

Similarly, we can write AltGDA updates as $w_{t + 1} = G_{η}^{(alt)} (w_{t})$ . An issue with fixed-point iteration is that it does not always converge, and even in the cases where it does converge, it might do so very slowly. GDA is one example that it could result in the possibility of the operator converging to a limit cycle instead of a single point for the GDA dynamic. A way of dealing with these problems is to use acceleration methods, which can potentially speed up the convergence process and in some cases even decrease the likelihood for divergence.

There are many different acceleration methods, but we will put our focus on an algorithm which we refer to as Anderson Mixing (or Anderson Acceleration). In short, Anderson Mixing (AM) shares the same idea as Nesterov’s acceleration. Given a fixed-point iteration $w_{t} = g (w_{t - 1})$ , Anderson Mixing argues that a good approximation to the final solution $w^{*}$ can be obtained as a linear combination of the previous $p$ iterates $w_{t + 1} = \sum_{i = 0}^{p} β_{i} g (w_{t - p_{t} + i})$ . Since obtaining the proper coefficients $β_{i}$ is a nonlinear procedure, Anderson Mixing is also known as a nonlinear extrapolation method. The general form of Anderson Mixing is shown in Algorithm 1. For efficiency, we prefer a ‘restarted’ version with a small table size $p$ that cleans up the table $F$ every $p$ iterations because it avoids solving a linear system of increasing size.

Algorithm 1:

Anderson Mixing Prototype (truncated version)

graphic file with name nihms-1846240-t0020.jpg

Open in a new tab

2.3. AM and Generalized Minimal Residual (GMRES)

Developed by Saad & Schultz (1986), Generalized Minimal Residual method (GMRES) is a Krylov subspace method for solving linear system equations. The method approximates the solution by the vector in a Krylov subspace with minimal residual, which is described below.

Definition 3.

Assume we have the linear system of equations $x = b$ with $\in ℝ^{n \times n}$ , $b \in ℝ^{n}$ and an initial guess $x_{0}$ . Then we denote the initial residual by $r_{0} = b - x_{0}$ and define the tth Krylov subspace as $K_{t} = s p a n \{r_{0}, r_{0}, \dots,^{t - 1} r_{0}\}$ .

The tth iterate $x_{t}$ of GMRES minimizes the norm of the residual $r_{t} = b - x_{t}$ in $K_{t}$ , that is, $x_{t}$ solves

\min_{x_{t} \in x_{0} + K_{t}} {‖b - x_{t}‖}_{2} .

The following formulation is equivalent to GMRES minimization problem and more convenient for implementation. It computes ${\hat{x}}_{t}$ such that

{\hat{x}}_{t} = \underset{{\hat{x}}_{t} \in K_{t}}{\arg \min} {‖b - (x_{0} + {\hat{x}}_{t})‖}_{2} = \underset{{\hat{x}}_{t} \in K_{t}}{\arg \min} {‖r_{0} - {\hat{x}}_{t}‖}_{2} .

Using a larger Krylov dimension will improve the convergence of the method, but will require more memory. For this reason, a smaller Krylov subspace dimension $t$ and ‘restarted’ versions of the method are used in practice Saad (2003).

The convergence of GMRES can be studied through the magnitude of the residual polynomial.

Theorem 2.1

(Lemma 6.31 of Saad (2003)). Let ${\hat{x}}_{t}$ be the approximate solution obtained at the t-th iteration of GMRES being applied to solve $x = b$ , and denote the residual as $r_{t} = b - {\hat{x}}_{t}$ . Then, $r_{t}$ is of the form

r_{t} = f_{t} () r_{0},

(5)

where

{‖r_{t}‖}_{2} = {‖f_{t} () r_{0}‖}_{2} = \min_{f_{t} \in P_{t}} {‖f_{t} () r_{0}‖}_{2},

(6)

where $P_{p}$ is the family of polynomials with degree $p$ such that $f_{p} (0) = 1$ , $\forall f_{p} \in P_{p}$ , which are usually called residual polynomials.

Although GMRES is applied to a system of linear equations not a fixed-point problem, there is a strong connection between Anderson Mixing and GMRES. In AM we are looking for a fixed-point $x$ such that $Gx - b - x = 0$ and by rearranging this equation we get

b + (G - I) x = 0 \Leftrightarrow (I - G) x = b .

Theorem 2.2 shows that if GMRES is applied to the system $(I - G) x = b$ and AM is applied to $g (x) = Gx + b$ with the same initial guess and $I - G$ is non-singular, then these are equivalent in the sense that the iterates of each algorithm can be obtained directly from the iterates of the other algorithm.

Theorem 2.2

(Equivalence between AM with restart and GMRES (Walker & Ni, 2011a)). Consider the fixed point iteration $x = g (x)$ where $g (x) = Gx + b$ for $G \in ℝ^{n \times n}$ and $b \in ℝ^{n}$ . If $I - G$ is non-singular, Algorithm 1 produces exactly the same iterates as GMRES being applied to solve $(I - G) x = b$ when both algorithms start with the same initial guess.

Theorem 2.2 can also be generalized to the restart version of AM an GMRES as well.

3. `GDA-AM` : GDA with Anderson Mixing

We propose a novel minimax optimizer, called GDA-AM, that is inspired by recent advances in parameter (or weight) averaging (Wu et al., 2020; Yazici et al., 2019). We argue that a nonlinear adaptive average (combination) is a more appropriate choice for minimax optimization.

3.1. GDA with Naïve Anderson Mixing

We propose to exploit the dynamic information present in the GDA iterates to “smartly” combine the past iterates. This is in contrast to the classical averaging methods (moving averaging and exponential moving averaging) (Yang et al., 2019) that “blindly” combine past iterates. A naïve adoption of Anderson Mixing using the past $p$ GDA iterates for both simGDA and altGDA has the following form:

Anderson mixing : x_{t + 1} = \sum_{i = 0}^{p} β_{i} x_{t - p + i}, y_{t + 1} = \sum_{i = 0}^{p} β_{i} y_{t - p + i} .

Since Zhang et al. (2021); Gidel et al. (2019b) show the AltGDA is superior to SimGDA in many aspects, we briefly summarized both Simultaneous and Alternating GDA-AM in Algorithms 2 and 3 with the truncated Anderson Mixing Algorithm 1 using a table size $p$ .

Algorithm 2:

Simultaneous GDA-AM

graphic file with name nihms-1846240-t0021.jpg

Open in a new tab

Algorithm 3:

Alternating GDA-AM

graphic file with name nihms-1846240-t0022.jpg

Open in a new tab

It is important to note that the Anderson Mixing form shown in Algorithm 1 is for illustrative purpose and not computationally efficient. For example, only one column of $F_{t}$ needs to be updated at each iteration. In addition, the solution of the least-square problem in Algorithm 1 can also be solved by a quick QR update scheme which costs $(2 n + 1) p^{2}$ (Walker & Ni, 2011a). Thus, from Algorithms 2 and 3, we can see that the major cost of GDA-AM arises from solving the additional linear least squares problem compared to regular GDA at each iteration. Additional implementation details are provided in the Appendix.

4. Convergence results for `GDA-AM`

In this section, we show that both simultaneous and alternating version GDA-AM converge to the equilibrium for bilinear problems. First, we do not require the learning rate to be sufficiently small. Second, we explicitly provide a linear convergence rate that is faster than EG and OG. More importantly, we derive nonasymptotic rates from the spectrum analysis perspective because existing theoretical results can not help us derive a convergent rate (see C.1).

4.1. Bilinear Games

Bilinear games are often regarded as an important simple example for theoretically analyzing and understanding new algorithms and techniques for solving general minimax problems (Gidel et al., 2019a; Mertikopoulos et al., 2019; Schaefer & Anandkumar, 2019). In this section, we analyze the convergence property of simultaneous GDA-AM and alternating GDA-AM schemes on the following zero-sum bilinear games:

\min_{x \in ℝ^{n}} \max_{y \in ℝ^{n}} f (x, y) = x^{T} Ay + b^{T} x + c^{T} y, A is full rank .

(7)

The Nash equilibrium to the above problem is given by $(x^{*}, y^{*}) = (-^{- T} c, -^{- 1} b)$ .

We also investigate bilinear-quadratic games from a spectrum analysis perspective. In addition, we show that analysis based on the numerical range (Bollapragada et al., 2018) can be also extended to such games, although it can not help derive a convergent bound for equation 7. Detailed discussion can be found in Appendix C.1 and C.4.1.

4.2. Simultaneous `GDA-AM`

Suppose $x_{0}$ and $y_{0}$ are the initial guesses for $x^{*}$ and $y^{*}$ , respectively. Then each iteration of simultaneous GDA can be written in the following matrix form:

[\begin{array}{l} x_{t + 1} \\ y_{t + 1} \end{array}] = \underset{G^{(S i m)}}{\underset{︸}{[\begin{matrix} I & - η A \\ η A^{T} & I \end{matrix}]}} \underset{w_{t}^{(S i m)}}{\underset{︸}{[\begin{array}{l} x_{t} \\ y_{t} \end{array}]}} - η \underset{b^{(S i m)}}{\underset{︸}{[\begin{array}{l} b \\ c \end{array}]}} .

(8)

It has been shown that the iteration in equation 8 often cycles and fails to converge for the bilinear problem due to the poor spectrum/numerical range of the fixed point operator $G^{(S i m)}$ (Gidel et al., 2019a; Azizian et al., 2020; Mokhtari et al., 2020a). Next we show that the convergence can be improved with Algorithm 2.

Theorem 4.1.

[Global convergence for simultaneous GDA-AM on bilinear problem] Denote the distance between the stationary point $w^{*}$ and current iterate $w_{(k + 1) p}$ of Algorithm 2 with table size $p$ as $N_{(k + 1) p} = ‖w^{*} - w_{(k + 1) p}‖$ . Then we have the following bound for $N_{t}$

N_{(k + 1) p}^{2} \leq ρ (A) N_{k p}^{2}

(9)

where $ρ (A) = {(\frac{1}{T_{p} (1 + \frac{2}{κ (T) - 1})})}^{2}$ . Here, $T_{p}$ is the Chebyshev polynomial of first kind of degree $p$ and $\frac{1}{T_{p} (1 + \frac{2}{κ (T) - 1})} < 1$ since $1 + \frac{2}{κ (T) - 1} > 1$ .

It is worthy emphasizing that the convergence rate of Algorithm 2 is independent of learning rate $η$ while the convergence results of other methods like EG and OG depend on the learning rate.

Remark 4.1.1.

Both EG and OG have the following form of convergence rate (Mokhtari et al., 2020a) for bilinear problem

N_{t + 1}^{2} \leq (1 - \frac{c}{κ (T)}) N_{t}^{2},

where $c$ is a positive constant independent of the problem parameters.

4.3. Alternating `GDA-AM`

The underlying fixed point iteration in Algorithm 3 can be written in the following matrix form:

[\begin{array}{l} x_{t + 1} \\ y_{t + 1} \end{array}] = \underset{G^{(A l t)}}{\underset{︸}{[\begin{matrix} I & - η A \\ η A^{T} & I - η^{2} A^{T} \end{matrix}]}} \underset{w_{t}^{(A l t)}}{\underset{︸}{[\begin{array}{l} x_{t} \\ y_{t} \end{array}]}} - η \underset{b^{(A l t)}}{\underset{︸}{[\begin{array}{l} b \\ c \end{array}]}} .

According to the equivalence between truncated Anderson acceleration and GMRES with restart, we can analyze the convergence of Algorithm 3 through the convergence analysis of applying GMRES to solve linear systems associated with $G = I - G^{(A l t)}$ :

G = [\begin{matrix} 0 & η A \\ - η A^{T} & η^{2} A^{T} \end{matrix}] .

Theorem 4.2.

[Global convergence for alternating GDA-AM on bilinear problem] Denote the distance between the stationary point $w^{*}$ and current iterate $w_{(k + 1) p}$ of Algorithm 3 with table size $p$ as $N_{(k + 1) p} = ‖w^{*} - w_{(k + 1) p}‖$ . Assume is normalized such that its largest singular value is equal to 1. Then when the learning rate $η$ is less than 2, we have the following bound for $N_{t}$

N_{(k + 1) p}^{2} \leq \sqrt{1 + \frac{2 η}{2 - η}} {(\frac{r}{c})}^{p} N_{k p}^{2}

where $c$ and $r$ are the center and radius of a disk $D (c, r)$ which includes all the eigenvalues of $G$ . Especially, $\frac{r}{c} < 1$ .

Theorem 4.2 shows that when $p > \frac{\log \sqrt{\frac{2 - η}{2 + η}}}{\log \frac{r}{c}}$ , alternating GDA-AM will converge globally.

4.4. Discussion of obtained rates

We would like to first explain on why taking Chebyshev polynomial of degree p at the point $1 + \frac{2}{κ - 1}$ . We evaluate the Chebyshev polynomial at this specific point because the reciprocal of this value gives the minimal value of infinite norm of the all polynomials of degree p defined on the interval $\tilde{I} = [η^{2} σ_{m i n}^{2} (), η^{2} σ_{m a x}^{2} ()]$ based on Theorem 6.25 (page 209) (Saad, 2003). In other words, taking the function value at this point leads to the tight bound.

When comparing between existing bounds, we would like to point our our derived bounds are hard to compare directly. The numerical experiments in figure 2b numerically verify that our bound is smaller than EG. We wanted to numerically compare our rate with EG with positive momentum. However the bound of EG with positive momentum is asymptotic. Moreover, it does not specify the constants so we can not numerically compare them. We do provide empirical comparison between GDA-AM and EG with positive momentum for bilinear problems in Appendix D.1. It shows GDA-AM outperforms EG with positive momentum. Regarding alternating GDA-AM, we would like to note that the bound in Theorem 4.2 depends on the eigenvalue distribution of the matrix $G$ . Condition number is not directly related to the distribution of eigenvalues of a nonsymmetric matrix $G$ . Thus, the condition number is not a precise metric to characterize the convergence. If these eigenvalues are clustered, then our bound can be small. On the other hand, if these eigenvalues are evenly distributed in the complex plane, then the bound can very close to 1.

Figure 2: — Figure 2a: The blue line is the spectrum of matrix $G^{(S i m)}$ while the red line is spectrum of matrix $I - G^{(s i m)}$ . Our method transforms the divergent problem to a convergent problem due to the transformed spectrum. Figure 2b: Convergence rate comparison between SimGDA-AM and EG for different condition numbers of and fixed table size $p = 10, 20, 50$ . Figure 2c: Convergence rate comparison between SimGDA-AM and EG for increasing table size on a matrix with condition number 100.

More importantly, we would like to stress several technical contributions.

Our obtained Theorem 4.1 and 4.2 provide nonasymptotic guarantees, while most other work are asymptotic. For example, EG with positive momentum can achieve a asymptotic rate of $1 - O (1 / \sqrt{κ})$ under strong assumptions (Azizian et al., 2020).
Our contribution is not just about fix the convergence issue of GDA by applying Anderson Mixing; another contribution is that we arrive at a convergent and tight bound on the original work and not just adopting existing analyses. We developed Theorem 4.1 and 4.2 from a new perspective because applying existing theoretical results fail to give us neither convergent nor tight bounds.
Theorem 4.1 and 4.2 only requires mild conditions and reflects how the table size $p$ controls the convergence rate. Theorem 4.1 is independent of the learning rate $η$ . However, the convergence results of other methods like EG and OG depend on the learning rate, which may yield less than desirable results for ill-specified learning rates.

5. Experiments

In this section, we conduct experiments to see whether GDA-AM improves GDA for minimax optimization from simple to practical problems. We first investigate performance of GDA-AM on bilinear games. In addition, we evaluate the efficacy of our approach on GANs.

5.1. Bilinear Problems

In this section, we answer following questions: Q1: How is GDA-AM perform in terms of iteration number and running time? Q2: How is the scalability of GDA-AM ? Q3: How is the performance of GDA-AM using different table size $p$ ? Q4: Does GDA-AM converge for large step size $η$ ?

We compare the performance with SimGDA, AltGDA, EG, and OG, and EG with Negative Momentum(Azizian et al. (2020)) on bilinear minimax games shown in equation 7 without any constraint.

, $b$ , $c$ , and initial points are generated using normally distributed random number. We set the maximum iteration number as 1 × 10⁶, stopping criteria 1 × 10⁻⁵ and depict convergence by use of the norm of distance to optima, which is defined as $‖w^{*} - w_{t}‖$ . Similar to Azizian et al. (2020); Wei et al. (2021a), the step size is set as 1 after rescaling to have 2-norm 1. We present results of different settings in Figures 4, 5, and 6.

Figure 4: — Comparison in terms of iteration: $\min_{x} \max_{y} f (x, y) = x^{T} Ay + b^{T} x + c^{T} y$ . We use different problem size and fix $p = 10$ , $η = 1$ for all experiments.

Figure 5: — Comparison between methods in terms of time.

We first generate different problem size $(n = 100, 1000, 5000)$ and present results of convergence in terms of iteration number in Figure 4. It can be observed that GDA-AM converges in much fewer iterations for different problem sizes. Note that EG, EG-NM, and OG converge in the end but requires many iterations, thus we plot only a portion for illustrative purposes. Figure 5 depicts the convergence for all methods in terms of time. It can be observed that the running time of GDA-AM is faster than EG. Although slower than OG, we can observe GDA-AM converges in much less time for all problems. Figure 4 and Figure 5 answer Q1 and Q2; although there is additional computation for GDA-AM, it does not hinder the benefits of adopting Anderson Mixing. Even for a large problem size, GDA-AM still converges in much less time than the baselines.

Next, we run GDA-AM using different table size $p$ and show the results in Figure 6a and Figure 6b. Figure 6a indicates an increasing of table size results in faster convergence in terms of iteration number, which also verifies our claim in Theorem 4.1. However, we also observe an increased running time when using a larger table size in Figure 6b. Further, we can see that $p = 50$ converges in a comparable time and iterations to $p = 100$ . Similar results are found in repeated experiments as well. As a result, our answer to Q3 is that although a larger $p$ means less iterations, a medium $p$ is sufficient and a small $p$ still outperforms the baselines. The optimal choice of $p$ is related to the condition number and step size, which is another interesting topic in the Anderson Mixing community.

Next, we answer Q4 on convergence under different step sizes. Although GDA-AM usually converges with suitable step size, our theorem suggests it requires a larger table size when combined with a extremely aggressive step size. Figure 6c shows the convergence under such circumstance. We can observe that although a very large step size goes the wrong way in the beginning, Anderson Mixing can still make it back on track except when $η > 1$ . It answers the question and confirms our claim that GDA-AM can achieve global convergence for bilinear problems for a large step size $η > 0$ .

5.2. GAN Experiments: Image Generation

We apply our method to the CIFAR10 dataset (Krizhevsky, 2009) and use the ResNet architecture with WGAN-GP (Gulrajani et al., 2017) and SNGAN (Miyato et al., 2018) objective. We also compared the performance of GDA-AM using cropped CelebA (64×64) (Liu et al., 2015) on WGAN-GP. We compare with Adam and extra-gradient with Adam (EG) as it offers significant improvement over OG. Models are evaluated using the inception score (IS) (Salimans et al., 2016) and FID (Heusel et al., 2017) computed on 50,000 samples. For fair comparison, we fixed the same hyperparamters of Adam for all methods after an extensive search. Experiments were run with 5 random seeds. We show results in Table 1. Table 1 reports the best IS and FID (averaged over 5 runs) achieved on these datasets by each method. We see that GDA-AM yields improvements over the baselines in terms of generation quality.

Table 1:

Best inception scores and FID for Cifar10 and FID for CelebA (IS is a less informative metric for celebA).

	WGAN-GP(ResNet)			SNGAN(ResNet)

	CIFAR10		CelebA	CIFAR10

Method	IS ↑	FID ↓	FID	IS	FID
Adam	7.76 ±.11	22.45 ±.65	8.43 ±.05	8.21 ±.05	20.81 ±.16
EG	7.83 ±.08	20.73 ±.22	8.15 ±.06	8.15 ±.07	21.12 ±.19
Ours (`GDA-AM`)	8.05 ±.06	19.32 ±.16	7.82 ±.06	8.38 ±.04	18.84 ±.13

Open in a new tab

6. Conclusion

We prove the convergence property of GDA-AM and obtain a faster convergence rate than EG and OG on the bilinear problem. Empirically, we verify our claim for such a problem and show the efficacy of GDA-AM in a deep learning setting as well. We believe our work is different from previous approaches and takes an important step towards understanding and improving minimax optimization by exploiting the GDA dynamic and reforming it with numerical techniques.

Figure 3: — An illustration of the spectrum of $G$ (red) and the closing circle (blue) in Theorem 4.2.

Acknowledgments

This work was funded in part by the NSF grant OAC 2003720, IIS 1838200 and NIH grant 5R01LM013323-03,5K01LM012924-03.

A. Related work

There is a rich literature on different strategies to alleviate the issue of minimax optimization. A useful add-on technique, Momentum, has been shown to be effective for bilinear games and strongly-convex-strongly-concave settings (Zhang & Wang, 2021; Gidel et al., 2019b; Azizian et al., 2020). Several second-order methods (Adolphs et al.; Mescheder et al., 2017; Mazumdar et al., 2019; Parker-Holder et al., 2020) show that their stable fixed points are exactly either Nash equilibria or local minimax by incorporating second-order information. However, such methods are computationally expensive and thus unsuitable for large applications such as image generation. Focusing on variants of GDA, EG and OG are two widely studied algorithms on improving the GDA dynamics. EG proposed to apply extra-gradient to overcome the cycling behaviour of GDA. OG, originally proposed in Popov (1980) and rediscovered in Daskalakis et al. (2018); Mertikopoulos et al. (2019), is more efficient by storing and re-using the extrapolated gradient for the extrapolation step. Without projection, OG is equivalent to extrapolation from past. Mokhtari et al. (2020b) shows that both of these algorithms can be interpreted as approximations of the classical proximal point method and did a unified analysis for bilinear games. These approaches mentioned the GDA dynamics can be viewed as a fixed-point iteration, but none of them further provides a solution to improve it. In this work, we fill this gap by proposing the application of the extrapolation method directly on the entire GDA dynamics. Unlike OG, EG and their variants (Hsieh et al., 2019; Lei et al., 2021; Thekumparampil et al., 2019; Yang et al., 2019), which regard minimax problems as variational inequality problems (Bruck, 1977; Nemirovski, 2004), our work is from a new perspective and thus orthogonal to these previous approaches.

In addition, several recent works consider nonconvex-concave minimax problems. Zhang et al. (2020) introduced a “smoothing” scheme combined with GDA to stabilize the dynamic of GDA. Luo et al. (2020) proposed a method called Stochastic Recursive gradiEnt Descent Ascent (SREDA) for stochastic nonconvex-strongly-concave minimax problems, by estimating gradients recursively and reducing its variance. Lin et al. (2020) showed that the two-timescale GDA can find a stationary point of nonconvex-concave minimax problems effectively. Ostrovskii et al. (2021) proposed a variant of Nesterov’s accelerated algorithm to find $ϵ$ -first-order Nash equilibrium that is a stronger criterion than the commonly used proximal gradient norm. Nouiehed et al. (2019) proposed a iterative method that finds $ϵ$ -first-order Nash equilibrium in $O (ϵ^{- 2})$ iterations under Polyak-Lojasiewicz (PL) condition. Focusing on nonconvex minimax problems, they studied an interesting and difficult problem. Since our work cast insight on the effectiveness of solving minimax optimization via Anderson Mixing, we expect the extension of this algorithm to general nonconvex problems can be further investigated in the future.

B. Anderson Mixing Implementation Details

In this section, we discuss the efficient implementation of Anderson Mixing. We start with generic Anderson Mixing prototype (Algorithm 4) and then present the idea of Quick QR-update Anderson Mixing implementation as described in Walker & Ni (2011b), which is commonly used in practice. For each iteration $t \geq 0$ , AM prototype solves a least squares problem with a normalization constraint. The intuition is to minimize the norm of the weighted residuals of the previous $m$ iterates.

Algorithm 4:

Anderson Mixing Prototype (truncated version)

graphic file with name nihms-1846240-t0023.jpg

Open in a new tab

The constrained linear least-squares problem in Algorithm AA can be solved in a number of ways. Our preference is to recast it in an unconstrained form suggested in Fang & Saad (2009); Walker & Ni (2011b) that is straightforward to solve and convenient for implementing efficient updating of QR.

Define $f_{i} = g (w_{i}) - w_{i}$ , $Δ f_{i} = f_{i + 1} - f_{i}$ for each $i$ and set $F_{t} = [f_{t - p_{t}}, \dots, f_{t}]$ , $F_{t} = [Δ f_{t - p_{t}}, \dots, Δ f_{t}]$ . Then solving the least-squares problem ( $\min_{β} {‖F_{t} β‖}_{2}$ , s. t. $\sum_{i = 0}^{p_{t}} β_{i} = 1$ ) is equivalent to

\min_{γ = {(γ_{0}, \dots, γ_{p_{t - 1}})}^{T}} {‖f_{t} - F_{t} γ‖}_{2}

(10)

where $α$ and $γ$ are related by $α_{0} = γ_{0}$ , $α_{i} = γ_{i} - γ_{i - 1}$ for $1 \leq i \leq p_{t} - 1$ , and $α_{p_{t}} = 1 - γ_{p_{t} - 1}$ .

Now the inner minimization subproblem can be efficiently solved as an unconstrained least squares problem by a simple variable elimination. This unconstrained least-squares problem leads to a modified form of Anderson Mixing

w_{t + 1} = g (w_{t}) - \sum_{i = 0}^{p_{t} - 1} γ_{i}^{(t)} [g (w_{t - p_{t} + i + 1}) - g (w_{t - p_{t} + i})] = g (w_{t}) - G_{t} γ^{(t)}

where $G_{t} = [Δ g_{t - p_{t}}, \dots, Δ g_{t - 1}]$ with $Δ g_{i} = g (w_{i + 1}) - g (w_{i})$ for each $i$ .

To obtain $γ^{(t)} = {(γ_{0}^{(t)}, \dots, γ_{p_{t} - 1}^{(t)})}^{T}$ by solving equation 10 efficiently, we show how the successive least-squares problems can be solved efficiently by updating the factors in the QR decomposition $F_{t} = Q_{t} R_{t}$ as the algorithm proceeds. We assume a think QR decomposition, for which the solution of the least-squares problem is obtained by solving the $p_{t} \times p_{t}$ linear system $R γ = Q^{'} * f_{t}$ . Each $F_{t}$ is $n \times p_{t}$ and is obtained from $F_{t - 1}$ by adding a column on the right and, if the resulting number of columns is greater than $p$ , also cleaning up (re-initialize) the table. That is, we never need to delete the left column because cleaning up the table stands for a restarted version of AM. As a result, we only need to handle two cases; 1 the table is empty(cleaned). 2 the table is not full. When the table is empty, we initialize $F_{1} = Q_{1} R_{1}$ with $Q_{1} = Δ f_{0} / {‖ Δ f_{0} ‖}_{2}$ and $R = {‖Δ f_{0}‖}_{2}$ . If the table size is smaller than $p$ , we add a column on the right of $F_{t - 1}$ . Have $F_{t - 1} = Q R$ , we update $Q$ and $R$ so that $F_{t} = [F_{t - 1}, Δ f_{t - 1}] = Q R$ . It is a single modified Gram–Schmidt sweep that is described as follows:

Algorithm 5:

QR-updating procedures

graphic file with name nihms-1846240-t0024.jpg

Open in a new tab

Note that we do not explicitly conduct QR decomposition in each iteration, instead we update the factors $(O (p^{2} n))$ and then solve a linear system using back substitution which has a complexity of $O (p^{2})$ . Based on this complexity analysis, we can find Anderson Mixing with QR-updating scheme has limited computational overhead than GDA (or OG). This explains why GDA-AM is faster than EG but slower than OG in terms of running time of each iteration.

C. Theoretical Results

C.1. Difficulty of analysis on GDA with Anderson Mixing

In the analysis, we study the inherent structures of the dynamics of the fixed point iteration and provide the convergence analysis for both simultaneous and alternating schemes. We want to emphasize that the direct application of existing convergence results of GMRES can not lead to convergent results. A recent paper Bollapragada et al. (2018) study the convergence acceleration schemes for multi-step optimization algorithms using Regularized Nonlinear Acceleration. We also want to point out that a naïve application of Crouzeix’s bound to the minimax optimization problem can not be used to derive the convergent result.

Theorem C.1

(Fischer & Freund (1991)). Let $n \geq 5$ be an integer, $r > 1$ , and $c \in ℝ$ . Consider the following constrained polynomial minmax problem

\min_{p \in ℙ_{n} : p (c) = 1} \max_{z \in E_{r}} | p (z) |

(11)

where

E_{r} : = \{z \in ℂ | | z - 1 | + | z + 1 | \leq r + \frac{1}{r}}

(12)

and $c \in ℂ ∖ E_{r}$ . Then this problem can be solved uniquely by

t_{n} (z; c) : = \frac{T_{n} (z)}{T_{n} (c)},

(13)

Where

T_{n} (z) = \frac{1}{2} (v^{n} + \frac{1}{v^{n}}), z = \frac{1}{2} (v + \frac{1}{v})

(14)

$| c | \geq \frac{1}{2} (r^{\sqrt{2}} + r^{- \sqrt{2}})$ or
$| c | \geq (1 / 2 a_{r}) (2 a_{r}^{2} - 1 + \sqrt{2 a_{r}^{4} - a_{r}^{2} + 1})$ , where $a_{r} : = \frac{1}{2} (r + \frac{1}{r})$ .

This is because the point 0 where all the residual polynomials take the fixed value of 1 is included in the numerical range of the iteration matrix, which violates the assumption of Theorem C.1. As a result, it can not be used to prove that the residual norm is decreasing based on this approach. Instead, we show that although the coefficient matrix is non-normal, it is diagonalizable. We then give the convergence results based on the eigenvalues instead of the numerical range. More specifically, Anderson mixing is equivalent to GMRES being applied to solve the following linear system:

(I - G^{(A l t)}) w = b^{(A l t)}, with w_{0} = w_{0}^{(A l t)} .

(15)

Writing this linear system in the block form:

[\begin{matrix} 0 & η A \\ - η A^{T} & η^{2} A^{T} \end{matrix}] w = b^{(A l t)} .

(16)

The residual norm bound for GMRES reads:

{‖r_{t}‖}_{2} = \min_{p \in ℙ_{t}^{1}} {‖p (I - G^{(A l t)}) r_{0}‖}_{2} .

(17)

Notice that the matrix $(I - G^{(A l t)})$ is non-normal. If we apply Crouzeix’s bound in Crouzeix & Palencia (2017) to our problem as Bollapragada et al. (2018) did, then we have the following bound

\frac{{‖r_{t}‖}_{2}}{{‖r_{0}‖}_{2}} \leq \min_{p \in ℙ_{t}^{1}} ‖p (I - G^{(A l t)})‖ \leq (1 + \sqrt{2}) \min_{p \in ℙ_{t}^{1}} \sup_{z \in W (I - G^{(A l t)})} ‖ p (z) ‖

(18)

where $W (I - G^{(A l t)}) = \{z^{*} (I - G^{(A l t)}) z, \forall z \in ℂ^{2 n} \ {0}, ‖ z ‖ = 1\}$ is the numerical range for $I - G^{(A l t)}$ . In order to simplify the upper bound in the previous theorem, we study the numerical range of $I - G^{(A l t)}$ similar to Bollapragada et al. (2018). Writing $z = [\begin{array}{l} z_{1} \\ z_{2} \end{array}]$ and computing the numerical range of $I - G^{(A l t)}$ explicitly yields:

[z_{1}^{*}, z_{2}^{*}] [\begin{matrix} 0 & η A \\ - η A^{T} & η^{2} A^{T} \end{matrix}] [\begin{array}{l} z_{1} \\ z_{2} \end{array}] = η^{2} z_{2}^{*} A^{T} z_{2} + η z_{1}^{*} z_{2} - η z_{2}^{* T} z_{1} .

(19)

For a general matrix $A$ , there is no special structure about the numerical range of $I - G^{(A l t)}$ . However, when is symmetric, we can decompose as $= \sum_{i = 1}^{n} λ_{i} v_{i} v_{i}^{T}$ where ${\{λ_{i}\}}_{i = 1}^{n}$ are eigenvalues of in decreasing order and ${\{v_{i}\}}_{i = 1}^{n}$ are associated eigenvectors, and write $A^{T} = \sum_{i = 1}^{n} λ_{i}^{2} v_{i} v_{i}^{T}$ . Then we can compute the numerical range of $G^{(A l t)}$ as follows:

\sum_{i}^{n} [z_{1}^{*}, z_{2}^{*}] [\begin{matrix} 0 & η λ_{i} v_{i} v_{i}^{T} \\ - η λ_{i} v_{i} v_{i}^{T} & η^{2} λ_{i}^{2} v_{i} v_{i}^{T} \end{matrix}] [\begin{array}{l} z_{1} \\ z_{2} \end{array}] = \sum_{i}^{n} [z_{1}^{*} v_{i}, z_{2}^{*} v_{i}] [\begin{matrix} 0 & η λ_{i} \\ - η λ_{i} & η^{2} λ_{i}^{2} \end{matrix}] \cdot [\begin{matrix} v_{i}^{T} z_{1} \\ v_{i}^{T} z_{2} \end{matrix}]

(20)

Following the techniques proposed in Bollapragada et al. (2018) to analyze the numerical range of general 2 × 2 matrices, we can show that the numerical range of $I - G^{(A l t)}$ is equal to the convex hull of the union of the numerical range of

G_{i} = [\begin{matrix} 0 & η λ_{i} \\ - η λ_{i} & η^{2} λ_{i}^{2} \end{matrix}], i = 1, \dots, n .

(21)

And the boundary of numerical range of $G_{i}$ is an ellipse whose axes are the line segments joining the points x to y and w to z, respectively, with

x = 0, y = η^{2} λ_{i}^{2},, w = \frac{η^{2} λ_{i}^{2}}{2} - \sqrt{- 1} η |λ_{i}|, z = \frac{η^{2} λ_{i}^{2}}{2} + \sqrt{- 1} η |λ_{i}| .

(22)

Thus, the numerical range of $I - G^{(A l t)}$ can be spanned by convex hull of the union of the numerical range of a set of 2-by-2 matrices and the numerical range of each such a 2-by-2 matrix is an ellipse. We can compute the center o and focal distance d of the ellipse generated by numerical range of $I - G^{(A l t)}$ explicitly. Then a linear transformation enables us to use Theorem C.1 to show that the near-best polynomial for the minimax problem on the numerical range of $I - G^{(A l t)}$ is given by $t_{n} (z; c) : = \frac{T_{n} (\frac{z - o}{d})}{T_{n} (\frac{c - o}{d})}$ if 0 is excluded from the numerical range of $I - G^{(A l t)}$ . However, according to equation 22 the numerical range includes the point 0 where the residual polynomial takes value 1, thus the analysis based on numerical range can not help derive the convergent result as the upper bound is not guaranteed to be less than 1.

C.2. Proofs of theorem

We first provide proof of Theorem 4.1.

Theorem C.2

(Global convergence for simultaneous GDA-AM on bilinear problem). Denote the distance between the stationary point $w^{*}$ and current iterate $w_{(k + 1) p}$ of Algorithm 2 with Anderson restart dimension $p$ as $N_{(k + 1) p} = d i s t (w^{*}, w_{(k + 1) p})$ . Then we have the following bound for $N_{t}$ Algorithm 2 is unconditionally convergent

N_{(k + 1) p} \leq \frac{1}{T_{p} (1 + \frac{2}{κ (T) - 1})} N_{k p}

(23)

where $T_{p}$ is the Chebyshev polynomial of first kind of degree $p$ and $\frac{1}{T_{p} (1 + \frac{2}{κ (T) - 1})} < 1$ since $1 + \frac{2}{κ (T) - 1} > 1$ .

Proof of Theorem 4.1.

Note that $I - G^{(S i m)}$ is a normal matrix which will be denoted as $G$ for notational simplicity. Thus it admits the following eigendecomposition:

G = UΛ U^{T}, U U^{T} = I, Λ = diag (λ_{1}, \dots, λ_{2 n}) .

(24)

Based on the equivalence between GMRES and Anderson Mixing, we know that the convergence rate of simultaneous GDA-AM can be estimated by the spectrum of $G$ . Especially, it holds that

r_{(k + 1) p} = U f_{p} (Λ) U^{T} r_{k p} . f_{p} \in P_{p}

(25)

where $P_{p}$ is the family of residual polynomials with degree p such that $f_{p} (0) = 1$ , $\forall f_{p} \in P_{p}$ . According to Lemma 2.1, we have the following estimation

{‖r_{(k + 1) p}‖}_{2} = \min_{f_{p} \in P_{p}} {‖f_{p} (G) r_{k p}‖}_{2} \leq \min_{f_{p} \in P_{p}} \max_{i} |f_{p} (λ_{i})| {‖r_{k p}‖}_{2} .

(26)

Due to the block structure of $G$ , the eigenvalues of $G$ can be computed explicitly as

\pm η σ_{i} \sqrt{- 1}, i = 1, \dots, n,

(27)

where $σ_{i}$ is the $i th$ largest singular value of matrix. This shows that the eigenvalues of $G$ are $n$ pairs of purely imaginary numbers excluding 0 since has full rank.

Since the eigenvalues of $G$ are distributed in two intervals excluding the origin

I = [- η σ_{m a x} () \sqrt{- 1}, - η σ_{m i n} () \sqrt{- 1}] \cup [η σ_{m i n} () \sqrt{- 1}, η σ_{m a x} () \sqrt{- 1}],

it can be shown that the following p-th degree polynomial with value 1 at the origin that has the minimal maximum deviation from 0 on I is given by:

f_{p} (z) = \frac{T_{l} (q (\sqrt{- 1} z))}{T_{l} (q (0))}, q (\sqrt{- 1} z) = 1 - \frac{2 (\sqrt{- 1} z - η σ_{m i n}) (\sqrt{- 1} z + η σ_{m i n})}{{(η σ_{m a x} ())}^{2} - {(η σ_{m i n} ())}^{2}}

(28)

where $l = [\frac{p}{2}]$ and $T_{l}$ is the Chebyshev polynomial of first kind of degree $l$ . The function $q (\sqrt{- 1} z)$ maps I to [−1,1]. Thus the numerator of the polynomial $f_{p}$ is bounded by 1 on I. The size of denominator can be determined by the method discussed in Chapter 3 of Greenbaum (1997). Assume $q (0) = \frac{1}{2} (y + y^{- 1})$ , then $T_{l} (q (0)) = \frac{1}{2} (y^{l} + y^{- l})$ . Then y can be determined by solving

q (0) = \frac{{(η σ_{m a x} ())}^{2} + {(η σ_{m i n} ())}^{2}}{{(η σ_{m a x} ())}^{2} - {(η σ_{m i n} ())}^{2}} .

(29)

The solutions to this equation are

y_{1} = \frac{η σ_{m a x} () + η σ_{m i n} ()}{η σ_{m a x} () - η σ_{m i n} ()} or y_{2} = \frac{η σ_{m a x} () - η σ_{m i n} ()}{η σ_{m a x} () + η σ_{m i n} ()} .

(30)

Then plugging the value of $q (0)$ into the polynomial $f_{p}$ yields

\begin{array}{l} \frac{‖r_{(k + 1) p}‖}{‖r_{k p}‖} \leq 2 {(\frac{\sqrt{η^{2} σ_{m a x}^{2} ()} - \sqrt{η^{2} σ_{m i n}^{2} ()}}{\sqrt{η^{2} σ_{m a x}^{2} ()} + \sqrt{η^{2} σ_{m i n}^{2} ()}})}^{l} \\ = 2 {(\frac{σ_{m a x} () - σ_{m i n} ()}{σ_{m a x} () + σ_{m i n} ()})}^{l} = 2 {(\frac{κ () - 1}{κ () + 1})}^{l} \end{array}

(31)

Note that $N_{t}$ and $r_{t}$ is related through $G (w_{t} - w^{*}) = r_{t}$ . Therefore,

\begin{array}{l} N_{(k + 1) p} = {‖w_{(k + 1) p} - w^{*}‖}_{2} = {‖G^{- 1} r_{(k + 1) p}‖}_{2} = \min_{f_{p} \in P_{p}} {‖G^{- 1} f_{p} (G) G (w_{k p} - w^{*})‖}_{2} \\ \leq \min_{f_{p} \in P_{p}} \max_{i} |f_{p} (λ_{i})| {‖w_{k p} - w^{*}‖}_{2} \leq 2 {(1 - \frac{2}{κ () + 1})}^{\frac{p}{2}} N_{k p} . \end{array}

(32)

Actually a tighter bound can be proved after noting that the problem is essentially equivalent to polynomial minmax problem on the interval:

\tilde{I} = [η^{2} σ_{m i n}^{2} (), η^{2} σ_{m a x}^{2} ()],

Then it is well known that,

\begin{array}{l} N_{(k + 1) p} \leq \min_{f_{p} \in P_{p}} \max_{λ_{i} \in [η^{2} σ_{\min}^{2} (), η^{2} σ_{\max}^{2} ()]} |f_{p} (λ_{i})| {‖w_{k p} - w^{*}‖}_{2} \leq \frac{1}{T_{p} (1 + 2 \frac{σ_{\min}^{2}}{σ_{\max}^{2} - σ_{\min}^{2}})} N_{k p} \\ \leq \frac{1}{T_{p} (1 + \frac{2}{κ (T) - 1})} N_{k p} \end{array}

(33)

where $T_{p}$ Chebyshev polynomial of degree p of the first kind and $\frac{1}{T_{p} (1 + \frac{2}{κ^{(T)} - 1})} < 1$ . Explicitly,

T_{p} (1 + \frac{2}{κ (T) - 1}) = \frac{1}{2} [{(1 + \frac{2}{κ (T) - 1} + \sqrt{{(1 + \frac{2}{κ (T) - 1})}^{2} - 1})}^{p} + {(1 + \frac{2}{κ (T) - 1} + \sqrt{{(1 + \frac{2}{κ (T) - 1})}^{2} - 1})}^{- p}]

□

Next, we give the proof of Theorem 4.2.

Theorem C.3

(Global convergence for alternating GDA-AM on bilinear problem). Denote the distance between the stationary point $w^{*}$ and current iterate $w_{(k + 1) p}$ of Algorithm 3 with Anderson restart dimension $p$ as $N_{(k + 1) p} = d i s t (w^{*}, w_{(k + 1) p})$ . Assume is normalized such that its largest singular value is equal to 1. Then when the learning rate $η$ is less than 2, we have the following bound for $N_{t}$

N_{(k + 1) p}^{2} \leq \sqrt{1 + \frac{2 η}{2 - η}} {(\frac{r}{c})}^{p} N_{k p}^{2}

where $c$ and $r$ are the center and radius of a disk $D (c, r)$ which includes all the eigenvalues of $G$ in equation 4.3. Especially, $\frac{r}{c} < 1$ .

Proof. Since the residual $r_{p}$ of AA at p-th iteration has the form of

r_{p} = (I - \sum_{i = 1}^{p} G^{i}) r_{0},

and AA minimizes the residual, we have

{‖r_{(k + 1) p}‖}_{2}^{2} \leq \min_{β} {‖r_{k p} - β G^{i} r_{k p}‖}_{2}^{2} \leq \min_{f_{p} \in P_{p}} {‖f_{p} (G) r_{k p}‖}_{2}^{2},

where $P_{p}$ is the family of polynomials with degree p such that $f_{p} (0) = 1$ , $\forall f_{p} \in P_{p}$ . It’s easy to see that $G$ is unitarily similar to a block diagonal matrix $Λ$ with 2 × 2 blocks as follows:

[\begin{matrix} 0 & η σ_{i} \\ - η σ_{i} & {(η σ_{i})}^{2} \end{matrix}] \forall i \in [n] .

Thus the eigenvalues of $G$ can be easily identified as

λ_{\pm i} = \frac{(η σ_{i} (η σ_{i} \pm \sqrt{{(η σ_{i})}^{2} - 4}))}{2}, i \in [n] .

where $σ_{1} \geq σ_{2} \geq \dots \geq σ_{n}$ are the singular values of. Furthermore, the eigenvector and eigenvalue associated with each 2 × 2 diagonal block are

[\begin{matrix} 0 & η σ_{i} \\ - η σ_{i} & {(η σ_{i})}^{2} \end{matrix}] [\begin{matrix} 1 \\ \frac{λ_{\pm i}}{η σ_{i}} \end{matrix}] = λ_{\pm i} [\begin{matrix} 1 \\ \frac{λ_{\pm i}}{η σ_{i}} \end{matrix}]

Thus $G$ is diagonalizable and denote the matrix with the columns of eigenvectors of $G$ by X. The real part of the eigenvalues of $G$ are at least

R (λ_{\pm i}) \geq \frac{{(η σ_{i})}^{2}}{2}, i \in [n] .

(34)

And since $|η σ_{i}| \geq |\sqrt{{(η σ_{i})}^{2} - 4}|$ , all the eigenvalues will be included in a disk $D (c, r)$ which is included in the right half plane. Moreover, both c and r being greater than zero indicates that $\frac{r}{c} < 1$ . Start from the following inequality:

N_{(k + 1) p} = {‖w_{(k + 1) p} - w^{*}‖}_{2} = {‖G^{- 1} r_{(k + 1) p}‖}_{2} \leq \min_{f_{p} \in P_{p}} {‖G^{- 1} f_{p} (G) r_{k p}‖}_{2} = \min_{f_{p} \in P_{p}} {‖G^{- 1} f_{p} (G) G (w_{k p} - w^{*})‖}_{2} = \min_{f_{p} \in P_{p}} {‖G_{p}^{- 1} (G) (w_{k p} - w^{*})‖}_{2} = \min_{f_{p} \in P_{p}} {‖f_{p} (G) (w_{k p} - w^{*})‖}_{2}

(35)

We will use the eigendeomposition of $G$ and the special polynomial ${(\frac{c - t}{c})}^{p}$ to derive the inequality in Theorem 3. Now we know $\frac{r}{c} < 1$ . If we choose $g_{p} (t) = {(\frac{c - t}{c})}^{p}$ , we can obtain

\min_{f_{p} \in P_{p}} {‖f_{p} (G) (w_{k p} - w^{*})‖}_{2} \leq {‖g_{p} (G) (w_{k p} - w^{*})‖}_{2}

which implies

\min_{f_{p} \in P_{p}} {‖f_{p} (G) (w_{k p} - w^{*})‖}_{2} \leq ‖g_{p} (X Λ X^{- 1})‖ {‖(w_{k p} - w^{*})‖}_{2}

Since $G$ is diagonalizable (which has been shown above), we assume the eigendecomposition of $G$ is $G = X Λ X^{- 1}$ . Then

\min_{f_{p} \in P_{p}} {‖g_{p} (G) (w_{k p} - w^{*})‖}_{2} \leq ‖ X ‖ ‖X^{- 1}‖ \max_{{\{λ_{i}\}}_{i = 1}^{2 n}} ‖g_{p} (Λ)‖ {‖(w_{k p} - w^{*})‖}_{2} \leq κ_{G} {(\frac{r}{c})}^{p} {‖w_{k p} - w^{*}‖}_{2}

where $κ_{G}$ is the condition number of $X$ . The last inequality comes from Lemma 6.26 and Proposition 6.32 in Saad (2003).. Since $G$ and $Λ$ are unitarily similar, $κ_{G}$ is equal to the condition number of the eigenvector matrix of $Λ$ . The eigenvector matrix of $Λ$ is a block diagonal matrix with the ith block as $[\begin{matrix} 1 & 1 \\ \frac{λ_{+ i}}{η σ_{i}} & \frac{λ_{- i}}{η σ_{i}} \end{matrix}]$ . Thus the singluar values of the eigenvector matrix of $Λ$ is equal to the union of the singular values of these 2-by-2 blocks. Under the assumption that the largest singular value of are equal to 1 and the learning rate is less than 2, it is easy to find the singular values of the eigenvector matrix of $Λ$ are $\sqrt{2 \pm η σ_{i}}$ . Thus, $κ_{G} = \frac{\sqrt{2 + η σ_{\max}}}{\sqrt{2 - η σ_{\max}}} = \frac{\sqrt{2 + η}}{\sqrt{2 - η}} = \sqrt{1 + \frac{2 η}{2 - η}}$ . □

C.3. Discussion of obtained rates

When comparing between existing bounds, we would like to point our our derived bounds are hard to compare directly. Alternatively, we can derive another bound for comparison with existing bounds for simultaneous GDA-AM. If we use the inequality that $T_{p} (t) \geq \frac{1}{2} ({(t + \sqrt{t^{2} - 1})}^{p})$ , we can obtain the bound $ρ (A) = 4 {(\frac{\sqrt{κ (A^{T} A)} - 1}{\sqrt{κ (A^{T} A)} + 1})}^{2} = 4 (1 - O (\frac{1}{\sqrt{κ (A^{T} A)}}))$ , which is in a form that is comparable with EG and can compete with EG + positive momentum. The numerical experiments in figure 2b numerically verify that our bound is smaller than EG. We wanted to numerically compare our rate with EG with positive momentum. However the bound of EG with positive momentum is asymptotic. Moreover, it does not specify the constants so we can not numerically compare them. We do provide empirical comparison between GDA-AM and EG with positive momentum for bilinear problems in Appendix D.1. It shows GDA-AM outperforms EG with positive momentum. Regarding alternating GDA-AM, we would like to note that the bound in Theorem 4.2 depends on the eigenvalue distribution of the matrix $G$ . Condition number is not directly related to the distribution of eigenvalues of a nonsymmetric matrix $G$ . Thus, the condition number is not a precise metric to characterize the convergence. If these eigenvalues are clustered, then our bound can be small. On the other hand, if these eigenvalues are evenly distributed in the complex plane, then the bound can very close to 1.

More importantly, we would like to stress several technical contributions.

Our obtained Theorem 4.1 and 4.2 provide nonasymptotic guarantees, while most other work are asymptotic. For example, EG with positive momentum can achieve a asymptotic rate of $1 - O (1 / \sqrt{κ})$ under strong assumptions (Azizian et al., 2020).
Our contribution is not just about fix the convergence issue of GDA by applying Anderson Mixing; another contribution is that we arrive at a convergent and tight bound on the original work and not just adopting existing analyses. We developed Theorem 4.1 and 4.2 from a new perspective because applying existing theoretical results fail to give us neither convergent nor tight bounds.
Theorem 4.1 and 4.2 only requires mild conditions and reflects how the table size $p$ controls the convergence rate. Theorem 4.1 is independent of the learning rate $η$ . However, the convergence results of other methods like EG and OG depend on the learning rate, which may yield less than desirable results for ill-specified learning rates.

C.4. Convex-concave and general case

Given the widespread usage of minimax problems in applications of machine learning, it is natural to ask about its properties when being applied to general nonconvex-nonconcave settings. If $f$ is a nonconvex-nonconcave function, the problem of finding global Nash equilibrium is NP-hard in general. Recently, Jin et al. (2020) show that local or global Nash equilibrium may not exist in nonconvex-nonconcave settings and propose a new notation local minimax as defined below:

Definition 4.

A point $(x^{⋆}, y^{⋆})$ is said to be a local minimax point of $f$ , if there exists $δ_{0} > 0$ and a function $h$ satisfying $h (δ) \to 0$ as $δ \to 0$ , such that for any $δ \in (0, δ_{0}]$ , and any $(x, y)$ satisfying $‖x - x^{⋆}‖ \leq δ$ and $‖y - y^{⋆}‖ \leq δ$ , we have

f (x^{⋆}, y) \leq f (x^{⋆}, y^{⋆}) \leq \max_{y^{'} : ‖y^{'} - y^{⋆}‖ \leq h (δ)} f (x, y^{'}) .

Jin et al. (2020) also establishes the following first- and second-order conditions to characterize local minimax:

Proposition 1

(First-order Condition). Any local minimax point $(x^{*}, y^{*})$ satisfies $\nabla f (x^{*}, y^{*}) = 0$ .

Proposition 2

(Second-order Necessary Condition). Any local minimax point $(x^{*}, y^{*})$ satisfies $\nabla_{yy} f (x^{*}, y^{*}) ≼ 0$ and $\nabla_{xx} f (x^{*}, y^{*}) - \nabla_{xy} f (x^{*}, y^{*}) {(\nabla_{yy} f (x^{*}, y^{*}))}^{- 1} \nabla_{yx} f (x^{*}, y^{*}) ≽ 0$ .

Proposition 3

(Second-order Sufficient Condition). Any stationary point $(x^{*}, y^{*})$ satisfies $\nabla_{yy} f (x^{*}, y^{*}) ≺ 0$ and $\nabla_{xx} f (x^{*}, y^{*}) - \nabla_{xy} f (x^{*}, y^{*}) {(\nabla_{yy} f (x^{*}, y^{*}))}^{- 1} \nabla_{yx} f (x^{*}, y^{*}) ≻ 0$ is a local minimax point.

Given the second-order conditions of local minimax, it turns out that above question is extremely challenging—GDA-AM is a first-order method. But we can prove the following result for GDA-AM:

Theorem C.4

(Local minimax as subset of limiting points of GDA-AM). Consider a general objective function $f (x, y)$ . The set of limiting points of GDA-AM for minimax problem

\min_{x \in ℝ^{n}} \max_{y \in ℝ^{n}} f (x, y)

includes the local minimax points of this function.

The definition of local minimax is stronger than that of first order $ϵ$ point. The convergence analysis for complexity of finding $ϵ$ stationary point is included in the next section. The proof of Theorem C.4 needs the result from the following theorem.

Theorem C.5

(Calvetti et al. (2002)). Let $δ$ satisfy $0 < δ \leq δ_{0}$ for some constant $δ_{0} > 0$ (refer to Calvetti et al. (2002) for details), and let $b^{δ} \in X$ satisfy $‖b - b^{δ}‖ \leq δ$ . Let $k \leq ℓ$ and let $x_{k}^{δ}$ denote the kth iterate determined by the GMRES method applied to equation $A x = b^{δ}$ , with initial guess $x_{0}^{δ} = 0$ . Similarly, let $x_{k}$ denote the kth iterate determined by the GMRES method applied to equation $A x = b$ with initial guess $x_{0} = 0$ . Then, there are constants $σ_{k}$ independent of $δ$ , such that

‖x_{k} - x_{k}^{δ}‖ \leq σ_{k} δ, 1 \leq k \leq ℓ

Then, we give the proof of Theorem C.4.

Proof of Theorem C.4.

For notational simplicity, we will denote $\nabla_{xx} f (x^{*}, y^{*})$ , $\nabla_{xy} f (x^{*}, y^{*})$ and $\nabla_{yy} f (x^{*}, y^{*})$ by $H_{x^{*} x^{*}}$ , $H_{x^{*} y^{*}}$ and $H_{y^{*} y^{*}}$ , respectively. Simultaneous GDA can be written as

w_{t + 1} = [\begin{array}{l} x_{t + 1} \\ y_{t + 1} \end{array}] = [\begin{array}{l} x_{t} - η \nabla_{x} f (x_{t}, y_{t}) \\ y_{t} + η \nabla_{y} f (x_{t}, y_{t}) \end{array}] .

Since the function is differentiable, Taylor expansion holds for $\nabla_{x} f (x_{t}, y_{t})$ and $\nabla_{y} f (x_{t}, y_{t})$ at a local minimx point $w^{*} = (x^{*}, y^{*})$ ,

\begin{array}{l} \nabla_{x} f (x_{t}, y_{t}) = \nabla_{x} f (x^{*}, y^{*}) + H_{x^{*} x^{*}} (x_{t} - x^{*}) + H_{x^{*} y^{*}} (y_{t} - y^{*}) + o ({‖w_{t} - w^{*}‖}_{2}) \\ \nabla_{y} f (x_{t}, y_{t}) = \nabla_{y} f (x^{*}, y^{*}) + H_{y^{*} y^{*}} (y_{t} - y^{*}) + H_{y^{*} x^{*}} (x_{t} - x^{*}) + o ({‖w_{t} - w^{*}‖}_{2}) . \end{array}

Use the fact that $\nabla f (x^{*}, y^{*}) = 0$ to simplify the above equations and obtain

\begin{array}{l} \nabla_{x} f (x_{t}, y_{t}) = H_{x^{*} x^{*}} (x_{t} - x^{*}) + H_{x^{*} y^{*}} (y_{t} - y^{*}) + o ({‖w_{t} - w^{*}‖}_{2}) \\ \nabla_{y} f (x_{t}, y_{t}) = H_{y^{*} y^{*}} (y_{t} - y^{*}) + H_{y^{*} x^{*}} (x_{t} - x^{*}) + o ({‖w_{t} - w^{*}‖}_{2}) . \end{array}

Inserting the above formulas into the iteration scheme, it yields

w_{t + 1} = [\begin{array}{l} x_{t + 1} \\ y_{t + 1} \end{array}] = [\begin{matrix} I - η H_{x^{*} x^{*}} & - η H_{x^{*}} y^{*} \\ η H_{y^{*} x^{*}} & I + η H_{y^{*} y^{*}} \end{matrix}] [\begin{array}{l} x_{t} \\ y_{t} \end{array}] + [\begin{matrix} η H_{x^{*}} x^{*} x^{*} + η H_{x^{*}} y^{*} y^{*} + ϵ \\ - η H_{y^{*} y^{*}} y^{*} - η H_{x^{*} y^{*}} x^{*} + ϵ \end{matrix}]

where $ϵ$ denotes the higher order error $o ({‖w_{t} - w^{*}‖}_{2})$ . According to Theorem 2.2, we know that simultaneous GDA-AM is equivalent to applying GMRES to solve the following linear system

(I - [\begin{matrix} (1 - α) I - η H_{x^{*} x^{*}} & - η H_{x^{*} y^{*}} \\ η H_{y^{*} x^{*}} & (1 - α) I + η H_{y^{*} y^{*}} \end{matrix}]) w = [\begin{matrix} α I + η H_{x^{*} x^{*}} & η H_{x^{*} y^{*}} \\ - η H_{y^{*} x^{*}} & α I - η H_{y^{*} y^{*}} \end{matrix}] w = b + ϵ

where $b = [\begin{matrix} η H_{x^{*} x^{*}} x^{*} + η H_{x^{*} y^{*}} y^{*} \\ - η H_{y^{*} y^{*}} y^{*} - η H_{x^{*} y^{*}} x^{*} \end{matrix}]$ . We now know that GDA-AM is equivalent to GMRES being applied to solve the following linear system

[\begin{matrix} α I + η H_{x^{*} x^{*}} & η H_{x^{*} y^{*}} \\ - η H_{y^{*} x^{*}} & α I - η H_{y^{*} y^{*}} \end{matrix}] \tilde{w} = b

The symmetric part of the coefficient matrix of the above linear system is

[\begin{matrix} α I + η H_{x^{*} x^{*}} & 0 \\ 0 & α I - η H_{y^{*} y^{*}} \end{matrix}] .

According to Proposition 2, $α I - η H_{y^{*} y^{*}}$ is positive definite since $H_{y^{*} y^{*}} ≼ 0$ . If $H_{x^{*} x^{*}}$ is positive semidefinite, then $α I + η H_{x^{*} x^{*}}$ is positive definite and we’re done. Otherwise, assume $λ_{\min} (H_{x^{*} x^{*}}) < 0$ . Then for fixed $α$ , when $η < - \frac{α}{λ_{\min} (H_{x^{*} x^{*}})}$ , $α I + η H_{x^{*} x^{*}}$ will be positive definite. Then according to Theorem 2.2, we know GDA-AM indeed converges. Let’s create a new companion linear system as follows

[\begin{matrix} α I + η H_{x^{*} x^{*}} & η H_{x^{*} y^{*}} \\ - η H_{y^{*} x^{*}} & α I - η H_{y^{*} y^{*}} \end{matrix}] \hat{w} = b + α w^{*}

Note that $\hat{w} = w^{*}$ and GMRES on this companion linear system is convergent under suitable choice of learning rate $η$ . Let the iterates of GMRES for $\tilde{w}$ , $\hat{w}$ , $w$ be denoted by ${\tilde{w}}_{t}$ , ${\hat{w}}_{t}$ , $w_{t}$ . Then $‖{\tilde{w}}_{t} - {\hat{w}}_{t}‖ \leq ‖{\tilde{w}}_{t} - w_{t}‖ + ‖{\hat{w}}_{t} - w_{t}‖$ . According to Theorem C.5, we also have $‖{\tilde{w}}_{t} - w_{t}‖ \leq σ_{k} ϵ$ , $1 \leq k \leq t$ . Further more, again according to Theorem C.5, we know $‖{\hat{w}}_{t} - {\hat{w}}_{t}‖ \leq σ_{k} (α w^{*} + ϵ)$ . Starting from an initial point very close to $w^{*}$ and let $t \to \infty$ and $α$ , $ϵ \to 0$ , ${\hat{w}}_{t}$ will converge to $w^{*} = (x^{*}, y^{*})$ , which means the local minimax $w^{*} = (x^{*}, y^{*})$ is a limiting point of GDA-RAM. □

Theorem C.6.

For strongly-convex-strongly-concave function $f (x, y)$ , GDA-AM will converge to the Nash equilibrium of this function.

Proof: Since strongly-convex-strongly-concave function $f (x, y)$ has unique Nash equilibrium which is also the unique minimax point, this minimax point must be the limiting point of GDA-AM according to Theorem C.4.

C.4.1. Bilinear-quadratic games

Moreover, we can further show that the GDA-AM converges on bilinear-quadratic games. Consider a quadratic problem as follows,

\min_{x \in ℝ^{n}} \max_{y \in ℝ^{n}} f (x, y) = x^{T} Ay + x^{T} Bx - y^{T} Cy + b^{T} x + c^{T} y,

(36)

where $A$ is full rank, $B$ and $C$ are both positive definite.

Theorem C.7.

[Global convergence for simultaneous GDA-AM on bilinear-quadratic problem] Let $r_{t}^{(S i m)}$ be the residual of Algorithm 2 being applied to problem equation 36. For some constant $ρ < 1$ ,

{‖r_{t}^{(S i m)}‖}_{2} \leq \underset{ρ^{t / 2}}{\underset{︸}{{(1 - \frac{{(λ_{\min} (J^{T} + J))}^{2}}{4 λ_{\max} (J^{T} J)})}^{t / 2}}} {‖r_{0}‖}_{2},

(37)

where $J = [\begin{matrix} η B & η A \\ - η A^{T} & η C \end{matrix}]$ and $λ_{\min}$ and $λ_{\max}$ denote the smallest and largest eigenvalue, respectively.

The convergence property of GMRES has been studied in the next theorem. We use this theorem to show the convergence rate of GDA-AM for bilinear-quadratic games.

Theorem C.8

(Elman (1982)). Consider solving a linear system $Ex = b$ using GMRES. Let $r_{t} = b - E x_{t}$ be the residual at tth iteration. If the Hermitian part of $E$ is positive definite, then for some positive constant $ρ < 1$ , it holds that

{‖r_{t}‖}_{2} \leq \underset{ρ^{t / 2}}{\underset{︸}{{(1 - \frac{{(λ_{\min} (E^{H} + E))}^{2}}{4 λ_{\max} (E^{H} E)})}^{t / 2}}} {‖r_{0}‖}_{2} .

(38)

Proof of Theorem C.7.

Applying simultaneous GDA-AM to solve the above problem is equivalent to applying Anderson Mixing on the following fixed point iteration:

[\begin{array}{l} x_{t + 1} \\ y_{t + 1} \end{array}] = \underset{G^{(Q u a d - s i m)}}{\underset{︸}{[\begin{matrix} I - η B & - η A \\ η A^{T} & I - η C \end{matrix}]}} \underset{w_{t}^{(Q u a d - s i m)}}{\underset{︸}{[\begin{array}{l} x_{t} \\ y_{t} \end{array}]}} + \underset{b^{(Q u a d - s i m)}}{\underset{︸}{[\begin{array}{l} - η b \\ - η c \end{array}]}} .

(39)

We know that we need to study the convergence properties of GMRES for solving the following linear system

[\begin{matrix} η B & η A \\ - η A^{T} & η C \end{matrix}] w = b .

(40)

For notational simplicity, the superscripts has been dropped. Denote the coefficient matrix $[\begin{matrix} η B & η A \\ - η A^{T} & η C \end{matrix}]$ by $J$ . The symmetric part of $J$ is

\frac{J + J^{T}}{2} = [\begin{matrix} \frac{η}{2} (B + B^{T}) & 0 \\ 0 & \frac{η}{2} (C + C^{T}) \end{matrix}]

which is positive definite. Then immediately by Theorem C.8, the following convergence rate holds For some constant $0 < ρ < 1$ ,

{‖r_{t}‖}_{2} = \min_{p \in ℙ_{t}^{1}} {‖p (J) r_{0}‖}_{2} \leq \underset{ρ^{t / 2}}{\underset{︸}{{(1 - \frac{(λ_{\min} {(J + J^{T})}^{2}}{(4 λ_{\max} (J^{T} J))})}^{t / 2}}} {‖r_{0}‖}_{2} = ρ^{t / 2} {‖r_{0}‖}_{2}

(41)

Note that the convergence of GDA-AM for bilinear-quadratic games can also be analyzed by numerical range as shown in (Bollapragada et al., 2018). Although we previously show that analysis based on the numerical range can not help us derive a convergent bound for bilinear games, we show analysis in Bollapragada et al. (2018) can be extended to bilinear-quadratic games. When $B$ and $C$ are positive definite, 1 is outside of the numerical range of matrix $G^{(Q u a d - s i m)}$ as shown in 7a. When $B$ or $C$ is not positive definite, 1 can be included in the numerical range of matrix $G^{(Q u a d - s i m)}$ as shown in 7b. That is saying analysis based on the numerical range (Crouzeix & Palencia, 2017; Bollapragada et al., 2018) to the bilinear-quadratic problem can lead to a convergent result when $B$ and $C$ are positive definite. And analysis based on the numerical range can not help us derive convergent results when $B$ or $C$ is not positive definite.

C.5. Stochastic convex-nonconvace case

In this section, we study the convergence of GDA-AM for convex-noncovace problem in the stochastic setting with the same assumptions in Wei et al. (2021b); Xu et al. (2021). The recent work Wei et al. (2021b) proves the convergence of the stochastic gradient descent with Anderson Mixing for min optimization. The convergence of GDA-AM for minimax optimization builds on top of it with several modifications. The minimax problem is equivalent to minimizing a function $Φ (\cdot) = \max_{y \in Y} f (\cdot, y)$ (Lin et al., 2020). And we are interested in complexity of a pair of $ϵ$ -stationary point $(x, y)$ instead of analysis of a point $x$ .

Definition 5.

(Lin et al., 2020) A pair of points $(x, y)$ is an -stationary point $(ϵ \geq 0)$ of a differentiable function $Φ$ if

\begin{array}{l} ‖\nabla_{x} f (x, y)‖ \leq ϵ \\ ‖P_{Y} (y + (1 / ℓ) \nabla_{y} f (x, y)) - y‖ \leq ϵ / ℓ \end{array}

Assumption 1.

$f : ℝ^{d} \mapsto ℝ$ is continuously differentiable. $f (x) \geq f^{l o w} > - \infty$ for any $x \in ℝ^{d}$ . $\nabla f$ is globally L-Lipschitz continuous; namely $‖ \nabla f (x) - \nabla f (y) ‖_{2} \leq L ‖ x - y ‖_{2}$ for any $x$ , $y \in ℝ^{d}$ .

Figure 7: — Numerical range of fixed-point operator (Simultaneous `GDA-AM`) $G = [\begin{matrix} I - η B & - η A \\ η A^{T} & I - η C \end{matrix}]$ for bilinear-quadratic games.

Assumption 2.

For any iteration $k$ , the stochastic gradient $\nabla f_{ξ_{k}} (x_{k})$ satisfies $E_{ξ_{k}} [\nabla f_{ξ_{k}} (x_{k})] = \nabla f (x_{k})$ , $E_{ξ_{k}} [{‖\nabla f_{ξ_{k}} (x_{k}) - \nabla f (x_{k})‖}_{2}^{2}] \leq σ^{2}$ , where $σ > 0$ , and $ξ_{k}$ , $k = 0, 1, \dots$ , are independent samples that are independent of ${\{x_{i}\}}^{k}$

Theorem C.9.

For a general convex-nonconcave function $f$ , suppose that Assumptions 1 and 2 hold. Batch size $n_{t} = n$ for $t = 0, \dots, N - 1$ . $C > 0$ is a constant. $β_{t} = \frac{μ}{4 L (1 + C^{- 1})}$ . $δ_{t} \geq C β_{t}^{- 2}$ , $0 \leq α_{t} \leq \min \{1, β_{t}^{\frac{1}{2}}\}$ and $α_{t}$ is chosen to make sure the positive definiteness of $H_{t}$ . Let $R$ be a random variable following $P_{R} (t) \overset{d e f}{=} Prob {R = t} = 1 / N$ , and $\bar{N}$ be the total number of stochastic GDA-AM calls needed to calculate stochastic gradients $\tilde{\nabla} f_{S_{t}} (w_{t})$ in our algorithm. To ensure $E [{‖\tilde{\nabla} f (w_{R})‖}_{2}] \leq ϵ$ , total number of stochastic GDA-AM calls needed to calculate stochastic gradients $\tilde{\nabla} f_{S_{t}} (w_{t})$ is $O (ϵ^{- 4})$ .

Recall that we can recast GDA scheme as the following fixed point iteration.

w_{t + 1} = G_{η}^{(sim)} (w_{t}) ≜ w_{t} + η V (w_{t}) with w = [\begin{array}{l} x \\ y \end{array}], V (w) = [\begin{matrix} - \nabla_{x} f (x, y) \\ \nabla_{y} f (x, y) \end{matrix}]

Ignoring the stepsize $η$ and let $W_{t}$ and $R_{t}$ record the first and second order diffrence of recent m iterates:

W_{t} = [Δ w_{t - m}, Δ w_{t - m + 1}, \dots, Δ w_{t - 1}], R_{t} = [Δ V_{t - m}, Δ V_{t - m + 1}, \dots, Δ V_{t - 1}]

Similarly as Wei et al. (2021b),the Anderson mixing can be decoupled into

{\bar{w}}_{t + 1} = w_{t} - W_{t} Γ_{t}, (Projection step) {\bar{w}}_{t + 1} = w_{t} + β_{t} {\bar{V}}_{t}, (Mixing step)

where $β_{t}$ is the mixing parameter, and ${\bar{V}}_{t} = V_{t} - W_{t} Γ_{t}$ and $Γ_{t}$ is solved by

Γ_{t} = \underset{Γ \in ℝ^{m}}{\arg \min} {‖V_{t} - R_{t} Γ‖}_{2} + δ_{t} ‖ Γ ‖_{2}

We want to argue that similar arguments in Wei et al. (2021b) can be applied to the problem here. To see why Anderson mixing works for minimax optimization, we assume function $f$ is smooth. Then the hessian matrix for $G_{η}^{(sim)}$ is

H = (\begin{matrix} - \nabla_{xx}^{2} f & - \nabla_{xy}^{2} f \\ \nabla_{yx}^{2} f & \nabla_{yy}^{2} f \end{matrix})

Notice that in a small neighborhood of $w_{t + 1}$ , we have

R_{t} = - H W_{t} = (\begin{matrix} \nabla_{xx}^{2} f & \nabla_{xy}^{2} f \\ - \nabla_{yx}^{2} f & - \nabla_{yy}^{2} f \end{matrix}) W_{t}

Thus ${‖V_{t} - R_{t} Γ‖}_{2} \approx {‖V_{t} + H W_{t} Γ‖}_{2}$ , which is equivalent to solving for a vector $p_{t}$ such that $H p_{k} = V_{t}$ . This is exactly the second order method for the fixed point iteration problem. Also at each step the AM is minimizing the residual, the reason that AM is equivalent to GMRES for linear problem is that this quadratic approximation is exact. Finally, we rewrite AM as the quasi-newton framework as Wei et al. (2021b) did. $w_{t + 1} = w_{t} + H_{t} V_{t}$ where

\min_{H_{t}} {‖H_{t} - β_{t} I‖}_{F} subject to H_{t} R_{t} = - X_{t}

Finally, with damping parameter, Anderson mixing has the following form

W_{t + 1} = W_{t} + β_{t} V_{t} - α_{t} (W_{t} + β_{t} r_{R}) Γ_{t}

(42)

we can also apply the very similar arguments to prove key results in lemma 1, lemma 2 in Wei et al. (2021b). There is also a key difference with Wei et al. (2021b). Here we are considering minimax optimization problem. Thus our gradient is actually $V (w) = [\begin{matrix} - \nabla_{x} f (x, y) \\ \nabla_{y} f (x, y) \end{matrix}]$ rather than $\nabla f (w) = [\begin{matrix} \nabla_{x} f (x, y) \\ \nabla_{y} f (x, y) \end{matrix}]$ This will introduce some difficulty to the dynamics of the fixed pointe iteration. However, noticing that $‖ V ‖ = ‖ \nabla f (w) ‖$ and

\begin{array}{l} f (w_{t + 1}) \leq f (w_{t}) + \nabla f {(w_{t})}^{T} (w_{t + 1} - w_{t}) + \frac{L}{2} {‖w_{t + 1} - w_{t}‖}_{2}^{2} \\ \leq f (w_{t}) + \tilde{\nabla} f {(w_{t})}^{T} (w_{t + 1} - w_{t}) + \frac{L}{2} {‖w_{t + 1} - w_{t}‖}_{2}^{2} \\ = f (w_{t}) + \tilde{\nabla} f {(w_{t})}^{T} H_{t} V_{t} + \frac{L}{2} {‖H_{t} V_{t}‖}_{2}^{2} \end{array}

(43)

where

\tilde{\nabla} f (w_{t}) = [\begin{matrix} - \nabla_{x} f (x, y) \\ \nabla_{y} f (x, y) \end{matrix}]

(44)

we call this the ascent-descent gradient (ADG) which is the gradient for minimax optimization problem

\min_{x \in ℝ^{d}} \max_{y \in ℝ^{d}} f (x, y) .

To see why $\nabla f {(w_{t})}^{T} (w_{t + 1} - w_{t}) \leq \tilde{\nabla} f {(w_{t})}^{T} (w_{t + 1} - w_{t})$ , we consider their difference

(\tilde{\nabla} - \nabla) f {(w_{t})}^{T} (w_{t + 1} - w_{t}) = - 2 \nabla_{x} f {(x_{t}, y_{t})}^{T} (x_{t + 1} - x_{t}) .

For fixed $y_{t}$ , $f (x_{t}, y_{t + 1})$ has the Talyor expansion:

f (x_{t + 1}, y_{t}) = f (x_{t}, y_{t}) + \nabla_{x} f {(x_{t}, y_{t})}^{T} (x_{t + 1} - x_{t}) + {(x_{t + 1} - x_{t})}^{T} \nabla_{xx} f (x_{t} + θ (x_{t + 1} - x_{t}), y_{t}) (x_{t + 1} - x_{t})

Assuming f is convex w.r.t $x$ and apply safeguard to ensure $f (x_{t + 1}, y_{t}) \leq f (x_{t}, y_{t})$ can guarantee $(\tilde{\nabla} - \nabla) f {(w_{t})}^{T} (w_{t + 1} - w_{t}) \geq 0$ . Now applying lemmas in Wei et al. (2021b), we can derive the convergence of our method for general convex-nonconcave function similarly.

D. Additional Experiments

D.1. Comparison with EG with Positive Momentum

In this section, we include additional comparison between GDA-AM and EG with positive momentum. GDA-AM has two big theoretical advantages over EG with positive momentum. First, convergence of GDA-AM does not require strong assumptions on choices of hyperparamters. Second, 4.1 and 4.2 provide nonasymptotic guarantees while convergence of EG with positive mometum is asymptotic. Experimental results are shown in 8. It indicates GDA-AM outperforms EG with positive momentum. Finding a good choice of the inner and outer step size of EG and momentum term is hard. For EG with positive momentum, we set the step size of extrapolation step as 1, the step size of update as 0.5, and the positive momentum term as 0.3 after grid search as shown in 8b and 8c. On the other hand, GDA-AM converges fast for different step size without hyper-parameter tuning.

Figure 8: — Additional Comparison between `GDA-AM` and EG with positive momentum

D.2. 1D Minimax functions

We begin with investigating the empirical performance of GDA-AM for 6 non-trivial 1d bivariate functions. We set initial points as (3, 3) and $m$ as 20 or 5 for all functions. We use optimal learning rates for all methods on each problem. Results are shown in Figure 9, 10, 11, 12, 13 and 14. We observe GDA-AM consistently outperforms all baselines and improves convergence. It is worthwhile to mention that the difference between GDA-AM and traditional averaging is twofold. First, traditional averaging does not involve an adaptive averaging scheme and thus blindly converge to (0, 0) for all 1d bivariate functions. In contrast, GDA-AM obtains optimal weights by solving a small linear system on past iterates. Using different weights for each iteration, GDA-AM is able to minimize the residual of past iterates and thus find the solution of a fixed-point iteration. More importantly, averaging does not change the GDA dynamic because averaging generates a new sequence of parameters based on GDA iterates. This means averaging is independent with base training algorithm (GDA here). However, GDA-AM changes the dynamic directly by overwriting the latest iterate. It means Anderson Mixing interacts with GDA, which is another major difference from averaging.

Figure 9: — $f (x, y) = (x - \frac{1}{2}) (y - \frac{1}{2}) + \frac{1}{3} e^{- {(x - 0.25)}^{2} - {(y - 0.75)}^{2})}$ . The optima for this function is not (0, 0). Because averaging blindly converges to (0,0), it can never find the correct solution.

D.3. Density estimation

To test our proposed method, we evaluate our method on two low-dimension density estimation problems, mixture of 25 Gaussians and Swiss roll. For both generator and discriminator, we use fully connected neural networks with 3 hidden layers and 128 hidden units in each layer. Except for the output layer of discriminator that uses a sigmoid activation, we use tanh-activation for all other layers. We run Adam and GDA-AM for 50000 steps. The learning rate is set as 2×10⁻⁴ and $β_{1} = 0$ , $β_{2} = 0.9$ after an extensive grid search, which is close to the maximal possible stepsize under which the methods rarely diverge. Figure 15 and 16 show the output after ${1 K, 10 K, 30 K, 50 K}$ iterations. It can be seen that our method converges faster to the target distribution offers a improvement over Adam. In addition, we can observe that the generated samples using our method gather around the circle and are less connected with other circles.

Figure 10: — $f (x, y) = (4 x^{2} - {(y - 3 x + 0.05 x^{3})}^{2} - 0.1 y^{4}) e^{- 0.01 (x^{2} + y^{2})}$ . All baselines except averaging are cyclying around the optima. Averaging is converging slowly.

Figure 11: — $f (x, y) = - 3 x^{2} - y^{2} + 4 x y$ . Baselines tend to diverge. Averaging is converging slowly again because averaging can only blindly converge to (0, 0) and the optima for this function is (0, 0).

D.4. Robust Neural Network Training

In this section, we test the effectiveness of GDA-AM by training a robust neural network on MNIST data set against adversarial attacks (Madry et al., 2019; Goodfellow et al., 2015; Kurakin et al., 2017). The optimization formulation is

\min_{w} \sum_{i = 1}^{N} \max_{δ_{i}, s.t. {|δ_{i}|}_{\infty} \leq ε} ℓ (f (x_{i} + δ_{i}; w), y_{i})

(45)

where $w$ is the parameter of the neural network, the pair $(x_{i}, y_{i})$ denotes the $i$ -th data point, and $δ_{i}$ is the perturbation added to data point $i$ . The accuracy of our formulation against popular attacks, FGSM (Goodfellow et al., 2015) and PGD (Kurakin et al., 2017), are summarized in Table 2.. Since solving such problem is computationally challenging, Nouiehed et al. (2019) proposed an approximation of the above optimization problem with a new objective function as the following nonconvex-concave problem:

\min_{w} \sum_{i = 1}^{N} \max_{t \in T} \sum_{j = 0}^{9} t_{j} ℓ (f (x_{i j}^{K}; w), y_{i}), T = \{(t_{1}, \dots, t_{m}) ∣ \sum_{i = 1}^{m} t_{i} = 1, t_{i} \geq 0\}

(46)

where $K$ is a parameter in the approximation, and $x_{i j}^{K}$ is an approximated attack on sample $x_{i}$ by changing the output of the network to label $j$ . We use the public available implementation (Nouiehed et al., 2019) ². We apply our algorithm on top of (Nouiehed et al., 2019) and compare our results $(p = 50)$ with (Madry et al., 2019; Zhang et al., 2019; 2020; Nouiehed et al., 2019). Results are summarized in table 2. We can observe that GDA-AM leads to a comparable or slightly better performance to the other methods. In addition, GDA-AM does not exhibit a significant drop in accuracy when $ϵ$ is larger and this suggests the learned model is more robust.

Figure 12: — $f (x, y) = \frac{1}{3} x^{3} + y^{2} + 2 x y - 6 x - 3 y + 4$ .

Figure 13: — $f (x, y) = x^{3} - y^{3} - 2 x y + 6$ .

D.5. Image Generation

In this section, we provide additional experimental results that are not given in Section 5. Figure 18a and 18b show the Inception Score for CIFAR10 using WGAN-GP and SNGAN. It can be observed that our method consistently performs better than Adam and EG during training. Further, on CIFAR-10 using WGAN-GP and SNGAN, GDA-AM is slightly slower than Adam (about 110–115% computational time), but significantly faster than EG (about 65–75% computational time).

D.6. Details on the experiments

For our experiments, we used the PyTorch ³ deep learning framework. Experiments were run one NVIDIA V100 GPU. The residual network architecture for generator and discriminator are summarized in Table 3 and 4. We use a WGAN-GP loss, with gradient penalty $λ = 10$ . When using the gradient penalty (WGAN-GP), we remove the batch normalization layers in the discriminator. When using SNGAN, we replace the batch normalization layers with spectral normalization. Hyperparamters of Adam are selected after grid search. We use a learning rate of 2 × 10⁻⁴ and batch size of 64. For table size of GDA-AM, we set it as 120 for CIFAR10 and 150 for CelebA. We set $β_{1} = 0.0$ and $β_{2} = 0.9$ as we find it gives us better models than default settings.

Table 2:

Test accuracies under FGSM and PGD attack. Trade refers to Zhang et al. (2019).

	Natural	FGSM $L_{\infty}$			PGD40 $L_{\infty}$

		$ε = 0.2$	$ε = 0.3$	$ε = 0.4$	$ε = 0.2$	$ε = 0.3$	$ε = 0.4$
Madry et al. (2019)	98.58%	96.09%	94.82%	89.84%	94.64%	91.41%	78.67%
Trade: $ε = 0.35$	97.37%	95.47%	94.86%	79.04%	94.41%	92.69%	85.74%
Trade: $ε = 0.40$	97.21%	96.19%	96.17%	96.14%	95.01%	94.36%	94.11%
Nouiehed et al. (2019)	98.20%	97.04%	96.66%	96.23%	96.00%	95.17%	94.22%
Zhang et al. (2020)	98.89%	97.87%	97.23%	95.81%	96.71%	95.62%	94.51%
`GDA-AM`	98.61%	97.75%	97.74%	97.75%	96.47%	95.91%	95.41%

Open in a new tab

Figure 14: — $f (x, y) = 2 x^{2} + y^{2} + 4 x y + \frac{4}{3} y^{3} - \frac{1}{4} y^{4}$ .

Figure 15: — **25 Gaussians:** Evolution plot of Adam and `GDA-AM`. Green dots are observed points and red dots are generated points.

Figure 16: — **Swiss roll:** Evolution plot of Adam and `GDA-AM`. Green dots are observed points and red dots are generated points.

Figure 17: — FID (lower or ↓ is better) for CIFAR 10

Figure 18: — **Left:** IS for CIFAR10 using WGANGP. **Middle:** IS for CIFAR10 using SNGAN. **Right:** FID for CelebA using WGANGP.

Figure 19: — Generated Images for CIFAR10 and CelebA using WGAN-GP(ResNet)

Table 3:

ResNet architecture used for our CIFAR-10 experiments.

Generator

Input: $z \in ℝ^{128} \sim N (0, I)$
Linear 128 → 256 × 4 × 4
ResBlock 128 → 128
ResBlock 256 → 256
ResBlock 256 → 256
Batch Normalization
ReLu
transposed conv. (256, kernel:3 × 3, stride:1, pad: 1
tanh(·)


Discriminator

Input: $x \in ℝ^{3 \times 32 \times 32}$
Linear 128 → 128 × 4 × 4
ResBlock 128 → 128
ResBlock 128 → 128
ResBlock 128 → 128
Linear 128 → 1

Open in a new tab

Table 4:

ResNet architecture used for our CelebA (64 × 64) experiments.

Generator

Input: $z \in ℝ^{128} \sim N (0, I)$
Linear 128 → 512 × 8 × 8
ResBlock 512 → 256
ResBlock 256 → 128
ResBlock 128 → 64
Batch Normalization
ReLu
transposed conv. (64, kernel:3 × 3, stride:1, pad: 1
tanh(·)


Discriminator

Input: $x \in ℝ^{3 \times 64 \times 64}$
Linear 128 → 128 × 4 × 4
ResBlock 128 → 128
ResBlock 128 → 256
ResBlock 256 → 512
Linear 512 → 1

Open in a new tab

Footnotes

https://github.com/hehuannb/GDA-AM

https://github.com/optimization-for-data-driven-science/Robust-NN-Training

https://pytorch.org/

Contributor Information

Huan He, Department of Computer Science Emory University Atlanta, GA 30329, USA.

Shifan Zhao, Department of Computer Science Emory University Atlanta, GA 30329, USA.

Yuanzhe Xi, Department of Computer Science Emory University Atlanta, GA 30329, USA.

Joyce C Ho, Department of Computer Science Emory University Atlanta, GA 30329, USA.

Yousef Saad, Department of Computer Science and Engineering University of Minnesota Minneapolis, MN 55455, USA.

References

Adolphs Leonard, Daneshmand Hadi, Lucchi Aurelien, and Hofmann Thomas. Local saddle point optimization: A curvature exploitation approach. Proceedings of Machine Learning Research. PMLR. [Google Scholar]
Anderson Donald G.. Iterative procedures for nonlinear integral equations. 1965. [Google Scholar]
Azizian Waïss, Scieur Damien, Mitliagkas Ioannis, Lacoste-Julien Simon, and Gauthier Gidel Accelerating smooth games by manipulating spectral shapes. In The 23rd International Conference on Artificial Intelligence and Statistics, AISTATS 2020, 26–28 August 2020, Online [Palermo, Sicily, Italy], volume 108 of Proceedings of Machine Learning Research, pp. 1705–1715. PMLR, 2020. [Google Scholar]
Bollapragada Raghu, Scieur Damien, and d’Aspremont Alexandre Nonlinear acceleration of momentum and primal-dual algorithms. arXiv preprint arXiv:1810.04539, 2018. [Google Scholar]
Bruck Ronald E.. On the weak convergence of an ergodic iteration for the solution of variational inequalities for monotone operators in hilbert space. Journal of Mathematical Analysis and Applications, 1977. [Google Scholar]
Calvetti Daniela, Lewis Bryan, and Reichel Lothar. On the regularizing properties of the gmres method. Numerische Mathematik, 91(4):605–625, 2002. [Google Scholar]
Crouzeix Michel and Palencia César. The numerical range is a (1+2)-spectral set. SIAM Journal on Matrix Analysis and Applications, 38(2):649–655, 2017. [Google Scholar]
Daskalakis C, Ilyas Andrew, Syrgkanis Vasilis, and Zeng Haoyang Training gans with optimism. ArXiv, abs/1711.00141, 2018. [Google Scholar]
Elman Howard C Iterative methods for large, sparse, nonsymmetric systems of linear equations. PhD thesis, Yale University New Haven, Conn, 1982. [Google Scholar]
Fang Haw-ren and Saad Yousef. Two classes of multisecant methods for nonlinear acceleration. Numerical Linear Algebra with Applications, 16(3):197–221, 2009. [Google Scholar]
Fischer Bernd and Freund Roland. Chebyshev polynomials are not always optimal. Journal of Approximation Theory, 65(3):261–272, 1991. [Google Scholar]
Gidel Gauthier, Berard Hugo, Vignoud Gaëtan, Vincent Pascal, and Lacoste-Julien Simon A variational inequality perspective on generative adversarial networks. In 7th International Conference on Learning Representations, ICLR, 2019a. [Google Scholar]
Gidel Gauthier, Hemmat Reyhane Askari, Pezeshki Mohammad, Rémi Le Priol, Huang Gabriel, Lacoste-Julien Simon, and Mitliagkas Ioannis. Negative momentum for improved game dynamics. In Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research. PMLR, 2019b. [Google Scholar]
Goodfellow I, Pouget-Abadie Jean, Mirza Mehdi, Xu Bing, Warde-Farley David, Ozair S. Courville Aaron C., and Bengio Yoshua. Generative adversarial nets. In NIPS, 2014. [Google Scholar]
Goodfellow Ian J., Shlens Jonathon, and Szegedy Christian. Explaining and harnessing adversarial examples, 2015. [Google Scholar]
Greenbaum Anne. Iterative methods for solving linear systems. SIAM, 1997. [Google Scholar]
Gulrajani Ishaan, Ahmed Faruk, Arjovsky Martin, Dumoulin Vincent, and Courville Aaron C Improved training of wasserstein gans. In Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, and Garnett R. (eds.), Advances in Neural Information Processing Systems, 2017. [Google Scholar]
Heusel Martin, Ramsauer Hubert, Unterthiner Thomas, Nessler Bernhard, and Hochreiter Sepp. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, 2017. [Google Scholar]
Hsieh Yu-Guan, Iutzeler F, Malick J, and Mertikopoulos P. On the convergence of single-call stochastic extra-gradient methods. In NeurIPS, 2019. [Google Scholar]
Jin Chi, Netrapalli Praneeth, and Jordan Michael. What is local optimality in nonconvex-nonconcave minimax optimization? In Proceedings of the 37th International Conference on Machine Learning, Proceedings of Machine Learning Research, pp. 4880–4889. PMLR, 2020. [Google Scholar]
Krizhevsky A. Learning multiple layers of features from tiny images. 2009. [Google Scholar]
Kurakin Alexey, Goodfellow Ian J., and Bengio Samy. Adversarial machine learning at scale. ArXiv, abs/1611.01236, 2017. [Google Scholar]
Lei Qi, Nagarajan Sai Ganesh, Panageas Ioannis, and Wang Xiao. Last iterate convergence in no-regret learning: constrained min-max optimization for convex-concave landscapes. In AISTATS, 2021. [Google Scholar]
Li S, Wu Yi, Cui Xinyue, Dong Honghua, Fang Fei, and Russell Stuart J. Robust multi-agent reinforcement learning via minimax deep deterministic policy gradient. In AAAI, 2019. [Google Scholar]
Lin Tianyi, Jin Chi, and Jordan Michael I.. On gradient descent ascent for nonconvex-concave minimax problems. In ICML, pp. 6083–6093, 2020. URL http://proceedings.mlr.press/v119/lin20a.html. [Google Scholar]
Liu Ziwei, Luo Ping, Wang Xiaogang, and Tang Xiaoou. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), December 2015. [Google Scholar]
Luo Luo, Ye Haishan, Huang Zhichao, and Zhang Tong. Stochastic recursive gradient descent ascent for stochastic nonconvex-strongly-concave minimax problems. In Larochelle H, Ranzato M, Hadsell R, Balcan MF, and Lin H. (eds.), Advances in Neural Information Processing Systems, 2020. [Google Scholar]
Madry Aleksander, Makelov Aleksandar, Schmidt Ludwig, Tsipras Dimitris, and Vladu Adrian. Towards deep learning models resistant to adversarial attacks. In 6th International Conference on Learning Representations, ICLR 2018,, 2018. [Google Scholar]
Madry Aleksander, Makelov Aleksandar, Schmidt Ludwig, Tsipras Dimitris, and Vladu Adrian. Towards deep learning models resistant to adversarial attacks, 2019. [Google Scholar]
Mazumdar Eric V., Jordan Michael I., and Sastry S. On finding local nash equilibria (and only local nash equilibria) in zero-sum games. ArXiv, abs/1901.00838, 2019. [Google Scholar]
Mertikopoulos Panayotis, Lecouat Bruno, Zenati Houssam, Foo Chuan-Sheng, Chandrasekhar Vijay, and Piliouras Georgios. Optimistic mirror descent in saddle-point problems: Going the extra gradient mile. In 7th International Conference on Learning Representations, ICLR, 2019. [Google Scholar]
Mescheder Lars, Nowozin Sebastian, and Geiger Andreas. The numerics of gans. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, 2017. [Google Scholar]
Miyato Takeru, Kataoka Toshiki, Koyama Masanori, and Yoshida Y. Spectral normalization for generative adversarial networks. ArXiv, abs/1802.05957, 2018. [Google Scholar]
Mokhtari Aryan, Ozdaglar Asuman, and Pattathil Sarath. A unified analysis of extra-gradient and optimistic gradient methods for saddle point problems: Proximal point approach. In International Conference on Artificial Intelligence and Statistics, pp. 1497–1507. PMLR, 2020a. [Google Scholar]
Mokhtari Aryan, Ozdaglar Asuman, and Pattathil Sarath. A unified analysis of extra-gradient and optimistic gradient methods for saddle point problems: Proximal point approach. In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research. PMLR, 2020b. [Google Scholar]
Nemirovski A. Prox-method with rate of convergence o(1/t) for variational inequalities with lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM J. Optim., 2004. [Google Scholar]
Nouiehed Maher, Sanjabi Maziar, Huang Tianjian, Lee Jason D., and Razaviyayn Meisam. Solving a Class of Non-Convex Min-Max Games Using Iterative First Order Methods. Curran Associates Inc., Red Hook, NY, USA, 2019. [Google Scholar]
Ostrovskii Dmitrii M., Lowy Andrew, and Razaviyayn Meisam. Efficient search of first-order nash equilibria in nonconvex-concave smooth min-max problems, 2021. [Google Scholar]
Jack Parker-Holder Luke Metz, Resnick Cinjon, Hu Hengyuan, Lerer Adam, Letcher Alistair, Peysakhovich Alexander, Pacchiano Aldo, and Foerster Jakob. Ridge rider: Finding diverse solutions by following eigenvectors of the hessian. In Advances in Neural Information Processing Systems, 2020. [Google Scholar]
Popov L. A modification of the arrow-hurwicz method for search of saddle points. Mathematical notes of the Academy of Sciences of the USSR, 1980. [Google Scholar]
Saad Youcef and Schultz Martin H.. Gmres: A generalized minimal residual algorithm for solving nonsymmetric linear systems. SIAM Journal on Scientific and Statistical Computing, 1986. [Google Scholar]
Saad Yousef. Iterative methods for sparse linear systems. SIAM, 2003. [Google Scholar]
Salimans Tim, Goodfellow Ian, Zaremba Wojciech, Cheung Vicki, Radford Alec, Chen Xi, and Chen Xi. Improved techniques for training gans. In Advances in Neural Information Processing Systems, 2016. [Google Scholar]
Schaefer Florian and Anandkumar Anima. Competitive gradient descent. In Wallach H, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox E, and Garnett R. (eds.), Advances in Neural Information Processing Systems, 2019. [Google Scholar]
Thekumparampil Kiran K, Jain Prateek, Netrapalli Praneeth, and Oh Sewoong. Efficient algorithms for smooth minimax optimization. In Wallach H, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox E, and Garnett R. (eds.), Advances in Neural Information Processing Systems, 2019. [Google Scholar]
John von Neumann and Oskar Morgenstern. Theory of Games and Economic Behavior. Princeton University Press, 1944. [Google Scholar]
Walker Homer F and Ni Peng Anderson acceleration for fixed-point iterations. SIAM Journal on Numerical Analysis, 49(4):1715–1735, 2011a. [Google Scholar]
Walker Homer F. and Ni Peng. Anderson acceleration for fixed-point iterations. 2011b. [Google Scholar]
Wang Yuanhao, Zhang Guodong, and Ba Jimmy. On solving minimax optimization locally: A follow-the-ridge approach. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26–30, 2020. OpenReview.net, 2020. [Google Scholar]
Wei Chen-Yu, Lee Chung-Wei, Zhang Mengxiao, and Luo Haipeng. Linear last-iterate convergence in constrained saddle-point optimization. In International Conference on Learning Representations, 2021a. URL https://openreview.net/forum?id=dx11_7vm5_r. [Google Scholar]
Wei Fuchao, Bao Chenglong, and Liu Yang. Stochastic anderson mixing for nonconvex stochastic optimization. arXiv preprint arXiv:2110.01543, 2021b. [Google Scholar]
Wu Yue, Zhou Pan, Wilson A, Xing E, and Hu Zhiting Improving gan training with probability ratio clipping and sample reweighting. ArXiv, abs/2006.06900, 2020. [Google Scholar]
Xu Zi, Zhang Huiling, Xu Yang, and Lan Guanghui. A unified single-loop alternating gradient projection algorithm for nonconvex-concave and convex-nonconcave minimax problems, 2021. [Google Scholar]
Minghan Yang A. Milzarek Z. Wen, and Zhang T. A stochastic extra-step quasi-newton method for nonsmooth nonconvex optimization. arXiv: Optimization and Control, 2019. [Google Scholar]
Yazici Yasin, Foo Chuan-Sheng, Winkler Stefan, Yap Kim-Hui, Piliouras Georgios, and Chandrasekhar Vijay. The unusual effectiveness of averaging in GAN training. In 7th International Conference on Learning Representations, ICLR, 2019, 2019. [Google Scholar]
Zhang Guodong and Wang Yuanhao. On the suboptimality of negative momentum for minimax optimization. In AISTATS, 2021. [Google Scholar]
Zhang Guodong, Wang Yuanhao, Lessard Laurent, and Grosse Roger B.. Don’t fix what ain’t broke: Near-optimal local convergence of alternating gradient descent-ascent for minimax optimization. CoRR, 2021. [Google Scholar]
Zhang Hongyang, Yu Yaodong, Jiao Jiantao, Xing Eric P., El Ghaoui Laurent, and Jordan Michael I. Theoretically principled trade-off between robustness and accuracy. CoRR, 2019. URL http://arxiv.org/abs/1901.08573. [Google Scholar]
Zhang Jiawei, Xiao Peijun, Sun Ruoyu, and Luo Zhiquan. A single-loop smoothed gradient descent-ascent algorithm for nonconvex-concave min-max problems. In Larochelle H, Ranzato M, Hadsell R, Balcan MF, and Lin H. (eds.), Advances in Neural Information Processing Systems, 2020. [Google Scholar]

[R1] Adolphs Leonard, Daneshmand Hadi, Lucchi Aurelien, and Hofmann Thomas. Local saddle point optimization: A curvature exploitation approach. Proceedings of Machine Learning Research. PMLR. [Google Scholar]

[R2] Anderson Donald G.. Iterative procedures for nonlinear integral equations. 1965. [Google Scholar]

[R3] Azizian Waïss, Scieur Damien, Mitliagkas Ioannis, Lacoste-Julien Simon, and Gauthier Gidel Accelerating smooth games by manipulating spectral shapes. In The 23rd International Conference on Artificial Intelligence and Statistics, AISTATS 2020, 26–28 August 2020, Online [Palermo, Sicily, Italy], volume 108 of Proceedings of Machine Learning Research, pp. 1705–1715. PMLR, 2020. [Google Scholar]

[R4] Bollapragada Raghu, Scieur Damien, and d’Aspremont Alexandre Nonlinear acceleration of momentum and primal-dual algorithms. arXiv preprint arXiv:1810.04539, 2018. [Google Scholar]

[R5] Bruck Ronald E.. On the weak convergence of an ergodic iteration for the solution of variational inequalities for monotone operators in hilbert space. Journal of Mathematical Analysis and Applications, 1977. [Google Scholar]

[R6] Calvetti Daniela, Lewis Bryan, and Reichel Lothar. On the regularizing properties of the gmres method. Numerische Mathematik, 91(4):605–625, 2002. [Google Scholar]

[R7] Crouzeix Michel and Palencia César. The numerical range is a (1+2)-spectral set. SIAM Journal on Matrix Analysis and Applications, 38(2):649–655, 2017. [Google Scholar]

[R8] Daskalakis C, Ilyas Andrew, Syrgkanis Vasilis, and Zeng Haoyang Training gans with optimism. ArXiv, abs/1711.00141, 2018. [Google Scholar]

[R9] Elman Howard C Iterative methods for large, sparse, nonsymmetric systems of linear equations. PhD thesis, Yale University New Haven, Conn, 1982. [Google Scholar]

[R10] Fang Haw-ren and Saad Yousef. Two classes of multisecant methods for nonlinear acceleration. Numerical Linear Algebra with Applications, 16(3):197–221, 2009. [Google Scholar]

[R11] Fischer Bernd and Freund Roland. Chebyshev polynomials are not always optimal. Journal of Approximation Theory, 65(3):261–272, 1991. [Google Scholar]

[R12] Gidel Gauthier, Berard Hugo, Vignoud Gaëtan, Vincent Pascal, and Lacoste-Julien Simon A variational inequality perspective on generative adversarial networks. In 7th International Conference on Learning Representations, ICLR, 2019a. [Google Scholar]

[R13] Gidel Gauthier, Hemmat Reyhane Askari, Pezeshki Mohammad, Rémi Le Priol, Huang Gabriel, Lacoste-Julien Simon, and Mitliagkas Ioannis. Negative momentum for improved game dynamics. In Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research. PMLR, 2019b. [Google Scholar]

[R14] Goodfellow I, Pouget-Abadie Jean, Mirza Mehdi, Xu Bing, Warde-Farley David, Ozair S. Courville Aaron C., and Bengio Yoshua. Generative adversarial nets. In NIPS, 2014. [Google Scholar]

[R15] Goodfellow Ian J., Shlens Jonathon, and Szegedy Christian. Explaining and harnessing adversarial examples, 2015. [Google Scholar]

[R16] Greenbaum Anne. Iterative methods for solving linear systems. SIAM, 1997. [Google Scholar]

[R17] Gulrajani Ishaan, Ahmed Faruk, Arjovsky Martin, Dumoulin Vincent, and Courville Aaron C Improved training of wasserstein gans. In Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, and Garnett R. (eds.), Advances in Neural Information Processing Systems, 2017. [Google Scholar]

[R18] Heusel Martin, Ramsauer Hubert, Unterthiner Thomas, Nessler Bernhard, and Hochreiter Sepp. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, 2017. [Google Scholar]

[R19] Hsieh Yu-Guan, Iutzeler F, Malick J, and Mertikopoulos P. On the convergence of single-call stochastic extra-gradient methods. In NeurIPS, 2019. [Google Scholar]

[R20] Jin Chi, Netrapalli Praneeth, and Jordan Michael. What is local optimality in nonconvex-nonconcave minimax optimization? In Proceedings of the 37th International Conference on Machine Learning, Proceedings of Machine Learning Research, pp. 4880–4889. PMLR, 2020. [Google Scholar]

[R21] Krizhevsky A. Learning multiple layers of features from tiny images. 2009. [Google Scholar]

[R22] Kurakin Alexey, Goodfellow Ian J., and Bengio Samy. Adversarial machine learning at scale. ArXiv, abs/1611.01236, 2017. [Google Scholar]

[R23] Lei Qi, Nagarajan Sai Ganesh, Panageas Ioannis, and Wang Xiao. Last iterate convergence in no-regret learning: constrained min-max optimization for convex-concave landscapes. In AISTATS, 2021. [Google Scholar]

[R24] Li S, Wu Yi, Cui Xinyue, Dong Honghua, Fang Fei, and Russell Stuart J. Robust multi-agent reinforcement learning via minimax deep deterministic policy gradient. In AAAI, 2019. [Google Scholar]

[R25] Lin Tianyi, Jin Chi, and Jordan Michael I.. On gradient descent ascent for nonconvex-concave minimax problems. In ICML, pp. 6083–6093, 2020. URL http://proceedings.mlr.press/v119/lin20a.html. [Google Scholar]

[R26] Liu Ziwei, Luo Ping, Wang Xiaogang, and Tang Xiaoou. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), December 2015. [Google Scholar]

[R27] Luo Luo, Ye Haishan, Huang Zhichao, and Zhang Tong. Stochastic recursive gradient descent ascent for stochastic nonconvex-strongly-concave minimax problems. In Larochelle H, Ranzato M, Hadsell R, Balcan MF, and Lin H. (eds.), Advances in Neural Information Processing Systems, 2020. [Google Scholar]

[R28] Madry Aleksander, Makelov Aleksandar, Schmidt Ludwig, Tsipras Dimitris, and Vladu Adrian. Towards deep learning models resistant to adversarial attacks. In 6th International Conference on Learning Representations, ICLR 2018,, 2018. [Google Scholar]

[R29] Madry Aleksander, Makelov Aleksandar, Schmidt Ludwig, Tsipras Dimitris, and Vladu Adrian. Towards deep learning models resistant to adversarial attacks, 2019. [Google Scholar]

[R30] Mazumdar Eric V., Jordan Michael I., and Sastry S. On finding local nash equilibria (and only local nash equilibria) in zero-sum games. ArXiv, abs/1901.00838, 2019. [Google Scholar]

[R31] Mertikopoulos Panayotis, Lecouat Bruno, Zenati Houssam, Foo Chuan-Sheng, Chandrasekhar Vijay, and Piliouras Georgios. Optimistic mirror descent in saddle-point problems: Going the extra gradient mile. In 7th International Conference on Learning Representations, ICLR, 2019. [Google Scholar]

[R32] Mescheder Lars, Nowozin Sebastian, and Geiger Andreas. The numerics of gans. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, 2017. [Google Scholar]

[R33] Miyato Takeru, Kataoka Toshiki, Koyama Masanori, and Yoshida Y. Spectral normalization for generative adversarial networks. ArXiv, abs/1802.05957, 2018. [Google Scholar]

[R34] Mokhtari Aryan, Ozdaglar Asuman, and Pattathil Sarath. A unified analysis of extra-gradient and optimistic gradient methods for saddle point problems: Proximal point approach. In International Conference on Artificial Intelligence and Statistics, pp. 1497–1507. PMLR, 2020a. [Google Scholar]

[R35] Mokhtari Aryan, Ozdaglar Asuman, and Pattathil Sarath. A unified analysis of extra-gradient and optimistic gradient methods for saddle point problems: Proximal point approach. In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research. PMLR, 2020b. [Google Scholar]

[R36] Nemirovski A. Prox-method with rate of convergence o(1/t) for variational inequalities with lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM J. Optim., 2004. [Google Scholar]

[R37] Nouiehed Maher, Sanjabi Maziar, Huang Tianjian, Lee Jason D., and Razaviyayn Meisam. Solving a Class of Non-Convex Min-Max Games Using Iterative First Order Methods. Curran Associates Inc., Red Hook, NY, USA, 2019. [Google Scholar]

[R38] Ostrovskii Dmitrii M., Lowy Andrew, and Razaviyayn Meisam. Efficient search of first-order nash equilibria in nonconvex-concave smooth min-max problems, 2021. [Google Scholar]

[R39] Jack Parker-Holder Luke Metz, Resnick Cinjon, Hu Hengyuan, Lerer Adam, Letcher Alistair, Peysakhovich Alexander, Pacchiano Aldo, and Foerster Jakob. Ridge rider: Finding diverse solutions by following eigenvectors of the hessian. In Advances in Neural Information Processing Systems, 2020. [Google Scholar]

[R40] Popov L. A modification of the arrow-hurwicz method for search of saddle points. Mathematical notes of the Academy of Sciences of the USSR, 1980. [Google Scholar]

[R41] Saad Youcef and Schultz Martin H.. Gmres: A generalized minimal residual algorithm for solving nonsymmetric linear systems. SIAM Journal on Scientific and Statistical Computing, 1986. [Google Scholar]

[R42] Saad Yousef. Iterative methods for sparse linear systems. SIAM, 2003. [Google Scholar]

[R43] Salimans Tim, Goodfellow Ian, Zaremba Wojciech, Cheung Vicki, Radford Alec, Chen Xi, and Chen Xi. Improved techniques for training gans. In Advances in Neural Information Processing Systems, 2016. [Google Scholar]

[R44] Schaefer Florian and Anandkumar Anima. Competitive gradient descent. In Wallach H, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox E, and Garnett R. (eds.), Advances in Neural Information Processing Systems, 2019. [Google Scholar]

[R45] Thekumparampil Kiran K, Jain Prateek, Netrapalli Praneeth, and Oh Sewoong. Efficient algorithms for smooth minimax optimization. In Wallach H, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox E, and Garnett R. (eds.), Advances in Neural Information Processing Systems, 2019. [Google Scholar]

[R46] John von Neumann and Oskar Morgenstern. Theory of Games and Economic Behavior. Princeton University Press, 1944. [Google Scholar]

[R47] Walker Homer F and Ni Peng Anderson acceleration for fixed-point iterations. SIAM Journal on Numerical Analysis, 49(4):1715–1735, 2011a. [Google Scholar]

[R48] Walker Homer F. and Ni Peng. Anderson acceleration for fixed-point iterations. 2011b. [Google Scholar]

[R49] Wang Yuanhao, Zhang Guodong, and Ba Jimmy. On solving minimax optimization locally: A follow-the-ridge approach. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26–30, 2020. OpenReview.net, 2020. [Google Scholar]

[R50] Wei Chen-Yu, Lee Chung-Wei, Zhang Mengxiao, and Luo Haipeng. Linear last-iterate convergence in constrained saddle-point optimization. In International Conference on Learning Representations, 2021a. URL https://openreview.net/forum?id=dx11_7vm5_r. [Google Scholar]

[R51] Wei Fuchao, Bao Chenglong, and Liu Yang. Stochastic anderson mixing for nonconvex stochastic optimization. arXiv preprint arXiv:2110.01543, 2021b. [Google Scholar]

[R52] Wu Yue, Zhou Pan, Wilson A, Xing E, and Hu Zhiting Improving gan training with probability ratio clipping and sample reweighting. ArXiv, abs/2006.06900, 2020. [Google Scholar]

[R53] Xu Zi, Zhang Huiling, Xu Yang, and Lan Guanghui. A unified single-loop alternating gradient projection algorithm for nonconvex-concave and convex-nonconcave minimax problems, 2021. [Google Scholar]

[R54] Minghan Yang A. Milzarek Z. Wen, and Zhang T. A stochastic extra-step quasi-newton method for nonsmooth nonconvex optimization. arXiv: Optimization and Control, 2019. [Google Scholar]

[R55] Yazici Yasin, Foo Chuan-Sheng, Winkler Stefan, Yap Kim-Hui, Piliouras Georgios, and Chandrasekhar Vijay. The unusual effectiveness of averaging in GAN training. In 7th International Conference on Learning Representations, ICLR, 2019, 2019. [Google Scholar]

[R56] Zhang Guodong and Wang Yuanhao. On the suboptimality of negative momentum for minimax optimization. In AISTATS, 2021. [Google Scholar]

[R57] Zhang Guodong, Wang Yuanhao, Lessard Laurent, and Grosse Roger B.. Don’t fix what ain’t broke: Near-optimal local convergence of alternating gradient descent-ascent for minimax optimization. CoRR, 2021. [Google Scholar]

[R58] Zhang Hongyang, Yu Yaodong, Jiao Jiantao, Xing Eric P., El Ghaoui Laurent, and Jordan Michael I. Theoretically principled trade-off between robustness and accuracy. CoRR, 2019. URL http://arxiv.org/abs/1901.08573. [Google Scholar]

[R59] Zhang Jiawei, Xiao Peijun, Sun Ruoyu, and Luo Zhiquan. A single-loop smoothed gradient descent-ascent algorithm for nonconvex-concave min-max problems. In Larochelle H, Ranzato M, Hadsell R, Balcan MF, and Lin H. (eds.), Advances in Neural Information Processing Systems, 2020. [Google Scholar]

PERMALINK

GDA-AM: ON THE EFFECTIVENESS OF SOLVING MIN-IMAX OPTIMIZATION VIA ANDERSON MIXING

Huan He

Shifan Zhao

Yuanzhe Xi

Joyce C Ho

Yousef Saad

Abstract

1. Introduction

Figure 1:

Our contributions:

2. Preliminaries and background

2.1. Minimax optimization

Definition 1.

2.2. Fixed-Point Iteration and Anderson Mixing (AM)

Definition 2.

Algorithm 1:

2.3. AM and Generalized Minimal Residual (GMRES)

Definition 3.

Theorem 2.1

Theorem 2.2

3. GDA-AM : GDA with Anderson Mixing

3.1. GDA with Naïve Anderson Mixing

Algorithm 2:

Algorithm 3:

4. Convergence results for GDA-AM

4.1. Bilinear Games

4.2. Simultaneous GDA-AM

Theorem 4.1.

Remark 4.1.1.

4.3. Alternating GDA-AM

Theorem 4.2.

4.4. Discussion of obtained rates

Figure 2:

5. Experiments

5.1. Bilinear Problems

Figure 4:

Figure 5:

Figure 6:

5.2. GAN Experiments: Image Generation

Table 1:

6. Conclusion

Figure 3:

Acknowledgments

A. Related work

B. Anderson Mixing Implementation Details

Algorithm 4:

Algorithm 5:

C. Theoretical Results

C.1. Difficulty of analysis on GDA with Anderson Mixing

Theorem C.1

C.2. Proofs of theorem

Theorem C.2

Proof of Theorem 4.1.

Theorem C.3

C.3. Discussion of obtained rates

C.4. Convex-concave and general case

Definition 4.

Proposition 1

Proposition 2

Proposition 3

Theorem C.4

Theorem C.5

Proof of Theorem C.4.

Theorem C.6.

C.4.1. Bilinear-quadratic games

Theorem C.7.

Theorem C.8

Proof of Theorem C.7.

C.5. Stochastic convex-nonconvace case

Definition 5.

Assumption 1.

Figure 7:

Assumption 2.

Theorem C.9.

D. Additional Experiments

D.1. Comparison with EG with Positive Momentum

Figure 8:

D.2. 1D Minimax functions

Figure 9:

3. `GDA-AM` : GDA with Anderson Mixing

4. Convergence results for `GDA-AM`

4.2. Simultaneous `GDA-AM`

4.3. Alternating `GDA-AM`