On the complexity of computing Markov perfect equilibrium in general-sum stochastic games

Xiaotie Deng; Ningyuan Li; David Mguni; Jun Wang; Yaodong Yang

doi:10.1093/nsr/nwac256

. 2022 Nov 22;10(1):nwac256. doi: 10.1093/nsr/nwac256

On the complexity of computing Markov perfect equilibrium in general-sum stochastic games

Xiaotie Deng ^1,^2,^✉, Ningyuan Li ^3,^✉, David Mguni ⁴, Jun Wang ⁵, Yaodong Yang ^6,^✉

PMCID: PMC9843164 PMID: 36684520

ABSTRACT

Similar to the role of Markov decision processes in reinforcement learning, Markov games (also called stochastic games) lay down the foundation for the study of multi-agent reinforcement learning and sequential agent interactions. We introduce approximate Markov perfect equilibrium as a solution to the computational problem of finite-state stochastic games repeated in the infinite horizon and prove its PPAD-completeness. This solution concept preserves the Markov perfect property and opens up the possibility for the success of multi-agent reinforcement learning algorithms on static two-player games to be extended to multi-agent dynamic games, expanding the reign of the PPAD-complete class.

Keywords: Markov game, multi-agent reinforcement learning, Markov perfect equilibrium, PPAD-completeness, stochastic game

We proved that solving Markov perfect equilibria of general-sum stochastic games is PPAD-complete, which lays down the foundation of algorithmic complexity for multi-agent reinforcement learning methodology.

INTRODUCTION

Shapley [1] introduced stochastic games (SGs) to study the dynamic non-cooperative multi-player game, where each player simultaneously and independently chooses an action at each round for a reward. According to their current state and the chosen actions, the next state is determined by a probability distribution specified a priori. Shapley’s work includes the first proof of the existence of a stationary strategy profile under which no agent has an incentive to deviate, in two-player zero-sum SGs. Next, the existence of equilibrium in stationary strategies was extended to multi-player general-sum SGs by Fink [2]. Such a solution concept (known as Markov perfect equilibrium (MPE) [3]) captures the dynamics of multi-player games.

Because of its generality, the framework of SGs has enlightened a sequence of studies [4] on a wide range of real-world applications ranging from advertising and pricing [5], species interaction game modeling in fisheries [6], traveling inspection [7] and gaming AIs [8]. As a result, developing algorithms to compute MPE in SGs has become one of the key subjects in this extremely rich research domain, using approaches from applied mathematics, economics, operations research, computer science and artificial intelligence (see, e.g., [9]).

The concept of the SG underpins many AI and machine learning studies. The optimal policy making of Markov decision processes (MDPs) captures the central problem of a single agent interacting with its environment, according to Sutton and Barto [10]. In multi-agent reinforcement learning (MARL) [11,12], SG extends to incorporate the dynamic nature in multi-agent strategic interactions, to study optimal decision making and subsequently equilibria in multi-player games [13,14].

For two-player zero-sum (discounted) SGs, the game-theoretical equilibrium is closely related to the optimization problem in MDPs as the opponent is purely adversarial [15,16]. On the other hand, solving general-sum SGs has been possible only under strong assumptions [2,17]. Zinkevich et al. [18] demonstrated that, for the entire class of value iteration methods, it is difficult to find stationary Nash equilibrium (NE) policies in general-sum SGs. This has led to few existing MARL algorithms to general-sum SGs. Known approaches have either studied special cases of SGs [19,20] or ignored the dynamic nature to limit the study to the weaker notion of Nash equilibrium [21].

Recently, Solan and Vieille [22] reconfirmed the importance of the existence of a stationary strategy profile as having several philosophical implications. First, it is conceptually straightforward. Second, past play affects the players’ future behavior only through the current state. Third, subsequently and most importantly, equilibrium behavior does not involve non-credible threats, a property that is stronger than the equilibrium property, and viewed as highly desirable [23].

Surprisingly, the complexity of finding an MPE in an SG remains an open problem, although an SG was proposed more than sixty years ago and despite its importance. While fruitful studies have been conducted on zero-sum SGs, we still know little about the complexity of solving general-sum SGs. It is clear that solving MPE in (infinite-horizon) SGs is at least PPAD-hard, since solving a two-player NE in one-shot SGs is already complete in this computational class [24,25] defined by Papadimitriou [26]. This suggests that it is unlikely to have polynomial-time algorithms in general-sum stochastic games for two players. Yet, with complications involved in the general-sum and dynamic settings, the unresolved challenge has been: Can solving MPE in general-sum SGs be anywhere complete in computational complexity classes?

We answer the above question in the positive, proving the computation of an approximate MPE in SGs equivalent to that of a Nash equilibrium in a single state setting, and subsequently showing its PPAD-completeness. It opens up the possibility to develop MARL algorithms to work for the general-sum SGs in the same way as for an ordinary Nash equilibrium computation.

Intuitions and a sketch of our main ideas

Computational studies on problem solving build understanding on various types of reduction. After all, computations carried out on computers are eventually reduced to AND/OR/NOT gates on electronic circuits.

To prove that a problem is PPAD-complete, we need to prove that it is in the class, and that it can be used as a base to solve any other problem in this class (for its hardness). More formally, the reduction needs to be carried out in polynomial (with respect to the input size) time. Nash equilibrium computation of two-player normal-form games [27] is arguably the most prominent PPAD-complete problem [24,25]. When one stochastic game has only one state and the discount factor γ = 0, then finding an MPE is equivalent to finding a Nash equilibrium in the corresponding normal-form game. The PPAD-hardness of finding an MPE follows immediately. Our main result is to prove the PPAD membership property of computing an approximate MPE (Lemma 2 below).

Firstly, we construct a function f on the strategy profile space, such that each strategy profile is a fixed point of f if and only if it is an MPE of the stochastic game (Theorem 2 below). Furthermore, we prove that the function f is continuous (λ-Lipschitz by Lemma 3 below), so that fixed points are guaranteed to exist by the Brouwer fixed point theorem.

Secondly, we prove that the function f has some ‘good’ approximation properties. Let Inline graphic be the input size of a stochastic game. If we can find a -approximate fixed point π of f, i.e. , where π is a strategy profile, then π is an ε-approximate MPE for the stochastic game (combining Lemma 5 and Lemma 6 below). So our goal converts to finding an approximate fixed point of a Lipschitz function.

Finally, our PPAD membership follows from the theorem that computation of the approximate Brouwer fixed point of a Lipschitz function is PPAD-complete, as shown in the seminal paper by Papadimitriou [26].

Related work

In practice, MARL methods are most often applied to compute the MPE of an SG based on the interactions between agents and the environment. Their uses can be classified in two different settings: online and offline. In the offline setting (also known as the batch setting [21]), the learning algorithm controls all players in a centralized way, hoping that the learning dynamics can eventually lead to an MPE by using a limited number of interaction samples. In the online setting, the learner controls only one of the players to play with an arbitrary group of opponents in the game, assuming unlimited access to the game environment. The central focus is often on the regret: the difference between the learner’s total reward during learning versus that of a benchmark measure in hindsight.

In the offline setting, two-player zero-sum (discounted) SGs have been extensively studied. Since the opponent is purely adversarial in zero-sum SGs, the process of seeking the worst-case optimality for each player can be thought of as solving MDPs. As a result, (approximate) dynamic programming methods [28,29] such as least-squares policy iteration [30] and fitted value iteration [31] or neural fitted Q iteration [32] can be adopted to solve SGs [33–36]. Under this setting, policy-based methods [37,38] can also be applied. However, directly applying existing MDP solvers on general-sum SGs is problematic. Since solving two-player NE in general-sum normal-form games (i.e. one-shot SGs) is well known to be PPAD-complete [24,25], the complexity of MPE in general-sum SGs is expected to be at least PPAD-hard. Although early attempts such as Nash-Q learning [39], correlated-Q learning [40], friend-or-foe Q-learning [41] have been made to solve general-sum SGs under strong assumptions, Zinkevich et al. [18] demonstrated that none in the entire class of value iteration methods can find stationary NE policies in general-sum SGs. The difficulties on both the complexity side and the algorithmic side have led to few existing MARL algorithms for general-sum SGs. Successful approaches either assume knowing the complete information of the SG such that solving MPE can be turned into an optimization problem [42], or prove the convergence of batch RL methods to a weaker notion of NE [21].

In the online setting, agents aim to minimize their regret by trial and error. One of the most well-known online algorithms is R-MAX [43], which studies (average-reward) zero-sum SGs and provides a polynomial (in game size and error parameter) regret bound while competing against an arbitrary opponent. Following the same regret definition, UCSG [44] improved R-MAX and achieved a sublinear regret, but still in the two-player zero-sum SG setting. When it comes to MARL solutions, Littman [13] proposed a practical solution named Minimax-Q that replaces the max operator with the minimax operator. Asymptotic convergence results of Minimax-Q were developed in both tabular cases [45] and value function approximations [46]. To avoid the overly pessimism property by playing the minimax value for general-sum SGs, WoLF [47] was proposed to take variable steps to exploit an opponent’s suboptimal policy for a higher reward on a variety of stochastic games. AWESOME [48] further generalized WoLF and achieved NE convergence in multi-player general-sum repeated games. However, outside the scope of zero-sum SGs, the question [43] of whether a polynomial time no-regret (near-optimal) MARL algorithm exists for general-sum SGs remains open.

Some recent works studied the sample complexity issue in RL and MARL algorithms, most of which considered a finite horizon. Jin et al. [49] proved that a variant of Q-learning with upper confidence bound exploration can achieve a near-optimal sample efficiency under episodic MDP setting. Zhang et al. [50] proposed a learning algorithm for episodic MDP with a regret bound close to the information theoretic lower bound. Li et al. [51] proposed a probably approximately correct learning algorithm for episodic RL with a sample complexity independent of the planning horizon. For general-sum MARL, Chen et al. [52] proved an exponential lower bound on the sample complexity of approximate Nash equilibrium even in n-player normal-form games. In the same direction, Song et al. [53] showed that correlated equilibrium (CE) and coarse correlated equilibrium (CCE) can be learned within a sample complexity polynomial in the maximum size of the action set of a player, rather than the size of the joint action space. Jin et al. [54] developed a decentralized MARL algorithm with polynomial sample complexity to learn CE and CCE.

DEFINITIONS AND THE MAIN THEOREM

Definition 1 (Stochastic game). A stochastic game is defined by a tuple of six elements .

By n we denote the number of agents.
By we denote the set of finite environmental states. Let .
By we denote the action space of agent i. Note that each agent i can choose different actions under different states. Without loss of generality, we assume that, for each agent i, the action space under each state is the same. Here is the set of agents’ joint action vector. Let and A_max = max _{i ∈
[n]}Aⁱ.
By we denote the transition probability, that is, at each time step, given the agents’ joint action vector , then the transition probability from state s to state s′ in the next time step is P(s′|s, a).
By we denote the reward function, that is, when agents are at state s and play the joint action vector a, then agent i will get reward rⁱ(s, a). We assume that the rewards are bounded by r_max.
By γ ∈ [0, 1) we denote the discount factor that specifies the degree to which the agent’s rewards are discounted over time.

Each agent aims to find a behavioral strategy with Markovian property, which is conditioned on the current state of the game.

The pure strategy space of agent i is Inline graphic , which means that agent i needs to select an action at each state. Note that the size of the pure strategy space of each agent is , which is already exponential in the number of states. More generally, we define the mixed behavioral strategy as follows.

Definition 2 (Behavioral strategy). A behavioral strategy of agent i is . For all , πⁱ(s) is a probability distribution on .

In the rest of the paper, we focus on behavioral strategy and refer to it simply as a strategy for convenience. A strategy profile π is the Cartesian product of all agents’ strategies, i.e. π = π¹ × ⋅⋅⋅ × πⁿ. We denote the probability of agent i using action aⁱ at state s by πⁱ(s, aⁱ). The strategy profile of all the agents other than agent i is denoted by π⁻ⁱ. We use πⁱ, π⁻ⁱ to represent π, and aⁱ, a⁻ⁱ to represent a.

Given π, the transition probability and the reward function depend only on the current state Inline graphic . Let r^{i, π}(s) denote and P^π(s′|s) denote . Fix π⁻ⁱ; the transition probability and the reward function depend only on the current state and player i’s action aⁱ. Let denote and denote .

For any positive integer m, let Inline graphic . Define . Then, for all , and .

Definition 3 (Value function). A value function for agent i under strategy profile π, , gives the expected sum of its discounted rewards when the starting state is s:

Here, s₀, s₁, … is the Markov chain such that the transition matrix is P^π, that is, for all k = 0, 1, …. Equivalently, the value function can be defined recursively via the Bellman policy equation

Definition 4 (Markov perfect equilibrium). A behavioral strategy profile π is called a Markov perfect equilibrium if, for all , all i ∈ [n] and all ,

where is the value function of i when its strategy deviates to while the strategy profile of other agents is π⁻ⁱ.

The Markov perfect equilibrium is a solution concept within SGs in which the players’ strategies depend only on the current state but not on the game history.

Definition 5 (ε-approximate MPE). Given ε > 0, a behavioral strategy profile π is called an ε-approximate MPE if, for all , all i ∈ [n] and all ,

We use Approximate MPE to denote the computational problem of finding an approximate Markov perfect equilibrium in stochastic games, where the inputs and outputs are as follows. The input instance of problem Approximate MPE is a pair Inline graphic , where is a stochastic game and L is a positive integer. The output of problem Approximate MPE is a strategy profile , also dependent only at the current state but not on its history, such that π is a 1/L-approximate MPE of . We use the notation to denote the input size of the stochastic game Inline graphic .

Theorem 1 (Main theorem). Approximate MPEis PPAD-complete.

We note that, when |S| = 1 and γ = 0, a stochastic game degenerates to an n-player normal-form game. At this time, any MPE of this stochastic game is a Nash equilibrium for the corresponding normal-form game. So we have the following hardness result immediately.

Lemma 1. Approximate MPE is PPAD-hard.

To derive Theorem 1, we focus on the proof of PPAD membership of Approximate MPE in the rest of the paper.

Lemma 2. Approximate MPE is in PPAD.

ON THE EXISTENCE OF MPE

The original proof of the existence of MPE is from [2] and based on Kakutani’s fixed point theorem. Unfortunately, proofs that are based on Kakutani’s fixed point theorem in general cannot be turned into PPAD-membership results. We develop a proof that uses Brouwer’s fixed point theorem, based on which we also prove the PPAD membership of Approximate MPE.

Inspired by the continuous transformation defined by Nash to prove the existence of the equilibrium point [27], we define an updating function Inline graphic to adjust the strategy profile of agents in a stochastic game to establish the existence of MPE.

Let Inline graphic be the behavioral strategy profile under discussion.

Let Inline graphic denote the expected sum of discounted rewards of agent i if agent i uses pure action aⁱ at state s at the first step, and then follows πⁱ after that, but every other agent j maintains its strategy π^j. Formally,

For each player i ∈ [n] with each action Inline graphic at each state , we define a policy update function of πⁱ(s, aⁱ) as

We consider the infinite norm distance of two strategy profiles π₁ and π₂, denoted by ‖π₁ − π₂‖_∞: Inline graphic .

We first prove that the function f satisfies a continuity property, namely, is λ-Lipschitz for λ equal to Inline graphic .

Lemma 3. The function f is λ-Lipschitz, i.e. for everysuch that ‖π₁ − π₂‖_∞ ≤ δ, we have

Proof. At any , pick any player i ∈ [n]. For an action , let M₁(aⁱ) denote and . From the next claim (proof in the Appendix),

Claim 1. For any x, x′, y, y′, z, z′ ≥ 0 such that (x + y)/(1 + z) ≤ 1 and (x′ + y′)/(1 + z′) ≤ 1, it holds that

Take δ = ‖π₁ − π₂‖_∞; then for any . Next, for any , we estimate

We should first derive an upper bound on .

Claim 2. It holds that

This follows from the following claim (proof in the Appendix).

Claim 3. It holds that

Similarly, we have the following claim.

Claim 4. It holds that

To bound for every , we denote by the column vector , and by r^{i, π} the column vector and by P^π the matrix . By the Bellman policy equation (Definition 3), we have

which means that

We prove in Lemma 7 below that

for all .

Now we are ready to give an upper bound on for any . We have

where the forth line follows from , in Lemma 7 below.

Similarly, we establish a bound for :

For any , we have

Thus, for any and any , we obtain

This completes the proof of Lemma 3.□

Now we can establish the existence of MPE by the Brouwer fixed point theorem.

Theorem 2. For any stochastic game , a strategy profile π is an MPE if and only if it is a fixed point of the function f, i.e. f(π) = π. Furthermore, the function f has at least one fixed point.

Proof. We first show that the function f has at least one fixed point. Brouwer’s fixed point theorem states that, for any continuous function mapping a compact convex set to itself, there is a fixed point. Note that f is a function mapping a compact convex set to itself. Also, f is continuous by Lemma 3. Therefore, the function f has at least one fixed point.

We then prove that a strategy profile π is an MPE if and only if it is a fixed point of f.

⇒: For the necessity, suppose that π is an MPE; then, by Definition 4, we have, for each player i ∈ [n], each state and each policy , . By Lemma 4 to be proven next, we have, for any action , , which implies that . Then, for each player i ∈ [n], each state and each action , (f(π))ⁱ(s, aⁱ) = πⁱ(s, aⁱ). It follows that π is a fixed point of f.

⇐: For the proof of the sufficiency part, let π be a fixed point of f. Then, for each player i ∈ [n], each state and each action ,

We first provide the following claim given the condition that π is a fixed point.

Claim 5. For any , .

Proof of Claim 5. Suppose for contradiction that there exists i ∈ [n] and such that . The above fixed point equation implies that πⁱ(s, dⁱ) > 0.

Let ; then . Note that, by the recursive definition of , we have

Since , there must exist some such that , because otherwise we have for all , which, com-bined with , implies that , a contradiction to the above equation. With some further calculation, we can have the equation

The above strict inequality follows because as well as πⁱ(s, cⁱ) > 0.

This contradicts with the assumption that π is a fixed point of f. Therefore, it holds for any that .□

Combining Claim 5 and Lemma 4 (to be proven next), we find that, for any and any , . Thus, π is an MPE by definition. This completes the proof of Theorem 2.□

Lemma 4. For any player i ∈ [n], given π⁻ⁱ, for any, the following two statements are equivalent:

for all and all , ;
for all and all , .

Proof. Let denote the space of value functions , and define the l_∞ norm for any as .

Pick any player i ∈ [n] and keep π⁻ⁱ fixed. Define the Bellman operator such that, for any and any ,

Note that, for all , , since .

We first prove the equivalence between statements 1 and 2, based on Claim 6 below, which will be proved next for completeness.

2⇒1: From statement 2, for all , , which is the fixed point of Φⁱ by Claim 6 below. That is, for all , , by definition of the Bellman operator Φⁱ. Statement 1 holds.

1⇒2: If statement 1 holds, we have, for all , . Since by Claim 6 below, we get . , implying that is a fixed point of Φⁱ. By Claim 6, the unique fixed point of Φⁱ is . Therefore, for all , : statement 2 holds.

Claim 6. We have the following important properties.

It holds that Φⁱis a γ-contraction mapping with respect to the l_∞norm, and has a unique fixed point.
For any and any , .
Let denote the fixed point of Φⁱ; then vⁱ* is the optimal value function, i.e. for any, .

Proof of Claim 6. Define

We have for all and .

We first prove that Φⁱ is a γ-contraction mapping with respect to the l_∞ norm. For all , let . We show that ‖Φⁱ(v₁) − Φⁱ(v₂)‖_∞ ≤ γδ.

For all and all , observe that

so .

Without loss of generality, one can suppose that Φⁱ(v₁)(s) ≥ Φⁱ(v₂)(s). Taking arbitrary , we have

thus, |Φⁱ(v₁)(s) − Φⁱ(v₂)(s)| ≤ γδ. By symmetry, the claim holds for the case in which Φⁱ(v₁)(s) ≤ Φⁱ(v₂)(s). Therefore, it holds that ‖Φⁱ(v₁) − Φⁱ(v₂)‖_∞ ≤ γδ. Thus, Φⁱ is a γ-contraction mapping.

By the Banach fixed point theorem, we know that has a unique fixed point . Moreover, for any , the point sequence v, Φⁱ(v), Φⁱ(Φⁱ(v)), … converges to vⁱ*, i.e. for all , lim_{k →
∞}(Φⁱ)^(k)(v)(s) = vⁱ*(s), where (Φⁱ)^(k) = Φⁱ (Φⁱ)^{(k − 1)} is defined recursively with (Φⁱ)⁽¹⁾ = Φⁱ.

Next, for all and all , , since

by definition.

Finally we prove that, for any , . For any , define the operator , such that, for any and any ,

Note that, for any , is also a γ-contraction mapping. This is because, for any such that ‖v₁ − v₂‖_∞ = δ, we have shown that, for any and any , , so

and then .

For any , we can observe that by definition. By the Banach fixed point theorem, we know that has a unique fixed point in , so is exactly the unique fixed point of .

Now we arbitrarily take a policy such that, for all , . It can be seen that, for any,

It follows that , so vkⁱ* is a fixed point of . Since the unique fixed point of is , we have . Thus, for any , .

To show that, for any , , we observe that, given , if for all , then, for any and any , . Therefore, . As we have, for all and all , , we have, for any and any , by induction. It follows that . Let k → ∞; then we get . Thus, .

The claim that, for all , follows.

THE APPROXIMATION GUARANTEE

Theorem 2 states that π is a fixed point of f if and only if π is an MPE for the stochastic game. Now we prove that f has some good approximation properties: if we find an ε-approximate fixed point π of f then it is also a Inline graphic -approximate MPE for the stochastic game (combining the following Lemma 5 and Lemma 6). This implies the PPAD-membership of Approximate MPE.

Lemma 5. Let ε > 0 and π be a strategy profile. If ‖f(π) − π‖_∞ ≤ ε then, for each player i ∈ [n] and each state, we have

where

Proof. Pick any player i ∈ [n] and any state . For simplicity, for any , define and .

First we give an upper bound on M(aⁱ). For any , it can be easily seen that

By the condition ‖f(π) − π‖_∞ ≤ ε , for any , we have

Set ε′ = (1 + A_maxr_max/(1 − γ))ε; then we have the crucial inequality

(1)

Let denote or, equivalently, . Let .

Case 1: . By inequality (1) we have

Case 2: . By inequality (1) we have

(2)

As and, for all , ,

Therefore,

Moreover, observe that . Substituting these into inequality (2), we get

It follows that .

In conclusion, combining the two cases, we get

Thus, for each , we have

which completes the proof.□

Lemma 6. Let ε > 0 and π be a strategy profile. If, for each player i ∈ [n] and each state, then π is an ε/(1 − γ)-approximate MPE.

Proof. Recall the mapping , defined as the Bellman operator, from the proof of Lemma 4. Let be the unique fixed point of Φⁱ and recall that, for all , .

Pick any player i ∈ [n]; by assumption, for each state , we have . On the other hand, , so we have , i.e. .

Since Φⁱ is a γ-contraction mapping,

In addition, by the triangle inequality we have

so it follows that

Thus, we have

It follows that, for any and any ,

By definition, it follows that π is an ε/(1 − γ)-approximate MPE.□

To conclude, Lemma 2 is proven by combining Lemma 5 and Lemma 6, which completes the proof of Lemma 1.

CONCLUSION

Solving an MPE in general-sum SGs has long expected to be at least PPAD-hard. In this paper, we prove that computing an MPE in a finite-state infinite horizon discounted SG is PPAD-complete. Our completeness result also immediately implies the PPAD-completeness of computing an MPE in action-free SGs and single-controller SGs. We hope that our results can encourage MARL researchers to study solving an MPE in general-sum SGs, proposing a sample-efficient MARL solution, which leads to more prosperous algorithmic developments than those currently on zero-sum SGs.

ACKNOWLEDGEMENT

We would like to thank Yuhao Li for his early work, when he was an undergraduate student at Peking University.

APPENDIX

Proof of Claim 1

We have

The first and third inequalities follow by the triangle inequality, the second inequality holds because x′ + y′ ≤ 1 + z′ and the last inequality follows because 1 + z ≥ 1. It immediately follows that

Proof of Claim 2

We have

where the last inequality follows from the next claim.

Proof of Claim 3

We have

Proof of Claim 4

We have

Lemma 7 and its proof

Lemma 7. For every such that ‖π₁ − π₂‖_∞ ≤ δ, we have

for any.

Proof. We first give an upper bound on for any :

Now we view P^π as an S × S matrix. For any two S × S matrices M¹, M², we use ‖M¹ − M²‖_max to denote max _i,
j|M¹(i, j) − M²(i, j)|, i.e. the max norm. Then we have .

Let and . (Note that the inverse of (I − γP^π) must exist because γ < 1.)

By definition, we have and . Then

where the sixth line follows the following facts:

;
|Q¹(k, j) − Q²(k, j)| ≤ max _k|Q¹(k, j) − Q²(k, j)|;
;
.

Note that Inline graphic . Since the 1-norm is submultiplicative, we have

which leads to the fourth fact. So we have

This completes the proof.

Contributor Information

Xiaotie Deng, Center on Frontiers of Computing Studies, School of Computer Science, Peking University, Beijing 100091, China; Center for Multi-Agent Research, Institute for AI, Peking University, Beijing 100091, China.

Ningyuan Li, Center on Frontiers of Computing Studies, School of Computer Science, Peking University, Beijing 100091, China.

David Mguni, Huawei UK, London WC1E 6BT, UK.

Jun Wang, Computer Science, University College London, London WC1E 6BT, UK.

Yaodong Yang, Center for Multi-Agent Research, Institute for AI, Peking University, Beijing 100091, China.

FUNDING

This work was partially supported by the Science and Technology Innovation 2030—‘New Generation of Artificial Intelligence’ Major Project (2018AAA0100901).

AUTHOR CONTRIBUTIONS

X.D., D.M. and J.W. designed the research; X.D. and D.M. identified the research problem; X.D. and N.L. performed the research; D.M. and Y.Y. coordinated the team; X.D., N.L. and Y.Y. wrote the paper.

Conflict of interest statement. None declared.

REFERENCES

1. Shapley LS. Stochastic games. Proc Natl Acad Sci USA 1953; 39: 1095–100. 10.1073/pnas.39.10.1095 [DOI] [PMC free article] [PubMed] [Google Scholar]
2. Fink AM. Equilibrium in a stochastic n-person game. J Sci Hiroshima Univ Ser A-I Math 1964; 28: 89–93. 10.32917/hmj/1206139508 [DOI] [Google Scholar]
3. Maskin E, Tirole J. Markov perfect equilibrium: I. observable actions. J Econ Theory 2001; 100: 191–219. 10.1006/jeth.2000.2785 [DOI] [Google Scholar]
4. Neyman A, Sorin S. Stochastic Games and Applications, vol. 570. New York: Springer Science and Business Media, 2003. 10.1007/978-94-010-0189-2 [DOI] [Google Scholar]
5. Albright SC, Winston W. A birth-death model of advertising and pricing. Adv Appl Probab 1979; 11: 134–52. 10.2307/1426772 [DOI] [Google Scholar]
6. Sobel MJ. Myopic solutions of Markov decision processes and stochastic games. Oper Res 1981; 29: 995–1009. 10.1287/opre.29.5.995 [DOI] [Google Scholar]
7. Filar J. Player aggregation in the traveling inspector model. IEEE Trans Automat Contr 1985; 30: 723–9. 10.1109/TAC.1985.1104060 [DOI] [Google Scholar]
8. Perez-Nieves N, Yang Y, Slumbers Oet al. Modelling behavioural diversity for learning in open-ended games. In: Meila M, Zhang T (eds). Proceedings of the 38th International Conference on Machine Learning, vol. 139. Cambridge, MA: PMLR, 2021, 8514–24. [Google Scholar]
9. Filar J, Vrieze K.Competitive Markov Decision Processes. New York: Springer Science and Business Media, 2012. [Google Scholar]
10. Sutton RS, Barto AG. Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press, 2018. [Google Scholar]
11. Busoniu L, Babuska R, De Schutter B. A comprehensive survey of multiagent reinforcement learning. IEEE Trans Syst Man Cybern C Appl Rev 2008; 38: 156–72. 10.1109/TSMCC.2007.913919 [DOI] [Google Scholar]
12. Yang Y, Wang J. An overview of multi-agent reinforcement learning from game theoretical perspective. arXiv: 2011.00583. [Google Scholar]
13. Littman ML. Markov games as a framework for multi-agent reinforcement learning. In: Cohen WW, Hirsh H (eds). Proceedings of the 11th International Conference on Machine Learning. San Francisco, CA: Morgan Kaufmann Publishers, 1994, 157–63, 10.1016/B978-1-55860-335-6.50027-1 [DOI] [Google Scholar]
14. Hu J, Wellman MP. Multiagent reinforcement learning: theoretical framework and an algorithm. In: Proceedings of the 15th International Conference on Machine Learning. San Francisco, CA: Morgan Kaufmann Publishers, 1998, 242–50, [Google Scholar]
15. Cesa-Bianchi N, Lugosi G. Prediction, Learning, and Games. Cambridge: Cambridge University Press, 2006. [Google Scholar]
16. Bubeck S, Cesa-Bianchi N. Regret Analysis Of Stochastic and Nonstochastic Multi-Armed Bandit Problems. Delft: Now Publishers Inc, 2012. 10.1561/9781601986276 [DOI] [Google Scholar]
17. Takahashi M. Stochastic games with infinitely many strategies. J Sci Hiroshima Univ Ser A-I Math 1962; 26: 123–34. 10.32917/hmj/1206139732 [DOI] [Google Scholar]
18. Zinkevich M, Greenwald A, Littman M. Cyclic equilibria in Markov games. In: Proceedings of the 18th International Conference on Neural Information Processing Systems. Cambridge, MA: MIT Press, 2005, 1641–8. [Google Scholar]
19. Yang Y, Luo R, Li Met al. Mean field multi-agent reinforcement learning. In: Proceedings of the 35th International Conference on Machine Learning, vol. 80. Cambridge, MA: PMLR, 2018, 5571–80. [Google Scholar]
20. Guo X, Hu A, Xu Ret al. Learning mean-field games. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems, vol. 80. Red Hook, NY: Curran Associates, 2019, 4966–76. [Google Scholar]
21. Pérolat J, Strub F, Piot Bet al. Learning nash equilibrium for general-sum Markov games from batch data. In: Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, vol. 54. Cambridge, MA: PMLR, 2017, 232–41. [Google Scholar]
22. Solan E, Vieille N. Stochastic games. Proc Natl Acad Sci USA 2015; 112: 13743–6. 10.1073/pnas.1513508112 [DOI] [PMC free article] [PubMed] [Google Scholar]
23. Selten R. Reexamination of the perfectness concept for equilibrium points in extensive games. Internat J Game Theory 1975; 4: 25–55. 10.1007/BF01766400 [DOI] [Google Scholar]
24. Daskalakis C, Goldberg PW, Papadimitriou CH. The complexity of computing a Nash equilibrium. SIAM J Comput 2009; 39: 195–259. 10.1137/070699652 [DOI] [Google Scholar]
25. Chen X, Deng X, Teng SH. Settling the complexity of computing two-player nash equilibria. J ACM 2009; 56: 1–57. 10.1145/1516512.1516516 [DOI] [Google Scholar]
26. Papadimitriou CH. On the complexity of the parity argument and other inefficient proofs of existence. J Comput Syst Sci 1994; 48: 498–532. 10.1016/S0022-0000(05)80063-7 [DOI] [Google Scholar]
27. Nash J. Non-cooperative games. Ann Math 1951; 54: 286–95. 10.2307/1969529 [DOI] [Google Scholar]
28. Bertsekas DP. Approximate dynamic programming. In: Sammut C, Webb GI (eds). Encyclopedia of Machine Learning. Boston, MA: Springer, 2010, 39. [Google Scholar]
29. Szepesvári C, Littman ML. Generalized Markov decision processes: dynamic-programming and reinforcement-learning algorithms. Technical Report. Brown University, 1997. [Google Scholar]
30. Lagoudakis MG, Parr R. Least-squares policy iteration. J Mach Learn Res 2003; 4: 1107–49. [Google Scholar]
31. Munos R, Szepesvári C. Finite-time bounds for fitted value iteration. J Mach Learn Res 2008; 9: 815–57. [Google Scholar]
32. Riedmiller M. Neural fitted Q iteration - first experiences with a data efficient neural reinforcement learning method. In: Gama J, Camacho R, Brazdil PB. et al (eds). Machine Learning: ECML 2005, vol. 3720. Berlin: Springer, 2005, 317–28. 10.1007/11564096_32 [DOI] [Google Scholar]
33. Pérolat J, Piot B, Geist Met al. Softened approximate policy iteration for Markov games. In: Proceedings of The 33rd International Conference on Machine Learning, vol. 48. Cambridge, MA: PMLR, 2016, 1860–8. [Google Scholar]
34. Lagoudakis MG, Parr R. Value function approximation in zero-sum Markov games. In: Proceedings of the 18th Conference on Uncertainty in Artificial Intelligence. San Francisco, CA: Morgan Kaufmann Publishers, 2002, 283–92. [Google Scholar]
35. Pérolat J, Scherrer B, Piot Bet al. Approximate dynamic programming for two-player zero-sum Markov games. In: Proceedings of the 32nd International Conference on Machine Learning, Vol. 37. Cambridge, MA: PMLR, 2015, 1321–9. [Google Scholar]
36. Sidford A, Wang M, Yang Let al. Solving discounted stochastic two-player games with near-optimal time and sample complexity. In: Chiappa S, Calandra R (eds). Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics, vol. 108. Cambridge, MA: PMLR, 2020, 2992–3002. [Google Scholar]
37. Daskalakis C, Foster DJ, Golowich N. Independent policy gradient methods for competitive reinforcement learning. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. Red Hook, NY: Curran Associates, 2020, 5527–40. [Google Scholar]
38. Hansen TD, Miltersen PB, Zwick U. Strategy iteration is strongly polynomial for 2-player turn-based stochastic games with a constant discount factor. J ACM 2013; 60: 1–16. 10.1145/2432622.2432623 [DOI] [Google Scholar]
39. Hu J, Wellman MP. Nash Q-learning for general-sum stochastic games. J Mach Learn Res 2003; 4: 1039–69. [Google Scholar]
40. Greenwald A, Hall K. Correlated-Q learning. In: Proceedings of the 20th International Conference on International Conference on Machine Learning. Washington, DC: AAAI Press, 2003, 242–49. [Google Scholar]
41. Littman ML. Friend-or-foe Q-learning in general-sum games. In: Proceedings of the 18th International Conference on Machine Learning. San Francisco, CA: Morgan Kaufmann Publishers, 2001, 322–8. [Google Scholar]
42. Prasad H, LA P, Bhatnagar S. Two-timescale algorithms for learning nash equilibria in general-sum stochastic games. In: Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems. Richland, SC: International Foundation for Autonomous Agents and Multiagent Systems, 2015, 1371–9. [Google Scholar]
43. Brafman RI, Tennenholtz M. R-MAX – A general polynomial time algorithm for near-optimal reinforcement learning. J Mach Learn Res 2002; 3: 213–31. [Google Scholar]
44. Wei CY, Hong YT, Lu CJ. Online reinforcement learning in stochastic games. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook, NY: Curran Associates, 2017, 4994–5004. [Google Scholar]
45. Littman ML, Szepesvári C. A generalized reinforcement-learning model: convergence and applications. In: Proceedings of the 13th International Conference on International Conference on Machine Learning. San Francisco, CA: Morgan Kaufmann Publishers, 1996, 310–8. [Google Scholar]
46. Fan J, Wang Z, Xie Yet al. A theoretical analysis of deep Q-learning. In: Proceedings of the 2nd Conference on Learning for Dynamics and Control, vol. 120. Cambridge, MA: PMLR, 2020, 486–9. [Google Scholar]
47. Bowling M, Veloso M. Rational and convergent learning in stochastic games. In: Proceedings of the 17th International Joint Conference on Artificial Intelligence. San Francisco, CA: Morgan Kaufmann Publishers, 2001, 1021–6. [Google Scholar]
48. Conitzer V, Sandholm T. Awesome: a general multiagent learning algorithm that converges in self-play and learns a best response against stationary opponents. Mach Learn 2007; 67: 23–43. 10.1007/s10994-006-0143-1 [DOI] [Google Scholar]
49. Jin C, Allen-Zhu Z, Bubeck Set al. Is Q-learning provably efficient? In: Proceedings of the 32nd International Conference on Neural Information Processing Systems. Red Hook, NY: Curran Associates, 2018, 4868–78. [Google Scholar]
50. Zhang Z, Zhou Y, Ji X. Almost optimal model-free reinforcement learning via reference-advantage decomposition. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. Red Hook, NY: Curran Associates, 2020, 15198–207. [Google Scholar]
51. Li Y, Wang R, Yang LF. Settling the horizon-dependence of sample complexity in reinforcement learning. In: 2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS). Piscataway, NJ: IEEE Press, 2022, 965–76. 10.1109/FOCS52979.2021.00097 [DOI] [Google Scholar]
52. Chen X, Cheng Y, Tang B. Well-supported versus approximate Nash equilibria: query complexity of large games. In: Proceedings of the 2017 ACM Conference on Innovations in Theoretical Computer Science. New York, NY: Association for Computing Machinery, 2017, 57. [Google Scholar]
53. Song Z, Mei S, Bai Y. When can we learn general-sum Markov games with a large number of players sample-efficiently? In: International Conference on Learning Representations. La Jolla, CA: OpenReview, 2022. [Google Scholar]
54. Jin C, Liu Q, Wang Yet al. V-learning – a simple, efficient, decentralized algorithm for multiagent RL. arXiv: 2110.14555. [Google Scholar]

[bib1] 1. Shapley LS. Stochastic games. Proc Natl Acad Sci USA 1953; 39: 1095–100. 10.1073/pnas.39.10.1095 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib2] 2. Fink AM. Equilibrium in a stochastic n-person game. J Sci Hiroshima Univ Ser A-I Math 1964; 28: 89–93. 10.32917/hmj/1206139508 [DOI] [Google Scholar]

[bib3] 3. Maskin E, Tirole J. Markov perfect equilibrium: I. observable actions. J Econ Theory 2001; 100: 191–219. 10.1006/jeth.2000.2785 [DOI] [Google Scholar]

[bib4] 4. Neyman A, Sorin S. Stochastic Games and Applications, vol. 570. New York: Springer Science and Business Media, 2003. 10.1007/978-94-010-0189-2 [DOI] [Google Scholar]

[bib5] 5. Albright SC, Winston W. A birth-death model of advertising and pricing. Adv Appl Probab 1979; 11: 134–52. 10.2307/1426772 [DOI] [Google Scholar]

[bib6] 6. Sobel MJ. Myopic solutions of Markov decision processes and stochastic games. Oper Res 1981; 29: 995–1009. 10.1287/opre.29.5.995 [DOI] [Google Scholar]

[bib7] 7. Filar J. Player aggregation in the traveling inspector model. IEEE Trans Automat Contr 1985; 30: 723–9. 10.1109/TAC.1985.1104060 [DOI] [Google Scholar]

[bib8] 8. Perez-Nieves N, Yang Y, Slumbers Oet al. Modelling behavioural diversity for learning in open-ended games. In: Meila M, Zhang T (eds). Proceedings of the 38th International Conference on Machine Learning, vol. 139. Cambridge, MA: PMLR, 2021, 8514–24. [Google Scholar]

[bib9] 9. Filar J, Vrieze K.Competitive Markov Decision Processes. New York: Springer Science and Business Media, 2012. [Google Scholar]

[bib10] 10. Sutton RS, Barto AG. Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press, 2018. [Google Scholar]

[bib11] 11. Busoniu L, Babuska R, De Schutter B. A comprehensive survey of multiagent reinforcement learning. IEEE Trans Syst Man Cybern C Appl Rev 2008; 38: 156–72. 10.1109/TSMCC.2007.913919 [DOI] [Google Scholar]

[bib12] 12. Yang Y, Wang J. An overview of multi-agent reinforcement learning from game theoretical perspective. arXiv: 2011.00583. [Google Scholar]

[bib13] 13. Littman ML. Markov games as a framework for multi-agent reinforcement learning. In: Cohen WW, Hirsh H (eds). Proceedings of the 11th International Conference on Machine Learning. San Francisco, CA: Morgan Kaufmann Publishers, 1994, 157–63, 10.1016/B978-1-55860-335-6.50027-1 [DOI] [Google Scholar]

[bib14] 14. Hu J, Wellman MP. Multiagent reinforcement learning: theoretical framework and an algorithm. In: Proceedings of the 15th International Conference on Machine Learning. San Francisco, CA: Morgan Kaufmann Publishers, 1998, 242–50, [Google Scholar]

[bib15] 15. Cesa-Bianchi N, Lugosi G. Prediction, Learning, and Games. Cambridge: Cambridge University Press, 2006. [Google Scholar]

[bib16] 16. Bubeck S, Cesa-Bianchi N. Regret Analysis Of Stochastic and Nonstochastic Multi-Armed Bandit Problems. Delft: Now Publishers Inc, 2012. 10.1561/9781601986276 [DOI] [Google Scholar]

[bib17] 17. Takahashi M. Stochastic games with infinitely many strategies. J Sci Hiroshima Univ Ser A-I Math 1962; 26: 123–34. 10.32917/hmj/1206139732 [DOI] [Google Scholar]

[bib18] 18. Zinkevich M, Greenwald A, Littman M. Cyclic equilibria in Markov games. In: Proceedings of the 18th International Conference on Neural Information Processing Systems. Cambridge, MA: MIT Press, 2005, 1641–8. [Google Scholar]

[bib19] 19. Yang Y, Luo R, Li Met al. Mean field multi-agent reinforcement learning. In: Proceedings of the 35th International Conference on Machine Learning, vol. 80. Cambridge, MA: PMLR, 2018, 5571–80. [Google Scholar]

[bib20] 20. Guo X, Hu A, Xu Ret al. Learning mean-field games. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems, vol. 80. Red Hook, NY: Curran Associates, 2019, 4966–76. [Google Scholar]

[bib21] 21. Pérolat J, Strub F, Piot Bet al. Learning nash equilibrium for general-sum Markov games from batch data. In: Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, vol. 54. Cambridge, MA: PMLR, 2017, 232–41. [Google Scholar]

[bib22] 22. Solan E, Vieille N. Stochastic games. Proc Natl Acad Sci USA 2015; 112: 13743–6. 10.1073/pnas.1513508112 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib23] 23. Selten R. Reexamination of the perfectness concept for equilibrium points in extensive games. Internat J Game Theory 1975; 4: 25–55. 10.1007/BF01766400 [DOI] [Google Scholar]

[bib24] 24. Daskalakis C, Goldberg PW, Papadimitriou CH. The complexity of computing a Nash equilibrium. SIAM J Comput 2009; 39: 195–259. 10.1137/070699652 [DOI] [Google Scholar]

[bib25] 25. Chen X, Deng X, Teng SH. Settling the complexity of computing two-player nash equilibria. J ACM 2009; 56: 1–57. 10.1145/1516512.1516516 [DOI] [Google Scholar]

[bib26] 26. Papadimitriou CH. On the complexity of the parity argument and other inefficient proofs of existence. J Comput Syst Sci 1994; 48: 498–532. 10.1016/S0022-0000(05)80063-7 [DOI] [Google Scholar]

[bib27] 27. Nash J. Non-cooperative games. Ann Math 1951; 54: 286–95. 10.2307/1969529 [DOI] [Google Scholar]

[bib28] 28. Bertsekas DP. Approximate dynamic programming. In: Sammut C, Webb GI (eds). Encyclopedia of Machine Learning. Boston, MA: Springer, 2010, 39. [Google Scholar]

[bib29] 29. Szepesvári C, Littman ML. Generalized Markov decision processes: dynamic-programming and reinforcement-learning algorithms. Technical Report. Brown University, 1997. [Google Scholar]

[bib30] 30. Lagoudakis MG, Parr R. Least-squares policy iteration. J Mach Learn Res 2003; 4: 1107–49. [Google Scholar]

[bib31] 31. Munos R, Szepesvári C. Finite-time bounds for fitted value iteration. J Mach Learn Res 2008; 9: 815–57. [Google Scholar]

[bib32] 32. Riedmiller M. Neural fitted Q iteration - first experiences with a data efficient neural reinforcement learning method. In: Gama J, Camacho R, Brazdil PB. et al (eds). Machine Learning: ECML 2005, vol. 3720. Berlin: Springer, 2005, 317–28. 10.1007/11564096_32 [DOI] [Google Scholar]

[bib33] 33. Pérolat J, Piot B, Geist Met al. Softened approximate policy iteration for Markov games. In: Proceedings of The 33rd International Conference on Machine Learning, vol. 48. Cambridge, MA: PMLR, 2016, 1860–8. [Google Scholar]

[bib34] 34. Lagoudakis MG, Parr R. Value function approximation in zero-sum Markov games. In: Proceedings of the 18th Conference on Uncertainty in Artificial Intelligence. San Francisco, CA: Morgan Kaufmann Publishers, 2002, 283–92. [Google Scholar]

[bib35] 35. Pérolat J, Scherrer B, Piot Bet al. Approximate dynamic programming for two-player zero-sum Markov games. In: Proceedings of the 32nd International Conference on Machine Learning, Vol. 37. Cambridge, MA: PMLR, 2015, 1321–9. [Google Scholar]

[bib36] 36. Sidford A, Wang M, Yang Let al. Solving discounted stochastic two-player games with near-optimal time and sample complexity. In: Chiappa S, Calandra R (eds). Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics, vol. 108. Cambridge, MA: PMLR, 2020, 2992–3002. [Google Scholar]

[bib37] 37. Daskalakis C, Foster DJ, Golowich N. Independent policy gradient methods for competitive reinforcement learning. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. Red Hook, NY: Curran Associates, 2020, 5527–40. [Google Scholar]

[bib38] 38. Hansen TD, Miltersen PB, Zwick U. Strategy iteration is strongly polynomial for 2-player turn-based stochastic games with a constant discount factor. J ACM 2013; 60: 1–16. 10.1145/2432622.2432623 [DOI] [Google Scholar]

[bib39] 39. Hu J, Wellman MP. Nash Q-learning for general-sum stochastic games. J Mach Learn Res 2003; 4: 1039–69. [Google Scholar]

[bib40] 40. Greenwald A, Hall K. Correlated-Q learning. In: Proceedings of the 20th International Conference on International Conference on Machine Learning. Washington, DC: AAAI Press, 2003, 242–49. [Google Scholar]

[bib41] 41. Littman ML. Friend-or-foe Q-learning in general-sum games. In: Proceedings of the 18th International Conference on Machine Learning. San Francisco, CA: Morgan Kaufmann Publishers, 2001, 322–8. [Google Scholar]

[bib42] 42. Prasad H, LA P, Bhatnagar S. Two-timescale algorithms for learning nash equilibria in general-sum stochastic games. In: Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems. Richland, SC: International Foundation for Autonomous Agents and Multiagent Systems, 2015, 1371–9. [Google Scholar]

[bib43] 43. Brafman RI, Tennenholtz M. R-MAX – A general polynomial time algorithm for near-optimal reinforcement learning. J Mach Learn Res 2002; 3: 213–31. [Google Scholar]

[bib44] 44. Wei CY, Hong YT, Lu CJ. Online reinforcement learning in stochastic games. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook, NY: Curran Associates, 2017, 4994–5004. [Google Scholar]

[bib45] 45. Littman ML, Szepesvári C. A generalized reinforcement-learning model: convergence and applications. In: Proceedings of the 13th International Conference on International Conference on Machine Learning. San Francisco, CA: Morgan Kaufmann Publishers, 1996, 310–8. [Google Scholar]

[bib46] 46. Fan J, Wang Z, Xie Yet al. A theoretical analysis of deep Q-learning. In: Proceedings of the 2nd Conference on Learning for Dynamics and Control, vol. 120. Cambridge, MA: PMLR, 2020, 486–9. [Google Scholar]

[bib47] 47. Bowling M, Veloso M. Rational and convergent learning in stochastic games. In: Proceedings of the 17th International Joint Conference on Artificial Intelligence. San Francisco, CA: Morgan Kaufmann Publishers, 2001, 1021–6. [Google Scholar]

[bib48] 48. Conitzer V, Sandholm T. Awesome: a general multiagent learning algorithm that converges in self-play and learns a best response against stationary opponents. Mach Learn 2007; 67: 23–43. 10.1007/s10994-006-0143-1 [DOI] [Google Scholar]

[bib49] 49. Jin C, Allen-Zhu Z, Bubeck Set al. Is Q-learning provably efficient? In: Proceedings of the 32nd International Conference on Neural Information Processing Systems. Red Hook, NY: Curran Associates, 2018, 4868–78. [Google Scholar]

[bib50] 50. Zhang Z, Zhou Y, Ji X. Almost optimal model-free reinforcement learning via reference-advantage decomposition. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. Red Hook, NY: Curran Associates, 2020, 15198–207. [Google Scholar]

[bib51] 51. Li Y, Wang R, Yang LF. Settling the horizon-dependence of sample complexity in reinforcement learning. In: 2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS). Piscataway, NJ: IEEE Press, 2022, 965–76. 10.1109/FOCS52979.2021.00097 [DOI] [Google Scholar]

[bib52] 52. Chen X, Cheng Y, Tang B. Well-supported versus approximate Nash equilibria: query complexity of large games. In: Proceedings of the 2017 ACM Conference on Innovations in Theoretical Computer Science. New York, NY: Association for Computing Machinery, 2017, 57. [Google Scholar]

[bib53] 53. Song Z, Mei S, Bai Y. When can we learn general-sum Markov games with a large number of players sample-efficiently? In: International Conference on Learning Representations. La Jolla, CA: OpenReview, 2022. [Google Scholar]

[bib54] 54. Jin C, Liu Q, Wang Yet al. V-learning – a simple, efficient, decentralized algorithm for multiagent RL. arXiv: 2110.14555. [Google Scholar]

PERMALINK

On the complexity of computing Markov perfect equilibrium in general-sum stochastic games

Xiaotie Deng

Ningyuan Li

David Mguni

Jun Wang

Yaodong Yang

ABSTRACT

INTRODUCTION

Intuitions and a sketch of our main ideas

Related work

DEFINITIONS AND THE MAIN THEOREM

ON THE EXISTENCE OF MPE

THE APPROXIMATION GUARANTEE

CONCLUSION

ACKNOWLEDGEMENT

APPENDIX

Proof of Claim 1

Proof of Claim 2

Proof of Claim 3

Proof of Claim 4

Lemma 7 and its proof

Contributor Information

FUNDING

AUTHOR CONTRIBUTIONS

REFERENCES

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

On the complexity of computing Markov perfect equilibrium in general-sum stochastic games

Xiaotie Deng

Ningyuan Li

David Mguni

Jun Wang

Yaodong Yang

ABSTRACT

INTRODUCTION

Intuitions and a sketch of our main ideas

Related work

DEFINITIONS AND THE MAIN THEOREM

ON THE EXISTENCE OF MPE

THE APPROXIMATION GUARANTEE

CONCLUSION

ACKNOWLEDGEMENT

APPENDIX

Proof of Claim 1

Proof of Claim 2

Proof of Claim 3

Proof of Claim 4

Lemma 7 and its proof

Contributor Information

FUNDING

AUTHOR CONTRIBUTIONS

REFERENCES

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases