Reactive learning strategies for iterated games

Alex McAvoy; Martin A Nowak

doi:10.1098/rspa.2018.0819

. 2019 Mar 20;475(2223):20180819. doi: 10.1098/rspa.2018.0819

Reactive learning strategies for iterated games

Alex McAvoy ^1,^✉, Martin A Nowak ¹

PMCID: PMC6451968 PMID: 31007557

Abstract

In an iterated game between two players, there is much interest in characterizing the set of feasible pay-offs for both players when one player uses a fixed strategy and the other player is free to switch. Such characterizations have led to extortionists, equalizers, partners and rivals. Most of those studies use memory-one strategies, which specify the probabilities to take actions depending on the outcome of the previous round. Here, we consider ‘reactive learning strategies’, which gradually modify their propensity to take certain actions based on past actions of the opponent. Every linear reactive learning strategy, p*, corresponds to a memory one-strategy, p, and vice versa. We prove that for evaluating the region of feasible pay-offs against a memory-one strategy, $C (p)$ , we need to check its performance against at most 11 other strategies. Thus, $C (p)$ is the convex hull in $R^{2}$ of at most 11 points. Furthermore, if p is a memory-one strategy, with feasible pay-off region $C (p)$ , and p* is the corresponding reactive learning strategy, with feasible pay-off region $C (p^{*})$ , then $C (p^{*})$ is a subset of $C (p)$ . Reactive learning strategies are therefore powerful tools in restricting the outcomes of iterated games.

Keywords: adaptive strategy, iterated game, memory-one strategy, social dilemma

1. Introduction

Since the discovery of zero-determinant strategies for iterated games by Press & Dyson [1], there has been a growing interest in the set of possible pay-offs that can be achieved against a fixed strategy. Imagine that Alice uses a particular strategy, while Bob can try out any conceivable strategy. The resulting set of pay-offs for both Alice and Bob define the ‘feasible region’ of Alice's strategy. If Alice uses a so-called zero-determinant strategy [1], then the feasible region is a line. In general, the feasible region is a two-dimensional convex subset of the feasible pay-off region of the game (figure 1). Using the geometric intuition put forth by Press & Dyson [1], subsequent work has explored strategies that generate two-dimensional feasible regions, defined by linear inequalities rather than strict equations [2–4]. However, a general description of what this region looks like, as it relates to the type of strategy played, is currently not well understood. In this study, we characterize the feasible regions for the well-known class of memory-one strategies [5] and consider their relationships to those of a new class of ‘reactive learning strategies’.

Figure 1. — Feasible region (grey) for a strategy with p_•• = (0.7881, 0.8888, 0.4686, 0.0792) when R = 3, S = 0, T = 5 and P = 1. The light blue region depicts the set of all pay-off pairs that can be achieved in the iterated game, i.e. the convex hull of the points (R, R), (S, T), (P, P) and (T, S). The feasible region of p can be characterized as the convex hull of 11 points, corresponding to those opponent-strategies, q, appearing next to each black dot. In this instance, five of these points already fall inside of the convex hull of the remaining six. However, one cannot remove one of these 11 points without destroying this characterization for some game-strategy pair. (Online version in colour.)

Iterated games have many applications across the social sciences and biology, and with them has come a proliferation of strategy classes of various complexities [6–10]. The type of strategy a player uses for dealing with repeated encounters depends on many factors, including the cognitive capacity of the player and the nature of the underlying ‘one-shot’ (or ‘stage’) games. In applications to theoretical biology, the most well-studied type of strategy is known as ‘memory-one’ because it takes into account the outcome of only the previous encounter when determining how to play in the next round [5,11]. This class of strategies, while forming only a small subset of all possible ways to play an iterated game [12], has several advantages over more complicated strategies. They permit rich behaviour in iterated play, such as punishment for exploitation and reward for cooperation [5,13–18]; but, owing to their simple memory requirements, they are also straightforward to implement in practice and analyse mathematically.

Memory, however, can apply to more than just the players' actions in the previous round. Since the action a player chooses in any particular encounter is typically chosen stochastically rather than deterministically, a player can also take into account how they chose their previous action rather than just the result. In a social dilemma, for instance, each player chooses an action (‘cooperate’, C, or ‘defect’, D) in a given round and receives a pay-off for this action against that of the opponent. The distribution with which this action is chosen is referred to as a ‘mixed action’ and can be specified by a single number between 0 and 1, representing the tendency to cooperate. A standard memory-one strategy for player X is given by a five-tuple, (p₀, p_CC, p_CD, p_DC, p_DD), where p₀ is the probability of cooperation in the initial round and p_xy is the probability of cooperation following an outcome in which X uses action x and the opponent, Y , uses action y. We consider a variation on this theme, where instead of using x and y to determine the next mixed action, X uses the opponent's action, y, to update their own mixed action, σ_X∈[0, 1], that was used previously to generate x. We refer to a strategy of this form as a ‘reactive learning strategy’.

Such a strategy is ‘reactive’ because it takes into account the realized action of just the opponent, and it is ‘learning’ because it adapts to this external stimulus. Like a memory-one strategy, a reactive learning strategy for X requires knowledge of information one round into the past, namely X's mixed action, σ_X, and Y 's realized action, y. Unlike a memory-one strategy, in which the probability of cooperation is in the set {p₀, p_CC, p_CD, p_DC, p_DD} in every round of the game, a reactive learning strategy can result in a broad range of cooperation tendencies for X over the duration of an iterated game. Moreover, these tendencies can be gradually changed over the course of many rounds, resulting (for example) in high probabilities of cooperation only after the opponent has demonstrated a sufficiently long history of cooperating. Punishment for defection can be similarly realized over a number of interactions. Remembering a probability, σ_X, and an action, y, instead of just two actions, x and y, can thus lead to more complex behaviours.

This adaptive approach to iterated games is similar to the Bush–Mosteller reinforcement learning algorithm [19–21], but there are important distinctions. For one, a reactive learning strategy does not necessarily reinforce behaviour resulting in higher pay-offs. Furthermore, it completely disregards the focal player's realized action, using only that of the opponent in the update mechanism. But there are certainly reactive learning strategies that are more closely related to reinforcement learning, and we give an example using a variation on the memory-one strategy tit-for-tat (TFT), which we call ‘learning tit-for-tat (LTFT)’.

In this study, we establish some basic properties of reactive learning strategies relative to the memory-one space. We first characterize the feasible region of a memory-one strategy as the convex hull of at most 11 points. When then show that there is an embedding of the set of memory-one strategies in the set of reactive learning strategies with the following property: if p is a memory-one strategy and p* is the corresponding reactive learning strategy, then the feasible region of p contains the feasible region of p*. Moreover, the image of the map p↦p* is the set of linear reactive learning strategies, which consists of those strategies that send a player's mixed action, σ_X, to ασ_X + β for some α, β∈[0, 1]. As a consequence, if the goal of a player is to restrict the region of pay-offs attainable by the players, then this player should prefer using a linear reactive learning strategy over the corresponding memory-one strategy.

2. Memory-one strategies

Consider an iterated game between two players, X and Y . In every round, each player chooses an action from the set {C, D} (cooperate or defect). They receive pay-offs based on the values in the matrix

\begin{aligned} \begin{matrix} C & D \end{matrix} \\ \begin{matrix} C \\ D \end{matrix} (\begin{matrix} R & S \\ T & P \end{matrix}) \end{aligned}

2.1

Over many rounds, these pay-offs are averaged to arrive at an expected pay-off for each player.

Whereas an action specifies the behaviour of a player in one particular encounter, a strategy specifies how a player behaves over the course of many encounters. One of the simplest and best-studied strategies for iterated games is a memory-one strategy [5], which for player X is defined as follows: for every (x, y)∈{C, D}² observed as action outcomes of a given round, X devises a mixed action p_xy∈[0, 1] for the next round. The notation p_xy indicates that this mixed action depends on the (pure) actions of both players in the previous round, not how they arrived at those actions (e.g. by generating an action probabilistically). The term ‘strategy’ is reserved for the players' behaviours in the iterated game.

Let Mem¹_X be the space of all memory-one strategies for player X in an iterated game. With just two actions, C and D, we have Mem¹_X = [0, 1] × [0, 1]⁴, i.e. the space of all (p₀, p_CC, p_CD, p_DC, p_DD)∈[0, 1]⁵. A pair of memory-one strategies, p: = (p₀, p_CC, p_CD, p_DC, p_DD) and q: = (q₀, q_CC, q_CD, q_DC, q_DD), for X and Y , respectively, yield a Markov chain on the space of all action pairs, {C, D}², whose transition matrix is

\begin{aligned} \begin{matrix} C C & C D & D C & D D \end{matrix} \\ M (p, q) = \begin{matrix} C C \\ C D \\ D C \\ D D \end{matrix} (\begin{matrix} p_{C C} q_{C C} & p_{C C} (1 - q_{C C}) & (1 - p_{C C}) q_{C C} & (1 - p_{C C}) (1 - q_{C C}) \\ p_{C D} q_{D C} & p_{C D} (1 - q_{D C}) & (1 - p_{D C}) q_{D C} & (1 - p_{D C}) (1 - q_{D C}) \\ p_{D C} q_{C D} & p_{D C} (1 - q_{C D}) & (1 - p_{C D}) q_{C D} & (1 - p_{C D}) (1 - q_{C D}) \\ p_{D D} q_{D D} & p_{D D} (1 - q_{D D}) & (1 - p_{D D}) q_{D D} & (1 - p_{D D}) (1 - q_{D D}) \end{matrix}) \end{aligned}

2.2

and whose initial distribution is μ₀: = (p₀q₀, p₀(1 − q₀), (1 − p₀)q₀, (1 − p₀)(1 − q₀)). If p_xy, q_xy∈(0, 1) for every x, y∈{C, D}, then this chain is ergodic and has a unique stationary distribution, μ(p, q), which is independent of μ₀. In particular, the expected pay-offs, π_X(p, q) = μ(p, q) · (R, S, T, P) and π_Y(p, q) = μ(p, q) · (R, T, S, P), are independent of p₀ and q₀. In this case, π_X and π_Y are functions of just the response probabilities, p_••: = (p_CC, p_CD, p_DC, p_DD) and q_••: = (q_CC, q_CD, q_DC, q_DD).

A useful way of thinking about a strategy is through its feasible region, i.e. the set of all possible pay-off pairs (for X and Y ) that can be achieved against it. For any memory-one strategy p of X, let

C (p) := {(π_{Y} (p, q), π_{X} (p, q))}_{q \in {M e m}_{X}^{1}}

2.3

be this feasible region. (Note that, if X uses a memory-one strategy, then it suffices to assume that Y uses a memory-one strategy by the results of Press & Dyson [1].) This subset of the feasible region represents the ‘geometry’ of strategy p in the sense that it captures all possible pay-off pairs against an opponent.

In this section, we show that the feasible region for p∈Mem¹_X with p_••∈(0, 1)⁴ is characterized by playing p against the following 11 strategies: (0, 0, 0, 0), (0, 0, 0, 1), (0, 0, 1, 0), (0, 0, 1, 1), (0, 1, 0, 1), (0, 1, 1, 0), (0, 1, 1, 1), (1, 0, 0, 1), (1, 0, 1, 0), (1, 0, 1, 1) and (1, 1, 1, 1). In other words, $C (p)$ is the convex hull of 11 points (figure 1). Therefore, any p∈Mem¹_X generates a simple polygon in $R^{2}$ whose number of extreme points is uniformly bounded over all game-strategy pairs, ((R, S, T, P), p).

Lemma 2.1. —

For q∈Mem¹_X and x, y∈{C, D}, let (q;q_xy = q_xy′) be the strategy obtained from q by changing q_xy to q_xy′∈[0, 1]. If p_••∈(0, 1)⁴, q∈Mem¹_X and x, y∈{C, D}, then the point (π_Y (p, q), π_X(p, q)) falls on the line joining (π_Y (p, (q;q_xy = 0)), π_X(p, (q;q_xy = 0))) and (π_Y (p, (q;q_xy = 1)), π_X(p, (q;q_xy = 1))).

Proof. —

Let p_••∈(0, 1)⁴ and q∈Mem¹_X. Since the transition matrix of equation (2.2) is just 4 × 4, one can directly solve for its stationary distribution, μ(p, q) (e.g. by using Gaussian elimination or the determinant formula of Press & Dyson [1]). For example, suppose that x = y = C. Then, with

$L (q_{C C}) := \frac{(1 - q_{C C})}{1 + q_{C C} (\frac{\begin{matrix} p_{C C} - p_{C C} p_{C D} + p_{C C} p_{D D} - p_{C C} q_{C D} + p_{C C} q_{D D} + p_{D C} q_{C D} - p_{D D} q_{D D} + p_{C C} p_{C D} q_{C D} \\ - p_{C C} p_{C D} q_{D D} - p_{C C} p_{D D} q_{C D} - p_{C D} p_{D C} q_{C D} - p_{C C} p_{D C} q_{D C} + p_{C C} p_{D C} q_{D D} \\ + p_{C C} p_{D D} q_{D C} + p_{C D} p_{D C} q_{D C} - p_{C D} p_{D D} q_{D C} + p_{D C} p_{D D} q_{C D} + p_{C D} p_{D D} q_{D D} - p_{D C} p_{D D} q_{D D} \end{matrix}}{\begin{matrix} p_{D D} - p_{C D} - q_{C D} + q_{D D} + p_{C D} q_{C D} + p_{C D} q_{D C} + p_{D C} q_{C D} - p_{C D} q_{D D} - p_{D D} q_{C D} \\ - p_{D C} q_{D C} + p_{D C} q_{D D} + p_{D D} q_{D C} - p_{D D} q_{D D} - p_{C C} p_{C D} q_{D C} - p_{C D} p_{D C} q_{C D} + p_{C D} p_{D C} q_{D C} \\ + p_{C C} p_{D D} q_{D D} + p_{D C} p_{D D} q_{C D} - p_{D C} p_{D D} q_{D D} - p_{C D} q_{C D} q_{D C} + p_{D C} q_{C D} q_{D C} + p_{C D} q_{D C} q_{D D} \\ - p_{D D} q_{D C} q_{D D} + p_{C C} p_{C D} q_{C D} q_{D C} - p_{C C} p_{D C} q_{C D} q_{D C} - p_{C C} p_{C D} q_{D C} q_{D D} + p_{C C} p_{D C} q_{C D} q_{D D} \\ - p_{C C} p_{D D} q_{C D} q_{D D} - p_{C D} p_{D C} q_{C D} q_{D D} - p_{C D} p_{D D} q_{C D} q_{D C} + p_{C D} p_{D D} q_{C D} q_{D D} \\ + p_{C C} p_{D D} q_{D C} q_{D D} + p_{C D} p_{D C} q_{D C} q_{D D} + p_{D C} p_{D D} q_{C D} q_{D C} - p_{D C} p_{D D} q_{D C} q_{D D} + 1 \end{matrix}})},$ 2.4

one has

$\begin{aligned} (π_{Y} (p, q), π_{X} (p, q)) = L (q_{C C}) (π_{Y} (p, (q; q_{C C} = 0)), π_{X} (p, (q; q_{C C} = 0))) \\ + (1 - L (q_{C C})) (π_{Y} (p, (q; q_{C C} = 1)), π_{X} (p, (q; q_{C C} = 1))) . \end{aligned}$ 2.5

Provided (π_Y (p, (q;q_CC = 0)), π_X(p, (q;q_CC = 0)))≠(π_Y (p, (q;q_CC = 1)), π_X(p, (q;q_CC = 1))), we also have L(0) = 1 and L(1) = 0. Moreover, one can check that, under this condition, L′(q_CC) is nowhere equal to 0, and 0 ≤ slantL(q_CC) ≤ slant1 for every q_CC∈[0, 1]. The other cases with x, y∈{C, D} are analogous. ▪

Remark 2.2. —

Even when q_xy is uniformly distributed between 0 and 1, the corresponding points in the feasible region need not be uniformly distributed between the endpoints corresponding to q_xy = 0 and q_xy = 1, respectively (figure 2). This result is therefore somewhat different from the analogous situation of playing against a mixed action in a stage game, where, for a pay-off function $u : S_{X} \times S_{Y} \to R^{2}$ and mixed action σ_X∈Δ(S_X) and σ_Y ∈Δ(S_Y), one has $u (σ_{X}, σ_{Y}) = \int_{y \in S_{Y}} u (σ_{X}, y) d σ_{Y} (y)$ due to linearity.

Figure 2. — The set of points (π_Y (p, q), π_X(p, q)), where p_•• = (0.7876, 0.9856, 0.4095, 0.0301) and q_•• = (q_CC, 0.9963, 0.0166, 0.9879) as q_CC varies between 0 (green) and 1 (red) in uniform increments of 0.01. The resulting points all fall along a line; however, they are not uniformly distributed even though the distribution of q_CC is uniform. Parameters: R = 3, S = 0, T = 5 and P = 1. (Online version in colour.)

Proposition 2.3. —

For any p∈Mem¹_X with $p_{∙ ∙} \in {(0, 1)}^{4}, C (p)$ is the convex hull of the following 11 points:

$\begin{aligned} (\begin{matrix} π_{X}^{(0, 0, 0, 0)} \\ π_{Y}^{(0, 0, 0, 0)} \end{matrix}) & = (\begin{matrix} \frac{P - P p_{C D} + S p_{D D}}{p_{D D} - p_{C D} + 1} \\ \frac{P - P p_{C D} + T p_{D D}}{p_{D D} - p_{C D} + 1} \end{matrix}); \end{aligned}$ 2.6a

$\begin{aligned} (\begin{matrix} π_{X}^{(0, 0, 0, 1)} \\ π_{Y}^{(0, 0, 0, 1)} \end{matrix}) & = (\begin{matrix} \frac{P + T - P p_{C D} + R p_{D D} + S p_{D C} - T p_{C D} - T p_{D D} - R p_{C D} p_{D D} + S p_{C C} p_{D D} - S p_{D C} p_{D D} + T p_{C D} p_{D D}}{p_{D C} - 2 p_{C D} + p_{C C} p_{D D} - p_{D C} p_{D D} + 2} \\ \frac{P + S - P p_{C D} + R p_{D D} - S p_{C D} - S p_{D D} + T p_{D C} - R p_{C D} p_{D D} + S p_{C D} p_{D D} + T p_{C C} p_{D D} - T p_{D C} p_{D D}}{p_{D C} - 2 p_{C D} + p_{C C} p_{D D} - p_{D C} p_{D D} + 2} \end{matrix}); \end{aligned}$ 2.6b

$\begin{aligned} (\begin{matrix} π_{X}^{(0, 0, 1, 0)} \\ π_{Y}^{(0, 0, 1, 0)} \end{matrix}) & = (\begin{matrix} \frac{P - P p_{D C} + S p_{D D} + T p_{D D} - P p_{C C} p_{C D} + P p_{C D} p_{D C} + R p_{C D} p_{D D} - T p_{C D} p_{D D}}{2 p_{D D} - p_{D C} - p_{C C} p_{C D} + p_{C D} p_{D C} + 1} \\ \frac{P - P p_{D C} + S p_{D D} + T p_{D D} - P p_{C C} p_{C D} + P p_{C D} p_{D C} + R p_{C D} p_{D D} - S p_{C D} p_{D D}}{2 p_{D D} - p_{D C} - p_{C C} p_{C D} + p_{C D} p_{D C} + 1} \end{matrix}); \end{aligned}$ 2.6c

$\begin{aligned} (\begin{matrix} π_{X}^{(0, 0, 1, 1)} \\ π_{Y}^{(0, 0, 1, 1)} \end{matrix}) & = (\begin{matrix} \frac{\begin{matrix} P + T - P p_{D C} + R p_{D D} + S p_{D C} - T p_{D D} - P p_{C C} p_{C D} + P p_{C D} p_{D C} + R p_{C D} p_{D C} \\ - R p_{D C} p_{D D} + S p_{C C} p_{D D} - S p_{D C} p_{D D} - T p_{C C} p_{C D} + T p_{C C} p_{D D} \end{matrix}}{2 (p_{C C} p_{D D} - p_{C C} p_{C D} + p_{C D} p_{D C} - p_{D C} p_{D D} + 1)} \\ \frac{\begin{matrix} P + S - P p_{D C} + R p_{D D} - S p_{D D} + T p_{D C} - P p_{C C} p_{C D} + P p_{C D} p_{D C} + R p_{C D} p_{D C} \\ - R p_{D C} p_{D D} - S p_{C C} p_{C D} + S p_{C C} p_{D D} + T p_{C C} p_{D D} - T p_{D C} p_{D D} \end{matrix}}{2 (p_{C C} p_{D D} - p_{C C} p_{C D} + p_{C D} p_{D C} - p_{D C} p_{D D} + 1)} \end{matrix}); \end{aligned}$ 2.6d

$\begin{aligned} (\begin{matrix} π_{X}^{(0, 1, 0, 1)} \\ π_{Y}^{(0, 1, 0, 1)} \end{matrix}) & = (\begin{matrix} \frac{T + P p_{D C} + R p_{D C} - T p_{C D} - T p_{D D} - P p_{C D} p_{D C} - R p_{C D} p_{D C} + S p_{C C} p_{D C} + T p_{C D} p_{D D}}{2 p_{D C} - p_{C D} - p_{D D} + p_{C C} p_{D C} - 2 p_{C D} p_{D C} + p_{C D} p_{D D} + 1} \\ \frac{S + P p_{D C} + R p_{D C} - S p_{C D} - S p_{D D} - P p_{C D} p_{D C} - R p_{C D} p_{D C} + S p_{C D} p_{D D} + T p_{C C} p_{D C}}{2 p_{D C} - p_{C D} - p_{D D} + p_{C C} p_{D C} - 2 p_{C D} p_{D C} + p_{C D} p_{D D} + 1} \end{matrix}); \end{aligned}$ 2.6e

$\begin{aligned} (\begin{matrix} π_{X}^{(0, 1, 1, 0)} \\ π_{Y}^{(0, 1, 1, 0)} \end{matrix}) & = (\begin{matrix} \frac{P p_{D C} + T p_{D D} - P p_{C C} p_{D C} + R p_{D C} p_{D D} + S p_{D C} p_{D D} - T p_{C D} p_{D D}}{p_{D C} + p_{D D} - p_{C C} p_{D C} - p_{C D} p_{D D} + 2 p_{D C} p_{D D}} \\ \frac{P p_{D C} + S p_{D D} - P p_{C C} p_{D C} + R p_{D C} p_{D D} - S p_{C D} p_{D D} + T p_{D C} p_{D D}}{p_{D C} + p_{D D} - p_{C C} p_{D C} - p_{C D} p_{D D} + 2 p_{D C} p_{D D}} \end{matrix}); \end{aligned}$ 2.6f

$\begin{aligned} (\begin{matrix} π_{X}^{(0, 1, 1, 1)} \\ π_{Y}^{(0, 1, 1, 1)} \end{matrix}) & = (\begin{matrix} \frac{T + P p_{D C} + R p_{D C} - T p_{D D} - P p_{C C} p_{D C} + S p_{C C} p_{D C} - T p_{C C} p_{C D} + T p_{C C} p_{D D}}{2 p_{D C} - p_{D D} - p_{C C} p_{C D} + p_{C C} p_{D D} + 1} \\ \frac{S + P p_{D C} + R p_{D C} - S p_{D D} - P p_{C C} p_{D C} - S p_{C C} p_{C D} + S p_{C C} p_{D D} + T p_{C C} p_{D C}}{2 p_{D C} - p_{D D} - p_{C C} p_{C D} + p_{C C} p_{D D} + 1} \end{matrix}); \end{aligned}$ 2.6g

$\begin{aligned} (\begin{matrix} π_{X}^{(1, 0, 0, 1)} \\ π_{Y}^{(1, 0, 0, 1)} \end{matrix}) & = (\begin{matrix} - \frac{P + T - P p_{C C} - P p_{C D} + R p_{D D} + S p_{D C} - T p_{C C} - T p_{C D} + P p_{C C} p_{C D} - R p_{C D} p_{D D} - S p_{C C} p_{D C} + T p_{C C} p_{C D}}{2 p_{C C} + 2 p_{C D} - p_{D C} - p_{D D} - 2 p_{C C} p_{C D} + p_{C C} p_{D C} + p_{C D} p_{D D} - 2} \\ - \frac{P + S - P p_{C C} - P p_{C D} + R p_{D D} - S p_{C C} - S p_{C D} + T p_{D C} + P p_{C C} p_{C D} - R p_{C D} p_{D D} + S p_{C C} p_{C D} - T p_{C C} p_{D C}}{2 p_{C C} + 2 p_{C D} - p_{D C} - p_{D D} - 2 p_{C C} p_{C D} + p_{C C} p_{D C} + p_{C D} p_{D D} - 2} \end{matrix}); \end{aligned}$ 2.6h

$\begin{aligned} (\begin{matrix} π_{X}^{(1, 0, 1, 0)} \\ π_{Y}^{(1, 0, 1, 0)} \end{matrix}) & = (\begin{matrix} \frac{P - P p_{C C} - P p_{D C} + S p_{D D} + T p_{D D} + P p_{C C} p_{D C} + R p_{C D} p_{D D} - S p_{C C} p_{D D} - T p_{C C} p_{D D}}{2 p_{D D} - p_{D C} - p_{C C} + p_{C C} p_{D C} - 2 p_{C C} p_{D D} + p_{C D} p_{D D} + 1} \\ \frac{P - P p_{C C} - P p_{D C} + S p_{D D} + T p_{D D} + P p_{C C} p_{D C} + R p_{C D} p_{D D} - S p_{C C} p_{D D} - T p_{C C} p_{D D}}{2 p_{D D} - p_{D C} - p_{C C} + p_{C C} p_{D C} - 2 p_{C C} p_{D D} + p_{C D} p_{D D} + 1} \end{matrix}); \end{aligned}$ 2.6i

$\begin{aligned} (\begin{matrix} π_{X}^{(1, 0, 1, 1)} \\ π_{Y}^{(1, 0, 1, 1)} \end{matrix}) & = (\begin{matrix} \frac{P + T - P p_{C C} - P p_{D C} + R p_{D D} + S p_{D C} - T p_{C C} + P p_{C C} p_{D C} + R p_{C D} p_{D C} - R p_{D C} p_{D D} - S p_{C C} p_{D C}}{p_{D D} - 2 p_{C C} + p_{C D} p_{D C} - p_{D C} p_{D D} + 2} \\ \frac{P + S - P p_{C C} - P p_{D C} + R p_{D D} - S p_{C C} + T p_{D C} + P p_{C C} p_{D C} + R p_{C D} p_{D C} - R p_{D C} p_{D D} - T p_{C C} p_{D C}}{p_{D D} - 2 p_{C C} + p_{C D} p_{D C} - p_{D C} p_{D D} + 2} \end{matrix}); \end{aligned}$ 2.6j

$\begin{aligned} and (\begin{matrix} π_{X}^{(1, 1, 1, 1)} \\ π_{Y}^{(1, 1, 1, 1)} \end{matrix}) & = (\begin{matrix} \frac{T + R p_{D C} - T p_{C C}}{p_{D C} - p_{C C} + 1} \\ \frac{S + R p_{D C} - S p_{C C}}{p_{D C} - p_{C C} + 1} \end{matrix}) . \end{aligned}$ 2.6k

Proof. —

Press & Dyson [1] show that if X uses a memory-one strategy, p, then any strategy of the opponent, y, can be replaced by a memory-one strategy, q, without changing the pay-offs to X and Y ; thus, if X uses a memory-one strategy, one may assume without a loss of generality that Y also uses a memory-one strategy. If p_••∈(0, 1)⁴ and q∈Mem¹_X, the fact that (π_Y (p, q), π_X(p, q)) can be written as a convex combination of the 16 points {(π_Y (p, q′), π_X(p, q′))}_{q_••′∈{0, 1}⁴} then follows immediately from lemma 2.1. Moreover, the points corresponding to (0, 0, 0, 0), (0, 1, 0, 0) and (1, 0, 0, 0) are the same, as are the points corresponding to (1, 1, 0, 1), (1, 1, 1, 0) and (1, 1, 1, 1); thus, we can eliminate four points. Furthermore, we can remove the point associated with (1, 1, 0, 0) because it lies on the line connecting the points associated with (0, 0, 0, 0) and (1, 1, 1, 1). One can easily check that the remaining 11 points have the following property: if point i is removed, then there exist R, S, T, P and p for which $C (p)$ is not the convex hull of the 10 points different from i (table 1). Thus, for a general p and pay-off matrix, all 11 of these points are required. ▪

Table 1.

For each point, π^{(i₁, i₂, i₃, i₃)}_X,Y, the feasible region $C (p)$ cannot (in general) be expressed as the convex hull of the remaining 10 points different from π^{(i₁, i₂, i₃, i₃)}_X,Y. That is, each row gives (i) one of the 11 points of which $C$ is the convex hull and (ii) an example of a game-strategy pair for which π^{(i₁, i₂, i₃, i₃)}_X,Y is an extreme point of $C (p)$ .

point	$(\begin{matrix} R & S \\ T & P \end{matrix})$	p_••
π^(0,0,0,0)_X,Y	$(\begin{matrix} 4.5953 & - 3.5001 \\ - 0.1798 & 4.4972 \end{matrix})$	(0.0347, 0.8913, 0.9873, 0.1164)
π^(0,0,0,1)_X,Y	$(\begin{matrix} 3.5909 & 3.7183 \\ 3.1091 & 2.6508 \end{matrix})$	(0.3420, 0.5591, 0.0468, 0.9941)
π^(0,0,1,0)_X,Y	$(\begin{matrix} 0.1150 & 1.2677 \\ - 2.8725 & 1.4290 \end{matrix})$	(0.8937, 0.9211, 0.6995, 0.0052)
π^(0,0,1,1)_X,Y	$(\begin{matrix} - 0.1523 & 1.7642 \\ - 3.3334 & - 3.9907 \end{matrix})$	(0.5319, 0.4107, 0.9805, 0.0823)
π^(0,1,0,1)_X,Y	$(\begin{matrix} 2.1084 & 0.4235 \\ 4.5449 & - 4.5716 \end{matrix})$	(0.3897, 0.6428, 0.2422, 0.0300)
π^(0,1,1,0)_X,Y	$(\begin{matrix} 2.5627 & - 2.5701 \\ - 4.1353 & 4.0437 \end{matrix})$	(0.7502, 0.7603, 0.9999, 0.3161)
π^(0,1,1,1)_X,Y	$(\begin{matrix} 0.0600 & 1.1524 \\ 2.8660 & 1.3631 \end{matrix})$	(0.1145, 0.9494, 0.7587, 0.9214)
π^(1,0,0,1)_X,Y	$(\begin{matrix} - 4.4025 & 1.6813 \\ - 2.9162 & 1.1664 \end{matrix})$	(0.9629, 0.0020, 0.2554, 0.8444)
π^(1,0,1,0)_X,Y	$(\begin{matrix} 0.1167 & 2.5125 \\ - 0.3462 & - 4.6919 \end{matrix})$	(0.4121, 0.4373, 0.5380, 0.8915)
π^(1,0,1,1)_X,Y	$(\begin{matrix} - 0.3787 & 1.1357 \\ 1.5417 & 2.7617 \end{matrix})$	(0.2570, 0.5191, 0.1293, 0.9332)
π^(1,1,1,1)_X,Y	$(\begin{matrix} - 1.8211 & - 3.2300 \\ - 4.6281 & - 0.4609 \end{matrix})$	(0.0009, 0.4996, 0.4362, 0.9653)

Open in a new tab

Remark 2.4. —

p enforces a linear pay-off relationship if and only if these 11 points are collinear.

Remark 2.5. —

One needs all 11 of these points for general R, S, T, P and p. However, for any particular game-strategy pair, it is often the case that several of these points are unnecessary because they lie within the convex hull of some other subset of these 11 points; they are typically not all extreme points of $C (p)$ .

3. Reactive learning strategies

In a traditional memory-one strategy, X's probability of playing C depends on the realized actions of the two players, x and y. However, X can observe more than just their pure action against the opponent's; they also know how they arrived at x (i.e. they know the mixed action, σ_X, that resulted in x in the previous round). Of course, X need not be able to see Y 's mixed action, but they can still observe the pure action Y played. Therefore, an alternative notion of a memory-one strategy for player X could be defined as follows: after X plays σ_X∈[0, 1] and Y plays y, X then chooses a new action based on the distribution p*_{σ_Xy}∈[0, 1]. In this formulation, p* is a map from [0, 1] × {C, D} to [0, 1]. We refer to such a map, p*, together with X's initial probability of playing C, p₀, as a ‘reactive learning strategy’ for player X (figure 3).

Figure 3. — The space of memory-one strategies, **Mem**¹_X, as it relates to the space of reactive learning strategies, RL_X. Both sets contain the space of reactive strategies [22], which take into account only the last move, y, of the opponent. Whereas a memory-one strategy takes into account the last pure action of X as well, x, a reactive learning strategy uses X's last *mixed* action, σ_X∈[0, 1]. After each round, a reactive learning strategy uses y to update X's probability of cooperating. RL_X is ‘larger’ than **Mem**¹_X in the sense that there is an injective map ${M e m}_{X}^{1} \to {R L}_{X}$ that is not surjective. (Online version in colour.)

In other words, in contrast to Mem¹_X = [0, 1] × [0, 1]⁴, which can be alternatively described as

{M e m}_{X}^{1} = [0, 1] \times {p : {C, D} \times {C, D} \to [0, 1]},

3.1

we define the space of reactive learning strategies as

{R L}_{X} := [0, 1] \times {p^{*} : [0, 1] \times {C, D} \to [0, 1]},

3.2

where [0, 1] indicates the space of mixed actions for X and {C, D} indicates the action space for Y . Although [0, 1] is a much larger space than {C, D}, the updates of mixed actions can be easier to specify using reactive learning strategies since they allow for adaptive modification of an existing mixed action (without the need to devise a new mixed action from scratch after every observed history of play).

Example 3.1. —

Suppose that player X starts by playing C and D with equal probability, i.e. p₀ = 1/2. For fixed η∈[0, 1] (the ‘learning rate’), cooperation from the opponent leads to p*_{σ_XC} = (1 − η)σ_X + η while defection leads to p*_{σ_XD} = (1 − η)σ_X. Thus, a long pattern of exploitation by Y leads X to defect more often. On the other hand, X does not immediately forgive such behaviour but rather requires Y to cooperate repeatedly to bring X back up to higher levels of cooperation. For example, if X starts with p₀ and Y defects ℓ times in a row, then X subsequently cooperates with probability (1 − η)^ℓp₀. In order to bring X's probability of cooperation above p₀ once again, Y must then cooperate for T rounds, where

$T ⩾ \frac{\log ((1 - p_{0}) / (1 - {(1 - η)}^{ℓ} p_{0}))}{\log (1 - η)} .$ 3.3

We refer to this strategy as LTFT because it pushes a player's cooperation probability in the direction of the opponent's last move (figure 4). In this way, a reactive learning strategy can encode more complicated behaviour than a memory-one strategy. Conversely, memory-one strategies can also encode behaviour not captured by reactive learning strategies, which we discuss further in §3c.

Figure 4. — ‘Learning tit-for-tat (LTFT)’, an analogue of tit-for-tat (TFT) within the space of reactive learning strategies. LTFT is the function of two parameters, p₀ (the initial mixed action) and η (the learning rate). Player X initially plays C with probability p₀. In all subsequent rounds, if X played C with probability σ_X and Y played C (resp. D) in the previous round, in the next round X plays C with probability p*_{σ_XC} = (1 − η)σ_X + η (resp. p*_{σ_XD} = (1 − η)σ_X). At the corners lie the strategies ALLD (always defect), ALLC (always cooperate), TFT (tit-for-tat) and STFT (suspicious tit-for-tat).

(a). Linear reactive learning strategies

A pertinent question at this point is whether there is a ‘natural’ map from Mem¹_X to RL_X. Let (p₀, p_••) = (p₀, p_CC, p_CD, p_DC, p_DD) be a memory-one strategy. If (p₀′, p*) is the corresponding reactive learning strategy, then the first requirement we impose is p₀′ = p₀. If σ_X = 1, then X plays C with probability one. It is therefore reasonable to insist that p*_1y = p_Cy. Similarly, X plays D with probability one when σ_X = 0, and we insist that p*_0y = p_Dy. Suppose now that σ_X and σ_X′ are two mixed actions for X. If Y plays y∈{C, D}, then the responses for X corresponding to σ_X and σ_X′ are p*_{σ_Xy} and p*_{σ_X′y}, respectively. If X plays σ_X with probability w∈[0, 1] and σ_X′ with probability 1 − w, then it is also natural to insist that the response is p*_{σ_Xy} with probability w and p*_{σ_X′y} with probability 1 − w. Thus, for any σ_X∈[0, 1] and y∈{C, D}, with these requirements p* can be written uniquely in terms of p_•• as

p_{σ_{X} y}^{*} = σ_{X} p_{1 y}^{*} + (1 - σ_{X}) p_{0 y}^{*} = σ_{X} p_{C y} + (1 - σ_{X}) p_{D y} .

3.4

Using this map, one can naturally identify Mem¹_X with the set of linear reactive learning strategies, LRL_X⊆RL_X, consisting of those functions $p^{*} : [0, 1] \times {C, D} \to [0, 1]$ for which there exist $a, b, c, d \in R$ with

\begin{aligned} p_{σ_{X} C}^{*} & = σ_{X} a + (1 - σ_{X}) c \end{aligned}

3.5a

\begin{aligned} and p_{σ_{X} D}^{*} & = σ_{X} b + (1 - σ_{X}) d . \end{aligned}

3.5b

Clearly, any such a, b, c, d must lie in [0, 1] since p*_{σ_Xy}∈[0, 1] for every σ_X∈[0, 1] and y∈{C, D}.

Under this correspondence, the strategy of example 3.1 has parameters (1/2, 1, 1 − η, η, 0). But note that this map, ${M e m}_{X}^{1} \to {R L}_{X}$ , is not surjective due to the fact that not every reactive learning strategy is linear. For example, if (a, b, c, d)∈[0, 1]⁴ and p*∈RL_X is the quadratic response function defined by

\begin{aligned} p_{σ_{X} C}^{*} := {(σ_{X})}^{2} a + (1 - {(σ_{X})}^{2}) c \end{aligned}

3.6a

\begin{aligned} and p_{σ_{X} D}^{*} := {(σ_{X})}^{2} b + (1 - {(σ_{X})}^{2}) d, \end{aligned}

3.6b

then there exists no (p_CC, p_CD, p_DC, p_DD)∈[0, 1]⁴ mapping to p* provided a≠c or b≠d.

(b). Stationary distributions

Suppose that (p₀, p*) and (q₀, q*) are reactive learning strategies for X and Y , respectively. These strategies generate a Markov chain on the (infinite) space {C, D}² × [0, 1]² with transition probabilities between ((x, y), (σ_X, σ_Y)), ((x′, y′), (p*_{σ_Xy}, q*_{σ_Yx}))∈ {C, D}² × [0, 1]² given by

P_{((x, y), (σ_{X}, σ_{Y})) \to ((x^{'}, y^{'}), (p_{σ_{X} y}^{*}, q_{σ_{Y} x}^{*}))} := {\begin{cases} p_{σ_{X} y}^{*} q_{σ_{Y} x}^{*} & x^{'} = C, y^{'} = C, \\ p_{σ_{X} y}^{*} (1 - q_{σ_{Y} x}^{*}) & x^{'} = C, y^{'} = D, \\ (1 - p_{σ_{X} y}^{*}) q_{σ_{Y} x}^{*} & x^{'} = D, y^{'} = C, \\ (1 - p_{σ_{X} y}^{*}) (1 - q_{σ_{Y} x}^{*}) & x^{'} = D, y^{'} = D . \end{cases}

3.7

To simplify notation, we can also denote the right-hand side of this equation by p*_{σ_Xy}(x′)q*_{σ_Yx}(y′).

If ν is a stationary distribution of this chain, then, for any ((x, y), (σ_X, σ_Y))∈{C, D}² × [0, 1]²,

\begin{aligned} ν ((x, y), (σ_{X}, σ_{Y})) & = \int_{\begin{matrix} ((x^{'}, y^{'}), (σ_{X}^{'}, σ_{Y}^{'})) \\ (p_{σ_{X}^{'} y^{'}}^{*}, q_{σ_{Y}^{'} x^{'}}^{*}) = (σ_{X}, σ_{Y}) \end{matrix}} P_{((x^{'}, y^{'}), (σ_{X}^{'}, σ_{Y}^{'})) \to ((x, y), (σ_{X}, σ_{Y}))} d ν ((x^{'}, y^{'}), (σ_{X}^{'}, σ_{Y}^{'})) \\ = \int_{\begin{matrix} ((x^{'}, y^{'}), (σ_{X}^{'}, σ_{Y}^{'})) \\ (p_{σ_{X}^{'} y^{'}}^{*}, q_{σ_{Y}^{'} x^{'}}^{*}) = (σ_{X}, σ_{Y}) \end{matrix}} σ_{X} (x) σ_{Y} (y) d ν ((x^{'}, y^{'}), (σ_{X}^{'}, σ_{Y}^{'})) . \end{aligned}

3.8

In general, ν is difficult to give explicitly. However, it is possible to understand the marginal distributions on σ_X and σ_Y in more detail (see appendix A). In any case, having an explicit formula for ν is not necessary for obtaining our main result on feasible pay-off regions, which we turn to in the next section.

(c). Feasible pay-off regions

By looking at the feasible region of a strategy, we uncover a nice relationship between a memory-one strategy, p, and its corresponding (linear) reactive learning strategy, p*. Namely, for every p∈Mem¹_X, we have $C (p^{*}) \subseteq C (p)$ . In this section, we give a proof of this fact and illustrate some of its consequences.

For t≥1, let $H_{t} = {({C, D}^{2})}^{t}$ be the history of play from time 0 through time t − 1 [12]. When t = 0, $H_{0} = {\emptyset}$ , where $\emptyset$ denotes the ‘empty’ history, indicating that no play came before the present encounter. A behavioural strategy for a player specifies, for every possible history of play, a probability of using C in the next encounter. That is, if $H := ⊔_{t ⩾ 0} H_{t}$ , then a behavioural strategy is a map $H \to [0, 1]$ . The following lemma shows that when considering the feasible region of a memory-one or reactive learning strategy, one can assume without a loss of generality that the opponent is playing a Markov strategy.

Lemma 3.2. —

Let $M \subseteq B$ be the set of all Markov strategies, i.e.

$M := {y : {1, 2, \dots} \times {C, D}^{2} \to [0, 1]} .$ 3.9

For any x∈Mem¹_X∪RL_X, we have $C (x) = {(π_{Y} (x, y), π_{X} (x, y))}_{y \in M}$ .

Proof. —

When p∈Mem¹_X, the lemma follows from [[1], appendix A]. Specifically, when X plays p∈Mem¹_X against $y \in B$ , consider the time-t distributions μ_t on {C, D}² and ${\bar{μ}}_{t}$ on $H_{t}$ . For (x_t+1, y_t+1)∈{C, D}²,

$\begin{aligned} μ_{t + 1} (x_{t + 1}, y_{t + 1}) & = \sum_{h_{t + 1} \in H_{t + 1}} p_{x_{t} y_{t}} (x_{t + 1}) y_{h_{t + 1}} (y_{t + 1}) {\bar{μ}}_{t + 1} (h_{t + 1}) \\ = \sum_{h_{t + 1} \in H_{t + 1}} p_{x_{t} y_{t}} (x_{t + 1}) y_{(h_{t}, (x_{t}, y_{t}))} (y_{t + 1}) {\bar{μ}}_{t + 1} (h_{t + 1}) \\ = \sum_{(x_{t}, y_{t}) \in {C, D}^{2}} p_{x_{t} y_{t}} (x_{t + 1}) \sum_{h_{t} \in H_{t}} y_{(h_{t}, (x_{t}, y_{t}))} (y_{t + 1}) μ_{t} (x_{t}, y_{t} ∣ h_{t}) {\bar{μ}}_{t} (h_{t}) . \end{aligned}$ 3.10

Therefore, the same sequence of distributions {μ_t}_t≥0 arises when Y uses the Markov strategy defined by

$q_{x_{t} y_{t}}^{t + 1} (y_{t + 1}) := \frac{\sum_{h_{t} \in H_{t}} y_{(h_{t}, (x_{t}, y_{t}))} (y_{t + 1}) μ_{t} (x_{t}, y_{t} ∣ h_{t}) {\bar{μ}}_{t} (h_{t})}{\sum_{h_{t} \in H_{t}} μ_{t} (x_{t}, y_{t} ∣ h_{t}) {\bar{μ}}_{t} (h_{t})} .$ 3.11

If $p^{*} : [0, 1] \times {C, D} \to [0, 1]$ is a reactive learning strategy that X uses against $y \in B$ , then for every t≥0 there are distributions ν_t on {C, D}², χ_t on [0, 1], and ${\bar{ν}}_{t}$ on $H_{t} \times [0, 1]$ . For (x_t+1, y_t+1)∈{C, D}²,

$\begin{aligned} ν_{t + 1} (x_{t + 1}, y_{t + 1}) \\ = \int_{(h_{t + 1}, σ_{X}^{t}) \in H_{t + 1} \times [0, 1]} p_{σ_{X}^{t} y_{t}}^{*} (x_{t + 1}) y_{h_{t + 1}} (y_{t + 1}) d {\bar{ν}}_{t + 1} (h_{t + 1}, σ_{X}^{t}) \\ = \sum_{(x_{t}, y_{t}) \in {C, D}^{2}} \int_{σ_{X}^{t} \in [0, 1]} p_{σ_{X}^{t} y_{t}}^{*} (x_{t + 1}) \\ \times \int_{(h_{t}, σ_{X}^{t - 1}) \in H_{t} \times [0, 1]} y_{(h_{t}, (x_{t}, y_{t}))} (y_{t + 1}) d χ_{t} (σ_{X}^{t} ∣ (h_{t}, (x_{t}, y_{t})), σ_{X}^{t - 1}) d {\bar{ν}}_{t} (h_{t}, σ_{X}^{t - 1}) . \end{aligned}$ 3.12

Consider the Markov strategy for Y with $q_{0} := y_{\emptyset}$ and q¹_x₀y₀(y₁): = y_{(x₀, y₀)}(y₁). For t≥1, let

$q_{x_{t} y_{t}}^{t + 1} (y_{t + 1}) := \frac{\begin{matrix} \int_{σ_{X}^{t} \in [0, 1]} p_{σ_{X}^{t} y_{t}}^{*} (x_{t + 1}) \int_{(h_{t}, σ_{X}^{t - 1}) \in H_{t} \times [0, 1]} \\ y_{(h_{t}, (x_{t}, y_{t}))} (y_{t + 1}) d χ_{t} (σ_{X}^{t} ∣ (h_{t}, (x_{t}, y_{t})), σ_{X}^{t - 1}) d {\bar{ν}}_{t} (h_{t}, σ_{X}^{t - 1}) \end{matrix}}{\int_{σ_{X}^{t} \in [0, 1]} p_{σ_{X}^{t} y_{t}}^{*} (x_{t + 1}) d χ_{t} (σ_{X}^{t} ∣ x_{t}, y_{t}) ν_{t} (x_{t}, y_{t})} .$ 3.13

If ν_t′ and χ_t′ are the analogues of ν_t and χ_t for p* against {q^t}_t≥1, then clearly ν_t = ν_t′ and χ_t = χ_t′ for t = 0, 1. Suppose that for some t≥1, we have ν_t = ν_t′ and χ_t = χ_t′. It follows, then, that at time t + 1,

$\begin{aligned} ν_{t + 1}^{'} (x_{t + 1}, y_{t + 1}) & = \sum_{(x_{t}, y_{t}) \in {C, D}^{2}} q_{x_{t} y_{t}}^{t + 1} (y_{t + 1}) \int_{σ_{X}^{t} \in [0, 1]} p_{σ_{X}^{t} y_{t}}^{*} (x_{t + 1}) d χ_{t}^{'} (σ_{X}^{t} ∣ x_{t}, y_{t}) ν_{t}^{'} (x_{t}, y_{t}) \\ = \sum_{(x_{t}, y_{t}) \in {C, D}^{2}} q_{x_{t} y_{t}}^{t + 1} (y_{t + 1}) \int_{σ_{X}^{t} \in [0, 1]} p_{σ_{X}^{t} y_{t}}^{*} (x_{t + 1}) d χ_{t} (σ_{X}^{t} ∣ x_{t}, y_{t}) ν_{t} (x_{t}, y_{t}) \\ = ν_{t + 1} (x_{t + 1}, y_{t + 1}), \end{aligned}$ 3.14

which gives the desired result for x∈RL_X. ▪

This lemma leads to a straightforward proof of our main result:

Theorem 3.3. —

$C (p^{*}) \subseteq C (p)$ for every p∈Mem¹_X.

Proof. —

By lemma 3.2, for x∈RL_X, we may assume the opponent's strategy is Markovian, meaning that it has a memory of one round into the past but can depend on the current round, t. This dependence on t distinguishes a Markov strategy from a memory-one strategy, the latter of which also has memory of one round into the past but is independent of t. We denote by $M$ the set of all Markov strategies (equation (3.9)).

Let p* be a linear reactive learning strategy for X and suppose that $y \in M$ . For every t≥0, these strategies generate a distribution ν*_t over {C, D}² × [0, 1]. For any strategy q against p, there is a sequence of distributions μ_t on {C, D}² generated by these two strategies. We prove the proposition by finding ${q^{t}}_{t ⩾ 1} \in M$ such that μ_t(x_t, y_t) = ν*_t({(x_t, y_t)} × [0, 1]) for every (x_t, y_t)∈{C, D}² and t≥0.

Let χ_t be the (marginal) distribution on σ^t_X∈[0, 1] at time t. For y_t∈{C, D}, denote by χ_t( · |y_t) this distribution conditioned on Y using action y_t at time t. For t≥0, consider the strategy with $q_{0} := y_{\emptyset}$ and

$\begin{aligned} q_{C y_{t}}^{t + 1} (y_{t + 1}) & := \frac{\int_{σ_{X}^{t} \in [0, 1]} σ_{X}^{t} (σ_{X}^{t} y_{C y_{t}}^{t + 1} (y_{t + 1}) + (1 - σ_{X}^{t}) y_{D y_{t}}^{t + 1} (y_{t + 1})) d χ_{t} (σ_{X}^{t} ∣ y_{t})}{\int_{σ_{X}^{t} \in [0, 1]} σ_{X}^{t} d χ_{t} (σ_{X}^{t} ∣ y_{t})} \end{aligned}$ 3.15a

$\begin{aligned} and q_{D y_{t}}^{t + 1} (y_{t + 1}) & := \frac{\int_{σ_{X}^{t} \in [0, 1]} (1 - σ_{X}^{t}) (σ_{X}^{t} y_{C y_{t}}^{t + 1} (y_{t + 1}) + (1 - σ_{X}^{t}) y_{D y_{t}}^{t + 1} (y_{t + 1})) d χ_{t} (σ_{X}^{t} ∣ y_{t})}{\int_{σ_{X}^{t} \in [0, 1]} (1 - σ_{X}^{t}) d χ_{t} (σ_{X}^{t} ∣ y_{t})} . \end{aligned}$ 3.15b

Clearly, μ₀(x₀, y₀) = ν*₀({(x₀, y₀)} × [0, 1]) for every (x₀, y₀)∈{C, D}². Suppose, for some t≥, that μ_t(x_t, y_t) = ν*_t({(x_t, y_t)} × [0, 1]) for every (x_t, y_t)∈{C, D}². For (x_t+1, y_t+1)∈{C, D}², we then have

$\begin{aligned} μ_{t + 1} (x_{t + 1}, y_{t + 1}) \\ = \sum_{(x_{t}, y_{t}) \in {C, D}^{2}} p_{x_{t} y_{t}} (x_{t + 1}) q_{x_{t} y_{t}}^{t + 1} (y_{t + 1}) μ_{t} (x_{t}, y_{t}) \\ = \sum_{y_{t} \in {C, D}} (p_{C y_{t}} (x_{t + 1}) q_{C y_{t}}^{t + 1} (y_{t + 1}) μ_{t} (C, y_{t}) + p_{D y_{t}} (x_{t + 1}) q_{D y_{t}}^{t + 1} (y_{t + 1}) μ_{t} (D, y_{t})) \\ = \sum_{y_{t} \in {C, D}} p_{C y_{t}} (x_{t + 1}) \int_{σ_{X}^{t} \in [0, 1]} σ_{X}^{t} (σ_{X}^{t} y_{C y_{t}}^{t + 1} (y_{t + 1}) + (1 - σ_{X}^{t}) y_{D y_{t}}^{t + 1} (y_{t + 1})) d χ_{t} (σ_{X}^{t} ∣ y_{t}) \\ + \sum_{y_{t} \in {C, D}} p_{D y_{t}} (x_{t + 1}) \int_{σ_{X}^{t} \in [0, 1]} (1 - σ_{X}^{t}) (σ_{X}^{t} y_{C y_{t}}^{t + 1} (y_{t + 1}) + (1 - σ_{X}^{t}) y_{D y_{t}}^{t + 1} (y_{t + 1})) d χ_{t} (σ_{X}^{t} ∣ y_{t}) \\ = \sum_{(x_{t}, y_{t}) \in {C, D}^{2}} y_{x_{t} y_{t}}^{t + 1} (y_{t + 1}) \int_{σ_{X}^{t} \in [0, 1]} (σ_{X}^{t} p_{C y_{t}} + (1 - σ_{X}^{t}) p_{D y_{t}}) d ν_{t}^{*} ({(x_{t}, y_{t})} \times {σ_{X}^{t}}) \\ = \sum_{(x_{t}, y_{t}) \in {C, D}^{2}} \int_{σ_{X}^{t} \in [0, 1]} p_{σ_{X}^{t} y_{t}}^{*} (x_{t + 1}) y_{x_{t} y_{t}}^{t + 1} (y_{t + 1}) d ν_{t}^{*} ({(x_{t}, y_{t})} \times {σ_{X}^{t}}) \\ = ν_{t + 1}^{*} ({(x_{t + 1}, y_{t + 1})} \times [0, 1]) . \end{aligned}$ 3.16

Therefore, by induction and the definition of expected pay-off in an iterated game, $C (p^{*}) \subseteq C (p)$ . ▪

As a consequence of theorem 3.3, we see that p* enforces a linear pay-off relationship [1] whenever p does. However, the converse need not hold; figure 5(b) gives an example in which X's pay-off is a function of Y 's when X uses p* but not when X uses p. Although this example illustrates an extreme case of when the pay-off region collapses, perhaps the most interesting behaviour is illustrated by figure 5a,c,d. In these examples, we focus on the pay-off regions that can be obtained against memory-one opponents. Using p* instead of p can both bias pay-offs in favour of X and limit potential losses against a spiteful opponent.

For a memory-one strategy p∈Mem¹_X, we can ask how the region {(π_Y (p, q), π_X(p, q))}_{q∈Mem¹_X} compares to {(π_Y (p*, q*), π_X(p*, q*))}_{q∈Mem¹_X}. In other words, does the map p↦p* transform the feasible region of a strategy when the opponents are also subjected to this map? Figure 6 demonstrates that this map can significantly distort the distribution of pay-offs within the feasible region.

Figure 6. — Distortions in the distribution of pay-offs against reactive learning strategies. In both panels, the grey region is formed by playing 10⁵ randomly chosen strategies q∈**Mem**¹_X against a fixed strategy p∈**Mem**¹_X. The green region in (a) arises from simulating the pay-offs of p* against 10⁵ strategies q∈**Mem**¹_X. In (b), this same reactive learning strategy, p*, is simulated against 10⁵ strategies q*∈RL_X for q∈**Mem**¹_X. In both panels, the optimal outcome for Y is the black point when X uses p and the magenta point when X uses p*. The magenta point represents a much better outcome for X and only a slightly worse outcome for Y than the black point, indicating that p* is highly extortionate relative to p when played against a pay-off-maximizing opponent. In both panels, the parameters are p = (0.50, 0.99, 0.40, 0.01, 0.01) and R = 3, S = 0, T = 5 and P = 1. Each coordinate of q is chosen independently from an arcsine (i.e. Beta(1/2, 1/2)) distribution. (Online version in colour.)

(d). Optimization through mutation

Suppose that X uses a fixed reactive learning strategy, p*, for some p∈Mem¹_X. Starting from some random memory-one strategy, q, the opponent might seek to optimize his or her pay-off through a series of mutations. In other words, Y is subjected to the following process. First, sample a new strategy q′∈Mem¹_X. If the pay-off to Y for q′ against p* exceeds that of q against p*, switch to q′; otherwise, retain q. This step then repeats until Y has a sufficiently high pay-off (or else has not changed strategies in some fixed number of steps). From figure 6, one expects this process to give different results from the same update scheme when X plays the memory-one strategy p instead of p*.

As expected, figure 7 shows that this optimization process behaves quite differently against p* as it does against p. Whereas using p in this example results in equitable outcomes, using p* gives X a much higher pay-off than Y , indicating extortionate behaviour. One can also imagine other optimization procedures (not covered here), such as when q′ is always sufficiently close to q (i.e. local mutations). When X uses p*, a path from the red point to the magenta point in figure 6 through random local sampling of q typically requires Y to initially accept lower pay-offs. If Y uses q* instead of q, as in figure 6b, this effect is amplified.

4. Discussion

Our primary focus has been on the feasible region generated by a fixed strategy. This approach to studying X's strategy is inspired by the ‘zero-determinant’ strategies of Press & Dyson [1], which enforce linear subsets of the feasible region. This perspective has also been expanded to cover so-called partner and rival strategies [2–4], which have proven extremely useful in understanding repeated games from an evolutionary perspective. The feasible region of a memory-one strategy, p, is quite simple and can be characterized as the convex hull of at most 11 points. Furthermore, these points are all straightforward to write down explicitly in terms of the pay-off matrix and the entries of p (see equation (2.6)). The feasible region of a reactive learning strategy, in terms of its boundary and extreme points, is evidently more complicated in general.

Both memory-one and reactive learning strategies contain the set of all reactive strategies. For every memory-one strategy, p, there exists a corresponding linear reactive learning strategy, p*, and this correspondence defines an injective map ${M e m}_{X}^{1} \to {R L}_{X}$ . In general, however, p cannot be identified with its image, p*, unless p is reactive. We make this claim formally using the geometry of a strategy within the feasible region, $C (p)$ , which captures all possible pay-off pairs against an opponent. For any memory-one strategy, we have $C (p^{*}) \subseteq C (p)$ . Therefore, reactive learning strategies generally allow a player to impose greater control over where pay-offs fall within the feasible region than do traditional memory-one strategies. As illustrated in figure 5a, this added control can prevent a greedy, self-pay-off-maximizing opponent from obtaining more than X when X uses p*, even when such an opponent receives an unfair share of the pay-offs when X uses p instead. The proof of the containment $C (p^{*}) \subseteq C (p)$ also extends to discounted games, where each pay-off unit received t rounds into the future is valued at δ^t units at present for some ‘discounting factor’, δ∈[0, 1].

Another property of the map ${M e m}_{X}^{1} \to {R L}_{X}$ sending p to p* is that it distorts the distribution of pay-offs within the feasible region. Since Mem¹_X can be identified with the space of linear reactive learning strategies under this map, it is natural to compare the region of possible pay-offs when p plays against memory-one strategies to the one obtained from when p* plays against linear reactive learning strategies. These distortions, as illustrated in figure 6, are particularly relevant when X plays against an opponent who is using a process such as simulated annealing to optimize pay-off. One can see from this example that if Y initially has a low pay-off, then with localized strategy exploration they must be willing to accept lower pay-offs before they find a strategy that improves their initial pay-off. This concern is not relevant when Y can simply compute the best response to X's strategy, but it is highly pertinent to evolutionary settings in which the opponent's strategy is obtained through mutation and selection rather than ‘computation’.

Reactive learning strategies are also more intuitive than memory-one strategies in some ways. Rather than being a dictionary of mixed actions based on all possible observed outcomes, a reactive learning strategy is simply an algorithm for updating one's tendency to choose a certain action. It, therefore, allows a player to alter their behaviour (mixed action) over time in response to various stimuli (actions of the opponent). This strategic approach to iterated games is reminiscent of both the Bush–Mosteller model [19] and the weighted majority algorithm [23], although traditionally these models are not studied through the pay-off regions they generate in iterated games. There are several interesting directions for future research in this area. For one, we have mainly considered the space of linear reactive learning strategies, but the space RL_X is much larger and could potentially exhibit complicated evolutionary dynamics. Furthermore, one could relax the condition that these strategies be reactive and allow them to use X's realized action in addition to X's mixed action. But even without these complications, we have seen that linear reactive learning strategies have quite interesting relationships to traditional memory-one strategies.

Acknowledgements

The authors are grateful to Krishnendu Chatterjee, Christian Hilbe and Joshua Plotkin for many helpful conversations and for feedback on earlier versions of this work.

Appendix A. Convergence of mixed actions

Suppose that X and Y use strategies (p₀, p*) and (q₀, q*), respectively. Let σ⁰_X = p₀ and σ⁰_Y = q₀ be the initial distributions on {C, D} for X and Y , respectively. If these distributions are known at time t≥0, then, on average, the corresponding distributions at time t + 1 are given by the system of equations

\begin{aligned} σ_{X}^{t + 1} & := σ_{Y}^{t} p_{σ_{X}^{t} C}^{*} + (1 - σ_{Y}^{t}) p_{σ_{X}^{t} D}^{*} \end{aligned}

A 1a

\begin{aligned} and σ_{Y}^{t + 1} & := σ_{X}^{t} q_{σ_{Y}^{t} C}^{*} + (1 - σ_{X}^{t}) q_{σ_{Y}^{t} D}^{*} . \end{aligned}

A 1b

This system suggests a fixed-point analysis to determine whether the sequence {(σ^t_X, σ^t_Y)}_t≥0 converges.

Suppose that (σ_X, σ_Y) ∈[0, 1]² is a fixed point of this system, i.e.

\begin{aligned} σ_{X} & = σ_{Y} p_{σ_{X} C}^{*} + (1 - σ_{Y}) p_{σ_{X} D}^{*} \end{aligned}

A 2a

\begin{aligned} and σ_{Y} & = σ_{X} q_{σ_{Y} C}^{*} + (1 - σ_{X}) q_{σ_{Y} D}^{*} . \end{aligned}

A 2b

We consider this system for two types of linear reactive learning strategies: those coming from reactive strategies and those coming from general memory-one strategies under the map ${M e m}_{X}^{1} \to {R L}_{X}$ .

We first consider reactive strategies of the form (p_C, p_D), where p_C (resp. p_D) is the probability a player uses C after the opponent played C (resp. D). Let (p_C, p_D) and (q_C, q_D) be fixed strategies for X and Y . For these reactive strategies, the system equation (A1) takes the form

\begin{aligned} σ_{X}^{t + 1} & := σ_{Y}^{t} p_{C} + (1 - σ_{Y}^{t}) p_{D} \end{aligned}

A 3a

\begin{aligned} and σ_{Y}^{t + 1} & := σ_{X}^{t} q_{C} + (1 - σ_{X}^{t}) q_{D} . \end{aligned}

A 3b

One can easily check that this dynamical system has a unique fixed point, which Hofbauer & Sigmund [24] refer to as the ‘asymptotic C-level’ of (p_C, p_D) against (q_C, q_D), and which is given explicitly by

\begin{aligned} σ_{X} & = \frac{p_{C} q_{D} + p_{D} (1 - q_{D})}{1 - (p_{C} - p_{D}) (q_{C} - q_{D})} \end{aligned}

A 4a

\begin{aligned} and σ_{Y} & = \frac{p_{D} q_{C} + (1 - p_{D}) q_{D}}{1 - (p_{C} - p_{D}) (q_{C} - q_{D})} . \end{aligned}

A 4b

Furthermore, we have the following, straightforward convergence result.

If (p_C, p_D), (q_C, q_D)∈(0, 1)², and if (σ_X, σ_Y) ∈(0, 1)² is given by equation (A 4), then

$lim_{t \to \infty} (σ_{X}^{t}, σ_{Y}^{t}) = (σ_{X}, σ_{Y})$ A 5

for any initial condition, (p₀, q₀)∈[0, 1]².

Proof. —

For (p_C, p_D), (q_C, q_D)∈(0, 1)², consider the map

$\begin{aligned} f & : {[0, 1]}^{2} ⟶ {[0, 1]}^{2} \\ : (\begin{matrix} x \\ y \end{matrix}) ⟼ (\begin{matrix} y p_{C} + (1 - y) p_{D} \\ x q_{C} + (1 - x) q_{D} \end{matrix}) . \end{aligned}$ A 6

For (x, y), (x′, y′)∈[0, 1]², we have

$f (x, y) - f (x^{'}, y^{'}) = (\begin{matrix} (y - y^{'}) (p_{C} - p_{D}) \\ (x - x^{'}) (q_{C} - q_{D}) \end{matrix}) .$ A 7

It follows that ∥f(x, y) − f(x′, y′)∥ ≤ slantλ∥(x, y) − (x′, y′)∥, where $λ := max {| p_{C} - p_{D} |, | q_{C} - q_{D} |} < 1$ . By the contraction mapping theorem, there is then a unique fixed point (σ_X, σ_Y) ∈[0, 1]² such that

$lim_{t \to \infty} f^{t} (p_{0}, q_{0}) = (σ_{X}, σ_{Y})$ A 8

for any (p₀, q₀)∈[0, 1]². It is straightforward to check that equation (A 4) is a fixed point of equation (A 3). ▪

In particular, if μ: = (σ_Xσ_Y, σ_X(1 − σ_Y), (1 − σ_X)σ_Y, (1 − σ_X)(1 − σ_Y)), then a straightforward calculation shows that μ is the stationary distribution of M((p_C, p_D, p_C, p_D), (q_C, q_D, q_C, q_D)) (equation (2.2)).

Remark A.2. —

Proposition A.1 need not hold if p_y and q_x are not strictly between 0 and 1. For example, when X and Y both play TFT, f is a simple involution with f(x, y) = (y, x), which preserves distance.

Consider now the case of general memory-one strategies with p_••: = (p_CC, p_CD, p_DC, p_DD) for X and q_••: = (q_CC, q_CD, q_DC, q_DD) for Y . For these strategies, the system defined by equation (A1) has the form

\begin{aligned} σ_{X}^{t + 1} & := σ_{Y}^{t} (σ_{X}^{t} p_{C C} + (1 - σ_{X}^{t}) p_{D C}) + (1 - σ_{Y}^{t}) (σ_{X}^{t} p_{C D} + (1 - σ_{X}^{t}) p_{D D}) \end{aligned}

A 9a

\begin{aligned} and σ_{Y}^{t + 1} & := σ_{X}^{t} (σ_{Y}^{t} q_{C C} + (1 - σ_{Y}^{t}) q_{D C}) + (1 - σ_{X}^{t}) (σ_{Y}^{t} q_{C D} + (1 - σ_{Y}^{t}) q_{D D}) . \end{aligned}

A 9b

In the spirit of proposition A.1, for fixed p_••, q_••∈(0, 1)⁴, we could consider the map

\begin{aligned} F & : {[0, 1]}^{2} ⟶ {[0, 1]}^{2} \\ : (\begin{matrix} x \\ y \end{matrix}) ⟼ (\begin{matrix} y (x p_{C C} + (1 - x) p_{D C}) + (1 - y) (x p_{C D} + (1 - x) p_{D D}) \\ x (y q_{C C} + (1 - y) q_{D C}) + (1 - x) (y q_{C D} + (1 - y) q_{D D}) \end{matrix}) \end{aligned}

A 10

and analyse its fixed points. At this point, however, a couple of remarks are in order:

(i)
F need not be a contraction, even when p_•• and q_•• have entries strictly between 0 and 1. For example, with p_•• = (0.9566, 0.2730, 0.0056, 0.0095) and q_•• = (0.9922, 0.0918, 0.3217, 0.0054),
$\begin{aligned} 0.0441 & = ∥ F (0.7404, 0.6928) - F (0.8241, 0.8280) ∥ \\ > ∥ (0.7404, 0.6928) - (0.8241, 0.8280) ∥ = 0.0253 . \end{aligned}$ A 11
We would conjecture that this map is an eventual contraction, in which case the convergence result of proposition A.1 still holds (although the explicit formulae for σ_X and σ_Y differ from equation (A4)).
(ii)
A fixed point of F, (σ_X, σ_Y), even when it exists and is unique, generally does not have the property that μ(p, q) = (σ_Xσ_Y, σ_X(1 − σ_Y), (1 − σ_X)σ_Y, (1 − σ_X)(1 − σ_Y)), where μ is the stationary distribution of equation (2.2). Furthermore, the long-run mean-frequency distribution on {C, D}² can be distinct from both of these distributions, including when the opponent plays q against p* and when they play q* against p*. An example of when these four distributions are pairwise distinct is easy to write down, e.g. p = (0.01, 0.01, 0.01, 0.99, 0.01) and q = (0.99, 0.99, 0.01, 0.99, 0.99). All four distributions coincide when p and q are both reactive, but in general they can be distinct.

Data accessibility

This article does not contain any additional data.

Authors' contributions

All authors designed research, performed research and wrote the paper.

Competing interests

We declare we have no competing interests.

Funding

The authors gratefully acknowledge support from the Lifelong Learning Machines program from DARPA/MTO. Research was sponsored by the Army Research Laboratory (ARL) and was accomplished under cooperative agreement no. W911NF-18-2-0265. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Laboratory or the US Government. The US Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein.

References

1.Press WH, Dyson FJ. 2012. Iterated prisoner's dilemma contains strategies that dominate any evolutionary opponent. Proc. Natl Acad. Sci. USA 109, 10 409–10 413. ( 10.1073/pnas.1206569109) [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Akin E. 2015. What you gotta know to play good in the iterated prisoner's dilemma. Games 6, 175–190. ( 10.3390/g6030175) [DOI] [Google Scholar]
3.Hilbe C, Traulsen A, Sigmund K. 2015. Partners or rivals? Strategies for the iterated prisoner's dilemma. Games Econ. Behav. 92, 41–52. ( 10.1016/j.geb.2015.05.005) [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Hilbe C, Chatterjee K, Nowak MA. 2018. Partners and rivals in direct reciprocity. Nat. Human Behav. 2, 469–477. ( 10.1038/s41562-018-0320-9) [DOI] [PubMed] [Google Scholar]
5.Nowak M, Sigmund K. 1993. A strategy of win-stay, lose-shift that outperforms tit-for-tat in the prisoner's dilemma game. Nature 364, 56–58. ( 10.1038/364056a0) [DOI] [PubMed] [Google Scholar]
6.Axelrod R. 1984. The evolution of cooperation. New York, NY: Basic Books. [Google Scholar]
7.Lehrer E. 1988. Repeated games with stationary bounded recall strategies. J. Econ. Theory 46, 130–144. ( 10.1016/0022-0531(88)90153-6) [DOI] [Google Scholar]
8.Hauert C, Schuster HG. 1997. Effects of increasing the number of players and memory size in the iterated Prisoner's Dilemma: a numerical approach. Proc. R. Soc. Lond. B 264, 513–519. ( 10.1098/rspb.1997.0073) [DOI] [Google Scholar]
9.Nowak MA. 2006. Five rules for the evolution of cooperation. Science 314, 1560–1563. ( 10.1126/science.1133755) [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Hilbe C, Martinez-Vaquero LA, Chatterjee K, Nowak MA. 2017. Memory-n strategies of direct reciprocity. Proc. Natl Acad. Sci. USA 114, 4715–4720. ( 10.1073/pnas.1621239114) [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Baek SK, Jeong H-C, Hilbe C, Nowak MA. 2016. Comparing reactive and memory-one strategies of direct reciprocity. Sci. Rep. 6, 25676 ( 10.1038/srep25676) [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Fudenberg D, Tirole J. 1991. Game theory. Cambridge, MA: MIT Press. [Google Scholar]
13.Posch M. 1999. Win–stay, lose–shift strategies for repeated games—memory length, aspiration levels and noise. J. Theor. Biol. 198, 183–195. ( 10.1006/jtbi.1999.0909) [DOI] [PubMed] [Google Scholar]
14.Dal Bó P. 2005. Cooperation under the shadow of the future: experimental evidence from infinitely repeated games. Am. Econ. Rev. 95, 1591–1604. ( 10.1257/000282805775014434) [DOI] [Google Scholar]
15.Nowak MA. 2006. Evolutionary dynamics: exploring the equations of life. Cambridge, MA: Belknap Press. [Google Scholar]
16.Barlo M, Carmona G, Sabourian H. 2009. Repeated games with one-memory. J. Econ. Theory 144, 312–336. ( 10.1016/j.jet.2008.04.003) [DOI] [Google Scholar]
17.Dal Bó P, Fréchette GR. 2011. The evolution of cooperation in infinitely repeated games: experimental evidence. Am. Econ. Rev. 101, 411–429. ( 10.1257/aer.101.1.411) [DOI] [Google Scholar]
18.Stewart AJ, Plotkin JB. 2016. Small groups and long memories promote cooperation. Sci. Rep. 6, 26889 ( 10.1038/srep26889) [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Bush RR, Mosteller F. 1953. A stochastic model with applications to learning. Ann. Math. Stat. 24, 559–585. ( 10.1214/aoms/1177728914) [DOI] [Google Scholar]
20.Roth AE, Erev I. 1995. Learning in extensive-form games: experimental data and simple dynamic models in the intermediate term. Games Econ. Behav. 8, 164–212. ( 10.1016/s0899-8256(05)80020-x) [DOI] [Google Scholar]
21.Izquierdo LR, Izquierdo SS. 2008. Dynamics of the Bush–Mosteller learning algorithm in 2x2 games. In Reinforcement learning. I-Tech Education and Publishing.
22.Nowak M, Sigmund K. 1990. The evolution of stochastic strategies in the prisoner's dilemma. Acta Applicandae Math. 20, 247–265. ( 10.1007/bf00049570) [DOI] [Google Scholar]
23.Littlestone N, Warmuth MK. 1989. The weighted majority algorithm. In 30th Annual Symp. on Foundations of Computer Science, Research Triangle Park, NC, USA, 30 October–1 November 1989, pp. 256–261 ( 10.1109/sfcs.1989.63487) [DOI]
24.Hofbauer J, Sigmund K. 1998. Evolutionary games and population dynamics. Cambridge, UK: Cambridge University Press; ( 10.1017/cbo9781139173179) [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

This article does not contain any additional data.

[RSPA20180819C1] 1.Press WH, Dyson FJ. 2012. Iterated prisoner's dilemma contains strategies that dominate any evolutionary opponent. Proc. Natl Acad. Sci. USA 109, 10 409–10 413. ( 10.1073/pnas.1206569109) [DOI] [PMC free article] [PubMed] [Google Scholar]

[RSPA20180819C2] 2.Akin E. 2015. What you gotta know to play good in the iterated prisoner's dilemma. Games 6, 175–190. ( 10.3390/g6030175) [DOI] [Google Scholar]

[RSPA20180819C3] 3.Hilbe C, Traulsen A, Sigmund K. 2015. Partners or rivals? Strategies for the iterated prisoner's dilemma. Games Econ. Behav. 92, 41–52. ( 10.1016/j.geb.2015.05.005) [DOI] [PMC free article] [PubMed] [Google Scholar]

[RSPA20180819C4] 4.Hilbe C, Chatterjee K, Nowak MA. 2018. Partners and rivals in direct reciprocity. Nat. Human Behav. 2, 469–477. ( 10.1038/s41562-018-0320-9) [DOI] [PubMed] [Google Scholar]

[RSPA20180819C5] 5.Nowak M, Sigmund K. 1993. A strategy of win-stay, lose-shift that outperforms tit-for-tat in the prisoner's dilemma game. Nature 364, 56–58. ( 10.1038/364056a0) [DOI] [PubMed] [Google Scholar]

[RSPA20180819C6] 6.Axelrod R. 1984. The evolution of cooperation. New York, NY: Basic Books. [Google Scholar]

[RSPA20180819C7] 7.Lehrer E. 1988. Repeated games with stationary bounded recall strategies. J. Econ. Theory 46, 130–144. ( 10.1016/0022-0531(88)90153-6) [DOI] [Google Scholar]

[RSPA20180819C8] 8.Hauert C, Schuster HG. 1997. Effects of increasing the number of players and memory size in the iterated Prisoner's Dilemma: a numerical approach. Proc. R. Soc. Lond. B 264, 513–519. ( 10.1098/rspb.1997.0073) [DOI] [Google Scholar]

[RSPA20180819C9] 9.Nowak MA. 2006. Five rules for the evolution of cooperation. Science 314, 1560–1563. ( 10.1126/science.1133755) [DOI] [PMC free article] [PubMed] [Google Scholar]

[RSPA20180819C10] 10.Hilbe C, Martinez-Vaquero LA, Chatterjee K, Nowak MA. 2017. Memory-n strategies of direct reciprocity. Proc. Natl Acad. Sci. USA 114, 4715–4720. ( 10.1073/pnas.1621239114) [DOI] [PMC free article] [PubMed] [Google Scholar]

[RSPA20180819C11] 11.Baek SK, Jeong H-C, Hilbe C, Nowak MA. 2016. Comparing reactive and memory-one strategies of direct reciprocity. Sci. Rep. 6, 25676 ( 10.1038/srep25676) [DOI] [PMC free article] [PubMed] [Google Scholar]

[RSPA20180819C12] 12.Fudenberg D, Tirole J. 1991. Game theory. Cambridge, MA: MIT Press. [Google Scholar]

[RSPA20180819C13] 13.Posch M. 1999. Win–stay, lose–shift strategies for repeated games—memory length, aspiration levels and noise. J. Theor. Biol. 198, 183–195. ( 10.1006/jtbi.1999.0909) [DOI] [PubMed] [Google Scholar]

[RSPA20180819C14] 14.Dal Bó P. 2005. Cooperation under the shadow of the future: experimental evidence from infinitely repeated games. Am. Econ. Rev. 95, 1591–1604. ( 10.1257/000282805775014434) [DOI] [Google Scholar]

[RSPA20180819C15] 15.Nowak MA. 2006. Evolutionary dynamics: exploring the equations of life. Cambridge, MA: Belknap Press. [Google Scholar]

[RSPA20180819C16] 16.Barlo M, Carmona G, Sabourian H. 2009. Repeated games with one-memory. J. Econ. Theory 144, 312–336. ( 10.1016/j.jet.2008.04.003) [DOI] [Google Scholar]

[RSPA20180819C17] 17.Dal Bó P, Fréchette GR. 2011. The evolution of cooperation in infinitely repeated games: experimental evidence. Am. Econ. Rev. 101, 411–429. ( 10.1257/aer.101.1.411) [DOI] [Google Scholar]

[RSPA20180819C18] 18.Stewart AJ, Plotkin JB. 2016. Small groups and long memories promote cooperation. Sci. Rep. 6, 26889 ( 10.1038/srep26889) [DOI] [PMC free article] [PubMed] [Google Scholar]

[RSPA20180819C19] 19.Bush RR, Mosteller F. 1953. A stochastic model with applications to learning. Ann. Math. Stat. 24, 559–585. ( 10.1214/aoms/1177728914) [DOI] [Google Scholar]

[RSPA20180819C20] 20.Roth AE, Erev I. 1995. Learning in extensive-form games: experimental data and simple dynamic models in the intermediate term. Games Econ. Behav. 8, 164–212. ( 10.1016/s0899-8256(05)80020-x) [DOI] [Google Scholar]

[RSPA20180819C21] 21.Izquierdo LR, Izquierdo SS. 2008. Dynamics of the Bush–Mosteller learning algorithm in 2x2 games. In Reinforcement learning. I-Tech Education and Publishing.

[RSPA20180819C22] 22.Nowak M, Sigmund K. 1990. The evolution of stochastic strategies in the prisoner's dilemma. Acta Applicandae Math. 20, 247–265. ( 10.1007/bf00049570) [DOI] [Google Scholar]

[RSPA20180819C23] 23.Littlestone N, Warmuth MK. 1989. The weighted majority algorithm. In 30th Annual Symp. on Foundations of Computer Science, Research Triangle Park, NC, USA, 30 October–1 November 1989, pp. 256–261 ( 10.1109/sfcs.1989.63487) [DOI]

[RSPA20180819C24] 24.Hofbauer J, Sigmund K. 1998. Evolutionary games and population dynamics. Cambridge, UK: Cambridge University Press; ( 10.1017/cbo9781139173179) [DOI] [Google Scholar]

PERMALINK

Reactive learning strategies for iterated games

Alex McAvoy

Martin A Nowak

Abstract

1. Introduction

Figure 1.

2. Memory-one strategies

Lemma 2.1. —

Proof. —

Remark 2.2. —

Figure 2.

Proposition 2.3. —

Proof. —

Table 1.

Remark 2.4. —

Remark 2.5. —

3. Reactive learning strategies

Figure 3.

Example 3.1. —

Figure 4.

(a). Linear reactive learning strategies

(b). Stationary distributions

(c). Feasible pay-off regions

Lemma 3.2. —

Proof. —

Theorem 3.3. —

Proof. —

Figure 5.

Figure 6.

(d). Optimization through mutation

Figure 7.

4. Discussion

Acknowledgements

Appendix A. Convergence of mixed actions

Proof. —

Remark A.2. —

Data accessibility

Authors' contributions

Competing interests

Funding

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases