Skip to main content
Proceedings. Mathematical, Physical, and Engineering Sciences logoLink to Proceedings. Mathematical, Physical, and Engineering Sciences
. 2019 Mar 20;475(2223):20180819. doi: 10.1098/rspa.2018.0819

Reactive learning strategies for iterated games

Alex McAvoy 1,, Martin A Nowak 1
PMCID: PMC6451968  PMID: 31007557

Abstract

In an iterated game between two players, there is much interest in characterizing the set of feasible pay-offs for both players when one player uses a fixed strategy and the other player is free to switch. Such characterizations have led to extortionists, equalizers, partners and rivals. Most of those studies use memory-one strategies, which specify the probabilities to take actions depending on the outcome of the previous round. Here, we consider ‘reactive learning strategies’, which gradually modify their propensity to take certain actions based on past actions of the opponent. Every linear reactive learning strategy, p*, corresponds to a memory one-strategy, p, and vice versa. We prove that for evaluating the region of feasible pay-offs against a memory-one strategy, C(p), we need to check its performance against at most 11 other strategies. Thus, C(p) is the convex hull in R2 of at most 11 points. Furthermore, if p is a memory-one strategy, with feasible pay-off region C(p), and p* is the corresponding reactive learning strategy, with feasible pay-off region C(p), then C(p) is a subset of C(p). Reactive learning strategies are therefore powerful tools in restricting the outcomes of iterated games.

Keywords: adaptive strategy, iterated game, memory-one strategy, social dilemma

1. Introduction

Since the discovery of zero-determinant strategies for iterated games by Press & Dyson [1], there has been a growing interest in the set of possible pay-offs that can be achieved against a fixed strategy. Imagine that Alice uses a particular strategy, while Bob can try out any conceivable strategy. The resulting set of pay-offs for both Alice and Bob define the ‘feasible region’ of Alice's strategy. If Alice uses a so-called zero-determinant strategy [1], then the feasible region is a line. In general, the feasible region is a two-dimensional convex subset of the feasible pay-off region of the game (figure 1). Using the geometric intuition put forth by Press & Dyson [1], subsequent work has explored strategies that generate two-dimensional feasible regions, defined by linear inequalities rather than strict equations [24]. However, a general description of what this region looks like, as it relates to the type of strategy played, is currently not well understood. In this study, we characterize the feasible regions for the well-known class of memory-one strategies [5] and consider their relationships to those of a new class of ‘reactive learning strategies’.

Figure 1.

Figure 1.

Feasible region (grey) for a strategy with p•• = (0.7881, 0.8888, 0.4686, 0.0792) when R = 3, S = 0, T = 5 and P = 1. The light blue region depicts the set of all pay-off pairs that can be achieved in the iterated game, i.e. the convex hull of the points (R, R), (S, T), (P, P) and (T, S). The feasible region of p can be characterized as the convex hull of 11 points, corresponding to those opponent-strategies, q, appearing next to each black dot. In this instance, five of these points already fall inside of the convex hull of the remaining six. However, one cannot remove one of these 11 points without destroying this characterization for some game-strategy pair. (Online version in colour.)

Iterated games have many applications across the social sciences and biology, and with them has come a proliferation of strategy classes of various complexities [610]. The type of strategy a player uses for dealing with repeated encounters depends on many factors, including the cognitive capacity of the player and the nature of the underlying ‘one-shot’ (or ‘stage’) games. In applications to theoretical biology, the most well-studied type of strategy is known as ‘memory-one’ because it takes into account the outcome of only the previous encounter when determining how to play in the next round [5,11]. This class of strategies, while forming only a small subset of all possible ways to play an iterated game [12], has several advantages over more complicated strategies. They permit rich behaviour in iterated play, such as punishment for exploitation and reward for cooperation [5,1318]; but, owing to their simple memory requirements, they are also straightforward to implement in practice and analyse mathematically.

Memory, however, can apply to more than just the players' actions in the previous round. Since the action a player chooses in any particular encounter is typically chosen stochastically rather than deterministically, a player can also take into account how they chose their previous action rather than just the result. In a social dilemma, for instance, each player chooses an action (‘cooperate’, C, or ‘defect’, D) in a given round and receives a pay-off for this action against that of the opponent. The distribution with which this action is chosen is referred to as a ‘mixed action’ and can be specified by a single number between 0 and 1, representing the tendency to cooperate. A standard memory-one strategy for player X is given by a five-tuple, (p0, pCC, pCD, pDC, pDD), where p0 is the probability of cooperation in the initial round and pxy is the probability of cooperation following an outcome in which X uses action x and the opponent, Y , uses action y. We consider a variation on this theme, where instead of using x and y to determine the next mixed action, X uses the opponent's action, y, to update their own mixed action, σX∈[0, 1], that was used previously to generate x. We refer to a strategy of this form as a ‘reactive learning strategy’.

Such a strategy is ‘reactive’ because it takes into account the realized action of just the opponent, and it is ‘learning’ because it adapts to this external stimulus. Like a memory-one strategy, a reactive learning strategy for X requires knowledge of information one round into the past, namely X's mixed action, σX, and Y 's realized action, y. Unlike a memory-one strategy, in which the probability of cooperation is in the set {p0, pCC, pCD, pDC, pDD} in every round of the game, a reactive learning strategy can result in a broad range of cooperation tendencies for X over the duration of an iterated game. Moreover, these tendencies can be gradually changed over the course of many rounds, resulting (for example) in high probabilities of cooperation only after the opponent has demonstrated a sufficiently long history of cooperating. Punishment for defection can be similarly realized over a number of interactions. Remembering a probability, σX, and an action, y, instead of just two actions, x and y, can thus lead to more complex behaviours.

This adaptive approach to iterated games is similar to the Bush–Mosteller reinforcement learning algorithm [1921], but there are important distinctions. For one, a reactive learning strategy does not necessarily reinforce behaviour resulting in higher pay-offs. Furthermore, it completely disregards the focal player's realized action, using only that of the opponent in the update mechanism. But there are certainly reactive learning strategies that are more closely related to reinforcement learning, and we give an example using a variation on the memory-one strategy tit-for-tat (TFT), which we call ‘learning tit-for-tat (LTFT)’.

In this study, we establish some basic properties of reactive learning strategies relative to the memory-one space. We first characterize the feasible region of a memory-one strategy as the convex hull of at most 11 points. When then show that there is an embedding of the set of memory-one strategies in the set of reactive learning strategies with the following property: if p is a memory-one strategy and p* is the corresponding reactive learning strategy, then the feasible region of p contains the feasible region of p*. Moreover, the image of the map pp* is the set of linear reactive learning strategies, which consists of those strategies that send a player's mixed action, σX, to ασX + β for some α, β∈[0, 1]. As a consequence, if the goal of a player is to restrict the region of pay-offs attainable by the players, then this player should prefer using a linear reactive learning strategy over the corresponding memory-one strategy.

2. Memory-one strategies

Consider an iterated game between two players, X and Y . In every round, each player chooses an action from the set {C, D} (cooperate or defect). They receive pay-offs based on the values in the matrix

CDCD(RSTP) 2.1

Over many rounds, these pay-offs are averaged to arrive at an expected pay-off for each player.

Whereas an action specifies the behaviour of a player in one particular encounter, a strategy specifies how a player behaves over the course of many encounters. One of the simplest and best-studied strategies for iterated games is a memory-one strategy [5], which for player X is defined as follows: for every (x, y)∈{C, D}2 observed as action outcomes of a given round, X devises a mixed action pxy∈[0, 1] for the next round. The notation pxy indicates that this mixed action depends on the (pure) actions of both players in the previous round, not how they arrived at those actions (e.g. by generating an action probabilistically). The term ‘strategy’ is reserved for the players' behaviours in the iterated game.

Let Mem1X be the space of all memory-one strategies for player X in an iterated game. With just two actions, C and D, we have Mem1X = [0, 1] × [0, 1]4, i.e. the space of all (p0, pCC, pCD, pDC, pDD)∈[0, 1]5. A pair of memory-one strategies, p: = (p0, pCC, pCD, pDC, pDD) and q: = (q0, qCC, qCD, qDC, qDD), for X and Y , respectively, yield a Markov chain on the space of all action pairs, {C, D}2, whose transition matrix is

CCCDDCDDM(p,q)=CCCDDCDD(pCCqCCpCC(1qCC)(1pCC)qCC(1pCC)(1qCC)pCDqDCpCD(1qDC)(1pDC)qDC(1pDC)(1qDC)pDCqCDpDC(1qCD)(1pCD)qCD(1pCD)(1qCD)pDDqDDpDD(1qDD)(1pDD)qDD(1pDD)(1qDD)) 2.2

and whose initial distribution is μ0: = (p0q0, p0(1 − q0), (1 − p0)q0, (1 − p0)(1 − q0)). If pxy, qxy∈(0, 1) for every x, y∈{C, D}, then this chain is ergodic and has a unique stationary distribution, μ(p, q), which is independent of μ0. In particular, the expected pay-offs, πX(p, q) = μ(p, q) · (R, S, T, P) and πY(p, q) = μ(p, q) · (R, T, S, P), are independent of p0 and q0. In this case, πX and πY are functions of just the response probabilities, p••: = (pCC, pCD, pDC, pDD) and q••: = (qCC, qCD, qDC, qDD).

A useful way of thinking about a strategy is through its feasible region, i.e. the set of all possible pay-off pairs (for X and Y ) that can be achieved against it. For any memory-one strategy p of X, let

C(p):={(πY(p,q),πX(p,q))}qMemX1 2.3

be this feasible region. (Note that, if X uses a memory-one strategy, then it suffices to assume that Y uses a memory-one strategy by the results of Press & Dyson [1].) This subset of the feasible region represents the ‘geometry’ of strategy p in the sense that it captures all possible pay-off pairs against an opponent.

In this section, we show that the feasible region for pMem1X with p••∈(0, 1)4 is characterized by playing p against the following 11 strategies: (0, 0, 0, 0), (0, 0, 0, 1), (0, 0, 1, 0), (0, 0, 1, 1), (0, 1, 0, 1), (0, 1, 1, 0), (0, 1, 1, 1), (1, 0, 0, 1), (1, 0, 1, 0), (1, 0, 1, 1) and (1, 1, 1, 1). In other words, C(p) is the convex hull of 11 points (figure 1). Therefore, any pMem1X generates a simple polygon in R2 whose number of extreme points is uniformly bounded over all game-strategy pairs, ((R, S, T, P), p).

Lemma 2.1. —

For qMem1X and x, y∈{C, D}, let (q;qxy = qxy′) be the strategy obtained from q by changing qxy to qxy′∈[0, 1]. If p••∈(0, 1)4, qMem1X and x, y∈{C, D}, then the point (πY (p, q), πX(p, q)) falls on the line joining (πY (p, (q;qxy = 0)), πX(p, (q;qxy = 0))) and (πY (p, (q;qxy = 1)), πX(p, (q;qxy = 1))).

Proof. —

Let p••∈(0, 1)4 and qMem1X. Since the transition matrix of equation (2.2) is just 4 × 4, one can directly solve for its stationary distribution, μ(p, q) (e.g. by using Gaussian elimination or the determinant formula of Press & Dyson [1]). For example, suppose that x = y = C. Then, with

L(qCC):=(1qCC)1+qCC(pCCpCCpCD+pCCpDDpCCqCD+pCCqDD+pDCqCDpDDqDD+pCCpCDqCDpCCpCDqDDpCCpDDqCDpCDpDCqCDpCCpDCqDC+pCCpDCqDD+pCCpDDqDC+pCDpDCqDCpCDpDDqDC+pDCpDDqCD+pCDpDDqDDpDCpDDqDDpDDpCDqCD+qDD+pCDqCD+pCDqDC+pDCqCDpCDqDDpDDqCDpDCqDC+pDCqDD+pDDqDCpDDqDDpCCpCDqDCpCDpDCqCD+pCDpDCqDC+pCCpDDqDD+pDCpDDqCDpDCpDDqDDpCDqCDqDC+pDCqCDqDC+pCDqDCqDDpDDqDCqDD+pCCpCDqCDqDCpCCpDCqCDqDCpCCpCDqDCqDD+pCCpDCqCDqDDpCCpDDqCDqDDpCDpDCqCDqDDpCDpDDqCDqDC+pCDpDDqCDqDD+pCCpDDqDCqDD+pCDpDCqDCqDD+pDCpDDqCDqDCpDCpDDqDCqDD+1), 2.4

one has

(πY(p,q),πX(p,q))=L(qCC)(πY(p,(q;qCC=0)),πX(p,(q;qCC=0)))+(1L(qCC))(πY(p,(q;qCC=1)),πX(p,(q;qCC=1))). 2.5

Provided (πY (p, (q;qCC = 0)), πX(p, (q;qCC = 0)))≠(πY (p, (q;qCC = 1)), πX(p, (q;qCC = 1))), we also have L(0) = 1 and L(1) = 0. Moreover, one can check that, under this condition, L′(qCC) is nowhere equal to 0, and 0 ≤ slantL(qCC) ≤ slant1 for every qCC∈[0, 1]. The other cases with x, y∈{C, D} are analogous. ▪

Remark 2.2. —

Even when qxy is uniformly distributed between 0 and 1, the corresponding points in the feasible region need not be uniformly distributed between the endpoints corresponding to qxy = 0 and qxy = 1, respectively (figure 2). This result is therefore somewhat different from the analogous situation of playing against a mixed action in a stage game, where, for a pay-off function u:SX×SYR2 and mixed action σX∈Δ(SX) and σY ∈Δ(SY), one has u(σX,σY)=ySYu(σX,y)dσY(y) due to linearity.

Figure 2.

Figure 2.

The set of points (πY (p, q), πX(p, q)), where p•• = (0.7876, 0.9856, 0.4095, 0.0301) and q•• = (qCC, 0.9963, 0.0166, 0.9879) as qCC varies between 0 (green) and 1 (red) in uniform increments of 0.01. The resulting points all fall along a line; however, they are not uniformly distributed even though the distribution of qCC is uniform. Parameters: R = 3, S = 0, T = 5 and P = 1. (Online version in colour.)

Proposition 2.3. —

For any p∈Mem1X with p(0,1)4,C(p) is the convex hull of the following 11 points:

(πX(0,0,0,0)πY(0,0,0,0))=(PPpCD+SpDDpDDpCD+1PPpCD+TpDDpDDpCD+1); 2.6a
(πX(0,0,0,1)πY(0,0,0,1))=(P+TPpCD+RpDD+SpDCTpCDTpDDRpCDpDD+SpCCpDDSpDCpDD+TpCDpDDpDC2pCD+pCCpDDpDCpDD+2P+SPpCD+RpDDSpCDSpDD+TpDCRpCDpDD+SpCDpDD+TpCCpDDTpDCpDDpDC2pCD+pCCpDDpDCpDD+2); 2.6b
(πX(0,0,1,0)πY(0,0,1,0))=(PPpDC+SpDD+TpDDPpCCpCD+PpCDpDC+RpCDpDDTpCDpDD2pDDpDCpCCpCD+pCDpDC+1PPpDC+SpDD+TpDDPpCCpCD+PpCDpDC+RpCDpDDSpCDpDD2pDDpDCpCCpCD+pCDpDC+1); 2.6c
(πX(0,0,1,1)πY(0,0,1,1))=(P+TPpDC+RpDD+SpDCTpDDPpCCpCD+PpCDpDC+RpCDpDCRpDCpDD+SpCCpDDSpDCpDDTpCCpCD+TpCCpDD2(pCCpDDpCCpCD+pCDpDCpDCpDD+1)P+SPpDC+RpDDSpDD+TpDCPpCCpCD+PpCDpDC+RpCDpDCRpDCpDDSpCCpCD+SpCCpDD+TpCCpDDTpDCpDD2(pCCpDDpCCpCD+pCDpDCpDCpDD+1)); 2.6d
(πX(0,1,0,1)πY(0,1,0,1))=(T+PpDC+RpDCTpCDTpDDPpCDpDCRpCDpDC+SpCCpDC+TpCDpDD2pDCpCDpDD+pCCpDC2pCDpDC+pCDpDD+1S+PpDC+RpDCSpCDSpDDPpCDpDCRpCDpDC+SpCDpDD+TpCCpDC2pDCpCDpDD+pCCpDC2pCDpDC+pCDpDD+1); 2.6e
(πX(0,1,1,0)πY(0,1,1,0))=(PpDC+TpDDPpCCpDC+RpDCpDD+SpDCpDDTpCDpDDpDC+pDDpCCpDCpCDpDD+2pDCpDDPpDC+SpDDPpCCpDC+RpDCpDDSpCDpDD+TpDCpDDpDC+pDDpCCpDCpCDpDD+2pDCpDD); 2.6f
(πX(0,1,1,1)πY(0,1,1,1))=(T+PpDC+RpDCTpDDPpCCpDC+SpCCpDCTpCCpCD+TpCCpDD2pDCpDDpCCpCD+pCCpDD+1S+PpDC+RpDCSpDDPpCCpDCSpCCpCD+SpCCpDD+TpCCpDC2pDCpDDpCCpCD+pCCpDD+1); 2.6g
(πX(1,0,0,1)πY(1,0,0,1))=(P+TPpCCPpCD+RpDD+SpDCTpCCTpCD+PpCCpCDRpCDpDDSpCCpDC+TpCCpCD2pCC+2pCDpDCpDD2pCCpCD+pCCpDC+pCDpDD2P+SPpCCPpCD+RpDDSpCCSpCD+TpDC+PpCCpCDRpCDpDD+SpCCpCDTpCCpDC2pCC+2pCDpDCpDD2pCCpCD+pCCpDC+pCDpDD2); 2.6h
(πX(1,0,1,0)πY(1,0,1,0))=(PPpCCPpDC+SpDD+TpDD+PpCCpDC+RpCDpDDSpCCpDDTpCCpDD2pDDpDCpCC+pCCpDC2pCCpDD+pCDpDD+1PPpCCPpDC+SpDD+TpDD+PpCCpDC+RpCDpDDSpCCpDDTpCCpDD2pDDpDCpCC+pCCpDC2pCCpDD+pCDpDD+1); 2.6i
(πX(1,0,1,1)πY(1,0,1,1))=(P+TPpCCPpDC+RpDD+SpDCTpCC+PpCCpDC+RpCDpDCRpDCpDDSpCCpDCpDD2pCC+pCDpDCpDCpDD+2P+SPpCCPpDC+RpDDSpCC+TpDC+PpCCpDC+RpCDpDCRpDCpDDTpCCpDCpDD2pCC+pCDpDCpDCpDD+2); 2.6j
and(πX(1,1,1,1)πY(1,1,1,1))=(T+RpDCTpCCpDCpCC+1S+RpDCSpCCpDCpCC+1). 2.6k

Proof. —

Press & Dyson [1] show that if X uses a memory-one strategy, p, then any strategy of the opponent, y, can be replaced by a memory-one strategy, q, without changing the pay-offs to X and Y ; thus, if X uses a memory-one strategy, one may assume without a loss of generality that Y also uses a memory-one strategy. If p••∈(0, 1)4 and qMem1X, the fact that (πY (p, q), πX(p, q)) can be written as a convex combination of the 16 points {(πY (p, q′), πX(p, q′))}q••′∈{0, 1}4 then follows immediately from lemma 2.1. Moreover, the points corresponding to (0, 0, 0, 0), (0, 1, 0, 0) and (1, 0, 0, 0) are the same, as are the points corresponding to (1, 1, 0, 1), (1, 1, 1, 0) and (1, 1, 1, 1); thus, we can eliminate four points. Furthermore, we can remove the point associated with (1, 1, 0, 0) because it lies on the line connecting the points associated with (0, 0, 0, 0) and (1, 1, 1, 1). One can easily check that the remaining 11 points have the following property: if point i is removed, then there exist R, S, T, P and p for which C(p) is not the convex hull of the 10 points different from i (table 1). Thus, for a general p and pay-off matrix, all 11 of these points are required. ▪

Table 1.

For each point, π(i1, i2, i3, i3)X,Y, the feasible region C(p) cannot (in general) be expressed as the convex hull of the remaining 10 points different from π(i1, i2, i3, i3)X,Y. That is, each row gives (i) one of the 11 points of which C is the convex hull and (ii) an example of a game-strategy pair for which π(i1, i2, i3, i3)X,Y is an extreme point of C(p).

point (RSTP) p••
π(0,0,0,0)X,Y (4.59533.50010.17984.4972) (0.0347, 0.8913, 0.9873, 0.1164)
π(0,0,0,1)X,Y (3.59093.71833.10912.6508) (0.3420, 0.5591, 0.0468, 0.9941)
π(0,0,1,0)X,Y (0.11501.26772.87251.4290) (0.8937, 0.9211, 0.6995, 0.0052)
π(0,0,1,1)X,Y (0.15231.76423.33343.9907) (0.5319, 0.4107, 0.9805, 0.0823)
π(0,1,0,1)X,Y (2.10840.42354.54494.5716) (0.3897, 0.6428, 0.2422, 0.0300)
π(0,1,1,0)X,Y (2.56272.57014.13534.0437) (0.7502, 0.7603, 0.9999, 0.3161)
π(0,1,1,1)X,Y (0.06001.15242.86601.3631) (0.1145, 0.9494, 0.7587, 0.9214)
π(1,0,0,1)X,Y (4.40251.68132.91621.1664) (0.9629, 0.0020, 0.2554, 0.8444)
π(1,0,1,0)X,Y (0.11672.51250.34624.6919) (0.4121, 0.4373, 0.5380, 0.8915)
π(1,0,1,1)X,Y (0.37871.13571.54172.7617) (0.2570, 0.5191, 0.1293, 0.9332)
π(1,1,1,1)X,Y (1.82113.23004.62810.4609) (0.0009, 0.4996, 0.4362, 0.9653)

Remark 2.4. —

p enforces a linear pay-off relationship if and only if these 11 points are collinear.

Remark 2.5. —

One needs all 11 of these points for general R, S, T, P and p. However, for any particular game-strategy pair, it is often the case that several of these points are unnecessary because they lie within the convex hull of some other subset of these 11 points; they are typically not all extreme points of C(p).

3. Reactive learning strategies

In a traditional memory-one strategy, X's probability of playing C depends on the realized actions of the two players, x and y. However, X can observe more than just their pure action against the opponent's; they also know how they arrived at x (i.e. they know the mixed action, σX, that resulted in x in the previous round). Of course, X need not be able to see Y 's mixed action, but they can still observe the pure action Y played. Therefore, an alternative notion of a memory-one strategy for player X could be defined as follows: after X plays σX∈[0, 1] and Y plays y, X then chooses a new action based on the distribution p*σXy∈[0, 1]. In this formulation, p* is a map from [0, 1] × {C, D} to [0, 1]. We refer to such a map, p*, together with X's initial probability of playing C, p0, as a ‘reactive learning strategy’ for player X (figure 3).

Figure 3.

Figure 3.

The space of memory-one strategies, Mem1X, as it relates to the space of reactive learning strategies, RLX. Both sets contain the space of reactive strategies [22], which take into account only the last move, y, of the opponent. Whereas a memory-one strategy takes into account the last pure action of X as well, x, a reactive learning strategy uses X's last mixed action, σX∈[0, 1]. After each round, a reactive learning strategy uses y to update X's probability of cooperating. RLX is ‘larger’ than Mem1X in the sense that there is an injective map MemX1RLX that is not surjective. (Online version in colour.)

In other words, in contrast to Mem1X = [0, 1] × [0, 1]4, which can be alternatively described as

MemX1=[0,1]×{p:{C,D}×{C,D}[0,1]}, 3.1

we define the space of reactive learning strategies as

RLX:=[0,1]×{p:[0,1]×{C,D}[0,1]}, 3.2

where [0, 1] indicates the space of mixed actions for X and {C, D} indicates the action space for Y . Although [0, 1] is a much larger space than {C, D}, the updates of mixed actions can be easier to specify using reactive learning strategies since they allow for adaptive modification of an existing mixed action (without the need to devise a new mixed action from scratch after every observed history of play).

Example 3.1. —

Suppose that player X starts by playing C and D with equal probability, i.e. p0 = 1/2. For fixed η∈[0, 1] (the ‘learning rate’), cooperation from the opponent leads to p*σXC = (1 − η)σX + η while defection leads to p*σXD = (1 − η)σX. Thus, a long pattern of exploitation by Y leads X to defect more often. On the other hand, X does not immediately forgive such behaviour but rather requires Y to cooperate repeatedly to bring X back up to higher levels of cooperation. For example, if X starts with p0 and Y defects ℓ times in a row, then X subsequently cooperates with probability (1 − η)p0. In order to bring X's probability of cooperation above p0 once again, Y must then cooperate for T rounds, where

Tlog((1p0)/(1(1η)p0))log(1η). 3.3

We refer to this strategy as LTFT because it pushes a player's cooperation probability in the direction of the opponent's last move (figure 4). In this way, a reactive learning strategy can encode more complicated behaviour than a memory-one strategy. Conversely, memory-one strategies can also encode behaviour not captured by reactive learning strategies, which we discuss further in §3c.

Figure 4.

Figure 4.

‘Learning tit-for-tat (LTFT)’, an analogue of tit-for-tat (TFT) within the space of reactive learning strategies. LTFT is the function of two parameters, p0 (the initial mixed action) and η (the learning rate). Player X initially plays C with probability p0. In all subsequent rounds, if X played C with probability σX and Y played C (resp. D) in the previous round, in the next round X plays C with probability p*σXC = (1 − η)σX + η (resp. p*σXD = (1 − η)σX). At the corners lie the strategies ALLD (always defect), ALLC (always cooperate), TFT (tit-for-tat) and STFT (suspicious tit-for-tat).

(a). Linear reactive learning strategies

A pertinent question at this point is whether there is a ‘natural’ map from Mem1X to RLX. Let (p0, p••) = (p0, pCC, pCD, pDC, pDD) be a memory-one strategy. If (p0′, p*) is the corresponding reactive learning strategy, then the first requirement we impose is p0′ = p0. If σX = 1, then X plays C with probability one. It is therefore reasonable to insist that p*1y = pCy. Similarly, X plays D with probability one when σX = 0, and we insist that p*0y = pDy. Suppose now that σX and σX′ are two mixed actions for X. If Y plays y∈{C, D}, then the responses for X corresponding to σX and σX′ are p*σXy and p*σXy, respectively. If X plays σX with probability w∈[0, 1] and σX′ with probability 1 − w, then it is also natural to insist that the response is p*σXy with probability w and p*σXy with probability 1 − w. Thus, for any σX∈[0, 1] and y∈{C, D}, with these requirements p* can be written uniquely in terms of p•• as

pσXy=σXp1y+(1σX)p0y=σXpCy+(1σX)pDy. 3.4

Using this map, one can naturally identify Mem1X with the set of linear reactive learning strategies, LRLXRLX, consisting of those functions p:[0,1]×{C,D}[0,1] for which there exist a,b,c,dR with

pσXC=σXa+(1σX)c 3.5a
andpσXD=σXb+(1σX)d. 3.5b

Clearly, any such a, b, c, d must lie in [0, 1] since p*σXy∈[0, 1] for every σX∈[0, 1] and y∈{C, D}.

Under this correspondence, the strategy of example 3.1 has parameters (1/2, 1, 1 − η, η, 0). But note that this map, MemX1RLX, is not surjective due to the fact that not every reactive learning strategy is linear. For example, if (a, b, c, d)∈[0, 1]4 and p*∈RLX is the quadratic response function defined by

pσXC:=(σX)2a+(1(σX)2)c 3.6a
andpσXD:=(σX)2b+(1(σX)2)d, 3.6b

then there exists no (pCC, pCD, pDC, pDD)∈[0, 1]4 mapping to p* provided ac or bd.

(b). Stationary distributions

Suppose that (p0, p*) and (q0, q*) are reactive learning strategies for X and Y , respectively. These strategies generate a Markov chain on the (infinite) space {C, D}2 × [0, 1]2 with transition probabilities between ((x, y), (σX, σY)), ((x′, y′), (p*σXy, q*σYx))∈ {C, D}2 × [0, 1]2 given by

P((x,y),(σX,σY))((x,y),(pσXy,qσYx)):={pσXyqσYxx=C,y=C,pσXy(1qσYx)x=C,y=D,(1pσXy)qσYxx=D,y=C,(1pσXy)(1qσYx)x=D,y=D. 3.7

To simplify notation, we can also denote the right-hand side of this equation by p*σXy(x′)q*σYx(y′).

If ν is a stationary distribution of this chain, then, for any ((x, y), (σX, σY))∈{C, D}2 × [0, 1]2,

ν((x,y),(σX,σY))=((x,y),(σX,σY))(pσXy,qσYx)=(σX,σY)P((x,y),(σX,σY))((x,y),(σX,σY))dν((x,y),(σX,σY))=((x,y),(σX,σY))(pσXy,qσYx)=(σX,σY)σX(x)σY(y)dν((x,y),(σX,σY)). 3.8

In general, ν is difficult to give explicitly. However, it is possible to understand the marginal distributions on σX and σY in more detail (see appendix A). In any case, having an explicit formula for ν is not necessary for obtaining our main result on feasible pay-off regions, which we turn to in the next section.

(c). Feasible pay-off regions

By looking at the feasible region of a strategy, we uncover a nice relationship between a memory-one strategy, p, and its corresponding (linear) reactive learning strategy, p*. Namely, for every pMem1X, we have C(p)C(p). In this section, we give a proof of this fact and illustrate some of its consequences.

For t≥1, let Ht=({C,D}2)t be the history of play from time 0 through time t − 1 [12]. When t = 0, H0={}, where denotes the ‘empty’ history, indicating that no play came before the present encounter. A behavioural strategy for a player specifies, for every possible history of play, a probability of using C in the next encounter. That is, if H:=t0Ht, then a behavioural strategy is a map H[0,1]. The following lemma shows that when considering the feasible region of a memory-one or reactive learning strategy, one can assume without a loss of generality that the opponent is playing a Markov strategy.

Lemma 3.2. —

Let MB be the set of all Markov strategies, i.e.

M:={y:{1,2,}×{C,D}2[0,1]}. 3.9

For any xMem1XRLX, we have C(x)={(πY(x,y),πX(x,y))}yM.

Proof. —

When pMem1X, the lemma follows from [[1], appendix A]. Specifically, when X plays pMem1X against yB, consider the time-t distributions μt on {C, D}2 and μ¯t on Ht. For (xt+1, yt+1)∈{C, D}2,

μt+1(xt+1,yt+1)=ht+1Ht+1pxtyt(xt+1)yht+1(yt+1)μ¯t+1(ht+1)=ht+1Ht+1pxtyt(xt+1)y(ht,(xt,yt))(yt+1)μ¯t+1(ht+1)=(xt,yt){C,D}2pxtyt(xt+1)htHty(ht,(xt,yt))(yt+1)μt(xt,ytht)μ¯t(ht). 3.10

Therefore, the same sequence of distributions {μt}t≥0 arises when Y uses the Markov strategy defined by

qxtytt+1(yt+1):=htHty(ht,(xt,yt))(yt+1)μt(xt,ytht)μ¯t(ht)htHtμt(xt,ytht)μ¯t(ht). 3.11

If p:[0,1]×{C,D}[0,1] is a reactive learning strategy that X uses against yB, then for every t≥0 there are distributions νt on {C, D}2, χt on [0, 1], and ν¯t on Ht×[0,1]. For (xt+1, yt+1)∈{C, D}2,

νt+1(xt+1,yt+1)=(ht+1,σXt)Ht+1×[0,1]pσXtyt(xt+1)yht+1(yt+1)dν¯t+1(ht+1,σXt)=(xt,yt){C,D}2σXt[0,1]pσXtyt(xt+1)×(ht,σXt1)Ht×[0,1]y(ht,(xt,yt))(yt+1)dχt(σXt(ht,(xt,yt)),σXt1)dν¯t(ht,σXt1). 3.12

Consider the Markov strategy for Y with q0:=y and q1x0y0(y1): = y(x0, y0)(y1). For t≥1, let

qxtytt+1(yt+1):=σXt[0,1]pσXtyt(xt+1)(ht,σXt1)Ht×[0,1]y(ht,(xt,yt))(yt+1)dχt(σXt(ht,(xt,yt)),σXt1)dν¯t(ht,σXt1)σXt[0,1]pσXtyt(xt+1)dχt(σXtxt,yt)νt(xt,yt). 3.13

If νt′ and χt′ are the analogues of νt and χt for p* against {qt}t≥1, then clearly νt = νt′ and χt = χt′ for t = 0, 1. Suppose that for some t≥1, we have νt = νt′ and χt = χt′. It follows, then, that at time t + 1,

νt+1(xt+1,yt+1)=(xt,yt){C,D}2qxtytt+1(yt+1)σXt[0,1]pσXtyt(xt+1)dχt(σXtxt,yt)νt(xt,yt)=(xt,yt){C,D}2qxtytt+1(yt+1)σXt[0,1]pσXtyt(xt+1)dχt(σXtxt,yt)νt(xt,yt)=νt+1(xt+1,yt+1), 3.14

which gives the desired result for xRLX. ▪

This lemma leads to a straightforward proof of our main result:

Theorem 3.3. —

C(p)C(p) for every pMem1X.

Proof. —

By lemma 3.2, for xRLX, we may assume the opponent's strategy is Markovian, meaning that it has a memory of one round into the past but can depend on the current round, t. This dependence on t distinguishes a Markov strategy from a memory-one strategy, the latter of which also has memory of one round into the past but is independent of t. We denote by M the set of all Markov strategies (equation (3.9)).

Let p* be a linear reactive learning strategy for X and suppose that yM. For every t≥0, these strategies generate a distribution ν*t over {C, D}2 × [0, 1]. For any strategy q against p, there is a sequence of distributions μt on {C, D}2 generated by these two strategies. We prove the proposition by finding {qt}t1M such that μt(xt, yt) = ν*t({(xt, yt)} × [0, 1]) for every (xt, yt)∈{C, D}2 and t≥0.

Let χt be the (marginal) distribution on σtX∈[0, 1] at time t. For yt∈{C, D}, denote by χt( · |yt) this distribution conditioned on Y using action yt at time t. For t≥0, consider the strategy with q0:=y and

qCytt+1(yt+1):=σXt[0,1]σXt(σXtyCytt+1(yt+1)+(1σXt)yDytt+1(yt+1))dχt(σXtyt)σXt[0,1]σXtdχt(σXtyt) 3.15a
andqDytt+1(yt+1):=σXt[0,1](1σXt)(σXtyCytt+1(yt+1)+(1σXt)yDytt+1(yt+1))dχt(σXtyt)σXt[0,1](1σXt)dχt(σXtyt). 3.15b

Clearly, μ0(x0, y0) = ν*0({(x0, y0)} × [0, 1]) for every (x0, y0)∈{C, D}2. Suppose, for some t≥, that μt(xt, yt) = ν*t({(xt, yt)} × [0, 1]) for every (xt, yt)∈{C, D}2. For (xt+1, yt+1)∈{C, D}2, we then have

μt+1(xt+1,yt+1)=(xt,yt){C,D}2pxtyt(xt+1)qxtytt+1(yt+1)μt(xt,yt)=yt{C,D}(pCyt(xt+1)qCytt+1(yt+1)μt(C,yt)+pDyt(xt+1)qDytt+1(yt+1)μt(D,yt))=yt{C,D}pCyt(xt+1)σXt[0,1]σXt(σXtyCytt+1(yt+1)+(1σXt)yDytt+1(yt+1))dχt(σXtyt)+yt{C,D}pDyt(xt+1)σXt[0,1](1σXt)(σXtyCytt+1(yt+1)+(1σXt)yDytt+1(yt+1))dχt(σXtyt)=(xt,yt){C,D}2yxtytt+1(yt+1)σXt[0,1](σXtpCyt+(1σXt)pDyt)dνt({(xt,yt)}×{σXt})=(xt,yt){C,D}2σXt[0,1]pσXtyt(xt+1)yxtytt+1(yt+1)dνt({(xt,yt)}×{σXt})=νt+1({(xt+1,yt+1)}×[0,1]). 3.16

Therefore, by induction and the definition of expected pay-off in an iterated game, C(p)C(p). ▪

As a consequence of theorem 3.3, we see that p* enforces a linear pay-off relationship [1] whenever p does. However, the converse need not hold; figure 5(b) gives an example in which X's pay-off is a function of Y 's when X uses p* but not when X uses p. Although this example illustrates an extreme case of when the pay-off region collapses, perhaps the most interesting behaviour is illustrated by figure 5a,c,d. In these examples, we focus on the pay-off regions that can be obtained against memory-one opponents. Using p* instead of p can both bias pay-offs in favour of X and limit potential losses against a spiteful opponent.

Figure 5.

Figure 5.

Simulated pay-offs against a fixed memory-one strategy, p (grey), and its corresponding reactive learning strategy, p* (green), as the opponent plays 105 randomly chosen strategies qMem1X. (a) If the opponent is greedy and wishes to optimize his or her own pay-off only, then upon exploring the space Mem1X for sufficiently long, the pay-offs will end up at the black point when X uses p and at the magenta point when X uses p*. In this scenario, p favours Y having a higher pay-off than X, while p* favours X having a higher pay-off than Y . Thus, p* extorts a pay-off-maximizing opponent while p is more generous. (b) The pay-offs against p* (green) can fall along a line even when those against p (grey) form a two-dimensional region. In (c), by using p* instead of p, X can limit the pay-off the opponent receives from the black point to the magenta point. Similarly, in (d), X can limit the potential ‘punishment’ incurred from Y . When X uses p, the opponent can choose a strategy that gives X a negative pay-off (black point). When X uses p*, no such strategy of the opponent exists, and the worst pay-off X can possibly receive is positive (magenta point). The parameters used are (a) p = (0.90, 0.50, 0.01, 0.20, 0.90) and R = 2, S = − 1, T = 1 and P = 1/2; (b) p = (1.0000, 0.6946, 0.0354, 0.1168, 0.3889) and R = 3, S = 1, T = 2 and P = 0; (c) p = (0.8623, 0.6182, 0.9528, 0.5601, 0.0001) and R = 3, S = 0, T = 5 and P = 1; and (d) p = (0.5626, 0.2381, 0.7236, 0.9537, 0.1496) and R = 1/2, S = − 3/2, T = 2 and P = 3/2. Each coordinate of q is chosen independently from an arcsine (i.e. Beta(1/2, 1/2)) distribution. (Online version in colour.)

For a memory-one strategy pMem1X, we can ask how the region {(πY (p, q), πX(p, q))}qMem1X compares to {(πY (p*, q*), πX(p*, q*))}qMem1X. In other words, does the map pp* transform the feasible region of a strategy when the opponents are also subjected to this map? Figure 6 demonstrates that this map can significantly distort the distribution of pay-offs within the feasible region.

Figure 6.

Figure 6.

Distortions in the distribution of pay-offs against reactive learning strategies. In both panels, the grey region is formed by playing 105 randomly chosen strategies qMem1X against a fixed strategy pMem1X. The green region in (a) arises from simulating the pay-offs of p* against 105 strategies qMem1X. In (b), this same reactive learning strategy, p*, is simulated against 105 strategies q*∈RLX for qMem1X. In both panels, the optimal outcome for Y is the black point when X uses p and the magenta point when X uses p*. The magenta point represents a much better outcome for X and only a slightly worse outcome for Y than the black point, indicating that p* is highly extortionate relative to p when played against a pay-off-maximizing opponent. In both panels, the parameters are p = (0.50, 0.99, 0.40, 0.01, 0.01) and R = 3, S = 0, T = 5 and P = 1. Each coordinate of q is chosen independently from an arcsine (i.e. Beta(1/2, 1/2)) distribution. (Online version in colour.)

(d). Optimization through mutation

Suppose that X uses a fixed reactive learning strategy, p*, for some p∈Mem1X. Starting from some random memory-one strategy, q, the opponent might seek to optimize his or her pay-off through a series of mutations. In other words, Y is subjected to the following process. First, sample a new strategy q′∈Mem1X. If the pay-off to Y for q′ against p* exceeds that of q against p*, switch to q′; otherwise, retain q. This step then repeats until Y has a sufficiently high pay-off (or else has not changed strategies in some fixed number of steps). From figure 6, one expects this process to give different results from the same update scheme when X plays the memory-one strategy p instead of p*.

As expected, figure 7 shows that this optimization process behaves quite differently against p* as it does against p. Whereas using p in this example results in equitable outcomes, using p* gives X a much higher pay-off than Y , indicating extortionate behaviour. One can also imagine other optimization procedures (not covered here), such as when q′ is always sufficiently close to q (i.e. local mutations). When X uses p*, a path from the red point to the magenta point in figure 6 through random local sampling of q typically requires Y to initially accept lower pay-offs. If Y uses q* instead of q, as in figure 6b, this effect is amplified.

Figure 7.

Figure 7.

(a) Optimization against a memory-one strategy and (b) the corresponding reactive learning strategy. In each panel, X's strategy is fixed with parameters p = (0.50, 0.99, 0.40, 0.01, 0.01). Y chooses an initial memory-one strategy, q, from an arcsine distribution. At each update step, Y samples another strategy, q′, from the same distribution. If Y 's pay-off for playing q′ against X exceeds that of playing q against X, then Y replaces his or her current strategy with q′. Otherwise, q′ is discarded and Y retains q. Over time, this process generates a sequence of pay-off pairs for X and Y , shown in (a,b). Relative to p, the reactive learning strategy p* is highly extortionate. (Online version in colour.)

4. Discussion

Our primary focus has been on the feasible region generated by a fixed strategy. This approach to studying X's strategy is inspired by the ‘zero-determinant’ strategies of Press & Dyson [1], which enforce linear subsets of the feasible region. This perspective has also been expanded to cover so-called partner and rival strategies [24], which have proven extremely useful in understanding repeated games from an evolutionary perspective. The feasible region of a memory-one strategy, p, is quite simple and can be characterized as the convex hull of at most 11 points. Furthermore, these points are all straightforward to write down explicitly in terms of the pay-off matrix and the entries of p (see equation (2.6)). The feasible region of a reactive learning strategy, in terms of its boundary and extreme points, is evidently more complicated in general.

Both memory-one and reactive learning strategies contain the set of all reactive strategies. For every memory-one strategy, p, there exists a corresponding linear reactive learning strategy, p*, and this correspondence defines an injective map MemX1RLX. In general, however, p cannot be identified with its image, p*, unless p is reactive. We make this claim formally using the geometry of a strategy within the feasible region, C(p), which captures all possible pay-off pairs against an opponent. For any memory-one strategy, we have C(p)C(p). Therefore, reactive learning strategies generally allow a player to impose greater control over where pay-offs fall within the feasible region than do traditional memory-one strategies. As illustrated in figure 5a, this added control can prevent a greedy, self-pay-off-maximizing opponent from obtaining more than X when X uses p*, even when such an opponent receives an unfair share of the pay-offs when X uses p instead. The proof of the containment C(p)C(p) also extends to discounted games, where each pay-off unit received t rounds into the future is valued at δt units at present for some ‘discounting factor’, δ∈[0, 1].

Another property of the map MemX1RLX sending p to p* is that it distorts the distribution of pay-offs within the feasible region. Since Mem1X can be identified with the space of linear reactive learning strategies under this map, it is natural to compare the region of possible pay-offs when p plays against memory-one strategies to the one obtained from when p* plays against linear reactive learning strategies. These distortions, as illustrated in figure 6, are particularly relevant when X plays against an opponent who is using a process such as simulated annealing to optimize pay-off. One can see from this example that if Y initially has a low pay-off, then with localized strategy exploration they must be willing to accept lower pay-offs before they find a strategy that improves their initial pay-off. This concern is not relevant when Y can simply compute the best response to X's strategy, but it is highly pertinent to evolutionary settings in which the opponent's strategy is obtained through mutation and selection rather than ‘computation’.

Reactive learning strategies are also more intuitive than memory-one strategies in some ways. Rather than being a dictionary of mixed actions based on all possible observed outcomes, a reactive learning strategy is simply an algorithm for updating one's tendency to choose a certain action. It, therefore, allows a player to alter their behaviour (mixed action) over time in response to various stimuli (actions of the opponent). This strategic approach to iterated games is reminiscent of both the Bush–Mosteller model [19] and the weighted majority algorithm [23], although traditionally these models are not studied through the pay-off regions they generate in iterated games. There are several interesting directions for future research in this area. For one, we have mainly considered the space of linear reactive learning strategies, but the space RLX is much larger and could potentially exhibit complicated evolutionary dynamics. Furthermore, one could relax the condition that these strategies be reactive and allow them to use X's realized action in addition to X's mixed action. But even without these complications, we have seen that linear reactive learning strategies have quite interesting relationships to traditional memory-one strategies.

Acknowledgements

The authors are grateful to Krishnendu Chatterjee, Christian Hilbe and Joshua Plotkin for many helpful conversations and for feedback on earlier versions of this work.

Appendix A. Convergence of mixed actions

Suppose that X and Y use strategies (p0, p*) and (q0, q*), respectively. Let σ0X = p0 and σ0Y = q0 be the initial distributions on {C, D} for X and Y , respectively. If these distributions are known at time t≥0, then, on average, the corresponding distributions at time t + 1 are given by the system of equations

σXt+1:=σYtpσXtC+(1σYt)pσXtD A 1a
andσYt+1:=σXtqσYtC+(1σXt)qσYtD. A 1b

This system suggests a fixed-point analysis to determine whether the sequence {(σtX, σtY)}t≥0 converges.

Suppose that (σX, σY) ∈[0, 1]2 is a fixed point of this system, i.e.

σX=σYpσXC+(1σY)pσXD A 2a
andσY=σXqσYC+(1σX)qσYD. A 2b

We consider this system for two types of linear reactive learning strategies: those coming from reactive strategies and those coming from general memory-one strategies under the map MemX1RLX.

We first consider reactive strategies of the form (pC, pD), where pC (resp. pD) is the probability a player uses C after the opponent played C (resp. D). Let (pC, pD) and (qC, qD) be fixed strategies for X and Y . For these reactive strategies, the system equation (A1) takes the form

σXt+1:=σYtpC+(1σYt)pD A 3a
andσYt+1:=σXtqC+(1σXt)qD. A 3b

One can easily check that this dynamical system has a unique fixed point, which Hofbauer & Sigmund [24] refer to as the ‘asymptotic C-level’ of (pC, pD) against (qC, qD), and which is given explicitly by

σX=pCqD+pD(1qD)1(pCpD)(qCqD) A 4a
andσY=pDqC+(1pD)qD1(pCpD)(qCqD). A 4b

Furthermore, we have the following, straightforward convergence result.

If (pC, pD), (qC, qD)∈(0, 1)2, and if (σX, σY) ∈(0, 1)2 is given by equation (A 4), then

limt(σXt,σYt)=(σX,σY) A 5

for any initial condition, (p0, q0)∈[0, 1]2.

Proof. —

For (pC, pD), (qC, qD)∈(0, 1)2, consider the map

f:[0,1]2[0,1]2:(xy)(ypC+(1y)pDxqC+(1x)qD). A 6

For (x, y), (x′, y′)∈[0, 1]2, we have

f(x,y)f(x,y)=((yy)(pCpD)(xx)(qCqD)). A 7

It follows that ∥f(x, y) − f(x′, y′)∥ ≤ slantλ∥(x, y) − (x′, y′)∥, where λ:=max{|pCpD|,|qCqD|}<1. By the contraction mapping theorem, there is then a unique fixed point (σX, σY) ∈[0, 1]2 such that

limtft(p0,q0)=(σX,σY) A 8

for any (p0, q0)∈[0, 1]2. It is straightforward to check that equation (A 4) is a fixed point of equation (A 3). ▪

In particular, if μ: = (σXσY, σX(1 − σY), (1 − σX)σY, (1 − σX)(1 − σY)), then a straightforward calculation shows that μ is the stationary distribution of M((pC, pD, pC, pD), (qC, qD, qC, qD)) (equation (2.2)).

Remark A.2. —

Proposition A.1 need not hold if py and qx are not strictly between 0 and 1. For example, when X and Y both play TFT, f is a simple involution with f(x, y) = (y, x), which preserves distance.

Consider now the case of general memory-one strategies with p••: = (pCC, pCD, pDC, pDD) for X and q••: = (qCC, qCD, qDC, qDD) for Y . For these strategies, the system defined by equation (A1) has the form

σXt+1:=σYt(σXtpCC+(1σXt)pDC)+(1σYt)(σXtpCD+(1σXt)pDD) A 9a
andσYt+1:=σXt(σYtqCC+(1σYt)qDC)+(1σXt)(σYtqCD+(1σYt)qDD). A 9b

In the spirit of proposition A.1, for fixed p••, q••∈(0, 1)4, we could consider the map

F:[0,1]2[0,1]2:(xy)(y(xpCC+(1x)pDC)+(1y)(xpCD+(1x)pDD)x(yqCC+(1y)qDC)+(1x)(yqCD+(1y)qDD)) A 10

and analyse its fixed points. At this point, however, a couple of remarks are in order:

  • (i)
    F need not be a contraction, even when p•• and q•• have entries strictly between 0 and 1. For example, with p•• = (0.9566, 0.2730, 0.0056, 0.0095) and q•• = (0.9922, 0.0918, 0.3217, 0.0054),
    0.0441=F(0.7404,0.6928)F(0.8241,0.8280)>(0.7404,0.6928)(0.8241,0.8280)=0.0253. A 11
    We would conjecture that this map is an eventual contraction, in which case the convergence result of proposition A.1 still holds (although the explicit formulae for σX and σY differ from equation (A4)).
  • (ii)

    A fixed point of F, (σX, σY), even when it exists and is unique, generally does not have the property that μ(p, q) = (σXσY, σX(1 − σY), (1 − σX)σY, (1 − σX)(1 − σY)), where μ is the stationary distribution of equation (2.2). Furthermore, the long-run mean-frequency distribution on {C, D}2 can be distinct from both of these distributions, including when the opponent plays q against p* and when they play q* against p*. An example of when these four distributions are pairwise distinct is easy to write down, e.g. p = (0.01, 0.01, 0.01, 0.99, 0.01) and q = (0.99, 0.99, 0.01, 0.99, 0.99). All four distributions coincide when p and q are both reactive, but in general they can be distinct.

Data accessibility

This article does not contain any additional data.

Authors' contributions

All authors designed research, performed research and wrote the paper.

Competing interests

We declare we have no competing interests.

Funding

The authors gratefully acknowledge support from the Lifelong Learning Machines program from DARPA/MTO. Research was sponsored by the Army Research Laboratory (ARL) and was accomplished under cooperative agreement no. W911NF-18-2-0265. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Laboratory or the US Government. The US Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein.

References

  • 1.Press WH, Dyson FJ. 2012. Iterated prisoner's dilemma contains strategies that dominate any evolutionary opponent. Proc. Natl Acad. Sci. USA 109, 10 409–10 413. ( 10.1073/pnas.1206569109) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Akin E. 2015. What you gotta know to play good in the iterated prisoner's dilemma. Games 6, 175–190. ( 10.3390/g6030175) [DOI] [Google Scholar]
  • 3.Hilbe C, Traulsen A, Sigmund K. 2015. Partners or rivals? Strategies for the iterated prisoner's dilemma. Games Econ. Behav. 92, 41–52. ( 10.1016/j.geb.2015.05.005) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Hilbe C, Chatterjee K, Nowak MA. 2018. Partners and rivals in direct reciprocity. Nat. Human Behav. 2, 469–477. ( 10.1038/s41562-018-0320-9) [DOI] [PubMed] [Google Scholar]
  • 5.Nowak M, Sigmund K. 1993. A strategy of win-stay, lose-shift that outperforms tit-for-tat in the prisoner's dilemma game. Nature 364, 56–58. ( 10.1038/364056a0) [DOI] [PubMed] [Google Scholar]
  • 6.Axelrod R. 1984. The evolution of cooperation. New York, NY: Basic Books. [Google Scholar]
  • 7.Lehrer E. 1988. Repeated games with stationary bounded recall strategies. J. Econ. Theory 46, 130–144. ( 10.1016/0022-0531(88)90153-6) [DOI] [Google Scholar]
  • 8.Hauert C, Schuster HG. 1997. Effects of increasing the number of players and memory size in the iterated Prisoner's Dilemma: a numerical approach. Proc. R. Soc. Lond. B 264, 513–519. ( 10.1098/rspb.1997.0073) [DOI] [Google Scholar]
  • 9.Nowak MA. 2006. Five rules for the evolution of cooperation. Science 314, 1560–1563. ( 10.1126/science.1133755) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Hilbe C, Martinez-Vaquero LA, Chatterjee K, Nowak MA. 2017. Memory-n strategies of direct reciprocity. Proc. Natl Acad. Sci. USA 114, 4715–4720. ( 10.1073/pnas.1621239114) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Baek SK, Jeong H-C, Hilbe C, Nowak MA. 2016. Comparing reactive and memory-one strategies of direct reciprocity. Sci. Rep. 6, 25676 ( 10.1038/srep25676) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Fudenberg D, Tirole J. 1991. Game theory. Cambridge, MA: MIT Press. [Google Scholar]
  • 13.Posch M. 1999. Win–stay, lose–shift strategies for repeated games—memory length, aspiration levels and noise. J. Theor. Biol. 198, 183–195. ( 10.1006/jtbi.1999.0909) [DOI] [PubMed] [Google Scholar]
  • 14.Dal Bó P. 2005. Cooperation under the shadow of the future: experimental evidence from infinitely repeated games. Am. Econ. Rev. 95, 1591–1604. ( 10.1257/000282805775014434) [DOI] [Google Scholar]
  • 15.Nowak MA. 2006. Evolutionary dynamics: exploring the equations of life. Cambridge, MA: Belknap Press. [Google Scholar]
  • 16.Barlo M, Carmona G, Sabourian H. 2009. Repeated games with one-memory. J. Econ. Theory 144, 312–336. ( 10.1016/j.jet.2008.04.003) [DOI] [Google Scholar]
  • 17.Dal Bó P, Fréchette GR. 2011. The evolution of cooperation in infinitely repeated games: experimental evidence. Am. Econ. Rev. 101, 411–429. ( 10.1257/aer.101.1.411) [DOI] [Google Scholar]
  • 18.Stewart AJ, Plotkin JB. 2016. Small groups and long memories promote cooperation. Sci. Rep. 6, 26889 ( 10.1038/srep26889) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Bush RR, Mosteller F. 1953. A stochastic model with applications to learning. Ann. Math. Stat. 24, 559–585. ( 10.1214/aoms/1177728914) [DOI] [Google Scholar]
  • 20.Roth AE, Erev I. 1995. Learning in extensive-form games: experimental data and simple dynamic models in the intermediate term. Games Econ. Behav. 8, 164–212. ( 10.1016/s0899-8256(05)80020-x) [DOI] [Google Scholar]
  • 21.Izquierdo LR, Izquierdo SS. 2008. Dynamics of the Bush–Mosteller learning algorithm in 2x2 games. In Reinforcement learning. I-Tech Education and Publishing.
  • 22.Nowak M, Sigmund K. 1990. The evolution of stochastic strategies in the prisoner's dilemma. Acta Applicandae Math. 20, 247–265. ( 10.1007/bf00049570) [DOI] [Google Scholar]
  • 23.Littlestone N, Warmuth MK. 1989. The weighted majority algorithm. In 30th Annual Symp. on Foundations of Computer Science, Research Triangle Park, NC, USA, 30 October–1 November 1989, pp. 256–261 ( 10.1109/sfcs.1989.63487) [DOI]
  • 24.Hofbauer J, Sigmund K. 1998. Evolutionary games and population dynamics. Cambridge, UK: Cambridge University Press; ( 10.1017/cbo9781139173179) [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

This article does not contain any additional data.


Articles from Proceedings. Mathematical, Physical, and Engineering Sciences are provided here courtesy of The Royal Society

RESOURCES