Abstract
We examine the long-term implication of two models of learning with recency bias: recursive weights and limited memory. We show that both models generate similar beliefs and that both have a weighted universal consistency property. Using the limited-memory model we produce learning procedures that both are weighted universally consistent and converge with probability one to strict Nash equilibrium.
There is substantial evidence from both the laboratory and the field that people display “recency bias,” meaning that they react more heavily to recent observations and experiences than they do to older ones and continue to do so even once they have accumulated lots of data (for example, ref. 1 for evidence of recency in the field, ref. 2 for that in the laboratory, and ref. 3 for a recent survey of recency bias in single-agent decision experiments). Recency bias has been incorporated into both belief-based and reinforcement-based models of learning in games, by adding a parameter that controls the speed of informational discounting as in, for example, refs. 4–7 or by supposing that individuals retain only a finite sample in their memory as in ref. 8.
Our first result, Theorem 1, shows that informational discounting and a particular class of finite-memory models are similar. In Theorem 2 we show that when recency is slight, if the agent chooses each period’s action as a smooth best response to his current beliefs, he will obtain at worst about as much utility as would be possible as if he knew the sequence of outcomes in advance and had to commit to a particular action. This property, called weighted universal consistency, is a variation on the notion of universal or Hannan consistency in games (9, 10).
Finally in Theorem 3 we construct weighted universally consistent finite-memory learning procedures that converge with probability one to strict Nash equilibrium. This result differs from other results on procedures that always converge to Nash equilibrium, both because it allows for persistent small recency bias and because these procedures can be “decentralized” in the sense that the details of each agent’s learning rule can be set independently of the others. Earlier studies such as refs. 11 and 12 give convergence results, but do not show that the procedures have universal consistency properties, whereas the calibration rules of ref. 13 require that all agents base their forecasts on a shared deterministic “public forecasting rule,” which unlike our method requires that players coordinate the starting point of their learning procedures.
The Model
We consider a one-person decision problem. Each period the player chooses an action a from a finite set of actions A and then observes an outcome , a finite set. The utility from action a when the outcome is y is denoted by . The space of probability distributions over a (finite) set S is denoted by . Mixed actions are denoted by and mixed outcomes by . We write for the expected utility to mixed actions and mixed outcomes. A strategy for the player can depend only on the information available to him when he moves, namely the past values of his own play and the outcome. A history of play for the player is denoted by , with the null history and H the space of all histories of play. A (behavior) strategy for the player is a map , whereas an outcome function is a map . Each strategy–outcome function pair induces a stochastic process over action–outcome pairs, where given the history the conditional probability of , is . In other words, the player and nature must base their play only on the history of actions and outcomes. In some interpretations, the outcomes may be chosen by other players rather than by nature.
Notions of Recency
Belief-Based Strategies.
Because we explicitly construct Markovian belief-based learning rules that satisfy the long-term performance criterion of weighted universal consistency, we restrict from the outset to this class of rules. A Markov belief-based strategy consists of a prior belief , a Markov kernel that specifies how beliefs are updated, and a map from beliefs at time to a mixed action at time t. One such map is the best-response map (although this is not single-valued, we can make an arbitrary choice in the case of indifference); we also consider various smooth approximations to the best-response map. For the moment we focus on modeling the evolution of beliefs. In doing so, it is convenient to define to be equal to 1 if and 0 otherwise and , the indicator function for whether the period-τ outcome was equal to y.
Recursive Weighting.
We are given a weight , an initial condition , and the deterministic kernel , where . An alternative representation of this belief process is to solve the difference equation and define , so that
This shows that beliefs are a weighted average of past observations and the prior. Note that corresponds to the case , where only the most recent observation matters, whereas Bayesian updating with a Dirichlet prior corresponds to time-varying weights . We now show how the recursive weighting process is related to several others.
Base Rate Neglect.
In the case where the agent believes that ρ is generated by independent and identically distributed (i.i.d.) sampling, let be beliefs over . Bodoh-Creed et al.* propose a model of base rate neglect with the updating rule . Note that if , this is ordinary Bayesian updating. If the prior is Dirichlet, the posterior mean is simply the maximum of this function with respect to ϕ. For the posterior mean is difficult to compute, so we continue to measure the central tendency by taking the maximum of the function, that is, the posterior mode rather than the mean. To consider the maximum of this function with respect to ϕ, we can ignore the denominator and maximize the logarithm . If we assume that the prior is such that for some weighted fictitious prior sample , then we may write the log-likelihood with prior as and the maximum likelihood is simply given by the weighted sample averages so that if we take and define , this is exactly the same point belief as generated by recency generated by weighted sampling.
Limited Memory.
So far we have supposed in effect that there is unlimited memory for past observations or at least that any value of ϕ can be recorded in the memory. We now instead suppose that the memory has size M, where M refers to the number of observations that can be stored. (Note that this is different from the limited-history processes considered by ref. 8, where the memory always contains the M most recent observations.) To do this we define a k, p, M procedure where as follows: (i) Choose randomly a subset of M of size k, (ii) discard each observation in the subset independently and randomly with probability p, and (iii) replace all of the discarded observations with the observation from the current period.
The simplest version of this procedure has ; that is, choose one observation at random from memory and discard it. In this case when the signal y is i.i.d., the ergodic distribution is multinomial.
The k, p procedure allows us to largely separate memory size M from λ, while allowing the construction of procedures with arbitrary values of λ. Our goal is to show that when M is large, the agent receives about the same utility as he would with infinite memory.
Recursive Weighting vs. Limited Memory
The recursive memory weighting model has a deterministic transition kernel and an infinite state space. The limited-memory model has random transitions and a finite state space. The latter has some advantages in analyzing properties such as universal consistency. In the case where ρ generates i.i.d. values of y, the stationary distribution of the recursive weighting model can be extremely complicated and need not have a density even when y is binary. The stationary distribution of the limited-memory model always exists and in some cases is quite simple: In the case in which each observation is drawn from exactly the same distribution , so the belief each period is simply a multinomial with M observations drawn from . (The same is true in the limited-history model of ref. 8, where the agent always forgets the oldest observation in the memory. However, that model does not fit our framework as it requires the state to keep track of the order in which the observations were acquired.)
We now want to relate the two models. Suppose that , so that the expected weight that the limited-memory model gives to the most recent observation is the same as in the recursive model, and initialize the two systems so that both begin with the same prior. Fix any sequence of observations and consider the deterministic sequence from recursive weighting and consider the random process from limited memory. Then we have the following:
Theorem 1.
For any fixed , as then uniformly in t and the sequence of observations .
Remark:
The idea of the proof is that when M is large and , both the average in the limited memory M and the weighted average are close to the full sample average in terms of mean square distance.
Proof:
Fix the sequence , define , and observe that , where . Hence to prove Theorem 1 it is sufficient to prove that .
Now let be the number of observations in updating to period t that are discarded and let be the frequencies in the remaining sample. Note that we can think of this as drawing observations from without replacement, and we arbitrarily define . Because , it is less than some . Simple algebra detailed in SI Appendix shows that
[1] |
where the inequality follows from the facts that each of , , and the difference between and is between 0 and 1.
Next we observe that and hence it is enough to prove that each of the expectations on the right-hand side of [1] has square deviation that goes to zero. Examining first , recall that and , so we need only compute the variance of . The variance of is , so the variance of is .
Turning to the second term , observe that and that the variance is bounded above by sampling with replacement, which is at worst . Hence .□
Approximate Universal Consistency of Slightly Weighted Sampling
Ref. 9 showed that when (no recency at all), any smooth fictitious play is universally consistent, meaning that regardless of the probability law of the , the agent does at most ε worse in terms of time-average payoff then if he knew the empirical distribution of play ahead of time. We show that this extends to recency, in the sense that recency adds an additional error term to the lower bound on payoffs that vanishes as recency goes to 0. We continue to let denote the beliefs of the weighted sampling scheme and let denote the weighted beliefs through and including observations at time t excluding the prior.
Fix a scale parameter and a function ν on probability distributions over actions that maps the interior of the simplex to the real numbers, is bounded by , is smooth and strictly differentiably concave, and satisfies the boundary condition that as γ approaches the boundary of the simplex the norm of the derivative becomes infinite. Now for we define . This perturbation of the utility function serves to induce mixing and allows the approximation of the best response function by the smoothed best response . The function is Lipschitz—shown in ref. 14—and from the implicit function theorem the Lipschitz constant has the form , where B depends only on ν.
For an arbitrary decision rule σ let be the total weighted expected utility received through period t, where is the distribution that places weight one on . The notion of universal consistency can be motivated by considering what a procedure that is based only on frequencies can hope to achieve: Obviously if there are deterministic cycles, no procedure based on frequencies can hope to do as well as a procedure that is designed to recognize cycles. To reflect the fact that these procedures are expected to do well only against frequencies, let us assume that the entire sample is known, but restrict the strategy space to playing the same action each period—that is, not reacting to cycles. In this case we could be maximizing , and because is linear in γ, this is the same as maximizing . Recall that , let be the maximized value, and set . Then measures how much worse the player did when he used relative to how well he might have done had he known the sample in advance (but been constrained to time invariant strategies).
Definition:
A learning rule is ε-universally consistent with respect to λ if regardless of the probability law generating the ys.
We provide conditions on ζ and μ so that playing a smooth best response to the weighted beliefs is ε-universally consistent with respect to λ. To see why this terminology makes sense, note that in the case this is the approximate universal consistency condition in the usual sense, as reduces to . The condition is stronger than in that it places more weight on the next (unknown) observation than on past observations. Hence, conceptually the bigger λ is, the stronger the notion of universal consistency.
Theorem 2.
For any ν there exists a constant such that for all utility functions the recursive memory model with parameters is consistent with respect to λ, with .
The Proof, which adapts the method used in ref. 14 for the case , can be found in SI Appendix. In light of Theorem 1 this result extends immediately to the case of the finite-memory model with an additional error term that goes to zero as .
Convergence to Strict Nash Equilibrium
We study simultaneous move games with observable actions . Say that a pure action profile is a δ-strict Nash equilibrium if each player loses at least δ from deviating to any pure action. Then we can show the following:
Theorem 3.
For any , and there exist recursive-memory learning procedures that are ε-universally consistent for each player such that if and the game w has a -strict Nash equilibrium, then regardless of the initial samples of the individual players with probability one the learning procedures converge to some strict Nash equilibrium.
Remark:
The coordinated learning procedures of ref. 13 require all players to have the same initial sample and so do not satisfy the conclusion of the theorem.
Proof:
The idea is to start with a finite-memory universally consistent procedure where ε is very small and modify it so that when the sample shows that each other player assigns probability close to one to a pure strategy, and the smooth best response is to play a pure strategy with very high probability, play that pure strategy with probability one instead of assigning small positive probabilities to other strategies. The modified procedure is also universally consistent, albeit for slighter larger ε, but it allows play to converge as it gets “stuck” at strict Nash equilibria.
Set and (and also smaller than 1/2). Define . For each player choose a such that implies (for example, the entropy function); note that these functions need not be the same for each player. Next choose ζ sufficiently small that two properties hold. First, . Second, note that as then approaches the best response, so in particular the probability of a strict best response goes to 1. Hence we can also choose ζ small enough that if is any -strict best response, then . Then choose μ such that . This is universally consistent by Theorem 2. By Theorem 1 we can choose large enough that and suppose that ; that is, we potentially discard all observations. Then the procedure replacing with is universally consistent, because the payoffs remain within that distance. Now define a procedure such that if all of the observations in the memory are identical and , then (we call this the stuck state); otherwise, . This procedure is no worse than universally consistent.
Suppose that is strict and that contains only these observations for . Then we see from construction that such a state is absorbing.
Next suppose that all players are in the stuck state. Observe that all must play a strict best response, because the probability of a nonstrict best response is less than 1/2 and so less than in the original procedure. If the best responses are identical to the samples, we are absorbed in a strict equilibrium. If not, then the sample must change for all but one player, and in particular the next period at least one player is not in the stuck state.
Now suppose that at least one player is not in the stuck state. Then that player plays an action different from his last period action with a positive probability bounded below with a bound that depends only on , both of which are fixed, and with positive probability remains unstuck. Hence the next period there is a positive probability that all players are unstuck. When all players are unstuck, there is a positive probability that all play the strict equilibrium action, and there is a positive probability that all observations in all their samples are replaced with this action, resulting in the absorbing state.
□
Note that we do not assert that all weighted universally consistent learning procedures converge to strict Nash equilibria. Proposition 7 of Benam et al. (7) suggests that this is unlikely to be the case in games that have both a stable Shapley cycle and a strict Nash equilibrium.
What about mixed equilibria or, because mixed equilibrium will be difficult to hit with a finite number of states, mixed approximate equilibrium? Ref. 12 gives uncoupled learning algorithms that converge with probability one to a mixed approximate equilibrium, but these procedures are not universally consistent because once in equilibrium play never changes regardless of the data. By contrast, ref. 11’s procedure continues learning and the set of approximate Nash equilibria does have probability near one in the ergodic distribution; we do not know whether their procedure is universally consistent. At this point the issue of what sorts of extensions of Theorem 3 apply to games with only mixed equilibria remains open; we hope to explore it in future work.
Conclusion
We examine two models of learning with recency, recursive weights and limited memory. These are similar in the sense that they have the same mean beliefs and with high probability beliefs that are very close. We show that recursive weights with suitably smoothed best responses are weighted universally consistent and argue that this is a sensible criterion. It follows that limited memory has the same property provided the grid is fine enough. This is useful because it can produce limited-memory algorithms that are weighted universally consistent and also converge with probability one to strict Nash equilibrium. This is the first example of which we are aware of learning processes that are robust to initial conditions, have global convergence to Nash equilibrium, and are also shown to satisfy any sort of criteria of adequate responsiveness to individual incentives.
Supplementary Material
Acknowledgments
We are grateful to the National Science Foundation (Grants SES-08-51315 and 1258665) for financial support.
Footnotes
The authors declare no conflict of interest.
This paper results from the Arthur M. Sackler Colloquium of the National Academy of Sciences, “In the Light of Evolution VIII: Darwinian Thinking in the Social Sciences,” held January 10–11, 2014, at the Arnold and Mabel Beckman Center of the National Academies of Sciences and Engineering in Irvine, CA. The complete program and audio files of most presentations are available on the NAS website at www.nasonline.org/ILE-Darwinian-Thinking.
This article is a PNAS Direct Submission.
*Bodoh-Creed A, Benjamin D, Rabin M (2014) The dynamics of base rate neglect. Abstract available under “Works In Progress”: faculty.haas.berkeley.edu/acreed. Accessed May 9, 2014.
This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1400987111/-/DCSupplemental.
References
- 1.Agarwal S, Driscoll JC, Gabaix X, Laibson D. Learning in the Credit Card Market. Cambridge, MA: National Bureau of Economic Research; 2008. [Google Scholar]
- 2.Fudenberg D, Peysakhovich A. Recency, records, and recaps: The effect of feedback on behavior in a simple decision problem. Proceedings of the 15th ACM conference on Economics and Computation. 2014 (June 8–12, 2014, Stanford, CA), in press. [Google Scholar]
- 3. Erev I, Haruvy E (2014) Learning and the economics of small decisions. The Handbook of Experimental Economics, eds Roth AE, Kagel J (Princeton Univ Press, Princeton), Vol 2. Available at www.utdallas.edu/~eeh017200/papers/LearningChapter.pdf. Accessed May 14, 2014.
- 4.Cheung Y, Friedman D. Individual learning in normal form games: Some laboratory results. Games Econ Behav. 1997;19(1):46–76. [Google Scholar]
- 5.Sutton R, Barto A. Reinforcement Learning: An Introduction. Cambridge, UK: Cambridge Univ Press; 1998. [Google Scholar]
- 6.Camerer C, Ho T. Experience-weighted attraction learning in normal form games. Econometrica. 1999;67:827–874. [Google Scholar]
- 7.Benam M, Hofbauer J, Hopkins E. Learning in games with unstable equilibria. J Econ Theory. 2009;144:1694–1709. [Google Scholar]
- 8.Young P. The evolution of conventions. Econometrica. 1993;61(1):57–84. [Google Scholar]
- 9.Fudenberg D, Levine DK. Consistency and cautious fictitious play. J Econ Dyn Control. 1995;19:1065–1089. [Google Scholar]
- 10.Hannan J. Approximation to Bayes risk in repeated plays. In: Dresher M, Tucker A, Wolfe P, editors. Contributions to the Theory of Games, 3. Princeton: Princeton Univ Press; 1957. pp. 97–139. [Google Scholar]
- 11.Foster DP, Young HP. Regret testing: Learning to play Nash equilibrium without knowing you have an opponent. Theoretical Economics. 2006;1:341–367. [Google Scholar]
- 12.Hart S, Mas-Colell A. Simple Adaptive Strategies, From Regret-Matching to Uncoupled Dynamics. Singapore: World Scientific; 2013. [Google Scholar]
- 13.Kakade SM, Foster DP. Deterministic calibration and Nash equilibrium. J Comput Syst Sci. 2008;74:115–130. [Google Scholar]
- 14.Fudenberg D, Levine DK. Conditional universal consistency. Games Econ Behav. 1999;29(1-2):104–130. [Google Scholar]