Optimism and pessimism in optimised replay

Georgy Antonov; Christopher Gagne; Eran Eldar; Peter Dayan

doi:10.1371/journal.pcbi.1009634

. 2022 Jan 12;18(1):e1009634. doi: 10.1371/journal.pcbi.1009634

Optimism and pessimism in optimised replay

Georgy Antonov ^1,^2,^*, Christopher Gagne ¹, Eran Eldar ^3,⁴, Peter Dayan ^1,⁵

Editor: Daniel Bush⁶

PMCID: PMC8809607 PMID: 35020718

Abstract

The replay of task-relevant trajectories is known to contribute to memory consolidation and improved task performance. A wide variety of experimental data show that the content of replayed sequences is highly specific and can be modulated by reward as well as other prominent task variables. However, the rules governing the choice of sequences to be replayed still remain poorly understood. One recent theoretical suggestion is that the prioritization of replay experiences in decision-making problems is based on their effect on the choice of action. We show that this implies that subjects should replay sub-optimal actions that they dysfunctionally choose rather than optimal ones, when, by being forgetful, they experience large amounts of uncertainty in their internal models of the world. We use this to account for recent experimental data demonstrating exactly pessimal replay, fitting model parameters to the individual subjects’ choices.

Author summary

When animals are asleep or restfully awake, populations of neurons in their brains recapitulate activity associated with extended behaviourally-relevant experiences. This process is called replay, and it has been established for a long time in rodents, and very recently in humans, to be important for good performance in decision-making tasks. The specific experiences which are replayed during those epochs follow highly ordered patterns, but the mechanisms which establish their priority are still not fully understood. One promising theoretical suggestion is that each replay experience is chosen in such a way that the learning that ensues is most helpful for the subsequent performance of the animal. A very recent study reported a surprising result that humans who achieved high performance in a planning task tended to replay actions they found to be sub-optimal, and that this was associated with a useful deprecation of those actions in subsequent performance. In this study, we examine the nature of this pessimized form of replay and show that it is exactly appropriate for forgetful agents. We analyse the role of forgetting for replay choices of our model, and verify our predictions using human subject data.

Introduction

During periods of quiet restfulness and sleep, when humans and other animals are not actively engaged in calculating or executing the immediate solutions to tasks, the brain is nevertheless not quiet. Rather it entertains a seething foment of activity. The nature of this activity has been most clearly elucidated in the hippocampus of rodents, since decoding the spatial codes reported by large populations of simultaneously recorded place cells [1, 2] reveals ordered patterns. Rodents apparently re-imagine places and trajectories that they recently visited (’replay’) [3–6], or might visit in the future (’preplay’) [6–13], or are associated with unusually large amounts of reward [14–16]. However, replay is not only associated with the hippocampus; there is also a complex semantic and temporal coupling with dynamical states in the cortex [17–23].

In humans, the patterns of neural engagement during these restful periods have historically been classified in such terms as default mode or task-negative activity [24]. This activity has been of great value in elucidating functional connectivity in the brain [25–27]; however, its information content had for a long time been somewhat obscure. Recently, though, decoding neural signals from magnetoencephalographic (MEG) recordings in specific time periods associated with the solution of carefully designed cognitive tasks, has revealed contentful replay and preplay (for convenience, we will generally refer to both simply as ‘replay’) that bears some resemblance to the rodent recordings [28–32].

The obvious question is what computational roles, if any, are played by these informationally-rich signals. It is known that disrupting replay in rodents leads to deficits in a variety of tasks [19, 33–36], and there are various theoretical ideas about its associated functions. Although the notions are not completely accepted, it has been suggested that the brain uses off-line activity to build forms of inverse models—index extension and maintenance in the context of memory consolidation [37], recognition models in the case of unsupervised learning [38], and off-line planning in the context of decision-making [39, 40].

While appealing, these various suggestions concern replay in general, and have not explained the micro-structure of which pattern is replayed when. One particularly promising idea for this in the area of decision-making, suggested by Mattar and Daw (2018) [41], is that granular choices of replay experiences are optimized for off-line planning. The notion, which marries two venerable suggestions in reinforcement learning (RL) [42]: DYNA [39] and prioritized sweeping [43], is that each replayed experience changes the model-free value of an action in order to maximize the utility of the animal’s ensuing behaviour. It was shown that the resulting optimal choice of experience balances two forces: need, which quantifies the expected frequency with which the state involved in the experience is encountered, and gain, which quantifies the benefit of the change to the behaviour at that state occasioned by replaying the experience. This idea explained a wealth of replay phenomena in rodents.

Applying these optimizing ideas to humans has been hard, since, until recently the micro-structure of replay in humans had not been assessed. Liu et al. (2020) [32] offered one compelling test which well followed Mattar and Daw (2018) [41]. By contrast, in a simple planning task, Eldar et al. (2020) [31] showed an unexpected form of efficacious replay in humans that, on the surface, seemed only partially to align with this theory. In this task, subjects varied in the extent to which their decisions reflected the utilisation of a model of how the task was structured. The more model-based (MB) they were, the more they engaged in replay during inter-trial interval periods, in a way that appeared helpful for their behaviour. Strikingly, though, the replay was apparently pessimized–that is, subjects preferred to replay bad choices, which were then deprecated in the future.

In this paper, we consider this characteristic of replay, examining it from the perspective of optimality. We show that favouring bad choices is in fact appropriate in the face of substantial uncertainty about the transition structure of the environment—a form of uncertainty that arises, for instance, from forgetting. Moreover, we consider the costs and benefits of replay on task performance in the light of subjects’ subjective, and potentially deviant, knowledge of the task. Although, to be concrete, we focus on the task studied by Eldar et al. (2020) [31], the issues we consider are of general importance.

Results

Preamble

In the study of Eldar et al. (2020) [31], human subjects acted in a carefully designed planning task (Fig 1A). Each state was associated with an image that was normally seen by the subjects. The subjects started each trial in a pseudo-random state and were required to choose a move among the 4 possible directions: up, down, left or right (based on the toroidal connectivity shown). In most cases, they were then shown an image associated with the new state according to the chosen move and received a reward associated with that image (this was not displayed; however, the subjects had been extensively taught about the associations between images and reward). Some trials only allowed single moves and others required subjects to make an additional second move which provided a second reward from the final state. In most of these 2-move trials, in order to obtain maximal total reward, the subjects had to select a first move that would have been sub-optimal had it been a 1-move trial.

The subjects were not aware of the spatial arrangement of the state space, and thus had to learn about it by trial and error (which they did in the training phase that preceded the main task, see Methods for details). In order to collect additional data on subjects’ knowledge of the state space, no feedback was provided about which state was reached after performing an action in the first 12 trials of blocks two through five. This feedback was provided in the remaining 42 trials to allow ongoing learning in the face, for instance, of forgetting. After two blocks of trials, the subjects were taught a new set of associations between images and rewards; similarly, before the final block they were informed about a pair of re-arrangements in the transition structure of the state space (involving swapping the locations of two pairs of images).

In order to achieve high performance, the subjects had to a) adjust their choices according to whether the trial allowed 1 or 2 moves; and b) adapt to the introduced changes in the environment. Eldar et al. (2020) [31] calculated an individual flexibility (IF) index from the former adjustment, and showed that this measure correlated significantly with how well the subjects adapted to the changes in the environment (and with their ability to draw the two different state spaces after performing all the trials, as well as to perform 2-move trials in the absence of feedback about the result of the first move). The more flexible subjects therefore presumably utilised a model of the environment to plan and re-evaluate their choices accurately.

Eldar et al. (2020) [31] investigated replay by decoding MEG data to reveal which images (i.e., states) subjects were contemplating during various task epochs. They exploited the so-called sequenceness analysis [28] to show that, in subjects with high IF, the order of contemplation of states in the inter-trial intervals following the outcome of a move revealed the replay of recently visited transitions (as opposed to the less flexible, presumably model-free (MF), subjects); it was notable that the transitions they preferred to replay (as measured by high sequenceness) mostly led to sub-optimal outcomes. Nevertheless, those subjects clearly benefited from what we call pessimized replay, for after the replay of sub-optimal actions they were, correctly, less likely to choose these (Fig 1B) when faced with the same selection of choices later on in the task.

In Eldar et al. (2020) [31], and largely following conventional suggestions [44–47], subjects’ choices were modelled with a hybrid MF/MB algorithm. This fit the data better than algorithms that relied on either pure MF or MB learning strategies. However, the proposed model did not account for replay and therefore could not explain the preference for particular replays or their effect on choice.

Therefore, to gain further insights into the mechanisms that underpinned replay choices of human subjects (as well as their effect on behaviour), we constructed an agent that made purely MF choices, but whose MF values were adjusted by a form of MB replay [41] that was optimal according to the agent’s forgetful model of the task. The agent was therefore able to adapt its decision strategy flexibly by controlling the amount of influence maintained by (subjectively optimal) MB information over MF values. We simulated this agent in the same behavioural task with the free parameters fit to the subjects (Fig 1C), and examined the resulting replay preferences.

Modelling of subjects’ choices in the behavioural task

To model replay in the behavioural task, we used a DYNA-like agent [39] which learns on-line by observing the consequences of its actions, as well as off-line in the inter-trial intervals by means of generative replay (Fig 2A). On-line learning is used to update a set of MF Q-values [48] which determine the agent’s choices through a softmax policy, as well as to (re)learn a model of the environment (i.e., transition probabilities). During off-line periods, the agent uses its model of the transition structure of the environment to estimate Q-values (denoted as ${\hat{Q}}^{M B}$ ), and then evaluates these MB estimates for the potential improvements to MF policy that the agent uses to make decisions. This process of evaluation and improvement is the key difference between our model and that of Eldar et al. (2020) [31]: instead of having MB quantities affecting the agent’s choices directly, they only did so by informing optimized replay [41] that provided additional training for the MF values.

Fig 2 — (A) Schematic illustration of the algorithm in the behavioural task. Upon completing each trial, the algorithm uses its knowledge of the transition structure of the environment to replay the possible outcomes. Note that in 1-move trials the algorithm replays only single moves, while in 2-move trials it considers both single and coupled moves (thus optimizing this choice). (B) Effect of MF forgetting and replay on MF Q-values. After acting and learning on-line towards true reward R (white blocks; controlled by learning rate, η), the algorithm learns off-line by means of replay (green blocks). Immediately after each replay bout, the algorithm forgets its MF Q-values towards the average reward experienced from the beginning of the task (red blocks; controlled by MF forgetting rate, ϕ^MF). Note that after trials 5 and 6, the agent chooses to replay the objectively optimal action, whereas after trials 8 and 9 it replays the objectively sub-optimal action. (C) Left: without MB forgetting, the algorithm’s estimate of reward obtained for a given move corresponds to the true reward function. Right: with MB forgetting (controlled by MB forgetting rate, ϕ^MB), the algorithm’s estimate of reward becomes an expectation of the reward function under its state-transition model. The state-transition model’s probabilities for the transitions are shown as translucent lines. (D) Steady-state performance (proportion of available reward obtained) of the algorithm in the behavioural task as a function of MF forgetting, ϕ^MF, and MB forgetting, ϕ^MB. Note how the agent still achieves high performance with substantial MF forgetting (high ϕ^MF) when its state-transition model accurately represents the transition probabilities (low ϕ^MB).

Unlike a typical DYNA agent, or indeed the suggestion from Mattar and Daw (2018) [41], our algorithm performs a full model evaluation, and therefore MF Q-values are updated in a supervised manner towards the model-generated ${\hat{Q}}^{M B}$ values. It is crucial to note that subjects were found by Eldar et al. (2020) [31] (and also in our own model fitting) to be notably forgetful. Therefore, although the task is deterministic, replay can help and hurt, since the seemingly omniscient MB updates may in fact be useless, or even worse, harmful. To avoid extensive training of MF values, in the light of the potential harm an inaccurate MB system can accomplish, the agent thus only engages in replay as long as the potential MB updates are estimated to be sufficiently gainful (see below); this is controlled by a replay gain threshold, which was a free parameter that we fit to each individual subject.

Similarly to the best-fitting model in Eldar et al. (2020) [31], after experiencing a move, our agent forgets both MF Q-values and the state-transition model (for all allowable transitions). However, unlike that account, over the course of forgetting, MF Q-values tend towards the average reward the agent has experienced since the beginning of the task (Fig 2B), as opposed to tending towards what was a fixed subject-specific parameter. Insofar as the agent improves over the course of the task, the average reward it obtains increases with each trial. MF Q-values for sub-optimal actions, therefore, tend to rise towards this average experienced reward; MF Q-values for optimal actions, on the other hand, become devalued, as the agent is prone occasionally to choose sub-optimal actions due to its non-deterministic policy. In other words, because of MF forgetting, the agent forgets how good the optimal actions are and how bad the sub-optimal actions are. Similarly, because of MB forgetting, the agent gradually forgets what is specific about particular transitions, progressively assuming a uniform distribution over the potential states to which it can transition (Fig 2C). The agent therefore becomes uncertain over time about the consequences of actions it rarely experiences.

To disambiguate the contribution of each individual component of our agent to the resulting behaviour, in Fig 3 we compare agents of varying complexity in a simple task involving only two actions (see Methods for details). First, note how agents with optimised replay significantly outperform their non-replaying counterparts (Fig 3A), hence suggesting that replay can indeed improve performance. Fig 3B shows the effects of replay and forgetting on the evolution of MF Q-values over the course of learning. Without forgetting (Fig 3B; left), MF Q-values quickly converge and remain stable, thus obviating the need for replay (Fig 3C; left) for all but the very first few trials (such an unforgetful agent precisely implements optimised replay as suggested by Mattar and Daw (2018) [41]). By contrast, given MF forgetting, continual learning (implemented, for instance, via additional replay-based off-line training) is required to maintain sound performance (Fig 3A). Importantly, as we further detail below, the extent to which replay can help ameliorate MF forgetting is itself limited by the amount of MB forgetting in the agent’s model of the task (notice in Fig 3B that the agent with MB forgetting underestimates MF Q-value for the objectively optimal action when compared to the agent without MB forgetting).

The two forgetting mechanisms therefore significantly influence the agent’s behaviour—MF forgetting effectively decreases the value of each state by infusing the agent’s policy with randomness, whereas MB forgetting confuses the agent with respect to the individual action outcomes. From an optimality perspective, the question is then what and if the agent should replay at all, given the imperfect knowledge of the world and a forgetful MF policy. We find that at high MF forgetting, replay confers a noticeable performance advantage to the agent provided that MB forgetting is mild (as can be seen from the curvature of the contour lines in Fig 2D). This means that, by engaging in replay, the agent is able to (on average) improve its MF policy and increase the obtained reward rate as long as there is little uncertainty about the transition structure.

We then analysed the replay choices of human participants which, according to our model prediction (with the free parameters of our agent fit to data from individual subjects), engaged in replay. This revealed a significant preference to replay actions that led to sub-optimal outcomes (Fig 1D). We therefore considered the parameter regimes in our model that led the agent to make such pessimal choices, and whether the subjects’ apparent preference to replay sub-optimal actions was formally beneficial for improving their policies.

Exploration of parameter regimes

The analysis in Mattar and Daw (2018) [41] suggests that two critical factors, need and gain, should jointly determine the ordering of replay by which an (in their case, accurate) MB system should train an MF controller. Need quantifies the extent to which a state is expected to be visited according to the agent’s policy and transition dynamics of the environment. It is closely related to the successor representation across possible start states, which is itself a prediction of discounted future state occupancies [49]. Heterogeneity in need would come from biases in the initial states on each trial (5 of the 8 were more common; but which 5 changed after blocks 2 and 4) and the contribution of subject’s preferences for the first move on 2-move trials. However, we expected the heterogeneity itself to be modest and potentially hard for subjects to track, and so made the approximation that need was the same for all states.

Gain quantifies the expected local benefit at a state from the change to the policy that would be engendered by a replay. Importantly, gain only accrues when the behavioural policy changes. Thus, one reason that the replay of sub-optimal actions is favoured is that replays that strengthen an already apparently optimal action will not be considered very gainful—so to the extent that the agent already chooses the best action, it will have little reason to replay it (such as in Fig 3C and 3D; left). A second reason comes from considering why continuing learning is necessary in this context anyhow—i.e., forgetting of the MF Q-values. Since the agent learns on-line as well as off-line, it will have more opportunities to learn about optimal actions without replay. Conversely, the values of sub-optimal actions are forgotten without this compensation, and so can potentially benefit from off-line replay (Fig 3C and 3D; middle and right).

We follow Mattar and Daw (2018) [41] in assuming that the agent computes gain optimally. However, this optimal computation is conducted on the basis of the agent’s subjective model of the task, which will be imperfect given forgetfulness. Thus, a choice to replay a particular transition might seem to be suitably gainful, and so selected by the agent, but would actually be deleterious, damaging the agent’s ability to collect reward. In this section, we explore this tension.

In Fig 4A, we show the estimated gain for an objectively optimal and sub-optimal action in an example simulation with only two actions available to the agent (see Methods for details). The bar plot in Fig 4B illustrates the ‘centering’ effect MF forgetting has on sub-optimal MF Q-values (which is also evident for forgetful agents in Fig 3B); this effect, however, is not symmetric. As the agent learns the optimal policy, the average reward it experiences becomes increasingly similar to the average reward obtainable from optimal actions in the environment. As a result, there is little room for MF Q-values for optimal actions to change through forgetting; on the other hand, MF Q-values for sub-optimal actions are forgotten towards what is near to the value of optimal actions to a much more substantial degree. Thus forgetting towards the average reward the agent experiences (provided that it accumulates with learning) is optimistic in a way that favours replay of sub-optimal actions.

Objectively sub-optimal actions that our agent replays therefore mostly have negative temporal difference (TD) errors. Such pessimized replay reminds the agent that sub-optimal actions, according to its model, are actually worse than predicted by the current MF policy.

There is an additional, subtle, aspect of forgetting that decreases both the objective and subjective benefit of replay of what are objectively optimal actions (provided the agent correctly estimates such actions to be optimal). For the agent to benefit from this replay, its state-transition model must be appropriately accurate as to generate MB Q-values for optimal actions that are better than their current MF Q-values. Forgetting in the model of the effects of actions (i.e., transition probabilities) will tend to homogenize the expected values of the actions—and this exerts a particular toll on the actions that are in fact the best. Conversely, forgetting of the effects of sub-optimal actions if anything makes them more prone to be replayed. This is because any sub-optimal action considered for replay will have positive estimated gain as long as MB Q-value for that action (as estimated by the agent’s state-transition model) is less than its current MF Q-value. Due to MF forgetting, the agent’s MF Q-values for sub-optimal actions rise above the actual average reward of the environment, and therefore, even if the state-transition model is uniform—which is the limit of complete forgetting of the transition matrix, the agent is still able to generate MB Q-values for sub-optimal actions that are less than their current MF Q-values, and use those in replay to improve the MF policy.

To examine these, we quantified uncertainty in the agent’s state-transition model, for every state s and action m considered for replay, using the standard Shannon entropy [50]:

\begin{matrix} H (s, m) = \underset{s^{'} \sim T (s^{'} ∣ m, s)}{E} [- {log}_{2} T (s^{'} ∣ m, s)] \end{matrix}

(1)

over the potential states to which the agent could transition. Similarly, the agent’s uncertainty about 2-move transitions considered for replay was computed as the joint entropy of the two transitions (Eq 23). For any state s and action m we will henceforth refer to $H (s, m)$ as the action entropy (importantly, it is not the overall model entropy since the agent can be more or less uncertain about particular transitions). If an action that is considered for replay has high action entropy, the estimated MB Q-value of that action is corrupted by the possibility of transitioning into multiple states (Fig 5A, left); in fact, maximal action entropy corresponds to a uniform policy. For an action with low action entropy, on the other hand, the agent is able to estimate the MB Q-value of that action more faithfully (Fig 5A, right).

Thus, we examined how the estimated gain of objectively optimal and sub-optimal actions is determined by action entropy (Fig 5B). Indeed, we found that the estimated gain for optimal actions was positive only at low action entropy values (Fig 5B, right; see also S1 Fig), hence confirming that uncertainty in the agent’s state-transition model significantly limited its ability to benefit from the replay of optimal actions. By contrast, the estimated gain for sub-optimal actions was positive for a wider range of action entropy values, compared to that for optimal actions (Fig 5B, left; see also S1 Fig).

Of course, optimal actions are performed more frequently, thus occasioning more learning. This implies that the state-transition model will be more accurate for the effects of those actions than for sub-optimal actions, which will partly counteract the effect evident in Fig 5. On the other hand, as discussed before, the estimated gain for optimal actions will be, in general, lower, precisely because of on-line learning as a result of the agent frequenting such actions.

An additional relevant observation is that action entropy (due to MB forgetting), together with MF forgetting, determines the balance between MF and MB strategies to which the agent apparently resorts. This can be seen from the increasing x-intercept as a function of the strength of MF forgetting for optimal actions (vertical dashed lines in Fig 5B, right). In the case of weak MF forgetting, the agent can only benefit from the replay of optimal actions inasmuch as the state-transition model is sufficiently accurate—which requires action entropy values to be low. As MF forgetting becomes more severe, the agent can think itself to benefit from replay extracted from a worse state-transition model. Thus, we identify a range of parameter regimes which can lead agents to find replay subjectively beneficial and, therefore, allocate more influence to the MB system.

Fitting actual subjects

We fit the free parameters of our model to the individual subject choices from the study of Eldar et al. (2020) [31], striving to keep as close as possible to the experimental conditions, for instance by treating the algorithm’s adaptations to the image-reward association changes and the spatial re-arrangement in the same way as in the original study (see Methods for details). First, we examined whether our model correctly captured the varying degree of decision flexibility that was observed across subjects. Indeed, we found that the simulated IF values, as predicted by our agent with subject-tailored parameters, correlated significantly with the behavioural IF values (S5A Fig). Moreover, our agent predicted that some subjects would be engaging in potentially measurable replay (an average of more than 0.3 replays per trial, n = 20), and hence use MB knowledge when instructing their MF policies to make decisions. Indeed, we found that our agent predicted those subjects to be significantly less ignorant about the transition probabilities at the end of the training trials (2-sample 2-tailed t-test, t = −2.56, p = 0.014; S8 Fig), thus indicating the accumulation of MB knowledge. Further, those same subjects had significantly higher simulated IF values relative to the subjects for whom the agent did not predict sufficient replay—which is in line with the observation of Eldar et al. (2020) [31] that subjects with higher IF had higher MEG sequenceness following ‘surprising’ (measured by individually-fit state prediction errors) trial outcomes.

We then examined the replay patterns of those subjects, which we refer to as model-informed (MI) subjects, when modelled by our agent. There is an important technical difficulty in doing this exactly: our agent was modelled in an on-policy manner—i.e., making choices and performing replays based on its subjective gain, which, because of stochasticity, might not emulate those of the subjects, even if our model exactly captured the mechanisms governing choice and replay in the subjects. However, we can still hope for general illumination from the agent’s behaviour.

In Fig 6A, we show an example move in a 1-move trial in the final block of the task which was predicted by our agent with parameters fit to an MI subject. This example is useful for demonstrating how the agent’s forgetful state of knowledge (as discussed in the previous section) led it to prioritise the replay of certain experiences, and whether those choices were objectively optimal.

In this particular example, the agent chose a sub-optimal move. The agent’s state of knowledge of 1-move transitions before and after executing the same move is shown in Fig 6B and 6C. Note how the agent is less ignorant, on average, about the outcomes of 1-move optimal actions, as opposed to sub-optimal ones (Fig 6B). The agent chose a sub-optimal move because the MF Q-value for that move had been forgotten to an extent that the agent’s subjective knowledge incorrectly indicated it to be optimal (Fig 6C). After learning online that the chosen move was worse than predicted (Fig 6C, compare green and blue bars), the agent replayed that action at the end of the trial to incorporate the estimated MB Q-value for that action, as generated by the agent’s state-transition model (Fig 6C, pink bar), into its MF policy (Fig 6D). In this case, the MB knowledge that the agent decided to incorporate into its MF policy was more accurate as regards the true Q-values; such replay, therefore, made the agent less likely to re-choose the same sub-optimal move in the following trials. Therefore, in this example, our agent predicted that the subject engaged in significant replay of the recent sub-optimal single move experience.

As just discussed, in addition to replaying the just-experienced transition, the agent also engaged in the replay of other transitions. Indeed, the agent was not restricted in which transitions to replay—it chose to replay actions based solely on whether the magnitude of the estimated gain of each possible action is greater than a subject-specific gain threshold (see Methods for details). We were, therefore, able to see a much richer picture of the replay of all allowable transitions, in addition to the just-experienced ones. In the given example, it is easy to see how the agent estimated its replay choices to be gainful: because of the relatively strong MF forgetting and a sufficiently accurate state-transition model. To demonstrate a slightly different parameter regime, we also additionally show an example move predicted by our agent for another MI subject (S4 Fig). In that case, the agent’s MF and MB policies were more accurate; however, our parameter estimates indicated that the subject’s gain threshold for initiating replay was set lower, and hence the agent still engaged in replay, even though the gain it estimated was apparently minute.

To quantify the objective benefit of replay for the subject shown in Fig 6, we examined how the agent’s objective value function (relative to the true obtainable reward) at each state changed due to the replay at the end of this trial, which is a direct measure of the change in the expected reward the agent can obtain. We found that for each state where the replay occurred, the objective value function of that state increased (Fig 6E). To see whether such value function improvements held across the entire session, we examined the average trial-wise statistics for this subject (Fig 6F). We found that, on average, the replay after each trial (both 1-move and 2-move) improved the agent’s objective value function by 0.08 reward points (1-sample 2-tailed t-test, t = 6.34, p ≪ 0.0001).

We next looked at the average trial-wise replay statistics of the subject as predicted by our agent (Fig 6G and 6H). If we consider solely the just-experienced transitions (Fig 6G), we found that there were significantly more replays of sub-optimal actions per trial (Wilcoxon rank-sum test, W = 4.71, p = 2.51 ⋅ 10⁻⁶). Moreover, the replay of sub-optimal single actions was at significantly higher action entropy values (Wilcoxon rank-sum test, W = 5.58, p ≪ 0.0001), which is what one would expect given that the transitions that led to optimal actions were experienced more frequently. Since some optimal 1-move actions corresponded to sub-optimal first moves in 2-move trials, the agent received an additional on-line training about the latter transitions, and we therefore found no significant difference in action entropy values at which coupled optimal and sub-optimal actions were replayed (Wilcoxon rank-sum test, W = −0.11, p = 0.91).

In addition to the just-experienced transitions, we also separately analysed the replay of all other transitions (Fig 6H); this revealed a much broader picture, but we observed the same tendency for sub-optimal actions to be replayed more (Wilcoxon rank-sum test, W = 2.88, p = 0.004). In this case, we found that both 1-move (Wilcoxon rank-sum test, W = 8.31, p ≪ 0.0001) and 2-move (Wilcoxon rank-sum test, W = 24.7, p ≪ 0.0001) sub-optimal actions were replayed at higher action entropy values than the corresponding optimal actions. We also compared the replays of recent and ‘other’ transitions and their entropy values. We found that both optimal other (Wilcoxon rank-sum test, W = 19.5, p ≪ 0.0001) and sub-optimal other (Wilcoxon rank-sum test, W = 12.6, p ≪ 0.0001) transitions were replayed more than the recent ones. All 1-move and 2-move optimal and sub-optimal other transitions were replayed at significantly higher action entropy values than the corresponding recent transitions (1-move optimal recent vs 1-move optimal other, Wilcoxon rank-sum test, W = 9.25, p = ≪0.0001; 1-move sub-optimal recent vs 1-move sub-optimal other, Wilcoxon rank-sum test, W = 14.7, p ≪ 0.0001, 2-move optimal recent vs 2-move optimal other, Wilcoxon rank-sum test, W = 3.36, p = 7.77 ⋅ 10⁻⁴; 2-move sub-optimal recent vs 2-move sub-optimal other, Wilcoxon rank-sum test, W = 8.47, p ≪ 0.0001). The latter observation could potentially explain why these ‘distal’ replays of other transitions were not detectable in the study of Eldar et al. (2020) [31], since the content of highly entropic on-task replay events may have been improbable to identify with classifiers trained on data obtained during the pre-task stimulus exposure (while subjects were contemplating the images with certitude).

Overall, we found that the pattern of the replay choices selected by our agent with parameters fit to the data for the subject shown in Fig 6 was consistent with the observations reported by Eldar et al. (2020) [31]. Furthermore, this consistency was true across all subjects with significant replay (Fig 7). Each MI subject, on average, preferentially replayed sub-optimal actions (just-experienced transitions, Wilcoxon rank-sum test, W = 2.48, p = 0.012; other transitions, Wilcoxon rank-sum test, W = 2.33, p = 0.020). As for the subject in Fig 6, we assessed whether the action entropy associated with the just-experienced transition (be it optimal or sub-optimal) when it was replayed was lower than the action entropy associated with other optimal and sub-optimal actions when they were replayed. Overall, we found the same trend across all MI subjects but for all combinations: i.e., 1-move transitions (just-experienced 1-move optimal vs other 1-move optimal, Wilcoxon rank-sum test, W = 11.3, p ≪ 0.0001; just-experienced 1-move sub-optimal vs other 1-move sub-optimal, Wilcoxon rank-sum test, W = 11.8, p ≪ 0.0001) and 2-move transitions (just-experienced 2-move optimal vs other 2-move optimal, Wilcoxon rank-sum test, W = 14.2, p ≪ 0.0001; just-experienced 2-move sub-optimal vs other 2-move sub-optimal, Wilcoxon rank-sum test, W = 24.7, p ≪ 0.0001). Furthermore, our agent predicted that, in each trial, MI human subjects increased their (objective) average value function by 0.323 reward points (1-sample 2-tailed t-test, t = 3.47, p = 0.003, Fig 7C) as a result of replay. As a final measure of the replay benefit, we quantified the average trial-wise increase in the probability of choosing an optimal action as a result of replay (Fig 7D). This was done to provide a direct measure of the effect of replay on the subjects’ policy, as opposed to the proxy reported by Eldar et al. (2020) [31] in terms of the visitation frequency (shown in Fig 1B). On average, our modelling showed that replay increased the probability of choosing an optimal action in MI subjects by 5.11% (1-sample 2-tailed t-test, t = 3.93, p = 0.0008), which is very similar to the numbers reported by Eldar et al. (2020) [31].

Clearly, our modelling predicted quite a diverse extent to which MI subjects objectively benefited (or even hurt themselves) by replay (Fig 7C and 7D). This is because the subjects had to rely on their forgetful and imperfect state-transition models to estimate MB Q-values for each action, and as a result of MB forgetting, some subjects occasionally mis-estimated some sub-optimal actions to be optimal—and thus their average objective replay benefit was more modest (or even negative in the most extreme cases). Indeed, upon closer examination of our best-fitting parameter estimates for each MI subject, we noted that subjects who, on average, hurt themselves by replay (n = 2) had very high MB forgetting rates (S3D Fig), such that their forgetting-degraded models further exacerbated their knowledge of the degree to which each action was rewarding (the example subject shown in S5 Fig has this characteristic). Moreover, we found a significant anticorrelation of MB forgetting parameter and IF in MI subjects (multiple regression, β = −0.37, t = −2.73, p = 0.016; S3B Fig), which suggests that MI subjects with a more accurate state-transition model achieved higher decision flexibility by preferentially engaging in pessimized replay following each trial outcome.

We did not find any significant linear relationship between the gain threshold and objective replay benefit (multiple regression, β = 0.24, t = 1.86, p = 0.083; S3F Fig). This suggests that the effect is either non-linear or highly dependent on the current state of knowledge of the agent (due to the on-policy nature of replay in our algorithm), and thus higher MB forgetting rate could not have been ameliorated by simply having a stricter gain threshold. Moreover, we note that extensive training of MF values by a forgetful MB policy (as a result of setting the gain threshold too low) can be dangerous. For instance, our modelling predicted the example subject in S5 Fig to be close to complete ignorance about the transition structure of the task (S5B Fig), yet the subject, as our agent predicted, still chose to engage in replay which significantly hurt his performance (S5F Fig).

The apparent exuberance of distal replays that we discovered suggested that our model fit predicted the subjects to be notably more forgetful than found by Eldar et al. (2020) [31]. We thus compared our best-fitting MB and MF forgetting parameter estimates to those reported by Eldar et al. (2020) [31] for the best-fitting model. We found that our model predicted most subjects to be subject to far less MB forgetting (S2B Fig). On the other hand, our MF forgetting parameter estimates implied that the subjects remembered their MF Q-values significantly less well (S2A Fig). We (partially) attribute these differences to optimal replay: in Eldar et al. (2020) [31] the state-transition model explicitly affected every choice, whereas in our model, the MF (decision) policy was only informed by MB quantities so long as this influence was estimated to be gainful. Thus, the MF policy could be more forgetful, and yet be corrected by subjectively optimal MB information; this is also supported by the significant correlation of MF forgetting and objective replay benefit (multiple regression, β = 0.81, t = 9.33, p ≪ 0.0001; S3F Fig). Moreover, due to substantially larger MF forgetting predicted by our modelling, we predicted that the subjects would have been more ‘surprised’ by how rewarding each outcome was. The combination of this surprise with the (albeit reduced) suprise about the transition probabilities, arising from the (lower) MB forgetting, resulted in surplus distal replays.

Discussion

In summary, we studied the consequences of forgetting in a DYNA-like agent [39] with optimised replay [41]. The agent uses on-line experiences to train both its model-free policy (by learning about the rewards associated with each action) and its model-based system (by (re)learning the transition probabilities). It uses off-line experience to allow MF values to be trained by the MB system in a supervised manner. Behaviour is ultimately controlled exclusively by the MF system. The progressive inaccuracy of MF values (as a result of forgetting) can be ameliorated by MB replay, but only if the MB system has itself not become too inaccurate.

In particular, we showed that the structure of forgetting could favour the replay of sub-optimal rather than optimal actions. This arose through the interaction of several factors. One is that MF values relax to the mean experienced reward (meaning that sub-optimal actions come to look better, and optimal actions look worse than they actually are)–this can lead to the progressive choice of sub-optimal actions, and thus gain in suppressing them. A second is that optimal actions generally enjoy greater updating from actual on-line experience, since they are chosen more frequently. A third is that MB values relax towards the mean of all the rewards in the environment (because the transition probabilities relax towards uniformity), which is pessimistic relative to the experienced rewards. This makes the model relatively worse at elevating optimal actions in the MF system. We showed that the Shannon entropy of the transition distribution was a useful indicator of the status of forgetting.

We found that replay can both help and hurt, from an objective perspective—with the latter occurring when MB forgetting is too severe. This shows that replay can be dangerous when subjects lack meta-cognitive monitoring insight to be able to question the veracity of their model and thus the benefit of using it (via subjective gain estimates). This result has implications for sub-optimal replay as a potential computational marker of mental dysfunction [51, 52].

We studied these phenomena in the task of Eldar et al. (2020) [31], where it was initially identified. Forgetting had already been identified by Eldar et al. (2020) [31] (in fact, a number of previous studies had also noted the beneficial effect of value forgetting on model fits to behavioural data [53–55]); however, it played a more significant role in our model, because we closed the circle of having MB replay affect MF values and thus behaviour. By not doing this, forgetting may have been artificially downplayed in the original report (S2A and S2B Fig).

Additionally, we note that our forgetting mechanism first arose in the literature on computational and behavioural neuroscience; average reward rate, for instance, has been found to be an important facet of behaviour—being used as the opportunity cost of time in accounts of controlled vigour [56]. Unfortunately, the task is not very well adapted to address broader psychological questions of forgetting [57], since opportunities for decay and prospective and retrospective interference all abound in the trial-type sequences employed. Furthermore, longer-term consolidation was not a focus.

We fit the free parameters of our agent to the behavioural data from individual subjects, correctly capturing their behavioural flexibility (IF) (S3A and S3B Fig), and providing a mechanistic explanation for the replay choice preferences of those employing a hybrid MB/MF strategy. Our fits suggested that these MI subjects underwent significantly more MF forgetting than those whose behavior was more purely MF (since the latter could only rely on the MF system, see S3C Fig).

Our study has some limitations. First, we shared Mattar and Daw (2018)’s [41] assumption that the subjects could compute gain correctly in their resulting incorrect models of the task. How gain might actually be estimated and whether there might be systematic errors in the calculations involved are not clear. However, we note that spiking neural networks have recently proven particularly promising for the study of biologically plausible mechanisms that support inferential planning [58, 59], and they could therefore provide insights into the underlying computations that prioritise the replay of certain experiences.

Second, our model did not account for the need term [41] that was theorised to be another crucial factor for the replay choices. This was unnecessary for the current task. Need is closely related to the successor representation (SR) of sequential transitions [49] inasmuch as it predicts how often one expects each state to be visited given the current policy. Need was shown to mostly influence forward replay at the very outset of a trial as part of planning, something that is only just starting to be detectable using MEG [60].

The SR also has other uses in the context of planning, and is an important intermediate representation for various RL tasks. Furthermore, it has a similar computational structure to MF Q-values—in fact, it can be acquired through a form of MF Q learning with a particular collection of reward functions [49]. Thus, it is an intriguing possibility, suggested, for instance, in the SR-DYNA algorithm of Russek et al. (2017) [61] or by the provocative experiment of Carey et al. (2019) [62] that there may be forms of MB replay that are directed at maintaining the integrity and fidelity of the SR in the face of forgetting and environmental change. Indeed, grid cells in the superficial layers of entorhinal cortex were shown to engage in replay independently of the hippocampus, and thus could be a potential candidate for SR-only replay [63, 64]. More generally, other forms of off-line consolidation could be involved in tuning and nurturing cognitive maps of the environment, leading to spatially-coherent replay [65].

Third, Eldar et al. (2020) [31] also looked at replay during the 2-minute rest period that preceded each block, finding an anticorrelation between IF and replay. The most interesting such periods preceded blocks 3 and 5, after the changes to the reward and transition models. Before block 3, MF subjects replayed transitions that they had experienced in block 2 and preplayed transitions they were about to choose in block 3; MI subjects, by contrast, showed no such bias. Although we did show that even if we set the MF Q-values to zero before allowing the agent to engage in replay prior to starting blocks 3 and 5 (S7 Fig), we still recover our main conclusions, the fact that we do not know what sort of replay might be happening during retraining meant that we left modelling this rest-based replay to future work.

Finally, for consistency, we borrowed some of the assumptions made in the study of Eldar et al. (2020) [31]. In particular, we did not not consider other sources of uncertainty which could have arisen from, for instance, forgetting of image-reward associations (since subjects were extensively pre-trained and, furthermore, Eldar et al. (2020) [31] validated an explicit recall of those associations in all participants upon completion of the task). Nevertheless, we note that such a mechanism could have impacted the observed pessimized pattern of replay. Pessimistic replay would prevail so long as the stimulus-specific rewards are forgotten towards a value which is less than the average experienced reward (with our form of MF forgetting).

To summarize, in this work, we showed both in simulations and by fitting human data in a simple planning task that pessimized replay can have distinctly beneficial effects. We also showed the delicacy of the balance of and interaction between MB and MF systems when they are forgetful—something that will be of particular importance in more sophisticated and non-deterministic environments that involve partial observability and large state spaces [66, 67].

Methods

Task design

We simulated the agent in the same task environment and trial and reward structure as Eldar et al. (2020) [31]. The task space contained 8 states arranged in a 2 × 4 torus, with each state associated with a certain number of reward points. From each state, 4 actions were available to the agent—up, down, left, and right. If the state the agent transitioned to was revealed (trials with feedback), the agent was awarded reward points associated with the state to which it transitioned.

The agent started each trial in the same location as the corresponding human subject, which was originally determined in a pseudo-random fashion. The agent then chose among the available moves. Based on the chosen move, the agent transitioned to a new state and received the reward associated with that state. In 1-move trials, this reward signified the end of the trial, whereas in 2-move trials the agent then had to choose a second action either with or without feedback as to which state the first action had led. In addition, transitions back to the start state in 2-move trials were not allowed. Importantly, the design of the task ensured that optimal first moves in 2-move trials were usually different from optimal moves in 1-move trials.

As for each of the human subjects, the simulation consisted of 5 task blocks that were preceded by 6 training blocks. Each training block consisted of twelve 1-move trials from 1 of 2 possible locations. The final training block contained 48 trials where the agent started in any of the 8 possible locations. In the main task, each block comprised 3 epochs, each containing six 1-move trials followed by twelve 2-move trials, therefore giving in total 54 trials per block. Every 6 consecutive trials the agent started in a different location except for the first 24 2-move trials of the first block, in which Eldar et al. (2020) [31] repeated each starting location for two consecutive trials to promote learning of coupled moves. eginning with block 2 the agent did not receive any reward or transition feedback for the first 12 trials of each block. After block 2 the agent was instructed about changes in the reward associated with each state. Similarly, before starting block 5 the agent was informed about a rearrangement of the states in the torus. Eldar et al. (2020) [31] specified this such that the optimal first move in 15 out of consecutive 16 trials became different.

DYNA-like agent

Free parameters

The potential free parameters of our model are listed in Table 1.

Table 1. Free parameters of the algorithm.

η ^MF1	learning rate in 1-move trials and second moves in 2-move trials
η ^MF2	learning rate in first moves in 2-move trials
$η_{r}^{M F 1}$	learning rate for replay in 1-move trials and second moves in 2-move trials
$η_{r}^{M F 2}$	learning rate for replay in first moves in 2-move trials
$β_{1}^{M F 1}$	inverse temperature in 1-move trials
$β_{2}^{M F 1}$	inverse temperature for second moves in 2-move trials
$β_{2}^{M F 2}$	inverse temperature for first moves in 2-move trials
θ	initialisation mean for Q^MF values
γ _m	fixed bias for each action, subject to ∑_m γ_a = 0
η ^MB	state-transition model learning rate
ρ	fraction of the state-transition model learning rate that is used for updating opposite transitions
ϕ ^MB	state-transition model forgetting
ϕ′^MB	state-transition model forgetting upon spatial re-arrangement
ω	state-transition model re-arrangement success
ϕ ^MF	Q^MF values forgetting
ϕ′^MF	Q^MF values forgetting upon spatial re-arrangement
ξ	gain threshold for initiating replay

Open in a new tab

Choices

The agent has the same model-free mechanism for choice as designed by Eldar et al. (2020) [31]. However, unlike Eldar et al. (2020) [31], choices are determined exclusively by the MF system; the only role the MB system plays is via replay, updating the MF values.

The model-free system involves two sets of state-action or Q^MF values [48]: Q^MF1 for single moves (the only move in 1-move trials, and both first and second moves in 2-move trials), and Q^MF2 for coupled moves (2-move sequences) in 2-move trials.

In 1-move trials, starting from state s_t,1, the agent chooses actions according to the softmax policy:

\begin{matrix} π (m_{t, 1} = m ∣ s_{t, 1}) \propto e^{γ_{m} + β_{1}^{M F 1} Q_{t}^{M F 1} (s_{t, 1}, m)} \end{matrix}

(2)

where m ∈ {up, down, left, right}. In 2-move trials, the first move is chosen based on the combination of both sorts of Q^MF values:

\begin{matrix} π (m_{t, 1} = m ∣ s_{t, 1}) \propto e^{γ_{m} + β_{2}^{M F 1} Q_{t}^{M F 1} (s_{t, 1}, m) + β_{2}^{M F 2} Q_{t}^{M F 2} (s_{t, 1}, m)} \end{matrix}

(3)

where the individual Q^MF1 value is weighted by a different inverse temperature or strength parameter $β_{2}^{M F 1}$ , and the coupled move $Q_{t}^{M F 2} (s_{t, 1}, m)$ value comes from considering all possible second moves m_t,2, weighted by their probabilities:

\begin{matrix} Q_{t}^{M F 2} (s_{t, 1}, m) = \sum_{m^{'}} π (m_{t, 2} = m^{'} ∣ s_{t, 1}, m_{t, 1}) Q_{t}^{M F 2} (s_{t, 1}, m, m^{'}) \end{matrix}

(4)

The two-argument Q^MF2 value used when choosing a first move therefore corresponds to a two-move MF Q value where the second move options are marginalised out with respect to the agent’s policy.

When choosing a second move the agent takes into partial account (since Q^MF2 doesn’t) the state s_t,2 to which it transitioned on the first move:

\begin{matrix} π (m_{t, 2} = m ∣ s_{t, 1}, m_{t, 1}, s_{t, 2}) \propto e^{γ_{m} + β_{2}^{M F 1} Q_{t}^{M F 1} (s_{t, 2}, m) + β_{2}^{M F 2} Q_{t}^{M F 2} (s_{t, 1}, m_{t, 1}, m)} \end{matrix}

(5)

In trials without feedback, since the agent’s Q^MF values are indexed by state and the transition is not revealed, it is necessary in Eq 5 to average Q^MF1 over all the permitted s_t,2 (since returning to s_t,1 is disallowed):

\begin{matrix} ⟨ Q_{t}^{M F 1} (s_{t, 2}, m) ⟩ = \frac{1}{7} \sum_{s} Q_{t}^{M F 1} (s, m) \end{matrix}

(6)

The decision is then made according to Eq 5.

Model-free value learning

The reward R(s_t,2) received for a given move m_t and transition to state s_t,2 is used to update the agent’s Q^MF1 value for that move:

\begin{matrix} Q_{t + 1}^{M F 1} (s_{t, 1}, m_{t}) \leftarrow Q_{t}^{M F 1} (s_{t, 1}, m_{t}) + η^{M F 1} [R (s_{t, 2}) - Q_{t}^{M F 1} (s_{t, 1}, m_{t})] \end{matrix}

(7)

On 2-move trials, the same rule is applied for the second move, adjusting $Q_{t + 1}^{M F 1} (s_{t, 2}, m_{t, 2})$ according to the second reward R(s_t,3). Furthermore, on 2-move trials, Q^MF2 values are updated at the end of each trial based on the sum total reward obtained on that trial but with a different learning rate:

\begin{matrix} Q_{t + 1}^{M F 2} (s_{t, 1}, m_{t, 1}, m_{t, 2}) \leftarrow & Q_{t}^{M F 2} (s_{t, 1}, m_{t, 1}, m_{t, 2}) + \\ η^{M F 2} [R (s_{t, 2}) + R (s_{t, 3}) - Q_{t}^{M F 2} (s_{t, 1}, m_{t, 1}, m_{t, 2})] \end{matrix}

(8)

Model-based learning

Additionally, and also as in Eldar et al. (2020) [31], the agent’s model-based system learns about the transitions associated with every move it experiences (i.e. the only transition in 1-move trials and both transitions in 2-move trials):

\begin{matrix} T_{t + 1} (s_{t, 1}, m_{t}, s_{t, 2}) \leftarrow T_{t} (s_{t, 1}, m_{t}, s_{t, 2}) + η^{M B} [1 - T_{t} (s_{t, 1}, m_{t}, s_{t, 2})] \end{matrix}

(9)

and, given that actions are reversible, learning also happens for the opposite transitions to a degree that is controlled by parameter ρ:

\begin{matrix} T_{t + 1} (s_{t, 2}, {\tilde{m}}_{t}, s_{t, 1}) \leftarrow T_{t} (s_{t, 2}, {\tilde{m}}_{t}, s_{t, 1}) + ρ η^{M B} [1 - T_{t} (s_{t, 2}, {\tilde{m}}_{t}, s_{t, 1})] \end{matrix}

(10)

where ${\tilde{m}}_{t}$ is the transition opposite to m_t. Note that in trials with no feedback the agent does not receive any reward and the state it transitions to is uncued. Therefore, no learning occurs in such trials.

To ensure that the probabilities sum up to 1, the agent re-normalizes the state-transition model after every update and following the MB forgetting (see below) as:

\begin{matrix} \forall s T_{t + 1} (s_{t, 1}, m_{t}, s) \leftarrow \frac{T_{t + 1} (s_{t, 1}, m_{t}, s)}{\sum_{s^{'}} T_{t + 1} (s_{t, 1}, m_{t}, s^{'})} \end{matrix}

(11)

Replay

That our agent exploits replay is the critical difference from the model that Eldar et al. (2020) [31] used to characterize their subjects’ decision processes.

Our algorithm makes use of its (imperfect) knowledge of the transition structure of the environment to perform additional learning in the inter-trial intervals by means of generative replay. Specifically, the agent utilises its state-transition model T and reward function R(s) to estimate model-based ${\hat{Q}}^{M B}$ values for every possible action (that are allowable according to the model). These ${\hat{Q}}^{M B}$ values are then assessed for the potential MF policy improvements (see below).

${\hat{Q}}^{M B}$ values for 1-move trials and second moves in 2-move trials are estimated as follows:

\begin{matrix} {\hat{Q}}^{M B 1} (s_{1}, m_{1}) = \sum_{s^{'}} T (s_{1}, m_{1}, s^{'}) R (s^{'}) \end{matrix}

(12)

Similarly, ${\hat{Q}}^{M B}$ values for 2-move sequences are estimated as:

\begin{matrix} {\hat{Q}}^{M B 2} (s_{1}, m_{1}, m_{2}) = \sum_{s^{'}} T (s_{1}, m_{1}, s^{'}) [R (s^{'}) + \sum_{s^{''}} T (s^{'}, m_{2}, s^{''}) R (s^{''})] \end{matrix}

(13)

When summing over the potential outcomes for a second action in Eq 13, the agent additionally sets the probability of transitioning into the starting location s₁ to zero (since back-tracking was not allowed) and normalises the transition probabilities according to Eq 11. Note that reward function R(s) here is the true reward the agent would have received for transitioning into state s since we assume that the subjects have learnt the image-reward associations perfectly well. The model-generated ${\hat{Q}}^{M B}$ values therefore incorporate the agent’s uncertainty about the transition structure of the environment. If the agent is certain which state a given action would take it to, ${\hat{Q}}^{M B}$ value for that action would closely match the true reward function of that state. Otherwise, ${\hat{Q}}^{M B}$ values for uncertain transitions are corrupted by the possibility of ending up in different states.

The agent then uses all the generated ${\hat{Q}}^{M B}$ values to compute new hybrid Q^MF/MB values. These hybrid values correspond to the values that would have resulted had the current Q^MF values been updated towards the model-generated ${\hat{Q}}^{M B}$ values using replay-specific learning rates $η_{r}^{M F 1}, η_{r}^{M F 2}$ :

\begin{matrix} Q^{M F 1 / M B 1} (s_{1}, m_{1}) \leftarrow Q^{M F 1} (s_{1}, m_{1}) + η_{r}^{M F 1} [{\hat{Q}}^{M B 1} (s_{1}, m_{1}) - Q^{M F 1} (s_{1}, m_{1})] \end{matrix}

(14)

\begin{matrix} Q^{M F 2 / M B 2} (s_{1}, m_{1}, m_{2}) \leftarrow & Q^{M F 2} (s_{1}, m_{1}, m_{2}) + \\ η_{r}^{M F 2} [{\hat{Q}}^{M B 2} (s_{1}, m_{1}, m_{2}) - Q^{M F 2} (s_{1}, m_{1}, m_{2})] \end{matrix}

(15)

We note that the 2-move updates specified in Eq 15 are a form of supervised learning, and so differ from the RL/DYNA-based episodic replay suggested by Mattar and Daw (2018) [41]. As mentioned above, we chose this way of updating MF Q-values for coupled moves in replay to keep the algorithmic details as close to Eldar et al. (2020) [31] as possible. In principle, we could have also operationalised our 2-move replay in a DYNA fashion.

To assess whether any of the above updates improve the agent’s MF policy, the agent computes the expected value of every potential update [41]:

\begin{matrix} \hat{Gain} (s_{1}, m_{1}) = & \sum_{m} Q^{M F 1 / M B 1} (s_{1}, m) π_{n e w} (m ∣ s_{1}) - \\ \sum_{m} Q^{M F 1 / M B 1} (s_{1}, m) π_{o l d} (m ∣ s_{1}) \end{matrix}

(16)

Analogously, for a sequence of two moves {m₁, m₂}:

\begin{matrix} \hat{Gain} (s_{1}, m_{1}, m_{2}) = & \sum_{{m_{1}, m_{2}}} Q^{M F 2 / M B 2} (s_{1}, m_{1}, m_{2}) π_{n e w} ({m_{1}, m_{2}} ∣ s_{1}) - \\ \sum_{{m_{1}, m_{2}}} Q^{M F 2 / M B 2} (s_{1}, m_{1}, m_{2}) π_{o l d} ({m_{1}, m_{2}} ∣ s_{1}) (21) \end{matrix}

(17)

where the policy π was assumed to be unbiased and computed as in Eq 2; that is, the agent directly estimated the corresponding probabilities for each sequence of 2 actions in 2-move trials. Both of these expressions for estimating $\hat{Gain}$ use the full new model-free policy that would be implied by the update. Thus, as also in Mattar and Daw (2018) [41], the gain (Eqs 16 and 17) does not assess a psychologically-credible gain, since the new policy is only available after the replays are executed. Moreover, we emphasise that this same gain is the agent’s estimate, for the true gain is only accessible to an agent with perfect knowledge of the transition structure of the environment (which is infeasible in the presence of substantial forgetting).

Finally, the expected value of each backup (EVB), or replay, is computed as $\frac{1}{8} \cdot \hat{Gain} (\cdot)$ (since the agent starts each trial in a pseudorandom location, we assumed need to be uniform). Exactly as in Mattar and Daw (2018) [41], the priority of the potential updates is determined by the EVB value—if the greatest EVB value exceeds gain threshold ξ (for simplicity and due to the assumption of uniform need, we refer to ξ as gain threshold, rather than EVB threshold), then the agent executes the replay associated with that EVB towards the model-generated ${\hat{Q}}^{M B}$ value according to Eqs 14 or 15 (depending whether it is a single move or a two-move sequence), thus incorporating its MB knowledge into the current MF policy. Note that since this changes the agent’s MF policy and the generation of ${\hat{Q}}^{M B 2}$ is policy-dependent, the latter are re-generated following every executed backup. The replay proceeds until no potential updates have the EVB value greater than ξ.

Forgetting

The agent is assumed to forget both about the Q^MF values and the state-transition model T in trials where feedback is provided. Thus, after every update and following replay the agent forgets according to:

\begin{matrix} Q_{t + 1}^{M F} & \leftarrow (1 - ϕ^{M F}) Q_{t + 1}^{M F} + ϕ^{M F} {\bar{r}}_{t + 1} \end{matrix}

(18)

\begin{matrix} T_{t + 1} & \leftarrow (1 - ϕ^{M B}) T_{t + 1} + ϕ^{M B} 1 / 7 \end{matrix}

(19)

Note that we parameterize the above equations in terms of forgetting parameters ϕ. Eldar et al. (2020) [31] instead used τ as remembrance, or value retention, parameters. Therefore, our forgetting parameters ϕ are equivalent to Eldar et al. (2020) [31]’s 1 − τ.

The state-transition model therefore decays towards the uniform distribution over the potential states the agent can transition to given any pair of state and action.Q^MF values are forgotten towards the average reward experienced since the beginning of the task, ${\bar{r}}_{t + 1}$ . For Q^MF1 values, it is the average reward obtained in single moves:

\begin{matrix} {\bar{r}}_{t + 1} = \frac{1}{t} \sum_{n = 1}^{t} R (s_{t, 2 / 3}) \end{matrix}

(20)

and for Q^MF2 values it is the average reward obtained in coupled moves:

\begin{matrix} {\bar{r}}_{t + 1} = \frac{1}{t} \sum_{n = 1}^{t} [R (s_{t, 2}) + R (s_{t, 3})] \end{matrix}

(21)

where R(s) is the reward obtained for transitioning into state s. This differs from Eldar et al. (2020) [31], for which forgetting was to a constant value θ, which was a parameter of the model.

After blocks 2 and 4 the learnt Q^MF values are of little use due to the introduced changes to the environment, and, again as in Eldar et al. (2020) [31] the agent forgets both the Q^MF values and state-transition model T according to Eqs 18 and 19, but with different parameters ϕ′^MF and ϕ′^MB, respectively.

In addition to on-task replay, the agent also engages in off-task replay immediately before the blocks with changed image-reward associations and spatial re-arrangement. In the former case, the agent uses the new reward function R′(s) that corresponds to the new image-reward associations. In the latter case, the agent generates ${\hat{Q}}^{M B}$ values with the new reward function and a state-transition model rearranged according to the instructions albeit with limited success:

\begin{matrix} T \leftarrow (1 - ω) T + ω T^{rearranged} \end{matrix}

(22)

During these two off-task replay bouts, the agent uses the exact same subject-specific parameter values as in on-task replay. The effect of such model re-arrangement on the accuracy of ${\hat{Q}}^{M B}$ estimates of the re-arranged environment in MI human subjects, as modelled by our agent, is shown in S6 Fig.

Initialisations

Q^MF values are initialised to $Q^{M F} \sim N (θ, 1)$ , and the state-transition model T(s_t,1, m_t, s_t,2) is initialised to 1/7 (since self-transitions are not allowed). The agent, however, starts the main task with extensive training on the same training trials that the subjects underwent before entering the main task. In S8 Fig, we show how ignorant (according to our agent’s prediction) each subject was as regards the optimal transitions in the state-space after this extensive training.

Parameter fitting

To fit the aforementioned free parameters to the subjects’ behavioural data we used the Metropolis-Hastings sampling algorithm in the Approximate Bayesian Computation [68] framework (ABC). As a distance (negative log-likelihood) measure for ABC, we took the root-mean-squared deviation between the simulated and the subjects’ performance data, measured as the proportion of available reward collected in each epoch. The pseudocode for our fitting procedure is provided in algorithm 1.

Algorithm 1 Metropolis-Hastings ABC Algorithm for obtaining a point estimate from a mode of the posterior distribution in the parameter space given the initialisation distribution μ(θ), data D, and a model that simulates the data $M (D ∣ θ)$ . θ_t denotes the full multivariate parameter sample at iteration t.

Set the exponentially decreasing tolerance thresholds {ϵ_t}_0,…,T and the perturbation variances ${σ_{t}^{2}}_{0, . . ., T}$ for T iterations. The perturbation covariances are set as $Σ_{t} = σ_{t}^{2} I$ , where I is the identity matrix.

1: for iteration t = 0 do

2: while ρ(D, D*) > ϵ₀ do

3: Sample θ* from initialisation θ* ∼ μ(θ)

3: fro i in {1..5} do

3: Simulate data $D_{i}^{*} \sim M (D ∣ θ^{*})$

6: end for

7: Calculate average distance metric $ρ (D, D^{*}) = \frac{1}{5} \sum_{i = 1}^{5} ρ (D, D_{i}^{*})$

8: end while

9: end for

10: Set θ₀ ← θ*

11: for iteration 1 ≤ t ≤ T do

12: while ρ(D, D*) > ϵ_t do

13: Sample θ* from proposal $θ^{*} \sim N (θ_{t - 1}, Σ_{t})$

14: if any [θ*]_i is not within support of π([θ]_i)then

15: continue

16: end if

17: for i in {1..5} do

18: Simulate data $D_{i}^{*} \sim M (D ∣ θ^{*})$

19: end for

20: Calculate average distance metric $ρ (D, D^{*}) = \frac{1}{5} \sum_{i = 1}^{5} ρ (D, D_{i}^{*})$

21: end while

22: Set θ_t ← θ*

23: end for

The fitting procedure was performed for 55 iterations with the exponentially decreasing tolerance threshold ϵ_t ranging between 0.6 and 0.10. For covariance matrices, we used identity multiplied by a scalar variance with the exponential range between 0.5 and 0.02. To avoid spurious parameter samples being accepted, we simulated our model 5 times with each proposed parameter sample and then used the average over these simulations to compute the distance metric.

Note that the initialisation distribution μ(θ) is different from a prior distribution π(θ), since it did not play any role in the algorithm acceptance probability (in fact, we only sampled the full multivariate parameter sample from the initialisation distribution only once in the very first iteration). The acceptance probability for the canonical Metropolis-Hastings ABC algorithm is:

\begin{matrix} α = min (1, \frac{π (θ^{*}) q (θ_{t - 1} ∣ θ^{*})}{π (θ_{t - 1}) q (θ^{*} ∣ θ_{t - 1})}) if ρ (D, D^{*}) < ϵ_{t} and 0 otherwise \end{matrix}

Notice that if the prior distribution is chosen to be uniform, or uninformative, then the two π’s cancel out. Furthermore, if one uses a Gaussian (hence symmetric) proposal distribution q, then the implied acceptance probability of the Metropolis-Hastings ABC algorithm is:

\begin{matrix} α = 1 if ρ (D, D^{*}) < ϵ_{t} and 0 otherwise \end{matrix}

which is exactly how our algorithm 1 operates.

We chose uniform initialisation distributions with support between 0 and 1 for η^MF1, η^MF2, $η_{r}^{M F 1}$ , $η_{r}^{M F 2}$ , η^MB, ϕ′^MB, ϕ′^MF, ω. The initialisation distributions for on-line forgetting parameters ϕ^MF and ϕ^MB, as well as for the fraction of the learning rate for opposite transitions, ρ, were specified to be beta with α = 6 and β = 2. Parameters θ and γ_m were chosen to be Gaussian with mean 0 and variance 1. Inverse temperature parameters $β_{1}^{M F 1}$ , $β_{2}^{M F 1}$ , and $β_{2}^{M F 2}$ had gamma priors with location 1 and scale 1. Since our perturbation covariance matrices were identities multiplied by a scalar variance, and the values for ξ were in general very small compared to the other parameters—to allow for a similar perturbation scale for ξ we sampled it from a log-gamma distribution with location −1 and scale 0.01, and the following perturbations were also performed in the log-space. Our fitting algorithm therefore learnt log₁₀ ξ rather than ξ directly. Additionally, parameters with bounded support (such as the learning and forgetting rates) were constrained to remain within the support specified by their corresponding prior distributions.

The fitting procedure was fully parallelized thanks to the implemented python MPI module in a freely available python package astroabc [69]. The distribution of the resulting fitting errors is shown in S9 Fig.

Replay analysis

The objective replay benefit for Fig 6E was computed as the total accrued gain for all example 1-move replay events shown in Fig 6D. That is, we used Eq 16, where Q^MF/MB were taken to be the true MF Q-values (or the true reward obtainable for each action), and π_old and π_new were the policies before and after all replay events, respectively. The average replay benefit in Fig 6F was computed in the same way as above, but the total accrued gain was averaged over all states where replay occurred (or equivalently, where there was a policy change). Importantly, in Fig 6E we only show the objective replay benefit as a result of the 1-move replay events from Fig 6D. For the overall average, in 2-move trials the 2-move replay events were also taken into account, and the total accrued gain after a 2-move trial was computed as the average over the 1-move and 2-move value function improvements.

The subjective replay benefit shown in S5E Fig was computed in the exact same way as described above; however, for Q^MF/MB we took the agent’s updated MF Q-values after the replay events (or event) shown in S5D Fig. The average objective value function change from S5F Fig was computed in the same way as described above.

To analyse the agent’s preference to replay sub-optimal actions, we extracted the number of times the agent replayed sub-optimal and optimal actions at the end of each trial, which is shown in Fig 6G and 6H for the most recent and all other (or ‘distal’) transitions, respectively. Due to the torus-like design of the state space, some transitions led to the same outcomes (e.g. going ‘up’ or ‘down’ when at the top or bottom rows), and we therefore treated these as the same ‘experiences’. For instance, if the agent chose an optimal move ‘up’, then both ‘up’ and ‘down’ replays were counted as replays of the most recent optimal transition. In 2-move trials, a move was considered optimal if the whole sequence of moves was optimal. In such case, the replays of this 2-move sequence and the replays of the second move were counted towards the most recent optimal replays. If, however, the first move in a 2-move trial was sub-optimal, the second move could still have been optimal, and therefore both sub-optimal replays of the first move and optimal replays of the second move were considered in this case.

For all replays described above, we considered action entropy values at which these replays were executed (shown in Fig 6G and 6H, right column). For every 1-move replay, the corresponding action entropy was computed according to Eq 1. For every 2-move replay beginning at state s₁ and proceeding with a sequence of actions m₁, m₂, the corresponding action entropy was computed as the joint entropy:

\begin{matrix} H (s_{1}, m_{1}, m_{2}) = \underset{s_{2} \sim T (s_{2} ∣ s_{1}, m_{1}), s_{3} \sim T (s_{3} ∣ s_{2}, m_{2})}{E} [- {log}_{2} {T (s_{2} ∣ s_{1}, m_{1}) \cdot T (s_{3} ∣ s_{2}, m_{2})}] \end{matrix}

(23)

where in T(s₂∣s₁, m₁) the probability of transitioning from s₂ to s₁ was set to zero; similarly, in T(s₃∣s₂, m₂) the probabilities of transitioning from s₃ to s₂ and from s₂ to s₁ were set to zero (since back-tracking was not allowed), and the transition matrix was re-normalized as in Eq 11.

Example simulations

The contour plot in Fig 1D was generated by simulating the agent in 1-move trials in the main state-space on a regularly spaced grid of 150 ϕ^MF and ϕ^MB values ranging from 0 to 0.5. For each combination of ϕ^MF and ϕ^MB, the simulation was run 20 times for 300 trials (in each simulation the same sequence of randomly-generated starting states was used). Performance (proportion of available reward collected) in the last 100 trials of each simulation was averaged within and then across the simulations for the same combination of ϕ^MF and ϕ^MB to obtain a single point from the contour plot shown in Fig 1D. The final matrix was then smoothed with a 2D Gaussian kernel of std. 3.5. The model parameters used in these simulations are listed below:

\begin{array}{l} η^{M F 1} & 0.615 \\ η_{r}^{M F 1} & 0.615 \\ β_{1}^{M F 1} & 1.152 \\ η^{M B} & 0.98 \\ ρ & 0.70 \\ γ_{m} & [- 1.105, 1.088, 0, 0.017] \\ θ & 0.088 \\ ξ & 0.0001 \end{array}

The data from Figs 1B, 4 and 5, S1A and S1B Fig were obtained by simulating the agent in a simple environment with only two actions available; in each trial the agent had to choose between the two options for 100 trials in total. The agent received a reward of 1 for the sub-optimal action and 10 for the optimal action (except for S7 Fig, there the agent was simulated with reward values that ranged from 0 to 10 with linear increments of 0.1). The following parameters were used:

\begin{array}{l} η^{M F 1} & 0.20 \\ η_{r}^{M F 1} & 0.20 \\ β_{1}^{M F 1} & 0.90 \\ η^{M B} & 0.98 \\ ϕ^{M B} & 0.10 \\ ϕ^{M F} & 0.20 \\ ξ & 0.00001 \end{array}

Q^MF values were initialised to 0, the state-transition model was initialised to 0.5, and the bias parameter γ_m was set to 0 for each move.

The data from Fig 3 were obtained by simulating multiple agents in the same 2-action environment as described above. All agents were simulated 50 times in 150 trials (data shown only for first 60 trials), and Fig 3 shows average values over those 50 simulations along with the estimated confidence intervals. The following parameters were used:

\begin{array}{l} η^{M F 1} & 0.30 \\ η_{r}^{M F 1} & 0.70 \\ β_{1}^{M F 1} & 1.00 \\ η^{M B} & 1.00 \\ ϕ^{M B} & 0.50 \\ ϕ^{M F} & 0.50 \\ ξ & 0.001 \end{array}

All agents used the same parameters, wherever applicable (i.e., agents without replay had the relevant parameters from those listed above; agents without forgetting had ϕ^MF and ϕ^MB set to 0).

The estimated gain from Fig 4A was computed by using Eq 16 where MF Q-values were taken to be the agent’s MF Q-values after the final trial, and 700 ${\hat{Q}}^{M B}$ values were manually generated on the interval [−9, 12]. This procedure was repeated with the agent’s MF Q-values additionally decayed according to Eq 18 with ϕ^MF values of 0.3, 0.5 and 0.7, and the average reward obtained by the agent over those 100 trials. The x-axis was then limited to the appropriate range.

The estimated gain from Fig 5B was computed in the same way as described above, but instead of manually generating ${\hat{Q}}^{M B}$ values, 100 transition probabilities (with constant linear increments) were generated on the interval from [0.5, 1] for the correct transition and [0, 0.5] for the other transition (such that the two always summed up to 1). These multiple instances of the state-transition model were used to generate ${\hat{Q}}^{M B}$ values for computing the estimated gain.

The entropy range difference in S1A Fig was computed as the difference between the sub-optimal and optimal action entropy values at which the estimated gain became positive for each respective action after the last (100th) trial. These range differences for each combination of the reward values were then averaged over 20 simulations and plotted as a contour plot which was additionally smoothed with a 2D Gaussian kernel of 1.5 std.

The pessimism bias in S1B Fig was computed as the average difference between the number of sub-optimal and optimal replays at the end of each trial, averaged over the same 20 simulations. The resulting matrix was also smoothed with a 2D Gaussian kernel of 1.5 std.

The contour plot in S1C Fig was generated from the data obtained from the same simulation as for Fig 1D, and using the procedure outlined above (the bias for each combination of ϕ^MF and ϕ^MB values was averaged over the last 100 trials and then across 20 simulations). The resulting matrix was smoothed with a 2D Gaussian kernel of 3.5 std.

Supporting information

S1 Fig. Replay entropy range and pessimism bias.

(A) Difference in the range of sub-optimal and optimal action entropy values at which the estimated gain for the corresponding actions was positive. Higher values correspond to a higher action entropy range for the sub-optimal action; the agent can therefore benefit from the replay of sub-optimal actions at a wider range of action entropy values. Each datum is an average over 20 simulations where each simulation consisted of 100 trials. (B) Pessimism bias, defined here as the average (over 100 simulation trials) difference between the per-trial average number of replays of sub-optimal and optimal actions. Each datum is an average over the same 20 simulations. Negative values indicate a stronger pessimism bias. Note that both matrices are triangular, since the reward for a sub-optimal action has to be smaller than that for the optimal action (by definition); the labels on the diagonal therefore label rows. (C) Same quantity as in (B) but for the agent simulated in the behavioural task with varying MF and MB forgetting parameters.

(EPS)

Click here for additional data file.^{(2.8MB, eps)}

S2 Fig. Comparison of forgetting parameters.

(A) Comparison of our best-fitting estimates of the MF forgetting parameter, ϕ^MF, to that of Eldar et al. (2020) [31] for the hybrid MF/MB model. Our modelling predicted the subjects to be notably more forgetful as regards their MF Q-values (2-sample 2-tailed t test, t = 6.73, p = 2.49 ⋅ 10⁻⁹). (B) Same as before but for the MB forgetting parameter, ϕ^MB. By contrast to MF forgetting, we predicted that most of the subjects remembered their state-transition models better (2-sample 2-tailed t test, t = −3.92, p = 0.0009). *** p < 0.001.

(EPS)

Click here for additional data file.^{(2.2MB, eps)}

S3 Fig. Parameter equivalents of subjects’ decision flexibility.

(A) Linear regression of behavioural individual flexibility (IF) for all subjects, as measured by Eldar et al. (2020) [31], against IF (averaged over 100 simulations) computed from our agent with subject-specific parameters (Pearson’s coefficient of determination, R² = 0.73, p ≪ 0.0001). Subjects for which our model predicted sufficient replay (MI subjects) are highlighted in red, and MF subjects are shown in blue. Further, MI subjects who were found to mostly hurt themselves by replay (n = 2) are shown in green. IF of MI subjects, as predicted by our agent, was significantly higher than that of MF subjects (permutation test, $\bar{t} = 0.22 .$ , p ≪ 0.0001). (B) Multiple regression of all subjects’ best-fitting decision, learning and memory parameters against their respective IF values as predicted by our agent (due to our fitting procedure, we excluded all MB-relevant parameters for MF subjects that did not engage in replay). Vertical black lines show 95% confidence intervals. (C) MF forgetting parameter of MI subjects was found to be significantly higher that that of MF subjects (Wilcoxon rank-sum test, W = 3.08, p = 0.002). (D) Distribution of MB forgetting parameter, ϕ^MB, in MI subjects. (E) Best-fitting gain threshold statistics across MI subjects. (F) Same as in (B) for MI subjects but regressed against the average policy change for each subject due to replay (shown in Fig 7C). All regression models from (B) and (F) were selected using 5-fold cross-validation with recursive feature elimination (with a minimum of 5 features). * p < 0.05, **p < 0.01, *** p < 0.001, t test (unless specified otherwise).

(EPS)

Click here for additional data file.^{(2.2MB, eps)}

S4 Fig. Epistemology of replay: Another example.

The layout of the figure is similar to that of Fig 6. (A) An example move predicted by our agent with parameters fit to an MI subject. (B) State-transition knowledge of the agent after executing the move in (A). Note how the agent is less ignorant about the transitions compared to that in Fig 6. (C) MF and MB knowledge of the agent. The agent chose a sub-optimal move because its MF Q-values had been forgotten to an extent that made the agent believe it was optimal. Online learning (green shading) indeed helped the agent to remember that the chosen move was worse than predicted. (D) Replays executed by the agent after learning online about the move in (A). Note that, despite the agent’s accurate MF knowledge, it still chooses to engage in replay. This is because its state-transition model was exceptionally accurate; moreover, our parameter estimates indicated that this MI subject had a very low gain threshold. (E) Changes in the objective value function as a result of the replay in (D). (F) Overall, we found that this MI subject, as predicted by our agent, achieved an average objective value function improvement of 0.06 reward points as a result of replay (1-sample 2-tailed t test, t = 2.16, p = 0.03). (G-H) Same as Fig 6. * p < 0.05, ** p < 0.01, *** p < 0.001, ns: not significant.

(EPS)

Click here for additional data file.^{(5.1MB, eps)}

S5 Fig. How replay can hurt.

The layout of the figure is similar to that of Fig 6. (A) Actual move in a 1-move trial chosen by an MI subject that was also predicted by our agent with subject-specific parameters. (B) State-transition knowledge of the agent after executing the move in (A). Note the agent’s extreme ignorance about all transitions. (C) MF and MB knowledge of the agent. The agent chose a sub-optimal move because it s MF Q-values (including those for other moves) had been forgotten to an extent that made the agent more likely to choose that move. Note that, according to the agent’s MF Q-values, the subjectively optimal move (which the agent did not choose) was in fact objectively sub-optimal. Online learning (green shading) did not improve the situation, since the chosen sub-optimal move was found to be more rewarding than what the agent’s MF Q-values had indicated. The agent’s MF and MB knowledge after executing this move, therefore, still incorrectly indicated that the two most rewarding moves were ‘down’ and ‘up’ (as opposed to the objectively optimal move ‘left’). (D) Replays executed by the agent after learning online about the move in (A). The agent chose to replay all other moves at the state where the trial began, and such replay further exacerbated the agent’s knowledge, for it decreased the MF Q-value of the objectively optimal action (towards its model’s estimate, ${\hat{Q}}^{M B}$ , shown in pink bar). Such replay decreased the agent’s objective value function (as measured with respect to the true obtainable reward) of that state by 0.03 reward points (E). According to the agent’s estimate, however, it’s subjective value function of that state increased by 0.5 reward points. Note that the replay at other states (e.g., ‘spanner’ and ‘house’) was also harmful because of the similar confusion of the agent as regards the objectively optimal actions. (F) Overall, we found that this MI subject, as predicted by our agent, mostly hurt himself by replaying at the end of each trial (average change in objective value function was decreased by 0.075 reward points, 1-sample 2-tailed t test, t = −3.26, p = 0.001). ** p < 0.01.

(EPS)

Click here for additional data file.^{(4.6MB, eps)}

S6 Fig. Effect of state-transition model re-arrangement.

(A) Root-mean-square deviation between the MI subjects’ state-transition model (as predicted by our agent) reward estimates, ${\hat{Q}}^{M B}$ , and the true reward, $Q_{t r u e}^{M F}$ , of each transition in the final block of the task before and after the model was re-arranged as in Eq 22. The RMSD measure shows that the re-arrangement (even with limited success) made the state-transition model’s predictions significantly more accurate (2-sample 2-tailed t test, t = −4.48, p = 6.61 ⋅ 10⁻⁵). (B) Linear regression of the difference between the two quantities in (A) (before—after) against the ignorance about optimal transitions (measured as the average action entropy across objectively optimal actions). MI subjects with lower state-transition model entropy had significantly more accurate ${\hat{Q}}^{M B}$ estimates as a result of the model re-arrangement (Pearson’s coefficient of determination, R² = 0.95, p ≪ 0.0001.). (C) Simulated IF for MI subjects showed significant correlation to the RMSD measure (Pearson’s coefficient of determination, R² = 0.91, p ≪ 0.0001). (D) Additionally, we found that the MI subjects’ ignorance about objectively optimal action outcomes after the model re-arrangement correlated significantly with the simulated IF (Pearson’s coefficient of determination, R² = 0.81, p = ≪ 0.0001). This means that uncertainty in the subjects’ state-transition model was a good predictor of how well they adapted to the spatial re-arrangement that took place. ***p < 0.001.

(EPS)

Click here for additional data file.^{(2MB, eps)}

S7 Fig. On-line replay statistics with obliviated knowledge.

The layout of the figure is identical to that of Fig 7. Before engaging in off-task replay prior to blocks 3 and 5, each agent’s MF Q-values were zeroed-out to examine whether the overall on-line replay statistics would remain unaltered—and indeed they did. Wilcoxon rank-sum test in (A) and (B), 1-sample 2-tailed t test in (C) and (D). * p < 0.05, ** p < 0.02, *** p < 0.001.

(EPS)

Click here for additional data file.^{(2MB, eps)}

S8 Fig. Entropy of pre-trained state-transition models.

We assessed how ignorant each subject (as modelled by our agent) was about optimal 1-move transitions after learning in the training trials and immediately before entering the main task. The ignorance was computed as the average action entropy (Eq 1) across all objectively optimal transitions. MI subjects who, according to our modelling, engaged in replay (n = 20) were significantly less ignorant about objectively optimal transitions compared to MF subjects (2-sample 2-tailed t test, t = −2.56, p = 0.014). The horizontal dashed black line shows maximum ignorance. ** p < 0.01.

(EPS)

Click here for additional data file.^{(514.1KB, eps)}

S9 Fig. Fitting errors.

Distribution of the final fitting errors (RMSD of the proportion of available reward collected between the human subjects and the agent).

(EPS)

Click here for additional data file.^{(387KB, eps)}

Abbreviations

IF: Individual flexibility. A behavioural index introduced by Eldar et al. (2020) [31] which measures the extent to which subjects manage to adapt their choices to the design of the task space whereby optimal choices in a subset of trials are sub-optimal in another set of trials. Higher individual flexibility thus means greater behavioural flexibility according to the task rules. Eldar et al. (2020) [31] showed that this measure correlated with other measures of the extent to which subjects had and/or used a model of the task.
MB: Model-based. An algorithm which makes decisions based on prospective values estimated based on a learnt model of the environment. Decision-making via such algorithms is computationally expensive but the use of a model implies that choices can be revalued or devalued if the agent is informed about changes in the task structure.
MEG: Magnetoencephalography. A brain imaging technique that registers magnetic fields generated due to specific patterns of neural activity. Recently, this technique has been used to decode with high temporal precision the sequential progression of states during replay events.
MF: Model-free. An algorithm which makes decisions based on a set of retrospective cached values learnt directly from past experience. Decision-making via such algorithms is computationally cheap; however, the downside is that they do not learn any task structure and are therefore quite stubborn when the task rules change, since the cached values are no longer applicable and need to be re-learnt.
MI: Model-informed. A hybrid model-free / model-based algorithm which makes decisions based on a set of model-free values which, however, can be altered by information supplied by the algorithm’s model of the environment.
RL: Reinforcement learning. An area of machine learning in which by interacting with an environment, agents acquire systematic methods of acting (policies) that tend to maximize gains and minimize losses.

Data Availability

The code and data used to produce the results and analyses presented in this manuscript are available at https://github.com/geoant1/optimism_and_pessimism.

Funding Statement

GA, CG, and PD are funded by the Max Planck Society (https://www.mpg.de/en). PD is also funded by the Alexander von Humboldt Foundation (https://www.humboldt-foundation.de/en/). EE holds the National Institute of Health grants R01MH124092 and R01MH125564 (https://www.nih.gov/), and a United States-Israel Binational Science Foundation grant 2019801 (https://www.bsf.org.il/). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

1. O’Keefe J, Dostrovsky J. The hippocampus as a spatial map: Preliminary evidence from unit activity in the freely-moving rat. Brain research. 1971. [DOI] [PubMed] [Google Scholar]
2. O’Keefe J, Nadel L. The hippocampus as a cognitive map. Oxford: Clarendon Press; 1978. [Google Scholar]
3. Wilson MA, McNaughton BL. Reactivation of hippocampal ensemble memories during sleep. Science. 1994;265(5172):676–679. doi: 10.1126/science.8036517 [DOI] [PubMed] [Google Scholar]
4. Lee AK, Wilson MA. Memory of sequential experience in the hippocampus during slow wave sleep. Neuron. 2002;36(6):1183–1194. doi: 10.1016/S0896-6273(02)01096-6 [DOI] [PubMed] [Google Scholar]
5. Foster DJ, Wilson MA. Reverse replay of behavioural sequences in hippocampal place cells during the awake state. Nature. 2006;440(7084):680–683. doi: 10.1038/nature04587 [DOI] [PubMed] [Google Scholar]
6. Diba K, Buzsáki G. Forward and reverse hippocampal place-cell sequences during ripples. Nature neuroscience. 2007;10(10):1241–1242. doi: 10.1038/nn1961 [DOI] [PMC free article] [PubMed] [Google Scholar]
7. Dragoi G, Tonegawa S. Preplay of future place cell sequences by hippocampal cellular assemblies. Nature. 2011;469(7330):397–401. doi: 10.1038/nature09633 [DOI] [PMC free article] [PubMed] [Google Scholar]
8. Dragoi G, Tonegawa S. Distinct preplay of multiple novel spatial experiences in the rat. Proceedings of the National Academy of Sciences. 2013;110(22):9100–9105. doi: 10.1073/pnas.1306031110 [DOI] [PMC free article] [PubMed] [Google Scholar]
9. Pfeiffer BE, Foster DJ. Hippocampal place-cell sequences depict future paths to remembered goals. Nature. 2013;497(7447):74–79. doi: 10.1038/nature12112 [DOI] [PMC free article] [PubMed] [Google Scholar]
10. Grosmark AD, Buzsáki G. Diversity in neural firing dynamics supports both rigid and learned hippocampal sequences. Science. 2016;351(6280):1440–1443. doi: 10.1126/science.aad1935 [DOI] [PMC free article] [PubMed] [Google Scholar]
11. Silva D, Feng T, Foster DJ. Trajectory events across hippocampal place cells require previous experience. Nature neuroscience. 2015;18(12):1772–1779. doi: 10.1038/nn.4151 [DOI] [PMC free article] [PubMed] [Google Scholar]
12. Eichenbaum H. Does the hippocampus preplay memories? Nature neuroscience. 2015;18(12):1701–1702. doi: 10.1038/nn.4180 [DOI] [PubMed] [Google Scholar]
13. Foster DJ. Replay comes of age. Annual review of neuroscience. 2017;40:581–602. doi: 10.1146/annurev-neuro-072116-031538 [DOI] [PubMed] [Google Scholar]
14. Singer AC, Frank LM. Rewarded outcomes enhance reactivation of experience in the hippocampus. Neuron. 2009;64(6):910–921. doi: 10.1016/j.neuron.2009.11.016 [DOI] [PMC free article] [PubMed] [Google Scholar]
15. Ólafsdóttir HF, Barry C, Saleem AB, Hassabis D, Spiers HJ. Hippocampal place cells construct reward related sequences through unexplored space. Elife. 2015;4:e06063. doi: 10.7554/eLife.06063 [DOI] [PMC free article] [PubMed] [Google Scholar]
16. Ambrose RE, Pfeiffer BE, Foster DJ. Reverse replay of hippocampal place cells is uniquely modulated by changing reward. Neuron. 2016;91(5):1124–1136. doi: 10.1016/j.neuron.2016.07.047 [DOI] [PMC free article] [PubMed] [Google Scholar]
17. Sirota A, Csicsvari J, Buhl D, Buzsáki G. Communication between neocortex and hippocampus during sleep in rodents. Proceedings of the National Academy of Sciences. 2003;100(4):2065–2069. doi: 10.1073/pnas.0437938100 [DOI] [PMC free article] [PubMed] [Google Scholar]
18. Sirota A, Montgomery S, Fujisawa S, Isomura Y, Zugaro M, Buzsáki G. Entrainment of neocortical neurons and gamma oscillations by the hippocampal theta rhythm. Neuron. 2008;60(4):683–697. doi: 10.1016/j.neuron.2008.09.014 [DOI] [PMC free article] [PubMed] [Google Scholar]
19. Jadhav SP, Rothschild G, Roumis DK, Frank LM. Coordinated excitation and inhibition of prefrontal ensembles during awake hippocampal sharp-wave ripple events. Neuron. 2016;90(1):113–127. doi: 10.1016/j.neuron.2016.02.010 [DOI] [PMC free article] [PubMed] [Google Scholar]
20. Maingret N, Girardeau G, Todorova R, Goutierre M, Zugaro M. Hippocampo-cortical coupling mediates memory consolidation during sleep. Nature neuroscience. 2016;19(7):959–964. doi: 10.1038/nn.4304 [DOI] [PubMed] [Google Scholar]
21. Rothschild G, Eban E, Frank LM. A cortical–hippocampal–cortical loop of information processing during memory consolidation. Nature neuroscience. 2017;20(2):251–259. doi: 10.1038/nn.4457 [DOI] [PMC free article] [PubMed] [Google Scholar]
22. Shin JD, Tang W, Jadhav SP. Dynamics of awake hippocampal-prefrontal replay for spatial learning and memory-guided decision making. Neuron. 2019;104(6):1110–1125. doi: 10.1016/j.neuron.2019.09.012 [DOI] [PMC free article] [PubMed] [Google Scholar]
23. Todorova R, Zugaro M. Isolated cortical computations during delta waves support memory consolidation. Science. 2019;366(6463):377–381. doi: 10.1126/science.aay0616 [DOI] [PubMed] [Google Scholar]
24. Raichle ME. The brain’s default mode network. Annual review of neuroscience. 2015;38:433–447. doi: 10.1146/annurev-neuro-071013-014030 [DOI] [PubMed] [Google Scholar]
25. Rissman J, Gazzaley A, D’Esposito M. Measuring functional connectivity during distinct stages of a cognitive task. Neuroimage. 2004;23(2):752–763. doi: 10.1016/j.neuroimage.2004.06.035 [DOI] [PubMed] [Google Scholar]
26. Greicius MD, Supekar K, Menon V, Dougherty RF. Resting-state functional connectivity reflects structural connectivity in the default mode network. Cerebral cortex. 2009;19(1):72–78. doi: 10.1093/cercor/bhn059 [DOI] [PMC free article] [PubMed] [Google Scholar]
27. Jolles DD, van Buchem MA, Crone EA, Rombouts SA. Functional brain connectivity at rest changes after working memory training. Human brain mapping. 2013;34(2):396–406. doi: 10.1002/hbm.21444 [DOI] [PMC free article] [PubMed] [Google Scholar]
28. Kurth-Nelson Z, Economides M, Dolan RJ, Dayan P. Fast sequences of non-spatial state representations in humans. Neuron. 2016;91(1):194–204. doi: 10.1016/j.neuron.2016.05.028 [DOI] [PMC free article] [PubMed] [Google Scholar]
29. Kurth-Nelson Z, Barnes G, Sejdinovic D, Dolan R, Dayan P. Temporal structure in associative retrieval. Elife. 2015;4:e04919. doi: 10.7554/eLife.04919 [DOI] [PMC free article] [PubMed] [Google Scholar]
30. Liu Y, Dolan RJ, Kurth-Nelson Z, Behrens TE. Human replay spontaneously reorganizes experience. Cell. 2019;178(3):640–652. doi: 10.1016/j.cell.2019.06.012 [DOI] [PMC free article] [PubMed] [Google Scholar]
31. Eldar E, Lièvre G, Dayan P, Dolan RJ. The roles of online and offline replay in planning. ELife. 2020;9:e56911. doi: 10.7554/eLife.56911 [DOI] [PMC free article] [PubMed] [Google Scholar]
32. Liu Y, Mattar MG, Behrens TEJ, Daw ND, Dolan RJ. Experience replay is associated with efficient nonlocal learning. Science. 2021;372 (6544). doi: 10.1126/science.abf1357 [DOI] [PMC free article] [PubMed] [Google Scholar]
33. Ego-Stengel V, Wilson MA. Disruption of ripple-associated hippocampal activity during rest impairs spatial learning in the rat. Hippocampus. 2010;20(1):1–10. doi: 10.1002/hipo.20707 [DOI] [PMC free article] [PubMed] [Google Scholar]
34. Girardeau G, Benchenane K, Wiener SI, Buzsáki G, Zugaro MB. Selective suppression of hippocampal ripples impairs spatial memory. Nature neuroscience. 2009;12(10):1222–1223. doi: 10.1038/nn.2384 [DOI] [PubMed] [Google Scholar]
35. Jadhav SP, Kemere C, German PW, Frank LM. Awake hippocampal sharp-wave ripples support spatial memory. Science. 2012;336(6087):1454–1458. doi: 10.1126/science.1217230 [DOI] [PMC free article] [PubMed] [Google Scholar]
36. Gridchyn I, Schoenenberger P, O’Neill J, Csicsvari J. Assembly-specific disruption of hippocampal replay leads to selective memory deficit. Neuron. 2020. 10.1016/j.neuron.2020.01.021 [DOI] [PubMed] [Google Scholar]
37. Káli S, Dayan P. Off-line replay maintains declarative memories in a model of hippocampal-neocortical interactions. Nature neuroscience. 2004;7(3):286–294. doi: 10.1038/nn1202 [DOI] [PubMed] [Google Scholar]
38. Hinton GE, Dayan P, Frey BJ, Neal RM. The “wake-sleep” algorithm for unsupervised neural networks. Science. 1995;268(5214):1158–1161. doi: 10.1126/science.7761831 [DOI] [PubMed] [Google Scholar]
39.Sutton RS. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In: Machine learning proceedings 1990. Elsevier; 1990. p. 216–224.
40. Momennejad I, Otto AR, Daw ND, Norman KA. Offline replay supports planning in human reinforcement learning. Elife. 2018;7:e32548. doi: 10.7554/eLife.32548 [DOI] [PMC free article] [PubMed] [Google Scholar]
41. Mattar MG, Daw ND. Prioritized memory access explains planning and hippocampal replay. Nature neuroscience. 2018;21(11):1609–1617. doi: 10.1038/s41593-018-0232-z [DOI] [PMC free article] [PubMed] [Google Scholar]
42. Sutton RS, Barto AG. Reinforcement learning: An introduction. MIT press; 2018. [Google Scholar]
43. Moore AW, Atkeson CG. Prioritized sweeping: Reinforcement learning with less data and less time. Machine learning. 1993;13(1):103–130. doi: 10.1007/BF00993104 [DOI] [Google Scholar]
44. Daw ND, Niv Y, Dayan P. Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control. Nature neuroscience. 2005;8(12):1704–1711. doi: 10.1038/nn1560 [DOI] [PubMed] [Google Scholar]
45. Daw ND, Gershman SJ, Seymour B, Dayan P, Dolan RJ. Model-based influences on humans’ choices and striatal prediction errors. Neuron. 2011;69(6):1204–1215. doi: 10.1016/j.neuron.2011.02.027 [DOI] [PMC free article] [PubMed] [Google Scholar]
46. Lee SW, Shimojo S, O’Doherty JP. Neural computations underlying arbitration between model-based and model-free learning. Neuron. 2014;81(3):687–699. doi: 10.1016/j.neuron.2013.11.028 [DOI] [PMC free article] [PubMed] [Google Scholar]
47. Moran R, Keramati M, Dayan P, Dolan RJ. Retrospective model-based inference guides model-free credit assignment. Nature communications. 2019;10(1):1–14. doi: 10.1038/s41467-019-08662-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
48. Watkins CJCH. Learning from delayed rewards. 1989. [Google Scholar]
49. Dayan P. Improving generalization for temporal difference learning: The successor representation. Neural Computation. 1993;5(4):613–624. doi: 10.1162/neco.1993.5.4.613 [DOI] [Google Scholar]
50. Shannon CE. A mathematical theory of communication. The Bell system technical journal. 1948;27(3):379–423. doi: 10.1002/j.1538-7305.1948.tb01338.x [DOI] [Google Scholar]
51. Montague PR, Dolan RJ, Friston KJ, Dayan P. Computational psychiatry. Trends in cognitive sciences. 2012;16(1):72–80. doi: 10.1016/j.tics.2011.11.018 [DOI] [PMC free article] [PubMed] [Google Scholar]
52.Gagne C, Dayan P. Peril, Prudence and Planning as Risk, Avoidance and Worry. 2021. Available from: psyarxiv.com/tcn7e.
53. Barraclough DJ, Conroy ML, Lee D. Prefrontal cortex and decision making in a mixed-strategy game. Nature neuroscience. 2004;7(4):404–410. doi: 10.1038/nn1209 [DOI] [PubMed] [Google Scholar]
54. Ito M, Doya K. Validation of decision-making models and analysis of decision variables in the rat basal ganglia. Journal of Neuroscience. 2009;29(31):9861–9874. doi: 10.1523/JNEUROSCI.6157-08.2009 [DOI] [PMC free article] [PubMed] [Google Scholar]
55. Toyama A, Katahira K, Ohira H. Reinforcement learning with parsimonious computation and a forgetting process. Frontiers in human neuroscience. 2019;13:153. doi: 10.3389/fnhum.2019.00153 [DOI] [PMC free article] [PubMed] [Google Scholar]
56. Niv Y, Daw ND, Joel D, Dayan P. Tonic dopamine: opportunity costs and the control of response vigor. Psychopharmacology. 2007;191(3):507–520. doi: 10.1007/s00213-006-0502-4 [DOI] [PubMed] [Google Scholar]
57. Wixted JT. The psychology and neuroscience of forgetting. Annu Rev Psychol. 2004;55:235–269. doi: 10.1146/annurev.psych.55.090902.141555 [DOI] [PubMed] [Google Scholar]
58. Friedrich J, Lengyel M. Goal-directed decision making with spiking neurons. Journal of Neuroscience. 2016;36(5):1529–1546. doi: 10.1523/JNEUROSCI.2854-15.2016 [DOI] [PMC free article] [PubMed] [Google Scholar]
59. Basanisi R, Brovelli A, Cartoni E, Baldassarre G. A generative spiking neural-network model of goal-directed behaviour and one-step planning. PLOS Computational Biology. 2020;16(12):e1007579. doi: 10.1371/journal.pcbi.1007579 [DOI] [PMC free article] [PubMed] [Google Scholar]
60. Schwartenbeck P, Baram A, Liu Y, Mark S, Muller T, Dolan R, et al. Generative replay for compositional visual understanding in the prefrontal-hippocampal circuit. bioRxiv. 2021; [Google Scholar]
61. Russek EM, Momennejad I, Botvinick MM, Gershman SJ, Daw ND. Predictive representations can link model-based reinforcement learning to model-free mechanisms. PLoS computational biology. 2017;13(9):e1005768. doi: 10.1371/journal.pcbi.1005768 [DOI] [PMC free article] [PubMed] [Google Scholar]
62. Carey AA, Tanaka Y, van Der Meer MA. Reward revaluation biases hippocampal replay content away from the preferred outcome. Nature neuroscience. 2019; p. 1–10. [DOI] [PubMed] [Google Scholar]
63. O’Neill J, Boccara C, Stella F, Schönenberger P, Csicsvari J. Superficial layers of the medial entorhinal cortex replay independently of the hippocampus. Science. 2017;355(6321):184–188. doi: 10.1126/science.aag2787 [DOI] [PubMed] [Google Scholar]
64. Stachenfeld KL, Botvinick MM, Gershman SJ. The hippocampus as a predictive map. Nature neuroscience. 2017;20(11):1643. doi: 10.1038/nn.4650 [DOI] [PubMed] [Google Scholar]
65. Babichev A, Morozov D, Dabaghian Y. Replays of spatial memories suppress topological fluctuations in cognitive map. Network Neuroscience. 2019;3(3):707–724. doi: 10.1162/netn_a_00076 [DOI] [PMC free article] [PubMed] [Google Scholar]
66. Kaelbling LP, Littman ML, Cassandra AR. Planning and acting in partially observable stochastic domains. Artificial intelligence. 1998;101(1-2):99–134. doi: 10.1016/S0004-3702(98)00023-X [DOI] [Google Scholar]
67. Silver D, Veness J. Monte-Carlo planning in large POMDPs. Neural Information Processing Systems; 2010. [Google Scholar]
68. Turner BM, Van Zandt T. A tutorial on approximate Bayesian computation. Journal of Mathematical Psychology. 2012;56(2):69–85. doi: 10.1016/j.jmp.2012.02.005 [DOI] [Google Scholar]
69. Jennings E, Madigan M. astroABC: an approximate Bayesian computation sequential Monte Carlo sampler for cosmological parameter estimation. Astronomy and computing. 2017;19:16–22. doi: 10.1016/j.ascom.2017.01.001 [DOI] [Google Scholar]

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1009634.r001

Decision Letter 0

Daniel Bush, Daniele Marinazzo

2 Aug 2021

Dear Mr. Antonov,

Thank you very much for submitting your manuscript "Optimism and pessimism in optimised replay" for consideration at PLOS Computational Biology.

As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments.

In particular, the authors should ensure that this manuscript is presented in a straightforward and comprehensible manner, so that it can be understood by a wide audience in both theoretical and empirical neuroscience; and that the assumptions made and their wider implications for our understanding of replay and memory function are clearly described.

We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts.

Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Daniel Bush

Associate Editor

PLOS Computational Biology

Daniele Marinazzo

Deputy Editor

PLOS Computational Biology

***********************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: Thank you for the opportunity to review this article. I have been interested in an article like this since seeing the Carey et. al (2019) paper as I thought it would be interesting to see how consistent / inconsistent the Mattar & Daw (2018) model would be with those results.

The authors conduct an impressive modeling analysis of pessimized replay in a (previously published) human behavioral dataset, arguing that pessimized replay is indeed optimal in the face of forgetfulness.

My main critique has to do with some of the (potentially incorrect) assumptions in the article. I do think they can be addressed and thus hope to see this article published after clarification on these issues.

There are two critiques here. Upon further deliberation, I think they are intimately related, but I will sketch them out separately and then try to offer a unified account.

1. It's not clear to me that pessimized replay is inconsistent with the Mattar & Daw model as is (which the present authors discuss, albeit briefly, in lines 214-223). Consider the following case, in which there are 2 actions, L and R. The MB values are 1 and 10, respectively, whereas the MF values are 2 and 10. There is higher Gain (and thus EVB) when replaying the pessimistic action. Thus, replay can be used to devalue options, which was my interpretation of the Carey et. al (2019) finding. I consider this hypothesis to be simpler than the authors' forgetting hypothesis, and thus would like to see it ruled out first before considering more complex hypotheses.

2. Second, I'm not sure I understand the authors idea of MF forgetting. From what I do understand, the problem the authors are identifying is the following. Since TD learning operates under the assumption that the optimal policy will be chosen after the first step, once the optimal policy is reached via training, the value (and thus the probability, assuming a softmax policy) of suboptimal actions will be inflated because the agent will make good decisions afterwards.

The authors say "Insofar as the agent improves over the course of the task, the average reward it obtains increases with each trial. MF values for sub-optimal actions, therefore, tend to rise towards this average experienced reward; MF Q-values for optimal actions, on the other hand, become devalued, as the agent is prone occasionally to choose sub-optimal actions due to its non-deterministic policy. In other words, because of MF forgetting, the agent forgets how good the optimal actions are and how bad the sub-optimal actions are."

Is the claim that this optimistic estimate of MF values is considered 'MF Forgetting'? This seems to be a different definition of 'forgetting' than what is commonly thought.

Are the authors claiming that changing the temperature in order to inject more indeterminism into the policy is forgetting? This, I think, would be a more palatable interpretation of forgetting, but the rationale for this is lost on me, especially given that it runs counter to the concept of annealing.

My attempt at reconciliation in order to understand the authors' intuition (but this is surely to be incorrect, so please correct me) is that the authors have identified that the value of suboptimal actions becomes inflated due to the mathematical formalism of TD learning and thus replay may function as a mechanism to devalue these actions to prevent further use (and thus we see pessimistic replay). If this is indeed what is happening, I worry about using term 'forgetting' so prominently as I believe it has very different connotations than something like 'devaluation'. I understand I may be harping on terminology here, so I do apologize; I do think it's an important distinction, though.

If I am incorrect in understanding the authors' intuition, please correct my mistakes.

Regardless, this was a very impressive modeling paper, and I am grateful to have had the opportunity to review it. I hope my comments were constructive.

Reviewer #2: In this manuscript, Antonov and colleagues investigates the effect of forgetting in a DYNA agent that selects optimal experiences for replay. Through various simulations and model fits to subject behavioral data, the authors explore two types of forgetting: the forgetting of action values, and the forgetting of action outcomes. They find that, under a set of modeling assumptions, a forgetful DYNA agent will tend to prioritize the replay of suboptimal actions and not of optimal actions. They then re-interpret the findings of a previous MEG study to suggest that subjects, who exhibited substantial forgetting, might have been replaying optimally when replaying suboptimal actions.

The topic addressed is timely, given the large number of recent studies investigating the content and function of replay in rodents and humans. In particular, the elegant modeling exercise presented in the article might offer new light into existing replay data that, at first blush, may have appeared inconsistent with a normative view. However, while the authors clearly conducted a careful analysis of how forgetting might affect optimal replay, the presentation of their findings in the paper was rather confusing. Accordingly, below I present a few questions, comments, and suggestions that might improve the clarity of the paper to a broad audience.

MAJOR COMMENTS:

1. First and foremost, the paper uses many jargons and abbreviations. Please consider simplifying the language and reducing the number of abbreviations.

2. The paper presents an enourmous amount of information from the very beginning of the Results section. I suspect that any reader not familiar with both Eldar et al (2020) and Mattar & Daw (2018) would have difficulty understanding the model. This might be mitigated with a clearer and more organized structure. One possibility is to organize the simulations in increasing levels of complexity, such as the following: (i) no replay; (ii) no replay with forgetting of values; (iii) normative replay with no forgetting; (iv) normative replay with forgetting of values; (v) normative replay with forgetting of the transition function (i.e. where the discussion action entropy would fall); (vi) normative replay with both forgetting of value and of the transition function. A structure such as this might help the reader understand how each factor, individually, contributes a different pattern reflected in behavior. Other structures are also possible, but the general aim should be to introduce the model components slowly such that any naive reader is able to understand it.

3. Forgetting is modeled as (i) MF values decaying towards a global average; and (ii) Transitions becoming more uniform. My understanding was that, in the absense of forgetting, the proposed model would reduce to the optimal model. Is that correct? If so, it would beneficial to state that explicitly in the paper. If not, it would be important to explain why.

4. What would be the effect of considering also the forgetting of the (stimulus-specific) rewards?

5. Similarly, how important is the assumption that MF values decay towards a global average? Would the simulation results be different if the decay was, instead, towards a subject-specific baseline?

6. I confess that I did not understand the selection of the very peculiar form for the the softmax in 2-step choices (equation [3]). What is the reasoning behind using both Q-values in 2-move trials, instead of only the Q-value for 2-move sequences (which, according to [8], should converge to the total reward obtained in 2 steps)? How different would the simulation results be if only Q^{MF2} was used? Similarly, why is Q^{MF2} used in equation [5], instead of only Q^{MF1}? How dependent are the simulation results on this assumption?

7. In equation [13], the authors assume that Q^{MB} values for 2-move sequences are calculated on-policy. Why not assume optimal behavior at the 2nd action (i.e. using a max operator instead of pi in [13]), and how important is this assumption?

8. In line 781, the author state that the online forgetting parameters used a beta prior with alpha=6 and beta=2, which has a mean of 0.75 and a standard deviation of only 0.14. I found this choice of prior to be rather specific, contrasting sharply with the more uninformative choices for all other priors. I would have loves to see how the simulations would have turned out if a uniform prior had been used. Would subjects still have been found to be forgetful?

MINOR COMMENTS:

9. Line 192-196: "we find that at high MF forgetting, replay confers a noticeable performance advantage to the agent provided that MB forgetting is mild (as can be seen from the curvative of the contour lines in Fig 2D)". However, Fig 2D only compares replay models with different forgetting rates. Does replay really confer a performance advantage in comparison to an agent that performs no replay? In other words, does replay with substantial forgetting hurt performance, and why/how?

10. Fig 3A is extremely difficult to parse. Same for Fig 3B (e.g. what do the axes mean?).

11. Avoid using asterisks in actions (as in equation [4]), which in RL is generally used to indicate an optimal action.

12. Similarly, in equation [4] and others, it was not clear why Q^{MF2} sometimes had three arguments and sometimes two.

13. R(s) is used to indicate a reward model in equations [12-13], and a sampled reward in [7-8]. Consider using different notation in each case.

14. On line 782, the authors state: "All the other parameters had gamma priors with location 1 and scale 1". Please list all such parameters explicitly.

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

PLoS Comput Biol. 2022 Jan 12;18(1):e1009634. doi: 10.1371/journal.pcbi.1009634.r002

Author response to Decision Letter 0

1 Oct 2021

Attachment

Submitted filename: Responses.pdf

Click here for additional data file.^{(623.8KB, pdf)}

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1009634.r003

Decision Letter 1

Daniel Bush, Daniele Marinazzo

25 Oct 2021

Dear Mr. Antonov,

Thank you very much for submitting your manuscript "Optimism and pessimism in optimised replay" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. The reviewers appreciated the attention to an important topic. Based on the reviews, we are likely to accept this manuscript for publication, providing that you modify the manuscript according to the review recommendations. Specifically, if you could incorporate some additional details regarding your definition of 'forgetting', as specified by Reviewer 1.

Please prepare and submit your revised manuscript within 30 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to all review comments, and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Thank you again for your submission to our journal. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Daniel Bush

Associate Editor

PLOS Computational Biology

Daniele Marinazzo

Deputy Editor

PLOS Computational Biology

***********************

A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately:

[LINK]

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: I thank the authors for addressing my comments and for a comprehensive revision.

I believe most of the confusion was due to the prominent use of 'forgetting' in a context I was unfamiliar with. I would ask the authors to

(1) cite other work that that uses this definition of forgetting (i.e. decay to average reward rate) since they claim it is the standard and

(2) comment on whether this is a proposal for what forgetting is defined as in a psychological context, or if this is specific jargon used in technical reinforcement learning settings that does not necessarily bear on (folk) psychological notions of forgetting

Reviewer #2: The structure and clarity of the paper has greatly improved. Thank you very much for taking the time to implement these changes. I am sure that this will increase the impact of the paper.

I have only a few final minor suggestions left:

* Line 111: The caption says "Unfilled hexagons show epochs which contained trials without feedback", but the figure shows no unfilled hexagons.

* Fig 3B: Please add the legend indicating the meaning of the yellow and blue lines (currently, there's only a label for the dashed lines)

* Line 237: In the previous sentence (line 235-236), the authors say "provided that MB forgetting is mild", followed by "This means that despite moderate uncertainty in the transition structure, the agent is still able to improve its MF policy and increase the obtained reward rate." Don't the authors mean: "as long as there is little uncertainty in the transition (...)"?

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

Reviewer #1: Yes

Reviewer #2: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Figure Files:

Data Requirements:

Reproducibility:

References:

Review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript.

If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

PLoS Comput Biol. 2022 Jan 12;18(1):e1009634. doi: 10.1371/journal.pcbi.1009634.r004

Author response to Decision Letter 1

10 Nov 2021

Attachment

Submitted filename: Responses2.pdf

Click here for additional data file.^{(53.4KB, pdf)}

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1009634.r005

Decision Letter 2

Daniel Bush, Daniele Marinazzo

12 Nov 2021

Dear Mr. Antonov,

We are pleased to inform you that your manuscript 'Optimism and pessimism in optimised replay' has been provisionally accepted for publication in PLOS Computational Biology.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology.

Best regards,

Daniel Bush

Associate Editor

PLOS Computational Biology

Daniele Marinazzo

Deputy Editor

PLOS Computational Biology

***********************************************************

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1009634.r006

Acceptance letter

Daniel Bush, Daniele Marinazzo

20 Dec 2021

PCOMPBIOL-D-21-01256R2

Optimism and pessimism in optimised replay

Dear Dr Antonov,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Zsofia Freund

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

S1 Fig. Replay entropy range and pessimism bias.

(EPS)

Click here for additional data file.^{(2.8MB, eps)}

S2 Fig. Comparison of forgetting parameters.

(EPS)

Click here for additional data file.^{(2.2MB, eps)}

S3 Fig. Parameter equivalents of subjects’ decision flexibility.

(EPS)

Click here for additional data file.^{(2.2MB, eps)}

S4 Fig. Epistemology of replay: Another example.

(EPS)

Click here for additional data file.^{(5.1MB, eps)}

S5 Fig. How replay can hurt.

(EPS)

Click here for additional data file.^{(4.6MB, eps)}

S6 Fig. Effect of state-transition model re-arrangement.

(EPS)

Click here for additional data file.^{(2MB, eps)}

S7 Fig. On-line replay statistics with obliviated knowledge.

(EPS)

Click here for additional data file.^{(2MB, eps)}

S8 Fig. Entropy of pre-trained state-transition models.

(EPS)

Click here for additional data file.^{(514.1KB, eps)}

S9 Fig. Fitting errors.

Distribution of the final fitting errors (RMSD of the proportion of available reward collected between the human subjects and the agent).

(EPS)

Click here for additional data file.^{(387KB, eps)}

Attachment

Submitted filename: Responses.pdf

Click here for additional data file.^{(623.8KB, pdf)}

Attachment

Submitted filename: Responses2.pdf

Click here for additional data file.^{(53.4KB, pdf)}

Data Availability Statement

The code and data used to produce the results and analyses presented in this manuscript are available at https://github.com/geoant1/optimism_and_pessimism.

[pcbi.1009634.ref001] 1. O’Keefe J, Dostrovsky J. The hippocampus as a spatial map: Preliminary evidence from unit activity in the freely-moving rat. Brain research. 1971. [DOI] [PubMed] [Google Scholar]

[pcbi.1009634.ref002] 2. O’Keefe J, Nadel L. The hippocampus as a cognitive map. Oxford: Clarendon Press; 1978. [Google Scholar]

[pcbi.1009634.ref003] 3. Wilson MA, McNaughton BL. Reactivation of hippocampal ensemble memories during sleep. Science. 1994;265(5172):676–679. doi: 10.1126/science.8036517 [DOI] [PubMed] [Google Scholar]

[pcbi.1009634.ref004] 4. Lee AK, Wilson MA. Memory of sequential experience in the hippocampus during slow wave sleep. Neuron. 2002;36(6):1183–1194. doi: 10.1016/S0896-6273(02)01096-6 [DOI] [PubMed] [Google Scholar]

[pcbi.1009634.ref005] 5. Foster DJ, Wilson MA. Reverse replay of behavioural sequences in hippocampal place cells during the awake state. Nature. 2006;440(7084):680–683. doi: 10.1038/nature04587 [DOI] [PubMed] [Google Scholar]

[pcbi.1009634.ref006] 6. Diba K, Buzsáki G. Forward and reverse hippocampal place-cell sequences during ripples. Nature neuroscience. 2007;10(10):1241–1242. doi: 10.1038/nn1961 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009634.ref007] 7. Dragoi G, Tonegawa S. Preplay of future place cell sequences by hippocampal cellular assemblies. Nature. 2011;469(7330):397–401. doi: 10.1038/nature09633 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009634.ref008] 8. Dragoi G, Tonegawa S. Distinct preplay of multiple novel spatial experiences in the rat. Proceedings of the National Academy of Sciences. 2013;110(22):9100–9105. doi: 10.1073/pnas.1306031110 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009634.ref009] 9. Pfeiffer BE, Foster DJ. Hippocampal place-cell sequences depict future paths to remembered goals. Nature. 2013;497(7447):74–79. doi: 10.1038/nature12112 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009634.ref010] 10. Grosmark AD, Buzsáki G. Diversity in neural firing dynamics supports both rigid and learned hippocampal sequences. Science. 2016;351(6280):1440–1443. doi: 10.1126/science.aad1935 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009634.ref011] 11. Silva D, Feng T, Foster DJ. Trajectory events across hippocampal place cells require previous experience. Nature neuroscience. 2015;18(12):1772–1779. doi: 10.1038/nn.4151 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009634.ref012] 12. Eichenbaum H. Does the hippocampus preplay memories? Nature neuroscience. 2015;18(12):1701–1702. doi: 10.1038/nn.4180 [DOI] [PubMed] [Google Scholar]

[pcbi.1009634.ref013] 13. Foster DJ. Replay comes of age. Annual review of neuroscience. 2017;40:581–602. doi: 10.1146/annurev-neuro-072116-031538 [DOI] [PubMed] [Google Scholar]

[pcbi.1009634.ref014] 14. Singer AC, Frank LM. Rewarded outcomes enhance reactivation of experience in the hippocampus. Neuron. 2009;64(6):910–921. doi: 10.1016/j.neuron.2009.11.016 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009634.ref015] 15. Ólafsdóttir HF, Barry C, Saleem AB, Hassabis D, Spiers HJ. Hippocampal place cells construct reward related sequences through unexplored space. Elife. 2015;4:e06063. doi: 10.7554/eLife.06063 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009634.ref016] 16. Ambrose RE, Pfeiffer BE, Foster DJ. Reverse replay of hippocampal place cells is uniquely modulated by changing reward. Neuron. 2016;91(5):1124–1136. doi: 10.1016/j.neuron.2016.07.047 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009634.ref017] 17. Sirota A, Csicsvari J, Buhl D, Buzsáki G. Communication between neocortex and hippocampus during sleep in rodents. Proceedings of the National Academy of Sciences. 2003;100(4):2065–2069. doi: 10.1073/pnas.0437938100 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009634.ref018] 18. Sirota A, Montgomery S, Fujisawa S, Isomura Y, Zugaro M, Buzsáki G. Entrainment of neocortical neurons and gamma oscillations by the hippocampal theta rhythm. Neuron. 2008;60(4):683–697. doi: 10.1016/j.neuron.2008.09.014 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009634.ref019] 19. Jadhav SP, Rothschild G, Roumis DK, Frank LM. Coordinated excitation and inhibition of prefrontal ensembles during awake hippocampal sharp-wave ripple events. Neuron. 2016;90(1):113–127. doi: 10.1016/j.neuron.2016.02.010 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009634.ref020] 20. Maingret N, Girardeau G, Todorova R, Goutierre M, Zugaro M. Hippocampo-cortical coupling mediates memory consolidation during sleep. Nature neuroscience. 2016;19(7):959–964. doi: 10.1038/nn.4304 [DOI] [PubMed] [Google Scholar]

[pcbi.1009634.ref021] 21. Rothschild G, Eban E, Frank LM. A cortical–hippocampal–cortical loop of information processing during memory consolidation. Nature neuroscience. 2017;20(2):251–259. doi: 10.1038/nn.4457 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009634.ref022] 22. Shin JD, Tang W, Jadhav SP. Dynamics of awake hippocampal-prefrontal replay for spatial learning and memory-guided decision making. Neuron. 2019;104(6):1110–1125. doi: 10.1016/j.neuron.2019.09.012 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009634.ref023] 23. Todorova R, Zugaro M. Isolated cortical computations during delta waves support memory consolidation. Science. 2019;366(6463):377–381. doi: 10.1126/science.aay0616 [DOI] [PubMed] [Google Scholar]

[pcbi.1009634.ref024] 24. Raichle ME. The brain’s default mode network. Annual review of neuroscience. 2015;38:433–447. doi: 10.1146/annurev-neuro-071013-014030 [DOI] [PubMed] [Google Scholar]

[pcbi.1009634.ref025] 25. Rissman J, Gazzaley A, D’Esposito M. Measuring functional connectivity during distinct stages of a cognitive task. Neuroimage. 2004;23(2):752–763. doi: 10.1016/j.neuroimage.2004.06.035 [DOI] [PubMed] [Google Scholar]

[pcbi.1009634.ref026] 26. Greicius MD, Supekar K, Menon V, Dougherty RF. Resting-state functional connectivity reflects structural connectivity in the default mode network. Cerebral cortex. 2009;19(1):72–78. doi: 10.1093/cercor/bhn059 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009634.ref027] 27. Jolles DD, van Buchem MA, Crone EA, Rombouts SA. Functional brain connectivity at rest changes after working memory training. Human brain mapping. 2013;34(2):396–406. doi: 10.1002/hbm.21444 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009634.ref028] 28. Kurth-Nelson Z, Economides M, Dolan RJ, Dayan P. Fast sequences of non-spatial state representations in humans. Neuron. 2016;91(1):194–204. doi: 10.1016/j.neuron.2016.05.028 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009634.ref029] 29. Kurth-Nelson Z, Barnes G, Sejdinovic D, Dolan R, Dayan P. Temporal structure in associative retrieval. Elife. 2015;4:e04919. doi: 10.7554/eLife.04919 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009634.ref030] 30. Liu Y, Dolan RJ, Kurth-Nelson Z, Behrens TE. Human replay spontaneously reorganizes experience. Cell. 2019;178(3):640–652. doi: 10.1016/j.cell.2019.06.012 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009634.ref031] 31. Eldar E, Lièvre G, Dayan P, Dolan RJ. The roles of online and offline replay in planning. ELife. 2020;9:e56911. doi: 10.7554/eLife.56911 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009634.ref032] 32. Liu Y, Mattar MG, Behrens TEJ, Daw ND, Dolan RJ. Experience replay is associated with efficient nonlocal learning. Science. 2021;372 (6544). doi: 10.1126/science.abf1357 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009634.ref033] 33. Ego-Stengel V, Wilson MA. Disruption of ripple-associated hippocampal activity during rest impairs spatial learning in the rat. Hippocampus. 2010;20(1):1–10. doi: 10.1002/hipo.20707 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009634.ref034] 34. Girardeau G, Benchenane K, Wiener SI, Buzsáki G, Zugaro MB. Selective suppression of hippocampal ripples impairs spatial memory. Nature neuroscience. 2009;12(10):1222–1223. doi: 10.1038/nn.2384 [DOI] [PubMed] [Google Scholar]

[pcbi.1009634.ref035] 35. Jadhav SP, Kemere C, German PW, Frank LM. Awake hippocampal sharp-wave ripples support spatial memory. Science. 2012;336(6087):1454–1458. doi: 10.1126/science.1217230 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009634.ref036] 36. Gridchyn I, Schoenenberger P, O’Neill J, Csicsvari J. Assembly-specific disruption of hippocampal replay leads to selective memory deficit. Neuron. 2020. 10.1016/j.neuron.2020.01.021 [DOI] [PubMed] [Google Scholar]

[pcbi.1009634.ref037] 37. Káli S, Dayan P. Off-line replay maintains declarative memories in a model of hippocampal-neocortical interactions. Nature neuroscience. 2004;7(3):286–294. doi: 10.1038/nn1202 [DOI] [PubMed] [Google Scholar]

[pcbi.1009634.ref038] 38. Hinton GE, Dayan P, Frey BJ, Neal RM. The “wake-sleep” algorithm for unsupervised neural networks. Science. 1995;268(5214):1158–1161. doi: 10.1126/science.7761831 [DOI] [PubMed] [Google Scholar]

[pcbi.1009634.ref039] 39.Sutton RS. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In: Machine learning proceedings 1990. Elsevier; 1990. p. 216–224.

[pcbi.1009634.ref040] 40. Momennejad I, Otto AR, Daw ND, Norman KA. Offline replay supports planning in human reinforcement learning. Elife. 2018;7:e32548. doi: 10.7554/eLife.32548 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009634.ref041] 41. Mattar MG, Daw ND. Prioritized memory access explains planning and hippocampal replay. Nature neuroscience. 2018;21(11):1609–1617. doi: 10.1038/s41593-018-0232-z [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009634.ref042] 42. Sutton RS, Barto AG. Reinforcement learning: An introduction. MIT press; 2018. [Google Scholar]

[pcbi.1009634.ref043] 43. Moore AW, Atkeson CG. Prioritized sweeping: Reinforcement learning with less data and less time. Machine learning. 1993;13(1):103–130. doi: 10.1007/BF00993104 [DOI] [Google Scholar]

[pcbi.1009634.ref044] 44. Daw ND, Niv Y, Dayan P. Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control. Nature neuroscience. 2005;8(12):1704–1711. doi: 10.1038/nn1560 [DOI] [PubMed] [Google Scholar]

[pcbi.1009634.ref045] 45. Daw ND, Gershman SJ, Seymour B, Dayan P, Dolan RJ. Model-based influences on humans’ choices and striatal prediction errors. Neuron. 2011;69(6):1204–1215. doi: 10.1016/j.neuron.2011.02.027 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009634.ref046] 46. Lee SW, Shimojo S, O’Doherty JP. Neural computations underlying arbitration between model-based and model-free learning. Neuron. 2014;81(3):687–699. doi: 10.1016/j.neuron.2013.11.028 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009634.ref047] 47. Moran R, Keramati M, Dayan P, Dolan RJ. Retrospective model-based inference guides model-free credit assignment. Nature communications. 2019;10(1):1–14. doi: 10.1038/s41467-019-08662-8 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009634.ref048] 48. Watkins CJCH. Learning from delayed rewards. 1989. [Google Scholar]

[pcbi.1009634.ref049] 49. Dayan P. Improving generalization for temporal difference learning: The successor representation. Neural Computation. 1993;5(4):613–624. doi: 10.1162/neco.1993.5.4.613 [DOI] [Google Scholar]

[pcbi.1009634.ref050] 50. Shannon CE. A mathematical theory of communication. The Bell system technical journal. 1948;27(3):379–423. doi: 10.1002/j.1538-7305.1948.tb01338.x [DOI] [Google Scholar]

[pcbi.1009634.ref051] 51. Montague PR, Dolan RJ, Friston KJ, Dayan P. Computational psychiatry. Trends in cognitive sciences. 2012;16(1):72–80. doi: 10.1016/j.tics.2011.11.018 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009634.ref052] 52.Gagne C, Dayan P. Peril, Prudence and Planning as Risk, Avoidance and Worry. 2021. Available from: psyarxiv.com/tcn7e.

[pcbi.1009634.ref053] 53. Barraclough DJ, Conroy ML, Lee D. Prefrontal cortex and decision making in a mixed-strategy game. Nature neuroscience. 2004;7(4):404–410. doi: 10.1038/nn1209 [DOI] [PubMed] [Google Scholar]

[pcbi.1009634.ref054] 54. Ito M, Doya K. Validation of decision-making models and analysis of decision variables in the rat basal ganglia. Journal of Neuroscience. 2009;29(31):9861–9874. doi: 10.1523/JNEUROSCI.6157-08.2009 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009634.ref055] 55. Toyama A, Katahira K, Ohira H. Reinforcement learning with parsimonious computation and a forgetting process. Frontiers in human neuroscience. 2019;13:153. doi: 10.3389/fnhum.2019.00153 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009634.ref056] 56. Niv Y, Daw ND, Joel D, Dayan P. Tonic dopamine: opportunity costs and the control of response vigor. Psychopharmacology. 2007;191(3):507–520. doi: 10.1007/s00213-006-0502-4 [DOI] [PubMed] [Google Scholar]

[pcbi.1009634.ref057] 57. Wixted JT. The psychology and neuroscience of forgetting. Annu Rev Psychol. 2004;55:235–269. doi: 10.1146/annurev.psych.55.090902.141555 [DOI] [PubMed] [Google Scholar]

[pcbi.1009634.ref058] 58. Friedrich J, Lengyel M. Goal-directed decision making with spiking neurons. Journal of Neuroscience. 2016;36(5):1529–1546. doi: 10.1523/JNEUROSCI.2854-15.2016 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009634.ref059] 59. Basanisi R, Brovelli A, Cartoni E, Baldassarre G. A generative spiking neural-network model of goal-directed behaviour and one-step planning. PLOS Computational Biology. 2020;16(12):e1007579. doi: 10.1371/journal.pcbi.1007579 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009634.ref060] 60. Schwartenbeck P, Baram A, Liu Y, Mark S, Muller T, Dolan R, et al. Generative replay for compositional visual understanding in the prefrontal-hippocampal circuit. bioRxiv. 2021; [Google Scholar]

[pcbi.1009634.ref061] 61. Russek EM, Momennejad I, Botvinick MM, Gershman SJ, Daw ND. Predictive representations can link model-based reinforcement learning to model-free mechanisms. PLoS computational biology. 2017;13(9):e1005768. doi: 10.1371/journal.pcbi.1005768 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009634.ref062] 62. Carey AA, Tanaka Y, van Der Meer MA. Reward revaluation biases hippocampal replay content away from the preferred outcome. Nature neuroscience. 2019; p. 1–10. [DOI] [PubMed] [Google Scholar]

[pcbi.1009634.ref063] 63. O’Neill J, Boccara C, Stella F, Schönenberger P, Csicsvari J. Superficial layers of the medial entorhinal cortex replay independently of the hippocampus. Science. 2017;355(6321):184–188. doi: 10.1126/science.aag2787 [DOI] [PubMed] [Google Scholar]

[pcbi.1009634.ref064] 64. Stachenfeld KL, Botvinick MM, Gershman SJ. The hippocampus as a predictive map. Nature neuroscience. 2017;20(11):1643. doi: 10.1038/nn.4650 [DOI] [PubMed] [Google Scholar]

[pcbi.1009634.ref065] 65. Babichev A, Morozov D, Dabaghian Y. Replays of spatial memories suppress topological fluctuations in cognitive map. Network Neuroscience. 2019;3(3):707–724. doi: 10.1162/netn_a_00076 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009634.ref066] 66. Kaelbling LP, Littman ML, Cassandra AR. Planning and acting in partially observable stochastic domains. Artificial intelligence. 1998;101(1-2):99–134. doi: 10.1016/S0004-3702(98)00023-X [DOI] [Google Scholar]

[pcbi.1009634.ref067] 67. Silver D, Veness J. Monte-Carlo planning in large POMDPs. Neural Information Processing Systems; 2010. [Google Scholar]

[pcbi.1009634.ref068] 68. Turner BM, Van Zandt T. A tutorial on approximate Bayesian computation. Journal of Mathematical Psychology. 2012;56(2):69–85. doi: 10.1016/j.jmp.2012.02.005 [DOI] [Google Scholar]

[pcbi.1009634.ref069] 69. Jennings E, Madigan M. astroABC: an approximate Bayesian computation sequential Monte Carlo sampler for cosmological parameter estimation. Astronomy and computing. 2017;19:16–22. doi: 10.1016/j.ascom.2017.01.001 [DOI] [Google Scholar]

PERMALINK

Optimism and pessimism in optimised replay

Georgy Antonov

Christopher Gagne

Eran Eldar

Peter Dayan

Roles

Abstract

Author summary

Introduction

Results

Preamble

Fig 1. Task structure and replay modelling.

Modelling of subjects’ choices in the behavioural task

Fig 2. Algorithm description and the effects of replay and forgetting on model performance.

Fig 3. Incremental model comparison.

Exploration of parameter regimes

Fig 4. How MF forgetting influences gain estimation.

Fig 5. Action entropy limits estimated gain.

Fitting actual subjects

Fig 6. Epistemology of replay.

Fig 7. Overall on-task replay statistics across MI subjects.

Discussion

Methods

Task design

DYNA-like agent

Free parameters

Table 1. Free parameters of the algorithm.

Choices

Model-free value learning

Model-based learning

Replay

Forgetting

Initialisations

Parameter fitting

Replay analysis

Example simulations

Supporting information

Abbreviations

Data Availability

Funding Statement

References

Decision Letter 0

Daniel Bush

Daniele Marinazzo

Roles

Author response to Decision Letter 0

Decision Letter 1

Daniel Bush

Daniele Marinazzo

Roles

Author response to Decision Letter 1

Decision Letter 2

Daniel Bush

Daniele Marinazzo

Roles

Acceptance letter

Daniel Bush

Daniele Marinazzo

Roles

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases