Skip to main content
eLife logoLink to eLife
. 2020 Jun 17;9:e56911. doi: 10.7554/eLife.56911

The roles of online and offline replay in planning

Eran Eldar 1,2,3,, Gaëlle Lièvre 2,3, Peter Dayan 4,5,, Raymond J Dolan 2,3,
Editors: Thorsten Kahnt6, Kate M Wassum7
PMCID: PMC7299337  PMID: 32553110

Abstract

Animals and humans replay neural patterns encoding trajectories through their environment, both whilst they solve decision-making tasks and during rest. Both on-task and off-task replay are believed to contribute to flexible decision making, though how their relative contributions differ remains unclear. We investigated this question by using magnetoencephalography (MEG) to study human subjects while they performed a decision-making task that was designed to reveal the decision algorithms employed. We characterised subjects in terms of how flexibly each adjusted their choices to changes in temporal, spatial and reward structure. The more flexible a subject, the more they replayed trajectories during task performance, and this replay was coupled with re-planning of the encoded trajectories. The less flexible a subject, the more they replayed previously preferred trajectories during rest periods between task epochs. The data suggest that online and offline replay both participate in planning but support distinct decision strategies.

Research organism: Human

eLife digest

Studies show that humans and animals replay past experiences in their brain. To do this, the brain creates a pattern of electrical activity for each part of a multistep experience and then plays them back in order. Humans and other animals can replay scenarios either while the experience is still happening (i.e. online replay) or later when they are resting or sleeping (i.e. offline replay). Being able to replay an experience and its outcome may help a person or animal plan a better course of action in the future. However, it is poorly understood how online and offline replay each contribute to such planning.

To answer this question, Eldar et al. used a brain imaging tool called magnetoencephalography (MEG for short) to measure the electrical activity inside the brain. This technique was able to detect replays in the brain of individuals performing a particular task, and later whilst they were resting.

In the experiments, 40 healthy volunteers played a game in which each location in a space was associated with an image, for example a frog or a traffic sign, and each image was given a value. Participants got paid for moving to more valuable images in one or two steps. Eldar et al. found that people who replay their steps during a task are able to adjust their choices on the fly, whereas individuals who replay their choices during rests tend to approach a task with a less flexible, more preformed plan.

Eldar et al. suggest that replaying an experience too much during rest and not enough in real-time might contribute to more rigid behaviors, a theory that could shed light on the mechanisms behind certain behavioral disorders such as obsessive compulsive disorder. However, more studies are needed to determine if these two different replay strategies play a causal role in human behavior.

Introduction

Online and offline replay are both suggested to contribute to decision making (Behrens et al., 2018; Diba and Buzsáki, 2007; Foster, 2017; Foster and Wilson, 2006; Gupta et al., 2010; Ji and Wilson, 2007; Kurth-Nelson et al., 2016; Louie and Wilson, 2001; Ólafsdóttir et al., 2017; Pezzulo et al., 2014; Skaggs and McNaughton, 1996; Ólafsdóttir et al., 2015; Stachenfeld et al., 2017), but their precise contributions remain unclear. Replay of experienced and expected state transitions during a task, either immediately before choice or following outcome feedback, is particularly well suited to mediate on-the-fly planning, where choices are evaluated based on the states to which they lead (this is known as model-based planning). Off-task replay might serve a complementary role of consolidating a model of a state space, which specifies how each state can be reached from other states as well as the values of each state. According to this perspective, both types of replay help subjects make choices that are flexibly adapted to current circumstances.

However, an alternative possibility is that off-task replay also directly participates in planning, by calculating and storing a (so-called model-free) decision policy that specifies in advance what to do in each state (Gershman et al., 2014; Mattar and Daw, 2018; Momennejad et al., 2018; Sutton, 1991). Such a pre-formulated policy is inherently less flexible than a policy that is constructed on the fly, but at the same time it decreases a need for subsequent online planning when time itself might be limited. Thus, rather than online and offline replay both supporting the same form of planning, this latter perspective suggests a trade-off between them. In other words, online replay promotes on-the-fly (model-based) flexibility, whereas offline replay establishes a stable (model-free) policy.

Despite the wide-ranging behavioural implications of a distinction between model-based and model-free planning (Crockett, 2013; Everitt and Robbins, 2005; Gillan et al., 2017; Kurdi et al., 2019), and much theorising on the role of replay in one or the other form of planning, to date there is little data indicating whether online and offline replay have complementary or contrasting impacts in this regard. However, recent advances in magnetoencephalography (MEG) analysis have made it possible to study replay in human subjects in relation to learning and decision-making behaviour (Eldar et al., 2018; Kurth-Nelson et al., 2016; Liu et al., 2019). This methodology involves three key steps. First, MEG signals are recorded before, during, and after the subject performs a task of interest. Second, the MEG time series are decoded so as to estimate a moment-by-moment probability a subject is neurally representing each task element, even when the sensory processing associated with that element has ceased. This is the minimal requirement for replay. However, a special feature of replay is the coordination between the non-sensory neural representations of multiple, related, task elements. Thus, finally, the relationships between elements’ representational probability time series are examined to determine whether pairs of elements tended to be represented sequentially, one after the other.

Here, we use this methodology to test the relationship between both online and offline replay and key aspects of decision flexibility that dissociate model-free and model-based forms of planning (Daw et al., 2011). For this purpose, we first recorded MEG signals from human subjects during rest and while they navigated a specially designed state space. We next characterised each individual subject’s decision-making flexibility based on their task behaviour, and we then analysed their MEG signals to look for evidence of on-task (Eldar et al., 2018) and off-task (Kurth-Nelson et al., 2016; Liu et al., 2019) sequences of state representations.

Results

Individual differences in decision flexibility

We used distinct visual images to represent eight unique states, where occupancy of each state provided a different amount of reward (Figure 1a). Subjects started each trial at a random state and had to choose a movement direction in order to collect reward from subsequent states (Figure 1b). Subjects learnt beforehand how much reward was associated with each state (Figure 1—figure supplement 1), but they did not know initially where states were in relation to one another. The latter aspect of task structure had to be acquired through trial and error learning in order for subjects to be able to implement subsequent moves that delivered the maximal amount of reward. We assessed subjects’ flexibility in three ways. First, after the initial two blocks of trials, we changed the reward associated with each state (Figure 1a; grey numbers) such that persisting with previously optimal moves would result in below-chance performance. Second, after two additional blocks of trials, we informed subjects that two specified pairs of states had switched positions (Figure 1a; ‘switch’), again rendering the previously optimal policy now suboptimal. A flexible model-based planner would be capable of re-planning their moves perfectly following each of these instructed changes, since such a planner has acquired knowledge as to how each state can be reached. Conversely, a pure model-free planner would need to learn a new policy from scratch via trial and error each time there is a change, since such an agent only possesses a now outdated policy that specifies where to move from each state.

Figure 1. Subjects differed in decision flexibility.

(a) Experimental task space. Before performing the main task, subjects learned state-reward associations (numbers in black circles) and they were then gradually introduced to the state space in a training session. After performing the main task for two blocks of trials, subjects learned new state-reward associations (numbers in dark grey circles) and then returned to the main task. Before a final block of trials, subjects were informed of a structural task change such that ‘House’ switched position with ‘Tomato’, and ‘Traffic sign’ switched position with ‘Frog’. The bird’s eye view shown in the figure was never seen by subjects. Subjects only saw where they started from on each trial and, after completing a move, the state to which their move led. The map was connected as a torus (e.g., starting from ‘Tomato’, moving right led to ‘Traffic sign’, and moving up or down from the tomato led to ‘Pond’). (b) Each trial started from a pseudorandom location from whence subjects were allowed either one (‘1-move trial’) or two (‘2-move trial’) consecutive moves (signalled at the start of each set of six trials), before continuing to the next trial. Outcomes were presented as images alone, and the associated reward points were not shown. A key design feature of the map was that in 5 out of 6 trials the optimal (first) move was different depending on whether the trial allowed for one or two moves. For instance, given the initial image-reward associations (black) and image positions, the best single move from ‘Face’ is LEFT (9 points), but when two moves are allowed the best moves are RIGHT and then DOWN (5+9 giving 15 total points). Note that the optimal moves differed also given the second set of image-reward associations. On ‘no-feedback’ trials (which started all but the first block), outcome images were also not shown (i.e., in the depicted trials, the ‘Wrench’, ‘Tomato’ and ‘Pond’ would appear as empty circles). (c) The proportion of obtainable reward points collected by the experimental subjects, and by three simulated learning algorithms. Each data point corresponds to 18 trials (six 1-move and twelve 2-move trials), with 54 trials per block. The images to which subjects moved were not shown to subjects for the first 12 trials of Blocks II to V (the corresponding ‘Without feedback’ data points also include data from 6 initial trials with feedback wherein starting locations had not yet repeated, and thus, subjects’ choices still reflected little new information). All algorithms were allowed to forget information so as to account for post-change performance drops as best fitted subjects’ choices (see Materials and methods for details). Black dashed line: chance performance. Shaded area: SEM. (d) Proportion of first choices that would have allowed collecting maximal reward where one (‘1-optimal’) or two (‘2-optimal’) consecutive moves were allowed. Choices are shown separately for what were in actuality 1-move and 2-move trials. Subjects are colour coded from lowest (gold) to highest (red) degree of flexibility in adjusting to one vs. two moves (see text). Dashed line: chance performance (33%, since up and down choices always lead to the same outcome). (e,f) Decrease in collected reward following a reward-contingency (e) and spatial (f) change, as a function of the index of flexibility (IF) computed from panel d. Measures are corrected for the impact of pre-change performance level using linear regression. p value derived using a premutation test.

Figure 1.

Figure 1—figure supplement 1. Image-reward training: Timeline of a trial.

Figure 1—figure supplement 1.

Figure 1—figure supplement 2. Example sketches of the state space by a representative subject.

Figure 1—figure supplement 2.

Subjects sketched the state space at the end of the experiment, recalling how it had been structured before (a) and after (b) the position change. On average, subjects sketched much of the state spaces accurately (correct state transitions: first map M=0.65, SEM=0.06; second map M=0.56, SEM=0.06; chance = 0.14, p<0.001, Bootstrap test). (a,b) Sketches by a representative subject with 0.58 accuracy for the state space before (a) and after (b) the spatial change. Erroneous transitions are marked in red. (c,d) The actual state spaces the subject navigated before (c) and after (d) the position change.
Figure 1—figure supplement 3. Evidence of advance prospective planning in flexible subjects.

Figure 1—figure supplement 3.

n = 40 subjects. (a) Proportion of optimal choices in second moves for trials without feedback, as a function of individual index of flexibility (IF). In such trials, second moves were enacted without seeing the state they were made from. Measures are corrected using linear regression for accuracy of non-blind moves from the same phases of the experiment. (b) Validation of move decoder. The plot shows the decodability of chosen and unchosen moves from MEG data recorded during the main task. Decodability was computed as the probability assigned to the chosen move (right, left, up or down) by a 4-way classifier based on each timepoint’s spatial MEG pattern, minus the average probability assigned to the same moves at baseline (400 ms preceding trial onset). A separate decoder was trained for each subject on MEG data recorded outside of the main task, during the image-reward association training phases. (c,d) Decodability of second moves (the blue arrow in the bottom example cartoon) in 2-move trials during first move choice (c) and presentation of the first outcome (d), as a function of IF. For display purposes only, mean time series are shown separately for subjects with high (above median) and low (below median) IF. In all panels, dark gray bars indicate timepoints where the 95% Credible Interval excludes zero and Cohen’s d > 0.1 (Bayesian Gaussian Process analysis). Dashed lines: chance decodability level.
Figure 1—figure supplement 4. Individual flexibility reflected the balance between MB and MF planning.

Figure 1—figure supplement 4.

n = 40 subjects. (a) Actual and simulated individual flexibility (IF). Task performance was simulated using subjects’ best-fitting parameter settings. IF was computed for each simulated subject and averaged over 100 simulations. (b) Relationship between IF and individually-fitted parameters. IF was regressed on subjects’ best-fitting parameter settings, including all learning (η), memory (τ), and inverse temperature (β) parameters. Parameters are color-coded by the component of the algorithm they enhance. Error bars: 95% CI.

Examining how subjects’ overall performance altered immediately following these changes revealed a decrement in average performance (Figure 1c). However, there were substantial individual differences in this regard, with some subjects seamlessly adapting to reward and position changes, and others showing drops in performance to chance levels. Subjects whose performance showed a strong decline following a reward change tended to cope poorly also with the position change (ρ=0.50, partial correlation controlling for performance levels before the changes; p=0.001, Permutation test).

As a third, more continuous, test of a different aspect of decision flexibility, we interleaved sets of six trials in which only a single move was allowed (‘1-move trials’) with trials which allowed two consecutive moves (‘2-move trials’; Figure 1b). In 2-move trials, subjects were rewarded for both states they visited, and thus, an optimal course of action often required subjects to move first to an initial low-reward state in order to gain access to a high-reward state with their second move. Thus, we defined an individual index (IF) of decision flexibility as the difference between the proportion of moves that were optimal given the actual number of allotted moves and the proportion of moves that would have been optimal given a different number of allotted moves (i.e., had 1-move trials instead involved two moves and 2-move trials involved one move). An IF value of zero implies no net adjustment, while positive IF values imply advantageous flexibility.

The results indicate subjects adjusted their choices advantageously to the number of allotted moves (+0.21, SEM 0.05, p < 0.001, Bootstrap test), though there was evidence again of substantial individual differences (Figure 1d). Importantly, IF correlated with how well a subject coped with the reward-contingency (Figure 1e) and position (Figure 1f) changes as well with how accurately they could sketch maps of the state space at the end of the experiment (r=0.51, p<0.001, Permutation test; Figure 1—figure supplement 2), indicating these subjects acquired and utilised a model of the state space.

Planning two steps into the future

Having a cognitive model that specifies how states are spatially organised makes it possible to plan several steps into the future. To test whether subjects were able to do that, we challenged subjects with 12 ‘without-feedback’ trials at the beginning of each of the last four blocks, during which outcome images were not shown. This meant that in 2-move trials subjects had to choose their second move ‘blindly’, without having seen the image to which their previous move had led (e.g., the tomato in Figure 1b). We found that subjects performed above chance on these blind second moves (proportion of optimal choices: 0.56, SEM 0.03; chance = 0.45; p<0.001, Bootstrap test), and this was the case even immediately following position and reward changes, when subjects could not have relied on previously tested 2-move sequences (0.52, SEM 0.03; p=0.01, Bootstrap test). Most importantly, such blind-move success was correlated with IF (Figure 1—figure supplement 3a).

This result indicates that more flexible subjects were better able to plan two steps into the future when required. Examining response times suggested flexibility was associated with advance planning also when it was not required. Thus, we found that IF correlated with quicker execution of second moves in general (Spearman correlation with median reaction time: r=-0.61, p<0.001, Permutation test). To determine whether advance planning was indeed generally associated with flexibility, we examined at what point during a trial their choices became decodeable from MEG signals. For this purpose, we trained a decoder to decode chosen moves from MEG signals recorded outside of the main task (see Materials and methods for details). Validating the decoder on MEG data from the main task showed that chosen moves became gradually more evident over the course of the trial, their decodability peaking 140 ms before a choice was made (Figure 1—figure supplement 3b).

Thus, we used the move decoder to test whether second-move choices began to materialise in the MEG signal even before subjects observed the outcomes of their first moves. We found that chosen second moves were indeed decodeable already during first-move choices (decodability: M= 0.006, 95% Credible Interval = 0.004 to 0.008, Bayesian Gaussian Process analysis; Figure 1—figure supplement 3c) and prior to the appearance of the first outcome (decodability: M=0.004, 95% Credible Interval = 0.002 to 0.006; Figure 1—figure supplement 3d). Importantly, this early decodability was correlated with IF (β: M=0.29, 95% Credible Interval = 0.24 to 0.34). By contrast, later decodability, following the onset of the second image, did not correlate with IF (β: M=0.02, 95% Credible Interval = −0.02 to 0.05). Thus, neural and behavioural evidence concur with the notion that flexibility was associated with planning second moves prospectively.

Individual flexibility reflected MF-MB balance

These convergent results suggest that IF reflected deployment of a model-based planning strategy. To test this formally, we compared how well different model-free (MF) and model-based (MB) decision algorithms, as well as a combination of both, explained subjects’ choices. Importantly, we enhanced these algorithms to maximise their ability to mimic one another (see Materials and methods for details). Thus, for instance, the MF algorithm included separate 1-move and 2-move policies, which allow it to achieve optimal adjustment to trial type given sufficient experience.

We found that a hybrid of MF and MB algorithms outperformed substantially either of them alone (Bayesian Information Criterion [Bishop, 2006]: MF = 40821, MB = 43249, MF-MB hybrid = 39908), suggesting that subjects employed a mix of model-free and model-based planning strategies. Simulating task performance using the hybrid algorithm showed it captured adequately behavioural differences evident between subjects (correlation between real and simulated IF: r=0.92, p<0.001, Permutation test; Figure 1—figure supplement 4a). When we examined each subject’s best-fitting parameter values, to determine which of these covaried with IF, we found 84% of inter-individual variance in IF was explained by three parameters that control a balance between model-based and model-free planning (Figure 1—figure supplement 4b). Importantly, less flexible subjects had comparable learning rates and a higher model-free inverse temperature parameter (in 2-move trials), indicating that lower flexibility did not reflect a non-specific impairment, but rather, it was associated with enhanced deployment of a model-free algorithm. Thus, our index of flexibility specifically reflected the influence of model-based, as compared to model-free, planning.

On-task replay is induced by prediction errors and associated with high flexibility

In rodents, reinstatement of past states, potentially in the service of planning, is evident both prior to choices (Pfeiffer and Foster, 2013) and following observation of outcomes (Pezzulo et al., 2014). Thus, we determined firstly at what point states were neurally reinstated during our task. For this purpose, we trained MEG decoders to identify the images subjects were processing (Figure 2a). Such decoders robustly reveal stimulus representations that are reinstated from memory and contribute to decision processes (Eldar et al., 2018; Kurth-Nelson et al., 2015; Bishop, 2006; Pfeiffer and Foster, 2013). Crucially, image decoders were trained on MEG data collected prior to subjects having any knowledge about the task, ensuring that the decoding was free of confounds related to other task variables (Figure 2—figure supplement 1). Applying these decoders to MEG signals from the main task, we found no evidence of prospective representation of outcome states (images) to which subjects will transition at choice (Figure 2—figure supplement 2a). Instead, we found strong evidence that following outcomes (corresponding to new states to which subjects had transitioned), subjects represented the states from which they had just moved (t-=3.4, p=0.001, Permutation test; Figure 2—figure supplement 2b). Consequently, we examined in detail the MEG data recorded following each outcome for evidence of replay of state sequences that subjects had just traversed.

Figure 2. Sequenceness analysis.

(a) Validation of the image MEG decoder used for the sequenceness analyses. n = 40 subjects. The plot shows the decodability of starting images from MEG data recorded during the main task at trial onset. Decodability was computed as the probability assigned to the starting image by an 8-way classifier based on each timepoint’s spatial MEG pattern, minus chance probability (0.125). (b) Schematic depiction of the sequenceness analysis used to determine whether representational probabilities of pairs of elements followed one another in time. Sequenceness is computed as the difference between the cross-correlation of two time series with positive and negative time lags. Since it focuses on asymmetries in the cross correlation function, this measure is useful for detecting sequential relationships even between closely correlated (or anti-correlated) time series. Negative sequenceness indicates signals are ordered in reverse.

Figure 2.

Figure 2—figure supplement 1. Decoding procedure.

Figure 2—figure supplement 1.

(a) Pre-task stimulus exposure on which decoders were trained. Timeline of a trial. (b) Decoding contribution by sensor. Contribution was quantified as the Spearman correlation between MEG signal and decoder output within trials for each stimulus. Correlations were then averaged over stimuli, trials and subjects.
Figure 2—figure supplement 2. Previous, not subsequent, states were encoded in MEG.

Figure 2—figure supplement 2.

(a) Decodability during choice, of the image to which the chosen move led subsequently. (b) Decodability following outcome, of images subjects had visited earlier in the trial. In panels a and b, the analysis excluded decoded probabilities assigned to the image presently on the screen. Dark gray bars indicate timepoints where the 95% Credible Interval excludes zero and Cohen’s d>0.1 (Bayesian Gaussian Process analysis). Dashed lines: chance decodability level. Example trials are shown below the plots with decoded elements marked in blue. (c) Decoding is shown separately for non-terminal states from which subjects moved (here face) and for trials’ terminal states, while aligning the offsets of both types of state. The top (blue) and bottom (grey) x axes denote time of non-terminal and terminal state decoding, respectively, given that the terminal state is the outcome of the non-terminal state. Non-terminal states were followed by outcomes (here tomato) within 1 to 1.1 s. By comparison, terminal states were followed by a new trial within 1.3 to 1.6 s. The top bar indicates significant post-outcome decodability (p < 0.001, permutation test). Examination of the entire sample (n = 40) shows post-outcome decodability was higher in subjects with greater evidence of on-task sequenceness (r = 0.32, p = 0.04, permutation test).

To test for evidence of replay, we applied a measure of ‘sequenceness’ to the decoded MEG time series, a metric we previously showed is sensitive in detecting replay of experienced and decision-related sequences of states (Eldar et al., 2018; Kurth-Nelson et al., 2016; Liu et al., 2019; Figure 2b). Importantly, sequenceness is not sensitive to simultaneous covariation, and thus, it is only found if stimulus representations follow one another in time (Eldar et al., 2018; as in previous work, we allowed for inter-stimulus lags of up to 200 ms). Thus, following each outcome, we computed sequenceness between the decoded representations of the preceding and the outcome state (Figure 3a). Additionally, MEG signals recorded following the second outcome in 2-move trials were also tested for sequenceness reflecting the trial’s first transition (i.e., between the starting state and first outcome; Figure 3b).

Figure 3. On-task replay of state-to-state trajectories as a function of individual flexibility.

Figure 3.

n = 40 subjects. (a) Sequenceness corresponding to a transition from the image the subject had just left (‘Start image’; in the cartoon at the bottom, the face) to the image to which they arrived (‘outcome image’; the tomato) following highly surprising outcomes (i.e., above-mean state prediction error). In the cartoon, the white arrow indicates the actual action taken on the trial; the blue arrow indicates the sequence that is being decoded. For display purposes alone, mean time series are shown separately for subjects with high (above median) and low (below median) IF. Positive sequenceness values indicate forward replay and negative values indicate backward replay. As in previous work (Eldar et al., 2018), sequenceness was averaged over all inter-image time lags from 10 ms to 200 ms, and each timepoint reflects a moving time window of 600 ms centred at the given time (e.g., the 1 s timepoint reflects MEG data from 0.7 s to 1.3 s following outcome). Dashed lines show mean data generated by a Bayesian Gaussian Process analysis, and the dark gray bars indicate timepoints where the 95% Credible Interval excludes zero and Cohen’s > 0.1. The top plot shows IF as a function of sequenceness for the timepoint where the average over all subjects was maximal. p value derived using a premutation test. Dot colours denote flexibility rank. (b) Sequenceness following the conclusion of 2-move trials corresponding to a transition from the starting image to the first outcome image. (c) Difference in the probability of subsequently choosing a different transition as a function of sequenceness recorded at the transition’s conclusion. For display purposes only, sequenceness is divided into high (i.e., above mean) and low (i.e., below mean). A correlation analysis between sequenceness and probability of policy change showed a similar relationship (Spearman correlation: M=-0.04, SEM=0.02, p=0.04, Bootstrap test). Sequenceness was averaged over the first cluster of significant timepoints from panels b) and c), in subjects with non-negligible inferred sequenceness (more than the standard deviation divided by 10; n=25), for the first time the subject chose each trajectory. Probability of changing policy was computed as the frequency of choosing a different move when occupying precisely the same state again. 0 corresponds to the average probability of change (51%). Error bars: s.e.m.

Using an hierarchical Bayesian Gaussian Process approach (see Materials and methods for details) we tested for timepoints at which sequenceness was evident and correlated with individual flexibility. This method directly corrects for comparison across multiple timepoints by accounting for the dependency between them (Kruschke, 2014). Since replay is thought to be induced by surprising observations (Mattar and Daw, 2018; Momennejad et al., 2018; Moore and Atkeson, 1993; Peng, 1993), we also included surprise about the outcome (i.e. the state prediction error inferred by the hybrid algorithm) as a predictor of sequenceness. We found significant sequenceness encoding the last experienced state transition (from 50 to 330 ms and from 820 to 950 ms following outcome onset; Figure 3a; note that the median split is for display purposes alone; actual analyses depended on the continuous flexibility index) and, at the conclusion of 2-move trials, also the penultimate transition (from 130 ms before to 350 ms following outcome onset; Figure 3b). These sequences were accelerated in time, with an estimated lag of 130 ms between the images, and were encoded in a ‘forward’ direction corresponding to the order actually visited. Moreover, later in the post-outcome epoch, the penultimate transition was also replayed backwards (from 440 to 940 ms following outcome onset). As would be expected, the finding of sequenceness was associated with enhanced decoding of previously visited states during the post-outcome epoch (Figure 2—figure supplement 2c).

More importantly, we found this evidence of replay, across all timepoints, was correlated with IF (mean β=0.17, 95% Credible Interval = 0.13 to 0.20), with surprise about the outcome (mean β=0.06, CI = 0.03 to 0.10), and with the interaction of these two factors (mean β=0.19, CI = 0.15 to 0.22). Thus, sequenceness was predominantly evident following surprising outcomes in subjects with high index of flexibility. This result is consistent with online replay contributing to post-outcome model-based planning.

On-task replay is associated with changes of policy

Recent theorising regarding the role of replay in planning argues that replay is preferentially induced when there is benefit to updating one’s policy (Mattar and Daw, 2018). Although policy updates can either reinforce or inhibit a chosen move, in our experiment, subsequently avoided choices in our experiment were more likely followed by a policy update than subsequently repeated choices, since a substantial proportion of the latter already reflected a well-informed policy (as evidenced by subjects’ above chance performance in Figure 1c). Thus, a role for replay in planning predicts that subjects should be more disposed to replay trajectories that they might not want to choose again, rather than trajectories whose choice reflects a firm policy. To determine whether decodable on-task replay was associated with behavioural change, we tested the relationship between sequenceness corresponding to each move that subjects chose, and the probability of making a different choice when occupying the same state later on. We found that moves after which high forward sequenceness was evident corresponded to moves that were less likely to be re-chosen subsequently (Figure 3c). These policy changes increased the proportion of obtained reward (M=+11.1%, SEM=1.5%, p=0.001). Thus, we found evidence of online replay for a chosen trajectory was coupled with advantageous re-evaluation of one’s choice.

Off-task replay is induced by prediction errors and associated with low flexibility

We next studied off-task replay, examining MEG data recorded during the 2-minute rest period that preceded each experimental block. Since each block included five frequently repeating starting states, we computed sequenceness for the five most frequent image-to-image transitions subjects chose before and after each rest period (mean choice frequency = 8.4 repetitions per block). As a control analysis, we also examined sequenceness for the five least frequently chosen transitions from the same starting states (mean choice frequency = 1.0 repetitions per block). We found significant evidence for sequenceness throughout the rest periods for frequent transitions (M = 0.002, SEM = 0.001, p=0.01, Bootstrap test). By contrast, we found no evidence of sequenceness for the infrequent transitions (M < 0.001, SEM = 0.001, p=0.47, Bootstrap test). Frequent transitions were replayed in a forward direction, with an estimated time lag of 180 ms between images, and prioritised trajectories that induced more reward prediction errors in the previous block (correlation of sequenceness with sum of absolute model-free reward prediction errors inferred by the hybrid algorithm: M= 0.04, SEM=0.018, p=0.03, Bootstrap test). Most importantly, off-task sequenceness correlated negatively with IF (Figure 4). This association of sequenceness during rest with low flexibility is consistent with a proposed role for offline replay in establishing model-free policies (Gershman et al., 2014; Mattar and Daw, 2018; Momennejad et al., 2018; Sutton, 1991).

Figure 4. Off-task replay of past and future trajectories.

Figure 4.

n = 40 subjects. Individual flexibility as a function of sequenceness in rest MEG data for the five most frequently experienced image-to-image transitions. For each rest period, sequenceness was averaged over transitions from both the preceding and following blocks of trials. p value derived using a premutation test. Dot colours denote flexibility rank.

Off-task replay can predict subsequently chosen sequences

If offline replay is involved in planning, then its content should predict subjects’ subsequent choices. To test this, we dissociated the replay of experienced trajectories from that of planned trajectories, by focusing on the third rest period after which the optimal image-to-image transitions changed entirely (due to a change in state-reward associations; see Figure 1). As subjects had been taught about the reward change before this rest period, the rest afforded subjects an opportunity to re-plan their choices accordingly.

We first examined the behavioural effect of the state-reward change in more detail. The most frequently chosen transitions in the block that followed the third rest period differed from the transitions most frequently chosen in the preceding block (overlap: M=14%, SEM=3%), and this policy change was substantially greater than for the other rest periods (overlap: M=53%, SEM=2%). As expected, the newly chosen transitions from the following block were advantageous given the new state-reward associations (reward collected: M=71%, SEM=2%; chance = 60%) and disadvantageous given the state-reward associations that had so far applied (M=52%, SEM=2%).

Given the behavioural change, we focused our examination of the MEG data on evidence for sequenceness during this crucial third rest period. We found that subjects indeed replayed the transitions they subsequently chose (M=0.004, SEM=0.002, p=0.02, Bootstrap test). This replay of subsequently chosen moves indicates subjects utilised a model of the task to re-plan their moves offline (Momennejad et al., 2018; Gershman et al., 2014; Mattar and Daw, 2018). Our reasoning here is that re-planning in light of the new reward associations, before subjects experienced them in practice, requires a model that specifies how to navigate from one state to another. Indeed, multiple regression analysis showed that low IF was only associated with sequenceness encoding previously chosen transitions (β=-0.35, t37=2.25, p=0.03), whereas the replay of subsequently chosen transitions did not correlate with IF (β=0.004, t37=0.03, p=0.97). On the other hand, the lack of a flexibility enhancement associated with prospective offline replay might indicate that, as might be expected, offline planning is ill-suited for enhancing trial-to-trial flexibility.

Discussion

We find substantial differences in the behaviour of individual subjects in a simple state-based sequential decision-making task that corresponds also to a distinction in the nature, and apparent effects, of MEG-recorded on- and off-task replay of state trajectories (Figure 5). These results bolster important behavioural dissociations, as well as provide substantial new insights into the control algorithms that subjects employ. The findings fit comfortably with an evolving literature that addresses human replay and preplay (Eldar et al., 2018; Kurth-Nelson et al., 2015; Kurth-Nelson et al., 2016; Liu et al., 2019; Schuck and Niv, 2019).

Figure 5. Individual flexibility and evidence of replay.

Figure 5.

The figure illustrates typical results for individuals with low (i.e., below median) and high (i.e., above median) flexibility. More flexible subjects advantageously adjusted their choices to the number of allotted moves. Such flexibility was associated with evidence of primarily forward replay following surprising outcomes, encoding the chosen transitions that led to those outcomes, and coupled with a reevaluation of those choices. Less flexible subjects showed evidence of forward replay during rest, encoding previously and subsequently chosen transitions.

The distinction between model-based and model-free reasoning has intuitive appeal, and close associations with many well-established psychological distinctions (Kahneman, 2011; Stanovich and West, 2000). However, popular tasks for investigating this distinction (Daw et al., 2011; Decker et al., 2016; Gillan et al., 2017; Gläscher et al., 2010) have been criticised for offering a better grasp on model-based compared to model-free reasoning processes (Gillan et al., 2015; da Silva and Hare, 2019); for rewarding model-based reasoning indifferently (Kool et al., 2016); and for admitting complex model-free strategies that can masquerade as being model-based (Akam et al., 2017). In our new task, we show a convergence between superficially divergent methods for distinguishing model-based and model-free methods – flexibility to immediate task demands (one-step versus two-step control), preserved performance in the face of changes in the location of rewards or structure, and an ability to reproduce explicitly, after the fact, the transition structure. Furthermore, the task effectively incentivises flexible model-based reasoning, as this type of reasoning alone allows collection of substantial additional reward (93%) compared to our most successful MF algorithm (80%). These convergent observations suggest that the model-based and model-free distinction we infer from our task rests on solid behavioural grounds.

In human subjects, there is a growing number of observations of replay and/or preplay of potential trajectories of states that are associated with the structure of tasks that subjects are performing (Kurth-Nelson et al., 2015; Kurth-Nelson et al., 2016; Schuck and Niv, 2019). However, it has been relatively hard to relate these replay events to ongoing performance. By contrast, there is evidence that rodent preplay has at least some immediate behavioural function (Gupta et al., 2010; Pfeiffer and Foster, 2013), and there are elegant theories for how replay should be optimally sequenced and structured in the service of planning. In particular, a recent normative model of replay, which aims to account for both online and offline events, suggests forward replay should prioritise trajectories on which the agent might soon re-embark, and backward replay should prioritise trajectories for which one’s policy can be improved (Mattar and Daw, 2018). Our results are broadly consistent with this normative perspective, showing that evidence of forward replay prioritises frequently chosen trajectories, whereas evidence of forward and backward replay follows new observations that can inform one’s policy, and indeed predicts appropriate changes in policy.

Two main aspects of our results deviate from this normative perspective. First, we primarily find forward sequenceness following outcomes. Second, rather than preplay immediately prior to choice, we found evidence of on-task replay following feedback alone. One possible explanation of the latter result is that replay before choice is more variable in speed and direction, which would make it more difficult to detect. However, this result might also indicate that upon observing an outcome, subjects immediately decided what moves to make next time they start from the same state. Such a strategy could be computationally expedient in that it minimises the need for retrieval and computation later on, during choice time, when a subject might be under time pressure. This suggests a third potential factor impacting on the timing and content of replay – the need to minimise memory load by embedding new information in ones’ policies as soon as it is received.

Critically, the timing and content of replay differed across individuals in a manner that links with their dominant mode of planning. More model-based subjects tended to replay trajectories during learning, predominantly reflecting choices they were likely to reconsider. There have been reports of preferential replay of deprecated trajectories in rodents (Gupta et al., 2010; Carey et al., 2019). However, those studies are consistent with a more general function for replay (e.g., maintaining the integrity of a map given a biased experience), whereas in our case, replay was closely related to future behaviour.

By contrast, the decodeable replay of more model-free subjects reinstated previously experienced trajectories during rest periods, when DYNA-like mechanisms (Sutton, 1991) are hypothesised to compile information about the environment to create an effective model-free policy. This replay of state-to-state transitions suggests that despite a general inability at the end of the task to draw a map accurately, model-free subjects do have implicit access to some form of model, though likely an incomplete one. In any case, the lack of association here between offline replay and ultimate winnings indicates that generating a policy offline might not be a good strategy for a task that requires trial-to-trial flexibility.

This aspect of the task might explain an apparent discrepancy with previous work on retrospective revaluation (Gershman et al., 2014; Momennejad et al., 2018), which indicates offline replay is associated with a greater degree of behavioural change. In the retrospective revaluation paradigm, behavioural change manifests between experimental phases that are separated in time by several minutes. The only way an algorithm like Dyna would be able to afford the trial-to-trial flexibility required in our task would be to furnish different stimulus-response mappings that can be summoned at will. It is possible such flexibility goes beyond the capabilities of any forms of DYNA that might be implemented in the brain.

Our work has a number of limitations. First, although our experimental design probes various facets of decision flexibility, it tests flexibility most extensively by interleaving 1- and 2-move trials. Modelling subjects’ choices shows this measure of flexibility captures individual differences in model-based and model-free planning. However, it is important to keep in mind that this measure does not capture all types of decision flexibility, as exemplified, for instance, by the different sort of flexibility that manifests in a retrospective revaluation paradigm. Second, our experiment was not ideally suited for inducing compound representations that link states with those that succeed them, since succession here changed frequently both within and between blocks. However, algorithms that utilise such representations mimic both model-free and model-based behaviour, and future work could utilise our methods to investigate whether and how these algorithms are aided by online and offline forms of replay (Russek et al., 2017). Third, the sequenceness measure that we use to determine replay suffers from a restriction of comparing forwards to backwards sequences. There is every reason to expect both forwards and backwards sequences co-exist, so focusing on a relative predominance of one or the other is likely to provide an incomplete picture. The problem measuring forwards and backwards replay against an absolute standard is the large autocorrelation in the neural decoding, and better ways of correcting for this are desirable in future studies. Nevertheless, despite these shortcomings the work we report is a further step towards revealing the rich and divergent structure of human choice in sequential decision making tasks.

Materials and methods

Subjects

40 human subjects, aged 18–33 years, 25 female, were recruited from a subject pool at University College London. Exclusion criteria included age (younger than 18 or older than 35), neurological or psychiatric illness, and current psychoactive drug use. To allow sufficient statistical power for comparisons between subjects, we set the sample size to roughly double that used in recent magnetoencephalography (MEG) studies on dynamics of neural representations (Hunt et al., 2012; Kurth-Nelson et al., 2015), and in line with our previous study of individual differences using similar measurements (including ‘sequenceness’; Eldar et al., 2018). Subjects received monetary compensation for their time (£20) in addition to a bonus (between £10 and £20) reflecting how many reward points subjects earned in the experiment task. The experimental protocol was approved by the University of College London local research ethics committee, and informed consent was obtained from all subjects.

Experimental design

To study flexibility in decision making, we designed a 2 × 4 state space where each location was identified by a unique image. Each image was associated with a known number of reward points, ranging between 0 and 10. Subjects’ goal was to collect as much reward as possible by moving to images associated with a high numbers of points. Subjects were never shown the whole structure of the state space, and thus, had to learn by trial and error which moves lead to higher reward.

Subjects were first told explicitly how many reward points were associated with each of the eight images. Subjects were then trained on these image-reward associations until they reliably chose the more rewarding image of any presented pair (see Image-reward training).

Next, the rules of the state-space task were explained (see State-space task), and multiple-choice questions were used to ensure that subjects understood these instructions. To facilitate learning, subjects were then gradually introduced to the state space, and were allowed one move at a time from a limited set of starting locations (see State-space training). Following this initial exposure, the rules governing two-move trials were explained and subjects completed a series of exercises testing their understanding of a distinction between one-move and two-move trials (see State-space exercise). Once these exercises were successfully completed, subjects played two full blocks of trials in the state pace, that included both one-move and two-move trials.

We next tested how subjects adapted to a change in the rewards associated with images. For this purpose, we instructed and trained subjects on new image-reward associations (see State-space design). Subjects then played two additional state-space blocks with these modified rewards.

Finally, we tested how subjects adapted to changes in the spatial structure of the state space. For this purpose, we told subjects that two pairs of images would switch locations, informing them precisely which images these were (see State-space design). Multiple-choice questions were used to ensure that subjects understood these instructions. Subjects then played a final state-space block with this modified spatial map.

At the end of the experiment, we also tested subjects’ explicit knowledge, asking them to sketch maps of the state spaces and indicate how many points each image was associated with before, and after, the reward contingency changed.

Stimuli

To ensure robust decoding from MEG, we used eight images that differed in colour, shape, texture and semantic category (Hunt et al., 2012; Carlson et al., 2013; Cichy et al., 2014). These included: a frog, a face, a traffic sign, a tomato, a hand, a house, a pond, and a wrench.

State-space task

Subjects started each trial in a pseudorandom state, identified only by its associated image. Subjects then chose whether to move right, left, up, or down, and the chosen move was implemented on the screen, revealing the new state (i.e., as its associated image) to which the move led. In ‘one-move’ trials, this marked the end of the trial, and was followed by a short inter-trial interval. The next trial then started from another pseudorandom location. In ‘two-move’ trials, subjects made an additional move from the location where their first move had led. This second move disallowed backtracking the first move (e.g., moving right and then left). Subjects were informed they would be awarded points associated with any image to which they move. Thus, subjects won points associated with a single image on one-move trials, and the combined value of the two images on two-move trials. The numbers of points awarded were never displayed during the main task. Every six trials, short text messages informed subjects what proportion of obtainable reward they had collected in the last six trials (message duration 2500 ms).

Each state-space block consisted of 54 trials, 18 one-move and 36 two-move trials respectively. The first six trials were one-move, the next 12 were two-move trials, then the next six were again one-move trials, the next 12 two-move, and so on. Every six trials, short text messages informed subjects whether the next six trials were going to be one-move or two-move trials (message duration 2000 ms). Every six consecutive trials featured six different starting locations. The one exception to this were the first of the 24 two-move trials of the experiment, where in order to facilitate learning, each starting location repeated for two consecutive trials (a similar measure was also implemented for one-move trials during training; see State-space training). Subjects’ performance improved substantially in the second of such pairs of trials (Δproportion of optimal first choices = +0.15, 95% CI = +0.11 to +0.18, p<0.001, Bootstrap test).

At the beginning of every block (except the first one), we tested how well subjects could do the task without additional information, based solely on the identity of the starting locations. For this purpose, images to which subjects’ moves led were not shown for the first 12 trials. In two-move trials, this meant subjects implemented a second move from an unrevealed image (i.e., state).

State-space design

The mapping of individual images to locations and rewards was randomly determined for each subject, but rewards were spatially organised in a similar manner for all subjects. To test whether subjects could flexibly adjust their choices, the state space was constructed such that there were five locations from which the optimal initial move was different depending on whether one or two moves were allowed. We tested subjects predominantly on these starting locations, using all five of them in every six consecutive trials. Following two blocks, the rewards associated with each image were changed, such that the optimal first moves in both 1-move and 2-move trials, given the new reward associations, were different from the optimal moves under the initial reward associations. The initial and modified reward associations were weakly anti-correlated across images (r=-0.37). Finally, before the last block, we switched the locations of two pairs of images, such that the optimal first move changed for 15 out of 16 trial types (1- and 2-move trials x 8 starting locations).

State-space training

Subjects played six short training blocks, each block consisted of 12 one-move trials starting in one of two possible locations. If a subject failed to collect 70% of the points available in one of these short blocks, the block was repeated. The majority of subjects (35 out of 40) had to repeat the first block, whereas only 12% of the remaining blocks were repeated (mean 0.6 blocks per subject, range 0 to 2). Very rarely, a block had to be repeated twice (a total of 5 out of 240 blocks for the whole group). Lastly, subjects played a final training block consisting 48 one-move trials starting at any of the eight possible locations. To facilitate learning, during the first half of the block, each starting location was repeated for two consecutive trials. In the second half of the block, starting locations were fully interleaved.

State-space exercise

Following the state-space training, which only included one-move trials, we ensured subjects understood how choices should differ in one- and two-move trials by asking them to choose the optimal moves in a series of random, fully visible state spaces. Subjects were given a bird’s eye view of each state space, with each location showing the number of reward points with which it was associated. The starting location was indicated in addition to whether one, or two, moves were available from which to collect reward. In all exercises, the optimal initial move was different depending on whether one or two moves were allowed. Every 10 consecutive exercises consisted of 5 one-move trials and five two-move trials. To illustrate the continuity of the state space, the exercise included one-move and two-move trials, wherein the optimal move required the subject to move off the map and arrive at the other end (e.g., moving left from a leftmost location to arrive at the rightmost location). In another two-move trial, the optimal moves involved moving twice up or twice down, thereby returning to the starting location. Subjects continued to do the exercises until fulfilling a performance criterion of 9 correct answers in 10 consecutive exercises. This criterion was relaxed to eight correct answers if at least 60 exercises had been completed. Only one subject required 60 exercises to reach criterion (mean required exercises = 24.5 exercises, SD 9.3).

Image-reward training

To ensure subjects remembered how many points each image awarded, we required subjects to select the more rewarding image out of any pair of presented images. First, subjects were asked to memorise the number of points each image would awards. Then, each round of training consisted of 28 trials, testing subjects on all 28 possible pairs of images (Figure 1—figure supplement 1). Each trial started with the presentation of one image, depicted on an arrow pointing either right, left, up or down. 800 ms later, another image appeared on an arrow pointing in a different direction. Subjects had then to press the button corresponding to the direction of the more rewarding image. Here, as throughout the experiment, subjects were instructed to press the ‘left’ and ‘up’ buttons with their left hand, and the ‘right’ and ‘down’ buttons with their right hand. During training, images were mapped to directions such that each of the four directions was equally associated with low- and high-reward images. Once subjects made their choice, the number of points associated with each of the two images appeared on the screen, and if the choice was correct the chosen move was implemented on the screen. Subjects repeated this training until they satisfied a performance criterion, based on how many points they missed consequent upon choosing less rewarding images. The initial performance criterion allowed four missed points, or less, in a whole training round (out of a maximum of 130 points). This criterion was gradually relaxed, to eight missed points in the second training round, to 12 missed points in the third training round, and to 16 missed points thereafter. Once subjects satisfied the performance criterion without time limit, they repeated the training with only 1500 ms allowed to make each choice, until satisfying the same re-set gradually relaxing criterion. Overall, subjects required an average of 3.4 training rounds (SD 1.0) to learn the initial image-reward associations (1.3 rounds without, and then 2.1 rounds with, a 1500 ms time limit), and 4.3 rounds (SD 1.3) to learn the second set image-reward associations (2.0 rounds without, and 2.3 rounds with, a time limit). Questioning at the end of the experiment validated that subjects had explicit recall for both sets of image-reward associations (mean error 0.36 pts, SEM = 0.07 pts; chance = 4.05 pts).

Modelling

To test what decision algorithm subjects employed, and in particular, whether they chose moves that had previously been most rewarding from the same starting location (model-free planning), or whether they learned how the state space is structured and used this information to plan ahead (model-based planning), we compared between model-free and model-based algorithms in terms of how well they fitted subjects’ actual choices. These models were informed by previous work (Daw et al., 2005; Sutton and Barto, 1998), adjusted to the present task, and validated using model and parameter recovery tests on simulated data.

Model-free learning algorithm

Free parameters: ηMF1,ηMF2, τMF, τ'MF, θ, β1,2MF1,β2MF2, γup,down,left,right. This algorithm learns the expected value of performing a given move upon encountering a given image. To do this, the algorithm updates its expectation QMF from move m given image s whenever this move is taken and its outcome is observed:

Qt+1MF1(st,1,mt)=QtMF1(st,1,mt)+ηMF1δtMF1, (1)

where st,1 is trial t’s starting image, δtMF is the reward prediction error, and ηMF1 is a fixed learning rate between 0 and 1. Reward prediction errors are computed as the difference between actual and expected outcomes:

δtMF1=Rg(st,2)QtMF1(st,1,mt), (2)

where the actual outcome consists of the points associated with the new image to which the move led, Rgst,2. g=1 refers to the initial image-rewards associations, and g=2 refers to the second set of image-rewards associations about which subjects were instructed in the middle of the experiment.

On 2-move trials, the algorithm also learns the expected reward for each pair of moves given each starting image. Thus, another set of Q values is maintained (QMF2), one for each possible pair of moves for each starting image, and these are updated every time a pair of moves is completed based on the total reward obtained by the two moves. This learning proceeds as described by Equations 1 and 2, but with a different learning rate (ηMF2).

All expected values are initialised to θ, and decay back to this initial value before every update:

QMFτMFQMF+(1τMF)θ, (3)

where τMF determines the degree of value retention. This allows learned expectations to be gradually forgotten.

Following instructed changes to the number of points associated with each image, or to the spatial arrangement of the images, previously learned Q values are of little use. Thus, we allow the Q values to then return back to θ, as in Equation 3, but only for a single timestep and with a different, potentially lower, memory parameter τ'MF.

Finally, the algorithm chooses moves based on a combination of its learned expected values. On 1-move trials, only single-move Q values are considered:

p(mt=a|st)eγm+β1MF1QtMF1(st,1,m), (4)

where γm is a fixed bias in favor of move m (mγm=0), and β1MF1 is an inverse temperature parameter that weighs the impact of expected values on choice. On 2-move trials, both types of Q values are considered. Thus, the first move is chosen based on a weighted sum of the single-move Q values and the move-pair Q values:

p(mt,1=m|st,1)eγm+β2MF1QtMF1(st,1,m)+β2MF2QtMF2(st,1,m) , (5)

wherein the latter are integrated over possible second moves each weighted by its probability:

QtMF2(st,1,m)=mp(mt,2=m|st,1,mt,1)QtMF2(st,1,m,m) (6)

Then, in choosing the second move the algorithm takes into account the state to which the first move led:

p(mt,2=m|st,1,mt,1,st,2)eγm+β2MF1QtMF1(st,2,m)+β2MF2QtMF2(st,1,mt,1,m). (7)

However, when the newly reached image st,2 is not known (i.e., in trials without feedback, or when estimating pmt,2=m*st,1,mt,1 in Equation 6 before st,2 is reached), QMF1 values are averaged over all settings of st,2.

Model-based learning algorithm

Free parameters: ηMB, τMB, τ'MB, ρ, ω, βMB, κ, γup,down,left,right. This algorithm learns the probability of transitioning from one image to another following each move. To do this, the algorithm updates its probability estimates, T, whenever a move is made and a transition is observed:

Tt+1(st,1,mt,st,2)=Tt(st,1,mt,st,2)+ηMBδtMB, (8)

where δtMB is the image-transition prediction error, and ηMF is a fixed learning rate between 0 and 1. Image-transition prediction errors reflect the difference between actual and expected transitions:

δtMB=1Tt(st,1,mt,st,2). (9)

To ensure that transition probabilities sum to 1, the transition matrix is renormalised following every update:

s Tt+1(st,1,mt,s)Tt+1(st,1,mt,s)sTt+1(st,1,mt,s). (10)

Learning may also take place with respect to the opposite transition. For instance, if moving right from image st,1 leads to image st,2, the agent can infer that moving left from image st,2 would lead to image st,1. Such inference is modulated in the algorithm by free parameter ρ:

Tt+1(st,2,mt,st,1)=Tt(st,2,mt,st,1)+ρηMBδtMB (11)

where m~t is the opposite of mt, and δ' is the opposite transition prediction error:

δtMB=1Tt(st,2,mt,st,1). (12)

Self-transitions are impossible and thus their probability is initialised to 0. All other transitions are initialised with uniform probabilities, and these probabilities decay back to their initial values before every update:

TτMBT+(1τMB)17, (13)

where τMB is the model-based memory parameter. A low τMB results in faster decay of expected transition probabilities towards uniform distributions, decreasing the impact of MB knowledge on choice.

When instructed about changes to the image locations, the agent rearranges its transition probabilities based on the instructed changes with limited success, as indexed by free parameter ω:

 T(1ω)T+ωTrearranged. (14)

Since some subjects may simply reset their transition matrix following instructed changes, the algorithm also ‘forgets’ after such instruction, as in Equation 13, but only for a single time point and with a different memory parameter, τ'MB.

Finally, the probability the algorithm will choose a given move when encountering a given image depends on its model-based estimate of the move’s expected outcome:

p(mt=m|st,1)eγm+βMBQtMB(st,1,m). (15)

The algorithm estimates expected outcomes by multiplying the number of points associated with an image with the probability of transitioning to that image, integrating over all potential future images:

QtMB(st,1,m)=sTt(st,1,m,s)Rg(s). (16)

When two moves are allowed, the calculation also accounts for the number of points obtainable with the second move, mt,2:

QtMB(st,1,m)=sTt(st,1,m,s)(Rg(s)+κmaxmsTt(s,m,s)Rg(s)), (17)

where κ is a fractional parameter that determines the degree to which reward obtained by the second move is taken into account.

Following the first move, Equation 15 is used to choose a second move based on the observed new location (st,2). However, if the next location is not shown (i.e., in trials without feedback), the agent chooses its second move by integrating Equation 15 over the expected st,2, as determined by Ttst,1,mt,1,st,2.

MF-MB hybrid algorithm

This algorithm employs both model-free (MF) and model-based (MB) planning, choosing moves based on a combination of the expected values estimated by the two learning processes. In 1-move trials this is implemented as:

p(mt=m|st)eγm+β1MF1QtMF1(st,1,m)+βMBQMB(st,1,m), (18)

In 2-move trials, the algorithm makes a choice based on a combination of the model-based Q values and both the single-move and two-move model-free Q values. For the first move, the combination is:

p(mt,1=m|st,1)eγm+β2MF1QtMF1(st,1,m)+β2MF2QtMF2(st,1,m)+βMBQMB(st,1,m) (19)

with QMBst,1,m computed according to Equation 17. For the second move, the choice is made according to:

p(mt,2=m|st,2)eγm+β2MF1QtMF1(st,2,m)+β2MF2QtMF2(st,1,mt,1,m)+βMBQMB(st,2,m) (20)

When the image is not shown following the first move (i.e., in a no-feedback trial), the agent averages the model-free values over all images.

Parameter fitting

To fit the free parameters of the different algorithms to subjects’ choices, we used an iterative hierarchical expectation-maximisation procedure (Bishop, 2006). We first sampled 10000 random settings of the parameters from predefined group-level prior distributions. Then, we computed the likelihood of observing subjects’ choices given each setting, and used the computed likelihoods as importance weights to re-fit the parameters of the group-level prior distributions. These steps were repeated iteratively until model evidence ceased to increase (see Model Comparison below for how model evidence was estimated). This procedure was then repeated with 31623 samples per iteration, and finally with 100000 samples per iteration. To derive the best-fitting parameters for each individual subject, we computed a weighted mean of the final batch of parameter settings, in which each setting was weighted by the likelihood it assigned to the subject’s choices. Fractional parameters (ηMF, τMF, τ'MF,ηMB, τMB, τ'MB, ρ, ω, α) were modelled with Beta distributions (initialised with shape parameters a = 1 and b = 1) and their values were log-transformed for the purpose of subsequent analysis. Initial Q values (θ) and bias parameters (γup, γdown, γleft, γright) were modelled with normal distributions (initialised with µ = 0 and σ = 1) to allow for both positive and negative effects, and all other parameters were modeled with Gamma distributions (initialised with shape = 1, scale = 1).

Algorithm comparison

We compared between pairs of algorithms, in terms of how well each accounted for subjects’ choices, by means of the integrated Bayesian Information Criterion (iBIC; Eldar et al., 2016; Huys et al., 2012). To do this, we estimated the evidence in favour of each model () as the mean likelihood of the model given 100000 random parameter settings drawn from the fitted group-level priors. We then computed the iBIC by penalising the model evidence to account for algorithm complexity as follows: iBIC=-2ln+klnn, where k is the number of fitted parameters and n is the number of subject choices used to compute the likelihood. Lower iBIC values indicate a more parsimonious fit.

Algorithm and parameter recovery tests

We tested whether our dataset was sufficiently informative to distinguish between the MF, MB and hybrid algorithms and recover the correct parameter values. For this purpose, we generated 10 simulated datasets using each algorithm and applied our fitting and comparison procedures to each dataset. To reduce processing time, only 10000 parameter settings were sampled. To maximise the chances of confusion between algorithms, we implemented all algorithms with the parameter values that best fitted subjects’ choices. Algorithm comparison implicated the correct algorithm in each of the 30 simulated datasets, and the parameters values that best fitted the simulated data consistently correlated with the actual parameter values used to generate these data (Pearson’s r: M=0.57, SEM=0.05). This correlation was stronger for parameters whose values were used for multiple trials when computing the fit to data (e.g., learning rates and inverse temperature parameters; M=0.67, SEM=0.04).

Additional algorithms

To test whether the algorithms described above were suitable for describing subjects’ behaviour, we compared them to several additional algorithms, all of which failed to fit subjects’ choices as well as the above counterparts, and so we do not describe them in detail. These alternative algorithms included a MF algorithm that only learns single-move Q values, but employs temporal difference learning (O'Doherty et al., 2003) to backpropagate second outcomes in 2-move trials back to the Q values of the starting location (BIC = 41559); a MB algorithm that employs Bayesian inference with a uniform Dirichlet prior (Bishop, 2006) (whose concentration parameter was set to 1) to learn the multinomial distributions that compose the state transition matrix (BIC = 43301); a MF-MB hybrid algorithm where state-transition expectations are only used to account for prospective second-move Q values when choosing the first move in 2-move trials (BIC = 40920); and an algorithm that combines two MF algorithms with different parameters (BIC = 40715).

MEG acquisition

MEG was recorded continuously at 600 samples/second using a whole-head 275-channel axial gradiometer system (CTF Omega, VSM MedTech, Canada), while subjects sat upright inside the scanner. A projector displayed the task on a screen ∼80 cm in front of the subject. Subjects made responses by pressing a button box, using their left hand for ‘left’ and ‘up’ choices and their right hand for ‘right’ and ‘down’ choices. Pupil size and eye gaze were recorded at 250 Hz using a desktop-mounted EyeLink II eyetracker (SR Research).

MEG preprocessing

Preprocessing was performed using the Fieldtrip toolbox (Oostenveld et al., 2011 in MATLAB (MathWorks). Data from two sensors were not recorded due to a high level of noise detected in routine testing. Data were first manually inspected for jump artefacts. Then, independent component analysis was used to remove components that corresponded to eye blinks, eye movement and heart beats. Based on previous experience (Eldar et al., 2018), we expected stimuli to be represented in low frequency fluctuations of the MEG signal. Therefore, to remove fast muscle artefacts and slow movement artefacts, we low-pass filtered the data with a 20 Hz cutoff frequency using a sixth-order Butterworth IIR filter, and we baseline-corrected each trial’s data by subtracting the mean signal recorded during the 400 ms preceding trial onset. Trials in which the average standard deviation of the signal across channels was at least 3 times greater than median were excluded from analysis (0.4% of trials, SEM 0.2%). Finally, the data were resampled from 600 Hz to 100 Hz to conserve processing time and improve signal to noise ratio. Therefore, data samples used for analysis were length 273 vectors spaced every 10 ms.

Pre-task stimulus exposure

To allow decoding of images from MEG we instructed subjects to identify each of the images in turn (Figure 2—figure supplement 1a). On each trial, the target image was indicated textually (e.g., ‘FACE’) and then an image appeared on the screen. Subjects’ task was to report whether the image matched (LEFT button) or did not match (RIGHT button) the preceding text. 20% of presented images did not match the text. The task continued until subjects correctly identified each of the images at least 25 times. Subjects were highly accurate on both match (M = 97.2%, SEM = 0.4%) and no-match (M = 90.2%, SEM = 0.6%) trials. To ensure robust decoding from MEG, we chose eight images that differed in colour, shape, texture and semantic category (Isik et al., 2014; Carlson et al., 2013; Figure 1a). Importantly, at this point subjects had no knowledge as to what the main task would involve, nor that the images would be associated with state-space locations and rewards. This ensured that no task information could be represented in the MEG data at this stage.

MEG decoding

We used support vector machines (SVMs) to decode images and moves from MEG. All decoders were trained on MEG data recorded outside of the main state-space task and validated within the task. As in previous work (Eldar et al., 2018), we trained a separate decoder for each time bin between 150 and 600 ms following the relevant event, either image onset or move choice, resulting in 46 decoders whose output was averaged. Averaging over decoders trained at different time points reduces peak decodability following stimulus onset, but can increase decodability of stimuli that are being processed when not on the screen (Eldar et al., 2018). To avoid over-fitting, training and testing were performed on separate sets of trials following a 5-fold cross validation scheme. These analyses were performed using LIBSVM’s implementation of the C-SVC algorithm with radial basis functions (Chang and Lin, 2011). Decoder training and testing were performed with each of 16 combinations of the algorithms’ cost parameter (10-1, 100, 101, 102) and basis-function concentration parameter (10-2/n, 10-1/n, 100/n, 101/n), where n is the number of MEG features (273 channels). Where classes differed in number of instances, weighting was used to ensure classes were equally weighted.

To decode the probability of each of eight possible images being presented (8-way classification), we used MEG data recorded during pre-task stimulus exposure. Decoding was evaluated based on the mean probability the decoders assigned to the presented image. To decode the probability of each of the four possible moves (LEFT, RIGHT, UP, DOWN) being chosen (4-way classification), we used MEG data recorded during the image-reward training. For both types of decoder, the parameter combination of cost = 102 and concentration = 10-2/n yielded the best cross-validated decoding performance and was thus used for all ensuing analyses, wherein decoders trained on pre-task stimulus exposure data were applied to main task data.

Sequenceness measure

To investigate how representations of different images related to one another in time, we used a measure recently developed for detecting sequences of representations in MEG; (Kurth-Nelson et al., 2016). ‘Sequenceness’ is computed as the difference between the cross-correlation of two images’ decodability time-series with positive and negative time lags. By relying on asymmetries in the cross-correlation function, this measure detects sequential relationships even between closely correlated (or anti-correlated) time series, as we have previously demonstrated on simulated time series (Eldar et al., 2018). Positive values indicate that changes in the first time series are followed by similar changes in the second time series (‘forward sequenceness’), negative values indicate the reverse sequence (‘backward sequenceness’), and zero indicates no sequential relationship. As in previous work, cross correlations were computed between the z-scored time series over 400 ms sliding windows with time lags of up to 200 ms. This timescale is sufficient for capturing the relationship between successive alpha cycles, which is important given the possibility that such oscillations may reflect temporal quanta of information processing (Busch and VanRullen, 2014).

Bayesian hierarchical Gaussian process time series analysis

To determine whether sequenceness time-series recorded following outcomes provided robust evidence of replay that correlated with individual index of behavioural flexibility, we modelled mean sequenceness time-series as Gaussian Processes with squared exponential kernels. Such Gaussian Processes explicitly account for dependencies between timepoints within a time series, as a function of the lengths of time between those points. A group-level Gaussian Process captures the time series’ systematic deviations from zeros, either on average or as a function of two predictors: subject IF and surprise about the outcome. The deviations of each individual time series from the predictions made by this group-level process can themselves exhibit dependencies between timepoints as a function of distance. Therefore, we accounted for these deviations by means of another set of Gaussian Processes, one for each modelled time series, which were added to the predictions of the group-level process.

In theory, this model can be fit using MCMC sampling to all trial-by-trial sequenceness time series. However, fitting the model to such an amount of data proved infeasible within a reasonable timeframe. Thus, we reduced the data to four mean sequenceness time series per subject: sequenceness encoding the last (as in Figure 3a) or penultimate (as in Figure 3b) transition following highly or weakly surprising outcomes. High and low surprise were determined based on the state prediction error generated by the hybrid algorithm, whose parameters were fitted to the individual subject’s choices (i.e., high – above-mean prediction error, low – below-mean prediction error). Since we assumed last and penultimate transitions could be replayed at different timepoints, these two types of time series each had their own group-level Gaussian Process. To account for the factors of IF and surprise, for each time series, the group-level process was multiplied by a weighted linear combination of the two factors, their interaction, and an intercept (thus involving four parameters: β,βsubject,βsurprise,βinteraction). The group- and individual-level Gaussian Processes were parameterised by different length-scales (ρgroup, ρinidivdual) and marginal standard deviations (αgroup, αindividual), and a standard deviation parameter (σ) accounted for additional normally distributed noise across all observations.

Bayesian estimation was performed in R (R Development Core Team, 2018) using the STAN (Carpenter et al., 2017) package for Markov Chain Monte Carlo (MCMC) sampling. All predictor variables were standardised. All prior distributions were set so as to be weakly informative and have broad range on the scale of the variables; (Kruschke, 2014). This included the following: β coefficients were drawn from normal distributions with a mean of zero and a standard deviation of 10; Standard deviations parameters (α, σ) were drawn from a normal distribution truncated to positive values, with a mean of zero and a standard deviation that matches the standard deviation of the predicted variable; Length-scales (ρ) were drawn from log-normal distributions whose mean is the geometric mean of two extremes: the distance in time between two successive timepoints, and the distance in time between the first and last timepoints; Half of the difference between these two values was used as the standard deviation of the priors. For the sake of identifiability, βinteraction was limited to positive values. Note this does not limit the model to positive interactions, since all coefficient are multiplied by the group-level Gaussian Processes, which might be negative.

We ran six MCMC chains each for 1400 iterations, with the initial 400 samples used for warmup. STAN’s default settings were used for all other settings. Examining the results showed there were no divergent transitions, and all parameters were estimated with effective sample sizes larger than 1000 and shrink factors smaller than 1.1. Posterior predictive checks showed good correspondence between the real and generated data (Figure 1c; Figure 1—figure supplement 4).

Decodability time series analyses

Decodability was tested for difference from zero and covariance with individual flexibility using the Bayesian Gaussian Process approach outlined above with the exclusion of the surprise predictor, which is inapplicable to timepoints that precede outcome onset.

Other statistical methods

Significance tests were conducted using nonparameteric methods that do not assume specific distributions. Differences from zero were tested using 10000 samples of bias-corrected and accelerated Bootstrap with default MATLAB settings. Correlations and differences between groups were tested by comparison to null distributions generated by 10000 permutations of the pairing between the two variables of interest. All tests are two-tailed.

Acknowledgements

We thank Zeb Kurth-Nelson and Yunzhe Liu for helpful comments on a previous version of the manuscript. EE holds an Alon Fellowship from the Israeli Council for Higher Education. PD is funded by the the Max Planck Society and the Humboldt Foundation. RJD holds a Wellcome Trust Investigator award (098362/Z/12/Z). The Max Planck UCL Centre for Computational Psychiatry and Ageing Research is a joint initiative supported by the Max Planck Society and University College London. The Wellcome Centre for Human Neuroimaging is supported by core funding from the Wellcome Trust (091593/Z/10/Z).

Funding Statement

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Contributor Information

Eran Eldar, Email: eran.eldar@mail.huji.ac.il.

Thorsten Kahnt, Northwestern University, United States.

Kate M Wassum, University of California, Los Angeles, United States.

Funding Information

This paper was supported by the following grants:

  • Council for Higher Education Alon Fellowship to Eran Eldar.

  • Max Planck Society to Peter Dayan.

  • Alexander von Humboldt Foundation to Peter Dayan.

  • Wellcome Trust Wellcome Trust Investigator award (098362/Z/12/Z) to Raymond J Dolan.

  • Max Planck Society to Raymond J Dolan.

Additional information

Competing interests

No competing interests declared.

Author contributions

Conceptualization, Data curation, Software, Formal analysis, Supervision, Validation, Investigation, Visualization, Methodology, Writing - original draft, Project administration, Writing - review and editing.

Data curation, Investigation, Methodology.

Formal analysis, Supervision, Methodology, Writing - review and editing.

Resources, Supervision, Funding acquisition, Project administration, Writing - review and editing.

Ethics

Human subjects: The experimental protocol was approved by the UCL Research Ethics Committee, under Project ID Number 9929/002, and informed consent was obtained from all subjects.

Additional files

Transparent reporting form

Data availability

The data and the custom code have been deposited in the Open Science Framework under https: https://doi.org/10.17605/OSF.IO/GUHJE.

The following dataset was generated:

Eldar E. 2020. The roles of online and offline replay. Open Science Framework.

References

  1. Akam T, Rodrigues-Vaz I, Zhang X, Pereira M, Oliveira R, Dayan P, Costa RM. Single-Trial inhibition of anterior cingulate disrupts Model-based reinforcement learning in a Two-step decision task. bioRxiv. 2017 doi: 10.1101/126292. [DOI]
  2. Behrens TEJ, Muller TH, Whittington JCR, Mark S, Baram AB, Stachenfeld KL, Kurth-Nelson Z. What is a cognitive map? organizing knowledge for flexible behavior. Neuron. 2018;100:490–509. doi: 10.1016/j.neuron.2018.10.002. [DOI] [PubMed] [Google Scholar]
  3. Bishop CM. Pattern Recognition and Machine Learning. Springer; 2006. [Google Scholar]
  4. Busch N, VanRullen R. Is visual perception like a continuous flow or a series of snapshots. In: Arstila V, loyd D. L, editors. Subjective Time: The Philosophy, Psychology, and Neuroscience of Temporality. MIT Press; 2014. [DOI] [Google Scholar]
  5. Carey AA, Tanaka Y, van der Meer MAA. Reward revaluation biases hippocampal replay content away from the preferred outcome. Nature Neuroscience. 2019;22:1450–1459. doi: 10.1038/s41593-019-0464-6. [DOI] [PubMed] [Google Scholar]
  6. Carlson T, Tovar DA, Alink A, Kriegeskorte N. Representational dynamics of object vision: the first 1000 ms. Journal of Vision. 2013;13:1. doi: 10.1167/13.10.1. [DOI] [PubMed] [Google Scholar]
  7. Carpenter B, Gelman A, Hoffman MD, Lee D, Goodrich B, Betancourt M, Brubaker M, Guo J, Li P, Riddell A. Stan: a probabilistic programming language. Journal of Statistical Software. 2017;76:i01. doi: 10.18637/jss.v076.i01. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Chang CC, Lin CJ. LIBSVM: a library for support vector machines. ACM T. Intel. Syst. Tec. 2011;2:27. doi: 10.1145/1961189.1961199. [DOI] [Google Scholar]
  9. Cichy RM, Pantazis D, Oliva A. Resolving human object recognition in space and time. Nature Neuroscience. 2014;17:455–462. doi: 10.1038/nn.3635. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Crockett MJ. Models of morality. Trends in Cognitive Sciences. 2013;17:363–366. doi: 10.1016/j.tics.2013.06.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. da Silva CF, Hare T. Model-free or muddled models in the two-stage task? bioRxiv. 2019 doi: 10.1101/682922. [DOI]
  12. Daw ND, Niv Y, Dayan P. Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control. Nature Neuroscience. 2005;8:1704–1711. doi: 10.1038/nn1560. [DOI] [PubMed] [Google Scholar]
  13. Daw ND, Gershman SJ, Seymour B, Dayan P, Dolan RJ. Model-based influences on humans' choices and striatal prediction errors. Neuron. 2011;69:1204–1215. doi: 10.1016/j.neuron.2011.02.027. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Decker JH, Otto AR, Daw ND, Hartley CA. From creatures of habit to Goal-Directed learners: tracking the developmental emergence of Model-Based reinforcement learning. Psychological Science. 2016;27:848–858. doi: 10.1177/0956797616639301. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Diba K, Buzsáki G. Forward and reverse hippocampal place-cell sequences during ripples. Nature Neuroscience. 2007;10:1241–1242. doi: 10.1038/nn1961. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Eldar E, Hauser TU, Dayan P, Dolan RJ. Striatal structure and function predict individual biases in learning to avoid pain. PNAS. 2016;113:4812–4817. doi: 10.1073/pnas.1519829113. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Eldar E, Bae GJ, Kurth-Nelson Z, Dayan P, Dolan RJ. Magnetoencephalography decoding reveals structural differences within integrative decision processes. Nature Human Behaviour. 2018;2:670–681. doi: 10.1038/s41562-018-0423-3. [DOI] [PubMed] [Google Scholar]
  18. Everitt BJ, Robbins TW. Neural systems of reinforcement for drug addiction: from actions to habits to compulsion. Nature Neuroscience. 2005;8:1481–1489. doi: 10.1038/nn1579. [DOI] [PubMed] [Google Scholar]
  19. Foster DJ. Replay comes of age. Annual Review of Neuroscience. 2017;40:581–602. doi: 10.1146/annurev-neuro-072116-031538. [DOI] [PubMed] [Google Scholar]
  20. Foster DJ, Wilson MA. Reverse replay of behavioural sequences in hippocampal place cells during the awake state. Nature. 2006;440:680–683. doi: 10.1038/nature04587. [DOI] [PubMed] [Google Scholar]
  21. Gershman SJ, Markman AB, Otto AR. Retrospective revaluation in sequential decision making: a tale of two systems. Journal of Experimental Psychology: General. 2014;143:182–194. doi: 10.1037/a0030844. [DOI] [PubMed] [Google Scholar]
  22. Gillan CM, Otto AR, Phelps EA, Daw ND. Model-based learning protects against forming habits. Cognitive, Affective, & Behavioral Neuroscience. 2015;15:523–536. doi: 10.3758/s13415-015-0347-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Gillan CM, Fineberg NA, Robbins TW. A trans-diagnostic perspective on obsessive-compulsive disorder. Psychological Medicine. 2017;47:1528–1548. doi: 10.1017/S0033291716002786. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Gläscher J, Daw N, Dayan P, O'Doherty JP. States versus rewards: dissociable neural prediction error signals underlying model-based and model-free reinforcement learning. Neuron. 2010;66:585–595. doi: 10.1016/j.neuron.2010.04.016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Gupta AS, van der Meer MA, Touretzky DS, Redish AD. Hippocampal replay is not a simple function of experience. Neuron. 2010;65:695–705. doi: 10.1016/j.neuron.2010.01.034. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Hunt LT, Kolling N, Soltani A, Woolrich MW, Rushworth MF, Behrens TE. Mechanisms underlying cortical activity during value-guided choice. Nature Neuroscience. 2012;15:470–476. doi: 10.1038/nn.3017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Huys QJ, Eshel N, O'Nions E, Sheridan L, Dayan P, Roiser JP. Bonsai trees in your head: how the pavlovian system sculpts goal-directed choices by pruning decision trees. PLOS Computational Biology. 2012;8:e1002410. doi: 10.1371/journal.pcbi.1002410. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Isik L, Meyers EM, Leibo JZ, Poggio T. The dynamics of invariant object recognition in the human visual system. Journal of Neurophysiology. 2014;111:91–102. doi: 10.1152/jn.00394.2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Ji D, Wilson MA. Coordinated memory replay in the visual cortex and Hippocampus during sleep. Nature Neuroscience. 2007;10:100–107. doi: 10.1038/nn1825. [DOI] [PubMed] [Google Scholar]
  30. Kahneman D. Thinking, Fast and Slow. Macmillan; 2011. [Google Scholar]
  31. Kool W, Cushman FA, Gershman SJ. When does Model-Based control pay off? PLOS Computational Biology. 2016;12:e1005090. doi: 10.1371/journal.pcbi.1005090. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Kruschke J. Doing Bayesian Data Analysis: A Tutorial with R, JAGS, and Stan. Academic Press; 2014. [Google Scholar]
  33. Kurdi B, Gershman SJ, Banaji MR. Model-free and model-based learning processes in the updating of explicit and implicit evaluations. PNAS. 2019;116:6035–6044. doi: 10.1073/pnas.1820238116. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Kurth-Nelson Z, Barnes G, Sejdinovic D, Dolan R, Dayan P. Temporal structure in associative retrieval. eLife. 2015;4:e04919. doi: 10.7554/eLife.04919. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Kurth-Nelson Z, Economides M, Dolan RJ, Dayan P. Fast sequences of Non-spatial state representations in humans. Neuron. 2016;91:194–204. doi: 10.1016/j.neuron.2016.05.028. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Liu Y, Dolan RJ, Kurth-Nelson Z, Behrens TEJ. Human replay spontaneously reorganizes experience. Cell. 2019;178:640–652. doi: 10.1016/j.cell.2019.06.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Louie K, Wilson MA. Temporally structured replay of awake hippocampal ensemble activity during rapid eye movement sleep. Neuron. 2001;29:145–156. doi: 10.1016/S0896-6273(01)00186-6. [DOI] [PubMed] [Google Scholar]
  38. Mattar MG, Daw ND. Prioritized memory access explains planning and hippocampal replay. Nature Neuroscience. 2018;21:1609–1617. doi: 10.1038/s41593-018-0232-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Momennejad I, Otto AR, Daw ND, Norman KA. Offline replay supports planning in human reinforcement learning. eLife. 2018;7:e32548. doi: 10.7554/eLife.32548. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Moore AW, Atkeson CG. Prioritized sweeping: Reinforcement learning with less data and less time. Machine Learning. 1993;13:103–130. doi: 10.1007/BF00993104. [DOI] [Google Scholar]
  41. O'Doherty JP, Dayan P, Friston K, Critchley H, Dolan RJ. Temporal difference models and reward-related learning in the human brain. Neuron. 2003;38:329–337. doi: 10.1016/S0896-6273(03)00169-7. [DOI] [PubMed] [Google Scholar]
  42. Ólafsdóttir HF, Barry C, Saleem AB, Hassabis D, Spiers HJ. Hippocampal place cells construct reward related sequences through unexplored space. eLife. 2015;4:e06063. doi: 10.7554/eLife.06063. [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Ólafsdóttir HF, Carpenter F, Barry C. Task demands predict a dynamic switch in the content of awake hippocampal replay. Neuron. 2017;96:925–935. doi: 10.1016/j.neuron.2017.09.035. [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Oostenveld R, Fries P, Maris E, Schoffelen JM. FieldTrip: open source software for advanced analysis of MEG, EEG, and invasive electrophysiological data. Computational Intelligence and Neuroscience. 2011;2011:1–9. doi: 10.1155/2011/156869. [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Peng J. Efficient learning and planning within the dyna framework. IEEE International Conference on Neural Networks; 1993. pp. 168–174. [DOI] [Google Scholar]
  46. Pezzulo G, van der Meer MA, Lansink CS, Pennartz CM. Internally generated sequences in learning and executing goal-directed behavior. Trends in Cognitive Sciences. 2014;18:647–657. doi: 10.1016/j.tics.2014.06.011. [DOI] [PubMed] [Google Scholar]
  47. Pfeiffer BE, Foster DJ. Hippocampal place-cell sequences depict future paths to remembered goals. Nature. 2013;497:74–79. doi: 10.1038/nature12112. [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. R Development Core Team . Vienna, Austria: R Foundation for Statistical Computing; 2018. https://www.R-project.org [Google Scholar]
  49. Russek EM, Momennejad I, Botvinick MM, Gershman SJ, Daw ND. Predictive representations can link model-based reinforcement learning to model-free mechanisms. PLOS Computational Biology. 2017;13:e1005768. doi: 10.1371/journal.pcbi.1005768. [DOI] [PMC free article] [PubMed] [Google Scholar]
  50. Schuck NW, Niv Y. Sequential replay of nonspatial task states in the human Hippocampus. Science. 2019;364:eaaw5181. doi: 10.1126/science.aaw5181. [DOI] [PMC free article] [PubMed] [Google Scholar]
  51. Skaggs WE, McNaughton BL. Replay of neuronal firing sequences in rat Hippocampus during sleep following spatial experience. Science. 1996;271:1870–1873. doi: 10.1126/science.271.5257.1870. [DOI] [PubMed] [Google Scholar]
  52. Stachenfeld KL, Botvinick MM, Gershman SJ. The Hippocampus as a predictive map. Nature Neuroscience. 2017;20:1643–1653. doi: 10.1038/nn.4650. [DOI] [PubMed] [Google Scholar]
  53. Stanovich KE, West RF. Individual differences in reasoning: implications for the rationality debate? Behavioral and Brain Sciences. 2000;23:645–665. doi: 10.1017/S0140525X00003435. [DOI] [PubMed] [Google Scholar]
  54. Sutton RS. Dyna, an integrated architecture for learning, planning, and reacting. ACM SIGART Bulletin. 1991;2:160–163. doi: 10.1145/122344.122377. [DOI] [Google Scholar]
  55. Sutton RS, Barto AG. Reinforcement Learning: An Introduction. Cambridge: MIT press; 1998. [Google Scholar]

Decision letter

Editor: Thorsten Kahnt1
Reviewed by: Samuel J Gershman2

In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.

Acceptance summary:

The findings presented in this paper describe the relationship between cognitive flexibility and replay in humans. A major strength of this study is that it shows a direct relationship between replay and individual differences in behavior. This is a major step toward a better understanding of how replay contributes to reinforcement learning and planning.

Decision letter after peer review:

Thank you for submitting your article "The roles of online and offline replay in planning" for consideration by eLife. Your article has been reviewed by three peer reviewers, and the evaluation has been overseen by by a Reviewing Editor and Kate Wassum as the Senior Editor. The following individual involved in review of your submission has agreed to reveal their identity: Samuel J Gershman (Reviewer #2).

The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.

As the editors have judged that your manuscript is of interest, but as described below that additional experiments or analyses are required before it is published, we would like to draw your attention to changes in our revision policy that we have made in response to COVID-19 (https://elifesciences.org/articles/57162). First, because many researchers have temporarily lost access to the labs, we will give authors as much time as they need to submit revised manuscripts. We are also offering, if you choose, to post the manuscript to bioRxiv (if it is not already there) along with this decision letter and a formal designation that the manuscript is 'in revision at eLife'. Please let us know if you would like to pursue this option. (If your work is more suitable for medRxiv, you will need to post the preprint yourself, as the mechanisms for us to do so are still in development.)

Summary:

This paper presents evidence for the relationship between cognitive flexibility and replay decoded from MEG signals. The work significantly extends prior results, by connecting replay with individual differences in behavior. All reviewers agreed that this work is technically solid and the experimental results compelling, distinguishing between forward and backward replay as well as online vs. offline replay. Reviewers also found the paper clearly written but somewhat dense. Reviewers also noted some issues that should be addressed with additional analyses and/or data before the manuscript can be accepted for publication.

Essential revisions:

1) Evidence for replay is based on decoding of the previous state at the time of outcome. How much of this decoding can be explained by sensory input or sensory processing of the previous image, rather than replay as a separate neural event? Figure 2A shows that decoding might still be above chance after image offset. And Figure 2—figure supplement 2B suggests that decoding of the previous state is already above chance at and likely before outcome onset. Can the authors show that decoding is indeed driven by a replay-like mechanisms rather than sensory/perceptual processing of the previous state? For instance, how does decoding evolve in the second(s) before outcome onset? If decoding is driven by replay at outcome onset, should we not expect decoding to peak after outcome onset? Also decoding at outcome onset should not depend on the length of the ISI between the offset of the previous image and outcome onset. (Alternatively, if decoding is driven by sensory processing, decodability should be higher for shorter compared to longer ISIs.) Finally, the authors could use data from a more passive viewing task to get an idea of how long decoding should be expected to remain above chance after image offset in the absence of replay. To support the main conclusions about replay, it would be important to show that decoding is not simply driven by residual processing of the previous image.

2) Where in the brain does replay occur? It would be interesting to see which of the 273 MEG channels contribute to successful decoding. If the authors could show that putative replays is localized near hippocampal sensors rather than, e.g., visual ones, this may also help to convince that decoding of the previous state is not simply based on visual processing (see essential revision point 1).

3) Some reviewers found the manuscript very hard to read. Results are consistently stated in terms of technical constructs and/or methods that require substantial acquaintance with the authors' previous work. For example, the Introduction cites a copious literature on rodent replay and preplay in hippocampus without justifying the premise that this can be successfully decoded from MEG in humans. Another example: subsection “Bayesian hierarchical Gaussian Process time series analysis”, second paragraph. This was considered far too dense, particularly since several critical modeling choices were not well justified. The concern is that this style forces non-specialist readers (and even some specialists) to do a lot of work that might be avoided with more motivation and explication. A better description of some of the key analysis methods (the sequenceness analysis) in the main text would be helpful. In this regard, it might also help to have some kind of conceptual figure relating individual flexibility, model-free vs. model-based learners, preplay/replay, on-task/off-task, and other relevant constructs so readers can see their relationships. The authors present a number of different results in different epochs pertaining to different claims, but it is hard to see the coherent whole, particularly for those not already heavily invested in these debates.

4) The authors invoke the recent Mattar and Daw theory of hippocampal replay, arguing that it predicts that "subjects should be more disposed to replay trajectories that they might not want to choose again, rather than trajectories whose choice reflects a firm policy". We gather that this is related to Figure 5 in Mattar and Daw. But aren't the theory's predictions more complicated? Mattar and Daw also point out that enhanced replay should occur when an agent is likely to repeat an action, for example when receiving a larger than expected reward. Another issue is that Mattar and Daw's theory specifically predicts an effect on reverse replay ("backward sequenceness" here) not forward replay ("forward sequenceness"), but the authors report an effect for forward sequenceness.

5) "off-task sequenceness negatively correlated with IF (Figure 3). This association of sequenceness during rest with low flexibility is consistent with a proposed role of offline replay in establishing model-free policies." The logic of these predictions is understandable based on algorithms like Dyna, but couldn't this also go in the opposite direction? For example, if one thinks about the Gershman, Markman and Otto, 2014 studies, their revaluation index is conceptually similar to the IF measure used here. But in that case large revaluation (i.e., greater flexibility) was used to argue in favor of more replay in a Dyna-like manner, and this hypothesis received more direct confirmation from the Momennejad eLife paper. How can we reconcile these two interpretations of flexibility (as reflecting more vs. less replay)?

eLife. 2020 Jun 17;9:e56911. doi: 10.7554/eLife.56911.sa2

Author response


Essential revisions:

1) Evidence for replay is based on decoding of the previous state at the time of outcome. How much of this decoding can be explained by sensory input or sensory processing of the previous image, rather than replay as a separate neural event? Figure 2A shows that decoding might still be above chance after image offset. And Figure 2—figure supplement 2B suggests that decoding of the previous state is already above chance at and likely before outcome onset. Can the authors show that decoding is indeed driven by a replay-like mechanisms rather than sensory/perceptual processing of the previous state? For instance, how does decoding evolve in the second(s) before outcome onset? If decoding is driven by replay at outcome onset, should we not expect decoding to peak after outcome onset? Also decoding at outcome onset should not depend on the length of the ISI between the offset of the previous image and outcome onset. (Alternatively, if decoding is driven by sensory processing, decodability should be higher for shorter compared to longer ISIs.) Finally, the authors could use data from a more passive viewing task to get an idea of how long decoding should be expected to remain above chance after image offset in the absence of replay. To support the main conclusions about replay, it would be important to show that decoding is not simply driven by residual processing of the previous image.

The reviewers raise an important question about the interpretation of previous-state decoding on which evidence of replay is based. We thank the reviewers for their helpful suggestions as to how to address this question. In a new analysis, we examine decodability of each state time-locked to its own offset. The results show that outcomes were indeed followed by increased decodability that correlated with evidence of post-outcome sequenceness (n = 40, r = 0.32, p = 0.04, permutation test). Furthermore, the experiment provides us with a natural comparison between non-terminal states that were followed by outcomes (blue in Figure 2—figure supplement 2C; those from which subjects moved) and terminal states that were not (grey; those that marked the end of a trial). Following the terminal states, we could decode the preceding non-terminal states (the black bar). However, crucially, we could not significantly decode terminal states this long after their offset (difference between non-terminal and terminal states: p = 0.03, Permutation test). These results indicate that decoded state representations did not only reflect sensory processing, and were well suited to contribute to post-outcome planning. We now report this new result in Figure 2—figure supplement 2C and note it in the main text:

“As would be expected, the finding of sequenceness was associated with enhanced decoding of previously visited states during the post-outcome epoch (Figure 2—figure supplement 2C).”

2) Where in the brain does replay occur? It would be interesting to see which of the 273 MEG channels contribute to successful decoding. If the authors could show that putative replays is localized near hippocampal sensors rather than, e.g., visual ones, this may also help to convince that decoding of the previous state is not simply based on visual processing (see essential revision point 1).

Understanding where in the brain decoded representations lie is indeed important. However, such analysis has limited practical usefulness for the present study. First, because localizing MEG activity to specific brain areas is highly imprecise in the absence of a proper MRI-based head model (unfortunately subjects were not scanned using MRI). Even with precise head models, identifying hippocampal activity in MEG data is particularly challenging (Meyer et al., 2017). Second, hippocampal replay has been shown at least in some settings to be associated with coherent replay in sensory cortex (Ji and Wilson, 2007). Thus, a differentiation between hippocampal and cortical replay in our data might be unreliable and not necessarily expected.

Nevertheless, delineating the contributions of each sensor to decoder is useful for future reference. We now provide this information in a new sensor map (Figure 2—figure supplement 1B). Our support vector machine decoders do not provide easily interpretable parameters, and thus, we quantified a sensor’s contribution by measuring the correlation between its MEG signal and decoder output, across all main task timepoints. The results indicate all sensors consistently contributed to decoding (most contributing channel: M = 0.269, SE = 0.009; least contributing channel: M = 0.156, SE = 0.004), with a small advantage for posterior sensors. The map has been added to the manuscript as Figure 2—figure supplement 1B and is referred to in the main text.

3) Some reviewers found the manuscript very hard to read. Results are consistently stated in terms of technical constructs and/or methods that require substantial acquaintance with the authors' previous work. For example, the Introduction cites a copious literature on rodent replay and preplay in hippocampus without justifying the premise that this can be successfully decoded from MEG in humans. Another example: subsection “Bayesian hierarchical Gaussian Process time series analysis”, second paragraph. This was considered far too dense, particularly since several critical modeling choices were not well justified. The concern is that this style forces non-specialist readers (and even some specialists) to do a lot of work that might be avoided with more motivation and explication. A better description of some of the key analysis methods (the sequenceness analysis) in the main text would be helpful. In this regard, it might also help to have some kind of conceptual figure relating individual flexibility, model-free vs. model-based learners, preplay/replay, on-task/off-task, and other relevant constructs so readers can see their relationships. The authors present a number of different results in different epochs pertaining to different claims, but it is hard to see the coherent whole, particularly for those not already heavily invested in these debates.

We apologise that the original version was hard for readers. We thank the reviewers for helpful suggestions as to how to make the paper more readable. We have substantially revised the manuscript in accordance with these suggestions, including adding the following changes:

– We now devote a paragraph in the Introduction to highlight how recent advances make it possible to detect evidence of replay in humans using MEG):

“Despite the wide-ranging behavioural implications of a distinction between model-based and model-free planning (Kurdi, Gershman and Banaji, 2019; Crockett, 2013; Everitt and Robbins, 2005; Gillan, Fineberg and Robbins, 2017), and much theorising on the role of replay in one or the other form of planning, to date there is little data indicating whether online and offline replay have complementary or contrasting impacts in this regard. […] Thus, finally, the relationships between elements’ representational probability time series are examined to determine whether pairs of elements tended to be represented sequentially, one after the other.”

– We explain the ‘sequenceness’ measure in a new schematic figure (Figure 2B).

– We clarify the details of, and motivation for, the different design choices made in the hierarchical Gaussian Process analysis. We reproduce below a key excerpt added to that effect:

“To determine whether sequenceness time-series recorded following outcomes provided robust evidence of replay that correlated with individual index of behavioural flexibility, we modelled mean sequenceness time-series as Gaussian Processes with squared exponential kernels. […] Therefore, we accounted for these deviations by means of another set of Gaussian Processes, one for each modelled time series, which were added to the predictions of the group-level Process.”

–We complement the discussion with a schematic figure that provides a coherent view of the main findings (Figure 5).

4) The authors invoke the recent Mattar and Daw theory of hippocampal replay, arguing that it predicts that "subjects should be more disposed to replay trajectories that they might not want to choose again, rather than trajectories whose choice reflects a firm policy". We gather that this is related to Figure 5 in Mattar and Daw. But aren't the theory's predictions more complicated? Mattar and Daw also point out that enhanced replay should occur when an agent is likely to repeat an action, for example when receiving a larger than expected reward. Another issue is that Mattar and Daw's theory specifically predicts an effect on reverse replay ("backward sequenceness" here) not forward replay ("forward sequenceness"), but the authors report an effect for forward sequenceness.

Mattar and Daw’s perspective indeed predicts enhanced replay also when an agent becomes more likely to repeat an action. This prediction holds only if an outcome is surprising, such that the information it provides is not already embedded in the agent’s policy. In our experiment, the most natural implication is that replay should be enhanced following surprising outcomes, and consistent with their perspective, our results show that more surprising outcomes were in fact followed by stronger sequenceness (subsection “On-task replay is induced by prediction errors and associated with high flexibility”, last paragraph). As noted by the reviewers, one critical difference from Daw and Mattar here is that we primarily find forward sequenceness following outcomes, where Daw and Mattar predict backward sequenceness. We now highlight this difference explicitly in the Discussion:

“Two main aspects of our results deviate from this normative perspective. First, we primarily find forward sequenceness following outcomes.”

That said, what we see as the most crucial element in Daw and Mattar (as well as in the work on retrospective revaluation; Gershman et al., 2014; Momennejad et al., 2018) is that replay contributes not only to knowledge about the structure of the state space, but also to planning (e.g., by updating action values). An association with prediction errors is not sufficient to substantiate this claim. Prediction errors are of course coupled with policy updates in our model of the task, but that is an assumption built into the model. Thus, what is needed is to examine whether sequenceness is associated with concrete evidence of policy update.

A simple test for this prediction is provided by the fact that subjects’ policy updates at the beginning of each experimental phase are more likely to give rise to behavioral change than to repetition. This is because subjects begin each phase with partial knowledge acquired in the previous phase (as evidenced by above chance performance; Figure 1C), and then go on to improve their policies further as they gain additional experience. Therefore, at this point, some of a subject’s choices already reflect a well-informed policy, and such choices are less likely to induce policy updates and are more likely to be repeated. By contrast, subjects’ more poorly informed choices are more likely to induce policy updates and are less likely to be repeated. Such variance among choices exists in many bandit tasks, but it is likely to be especially pronounced in the present experiment since outcomes are deterministic but numerous, such that learning and forgetting are fast (the average modelled MB forgetting rate was 0.3). Thus, examining behavioral change provides a convenient marker of policy update that is not dependent on specific modeling assumptions. We now clarify this logic more explicitly in the main text:

“Recent theorising regarding the role of replay in planning argues that replay is preferentially induced when there is benefit to updating one’s policy (Mattar and Daw, 2018). […] Thus, a role for replay in planning predicts that subjects should be more disposed to replay trajectories that they might not want to choose again, rather than trajectories whose choice reflects a firm policy.”

5) "off-task sequenceness negatively correlated with IF (Figure 3). This association of sequenceness during rest with low flexibility is consistent with a proposed role of offline replay in establishing model-free policies." The logic of these predictions is understandable based on algorithms like Dyna, but couldn't this also go in the opposite direction? For example, if one thinks about the Gershman, Markman and Otto, 2014 studies, their revaluation index is conceptually similar to the IF measure used here. But in that case large revaluation (i.e., greater flexibility) was used to argue in favor of more replay in a Dyna-like manner, and this hypothesis received more direct confirmation from the Momennejad eLife paper. How can we reconcile these two interpretations of flexibility (as reflecting more vs. less replay)?

The reviewers raise an interesting question regarding the relationship between the present findings and previous findings on retrospective revaluation. Whereas previous findings indicate offline replay is associated with behavioral change, our findings suggest it is associated with decreased flexibility. We consider the two lines of work can be reconciled by considering the scale of flexibility each work emphasizes. While retrospective revaluation manifests in the comparison between learning and test phases separated by additional phases several minutes long, our experiment requires subjects implement different stimulus-response mappings in contiguous trials. An algorithm like Dyna would only be able to afford such flexibility if it can furnish multiple stimulus-response mappings that can be summoned by context. It is possible such flexibility goes beyond the capabilities of forms of Dyna that might be implemented in the brain. We now discuss this issue in relation to previous work in the main text:

“This aspect of the task might explain an apparent discrepancy with previous work on retrospective revaluation (Momennejad et al., 2018; Gershman, Markman and Otto, 2014), which indicates offline replay is associated with a greater degree of behavioural change. […] It is possible such flexibility goes beyond the capabilities of any forms of DYNA that might be implemented in the brain.”

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Data Citations

    1. Eldar E. 2020. The roles of online and offline replay. Open Science Framework. [DOI]

    Supplementary Materials

    Transparent reporting form

    Data Availability Statement

    The data and the custom code have been deposited in the Open Science Framework under https: https://doi.org/10.17605/OSF.IO/GUHJE.

    The following dataset was generated:

    Eldar E. 2020. The roles of online and offline replay. Open Science Framework.


    Articles from eLife are provided here courtesy of eLife Sciences Publications, Ltd

    RESOURCES