Dissociable effects of surprising rewards on learning and memory

Nina Rouhani; Kenneth A Norman; Yael Niv

doi:10.1037/xlm0000518

. Author manuscript; available in PMC: 2019 Sep 1.

Published in final edited form as: J Exp Psychol Learn Mem Cogn. 2018 Mar 19;44(9):1430–1443. doi: 10.1037/xlm0000518

Dissociable effects of surprising rewards on learning and memory

Nina Rouhani ^a,^b, Kenneth A Norman ^a,^b, Yael Niv ^a,^b

PMCID: PMC6117220 NIHMSID: NIHMS948587 PMID: 29553767

Abstract

Reward prediction errors track the extent to which rewards deviate from expectations, and aid in learning reward predictions. How do such errors in prediction interact with memory for the rewarding episode? Existing findings point to both cooperative and competitive interactions between learning and memory mechanisms. Here, we investigated whether learning about rewards in a high-risk context, with frequent, large prediction errors, gives rise to higher fidelity memory traces for rewarding events than learning in a low-risk context. Experiment 1, showed that recognition was better for items associated with larger absolute prediction errors during reward learning. Larger prediction errors also led to higher rates of learning about rewards. Interestingly, we did not find a relationship between learning rate for reward and recognition memory accuracy for items, suggesting that these two effects of prediction errors were due to separate underlying mechanisms. Experiment 2, replicated these results with a longer task that posed stronger memory demands and allowed for more learning. We also showed improved source and sequence memory for items within the high-risk context. Experiment 3 controlled for the difficulty of reward learning in the risk environments, again replicating the previous results. Moreover, this control revealed that the high-risk context enhanced item recognition memory beyond the effect of prediction errors. In summary, our results show that prediction errors boost both episodic item memory and incremental reward learning, but the two effects are likely mediated by distinct underlying systems.

Keywords: reward prediction error, risk, memory, reinforcement learning, surprise

If you receive a surprising reward, would you remember the event better or worse than if that same reward were expected? And what if it was a surprising punishment? Surprising rewards or punishments cause “prediction errors” that are important for learning which outcomes to expect in the future, but it is unclear how these prediction errors affect episodic memory for the details of the surprising event. Theories of learning suggest that outcomes are integrated across experiences, yielding an average expected value for the rewarding source. Alternatively, we could use distinct episodic memories of past events and their outcomes to help guide us towards rewarding experiences and away from punishing ones. Incremental learning and episodic memory systems can collaborate during decision making, for example, when both the expected value of an option and a distinct memory of a previously experienced outcome influences a decision (Biele, Erev, & Ert, 2009; Duncan & Shohamy, 2016). The two systems can also compete for processing resources: compromised feedback-based learning has been associated with enhanced episodic memory, both behaviorally and neurally (Foerde, Braun, & Shohamy, 2012; Wimmer, Braun, Daw, & Shohamy, 2014). Here, we study the nature of the interaction between incremental learning and episodic memory by investigating the role of reward prediction errors – rapid and transient reinforcement signals that track the difference between actual and expected outcomes – in the formation of episodic memory for rewarding events.

Reward prediction errors play a well-established role in updating stored information about the values of different choices, and are known to modulate dopamine release. When a reward is better than expected, there is an increase in the firing of dopamine neurons, and conversely, when the reward is worse than expected, there is a dip in dopaminergic firing (Schultz & Dickinson, 2000; Schultz, Dayan, & Montague, 1997). Dopamine, in turn, modulates plasticity in the hippocampus, a key structure for episodic memory (Lisman & Grace, 2005). This dopaminergic link therefore provides a potential neurobiological mechanism for reward prediction errors to affect episodic memory. However, there are several ways by which reward prediction errors could potentially influence episodic memory. First, if memory formation is affected by this signed prediction error, then we would expect an asymmetric effect on memory, such that a positive prediction error (leading to an increase in dopaminergic firing) would improve memory whereas a negative prediction error (leading to a decrease in dopaminergic firing) would worsen it.

A second possibility is that the magnitude of the prediction error could influence episodic memory regardless of the sign of the error, enhancing memory for events that are either much better or much worse than expected. Outside of reward learning, surprising feedback has been linked to better memory for both the content and source of feedback events in studies investigating the “hypercorrection” effect, where high confidence errors are more likely to be corrected and remembered (Butterfield & Mangels, 2003; Butterfield & Metcalfe, 2001; Fazio & Marsh, 2009, 2010). The same memory benefit for high-confidence errors has also been shown for low-confidence correct feedback, and one can envision both high-confidence errors and low-confidence correct trials as generating a large (unsigned) prediction error. These putative “high prediction-error events” have also been shown to modulate attention, as measured by impaired performance on a secondary task; the degree of this attentional capture in turn predicts subsequent memory enhancement for the feedback content (Butterfield & Metcalfe, 2006).

The effects of unsigned prediction errors are thought to be mediated by the locus-coeruleus-norepinephrine (LC-NE) system, which demonstrates a transient response to unexpected changes in stimulus-reinforcement contingencies in both reward and fear learning (that is, regardless of sign; for a review, see Sara, 2009), and modulates increases in learning rate, i.e. the extent to which a learner updates their values, following large unsigned prediction errors (Behrens, Woolrich, Walton, & Rushworth, 2007; McGuire, Nassar, Gold, & Kable, 2014; Nassar et al., 2012; Pearce & Hall, 1980). Importantly, recent evidence also indicates that the locus coeruleus co-releases dopamine with norepinephrine, giving rise to dopamine-dependent plasticity in the hippocampus (Kempadoo, Mosharov, Choi, Sulzer, & Kandel, 2016; Takeuchi et al., 2016). This latter pathway thereby provides a mechanism whereby unsigned prediction errors could affect episodic memory, by modulating hippocampal plasticity.

In the following experiments, we therefore tested whether signed or unsigned prediction errors influence learning rate and episodic memory, and whether these two effects are correlated. Correlated effects on learning of values and memory for events would suggest a common mechanism underlying both effects, whereas two uncorrelated effects are consistent with separate underlying mechanisms.

We also wanted to measure the effect of risk context (i.e., whether unsigned prediction errors were large or small, on average, in a particular environment) on episodic memory. Previous work on the effects of risk context show that dopamine signals scale to the reward variance of the learning environment (Tobler, Fiorillo, & Schultz, 2005), allowing for greater sensitivity to prediction errors in lower variance contexts. Moreover, behavioral learning rate and BOLD responses in the dopaminergic midbrain and striatum reflect this adaptation, with higher learning rates and increased striatal response to prediction errors when the reward variance is lower (Diederen, Spencer, Vestergaard, Fletcher, & Schultz, 2016). We therefore expected higher learning rates in a low-risk context, but it was unclear whether this effect would interact with episodic memory. If anything, for memory we expected opposite effects, such that a high-risk context would induce better episodic memory, as salient feedback (like experiencing high magnitude prediction errors) is thought to increase autonomic arousal and encoding of those events (Clewett, Schoeke, & Mather, 2014). The mnemonic effects of higher magnitude prediction errors may also “spill over” to surrounding items, boosting memory for those items as well, again predicting better memory for events experienced in the high-risk context (Duncan, Sadanand, & Davachi, 2012; Mather, Clewett, Sakaki, & Harley, 2015).

To investigate the effect of prediction errors and risk context on the structure of memory, we asked participants to learn by trial and error which of two types of images, indoor or outdoor scenes, leads to larger rewards. Trial-unique indoor and outdoor images were presented in two different contexts or ‘rooms,’ with each room associated with a different degree of outcome variance. The average values of the scene categories in the two rooms were matched. Participants were instructed to learn the average (expected) value of each type of image (indoor or outdoor scenes), given the variable individual outcomes experienced for each scene, as is typically done in reinforcement learning tasks (e.g. O’Doherty, Dayan, Friston, Critchley, & Dolan, 2003; Wimmer et al., 2014).

Specifically, we asked participants to explicitly estimate, on each trial, the average value of the category of the current scene. The deviation between this estimate and the outcome on that trial defined the trial-specific subjective prediction error. These prediction errors were then used to calculate trial-by-trial learning rates for the average values of the categories, as well as to predict future memory for the specific scenes presented on each trial. At a later stage, memory for the individual scenes was assessed through recognition memory (‘item’ memory), identification of the room the item belonged to (‘source’ or context memory; Exp. 2–3), and the ordering of a pair of items (‘sequence’ memory). Given that both category-value learning and individual scene memory were hypothesized to depend on the same prediction errors, we also characterized the relationship between learning about the average rewards in the task and episodic memory for the individual rewarding events.

Experiment 1

In Experiment 1, we assessed whether reward prediction errors interact with episodic memory for rewarding episodes. Participants learned the average reward values of images from two categories (indoor or outdoor scenes) in two learning contexts (‘rooms’). The two learning contexts had the same mean reward, but different degrees of reward variance (‘risk’) such that the rewards associated with scenes in the ‘high-risk room’ gave rise to higher absolute prediction errors than in the ‘low-risk room’. We then assessed participants’ recognition for the different scenes in a surprise memory test, to test how prediction errors due to the reward associated with each episode affected memory for that particular scene.

Method

Participants

Two hundred participants initiated an online task using Amazon Mechanical Turk (MTurk), and 174 completed the task. We obtained informed consent online, and participants had to correctly answer questions checking for their understanding of the instructions before proceeding (see supplementary material); procedures were approved by Princeton University’s Institutional Review Board. Participants were excluded if they (1) had a memory score (A′: Sensitivity index in signal detection; Pollack & Norman, 1964) of less than 0.5 based on their hit rate and false alarm rate for item recognition memory, or (2) missed more than three trials. These criteria led to the exclusion of ten participants, leading to a final sample of 164 participants. Although we do not have demographic information for the mTurk workers who completed these experiments, an online demographic tracker reports that during the time we collected data, the samples were approximately 55% female; 40% were born before 1980, 40% were born between 1980 and 1990, and 20% were born between 1990–1999 (Difallah, Catasta, Demartini, Ipeirotis, & Cudré-Mauroux, 2015; Ipeirotis, 2010).

Procedure

Participants learned by trial and error the average value of images from two categories (indoor or outdoor scenes) in two rooms defined by different background colors (see Figure 1). In each room, one type of scene was worth 40¢ on average (low-value category) and the other worth 60¢ (high-value category). The average values of the categories were matched across rooms, but the reward variance of the high-risk room was more than double that of the low-risk room (high-risk σ = 34.25, low-risk σ = 15.49). The order of the rooms (high-risk and low-risk) was randomized across participants. In an instruction phase, participants were explicitly told (through written instructions; see supplementary material) that in each room one scene category is worth more than the other (a ‘winning’ category) and were asked to indicate the winner after viewing all images in a room. They were not told the reward distributions of the rooms, nor that the rooms would have different levels of variance. In addition, to motivate participants to pay attention to individual scenes and their outcomes, participants were told that later in the experiment they would have the opportunity to choose between these same scenes and receive the rewards associated with them as per their choices.

After the two learning blocks (one high-risk and one low-risk), participants completed a risk attitude questionnaire (DOSPERT, Weber, Blais, & Betz, 2002) that served to create a 5–10 minute delay between learning and memory tests. Participants then completed a surprise item-recognition task (i.e., participants were never told that their memory for scenes would be tested, apart from instructions about the choice task as detailed above), as well as a sequence memory task. After the memory tests, participants made choices between previously seen images.

Learning

On each trial, participants were shown a trial-unique image (either an indoor or outdoor scene) for 2 seconds. Participants then had up to 5 seconds to estimate how much that type of scene is worth on average in that room (from 1 to 100 cents). In other words, participants were asked to provide their estimate of the average, or expected value, of the scene category based on the previous (variable) outcomes they had experienced from that scene category within the room. The scene was then presented again for 3 seconds along with its associated reward (see Figure 1A). In the instructions (see supplementary material), participants had been told that although trial-unique images can take on different rewards, each scene category had a stable mean reward, and on average one scene category was worth more than the other. Note that participants were not asked to estimate the exact outcome they would receive on that trial, but instead were estimating the average expected reward from that scene category. Accordingly, participants had also been told that their payment was not contingent on how accurate their guesses were relative to the reward on that trial. Instead, their payment was solely determined by the rewards they received, to ensure that rewards were meaningful for the participant. This task structure was chosen to ensure that participants would continue to experience prediction errors on each trial (i.e., for individual scenes) even after correctly estimating the expected values of the categories, as is commonly done in reinforcement learning tasks (e.g. Niv, Edlund, Dayan, & O’Doherty, 2012).

There were 16 trials in each room (8 outdoor and 8 indoor). Rewards were 20¢, 40¢, 80¢, 100¢ (twice each) for the high-risk–high-value category, 0¢, 20¢, 60¢, 80¢ for the high-risk–low-value category, 45¢, 55¢, 65¢, 75¢ for the low-risk–high-value category and 25¢, 35¢, 45¢, 55¢ for the low-risk–low-value category. All participants experienced the same sequence of rewards within each room, with the order of the rooms randomized.

Memory

After completing the risk questionnaire, participants were presented with a surprise recognition memory test in which they were asked whether different scenes were old or new (Figure 1) as well as their confidence for that memory judgment (from 1 ‘guessing’ to 4 ‘completely certain’). There were 32 test trials, including 16 old images (8 from each room) and 16 foils. Participants were then asked to sequence 8 pairs of previously seen scenes (which were not included in the recognition memory test) by answering ‘which did you see first?’ (Figure 1) and by estimating how many trials apart the images had been from each other. Each pair belonged to either the low (4 pairs) or the high-risk room (4 pairs).

Choice

In the last phase of the experiment, to verify that participants had encoded and remembered the individual outcomes associated with different scenes, participants were asked to choose between pairs of previously seen scenes for a chance to receive their associated reward again (see Figure 1C). The pairs varied in either belonging to the same room or different rooms and some were matched for reward and/or average scene value in order to test for the effects of factors such as risk context on choice preference. The choices were presented without feedback.

Statistical Analysis

Analyses were conducted using paired t-tests, repeated measures ANOVAs, and generalized linear mixed-effects models (lme4 package in R; Bates et al., 2015). All results reported below (t-tests and ANOVAs) were confirmed using linear or generalized mixed-effects models treating participant as a random effect (for both the intercept and slope of the fixed effect in question). We note that in all experiments, our results held when controlling for the between-subjects variable of room order (for brevity, we only explicitly report these results in Experiment 1, see below).

Results

Learning

Participants learned the average values of the high- and low-value categories better in the low-risk than in the high-risk room, as assessed by the deviation of their value estimates from the true averages of the scene categories (t(163) = 14.52, p < 0.001; Figure 3A). We then calculated, for every scene, the prediction error (PE_t associated with that scene by subtracting participants’ value estimates (V_t) from the reward outcome they observed (R_t; see Figure 2). This showed that, as we had planned, there were more high-magnitude prediction errors in the high-risk room as compared to the low-risk room (t(163) = 36.77, p < 0.001, within-subject comparison of average absolute prediction errors between the two rooms; Figure 3B).

Schematic of prediction error (PE) and learning rate (α) calculation for two consecutive trials that involve the same scene category, in the learning phase of the experiment. Based on the learning equation V_t+1 = V_t + α_t* PE_t, we calculated the trial-by-trial learning rate as (V_t+1 − V_t) / PE_t. Note that all components of this equation are measured explicitly: V_t and V_t+1 are two consecutive estimates of the value of a scene from a single category (e.g., outdoor scenes), and the prediction error on trial t is the difference between the reward given on that trial, and the participants’ estimate of the value of scene on the same trial. We assume here that separate values are learned and updated for each of the scene categories.

Moreover, there was an interaction between risk and scene category such that participants overestimated the value of low-value scene category (resulting in negative prediction errors, on average) and underestimated the value of high-value scene category (resulting in positive prediction errors, on average) to a greater extent in the high-risk room than in the low-risk room (F(1,163) = 141.2, p < 0.001 for a within-subject interaction of the effects of room and scene category on the average signed prediction error; Figure 3C). This demonstrates more difficulty in separating the values of the categories in the high-risk room, consistent with previous findings showing that when people estimate the means of two largely overlapping distributions, they tend to average across the two distributions, thereby grouping them into one category instead of separating them into two (Gershman & Niv, 2013). Despite greater difficulty in separating the values of the high and low value categories within the high-risk room, most participants correctly guessed the “winner”, or the high-value scene category, within both the high-risk (88%) and the low-risk (89%) rooms.

Memory by Risk and Prediction Error

We found that items within the high-risk room were recognized better than items within the low-risk room (z = 2.37, p = 0.02, β = 0.31; Figure 4A). To test the effect of reward prediction errors on item-recognition memory, we ran two separate mixed-effects logistic regression models of memory accuracy, one testing for the effect of signed and the other the effect of unsigned (absolute) prediction errors on recognition memory. Both models also included a risk-level regressor to test for the effects of risk and prediction error separately, and treated participants as a random effect. We did not find signed prediction errors to influence recognition memory beyond the effect of risk (signed prediction error (PE_t): z = 0.71, p = n.s., β = 0.04; risk: z = 2.29, p = 0.02, β = 0.30). Instead, we found that larger prediction errors enhanced memory regardless of the sign of the prediction error, which also explained the modulation of memory by risk (absolute prediction error (|PE_t|) : z = 3.36, p < 0.001, β = 0.23; risk: z = 0.9, p = n.s., β = 0.10; Figure 4B).

Experiment 1, recognition memory results. A: Recognition memory was better for items within the high-risk room. B: There was better recognition memory for items associated with a higher absolute prediction error. Item memory was binned by the quartile values of prediction errors within each risk room. Each dot represents the average value within that quartile. Error bars represent standard error of the mean.

We ran two subsequent models testing for confounds, one including the effect of value estimates and the other the actual reward outcomes associated with the items, along with the effect of absolute prediction errors. Absolute prediction error had a significant effect on recognition memory when controlling for reward outcome (|PE_t| : z = 3.94, p < 0.001, β = 0.26; R_t: z = 0.45, p = n.s., β = 0.02) and value estimates (|PE_t| : z = 3.93, p < 0.001, β = 0.26; V_t: z = −0.09, p = n.s., β = −0.005). This effect also held when modeling recognition memory for items in the high and low-risk rooms separately (high-risk: z = 1.90, p = 0.05, β = 0.18; low-risk: z = 2.17, p = 0.03, β = 0.24), and in a model of the effects of absolute prediction errors on recognition memory that controlled for room order (|PE_t| : z = 3.90, p < 0.001, β = 0.25; room order: z = 1.95, p = 0.05, β = 0.33). Although room order itself did affect recognition memory (participants who experienced the low-risk room first showed better memory accuracy overall), all of our main findings (including learning rate, see below) held when controlling for this effect.

Reward prediction errors therefore affected recognition memory, such that larger deviations from one’s predictions, in any direction, enhanced memory for items. Finally, we tested for the effect of risk on sequence memory (the correct ordering of two images seen during learning) and found no difference in sequence memory between pairs of images seen in the high and low-risk rooms (z = 0.11, p = n.s., β = 0.02).

Learning Rate by Risk and Prediction Error

We also examined the effects of risk and prediction errors on the reward learning process itself. For this we calculated a trial-by-trial learning rate α_t as the proportion of the current prediction error PE_t = R_t − V_t that was applied to update the value for the next encounter of the same type of scene, V_t₊₁ (see Figure 2 for schematic representing learning rate calculation). That is, we derived the trial-specific learning rate directly from the standard reinforcement-learning update equation V_t₊₁ = V_t + α_t(R_t − V_t), as $α_{t} = \frac{V_{t + 1} - V_{t}}{R_{t} - V_{t}}$ .

In agreement with recent findings (e.g. Diederen et al., 2016), we found that average learning rate was higher in the low-risk room than in the high-risk room (t(163) = 3.37, p < 0.001 within-subjects test; Figure 5A). Moreover, higher absolute prediction errors increased trial-by-trial learning rates (α_t) above and beyond the effect of risk (mixed-effects linear model, effect of absolute prediction error: t = 3.30, p = 0.001, β = 0.07; risk: t = 4.67, p < 0.001, β = 0.16; Figure 5B). We did not find participant room order to influence learning rate (t = 0.31, p = n.s., β = −0.03). These results show that larger absolute prediction errors enhance value updating, and further, that learning rates adapt to the reward variance such that there is greater sensitivity to prediction errors in a lower-risk environment.

Experiment 1, learning rate results. A: Learning rate was higher in the low-risk context. Average learning rate plotted by risk context and category value. B: Both absolute prediction errors and a low-risk context increased learning rate. Learning rates were binned by prediction errors that occurred on the same trial (each dot represents the average prediction error within the binned range). Error bars represent standard error of the mean.

We next ran a mixed-effects regression model to test whether trial-by-trial learning rates predicted recognition memory for scenes at test. Controlling for absolute prediction error, we did not find that learning rate on trial t predicted memory on that same trial (α_t : z = 0.85, p = n.s., β = 0.08; |PE_t| : z = 3.42, p < 0.001, β = 0.20), nor on the subsequent trial, (effect of α_t₋₁ on recognition memory for the scene on trial t: z = 0.56, p = n.s., β = 0.05; |PE_t| : z = 3.06, p = 0.002, β = 0.19, where t enumerates over trials within a room). This demonstrates that increases in learning rate were not correlated with better (or worse) memory, even though both learning rate and recognition memory were enhanced by larger prediction errors.

Choice by Reward and Value Difference

Finally, in a manipulation test, participants were asked to make choices between pairs of previously-seen scenes. Choices between scenes with different reward outcomes served to test whether participants encoded the rewards associated with the images. Participants chose the image associated with the larger outcome more often (mixed-effects logistic regression model predicting choice based on outcome: z = 6.40, p < 0.001, β = 0.54), suggesting that they did indeed encode and remember the rewards associated with the scenes.

Some choices were between items that were associated with the same outcome feedback. Here we sought to test whether features of the environment such as the risk context biased participants away from indifference. We did not find risk level, whether the scene was from the low rewarding or high rewarding category, or the difference in absolute prediction error between the images, to additionally influence choice preference. We instead found that participants were more likely to choose the scene that they had initially guessed a higher value for (z = 3.74, p < 0.001, β = 0.01). We additionally found that even when the two options had led to different reward outcomes, the difference in initial value estimates for the scene was a significant predictor of choice, above and beyond the difference in actual reward outcome (value estimate difference: z = 2.27, p = 0.02, β = 0.16; reward difference: z = 7.25, p < 0.001, β = 0.52). This suggests that participants remembered not only the outcomes for different scenes, but also their initial estimates.

Discussion

In Experiment 1, we showed that the greater the magnitude of the prediction error experienced during value learning, the more likely participants were to recognize items associated with those prediction errors. We also demonstrated that both risk context and absolute prediction errors influenced the extent to which people updated values for the scene categories, i.e. their item-by-item learning rate fluctuated according to prediction errors and was influenced by context. In particular, learning rate was higher in the low-risk environment, suggesting greater sensitivity to prediction errors when the variance of the environment was lower. Further, in both contexts, higher absolute prediction errors increased learning rate. Although absolute prediction errors enhanced both recognition memory and learning rate, we did not find learning rate to predict recognition memory, suggesting that absolute prediction errors affect learning and memory through distinct mechanisms.

Experiment 2

In Experiment 2, we allowed for more learning in both rooms, which posed stronger memory demands. We also tested for other types of episodic memory. Notably, different from standard reinforcement-learning paradigms, Experiment 1 involved only 16 trials of learning in each context, 8 for each category. The initial phase of learning, which we were effectively testing, is characterized by increased prediction errors and uncertainty relative to later learning, which might affect the relationship between prediction errors and episodic memory. Additionally, participants in Experiment 1 all experienced the same reward sequence, which inadvertently introduced regularities in the learning curves that could have also influenced initial learning and memory results. Finally, in this relatively short experiment, average recognition memory performance was near ceiling (A′ = 0.90). In Experiment 2, we therefore sought to replicate the results of Experiment 1 while increasing the number of learning and memory trials and randomizing reward sequence. With more trials, we were also able to test for sequence memory for items that were presented further apart in time, and we included a measure of source memory (i.e., which room the item belonged to)—a marker of episodic memory—for the context of the probed item.