eTOC Blurb
Here, Chang et al show that blocking dopamine transients prevents learning about unexpected rewards. Learning was prevented whether it was driven by the addition of reward or by a valueless change in the flavor of the expected reward. This result is contrary to the proposal that dopamine transients act only as cached value prediction errors.
Keywords: dopamine, reward prediction error, blocking, associative learning, rat
Summary
Prediction errors are critical for associative learning [1, 2]. Transient changes in dopamine neuron activity correlate with positive and negative reward prediction errors and can mimic their effects [3–15]. However, while causal studies show that dopamine transients of 1–2s are sufficient to drive learning about rewards, these studies do not address whether they are necessary [but see 11]. Further, the precise nature of this signal is not yet fully established. While it has been equated with the cached-value error signal proposed to support model-free reinforcement learning, cached-value errors are typically confounded with errors in the prediction of reward features [16]. Here we used optogenetic and transgenic approaches to prevent transient changes in midbrain dopamine neuron activity during the critical error-signaling period of two unblocking tasks. In one, learning was unblocked by increasing the number of reward, a manipulation that induces errors in predicting both value and reward features. In another, learning was unblocked by switching from one to another equally valued reward, a manipulation that induces errors only in reward feature prediction. Preventing dopamine neurons in the ventral tegmental area from firing for 5s beginning before and continuing until after the changes in reward prevented unblocking of learning in both tasks. A similar duration suppression did not induce extinction when delivered during an expected reward, indicating it did not act independently as a negative prediction error. This result suggests that dopamine transients play a general role in error signaling rather than being restricted to only signaling errors in value.
Results
Fourteen rats expressing Cre recombinase from the tyrosine hydroxylase (TH) promoter served as subjects [17]; two were removed during the course of the experiment due to illness. Each rat received bilateral infusions of AAV-DIO-NpHR3.0-eYFP (NpHR) within VTA, and fiberoptics were implanted targeting this area bilaterally (Figure 1a). Postmortem immunohistochemical verification showed a high degree of co-localization between Cre-dependent NpHR expression and TH in VTA (Figure 1b); ~90% of NpHR-expressing cells in the VTA (519 of 567 cells counted in sections in the anterior-posterior plane between −5.1 to −5.9 mm, n= 4) were immunoreactive to anti-TH antisera. Rats with similar expression in prior work showed spontaneous and evoked firing in NpHR-expressing neurons in VTA that was uniformly sensitive to light [15]. After a 2-week period to allow viral expression, the rats were food-deprived to 85% of their baseline body weight, and then we began training on the unblocking tasks (Figure 1c).
Figure 1. Histological verification, task designs, and pellet preference test.
A) Fiber implants were localized in the vicinity of NpHR expression in VTA. The light orange shading represents the maximal spread of expression at each level, whereas the dark orange shading represents the minimal spread. B) Expression of NpHR showed a high degree of colocalization (~90%) with TH in VTA neurons. Green represents NpHR-eYFP, red represents TH. Scale is 500 μm. C) Design of number (top) and identity (bottom) unblocking tasks. All rats were trained in both tasks; order of training was counterbalanced. D) Preference test comparing consumption of banana and chocolate pellets used in identity unblocking task. During the test, the rats were given access to both banana and chocolate pellets (200 pellets each). The number of remaining pellets were assessed every 2.5 min, 5 min, and 10 min as the test progressed. There was no discernable difference in the consumption rate between the two flavors during the course of 60 min test (p >0.32).
In one unblocking task, we tested whether preventing transient changes in dopamine neuron activity would block learning from increases in reward number. Adding an unexpected reward is well documented to cause a transient increase in dopamine neuron activity [3–6] and is also known to unblock learning [18, 19]. While these effects are typically interpreted as reflecting unexpected value, a change in reward number also represents a substantial change in the specific features of the reward. One might think of this as a change in “number” although we are simply adding an entirely new pellet, so there are likely many unexpected features involved. Indeed learning that is unblocked in this manner is sensitive to devaluation, suggesting that it involves to formation of associations with the reward’s specific features [18]. This sort of associative representation is beyond what is supported by cached value errors in model-free reinforcement learning models [2]. Changes in features are thought to induce their own prediction errors independent of effects on general or cached value [20], and evidence of the operation of these sensory prediction errors can be seen in unblocking of learning even in the absence of a value shift [19, 21]. If transient changes in dopamine activity reflect only cached-value errors, then they should not be necessary for unblocking caused by an increase in reward number, since the sensory-based error signal would remain to support learning. If, on the other hand, transient changes in dopamine convey both types of error signals (or a broader type of error signal encompassing both), then dopamine should be necessary for number unblocking. In the next paragraph, we will describe the design in more detail.
Optogenetic blockade of dopamine transients prevents learning induced by changes in reward number
Number unblocking consisted of three phases: conditioning, compound training, and probe testing (Figure 1c, top). During conditioning, the rats were trained to associate two novel 10s visual cues, VUB and VB, with either one or two plain sucrose pellets. After conditioning, the rats underwent compound training. These sessions were the same as conditioning sessions, except for the addition of trials on which the previously trained visual cues were presented with novel auditory cues to form compounds, VB/AB and VUB/AUB, paired with two pellets of reward. For VB/AB, both pellets were already predicted by VB, thus we expected learning for AB to be blocked. By contrast, VUB/AUB was paired with an additional, unexpected pellet not predicted by VUB, which we expected to unblock learning for AUB. The presence or absence of learning was assessed at the end of compound training in a probe test, in which the two auditory cues were presented alone, without reward. After completion of the probe test, the rats were retrained for two sessions on the visual cues and then underwent a second run of number unblocking with two new auditory cues.
To test whether transient changes in dopamine were necessary to support number unblocking, we delivered green light into VTA for 5s on VUB/AUB trials to prevent changes in dopamine neuron firing (Figure 1c, top). These 5s light pulses were delivered either beginning 0.5 s prior to delivery of the unexpected second pellet or, as a control for non-specific effects, in the middle of the inter-trial interval (ITI) after these trials. Unlike appropriately timed and transient (~1–2s) suppression of dopamine neurons, which we found to act as negative prediction errors to drive extinction learning [15], this longer period of suppression did not induce significant extinction when delivered at the time of an expected reward (see supplemental material, Figure S1). Each rat received each manipulation across the two runs of unblocking (order counterbalanced).
Behavioral data from all three phases of training is plotted in Figure 2. The rats developed responding to both visual cues across conditioning. ANOVA (cue × session) revealed a significant main effect of session (F(11,242)=9.41, p<0.0001). The rats also gradually learned to respond more to the cue predicting two pellets than to the cue predicting one pellet (Figure 2, conditioning). This difference was significant over the last 4 sessions of conditioning (F(1,22)=7.11, p<0.014) and also during the reminder training between the two runs of number unblocking (F(1,22)=4.553, p<0.044).
Figure 2. Optogenetic blockade of dopamine transients prevents learning induced by changes in reward number.
Design is illustrated in Figure 1c. Conditioned responding is shown to VB and VUB during conditioning and reconditioning (left), to VB/AB and VUB/AUB during compound training (middle), and to AB and AUB during the probe test (right). Conditioned responding is represented as the percentage of time the rats spent in the food cup during cue presentation. Top panels for compound training and probe test show data from the experimental run (Exp), when neurons were suppressed during delivery of the second pellet, and bottom panels show data from the ITI run (ITI), when neurons were suppressed during the intertrial interval. Insets show the percentage of time rats spent in the food cup during the reward period after termination of the cues. (See also Figure S1)
Conditioned responding was generally maintained during subsequent compound training (Figure 2, Compound Training). ANOVA (run x cue x session) found a main effect of cue (F(1,22) = 4.660, p< 0.042) but no other significant main effects nor any interactions (F’s < 2.624, p’s > 0.058). To evaluate whether there were any aversive effects of light delivery, we also measured the time rats spent in the food cup during the reward period itself (Figure 2, compound training, bar graphs inset). ANOVA (run x cue) found no significant effects of light delivery during this period (F’s < 0.65, p’s > 0.42); the rats also ate all the pellets available in each session without exception.
Here it is worth noting that despite the main effect of cue in our ANOVA, the difference in responding established during training to the two visual cues seemed to disappear during the compound training, consistent with the presentation of two pellets on both trial types (Figure 2, compound training). This difference appeared to take longer to disappear during the experimental run, when light was delivered during the second pellet, than during the ITI run, when light was delivered during the intertrial interval. Although the ANOVA did not reveal the requisite 3-way interaction (run X cue X session), post-hoc testing did show that there was a significant difference in responding in the first two compound sessions in the experimental run (p’s < 0.05) that was not present for the later sessions or for any of the ITI sessions (p’s > 0.09).
Probe testing revealed significant effects that depended on the timing of light delivery during the earlier compound training. Specifically, when light had been delivered to prevent changes in dopamine firing during the second pellet on unblocking trials (VUB/AUB), there was no evidence of unblocked learning for AUB in the probe test, whereas the same rats showed substantial unblocking when light had been delivered later in the ITI (Figure 2, probe). This difference was most prominent on the very first trial and then disappeared across trials, consistent with the effects of ongoing extinction in the unrewarded probe test. ANOVA (run x cue x trial) revealed a significant main effect of trial (F (5,110) =11.337, p< 0.000001), and a significant interaction between cue and run (F (1, 22) =13.958, p< 0.002). Post-hoc testing showed the latter effect was due to elevated responding to AUB after the ITI run (p’s < 0.025 versus other auditory cues on trial 1). This pattern was also evident during the reward period (Figure 2, probe, bar graph inset). Normally responding during this period could reflect responding to the food, however in the probe test, no food was delivered, thus any responding in this period can only reflect information provided by the preceding cue. Again responding was significantly higher after AUB on the ITI run; ANOVA (cue × run) yielded a significant interaction (F (1, 22) =5.803, p< 0.025).
The effects described above suggest that transient increases in dopamine neuron firing, similar to those correlated with reward prediction errors and to those sufficient to unblock learning, are also necessary for learning to be unblocked by the addition of new rewards. This result is at odds with the hypothesis that dopaminergic transients signal only cached-value prediction errors, since the addition of a new reward should evoke errors both in value prediction as well in predicting other features of the reward. Changes in reward features are known to drive unblocking even in the absence of value changes, therefore if dopamine were only conveying the value signal, we would have expected significant unblocking even when dopamine signaling was prevented.
To provide additional evidence supporting this conclusion, we also ran the same rats in a second unblocking task (order counterbalanced, new cues), in which we directly tested whether preventing transient changes in dopamine neuron activity would block learning from valueless changes in reward features. The design of this task was intentionally similar to the number unblocking design, except that we used two equally-preferred flavored sucrose pellets as rewards and delivered two pellets at the end of every cue. Where we added a new pellet in number unblocking on VUB/AUB trials, here we simply switched the flavor of the second pellet. This resulted in a valueless shift in reward identity with the same timing as the change in number in the prior design. Again we delivered green light into VTA for 5s across delivery of this second reward or in the subsequent ITI period to test whether transient changes in dopamine were necessary for learning induced by the identity shift. In the next paragraph, we will describe the design in more detail.
Optogenetic blockade of dopamine transients prevents learning induced by changes in reward flavor
Identity unblocking consisted of three phases: conditioning, compound training, and probe testing (Figure 1c, bottom). During conditioning, the rats were trained to associate two novel 10s visual cues, VUB and VB, with two pellets of one of two equally-preferred (Figure. 1d) flavored sucrose pellets. After conditioning, the rats underwent compound training in which the visual cues were presented with novel auditory cues to form compounds, VB/AB and VUB/AUB. Each compound was paired with two pellets of reward. For VB/AB, both pellets were of the same flavor already predicted by VB, thus we expected learning for AB to be blocked. By contrast, the second pellet presented after VUB/AUB was of the opposite flavor to that predicted by VUB. Based on prior work [19, 21], we expected this manipulation to unblock learning for AUB due to the change in reward identity. The presence or absence of learning was assessed at the end of compound training in a probe test, in which the two auditory cues were presented alone, without reward. After completion of the probe test, the rats were retrained for two sessions on the visual cues and then underwent a second run of unblocking with two new auditory cues. Green light was delivered into VTA for 5s on VUB/AUB trials beginning either 0.5 s prior to delivery of the novel second pellet or, as a control for non-specific effects, in the middle of the inter-trial interval (ITI) on these trials. Each rat received each manipulation across the two runs of unblocking (order counterbalanced).
Behavioral data from all three phases of training is plotted in Figure 3. The rats developed conditioned responding to the visual cues across conditioning, and there were no differences between the two cues (Figure 3, conditioning). ANOVA (cue × session) revealed a significant main effect of session (F(11,264)=17.598, p<0.0001) but no main effect nor any interaction with cue (F’s < 0.5, p’s > 0.6). There were also no effects of cue in the last 4 sessions of conditioning or in the reminder training between the two runs of identity unblocking (F’s < 0.7, p’s > 0.3).
Figure 3. Optogenetic blockade of dopamine transients prevents learning induced by changes in reward flavor.
Design is illustrated in Figure 1c. Conditioned responding is shown to VB and VUB during conditioning and reconditioning (left), to VB/AB and VUB/AUB during compound training (middle), and to AB and AUB during probe test (right). Conditioned responding is represented as the percentage of time the rats spent in the food cup during cue presentation. Top panels for compound training and probe test show data from the experimental run (Exp), when neurons were suppressed during delivery of the second pellet, and bottom panels show data from the ITI run (ITI), when neurons were suppressed during the intertrial interval. Insets show the percentage of time rats spent in the food cup during the reward period after termination of the cues.
Conditioned responding was also maintained during subsequent compound training, with no evidence of differences between cues in either run (Figure 3, compound training). ANOVA (run x cue x session) found neither significant main effects nor any interactions (F’s < 0.653, p’s > 0.136); time in the food cup during the reward period itself (Figure 3, compound training, bar graphs) again revealed no significant effects of light delivery during this period (F’s < 0.343, p’s > 0.564), and rats ate all the pellets available in each session without exception.
Here it is worth noting that there were no differences in responding to the cues, either during conditioning or compound training. This contrasts with the differences in conditioned responding that emerged with training in the number unblocking task for cues that predicted different numbers of pellets. The similar responding here provides additional evidence that the differently flavored pellets were similarly valued, both during conditioning, when they were delivered separately, or during compound training, when one of each was delivered on one type of trial. In neither case were there value differences that were sufficient to affect responding to the cues.
However subsequent probe test responding again revealed significant effects that depended on the timing of light delivery in the earlier compound training. Specifically, when light had been delivered during presentation of the second pellet on VUB/AUB trials, there was no evidence of unblocked learning for AUB in the probe test, whereas the same rats showed substantial unblocking in the other run when light had been delivered in the ITI (Figure 3, probe). Again this difference was most prominent on the very first trial and then disappeared across trials, consistent with the effects of ongoing extinction. ANOVA (run x cue x trial) revealed a significant main effect of trial (F (5,120) =16.459, p< 0.000001) and a significant interaction between cue and run (F (1, 24) =10.362, p< 0.004). Post-hoc testing showed that the latter effect was due to elevated responding to AUB after the ITI run (p’s < 0.005 vs other cues on trial 1). This pattern was also evident during the reward period after cue termination (Figure 3, probe, bar graph), where responding was significantly higher after AUB on the ITI run; ANOVA (run × cue) yielded a significant interaction (F (1, 24) =4.314, p< 0.049).
Discussion
Prediction errors are critical for associative learning [1, 2]. Transient changes in dopamine neuron activity correlate with positive and negative reward prediction errors [3–6] and the artificial induction of such changes can mimic the normal effects of such errors [10–15, 22]. However, while these causal studies show that dopamine transients (1–2s increases or decreases) are sufficient to drive learning, they generally do not address whether they are necessary, since they either do not attempt to block the normal physiological signal or they do so with a manipulation capable of acting independently to drive learning. Further, these causal studies do not address whether dopamine transients only mediate learning in response to cached-value errors or whether these signals may also facilitate learning about rewards in settings where errors in predicting cached value are not responsible, as has been recently suggested for neutral cues [23].
Here we addressed these questions using optogenetic and transgenic approaches to prevent transient changes in midbrain dopamine neuron activity across the critical error-signaling period of two unblocking tasks. In one task, learning was unblocked by adding unexpected rewards. Unexpected rewards are well documented to transiently increase dopamine neuron firing [3–6] and are also known to unblock learning [18, 19]. Yet additional reward should induce both a cached-value error as well as errors in the prediction of value-less sensory features of the reward. Importantly, the latter type of error is capable of driving learning even in the absence of changes in value [18–21]. If dopamine transients are only necessary for signaling cached-value errors, then they might be sufficient to unblock learning, as has been shown [11], but they should not be necessary. Contrary to this prediction, we found that suppressing dopamine for a 5s period when the unexpected reward was delivered abolished learning. This result suggests that dopamine transients are necessary for learning from both types of error signals. This proposal was confirmed by results in a second task in which we unblocked learning by changing the identity of the reward without changing its value. Suppressing dopamine transients at the time of the change in reward identity also abolished unblocking.
In evaluating these results, it is possible that suppressing the dopamine neurons affects motivation, is aversive, or acts as a negative prediction error to offset learning. We believe there are several aspects of the data that make these explanations unlikely. For instance, there was little or no effect of suppressing the dopamine neurons on responding during compound training, when the suppression occurred, particularly during the identity task in which reward value was not altered. If food cup entry was punished or resulted in negative prediction errors, we would have expected to see extinction of conditioned responding in this phase of training, which we did not. Critically this includes both the cue period as well as the post-cue food delivery period, when the dopamine neurons were suppressed. In each period, we saw no effect on time in the food cup (Figures 2 and 3, compound training, bar graph inset) or on food pellet consumption. This suggests suppressing the dopamine neurons did not diminish the motivation to eat nor was it perceived as particularly punishing or aversive.
We also think it is unlikely that suppressing the dopamine neurons blocked learning by acting independently as a negative prediction error. In fact, negative prediction errors induced by reducing the number of rewards in unblocking task such as that used here typically cause increased rather than diminished responding [24–26], due to changes in the processing of the prior rewards. This is the opposite of the effect caused by the longer suppression used here, suggesting it did not function like an endogenous negative prediction error would normally function in this context. Further, in pilot testing, we found that a longer period of suppression, similar to that used here, was relatively ineffective at inducing extinction learning compared to several brief periods of suppression (see supplemental material, Figure S1).
The current result is consistent with recent fMRI and single unit data showing that errors in the prediction of information other than value, including sensory or identity, coexist in the midbrain generally and in dopamine neurons in particular [16, 27, 28]. While this might suggest yet another function for transient changes in dopamine neuron firing – that of signaling outcome identity or sensory errors, alongside errors in value, salience, information, and so on – an alternative proposal would be that dopamine transients signal changes in expected events more generally. While value would be a very important dimension (or combination of dimensions) in which to track unexpected events for hungry, well-trained subjects in most experimental paradigms, a similar signal might also be elicited by changes orthogonal to value in other settings, when that information is deemed relevant by the subject [29, 30]. Consistent with this idea, dopamine transients are sometimes elicited by the appearance of unexpected neutral cues [31–33]. While such firing changes have been explained as salience or a value bonus tied to novelty [31, 34], they might be viewed as an error evoked by the change in sensory input. Accordingly, we have recently found that dopamine transients are sufficient and likely necessary for learning to associate neutral cues [23]. Their role in this function was not tied in any obvious way to salience or changes in the cues’ associability.
Dopamine transients evoked by an unexpected reward may perform a similar function, at least in some downstream areas, causing associations to form between antecedent events and the unexpected sensory features that comprise a given reward. This would explain why learned behaviors unblocked by the addition of reward are sensitive to subsequent reward devaluation [18]. Indeed, such empirical evidence such as that presented here may reveal only a small part of the role the dopamine system plays in learning, given recent proposals that dopamine transients reflect discrepancies in a variety of increasingly abstract constructs such as information and internal beliefs [28–30, 35].
In summary, the results presented here show that dopamine transients, in addition to being sufficient [11], are also necessary for error-based reward learning. Given the contribution of both cached value and other types of error signals in the unblocking tasks used here, the requirement for dopamine transients suggests they play a general role in signaling erroneous predictions about the events around us, rather than being restricted to only signaling errors in predicting cached value.
STAR Methods Text
Contact for Reagent and Resource Sharing
Further information and requests for resources and reagents should be directed to and will be fulfilled by the Lead Contact, Geoffrey Schoenbaum (geoffrey.schoenbaum@nih.gov).
Experimental Model and Subject Details
Subjects
Fourteen transgenic rats (5 male and 9 female) that carried TH-dependent Cre expressing system on a Long-Evans background (NIDA animal breeding facility) served as subjects [27, 36]. Two rats became ill and had to be removed from the study during the course of the experiment; as a result, 12 subjects contributed data on number unblocking and 13 contributed data on identity unblocking. The rats were maintained in a 12-h light/dark cycle with unlimited access to food and water, except during the behavioral experiment when they were food restricted to maintain 85% of their baseline weight. All experimental procedures were conducted in accordance with Institutional Animal Care and Use Committee of the US National Institute of Health guidelines.
Method Details
Surgical Procedures
Rats (> 275 g) received bilateral infusions of AAV5-EF1α-DIO-NpHR3.0-eYFP into the VTA (AP: −5 mm (referenced to Bregma), ML: ±0.7 mm (referenced to the midline), and DV: −7.2 mm and −8 mm for male and −6.5 mm and −7.3 mm for female (referenced to the brain surface). Virus was obtained from the University of North Carolina at Chapel Hill Gene Therapy Center, courtesy of Dr. Karl Deisseroth. A total of 1~1.5 μl of virus with a titer of ≥ 1012 vg/ml was injected at the rate of 0.1 μl/min per injection site. The rats also were implanted with optic fibers bilaterally (200μm diameter, Thorlab, NJ; AP: −5.3 mm, ML: ± 2.61 mm, and DV: −6.8 mm for male and −6.2 mm for female at 15° angle pointing to the midline).
Apparatus
Training was conducted in 10 standard behavioral chambers from Coulbourn Instruments (Allentown, PA), each enclosed in a sound-resistant shell. A food cup was recessed in the center of one end wall. Entries were monitored by photobeam. A food dispenser containing 45 mg sucrose pellets (plain, banana-flavored, or chocolate-flavored; Bio-SERV, Frenchtown, NJ) allowed delivery of pellets into the food cup.
Number and Identity Unblocking
Unblocking based on changes in number or identity was run similarly to prior experiments [19, 21]. The main change from procedures followed in that study was to train on each rat on each type of unblocking separately to avoid interference between tracking of identity and number, which weakened the effect relative to prior work [19]. Thus while each rat was trained on each type of unblocking, this was done separately and with different cues. Approximately half the rats (n = 7) received number unblocking first and the remaining rats received identity unblocking first. There were no main effects nor were there any interactions involving order (F’s < 1.564, p’s > 0.240).
Number unblocking consisted of three phases: conditioning, compound training, and probe testing. Conditioning consisted of 12 sessions. In each session, the rats received presentations of two distinct 10s visual cues (VB and VUB, 6 W light bulb with on/off patterns consisting of 2.4 s on and 0.2s off for total of 10s, or 0.2 s on and 2.4 s off, counterbalanced) paired with plain sucrose pellets, delivery of which commenced during the final second of the cue. VB was paired with two sucrose pellets, whereas VUB was paired with only one pellet. Each session consisted of 16 trials of VB and 16 trials of VUB, arranged in 8-trial blocks, the order of which varied from day to day for each rat. Trials were separated by inter-trial intervals that varied randomly between 95, 120, and 135s. After conditioning, the rats began compound training. Before the first session, the rats received 8 presentations each of two novel 10s auditory cues (AB and AUB, ~76 dB, customized Arduino-based melodies, counterbalanced). Compound training consisted of 4 sessions. Trial structure was the same as during conditioning except that half the trials were reminder trials in which VB and VUB were presented followed by reward as in conditioning (6 trials/block for each visual cue, with total of 16 trials/2 blocks for each cue in a session) and the remaining half of the trials were converted into compound trials in which these visual cues were presented together with the new auditory cues. These compound cues were paired with two pellets delivered precisely as on VB trials. In addition, on VUB/AUB trials, green light (532nm with final output of ~16–18 mW, Shanghai Laser & Optics Century Co., Ltd.) was delivered to the VTA for 5s beginning 0.5s prior to delivery of the second pellet in some rats. This longer duration was chosen to achieve blockade of dopamine activity without inducing extinction learning mediated by briefer suppression of dopamine neurons [15]; this dichotomy is consistent with recent suggestions that the dynamic change and not the level of dopamine is most relevant for negative error signaling [37]. In the remaining rats, light was delivered for 5s in the middle of the ITI. After the final day of conditioning, the rats underwent a probe test, in which AB and AUB were presented for 6 times randomly without any reward. Subsequently, all rats received 2 reminder conditioning sessions after which we repeated the compound training and probe testing described above, but using 2 new auditory cues. Rats that had received light during the reward period received light during the ITI and vice versa.
Identity unblocking consisted of three phases: conditioning, compound training, and probe testing. Conditioning consisted of 12 sessions. In each session, the rats received presentations of two distinct 10s visual cues (VB and VUB, 6 W light bulb with on/off patterns consisting of 0.5 s on and 1.8 s off for total of 10s, or 1 s on and 1 s off, counterbalanced) paired with delivery of one of two equally-preferred flavored sucrose pellets (banana and chocolate, pairing counterbalanced, Figure 1d). Each cue was paired with delivery of 2 pellets of one flavor, commencing during the final second of the cue. Each session consisted of 16 trials of VB and 16 trials of VUB, arranged in 8-trial blocks, the order of which varied from day to day for each rat. Trials were separated by inter-trial intervals that varied randomly between 95, 120, and 135s. After conditioning, the rats began compound training. Before the first session, the rats received 8 presentations each of two novel 10s auditory cues (AB and AUB, ~76 dB, customized Arduino-based melodies, counterbalanced). Compound training consisted of 4 sessions. Trial structure was the same as during conditioning except that half the trials were reminder trials in which VB and VUB were presented followed by reward as in conditioning (6 trials/block for each visual cue, with total of 16 trials/2 blocks for each cue in a session) and the remaining half of the trials were converted into compound trials in which these visual cues were presented together with the new auditory cues. VB/AB was paired with two pellets of the same reward delivered on VB trials. VUB/AUB was paired with two pellets also, the first of which was identical to that predicted by VUB and the second of which was the other flavor. In addition, on VUB/AUB trials, green light (532nm with final output of ~16–18 mW) was delivered to the VTA for 5s beginning 0.5s prior to delivery of the second pellet in some rats. Again, this longer duration was chosen to achieve blockade of dopamine activity without inducing extinction learning mediated by briefer suppression of dopamine neurons [15]. In the remaining rats, light was delivered for 5s in the middle of the ITI. After the final day of conditioning, the rats underwent a probe test, in which AB and AUB were presented for 6 times without any reward. Subsequently, all rats received 2 reminder conditioning sessions after which we repeated the compound training and probe testing described above, but using 2 new auditory cues. Rats that had received light during the reward period received light during the ITI and vice versa.
Histology and Immunohistochemistry
Rats that received viral infusions and fiber implants were euthanized with carbon dioxide and perfused with 1 × Phosphate buffered saline (PBS) followed by 4 % Paraformaldehyde (Santa Cruz Biotechnology Inc., CA). Fixed brains were cut in 40 μm sections to examine fiber tip position under fluorescence microscope (Olympus Microscopy, Japan). For immunohistochemistry, the brain slices were first blocked in 10 % donkey serum made in 0.1 % Triton X-100/1 × PBS and then incubated in anti-tyrosine hydroxylase (TH) antisera (1: 600, EMD Millipore, Billerica, Massachusetts) followed by Alexa 568 secondary antisera (1:1000, Invitrogen, Carlsbad, CA). The image of brain slices was acquired by confocal microscope (Olympus FluoView 1000, America, Melville, NY), and later analyzed in Adobe Photoshop. The VTA, including anterior (rostral and parabrachial pigmental area) and posterior (caudal, parabrachial pigmental area, paranigral nuclus, and medial substantia nigra pars medialis), of brain slices from AP −5.1 mm to −5.9 mm from 4 subjects were analyzed. This encompasses the location targeted by our fibers and likely to achieve good light penetration. For quantification, the intensity of 4 random 40 μm × 40 μm square areas from the background were averaged to provide a baseline, and positive staining was defined as signal 2.5 times this baseline intensity, with a cell diameter larger than 5 μm, co-localized within cells reactive to DAPI staining.
Quantification and statistical analysis
The primary measure of conditioning was the percentage of time that each rat spent with its head in the food cup during each CS presentation prior to food delivery, as indicated by disruption of the photocell beam. Food cup responding is strongest at the end of the cue, so for clarity we excluded the first 3s of the cue period from the data presented in the figures. However a direct comparison of the data shown with data from this initial 3s period showed no interaction of period with any of the meaningful effects described in the main text (F’s < 1.184; p’s > 0.282). In addition, in some periods we also measured the amount of time the rats spent in the food cup during the post CS period starting at the time of the first food pellet delivery. This was done to test for any aversive or distracting effect of the optogenetic manipulation during blocking or as an additional assessment of conditioned responding post-cue (in the absence of food) in the final probe test. All statistical analysis was performed with multi-factor analysis of variance (ANOVA) in STATISICA (Statsoft, TIBCO Software Inc., Palo Alto, CA). Specifically, 2-factor ANOVA (cue x session) was used to analyze CS responding during conditioning phase, whereas 3-factor ANOVA (run x cue x session/trial) was used to assess the effect of light inhibition on behaviors in compound and probes phases in both experiments. In the case of post-CS comparison, 2-factor ANOVA (run x cue) was applied. Sample size was 12 rats for value unblocking experiment and 13 for identity unblocking, as described in “Subjects”.
Supplementary Material
Highlights.
Dopamine transients are necessary for unblocking by changes in reward number
Dopamine transients are necessary for unblocking by changes in reward features
These results suggest that dopamine transients play a general role in error signaling
Acknowledgments
This work was supported by the Intramural Research Program at the National Institute on Drug Abuse. The authors would like to thank Dr. Karl Deisseroth and the Gene Therapy Center at the University of North Carolina at Chapel Hill for providing viral reagents, and Dr. Brandon Harvey and the NIDA Optogenetic and Transgenic Core for their assistance. The opinions expressed in this article are the authors’ own and do not reflect the view of the NIH/DHHS. The authors declare they have no financial conflicts of interest to report.
Footnotes
Author Contributions
C.Y.C. and G.S. conceived the experiment; C.Y.C., M.G., and M.T. carried out the experiment; C.Y.C. and G.S. analyzed the data and prepared the manuscript.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
References
- 1.Rescorla RA, Wagner AR. A theory of Pavlovian conditioning: variations in the effectiveness of reinforcement and nonreinforcement. In: Black AH, Prokasy WF, editors. Classical Conditioning II: Current Research and Theory. Appleton-Century-Crofts; New York: 1972. pp. 64–99. [Google Scholar]
- 2.Sutton RS. Learning to predict by the method of temporal difference. Machine Learning. 1988;3:9–44. [Google Scholar]
- 3.Mirenowicz J, Schultz W. Importance of unpredictability for reward responses in primate dopamine neurons. Journal of Neurophysiology. 1994;72:1024–1027. doi: 10.1152/jn.1994.72.2.1024. [DOI] [PubMed] [Google Scholar]
- 4.Roesch MR, Calu DJ, Schoenbaum G. Dopamine neurons encode the better option in rats deciding between differently delayed or sized rewards. Nature Neuroscience. 2007;10:1615–1624. doi: 10.1038/nn2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Waelti P, Dickinson A, Schultz W. Dopamine responses comply with basic assumptions of formal learning theory. Nature. 2001;412:43–48. doi: 10.1038/35083500. [DOI] [PubMed] [Google Scholar]
- 6.Pan WX, Schmidt R, Wickens JR, Hyland BI. Dopamine cells respond to predicted events during classical conditioning: evidence for eligibility traces in the reward-learning network. Journal of Neuroscience. 2005;25:6235–6242. doi: 10.1523/JNEUROSCI.1478-05.2005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.D’Ardenne K, McClure SM, Nystrom LE, Cohen JD. BOLD responses reflecting dopaminergic signals in the human ventral tegmental area. Science. 2008;319:1264–1267. doi: 10.1126/science.1150605. [DOI] [PubMed] [Google Scholar]
- 8.Day JJ, Roitman MF, Wightman RM, Carelli RM. Associative learning mediates dynamic shifts in dopamine signaling in the nucleus accumbens. Nature Neuroscience. 2007;10:1020–1028. doi: 10.1038/nn1923. [DOI] [PubMed] [Google Scholar]
- 9.Hart AS, Rutledge RB, Glimcher PW, Phillips PE. Phasic dopamine release in the rat nucleus accumbens symmetrically encodes a reward prediction error term. Journal of Neuroscience. 2014;34:698–704. doi: 10.1523/JNEUROSCI.2489-13.2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Zweifel LS, et al. Disruption of NMDAR-dependent burst firing by dopamine neurons provides selective assessment of phasic dopamine-dependent behavior. Proceedings of the National Academy of Science. 2009;106:7281–7288. doi: 10.1073/pnas.0813415106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Steinberg EE, Keiflin R, Boivin JR, Witten IB, Deisseroth K, Janak PH. A causal link between prediction errors, dopamine neurons and learning. Nature Neuroscience. 2013;16:966–973. doi: 10.1038/nn.3413. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Tsai HC, Zhang F, Adamantidis A, Stuber GD, Bonci A, de Lecea L, Deisseroth K. Phasic firing in dopamine neurons is sufficient for behavioral conditioning. Science. 2009;324:1080–1084. doi: 10.1126/science.1168878. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Kim KM, Baratta MV, Yang A, Lee D, Boyden ES, Fiorillo CD. Optogenetic mimicry of the transient activation of dopamine neurons by natural reward is sufficient for operant reinforcement. PLoS One. 2012;7:e33612. doi: 10.1371/journal.pone.0033612. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Stopper CM, Tse MT, Montes DR, Wiedman CR, Floresco SB. Overriding phasic dopamine signals redirects action selection during risk/reward decision making. Neuron. 2014;84:177–189. doi: 10.1016/j.neuron.2014.08.033. [DOI] [PubMed] [Google Scholar]
- 15.Chang CY, Esber GR, Marrero-Garcia Y, Yau HJ, Bonci A, Schoenbaum G. Brief optogenetic inhibition of VTA dopamine neurons mimics the effects of endogenous negative prediction errors during Pavlovian over-expectation. Nature Neuroscience. 2016;19:111–116. doi: 10.1038/nn.4191. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Takahashi YK, Batchelor HM, Liu B, Khanna A, Morales M, Schoenbaum G. Dopamine neurons respond to errors in the prediction of sensory features of expected rewards. Neuron. 2017;95:1395–1405. doi: 10.1016/j.neuron.2017.08.025. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Witten IB, et al. Recombinase-driver rat lines: tools, techniques, and optogenetic application to dopamine-mediated reinforcement. Neuron. 2011;72:721–733. doi: 10.1016/j.neuron.2011.10.028. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Holland PC. Unblocking in Pavlovian appetitive conditioning. Journal of Experimental Psychology. 1984;10:476–497. [PubMed] [Google Scholar]
- 19.McDannald MA, Lucantonio F, Burke KA, Niv Y, Schoenbaum G. Ventral striatum and orbitofrontal cortex are both required for model-based, but not model-free, reinforcement learning. Journal of Neuroscience. 2011;31:2700–2705. doi: 10.1523/JNEUROSCI.5499-10.2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Glascher J, Daw N, Dayan P, O’Doherty JP. States versus rewards: dissociable neural prediction error signals underlying model-based and model-free reinforcement learning. Neuron. 2010;66:585–595. doi: 10.1016/j.neuron.2010.04.016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Burke KA, Franz TM, Miller DN, Schoenbaum G. The role of the orbitofrontal cortex in the pursuit of happiness and more specific rewards. Nature. 2008;454:340–344. doi: 10.1038/nature06993. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Frank MJ, Moustafa AA, Haughey HM, Curran T, Hutchinson KE. Genetic triple dissociation reveals multiple roles for dopamine in reinforcement learning. Proceedings of the National Academy of Science. 2007;104:16311–16316. doi: 10.1073/pnas.0706111104. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Sharpe MJ, Chang CY, Liu MA, Batchelor HM, Mueller LE, Jones JL, Niv Y, Schoenbaum G. Dopamine transients are sufficient and necessary for acquisition of model-based associations. Nature Neuroscience. 2017;20:735–742. doi: 10.1038/nn.4538. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Holland PC, Gallagher M. Amygdala central nucleus lesions disrupt increments, but not decrements, in conditioned stimulus processing. Behavioral Neuroscience. 1993;107:246–253. doi: 10.1037//0735-7044.107.2.246. [DOI] [PubMed] [Google Scholar]
- 25.Holland PC, Gallagher M. Effects of amygdala central nucleus lesions on blocking and unblocking. Behavioral Neuroscience. 1993;107:235–245. doi: 10.1037//0735-7044.107.2.235. [DOI] [PubMed] [Google Scholar]
- 26.Holland PC, Kenmuir C. Variations in unconditioned stimulus processing in unblocking. Journal of Experimental Psychology Animal Behavior Processes. 2005;31:155–171. doi: 10.1037/0097-7403.31.2.155. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Boorman ED, Rajendran VG, ORJX, Behrens TE. Two anatomically and computationally distinct learning signals predict changes to stimulus-outcome associations in hippocampus. Neuron. 2016;89:1343–1354. doi: 10.1016/j.neuron.2016.02.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Iglesias S, Mathys C, Brodersen KH, Kasper L, Piccirelli M, den Ouden HE, Stephan KE. Hierarchical prediction errors in midbrain and basal forebrain during sensory learning. Neuron. 2013;80:519–530. doi: 10.1016/j.neuron.2013.09.009. [DOI] [PubMed] [Google Scholar]
- 29.Lak A, Nomoto K, Keramati M, Sakagami M, Kepecs A. Midbrain dopamine neurons signal belief in choice accuracy during a perceptual decision. Current Biology. 2017;27:821–832. doi: 10.1016/j.cub.2017.02.026. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Schwartenbeck P, FitzGerald THB, Dolan R. Neural signals encoding shifts in beliefs. Neuroimage. 2016;125:578–586. doi: 10.1016/j.neuroimage.2015.10.067. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Kakade S, Dayan P. Dopamine: generalization and bonuses. Neural Networks. 2002;15:549–559. doi: 10.1016/s0893-6080(02)00048-5. [DOI] [PubMed] [Google Scholar]
- 32.Tobler PN, Dickinson A, Schultz W. Coding of predicted reward omission by dopamine neurons in a conditioned inhibition paradigm. Journal of Neuroscience. 2003;23:10402–10410. doi: 10.1523/JNEUROSCI.23-32-10402.2003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Horvitz JC, Stewart T, Jacobs BL. Burst activity of ventral tegmental dopamine neurons is elicited by sensory stimuli in the awake cat. Brain Research. 1997;759:251–258. doi: 10.1016/s0006-8993(97)00265-5. [DOI] [PubMed] [Google Scholar]
- 34.Schultz W. Dopamine reward prediction-error signalling: a two-component response. Nature Reviews Neuroscience. 2016;17:183–195. doi: 10.1038/nrn.2015.26. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Bromberg-Martin ES, Hikosaka O. Midbrain dopamine neurons signal preference for advance information about upcoming rewards. Neuron. 2009;63:119–126. doi: 10.1016/j.neuron.2009.06.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Collins AL, V, Greenfield Y, Bye JK, Linker KE, Wang AS, Wassum KM. Dynamic mesolimbic dopamine signaling during action sequence learning and expectation violation. Scientific Reports. 2016;6:20231. doi: 10.1038/srep20231. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Hamid AA, Pettibone JR, Mabrouk OS, Hetrick VL, Schmidt R, Ander Weele CMV, Kennedy RT, Aragona BJ, Berke JD. Mesolimbic dopamine signals the value of work. Nature Neuroscience. 2016;19:117–126. doi: 10.1038/nn.4173. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.



