Attenuation of dopamine-modulated prefrontal value signals underlies probabilistic reward learning deficits in old age

Lieke de Boer; Jan Axelsson; Katrine Riklund; Lars Nyberg; Peter Dayan; Lars Bäckman; Marc Guitart-Masip

doi:10.7554/eLife.26424

. 2017 Sep 5;6:e26424. doi: 10.7554/eLife.26424

Attenuation of dopamine-modulated prefrontal value signals underlies probabilistic reward learning deficits in old age

Lieke de Boer ^1,^✉, Jan Axelsson ^2,³, Katrine Riklund ^2,³, Lars Nyberg ^2,^3,⁴, Peter Dayan ⁵, Lars Bäckman ¹, Marc Guitart-Masip ^1,^6,^✉

Editor: Wolfram Schultz⁷

PMCID: PMC5593512 PMID: 28870286

Abstract

Probabilistic reward learning is characterised by individual differences that become acute in aging. This may be due to age-related dopamine (DA) decline affecting neural processing in striatum, prefrontal cortex, or both. We examined this by administering a probabilistic reward learning task to younger and older adults, and combining computational modelling of behaviour, fMRI and PET measurements of DA D1 availability. We found that anticipatory value signals in ventromedial prefrontal cortex (vmPFC) were attenuated in older adults. The strength of this signal predicted performance beyond age and was modulated by D1 availability in nucleus accumbens. These results uncover that a value-anticipation mechanism in vmPFC declines in aging, and that this mechanism is associated with DA D1 receptor availability.

Research organism: Human

Introduction

In order to navigate an uncertain world successfully, humans and other animals are required to learn and update the values of available actions and switch between them appropriately. Compared with younger adults, older individuals are poor at probabilistic reward learning and subsequent optimal action selection (Eppinger et al., 2011; Mell et al., 2005). One common account of this deficit is an age-related deterioration of the dopamine (DA) system (Volkow et al., 1998), with two of its primary targets - striatum and prefrontal cortex (PFC) - being obvious culprits.

A wealth of animal literature demonstrates that DA signals from midbrain convey reward prediction errors (RPEs) (Bayer and Glimcher, 2005; Schultz et al., 1997), which are thought to act as signals that facilitate action selection in striatum (Niv and Montague, 2009; Pessiglione et al., 2006). Hence, one hypothesis states that aging leads to decreased striatal DA release in response to RPEs, leading to a comparatively less efficient learning signal (e.g. slower learning rate). Supporting this, previous studies reported lower correlations between RPEs generated from probabilistic reward learning tasks and nucleus accumbens (NAcc) BOLD signals in older compared with younger adults (Eppinger et al., 2013; Samanez-Larkin et al., 2014). By decomposing RPEs in a dynamic two-armed bandit task into their two subcomponents: obtained reward (R) and expected value (Q), Chowdhury et al. (2013) showed that, in older adults, neural activity in NAcc reflected just the former. Only after the dopaminergic system had been pharmacologically boosted could the expected value component be detected in NAcc. While these findings support the tenet that attenuation of DA-modulated expected value signals in NAcc underlies age-related performance deficits (Chowdhury et al., 2013), a younger comparison group was lacking.

Another hypothesis is that age-related decline in probabilistic reward learning may be related to impaired prefrontal functioning (Nyberg et al., 2010; Raz et al., 2005; Halfmann et al., 2016). Indeed, compromised DA projections to frontostriatal circuits are reported in aging (Dreher et al., 2008; Hämmerer and Eppinger, 2012). Anticipatory activity reflecting the value of the chosen option in the ventromedial PFC (vmPFC) is widely reported in decision-making tasks (Balleine and O'Doherty, 2010; Daw et al., 2006) and is modulated by DA (Jocham et al., 2011). Supporting an involvement of PFC in age-related decline in probabilistic reward learning, one previous study suggests decreased RPE signalling in vmPFC in older adults (Eppinger et al., 2015). Another study showed that within a group of older adults with increased BOLD activity related to value anticipation predicted better performance on the Iowa gambling task (Halfmann et al., 2016). However, despite evidence suggesting that age-related decline in PFC value signals could be related to dopaminergic deterioration, there is no published data directly showing this.

Furthermore, there is little work comparing younger and older populations according to an additional factor that could influence performance in these tasks, namely the impact of uncertainty or confidence in the payoffs or values of the options on choice switching (Badre et al., 2012; Frank et al., 2009; Vinckier et al., 2016). Uncertainty should influence the trade-off between exploration and exploitation (Sutton and Barto, 1998) that an optimal policy should balance. However, how exploration and switching are modulated in aging and how they influence performance is unclear.

Our aim was to investigate the effect of age and DA availability on striatal and prefrontal mechanisms involved in probabilistic reward learning. We included samples of 30 older and 30 younger participants who performed a two-armed bandit task (TAB) previously used by Chowdhury et al. (2013) while fMRI data was acquired. All participants were healthy and cognitively high functioning (MMSE > 27). In brief, all participants performed 220 trials on the TAB (Figure 1a). On each trial, participants chose between one of two bandits, represented by fractal images. After a variable interval, the outcome was presented as a green arrow pointing up signalling a reward, or a yellow horizontal bar signalling no reward. Probabilities of obtaining a reward varied over time for both bandits, according to independent Gaussian random walks (Figure 1c, left). This required the participants to update the expected value for each bandit continuously. Participants received monetary earnings of 1 Swedish Krona (SEK, ~$0.11) per rewarded trial. Behaviour was quantified with a Bayesian observer model augmented to capture the influence of variance and confidence on switch behaviour. This model outperformed a Rescorla-Wagner (RW) model that tracked expected value using simple RPEs. To investigate the relationship between the ability to learn about probabilistic rewards and the DA system, we collected PET data using the radioligand [11C]SCH23390 to measure striatal and prefrontal DA D1 receptor binding potential (D1 BP), as a proxy for integrity of the dopaminergic system. The chosen radioligand allows for reliable measurement of BP in striatum and PFC simultaneously (Hall et al., 1994), as opposed to alternative markers of dopaminergic function.

(a) Schematic representation of a trial in the TAB. Participants were presented with two fractal images on each trial and selected one of them through a button press. The maximum response time was 2000 ms, meaning the trial would count as a miss if the response time exceeded this limit and the next trial would start immediately after the next inter-trial interval. If one stimulus was selected, this option was highlighted with a red frame. After 1000 ms, participants were presented with the outcome: either a green arrow pointing upwards, indicating an obtained reward of 1SEK (≈$0.11), or a yellow horizontal bar, indicating no win. The position of the images on the screen varied randomly across the 2 × 110 trials of the experiment. Reward probabilities varied throughout the experiment. (b) Behavioural performance on the TAB, across age group. Younger participants earned more money on the TAB on average (top left, t(49) = 1.69, p(one-tailed)=0.048). Proportion of efficient choices differed significantly between the two groups (top right, Mann-Whitney U = 286.5, p(one-tailed)=0.029). Number of switches did not differ significantly between groups (p=0.19; bottom left), but the proportion of adaptive switches differed between age groups (bottom right, Mann-Whitney = 271.0; p(two-tailed)=0.033). Data are represented as mean ±SEM. (c) Left pane: Varying reward probabilities for obtaining a reward for each bandit on the 220 trials of the experiment. Center/right pane: Model predictions (black lines) and observed behaviour (coloured lines). Model fit did not significantly differ between participants (Mann-Whitney U = 353.0, p=0.406).

Figure 1—source data 1. Source data to Figure 1.

elife-26424-fig1-data1.zip^{(17.2KB, zip)}

DOI: 10.7554/eLife.26424.005

Figure 1—source code 1. Code that was used to perform simulation of behavioural data (figure 1c), as well as the creation of Figure 1.

elife-26424-fig1-code1.m^{(2.4KB, m)}

DOI: 10.7554/eLife.26424.006

Figure 1—figure supplement 1. — (a) Schematic representation of a trial in the TAB. Participants were presented with two fractal images on each trial and selected one of them through a button press. The maximum response time was 2000 ms, meaning the trial would count as a miss if the response time exceeded this limit and the next trial would start immediately after the next inter-trial interval. If one stimulus was selected, this option was highlighted with a red frame. After 1000 ms, participants were presented with the outcome: either a green arrow pointing upwards, indicating an obtained reward of 1SEK (≈$0.11), or a yellow horizontal bar, indicating no win. The position of the images on the screen varied randomly across the 2 × 110 trials of the experiment. Reward probabilities varied throughout the experiment. (b) Behavioural performance on the TAB, across age group. Younger participants earned more money on the TAB on average (top left, t(49) = 1.69, p(one-tailed)=0.048). Proportion of efficient choices differed significantly between the two groups (top right, Mann-Whitney U = 286.5, p(one-tailed)=0.029). Number of switches did not differ significantly between groups (p=0.19; bottom left), but the proportion of adaptive switches differed between age groups (bottom right, Mann-Whitney = 271.0; p(two-tailed)=0.033). Data are represented as mean ±SEM. (c) Left pane: Varying reward probabilities for obtaining a reward for each bandit on the 220 trials of the experiment. Center/right pane: Model predictions (black lines) and observed behaviour (coloured lines). Model fit did not significantly differ between participants (Mann-Whitney U = 353.0, p=0.406).

Figure 1—source data 1. Source data to Figure 1.

elife-26424-fig1-data1.zip^{(17.2KB, zip)}

DOI: 10.7554/eLife.26424.005

Figure 1—source code 1. Code that was used to perform simulation of behavioural data (figure 1c), as well as the creation of Figure 1.

elife-26424-fig1-code1.m^{(2.4KB, m)}

DOI: 10.7554/eLife.26424.006

Based on previous work, we hypothesised that, in younger participants, BOLD signal in NAcc would reflect both components of the RPE signal, whereas older participants would show a reduced expected-value component. Additionally, we expected an attenuated expected-value signal during choice in the older compared to the younger sample in PFC. We reasoned that the strength of these expected-value representations in both PFC and NAcc would show a relationship to DA D1 BP in either subcortical or prefrontal regions.

Results

Task performance

The goal of the analyses was to establish the neural mechanism underlying decreased probabilistic value learning in older participants. We did this by (1) assessing differences between age groups in the BOLD signal related to anticipatory expected value in the vmPFC, (2) assessing differences between age groups in the BOLD signal related to RPEs in the NAcc, and (3) investigating the relationship between these BOLD signals and DA D1 binding potentials (BP) in a set of predefined ROIs. To obtain the best estimate of expected value to use in our fMRI analysis, we fitted a range of computational models and used Bayesian model selection.

Younger adults outperformed older adults on the task (Figure 1b). There was a weak group difference in total amount of money earned (M_old = 125.9, SD = 11.4; M_young = 130.3, SD = 8.2; t(49), p(one-tailed)=0.050). Additionally, efficient choices, defined as the proportion of total choices that were more likely to be rewarded according to the actual (hidden) state of the Gaussian random walks also differed between groups (M_old = 0.53, SD = 0.10; M_young = 0.59. SD = 0.08, Mann-Whitney U = 286.5, p(one-tailed)=0.029. We also investigated how switching between the two alternatives contributed to performance. The number of switches was negatively related to total monetary gains (r = −0.29, p=0.032, controlled for age). There was no evidence to suggest that the number of switches differed between age groups (M_old = 57.3, SD = 34.6; M_young = 68.1. SD = 29.8, Mann-Whitney U = 323.0, p=0.190).

We assessed the ability of a variety of members of two broad families of models to capture trial-by-trial behaviour (see Materials and methods, SI for details). The first family includes variations on standard reinforcement learning (RL) models in which action values are learned through RPEs and the RW updating rule. The second family of models comprises variations on a Bayesian observer in which the probability distribution of obtaining a reward is updated after each outcome observation. Model comparison statistics are displayed in Table 1.

Table 1. Model comparison statistics for the different models.

The winning model, defined as the model with the lowest integrated BIC (iBIC), was the Bayesian observer model with five parameters. Parameters: β: inverse temperature parameter for softmax, α: learning rate for RW model, b: choice kernel, ϕ: forgetting rate for RW model, ω: learning rate for Bayesian model, λ: forgetting rate for Bayesian model, υ: variance weighting, κ: confidence weighting.

Family	Parameters	# Param	Likelihood	Pseudo-R²	iBIC
RW	β, α	2	−5636.8	0.336	11309
	β, α, b	3	−5317.8	0.374	10692
	β, α, b, ϕ	4	−5140.0	0.394	10355
Bayesian observer	β, ω	2	−5919.8	0.302	11877
	β, ω, λ	3	−5719.2	0.326	11495
	β, ω, λ, b	4	−5154.6	0.392	10385
	β, ω, λ, υ(chosen)	4	−5161.7	0.392	10399
	β, ω, λ, υ(unchosen)	4	−5130.0	0.395	10335
	β, ω, λ, κ	4	−5675.3	0.331	11426
	β, ω, λ, υ(unchosen), κ	5	−5082.5	0.401	10259

Open in a new tab

The most parsimonious account came from a five parameter Bayesian observer model. This tracked the probability of obtaining a reward for each action as a beta distribution with parameters representing pseudo-counts of wins and win omissions. Pseudocounts for the bandit that was chosen were updated according to the outcome, based on a learning rate of ω. Pseudocounts for the bandit that was not chosen were relaxed towards neutral values based on a forgetting rate of λ. The beta distributions generated action propensities for the two bandits according to three weighted additive factors: one was the relative expected values (Q) of the bandits, calculated as the mean of the beta distributions (Figure 2 and Equation 7, Materials and methods). The other two depended on different forms of uncertainty and were associated with the choice between sticking with the previous choice or switch.

Figure 2. — All components that are used to model choice at trial 21 are marked in orange. The sequence of choices for this participant was [1 1 1 2 1 1 1 2 2 1 2 2 1 1 1 2 2 2 2 1], and the payout for these choices was [1 1 0 0 1 1 0 1 0 0 0 0 1 1 0 0 1 1 0 1]. According to the participant’s individually fitted model parameters (ω = 0.72; λ = 0.28), and following this sequence of choices and outcomes, the beta distributions defining the subjective value of the bandits were $θ_{1} ~ β (θ_{1}; 2.02, 1.08)$ and $θ_{2} ~ β (θ_{2}; 1.26, 1.74)$ (see Equations 9–11, Materials and methods) at choice of trial 21. The expected value for each bandit was defined as the mean of the beta distribution (Q₁ = 0.65, Q₂ = 0.42; see Equation 7, Materials and methods). The variance of the unchosen option was equal to the variance of bandit 2, which was not chosen on trial 20 (V_uc = 0.05, see Equation 8, Materials and methods). Variance is schematically represented as a dotted line (note that this is an approximation because the beta distributions are not symmetrical). The 2-d plot shows the joint distribution P( $θ_{1}$ , $θ_{2}$ ) where values of $θ_{1}$ are along the x-axis and $θ_{2}$ along the y-axis. Confidence was calculated based on the values of the distributions at choice on the previous trial. C₁ was defined as the probability that a random sample drawn from $θ_{1}$ at the time of choice at trial 20 was greater than a sample drawn from $θ_{2}$ (shaded area below the diagonal, as $θ_{1}$ > $θ_{2}$ there. C₁ = 0.56, Materials and methods Equation 15). C₂ could be defined as 1-C₁ (shaded area above the diagonal, C₂ = 0.44, Equation 16, Materials and methods). C^rel was equivalent to C^chosen – C^unchosen, in this case C₁-C₂ (C^rel = 0.12, Equation 17). This relative confidence was scaled by $κ$ and then added to the action that was not chosen on the previous trial (in this case bandit 2).

The first determinant of switching was the current variance (V) of the option that was not chosen on the previous trial calculated from its approximate beta distribution (Figure 2; formula 8, Materials and methods). The variance was multiplied by a parameter υ and added to the propensity of this previously unchosen option. υ was negative in all but two participants (Table 2), reflecting the fact that increasing uncertainty about the unchosen option decreased its value, making it a less likely choice on the current trial. Hence, increased uncertainty about the previously unchosen option caused most subjects to stick to their current choice. This is the opposite of an exploration bonus. This model outperformed an account based on a more conventional choice kernel, according to which perseverating or switching was influenced only by previous choices themselves rather than something reflecting knowledge about those choices. This suggests that perseveration, which is commonly observed (Rutledge et al., 2009; Schönberg et al., 2007), may partly reflect uncertainty aversion.

Table 2. Summary statistics of the five parameters of the winning model.

	Minimum	25th percentile	Median	75th percentile	Maximum
β	1.436	7.017	12.280	17.730	64.750
ω	0.042	0.238	0.408	0.558	0.851
λ	0.055	0.139	0.202	0.270	0.544
υ	−3.372	−1.845	−1.069	−0.678	0.234
κ	−0.202	0.152	0.260	0.359	0.896

Open in a new tab

The second determinant of switching was a measure of the relative confidence in the choice that was made on the previous trial (see Materials and methods). We assumed that subjects used their approximate Bayesian posterior distributions over the values of the bandits to calculate this confidence, $C^{rel}$ , as their subjective probability that the option they chose was better (a calculation they made before observing the outcome on that trial). A term $C^{rel} κ$ was then added to value of the action that was not chosen on that previous trial (see Methods) where $κ$ was fitted to each participant. Thus, if $κ$ was positive, then a subject would be more likely to switch on trial $t$ if she had been more confident on trial $t - 1$ .

Note that relative confidence was calculated on the preceding trial because, at the time of the choice, the model has no information about the option that will be chosen on that trial. Perhaps surprisingly, κ was positive in 49 out of 57 participants (Table 2) – thus for the majority of subjects, the more sure they were that the chosen option was better, the more they sought to switch and try the alternative. It is important to acknowledge, however, that there are subtle interactions with the effects of the means and variances of the options with which relative confidence is partly correlated. Nevertheless, κ was negatively correlated with total monetary gains on the task (r(54) = 0.42, p=0.001, controlled for age; Supplementary file 1), with negative values of κ in those participants with the highest performance. This implies that κ has the expected effect on performance despite having an unexpected sign at the group level. The overall tendency for κ to be positive and υ to be negative does not stem from autocorrelation between the two as the sign of these parameter is largely the same when the model is specified with only one of these parameters (data not shown).

The final step to realizing choice was to feed the ultimate action propensities into a softmax with temperature parameter β.

We found that variation in the number of switches was better accounted for by variation in the parameters of the winning model than by that in parameters of the best RW model (Supplementary file 1b). No single model parameter differed between age groups (Mann-Whitney test: β, U = 386.0, p=0.761; ω, U = 345.0, p=0.338; λ, U = 401.0, p=0.949; υ, U = 307.0, p=0.117; κ, U = 374.0, p=0.620). A multivariate analysis with the model parameters as independent variables and age group as a fixed factor did not yield any significant predictor of age group (F = 0.91, p=0.482). Model fit, defined by the individual log likelihood for each participant, also did not differ between age groups (Mann-Whitney U = 353.0, p=0.406). In the best-performing RW model (which fit the behavioural data less well), younger participants learned more quickly (Supplementary file 1c).

Because measures of successful performance differed between groups but the number of switches did not, we used our winning model to investigate the nature of switches separately in each group. We used the expected values from the winning model to assess the proportion of adaptive switches (to subjectively better bandits) versus maladaptive switches (to subjectively worse bandits) and found that young participants had a higher proportion of adaptive switches compared with old participants (M_old = 57.4, SD = 23.4; M_young = 68.3. SD = 18.2, Mann-Whitney test U = 271.5, p=0.033, Figure 1b, bottom right). The proportion of adaptive switches was positively associated with total monetary gains (r(54) = 0.49, p<0.001), suggesting that the age difference in performance partly resulted from differences in strategic switching.

Value anticipation in vmPFC

To investigate brain activity reflecting value anticipation, we estimated a GLM that included the chosen value Q on that trial as a regressor to be correlated with the BOLD signal at the time of choice (see Materials and methods, GLM 1). Clusters in vmPFC, bilateral hippocampus, visual cortex and bilateral precuneus showed a positive correlation between BOLD and Q at the time of choice at p(FWE-corrected)<0.05 (Supplementary file 3).

We next tested if the expected-value signal differed between age groups. We used the cluster showing a positive correlation between BOLD and Q at the time of choice in vmPFC (Figure 3, Supplementary file 3) as a functional ROI to extract the individual parameter estimates. A two-sample t-test, orthogonal to the test used to define this ROI, revealed that younger participants showed a stronger representation of Q in vmPFC compared to the older participants (M_old = 2.84, SD = 5.25; M_young = 6.44, SD = 6.07; t(55) = 2.38; p=0.021). This difference in vmPFC value signal did not arise because of the difference in learning performance: when we restricted our analysis to high performers as defined by a median split (13 old, 15 young), a difference in performance was no longer significant (p=0.60), but the strength of expected-value signal in vmPFC was correlated with age (r(26) = −0.39, p=0.040) and we found a marginally significant difference between age groups (M_old = 4.21, SD = 4.81; M_young = 8.29, SD = 5.72; t(26) = 2.03, p=0.054). For illustrative purposes, we plotted the time course of the expected-value signal in vmPFC over the course of a trial. This suggests that, on average, the expected-value signal was stronger and sustained for longer throughout the trial in younger compared with older adults (Figure 3c).

Figure 3. — (a) Cluster in vmPFC that shows expected value activity at the time of the choice. Peak voxel x,y,z −5,52,–6; p<0.05, FWE corrected. (b) Parameter estimates for younger and older participants extracted from the cluster in Figure 3a. Activity differs significantly between age groups (t(55) = 2.38; p=0.021). Error bars represent standard errors of the means. (c) Time-course visualisation of the expected value signal in vmPFC. Shaded areas indicate standard errors. The expected-value signal is significantly larger and prolonged in the younger compared to the older sample. (d) There is a positive relationship between expected-value signal magnitude and total monetary gains (r(53) = 0.37, p=0.006 when controlling for age and model fit). For display purposes, the correlations are shown with residuals after regressing out age and model fit. (e) DA D1 BP in NAcc is positively related to Q in vmPFC (r(53) = 0.28, p=0.038, when controlling for age). For display purposes the correlations are shown with residuals after regressing out age.

Figure 3—source data 1. Source data for figure 3: cluster correponding to Q in vmPFC at the time of choice.
Parameter estimates for all participants of Q in vmPFC at the time of choice. Timecourse data for young and old participants corresponding to Q in vmPFC. BPs in NAcc for all participants.

elife-26424-fig3-data1.zip^{(1.1MB, zip)}

DOI: 10.7554/eLife.26424.011

The parameter estimate for Q in vmPFC was positively related to total monetary gains (r(53)=0.37, p=0.006, controlling for age and model fit in a partial correlation). This correlation remained significant without controlling for age, model fit or both. Q in vmPFC was a significant predictor of all measures of performance (bivariate correlations: total monetary gains: r(55) = 0.47, p<0.001 adaptive switches: r(55) = 0.39, p=0.003; efficient choices: r(55) = 0.38, p=0.004). Age was also a significant predictor of all measures of performance (bivariate correlations: total monetary gains: r(55) = -0.32, p=0.050; adaptive switches: r(55) = -0.26 p=0.052; efficient choices: r(55) = -0.32 p=0.015). Age was also a significant predictor of Q in vmPFC (r(55) = -0.32 p=0.016). Age was no longer a significant predictor of performance after controlling for Q in vmPFC (beta age = −0.12,–0.23, and −0.15; p=0.328, 0.086 and 0.255, for monetary gains, efficient choices and adaptive switches, respectively), whereas Q in vmPFC remained a significant predictor of all measures of performance (beta Q in vmPFC = 0.43, 0.30, and 0.34; p=0.001, 0.023 and 0.012 for monetary gains, efficient choices and adaptive switches, respectively). This is consistent with a full mediation of age effects on performance by Q in vmPFC. Note, however, that it is difficult to make inferences on mediation effects of age in a cross-sectional dataset (Lindenberger et al., 2011).

The results were not dependent on the use of the Bayesian model to estimate Q values (when using the RW model Q estimates; when including both age and Q, beta age = −0.20,–0.22, −0.21, p=0.111, 0.093, 0.104; beta Q in vmPFC = 0.33, 0.28, 0.26, p=0.010, p=0.030, p=0.047 for monetary gains, efficient choices and adaptive switches, respectively).

RPE signals in striatum

RPEs are widely reported in NAcc (Behrens et al., 2008; Niv et al., 2012); but see also (Stenner et al., 2015; Wimmer et al., 2014). RPEs are thought to be a critical signal conveyed by dopaminergic neurons (Bayer and Glimcher, 2005; Hart et al., 2014) that guide action selection in probabilistic learning tasks (Pessiglione et al., 2006; Hart et al., 2014; Rolls et al., 2008) like the TAB. Although our winning computational model, a Bayesian observer model, does not use RPEs, we may expect the brain to, nonetheless, track RPEs as the discrepancy between outcomes observed and outcomes predicted by the model (Daw et al., 2011). When investigating RPE signals in fMRI data, a common approach is to identify regions in which activity is correlated with the RPE defined as a single regressor (R(t)–Q_a(t)). However, because R and RPE are correlated (Behrens et al., 2008; Niv et al., 2012; Li et al., 2011), when using this approach the amount of variance attributed to RPE may be overestimated (Behrens et al., 2008; Guitart-Masip et al., 2012) and the identified signals can be seen as putative RPEs. For this reason, it has been suggested that the effects of R and Q need to be estimated separately and only regions showing both signals with opposite signs can be considered as conveying a canonical RPE signal (Behrens et al., 2008).

Following this approach, we first defined an ROI for NAcc in each hemisphere in which BOLD was correlated with the full RPE regressor at the time of outcome (Materials and methods, GLM 2, Figure 4a, MNI peak voxel coordinates x,y,z = 14,12,–10; k = 72; z = 7.03 and x,y,z = -14,8,–10; k = 47; z = 6.74 with p(FWE-corrected)<0.05). From these regions, we extracted parameter estimates for reward and expected value separately as estimated in a separate GLM model (Materials and methods, GLM 3). We replicated previous findings in older adults (Chowdhury et al., 2013), as we saw a significant effect of R, but no significant negative effect of Q in both ROIs (Figure 4b). Contrary to our hypothesis, we did not observe a canonical prediction error in the young sample either. Again, we observed a positive effect of R, but no significantly negative effect of Q. Note that this is not inconsistent with the result reported by Chowdhury et al. (2013), where no fMRI data were collected for the young control group. No evidence for differences between the different age groups’ mean activation for R or Q were found (p>0.29). In addition, when performing a less stringent test and extracting parameter estimates from this ROI for the full RPE, defined as one regressor (R-Q), we did not observe any differences between the groups’ mean activation (p>0.45). These negative results were not dependent on using the Bayesian observer model to generate Q as they were consistent across models (Supporting figure to Figure 4). There was no indication that the lack of expected value signal in the NAcc at the group level was caused by some participants showing poor learning of expected value, as the correlation between Q in NAcc and the different measures of performance (monetary gains, effective choices, and adaptive switches) was not significant (p>0.25).

Figure 4. — These were selected as candidate regions to test for canonical RPE showing both a positive effect of reward and a negative effect of Q as calculated by the Bayesian observer model. Extracted parameter estimates for R and Q as calculated by the Bayesian observer model from the regions shown in Figure 4a. Although we found a strong effect of reward bilaterally, no expected-value signal was observed for either age group (p>0.10).

Figure 4—source data 1. Activation cluster in ventral striatum as defined by the winning Bayesian model, as well as parameter estimates of R and Q in left and right ventral striatum.

elife-26424-fig4-data1.zip^{(3.7KB, zip)}

DOI: 10.7554/eLife.26424.015

Figure 4—figure supplement 1. — These were selected as candidate regions to test for canonical RPE showing both a positive effect of reward and a negative effect of Q as calculated by the Bayesian observer model. Extracted parameter estimates for R and Q as calculated by the Bayesian observer model from the regions shown in Figure 4a. Although we found a strong effect of reward bilaterally, no expected-value signal was observed for either age group (p>0.10).

Figure 4—source data 1. Activation cluster in ventral striatum as defined by the winning Bayesian model, as well as parameter estimates of R and Q in left and right ventral striatum.

elife-26424-fig4-data1.zip^{(3.7KB, zip)}

DOI: 10.7554/eLife.26424.015

In order to assess how well the RW model and the Bayesian observer model generated predictions of the BOLD signal, we estimated two comparable GLMs including only R and Q as generated by each model as parametric modulators at the time of the outcome. We then compared the residuals of the respective GLMs on specific ROIs. The RW model generated better predictions of the BOLD signal in NAcc (paired t-test comparing residuals of the respective GLM models within functional ROIs; t(56) = 5.69, p<0.001). This is in line with the extent of the literature showing putative or canonical RPEs as being encoded in NAcc (Daw et al., 2011; McClure et al., 2003; O’'Doherty et al., 2003), because the RW model used RPEs to learn the value of actions. On the other hand, the Bayesian observer model generates better predictions of the BOLD signal in the vmPFC when Q as generated by each model was included as a parametric modulator at the time of choice (paired t-test comparing residuals of the respective GLM models across all voxels in the respective vmPFC ROIs; t(56) = 5.62, p<0.001).

Relationship to D1 DA BP

We also investigated the relationship among DA D1 BP, age, brain function and performance. We collected PET data using [¹¹C]SCH23390 radiotracer that allows DA D1 BP to be measured across the whole brain. We calculated D1 BP in seven a priori ROIs. The selected ROIs were dlPFC, vlPFC, OFC, and vmPFC in cortex, and putamen, caudate and NAcc in striatum in each hemisphere. BP values were calculated and averaged across hemispheres. The selected regions were chosen based on their relevance to our task, as they have previously been reported to be important for various cognitive processes, ranging from value learning and reward sensitivity to working memory and cognitive flexibility (see SI for details). Younger participants had higher values for binding potentials in all ROIs considered (p<0.001 in all seven ROIs, Figure 1). BP in none of the ROIs was correlated with any measure of performance or any of the model parameters after controlling for multiple comparisons (p>0.02; adjusted threshold when controlling for 42 comparisons: p=0.001, Supplementary file 2a). D1 BP among ROIs was highly correlated after controlling for age (r(53) = 0.411–0.911, p<0.001, Supplementary file 2b).

The group difference in value signals in PFC could be a result of the well-documented age-related decline in DA availability (Volkow et al., 1998; Bäckman et al., 2010). To investigate this, we performed linear regressions predicting the strength of the link between Q and BOLD in vmPFC from DA D1 BP in all PET ROIs. Because of the high correlation between age and BP in all ROIs (r(56) > 0.73, p<0.001), we first examined the relationship between BP and Q in vmPFC without controlling for age. BP in NAcc and putamen were related to Q in vmPFC after correcting for multiple comparisons (corrected threshold considering seven ROIs p=0.007; NAcc: r(56) = 0.41, p=0.002; putamen: r(56) = 0.36, p=0.006). When controlling for age as a predictor of no interest, this correlation only survived for NAcc (r(53) = 0.28 p=0.038, Figure 3e). This result was confirmed by a mediation analysis: Age was a significant predictor of both BP in NAcc (r(54) = -0.78, p<0.001) and Q in vmPFC (r(55) = -0.32, p=0.016). BP in Nacc was also a significant predictor of Q in vmPFC r(54)=0.41, p=0.001. Age was no longer a significant predictor of Q in vmPFC after controlling for BP in NAcc (beta age = −0.01, p=0.964; beta BP in NAcc = 0.42, p=0.038). This is consistent with a full mediation of age effects on Q in vmPFC by DA D1 BP in NAcc. Further, despite the main effect of age on D1 BP in NAcc, there was no significant interaction between age group and NAcc D1 BP (F(1,52) = 1.20; p=0.279) in modelling Q in vmPFC; thus, the relationship between DA D1 BP in NAcc and Q in vmPFC did not differ between age groups.

We did not find any significant relationship between the representation of Q in NAcc at outcome time and D1 BP in any of the ROIs examined (p>0.11 in bivariate correlations; p>0.13 when controlling for age).

Discussion

We used a probabilistic reward learning task along with computational modelling, PET measures of the D1 system and fMRI in healthy, cognitively high functioning younger and older participants to investigate the effects of age on value-based decision making and its modulation by DA. We showed that probabilistic reward learning was impaired in older compared to younger participants. We also showed that value anticipation in vmPFC predicted performance beyond age and was attenuated in older participants. Furthermore, the value signal in vmPFC was modulated by D1 BP in NAcc. Finally, our computational model showed that the tendency for choice perseveration can be described as aversion to the variance of the unchosen option and that, for most participants, greater subjective confidence in a previous choice promoted switches away from that choice.

Dopamine, aging and value signals

An age-related impairment in probabilistic reward learning has been widely reported (Mell et al., 2005; Eppinger et al., 2013; Chowdhury et al., 2013; Eppinger et al., 2015; Samanez-Larkin et al., 2012). The age-related deterioration of the dopaminergic system (Volkow et al., 1998) has been hypothesised to underlie age-related cognitive decline (Volkow et al., 1998; Bäckman et al., 2010). One mechanism through which DA deficits can affect probabilistic learning performance in aging is by attenuation of value signals in the brain (Halfmann et al., 2016). Anticipatory value signals are commonly reported in vmPFC (Rolls et al., 2008; Kim et al., 2011) as well as in striatum (Schönberg et al., 2007; Behrens et al., 2008) and are modulated by DA (Pessiglione et al., 2006; Chowdhury et al., 2013; Schlagenhauf et al., 2013). Additionally, RPEs detected in NAcc (Wimmer et al., 2014; Samanez-Larkin et al., 2012; Kim et al., 2011) are thought to reflect dopaminergic signals from midbrain (Bayer and Glimcher, 2005; Schultz et al., 1997), supporting optimal action selection in probabilistic reward learning (Frank et al., 2004).

We found a robust value anticipation signal in vmPFC in both age groups, which is in keeping with neuroimaging findings across a range of similar tasks (Daw et al., 2006; Wimmer et al., 2014). As expected, this signal was attenuated in the older compared with the younger sample. Furthermore, the strength of the signal predicted performance on the task beyond age and was related to D1 BP in NAcc. Our results are consistent with a full mediation of the age effects on performance by Q in vmPFC, that is, age no longer predicts performance when controlling for the strength of BOLD that reflects Q in vmPFC. The same is true for the strength of Q in vmPFC: the effect of age can be explained by lower DA D1 BP in the older age group. Note, however, that it is difficult to make inferences on mediation effects of age in a cross-sectional dataset (Lindenberger et al., 2011). To the best of our knowledge, this is a novel finding demonstrating a relationship between integrity of the mesolimbic DA system and the prefrontal value signal supporting probabilistic learning in humans. This suggests that age-related deficits in probabilistic learning may reflect DA decline blurring value anticipation in vmPFC.

It is unsurprising that anticipatory value signals have a great impact on the ability to perform the present task, considering that damage to vmPFC/medial orbitofrontal cortex (mOFC) in humans and monkeys impairs value-guided decision making (Halfmann et al., 2016; Camille et al., 2011; Noonan et al., 2010; Rudebeck and Murray, 2014; Rushworth et al., 2011). The nature of this signal is still debated (Noonan et al., 2012), as is the cross-species generalizability for prefrontal regions (Neubert et al., 2015). Some have proposed that vmPFC tracks the value of items regardless of their nature, because vmPFC activation reflects the value across a range of tasks with different reward features from money to aesthetic and social rewards (Behrens et al., 2008; Kim et al., 2011; McNamee et al., 2013; O'Doherty, 2007; Philiastides et al., 2010). Others have proposed that vmPFC performs value comparisons, because neural signals represent the value difference between alternative options (Rushworth et al., 2011; Boorman et al., 2009; Chau et al., 2014). Regardless of its exact nature, our findings show that the signal is important not only for reward learning in general but that its attenuation is linked to age-related deficits in probabilistic learning. This notion fits with previous suggestions that age-related impairment in probabilistic learning relates to deficits in PFC function (Hämmerer and Eppinger, 2012; Samanez-Larkin and Knutson, 2015). Our results show that performance in the TAB is supported by the expected value signal in the vmPFC and that the strength of this signal explains the effects of age on performance. However, considering that the TAB can be seen as noisy reversal learning task, it is a possibility that differences in executive functions - such as the ability to inhibit a response to previously rewarded option - contribute to group differences in our task (Bari and Robbins, 2013).

Value anticipation in vmPFC was modulated by D1 BP in NAcc across both age groups and when controlling for age, again showing a full mediation of age effects on vmPFC signals by DA in NAcc. This finding is in agreement with the view that gating and selection of relevant information in cortex relies on processing within corticostriatal loops (Shipp, 2017), which is modulated by DA (Reynolds and Wickens, 2002). Pharmacological evidence in humans suggests that D2 receptors have a role in modulating gating of information in working memory (Cools and D'Esposito, 2011), but experiments studying this process with selective pharmacological manipulations of the D1 system in humans are lacking. However, computational work suggests a role for striatal D1 receptors in cortico-striatal gating (Gruber et al., 2006). The value representation in vmPFC might therefore emerge through this DA-modulated iterative gating process in NAcc. Although BPs are highly correlated across ROIs, a mediation analysis was only significant for the NAcc. This is compatible with the literature on reward processing in the corticostriatal loops. The critical nodes for processing of reward information and motivation are NAcc and the mOFC, including vmPFC (Haber and Knutson, 2010). Our data suggest that good performance, based on selection of adaptive actions, relies on D1 availability in NAcc, which in turn allows for robust value anticipation in vmPFC. Note, however, that the relationship between D1 BP and performance was not significant when controlling for age, which precludes inferences about a direct role of DA on performance.

Aside from considering expected value in vmPFC, one might have hypothesised that attenuated RPEs in NAcc of older participants would account for the age-related performance deficit (Chowdhury et al., 2013), because of the connection between DA and RPEs in NAcc. This hypothesis builds on the common observation that RPE signals in NAcc are present in younger adults (McClure et al., 2003; O’'Doherty et al., 2003). In contrast to this, we did not observe neural activity reflecting a canonical RPE signal in NAcc in either age group. Although we found a significant effect of reward, we did not obtain a negative effect of expected value. Note that we did not find a canonical RPE in NAcc when using the best of the RW models either. This suggests that the lack of expected value signal in NAcc is not merely caused by generating expected value with the Bayesian ideal observer model which does not make use of RPEs to update value representations.

The lack of canonical RPE signal in NAcc could stem from the fact that we used a very stringent test for RPEs. Previous studies using the same stringent method report mixed results. Whereas some studies report significant positive effects of reward obtainment and negative effects of expected value (Behrens et al., 2008; Niv et al., 2012), others do not find this canonical signal in NAcc (Chowdhury et al., 2013; Stenner et al., 2015; Wimmer et al., 2014; Li and Daw, 2011). The conditions under which a canonical RPE can be detected may depend on task characteristics. For example, if the RPE signal is not behaviourally relevant for the task at hand it may not be encoded in the NAcc. In our case, however, RPEs are behaviourally relevant because the choice between bandits is based on fine-grained differences in their values. However, for other paradigms, the lack of behavioural relevance of RPEs could potentially explain a negative result (Stenner et al., 2015; Guitart-Masip et al., 2012). Another important aspect may be the temporal proximity of the choice cues and the outcome presentation in the task. This may hinder the dissection of opposing responses to these events with fMRI. We cannot rule out the possibility that our negative result stems from this feature of our task design and for this reason, we cannot provide conclusive evidence on the lack of canonical RPE signal in the NAcc. Our results point, however, to the need for stringent tests in future studies of the neural underpinnings of RPEs with fMRI.

The lack of canonical RPEs in older participants has already been observed using the same task (Chowdhury et al., 2013). In that study, canonical RPEs were detected in NAcc of older participants after boosting the dopaminergic system with levodopa. These findings were interpreted as evidence that older participants had deficient RPEs signals in NAcc due to DA decline, and that remediating this deficit could restore the RPE signal. However, no younger comparison group was scanned to confirm that the deficient expected value signal observed in the older participants on placebo was age-specific. Chowdhury et al. (2013), nevertheless, showed that the expected value signal in NAcc is sensitive to DA manipulations. Contrary to what one might expect from these data, the relationship between expected value (Q) as predicted by the winning model and NAcc BOLD signal was not modulated by D1 BP in any ROI considered. The reason for this negative result remains unknown. In striatum, D1 receptors have lower affinity to DA than D2 receptors and their stimulation is hypothesised to be dependent on phasic changes in DA (Maia and Frank, 2011). Because RPE in NAcc is thought to reflect phasic fluctuations of DA levels (Schultz et al., 1997), one would expect that D1 receptors would be sensitive to these fluctuations. Our results do not support this view. An alternative account is that the dopaminergic modulation of BOLD signal in NAcc observed by Chowdhury et al. (2013) after administration of levodopa is related to stimulation of D2 rather than D1 receptors. Supporting this view, recent evidence suggests that D2 receptors can encode both tonic and phasic DA signals in striatum (Marcott et al., 2014).

Computational mechanisms of switch behaviour

Using computational modelling, we explored different possible influences on the trade-off between exploration and exploitation in the probabilistic reward-learning task. We considered two families of computational models, variations of a standard RL model using RPEs to learn the mean expected value of the bandits and variations in Bayesian ideal observer model that tracked the probability of obtaining a reward for each bandit as a beta distribution. In both model families, including a parameter that promoted forgetting of the unchosen bandit improved model fit. Similarly, including a perseveration parameter to account for the tendency to repeat choices regardless of expected value (Rutledge et al., 2009; Schönberg et al., 2007; Lau and Glimcher, 2005) improved model fit in both families. However, a Bayesian model that modulated the expected value of the unchosen option by the variance of that option outperformed any model with perseveration. Across participants, the variance of the unchosen option had a negative impact on the value of that option. This is opposite to an exploration bonus or uncertainty based exploration term that arises in various more or less normative accounts of exploration (Dayan and Sejnowski, 1996) and has been observed in some experiments (Badre et al., 2012; Wilson et al., 2014). However, many previous studies of decision-making have also shown that variance may be penalised as a form of risk sensitivity (Symmonds et al., 2011; Payzan-LeNestour et al., 2011; d'Acremont et al., 2013), and this is a cousin of the effect that we observed. Furthermore, our model comparison showed that uncertainty aversion is a better account of the perseveration typically observed in bandit tasks (Rutledge et al., 2009; Schönberg et al., 2007) than a choice kernel. This is a novel insight into the mechanism usually referred to as perseveration and suggests that aversion to the uncertainty about the option that was not chosen previously causes a tendency to stick to ones choices. Whether perseveration observed in other paradigms can be accounted for in the same way remains unknown.

Additionally, a Bayesian model that modulated the value of the most recent choice by the relative subjective confidence in that choice outperformed all other models. Increased relative confidence about the most recent choice resulted in increased attractiveness for the other option. This implies that participants were more likely to switch away from the most recent choices as their subjective confidence in those choices increased. This may appear counterintuitive, as one would expect that increased confidence would lead to choice repetition (Vinckier et al., 2016). However, performance improved as the effect of relative confidence decreased, and those participants showing the highest performance had the reverse effect of confidence on choice. In other words, these participants’ behaviour was consistent with a negative confidence parameter rather than a positive confidence parameter, implying that increased confidence in previous choices promoted staying with the previously chosen option. One reason for the unwarranted use of confidence in the majority of participants could stem from participants perceiving the task as highly volatile. As a result, they may infer that increasing confidence in the most recent choice indicates that the unchosen option has become better than the chosen option (Behrens et al., 2007; Mathys et al., 2011). Additionally, the observed effect of κ could reflect safe exploration: if the participant is convinced they have recently chosen the best option a lot (hence their confidence), they can afford to explore the more uncertain option. These possibilities provide interesting directions for future research.

Despite the performance difference, we did not find age differences in any single model parameter, precluding any conclusions about which computational process is affected in old age. In fact, it is likely that the process underlying age differences in performance is not parametrised in the winning Bayesian model. This stands in contrast with the less accurate but simpler RW model, in which the effect of aging was consistently manifested in the learning rate (Supplementary file 1c).

Conclusions

We measured brain activity in younger and older adults performing a probabilistic learning task and found that a signal in vmPFC at the time of choice reflecting expected value was correlated to successful performance. This activity was dependent on DA availability and age, providing support for age-related prefrontal and dopaminergic alterations as candidate mechanisms for impaired probabilistic reward learning and subsequent optimal action selection commonly reported in aging. These results provide insights into the neural and behavioural underpinnings of probabilistic learning and highlight the mechanisms by which age-related dopaminergic deterioration impacts decision making.

Materials and methods

Participants

Thirty healthy older adults aged 66–75 and thirty younger adults aged 19–32 were recruited through local newspaper advertisements in Umeå, Sweden. Sample size and power were calculated based on previous studies. One was a study of DA D1 BP differences between age groups (Rieckmann et al., 2011). The authors found clear differences in DA D1 BP after testing 20 participants in each age group: Cohen’s d = 3.00 (pooled SD = 0.04) for frontal and parietal areas, Cohen’s d = 1.60 (pooled SD = 0.21) for striatal ROIs. Assuming this difference, in order to obtain 90% power on a two-tailed independent sample t-test, 10 participants were needed in each age group. Additionally, to estimate the appropriate sample size for the behavioural task, we used the previous study by Chowdhury et al. (2013), who found a behavioural difference on the same task between younger and older participants of similar age ranges: Cohen’s d = 0.57 (pooled SD = 0.99). Assuming this difference, in order to obtain 70% power on a one-tailed t-test of a behavioural difference between two samples, 30 participants were needed in each group. Higher power could not be reached, due to financial constraints posed by the cost of PET scans.

The health of all potential participants was assessed before recruitment by a questionnaire administered by the research nurses. The questionnaire enquired about past and present neurologic or psychiatric conditions, head trauma, diabetes mellitus, arterial hypertension that required more than two medications, addiction to alcohol or other drugs, and bad eyesight. All participants were right-handed and provided written informed consent prior to commencing the study. Ethical approval was obtained from the Regional Ethical Review Board. Participants were paid 2000 SEK (~$225) for participation and earned up to 149 additional SEK (~$17) in the two-armed bandit task (TAB). Three older participants were excluded from the TAB analysis, one due to excessive head motion during fMRI scanning, one for only ever selecting one of the two stimuli in the task, and one due to a malfunctioning button box, resulting in no recorded responses. One additional older participant did not complete the full PET scan, but this participant's fMRI and task data are still included in the analysis where possible. This resulted in a total of 57 participants for fMRI and task analysis (27 old, 30 young) and 56 participants for PET analysis (26 old, 30 young).

Procedure

Participants completed a health questionnaire via telephone prior to recruitment. All participants performed the Mini Mental State Examination (MMSE). Scores ranged from 26 to 30 in the young sample (mean = 29.4, SD = 0.97) and from 27 to 30 in the older sample (mean = 29.4, SD = 0.77), with no difference between the two (p=0.89). PET and fMRI scanning were planned 2 days apart. However, due to a technical problem with the PET scanner, 12 participants were scanned at a longer delay apart (range 4–44 days apart). On the MRI scanning day, participants completed the TAB and another unrelated task inside the MRI scanner. Participants also completed a battery of tasks outside the scanner. Only results from the TAB will be discussed here.

Two-armed bandit task

The TAB (10) was presented in Cogent 2000 (Wellcome Trust for Neuroimaging, London, UK). Figure 1a depicts a schematic representation of one TAB trial. Participants were instructed to choose the fractal stimulus they thought to be most rewarding at each trial and were informed of the changing probability of obtaining a reward for each stimulus. These probabilities varied independently from one another. Probabilities were generated using a random Gaussian walk (Daw et al., 2006). Before scanning, participants were presented with five practice trials. The same set of random Gaussian walks was used for all participants, but assignment of random walk to stimulus identity was counterbalanced across participants.

Computational modelling of behavioural data

We built a variety of different models which can be classified into two main families. The first includes variations on standard RL models whereby action values are learned through reward prediction errors (RPEs) using the RW updating rule. The second family of models include variations on a Bayesian ideal observer whereby the probability distribution of obtaining a reward is updated after each outcome observation. All models, regardless of family, use a softmax rule with an inverse temperature parameter β (with β > 0) to determine the probability that the participants chooses action a:

P (a (t) = a) = \frac{e x p [β m_{a} (t)]}{e x p [β m_{0} (t)] + e x p [β m_{1} (t)]}

(1)

here, $m_{a} (t$ ) is the propensity for selecting action a. The next section lays out how $m_{a} (t$ ) is defined in the models we explored.

Reinforcement learning models

For RL models, expected values (Q) for trial t were calculated for each action a ∈{0,1} (corresponding to each bandit). Q_a(t + 1) is calculated according to standard RW updating rule:

Q_{a (t)} (t + 1) = Q_{a (t)} (t) + α δ (t)

(2)

δ (t) = R (t) - Q_{a} (t) (t)

(3)

Q_a(t)(t) is the expected value of the option $a (t)$ selected on trial t. Q for both actions was set to 0.5 at the start of the experiment. δ(t) is the difference between expected value and received reward (R) on trial t. R is a binary with the value of 1 on rewarded trials, and 0 on unrewarded trials. α is the learning rate, with 0 < α <1, indicating the weight given to the RPE on the current trial. A greater value for α results in faster updating of Q.

In the simplest model, $m_{a} (t) = Q_{a} (t$ ). We included an additional parameter in the definition of $m_{a} (t$ ): a perseveration parameter $b$ (with $- \infty < b < \infty$ ), reflecting the common observation that participants tend to either repeat their choices, or avoid repetition (Rutledge et al., 2009; Schönberg et al., 2007; Lau and Glimcher, 2005). This parameter raises or lowers the expected value of a stimulus if that stimulus was also chosen on the previous trial. Thus,

m_{a} (t) = Q_{a} (t) + b χ_{a = a (t - 1)}

(4)

where a positive value of $b$ reflects a tendency to perseverate (repeat the same choice), and a negative value reflects avoiding perseveration.

We considered another definition of $m_{a} (t)$ , where in addition to the perseveration parameter $b$ , we considered the possibility that the unchosen stimulus may decay in value each time it is not selected by the participant. This was instantiated by the inclusion of a ‘forget’ parameter $φ$ (with $0 < φ < 1$ )(Barch et al., 2003), so that the $Q$ value for the unchosen option relaxes towards 0.5. Thus,

Q_{a} (t + 1) = Q_{a} (t) + φ (0.5 - Q_{a} (t)) χ_{a \neq a (t 1)}

(5)

In this model, the value of the chosen option is updated as described in Equation 2.

Bayesian observer models

Choice behaviour was modelled by representing the probability of obtaining a reward for each possible action $a \in {0,1}$ (corresponding to each bandit) as a beta distribution

θ_{a} ~ β (θ_{a}; γ_{a}, ε_{a})

(6)

that is updated upon observation of the outcome on each trial. On any given trial, these models generate expectations about the mean probability of obtaining a reward (which we will refer to as $Q_{a} (t)$ , for consistency with the RL models) and its variance ( $V_{a} (t)$ ):

Q_{a} (t) = \frac{γ_{a}}{(γ_{a} + ε_{a})}

(7)

V_{a} (t) = \frac{γ_{a} ε_{a}}{{(γ_{a} + ε_{a})}^{2} (γ_{a} + ε_{a} + 1)}

(8)

The parameters of the beta distributions were initialised at 1 ( $γ_{a} = ε_{a} = 1$ ). This implies that $Q_{0} (1) = Q_{1} (1) = 0.5$ and maximum variance $V_{0} (1) = V_{1} (1) = 0.143$ reflecting an expectation of reward equal to chance for both bandits and a lack of knowledge about the underlying probability distributions. After getting a reward for choosing action $a$ , $γ_{a}$ is increased by 1 and both $γ_{a}$ and $ε_{a}$ are relaxed towards 1. Conversely, after reward omission, $ε_{a}$ is increased by 1 and both $γ_{a}$ and $ε_{a}$ are relaxed towards 1. Hence,

\begin{array}{lr} γ_{a (t)} (t + 1) = (1 - ω) γ_{a (t)} (t) + ω + 1; & a n d \\ ε_{a (t)} (t + 1) = (1 - ω) ε_{a (t)} (t) + ω; & i f R (t) = 1 \end{array}

(9)

\begin{array}{lr} γ_{a (t)} (t + 1) = (1 - ω) γ_{a (t)} (t) + ω; & a n d \\ ε_{a (t)} (t + 1) = (1 - ω) ε_{a (t)} (t) + ω + 1; & i f R (t) = 0 \end{array}

(10)

For the unchosen bandit, both $γ_{a}$ and $ε_{a}$ are relaxed towards 1:

\begin{array}{lr} γ_{1 - a (t)} (t + 1) = (1 - λ) γ_{1 - a (t)} (t) + λ; & a n d \\ ε_{1 - a (t)} (t + 1) = (1 - λ) ε_{1 - a (t)} (t) + λ; \end{array}

(11)

$ω$ and $λ$ are individual participants' freeparameters governing how fast reward probabilities are updated ( $ω$ , with $0 < ω < 1$ ) and forgotten ( $λ$ , with $0 < λ < 1$ ). In the simplest model we considered, $ω = λ$ . We also considered the possibility that updating and forgetting mechanisms occurred at different speeds, hence allowing $ω$ and $λ$ to be different.

As stated previously, $m_{a} (t)$ reflects the propensity of selecting action $a$ , where the simplest definition of $m_{a} (t)$ is $m_{a} (t) = Q_{a} (t)$ as defined in Equation 7, either calculated from a model with one single update parameter $(ω = λ)$ or with two separate update parameters $(ω \neq λ)$ .

We then considered a variety of possible additions to $m_{a} (t)$ which reflected various factors that might influence choice. We tested different combinations of nested models using methods of model comparison. First was choice perseveration $b χ_{a = a (t - 1)}$ just as in Equation 4.

The second potential addition concerned the variance $V_{a} (t)$ of the beta distributions for the individual bandits. In principle, since the subjects might have framed their decision as being between sticking and switching, there could be separate influences associated with the bandit that was or was not chosen on the previous trial. Thus, we considered two separate contributions:

υ^{chosen} V_{a} (t) χ_{a = a (t - 1)} a n d

(12)

υ^{unchosen} V_{a} (t) χ_{a = 1 - a (t - 1)} .

(13)

If $υ^{chosen}$ or $υ^{unchosen}$ are positive, then there is a tendency to choose in favour of high variance – a form of uncertainty or exploration bonus.

Finally, we considered the possibility that subjective confidence that participants can calculate about the correctness of their choices might modulate choice. Based on Sanders et al. (2016), confidence (C) can be defined as:

C = P (c o r r e c t| o b s e r v a t i o n s, c h o i c e)

(14)

Given that our Bayesian observer model tracks subjective estimates of the mean and the variance of the probability distribution of obtaining a reward for each bandit, the probability in Equation 14 can be approximated by:

C_{1} (t) = P (θ_{1} > θ_{0}) = \int_{θ_{1} = 0}^{1} d θ_{1} β (θ_{1}; γ_{1}, ε_{1}) \int_{θ_{0} = 0}^{θ_{1}} d θ_{0} β (θ_{0}; γ_{0}, ε_{0})

(15)

C_{0} (t) = P (θ_{0} > θ_{1}) = 1 - C_{1} (t)

(16)

Given the simple relationship between these two confidences, there are various essentially equivalent ways of incorporating it into choice. We considered the relative confidence in the choice on a trial:

C^{rel} (t) = P (θ_{a (t)} > θ_{1 - a (t)}) - P (θ_{1 - a (t)} > θ_{a (t)}) = 2 P (θ_{a (t)} > θ_{1 - a (t)}) - 1

(17)

and assessed the extent to which the relative confidence on trial $t - 1$ encouraged switching on trial $t$ by adding a factor ${κ C}^{rel} (t - 1) χ_{a = 1 - a (t - 1)}$ to the action that was not chosen on trial $t - 1$ . Here, positive values of $κ$ make the subjects more likely to switch if they had been more confident.

Model fitting and comparison

Model parameters were fitted using an expectation-maximisation approach (Guitart-Masip et al., 2012; Huys et al., 2011). We used a Laplacian approximation to obtain maximum a posteriori estimates for the parameters for each participant iteratively, starting with flat priors. After an iteration, the resulting group mean posterior and variance for each parameter were used as priors in the next iteration. This method was used to prevent the individuals’ parameters from taking on extreme values.

Models were compared using the integrated Bayesian Information Criterion (iBIC) (Guitart-Masip et al., 2012; Huys et al., 2011), where small iBIC values indicate a model that fits the data better after penalizing for the number of parameters. Comparing iBIC values is akin to a likelihood ratio test.

Statistical analysis of behaviour and brain variables

We calculated a number of behavioural measures: (1) the total monetary gains in Swedish Crowns (SEK), (2) percentage of efficient choices (the proportion of choices in which participants chose the option that was most likely to be rewarded according to the random Gaussian walks), (3) number of switches between bandits, and (4) percentage of adaptive switches, defined as switches to subjectively better bandits (according to the winning model) versus switches to subjectively worse bandits. We used independent sample one-tailed t-tests to assess group differences in task performance, based on previously reported observations of impaired probabilistic reward learning performance in old age (Eppinger et al., 2011; Mell et al., 2005). We hypothesised that the older group mean would be lower than the young group mean. Non-parametric independent two-tailed two sample Mann-Whitney tests were used to assess group differences in model parameters and other variables that were non-normally distributed. Regular two-tailed two-sample t-tests were used elsewhere. Pearson's correlations were used to analyse the data further, controlling for age and model fit, as defined by the participant’s log likelihood, where appropriate. Statistical analyses were performed in SPSS 22 and R3.3.0.

MRI acquisition

Brain imaging data were acquired on a 3.0TE MR-scanner (GE Medical Systems). T1-weighted 3D-SPGR images were acquired using a single-echo sequence (voxel size: 0.5 × 0.5 × 1 mm, TE = 3.20, flip angle = 12 deg). Functional images were acquired using a T2*-sensitive gradient echo sequence (voxel size: 2 × 2 × 4 mm, TE = 30.0 mis, TR = 2000 ms, flip angle = 80 deg), and contained 37 slices of 3.4 mm thickness, with a 0.5 mm gap between slices. Volume acquisition occurred in an interleaved fashion. 330 volumes were obtained for each of the two functional runs. During acquisition of fMRI time series, heart rate and respiratory data were collected using a breathing belt and a pulse oximeter.

MR analysis

fMRI analysis was performed using SPM8 (http://www.fil.ion.ucl.ac.uk/spm/software/spm8/). Preprocessing steps included (in this order): slice-time correction, realignment, coregistration to the T1-weighted image, movement correction using ArtRepair (see below), normalisation to MNI space using a diffeomorphic registration algorithm (DARTEL) as implemented in SPM (Ashburner, 2007) with spatial resolution after normalisation 2 × 2×2 mm. Data were smoothed with a final Gaussian kernel equivalent to a standard 8 mm. This kernel was achieved in two steps, including the ArtRepair motion correction (see below). The fMRI time series data were high-pass filtered with cut-off 128 s, and whitened with an AR(1) model. For each participant, the canonical hemodynamic response function was used to compute their statistical model.

The movement parameters showed that 15 participants moved >3 mm in any direction during functional runs. To correct for movement artefacts, we used the ArtRepair toolbox (Mazaika et al., 2005; Levy and Wagner, 2011). ArtRepair assesses the amount of motion between volume acquisitions from the mean intensity plot and linearly interpolates scans in which motion over a user-specified threshold is present. We set our threshold to the recommended value of 1.5% deviation of the mean intensity between scans. The average number of interpolated scans for our participants was 12.2 (1.8%) (SD = 19.6 (3.0%)) and one participant was excluded for showing movement >1.0 mm in >25% of scans, in line with default recommendations. ArtRepair requires smoothing of the individual subject data with a Gaussian smoothing kernel of 4 mm. A Gaussian kernel of 7 mm was then used after the normalisation to MNI space, resulting in a smoothed, normalised image equivalent to a more standard 8 mm smoothed normalised image.

We estimated three first-level models, in order to address the different goals of the study. GLM 1 was set to study how value anticipation is represented in the brain and how this representation differs between age groups and relate to task performance and DA D1 receptor density. GLM 2 and 3 were set to investigate the differences in the expression of the RPE signal at the time of the outcome in the old and young sample and its relation to task performance and DA D1 receptor density as measured by PET. Note that our winning computational model does not use RPEs. However, because our Bayesian observer model generates value expectations, we may expect the brain to, nonetheless, track RPEs as the discrepancy between observed outcomes and outcomes predicted by the model. All GLMs (described in detail below) included a regressor specifying the time of choice and one specifying the time of outcome. These were parametrically modulated by various regressors that were calculated based on the winning computational model and the group posterior parameter means. These regressors are mean-centered by default (Mumford et al., 2015). The SPM motion regressors were also added to the design matrix as regressors of no interest, as well as 18 parameters correcting for physiological noise as recorded by a heartbeat detector and breathing belt during the scanning sessions. These were calculated using the PhysiO toolbox version r671 (https://www.tnu.ethz.ch/en/software/tapas.html).

GLM 1: Because the choice and outcome are close in time in each trial (maximum 3 s apart), including Q as a parametric modulator at both time points would result in highly correlated regressors. Therefore, to investigate brain activity reflecting value anticipation, we estimated a model that included Q at the time of the choice. R was included at the time of the outcome as a regressor of no interest. For each participant, we calculated a contrast image weighting the parametric modulators of interest (Q at choice) by 1. At the second level, we used this contrast image to perform a one-sample t-test across age groups. The second-level map was produced with a family-wise error (FWE) corrected threshold at p<0.05 and parameter estimates for Q were extracted from relevant surviving clusters to investigate the relationship between the signal, task performance and DA.

GLM 2 (putative RPE): When investigating RPE signals, a common approach is to identify regions in which activity is correlated with the RPE, defined as R(t)-Q_a(t), included as a single regressor in the GLM (Eppinger et al., 2013; Schönberg et al., 2007; McClure et al., 2003). Because R and RPE are correlated (Behrens et al., 2008; Niv et al., 2012; Li and Daw, 2011), when using this approach the amount of variance attributed to RPE may be overestimated and the identified signals can be seen as putative RPEs. For this reason, it has been suggested that the effects of R and Q need to be estimated separately and only regions showing both signals can be considered as conveying a canonical RPE signal. In order to identify regions potentially conveying a canonical RPE signal, we first identified regions conveying a putative RPE signal, by setting up a first-level GLM including the putative RPE regressor (R(t)-Q_a(t)) as a single parametric modulator at the time of outcome presentation. For each participant, we calculated a contrast image weighting this parametric modulator by 1. At the second level, we used these contrast images to perform a one-sample t-test across age groups. All second-level maps were produced with a family-wise error (FWE) corrected threshold at p<0.05. The bilateral NAcc, commonly reported to respond to RPEs, was identified in this analysis and used as functional ROIs for further analysis. To constrain these ROIs, we used the conjunction of the functional ROIs and the anatomical NAcc masks found in the PickAtlas (https://www.nitrc.org/projects/wfu_pickatlas/).

GLM 3: To quantify the separate RPE components, we performed another first-level analysis in which R and Q were included as two independent parametric modulators at the time of the outcome in the design matrix. For each participant, we calculated a contrast image weighting these two independent parametric modulators by 1. Parameter estimates for R and Q were extracted from these contrast maps using the ROIs defined in the second-level analysis described in GLM 2 and were further analysed to look for a canonical RPE signal.

Time course extraction

The aim of this analysis was to visualise the effect of variables of interest on the BOLD signal, at the time of the choice and at the time of the outcome. Time courses of BOLD data from specified ROIs were extracted from the preprocessed, normalised EPI images. This BOLD signal was upsampled to one measurement every 200 ms. This time series resampled into chunks of 15 s, corresponding to individual trials. Stimulus onset occurred at 0 s, choice between 0 and 2 s, and outcome at 3 s. A general linear model including the regressors of interest was estimated at each time point in each trial for each participant. In these models, the regressors of interest were allowed to compete for variance. At each time point, group mean effect sizes and standard errors were calculated and plotted separately for young and old.

PET image acquisition

PET images were acquired in 3D mode using a Discovery 690 PET/CT (General Electric, WI, US), at the Department of Nuclear Medicine, Norrland’s University Hospital. A low-dose helical CT scan (20 mA, 120 kV, 0.8 s/revolution), provided data for PET attenuation correction. Participants were injected with a bolus of 200 MBq [11C]SCH 23390. A 55-min dynamic acquisition commenced at time of injection (9 frames x 2 min, 3 frames x 3 min, 3 frames x 4,20 min, 3 frames x 5 min). Attenuation- and decay-corrected 256 × 256 pixel transaxial PET images were reconstructed to a 25 cm field-of-view employing the Sharp IR algorithm (6 iterations, 24 subsets, 3.0 mm Gaussian post filter). Sharp IR is an advanced version of the OSEM method for improving spatial resolution, in which detector system responses are included (Ross and Stearns, 2010). The Full- Width Half-Maximum (FWHM) resolution is below 3 mm. The protocol resulted in 47 tomographic slices per time frame, yielding 0.977 × 0.977 × 3.27 mm³ voxels. Images were decay-corrected to the start of the scan. Images were de-identified using dicom2usb (http://dicom-port.com/). To minimise head movement during the imaging session, the patient’s head was fixated with an individually fitted thermoplastic mask (Positocasts Thermoplastic; CIVCO medical solutions, IA, USA).

PET analysis

PET data were analysed in a standard ROI-based protocol. This type of analysis requires a priori hypotheses about the regional specificity of dopaminergic modulation of observed behavioural or neuronal effects. All analyses were done with the use of in-house developed software (imlook4d version 3.5, https://dicom-port.com/product/imlook4d/).

Regions of interest for the ROI analysis were dorsolateral PFC (dlPFC), ventrolateral PFC (vlPFC), orbitofrontal cortex (OFC), and vmPFC in cortex, and putamen, caudate and NAcc in striatum across hemispheres. These regions were chosen based on their relevance to our task: dlPFC has previously been demonstrated to be involved in executive processes and working memory (WM) and cognitive flexibility (Barch et al., 2003; D'Esposito et al., 1995; Petrides, 2000; Plakke and Romanski, 2016), whereas vlPFC is thought to be important for goal-directed action and attention (Levy and Wagner, 2011). vmPFC has been shown to be responsive to reward magnitude and reward probability in an overwhelming number of studies (Rushworth et al., 2008). In addition, vmPFC and OFC are active during anticipation of rewards (Kim et al., 2011). Many connections exist between these regions and ventral striatum (VS), an important node in the mesolimbic dopamine system (Rushworth et al., 2011; Haber and Knutson, 2010; Salamone and Correa, 2012). VS consists of NAcc, and parts of the medial caudate nucleus and rostral putamen. Because of its connections with prefrontal areas relevant to this task, and because striatum is densely innervated by dopaminergic neurons, we segmented the different parts of striatum to use as separate ROIs. The cerebellum was segmented to be used as reference tissue because it is devoid of DA D1 receptors (Hall et al., 1994). Freesurfer's recon-all function (Desikan et al., 2006) was used to segment the brain into cortical ROIs, FSL's FIRST algorithm (Patenaude et al., 2011) was used to segment subcortical structures.

In order to obtain ROI BP values, the PET time series were first coregistered to the individual T1-weighted images and ROI images. The average time activity curves (TAC) were extracted across all voxels within each ROI and calculated binding potential (BP) by applying the Logan method (Logan et al., 1990) as implemented in imlook4d. This method was applied to each ROI using the cerebellum as reference tissue. BP values for all ROIs were averaged across hemispheres. We then investigated the relationship between DA D1 BP in the different ROIs and the Q signal in NAcc and vmPFC while controlling for age and model fit.

Acknowledgements

We thank Mats Erikson and Kajsa Burström for collecting the data.

Funding Statement

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Contributor Information

Lieke de Boer, Email: liekelotte@gmail.com.

Marc Guitart-Masip, Email: marc.guitart-masip@ki.se.

Wolfram Schultz, University of Cambridge, United Kingdom.

Funding Information

This paper was supported by the following grants:

Vetenskapsrådet VR521-2013-2589 to Marc Guitart-Masip.
Gatsby Charitable Foundation to Peter Dayan.
Humboldt Research Award to Lars Bäckman.
af Jochnick Foundation to Lars Bäckman.

Additional information

Competing interests

No competing interests declared.

Author contributions

Conceptualization, Data curation, Formal analysis, Validation, Investigation, Visualization, Methodology, Writing—original draft, Writing—review and editing.

Data curation, Software, Formal analysis, Writing—review and editing.

Conceptualization, Resources, Project administration, Writing—review and editing.

Conceptualization, Resources, Supervision, Investigation, Methodology, Project administration, Writing—review and editing.

Data curation, Software, Formal analysis, Investigation, Visualization, Methodology, Writing—review and editing.

Conceptualization, Resources, Supervision, Investigation, Methodology, Writing—review and editing.

Conceptualization, Resources, Data curation, Software, Formal analysis, Supervision, Funding acquisition, Validation, Investigation, Visualization, Methodology, Writing—original draft, Project administration, Writing—review and editing.

Ethics

Human subjects: Ethical approval was obtained from the Umeå Ethical Review Board, identifier DNR 2014-251-31M. All participants provided written informed consent prior to commencing the study.

Additional files

Source code 1. Computational modelling.

Scripts needed for the entire modelling routine used in the behavioural analysis. See the comments in the file fit_all_models_eLife.m for more details on each of the models and the procedure

elife-26424-code1.zip^{(44KB, zip)}

DOI: 10.7554/eLife.26424.016

Source code 2. fMRI analysis.

All MATLAB scripts required to set up preprocessing of fMRI data, create regressors for fMRI analysis, run the first level analysis and the second level analysis.

elife-26424-code2.zip^{(8.2KB, zip)}

DOI: 10.7554/eLife.26424.017

Source code 3. PET analysis.

Scripts required to run the segmentation of T1 images, PET analysis and estimation of BPs for the different ROIs.

elife-26424-code3.zip^{(4.3KB, zip)}

DOI: 10.7554/eLife.26424.018

Source code 4. figures.

R script for ggplot for Figures 1b, 3b, d and e and 4b

elife-26424-code4.r^{(15.6KB, r)}

DOI: 10.7554/eLife.26424.019

Source code 5. Figure 2.

MATLAB script that creates joint probability distributions shown in Figure 2.

elife-26424-code5.m^{(1KB, m)}

DOI: 10.7554/eLife.26424.020

Source code 6. timecourse extraction.

MATLAB script that extracts the timecourse for expected value from vmPFC for young and old separately.

elife-26424-code6.m^{(14.1KB, m)}

DOI: 10.7554/eLife.26424.021

Supplementary file 1. (A) Correlation coefficients between model parameters and performance.

Coefficients in italics represent significant correlations at p<0.05. Coefficients in bold represent significant correlations at p<0.002 (adjust Bonferroni-corrected threshold). (B) Variance in number of switches as explained by the strongest RW model and winning model. When explaining the number of switches from the individual model parameters, the parameters that weighted V (υ), C_rel (κ) and forgetting rate ( $λ$ ), in addition to the softmax temperature parameter (β) were found to be significant predictors. Age or other model predictors did not contribute significantly. This regression model explained the number of switches better than the RW model parameters, where only the perseveration parameter b and softmax temperature parameter β were significant predictors of number of switches. (C) Young participants have a higher learning rate in the winning Rescorla-Wagner model according to non-parametric t-tests. None of the other model parameters significantly differed between groups.

elife-26424-supp1.docx^{(14.7KB, docx)}

DOI: 10.7554/eLife.26424.022

Supplementary file 2. (A) No significant correlations between model parameters and dopamine D1 receptor density in any ROI after controlling for age at Bonferroni-corrected threshold of 0.0014.

(B) Partial correlation matrix showing correlation coefficients between the binding potential in the different PET ROIs and their p-values after controlling for age.

elife-26424-supp2.docx^{(20.5KB, docx)}

DOI: 10.7554/eLife.26424.023

Supplementary file 3. Coordinates of clusters responsive to Q at the time of choice.

elife-26424-supp3.docx^{(19.1KB, docx)}

DOI: 10.7554/eLife.26424.024

Transparent reporting form

elife-26424-transrepform.docx^{(243KB, docx)}

DOI: 10.7554/eLife.26424.025

References

Ashburner J. A fast diffeomorphic image registration algorithm. NeuroImage. 2007;38:95–113. doi: 10.1016/j.neuroimage.2007.07.007. [DOI] [PubMed] [Google Scholar]
Badre D, Doll BB, Long NM, Frank MJ. Rostrolateral prefrontal cortex and individual differences in uncertainty-driven exploration. Neuron. 2012;73:595–607. doi: 10.1016/j.neuron.2011.12.025. [DOI] [PMC free article] [PubMed] [Google Scholar]
Balleine BW, O'Doherty JP. Human and rodent homologies in action control: corticostriatal determinants of goal-directed and habitual action. Neuropsychopharmacology. 2010;35:48–69. doi: 10.1038/npp.2009.131. [DOI] [PMC free article] [PubMed] [Google Scholar]
Barch DM, Sheline YI, Csernansky JG, Snyder AZ. Working memory and prefrontal cortex dysfunction: specificity to schizophrenia compared with major depression. Biological Psychiatry. 2003;53:376–384. doi: 10.1016/S0006-3223(02)01674-8. [DOI] [PubMed] [Google Scholar]
Bari A, Robbins TW. Inhibition and impulsivity: behavioral and neural basis of response control. Progress in Neurobiology. 2013;108:44–79. doi: 10.1016/j.pneurobio.2013.06.005. [DOI] [PubMed] [Google Scholar]
Bayer HM, Glimcher PW. Midbrain dopamine neurons encode a quantitative reward prediction error signal. Neuron. 2005;47:129–141. doi: 10.1016/j.neuron.2005.05.020. [DOI] [PMC free article] [PubMed] [Google Scholar]
Behrens TE, Hunt LT, Woolrich MW, Rushworth MF. Associative learning of social value. Nature. 2008;456:245–249. doi: 10.1038/nature07538. [DOI] [PMC free article] [PubMed] [Google Scholar]
Behrens TE, Woolrich MW, Walton ME, Rushworth MF. Learning the value of information in an uncertain world. Nature Neuroscience. 2007;10:1214–1221. doi: 10.1038/nn1954. [DOI] [PubMed] [Google Scholar]
Boorman ED, Behrens TE, Woolrich MW, Rushworth MF. How green is the grass on the other side? Frontopolar cortex and the evidence in favor of alternative courses of action. Neuron. 2009;62:733–743. doi: 10.1016/j.neuron.2009.05.014. [DOI] [PubMed] [Google Scholar]
Bäckman L, Lindenberger U, Li SC, Nyberg L. Linking cognitive aging to alterations in dopamine neurotransmitter functioning: recent data and future avenues. Neuroscience & Biobehavioral Reviews. 2010;34:670–677. doi: 10.1016/j.neubiorev.2009.12.008. [DOI] [PubMed] [Google Scholar]
Camille N, Griffiths CA, Vo K, Fellows LK, Kable JW. Ventromedial frontal lobe damage disrupts value maximization in humans. Journal of Neuroscience. 2011;31:7527–7532. doi: 10.1523/JNEUROSCI.6527-10.2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chau BK, Kolling N, Hunt LT, Walton ME, Rushworth MF. A neural mechanism underlying failure of optimal choice with multiple alternatives. Nature Neuroscience. 2014;17:463–470. doi: 10.1038/nn.3649. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chowdhury R, Guitart-Masip M, Lambert C, Dayan P, Huys Q, Düzel E, Dolan RJ. Dopamine restores reward prediction errors in old age. Nature Neuroscience. 2013;16:648–653. doi: 10.1038/nn.3364. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cools R, D'Esposito M. Inverted-U-shaped dopamine actions on human working memory and cognitive control. Biological Psychiatry. 2011;69:e113–e125. doi: 10.1016/j.biopsych.2011.03.028. [DOI] [PMC free article] [PubMed] [Google Scholar]
d'Acremont M, Fornari E, Bossaerts P. Activity in inferior parietal and medial prefrontal cortex signals the accumulation of evidence in a probability learning task. PLoS Computational Biology. 2013;9:e1002895. doi: 10.1371/journal.pcbi.1002895. [DOI] [PMC free article] [PubMed] [Google Scholar]
D'Esposito M, Detre JA, Alsop DC, Shin RK, Atlas S, Grossman M. The neural basis of the central executive system of working memory. Nature. 1995;378:279–281. doi: 10.1038/378279a0. [DOI] [PubMed] [Google Scholar]
Daw ND, Gershman SJ, Seymour B, Dayan P, Dolan RJ. Model-based influences on humans' choices and striatal prediction errors. Neuron. 2011;69:1204–1215. doi: 10.1016/j.neuron.2011.02.027. [DOI] [PMC free article] [PubMed] [Google Scholar]
Daw ND, O'Doherty JP, Dayan P, Seymour B, Dolan RJ. Cortical substrates for exploratory decisions in humans. Nature. 2006;441:876–879. doi: 10.1038/nature04766. [DOI] [PMC free article] [PubMed] [Google Scholar]
Dayan P, Sejnowski TJ. Exploration bonuses and dual control. Machine Learning. 1996;25:5–22. doi: 10.1007/BF00115298. [DOI] [Google Scholar]
Desikan RS, Ségonne F, Fischl B, Quinn BT, Dickerson BC, Blacker D, Buckner RL, Dale AM, Maguire RP, Hyman BT, Albert MS, Killiany RJ. An automated labeling system for subdividing the human cerebral cortex on MRI scans into gyral based regions of interest. NeuroImage. 2006;31:968–980. doi: 10.1016/j.neuroimage.2006.01.021. [DOI] [PubMed] [Google Scholar]
Dreher JC, Meyer-Lindenberg A, Kohn P, Berman KF. Age-related changes in midbrain dopaminergic regulation of the human reward system. PNAS. 2008;105:15106–15111. doi: 10.1073/pnas.0802127105. [DOI] [PMC free article] [PubMed] [Google Scholar]
Eppinger B, Heekeren HR, Li SC. Age-related prefrontal impairments implicate deficient prediction of future reward in older adults. Neurobiology of Aging. 2015;36:2380–2390. doi: 10.1016/j.neurobiolaging.2015.04.010. [DOI] [PubMed] [Google Scholar]
Eppinger B, Hämmerer D, Li SC. Neuromodulation of reward-based learning and decision making in human aging. Annals of the New York Academy of Sciences. 2011;1235:1–17. doi: 10.1111/j.1749-6632.2011.06230.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Eppinger B, Schuck NW, Nystrom LE, Cohen JD. Reduced striatal responses to reward prediction errors in older compared with younger adults. Journal of Neuroscience. 2013;33:9905–9912. doi: 10.1523/JNEUROSCI.2942-12.2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
Frank MJ, Doll BB, Oas-Terpstra J, Moreno F. The neurogenetics of exploration and exploitation: Prefrontal and striatal dopaminergic components. Nature Neuroscience. 2009;12:1062–1068. doi: 10.1038/nn.2342. [DOI] [PMC free article] [PubMed] [Google Scholar]
Frank MJ, Seeberger LC, O'reilly RC. By carrot or by stick: cognitive reinforcement learning in parkinsonism. Science. 2004;306:1940–1943. doi: 10.1126/science.1102941. [DOI] [PubMed] [Google Scholar]
Gruber AJ, Dayan P, Gutkin BS, Solla SA. Dopamine modulation in the basal ganglia locks the gate to working memory. Journal of Computational Neuroscience. 2006;20:153. doi: 10.1007/s10827-005-5705-x. [DOI] [PubMed] [Google Scholar]
Guitart-Masip M, Huys QJ, Fuentemilla L, Dayan P, Duzel E, Dolan RJ. Go and no-go learning in reward and punishment: interactions between affect and effect. NeuroImage. 2012;62:154–166. doi: 10.1016/j.neuroimage.2012.04.024. [DOI] [PMC free article] [PubMed] [Google Scholar]
Haber SN, Knutson B. The reward circuit: linking primate anatomy and human imaging. Neuropsychopharmacology. 2010;35:4–26. doi: 10.1038/npp.2009.129. [DOI] [PMC free article] [PubMed] [Google Scholar]
Halfmann K, Hedgcock W, Kable J, Denburg NL. Individual differences in the neural signature of subjective value among older adults. Social Cognitive and Affective Neuroscience. 2016;11:1111–1120. doi: 10.1093/scan/nsv078. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hall H, Sedvall G, Magnusson O, Kopp J, Halldin C, Farde L. Distribution of D1- and D2-dopamine receptors, and dopamine and its metabolites in the human brain. Neuropsychopharmacology. 1994;11:245–256. doi: 10.1038/sj.npp.1380111. [DOI] [PubMed] [Google Scholar]
Hart AS, Rutledge RB, Glimcher PW, Phillips PE. Phasic dopamine release in the rat nucleus accumbens symmetrically encodes a reward prediction error term. The Journal of Neuroscience. 2014;34:698–704. doi: 10.1523/JNEUROSCI.2489-13.2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
Huys QJ, Cools R, Gölzer M, Friedel E, Heinz A, Dolan RJ, Dayan P. Disentangling the roles of approach, activation and valence in instrumental and pavlovian responding. PLoS Computational Biology. 2011;7:e1002028. doi: 10.1371/journal.pcbi.1002028. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hämmerer D, Eppinger B. Dopaminergic and prefrontal contributions to reward-based learning and outcome monitoring during child development and aging. Developmental Psychology. 2012;48:862–874. doi: 10.1037/a0027342. [DOI] [PubMed] [Google Scholar]
Jocham G, Klein TA, Ullsperger M. Dopamine-mediated reinforcement learning signals in the striatum and ventromedial prefrontal cortex underlie value-based choices. Journal of Neuroscience. 2011;31:1606–1613. doi: 10.1523/JNEUROSCI.3904-10.2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kim H, Shimojo S, O'Doherty JP. Overlapping responses for the expectation of juice and money rewards in human ventromedial prefrontal cortex. Cerebral Cortex. 2011;21:769–776. doi: 10.1093/cercor/bhq145. [DOI] [PubMed] [Google Scholar]
Lau B, Glimcher PW. Dynamic response-by-response models of matching behavior in rhesus monkeys. Journal of the Experimental Analysis of Behavior. 2005;84:555–579. doi: 10.1901/jeab.2005.110-04. [DOI] [PMC free article] [PubMed] [Google Scholar]
Levy BJ, Wagner AD. Cognitive control and right ventrolateral prefrontal cortex: reflexive reorienting, motor inhibition, and action updating. Annals of the New York Academy of Sciences. 2011;1224:40–62. doi: 10.1111/j.1749-6632.2011.05958.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Li J, Daw ND. Signals in human striatum are appropriate for policy update rather than value prediction. Journal of Neuroscience. 2011;31:5504–5511. doi: 10.1523/JNEUROSCI.6316-10.2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
Li J, Schiller D, Schoenbaum G, Phelps EA, Daw ND. Differential roles of human striatum and amygdala in associative learning. Nature Neuroscience. 2011;14:1250–1252. doi: 10.1038/nn.2904. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lindenberger U, von Oertzen T, Ghisletta P, Hertzog C. Cross-sectional age variance extraction: what's change got to do with it? Psychology and Aging. 2011;26:34–47. doi: 10.1037/a0020525. [DOI] [PubMed] [Google Scholar]
Logan J, Fowler JS, Volkow ND, Wolf AP, Dewey SL, Schlyer DJ, MacGregor RR, Hitzemann R, Bendriem B, Gatley SJ. Graphical analysis of reversible radioligand binding from time-activity measurements applied to [N-11C-methyl]-(-)-cocaine PET studies in human subjects. Journal of Cerebral Blood Flow and Metabolism : Official Journal of the International Society of Cerebral Blood Flow and Metabolism. 1990;10:740–747. doi: 10.1038/jcbfm.1990.127. [DOI] [PubMed] [Google Scholar]
Maia TV, Frank MJ. From reinforcement learning models to psychiatric and neurological disorders. Nature Neuroscience. 2011;14:154–162. doi: 10.1038/nn.2723. [DOI] [PMC free article] [PubMed] [Google Scholar]
Marcott PF, Mamaligas AA, Ford CP. Phasic dopamine release drives rapid activation of striatal D2-receptors. Neuron. 2014;84:164–176. doi: 10.1016/j.neuron.2014.08.058. [DOI] [PMC free article] [PubMed] [Google Scholar]
Mathys C, Daunizeau J, Friston KJ, Stephan KE. A bayesian foundation for individual learning under uncertainty. Frontiers in Human Neuroscience. 2011;5:39. doi: 10.3389/fnhum.2011.00039. [DOI] [PMC free article] [PubMed] [Google Scholar]
Mazaika P, Whitfield S, Cooper JC. Detection and Repair of Transient Artifacts in fMRI Data. Human Brain Mapping.2005. [Google Scholar]
McClure SM, Daw ND, Montague PR. A computational substrate for incentive salience. Trends in Neurosciences. 2003;26:423–428. doi: 10.1016/S0166-2236(03)00177-2. [DOI] [PubMed] [Google Scholar]
McNamee D, Rangel A, O'Doherty JP. Category-dependent and category-independent goal-value codes in human ventromedial prefrontal cortex. Nature Neuroscience. 2013;16:479–485. doi: 10.1038/nn.3337. [DOI] [PMC free article] [PubMed] [Google Scholar]
Mell T, Heekeren HR, Marschner A, Wartenburger I, Villringer A, Reischies FM. Effect of aging on stimulus-reward association learning. Neuropsychologia. 2005;43:554–563. doi: 10.1016/j.neuropsychologia.2004.07.010. [DOI] [PubMed] [Google Scholar]
Mumford JA, Poline JB, Poldrack RA. Orthogonalization of regressors in FMRI models. PLoS One. 2015;10:e0126255. doi: 10.1371/journal.pone.0126255. [DOI] [PMC free article] [PubMed] [Google Scholar]
Neubert FX, Mars RB, Sallet J, Rushworth MF. Connectivity reveals relationship of brain areas for reward-guided learning and decision making in human and monkey frontal cortex. PNAS. 2015;112:E2695–E2704. doi: 10.1073/pnas.1410767112. [DOI] [PMC free article] [PubMed] [Google Scholar]
Niv Y, Edlund JA, Dayan P, O'Doherty JP. Neural prediction errors reveal a risk-sensitive reinforcement-learning process in the human brain. Journal of Neuroscience. 2012;32:551–562. doi: 10.1523/JNEUROSCI.5498-10.2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
Niv Y, Montague PR. Theoretical and Empirical Studies of Learning. 2009. [Google Scholar]
Noonan MP, Kolling N, Walton ME, Rushworth MF. Re-evaluating the role of the orbitofrontal cortex in reward and reinforcement. European Journal of Neuroscience. 2012;35:997–1010. doi: 10.1111/j.1460-9568.2012.08023.x. [DOI] [PubMed] [Google Scholar]
Noonan MP, Walton ME, Behrens TE, Sallet J, Buckley MJ, Rushworth MF. Separate value comparison and learning mechanisms in macaque medial and lateral orbitofrontal cortex. PNAS. 2010;107:20547–20552. doi: 10.1073/pnas.1012246107. [DOI] [PMC free article] [PubMed] [Google Scholar]
Nyberg L, Salami A, Andersson M, Eriksson J, Kalpouzos G, Kauppi K, Lind J, Pudas S, Persson J, Nilsson LG. Longitudinal evidence for diminished frontal cortex function in aging. PNAS. 2010;107:22682–22686. doi: 10.1073/pnas.1012651108. [DOI] [PMC free article] [PubMed] [Google Scholar]
O'Doherty JP. Lights, camembert, action! The role of human orbitofrontal cortex in encoding stimuli, rewards, and choices. Annals of the New York Academy of Sciences. 2007;1121:254–272. doi: 10.1196/annals.1401.036. [DOI] [PubMed] [Google Scholar]
O'Doherty JP, Dayan P, Friston K, Critchley H, Dolan RJ. Temporal difference models and reward-related learning in the human brain. Neuron. 2003;38:329–337. doi: 10.1016/S0896-6273(03)00169-7. [DOI] [PubMed] [Google Scholar]
Patenaude B, Smith SM, Kennedy DN, Jenkinson M. A Bayesian model of shape and appearance for subcortical brain segmentation. NeuroImage. 2011;56:907–922. doi: 10.1016/j.neuroimage.2011.02.046. [DOI] [PMC free article] [PubMed] [Google Scholar]
Payzan-LeNestour E, Bossaerts P, Risk BP. Risk, unexpected uncertainty, and estimation uncertainty: Bayesian learning in unstable settings. PLoS Computational Biology. 2011;7:e1001048. doi: 10.1371/journal.pcbi.1001048. [DOI] [PMC free article] [PubMed] [Google Scholar]
Pessiglione M, Seymour B, Flandin G, Dolan RJ, Frith CD. Dopamine-dependent prediction errors underpin reward-seeking behaviour in humans. Nature. 2006;442:1042–1045. doi: 10.1038/nature05051. [DOI] [PMC free article] [PubMed] [Google Scholar]
Petrides M. The role of the mid-dorsolateral prefrontal cortex in working memory. Experimental Brain Research. 2000;133:44–54. doi: 10.1007/s002210000399. [DOI] [PubMed] [Google Scholar]
Philiastides MG, Biele G, Heekeren HR. A mechanistic account of value computation in the human brain. PNAS. 2010;107:9430–9435. doi: 10.1073/pnas.1001732107. [DOI] [PMC free article] [PubMed] [Google Scholar]
Plakke B, Romanski LM. Neural circuits in auditory and audiovisual memory. Brain Research. 2016;1640:278–288. doi: 10.1016/j.brainres.2015.11.042. [DOI] [PMC free article] [PubMed] [Google Scholar]
Raz N, Lindenberger U, Rodrigue KM, Kennedy KM, Head D, Williamson A, Dahle C, Gerstorf D, Acker JD. Regional brain changes in aging healthy adults: general trends, individual differences and modifiers. Cerebral Cortex. 2005;15:1676–1689. doi: 10.1093/cercor/bhi044. [DOI] [PubMed] [Google Scholar]
Reynolds JN, Wickens JR. Dopamine-dependent plasticity of corticostriatal synapses. Neural Networks. 2002;15:507–521. doi: 10.1016/S0893-6080(02)00045-X. [DOI] [PubMed] [Google Scholar]
Rieckmann A, Karlsson S, Karlsson P, Brehmer Y, Fischer H, Farde L, Nyberg L, Bäckman L. Dopamine D1 receptor associations within and between dopaminergic pathways in younger and elderly adults: links to cognitive performance. Cerebral Cortex. 2011;21:2023–2032. doi: 10.1093/cercor/bhq266. [DOI] [PubMed] [Google Scholar]
Rolls ET, McCabe C, Redoute J. Expected value, reward outcome, and temporal difference error representations in a probabilistic decision task. Cerebral Cortex. 2008;18:652–663. doi: 10.1093/cercor/bhm097. [DOI] [PubMed] [Google Scholar]
Ross S, Stearns C. SharpIR: White paper [Internet] [9, January 2017];2010 http://www3.gehealthcare.co.uk/~/media/downloads/uk/education/pet%20white%20papers/mi_emea_sharpir_white_paper_pdf_092010_doc0852276.pdf?Parent=%7BB66C9E27-1C45-4F6B-BE27-D2351D449B19%7D
Rudebeck PH, Murray EA. The orbitofrontal oracle: cortical mechanisms for the prediction and evaluation of specific behavioral outcomes. Neuron. 2014;84:1143–1156. doi: 10.1016/j.neuron.2014.10.049. [DOI] [PMC free article] [PubMed] [Google Scholar]
Rushworth MF, Behrens TE, Choice BTEJ. Choice, uncertainty and value in prefrontal and cingulate cortex. Nature Neuroscience. 2008;11:389–397. doi: 10.1038/nn2066. [DOI] [PubMed] [Google Scholar]
Rushworth MF, Noonan MP, Boorman ED, Walton ME, Behrens TE. Frontal cortex and reward-guided learning and decision-making. Neuron. 2011;70:1054–1069. doi: 10.1016/j.neuron.2011.05.014. [DOI] [PubMed] [Google Scholar]
Rutledge RB, Lazzaro SC, Lau B, Myers CE, Gluck MA, Glimcher PW. Dopaminergic drugs modulate learning rates and perseveration in Parkinson's patients in a dynamic foraging task. Journal of Neuroscience. 2009;29:15104–15114. doi: 10.1523/JNEUROSCI.3524-09.2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
Salamone JD, Correa M. The mysterious motivational functions of mesolimbic dopamine. Neuron. 2012;76:470–485. doi: 10.1016/j.neuron.2012.10.021. [DOI] [PMC free article] [PubMed] [Google Scholar]
Samanez-Larkin GR, Knutson B. Decision making in the ageing brain: changes in affective and motivational circuits. Nature Reviews Neuroscience. 2015;16:278–289. doi: 10.1038/nrn3917. [DOI] [PMC free article] [PubMed] [Google Scholar]
Samanez-Larkin GR, Levens SM, Perry LM, Dougherty RF, Knutson B. Frontostriatal white matter integrity mediates adult age differences in probabilistic reward learning. Journal of Neuroscience. 2012;32:5333–5337. doi: 10.1523/JNEUROSCI.5756-11.2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
Samanez-Larkin GR, Worthy DA, Mata R, McClure SM, Knutson B. Adult age differences in frontostriatal representation of prediction error but not reward outcome. Cognitive, Affective, & Behavioral Neuroscience. 2014;14:672–682. doi: 10.3758/s13415-014-0297-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sanders JI, Hangya B, Kepecs A. Signatures of a Statistical Computation in the Human Sense of Confidence. Neuron. 2016;90:499–506. doi: 10.1016/j.neuron.2016.03.025. [DOI] [PMC free article] [PubMed] [Google Scholar]
Schlagenhauf F, Rapp MA, Huys QJ, Beck A, Wüstenberg T, Deserno L, Buchholz HG, Kalbitzer J, Buchert R, Bauer M, Kienast T, Cumming P, Plotkin M, Kumakura Y, Grace AA, Dolan RJ, Heinz A. Ventral striatal prediction error signaling is associated with dopamine synthesis capacity and fluid intelligence. Human Brain Mapping. 2013;34:1490–1499. doi: 10.1002/hbm.22000. [DOI] [PMC free article] [PubMed] [Google Scholar]
Schultz W, Dayan P, Montague PR. A neural substrate of prediction and reward. Science. 1997;275:1593–1599. doi: 10.1126/science.275.5306.1593. [DOI] [PubMed] [Google Scholar]
Schönberg T, Daw ND, Joel D, O'Doherty JP. Reinforcement learning signals in the human striatum distinguish learners from nonlearners during reward-based decision making. Journal of Neuroscience. 2007;27:12860–12867. doi: 10.1523/JNEUROSCI.2496-07.2007. [DOI] [PMC free article] [PubMed] [Google Scholar]
Shipp S. The functional logic of corticostriatal connections. Brain Structure and Function. 2017;222:669–706. doi: 10.1007/s00429-016-1250-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
Stenner MP, Rutledge RB, Zaehle T, Schmitt FC, Kopitzki K, Kowski AB, Voges J, Heinze HJ, Dolan RJ. No unified reward prediction error in local field potentials from the human nucleus accumbens: evidence from epilepsy patients. Journal of Neurophysiology. 2015;114:781–792. doi: 10.1152/jn.00260.2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sutton RS, Barto AG. Reinforcement learning: an introduction.[Internet] [17, October 2016];1998 http://site.ebrary.com/id/10225275
Symmonds M, Wright ND, Bach DR, Dolan RJ. Deconstructing risk: separable encoding of variance and skewness in the brain. NeuroImage. 2011;58:1139–1149. doi: 10.1016/j.neuroimage.2011.06.087. [DOI] [PMC free article] [PubMed] [Google Scholar]
Vinckier F, Gaillard R, Palminteri S, Rigoux L, Salvador A, Fornito A, Adapa R, Krebs MO, Pessiglione M, Fletcher PC. Confidence and psychosis: a neuro-computational account of contingency learning disruption by NMDA blockade. Molecular Psychiatry. 2016;21:946–955. doi: 10.1038/mp.2015.73. [DOI] [PMC free article] [PubMed] [Google Scholar]
Volkow ND, Gur RC, Wang GJ, Fowler JS, Moberg PJ, Ding YS, Hitzemann R, Smith G, Logan J. Association between decline in brain dopamine activity with age and cognitive and motor impairment in healthy individuals. The American journal of psychiatry. 1998;155:344–349. doi: 10.1176/ajp.155.3.344. [DOI] [PubMed] [Google Scholar]
Wilson RC, Geana A, White JM, Ludvig EA, Cohen JD. Humans use directed and random exploration to solve the explore-exploit dilemma. Journal of Experimental Psychology: General. 2014;143:2074–2081. doi: 10.1037/a0038199. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wimmer GE, Braun EK, Daw ND, Shohamy D. Episodic memory encoding interferes with reward learning and decreases striatal prediction errors. Journal of Neuroscience. 2014;34:14901–14912. doi: 10.1523/JNEUROSCI.0204-14.2014. [DOI] [PMC free article] [PubMed] [Google Scholar]

eLife. doi: 10.7554/eLife.26424.026

Decision letter

Editor: Wolfram Schultz¹

In the interests of transparency, eLife includes the editorial decision letter and accompanying author responses. A lightly edited version of the letter sent to the authors after peer review is shown, indicating the most substantive concerns; minor comments are not usually included.

Thank you for submitting your article "Dopaminergic, neural and computational contributions to probabilistic reward learning in old age" for consideration by eLife. Your article has been reviewed by three peer reviewers, and the evaluation has been overseen by a Reviewing Editor and Sabine Kastner as the Senior Editor.

The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.

Summary:

The attenuation of probabilistic reward learning in older human participants was accompanied by reduced value signals in prefrontal cortex.

Essential revisions:

As you will see from the reviewers comments, which are backed by similar concerns of the Reviewing Editor, the paper is far too complicated as it stands. I would suggest to seriously reduce unnecessary analyses, and focus and streamline the paper, and its text, on the essentials related to the age of the participants. There is nothing wrong with removing parts of the data and/or analysis if that would lead to a much clearer message (note that the Abstract already is tough to read with too many distinct details). It should also be discussed why there were not the usual reward prediction error signals found in the ventral striatum (this is not necessarily a bad thing, in particular when a stringent analysis has been applied as here, but readers want to know why).

We will need to send the paper back to the reviewers, but given their substantial difficulties with reading and commenting on the complex text, we can do this only once. This one-revision-only is also general policy of the Journal, and we will need to adhere to it given the complexity of the report.

Please reply with a succinct, simple, point-to-point text to the reviewers' comments. And please be aware of a general policy at eLife that we do not permit several rounds of revisions. Thus, we sincerely hope that you will be able to successfully revise your manuscript with the next round.

Reviewer #1:

This study examined the behavioral and neural bases of age differences in probabilistic reward learning, using fMRI and PET. A group of young and older adults performed a simple instrumental reward learning task (two-armed bandit) in an fMRI experiment. On each trial, participants chose between two cues, whose reward probabilities changed in Gaussian random-walk processes. DA D1 binding potential (BP) was also assessed in several brain regions. Young adults made more money and more efficient choices. The authors compared two families of behavioral models, one based on reinforcement learning through reward prediction errors (RPEs), and the other on a Bayesian observer, where reward probability is updated after each outcome. The best model was a Bayesian one, which included reward-probability updating for both the chosen and unchosen options, the variance of the option not chosen on the previous trial, and decision confidence. None of the model parameters differed between the young and older adult groups, but using the model's generated expected values the authors show that young adults made more "adaptive switches" – switches to the option of the higher value. The winning model was used in the analysis of the fMRI data. This analysis revealed stronger representation of the expected value of the chosen option in young compared to older adults in several brain areas. In the vmpfc, this parameter predicted earnings, and accounted for age differences in aging. DA D1 BP in NAcc was correlated with the value parameter in vmpfc, but not in NAcc, and accounted for age differences in that parameter. The authors then searched for RPE signals. They did not settle for a simple correlation with RPE, but rather required separate correlations with expected and obtained reward, with opposite signs. This analysis did not identify RPE signals in NAcc in either young or older adults, but the authors show that a reinforcement-learning model fits the BOLD data there better than the Bayesian model. Finally, the authors also report activation in several areas that are related to decision confidence or to switches.

This is an interesting study, which asks an important question. There is evidence for reduced reinforcement learning in aging, but the neural basis of this reduction is not clear. The paper has many strengths, including the use of computational modeling and model comparison, the combination of fMRI and DA D1 BP within subject, and the careful neural analysis. I have relatively minor comments, which are detailed below.

– In its current form the paper is somewhat difficult to follow. There are many questions and analyses, so the writing should be very clear in order to take the reader through the entire story. Sometimes the authors assume knowledge that a wide audience may not necessarily have. For example, will be helpful if there is a brief description of the task either at the end of the Introduction or the beginning of the Results section. Then perhaps some overview of the main questions and the general analysis strategy. Next, the model descriptions should be clarified – the Materials and methods section provides detailed information, but it will be helpful if the brief description in the Results section is clearer – especially important is the definition of all the parameters in each model. Keeping all the parts of the PET analysis together will also be helpful. Finally, it seems that the main finding is the relationships between vmPFC activity and behavior and between NAcc DA BP and vmPFC activity. In both of these analysis correlation with age disappears when the physiological variable is taken into account. This is very interesting and should be clearly stated.

– It was a bit confusing to me that no parameter of the winning model reflected the age-related difference in behavior (unlike the RW model). It seems that the main point of using computational modeling is to uncover latent variables that affect behavior, but cannot be directly observed, in order to understand the differences in computations. Is it possible that the model fails to capture some latent variable that is of the most interest to this particular study? This is supported by the fact that using the switches, instead of the model parameters, yielded more informative results. The authors should clearly explain the utility of model fitting for their behavioral analysis. In particular, did the vmPFC results depend on the particular formulation of Q from the winning model?

– The lack of RPE encoding in NAcc in young adults is presented as an incidental finding, but it is of importance, independently from the aging research question. The authors are right that this may be due to their stringent criterion, but the fact that there was also no correlation with D1 BP, and that an RW model fit the activity better than the winning Bayesian model, makes me wonder if the Q estimates may be off? Also, is there an age difference if you consider the less stringent single-predictor RPE?

Reviewer #2:

In this article, the authors combine behavioral, fMRI, PET, and computational modeling approaches to understand the mechanisms of probabilistic reward learning, and how this learning changes with age. There are definitely some interesting results here. The relationship between D1 binding potential (BP) in NAcc and the neural correlate of chosen value in vmPFC seems particularly notable. However, several of the neural and computational modeling results are not yet as compelling as they could be. Below the major findings are discussed in turn; in some cases the paper might be best served by cutting certain analyses entirely, but suggestions and comments are provided nonetheless.

1) Probably the most novel and central findings concern predicted value (Q) signals in vmPFC. The correlation between Q and vmPFC activity is reduced in older adults and predicted by D1 BP in NAcc, and D1 BP fully accounts for the effect of age. The predicted value response in vmPFC also predicts performance on the task. This is an interesting set of findings. I'm aware of only one other report showing age effects on vmPFC value correlates (Halfmann et al., 2017, SCAN), but that report focused on individual differences within older adults, and the link to dopaminergic signal shown here provides a plausible mechanism for the effect. These findings could be strengthened in a few ways, however:

1a) A formal mediation analysis would further strengthen the claim that D1 BP accounts for the effects of age on value signals in vmPFC.

1b) These results depend on the parameter estimates in vmPFC extracted from the region showing a main effect of predicted value. It would be of interest to replicate these analyses in an independent ROI – e.g., the ROI from the Bartra et al., 2013 meta-analysis on subjective value. Though the age comparison is orthogonal to the original fMRI analysis, it is hard to know if and how the possible inflation of the parameter estimates might interact with other analyses such as the correlation with D1 BP.

1c) The effect of Q in vmPFC on performance in the task is significant when controlling for age and model fit, but does it hold when not controlling for these factors? I understand the need to control for age given differences in value-related vmPFC activity between the two groups, but the results without the control variables should at least be noted in the text.

1d) Given the high correlation between dopamine binding in different ROIs, theorizing of how D1 binding in the NAcc specifically could mediate vmPFC effects (Discussion section) seems somewhat premature.

1e) In these analyses, a negative correlation with predicted value is also noted in several prefrontal and parietal regions. In several previous studies, these regions have been associated with difficult choices, indexed by the absolute difference between chosen and unchosen value. Before concluding that these regions encode the inverse of chosen value, this alternative explanation would need to be ruled out.

2) Another imaging finding was that NAcc tracked received rewards, rather than reward prediction errors, in both young and old adults. This is an interesting finding and the methods here provide a nice warning about making strong conclusions about correlations with prediction error regressors, without examining responsivity to both components of the prediction error.

2a) One possibility, though, is that the lack of a prediction error signal is reflective poor learning – i.e., subjects' expectancies are not accurate. Have the authors looked at the correlation between the representation of expectancy in NAcc and performance on the task?

2b) It looks like two different versions of Figure 4 were uploaded. Which one is correct needs to be clarified.

3) A final neural finding – which is more exploratory – is increased activity in frontoparietal brain regions on switch trials, which predicts performance in the task.

3a) Here again, the possible alternative that these regions respond to more difficult decisions (indexed, for example, but the difference in absolute value, or by reaction time), rather than switches per se, needs to be explored.

3b) In addition, the authors also show that activity in these regions is negatively related to the number of switches made by the subject, which is in turn, negatively related to performance. Does dlPFC or IPL activity predict performance after including the number of switches in the model? Without showing this, it is quite possible that the number of switches modulates both dlPFC/IPL activity and performance (i.e., the relationship between the brain and performance is driven by a third variable).

3c) While the fact that switch-related neural activity independently predicts performance when controlling for Q in vmPFC suggests that including the switch-related activity improves the predictive power of the model, a formal model comparison is needed to support this conclusion. I would like to see a formal model comparison between the following models for predicting performance: 1) age and Q in vmPFC; 2) age and switch-related activity in switch ROIs; and 3) age, Q in vmPFC, and switch-related activity in switch ROIs.

4) The results are not definitive on whether there are age differences in probabilistic learning and if so what the cause of these differences is.

4a) Performance differences between older and younger adults are only significant with a one-tailed t-test. This is weak evidence at best for any age effects in the task.

4b) None of the parameters in the authors' winning model differed between younger and older adults. The authors suggest that "correlated changes in the parameters may explain the age difference". However, if there were correlated changes, there should still be significant differences in the parameters – in fact, wouldn't you expect to see more significant differences? I suppose the authors could use some kind of multivariate analysis to look for age differences in model parameters, but the overall picture seems more consistent with subtle, if any, age effects.

5) There were several aspects of the computational modeling approach that were potentially problematic.

5a) In the authors' winning Bayesian model, Q values are initialized at 0.5 and the forgetting process relaxes these values back to 0.5. In the reinforcement learning models, Q values are initialized at 0 and the forgetting process relaxes these values back to 0. A fair comparison between the two classes of models would eliminate this structural difference. Though unlikely, it is possible that this aspect, rather than the details of updating, accounts for the difference in model performance between RL and Bayesian models.

5b) In the authors' winning model, switching is more likely when there is less uncertainty about the unchosen value and when there is greater relative confidence about the previous choice. These effects are counter-intuitive, and more evidence is needed for them to be convincing.

In the case of switching when there is less uncertainty, this is the opposite of normative exploration, the notion of an "exploration bonus." Wilson et al., 2014, for example, found evidence for directed exploration. Why do the authors think they see the opposite here?

5c) That previous trial relative confidence predicts switching is also surprising. It is hard to see how this could be a good feature for learning to have under general conditions, which makes me wonder if it is a byproduct of some particular aspect of the current task. For example, this behavior could be adaptive if there is a negative correlation between the values of the two options or a tendency for values to reverse over time. If this result is more of a byproduct of the task than a general phenomenon, then I would worry about making too much of this finding.

5d) In both of these cases, it would be very informative if the authors could identify the features of task performance that these two aspects of the model explain. This would increase confidence in the empirical finding, beyond the simple model comparisons. It might also provide further insight and modeling ideas; perhaps once the nature of switching behavior in this task is better understood the authors will discover that it can be even better explained by adding different, less counter-intuitive, features to the model.

5e) The potential interaction between the uncertainty and confidence effects needs to be examined. It would make sense that uncertainty about the unchosen value and relative confidence are negatively correlated, given that the former is one of the inputs needed to calculate the latter. This would seem to complicate any interpretation of the weights on these parameters when both are in the model. At a minimum, the authors should report the goodness of fit statistics and parameter weights for a model where only the relative confidence term, and not the uncertainty term, is included in the model.

5f) The authors refer to the effect of relative confidence as a "grass is greener" effect, but I do not think this analogy captures the effect accurately at all. For example, I could imagine also referring to an effect in the exact opposite direction (more switching when confidence is lower) as a "grass is greener" effect, so obviously the analogy is doing no work, and perhaps obscuring rather than enlightening.

Reviewer #3:

In this work, De Boer and colleagues examined the effects of age and dopamine (D1 receptor availability measured using PET) on the neural mechanisms underlying probabilistic reward learning (explored using fMRI). They isolated two main processes contributing to choice performance: 1) learned estimates of option values, 2) switching behavioral strategy. The first process was notably expressed in vmPFC activity and declined with age, this decline being related to nucleus accumbens dopamine. The second process was underpinned by a frontoparietal activity and was independent of age and dopamine.

The question is not really novel. In particular the last author (Marc Guitart-Masip) contributed to a Nature Neuroscience paper that already established the dopamine-dependency of age-related decline in reward learning. However, this new study brings further insights that help to refine our understanding of this phenomenon. Besides, the study has several strengths: it gathers a large dataset (60 participants) including behavioral, PET and fMRI data and takes a sophisticated analytical approach using computational modeling. Overall, I think this paper would nicely contribute to unraveling the determinants of reward learning in humans. Unfortunately, the number of different analyses and results sort of obscure the reading and dilute the main findings. My main suggestion would be to streamline the analysis so the results description would have a clearer structure.

In that regard, it would help to remove the Bayesian model, which does not seem to bring much to the main conclusions, unless I missed something. I appreciate the amount of effort that the authors must have invested in this modeling work, but I am not convinced it makes sense to keep this model and related analyses of brain activity. My reasons are 1) there is no principle justifying that participants should switch when confidence in the chosen option is high (I suspect this comes from correlation between parameters), 2) when comparison is fair (models without the confidence add-ons) the BIC of Rescorla-Wagner and Bayesian models are similar (compare third lines in Table 1), 3) unlike the RW model, the Bayesian model does not capture the difference in behavioral performance between young and older people, 4) the variables specific to the Bayesian model have only weak links with brain activity, contrary to the RW model-based predictions, on which main conclusions are built.

Besides, I have some other concerns:

– As far as I understand, a unique random walk was used to generate reward probabilities for all participants. From the plot in Figure 1 it looks like a noisy reversal, which raises the issue of possible age-related deficits in reversal per se, and of the anti-correlation between cues that may induce the belief that outcomes inform on both cues (subject might normalize the two option values). These possibilities should be discussed.

– The difference in vmPFC value signal could artificially come from the difference in learning performance. This is because the variance of the value regressor in the GLM used to fit fMRI data depends on how much subjects learn about option values (no learning gives a flat regressor), unless regressors are z-scored (I could not find this info in the Materials and methods). This issue needs to be carefully addressed.

– The absence of (negative) correlation with expectation at outcome onset is interesting given the debate about prediction error encoding in the striatum. Yet I am unsure of how the authors interpret this. Is this an artifact from the design (cue and outcome onsets being too close in time), is it that true prediction errors are encoded in other brain regions, or is it that the brain does not encode prediction error at all? Perhaps the authors could clarify their position in this issue in the discussion.

eLife. 2017 Sep 5;6:e26424. doi: 10.7554/eLife.26424.027

Author response

Reviewer #1:

[…] I have relatively minor comments, which are detailed below.

– In its current form the paper is somewhat difficult to follow. There are many questions and analyses, so the writing should be very clear in order to take the reader through the entire story. Sometimes the authors assume knowledge that a wide audience may not necessarily have.

We apologise for the confusion caused by the amount of information included in the paper. We thank the reviewer for providing very helpful structuring comments. We have done our best to streamline the paper by cutting out some of the fMRI analyses, and address the points made by the reviewer, one by one, below:

For example, will be helpful if there is a brief description of the task either at the end of the Introduction or the beginning of the Results section.

We have added a paragraph in the Introduction that explains the TAB in more detail:

“In brief, all participants performed 220 trials on the TAB (Figure 1A). […] Participants received monetary earnings of 1 Swedish Krona (SEK, ~$0.11) per rewarded trial.”

Then perhaps some overview of the main questions and the general analysis strategy.

We have added a paragraph at the beginning of the Results section outlining the goal of the analyses and main questions:

“The goal of the analyses was to establish the neural mechanism underlying decreased probabilistic value learning in older participants. […] To obtain the best estimate of expected value to use in our fMRI analysis, we fitted a range of computational models and used Bayesian model selection.”

Next, the model descriptions should be clarified – the Materials and methods section provides detailed information, but it will be helpful if the brief description in the Results section is clearer – especially important is the definition of all the parameters in each model.

We have clarified this section in the Results, which currently reads as follows:

“The first determinant of switching was the current variance (V) of the option that was not chosen on the previous trial calculated from its approximate β distribution (Figure 2; formula 8, Materials and methods). […]Hence, increased uncertainty about the previously unchosen option caused most subjects to stick to their current choice.”

And:

“The second determinant of switching was a measure of the relative confidence in the choice that was made on the previous trial (see Materials and methods). […] Thus, if κwas positive, then a subject would be more likely to switch on trial if she had been more confident on trial t-1.”

We have also added a figure to clarify the components of the model (new figure 2).

Keeping all the parts of the PET analysis together will also be helpful.

We have restructured the Results section to first present the behavioural and computational modelling results, then the analysis of vmPFC activity, then RPEs in the NAcc, and last the PET results. The results concerning neural correlates of confidence and switch behaviour are no longer part of the manuscript.

Finally, it seems that the main finding is the relationships between vmPFC activity and behavior and between NAcc DA BP and vmPFC activity. In both of these analysis correlation with age disappears when the physiological variable is taken into account. This is very interesting and should be clearly stated.

We have emphasised this in the Results by presenting the result as a mediation analysis. We added:

“The parameter estimate for Q in vmPFC was positively related to total monetary gains (r(53)=0.37, p=0.006, controlling for age and model fit in a partial correlation). […] This is consistent with a full mediation of age effects on performance by Q in vmPFC. Note however, that it is difficult to make inferences on mediation effects of age in a cross-sectional dataset (1).”

And:

“This result was confirmed by a mediation analysis: […] This is consistent with a full mediation of age effects on Q in vmPFC by DA D1 BP in NAcc.”

Finally, we point this result out again in the Discussion:

“Our results are consistent with a full mediation of the age effects on performance by Q in vmPFC, that is, age no longer predicts performance when controlling for the strength of BOLD that reflects Q in vmPFC. The same is true for the strength of Q in vmPFC: the effect of age can be explained by lower DA D1 BP in the older age group. Note however, that it is difficult to make inferences on mediation effects of age in a cross-sectional dataset (1).”

– It was a bit confusing to me that no parameter of the winning model reflected the age-related difference in behavior (unlike the RW model). It seems that the main point of using computational modeling is to uncover latent variables that affect behavior, but cannot be directly observed, in order to understand the differences in computations. Is it possible that the model fails to capture some latent variable that is of the most interest to this particular study? This is supported by the fact that using the switches, instead of the model parameters, yielded more informative results. The authors should clearly explain the utility of model fitting for their behavioral analysis. In particular, did the vmPFC results depend on the particular formulation of Q from the winning model?

We agree that the lack of difference in model parameters is unsettling. However, we believe that the Bayesian model provides the best account of our data. We set out to do computational modelling with three aims in mind. First, we aimed to uncover behavioural mechanisms on this particular task, regardless of age. We were successful in this regard by demonstrating that this model provides a better account of choices as indicated by Bayesian model comparison. Further, the model uncovers previously unclear contributions of uncertainty and confidence to choice on this task. For example, the model uncovers that perseveration is better accounted by as uncertainty aversion than by a choice kernel as previously thought. Our second goal was to generate predictors of neural responses. In this regard, we also observed that the Bayesian model provides a better predictor of expected value in the vmPFC improving our ability to make inferences about the relationship between BOLD signal in this region and dopamine D1 receptor availability (see below). We showed this by building two equivalent GLMs for fMRI analysis: they both include reward at the time of outcome, and expected value (as calculated by each respective model) at the time of choice. A paired t-test between the residuals of both GLMs demonstrates that the estimates for expected value from the Bayesian model predict BOLD in vmPFC more accurately than expected value estimates from the RW model. Our third goal was to understand group differences in behaviour. In this respect, we were not successful in the sense that the Bayesian model failed to capture the group difference we observed in behaviour (that was captured by the RW model). We state this limitation in the Discussion (Also see reviewer 2, point 4b and 6k and reviewer 3, point 3):

In fact, it is likely that the process underlying age differences in performance is not parametrised in the winning Bayesian model.

We add the observation that the Bayesian model generates better predictions of BOLD in vmPFC at choice in the Results section along with the previously reported observation that the RW model generates better predictions of BOLD in NAcc at outcome (Also see reviewer 2 point 4b and reviewer 3 point 3):

“On the other hand, the Bayesian observer model generates better predictions of the BOLD signal in the vmPFC when Q as generated by each model was included as a parametric modulator at the time of choice (paired t-test comparing residuals of the respective GLM models across all voxels in the respective vmPFC ROIs; t(56)=5.62, p<0.001).”

Our vmPFC result is not dependent on the choice of model. When expected value is estimated using the RW model, the group difference in anticipatory expected value in the vmPFC and its relationship to performance remains unchanged. We have added this result:

“The results were not dependent on the use of the Bayesian model to estimate Q values (when using the RW model Q estimates; when including both age and Q, β age = -0.20, -0.22, -0.21, p=0.111, 0.093, 0.104; β Q in vmPFC = 0.33, 0.28, 0.26, p=0.010, p=0.030, p=0.047 for monetary gains, efficient choices and adaptive switches, respectively).”

The relationship between this activity and DA D1 BP is still significant, but does not survive correction for age. Interestingly, the relationship between age and anticipatory activity in vmPFC does not survive the inclusion of DA D1 either. This does not allow for the disentanglement of age and DA D1 BP as contributors to vmPFC activity when using the RW model. However, since the BOLD activity in vmPFC is better accounted for by the Bayesian model, which is also a better model in terms of Bayesian model comparison, we see this as a good reason to keep the Bayesian model as a predictor of brain activity. Note that despite the RW generating better predictions for BOLD signal in the NAcc, the reported lack of RPE is not dependent on which model is used.

– The lack of RPE encoding in NAcc in young adults is presented as an incidental finding, but it is of importance, independently from the aging research question. The authors are right that this may be due to their stringent criterion, but the fact that there was also no correlation with D1 BP, and that an RW model fit the activity better than the winning Bayesian model, makes me wonder if the Q estimates may be off? Also, is there an age difference if you consider the less stringent single-predictor RPE?

There is no significant difference between the signal for Q in NAcc from the GLM where Q is estimated from the Bayesian model compared to the Rescorla-Wagner model (paired t-test, p>0.165 for both sides). In addition, there is no difference between the age groups when the standard more liberal analysis (using a single RPE regressor) is used and regardless of whether expected value is estimated using the RW model or the Bayesian model (p>0.45 in all comparisons). We have added this negative result in the Results section:

“In addition, when performing a less stringent test and extracting parameter estimates from this ROI for the full RPE, defined as one regressor (R-Q), we did not observe any differences between the groups’ mean activation (p>0.45).”

Reviewer #2:

In this article, the authors combine behavioral, fMRI, PET, and computational modeling approaches to understand the mechanisms of probabilistic reward learning, and how this learning changes with age. There are definitely some interesting results here. The relationship between D1 binding potential (BP) in NAcc and the neural correlate of chosen value in vmPFC seems particularly notable. However, several of the neural and computational modeling results are not yet as compelling as they could be. Below the major findings are discussed in turn; in some cases the paper might be best served by cutting certain analyses entirely, but suggestions and comments are provided nonetheless.

We thank reviewer 2 for their insightful comments and for the very detailed comments on our manuscript. We apologise for any lack of clarity in the original version, and believe the quality of our manuscript has greatly improved thanks to these comments.

1) Probably the most novel and central findings concern predicted value (Q) signals in vmPFC. The correlation between Q and vmPFC activity is reduced in older adults and predicted by D1 BP in NAcc, and D1 BP fully accounts for the effect of age. The predicted value response in vmPFC also predicts performance on the task. This is an interesting set of findings. I'm aware of only one other report showing age effects on vmPFC value correlates (Halfmann et al., 2017, SCAN), but that report focused on individual differences within older adults, and the link to dopaminergic signal shown here provides a plausible mechanism for the effect. These findings could be strengthened in a few ways, however:

1a) A formal mediation analysis would further strengthen the claim that D1 BP accounts for the effects of age on value signals in vmPFC.

We have now presented this result as a formal mediation analysis (Also see reviewer 1, first concern, last bullet point).

“This result was confirmed by a mediation analysis: Age was a significant predictor of both BP in NAcc (r(54)=-0.78, p<0.001) and Q in vmPFC (r(55)=-0.32, p=0.016). BP in Nacc was also a significant predictor of Q in vmPFC r(54)=0.41, p=0.001. Age was no longer a significant predictor of Q in vmPFC after controlling for BP in NAcc (β age=-0.01, p=0.964; β BP in NAcc=0.42, p=0.038).”

1b) These results depend on the parameter estimates in vmPFC extracted from the region showing a main effect of predicted value. It would be of interest to replicate these analyses in an independent ROI – e.g., the ROI from the Bartra et al., 2013 meta-analysis on subjective value. Though the age comparison is orthogonal to the original fMRI analysis, it is hard to know if and how the possible inflation of the parameter estimates might interact with other analyses such as the correlation with D1 BP.

We acknowledge that it is possible that these values are inflated. Therefore, we performed the suggested analysis, on the ROI resulting from their five-way conjunction analysis (Figure 9 in said paper), carrying a monotonic, modality-independent subjective value signal. This analysis demonstrated that the relationship between Q in the vmPFC and DA D1 BP indeed survived when using the suggested ROI: partial correlation between activity in the Bartra 2013 ROI and DA D1 BP: r(53)=0.318, p=0.018 (controlled for age and model fit, also survives without controlling for these variables). The correlation between Q in vmPFC and performance also survived when using the suggested ROI: partial correlation between activity in the Bartra 2013 ROI and monetary gains: r=0.337, p=0.011 (controlling for age and model fit, also survives without controlling for these).

1c) The effect of Q in vmPFC on performance in the task is significant when controlling for age and model fit, but does it hold when not controlling for these factors? I understand the need to control for age given differences in value-related vmPFC activity between the two groups, but the results without the control variables should at least be noted in the text.

The relationship between these variables hold when not controlling for these factors, r=0.46, p<0.001. We have added this in the text:

“The parameter estimate for Q in vmPFC was positively related to total monetary gains (r(53)=0.37, p=0.006, controlling for age and model fit in a partial correlation). This correlation remained significant without controlling for age, model fit or both.”

1d) Given the high correlation between dopamine binding in different ROIs, theorizing of how D1 binding in the NAcc specifically could mediate vmPFC effects (Discussion section) seems somewhat premature.

We thank the reviewer for pointing this out. It is correct that BP in the different ROIs are correlated and we added a note on that in the Discussion.

However, BP in NAcc is the only measure for which a mediation analysis is significant. This is consistent with the literature on reward processing in the corticostriatal loops. We therefore think this is important to discuss.

“Although BPs are highly correlated across ROIs, a mediation analysis was only significant for the NAcc. This is compatible with the literature on reward processing in the corticostriatal loops.”

1e) In these analyses, a negative correlation with predicted value is also noted in several prefrontal and parietal regions. In several previous studies, these regions have been associated with difficult choices, indexed by the absolute difference between chosen and unchosen value. Before concluding that these regions encode the inverse of chosen value, this alternative explanation would need to be ruled out.

This is indeed a very valid point. For the sake of clarity and brevity, we have taken out these analyses.

2) Another imaging finding was that NAcc tracked received rewards, rather than reward prediction errors, in both young and old adults. This is an interesting finding and the methods here provide a nice warning about making strong conclusions about correlations with prediction error regressors, without examining responsivity to both components of the prediction error.

2a) One possibility, though, is that the lack of a prediction error signal is reflective poor learning – i.e., subjects' expectancies are not accurate. Have the authors looked at the correlation between the representation of expectancy in NAcc and performance on the task?

This is a good point. The correlation between performance and Q in NAcc is not significant (p>0.25 in all correlations, with or without controlling for age). We have added this negative result in the manuscript:

“There was no indication that the lack of expected value signal in the NAcc at the group level was caused by some participants showing poor learning of expected value, as the correlation between Q in NAcc and the different measures of performance (monetary gains, effective choices, and adaptive switches) was not significant (p>0.25).”

2b) It looks like two different versions of Figure 4 were uploaded. Which one is correct needs to be clarified.

These are the RW and the Bayesian RPE parameter estimates – one of them is a supporting figure. This is now clarified in the captions and text references.

“Extracted parameter estimates for R and Q as calculated by the Bayesian observer model from the regions shown in Figure 4A. Although we found a strong effect of reward bilaterally, no expected-value signal was observed for either age group (p>0.10).”

3) A final neural finding – which is more exploratory – is increased activity in frontoparietal brain regions on switch trials, which predicts performance in the task.

We would like to thank the reviewer for the insightful comments on this section, which has led us to cut out the bulk of these analyses.

3a) Here again, the possible alternative that these regions respond to more difficult decisions (indexed, for example, but the difference in absolute value, or by reaction time), rather than switches per se, needs to be explored.

We thank the reviewer for this important point. We have performed the suggested analysis including both RT and value difference) and although significant clusters are still obtained in IPL and OFC, the relationship between activity in these clusters and performance has disappeared. Therefore, and for the sake of improving and streamlining the paper, we have taken out this analysis.

3b) In addition, the authors also show that activity in these regions is negatively related to the number of switches made by the subject, which is in turn, negatively related to performance. Does dlPFC or IPL activity predict performance after including the number of switches in the model? Without showing this, it is quite possible that the number of switches modulates both dlPFC/IPL activity and performance (i.e., the relationship between the brain and performance is driven by a third variable).

Yes, even when controlling for switches this activity predicts performance. However, because the relationship does not hold when including measures of difficulty in the GLM, and for the sake of streamlining the paper, we have now taken out the switch analyses.

3c) While the fact that switch-related neural activity independently predicts performance when controlling for Q in vmPFC suggests that including the switch-related activity improves the predictive power of the model, a formal model comparison is needed to support this conclusion. I would like to see a formal model comparison between the following models for predicting performance: 1) age and Q in vmPFC; 2) age and switch-related activity in switch ROIs; and 3) age, Q in vmPFC, and switch-related activity in switch ROIs.

The switch analysis on BOLD has now been removed, and only the Q in vmPFC model is presented in the paper.

4) The results are not definitive on whether there are age differences in probabilistic learning and if so what the cause of these differences is.

We respectfully disagree with the reviewer and outline our reasons for this below.

4a) Performance differences between older and younger adults are only significant with a one-tailed t-test. This is weak evidence at best for any age effects in the task.

We apologise for the lack of clarity on the performance difference. Apart from the weak age group difference on total wins, we also find difference between efficient choices and adaptive switches (reported in the Results and Figure 1). We had chosen total monetary gains as an indicator of performance because it is the most intuitive indicator of performance. However, the mediation of the effect of age on performance holds when other measures of performance that we present in the paper (efficient choices, adaptive switches) are used as the outcome variable. We have added a paragraph about this in the manuscript:

“Q in vmPFC was a significant predictor of all measures of performance (bivariate correlations: total monetary gains: r(55)=0.47, p<0.001 adaptive switches: r(55)=0.39, p=0.003; efficient choices: r(55)=0.38, p=0.004). […] Note however, that it is difficult to make inferences on mediation effects of age in a cross-sectional dataset (1).”

4b) None of the parameters in the authors' winning model differed between younger and older adults. The authors suggest that "correlated changes in the parameters may explain the age difference". However, if there were correlated changes, there should still be significant differences in the parameters – in fact, wouldn't you expect to see more significant differences? I suppose the authors could use some kind of multivariate analysis to look for age differences in model parameters, but the overall picture seems more consistent with subtle, if any, age effects.

We agree that the lack of difference in model parameters is unsettling. As the reviewer suggested, we performed a multivariate analysis with the parameters of the behavioural model as independent variables, and age group as a fixed factor. This analysis did not detect any effect of age group (F=0.91, p=0.482). This negative result is now reported:

“A multivariate analysis with the model parameters as independent variables, and age group as a fixed factor, did not yield any significant predictor of age group (F=0.91, p=0.482).”

However, we believe that the Bayesian model provides the best account of our data. We set out to do computational modelling with three aims in mind. First, we aimed to uncover behavioural mechanisms on this particular task, regardless of age. We were successful in this regard by demonstrating that this model provides a better account of choices, as indicated by Bayesian model comparison. Further, the model uncovers previously unclear contributions of uncertainty and confidence to choice on this task. For example, the model uncovers that perseveration is better accounted by as uncertainty aversion than by a choice kernel as previously thought. Our second goal was to generate predictors of neural responses. In this regard, we also observed that the Bayesian model provides a better predictor of expected value in the vmPFC improving our ability to make inferences about the relationship between BOLD signal in this region and dopamine D1 receptor availability (see below). We showed this by building two equivalent GLMs for fMRI analysis: they both include reward at the time of outcome, and expected value (as calculated by each respective model) at the time of choice. A paired t-test between the residuals of both GLMs demonstrates that the estimates for expected value from the Bayesian model predict BOLD in vmPFC more accurately than expected value estimates from the RW model. Our third goal was to understand age group differences in behaviour. In this respect, we were not successful in the sense that the Bayesian model failed to capture the age group difference we observed in behaviour (that was captured by the RW model).

5) There were several aspects of the computational modeling approach that were potentially problematic.

5a) In the authors' winning Bayesian model, Q values are initialized at 0.5 and the forgetting process relaxes these values back to 0.5. In the reinforcement learning models, Q values are initialized at 0 and the forgetting process relaxes these values back to 0. A fair comparison between the two classes of models would eliminate this structural difference. Though unlikely, it is possible that this aspect, rather than the details of updating, accounts for the difference in model performance between RL and Bayesian models.

This is a good point. We have refitted the RW model with starting values 0.5 and allowing the values to relax back to 0.5 as well. This model is better than the originally reported RW model (BIC: 10355 vs 10388) and has now been added to the manuscript in substitution of the previously reported RW (Table 1). The Materials and methods section has been updated to reflect this. However, this new RW model shows no better fit than the Bayesian model including the effect of variance of the unchosen option (BIC: 10335) or the winning full Bayesian model including both the effect of variance of the unchosen option and relative confidence (BIC: 10259) (Table 1).

5b) In the authors' winning model, switching is more likely when there is less uncertainty about the unchosen value and when there is greater relative confidence about the previous choice. These effects are counter-intuitive, and more evidence is needed for them to be convincing.

In the case of switching when there is less uncertainty, this is the opposite of normative exploration, the notion of an "exploration bonus." Wilson et al., 2014, for example, found evidence for directed exploration. Why do the authors think they see the opposite here?

We do not think that the effect of uncertainty that we observed is that counterintuitive. The reviewer is right that the effect of uncertainty we observed is opposite to an exploration bonus or uncertainty based exploration as suggested by normative accounts of exploration (2) and supported by some experiments (3,4). However, in many other studies of decision-making, variance is penalised as a form of risk sensitivity (5-7) which is akin to the effect that we observed. Furthermore, our model comparison showed that uncertainty aversion is a better account of the perseveration typically observed in bandit tasks (8,9) than a choice kernel. We have added this rationale in the Discussion:

“This is opposite to an exploration bonus or uncertainty based exploration term that arises in various more or less normative accounts of exploration (2) and has been observed in some experiments (3,4). However, many previous studies of decision-making have also shown that variance may be penalised as a form of risk sensitivity (5–7), and this is a cousin of the effect that we observed. Furthermore, our model comparison showed that uncertainty aversion is a better account of the perseveration typically observed in bandit tasks (8,9) than a choice kernel.”

5c) That previous trial relative confidence predicts switching is also surprising. It is hard to see how this could be a good feature for learning to have under general conditions, which makes me wonder if it is a byproduct of some particular aspect of the current task. For example, this behavior could be adaptive if there is a negative correlation between the values of the two options or a tendency for values to reverse over time. If this result is more of a byproduct of the task than a general phenomenon, then I would worry about making too much of this finding.

The effects of relative confidence observed at the group level are indeed surprising (certainly, we had not predicted them), and could partly stem from a particular aspect of the task that we have yet to identify. However, when looking at individual differences we observed a negative correlation between κ and total monetary gains on the task (-0.42, p<0.001). Note that this correlation was wrongly described in the Results section before (our apologies for this, see response to point 6b for details). Further, those participants that had a negative κ (8 out of 57) performed best on the task. This implies that relative confidence has the expected effect on performance despite having an unexpected sign at the group level. This effect of κ on performance can be observed from simulated data where we explored the effects of varying κ when all other parameters were fixed at the median of all participants. We plotted the mean and standard error for total wins and proportion efficient choices as a result of 100 iterations of 220 trials in the graph below

We are also curious about what causes this unexpected use of confidence in most participants and are planning experiments where we will manipulate different task features such as the volatility of the environment and beliefs about task structure to tease apart the mechanism by which confidence operates. We have added clarifying text in the Results section:

“Nevertheless, κ was negatively correlated with the total monetary gains on the task (r(54)=0.42, p=0.001, controlled for age; Supplementary Table 1), with negative values of κ in those participants with the highest performance. This implies that κ has the expected effect on performance despite having an unexpected sign at the group level.”

5d) In both of these cases, it would be very informative if the authors could identify the features of task performance that these two aspects of the model explain. This would increase confidence in the empirical finding, beyond the simple model comparisons. It might also provide further insight and modeling ideas; perhaps once the nature of switching behavior in this task is better understood the authors will discover that it can be even better explained by adding different, less counter-intuitive, features to the model.

From the model comparison (Table 1) it is evident that V (the variance of the unchosen option) captures perseveration better than RW or Bayesian models with a choice kernel. Therefore, the behavioural performance feature that is explained by V is perseveration. We provide a mechanistic account of perseveration beyond the commonly used choice kernel and show that, in our task, perseveration is steered by the variance of the previously unchosen option. We have highlighted this point in the current version of the task:

“This is a novel insight into the mechanism behind what is usually referred to as perseveration and suggests that aversion to the uncertainty about the option that was not chosen previously causes a tendency to stick to one choices.”

The performance feature captured by C^rel is less clear, and difficult to assess with the current dataset, which was not designed to find, let alone unpack, this quantity. Nevertheless, we have some ideas about it. One possibility is that κ depends on the perceived volatility of the task: if participants perceive the environment as being very volatile, an increased confidence in the currently chosen option can spur the belief that switching is a good idea. This could be assessed by manipulating volatility and/or regularly measuring experienced volatility, and observing the effect of κ on choice. Alternatively, it could be a belief in the Machiavellian nature of the experimenter (the surer a participant is that one option is better, the more likely they believe it will switch). Finally, the observed effect of κ could reflect a sort of safe exploration – in two senses: a) – the participant is convinced she has recently chosen the best option a lot (hence her confidence), so she can afford the odd exploratory trial; b) since the participant is relatively sure about the quality of A, they don't have to maintain a very precise assessment of by how much it bests B, so choosing B isn't informationally tricky for them in the sense of allowing a single outcome for B incorrectly to sway choice in favour of this option. In sum, we believe that this result is counterintuitive but very suggestive and points to new avenues for research. We have developed the discussion of this finding:

“One reason for the unwarranted use of confidence in the majority of participants could be that participants perceived the task as being highly volatile. As a result, they may have inferred that increasing confidence in the most recent choice indicates that the unchosen option has become better than the chosen option (10,11). Additionally, the observed effect of κ could reflect safe exploration: if the participant is convinced they have recently chosen the better option frequently (hence their confidence), they can afford to explore the more uncertain option. These possibilities provide interesting directions for future research.”

5e) The potential interaction between the uncertainty and confidence effects needs to be examined. It would make sense that uncertainty about the unchosen value and relative confidence are negatively correlated, given that the former is one of the inputs needed to calculate the latter. This would seem to complicate any interpretation of the weights on these parameters when both are in the model. At a minimum, the authors should report the goodness of fit statistics and parameter weights for a model where only the relative confidence term, and not the uncertainty term, is included in the model.

It is true that the correlation between υ(unchosen) and κ is negative, but this correlation is only significant at trend level (r(57)=-0.253, p=0.057). The models with only κ or only υ(unchosen) show poorer fit than the model with both υ(unchosen) and κ. We have added the model statistics for the model with κ only into Table 1 (likelihood: -5675.3, pseudo-R²: 0.331, iBIC: 11426). When the models are specified with one of the parameters alone, the sign of these parameters are largely the same as they are in the model with both parameters. In other words, if a Bayesian model is specified including κ only, κ is positive for 42 out of 57 participants (compared to 49 out of 57 in the current winning model), where all 42 participants with positive κ in the simpler model have positive κ in the winning model as well. If a Bayesian model is specified including υ only, υ is negative for 53 out of 57 participants (compared to 55 out of 57 in the current winning model), where all 53 participants with negative υ in the simpler model have negative υ in the winning model as well. This suggests that overall tendency for κ to be positive and υ to be negative does not stem from autocorrelation between the two. We have added this information in the Results (Also see reviewer 3, point 1):

“The overall tendency for κ to be positive and υ to be negative does not stem from autocorrelation between the two as the sign of these parameter is largely the same when the model is specified with only one of these parameters (data not shown).”

5f) The authors refer to the effect of relative confidence as a "grass is greener" effect, but I do not think this analogy captures the effect accurately at all. For example, I could imagine also referring to an effect in the exact opposite direction (more switching when confidence is lower) as a "grass is greener" effect, so obviously the analogy is doing no work, and perhaps obscuring rather than enlightening.

We apologise for the confusion created and have removed the expression to avoid ambiguity.

Reviewer #3:

[…] The question is not really novel. In particular the last author (Marc Guitart-Masip) contributed to a Nature Neuroscience paper that already established the dopamine-dependency of age-related decline in reward learning. However, this new study brings further insights that help to refine our understanding of this phenomenon. Besides, the study has several strengths: it gathers a large dataset (60 participants) including behavioral, PET and fMRI data and takes a sophisticated analytical approach using computational modeling. Overall, I think this paper would nicely contribute to unraveling the determinants of reward learning in humans. Unfortunately, the number of different analyses and results sort of obscure the reading and dilute the main findings. My main suggestion would be to streamline the analysis so the results description would have a clearer structure.

We thank the reviewer for highlighting the strength of the results and the constructive criticism, which has been very helpful in the process of streamlining our analyses. We apologise for our lack of clarity. Below we address each of the separate points.

In that regard, it would help to remove the Bayesian model, which does not seem to bring much to the main conclusions, unless I missed something. I appreciate the amount of effort that the authors must have invested in this modeling work, but I am not convinced it makes sense to keep this model and related analyses of brain activity.

We have streamed the result as suggested by the reviewer but, with due respect, decided to keep the Bayesian model. The reasons for this decision are outlined in response to each of the reviewer’s points pertaining this concern.

My reasons are 1) There is no principle justifying that participants should switch when confidence in the chosen option is high (I suspect this comes from correlation between parameters),

When the models are specified with one of the parameters alone, the sign of these parameters are largely the same as they are in the model with both parameters. In other words, if a Bayesian model is specified including κ only, κ is positive for 42 out of 57 participants (compared to 49 out of 57 in the current winning model), where all 42 participants with positive κ in the simpler model have positive κ in the winning model as well. If a Bayesian model is specified including υ only, υ is negative for 53 out of 57 participants (compared to 55 out of 57 in the current winning model), where all 53 participants with negative υ in the simpler model have negative υ in the winning model as well. This suggests that overall tendency for κ to be positive and υ to be negative does not stem from autocorrelation between the two. We have added this information in the Results (Also see reviewer 2 point 5e):

2) When comparison is fair (models without the confidence add-ons) the BIC of Rescorla-Wagner and Bayesian models are similar (compare third lines in Table 1),

This is true and indicates that the reason why the Bayesian model is better is not dependent on the update rules. Instead, the Bayesian model provides additional information such as variance and confidence that can be used to make decisions. The Bayesian model has access to this information by tracking the probability distribution of the bandits; the simpler, RW, model does not (at least not simply). What makes the Bayesian model better, in our view, is that it provides the most parsimonious account of the behaviour we observed. In addition, it provides a mechanistic account of perseveration, commonly observed in bandit tasks. It also provides insight into the role of confidence, which is novel and can be interesting for future research. Finally, the BOLD activity in vmPFC is better accounted for by the Bayesian model (see response to point 4 for further discussion on this issue).

3) Unlike the RW model, the Bayesian model does not capture the difference in behavioral performance between young and older people,

This is true and unfortunate. Although one goal of using modelling is to better understand age group differences in behaviour, another goal with the modelling to uncover behavioural mechanisms on this particular task, regardless of age. We believe that the Bayesian model provides a better account of choices as indicated by Bayesian model comparison. Furthermore, the Bayesian model uncovers potentially interesting contributions of uncertainty and confidence to choice on this task such as that perseveration is better accounted of by uncertainty aversion than by a choice kernel as commonly modelled. We have added in the Discussion a sentence acknowledging that the components of our model may not be able to capture age differences (Also see reviewer 2, point 6k):

“In fact, it is likely that the process underlying age differences in performance is not parametrised in the winning Bayesian model.”

4) The variables specific to the Bayesian model have only weak links with brain activity, contrary to the RW model-based predictions, on which main conclusions are built.

The reviewer is right that the variables specific to the Bayesian model have only weak links with brain activity. We have therefore taken out the analyses that show the relation of BOLD to confidence and variance. We have reanalysed our fMRI data using a GLM only including Q at choice (as opposed to Q, V and Crel). Our most interesting results relate to expected value and do not change using this new GLM. We now use the parameter estimates of this new model for expected value to analyse the relationship between Q, age and DA. We observed that the Bayesian model provides a better predictor of expected value in the vmPFC compared to the RW model, improving our ability to make inferences about the relationship between BOLD signal in this region and dopamine D1 receptor availability. This was demonstrated by a paired t-test between the residuals of both GLMs in their respective vmPFC ROIs. We have included this observation in the Results (Also see reviewer 1, second concern, and reviewer 2, point 4b).

“On the other hand, the Bayesian observed model generates better predictions of the BOLD signal in the vmPFC when Q as generated by each model was included as a parametric modulator at the time of choice (paired t-test comparing residuals of the respective GLM models across all voxels in the respective vmPFC ROI; t(56)=-5.62, p<0.001).”

Whereas the RW model provides a better predictor of expected value in the NAcc, the reported lack of RPE is not dependent on which model is used (Figure 4—figure supplement 1).

It is important to note that our vmPFC result is not dependent on the choice of model. When expected value is estimated using the RW model, the age group difference in anticipatory expected value in the vmPFC and its relationship to performance remains unchanged. The relationship between this activity and DA D1 BP is still significant, but does not survive correction for age. Nevertheless, the relationship between age and anticipatory activity in vmPFC does not survive the inclusion of DA D1 either. This does not allow for the disentanglement of age and DA D1 BP as contributors to vmPFC activity when using the RW model. By contrast, BOLD activity in vmPFC is better accounted for by the Bayesian model, which in turn is also a better model (taking appropriate account of the numbers of parameters). We suggest that these are good reasons to keep the Bayesian model as a predictor of brain activity.

Besides, I have some other concerns:

– As far as I understand, a unique random walk was used to generate reward probabilities for all participants. From the plot in Figure 1 it looks like a noisy reversal, which raises the issue of possible age-related deficits in reversal per se, and of the anti-correlation between cues that may induce the belief that outcomes inform on both cues (subject might normalize the two option values). These possibilities should be discussed.

The reviewer is right that the TAB task can be seen as a noisy reversal learning task whereby the key determinant of which stimulus is chosen is expected value. However, as shown by our winning model, choice is also modulated by uncertainty and relative confidence.

Our results demonstrate that performance in the task is at least partly supported by the expected value signal in the vmPFC and that the strength of this signal explains the effects of age on performance. However, as pointed out by the reviewer, other mechanisms such as executive control may be at play. We attempted to uncover one such alternative mechanism by looking at brain responses on switch trials. However, after controlling for RT and value difference (as suggested by reviewer 2) these brain responses were no longer correlated with measures of performance, suggesting that they may be related to the differences in value. We have removed these results from the manuscript but extended the discussion to suggest that differences in executive functions such as the ability to inhibit a response to previously rewarded option could also be at play in our task because of the fact that the task can be seen as a noisy reversal learning task:

“Our results show that performance in the TAB is supported by the expected value signal in the vmPFC and that the strength of this signal explains the effects of age on performance. However, considering that the TAB can be seen as noisy reversal learning task, it is a possibility that differences in executive functions – such as the ability to inhibit a response to previously rewarded option – contribute to age group differences in our task (19).”

– The difference in vmPFC value signal could artificially come from the difference in learning performance. This is because the variance of the value regressor in the GLM used to fit fMRI data depends on how much subjects learn about option values (no learning gives a flat regressor), unless regressors are z-scored (I could not find this info in the Materials and methods). This issue needs to be carefully addressed.

Thank you for this comment, which is a valid concern. SPM by default mean centre the regressors. We have added a reference to that point in the Materials and methods:

“These regressors are mean-centered by default (20).”

To investigate the question in more detail we restricted our analysis to the high performers as defined by a median split (n=28, 13 old, 15 young). When we take this group, there is no longer an age group difference in total monetary gains (p=0.6). However, we found a correlation between Q in vmPFC and age (r(26)=-0.39; p=0.040) and a marginally significant group difference (t(26)=2.03; p=0.054). Therefore, it is unlikely that this difference comes only from the different performance in learning. This result has been added in the Results section:

“This difference in vmPFC value signal did not arise because of the difference in learning performance: when we restricted our analysis to high performers as defined by a median split (13 old, 15 young), a difference in performance was no longer significant (p=0.60), but the strength of expected-value signal in vmPFC was correlated with age (r(26)=-0.39, p=0.040) and we found a marginally significant difference between age groups (M_old=4.21, SD=4.81; M_young=8.29, SD=5.72; t(26)=2.03, p=0.054).”

– The absence of (negative) correlation with expectation at outcome onset is interesting given the debate about prediction error encoding in the striatum. Yet I am unsure of how the authors interpret this. Is this an artifact from the design (cue and outcome onsets being too close in time), is it that true prediction errors are encoded in other brain regions, or is it that the brain does not encode prediction error at all? Perhaps the authors could clarify their position in this issue in the discussion.

We have clarified our position in the Discussion. Although we would like to give a conclusive statement about whether and where the brain encodes RPEs, we do not think that our data can provide enough evidence one way or the other (Also see reviewer 2, point 6i).

“The lack of canonical RPE signal in NAcc could stem from the fact that we used a very stringent test for RPEs. Previous studies using the same stringent method report mixed results. Whereas some studies report significant positive effects of reward obtainment and negative effects of expected value (21,22), others do not find this canonical signal in NAcc (16,18,23,24). The conditions under which a canonical RPE can be detected may depend on task characteristics. For example, if the RPE signal is not behaviourally relevant for the task at hand it may not be encoded in the NAcc. In our case, however, RPEs are behaviourally relevant because the choice between bandits is based on fine-grained differences in their values. However, for other paradigms, the lack of behavioural relevance of RPEs could potentially explain a negative result (15,16,24). Another important aspect may be the temporal proximity of the choice cues and the outcome presentation in the task. This may hinder the dissection of opposing responses to these events with fMRI. We cannot rule out the possibility that our negative result stems from this feature of our task design and for this reason, we cannot provide conclusive evidence on the lack of canonical RPE signal in the NAcc. Our results point, however, to the need for stringent tests in future studies of the neural underpinnings of RPEs with fMRI.”

1. Lindenberger U, von Oertzen T, Ghisletta P, Hertzog C. Cross-sectional age variance extraction: what’s change got to do with it? Psychol Aging. 2011 Mar;26(1):34–47.

2. Dayan P, Sejnowski TJ. Exploration Bonuses and Dual Control. Mach Learn. 1996;25(1):5–22.

3. Badre D, Doll BB, Long NM, Frank MJ. Rostrolateral prefrontal cortex and individual differences in uncertainty-driven exploration. Neuron. 2012 Feb 9;73(3):595–607.

4. Wilson RC, Geana A, White JM, Ludvig EA, Cohen JD. Humans use directed and random exploration to solve the explore-exploit dilemma. J Exp Psychol Gen. 2014 Dec;143(6):2074–81.

5. Symmonds M, Wright ND, Bach DR, Dolan RJ. Deconstructing risk: separable encoding of variance and skewness in the brain. NeuroImage. 2011 Oct 15;58(4):1139–49.

6. Payzan-LeNestour E, Bossaerts P. Risk, unexpected uncertainty, and estimation uncertainty: Bayesian learning in unstable settings. PLoS Comput Biol. 2011 Jan 20;7(1):e1001048.

7. d’Acremont M, Fornari E, Bossaerts P. Activity in Inferior Parietal and Medial Prefrontal Cortex Signals the Accumulation of Evidence in a Probability Learning Task. PLoS Comput Biol [Internet]. 2013 [cited 2017 May 29];9(1). Available from: http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1002895

8. Schönberg T, Daw ND, Joel D, O’Doherty JP. Reinforcement Learning Signals in the Human Striatum Distinguish Learners from Nonlearners during Reward-Based Decision Making. J Neurosci. 2007 Nov 21;27(47):12860–7.

9. Rutledge RB, Lazzaro SC, Lau B, Myers CE, Gluck MA, Glimcher PW. Dopaminergic Drugs Modulate Learning Rates and Perseveration in Parkinson’s Patients in a Dynamic Foraging Task. J Neurosci. 2009 Dec 2;29(48):15104–14.

10. Behrens TEJ, Woolrich MW, Walton ME, Rushworth MFS. Learning the value of information in an uncertain world. Nat Neurosci. 2007 Sep;10(9):1214–21.

11. Mathys C, Daunizeau J, Friston KJ, Stephan KE. A bayesian foundation for individual learning under uncertainty. Front Hum Neurosci. 2011;5:39.

12. Daw ND, Gershman SJ, Seymour B, Dayan P, Dolan RJ. Model-based influences on humans’ choices and striatal prediction errors. Neuron. 2011 Mar 24;69(6):1204–15.

13. McClure SM, Daw ND, Montague PR. A computational substrate for incentive salience. Trends Neurosci. 2003 Aug;26(8):423–8.

14. O’Doherty JP, Dayan P, Friston K, Critchley H, Dolan RJ. Temporal Difference Models and Reward-Related Learning in the Human Brain. Neuron. 2003 Apr 24;38(2):329–37.

15. Guitart-Masip M, Huys QJM, Fuentemilla L, Dayan P, Duzel E, Dolan RJ. Go and no-go learning in reward and punishment: Interactions between affect and effect. NeuroImage. 2012 Aug 1;62(1):154–66.

16. Stenner M-P, Rutledge RB, Zaehle T, Schmitt FC, Kopitzki K, Kowski AB, et al. No unified reward prediction error in local field potentials from the human nucleus accumbens: evidence from epilepsy patients. J Neurophysiol. 2015 Aug;114(2):781–92.

17. Rieckmann A, Karlsson S, Karlsson P, Brehmer Y, Fischer H, Farde L, et al. Dopamine D1 receptor associations within and between dopaminergic pathways in younger and elderly adults: links to cognitive performance. Cereb Cortex N Y N 1991. 2011 Sep;21(9):2023–32.

18. Chowdhury R, Guitart-Masip M, Lambert C, Dayan P, Huys Q, Düzel E, et al. Dopamine restores reward prediction errors in old age. Nat Neurosci. 2013 May;16(5):648–53.

19. Bari A, Robbins TW. Inhibition and impulsivity: Behavioral and neural basis of response control. Prog Neurobiol. 2013 Sep;108:44–79.

20. Mumford JA, Poline J-B, Poldrack RA. Orthogonalization of Regressors in fMRI Models. PLOS ONE. 2015 Apr 28;10(4):e0126255.

21. Behrens TEJ, Hunt LT, Woolrich MW, Rushworth MFS. Associative learning of social value. Nature. 2008 Nov 13;456(7219):245–9.

22. Niv Y, Edlund JA, Dayan P, O’Doherty JP. Neural prediction errors reveal a risk-sensitive reinforcement-learning process in the human brain. J Neurosci Off J Soc Neurosci. 2012 Jan 11;32(2):551–62.

23. Wimmer GE, Braun EK, Daw ND, Shohamy D. Episodic Memory Encoding Interferes with Reward Learning and Decreases Striatal Prediction Errors. J Neurosci. 2014 Nov 5;34(45):14901–12.

24. Li J, Daw ND. Signals in Human Striatum Are Appropriate for Policy Update Rather than Value Prediction. J Neurosci. 2011 Apr 6;31(14):5504–11.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Figure 1—source data 1. Source data to Figure 1.

elife-26424-fig1-data1.zip^{(17.2KB, zip)}

DOI: 10.7554/eLife.26424.005

Figure 1—source code 1. Code that was used to perform simulation of behavioural data (figure 1c), as well as the creation of Figure 1.

elife-26424-fig1-code1.m^{(2.4KB, m)}

DOI: 10.7554/eLife.26424.006

Figure 1—figure supplement 1—source data 1. Binding potentials in seven ROIs for young and old participants.

elife-26424-fig1-figsupp1-data1.csv^{(5.3KB, csv)}

DOI: 10.7554/eLife.26424.004

Figure 3—source data 1. Source data for figure 3: cluster correponding to Q in vmPFC at the time of choice.

Parameter estimates for all participants of Q in vmPFC at the time of choice. Timecourse data for young and old participants corresponding to Q in vmPFC. BPs in NAcc for all participants.

elife-26424-fig3-data1.zip^{(1.1MB, zip)}

DOI: 10.7554/eLife.26424.011

Figure 4—source data 1. Activation cluster in ventral striatum as defined by the winning Bayesian model, as well as parameter estimates of R and Q in left and right ventral striatum.

elife-26424-fig4-data1.zip^{(3.7KB, zip)}

DOI: 10.7554/eLife.26424.015

Figure 4—figure supplement 1—source data 1. Activation cluster in ventral striatum as defined by the winning Rescorla-Wagner model, as well as parameter estimates of R and Q in left and right ventral striatum.

elife-26424-fig4-figsupp1-data1.zip^{(920.9KB, zip)}

DOI: 10.7554/eLife.26424.014

Source code 1. Computational modelling.

Scripts needed for the entire modelling routine used in the behavioural analysis. See the comments in the file fit_all_models_eLife.m for more details on each of the models and the procedure

elife-26424-code1.zip^{(44KB, zip)}

DOI: 10.7554/eLife.26424.016

Source code 2. fMRI analysis.

All MATLAB scripts required to set up preprocessing of fMRI data, create regressors for fMRI analysis, run the first level analysis and the second level analysis.

elife-26424-code2.zip^{(8.2KB, zip)}

DOI: 10.7554/eLife.26424.017

Source code 3. PET analysis.

Scripts required to run the segmentation of T1 images, PET analysis and estimation of BPs for the different ROIs.

elife-26424-code3.zip^{(4.3KB, zip)}

DOI: 10.7554/eLife.26424.018

Source code 4. figures.

R script for ggplot for Figures 1b, 3b, d and e and 4b

elife-26424-code4.r^{(15.6KB, r)}

DOI: 10.7554/eLife.26424.019

Source code 5. Figure 2.

MATLAB script that creates joint probability distributions shown in Figure 2.

elife-26424-code5.m^{(1KB, m)}

DOI: 10.7554/eLife.26424.020

Source code 6. timecourse extraction.

MATLAB script that extracts the timecourse for expected value from vmPFC for young and old separately.

elife-26424-code6.m^{(14.1KB, m)}

DOI: 10.7554/eLife.26424.021

Supplementary file 1. (A) Correlation coefficients between model parameters and performance.

elife-26424-supp1.docx^{(14.7KB, docx)}

DOI: 10.7554/eLife.26424.022

Supplementary file 2. (A) No significant correlations between model parameters and dopamine D1 receptor density in any ROI after controlling for age at Bonferroni-corrected threshold of 0.0014.

(B) Partial correlation matrix showing correlation coefficients between the binding potential in the different PET ROIs and their p-values after controlling for age.

elife-26424-supp2.docx^{(20.5KB, docx)}

DOI: 10.7554/eLife.26424.023

Supplementary file 3. Coordinates of clusters responsive to Q at the time of choice.

elife-26424-supp3.docx^{(19.1KB, docx)}

DOI: 10.7554/eLife.26424.024

Transparent reporting form

elife-26424-transrepform.docx^{(243KB, docx)}

DOI: 10.7554/eLife.26424.025

[bib1] Ashburner J. A fast diffeomorphic image registration algorithm. NeuroImage. 2007;38:95–113. doi: 10.1016/j.neuroimage.2007.07.007. [DOI] [PubMed] [Google Scholar]

[bib2] Badre D, Doll BB, Long NM, Frank MJ. Rostrolateral prefrontal cortex and individual differences in uncertainty-driven exploration. Neuron. 2012;73:595–607. doi: 10.1016/j.neuron.2011.12.025. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib3] Balleine BW, O'Doherty JP. Human and rodent homologies in action control: corticostriatal determinants of goal-directed and habitual action. Neuropsychopharmacology. 2010;35:48–69. doi: 10.1038/npp.2009.131. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib4] Barch DM, Sheline YI, Csernansky JG, Snyder AZ. Working memory and prefrontal cortex dysfunction: specificity to schizophrenia compared with major depression. Biological Psychiatry. 2003;53:376–384. doi: 10.1016/S0006-3223(02)01674-8. [DOI] [PubMed] [Google Scholar]

[bib5] Bari A, Robbins TW. Inhibition and impulsivity: behavioral and neural basis of response control. Progress in Neurobiology. 2013;108:44–79. doi: 10.1016/j.pneurobio.2013.06.005. [DOI] [PubMed] [Google Scholar]

[bib6] Bayer HM, Glimcher PW. Midbrain dopamine neurons encode a quantitative reward prediction error signal. Neuron. 2005;47:129–141. doi: 10.1016/j.neuron.2005.05.020. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib7] Behrens TE, Hunt LT, Woolrich MW, Rushworth MF. Associative learning of social value. Nature. 2008;456:245–249. doi: 10.1038/nature07538. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib8] Behrens TE, Woolrich MW, Walton ME, Rushworth MF. Learning the value of information in an uncertain world. Nature Neuroscience. 2007;10:1214–1221. doi: 10.1038/nn1954. [DOI] [PubMed] [Google Scholar]

[bib9] Boorman ED, Behrens TE, Woolrich MW, Rushworth MF. How green is the grass on the other side? Frontopolar cortex and the evidence in favor of alternative courses of action. Neuron. 2009;62:733–743. doi: 10.1016/j.neuron.2009.05.014. [DOI] [PubMed] [Google Scholar]

[bib10] Bäckman L, Lindenberger U, Li SC, Nyberg L. Linking cognitive aging to alterations in dopamine neurotransmitter functioning: recent data and future avenues. Neuroscience & Biobehavioral Reviews. 2010;34:670–677. doi: 10.1016/j.neubiorev.2009.12.008. [DOI] [PubMed] [Google Scholar]

[bib11] Camille N, Griffiths CA, Vo K, Fellows LK, Kable JW. Ventromedial frontal lobe damage disrupts value maximization in humans. Journal of Neuroscience. 2011;31:7527–7532. doi: 10.1523/JNEUROSCI.6527-10.2011. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib12] Chau BK, Kolling N, Hunt LT, Walton ME, Rushworth MF. A neural mechanism underlying failure of optimal choice with multiple alternatives. Nature Neuroscience. 2014;17:463–470. doi: 10.1038/nn.3649. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib13] Chowdhury R, Guitart-Masip M, Lambert C, Dayan P, Huys Q, Düzel E, Dolan RJ. Dopamine restores reward prediction errors in old age. Nature Neuroscience. 2013;16:648–653. doi: 10.1038/nn.3364. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib14] Cools R, D'Esposito M. Inverted-U-shaped dopamine actions on human working memory and cognitive control. Biological Psychiatry. 2011;69:e113–e125. doi: 10.1016/j.biopsych.2011.03.028. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib15] d'Acremont M, Fornari E, Bossaerts P. Activity in inferior parietal and medial prefrontal cortex signals the accumulation of evidence in a probability learning task. PLoS Computational Biology. 2013;9:e1002895. doi: 10.1371/journal.pcbi.1002895. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib16] D'Esposito M, Detre JA, Alsop DC, Shin RK, Atlas S, Grossman M. The neural basis of the central executive system of working memory. Nature. 1995;378:279–281. doi: 10.1038/378279a0. [DOI] [PubMed] [Google Scholar]

[bib17] Daw ND, Gershman SJ, Seymour B, Dayan P, Dolan RJ. Model-based influences on humans' choices and striatal prediction errors. Neuron. 2011;69:1204–1215. doi: 10.1016/j.neuron.2011.02.027. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib18] Daw ND, O'Doherty JP, Dayan P, Seymour B, Dolan RJ. Cortical substrates for exploratory decisions in humans. Nature. 2006;441:876–879. doi: 10.1038/nature04766. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib19] Dayan P, Sejnowski TJ. Exploration bonuses and dual control. Machine Learning. 1996;25:5–22. doi: 10.1007/BF00115298. [DOI] [Google Scholar]

[bib20] Desikan RS, Ségonne F, Fischl B, Quinn BT, Dickerson BC, Blacker D, Buckner RL, Dale AM, Maguire RP, Hyman BT, Albert MS, Killiany RJ. An automated labeling system for subdividing the human cerebral cortex on MRI scans into gyral based regions of interest. NeuroImage. 2006;31:968–980. doi: 10.1016/j.neuroimage.2006.01.021. [DOI] [PubMed] [Google Scholar]

[bib21] Dreher JC, Meyer-Lindenberg A, Kohn P, Berman KF. Age-related changes in midbrain dopaminergic regulation of the human reward system. PNAS. 2008;105:15106–15111. doi: 10.1073/pnas.0802127105. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib22] Eppinger B, Heekeren HR, Li SC. Age-related prefrontal impairments implicate deficient prediction of future reward in older adults. Neurobiology of Aging. 2015;36:2380–2390. doi: 10.1016/j.neurobiolaging.2015.04.010. [DOI] [PubMed] [Google Scholar]

[bib23] Eppinger B, Hämmerer D, Li SC. Neuromodulation of reward-based learning and decision making in human aging. Annals of the New York Academy of Sciences. 2011;1235:1–17. doi: 10.1111/j.1749-6632.2011.06230.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib24] Eppinger B, Schuck NW, Nystrom LE, Cohen JD. Reduced striatal responses to reward prediction errors in older compared with younger adults. Journal of Neuroscience. 2013;33:9905–9912. doi: 10.1523/JNEUROSCI.2942-12.2013. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib25] Frank MJ, Doll BB, Oas-Terpstra J, Moreno F. The neurogenetics of exploration and exploitation: Prefrontal and striatal dopaminergic components. Nature Neuroscience. 2009;12:1062–1068. doi: 10.1038/nn.2342. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib26] Frank MJ, Seeberger LC, O'reilly RC. By carrot or by stick: cognitive reinforcement learning in parkinsonism. Science. 2004;306:1940–1943. doi: 10.1126/science.1102941. [DOI] [PubMed] [Google Scholar]

[bib27] Gruber AJ, Dayan P, Gutkin BS, Solla SA. Dopamine modulation in the basal ganglia locks the gate to working memory. Journal of Computational Neuroscience. 2006;20:153. doi: 10.1007/s10827-005-5705-x. [DOI] [PubMed] [Google Scholar]

[bib28] Guitart-Masip M, Huys QJ, Fuentemilla L, Dayan P, Duzel E, Dolan RJ. Go and no-go learning in reward and punishment: interactions between affect and effect. NeuroImage. 2012;62:154–166. doi: 10.1016/j.neuroimage.2012.04.024. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib29] Haber SN, Knutson B. The reward circuit: linking primate anatomy and human imaging. Neuropsychopharmacology. 2010;35:4–26. doi: 10.1038/npp.2009.129. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib30] Halfmann K, Hedgcock W, Kable J, Denburg NL. Individual differences in the neural signature of subjective value among older adults. Social Cognitive and Affective Neuroscience. 2016;11:1111–1120. doi: 10.1093/scan/nsv078. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib31] Hall H, Sedvall G, Magnusson O, Kopp J, Halldin C, Farde L. Distribution of D1- and D2-dopamine receptors, and dopamine and its metabolites in the human brain. Neuropsychopharmacology. 1994;11:245–256. doi: 10.1038/sj.npp.1380111. [DOI] [PubMed] [Google Scholar]

[bib32] Hart AS, Rutledge RB, Glimcher PW, Phillips PE. Phasic dopamine release in the rat nucleus accumbens symmetrically encodes a reward prediction error term. The Journal of Neuroscience. 2014;34:698–704. doi: 10.1523/JNEUROSCI.2489-13.2014. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib33] Huys QJ, Cools R, Gölzer M, Friedel E, Heinz A, Dolan RJ, Dayan P. Disentangling the roles of approach, activation and valence in instrumental and pavlovian responding. PLoS Computational Biology. 2011;7:e1002028. doi: 10.1371/journal.pcbi.1002028. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib34] Hämmerer D, Eppinger B. Dopaminergic and prefrontal contributions to reward-based learning and outcome monitoring during child development and aging. Developmental Psychology. 2012;48:862–874. doi: 10.1037/a0027342. [DOI] [PubMed] [Google Scholar]

[bib35] Jocham G, Klein TA, Ullsperger M. Dopamine-mediated reinforcement learning signals in the striatum and ventromedial prefrontal cortex underlie value-based choices. Journal of Neuroscience. 2011;31:1606–1613. doi: 10.1523/JNEUROSCI.3904-10.2011. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib36] Kim H, Shimojo S, O'Doherty JP. Overlapping responses for the expectation of juice and money rewards in human ventromedial prefrontal cortex. Cerebral Cortex. 2011;21:769–776. doi: 10.1093/cercor/bhq145. [DOI] [PubMed] [Google Scholar]

[bib37] Lau B, Glimcher PW. Dynamic response-by-response models of matching behavior in rhesus monkeys. Journal of the Experimental Analysis of Behavior. 2005;84:555–579. doi: 10.1901/jeab.2005.110-04. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib38] Levy BJ, Wagner AD. Cognitive control and right ventrolateral prefrontal cortex: reflexive reorienting, motor inhibition, and action updating. Annals of the New York Academy of Sciences. 2011;1224:40–62. doi: 10.1111/j.1749-6632.2011.05958.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib39] Li J, Daw ND. Signals in human striatum are appropriate for policy update rather than value prediction. Journal of Neuroscience. 2011;31:5504–5511. doi: 10.1523/JNEUROSCI.6316-10.2011. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib40] Li J, Schiller D, Schoenbaum G, Phelps EA, Daw ND. Differential roles of human striatum and amygdala in associative learning. Nature Neuroscience. 2011;14:1250–1252. doi: 10.1038/nn.2904. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib41] Lindenberger U, von Oertzen T, Ghisletta P, Hertzog C. Cross-sectional age variance extraction: what's change got to do with it? Psychology and Aging. 2011;26:34–47. doi: 10.1037/a0020525. [DOI] [PubMed] [Google Scholar]

[bib42] Logan J, Fowler JS, Volkow ND, Wolf AP, Dewey SL, Schlyer DJ, MacGregor RR, Hitzemann R, Bendriem B, Gatley SJ. Graphical analysis of reversible radioligand binding from time-activity measurements applied to [N-11C-methyl]-(-)-cocaine PET studies in human subjects. Journal of Cerebral Blood Flow and Metabolism : Official Journal of the International Society of Cerebral Blood Flow and Metabolism. 1990;10:740–747. doi: 10.1038/jcbfm.1990.127. [DOI] [PubMed] [Google Scholar]

[bib43] Maia TV, Frank MJ. From reinforcement learning models to psychiatric and neurological disorders. Nature Neuroscience. 2011;14:154–162. doi: 10.1038/nn.2723. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib44] Marcott PF, Mamaligas AA, Ford CP. Phasic dopamine release drives rapid activation of striatal D2-receptors. Neuron. 2014;84:164–176. doi: 10.1016/j.neuron.2014.08.058. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib45] Mathys C, Daunizeau J, Friston KJ, Stephan KE. A bayesian foundation for individual learning under uncertainty. Frontiers in Human Neuroscience. 2011;5:39. doi: 10.3389/fnhum.2011.00039. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib46] Mazaika P, Whitfield S, Cooper JC. Detection and Repair of Transient Artifacts in fMRI Data. Human Brain Mapping.2005. [Google Scholar]

[bib47] McClure SM, Daw ND, Montague PR. A computational substrate for incentive salience. Trends in Neurosciences. 2003;26:423–428. doi: 10.1016/S0166-2236(03)00177-2. [DOI] [PubMed] [Google Scholar]

[bib48] McNamee D, Rangel A, O'Doherty JP. Category-dependent and category-independent goal-value codes in human ventromedial prefrontal cortex. Nature Neuroscience. 2013;16:479–485. doi: 10.1038/nn.3337. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib49] Mell T, Heekeren HR, Marschner A, Wartenburger I, Villringer A, Reischies FM. Effect of aging on stimulus-reward association learning. Neuropsychologia. 2005;43:554–563. doi: 10.1016/j.neuropsychologia.2004.07.010. [DOI] [PubMed] [Google Scholar]

[bib50] Mumford JA, Poline JB, Poldrack RA. Orthogonalization of regressors in FMRI models. PLoS One. 2015;10:e0126255. doi: 10.1371/journal.pone.0126255. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib51] Neubert FX, Mars RB, Sallet J, Rushworth MF. Connectivity reveals relationship of brain areas for reward-guided learning and decision making in human and monkey frontal cortex. PNAS. 2015;112:E2695–E2704. doi: 10.1073/pnas.1410767112. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib52] Niv Y, Edlund JA, Dayan P, O'Doherty JP. Neural prediction errors reveal a risk-sensitive reinforcement-learning process in the human brain. Journal of Neuroscience. 2012;32:551–562. doi: 10.1523/JNEUROSCI.5498-10.2012. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib53] Niv Y, Montague PR. Theoretical and Empirical Studies of Learning. 2009. [Google Scholar]

[bib54] Noonan MP, Kolling N, Walton ME, Rushworth MF. Re-evaluating the role of the orbitofrontal cortex in reward and reinforcement. European Journal of Neuroscience. 2012;35:997–1010. doi: 10.1111/j.1460-9568.2012.08023.x. [DOI] [PubMed] [Google Scholar]

[bib55] Noonan MP, Walton ME, Behrens TE, Sallet J, Buckley MJ, Rushworth MF. Separate value comparison and learning mechanisms in macaque medial and lateral orbitofrontal cortex. PNAS. 2010;107:20547–20552. doi: 10.1073/pnas.1012246107. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib56] Nyberg L, Salami A, Andersson M, Eriksson J, Kalpouzos G, Kauppi K, Lind J, Pudas S, Persson J, Nilsson LG. Longitudinal evidence for diminished frontal cortex function in aging. PNAS. 2010;107:22682–22686. doi: 10.1073/pnas.1012651108. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib57] O'Doherty JP. Lights, camembert, action! The role of human orbitofrontal cortex in encoding stimuli, rewards, and choices. Annals of the New York Academy of Sciences. 2007;1121:254–272. doi: 10.1196/annals.1401.036. [DOI] [PubMed] [Google Scholar]

[bib58] O'Doherty JP, Dayan P, Friston K, Critchley H, Dolan RJ. Temporal difference models and reward-related learning in the human brain. Neuron. 2003;38:329–337. doi: 10.1016/S0896-6273(03)00169-7. [DOI] [PubMed] [Google Scholar]

[bib59] Patenaude B, Smith SM, Kennedy DN, Jenkinson M. A Bayesian model of shape and appearance for subcortical brain segmentation. NeuroImage. 2011;56:907–922. doi: 10.1016/j.neuroimage.2011.02.046. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib60] Payzan-LeNestour E, Bossaerts P, Risk BP. Risk, unexpected uncertainty, and estimation uncertainty: Bayesian learning in unstable settings. PLoS Computational Biology. 2011;7:e1001048. doi: 10.1371/journal.pcbi.1001048. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib61] Pessiglione M, Seymour B, Flandin G, Dolan RJ, Frith CD. Dopamine-dependent prediction errors underpin reward-seeking behaviour in humans. Nature. 2006;442:1042–1045. doi: 10.1038/nature05051. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib62] Petrides M. The role of the mid-dorsolateral prefrontal cortex in working memory. Experimental Brain Research. 2000;133:44–54. doi: 10.1007/s002210000399. [DOI] [PubMed] [Google Scholar]

[bib63] Philiastides MG, Biele G, Heekeren HR. A mechanistic account of value computation in the human brain. PNAS. 2010;107:9430–9435. doi: 10.1073/pnas.1001732107. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib64] Plakke B, Romanski LM. Neural circuits in auditory and audiovisual memory. Brain Research. 2016;1640:278–288. doi: 10.1016/j.brainres.2015.11.042. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib65] Raz N, Lindenberger U, Rodrigue KM, Kennedy KM, Head D, Williamson A, Dahle C, Gerstorf D, Acker JD. Regional brain changes in aging healthy adults: general trends, individual differences and modifiers. Cerebral Cortex. 2005;15:1676–1689. doi: 10.1093/cercor/bhi044. [DOI] [PubMed] [Google Scholar]

[bib66] Reynolds JN, Wickens JR. Dopamine-dependent plasticity of corticostriatal synapses. Neural Networks. 2002;15:507–521. doi: 10.1016/S0893-6080(02)00045-X. [DOI] [PubMed] [Google Scholar]

[bib67] Rieckmann A, Karlsson S, Karlsson P, Brehmer Y, Fischer H, Farde L, Nyberg L, Bäckman L. Dopamine D1 receptor associations within and between dopaminergic pathways in younger and elderly adults: links to cognitive performance. Cerebral Cortex. 2011;21:2023–2032. doi: 10.1093/cercor/bhq266. [DOI] [PubMed] [Google Scholar]

[bib68] Rolls ET, McCabe C, Redoute J. Expected value, reward outcome, and temporal difference error representations in a probabilistic decision task. Cerebral Cortex. 2008;18:652–663. doi: 10.1093/cercor/bhm097. [DOI] [PubMed] [Google Scholar]

[bib69] Ross S, Stearns C. SharpIR: White paper [Internet] [9, January 2017];2010 http://www3.gehealthcare.co.uk/~/media/downloads/uk/education/pet%20white%20papers/mi_emea_sharpir_white_paper_pdf_092010_doc0852276.pdf?Parent=%7BB66C9E27-1C45-4F6B-BE27-D2351D449B19%7D

[bib70] Rudebeck PH, Murray EA. The orbitofrontal oracle: cortical mechanisms for the prediction and evaluation of specific behavioral outcomes. Neuron. 2014;84:1143–1156. doi: 10.1016/j.neuron.2014.10.049. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib71] Rushworth MF, Behrens TE, Choice BTEJ. Choice, uncertainty and value in prefrontal and cingulate cortex. Nature Neuroscience. 2008;11:389–397. doi: 10.1038/nn2066. [DOI] [PubMed] [Google Scholar]

[bib72] Rushworth MF, Noonan MP, Boorman ED, Walton ME, Behrens TE. Frontal cortex and reward-guided learning and decision-making. Neuron. 2011;70:1054–1069. doi: 10.1016/j.neuron.2011.05.014. [DOI] [PubMed] [Google Scholar]

[bib73] Rutledge RB, Lazzaro SC, Lau B, Myers CE, Gluck MA, Glimcher PW. Dopaminergic drugs modulate learning rates and perseveration in Parkinson's patients in a dynamic foraging task. Journal of Neuroscience. 2009;29:15104–15114. doi: 10.1523/JNEUROSCI.3524-09.2009. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib74] Salamone JD, Correa M. The mysterious motivational functions of mesolimbic dopamine. Neuron. 2012;76:470–485. doi: 10.1016/j.neuron.2012.10.021. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib75] Samanez-Larkin GR, Knutson B. Decision making in the ageing brain: changes in affective and motivational circuits. Nature Reviews Neuroscience. 2015;16:278–289. doi: 10.1038/nrn3917. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib76] Samanez-Larkin GR, Levens SM, Perry LM, Dougherty RF, Knutson B. Frontostriatal white matter integrity mediates adult age differences in probabilistic reward learning. Journal of Neuroscience. 2012;32:5333–5337. doi: 10.1523/JNEUROSCI.5756-11.2012. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib77] Samanez-Larkin GR, Worthy DA, Mata R, McClure SM, Knutson B. Adult age differences in frontostriatal representation of prediction error but not reward outcome. Cognitive, Affective, & Behavioral Neuroscience. 2014;14:672–682. doi: 10.3758/s13415-014-0297-4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib78] Sanders JI, Hangya B, Kepecs A. Signatures of a Statistical Computation in the Human Sense of Confidence. Neuron. 2016;90:499–506. doi: 10.1016/j.neuron.2016.03.025. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib79] Schlagenhauf F, Rapp MA, Huys QJ, Beck A, Wüstenberg T, Deserno L, Buchholz HG, Kalbitzer J, Buchert R, Bauer M, Kienast T, Cumming P, Plotkin M, Kumakura Y, Grace AA, Dolan RJ, Heinz A. Ventral striatal prediction error signaling is associated with dopamine synthesis capacity and fluid intelligence. Human Brain Mapping. 2013;34:1490–1499. doi: 10.1002/hbm.22000. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib80] Schultz W, Dayan P, Montague PR. A neural substrate of prediction and reward. Science. 1997;275:1593–1599. doi: 10.1126/science.275.5306.1593. [DOI] [PubMed] [Google Scholar]

[bib81] Schönberg T, Daw ND, Joel D, O'Doherty JP. Reinforcement learning signals in the human striatum distinguish learners from nonlearners during reward-based decision making. Journal of Neuroscience. 2007;27:12860–12867. doi: 10.1523/JNEUROSCI.2496-07.2007. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib82] Shipp S. The functional logic of corticostriatal connections. Brain Structure and Function. 2017;222:669–706. doi: 10.1007/s00429-016-1250-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib83] Stenner MP, Rutledge RB, Zaehle T, Schmitt FC, Kopitzki K, Kowski AB, Voges J, Heinze HJ, Dolan RJ. No unified reward prediction error in local field potentials from the human nucleus accumbens: evidence from epilepsy patients. Journal of Neurophysiology. 2015;114:781–792. doi: 10.1152/jn.00260.2015. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib84] Sutton RS, Barto AG. Reinforcement learning: an introduction.[Internet] [17, October 2016];1998 http://site.ebrary.com/id/10225275

[bib85] Symmonds M, Wright ND, Bach DR, Dolan RJ. Deconstructing risk: separable encoding of variance and skewness in the brain. NeuroImage. 2011;58:1139–1149. doi: 10.1016/j.neuroimage.2011.06.087. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib86] Vinckier F, Gaillard R, Palminteri S, Rigoux L, Salvador A, Fornito A, Adapa R, Krebs MO, Pessiglione M, Fletcher PC. Confidence and psychosis: a neuro-computational account of contingency learning disruption by NMDA blockade. Molecular Psychiatry. 2016;21:946–955. doi: 10.1038/mp.2015.73. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib87] Volkow ND, Gur RC, Wang GJ, Fowler JS, Moberg PJ, Ding YS, Hitzemann R, Smith G, Logan J. Association between decline in brain dopamine activity with age and cognitive and motor impairment in healthy individuals. The American journal of psychiatry. 1998;155:344–349. doi: 10.1176/ajp.155.3.344. [DOI] [PubMed] [Google Scholar]

[bib88] Wilson RC, Geana A, White JM, Ludvig EA, Cohen JD. Humans use directed and random exploration to solve the explore-exploit dilemma. Journal of Experimental Psychology: General. 2014;143:2074–2081. doi: 10.1037/a0038199. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib89] Wimmer GE, Braun EK, Daw ND, Shohamy D. Episodic memory encoding interferes with reward learning and decreases striatal prediction errors. Journal of Neuroscience. 2014;34:14901–14912. doi: 10.1523/JNEUROSCI.0204-14.2014. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Attenuation of dopamine-modulated prefrontal value signals underlies probabilistic reward learning deficits in old age

Lieke de Boer

Jan Axelsson

Katrine Riklund

Lars Nyberg

Peter Dayan

Lars Bäckman

Marc Guitart-Masip

Roles

Abstract

Introduction

Figure 1. Behavioural paradigm and performance on the two-armed bandit task.

Figure 1—figure supplement 1. Dopamine D1 binding potential is lower in older adults.

Results

Task performance

Table 1. Model comparison statistics for the different models.

Figure 2. Schematic representation of the Bayesian model values for one participant at the time of choice at trial 21.

Table 2. Summary statistics of the five parameters of the winning model.

Value anticipation in vmPFC

Figure 3. Value anticipation in vmPFC is related to behavioural performance and D1 BP in NAcc.

RPE signals in striatum

Figure 4. Clusters in bilateral NAcc linked to putative reward prediction error (RPE) at the time of the outcome.

Figure 4—figure supplement 1. Canonical RPE parameter estimates from the Rescorla-Wagner model.

Relationship to D1 DA BP

Discussion

Dopamine, aging and value signals

Computational mechanisms of switch behaviour

Conclusions

Materials and methods

Participants

Procedure

Two-armed bandit task

Computational modelling of behavioural data

Model fitting and comparison

Statistical analysis of behaviour and brain variables

MRI acquisition

MR analysis

Time course extraction

PET image acquisition

PET analysis

Acknowledgements

Funding Statement

Contributor Information

Funding Information

Additional information

Competing interests

Author contributions

Ethics

Additional files

References

Decision letter

Roles

Author response

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases