A Role Beyond Learning for NMDA Receptors in Reward-Based Decision-Making—a Pharmacological Study Using d-Cycloserine

Jacqueline Scholl; Jan Günthner; Nils Kolling; Elisa Favaron; Matthew FS Rushworth; Catherine J Harmer; Andrea Reinecke

doi:10.1038/npp.2014.144

. 2014 Jul 9;39(12):2900–2909. doi: 10.1038/npp.2014.144

A Role Beyond Learning for NMDA Receptors in Reward-Based Decision-Making—a Pharmacological Study Using d-Cycloserine

Jacqueline Scholl ^1,^*, Jan Günthner ¹, Nils Kolling ², Elisa Favaron ¹, Matthew FS Rushworth ^2,³, Catherine J Harmer ¹, Andrea Reinecke ¹

PMCID: PMC4200501 EMSID: EMS58901 PMID: 24924800

Abstract

N-methyl-D-aspartate (NMDA) receptors are known to fulfill crucial functions in many forms of learning and plasticity. More recently, biophysical models, however, have suggested an additional role of NMDA receptors in evidence integration for decision-making, going beyond their role in learning. We designed a task to study the role of NMDA receptors in human reward-guided learning and decision-making. Human participants were assigned to receive either 250 mg of the partial NMDA agonist d-cycloserine (n=20) or matching placebo capsules (n=27). Reward-guided learning and decision-making were assessed using a task in which participants had to integrate learnt and explicitly shown value information to maximize their monetary wins and minimize their losses. To tease apart the effects of NMDA on learning and decision-making we used simple learning models. D-cycloserine shifted decision-making towards a more optimal integration of the learnt and the explicitly shown information, in the absence of a direct learning effect. In conclusion, our results reveal a distinct role for NMDA receptors in reward-guided decision-making. We discuss these findings in the context of NMDA's roles in neuronal super-additivity and as crucial for evidence integration for decisions.

INTRODUCTION

Animals and humans live in ever changing and complex environments. They need to continuously track and learn about the changing properties of their environment and use them to behave adaptively. The neural mechanisms engaged in learning about the environment depend on the type of information being tracked. Nevertheless, the plasticity underpinning several of these various forms of learning has been argued to depend on the same molecular mechanism, ie, N-methyl-D-aspartate (NMDA) subtypes of glutamate receptors. Accordingly a number of studies have investigated whether NMDA receptor manipulations affect learning and memory (Bannerman et al, 2012; Kuriyama et al, 2011; Bohn et al, 2003).

Most of the studies looking at the role of NMDA receptors in reward learning have focused on simple tasks, such as a single association between a stimulus and an outcome or its reversal. However, in more ecological, and thus complex, scenarios, learning and decision-making often involve consideration of positive and negative aspects of potential outcomes as well as the integration of learnt information with information that is explicitly cued. Importantly, it has recently been shown that integration of information conveyed by different dimensions during decision-making is an active process recruiting particular neural mechanisms (Burke et al, 2013; Stein and Stanford, 2008). Thus, to understand the full role of NMDA receptors in reward learning and decision-making, it might not be sufficient to only study a single simple component in isolation.

Interestingly, there is some, albeit indirect, evidence that NMDA receptors are not only important for learning, but also for integration of information: NMDA receptor blockade has been reported to affect the integration of multisensory information in cat superior colliculus (Binns and Salt, 1996) or the integration of reward and delay in rats (Floresco et al, 2008). However, to our knowledge, it has not been investigated whether NMDA receptors play a role in human value-based decision-making and learning.

To examine the influence of changes in NMDA receptor activity on complex learning and decision-making behaviors in humans, we used the partial NMDA agonist d-cycloserine. D-cycloserine binds to the glycine site of the NMDA receptor. Glycine is a co-agonist of the NMDA receptor, meaning that NMDA receptors only open when both glycine and glutamate bind. D-cycloserine can thus increase the probability of glutamate release opening NMDA receptors, which in turn enhances NMDA receptor-mediated activation. We designed a multi-attribute decision-making task in which participants learnt changing accepted article preview 13 June 2014 of gains and losses of two options. They made choices between options by integrating those learnt probabilities with explicitly cued information about gain and loss magnitudes. To assess potential effects of d-cycloserine on learning and decision-making, we used reinforcement-learning models.

Surprisingly, we did not find any evidence for a change in the rate at which participants learned about reward or punishment outcomes. However, we found that d-cycloserine improved decision-making. D-cycloserine led to a more optimal integration of the learnt probability information with the explicitly cued magnitude information.

METHODS

Participants

The study was approved by the local ethics committee. In total, 52 healthy volunteers (age 18–30) took part in the study (inclusion details in Supplementary Methods). The groups were well-matched on sociodemographic and personality parameters (Table 1). Five participants were excluded (Supplementary Methods). There remained 20 participants in the d-cycloserine and 27 participants in the placebo group.

Table 1. Sociodemographic and Questionnaire Measurements.

Demographics and questionnaire measurements
	Pla (n=27)	DCS (n=20)	P
Age	22.2±0.6	22.3±0.7	0.93
Gender, F : M	15 : 12	11 : 9	0.97
BDI	2.0±0.5	0.9±0.4	0.14
Education years	16.9±0.3	15.9±0.5	0.10
Trait anxiety	30.9±1.6	30.5±1.3	0.86
BMI	22.0±0.4	22.0±0.5	0.99
Neuroticism	5.2±0.9	5.6±0.9	0.33
ACS focusing	26.7±0.6	26.2±0.9	0.64
ACS shifting	34.7±0.8	34.5±1.0	0.31
BIS	16.0±0.8	16.7±0.8	0.59
BAS	24.4±5.8	23.6±4.8	0.59

VAS items	Pla, before	Pla, after	DCS, before	DCS, after	P, before	P, after	P, diff score
Anxious	7.6±1.6	3.6±0.6	6.7±1.5	5.7±1.8	0.70	0.25	0.24
Sleepy	28.4±4.0	21.3±4.3	24.1±3.3	13.7±2.5	0.43	0.13	0.54
Flushed	9.1±2.0	3.2±0.8	7.1±1.9	3.7±0.7	0.50	0.62	0.31
Tearful	3.3±0.9	2.9±0.8	3.1±0.7	2.8±0.5	0.84	0.87	0.95
Nauseous	3.3±0.9	3.2±0.9	2.9±0.7	3.5±0.6	0.72	0.83	0.41
Hopeless	3.3±0.7	2.5±0.5	4.1±1.5	2.7±0.5	0.61	0.85	0.65
Tremor	3.3±0.9	3.2±1.0	3.7±1.0	3.1±0.6	0.77	0.91	0.58
Sad	4.9±0.9	2.8±0.5	4.5±1.3	2.8±0.5	0.83	0.98	0.80
Dizzy	2.7±0.7	3.5±1.2	3.0±0.7	5.4±2.0	0.78	0.41	0.38
Depressed	2.9±0.5	2.6±0.5	3.1±0.8	2.5±0.5	0.85	0.90	0.65
Tachycardia	4.5±1.4	3.5±1.2	5.7±1.5	3.5±0.7	0.57	0.98	0.37
Alert	51.3±5.0	48.4±5.1	57.8±5.0	52.3±5.4	0.37	0.61	0.51

Open in a new tab

The following measurements were obtained before drug administration: age, gender, Beck's Depression Inventory (BDI, Beck et al, 1996), years of education at time of study, trait anxiety (Spielberger et al, 1983), body mass index (BMI, weight(kg)/height(cm)²), neuroticism (Eysenck and Eysenck, 1994), attention control scale (ACS, Derryberry and Reed, 2002), behavioral inhibition, behavioral activation (BIS/BAS Scale, Carver and White, 1994). The values reported are mean values with standard errors and the P-scores from between-subject t-tests (apart from the value for gender ratio, where a chi-squared test was used). Visual analog scales were given to participants before and after drug administration. For each of the listed items, they were asked to indicate how they were feeling by placing a tick mark on a 100 mm line, which was labeled ‘not at all' on the left-hand side and ‘extremely' on the right hand side. The values reported are the mean and standard error for the tick mark positions (in mm). P-values were calculated for the group differences in the first (at baseline, ‘before') and second measurement (after drug administration, ‘after') and on the difference scores (baseline–second measurement, ‘diff score').

Procedure

In a double-blind, placebo-controlled design, participants were randomly allocated to a single dose of d-cycloserine (250 mg) or matching placebo capsule. They fasted 2 h before the testing visit. Approximately, 250 mg was chosen in agreement with recent studies (Klumpers et al, 2012; Onur et al, 2010). Participants were tested 3 h after drug administration. According to product information (King's Pharmaceutical), plasma peak levels are reached within 3–4 h; other studies (van Berckel et al, 1997, 1998; Patel et al, 2011) found that peak levels are reached within ∼1 h. However, given d-cycloserine's half-life of 8–12 h (product information) or 15 h (Patel et al, 2011), plasma levels would have been close to peak levels during testing, given either time-to-peak information. To assess potential subjective changes following d-cycloserine, participants completed questionnaires (Table 1) before capsule intake and before testing.

Probabilistic Instrumental Learning Task

Participants performed a probabilistic learning task with monetary wins and losses (Figure 1a). Participants made repeated choices between two options with the aim to maximize their monetary pay-off.

(a) At the beginning of a trial two options appeared on the left (pink square) and right side (yellow square) of the screen. Throughout the experiment, the pink square was always on the left side and the yellow square was always on the right side. Reward (bars at the top) and loss magnitudes (circles at the bottom) were presented overlaid on the option symbols. After 500 ms, a question mark appeared after which the participants chose an option. After participants made their selection (there was no time-out), the outcomes of the gambles were shown (b), first for the chosen option (duration: 2.5 s—left option in the example), then also for the unchosen option (duration: 2.5 s). If the gamble outcomes of the chosen option led to a reward, the reward bar was shown; otherwise the reward bar was not shown. Similarly, if the gamble outcome of the chosen option led to a loss, the loss circle was shown; otherwise, it was not shown. The sum of the reward and the loss incurred for the chosen option in a trial was added to a status bar at the bottom of the screen, allowing participants to keep track of their overall gains. Subsequently, the participants were shown the outcomes for the unchosen option in the same way, except that no points were added to the status bar. Importantly, presenting the outcomes of the chosen and the unchosen option ensured that participants had an equal chance to learn the probabilities of the chosen and the unchosen option. After an inter-trial interval of 1.5 s, the next trial started. (c) Example reward probabilities for the two options over the course of the experiment. The probabilities were either stable at 20% or 80%, or they drifted between 20 and 80%, taking between five and eight trials per drift. (d) Example reward probability for one of the options (solid line), together with the probability estimates from the Bayesian learner used (dotted line).

On each trial, participants had a choice between two options. Each option had four independent attributes: a reward and a loss magnitude, a reward and a loss probability. The magnitude determined how many points could be won (and lost) on this trial, while the probabilities determined how probable winning and losing was respectively. After participants selected one of the options, they were shown the outcomes for both options. However, only the option they had chosen contributed to the participants' earnings. In trials where the chosen option incurred both, wins and losses, the participants' earnings in that trial were the sum of both. Therefore, to maximize the overall gains, participants should be choosing the option on each trial with the highest reward utility (reward magnitude × reward probability) and with the lowest loss utility (loss magnitude × loss probability).

Reward and loss magnitudes were explicitly cued at the time of choice and were randomly drawn from a flat distribution between 1 and 100 points. In contrast, the probabilities were not explicitly shown and had to be learnt across trials by observing the outcomes. The outcomes for one option could either be a win and a loss, only a win, only a loss, or neither win nor loss. The independent reward and loss probabilities determined the probability of these outcomes. The probabilities varied over the course of the experiment between 20 and 80%, with only one of the four probabilities varying at any given time (Figure 1c).

Before the experiment, participants were instructed about the task (see Supplementary Methods). Participants first performed 30 training trials, followed by 381 test trials, which were included in the analysis. Each participant was tested using the same task schedule to allow for better group comparisons. At the start of the task participants were given 4£ (400 points) to ensure that they had sufficient funds to sustain losses incurred even at the beginning of the experiment.

Analysis

The behavioral analysis compared the effects of d-cycloserine vs placebo on (a) learning of reward and loss probabilities, and (b) integration of learnt (probabilities) and explicitly cued (magnitudes) information for guiding complex decisions. All analyses were performed in Matlab and SPSS.

Logistic regression analysis

To ensure that participants learnt the probabilities, we first assessed the impact of past outcomes (reward and loss) and the explicitly cued magnitudes (reward and loss) on choice, using a logistic regression analysis and normalized regressor estimates. We included regressors for the last five trial differences in reward and loss outcomes between the two options, as well as the differences in the explicitly cued magnitudes.

To investigate whether the groups differed in their learning speeds, the resulting regression weights for the past outcomes were entered into an ANOVA with group (d-cycloserine vs placebo) as a between-subject factor and time (1, 2, 3, 4, or 5 trials in the past) and valence (reward or loss) as within-subject factors.

Modeling

To look at the learning effects more specifically and assess participants' strategies for the integration of information to make decisions, we used reinforcement-learning models to fit each participant's trial-by-trial behavior.

Each model consisted of three main components. First, the model had estimates about the probabilities underlying the outcomes of both options. These were updated on every trial using a reinforcement-learning algorithm. Second, the probability estimates were integrated with the explicitly cued magnitudes to calculate how valuable each of the two options was (ie, their utility). Third, these two utilities were compared to predict participants' choices. To determine the best parameter estimates for every participant, we used a standard log-likelihood maximization procedure.

When calculating how valuable each option is, participants might use different decision strategies for integrating learnt probabilities with explicit magnitudes. They could use a mathematically optimal strategy (utility as probability × magnitude). However, as this optimal strategy may be quite cognitively taxing, they could resort to a heuristic strategy (utility as a weighted sum of probability and magnitude). To test which decision strategy participants used, we fitted different models to the data. To test for differences in strategy, we then compared how well each of these models explained the groups' behavior. In addition, we fitted a third model, which directly estimated to what degree they used either decision strategy.

Optimal model

This model assumed that participants integrated the learnt probabilities optimally with the explicit magnitudes (magnitude × probability). The learning of probabilities was modeled using a standard reinforcement-learning rule. On each trial, the estimated probability of an attribute was updated based on the trial's outcome, as a function of the prediction error (PE):

with

where α is the learning rate. Thus, the learning rate is a measure of how much participants updated their probability estimate when the outcome associated with an attribute differed from their expectation (eg, the probability that the left option yields a reward). Separate learning rates were used for learning about wins and losses.

These predictions were combined optimally with the shown magnitudes:

The loss utility was computed in the same way and combined with the reward utility:

where λ determines how much participants weighted the prospect of rewards vs losses.

A standard soft-max decision rule was used to predict the probability of choosing an option.

where β reflects a participant's ability to pick the option with higher utility.

To assess the effect of d-cycloserine on learning about wins and losses, we compared their respective learning rates between groups.

Heuristic model

This model differed from the optimal model only in the decision strategy for integrating learnt probabilities with explicit magnitudes. Instead of being a product of probability and magnitude, utility was computed as a weighted sum.

where μ is the probability weighting factor, describing the relative importance of the learnt probability compared to the explicit. The loss utility was computed in the same way, sharing the same μ.

Again, we compared the groups' learning rates. Additionally, a change in learning could also manifest as a changed reliance on learnt compared to explicit information, we therefore also compared the probability weighting factors between the groups.

Hybrid model

We used the hybrid model to examine whether d-cycloserine affected how participants integrated information, shifting them towards a more optimal, less heuristic decision strategy. The hybrid model computed utility as a weighted sum of the utilities from the optimal and the heuristic model:

where ω is the heuristic weight factor, determining how much the overall utility is like the heuristic or the optimal utility. The higher ω (between 0 and 1), the more a participant relied on a heuristic decision rule.

Model Comparison

If the groups differ in their decision strategy, this should also be reflected in how well the models incorporating the different strategies can explain behavior. To assess this, we compared the model fits using the Akaike information criterion (AIC). We calculated for each participant the AIC differences between the optimal and the heuristic model and also between the heuristic and the hybrid model. This allowed us to compare whether the d-cycloserine differed from the placebo group in how well their behavior was explained by the heuristic relative to the optimal and by the heuristic relative to the hybrid model.

To confirm the results from the modeling analysis using a different method, we performed an additional regression analysis with regressors analogous to the components of the hybrid model. We included regressors for the explicit magnitude differences, for the learnt probability estimate differences, as well as for the difference in optimal utilities (magnitude × probability). The probability estimates were obtained using a Bayesian learner, like the one described in Behrens et al, 2007; also see Supplementary Methods. These Bayesian probability estimates are the most accurate estimates a participant could have given the past outcomes (Figure 1d). If participants' decision strategy is more heuristic, the main effects should have a larger impact on behavior. Conversely, if the decision strategy is more optimal, the interaction term (magnitude × probability) should have a higher impact.

Learning About the Unchosen Option

There is some evidence (Boorman et al, 2011) that different brain areas are used to learn about the chosen and the unchosen option, thus there is a possibility they could be affected differently by d-cycloserine. We found that d-cycloserine did not affect the usage of the unchosen option's outcomes for decision-making or learning (see Supplementary Methods and Results).

RESULTS

General Performance

In the task, participants had to constantly track the independent win and loss probabilities of two options and integrate these with explicitly cued reward and loss magnitudes. The groups did not differ in the overall earnings or in the mean points won/lost (Table 2a).

Table 2a. Results of a General Behavioral Analysis.

Overall behavior	Pla	DCS	P
Total money won	12.9±1.1	13.0±0.9	0.66
Mean points won	35.4±1.1	35.4±1.1	0.89
Mean points lost	21.8±1.4	21.6±0.3	0.67
Mean reward magnitude chosen	60.7±0.4	60.7±0.6	0.92
Mean loss magnitude chosen	48.0±0.5	48.0±0.5	0.96
Mean reward gamble outcome	0.6±0.0	0.6±0.0	0.9
Mean loss gamble outcome	0.4±0.0	0.4±0.0	0.86

Open in a new tab

Logistic Regression

As a measure of learning, we assessed the impact of past reward/loss outcomes on choice, using a logistic regression analysis. We also included the currently displayed reward/loss magnitudes. Participants were more likely to pick options with higher reward (t(46)=18.6, P=5 × 10⁻²³) and lower loss magnitudes (t(46)=−14.9, P=3 × 10⁻¹⁹). They also chose options more frequently when they were associated with more past wins and fewer losses (Figure 2a), thus suggesting that they were able to learn from past outcomes.

(a) Decision weights (beta) for placebo (white) and d-cycloserine (gray), showing the decision impact of current magnitude differences (left–right) and past gamble outcome differences (left–right), for one to five trials in the past, on choice. (b) Decision weights (beta) based on a regression using shown magnitude differences, probability prediction differences (estimated using a Bayesian model), and their interaction for both groups. (c) AIC difference scores comparing the relative fit of the hybrid to the heuristic and of the heuristic to the optimal model in both groups. Error bars indicate SE. ⁺P=0.056, *P≤0.05, **P<0.001, ****P<10⁻⁴.

To test whether learning differed between the groups, and maybe in dependence on reward and loss valence, we ran a 2 (group) × 5 (time point) × 2(valence) ANOVA on the regression weights of the past outcomes (Figure 2a). That participants learnt the reward/loss probabilities over time was evidenced by the fact that recent reward/loss outcomes influenced choices more than longer ago ones (main effect of time, F(4,180)=80.6, P<10⁻⁶). This effect of time was stronger for losses than rewards (interaction effect, time × valence, F(4,180)=6.2, P=3.8 × 10⁻⁴), suggesting that loss probabilities were learnt more quickly.

Importantly, the groups neither differed in overall learning speed (time × group: F(4,180)=0.9, P=0.42), nor in their relative learning speeds for wins and losses (time × valence × group, F(4,180)=0.3, P=0.89).