Abstract
Gaining a better understanding of the biological mechanisms underlying the individual variation observed in response to rewards and reward cues could help to identify and treat individuals more prone to disorders of impulsive control, such as addiction. Variation in response to reward cues is captured in rats undergoing autoshaping experiments where the appearance of a lever precedes food delivery. Although no response is required for food to be delivered, some rats (goal-trackers) learn to approach and avidly engage the magazine until food delivery, whereas other rats (sign-trackers) come to approach and engage avidly the lever. The impulsive and often maladaptive characteristics of the latter response are reminiscent of addictive behaviour in humans. In a previous article, we developed a computational model accounting for a set of experimental data regarding sign-trackers and goal-trackers. Here we show new simulations of the model to draw experimental predictions that could help further validate or refute the model. In particular, we apply the model to new experimental protocols such as injecting flupentixol locally into the core of the nucleus accumbens rather than systemically, and lesioning of the core of the nucleus accumbens before or after conditioning. In addition, we discuss the possibility of removing the food magazine during the inter-trial interval. The predictions from this revised model will help us better understand the role of different brain regions in the behaviours expressed by sign-trackers and goal-trackers.
Keywords: Reinforcement Learning, Dopamine, Pavlovian conditioning, Autoshaping, Model-based, Model-free, Factored representation, Sign-tracker, Goal-tracker, Conditioned approach
1. Introduction
A significant number of models have been developed since the 1970s to describe Pavlovian and instrumental phenomena. Early models were mostly focusing on reproducing the averaged behaviour expressed within a population, neglecting inter-individual variations and possibly smoothing the true behaviour of individuals (Gallistel et al., 2004), or even masking the variation in behaviour. However, this variation is of particular interest when trying to identify those individuals within population prone to impulsive behaviours or having a higher risk of addiction (Flagel et al., 2011a; Saunders and Robinson, 2013; Huys et al., in press).
Recent studies have investigated such intervariability among rats undergoing an autoshaping experiment (Flagel et al., 2007, 2009, 2011a,b; DiFeliceantonio and Berridge, 2012; Mahler and Berridge, 2009; Robinson and Flagel, 2009; Meyer et al., 2012; Fitzpatrick et al., 2013), where a lever (conditioned stimulus, CS) was presented for 8 seconds, followed immediately by delivery of a food pellet (unconditioned stimulus, US) into an adjacent food magazine. Although no response was required to receive the reward, with training, some rats (sign-trackers; STs) learned to rapidly approach and engage the lever-CS. However, others (goal-trackers; GTs) learned to approach the food magazine upon CS presentation, and made anticipatory head entries into it. Some rats (intermediate group; IG) presented a mixed behaviour, switching between lever and magazine during presentation of the CS, and sometimes engaging both during one trial. Furthermore, in STs, phasic dopamine release in the core of the nucleus accumbens, measured with Fast Scan Cyclic Voltammetry (FSCV), matched the pattern that would be predicted by reward prediction error (RPE) signalling, and dopamine was necessary for the acquisition of a sign-tracking conditioned response (CR). In contrast, despite the fact that GTs acquired a Pavlovian conditioned approach response, this was not accompanied with the expected RPE-like dopamine signal, nor was the acquisition of a goal-tracking CR blocked by administration of a dopamine antagonist (see also Danna and Elmer (2010)). While the proportion of STs and GTs in the population varies (Fitzpatrick et al., 2013), both phenotypes are typically represented in an outbred population.
To our knowledge, only one model (Lesaint et al., 2014) accounts for these experimental results and has been validated with existing data. This model is built on a combination of model-free and model-based systems (Daw et al., 2005; Clark et al., 2012; Huys et al., in press) and extended with state factored representations. Combining multiple systems enables the model to express a large repertoire of behaviours and considering features within states enables the model to learn Pavlovian impetuses (Dayan et al., 2006) specific to the Pavlovian features within the task.
In this paper, we review the model described by Lesaint et al. (2014), extending it with a new tool to improve its reliability. We suggest new experimental protocols and some new analyses of the data that would further validate the model and strengthen its explanatory power, refine our understanding of the role of the nucleus accumbens in the described behaviours, and help clarify the impact of some choices made in the original protocol.
2. Material and methods
The model from which the present results are extracted is described in depth in a previous article (Lesaint et al., 2014). It is composed of two distinct reinforcement learning systems that collaborate to define the action to be selected at each step of the experiment (see Figure 1 A; Clark et al. (2012)).
The first system, a model-based system (MB), incrementally learns a model of the world (a transition function and a reward function ) from which it infers values for each action in each situation, given the classical following formulas:
(1) |
(2) |
where the discount rate 0 ≤ γ ≤ 1 classically represents the preference for immediate versus distant rewards. At each step, the most valued action is the most rewarding on the long run (e.g. approaching the magazine to be ready to consume the food as soon as its delivery). It favours goal-tracking because this is the shortest path towards the rewarding state (see Figure 1 B).
The second system, a revised model-free system, learns values over features (e.g. food, lever or magazine). Contrary to the first system, which uses a classical abstract state representation, it relies on the features that compose these abstract states. In traditional reinforcement learning, each situation that can be encountered by the agent is defined as an abstract state (e.g. arbitrarily defined as s1, s2 … sx), such that similarities between situations (e.g. presence of a magazine) are lost. By using features, we reintroduce the capacity to use and benefit from these similarities. The second system is further defined as the feature model-free system (FMF). It relies on a RPE signal δ, computed as follows:
(3) |
where {lever, magazine, food, ∅} is a feature-function that returns the feature c(s, a) the action a was focusing on in state s (e.g. it returns the lever when the action was to engage with the lever). We hypothesized that, similarly to classical model-free systems, δ parallels phasic dopaminergic activity (Schultz, 1998). This signal enables to revise and attribute values, seen as motivational, to features without the need of the internal model of the world used by the MB system. When an event is fully expected, there should be no RPE as its value is fully anticipated. When an event is positively surprising, there should be a positive RPE. Actions are then valued by the motivational value of the feature they are focusing on (e.g. engaging with the lever would be valued given the general motivational value of the lever). Hence, it favours actions that engage with the most motivational features. This might lead to favour suboptimal actions with regard to maximizing rewards (e.g. engaging with the lever keeps the rat away from the soon to be rewarded magazine). It favours sign-tracking (a suboptimal path, see Figure 1 B) as the lever, being a full predictor of reward, earns a strong motivational value relative to the magazine.
The model does not base its decision on a single system at a time, rather the values of the MB system and the FMF system are integrated such that a single decision is made at each time step: producing a sort of cooperation between the two systems. The values computed by these two systems are then integrated through a weighted sum and passed to a softmax action selection mechanism that converts them into probabilities of selecting the action given a situation (see Figure 1 A). The integration is done as follows:
(4) |
where 0 ≤ ω ≤ 1 is a combination parameter which defines the importance of each system in the overall model. Varying ω (while leaving the other parameters of the model unchanged) is sufficient to reproduce the characteristics of the different subgroups of rats (Lesaint et al., 2014). The previous experimental data could be reproduced by having STs give a stronger weight to the FMF system whereas having GTs give a stronger weight to the MB system. FMF and MB systems are then updated according to the action a taken by the full model in state s - even if the systems would have individually favoured different actions -and the resulting new state s’ and retrieved reward r, as previously done in other computational models involving a cooperation between model-free and model-based systems (Caluwaerts et al., 2012).
2.1. Simulations of experimental protocols
The experiment is described through an episodic Markov Decision Process (MDP) that represents one trial of the session (see Figure 1 B,C). The inter-trial interval (ITI), not being part of the MDP, is simulated between each run by revising downward the magazine value being a parameter of the model). This simulates the hypothesis that the presence of the magazine in the absence of food delivery reduces its value. If the magazine were removed during ITI, we would expect no revision of its value.
The model is used to simulate experiments that involved injections of flupentixol, an antagonist of dopamine, either systemically or within the core of the nucleus accumbens. In the case of local injections, assuming that the FMF system relies on the core of the nucleus accumbens, we simulate the impact of flupentixol on phasic dopamine by degrading the reward predictions errors as follows:
(5) |
where 0 ≤ f < 1 represents the impact of flupentixol. Its effect is defined such that flupentixol injections cannot lead to negative learning when RPE is positive, but at most blocks it. In the case of systemic injections, we also assume an additional impact on tonic dopamine (Humphries et al., 2012), which affects the action selection process. We simulate this impact by revising the temperature parameter. Hence, flupentixol favours random exploration instead of using learned values to take a decision.
Some predictions presented here suggest to lesion the core of the nucleus accumbens. Such a lesion is simulated by removing the FMF system from the model , i.e. all values that would have come from the system are replaced by 0. The rest of the model is left intact. Equation 4 can be replaced by:
(6) |
2.2. Index Score
Introduced by Meyer et al. (2012), the Pavlovian Conditioned Approach (PCA) Index Score provides a metric to categorize rats as STs, GTs or IGs independent of the rest of the population. That is, instead of ordering rats based on their engagement with the lever and splitting the population in 3 groups of approximately equal size, as done in previous studies (Flagel et al., 2007; Robinson and Flagel, 2009), classifying rats based on PCA Index minimizes the chances of misclassification and allows one to compare across studies or populations of rats. The PCA Index relies on the number of contacts with the lever and the magazine, the probability to engage with one versus the other and the latencies to act towards each (Table 1 in (Meyer et al., 2012)).
Table 1. Formulas for deriving the Index Score.
Response Bias(n) = (LeverPresses−MagazineEntries)/(LeverPresses + MagazineEntries) |
Probability Difference(n) = p(LeverPress)−p(MagazineEntry) |
Score(n) = [ResponseBias(n) + ProbabilityDifference(n)]/2 |
Index Score = [Score(6) + Score(7)]/2 |
We developed a similar Index Score as it provides a good metric for some of the predictions described here. Simulated rats whose score is > 0.5 are defined as STs. Simulated rats that have a score < −0.5 are defined as GTs. Remaining rats are defined as IGs. Table 1 explains how it is computed based on the last two sessions of simulations. Contrary to the PCA Index Score, it cannot use latencies as they are not accounted for by the model.
2.3. Estimation of model parameters
The model relies on a set of 8 parameters (a shared learning rate, a shared discount rate, a selection temperature, an integration parameter, a ITI impact parameter and 3 initial conditions) that need to be tuned for simulations to fit experimental data. We use the multi-objective algorithm NSGA-II (Deb et al., 2002; Mouret and Doncieux, 2010) to find the best values (solutions) for the parameters. This method is an efficient tool to fully explore the high dimensional parameter space and avoid local minima.
As in (Lesaint et al., 2014), we search a set of parameter values per group. The two first objectives of the fitness function are to fit the averaged behaviours of the simulated group to the averaged behaviours of the experimental group. More formally, for each group, we try to minimize the least square error between the probabilities of rats and simulated rats to engage with the magazine and the lever over time (see Table 2). This results in multiple solutions that are compromises between these two objectives. We subsequently select one of the solutions that is visually acceptable (no misclassification, and a good compromise between the two other criteria).
Table 2. Revised fitness function.
Objective | Formula |
---|---|
Best fit magazine engagement | |
Best fit lever engagement | |
Penalize parameters that lead to misclassification | min(∑ai∈animals ∣ref PCA (Group) − IndexScoreai (Group)∣) |
We noticed however, that without further constraints, as we are fitting averaged data, some of the resulting solutions could induce great variability of behaviour within a group, leading to misclassification. For example, a simulated rat classified as a GT by its parameters could have behaved as a ST and went undetected as its behaviour would have been diluted in the averaged behaviours of the simulated GT group.
The fitness function was extended with a new criterion based on the Index Score (see Table 2), to favour sets of parameter values that lead to groups of rats that did not introduce such errors, hence without strong inter variability. This is consistent with experimental data (Meyer et al., 2012). The resulting new sets of parameter values (see Table 3) did not affect the explanatory power of the model.
Table 3. Parameters used to produce the presented results.
Type | ω | β | α | γ | uITI | |||
---|---|---|---|---|---|---|---|---|
ST | 0.501 | 0.243 | 0.027 | 0.946 | 0.845 | 0.263 | 0.272 | 0.344 |
IG | 0.095 | 0.241 | 0.885 | 0.989 | 0.840 | 0.059 | 0.142 | 0.732 |
GT | 0.081 | 0.063 | 0.033 | 0.483 | 0.893 | 0.936 | 0.022 | 0.099 |
This metric ensures, for example, that using a set of parameter values for sign-tracking will produce a sign-tracker when applying the model in a simulation reproducing the original experiment. Interestingly, it allows us to predict qualitatively what the behaviour of such a rat (ST in normal conditions) would be in new experimental conditions: for example, whether the acquisition or the expression of the behaviour would be blocked or shifted to intermediate or even a goal-tracking behaviour, according to the Index Score defined above.
Note that initial – values have no impact on behaviours on the long run as they are revised by incremental learning during the simulation. Estimated β parameters are sufficient to generate exploration and avoid being permanently biased by such initial values. They mainly help in reproducing the initial tendencies of rats to interact with the experimental environment. They can reflect difference in traits (e.g. novelty-seeking traits) that seem to differ between STs and GTs.
3. Results
The model has already been validated on a set of behavioural, physiological and pharmacological data (Lesaint et al., 2014). Interestingly, while the model was only tuned to fit the behavioural data for each group, simulations of additional experiments without changing the parameters were consistent with the remaining experimental data.
The model accounts for the respective engagements of STs and GTs towards distinct specific features (Flagel et al., 2007, 2009, 2011a). It reproduces the difference in patterns of dopaminergic activity for GTs and STs (Flagel et al., 2011a). It also reproduces behaviours indicative of incentive salience attribution, including the conditioned reinforcement effect of the lever shown to a greater extent in STs than GTs (Robinson and Flagel, 2009), and the consumption-like engagement of the lever or magazine (Mahler and Berridge, 2009; DiFeliceantonio and Berridge, 2012). Finally, it also reproduces the impact of flupentixol injected either systemically prior to training (Flagel et al., 2011a), i.e. during acquisition, or locally after the rats have acquired their respective conditioned responses (Saunders and Robinson, 2012), i.e. expression.
In the following sections, taking inspiration from the set of studies used to validate the model, we generate predictions that new experiments or extended analyses of the data could confirm.
3.1. Dopaminergic patterns of activity
The model parallels the dopaminergic activity recorded in the core of the nucleus accumbens by Fast Scan Cyclic Voltammetry with the RPE signal used in the FMF system. At US time, the RPE signal within the FMF system comes from the difference between the value of the previously engaged cue and the value of the delivered food. At CS time, it mainly reflects the value of the most rewarding cue between the lever and the magazine.
STs and GTs dopaminergic patterns at CS and US time are very distinct (Flagel et al., 2011a). While we observe a clear propagation of the signal from US to CS in STs (as expected from the classical RPE theory (Schultz, 1998)), this is not the case for GTs for which the CS and US signals are similar to one another and remain relatively constant across sessions (hence, in discrepancy with the classical theory).
In the model, the RPE signal is dependent of the feature previously focused on by the simulated rat. Thus, RPE patterns, averaged over sessions, strongly depend on the dominant path taken by the simulated rats before food delivery. Simulated STs, that mainly engage with the lever before food delivery, have an averaged signal that propagates from US to CS. This reflects that any rat that engages with the lever, eventually learns that it is a full predictor of food delivery. Simulated GTs, that mainly engage with the magazine before food delivery, have an averaged signal that do not show such a propagation. Indeed, the magazine is not fully informative of food delivery for any rat, hence a persistent reward prediction error remains at food delivery when engaging with the magazine during CS.
In Flagel et al. (2011a), recordings of dopaminergic activity in outbred rats were made to parallel those of the selectively bred STs and GTs but no recordings were made in outbred IGs. We would expect that IGs, whose behaviour fluctuate between sign-tracking and goal-tracking, would have a kind of mixed signal, averaging between those following from sign-tracking and goal-tracking. The current parameters values used in the model suggest that we would expect a high signal at CS time that would converge to a certain point, while at the meantime, the signal at US time would keep fluctuating without fully disappearing (see Figure 2).
Note that the visual results of this prediction are not identical with those in Lesaint et al. (2014). Contrary to ST and GT behaviours that deeply rely on the mechanisms, IG results strongly depend on the parameter values, which are significantly different with the introduction of the new score. Experimental recordings could help us refine the appropriate set of values for further predictions.
The initial analysis (Flagel et al., 2011a) and its reproduction (Lesaint et al., 2014) was done without taking into account the features engaged by animals prior to food delivery, possibly averaging very distinct patterns. The model predicts that if we were to organize the data per groups and actions rather than only per groups, we would observe patterns as shown in Figure 3. At the time the CS is presented, there should be no differences as all rats are exploring the world and not expecting the lever appearance, hence the positive RPE common to all rats. The difference would be at US time.
STs previously engaged with the lever (Figure 3 A) would show a classical propagation pattern, similar to the one of the initial analysis, as this condition dominates in the data. It reflects the fully predictive value of the lever. STs previously engaged with the magazine (Figure 3 C) would show a significant peak of DA activity, as they almost never engage with the magazine and hence attribute a low value to it, leading to an expected significant RPE.
GTs previously engaged with the magazine (Figure 3 D) would show an absence of propagation and patterns of DA activity that follow those at CS time, similar to the one of the initial analysis, as this condition dominates in the data. It reflects the difference between the value of the food delivered and the lower motivational value of the magazine. GTs previously engaged with the lever (Figure 3 B) would show a noisy dopaminergic activity that would decrease with time as the predictive value of the lever is learn.
3.2. Removal of magazine during the ITI
In the present model, the simulation of the ITI has a significant impact on the data. We hypothesize that the permanent presence of the magazine during the whole experiment lead animals to revise its associated motivational value, upward at lever retraction (i.e. food delivery) and downward during the ITI as there is no reward to be found then. Hence, on average, its presence does not guarantee access to food. In contrast, the time-locked presence of the lever before food delivery would lead to learn and maintain the motivational value of the lever to a certain level, as its presence guarantees food to be delivered.
First, by keeping the motivational value of the lever higher than that of the magazine in the FMF system, it makes simulated rats favouring this system (STs) to follow a sign-tracking policy. The small contribution of the MB system, which would attract rats towards the magazine does not compensate. Thus, the presence of the magazine in ITI is central for the emergence of STs in the model.
Second, by revising downward the magazine value between episodes, it maintains a discrepancy between the expectation (value) and the observation (reward) at food delivery in simulated rats being engaged with the magazine. This leads to the persistent positive RPE at US time and prevents a full propagation of the signal to CS time. Thus, the presence of the magazine in ITI is also central for the model, to explain the distinct dopaminergic patterns of activity in STs versus GTs that have been observed in Flagel et al. (2011a).
Third, we also hypothesize that values of the FMF system account for the motivational engagement, i.e. incentive salience, observed in rats towards either the lever or the magazine. The higher motivational value of the lever relative to that of the magazine implies that simulated rats chew/bite more the lever than the magazine. While not central to the model, it is consistent with experimental observations (Mahler and Berridge, 2009; DiFeliceantonio and Berridge, 2012).
If no magazine were available during the ITI then, according to the model, the magazine would not loose its motivational value, as it would become a full predictor of food delivery and be highly valued. Hence, we would expect (1) an increased motivational engagement (chew/bite) towards the magazine, (2) a decreased tendency in sign-tracking within the population and (3) a different pattern of dopamine activity when goal-tracking for all rats.
As the motivational value of a feature accounts for the level of motivational engagement towards it, a higher motivational value of the magazine, relative to a control group, would necessarily lead to a relatively stronger motivational engagement towards it.
As the motivational value of the magazine would be as high as that of the lever, there should be no reason for rats relying mainly on the FMF system (STs) to favour one over the other, hence shifting to behaviours similar to those of IGs and GTs (see Figure 4). GTs, relying mainly on the MB system would not be deeply affected (see Figure 4 B).
Finally, as the presence of the magazine would be time-locked to the moments before the delivery of food, we would expect a propagation of the dopamine signal from US time to CS time (see Figure 5). At some point (after the value of the food has been fully learned) the signal at US time should start decreasing. Note that if we would have used the same parameters (except for the weighting parameter) to simulate STs and GTs, we would have expected an identical RPE signal for STs and GTs, and we know this is not the case based on existing data (Flagel et al., 2011a).
The expected decreased tendency in sign-tracking within the simulated population does not mean that simulated rats would not be attracted any more by the lever. Simulated rats would indeed be attracted by both the lever and magazine because their FMF system attributes a high motivational value to all signs preceding reward delivery. Combined with the contribution of the MB system which attracts rats towards to magazine, it could make the simulated animal engage more with the magazine than with the lever. Thus if the computational model is valid, this would mean that the tendency to sign-track in real animals can be gradually changed by affecting some of the signs or features present in the context of the task (here the magazine during the ITI).
3.3. Injections of flupentixol in the core of the nucleus accumbens
In the model, flupentixol, an antagonist of dopamine, is hypothesized to impact the RPE (hypothesized to parallel phasic dopamine) used in the FMF system, putatively based within the core of the nucleus accumbens. Flupentixol is also assumed to affect any action selection process, relying on tonic dopamine (Humphries et al., 2012). Hence, under systemic injections of flupentixol, the learning process of the FMF system is disrupted and actions are almost randomly picked barely using learned values.
With systemic injections of flupentixol (Flagel et al., 2011a), no goal-tracking nor sign-tracking is expressed in the population. However, when afterwards released from flupentixol, GTs fully express goal-tracking, whereas STs behave as untrained rats.
The model accounts for the absence of behaviours under flupentixol by the hypothesized impact of flupentixol on the action selection process, blocking the expression of any acquired behaviour Lesaint et al. (2014). The subsequent absence of sign-tracking on a last session free of flupentixol is explained by the disruption of the FMF system during the 7 first sessions, blocking behaviour acquisition. The full expression of goal-tracking as soon as flupentixol is removed, relies on the unaffected learning process in the MB system, assumed to be dopamine-independent and hence keeps learning under flupentixol, but which values are simply not used by the softmax function.
The model predicts that if flupentixol were injected locally in the core of the nucleus accumbens rather than systemically prior to acquisition, GTs would normally express their behaviour, as the action selection mechanism would not be disrupted and make use of the values learned in the MB system; whereas STs’ behaviour would remain blocked because of the disruption of the FMF system (see Figure 6), and this is indeed what happend when Saunders and Robinson (2012) locally injected flupetixol after the behaviours were already acquired.
3.4. Lesions of core of the nucleus accumbens
While we did not try to find all anatomical counter parts of the mechanisms involved in the model, the hypothesis that the FMF system relies mainly on the core of the nucleus accumbens is important for the model. Indeed, RPEs used in the FMF system are compared with the dopaminergic recordings (using FSCV) in the core of the nucleus accumbens. As already stated, the values learned by the FMF system are a key component in the emergence of sign-tracking behaviours within a population and assumed to reflect the motivational engagement observed towards the magazine and the lever.
As stated in the previous section, Flagel et al. (2011a) studied the impact of systemic injection of flupentixol on the acquisition of sign-tracking and goal-tracking. They observed that the acquisition of a goal-tracking behaviour did not require a fully functional dopaminergic system contrary to sign-tracking. Another study (Saunders and Robinson, 2012) focused on the impact of local injections of flupentixol in the core of the nucleus accumbens on the expression of sign-tracking and goal-tracking, after 8 days of conditioning. On the last day, with a sufficient dose of flupentixol, they observed a decrease in the general tendency to sign-track in the overall population while leaving the level of goal-tracking unaffected.
Simulating injections of flupentixol in the core of the nucleus accumbens, by disrupting RPEs in the FMF system and hence its contribution in the decision, the model accounts for these last observations. The action selection mechanism remains functional and makes use of the MB system values, such that the behaviour of GTs is preserved while the one of STs is disturbed and leads to a decrease in sign-tracking in the overall population.
We expect that lesions of the core of the nucleus accumbens would lead to similar effects as the above experiments.
Lesions of the core of the nucleus accumbens prior to the experiment would (1) block the expression of sign-tracking responses and (2) stop the motivational engagement towards the magazine or the lever during approaches.
By disabling the FMF system (setting and keeping all values to 0), it cannot favour the lever over the magazine any more. STs would therefore act randomly , approaching lever and magazine indifferently, as observed in IGs. We would expect a shift towards goal-tracking similar to the one expected for removing the magazine during the ITI (as in Figure 4).
However, while a magazine removal would lead to an increase in motivational engagement, we expect such a lesion to block any consumption-like behaviour. Especially, we would expect GTs’ approach behaviour to remain similar to control group, but without subsequent chewing and biting of the magazine.
We would expect that lesions of the core of the nucleus accumbens after the experiment would disrupt the tendency to sign-track in the overall population, while leaving the tendency to goal-track intact (see Figure 7). However, contrary to flupentixol injections, that needed 35 min of infusion for a visible effect, we would expect the effect to be immediate with a lesion. Such a lesion would disrupt the FMF system, hence (1) suppressing any consumption-like engagement towards the features (motivational values being kept to 0), and (2) stop favouring engagements towards the lever. The lesion would leave the MB system unaffected and have no impact on the general tendency to goal-track.
4. Discussion
Relying on a model that was previously validated using experimental data to account for variability in rats undergoing an autoshaping paradigm (Lesaint et al., 2014), we generate an additional set of behavioural, physiological and pharmacological predictions.
We predict that dopaminergic patterns for IGs should be a mixed signal between those observed for STs and GTs. We predict that looking separately at the DA patterns given the prior engagement towards either the lever or the magazine should lead to clearly distinct patterns. We predict that the removal of the magazine during the ITI should lead to an increased motivational engagement towards the magazine, a decreased tendency in sign-tracking within the population and a different pattern of dopaminergic activity when goal-tracking. Finally, we predict that local injections of flupentixol to the core of the nucleus accumbens would preserve goal-tracking and prevent the learning of a sign-tracking response, a result that should also be observed following lesions of the core of the nucleus accumbens prior to conditioning. Lesions after conditioning, would only block the expression of the learned sign-tracking behaviour.
An important limitation of the present predictions is that most of them are based on the behaviour that is expected to emerge from naive rats trained in a revised protocol, assuming that they would have behaved in a specific manner in the standard protocol (e.g. expecting a supposed ST to goal-track). To overcome this difficulty, one must look at the population level rather than the individual level (Saunders and Robinson, 2012), which might be problematic as the proportion of GTs, STs and IGs is highly variable in a population (Meyer et al., 2012; Fitzpatrick et al., 2013). An alternative would be to use selectively bred rats that can more or less be ensured to behave as STs or GTs in experimental conditions (Flagel et al., 2011a).
Another limit of the present predictions are the hypotheses on which they are based. It cannot be excluded that the core of the nucleus accumbens also contributes to the MB system, but not by its dopaminergic activity (Khamassi and Humphries, 2012; van Der Meer and Redish, 2011; McDannald et al., 2011) (but see van der Meer et al. (2010); Bornstein and Daw (2012); Penner and Mizumori (2012)). Hence, completely disrupting it might unexpectedly affect goal-tracking. Validating these predictions would help to confirm this hypothesis. In the initial model (Lesaint et al., 2014) we interpreted the parameter which simulates the ITI as accounting for the engagement of the rats towards the magazine during the ITI. Preliminary analyses of experimental data (not shown), while still inconclusive, tend to mitigate such a strong hypothesis. Hence, in the current article, we only assume that the presence of the magazine during the ITI impacts its general motivational value within the experiment. Validating such predictions would definitely help to clarify the impact of the ITI context on the expressed behaviours.
One could argue that, to some extent, describing STs with a MF system and GTs with a MB system could be sufficient to explain dominant behaviours (Clark et al., 2012). However, it would fail to explain the full and continuous spectrum of observed behaviours (Meyer et al., 2012). If the predictions that we make about IGs (which have an intermediate behaviour between STs and GTs) are correct, this would argue in favour of a continuum in the weighting between MB and MF systems rather than a pure dichotomy.
An alternative to the collaboration of both systems (through a weighted sum) would be a reciprocal inhibition, such that only one system would be working at a time. This would be sufficient to account for the previous point and may even be able to account for the absence of RPE pattern in the dopamine signal measures in GTs (Flagel et al., 2011a) without requiring a revision of the magazine value during ITI. The inhibition of the MF system in GTs would indeed prevent any RPE signal from being observed. However, it would be unable to properly account for the consumption-like engagement observed in both STs and GTs without some kind of extension (see Zhang et al. (2009) for a computational model of incentive salience). It would also fail to explain why the pharmacological disruptions of one system does not seem to let the other take control (Flagel et al., 2011a).
Another possibility would be that the two systems run in parallel but that only one is used to make the decision during a trial. Assuming that one system leads to the lever and the other to the magazine, we would expect IGs to behave as STs when engaging with the lever and GTs when engaging with the magazine. Experimental data goes against such interpretation. Meyer et al. (2012) observed that contrary to STs or GTs, IGs tend to approach both the magazine and the lever during single trials. Some rats even hold on to the lever while putting their head into the magazine (which no model that selects a single action at a time can reproduce). While the task representation does not allow multiple engagements in a trial, this suggests that both systems are active and contribute actively to their behaviour at all time. We would also expect rats to behave differently when using one system over the other, such that, for example, rats would actively engage with the lever but quietly wait in front of the magazine, which is not the case. Finally, the recent literature seems consistent with multiple systems working in parallel and partially contributing to a global decision (e.g. Daw et al. (2011)). Hence, this does not suggest take-over competition between the systems. Trial-by-trial analyses (Daw, 2011) would allow us to definitely rule out such alternatives. Finally, if only the output of the MF system was inhibited, given that the lever appearance is fully predictive of food delivery, no classical MF system (relying on classical state representation) would reproduce the differences observed in phasic dopaminergic patterns between STs and GTs nor explain the differences of focused features. Hence, the model suggests to take features into consideration.
The interest of the current computational model lies in its combination of simple concepts actively used and accepted in the current field (Dual reinforcement learning and factored representations) but rarely used together, to account for a variability of experimental data, without resorting to arbitrary additions. As a result the current model does not behave as state of the art algorithms would on the same task and produces a suboptimal behaviour. This suboptimal behaviour is, however, in accordance with behavioural observations in rats.
Subsequent studies could benefit from a different approach to estimate parameters. We are currently fitting the model on the behavioural data per sessions and groups, using trial-by-trial analyses could prove a better tool to fit the parameters at the individual level (Daw et al., 2011) and comfort some choices in the architecture of the model.
It has been suggested that individuals for whom cues become powerful incentives (i.e. STs) are more prone to develop addiction (Saunders and Robinson, 2012). Thus, the current model and its predictions will allow us to further investigate and possibly identify the neural mechanisms that underlie addiction and related disorders. For example, the current model predicts that some manipulations could alter the behaviour of STs towards that of GTs, and the neurobiological targets of these manipulations may be used to alter drug-cue dependency and prevent relapse (For further discussion regarding the role of learning-related dopamine signals in addiction vulnerability, see Huys et al. (in press)).
To conclude, the current article refines the model previously described by Lesaint et al. (2014) with an additional metric that strengthens its explanatory power. It mainly suggests a set of predictions with which to further confront the model. The new proposed experiments would help to better localize the anatomical counterparts of the mechanisms involved and disentangle their contributions to the observed behaviours. It would also help in refining the hypotheses and simplifications of the model and we hope would confirm the interest and necessity of considering the features rather than the general situations encountered by rats when modelling this kind of phenomena.
Highlights.
* We model goal-tracking and sign-tracking with a model-based/model-free combination
* We suggest that magazine in ITI is necessary for these distinct behaviours to emerge
* Phasic Dopaminergic activity can be explained by previously engaged features
Acknowledgments
This work was supported by Grant ANR-11-BSV4-006 ”Learning Under Uncertainty” from L’Agence Nationale de la Recherche, France (FL, OS, MK), by Grant “HABOT” from the Ville de Paris Emergence(s) Program, France (MK).
The authors would like to thank Peter Dayan, Mehdi Keramati, Angelo Arleo, Etienne Coutureau and Geoffrey Schoenbaum for helpful discussions. The authors would also like to thank the reviewers for their valuable comments and suggestions that helped to improve the contents of this paper.
Footnotes
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
References
- Gallistel CR, Fairhurst S, Balsam P. The learning curve: Implications of a quantitative analysis. Proceedings of the national academy of Sciences of the united States of america. 2004;101(36):13124–13131. doi: 10.1073/pnas.0404965101. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Flagel SB, Clark JJ, Robinson TE, Mayo L, Czuj A, Willuhn I, Akers CA, Clinton SM, Phillips PEM, Akil H. A selective role for dopamine in stimulus-reward learning. Nature. 2011a;469(7328):53–57. doi: 10.1038/nature09588. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Saunders BT, Robinson TE. Individual variation in resisting temptation: implications for addiction. Neurosci Biobehav Rev. 2013;37(9):1955–1975. doi: 10.1016/j.neubiorev.2013.02.008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Huys QJM, Tobler PN, Hasler G, Flagel SB. Chapter 3 - The role of learning-related dopamine signals in addiction vulnerability. In: Diana Marco, Di Chiara Gaetano, Spano Pierfranco., editors. Progress in Brain Research. Volume 211. Elsevier; 2014. pp. 31–77. ISSN 0079-6123, ISBN 9780444634252, http://dx.doi.org/10.1016/B978-0-444-63425-2.00003-9. [DOI] [PubMed] [Google Scholar]
- Flagel SB, Watson SJ, Robinson TE, Akil H. Individual differences in the propensity to approach signals vs goals promote different adaptations in the dopamine system of rats. Psychopharmacology. 2007;191(3):599–607. doi: 10.1007/s00213-006-0535-8. [DOI] [PubMed] [Google Scholar]
- Flagel SB, Akil H, Robinson TE. Individual differences in the attribution of incentive salience to reward-related cues: Implications for addiction. Neuropharmacology. 2009;56:139–148. doi: 10.1016/j.neuropharm.2008.06.027. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Flagel SB, Cameron CM, Pickup KN, Watson SJ, Akil H, Robinson TE. A food predictive cue must be attributed with incentive salience for it to induce c-fos mRNA expression in corticostriatal-thalamic brain regions. Neuroscience. 2011b;196:80–96. doi: 10.1016/j.neuroscience.2011.09.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- DiFeliceantonio AG, Berridge KC. Which cue to ’want’? Opioid stimulation of central amygdala makes goal-trackers show stronger goal-tracking, just as sign-trackers show stronger sign-tracking. Behav Brain Res. 2012;230(2):399–408. doi: 10.1016/j.bbr.2012.02.032. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mahler SV, Berridge KC. Which cue to “want?” Central amygdala opioid activation enhances and focuses incentive salience on a prepotent reward cue. J Neurosci. 2009;29(20):6500–13. doi: 10.1523/JNEUROSCI.3875-08.2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Robinson TE, Flagel SB. Dissociating the predictive and incentive motivational properties of reward-related cues through the study of individual differences. Biol psychiatry. 2009;65(10):869–873. doi: 10.1016/j.biopsych.2008.09.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Meyer PJ, Lovic V, Saunders BT, Yager LM, Flagel SB, Morrow JD, Robinson TE. Quantifying Individual Variation in the Propensity to Attribute Incentive Salience to Reward Cues. PLoS ONE. 2012;7(6):e38987. doi: 10.1371/journal.pone.0038987. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fitzpatrick CJ, Gopalakrishnan S, Cogan ES, Yager LM, Meyer PJ, Lovic V, Saunders BT, Parker CC, Gonzales NM, Aryee E. Variation in the Form of Pavlovian Conditioned Approach Behavior among Outbred Male Sprague-Dawley Rats from Different Vendors and Colonies: Sign-Tracking vs. Goal-Tracking. PloS ONE. 2013;8(10):e75042. doi: 10.1371/journal.pone.0075042. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Danna CL, Elmer GI. Disruption of conditioned reward association by typical and atypical antipsychotics. Pharmacol Biochem Behav. 2010;96(1):40–47. doi: 10.1016/j.pbb.2010.04.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lesaint F, Sigaud O, Flagel SB, Robinson TE, Khamassi M. Modelling Individual Differences in the Form of Pavlovian Conditioned Approach Responses: A Dual Learning Systems Approach with Factored Representations. PLoS Comput Biol. 2014;10(2):e1003466. doi: 10.1371/journal.pcbi.1003466. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Daw ND, Niv Y, Dayan P. Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control. Nat Neurosci. 2005;8(12):1704–1711. doi: 10.1038/nn1560. [DOI] [PubMed] [Google Scholar]
- Clark JJ, Hollon NG, Phillips PEM. Pavlovian valuation systems in learning and decision making. Curr Opin Neurobiol. 2012;22(6):1054–1061. doi: 10.1016/j.conb.2012.06.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dayan P, Niv Y, Seymour B, Daw ND. The misbehavior of value and the discipline of the will. Neural Netw. 2006;19(8):1153–1160. doi: 10.1016/j.neunet.2006.03.002. [DOI] [PubMed] [Google Scholar]
- Schultz W. Predictive Reward Signal of Dopamine Neurons. J Neurophysiol. 1998;80:1–27. doi: 10.1152/jn.1998.80.1.1. [DOI] [PubMed] [Google Scholar]
- Caluwaerts K, Staffa M, N’Guyen S, Grand C, Dollé L, Félix A. Favre, Girard B, Khamassi M. A biologically inspired meta-control navigation system for the psikharpax rat robot. Bioinspiration & biomimetics. 2012;7(2):025009. doi: 10.1088/1748-3182/7/2/025009. [DOI] [PubMed] [Google Scholar]
- Humphries MD, Khamassi M, Gurney K. Dopaminergic control of the exploration-exploitation trade-off via the basal ganglia. Front Neurosci. 6(9) doi: 10.3389/fnins.2012.00009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Deb K, Pratap A, Agarwal S, Meyarivan T. A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans Evol Comput. 2002;6(2):182–197. [Google Scholar]
- Mouret J-B, Doncieux S. SFERESv2: Evolvin’ in the Multi-Core World. WCCI 2010 IEEE World Congress on Computational Intelligence, Congress on Evolutionary Computation (CEC).2010. pp. 4079–4086. [Google Scholar]
- Saunders BT, Robinson TE. The role of dopamine in the accumbens core in the expression of Pavlovian-conditioned responses. Eur J neurosci. 2012;36(4):2521–2532. doi: 10.1111/j.1460-9568.2012.08217.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Khamassi M, Humphries MD. Integrating cortico-limbic-basal ganglia architectures for learning model-based and model-free navigation strategies. Front Behav Neurosci. 6(79) doi: 10.3389/fnbeh.2012.00079. [DOI] [PMC free article] [PubMed] [Google Scholar]
- van Der Meer M, Redish AD. Ventral striatum: a critical look at models of learning and evaluation. Curr Opin Neurobiol. 2011;21(3):387–392. doi: 10.1016/j.conb.2011.02.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- McDannald MA, Lucantonio F, Burke KA, Niv Y, Schoenbaum G. Ventral Striatum and Orbitofrontal Cortex Are Both Required for Model-Based, But Not Model-Free, Reinforcement Learning. J Neurosci. 2011;31(7):2700–2705. doi: 10.1523/JNEUROSCI.5499-10.2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- van der Meer M, Johnson A, Schmitzer-Torbert NC, Redish AD. Triple dissociation of information processing in dorsal striatum, ventral striatum, and hippocampus on a learned spatial decision task. Neuron. 2010;67(1):25–32. doi: 10.1016/j.neuron.2010.06.023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bornstein AM, Daw ND. Dissociating hippocampal and striatal contributions to sequential prediction learning. Eur J neurosci. 2012;35(7):1011–1023. doi: 10.1111/j.1460-9568.2011.07920.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Penner MR, Mizumori SJY. Neural systems analysis of decision making during goal-directed navigation. Progress in neurobiology. 2012;96(1):96–135. doi: 10.1016/j.pneurobio.2011.08.010. [DOI] [PubMed] [Google Scholar]
- Zhang J, Berridge KC, Tindell AJ, Smith KS, Aldridge JW. A neural computational model of incentive salience. PLoS computational biology. 2009;5(7):e1000437. doi: 10.1371/journal.pcbi.1000437. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Daw ND, Gershman SJ, Seymour B, Dayan P, Dolan RJ. Model-based influences on humans’ choices and striatal prediction errors. Neuron. 2011;69(6):1204–1215. doi: 10.1016/j.neuron.2011.02.027. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Daw ND. Trial-by-trial data analysis using computational models. In: Delgado MR, Phelps EA, Robbins TW, editors. Decision Making, Affect, and Learning: Attention and Performance XXIII. Vol. 23. Oxford University Press; 2011. chap. 1. [Google Scholar]