Skip to main content
PLOS Computational Biology logoLink to PLOS Computational Biology
. 2021 Feb 10;17(2):e1008553. doi: 10.1371/journal.pcbi.1008553

Neural signatures of arbitration between Pavlovian and instrumental action selection

Samuel J Gershman 1,2,*, Marc Guitart-Masip 3,4, James F Cavanagh 5
Editor: Daniele Marinazzo6
PMCID: PMC7901778  PMID: 33566831

Abstract

Pavlovian associations drive approach towards reward-predictive cues, and avoidance of punishment-predictive cues. These associations “misbehave” when they conflict with correct instrumental behavior. This raises the question of how Pavlovian and instrumental influences on behavior are arbitrated. We test a computational theory according to which Pavlovian influence will be stronger when inferred controllability of outcomes is low. Using a model-based analysis of a Go/NoGo task with human subjects, we show that theta-band oscillatory power in frontal cortex tracks inferred controllability, and that these inferences predict Pavlovian action biases. Functional MRI data revealed an inferior frontal gyrus correlate of action probability and a ventromedial prefrontal correlate of outcome valence, both of which were modulated by inferred controllability.

Author summary

Using a combination of computational modeling, neuroimaging (both EEG and fMRI), and behavioral analysis, we present evidence for a dual-process architecture in which Pavlovian and instrumental action values are adaptively combined through a Bayesian arbitration mechanism. Building on prior research, we find neural signatures of this arbitration mechanism in frontal cortex. In particular, we show that trial-by-trial changes in Pavlovian influences on action can be predicted by our computational model, and are reflected in midfrontal theta power, as well as inferior frontal and ventromedial prefrontal cortex fMRI responses.

Introduction

Approaching reward-predictive stimuli and avoiding punishment-predictive stimuli are useful heuristics adopted by many animal species. However, these heuristics can sometimes lead animals astray—a phenomenon known as “Pavlovian misbehavior” [1, 2]. For example, reward-predictive stimuli invigorate approach behavior even when such behavior triggers withdrawal of the reward [3, 4], or the delivery of punishment [5, 6]. Likewise, punishment-predictive stimuli inhibit approach behavior even when doing so results in reduced net reward [79].

A venerable interpretation of these and related findings is that they arise from the interaction between Pavlovian and instrumental learning processes [10]. The two-process interpretation has been bolstered by evidence from neuroscience that Pavlovian and instrumental influences on behavior are (at least to some extent) segregated anatomically [11]. In particular, the dorsal subdivision of the striatum (caudate and putamen in primates) is more closely associated with instrumental learning, whereas the ventral subdivision (nucleus accumbens) is more closely associated with Pavlovian learning [12, 13].

Any multi-process account of behavior naturally raises the question of arbitration: what decides the allocation of behavioral control to particular processes at any given point in time? One way to approach this question from a normative perspective is to analyze the computational trade-offs realized by different processes. The job of the arbitrator is to determine which process achieves the optimal trade-off for a particular situation. This approach has proven successful in understanding arbitration between different instrumental learning processes [1417]. More recently, it has been used to understand arbitration between Pavlovian and instrumental processes [18]. The key idea is that instrumental learning is more statistically flexible, in the sense that it can learn reward predictions that are both action-specific and stimulus-specific, whereas Pavlovian learning can only learn stimulus-specific predictions. The cost of this flexibility is that instrumental learning is more prone to over-fitting: for any finite amount of data, there is some probability that the learned predictions will generalize incorrectly in the future, and this probability is larger for more flexible models, since they have more degrees of freedom with which to capture noise in the data. This account can be formalized in terms of Bayesian model comparison [18].

Dorfman and Gershman [18] tested the controllability prediction more directly using a variant of the Go/NoGo paradigm, which has been widely employed as an assay of Pavlovian bias in human subjects [1924]. We focus on this task in the present paper, while acknowledging that our conclusions may not generalize to other forms of Pavlovian bias, such as in Pavlovian-instrumental transfer paradigms. The Go/NoGo task crosses valence (winning reward vs. avoiding punishment) with action (Go vs. NoGo), resulting in four conditions: Go-to-Win, Go-to-Avoid, NoGo-to-Win, and NoGo-to-Avoid (Fig 1A). A key finding from this paradigm is that people make more errors on Go-to-Avoid trials compared to Go-to-Win trials, and this pattern reverses for NoGo trials, indicating that Pavlovian bias invigorates approach (the Go response) for reward-predictive cues, and inhibits approach for punishment-predictive cues. By introducing decoy trials in which rewards were either controllable or uncontrollable, Dorfman and Gershman showed that the Pavlovian bias was enhanced in the low controllability condition (see also [25]).

Fig 1. Experimental design and computational framework.

Fig 1

(A) Shown here is the experimental design used by [23] in their EEG study, which differed in several minor ways from the design used by [22] in their fMRI study (see Materials and methods). Subjects were instructed to respond to a target stimulus (white circle) by either pressing a button (Go) or witholding a button press (NoGo). Subjects had to learn the optimal action based on stimulus cues (shapes) and reward or punishment feedback. For all conditions, the optimal action yielded reward delivery or punishment avoidance with 70% probability; this probability was 30% for the suboptimal action. (B) Pavlovian and instrumental prediction and valuation combine into a single integrated decision value based on a weighting parameter (w) that represents the evidence for the uncontrollable environment (i.e., in favor of the Pavlovian predictor). Figure adapted from [18], with permission. See Materials and methods for technical details.

An important innovation of the Dorfman and Gershman model was the hypothesis that the balance between Pavlovian and instrumental influences on action is dynamically arbitrated, and hence can potentially vary within the course of a single experimental session. This contrasts with most modeling of the Go/NoGo task (starting with [22]), which has assumed that the balance is fixed across the experimental session. Dorfman and Gershman presented behavioral evidence for within-session variation of the Pavlovian bias. Neural data could potentially provide even more direct evidence, by revealing correlates of the arbitration process itself. We pursue this question here by carrying out a model-based analysis of two prior data sets, one from an electroencephalography (EEG) study [23], and one from a functional magnetic resonance imaging (fMRI) study [22].

Results

Modeling and behavioral results

We fit computational models to Go/NoGo data from two previously published studies. The tasks used in these two studies were very similar, with a few minor differences detailed in the Materials and Methods. We will first briefly summarize the models (more details can be found in the Materials and methods).

In [18], a Bayesian framework was introduced that formalized action valuation in terms of probabilistic inference (Fig 1B). According to this framework, Pavlovian and instrumental processes correspond to distinct predictive models of reward (or punishment) outcomes. The Pavlovian process estimates outcome predictions based on stimulus information alone, whereas the instrumental process uses both stimulus and action information. These predictions are converted into action values in different ways. For the instrumental process, action valuation is straightforward—it is simply the expected outcome for a particular stimulus-action pair. The Pavlovian process, which does not have an action-dependent outcome expectation, instead relies on the heuristic that reward-predictive cues should elicit behavioral approach (Go actions in the Go/NoGo task), and punishment-predictive cues should elicit avoidance (NoGo).

Arbitration in the Bayesian framework corresponds to model comparison: the action values are weighted by the probability favoring each predictor. This computation yields the expected action value under model uncertainty. Thus, the Bayesian framework offers an interpretation of Pavlovian bias in terms of the probability favoring Pavlovian outcome prediction (denoted by w, which we refer to as the “Pavlovian weight”). The Pavlovian weight can also be interpreted as the subjective degree of belief in an uncontrollable environment, where actions do not influence the probability distribution over outcomes (and correspondingly, 1 − w is the degree of belief in a controllable environment).

Dorfman and Gershman [18] compared two versions of probabilistic arbitration. In the Fixed Bayesian model, the Pavlovian weight reflects a priori beliefs (i.e., prior to observing data). Thus, in the Fixed Bayesian model, the Pavlovian bias weight does not change with experience. In the Adaptive Bayesian model, the Pavlovian weight reflects a posteriori beliefs (i.e., after observing data), such that the weight changes across trials based on the observations. Finally, we compared both Bayesian models to a non-Bayesian reinforcement learning (RL) model that best described the data in [22]. This RL model is structurally similar to the Fixed Bayesian model, but posits a heuristic aggregation of Pavlovian and instrumental values. All models use an error-driven learning mechanism, but the Bayesian models assume that the learning rate decreases across stimulus repetitions.

We found that the Adaptive Bayesian model was favored in both data sets, with a protected exceedance probability greater than 0.7 (Fig 2A and 2B). To confirm that the Adaptive model fit the data well (see S1 Fig for further predictive checks), we plotted the go bias (difference in accuracy between Go and NoGo trials) as a function of weight quantile (Fig 2C and 2D). Consistent with the model and previous results [18], the go bias increased with weight for the Win condition [top vs. bottom quantile: t(31) = 2.41, p < 0.05 for the EEG data set, t(28) = 3.96, p < 0.001 for the fMRI data set]. In contrast, it remained essentially flat for the Avoid condition. This asymmetry arises from the fact that most subjects were best fit with an initial Pavlovian value greater than 0 (76% in the EEG data set, 63% in the fMRI data set). This means that the model actually predicts a positive go bias for the Avoid condition early during learning (when the Pavlovian weight is typically larger; see S2 Fig), which eventually should become a negative go bias. Consistent with this hypothesis, the go bias during the first 40 trials (across all conditions) in the fMRI data set was significantly greater than 0 for the Avoid condition [t(29) = 3.63, p < 0.002] and significantly less than 0 during the last 40 trials [t(29) = 2.24, p < 0.05] (the EEG data set had fewer trials, and hence it was harder to obtain a reliable test of this hypothesis, though the results were numerically in the same direction). This early preference for the go response is also consistent with earlier modeling of these data sets in which an unlearned go bias was incorporated into the decision value (see also [9]). Here we explain the same phenomenon in terms of the prior over Pavlovian values.

Fig 2. Behavioral results.

Fig 2

Top: Protected exceedance probabilities (PXPs) for 3 computational models fit to the EEG data set (A) and the fMRI data set (B). Bottom: Go bias (difference in accuracy between Go and NoGo trials) computed as a function of the Pavlovian weight for the EEG data set (C) and the fMRI data set (D). Lines show model fits, circles show means with standard errors.

In the next two sub-sections, we use the Adaptive model to generate model-based regressors for neural activity, in an effort to ground the hypothesized computational processes. In particular, we will focus on showing that neural signals covary with the Pavlovian weight, thereby demonstrating that this dynamically changing variable is encoded by the brain. Before proceeding to these analyses, it is important to show that this covariation is not confounded by other dynamic variables. In particular, while the Fixed and RL models lack a dynamic weight, the instrumental and Pavlovian values are dynamic in these models. To eliminate these variables as potential confounds, we correlated them with the Pavlovian weight for each subject. For both the EEG and fMRI data sets, the median correlation never exceeded 0.02, and the median correlation never significantly differed from 0 (p > 0.1, signed rank test). To evaluate the positive evidence for the null hypothesis (correlation of 0), we also carried out Bayesian t-tests [26] on the Fisher z-transformed correlations, finding that the Bayes factors consistently favored the null hypothesis (ranging between 2 and 5). These results gives us confidence that the neural covariation we report next is unconfounded by other dynamic variables in the computational model.

EEG results

Following the template of our behavioral analyses, we examined midfrontal theta power as a function of the Pavlovian weight (see S3 Fig for results fully disaggregated across conditions). In previous work on this same data set [23], and in follow-up studies [25, 27], frontal theta was implicated in the suppression of the Pavlovian influence on choice. Consistent with these previous findings, we found that frontal theta power decreased with the Pavlovian weight [top vs. bottom quantile: t(31) = 2.09, p < 0.05; Fig 3]. Unlike these earlier studies, which incorporated the frontal theta signal as an input into the computational model, we have validated for the first time a model of the frontal theta signal (i.e., as an output of the model).

Fig 3. EEG results.

Fig 3

(A) Montage showing region of interest, derived from [23]. (B) Stimulus-locked midfrontal theta power as a function of the Pavlovian weight. Error bars show standard error of the mean.

Our data cannot be explained by alternative theories about midfrontal theta. First, we can rule out an explanation in terms of reward prediction error [2830]. At the time of stimulus presentation, the reward prediction error is simply the estimate of stimulus value (a recency-weighted average of past prediction errors), which is uncorrelated with w (p = 0.29; signed rank test; Bayes factor favoring the null: 2.36, Bayesian t-test applied to Fisher z-transformed correlations) as well as with midfrontal theta (p = 0.23, signed rank test; Bayes factor favoring the null: 2.62). Second, our data are not adequately explained by choice confidence [31]. Although we do not measure subjective choice confidence, we can use as a proxy the expected accuracy under our model. This measure is negatively correlated with the Pavlovian weight (p < 0.001, signed rank test), as we would expect given that accuracy will tend to be lower under Pavlovian control. However, repeating our quantile comparison using this measure did not reveal a significant relationship between expected accuracy and midfrontal theta power (top vs. bottom quantile: p = 0.14).

fMRI results

We next re-analyzed fMRI data from [22], focusing on two frontal regions of interest: the inferior frontal gyrus (IFG; Fig 4A) and the ventromedial prefrontal cortex (vmPFC; Fig 4B). The key results are summarized in panels C and D in Fig 4 (see S4 Fig for fully disaggregated results, including results from the ventral striatum). When the Pavlovian weight is close to 0, the IFG response to the cue for Go and NoGo conditions is not significantly different (p = 0.32), but when the Pavlovian weight is close to 1, IFG responds significantly more to NoGo than to Go [t(28) = 3.91, p < 0.001, Fig 4C]. This is consistent with the hypothesis that IFG is responsible for the suppression of Go responses when the Pavlovian bias is strong, regardless of valence. Note that the NoGo>Go effect is unsurprising given that the IFG region of interest was selected based on the NoGo>Go contrast, but this selection criterion does not by itself explain the interaction between weight and NoGo vs. Go.

Fig 4. Functional MRI results.

Fig 4

(A,B) Regions of interest. IFG: inferior frontal gyrus; vmPFC: ventromedial prefrontal cortex. (C) IFG response to the stimulus cue as a function of Pavlovian weight, separated by Go and NoGo conditions. (D) vmPFC response to the stimulus cue as a function of Pavlovian weight, separated by Win and Avoid conditions.

The vmPFC scaled with the Pavlovian weight [top vs. bottom quantile: t(28) = 2.05, p < 0.05; Fig 4D], and responded more to Win vs. Avoid across weight quantiles [t(28) = 3.46, p < 0.002], but the interaction was not significant (p = 0.56). Thus, vmPFC appears to encode a combination of valence and Pavlovian bias (both main effects but no interaction). The relationship between Pavlovian weight and vmPFC cannot be explained by choice confidence (again using expected accuracy as a proxy): although we expect expected accuracy to vary inversely with the Pavlovian weight (see above), vmPFC activity did not differ significantly between the top and bottom quantile of expected accuracy (p = 0.08). Nor can the relationship be explained by differences in instrumental values, which are uncorrelated with the Pavlovian weight (median r = 0.02, p = 0.36, signed rank test; Bayes factor favoring the null: 3.30, Bayesian t-test applied to Fisher z-transformed correlations).

Discussion

By re-analyzing two existing neuroimaging data sets, we have provided some of the first evidence for neural signals tracking beliefs about controllability during Go/NoGo task performance. These signals are theoretically significant, as they support the computational hypothesis that Pavlovian influences on choice behavior arise from a form of Bayesian model comparison between Pavlovian and instrumental outcome predictions [18]. Modeling of behavior further supported this hypothesis, showing that the behavioral data were best explained by a Bayesian model in which Pavlovian influence changes as a function of inferred controllability.

Our analyses focused on three regions, based on prior research. One strong point of our approach is that we did not select the regions of interest based on any of the analyses reported in this paper; thus, the results serve as relatively unbiased tests of our computational hypotheses.

First, we showed that midfrontal theta power tracked inferred controllability (i.e., inversely with the Pavlovian weight). This finding is consistent with the original report describing the data set [23], which showed that the Pavlovian weight governing action selection could be partially predicted from midfrontal theta power, a finding further supported by subsequent research [27]. A recent study [25] attempted to more directly link midfrontal theta to controllability using a “learned helplessness” design in which one group of subjects intermittently lost control over outcomes by “yoking” the outcomes to those observed by a control group. The control group exhibited the same relationship between Pavlovian weight and midfrontal theta observed in earlier studies, whereas the yoked group did not (however, it must be noted that a direct comparison did not yield strong evidence for group differences). More broadly, these results are consistent with the hypothesis that midfrontal theta (and its putative cortical generator in midcingulate / dorsal anterior cingulate cortex) is responsible for computing the “need for control” [32, 33] or the “expected value of control” [34]. Controllability is a necessary (though not sufficient) requirement for the exertion of cognitive control to have positive value. From this perspective, it is important to emphasize that we do not see midfrontal theta as exclusively signaling inferred controllability; rather, high inferred controllability signals need for control, which thereby evokes midfrontal theta activity. Other variables that signal need for control, such as reward prediction error [2830], can also evoke midfrontal theta activity, without requiring changes in inferred controllability.

Because of the partial volume acquisition in the imaging procedure (see Materials and methods), our fMRI data did not allow us to examine hemodynamic correlates in the midcingulate cortex. Instead, we examined two other regions of interest: IFG and vmPFC. IFG has been consistently linked to inhibition of prepotent responses [35, 36]. Accordingly, we found greater response to NoGo than to Go in IFG. However, this difference only emerged when inferred controllability (as determined by our computational model) was low. There is some previous evidence that IFG is sensitive to controllability. Romaniuk and colleagues [37] reported that the IFG response was stronger on free choice trials compared to forced choice trials, and was significantly correlated with self-reported ratings of personal autonomy. Similarly, IFG activity has been associated with illusions of control [38]. It is difficult to directly connect these previous findings with those reported here, since the studies did not compare Go and NoGo responses.

While our finding that vmPFC shows a stronger response to reward vs. punishment is consistent with previous findings [39, 40], the fact that vmPFC decreases with inferred controllability is rather surprising. If anything, the literature suggests that vmPFC increases with subjective and objective controllability [4143], though at least one study found a greater reduction in vmPFC activity after a controllable punishment compared to an uncontrollable punishment [44]. Further investigation is needed to confirm the surprising inverse relationship between vmPFC and inferred controllability.

Our study is limited in a number of ways, which point toward promising directions for future research. First, as already mentioned, our fMRI data did not allow us to test the hypothesis that midcingulate cortex, as the putative generator of midfrontal theta, tracked inferred controllability. This limitation could be overcome in future studies using whole brain acquisition volumes. Second, we only analyzed neural data time-locked to the stimulus; future work could examine outcome-related activity. We chose not to do this because of our focus on action selection, and in particular how inferred controllability signals are related to Pavlovian biasing of actions. An important task for future work will be to identify the neural update signal for inferred controllability that drives dynamic changes in Pavlovian bias. Third, while our work is partly motivated by studies of Pavlovian-instrumental transfer [45], it is still an open empirical question to what extent that phenomenon is related to Pavlovian biases in Go/NoGo performance. It is also an open theoretical question to what extent the kind of Bayesian arbitration model proposed here can provide a comprehensive account of Pavlovian-instrumental transfer. Finally, in accordance with most prior analyses of midfrontal theta, we focused only on total theta power, neglecting any distinctions between phase-locked and non-phase-locked activity. However, several lines of research suggest that these two components of oscillatory activity carry different information [46, 47]. Distinguishing them may therefore support a finer-grained dissection of computational function.

Materials and methods

This section summarizes the methods used in the original studies [22, 23], which can be consulted for further details. The Bayesian models were first presented in [18], and that paper can be consulted for derivations of the equations.

Subjects

34 adults (18-34 years) participated in the EEG study [23], and 30 adults (18-35 years) participated in the fMRI study [22]. Subjects had normal or corrected-to-normal vision, and no history of neurological, psychiatric, or other relevant medical problem.

Ethics statement

All subjects provided written informed consent, which was approved by the local ethics committees (the Institutional Review Boards at Brown University and University College London).

Experimental procedure

The experimental procedure was very similar across the two studies (see Fig 1A). Each trial began with a presentation of a visual stimulus (a colored shape in [23], a fractal in [22]) for 1000 ms. After a variable interval (250-2500 ms in [23], 250-2000 ms in [22]), a target circle appeared, at which point a response was elicited. In [23], the target appeared centrally and subjects simply decided whether or not to press a button (Go or NoGo); in [22], the target appeared laterally and subjects (if they chose to respond) indicated on which side of the screen the target appeared. After a 1000 ms delay, subjects received reward or punishment feedback. In [23], the optimal action yielded a positive outcome (reward delivery or punishment avoidance) with probability 0.7, and the suboptimal action yielded a positive outcome with probability 0.3; in [22] these probabilities were 0.8 and 0.2, respectively. Rewards were defined as monetary gains, and punishments were defined as monetary losses. Subjects were compensated based on their earnings/losses during the task.

There were 4 conditions, signaled by distinct stimuli: Go-to-Win reward, Go-to-Avoid punishment, NoGo-to-Win reward, NoGo-to-Avoid punishment. Note that subjects were not instructed about the meaning of the stimuli, so these contingencies needed to be learned from trial and error. The experimental session consisted of 40 trials for each condition in [23], 60 trials for each condition in [22], presented in a randomly intermixed order.

EEG methods

EEG was recorded using a 128-channel EGI system, recorded continuously with hardware filters set from 0.1 to 100 Hz, a sampling rate of 250 Hz, and an online vertex reference. The EEG data were then preprocessed to interpolate bad channels, remove eyeblink contaminants, and bandpass filtered. Finally, total spectral power (both phase-locked and non-phase-locked) was computed within the theta band (4-8 Hz, 175-350 ms post-stimulus) in a midfrontal region of interest (ROI; Fig 4A) based on previous studies [48].

fMRI methods

Data were collected using a 3-Tesla Siemens Allegra magnetic resonance scanner (Siemens, Erlangen, Germany) with echo planar imaging of a partial volume that included the striatum and the midbrain (matrix: 128 × 128; 40 oblique axial slices per volume angled at −30o in the antero-posterior axis; spatial resolution: 1.5 × 1.5 × 1.5 mm; TR = 4100 ms; TE = 30 ms). This partial volume included the whole striatum, the substantia nigra, ventral tegmental area, the amygdala, and the ventromedial prefrontal cortex. It excluded the medial cingulate cortex, the supplementary motor areas, the superior frontal gyrus, and the middle frontal gyrus. The fMRI acquisition protocol was optimized to reduce susceptibility-induced BOLD sensitivity losses in inferior frontal and temporal lobe regions [49].

Data were preprocessed using SPM8 (Wellcome Trust Centre for Neuroimaging, UCL, London), with the following steps: realignment, unwrapping using individual fieldmaps, spatial normalization to the Montreal Neurology Institute (MNI) space, smoothing with a 6 mm full-width half maximum Gaussian kernel, temporal filtering (high-pass cutoff: 128 Hz), and whitened using a first-order autoregressive model. Finally, cue-evoked response amplitude was estimated with a general linear model (GLM), in which the event-related impulse was convolved with the canonical hemodynamic response function. The GLM also included movement regressors estimated from the realignment step.

To obtain a trial-by-trial estimate of the BOLD response at the time of the cue, we built a new GLM that included one regressor per trial at the time each cue was presented. In order to control for activity associated with the performance of the target detection task, we included a single regressor indicating the time at which the targets were presented together with a parametric modulator indicating whether participants performed a Go (1) or a NoGo (-1) response. Similarly, to control for activity associated with the receipt of feedback, we included a single regressor indicating the time at which the outcome was presented together with a parametric modulator indicating whether the outcome was a loss (-1), a neutral outcome (0), or a win (1). Finally, the model also included movement regressor parameters. Before estimation, all regressors (except the movement regressors) were convolved with the canonical hemodynamic response function. This analysis resulted in one image per trial summarizing the BOLD response on that trial for each available voxel. We then extracted the mean BOLD response with the 2 frontal ROIs. The IFG ROI was defined as the voxels that responded to NoGo>Go in learners in the original report, thresholded at p < 0.001 uncorrected. The vmPFC ROI was defined as the voxels that responded positively to the parametric modulator of outcome responses in the GLM reported above, thresholded at p < 0.001 uncorrected.

Computational models

We compared three computational models of learning and choice. Each model was fit to data from individual subjects using maximum likelihood estimation (see S5 Fig for histograms of the parameter estimates) and compared using random-effects Bayesian model comparison with the Bayesian information criterion approximation of the marginal likelihood [50]. We summarize the model comparison results using protected exceedance probabilities, which express the posterior probability that a particular model is more frequent in the population than all other models, adjusting for the probability that the differences in model fit could have arisen from the null hypothesis (uniform model frequency in the population).

Guitart-Masip and colleagues [22] compared several reinforcement learning models, finding the strongest support for one in which the action policy is defined by:

P(Go|s)=exp[V(s,Go)]exp[V(s,Go)]+exp[V(s,NoGo)](1-ξ)+ξ2, (1)

where s denotes the stimulus, ξ is a lapse probability (capturing a baseline error rate), and V(s, a) is the integrated action value for action a in response to stimulus s:

V(s,Go)=VI(s,Go)+πVP(s,Go)+b (2)
V(s,NoGo)=VI(s,NoGo). (3)

The action value integrates the instrumental value VI and the Pavlovian value VP, where the weighting parameter π captures a fixed Pavlovian approach bias towards reward-predictive cues, and an avoidance bias away from punishment predictive cues. In addition, the parameter b captures a fixed Go bias. The values are updated according to an error-driven learning rule:

ΔVI(s,a)=α[ρr-VI(s,a)] (4)
ΔVP(s)=α[ρr-VP(s)], (5)

where α is a learning rate, ρ > 0 is an outcome scaling factor, and r is the outcome. For the sake of brevity, we will refer to this model simply as the “RL model” (but note that the models described next could be validly considered “Bayesian RL models” insofar as they estimate expectations about reward and punishment; see [51] for further discussion of this point).

Subsequent modeling (e.g., [23]) has shown that this model can be improved by allowing differential sensitivity to rewards and punishments, but we do not pursue that extension here since it would also require us to develop an equivalent extension of the Bayesian models described next. Since our primary goal is to model the neural dynamics underlying variability in the Pavlovian bias, we did not feel that it was necessary to run a more elaborate horse race between the model classes.

Dorfman and Gershman [18] introduced two Bayesian models. The learner is modeled as occupying one of two possible environments (controllable or uncontrollable). In the controllable environment, outcomes depend on the combination of stimulus and action, as specified by a Bernoulli parameter θsa. In the uncontrollable environment, outcomes depend only on the stimulus, as specified by the parameter θs. Because these parameters are unknown at the outset, the learner must estimate them. The Bayes-optimal estimate, assuming a Beta prior on the parameters, can be computed using an error-driven learning rule similar to the one described above, with the difference that the learning rate declines according to α = 1/ηs for the Pavlovian model, where ηs is the number of times stimulus s was encountered (the instrumental model follows the same idea, but using ηsa, the number of times action a was taken in response to stimulus s). The model is parametrized by the initial value of η and the initial values, which together define Beta distribution priors (see [18] for a complete derivation). To convert the parameter estimates (denoted θ^s and θ^sa) into action values, we assumed that the instrumental values are simply the parameter estimates, VI(s,a)=θ^sa, while the Pavlovian value VP is 0 for a = NoGo and θ^s for Go.

The learner does not know with certainty which environment she occupies; her belief that she is in the controllable environment is specified by the probability w. The expected action value under environment uncertainty is then given by:

V(s,Go)=(1-w)VI(s,Go)+wVP(s,Go), (6)

which is similar to the RL model integration but where the integrated action value is now a convex combination of the instrumental and Pavlovian values. Unlike the RL model, the Fixed Bayesian model used an inverse temperature parameter instead of an outcome scaling parameter (though these parameters play essentially the same role), and did not model lapse probability or Go bias (because the extra complexity introduced by these parameters was not justified based on model comparison). Thus, the action policy is given by:

P(Go|s)=exp[βV(s,Go)]exp[βV(s,Go)]+exp[βV(s,NoGo)], (7)

where β is the inverse temperature, which controls action stochasticity.

In the Bayesian framework, the parameter w can be interpreted as a belief in the probability that the environment is uncontrollable (outcomes do not depend on actions). A critical property of the Fixed Bayesian model is that this parameter is fixed for a subject, under the assumption that the subject does not draw inferences about controllability during the experimental session. The Adaptive Bayesian model is essentially the same as the Fixed Bayesian model, but departs in one critical aspect: the Pavlovian weight parameter w is updated on each trial. Using the relation w = 1/(1 + exp(−L)), where L is the log-odds favoring the uncontrollable environment, we can describe the update rule as follows:

ΔL=rlog|θ^s||θ^sa|+(1-r)1-|θ^s|1-|θ^sa|. (8)

The initial value of L was set to 0 (a uniform distribution over environments).

We verified that parameters of the adaptive Bayesian model are reasonably recoverable, by simulating the experimental design used in [22], with the same number of subjects, and then fitting the simulated data. Overall, the correlation between true and recovered parameters was r = 0.62 (p < 0.0001). One parameter (the prior confidence of the Pavlovian values) exhibited relatively poor recoverability, with r = 0.28; we do not make any specific claims about this parameter in the paper. In addition to parameter recoverability, we found good model recoverability: the protected exceedance probability assigned to the adaptive Bayesian model was close to 1.

Supporting information

S1 Fig. Accuracy across conditions.

(Left) Data from Guitart-Masip et al. (2012). (Right) Simulations of the adaptive Bayesian model, using parameters fitted to the data. Error bars show standard error of the mean.

(PDF)

S2 Fig. Pavlovian weight dynamics.

The weight variable w is plotted across trial epochs, broken into quarters (note that the data sets have different numbers of trials). Error bars show standard error of the mean.

(PDF)

S3 Fig. Disaggregated EEG results.

Midfrontal theta power (z-scored within subject) as a function of Pavlovian weight quantile, separated by stimulus condition. Error bars show standard error of the mean.

(PDF)

S4 Fig. Disaggregated fMRI results.

BOLD response amplitude (z-scored within subject) as a function of Pavlovian weight quantile, separated by stimulus condition. Left: ventromedial prefrontal cortex. Middle: ventral striatum. Right: inferior frontal gyrus. Error bars show standard error of the mean.

(PDF)

S5 Fig. Parameter estimate histograms.

Estimates were aggregated across the EEG and fMRI data sets.

(PDF)

Data Availability

All code and data for reproducing the analyses and figures is available at https://github.com/sjgershm/GoNoGo-neural.

Funding Statement

SJG was supported by the Center for Brains, Minds and Machines (CBMM), funded by NSF STC award CCF-1231216 (https://nsf.gov/), and by the Office of Naval Research (N00014-17-1-2984; https://www.onr.navy.mil/). MGM was supported by a research grant awarded by the Swedish Research Council (VR-2018-02606; https://www.vr.se/english.html). JFC was supported by NIMH 1RO1MH119382-01(url https://www.nimh.nih.gov/index.shtml). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1. Breland K, Breland M. The misbehavior of organisms. American Psychologist. 1961;16:681–684. 10.1037/h0040090 [DOI] [Google Scholar]
  • 2. Dayan P, Niv Y, Seymour B, Daw ND. The misbehavior of value and the discipline of the will. Neural Networks. 2006;19:1153–1160. 10.1016/j.neunet.2006.03.002 [DOI] [PubMed] [Google Scholar]
  • 3. Williams DR, Williams H. Auto-maintenance in the pigeon: sustained pecking despite contingent non-reinforcement. Journal of the experimental analysis of behavior. 1969;12:511–520. 10.1901/jeab.1969.12-511 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Hershberger WA. An approach through the looking-glass. Animal Learning & Behavior. 1986;14:443–451. 10.3758/BF03200092 [DOI] [Google Scholar]
  • 5. Grossen N, Kostansek D, Bolles R. Effects of appetitive discriminative stimuli on avoidance behavior. Journal of Experimental Psychology. 1969;81:340–343. 10.1037/h0027780 [DOI] [PubMed] [Google Scholar]
  • 6. Bull JA III. An interaction between appetitive Pavlovian CSs and instrumental avoidance responding. Learning and Motivation. 1970;1:18–26. 10.1016/0023-9690(70)90124-4 [DOI] [Google Scholar]
  • 7. Estes W, Skinner B. Some quantitative properties of anxiety. Journal of Experimental Psychology. 1941;29:390–400. 10.1037/h0062283 [DOI] [Google Scholar]
  • 8. Annau Z, Kamin L. The conditioned emotional response as a function of intensity of the US. Journal of Comparative and Physiological Psychology. 1961;54:428–432. 10.1037/h0042199 [DOI] [PubMed] [Google Scholar]
  • 9. Huys QJ, Eshel N, O’Nions E, Sheridan L, Dayan P, Roiser JP. Bonsai trees in your head: how the pavlovian system sculpts goal-directed choices by pruning decision trees. PLoS Computational Biology. 2012;8 10.1371/journal.pcbi.1002410 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Rescorla R, Solomon R. Two-process learning theory: relationships between Pavlovian conditioning and instrumental learning. Psychological Review. 1967;74:151–182. 10.1037/h0024475 [DOI] [PubMed] [Google Scholar]
  • 11. Guitart-Masip M, Duzel E, Dolan R, Dayan P. Action versus valence in decision making. Trends in Cognitive Sciences. 2014;18:194–202. 10.1016/j.tics.2014.01.003 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Joel D, Niv Y, Ruppin E. Actor–critic models of the basal ganglia: New anatomical and computational perspectives. Neural Networks. 2002;15:535–547. 10.1016/S0893-6080(02)00047-3 [DOI] [PubMed] [Google Scholar]
  • 13. O’Doherty J, Dayan P, Schultz J, Deichmann R, Friston K, Dolan RJ. Dissociable roles of ventral and dorsal striatum in instrumental conditioning. Science. 2004;304:452–454. 10.1126/science.1094285 [DOI] [PubMed] [Google Scholar]
  • 14. Daw ND, Niv Y, Dayan P. Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control. Nature Neuroscience. 2005;8:1704–1711. 10.1038/nn1560 [DOI] [PubMed] [Google Scholar]
  • 15. Keramati M, Dezfouli A, Piray P. Speed/accuracy trade-off between the habitual and the goal-directed processes. PLoS Computational Biology. 2011;7 10.1371/journal.pcbi.1002055 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Lee SW, Shimojo S, O’Doherty JP. Neural computations underlying arbitration between model-based and model-free learning. Neuron. 2014;81:687–699. 10.1016/j.neuron.2013.11.028 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Kool W, Gershman SJ, Cushman FA. Cost-benefit arbitration between multiple reinforcement-learning systems. Psychological Science. 2017;28:1321–1333. 10.1177/0956797617708288 [DOI] [PubMed] [Google Scholar]
  • 18. Dorfman HM, Gershman SJ. Controllability governs the balance between Pavlovian and instrumental action selection. Nature Communications. 2019;10:1–8. 10.1038/s41467-019-13737-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Crockett MJ, Clark L, Robbins TW. Reconciling the role of serotonin in behavioral inhibition and aversion: acute tryptophan depletion abolishes punishment-induced inhibition in humans. Journal of Neuroscience. 2009;29:11993–11999. 10.1523/JNEUROSCI.2513-09.2009 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Guitart-Masip M, Fuentemilla L, Bach DR, Huys QJ, Dayan P, Dolan RJ, et al. Action dominates valence in anticipatory representations in the human striatum and dopaminergic midbrain. Journal of Neuroscience. 2011;31:7867–7875. 10.1523/JNEUROSCI.6376-10.2011 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Crockett MJ, Clark L, Apergis-Schoute AM, Morein-Zamir S, Robbins TW. Serotonin modulates the effects of Pavlovian aversive predictions on response vigor. Neuropsychopharmacology. 2012;37:2244–2252. 10.1038/npp.2012.75 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Guitart-Masip M, Huys QJ, Fuentemilla L, Dayan P, Duzel E, Dolan RJ. Go and no-go learning in reward and punishment: interactions between affect and effect. NeuroImage. 2012;62:154–166. 10.1016/j.neuroimage.2012.04.024 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Cavanagh JF, Eisenberg I, Guitart-Masip M, Huys Q, Frank MJ. Frontal theta overrides pavlovian learning biases. Journal of Neuroscience. 2013;33:8541–8548. 10.1523/JNEUROSCI.5754-12.2013 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. de Boer L, Axelsson J, Chowdhury R, Riklund K, Dolan RJ, Nyberg L, et al. Dorsal striatal dopamine D1 receptor availability predicts an instrumental bias in action learning. Proceedings of the National Academy of Sciences. 2019;116:261–270. 10.1073/pnas.1816704116 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Csifcsák G, Melsæter E, Mittner M. Intermittent absence of control during reinforcement learning interferes with Pavlovian bias in action selection. Journal of Cognitive Neuroscience. 2020;32:646–663. 10.1162/jocn_a_01515 [DOI] [PubMed] [Google Scholar]
  • 26. Rouder JN, Speckman PL, Sun D, Morey RD, Iverson G. Bayesian t tests for accepting and rejecting the null hypothesis. Psychonomic Bulletin & Review. 2009;16:225–237. 10.3758/PBR.16.2.225 [DOI] [PubMed] [Google Scholar]
  • 27. Swart JC, Frank MJ, Määttä JI, Jensen O, Cools R, den Ouden HE. Frontal network dynamics reflect neurocomputational mechanisms for reducing maladaptive biases in motivated action. PLoS Biology. 2018;16:e2005979 10.1371/journal.pbio.2005979 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Cavanagh JF, Frank MJ, Klein TJ, Allen JJ. Frontal theta links prediction errors to behavioral adaptation in reinforcement learning. Neuroimage. 2010;49:3198–3209. 10.1016/j.neuroimage.2009.11.080 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Cavanagh JF, Figueroa CM, Cohen MX, Frank MJ. Frontal theta reflects uncertainty and unexpectedness during exploration and exploitation. Cerebral Cortex. 2012;22:2575–2586. 10.1093/cercor/bhr332 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Van de Vijver I, Ridderinkhof KR, Cohen MX. Frontal oscillatory dynamics predict feedback learning and action adjustment. Journal of Cognitive Neuroscience. 2011;23:4106–4121. 10.1162/jocn_a_00110 [DOI] [PubMed] [Google Scholar]
  • 31. Lim K, Wang W, Merfeld DM. Frontal scalp potentials foretell perceptual choice confidence. Journal of Neurophysiology. 2020;123:1566–1577. 10.1152/jn.00290.2019 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Cavanagh JF, Frank MJ. Frontal theta as a mechanism for cognitive control. Trends in Cognitive Sciences. 2014;18:414–421. 10.1016/j.tics.2014.04.012 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Cavanagh JF, Shackman AJ. Frontal midline theta reflects anxiety and cognitive control: meta-analytic evidence. Journal of Physiology-Paris. 2015;109:3–15. 10.1016/j.jphysparis.2014.04.003 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Shenhav A, Botvinick MM, Cohen JD. The expected value of control: an integrative theory of anterior cingulate cortex function. Neuron. 2013;79:217–240. 10.1016/j.neuron.2013.07.007 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. Rubia K, Smith AB, Brammer MJ, Taylor E. Right inferior prefrontal cortex mediates response inhibition while mesial prefrontal cortex is responsible for error detection. NeuroImage. 2003;20:351–358. 10.1016/S1053-8119(03)00275-1 [DOI] [PubMed] [Google Scholar]
  • 36. Aron AR, Poldrack RA. Cortical and subcortical contributions to stop signal response inhibition: role of the subthalamic nucleus. Journal of Neuroscience. 2006;26:2424–2433. 10.1523/JNEUROSCI.4682-05.2006 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Romaniuk L, Sandu AL, Waiter GD, McNeil CJ, Xueyi S, Harris MA, et al. The neurobiology of personal control during reward learning and its relationship to mood. Biological Psychiatry: Cognitive Neuroscience and Neuroimaging. 2019;4:190–199. 10.1016/j.bpsc.2018.09.015 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Lorenz RC, Gleich T, Kühn S, Pöhland L, Pelz P, Wüstenberg T, et al. Subjective illusion of control modulates striatal reward anticipation in adolescence. NeuroImage. 2015;117:250–257. 10.1016/j.neuroimage.2015.05.024 [DOI] [PubMed] [Google Scholar]
  • 39. Blair K, Marsh AA, Morton J, Vythilingam M, Jones M, Mondillo K, et al. Choosing the lesser of two evils, the better of two goods: specifying the roles of ventromedial prefrontal cortex and dorsal anterior cingulate in object choice. Journal of Neuroscience. 2006;26:11379–11386. 10.1523/JNEUROSCI.1640-06.2006 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Monosov IE, Hikosaka O. Regionally distinct processing of rewards and punishments by the primate ventromedial prefrontal cortex. Journal of Neuroscience. 2012;32:10318–10330. 10.1523/JNEUROSCI.1801-12.2012 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41. Amat J, Baratta MV, Paul E, Bland ST, Watkins LR, Maier SF. Medial prefrontal cortex determines how stressor controllability affects behavior and dorsal raphe nucleus. Nature Neuroscience. 2005;8:365–371. 10.1038/nn1399 [DOI] [PubMed] [Google Scholar]
  • 42. Kerr DL, McLaren DG, Mathy RM, Nitschke JB. Controllability modulates the anticipatory response in the human ventromedial prefrontal cortex. Frontiers in Psychology. 2012;3:557 10.3389/fpsyg.2012.00557 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43. Murayama K, Matsumoto M, Izuma K, Sugiura A, Ryan RM, Deci EL, et al. How self-determined choice facilitates performance: A key role of the ventromedial prefrontal cortex. Cerebral Cortex. 2015;25:1241–1251. 10.1093/cercor/bht317 [DOI] [PubMed] [Google Scholar]
  • 44. Bhanji JP, Delgado MR. Perceived control influences neural responses to setbacks and promotes persistence. Neuron. 2014;83:1369–1375. 10.1016/j.neuron.2014.08.012 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45. Holmes NM, Marchand AR, Coutureau E. Pavlovian to instrumental transfer: a neurobehavioural perspective. Neuroscience & Biobehavioral Reviews. 2010;34:1277–1295. 10.1016/j.neubiorev.2010.03.007 [DOI] [PubMed] [Google Scholar]
  • 46. Hajihosseini A, Holroyd CB. Frontal midline theta and N 200 amplitude reflect complementary information about expectancy and outcome evaluation. Psychophysiology. 2013;50:550–562. 10.1111/psyp.12040 [DOI] [PubMed] [Google Scholar]
  • 47. Cohen MX, Donner TH. Midfrontal conflict-related theta-band power reflects neural oscillations that predict behavior. Journal of Neurophysiology. 2013;110:2752–2763. 10.1152/jn.00479.2013 [DOI] [PubMed] [Google Scholar]
  • 48. Cavanagh JF, Zambrano-Vazquez L, Allen JJ. Theta lingua franca: A common mid-frontal substrate for action monitoring processes. Psychophysiology. 2012;49:220–238. 10.1111/j.1469-8986.2011.01293.x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49. Weiskopf N, Hutton C, Josephs O, Deichmann R. Optimal EPI parameters for reduction of susceptibility-induced BOLD sensitivity losses: a whole-brain analysis at 3 T and 1.5 T. NeuroImage. 2006;33:493–504. 10.1016/j.neuroimage.2006.07.029 [DOI] [PubMed] [Google Scholar]
  • 50. Rigoux L, Stephan KE, Friston KJ, Daunizeau J. Bayesian model selection for group studies—revisited. NeuroImage. 2014;84:971–985. 10.1016/j.neuroimage.2013.08.065 [DOI] [PubMed] [Google Scholar]
  • 51. Gershman SJ. A unifying probabilistic view of associative learning. PLoS Computational Biology. 2015;11 10.1371/journal.pcbi.1004567 [DOI] [PMC free article] [PubMed] [Google Scholar]
PLoS Comput Biol. doi: 10.1371/journal.pcbi.1008553.r001

Decision Letter 0

Daniele Marinazzo

19 Aug 2020

Dear Dr. Gershman,

Thank you very much for submitting your manuscript "Neural signatures of arbitration between Pavlovian and instrumental action selection" for consideration at PLOS Computational Biology.

As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments.

Some of them are quite substantial, and will need some particular attention to make the paper suitable for publication in PLOS CB.

We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts.

Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Daniele Marinazzo

Deputy Editor

PLOS Computational Biology

Daniele Marinazzo

Deputy Editor

PLOS Computational Biology

***********************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: The manuscript offers a refreshing new way to look at PIT from a computational perspective. The insights that uncertainty controls arbitration between two models – a Pavlovian and a Bayesian – relies on important literature and applies it to PIT in an elegant way.

The main body of work that the authors use to motivate their model is reference [19]. They sketch two implications of the model, as follows:

• “First, the Bayesian arbitration mechanism preferentially allocates control to the Pavlovian process initially, when there are less data and hence less support for the more flexible model. This is broadly consistent with the finding that the Pavlovian bias on instrumental responding declines with the amount of instrumental training [19].” – I understand this to mean: MORE instrumental training, LESS PIT

• Second, this initial preference should be stronger in relatively less controllable environments, where little predictive power is gained by conditionalizing predictions on action. Accordingly, Pavlovian bias increases with the amount of Pavlovian training [19].” I understand this to mean: MORE Pavlovian training, MORE PIT

I may well be misunderstanding something, and therefore stand to be corrected, but reference [19] (and indeed my knowledge of the literature on which it is based) appear to say something very different: the exact set-up of the conditioning phases (prior to the PIT phase) greatly influence the pattern found; so we ignore them in our peril. The findings agree with the authors’ interpretation only for a subset of experimental set-ups, but contradict them in many others.

Reference 19 makes 3 contributions. It provides a subjective review of the literature; a meta analysis; and new empirical data.

Quoting from the review: “Greater amounts of instrumental training facilitate PIT (Holland, 2004), leading some to consider that habitual responses are more susceptible to the general motivating influence of CSs (Holland, 2004, Yin and Knowlton, 2006; see also Dickinson and Balleine, 2001). “ I understand this to mean: MORE instrumental training, MORE PIT – the opposite of the authors’ interpretation.

From the empirical work: “Extensive Pavlovian conditioning produced more Pavlovian magazine visits and weaker PIT than moderate Pavlovian conditioning (Experiment 1)”. I understand this to mean: MORE Pavlovian training, LESS PIT – again, the opposite of the authors’ interpretation.

From the meta-analysis: “The amount of instrumental training clearly influences PIT scores in non-selective transfer studies and to some extent in selective transfer studies. The precise relationship between instrumental training and PIT scores furthermore depends on the order of the instrumental and Pavlovian conditioning phases. More instrumental training facilitates PIT when Pavlovian conditioning precedes instrumental training, but appears to be detrimental to PIT when the order of the two phases is reversed.” Here, half the studies agree with the authors’ interpretation, but the other half directly contradict it.

From the meta-analysis: “There were no relationships between PIT scores and amounts of Pavlovian conditioning for groups in non-selective PIT studies. …” In “Selective PIT studies…

When Pavlovian conditioning preceded instrumental training, there was a clear negative relationship between PIT scores and the amount of Pavlovian conditioning… However, when Pavlovian conditioning followed instrumental training, there was a positive relationship between PIT scores and the amount of Pavlovian conditioning for CS-different… but not for CS-same”. My understanding here is that non-selective studies contradict the authors’ interpretation; half of the selective studies contradict it (MORE Pavlovian conditioning, LESS PIT); and some of the other half of selective studies agree with it.

For the model to be useful for the community of PIT researchers, the proposed model need to exhibit some of the agreed patterns in that literature. Of course, it’s expected that the model will make some novel predictions; but when it contradicts existing patterns, this needs to be very clearly spelled out. At present, as far as I see, the crucial reference the authors used highlights important differences between the experimental set-ups that give rise to different patterns of the relationship between training and PIT, which the current model does not have within it a mechanism to explain. It would be important, for example, to relate the key variable of interest here – controllability – to the order of training (Pavlovian vs. Instrumental first). I would love to have the author’s response to this query, because I do agree with the huge potential of their approach.

Reviewer #2: In this paper the authors reanalyze two previously published data sets (and EEG and and fMRI study) that used essentially the same valenced Go/NoGo Task that either required a Go or a NoGo response to either gain a monetary reward or avoid a monetary loss. The focus of the paper is the test of a computational model that negotiates between Pavlovian state values and instrumental state-action values using a time-varying linear combination of both value signals. The main findings from the EEG study is a dependence of frontal theta band power on the Pavlovian weight supporting the involvement of this area in the suppression of Pavlovian influence on behavior. In the fMRI data the report and interaction of behavioral response and Pavlovian weight suggesting higher activation for NoGo responses with increasing weight.

In general, the paper is well-written with a clear research question and a dedicated computational model that is being tested. Model comparison with degenerate version of the model and a previously published RL model is done with appropriate methods. I also like the idea of recycling and reanalyzing older data sets to uncover novel aspects of already established tasks. However, in the current paper, there many elements missing that one would want to see in a computational modeling study. This dampens my enthusiasm for this paper considerably at this point.

Model simulation and parameter recovery. It would be good to know that the proposed model is able to recover true parameters value during MLE estimation (Why was the estimation not done in a hierarchical manner?). In addition, it would be interesting to see, how different Pavlovian weights change the model-free signature of hte behavioral data (e.g. the interaction of response (Go/NoGo) and valence (win / neutral/ loss) that was reported in the original publications.

Model accuracy and Posterior predictive checks. We are presented with evidence that the flexible Bayesian learner fits the data best among the competing models, bug we don’t actually see, if this model is able to generate data from the fitted parameters that are commensurate of the experimental data. For instance, how accurate is the model fitting to the original experimental data and can the model replicate the finding in the original publication of an interaction of Go/NoGo and Valence?

Analysis of the fitted parameters. We never get to see the distribution of the model parameters between different participants and whether they are related behavioral performance of the imaging data.

Inconsistencies between DFs and participant number. There is an apparent inconsistency between the degrees of freedom in paired t-tests (EEG df=31, fMRI df=28) and the number participants in the two samples (EEG n=30, fMRI n=47). Were there any subjects excluded from the modeling and, if yes, for what reasons? Please resolves this inconsistency.

1st level fMRI GLM. I am confused about the 1st level GLM in the fMRI data set. There were trial-specific regressors and in addition parametric regressors for Go/NoGo and Outcome. The latter models the experimental variance due to these two factors that is common across trials, which leaves the trial-specific regressors with the residual variance that is not explained by these common factorial regressors. How is it then possible that the ROI analysis based on the trial-specific beta images is still showing effects of valence and behavioral response.

Incomplete analysis of the data. The authors correctly mention the caveat that they only analyzed responses to cue presentation, but not following the outcome leading to value updates in the model. However, they hav the data at hand to look at this question and going the extra mile would give a comprehensive picture of the neural computations associated with controllability.

Figure Permission. Figure 1b is taken directly from the original Dorfman & Gershman, NCOMMS paper, which should be cited in the Figure legend.

Reviewer #3: In this paper, the authors test a computational theory according to which Pavlovian influence will be stronger when inferred controllability of outcomes is low. They used two prior datasets of a go/no-go task in humans, and perform model-based analysis of behavior and neuroimaging data (both EEG and fMRI). They find that theta-band oscillatory power in frontal cortex tracks inferred controllability, and that these inferences predict Pavlovian action biases.

Overall, the underlying theory is very interesting and elegant but has already been published in Dorfman & Gershman (2018). The presented re-analyses of behavior, EEG and fMRI data from two previous studies are a bit superficial in the current state. A number of potential confounds have been neglected (see detailed suggestions below). Unless much more thorough analyses are presented and drastic improvements of the manuscript are made, I have the impression that not much can be learned from the paper compared to the two previous studies from which the datasets were taken, and which already showed clear neural correlates of Pavlovian influence over behavior and its modulation/suppression.

Detailed comments

The authors found an initially positive Go bias even in the Avoid condition in resp. 76% and 63% of the subjects of the datasets. This seems contradictory with the hypothesis that Pavlovian biases should intrinsically favor Go for Win and NoGo for Avoid (Huys et al., Disentangling the roles of approach, activation and valence in instrumental and pavlovian responding. PLoS Comput Biol 7, e1002028 (2011).). How do the authors reconcile their observation with the underlying theoretical hypothesis?

The tested model space is too small. Why not comparing to a non-Bayesian RL model with annealed learning rate, similar to the Bayesian model, so as to assess the specific role of this annealing process? It would also be interesting to compare the model to a fixed amplitude change model rather than based on RPE amplitude (to assess the variability of magnitude changes), and to compare the model to a random walk model (to assess the consistency of direction changes).

The authors systematically plot variables of interest against weight quantile (e.g., Figs. 2, 3). This is interesting but not sufficient to evaluate the temporal evolution of these variables. In Dorfman & Gershman (2018; Fig. 6), the Go bias decreases rapidly in less than 10 trials, and then remains flat during the rest of the experiment. Is this also the case here? Is it also the case for the Pavlovian weight and for the variables of interest plotted in Figs. 2 and 3? Figure S1 shows the evolution of the Pavlovian weight through time. However, it is important to plot the trial-by-trial evolution, and not just for different quarters of trials. Moreover, it is important to show distinct plots for different conditions and different groups of subjects (see next paragraph). Finally, because the Go bias and Pavlovian weights are expected to change a lot during early experiment and then to remain nearly flat, it is important to verify that correlations with the model’s Pavlovian weight still hold when only considering the second half of each condition block.

In Cavanagh et al. (2013), there are important performance difference between the four conditions as well as between learners and non-learners. Are there differences in model fitting accuracy and in model parameters between conditions or between learners and non-learners? How does the evolution of Pavlovian weight with time differs between these cases? And how does the correlation between frontal theta and Pavlovian weight differ between these cases?

From the methods, it is not clear that the four conditions are presented in distinct blocks of trials. In contrast, this is clear from the original papers. I think this should be specified here too. Moreover, I think the results would be different if trials from the four conditions were intermixed, and this should be discussed here. Finally and more importantly, it is not clear to me whether the order between conditions was counterbalanced between subjects or not. This is important since some prior knowledge can be used by subjects and learning can be facilitated during late blocks (especially the fourth one) based on the previously encountered task rules. For instance, in the data of Guitart-Masip et al. (2011), how can one disentangle this effect from the Pavlovian effect to explain the better performance in the fourth block (NoGo to Avoid) than in the third block (NoGo to Win)? Were there significant differences in the initial values of the fitted model between conditions in any of the two datasets? And how did this affect learning and the evolution of the Pavlovian bias?

The authors verified that other model variables, like instrumental and Pavlovian values, do not correlate with the Pavlovian weight, and thus are not potential confounds. But other potential confounds should also be tested here, like choice confidence, reward uncertainty and reward prediction errors, to make sure that the frontal theta power is here only reflecting the suppression of the Pavlovian influence on choice. Along these lines, it has been shown previously that midfrontal theta may relate to reward prediction errors (Holroyd CB, Krigolson OE, Lee S (2011) Reward positivity elicited by predictive cues. Neuroreport 22:249 –252), to behavioral slowing (Cavanagh JF, Frank MJ, Klein TJ, Allen JJ (2010) Frontal theta links prediction errors to behavioral adaptation in reinforcement learning. Neuroimage 49:3198 –3209), and switching (Cohen MX, Ranganath C (2007) Reinforcement learning signals predict future decisions. J Neurosci 27:371–378; van de Vijver I, Ridderinkhof KR, Cohen MX (2011) Frontal oscillatory dynamics predict feedback learning and action adjustment. J Cogn Neurosci 23:4106–4121). Could the authors here control for these potential predictors of variations in frontal theta power, and whether the Pavlovian bias in the model can account for these effects?

About the EEG data, it is a bit unsatisfying to just show a basic correlation between frontal theta power and Pavlovian bias. Canavagh et al. (2013) found that ‘interindividual differences in theta power increases in response to Pavlovian conflict (across participants) correlated with intraindividual abilities to use theta (across trials) to overcome Pavlovian biases’. Could the authors here also assess the role of Pavlovian conflict and interindividual differences in the modulation of the correlation between frontal theta power and the Pavlovian bias?

The basic analyses of fMRI data presented here are not satisfying either. For instance, it has been shown that vMPFC encodes option values and confidence (M. Lebreton, R. Abitbol, J. Daunizeau, M. Pessiglione (2015) Automatic integration of confidence in the brain valuation signal, Nat. Neurosci., 18(8):1159‑1167; B. De Martino, S. Bobadilla-Suarez, T. Nouguchi, T. Sharot, B. C. Love (2017) Social Information Is Integrated into Value and Confidence Judgments According to Its Reliability, J. Neurosci., 37(25):6066‑6074). Could the authors check for these potential confounds of vMPFC activity, draw links with the model?

Some methodological information is missing. For instance, is the reported frontal theta power phase-locked or not, and what does this imply in terms of interpretation? In the fMRI data, to which events of the task are the IFG and vMPFC responses shown?

Typos

Page 6, differed fom 0 -> from 0.

Pages 7 and 11, please fix three occurrences of ‘NoGo¿Go’.

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: None

Reviewer #2: Yes

Reviewer #3: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, PLOS recommends that you deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions, please see http://journals.plos.org/compbiol/s/submission-guidelines#loc-materials-and-methods

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1008553.r003

Decision Letter 1

Daniele Marinazzo

16 Oct 2020

Dear Dr. Gershman,

Thank you very much for submitting your manuscript "Neural signatures of arbitration between Pavlovian and instrumental action selection" for consideration at PLOS Computational Biology.

This version is much improved. Still some important issues have not been addressed. We appreciate that you may disagree with these, still it would be good to discuss, both before and after an eventual publication. . In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments.

We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts.

Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Daniele Marinazzo

Deputy Editor

PLOS Computational Biology

Daniele Marinazzo

Deputy Editor

PLOS Computational Biology

***********************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: I really like the author’s modelling of the neural data, and agree that their project makes sense as a follow up and consolidation of the earlier modelling of behavioural data. The results of the analysis of midfrontal theta and VMPFC effects are novel and would be useful to others who work in this field.

I regret that in my view, the authors’ response to my critique was not sufficiently robust. Put simply, the original manuscript was motivated by behavioural findings X and Y; therefore, there need to be a bit more reconning, when it is pointed out that the findings are actually ~X and ~Y.

While I agree with the authors that the go-no-go task could, in principle, rely on different principles than the PIT task, and it is absolutely right to point out any links between these tasks should be tested empirically, the logic of the model is based on the hypothesis that the task taps generic Pavlovian and Instrumental processes. This is, indeed, how the abstract and introduction are written.

Therefore, when they say that the “Pavlovian weight can also be interpreted as the subjective degree of belief in an uncontrollable environment”, it makes sense for readers to reflect on situations where an agent is trained less well in the instrumental task, and therefore, their belief in the controllability of the environment should be lower. When the same agent is given an opportunity to behave in a way that can reflect both instrumental and Pavlovian biases, then if I had understood the model correctly, the Pavlovian biases should be stronger when instrumental training was extensive, and weaker when instrumental training was less extensive. Yet in this situation, during a PIT test, the empirical finding is exactly the opposite. As the PIT situation is the best-studied example of the interaction of Pavlovian and instrumental processes, readers are likely to have this example in mind.

The work and the data remain interesting if the model can only explain the go-no-go task; or tasks where the stimuli are spaced in particular ways; but this needs to be stated clearly in the introduction with the boundaries laid out.

Minor comments

1.“At the time of stimulus presentation, the reward prediction error is simply the stimulus value – “ Would participants not update the reward prediction based on their performance on previous trials, and the reward offered on previous trials, such that the reward they can predict at the start of the new trial varies somewhat?

2.In a number of places, the authors rule out alternative interpretation of the data by referring to null results of a signed rank test – e.g. three places on p. 6 and one place at the top of p. 7. This is a worry, because while the sample may have been powered for the purpose of the original study, it may not have been sufficiently large to allow the detection of the effects the authors refer to here (e.g. if it was powered to detect difference between means, it may not be large enough to detect a significant correlation). Please could the authors state the statistical power they had in every case they refer to a null result, e.g. “the median correlation never significantly differed from 0, although we only had power to detect effects that are medium in magnitude, or higher”.

Reviewer #2: The authors have addressed my all my questions sufficiently. I support the publication of this paper now.

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: None

Reviewer #2: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, PLOS recommends that you deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions, please see http://journals.plos.org/compbiol/s/submission-guidelines#loc-materials-and-methods

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1008553.r005

Decision Letter 2

Daniele Marinazzo

23 Nov 2020

Dear Dr. Gershman,

We are pleased to inform you that your manuscript 'Neural signatures of arbitration between Pavlovian and instrumental action selection' has been provisionally accepted for publication in PLOS Computational Biology.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. 

Best regards,

Daniele Marinazzo

Deputy Editor

PLOS Computational Biology

Daniele Marinazzo

Deputy Editor

PLOS Computational Biology

***********************************************************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: Although my own preference would be for another sentence in the discussion where the authors refer to PIT - perhaps pointing out that there are potential discrepancies to resolve in future, I agree that the revised manuscript presents a fair, and far more cautious, interpretation of the results. The revised manuscript is far improved in my view, and I recommend that it is now published. I thank the authors for addressing my reservations.

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1008553.r006

Acceptance letter

Daniele Marinazzo

2 Feb 2021

PCOMPBIOL-D-20-00957R2

Neural signatures of arbitration between Pavlovian and instrumental action selection

Dear Dr Gershman,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Alice Ellingham

PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Fig. Accuracy across conditions.

    (Left) Data from Guitart-Masip et al. (2012). (Right) Simulations of the adaptive Bayesian model, using parameters fitted to the data. Error bars show standard error of the mean.

    (PDF)

    S2 Fig. Pavlovian weight dynamics.

    The weight variable w is plotted across trial epochs, broken into quarters (note that the data sets have different numbers of trials). Error bars show standard error of the mean.

    (PDF)

    S3 Fig. Disaggregated EEG results.

    Midfrontal theta power (z-scored within subject) as a function of Pavlovian weight quantile, separated by stimulus condition. Error bars show standard error of the mean.

    (PDF)

    S4 Fig. Disaggregated fMRI results.

    BOLD response amplitude (z-scored within subject) as a function of Pavlovian weight quantile, separated by stimulus condition. Left: ventromedial prefrontal cortex. Middle: ventral striatum. Right: inferior frontal gyrus. Error bars show standard error of the mean.

    (PDF)

    S5 Fig. Parameter estimate histograms.

    Estimates were aggregated across the EEG and fMRI data sets.

    (PDF)

    Attachment

    Submitted filename: response.pdf

    Attachment

    Submitted filename: response.pdf

    Data Availability Statement

    All code and data for reproducing the analyses and figures is available at https://github.com/sjgershm/GoNoGo-neural.


    Articles from PLoS Computational Biology are provided here courtesy of PLOS

    RESOURCES