Skip to main content
eLife logoLink to eLife
. 2017 Sep 15;6:e27430. doi: 10.7554/eLife.27430

A causal role for right frontopolar cortex in directed, but not random, exploration

Wojciech K Zajkowski 1, Malgorzata Kossut 2,3, Robert C Wilson 4,5,
Editor: Michael J Frank6
PMCID: PMC5628017  PMID: 28914605

Abstract

The explore-exploit dilemma occurs anytime we must choose between exploring unknown options for information and exploiting known resources for reward. Previous work suggests that people use two different strategies to solve the explore-exploit dilemma: directed exploration, driven by information seeking, and random exploration, driven by decision noise. Here, we show that these two strategies rely on different neural systems. Using transcranial magnetic stimulation to inhibit the right frontopolar cortex, we were able to selectively inhibit directed exploration while leaving random exploration intact. This suggests a causal role for right frontopolar cortex in directed, but not random, exploration and that directed and random exploration rely on (at least partially) dissociable neural systems.

Research organism: Human

Introduction

In an uncertain world adaptive behavior requires us to carefully balance the exploration of new opportunities with the exploitation of known resources. Finding the optimal balance between exploration and exploitation is a hard computational problem and there is considerable interest in understanding how humans and animals strike this balance in practice (Badre et al., 2012; Cavanagh et al., 2011; Cohen et al., 2007; Daw et al., 2006; Frank et al., 2009; Hills et al., 2015; Mehlhorn et al., 2015; Wilson et al., 2014). Recent work has suggested that humans use two distinct strategies to solve the explore-exploit dilemma: directed exploration, based on information seeking, and random exploration, based on decision noise (Wilson et al., 2014). Even though both of these strategies serve the same purpose, that is, balancing exploration and exploitation, it is likely they rely on different cognitive mechanisms. Directed exploration is driven by information and is thought to be computationally complex (Gittins and Jones, 1979; Auer et al., 2002; Gittins, 1974). On the other hand, random exploration can be implemented in a simpler fashion by using neural or environmental noise to randomize choice (Thompson, 1933).

A key question is whether these dissociable behavioral strategies rely on dissociable neural systems. Of particular interest is the frontopolar cortex (FPC) – an area that has been associated with a number of functions, such as tracking pending and/or alternate options (Koechlin and Hyafil, 2007; Boorman et al., 2009), strategies (Domenech and Koechlin, 2015) and goals (Pollmann, 2016) and that has been implicated in exploration itself (Badre et al., 2012; Cavanagh et al., 2011; Daw et al., 2006). Importantly, however, the exact role that FPC plays in exploration is unknown as how exploration is defined varies from paper to paper. In one line of work, exploration is defined as information seeking. Understood this way, exploration correlates with RFPC activity measured via fMRI (Badre et al., 2012) and a frontal theta component in EEG (Cavanagh et al., 2011), suggesting a role for RFPC in directed exploration. However, in another line of work, exploration is operationalized differently, as choosing the low value option, not the most informative. Such a measure of exploration is more consistent with random exploration where decision noise drives the sampling of low value options by chance. Defined in this way, exploratory choice correlates with lateral FPC activation (Daw et al., 2006) and stimulation and inhibition of RFPC with direct current (tDCS) can increase and decrease the frequency with which such exploratory choices occur (Raja Beharelle et al., 2015).

Taken together, these two sets of findings suggest that RFPC plays a crucial role in both directed and random exploration. However, we believe that such a conclusion is premature because of a subtle confound that arises between reward and information in most explore-exploit tasks. This confound arises because participants only gain information from the options they choose, yet are incentivized to choose more rewarding options. Thus, over many trials, participants gain more information about more rewarding options and the two ways of defining exploration, that is, choosing high information or low reward options, become confounded (Wilson et al., 2014). This makes it impossible to tell whether the link between RFPC and exploration is specific to either directed or random exploration, or whether it is general to both.

To distinguish these interpretations and investigate the causal role of FPC in directed and random exploration, we used continuous theta-burst TMS (Huang et al., 2005) to selectively inhibit right frontopolar cortex (RFPC) in participants performing the ‘Horizon Task’, an explore-exploit task specifically designed to separate directed and random exploration (Wilson et al., 2014). Using this task we find evidence that inhibition of RFPC selectively inhibits directed exploration while leaving random exploration intact.

Results

We used our previously published ‘Horizon Task’ (Figure 1) to measure the effects of TMS stimulation of RFPC on directed and random exploration. In this task, participants play a set of games in which they make choices between two slot machines (one-armed bandits) that pay out rewards from different Gaussian distributions. To maximize their rewards in each game, participants need to exploit the slot machine with the highest mean, but they cannot identify this best option without exploring both options first.

Figure 1. The horizon task.

Figure 1.

Participants make a series of decisions between two one-armed bandits that pay out probabilistic rewards with unknown means. At the start of each game, ‘forced-choice’ trials give participants partial information about the mean of each option. We use the forced-choice trials to set up one of two information conditions: (A) an unequal (or [1 3]) condition in which participants see 1 play from one option and 3 plays from the other and (B) an equal (or [2 2]) condition in which participants see 2 plays from both options. A model-free measure of directed exploration is then defined as the change in information seeking with horizon in the unequal condition (A). Likewise a model-free measure of random exploration is defined as the change choosing the low mean option in the equal condition (B).

The Horizon Task has two key manipulations that allow us to measure directed and random exploration. The first manipulation is the horizon itself, i.e. the number of decisions remaining in each game. The idea behind this manipulation is that when the horizon is long (6 trials), participants should explore more frequently, because any information they acquire from exploring can be used to make better choices later on. In contrast, when the horizon is short (1 trial), participants should exploit the option they believe to be best. Thus, this task allows us to quantify directed and random exploration as changes in information seeking and behavioral variability that occur with horizon.

The second manipulation is the amount of information participants have about each option before making their first choice. This information manipulation is achieved by using four forced-choice trials, in which participants are told which option to pick, at the start of each game. We use these forced-choice trials to setup one of two information conditions: an unequal, or (Aston-Jones and Cohen, 2005; Badre et al., 2012), condition, in which participants see 1 play from one option and 3 plays from the other option, and an unequal, or (Auer et al., 2002; Auer et al., 2002), condition, in which participants see two outcomes from both options. By varying the amount of information participants have about each option independent of the mean payout of that option, this information manipulation allows us to remove the reward-information confound, at least on the first free-choice trial (Figure 2). After the first free-choice trial, however, participants tend to choose more rewarding options more frequently and reward and information are rapidly confounded. For this reason the bulk of our analyses are focussed on the first free-choice trial where the confound has been removed.

Figure 2. The reward-information confound.

Figure 2.

The y-axis corresponds to the correlation between the sign of the difference in mean (sgn(μleftμright)) between options and the sign of difference in the number of times each option has been played (sgn(nleftnright)). The forced trials are chosen such that the the correlation is approximately zero on the first free-choice trial. After the first trial, however, a positive correlation quickly emerges as participants choose the more rewarding options more frequently. This strong confound between reward and information makes it difficult to dissociate directed and random exploration on later trials.

RFPC stimulation selectively inhibits directed exploration on the first free-choice

In this section we analyze behavior on the first free-choice trial in each game. This way we are able to remove any effect of the reward-information confound and fairly compare behavior between horizon conditions. We analyze the data with both a model-free approach, using simple statistics of the data to quantify directed and random exploration, as well as a model-based approach, using a cognitive model of the behavior to draw more precise conclusions. Both analyses point to the same conclusion that RFPC stimulation selectively inhibits directed, but not random, exploration.

Model-free analysis

The two information conditions in the Horizon Task allow us to quantify directed and random exploration in a model-free way. In particular, directed exploration, which involves information seeking, can be quantified as the probability of choosing the high information option, p(high info) in the [1 3] condition, while random exploration, which involves decision noise, can be quantified as the probability of making a mistake, or choosing the low mean reward option, p(low mean), in the [2 2] condition.

Using these measures of exploration, we found that inhibiting the RFPC had a significant effect on directed exploration but not random exploration (Figure 3A,B). In particular, for directed exploration, a repeated measures ANOVA with horizon, TMS condition and order as factors revealed a significant interaction between stimulation condition and horizon (F(1,24)=4.96, p=0.036). Conversely, a similar analysis for random exploration revealed no effects of stimulation condition (main effect of stimulation condition, F(1,24)=0.88, p=0.36; interaction of stimulation condition with horizon, F(1,24)=1.24, p=0.28). Post hoc analyses revealed that the change in directed exploration was driven by changes in information seeking in horizon 6 (one-sided t-test, t(24)=2.62, p=0.008) and not in horizon 1 (two-sided t-test, t(24)=0.30).

Figure 3. Model-free analysis of the first free-choice trial shows that RPFC stimulation affects directed, but not random, exploration.

Figure 3.

(A) In the control (vertex) condition, information seeking increases with horizon, consistent with directed exploration. When RFPC is stimulated, directed exploration is reduced, an effect that is entirely driven by changes in horizon 6 (* denotes p<0.02 and ** denotes p<0.005; error bars are ± s.e.m.). (B) Random exploration increases with horizon but is not affected by RFPC stimulation.

Model-based analysis

While the model-free analyses are intuitive, the model-free statistics, p(high info) and p(low mean), are not pure reflections of information seeking and behavioral variability and could be influenced by other factors such as spatial bias and learning. To account for these possibilities we performed a model-based analysis using a model that extends our earlier work (Wilson et al., 2014; Somerville et al., 2017; Krueger et al., 2017) see Materials and methods for a complete description. In this model, the level of directed and random exploration is captured by two parameters: an information bonus for directed exploration, and decision noise for random exploration. In addition the model includes terms for the spatial bias and to describe learning.

Overview of model

Before presenting the results of the model-based analysis we begin with a brief overview of the most salient points of the model. A full description of the model can be found in the Methods and code to implement the model can be found in the Supplementary Material.

Conceptually, the model breaks the explore-exploit choice down into two components: a learning component, in which participants estimate the mean payoff of each option from the rewards they see, and a decision component, in which participants use this estimated payoff to guide their choice. The learning component assumes that participants compute an estimate of the average payoff for each slot machine, Rti, using a simple delta rule update equation (based on a Kalman filter (Kalman, 1960), see Materials and methods):

Rt+1i=Rti+αti(rt-Rti) (1)

where rt is the reward on trial t and αti is the time-varying learning rate that determines the extent to which the prediction error, (rt-Rti), updates the estimate of the mean of bandit i. The learning process is described by three free parameters: the initial value of the estimated payoff, R0, and two learning rates, the initial learning rate, α1, and the asymptotic learning rate, αinf, which together describe the evolution of the actual learning rate, αt, over time. For simplicity, we assume that these parameters are independent of horizon and uncertainty condition (Table 1).

Table 1. Model parameters.

Subject’s behavior on the first free choice of each session is described by 13 free parameters. Three of these parameters (R0, α1 and α) describe the learning process and do not vary with horizon or uncertainty condition. Ten of these parameters (A, B and σ in the different horizon and information conditions) describe the decision process. All parameters are estimated for each subject in each stimulation condition and the key analysis asks whether parameters change between vertex and RFPC stimulation.

Parameter Horizon dependent? Uncertainty dependent? TMS dependent?
prior mean, R0 no no yes
initial learning rate, α1 no no yes
asymptotic learning rate, α no no yes
information bonus, A yes n/a yes
spatial bias, B yes yes yes
decision noise, σ yes yes yes

The decision component of the model assumes that participants choose between the two options (left and right) probabilistically according to.

p(choose right)=11+exp(ΔR+AΔI+Bσ) (2)

where ΔR ( =Rtleft-Rtright ) is the difference in expected reward between left and right options and ΔI is the difference in information between left and right options (which we define as +1 when left is more informative, −1 when right is more informative, and 0 when both options convey equal information in the [2 2] condition). The decision process is described by three free parameters: the information bonus A, the spatial bias B, and the decision noise σ. We estimate separate values of the decision parameters for each horizon and (since the information bonus is only used in the [1 3] condition) separate values of only the bias and decision noise for each uncertainty condition.

Overall, subject’s behavior in each session (vertex vs RFPC stimulation) is described by 13 free parameters (Table 1): three describing learning (R0, α1 and α) and 10 describing the decision process (A in the two horizon conditions, B and σ in the four horizon-x-uncertainty conditions). These 13 parameters were fit to each subject in each stimulation condition using a hierarchical Bayesian approach (Lee and Wagenmakers, 2014) (see Materials and methods).

Model fitting results

Posterior distributions over the group-level means are shown in the left column of Figure 4, while posteriors over the TMS-related change in parameters are shown in the right column. Both columns suggest a selective effect of RFPC stimulation on the information bonus in horizon 6.

Figure 4. Model-based analysis of the first free-choice trial showing the effect of RFPC stimulation on each of the 13 parameters.

Figure 4.

Left column: Posterior distributions over each parameter value for RFPC and vertex stimulation condition. Right column: posterior distributions over the change in each parameter between stimulation conditions. Note that, because information bonus, decision noise and spatial bias are all in units of points, we plot them on the same scale to facilitate comparison of effect size.

Focussing on the left column first, overall the parameter values seem reasonable. The prior mean is close to the generative mean of 50 used in the actual experiment, and the decision parameters are comparable to those found in our previous work (Wilson et al., 2014). The learning rate parameters, α1 and α, were not included in our previous models and are worth discussing in more detail. As expected for Bayesian learning (Kalman, 1960; Nassar et al., 2010), the initial learning rate is higher than the asymptotic learning rate (95% of samples in the vertex condition, 94% in the RFPC condition). However, the actual values of the learning rates are quite far from their ‘optimal’ settings of α1=1 and α=0 that would correspond to perfectly computing the mean reward. This suggests a greater than optimal reliance on the prior (α1<1) and a pronounced recency bias (α>0) such that the most recent rewards are weighted more heavily in the computation of expected reward, Rti. Both of these findings are likely due to the fact that the version of the task we employed did not keep the outcomes of the forced trials on screen and instead relied on people’s memories to compute the expected value.

Turning to the right hand column of Figure 4, we can see that the model-based analysis yields similar result to the model-free analysis. In particular we see a reduction (of about 4.8 points) in the information bonus in horizon 6 (with 99% of samples showing a reduced information bonus in the RFPC stimulation condition) and no effect on decision noise in either horizon in either the [2 2] or [1 3] uncertainty conditions (with between 40% and 63% of samples below zero).

In addition to the effect on the information bonus in horizon 6, there is also a hint of an effect on the information bonus in horizon 1 (85% samples less than zero) and on the prior mean R0 (88% samples above zero). While these results may suggest that RFPC stimulation affects more than just information bonus in horizon 6, they more likely reflect an inherent tradeoff between prior mean and information bonus that is peculiar to this task. In particular, because the prior mean has a stronger effect on the more uncertain option, an increase in R0 increases the value of the more informative option in much the same way as an information bonus. Thus, when applied to this task, the model has a built in tradeoff between prior mean and information bonus that can muddy the interpretation of both. Note that this tradeoff is not a general feature of the model and could be removed with a different task design that employed more forced choice trials and hence more time for the effects of the prior to be removed.

Figure 5 exposes the tradeoff between R0 and A in more detail. Panels A and B plot samples from the posterior over the TMS-related change in information bonus, A(vertex)A(RFPC), against the TMS-related change in prior mean, R0(vertex)R0(RFPC). For both horizon conditions we see a strong negative correlation such that increasing R0 decreases A. This negative correlation is especially problematic for the interpretation of the horizon 1 change in information bonus where a sizable fraction of the posterior centers on no change in either variable. In contrast the negative correlation between A and R0 does not affect our interpretation of the horizon 6 result where the TMS-related change in A is negative regardless of of the change in R0.

Figure 5. Correlation between TMS-induced changes in information bonus, A, and TMS-induced changes in the prior mean, R0.

Figure 5.

(A, B) Samples from the posterior distributions over the TMS-related changes in prior mean, R0, and TMS-related change in information bonus in horizon 1 (A) and horizon 6 (B). In both cases we see a negative correlation between the change in R0 and the change in A consistent with a tradeoff between these variables in the model. (C) Samples from the posterior over the effect of TMS stimulation on the horizon-related change in information bonus, ΔA=A(h=6)-A(h=1) plotted against samples from the TMS-related change in prior mean. Here we see no correlation between variables and the majority of ΔA(vertex)ΔA(RFPC) samples below zero consistent with an effect of RFPC stimulation on directed exploration.

Finally we asked whether the horizon-dependent change in information seeking, i.e. ΔA=A(h=6)-A(h=1), was different in each TMS condition. As shown in Figure 5C, the TMS-related change in ΔA is about −3.1 points (94% samples below 0) and is uncorrelated with the TMS-related change in R0. Taken together, this suggests that we can be fairly confident in our claim that RFPC stimulation has a selective effect on directed exploration.

The effect of RFPC stimulation on later trials

Our analyses so far have focussed on just the first free choice and have ignored the remaining five choices in the horizon six games. The reason for this is the reward-information confound, illustrated in Figure 2, which makes interpretation of the later trials more difficult. Despite this difficulty, we note that in Figure 2 the size of the confound is almost identical in the two stimulation conditions and so we proceed, with caution, to present a model-free analysis of the later trials below.

In Figure 6 we plot the model-free measures, p(high info) and p(low mean), as a function of trial number. Both measures show a decrease over the course of the horizon six games although, because of the confound, it is difficult to say whether these changes reflect a reduction in directed exploration, random exploration, or both. Interestingly, the differences in p(high info) between vertex and RFPC conditions on the first free-choice trial appear to persist into the second, a result that becomes more apparent when we plot the TMS-related change, that is, p(high info,RFPC)p(high info,vertex) (Figure 6C,D). More formally a repeated measures ANOVA with trial number, TMS condition as factors reveals a significant main effect of trial number (F(5,120)=126, p<10-45), no main effect of TMS condition (F(1,120)=1.17, p=0.29) and a near significant interaction between trial number and TMS condition (F(5,120)=2.26,p=0.053). A post hoc, one-sided t-test on the second trial reveals a marginally significant reduction in p(high info) on the second trial (t(24)=1.61). In contrast, a similar analysis for random exploration shows no evidence for any effect of TMS condition on p(low mean) (main effect of TMS, F(1,120)=0.16, p=0.69; TMS x trial number, F(5,120)=0.69, p=0.63) although the main effect of trial number persists (F(5,120)=13.7, p<10-9). Thus, the analysis of later trials provides additional, albeit modest, support for the idea that RFPC stimulation selectively disrupts directed but not random exploration at long horizons.

Figure 6. Model-free analysis of all trials.

Figure 6.

(A, B) Model-free measures of directed (A) and random (B) exploration as a function of trial number suggests a reduction in both directed and random exploration over the course of the game. (C, D) TMS-induced change in measures of directed and random exploration as a function of trial number. This suggests that the reduction in directed exploration on the first free-choice trial, persists into the second trial of the game.

Discussion

In this work we used continuous theta-burst transcranial magnetic stimulation (cTBS) to investigate whether right frontopolar cortex (RFPC) is causally involved in directed and random exploration. Using a task that is able to behaviorally dissociate these two types of exploration, we found that inhibition of RFPC caused a selective reduction in directed, but not random exploration. To the best of our knowledge, this finding represents the first causal evidence that directed and random exploration rely on dissociable neural systems and is consistent with our recent findings showing that directed and random exploration have different developmental profiles (Somerville et al., 2017). This suggests that, contrary to the assumption underlying many contemporary studies (Daw et al., 2006; Badre et al., 2012), exploration is not a unitary process, but a dual process in which the distinct strategies of information seeking and choice randomization are implemented via distinct neural systems.

Such a dual-process view of exploration is consistent with the classical idea that there are multiple types of exploration (Berlyne, 1966). In particular Berlyne’s constructs of ‘specific exploration’, involving a drive for information and ‘diversive exploration’, involving a drive for variety, bear a striking resemblance to our definitions of directed and random exploration. Despite the importance of Berlyne’s work, more modern views of exploration tend not to make the distinction between different types of exploration, considering instead a single exploratory state or exploratory drive that controls information seeking across a wide range of tasks (Berlyne, 1966; Aston-Jones and Cohen, 2005; Hills et al., 2015; Kidd and Hayden, 2015). At face value, such unitary accounts seem at odds with a dual-process view of exploration. However, these two viewpoints can be reconciled if we allow for the possibility that, while directed and random exploration are implemented by different systems, their levels are set by a common exploratory drive.

Intriguingly, individual differences in behavior on the Horizon Task provide some support for the idea that directed and random exploration are driven by a common source. In particular, in a large behavioral data set of 277 people performing the Horizon Task, we find a positive correlation between the levels of directed and random exploration such that people with high levels of directed exploration also tend to have high levels random exploration (r(275)=0.29, p<10-5), Figure 7. This is consistent with the idea that the levels of directed and random exploration are set by the strength of an exploratory drive that varies as an individual difference between people.

Figure 7. Correlation between individual differences in the levels of directed and random exploration in a sample of 277 people performing the Horizon Task.

Figure 7.

While the present study does allow us to conclude that directed and random exploration rely on different neural systems, the limited spatial specificity of TMS limits our ability to say exactly what those systems are. In particular, because the spatial extent of TMS is quite large, stimulation aimed at frontal pole may directly affect activity in nearby areas such as ventromedial prefrontal cortex (vmPFC) and orbitofrontal cortex (OFC), both areas that have been implicated in exploratory decision making and that may be contributing to our effect (Daw et al., 2006). In addition to such direct effects of TMS on nearby regions, indirect changes in areas that are connected to the frontal pole could also be driving our effect. For example, cTBS of left frontal pole has been associated with changes in blood perfusion in areas such as amygdala, fusiform gyrus and posterior parietal cortex (Volman et al., 2011) and by changes in the fMRI BOLD signal in OFC, insula and striatum (Hanlon et al., 2017). In addition (Volman et al., 2011) showed that unilateral cTBS of left frontal pole is associated with changes in blood perfusion to the right frontal pole. Indeed, such a bilateral effect of cTBS may explain why our intervention was effective at all given that a number of neuroimaging studies have shown bilateral activation of the frontal pole associated with exploration (Daw et al., 2006; Badre et al., 2012). Future work combining cTBS with neuroimaging will be necessary to shed light on these issues.

With the above caveats that our results may not be entirely due to disruption of frontal pole, the interpretation that RFPC plays a role in directed, but not random, exploration is consistent with a number of previous findings. For example, frontal pole has been associated with tracking the value of the best unchosen option (Boorman et al., 2009), inferring the reliability of alternate strategies (Boorman et al., 2009; Domenech and Koechlin, 2015), arbitrating between old and new strategies (Donoso et al., 2014; Mansouri et al., 2015), and reallocating cognitive resources among potential goals in underspecified situations (Pollmann, 2016). Taken together, these findings suggest a role for frontal pole in model-based decisions (Daw et al., 2006) that involve long-term planning and the consideration of alternative actions. From this perspective, it is perhaps not surprising that directed exploration relies on RFPC, since computing an information bonus relies heavily on an internal model of the world. It is also perhaps not surprising that random exploration is independent of RFPC, as this simpler strategy could be implemented without reference to an internal model. Indeed, the ability to explore effectively in a model-free manner, may be an important function of random exploration as it allows us to explore even when our model of the world is wrong.

More generally, it is unlikely that frontal pole is the only area involved in directed exploration, and more work will be needed to map out the areas involved in directed and random exploration and expose their causal relationship to explore-exploit behavior.

Materials and methods

Participants

31 healthy right-handed, adult volunteers (19 female, 12 male; ages 19–32). An initial sample size of 16 was chosen based on two studies using a very similar cTBS design that stimulated lateral FPC (Costa et al., 2011; Costa et al., 2013) and this was augmented to 31 on the basis of feedback from reviewers. Five participants (5 female, 0 male) were excluded from the analysis due to chance-level performance in both experimental sessions. One (female) participant failed to return for the second (vertex stimulation condition) session and is excluded from the model-free analyses but not the model-based analyses as that can handle missing data more gracefully. Thus our final data set consisted of 25 participants (13 female, 12 male, ages 19–32) with complete data and one participant (female, aged 20) with data from the RFPC session only.

All participants were informed about potential risks connected to TMS and signed a written consent. The study was approved by University of Social Sciences and Humanities ethics committee.

Procedure

There were two experimental TMS sessions and a preceding MRI session. On the first session T1 structural images were acquired using a 3T Siemens TRIO scanner. The scanning session lasted up to 10 min. Before the first two sessions, participants filled in standard safety questionnaires regarding MRI scanning and TMS. During the experimental sessions, prior to the stimulation participants went through 16 training games to get accustomed to the task. Afterwards, resting motor thresholds were obtained and the stimulation took place. Participants began the main task immediately after stimulation. The two experimental sessions were performed with an intersession interval of at least 5 days. The order of stimulation conditions was counterbalanced across subjects. All sessions took place at Nencki Institute of Experimental Biology in Warsaw.

Stimulation site

The RFPC peak was defined as [x,y,z]= [35,50,15] in MNI (Montreal Neurological Institute) space. The coordinates were based on a number of fMRI findings that indicated RFPC involvement in exploration (Badre et al., 2012; Boorman et al., 2009; Daw et al., 2006) and constrained by the plausibility of stimulation (e.g. defining ‘z’ coordinate lower would result in the coil being placed uncomfortably close to the eyes). Vertex corresponded to the Cz position of the 10–20 EEG system. In order to locate the stimulation sites we used a frameless neuronavigation system (Brainsight software, Rogue Research, Montreal, Canada) with a Polaris Vicra infrared camera (Northern Digital, Waterloo, Ontario, Canada).

TMS protocol

We used continuous theta burst stimulation (cTBS) (Huang et al., 2005). cTBS requires 50 Hz stimulation at 80% resting motor threshold. 40 s stimulation is equivalent to 600 pulses and can decrease cortical excitability for up to 50 min (Wischnewski and Schutter, 2015).

Individual resting motor thresholds were assessed by stimulating the right motor knob and inspecting if the stimulation caused an involuntary hand twitch in 50% of the cases. We used a MagPro X100 stimulator (MagVenture, Hueckelhoven, Germany) with a 70 mm figure-eight coil. The TMS was delivered in line with established safety guidelines (Rossi et al., 2009).

Limitations

Defining stimulation target by peak coordinates based on findings from previous studies did not allow to account for individual differences in either brain anatomy or the impact of TMS on brain networks (Gratton et al., 2013). However, a study by Volman and colleagues (Volman et al., 2011) that used the same theta-burst protocol on the left frontopolar cortex has shown bilateral inhibitory effects on blood perfusion in the frontal pole. This suggests that both right and left parts of the frontopolar cortex might have been inhibited in our experiment, which is consistent with imaging results indicating bilateral involvement of the frontal pole in exploratory decisions.

Task

The task was a modified version of the Horizon Task (Wilson et al., 2014). As in the original paper, the distributions of payoffs tied to bandits were independent between games and drawn from a Gaussian distribution with variable means and fixed standard deviation of 8 points. Participants were informed that in every game one of the bandits was objectively ‘better’ (has a higher payoff mean). Differences between the mean payouts of the two slot machines were set to either 4, 8, 12 or 20. One of the means was always equal to either 40 or 60 and the second was set accordingly. The order of games was randomized. Mean sizes and order of presentation were counterbalanced. Participants played 160 games and the whole task lasted between 39 and 50 min (mean 43.4 min).

Each game consisted of 5 or 10 choices. Every game started with a screen saying ‘New game’ and information about whether it was a long or short horizon, followed by sequentially presented choices. Every choice was presented on a separate screen, so that participants had to keep previous the scores in memory. There was no time limit for decisions. During forced choices participants had to press the prompted key to move to the next choice. During free choices they could press either ‘z’ or ‘m’ to indicate their choice of left or right bandit. The decision could not be made in a time shorter than 200 ms, preventing participants from accidentally responding too soon. The score feedback was presented for 500 ms. A counter at the bottom of the screen indicated the number of choices left in a given game. The task was programmed using PsychoPy software v1.86 (Peirce, 2007).

Participants were rewarded based on points scored in two sessions. The payoff bounds were set between 50 and 80 zl (equivalent to approximately 12 and 19 euro). Participants were informed about their score and monetary reward after the second session.

Finally, the random seeds were not perfectly controlled between subjects. The first 16 subjects ran the task with identical random seeds and thus all 16 saw the same sequence of forced-choice trials in both vertex and RFPC sessions. For the remaining subjects the random seed was unique for each subject and each session, thus these subjects had unique a series of forced-choice trials for each session. Despite this limitation we saw no evidence of different behavior across the two groups.

Data and code

Behavioral data as well as Matlab code to recreate the main figures from this paper can be found on the Dataverse website at https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/CZT6EE.

Model-based analysis

We modeled behavior on the first free choice of the Horizon Task using a version of the logistic choice model in Wilson et al. (2014) that was modified to include a learning component. In particular, we assume that participants use the outcomes of the forced-choice trials to learn an estimate of the mean reward of each option, before inputting that mean reward into a decision function that includes terms for directed and random exploration. This model naturally decomposes into a learning component and a decision component and we consider each of these components in turn.

Learning component

The learning component of the model assumes that participants use a Kalman filter to learn a value for the mean reward of each option. The Kalman filter (Kalman, 1960) has been used to model learning in other explore-exploit tasks (Daw et al., 2006) and is a popular model of Bayesian learning as it is both analytically tractable and easily relatable to the delta-rule update equations of reinforcement learning.

More specifically, the Kalman filter assumes a generative model in which the rewards from each bandit, rt, are generated from Gaussian distribution with a fixed standard deviation, σr, and a mean, mti, that is different for each bandit and can vary over time. The time dependence of the mean is determined by a Gaussian random walk with mean 0 and standard deviation σd. Note that this generative model, assumed by the Kalman filter, is slightly different to the true generative model used in the Horizon Task, which assumes that the mean of each bandit is constant over time, that is, σd=0. This mismatch between the assumed and actual generative models, is quite deliberate and allows us to account for the suboptimal learning of the subjects. In particular, this mismatch, introduces the possibility of a recency bias (when σd>0) whereby more recent rewards are over-weighted in the computation of Rti.

The actual equations of the Kalman filter model are straightforward. The model keeps track of an estimate of both the mean reward, Rti, of each option, i, and the uncertainty in that estimate, σti. When option i is played on trial t, these two parameters update according to

Rt+1i=Rti+(σt+1i)2σr2(rt-Rti) (3)

and

1(σt+1i)2=1(σti)2+σd2+1σr2 (4)

When option i is not played on trial t we assume that the estimate of the mean stays the same, but that the uncertainty in this estimate grows as the generative model assumes the mean drifts over time. Thus for unchosen option j we have

Rt+1j=Rtjand(σt+1j)2=(σtj)2+σd2 (5)

When the option is played, the update Equation 3 for Rti is essentially just a ‘delta rule’ (Rescorla and Wagner, 1972; Schultz et al., 1997), with the estimate of the mean being updated in proportion to the prediction error, rt-Rti. This relationship to the reinforcement learning literature is made more clear by rewriting the learning equations in terms of the time varying learning rate,

αti=(σt+1i)2σr2 (6)

Written in terms of this learning rate, Equations 3 and 4 become

Rt+1i=Rti+αti(rt-Rti) (7)

and

1αti=1αt-1i+αd+1 (8)

where

αd=σd2σr2 (9)

The learning model has four free parameters, the noise variance, σr2, the drift variance, σd2, and the initial values of the estimated reward, R0, and uncertainty in that variance estimate, σ02. In practice, only three of these parameters are identifiable from behavioral data, and we will find it useful to reparameterize the learning model in terms of R0 and an initial, α1, and asymptotic, α, learning rate. In particular, the initial value of the learning rate relates to σ0, σr and σd as

α1=σ02+σd2σr2 (10)

While the asymptotic value of the learning rate, which corresponds to the steady state value of αti if option i is played forever, relates to αd (and hence σd and σr) as

α=12(-αd+αd2+4αd) (11)

While this choice to parameterize the learning equations in terms of α1 and α is somewhat arbitrary, we feel that the learning rate parameterization has the advantage of being slightly more intuitive and leads to parameter values between 0 and 1 which are (at least for us) easier to interpret.

Decision component

Once the payoffs of each option, Rti, have been estimated from the outcomes of the forced-choice trials, the model makes a decision using a simple logistic choice rule:

p(choose right)=11+exp(ΔR+AΔI+Bσ) (12)

where ΔR ( =Rtleft-Rtright ) is the difference in expected reward between left and right options and ΔI is the difference in information between left and right options (which we define as +1 when left is more informative, −1 when right is more informative, and 0 when both options convey equal information in the (Auer et al., 2002; Auer et al., 2002) condition). The three free parameters of the decision process are: the information bonus, A, the spatial bias, B, and the decision noise σ. We assume that these three decision parameters can take on different values in the different horizon and uncertainty conditions (with the proviso that A is undefined in the (Auer et al., 2002; Auer et al., 2002) information condition since ΔI=0). Thus the decision component of the model has 10 free parameters (A in the two horizon conditions, and B and σ in the 4 horizon x uncertainty conditions). Directed exploration is then quantified as the change in information bonus with horizon, while random exploration is quantified as the change in decision noise with horizon.

Model fitting

Hierarchical bayesian model

Between the learning and decision components of the model, each subject’s behavior is described by 13 free parameters, all of which are allowed to vary between TMS conditions. These parameters are: the initial mean, R0, the initial learning rate, α1, the asymptotic learning rate, α, the information bonus, A, in both horizon conditions, the spatial bias, B, in the four horizon x uncertainty conditions, and the decision noise, σ, in the four horizon x uncertainty conditions (Table 2, Figure 8).

Table 2. Model parameters, priors, hyperparameters and hyperpriors.
Parameter Prior Hyperparameters Hyperpriors
prior mean, R0τs R0τs Gaussian(μR0τ, σR0τ) θR0τ=(μR0τ,σR0τ) μR0τ Gaussian( 50, 14 )
σR0τ Gamma( 1, 0.001 )
initial learning rate, α1τs α1τs Beta(aα1τ, bα1τ) θα1τ=(aα1τ,bα1τ) aα1τ Uniform( 0.1, 10 )
bα1τ Uniform( 0.5, 10 )
asymptotic learning rate, ατs ατs Beta(aατ, bατ) θατ=(aατ,bατ) aατ Uniform( 0.1, 10 )
bατ Uniform( 0.1, 10 )
information bonus, Aτshu Aτshu Gaussian(μAτhu, σAτhu) θAτhu=(μAτhu,σAτhu) μAτhu Gaussian( 0, 100 )
σAτhu Gamma( 1, 0.001 )
spatial bias, Bτshu Bτshu Gaussian(μBτhu, σBτhu) θBτhu=(μBτhu,σBτhu) μBτhu Gaussian( 0, 100 )
σBτhu Gamma( 1, 0.001 )
decision noise, στshu στshu Gamma(kστhu, λστhu) θστhu=(kστhu,λστhu) kστhu Exp( 0.1 )
λστhu Exp( 10 )
Figure 8. Graphical representation of the model.

Figure 8.

Each variable is represented by a node, with edges denoting the dependence between variables. Shaded nodes correspond to observed variables, that is, the free choices cτshug, forced-trial rewards, 𝐫τshug and forced-trial choices 𝐚τshug. Unshaded nodes correspond to unobserved variables whose values are inferred by the model.

Each of the free parameters is fit to the behavior of each subject using a hierarchical Bayesian approach (Lee and Wagenmakers, 2014). In this approach to model fitting, each parameter for each subject is assumed to be sampled from a group-level prior distribution whose parameters, the so-called ‘hyperparameters’, are estimated using a Markov Chain Monte Carlo (MCMC) sampling procedure. The hyper-parameters themselves are assumed to be sampled from ‘hyperprior’ distributions whose parameters are defined such that these hyperpriors are broad. For notational convenience, we refer to the hyperparameters that define the prior for variable X as θX. In addition we use subscripts to refer to the dependence of both parameters and hyperparameters on TMS stimulation condition, τ, horizon condition, h, uncertainty condition, u, subject, s, and game, g.

The particular priors and hyperpriors for each parameter are shown in Table 2. For example, we assume that the prior mean, R0τs, for each stimulation condition τ and horizon condition h, is sampled from a Gaussian prior with mean μR0τ and standard deviation σR0τ. These prior parameters are sampled in turn from their respective hyperpriors: μR0τ, from a Gaussian distribution with mean 50 and standard deviation 14, σR0τ from a Gamma distribution with shape parameter 1 and rate parameter 0.001.

Model fitting using MCMC

The model was fit to the data using Markov Chain Monte Carlo approach implemented in the JAGS package (Plummer, 2003) via the MATJAGS interface (psiexp.ss.uci.edu/research/programs_data/jags/). This package approximates the posterior distribution over model parameters by generating samples from this posterior distribution given the observed behavioral data.

In particular we used 4 independent Markov chains to generate 4000 samples from the posterior distribution over parameters (1000 samples per chain). Each chain had a burn in period of 500 samples, which were discarded to reduce the effects of initial conditions, and posterior samples were acquired at a thin rate of 1. Convergence of the Markov chains was confirmed post hoc by eye. Code and data to replicate our analysis and reproduce our Figures is provided as part of the Supplementary Materials.

Funding Statement

No external funding was received for this work.

Contributor Information

Robert C Wilson, Email: bob@arizona.edu.

Michael J Frank, Brown University, United States.

Additional information

Competing interests

No competing interests declared.

Author contributions

Conceptualization, Formal analysis, Investigation, Visualization, Methodology, Writing—original draft, Project administration, Writing—review and editing.

Resources, Supervision, Project administration.

Conceptualization, Software, Formal analysis, Supervision, Visualization, Writing—original draft, Writing—review and editing.

Ethics

Human subjects: All participants were informed about potential risks connected to TMS and signed a written consent. The study was approved by University of Social Sciences and Humanities ethics committee.

Additional files

Transparent reporting form
DOI: 10.7554/eLife.27430.012

Major datasets

The following dataset was generated:

Zajkowski, author; W, author; Kossut, author; M, author; Wilson, author; RC, author. A causal role for right frontopolar cortex in directed, but not random, exploration. 2017 https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/CZT6EE Publicly accessible via the Harvard Dataverse website (https://dx.doi.org/10.7910/DVN/CZT6EE)

References

  1. Aston-Jones G, Cohen JD. Adaptive gain and the role of the locus coeruleus-norepinephrine system in optimal performance. The Journal of Comparative Neurology. 2005;493:99–110. doi: 10.1002/cne.20723. [DOI] [PubMed] [Google Scholar]
  2. Auer P, Cesa-Bianchi N, Fischer P. Finite-time analysis of the multiarmed bandit problem. Machine Learning. 2002;47:235–256. doi: 10.1023/A:1013689704352. [DOI] [Google Scholar]
  3. Badre D, Doll BB, Long NM, Frank MJ. Rostrolateral prefrontal cortex and individual differences in uncertainty-driven exploration. Neuron. 2012;73:595–607. doi: 10.1016/j.neuron.2011.12.025. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Berlyne DE. Curiosity and exploration. Science. 1966;153:25–33. doi: 10.1126/science.153.3731.25. [DOI] [PubMed] [Google Scholar]
  5. Boorman ED, Behrens TE, Woolrich MW, Rushworth MF. How green is the grass on the other side? Frontopolar cortex and the evidence in favor of alternative courses of action. Neuron. 2009;62:733–743. doi: 10.1016/j.neuron.2009.05.014. [DOI] [PubMed] [Google Scholar]
  6. Cavanagh JF, Bismark AJ, Frank MJ, Allen JJ. Larger Error Signals in Major Depression are Associated with Better Avoidance Learning. Frontiers in Psychology. 2011;2:331. doi: 10.3389/fpsyg.2011.00331. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Cohen JD, McClure SM, Yu AJ. Should I stay or should I go? How the human brain manages the trade-off between exploitation and exploration. Philosophical Transactions of the Royal Society B: Biological Sciences. 2007;362:933–942. doi: 10.1098/rstb.2007.2098. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Costa A, Oliveri M, Barban F, Bonnì S, Koch G, Caltagirone C, Carlesimo GA. The right frontopolar cortex is involved in visual-spatial prospective memory. PLoS ONE. 2013;8:e56039. doi: 10.1371/journal.pone.0056039. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Costa A, Oliveri M, Barban F, Torriero S, Salerno S, Lo Gerfo E, Koch G, Caltagirone C, Carlesimo GA. Keeping memory for intentions: a cTBS investigation of the frontopolar cortex. Cerebral Cortex. 2011;21:2696–2703. doi: 10.1093/cercor/bhr052. [DOI] [PubMed] [Google Scholar]
  10. Daw ND, O'Doherty JP, Dayan P, Seymour B, Dolan RJ. Cortical substrates for exploratory decisions in humans. Nature. 2006;441:876–879. doi: 10.1038/nature04766. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Domenech P, Koechlin E. Executive control and decision-making in the prefrontal cortex. Current Opinion in Behavioral Sciences. 2015;1:101–106. doi: 10.1016/j.cobeha.2014.10.007. [DOI] [Google Scholar]
  12. Donoso M, Collins AG, Koechlin E. Human cognition. Foundations of human reasoning in the prefrontal cortex. Science. 2014;344:1481–1486. doi: 10.1126/science.1252254. [DOI] [PubMed] [Google Scholar]
  13. Frank MJ, Doll BB, Oas-Terpstra J, Moreno F. Prefrontal and striatal dopaminergic genes predict individual differences in exploration and exploitation. Nature Neuroscience. 2009;12:1062–1068. doi: 10.1038/nn.2342. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Gittins JC, Jones DM. A dynamic allocation index for the discounted multiarmed bandit problem. Biometrika. 1979;66:561. doi: 10.1093/biomet/66.3.561. [DOI] [Google Scholar]
  15. Gittins JC. Resource allocation in speculative chemical research. Journal of Applied Probability. 1974;11:255. doi: 10.1017/S0021900200036718. [DOI] [Google Scholar]
  16. Gratton C, Lee TG, Nomura EM, D'Esposito M. The effect of theta-burst TMS on cognitive control networks measured with resting state fMRI. Frontiers in Systems Neuroscience. 2013;7:124. doi: 10.3389/fnsys.2013.00124. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Hanlon CA, Dowdle LT, Correia B, Mithoefer O, Kearney-Ramos T, Lench D, Griffin M, Anton RF, George MS. Left frontal pole theta burst stimulation decreases orbitofrontal and insula activity in cocaine users and alcohol users. Drug and Alcohol Dependence. 2017;178:310–317. doi: 10.1016/j.drugalcdep.2017.03.039. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Hills TT, Todd PM, Lazer D, Redish AD, Couzin ID, Cognitive Search Research Group Exploration versus exploitation in space, mind, and society. Trends in Cognitive Sciences. 2015;19:46–54. doi: 10.1016/j.tics.2014.10.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Huang YZ, Edwards MJ, Rounis E, Bhatia KP, Rothwell JC. Theta burst stimulation of the human motor cortex. Neuron. 2005;45:201–206. doi: 10.1016/j.neuron.2004.12.033. [DOI] [PubMed] [Google Scholar]
  20. Kalman RE. A New Approach to Linear Filtering and Prediction Problems. Journal of Basic Engineering. 1960;82:35–45. doi: 10.1115/1.3662552. [DOI] [Google Scholar]
  21. Kidd C, Hayden BY. The psychology and neuroscience of curiosity. Neuron. 2015;88:449–460. doi: 10.1016/j.neuron.2015.09.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Koechlin E, Hyafil A. Anterior prefrontal function and the limits of human decision-making. Science. 2007;318:594–598. doi: 10.1126/science.1142995. [DOI] [PubMed] [Google Scholar]
  23. Krueger PM, Wilson RC, Cohen JD. Strategies for exploration in the domain of losses. Judgment and Decision Making. 2017;12:104–117. [Google Scholar]
  24. Lee MD, Wagenmakers E. Bayesian Cognitive Modeling: A Practical Course. Cambridge University Press; 2014. [Google Scholar]
  25. Mansouri FA, Buckley MJ, Mahboubi M, Tanaka K. Behavioral consequences of selective damage to frontal pole and posterior cingulate cortices. PNAS. 2015;112:E3940–E3949. doi: 10.1073/pnas.1422629112. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Mehlhorn K, Newell BR, Todd PM, Lee MD, Morgan K, Braithwaite VA, Hausmann D, Fiedler K, Gonzalez C. Unpacking the exploration–exploitation tradeoff: A synthesis of human and animal literatures. Decision. 2015;2:191–215. doi: 10.1037/dec0000033. [DOI] [Google Scholar]
  27. Nassar MR, Wilson RC, Heasly B, Gold JI. An approximately Bayesian delta-rule model explains the dynamics of belief updating in a changing environment. Journal of Neuroscience. 2010;30:12366–12378. doi: 10.1523/JNEUROSCI.0822-10.2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Peirce JW. PsychoPy--Psychophysics software in Python. Journal of Neuroscience Methods. 2007;162:8–13. doi: 10.1016/j.jneumeth.2006.11.017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Plummer M. JAGS: a program for analysis of bayesian graphical models using gibbs sampling. Proceedings of the 3rd International Workshop on Distributed Statistical Computing. 2003;124:125. [Google Scholar]
  30. Pollmann S. Frontopolar resource allocation in human and nonhuman primates. Trends in Cognitive Sciences. 2016;20:84–86. doi: 10.1016/j.tics.2015.11.006. [DOI] [PubMed] [Google Scholar]
  31. Raja Beharelle A, Polanía R, Hare TA, Ruff CC. Transcranial stimulation over frontopolar cortex elucidates the choice attributes and neural mechanisms used to resolve exploration-exploitation trade-offs. Journal of Neuroscience. 2015;35:14544–14556. doi: 10.1523/JNEUROSCI.2322-15.2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Rescorla RA, Wagner AR. A theory of pavlovian conditioning: variations in the effectiveness of reinforcement and nonreinforcement. Classical Conditioning II: Current Research and Theory. 1972;2:64–99. [Google Scholar]
  33. Rossi S, Hallett M, Rossini PM, Pascual-Leone A, Safety of TMS Consensus Group Safety, ethical considerations, and application guidelines for the use of transcranial magnetic stimulation in clinical practice and research. Clinical Neurophysiology. 2009;120:2008–2039. doi: 10.1016/j.clinph.2009.08.016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Schultz W, Dayan P, Montague PR. A neural substrate of prediction and reward. Science. 1997;275:1593–1599. doi: 10.1126/science.275.5306.1593. [DOI] [PubMed] [Google Scholar]
  35. Somerville LH, Sasse SF, Garrad MC, Drysdale AT, Abi Akar N, Insel C, Wilson RC. Charting the expansion of strategic exploratory behavior during adolescence. Journal of Experimental Psychology: General. 2017;146:155–164. doi: 10.1037/xge0000250. [DOI] [PubMed] [Google Scholar]
  36. Thompson WR. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika. 1933;25:285–294. doi: 10.1093/biomet/25.3-4.285. [DOI] [Google Scholar]
  37. Volman I, Roelofs K, Koch S, Verhagen L, Toni I. Anterior prefrontal cortex inhibition impairs control over social emotional actions. Current Biology. 2011;21:1766–1770. doi: 10.1016/j.cub.2011.08.050. [DOI] [PubMed] [Google Scholar]
  38. Wilson RC, Geana A, White JM, Ludvig EA, Cohen JD. Humans use directed and random exploration to solve the explore-exploit dilemma. Journal of Experimental Psychology: General. 2014;143:2074–2081. doi: 10.1037/a0038199. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Wischnewski M, Schutter DJ. Efficacy and time course of theta burst stimulation in healthy humans. Brain Stimulation. 2015;8:685–692. doi: 10.1016/j.brs.2015.03.004. [DOI] [PubMed] [Google Scholar]

Decision letter

Editor: Michael J Frank1

In the interests of transparency, eLife includes the editorial decision letter and accompanying author responses. A lightly edited version of the letter sent to the authors after peer review is shown, indicating the most substantive concerns; minor comments are not usually included.

[Editors’ note: a previous version of this study was rejected after peer review, but the authors submitted for reconsideration. The first decision letter after peer review is shown below.]

Thank you for submitting your work entitled "A causal role for right frontopolar cortex in directed, but not random, exploration" for consideration by eLife. Your article has been reviewed by two peer reviewers, and the evaluation has been overseen by a Reviewing Editor and a Senior Editor. The reviewers have opted to remain anonymous.

Our decision has been reached after consultation between the reviewers. Based on these discussions and the individual reviews below, we regret to inform you that your work in the current state will not be considered further for publication in eLife.

All involved found the work to have great merit and contributes to the literature on RLPFC and exploration. In our view this is perhaps the clearest demonstration to date that the RLPFC is involved in directed, uncertainty-guided exploration, in that it is the first to imply causality. However, given the state of the literature with other studies (cited in your manuscript) that show RLPFC activation during exploration, that it codes for uncertainty and/or the value of alternative actions, together with an existing TDCS study manipulating it and affecting exploration (albeit not in a way that clearly implicates uncertainty), we felt that the bar for establishing causality in your study needs to be quite high. The reviewers agreed that given the small sample and somewhat marginal statistics, it would be more reassuring if the results held up in a larger N study (or a separate independent replication). Moreover, while the findings here are compelling (e.g. the selectivity to horizon 6 directed exploration), they would be more so especially if you had a control site of stimulation (e.g. DLPFC or IFG) to establish specificity of the RLPFC site. (One of the reviewers noted in the consultation session that RLFPC stimulation may cause discomfort relative to vertex stimulation, which could differentially impact conditions that may require differences in effort).

Reviewer 1 also had concerns regarding potential power differences to detect effects in directed vs. random exploration.

If you feel strongly that you can address these concerns we could consider a resubmission. But because the nature of the concerns requires new data collection, and it is unclear whether the results of new studies will provide more clarity, we are rejecting the paper as it stands. We would understand if you chose to submit this study as it is elsewhere.

Reviewer #1:

This manuscript reports the results of a TMS study in which participants are stimulated with theta-burst TMS while participating in a one armed bandit gambling task aimed at distinguishing directed from random exploration. The authors hypothesize that frontopolar cortex is involved in directed but not random exploration. Using both model-based and model-free analyses the authors report that frontopolar cortex inhibition impacts on directed but not random exploration, allowing the authors to conclude that this structure plays a specific role in directed exploration.

Overall, the study is an interesting one that identifies a potentially important finding. The notion that frontopolar cortex is especially involved in directed exploration is highly plausible, and the results do indeed provide some indication of this possibility. However, I do have several major concerns which I detail below.

1) One major concern is the possibility that there is a substantial difference in power to detect the effects of TMS on the two forms of exploration due to perhaps a big difference in the number of behavioral choice that index these two forms of exploration on the first trial of each block. While directed exploration in the vertex treatment in the [1 3] condition perhaps occurs frequently (the number of trials in which this behavior is found are not reported, but I am inferring this from the high probability of directed exploration reported in the horizon 6 condition), it seems natural to expect that there would be much fewer instances of "random" exploration as defined by choice of the lower valued option in the [2 2] condition – this appears to be reflected in the much lower reported probabilities of random exploration in that condition. If there are many fewer trials of random exploration in the first place this ought to make an effect of random exploration following TMS stimulation much harder to detect. Therefore, one trivial account for the authors' double dissociation is that it occurs as a result of a difference in the experimental power to detect these two effects in the paradigm. The claim the authors have about an effect of TMS on direct exploration per se seems well supported in my opinion, but the claim for the specificity of the effect to random exploration seems a lot weaker.

2) Another concern is that for a behavioral and TMS study the use of such a small sample size of only 15 participants seems hard to justify, especially given that the authors are reporting effects that are just barely reaching significance at p<0.05. Given the concerns raised above about power to detect effects on random exploration, and given that there are a very small number of trials per subject enabling the authors to test their claimed effects (as they are throwing away most of the trials per block and focusing only on the first), suggest that it would not be unreasonable to expect the authors to obtain a larger sample size.

3) A more generic concern with TMS over frontopolar cortex is that it is unclear with this stimulation protocol how diffuse the effect of the TMS stimulation has been, and to what extent the stimulation protocol has also impacted adjacent regions of frontal cortex. This is an inherent limitation of this technique of course, but there are ways to ameliorate concerns in this regard such as by measuring effects of the stimulation protocol with fMRI. The authors could discuss this limitation and ideally bolster their claims about the degree to which these effects can be specifically attributed to effects of stimulation on frontopolar cortex per se.

4) Could the apparent effect on directed exploration be driven by other more prosaic possibilities such as an impairment in the ability to flexibly change task set (e.g. from a short to long horizon) across blocks or alterations in the capacity to attend to the task cues indicating the horizon length or even the capacity to incorporate knowledge of task instructions could be impacted instead of directed exploration per se.

5) Can the authors discriminate between different ways in which directed exploration could be implemented computationally on this task? For instance one could imagine a Bayesian implementation in which a representation of uncertainty over the options is computed and used to direct exploration toward the more uncertain options, or else one could simply use a heuristic strategy of just counting the number of samples of each option to try to ensure each option has been sampled an equivalent number of times.

6) Although the authors cite Wilson et al. (2014) to describe their modeling strategies, it would be important to reproduce details of exactly how they implemented the model fitting etc. in the current paradigm, as these analyses are central to the current paper and the reader shouldn't be required to go searching for another paper to understand precisely what was done.

7) I wonder whether more use can be made of the subsequent trials in each block. It seems a shame to throw these trials away, even if the utility of the trials for distinguishing these constructs drops off over repeated trials within a block it seems plausible to me that the 2nd and 3rd trials at the very least would contain useful information.

Reviewer #2:

Zajkowski and colleagues present a study showing that continuous theta burst stimulation to right frontopolar cortex, but not the vertex, selectively reduces directed exploration, but not random exploration. I commend the authors for their experimental approach, combining a carefully designed experimental paradigm and computational modeling of behavior with a transient causal manipulation, such as cTBS. While the results look straightforward, and I do believe they represent an advance on current knowledge in the field, I do not think they represent such a significant advance to merit publication in eLife (or a similar high impact journal of broad interest), but would be appropriate for a more specialized journal in the field. Rather than advancing thinking on this topic in some new way, developing a new methodology, or resolving a debate, I believe the results essentially confirm what could be inferred to be likely from the existing fMRI (Daw et al., 2006; Badre et al., 2012) and stimulation (TDCS) literature (Beharelle et al., 2015) on the RFPC and exploration/exploitation. Furthermore, the experimental paradigm and modeling results have been published (Wilson, et al., 2014). I do not mean to discourage the authors, who I think have conducted a genuinely interesting study by combining approaches in an unusual way, and confirming their main hypothesis. I simply do not believe the paper is best suited for a journal of the caliber of eLife, but I of course leave this up to the editor's discretion. I have added a few comments below that I hope will be helpful to the authors.

In the Introduction random exploration is framed as simply increasing decision noise, and directed exploration as information seeking. But is that really the critical distinction? In the real world random exploration is likely to occur when the environmental statistics have changed very rapidly and/or the animal has inferred (for whatever reason) their prior causal model (or even set of models) is (are) no longer tenable. In these circumstances their exploratory behavior is likely still characterized as information seeking, even if it manifests formally as an increase in decision noise. It seems to me, therefore, that the key distinction between directed and undirected exploration is that animals no longer know which options to explore. Can the authors clarify their view, and perhaps modify the Introduction and/or Discussion as needed?

What were the instructions to participants? Do they necessarily understand that the bandit means are constant and independent? Is there any evidence they weight more recent past samples more strongly than more distant samples? Would this change the estimates of the means in any meaningful way?

Given the demonstrated effects of cTBS on the hemodynamic signal measured in control networks (Gratton et al., 2013), how specific is the effect of stimulation to RFPC? To address this question, I would have liked to see the investigators target another frontal comparison brain region, in addition or instead of the vertex.

[Editors’ note: what now follows is the decision letter after the authors submitted for further consideration.]

Thank you for resubmitting your work entitled "A causal role for right frontopolar cortex in directed, but not random, exploration" for further consideration at eLife. Your revised article has been favorably evaluated by Sabine Kastner as Senior Editor, Michael Frank as Reviewing editor and two reviewers.

The manuscript has been improved, especially given the doubled sample size, and the model-based and model-free analyses are sophisticated, comprehensive, and generally compelling. However, there are some remaining issues that need to be addressed before acceptance, as outlined below:

1) Why do the authors binarize relative information such that it is coded as +1 when the left gamble is more informative and -1 when the right gamble is? Based on Badre, Doll, Frank, et al. I would have thought that the estimated relative uncertainty between options would be more appropriate to quantitatively test the impact of stimulation on directed exploration. Or is variance in this quantity negligible across the critical choices in this task? Related to this question, is this quantity matched across conditions and do all subjects see identical or different schedules?

2) Although I am not requesting the authors conduct another experiment, a second stimulation site within prefrontal cortex would make for an important comparison for future studies. My suggestion is in part due to the quite severe discomfort frequently caused by TMS stimulation to FPC and neighboring regions due to the underlying facial musculature, as compared to say the vertex. Any differences between stimulation sites could in theory be due to differences in discomfort or subsequent distraction produced by the stimulation sites. Here, this difference could conceivably interact with the comparison between horizon 6 and horizon 1 in the unequal condition if this horizon 6 condition is in fact more cognitively demanding. Note this is not a concern in the cited tDCS study by Raja Beharelle et al. because tDCS does not stimulate the facial muscles and because the excitation and inhibition respectively following anodal and cathodal tDCS provides for an internal control. Can the authors provide some evidence that horizon 6 in the unequal condition is not the most cognitively demanding for their subjects, for instance by analysing RTs? Are there existing data that address this concern by comparing stimulation of FPC and other PFC regions using cTBS?

3) The trend of an effect of RFPC stimulation on the information bonus for horizon 1, although smaller than that of horizon 6, seems problematic for an interpretation purely based on directed exploration, since there is no opportunity to exploit the newly acquired information for horizon 1. The authors suggest subjects may become less information-seeking in both conditions (consistent with risk or ambiguity aversion in horizon 1 and reduced directed exploration in horizon 6), but this begs the question of what process or mechanism underlies this decrease in both horizons. Given the broader literature on the role of FPC, one interpretation would be that stimulation has disrupted the FPC's ability to faithfully encode the parameters of a "pending" option that they may choose in the future (e.g. Koechlin and Hyafil, Science, 2007) – in this task this could be seen as the option that has not been selected as frequently or attended to recently during forced choices. However I am sure there are other plausible interpretations. How do the authors interpret this effect across horizons in the unequal condition with respect to the broader literature on FPC?

eLife. 2017 Sep 15;6:e27430. doi: 10.7554/eLife.27430.016

Author response


[Editors’ note: the author responses to the first round of peer review follow.]

Reviewer #1:

[…] 1) One major concern is the possibility that there is a substantial difference in power to detect the effects of TMS on the two forms of exploration due to perhaps a big difference in the number of behavioral choice that index these two forms of exploration on the first trial of each block. While directed exploration in the vertex treatment in the [1 3] condition perhaps occurs frequently (the number of trials in which this behavior is found are not reported, but I am inferring this from the high probability of directed exploration reported in the horizon 6 condition), it seems natural to expect that there would be much fewer instances of "random" exploration as defined by choice of the lower valued option in the [2 2] condition – this appears to be reflected in the much lower reported probabilities of random exploration in that condition. If there are many fewer trials of random exploration in the first place this ought to make an effect of random exploration following TMS stimulation much harder to detect. Therefore, one trivial account for the authors' double dissociation is that it occurs as a result of a difference in the experimental power to detect these two effects in the paradigm. The claim the authors have about an effect of TMS on direct exploration per se seems well supported in my opinion, but the claim for the specificity of the effect to random exploration seems a lot weaker.

This is an important point. Put simply, do we find no effect on random exploration because our experiment is underpowered to detect effects on random exploration? We believe that we do have sufficient power to detect an effect on random exploration (if it were there) and we try to show this using both a model-free and model-based approach.

For the model-free approach we consider the size of the horizon effect for directed and random exploration in the control condition. This horizon effect is essentially the effect we are trying to remove with TMS and the idea is that, if the horizon effect size is smaller for random than directed exploration, there would be a difference in power to detect changes to the horizon effect. Fortunately the horizon effects are of equal size in this study (in the vertex condition Cohen’s d for directed = 0.71; for random = 0.68). These numbers are largely in line with pure behavioral subjects (the 60 undergraduates from Somerville et al. 2016) where we find d = 0.75 for directed and, a slightly larger, d = 1.18 for random. Thus, if TMS were to reduce the horizon effect by 50% we would have essentially equal power to detect both effects (note we have the same number of trials in the [2 2] condition, for measuring p(low mean) and random exploration, and [1 3] condition, for measuring p(high info) and directed exploration).

For the model-based approach, we can fit the decision noise in the [1 3] uncertainty condition in addition to the [2 2] condition. This gives us an independent estimate of decision noise and gives us another chance to see an effect of TMS on random exploration. In addition, in our new model-based analysis, we use hierarchical Bayesian model fitting to compute posterior distributions over all model parameters given the data (see reviewer #1 response #6 for more details on this model). As shown by the posterior distributions (Figure 4, main text) we see no effect of TMS on decision noise in any of the four uncertainty x horizon conditions, but we do see an effect on information bonus in horizon 6.

2) Another concern is that for a behavioral and TMS study the use of such a small sample size of only 15 participants seems hard to justify, especially given that the authors are reporting effects that are just barely reaching significance at p<0.05. Given the concerns raised above about power to detect effects on random exploration, and given that there are a very small number of trials per subject enabling the authors to test their claimed effects (as they are throwing away most of the trials per block and focusing only on the first), suggest that it would not be unreasonable to expect the authors to obtain a larger sample size.

We agree that N = 15 was not ideal. We have now run an additional 16 subjects and our results hold (see Author response image 1).

Author response image 1. No difference in effects between original and replication experiments.

Author response image 1.

In each panel we plot the model-free measures of directed and random exploration and how they change between stimulation conditions. For example, in Panel A, we plot p(high info) in horizon 1 for vertex stimulation (x-axis) and RFPC stimulation (y-axis). Each point in this plot is a single subject and the diagonal line represents equality. Participants below the diagonal line have a smaller value of p(high info) in the RFPC stimulation condition. From this we can clearly see that there is no effect of RFPC stimulation on directed exploration in horizon 1 (panel A), or random exploration in either horizon (B, D). However, there is a strong effect of RFPC stimulation on directed exploration in horizon 6 with the majority of points lying below the diagonal (C). Moreover, both the original and replication datasets point to the same conclusions in all four panels.

In addition, we have included two new analyses: a model-based Bayesian analysis (results of which are shown in Figure 4), as well as a model-free analysis of later trials. Both of these analyses point to the same conclusion – inhibition of RFPC leads to selective inhibition of directed exploration in horizon 6.

The model-free analysis of later trials is presented in the main paper in Figure 6 in its own section. In this analysis we compute p(high info) and p(low mean) for all trials in the horizon 6 game to see whether behavior on the later trials is affected by stimulation of frontal pole. For directed exploration we find some evidence that the reduction in p(high info) on the first trial continues into the second (post hoc, one-sided t-test on the second trial, t(24) = 1.61; p = 0.06), Figure 6 panels A and C. While this is a marginal result, it is consistent with our hypothesis and provides more support for frontal pole playing a role in directed exploration. For random exploration we find no effect of RFPC stimulation on any trial. This is consistent with the idea that frontal pole is not involved in random exploration.

For completeness we reproduce the particular section of text here:

“The effect of RFPC stimulation on later trials

Our analyses so far have focused on just the first free choice and have ignored the remaining five choices in the horizon 6 games. […] Thus, the analysis of later trials provides additional, albeit modest, support for the idea that RFPC stimulation selectively disrupts directed but not random exploration at long horizons.”

3) A more generic concern with TMS over frontopolar cortex is that it is unclear with this stimulation protocol how diffuse the effect of the TMS stimulation has been, and to what extent the stimulation protocol has also impacted adjacent regions of frontal cortex. This is an inherent limitation of this technique of course, but there are ways to ameliorate concerns in this regard such as by measuring effects of the stimulation protocol with fMRI. The authors could discuss this limitation and ideally bolster their claims about the degree to which these effects can be specifically attributed to effects of stimulation on frontopolar cortex per se.

We agree that this is an important point and would be an important follow-up study. We have added the following to the Discussion to address this point:

“While the present study does allow us to conclude that directed and random exploration rely on different neural systems, the limited spatial specificity of TMS limits our ability to say exactly what those systems are. […] Future work combining cTBS with neuroimaging will be necessary to shed light on these issues.”

4) Could the apparent effect on directed exploration be driven by other more prosaic possibilities such as an impairment in the ability to flexibly change task set (e.g. from a short to long horizon) across blocks or alterations in the capacity to attend to the task cues indicating the horizon length or even the capacity to incorporate knowledge of task instructions could be impacted instead of directed exploration per se.

This is an interesting idea that we believe we can rule out. To paraphrase, the idea is that RFPC stimulation inhibits the ability to adapt to horizon in general (e.g. by causing subjects to ignore relevant task cues) rather than causing a specific deficit in directed exploration. Such a general deficit would predict that the horizon effect on random exploration would also be abolished with RFPC stimulation and this is something we do not see at all in three separate analyses.

First, in the model-free analysis (Figure 3B) we see that p(low mean) increases with horizon even in the RFPC condition and that RFPC stimulation has no effect on this measure of random exploration.

Second, this model-free result also holds for the later trials in which we see no stimulation based change in p(low mean) over the course of horizon 6 games (Figure 6B, D). Of course, these later trial results are subject to the reward information confound and so should not be overinterpreted, but they do at least point to the same conclusion that RFPC stimulation does not change the horizon dependence of random exploration.

Third, our model-based analysis points to the same conclusion that there is no change in decision noise with stimulation condition (Figure 4).

5) Can the authors discriminate between different ways in which directed exploration could be implemented computationally on this task? For instance one could imagine a Bayesian implementation in which a representation of uncertainty over the options is computed and used to direct exploration toward the more uncertain options, or else one could simply use a heuristic strategy of just counting the number of samples of each option to try to ensure each option has been sampled an equivalent number of times.

Unfortunately the vanilla Horizon Task used here is not well suited to addressing this question. The reason is that uncertainty on the first free choice is not parametrically modulated – there either is a difference in uncertainty (in the [1 3] condition) or else there is no difference in uncertainty (in the [2 2] condition). While one could try to look at this with a model-based analysis of the later trials, such an analysis is deeply affected by the reward-information confound which makes interpreting results of such an analysis difficult.

In an on-going set of experiments, we have performed a (purely behavioral) version of the task with parametric modulation of uncertainty. This reveals that the information bonus does appear to scale with uncertainty in a more Bayesian manner, more analysis needs to be done to be sure and the result requires internal replication (much easier with pure behavior than TMS!) before we publish.

6) Although the authors cite Wilson et al. (2014) to describe their modeling strategies, it would be important to reproduce details of exactly how they implemented the model fitting etc. in the current paradigm, as these analyses are central to the current paper and the reader shouldn't be required to go searching for another paper to understand precisely what was done.

This is a fair point and we have now included much more detail on the model. In addition we have expanded the model to include a learning component and fit the model in a different (and more rigorous) hierarchical Bayesian manner. We describe the model at two different points in the text and provide code to implement the model in the Supplementary Material. In the Results section, we highlight the salient points to try to convey the main intuition in the subsection “RFPC stimulation selectively inhibits directed exploration on the first free-choice”. In the Materials and methods section, we go into all the gory details. As this text is extensive, we do not quote it here.

7) I wonder whether more use can be made of the subsequent trials in each block. It seems a shame to throw these trials away, even if the utility of the trials for distinguishing these constructs drops off over repeated trials within a block it seems plausible to me that the 2nd and 3rd trials at the very least would contain useful information.

I wonder this too and have been for quite a while! In trying to model the later trials, it quickly becomes apparent that the reward-information confound is very real and introduces very strong correlations between the fitted parameter values that makes interpretation of the results essentially impossible.

Despite this difficulty in interpreting the model-based parameters, the model-free statistics (while still being confounded) are at least more straightforward. As mentioned above (response #2), we include this model-free analysis of later trials in a separate section of the Results, along with appropriate health warnings about the reward information confound.

Reviewer #2:

Zajkowski and colleagues present a study showing that continuous theta burst stimulation to right frontopolar cortex, but not the vertex, selectively reduces directed exploration, but not random exploration. I commend the authors for their experimental approach, combining a carefully designed experimental paradigm and computational modeling of behavior with a transient causal manipulation, such as cTBS.

We thank the reviewer for the positive comments and helpful feedback. We hope this revision will change your mind about the “importance” of the findings, but regardless of whether the paper is accepted to eLife, your comments have greatly improved the paper!

While the results look straightforward, and I do believe they represent an advance on current knowledge in the field, I do not think they represent such a significant advance to merit publication in eLife (or a similar high impact journal of broad interest), but would be appropriate for a more specialized journal in the field. Rather than advancing thinking on this topic in some new way, developing a new methodology, or resolving a debate, I believe the results essentially confirm what could be inferred to be likely from the existing fMRI (Daw et al., 2006; Badre et al., 2012) and stimulation (TDCS) literature (Beharelle et al., 2015) on the RFPC and exploration/exploitation. Furthermore, the experimental paradigm and modeling results have been published (Wilson, et al., 2014). I do not mean to discourage the authors, who I think have conducted a genuinely interesting study by combining approaches in an unusual way, and confirming their main hypothesis. I simply do not believe the paper is best suited for a journal of the caliber of eLife, but I of course leave this up to the editor's discretion. I have added a few comments below that I hope will be helpful to the authors.

While we acknowledge that such judgments of “importance” are often a matter of taste and perspective (all things look big when viewed up close!), we respectfully disagree with this point and believe our study represents a major update to current thinking. In particular, by showing that RFPC stimulation selectively inhibits directed exploration we show that “exploration” is not a unitary process, it is a dual process in which directed and random exploration rely on (at least partially) dissociable neural systems.

That exploration is a dual process is absolutely not something one would have concluded from previous work. For example, Daw and Badre see similar activations despite defining exploration in very different ways (choosing low value option for Daw and (loosely) choosing high information options for Badre). The reason the activations are similar is that both tasks have a reward-information confound and after making just a few free choices, the high information options are the low value options. This means that every single exploration-related activation in those studies now has a big question mark on it – is it an activation related to directed exploration, random exploration or both? The same can be said of the Beharelle finding, which is beautiful in how it shows opposite effects for anodal and cathodal stimulation, but which cannot dissociate directed and random exploration because of the nature of the behavioral task. To be clear, we do not mean to attack previous work here – these are all incredibly important studies. However, our findings do open them up to reinterpretation.

We have tried to emphasize this dual-process interpretation in the Discussion:

“In this work we used continuous theta-burst transcranial magnetic stimulation (cTBS) to investigate whether right frontopolar cortex (RFPC) is causally involved in directed and random exploration. […] This is consistent with the idea that the levels of directed and random exploration are set by the strength of an exploratory drive that varies as an individual difference between people.”

In the Introduction random exploration is framed as simply increasing decision noise, and directed exploration as information seeking. But is that really the critical distinction? In the real world random exploration is likely to occur when the environmental statistics have changed very rapidly and/or the animal has inferred (for whatever reason) their prior causal model (or even set of models) is (are) no longer tenable. In these circumstances their exploratory behavior is likely still characterized as information seeking, even if it manifests formally as an increase in decision noise. It seems to me, therefore, that the key distinction between directed and undirected exploration is that animals no longer know which options to explore. Can the authors clarify their view, and perhaps modify the Introduction and/or Discussion as needed?

This is a really interesting idea and one that would be worth investigating in its own right. We have added a few sentences to the Discussion suggesting that random exploration may be a “model-free” method of exploration that works especially well when the model is unknown.

“With the above caveats that our results may not be entirely due to disruption of frontal pole, the interpretation that RFPC plays a role in directed, but not random, exploration is consistent with a number of previous findings. […] Indeed, the ability to explore effectively in a model-free manner, may be an important function of random exploration as it allows us to explore even when our model of the world is wrong.”

What were the instructions to participants? Do they necessarily understand that the bandit means are constant and independent?

The instructions were a direct Polish translation of the original instructions used by Wilson et al. (2014). These instructions clearly state that the average reward from each bandit is constant in each game and that the variability is constant over the entire game. For reference see the supplementary material of the original paper. If you feel it would be important for this paper, we would be happy to include them as Supplementary Material.

Is there any evidence they weight more recent past samples more strongly than more distant samples? Would this change the estimates of the means in any meaningful way?

This is a great question and one that has pushed us to update the model. In particular, we have now modeled the learning process (i.e. the process by which participants infer the mean of each option from the forced trials) using a Kalman filter. This model assumes that participants learn the mean reward for each option using a delta-rule update equation

Rit+1 = Rit + αit (rt – Rit) (*)

Where αit is the time varying learning rate. The time dependence of the learning rate is determined by the Kalman filter equations (see Materials and methods for full description of the model) and can be parameterized by two parameters: the initial learning rate α0 and the asymptotic learning rate αinf. Crucially, equation (*) allows for potentially uneven weighting of the reward depending on the values of α0 and αinf. Our previous model, with equal weighting given to all points, corresponds to the case of α0= 1, αinf = 0. Models with α0< 1 and αinf > 0 have a recency bias, weighting more recent rewards more strongly.

The posterior distributions over the group average values of α0and αinf are shown in Figure 3 in the main paper. In particular α0~ 0.6 and αinf ~ 0.45, suggesting quite a pronounced recency effect. Importantly, however, neither of these parameters changes between stimulation conditions, and including this learning term in the model does not change the effect of TMS on directed exploration (information bonus in horizon 6).

Given the demonstrated effects of cTBS on the hemodynamic signal measured in control networks (Gratton et al., 2013), how specific is the effect of stimulation to RFPC? To address this question, I would have liked to see the investigators target another frontal comparison brain region, in addition or instead of the vertex.

We agree that our inability to nail down the specificity of the effect is an important limitation of this work. Unfortunately we currently lack the resources to run a study looking at stimulation of other areas and have instead focused our efforts on increasing the sample size of the current study. Likewise, combining TMS with fMRI will be important in future work to more precisely characterize the effects of the perturbation. We have acknowledged both of these limitations in the Discussion as follows:

“While the present study does allow us to conclude that directed and random exploration rely on different neural systems, the limited spatial specificity of TMS limits our ability to say exactly what those systems are. […] Future work combining cTBS with neuroimaging will be necessary to shed light on these issues.”

[Editors' note: the author responses to the re-review follow.]

1) Why do the authors binarize relative information such that it is coded as +1 when the left gamble is more informative and -1 when the right gamble is? Based on Badre, Doll, Frank, et al. I would have thought that the estimated relative uncertainty between options would be more appropriate to quantitatively test the impact of stimulation on directed exploration. Or is variance in this quantity negligible across the critical choices in this task? Related to this question, is this quantity matched across conditions and do all subjects see identical or different schedules?

There are a few thoughts behind binarizing information. First, binary information matches the task design in which there is only one unequal information condition and no gradations in uncertainty from a normative perspective. Related to this, and as the reviewer rightly intuits, the single unequal uncertainty condition means that the variance in relative uncertainty between options is relatively small meaning that there is very little difference between the binarized vs continuous definition of information. Because of this we have decided to stick with the binarized version in the paper so as to avoid over interpreting the data.

More generally, the parametric effect of uncertainty in this task is a key question and is something we are looking at behaviorally in ongoing experiments with different numbers of forced trials. Such explicit manipulation of information leads to much more variance in the uncertainties allowing us to compute parametric effects of uncertainty with more confidence. In brief, these results do suggest a linear effect of uncertainty as seen in previous work.

Of course, it is possible to fit the continuous model to the data in this paper and when we do so we come to the same conclusions as the binarized model – a selective effect of RFPC stimulation on directed exploration (see Author response image 2 and 3).

Author response image 2. Effect of TMS on information bonus in model with bonus proportional to uncertainty.

Author response image 2.

Author response image 3. Effect of TMS on decision noise in model in which bonus is a linear function of uncertainty.

Author response image 3.

Finally, as to the question of whether participants receive exactly the same schedule of trials or not, unfortunately this was not perfectly controlled in either direction. The first 16 subjects (the initial group) were run with the same random seed while the remaining subjects (the replication group) were run with unique random seeds. Given the results replicate between groups we do not think this is a major issue although we now include the following text in the Materials and methods section:

“Finally, the random seeds were not perfectly controlled between subjects. […] Despite this limitation we saw no evidence of different behavior across the two groups.”

2) Although I am not requesting the authors conduct another experiment, a second stimulation site within prefrontal cortex would make for an important comparison for future studies. My suggestion is in part due to the quite severe discomfort frequently caused by TMS stimulation to FPC and neighboring regions due to the underlying facial musculature, as compared to say the vertex. Any differences between stimulation sites could in theory be due to differences in discomfort or subsequent distraction produced by the stimulation sites. Here, this difference could conceivably interact with the comparison between horizon 6 and horizon 1 in the unequal condition if this horizon 6 condition is in fact more cognitively demanding. Note this is not a concern in the cited tDCS study by Raja Beharelle et al. because tDCS does not stimulate the facial muscles and because the excitation and inhibition respectively following anodal and cathodal tDCS provides for an internal control. Can the authors provide some evidence that horizon 6 in the unequal condition is not the most cognitively demanding for their subjects, for instance by analysing RTs?

We agree that other types and locations of stimulation will be an important avenue for future work and is something that I (RCW) am planning once TMS becomes available at UA.

The point about pain is also important. As we understand it, the idea is that RFPC stimulation can be painful. Pain is distracting which leads to worse performance, especially when a task is cognitively demanding. Thus if directed exploration in horizon 6 is the most cognitively demanding component of the task, then distraction from pain could cause the effect.

While we cannot rule this interpretation out entirely, two results suggest that simple distraction is likely not to blame.

First, one prediction of the distraction hypothesis is that people should perform worse overall when distracted by pain. In the model-free analysis this should show up as increased p(low mean) with stimulation of frontal pole. In the model-based analysis, distraction should manifest as increased decision noise in both [1 3] and [2 2] conditions. In both analyses we see no effect of RFPC stimulation (Figures 3B and 4). This effectively puts an upper bound on how distracting the pain could be – the distraction effect must be small enough to cause no change in the ability to pick out the high reward option.

Of course, the above analysis says nothing about the lower bound and it could still be the case that, while the pain is not distracting enough to affect computing the mean reward, it is distracting enough to affect the computations of the information bonus. This could be the case if computing the information bonus were harder than computing the mean. Evidence for this increased computational load could come from reaction times. Specifically if computing the bonus is difficult, then RTs should be longer in the [1 3] condition in horizon 6 than in horizon 1. As shown in Author Response Image 4 this is not the case and there is no effect of horizon on RT for the first free choice (F = 1.32, p = 0.26). Thus computing the information bonus is not a time consuming process, suggesting it is not any more taxing than computing the difference in means between options.

Author response image 4.

Author response image 4.

Together with the null effect on p(low mean) we believe that these results provide good evidence that our effects are driven by neural changes (presumably in RFPC – although this is impossible to verify without neuroimaging) not as a response to pain.

Are there existing data that address this concern by comparing stimulation of FPC and other PFC regions using cTBS?

A Google Scholar search for “cTBS frontal pole” found only one paper that reported pain measures. None that we could find directly compared pain from stimulation to FPC and other areas of PFC.

Hanlon, C. A., Dowdle, L. T., Correia, B., Mithoefer, O., Kearney-Ramos, T., Lench, D.,[…] and George, M. S. (2017). Left frontal pole theta burst stimulation decreases orbitofrontal and insula activity in cocaine users and alcohol users. Drug and Alcohol Dependence.

This study compared cTBS to frontal pole to a sham stimulation of muscles with electrodes. The study found that participants could not distinguish TMS from sham stimulation. More importantly for our purposes they also found that pain subsided quickly “Subjective reports indicated that the painfulness of the protocol subsided after the first 15-30 s”.

The following other studies uncovered by the same search did not report measures of pain / discomfort.

Costa, A., Oliveri, M., Barban, F., Torriero, S., Salerno, S., Lo Gerfo, E.,.[…] and Carlesimo, G. A. (2011). Keeping memory for intentions: a cTBS investigation of the frontopolar cortex. Cerebral cortex, 21(12), 2696-2703.

Costa, A., Oliveri, M., Barban, F., Bonnì, S., Koch, G., Caltagirone, C., and Carlesimo, G. A. (2013). The right frontopolar cortex is involved in visual-spatial prospective memory. PLoS One, 8(2), e56039.

Rahnev, D., Nee, D. E., Riddle, J., Larson, A. S., and D’Esposito, M. (2016). Causal evidence for frontal cortex organization for perceptual decision making. Proceedings of the National Academy of Sciences, 113(21), 6059-6064.

3) The trend of an effect of RFPC stimulation on the information bonus for horizon 1, although smaller than that of horizon 6, seems problematic for an interpretation purely based on directed exploration, since there is no opportunity to exploit the newly acquired information for horizon 1. The authors suggest subjects may become less information-seeking in both conditions (consistent with risk or ambiguity aversion in horizon 1 and reduced directed exploration in horizon 6), but this begs the question of what process or mechanism underlies this decrease in both horizons. Given the broader literature on the role of FPC, one interpretation would be that stimulation has disrupted the FPC's ability to faithfully encode the parameters of a "pending" option that they may choose in the future (e.g. Koechlin and Hyafil, Science, 2007) – in this task this could be seen as the option that has not been selected as frequently or attended to recently during forced choices. However I am sure there are other plausible interpretations. How do the authors interpret this effect across horizons in the unequal condition with respect to the broader literature on FPC?

We have dug into this point more and can now include more detail. What we believe is going on here is a tradeoff between the mean of the prior, R0, and the information bonus A. While this tradeoff does not affect our conclusions that RFPC stimulation selectively affects directed exploration, we believe that the tradeoff does suggest caution when interpreting the horizon 1 result.

In particular, note that in Figure 4, in addition to the information bonus going down in both horizons, the prior mean goes up suggesting a possible tradeoff between the information bonus parameter and the mean of the prior. Such a tradeoff is to be expected in this task because the prior has a larger effect on the more uncertain option – i.e. the option chosen once in the [1 3] condition. This larger effect of the prior means that increasing R0 can have a similar effect to an information bonus in the task by increasing the relative value of the uncertain option (in RL terms, this would be exploration by optimistic initialization).Thus, in the context of this task, the model contains an inherent tradeoff between the information bonus and mean of the prior.

In practice, the tradeoff between R0and A shows up as correlations in the posteriors. This is shown in the updated Figure 5 in the manuscript where we plot samples from the posterior over the change in R0 between stimulation conditions (R0 (vertex) – R0 (RFPC)) against the change in information bonus (A(vertex) – A(RFPC)). In both horizon 1 (panel A) and horizon 6 (panel B) there is a tradeoff between the two parameters. However, while the tradeoff affects the interpretation of the horizon 1 and horizon 6 result alone, it does not affect the interpretation of the horizon-based change in information bonus (panel C).

In addition to including this new figure, we have addressed this point in the manuscript with the following text:

“In addition to the effect on the information bonus in horizon 6, there is also a hint of an effect on the information bonus in horizon 1 (85% samples less than zero) and on the prior mean R0 (88% samples above zero). […] Taken together, this suggests that we can be fairly confident in our claim that RFPC stimulation has a selective effect on directed exploration.”

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    Transparent reporting form
    DOI: 10.7554/eLife.27430.012

    Articles from eLife are provided here courtesy of eLife Sciences Publications, Ltd

    RESOURCES