Skip to main content
PLOS Biology logoLink to PLOS Biology
. 2023 Jan 30;21(1):e3001985. doi: 10.1371/journal.pbio.3001985

Neural responses in macaque prefrontal cortex are linked to strategic exploration

Caroline I Jahn 1,2,3,*,#, Jan Grohn 1,*,#, Steven Cuell 1, Andrew Emberton 4, Sebastien Bouret 2, Mark E Walton 1, Nils Kolling 5,6,‡,*, Jérôme Sallet 1,6,‡,*
Editor: Thorsten Kahnt7
PMCID: PMC9910800  PMID: 36716348

Abstract

Humans have been shown to strategically explore. They can identify situations in which gathering information about distant and uncertain options is beneficial for the future. Because primates rely on scarce resources when they forage, they are also thought to strategically explore, but whether they use the same strategies as humans and the neural bases of strategic exploration in monkeys are largely unknown. We designed a sequential choice task to investigate whether monkeys mobilize strategic exploration based on whether information can improve subsequent choice, but also to ask the novel question about whether monkeys adjust their exploratory choices based on the contingency between choice and information, by sometimes providing the counterfactual feedback about the unchosen option. We show that monkeys decreased their reliance on expected value when exploration could be beneficial, but this was not mediated by changes in the effect of uncertainty on choices. We found strategic exploratory signals in anterior and mid-cingulate cortex (ACC/MCC) and dorsolateral prefrontal cortex (dlPFC). This network was most active when a low value option was chosen, which suggests a role in counteracting expected value signals, when exploration away from value should to be considered. Such strategic exploration was abolished when the counterfactual feedback was available. Learning from counterfactual outcome was associated with the recruitment of a different circuit centered on the medial orbitofrontal cortex (OFC), where we showed that monkeys represent chosen and unchosen reward prediction errors. Overall, our study shows how ACC/MCC-dlPFC and OFC circuits together could support exploitation of available information to the fullest and drive behavior towards finding more information through exploration when it is beneficial.


Humans have been shown to strategically explore; non-human primates are also thought to strategically explore, but do they use the same strategies and neural bases? This study reveals that monkeys strategically explore to improve future outcomes; such strategic exploration is linked to distinct signals in their frontal cortex.

Introduction

In many species, most behaviors, including foraging, can be accounted for by simple behaviors—approach/avoidance of an observed and immediately available source of food—that require no mental representations. Exploration is, by definition, a non-value maximizing strategy [1,2], so in those models, exploration is often reduced to a random process, where noise in behavior can lead animals to change behavior by chance [1,36]. However, in species relying upon spatially and temporally scattered resources, such as fruits, exploration is thought to be aimed at gathering information about the environment in order to form a mental representation of the world. Work in monkeys and humans has shown that primates are sensitive to the novelty of an option when deciding to explore [79]. They sample novel options until they have formed a representation of their relative values compared to the available options. This work clearly showed that monkeys have a representation of the uncertainty and actively explore to reduce it. Similar results had been shown in humans, whose exploration is driven by the uncertainty about the options [10,11]. However, it is still unknown whether monkeys have a specific representation of potential future action and outcomes that enables them to organize their behavior over longer time or spatial scale. We design a novel paradigm in monkeys—based on work in humans—to assess whether on how monkeys engage in strategic exploration, which is exploring only when it is useful for the future. Strategic exploration enables an animal to adapt to a specific context and is essential to maximize rewards over a longer time and spatial scale. For frugivorous animals such as primates, it might be critical for survival.

Humans have been shown to strategically explore [1215], but there is little evidence in other species. Inspired by Wilson and colleagues’ “horizon” exploration task [12], we developed a task to investigate whether monkeys mobilize strategic exploration based on whether that information can improve subsequent choice. Importantly, non-human primate models provide insights into the evolutionary history of cognitive abilities, and of the neuronal architecture supporting them [16]. Given the rhesus monkeys’ ecology (including feeding), they should also be able to use strategic exploration, but the extent to which they can mobilize strategic exploration might be different from that of humans. Based on the similarities in circuits supporting cognitive control and decision-making processes in humans and macaques [17,18], one could further hypothesize that the same neurocognitive processes (the same computational model) might be recruited but not to the same extent (different weights).

As in Wilson and colleagues’ original study, we manipulated whether the information could be used for future choices by changing the choice horizon [12]. By comparing exploration in both conditions, we could test whether the animals reduced their reliance on value estimates (random exploration) and increased their preference for more uncertain options (directed exploration) when gathering information was useful for future choices in the long horizon [12]. In addition, we manipulated the contingency between the choice and the information by varying the type of feedback that monkeys received. In the complete feedback condition, information was freely available, and we could probe whether monkeys decreased their exploration compared to the classic partial feedback condition. In humans, providing the complete feedback decreases decision noise [19] and improves learning [2023], both of which are consistent with reduced exploration. A strategic explorer would only actively explore—and forgo immediate rewards—when it is useful for the future (long horizon) and that it is the only way to obtain information (partial feedback). In addition to behavioral data, neural data were collected using fMRI to probe the neural substrates of strategic exploration. Our analysis was focused on regions previously identified in fMRI studies on reward valuation and cognitive control in monkeys [2429]. Finally, we took advantage of the different feedback conditions to explore how monkeys update their expectations based on new information. Specifically, we investigated the behavioral and neural consequences of feedback about the outcome of their choice and—in the complete feedback condition—on counterfactual feedback from the alternative.

We found that rhesus monkeys engaged in strategic exploration by decreasing their reliance on expected values (random exploration) when it was useful for the future (long horizon) and that active sampling was the only way to obtain information (partial feedback). Neurally, we found prefrontal strategic exploration signals in the anterior and mid-cingulate cortex (ACC/MCC) and dorsolateral prefrontal cortex (dlPFC). However, we did not find a significant modulation by the horizon and feedback type of the effect of uncertainty (directed exploration) on choices. When making choices in a sequence (long horizon), we found evidence that macaques used counterfactual feedback to guide their choices. Complementing this activity at the time of decision, in the complete feedback condition, we found overlapping chosen and unchosen outcome prediction error signals in the orbitofrontal cortex (OFC), at the time of receiving the outcome. The counterfactual prediction errors in the OFC are particularly interesting as they point to the neural system that allowed the macaques to forgo having to make exploratory choices in the complete condition, which could also change how the MCC-dlPFC network represented the value of the chosen option.

Results

Probing strategic exploration in monkeys

Three monkeys performed a sequential choice task inspired by Wilson and colleagues [12]. In this paradigm called the horizon task, monkeys were presented with one choice (short horizon) or a sequence of four choices (long horizon) between two options (Fig 1A). Each option belonged to one side of the screen and had a corresponding touch pad located under the screen (see Materials and methods for details). Both types of choice sequence (long and short horizon) started with an “observation phase” during which monkeys saw four pieces of information randomly drawn from both options and reflecting outcome distribution of each option. They received at least one piece of information per option (Fig 1B). Each piece of information was presented exactly like subsequent choice outcomes as a bar length (equivalent of 0 to 10 drops of juice) drawn from each option’s outcome distribution. The animals had learned that the length of the orange bar on a yellow background indicated the number of drops of juice associated with that specific option on a given trial (Fig 1B). One option was associated with a larger reward (more drops of juice) on average than the other. The means of the distributions were fixed within a sequence but unknown to the monkey. Monkeys only received the reward associated with the option they chose at the end of each choice.

Fig 1. Task and model.

Fig 1

(A) During the task, we manipulated whether the information could be used in the future by including both long and short horizon sequences. In both trial types, monkeys initially received four samples (“observations”) from the unknown underlying reward distributions. In short horizon trials, they then made a one-off decision between the two options presented on screen (“choice”). In long horizon trials, they could make four consecutive choices between the two options (fixed reward distributions). On the first choice (highlighted), the information content was equivalent between short and long horizon trials (same number of observations), whereas the information context was different (learning and updating is only beneficial in the long horizon trials). (B) Example short and long horizon trials. The monkeys first received some information about the reward distributions associated with choosing the left and right option. The length of the orange bar indicates the number of drops of juice they could have received (0–10 drops). The horizon length of the trial is indicated by the size of the grey area below the four initial samples. The monkeys then make one (short horizon) or four (long horizon) subsequent choices. As monkeys progressed through the four choices, more information about the distributions was revealed. Displayed here is a partial information trial where only information about the chosen option is revealed. (C) Ideal model observer for the options of the example trial shown in B (color code corresponds to the side of the option). The distributions correspond to the probabilities to observe the next outcome for each option. The expected value corresponds to the peak of the distribution and the uncertainty to the variance. Thick lines correspond to post-outcome estimate and thin lines to pre-outcome estimates (from the previous trial). (D) We also modulated the contingency between choice and information by including different feedback conditions. In the partial feedback condition, monkeys only receive feedback for the chosen option. In contrast, in the complete feedback condition, they receive feedback about both options after active choices (not in the observation phase). (E) Example partial and complete feedback trials (both short horizon). Here, the observation phase shown in (B) is broken up into the components the monkeys see on screen during the experiment. Initially, the samples were displayed on screen, but a red circle in the center indicates that the monkeys could not yet respond. After a delay, the circle disappears, and the monkeys could choose an option. After they responded, the chosen side was highlighted (red outline). After another delay, the outcome was revealed. In the partial feedback condition (top), only the outcome for the chosen option was revealed. In contrast, in the complete feedback condition (bottom), both outcomes were revealed. After another delay, the reward for the chosen option was delivered in both conditions.

First, we manipulated whether the information gathered during the first choice could be useful in the future. During a session, we varied the number of times monkeys could choose between the options (horizon length). The horizon length was visually cued (Fig 1A and 1B). On short horizon trials, the information provided by the outcome of the choice could only be used for the current choice and was then worthless going forward. On long horizon trials, it could be used to guide a sequence of four choices. Second, we manipulated the contingency between choice and information by varying the type of feedback monkeys received after their active choices (the observation phase was identical for partial and complete feedback conditions). In the partial feedback condition, they only saw the outcome for the option they chose. In the complete feedback condition, they saw the outcome of both the option they chose and the alternative option (Fig 1D and 1E). In the latter case, the information about the options could be learned from the counterfactual outcomes—the outcome that would have been obtained had a different choice been made. This type of feedback is sometimes referred to as “hypothetical” [30] or “fictive” feedback [31]. The feedback condition was not cued but was fixed during a session.

To assess monkeys’ sensitivity to the expected value and the uncertainty about the options, we set up an ideal observer Bayesian model (see Materials and methods for model details), which estimates the probability of observing the next outcome given the current information (Fig 1C). This model uses only the visual information available on the screen to infer the true underlying mean value of each options but does not use the horizon nor the feedback type as those were irrelevant for this inference. We extracted the expected value (peak of the probability distribution of the next observation, i.e., most likely next outcome) and the uncertainty (variance) of the options from the model to evaluate monkeys’ sensitivity to these variables. If monkeys did not engage in strategic exploration, the effect of expected value should be unaffected by the manipulations of horizon and feedback as was the case for the model.

The horizon length and the type of feedback modulate monkeys’ exploration

We first focused our analysis on the first choice of the trial, as the information about the reward probability of two options was identical across horizons and feedback conditions, such that choices should only be affected by the contextual manipulations (horizon and feedback type). If monkeys were sensitive to whether the information could be used in the future, they would explore more in the long compared to the short horizon. This is because information obtained early in a trial can only beneficial for subsequent choices in long horizon trials. Moreover, exploration should only occur when obtaining information is instrumentally dependent upon it, i.e., in the partial feedback condition (Fig 2A).

Fig 2. First choice.

Fig 2

(A) In our experimental design, on the first choice of a horizon, directed exploration is only sensible in long horizon trials in the partial feedback condition. This is because in short horizon trials, the information gained by exploring is of no use for subsequent choices, so a rational decision-maker would only choose based on the expected value of the options. Moreover, in the complete feedback condition, all information is obtained regardless of which option is chosen, so an ideal observer would again always choose the option with the highest expected value. (B) The proportion of trials in which the monkeys chose the option with the higher expected value is above chance level (0.5) across both feedback conditions and horizons. Mean across sessions (partial feedback: 41 sessions, complete feedback: 40 sessions). (C) Monkeys’ choices are sensitive to nuanced differences in expected value. Mean across all sessions (81 sessions). (D) According to the logistic regression model predicting monkeys’ first choices in a horizon (see main text and methods for details), monkeys’ first choices are less driven by expected value in the partial than in the complete feedback condition. Within the partial feedback condition, they are less driven by expected value in long than in short horizon trials. No such difference was found in the complete feedback condition. This is evidence that monkeys deliberately modulate their exploration behavior to explore more on partial feedback long horizon trials, where exploration is sensible (see (A)). Error bars indicate standard error to the mean in B and C and standard deviation in D. Data and code to reproduce the figure can be found at https://doi.org/10.5281/zenodo.7464572.

We first ensured that monkeys’ choices were influenced by the expected value computed by the Bayesian model. We looked at the accuracy (defined as choosing the option with the highest expected value according to the model) during the first choice. For the two horizon lengths and in both feedback conditions, accuracy was above chance level (t test compared to a distribution with a mean at 0.5; partial feedback short horizon: t (40) = 10, p < 0.001, partial feedback long horizon: t (40) = 8.930, p < 0.001, complete feedback short horizon: t (39) = 7.9, p < 0.001, complete feedback long horizon: t (39) = 8.963, p < 0.001, Fig 2B). Therefore, monkeys used the information provided by the informative observations on each trial to guide their choices. Monkeys also adjusted their choices to variations in expected value, as can be seen when pooling together both feedback conditions and horizon lengths (Fig 2C; see statistical significance in Fig 2D).

Although choices were guided by the expected value of the options above chance level, monkeys still sometimes chose the less valuable option in both conditions and horizons (Fig 2B and 2C). We examined whether monkeys were less driven by expected value on partial feedback long horizon trials, as exploration is only a relevant strategy on these trials (Fig 2A). To test this hypothesis, we ran a single logistic regression predicting responses during first choices in the partial and complete conditions with the following regressors: the expected value according to our Bayesian model, the uncertainty according to our Bayesian model, the horizon (short/long), and the interactions of expected value and uncertainty with horizon. In the same model, we added two potential biases, a side bias and tendency to repeat the same action. We fitted regressors to vary by condition (partial or complete feedback) and by monkey, and modelled sessions as random effects for each monkey, with all regressors included as random slopes. We confirmed that in both feedback conditions, monkeys tended to choose the option with the highest expected value (p < 0.000001 in the partial condition and p < 0.000001 in the complete; one-sided test, based on sample drawn from Bayesian posterior; see Materials and methods). We identified that monkeys relied more on the difference in expected value in the complete than in the partial feedback condition (p = 0.0024; one-sided test), and in short horizon than in the long horizon in the partial condition only (p = 0.0163 in the partial condition and p = 0.6598 in the complete; one-sided test). Thus, animals engaged in strategic exploration by reducing their reliance on expected value. In other words, animals strategically modulated the degree to which they used random exploration both depending on the horizon length and feedback type (S3A Fig).

We next looked at the effect of uncertainty. Exploratory behaviors should be sensitive to how much they can reduce uncertain, i.e., the animals should optimally pick the most uncertain option when they explore [12]. We found that monkeys were sensitive to uncertainty overall, avoiding options that were more uncertain in the partial and the complete feedback conditions (p = 0.0081 in the partial condition and p = 0.00025 in the complete; one-sided test) (see S1 Fig for full model fit and the posteriors for each individual subject). This risk aversion was driven by the difference in number of information presented as when we restricted our analysis to the trials where they received 2 information about each option, monkeys showed a small preference for the more uncertain option (p = 0.077 in the partial condition and p = 0.066 in the complete; p = 0.02 when combined; one-sided test) (see S2 Fig for full model fit and the posteriors for each individual subject). However, we found no statistically reliable difference in the sensitivity to the uncertainty across the experimental conditions. We also ran a second model that used the number of available information (which is mathematically equivalent to the model used by Wilson and colleagues [12]) rather than uncertainty and found identical results, both for the effects on expected value and the absence of an effect on the sensitivity to the number of available information (S3B and S4 Figs for full model fit and the posteriors for each individual subject). Therefore, uncertainty did not play a key role in strategic exploration in our task. This indicates that our macaques did not use directed exploration to strategically guide their choices (S3 Fig).

Finally, we checked whether the decision variables were stable over trials and across sessions. First, in the above regression model, we added interaction terms with the trial number. We found no significant interaction with our regressors of interest (expected value, expected value interaction with horizon, uncertainty, and uncertainty interaction with horizon). Second, we fitted each session separately and looked for a linear trend in the session number. We found no linear nor clear trend for the regressors over sessions. Overall, we found no evidence that monkeys’ decision variables changed throughout the recording.

Monkeys learn from chosen and counterfactual feedbacks

We next assessed whether monkeys used the information they collected during their previous choices to update their choice, and how the nature of the feedback affected this process. To this end, we focused our analysis on choices from long horizon trials. On such trials, monkeys’ accuracy (defined as choosing the option with the highest expected value according to the model) was always above chance level (t test compared against a mean of 0.5; all p < 10−10) and increased as they progressed through the sequence (t test compared against a mean of 0 of the distribution regression coefficients of the trial number onto the accuracy (both z-scored) for each session; partial feedback condition: t (40) = 11.3653, p < 0.001, complete feedback condition: t (39) = 5.6590, p < 0.0001) (Fig 3A). We inferred that this improvement was due to the use of the information collected during the choices. To examine this, we isolated the change in expected value compared to the initial “observation phase” (see Materials and methods). We found that monkeys were sensitive to the change in expected value both for the chosen option (in the partial and complete feedback conditions) and the unchosen option (counterfactual feedback in the complete feedback condition only) (Fig 3B and 3C; see statistical significance in Fig 3E). Monkeys displayed a significant tendency to choose the same option (t test compared against a mean of 0.5; all p < 10−6), which sharply increased after the first trial (paired t test between the first choice and the subsequent choices; all p < 10−10) and kept increasing after the first choice (t test compared to a distribution with a mean at 0 of the distribution regression coefficients of the trial number onto the probability to choose the same option (both z-scored) for each session; partial feedback condition: t (40) = 5.3026, p < 0.001, complete feedback condition: t (39) = 3.1265, p = 0.0033) (Fig 3D).

Fig 3. Behavioral update.

Fig 3

(A) As monkeys progressed through the long horizon, they were more likely to choose the option with the higher expect reward in both the partial and complete feedback condition. Mean across sessions (partial feedback: 41 sessions, complete feedback: 40 sessions). (B) Monkeys were sensitive to changes in the expected value compared to the baseline expected value they experienced during the observation phase both for the chosen option (mean across all sessions (81 sessions)) and (C) the unchosen option (mean across all complete feedback sessions (40 sessions)). (D) Monkeys were also more likely to repeat their choice as they progressed through the long horizon. Mean across sessions (partial feedback: 41 sessions, complete feedback: 40 sessions). (E) Results of the single logistic regression model predicting second, third, and fourth choices in the long horizon. In both the partial and complete feedback, monkeys were sensitive to the expected value at observation but more so in the complete than the partial feedback condition (left). Monkeys tended to repeat previous choices in both conditions but more so in the partial than in the complete feedback condition (center left). In both conditions, monkeys were sensitive to the change in expected value compared to the observation phase with no significant difference between conditions (center right). In the complete feedback condition, monkeys were also sensitive to the change compared to baseline of the additional information they received. Error bars represent standard error to the mean in A-D and standard deviation in E. *p < 0.05, **p < 0.01, and ***p < 0.001. Data and code to reproduce the figure can be found at https://doi.org/10.5281/zenodo.7464572.

We investigated the determinants of these effects by performing a single logistic regression for all non-first choices with the following regressors: the expected value and uncertainty during the observation phase (which served as a baseline for subsequent choices), the change in these baselines as new information was revealed as they progressed through the horizon. We also added in the same model three potential biases in choices: a side bias, the tendency to repeat the same action, and a bias for choosing the option most often chosen (see S5 Fig for full model fit and the posteriors for each individual subject). Just as with the previous regression model for first choices, we again allowed regressors to vary by condition and monkey and modelled sessions as random effects. We confirmed that monkey remained sensitive to the difference in expected value during the observation phase and that guided the first choice (p < 0.000001 in the partial condition and p < 0.000001 in the complete; one-sided test). Consistent with the choice behavior on the first choice, monkeys relied more on this difference in the complete than in the partial feedback condition in subsequent choices (p = 0.0192, one-sided test; Fig 3E). Monkeys were biased towards repeating the same choice (p < 0.000001 in the partial condition and p < 0.000001 in the complete; one-sided test), but this bias was also more pronounced in the partial feedback condition (p = 0.0018, one-sided test; Fig 3E) as can already be seen in Fig 3B. Monkeys also preferred to choose the option most chosen (p < 0.000001 in the partial condition and p < 0.000001 in the complete; one-sided test), which explained the increase in repetition bias over time, but this was not affected by the feedback type (partial > complete: p = 0.309) (S5 Fig). Monkeys were sensitive to the change in expected value when the information was related to the chosen option (p < 0.000001 in the partial condition and p < 0.000001 in the complete; one-sided test), with no statistical difference between the partial and complete feedback conditions (partial > complete: p = 0.6913). Finally, in the complete feedback condition, monkeys were sensitive to the change in expected values obtained from the counterfactual feedback (p < 0.000001; one-sided test; Fig 3E).

Overall, we found that on top of being more sensitive to the expected value difference during the initial evaluation, monkeys were less likely to be biased towards repeating the same action when they had counterfactual feedback to further guide their choices in the complete feedback condition. They were able to learn about the options, using both the chosen and the counterfactual feedback when it was available.

Strategic exploration signals in ACC/MCC and dlPFC

To identify brain areas associated with strategic exploration, we ran a two-level multiple regression analysis using a general linear model (GLM). For each individual session, we used a fixed-effects model. To combine sessions and monkeys, we used random effects as implemented in the FMRIB’s Local Analysis of Mixed Effects (FLAME) 1 + 2 procedure from the FMRIB Software Library (FSL). We focused our analysis on regions previously identified in fMRI studies on reward valuation and cognitive control in monkeys [2429]. Thus, to only look at the regions we were interested in and to increase the statistical power of our analysis, we only analyzed data in a volume of interest (VOI) covering frontal cortex and striatum (previously used by Grohn and colleagues [29]). We used data from 75 (41 partial feedback and 34 complete feedback) of the 81 (41 partial feedback and 40 complete feedback) sessions we had acquired (fMRI data from 6 sessions were corrupted and unrecoverable). Details of all regressors included in the model can be found in the Materials and methods section. In addition to the analysis in the VOI, we examined the activity in the functionally and anatomically defined regions of interest (ROIs). These ROIs were not chosen a priori but were selected based on the activity in the VOI. The goal of these analyses was either (i) to examine the effect of a different variable than the one used to define the ROI in our VOI, which is an independent test so we could look for statistical significance of this different variable on the activity in the ROI, or (ii) to illustrate an effect revealed in the VOI, which is not an independent test, so we did not do any statistical analysis.

To examine how monkeys use initial information displayed during the observation phase of the task differently depending on the horizon and the feedback condition, we examined the brain activity when the stimuli were presented on the first choice (“wait” period; Fig 1D). Crucially, there was no difference in the visual inputs between the partial and the complete feedback condition, as the nature of the feedback was not cued and fixed for blocks of sessions and monkeys only received the counterfactual feedback after an active choice (not in the observation phase). We first investigated the main effects of our two manipulations: the overall effect of the horizon and feedback type on brain activity.

We combined all sessions and looked for evidence of different activations in the long and short horizon. We found a significantly greater activity for the long horizon in 3 clusters (cluster p < 0.05, cluster forming threshold of z > 2.3; Fig 4A, see S1 Table for coordinates of cluster peaks). One cluster was centered on the pregenual anterior cingulate cortex (pgACC), and the striatum and two clusters of activities were centered on the dlPFC and extended in the lateral orbitofrontal cortex (lOFC, area 47/12o; see Materials and methods for more details about OFC subdivisions) with one on each hemisphere. In an independent test, we placed ROIs by calculating the functional and anatomical overlap for each Brodmann area 24, 46, and 47/12o and extracted the t-statistics of the regressor to examine the effect the contingency between choice and information (feedback condition). We observed no effect of the feedback type in ACC (p = 0.19) and lOFC (p = 0.53), but we found a main effect of feedback type in the dlPFC (two-way ANOVA, F(144, 147) = 4.86, p = 0.029) and no interaction anywhere (ACC: p = 0.29, dlPFC: p = 0.9 and lOFC: p = 0.78). This revealed that a subpart of the pgACC and the lOFC were sensitive to the horizon length, while the dlPFC showed an additive sensitivity to the horizon length and the feedback type, such that it was most activated in the long horizon and partial feedback, when exploration is beneficial.

Fig 4. First choice neural results.

Fig 4

(A) When combining partial and complete feedback sessions, we found clusters for a differential in activity in long horizon than short horizon in the pgACC, the dlPFC, and the lateral OFC. Cluster p < 0.05, cluster forming threshold of z > 2.3. (B) We placed ROIs (in yellow) in the overlap of the functional cluster and anatomical region and extracted t-statistics for the difference between long horizon and short horizon. Mean across sessions (partial feedback: 40 sessions, complete feedback 34 sessions). (C) We looked for differences in how the contingency between choice and information (complete vs. partial feedback) modulates the initial information that was presented before first choices. Within our VOI, we found clusters of activity in MCC both for the main effect of feedback type and a greater sensitivity to expected value in the complete feedback condition. We also found a cluster of activity in dlPFC for a greater sensitivity to expected value in the complete feedback condition. (D) We placed an ROI (in yellow) in the part of MCC that is activated by the main effect of feedback type and extracted the t-statistics of the regressor for every session. We found that the effect we observe in the VOI is driven by increased activity in the complete feedback condition, whereas there is no activity in the partial feedback condition. Mean across sessions (partial feedback: 40 sessions, complete feedback 34 sessions). (E) We also placed ROIs (in yellow) in the parts of MCC and dlPFC where we found significant clusters in the VOI for the interaction of feedback type and expected value and extracted the t-statistics for the expected value regressor of every session. Plotting these regressors separately for feedback type reveals that both MCC and dlPFC were more active when an option with high expected value was chosen in the complete feedback condition, whereas they were more active when an option with low expected value was chosen in the partial feedback condition. Mean across sessions (partial feedback: 40 sessions, complete feedback 34 sessions). Error bars represent standard error to the mean. *p < 0.05. Data and code to reproduce the figure can be found at https://doi.org/10.5281/zenodo.7464572. dlPFC, dorsolateral prefrontal cortex; MCC, anterior and mid-cingulate cortex; OFC, orbitofrontal cortex; pgACC, pregenual anterior cingulate cortex; ROI, region of interest; VOI, volume of interest.

We next examined the effect of the feedback in our VOI. We found one cluster around the MCC that was significantly modulated by the difference between the activity during the complete and partial feedback conditions during stimuli presentation on the first choice (Fig 4C, yellow contrast; see S1 Table for coordinates of cluster peaks). To examine this effect further, and although it is not an unbiased test, we defined an ROI by taking the overlap between our functionally defined cluster and Brodmann area 24′. Extracted the t-statistics of each session from the regressor from this ROI revealed that the MCC is more active at the time of choice in the complete feedback condition but not in the partial feedback condition (Fig 4D). We found no interaction between the horizon length and the feedback type in our VOI. Thus, a different subpart of the MCC that was sensitive to the horizon length was sensitive to the type of feedback.

Behaviorally, we observed that strategic exploration was implemented by decreasing the influence of expected value on the choice. We therefore next looked for evidence of stronger expected value signals in complete feedback condition compare to the partial feedback condition. We tested the expected value of the chosen option, the unchosen option, and the difference in expected values between the chosen and unchosen options. We only found activity related to the expected value of the chosen option. We found two clusters of activities bilaterally in the MCC (area 24′) and the left dlPFC (area 46) that were modulated by the contingency between choice and information (Fig 4C; see S1 Table for coordinates of cluster peaks). We again placed two ROIs by calculating the functional and anatomical overlap for Brodmann areas 24′ and 46 and extracted the t-statistics of the regressor. Although this is not an unbiased test, we can see that the MCC and dlPFC seemed to be active when an option with a low expected value was chosen, whereas in the complete feedback condition, they were more active when choosing high expected value options (Fig 4E for illustration). We found, however, no difference of the strength of this sensitivity between short and long horizons. Thus, we found that the availability of the counterfactual feedback in the complete feedback condition decreased—and potentially even inverted—the sensitivity of the MCC and dlPFC to the expected value of the chosen option. We conducted additional exploratory brain–behavior correlations but found no significant relationships to behavioral sensitivity (see “Author Response” file within the Peer Review History tab for additional details).

Finally, we looked for signals that were related to the expected outcome of the chosen option and that were common to both feedback conditions. Consistent with previous studies [3235], when we combined the partial and complete feedback conditions session and took all trials in the “wait” period, we found a large activation related to the expected value of the chosen option (which is the same as the chosen action in our task) spanning from the motor cortex/somatosensory cortex, the dlPFC, the OFC, and striatum, as well as an inverted signal in the visual areas in the whole brain (without mask; S6A Fig). We also found a clear representation of the uncertainty about the chosen option on the first choice (when the magnitude of the uncertainty about the chosen option is equivalent in the partial and complete feedback conditions as no counterfactual feedback has yet been provided) in the right medial prefrontal cortex (24c and 9m) that extended bilaterally in the frontal pole (10mr) (S6B Fig). We conducted additional exploratory brain–behavior correlations but found no significant relationships to behavioral sensitivity (see “Author Response” file within the Peer Review History tab for additional details).

Overall, we found that pgACC and MCC reflected the horizon length and the type of feedback, respectively. The dlPFC was linearly modulated both, with the strongest activation in the long horizon and partial feedback, when exploration is beneficial. Additionally, the feedback type modulated the effect of the chosen expected value on the activity of the MCC and the dlPFC, such that it was more active for low value choices only when obtaining information was contingent on choosing an option.

Chosen and counterfactual outcome prediction error signals in the OFC

We next examined the brain activity when the outcome of the choice is revealed (“outcome” period in Fig 1D) and monkeys are updating their beliefs about the options. After the first choice, the sequences of events played out differently in the partial and complete feedback conditions; therefore, we analyzed each dataset separately in regard to feedback. At outcome, the partial feedback condition closely resembles previously reported results from fMRI studies in monkeys [27,29]. We looked for brain regions with an activity that was modulated by magnitude of the outcome prediction error signals, i.e., the difference between the outcome and the expectation (S7A Fig). Consistent with these studies, we found the expected clusters of activity in the medial prefrontal cortex and bilaterally in the motor cortex in our VOI (see fig for outcome only–related activity). When we time-locked our search to the onset of the reward (1 s after the display of the outcome, on a different GLM), we also found the classic prediction error related activity in the ventral striatum at the whole brain level (S7C Fig).

We then turned to the complete feedback condition, in which we simultaneously presented the outcome of the chosen and the unchosen after the first active choice, in order to examine the neural substrates involved in learning about counterfactual feedback and the extent to which they overlap with learning about chosen feedback. We looked in our VOI for brain regions with an activity that was modulated by the prediction error for the chosen option and the unchosen option. We found a cluster of activity around the lOFC (area 47/12o) that was negatively modulated by the prediction error for the chosen option and a cluster of activity around the medial orbitofrontal cortex (mOFC, area 14) that was negatively modulated by the prediction error of the unchosen option (Fig 5A; see S1 Table for coordinates of cluster peaks). These clusters intersected in the central part of the OFC (cOFC, area 13). Prediction error activity should show both an effect of outcome and expectation, with opposite signs. To independently test whether observed effects were prediction errors, rather than being driven by the outcome or the expectation alone, we extracted the t-statistics for both outcome and expectation in ROI defined by their outcome-related activity only and looked for a modulation by the expectations (S8 Fig). Again, we defined ROIs based the functional modulation by the magnitude of the chosen outcome and anatomical overlap. For the chosen outcome, we found that lOFC did not show a significant positive expectation for the chosen outcome (p = 0.1083) (Fig 5C). We found that the somatosensory cortex (area 3) showed a strong positive chosen outcome signal and as well as a positive modulation by the chosen expectation (T (33) = 2.5246, p = 0.017) and the ventrolateral prefrontal cortex (vlPFC) (area 45) had no sensitivity for the chosen expectation (p = 0.95) (S6B Fig). Using the same procedure with the unchosen outcome, we found that the cOFC showed a positive expectation about the unchosen outcome (t (33) = 2.2617, p = 0.0304), as well as a negative modulation by the chosen outcome (t (33) = −2.8761, p = 0.007) and a positive modulation by the expectation about the chosen outcome (t (33) = 2.5560, p = 0.0154). We found a similar pattern in the mOFC (unchosen expectation: t (33) = 2.5130, p = 0.017; chosen outcome: t (33) = −2.2455, p = 0.0316; chosen expectation: t (33) = 2.8729, p = 0.0071). The ventral-medial prefrontal cortex (area 10m according to the atlas we used [36] but has been called 14m [37]) showed a negative modulation of its activity by the unchosen and the chosen (t (33) = −3.079, p = 0.004) outcomes but no sensitivity to the expectations. Overall, we found that the cOFC and mOFC both showed prediction error–related activity for both the chosen and the unchosen outcomes, and with the same sign.

Fig 5. Prediction error neural results.

Fig 5

(A) In complete feedback sessions only, we found clusters for inverted prediction error activity in the central part of OFC (area 13), extending into lOFC (area 47/12o). We also found inverted prediction error activity in the cOFC (area 13) and mOFC (area 14) for the unchosen, counterfactual reward. (B) Brain–behavior correlational analysis between the prediction error signal in the mOFC (t-statistic) and session-specific t-statistic of the behavioral effect of the change in expected value on choices (estimated with a separate GLM for each session). (C) We placed ROIs (in yellow) in the overlap of the functional cluster modulated by the magnitude of the chosen outcome and anatomical region. We extracted t-statistics for reward and expectation, both of the chosen and unchosen option. Prediction error activity should evoke both a reward and an expectation response with opposite signs. We did not found evidence for outcome expectation of the chosen option. Mean across complete feedback sessions (34 sessions). (D) When defining the ROIs (in yellow) according to the response to the magnitude of unchosen outcome, we find evidence for a classic reward prediction error and a counterfactual prediction error about the unchosen option both in cOFC and mOFC: We observe activity related both to the obtained and the unobtained reward, and also activity related both the chosen and unchosen outcome expectation. Mean across complete feedback sessions (34 sessions). Error bars represent standard error to the mean. *p < 0.05, **p < 0.01. Data and code to reproduce the figure can be found at https://doi.org/10.5281/zenodo.7464572. cOFC, central OFC; GLM, general linear model; OFC, orbitofrontal cortex; lOFC, lateral OFC; mOFC, medial OFC; ROI, region of interest.

To test the OFC prediction error effects even further, we ran an exploratory correlational analysis, between the ROIs based the prediction error signal (t-statistic) and session specific t-statistic of the behavioral effect of the change in expected value on choices (estimated with a separate GLM for each session with the same regressors as in Fig 5B). We wanted to see whether the strength of the counterfactual outcome prediction error in the brain is predictive of how much an animal uses it in a particular session. Only in mOFC (and not cOFC) did we see the expected—albeit modest—correlation between increased negative counterfactual prediction error signals and increased behavioral impact of the counterfactual information (Fig 5D, ß = −0.1701 ± 0.0971, t (32) = −1.7507, p = 0.0448, one-sided test).

Discussion

Weighing up exploration to gather new information with exploitation of your current knowledge is a key consideration for humans and animals alike. Inspired by recent work carefully dissociating value-driven exploration from simple lack of exploitation [12], we designed the horizon task to look at the behaviors and neural correlates of goal-directed evaluation of strategic exploration in rhesus monkeys. While strategic value-driven exploration is important to optimize the behavior in time, it is equally important to be able to learn from observations related to choices not taken. In particular, being able to process counterfactual information during learning is key to optimize exploration for only the kind of situations when active sampling is necessary.

Strategic exploration as a reduction of the effect of expected value on choices

We know that monkeys can seek information before committing to a choice or to increase confidence about their decision [3840]. However, here, we showed that monkeys could identify situations in which a strategic exploratory choice would lead to gaining information that would be beneficial for future decisions. Indeed, their choices were least influenced by expected value in the long horizon partial feedback condition, which is when there should be a drive to explore. This suggests that monkeys had a representation of the significance of the information and used it to plan future actions. Our results demonstrate that they could discern both whether information will be useful in the future (greater exploration in long horizon) and that choosing an option is instrumental to get information about it (greater exploration in the partial feedback condition).

Exploration during value-based decision-making has been conceived in different ways in the past. A simple way to account for exploration is the “epsilon-greedy strategy,” in which a small fraction of choices is made towards the non-most rewarded option [1,3]. Along the same line, another way to formalize exploratory choices is through the noise or (inverse) temperature in the softmax choice-rule, which predicts that there are more exploratory choices when the expected values of the options are close [1,36]. This process is also called random exploration because the relaxation of the effect of expected value on choices could allow for stumbling upon better options by chance [12,15]. This form of exploration is negatively correlated with accuracy. Therefore, without varying the other features such as the usefulness of information for the future and the contingency between choice and information, it is impossible to know whether monkeys made a mistake or were exploring the non-most rewarded option to obtain information about it.

Here, we show that monkeys, like humans, can perform sophisticated choices that take into account the prospective value of discovering new information about the options [41]. Our results revealed that foraging behaviors in macaques do not only rely on simple heuristics (e.g., win-stay/lose shift) but are also based on strategic exploration. To some extent, it mirrors anticipatory switch to exploitative behavior once enough information has been learned about the information, even when the expected outcome has not yet been obtained [42]. It also adds to recent works showing that complex sociocognitive processes thought to be uniquely human such as mentalizing or recursive reasoning could be identified in rhesus monkeys [43,44]. However, contrary to human behavior, our monkeys only adapted by reducing their reliance on expected value (i.e., exploitative value) on choices, which corresponds to the degree to which they use random exploration. Humans also increase their preference for the most uncertain option when exploration is useful for the future, which has been referred to as directed exploration [12]. Our results can also be compared to recent work in monkeys showing that monkeys choose novel options until they have an estimate of its value compare to other options [79], which can be interpreted as exploration for uncertainty reduction. Here, we found that monkeys did not seek to reduce overall uncertainty. This perhaps is because the number of reward samples was negatively correlated with uncertainty, and thus no option was novel in the sense that there was no information about it. Uncertainty-related behavior was not modulated by the feedback type nor the horizon, hinting that uncertainty (rather than novelty) was not a driver of strategic exploration. This suggests species specificity in exploratory strategy. In the future, a variation of our task could be used to test the effect of novelty on strategic exploration by offering zero information about one option.

Use of counterfactual feedback in subsequent choices

We next investigated whether and how the availability of the counterfactual feedback impacted their subsequent choices in the long horizon. First, we found that having more information about the options in the complete feedback condition improved accuracy. In general, monkeys were more sensitive to the initial expected values of the options when there was no contingency between choice and information, in the complete feedback condition, but then utilized the feedback about the chosen option to the same degree in both conditions. However, in the complete feedback condition, monkeys additionally used the counterfactual feedback about the unchosen option to update their preference. Our results confirm that rhesus macaques are sensitive not only to direct reinforcers (i.e., the reward they obtain) but also counterfactual information [28,30,31,45]. Here, we clearly demonstrate that monkeys can learn directly from counterfactual feedback, via prediction error. We also demonstrated this learning is associated with OFC activity. Identifying that an alternative action could have led to a better outcome and acting upon it has been shown to modulate OFC activity in rodents [46], suggesting that this ability was present in the last common ancestor to primates and rodents 100 M years ago.

The availability of counterfactual feedback also helped compensate for the repetition bias that monkeys displayed during the performance of the sequence. This form of engagement can also be considered in terms of the exploration/exploitation trade-off, where exploiting corresponds to staying with the current or default option: Monkeys committed to an option at the beginning of the trial and only changed option if there were sufficient evidence that it was worth it. Consistently, humans and animals show a tendency to overexploit compared to the optimal policy in various tasks [4749]. In general, there seems to be a cost associated with switching from the ongoing representation or strategy to a new one [50,51]. In our task, switching options also requires minimal physical effort as the monkeys are positioned in the sphinx position in the scanner. Additional information about the options, particularly about the alternative option, seems to encourage the reevaluation of the default strategy of persevering with the current option, enhance behavioral flexibility, and increases the willingness to bear the cost associated with the physical resetting required by switching target.

Strategic exploration signals in ACC/MCC and dlPFC

Using fMRI, we investigated the neural correlates of the assessment of the possibility to use the information collected during the choice in the future, manipulated through horizon length, as well as the assessment of the contingency between choice and information, manipulated through the availability of the counterfactual outcome. Modulation of activity associated with exploratory behavior in an uncertain environment has been recorded in humans and monkeys in both the ACC and the MCC [42,5254], but here, we found interesting anatomical distinctions. We found that the pgACC was more active when the information could be used in the future in the long horizon. In humans, the pgACC activity has been shown to scale with the use prospective value to guide choices [41]. Thus, the pgACC might be critical to organize the behavior in the long run, beyond the immediate choice. The activity of a separate anatomical region of the MCC was modulated by the feedback type. The MCC has been shown to encode the decision to obtain information about the state of the world [55] and to integrate information about the feedback to adapt the behavior [56]. Here, we show that activity in the MCC was modulated prospectively by the feedback type. This activation was greater when more information was going to be provided, i.e., in the complete feedback condition. Thus, the MCC could be involved in anticipating more learning or regulating exploration prospectively based on the feedback that will be received. Critically, the dlPFC displayed an additive effect of the usefulness of exploration for the future and the contingency between choice and information. It was most active when both were true, and exploration was sensible. Moreover, in the complete feedback condition, the MCC and the dlPFC were more active when the expected value of the chosen option was high. Such modulation is in line with studies in monkeys and humans showing that neuronal activity in the MCC and the dlPFC does correlate with actions’ values [5760]. When the unchosen outcome was not available, MCC and dlPFC were more active when the expected value of the chosen option was low, which is consistent with the pursuit of an exploration strategy. Overall, the coordinated roles of ACC and MCC participate to the regulation of exploratory/exploitative behaviors, not only in rhesus macaque but also in humans [61].

Computational modelling of the activity in ACC/MCC and dlPFC suggest that ACC/MCC could regulate decision variables in the dlPFC based on the strategic assessment for exploration [62]. Noradrenaline has been shown to modulate the noise in the decision process that could fuel random exploration and potentially give a mechanism for modulation of exploratory activity [13,49,63,64]. Importantly, ACC/MCC also more generally interacts with other frontal lobe regions as well as monoamine systems and, in particular, the noradrenergic system making it a feasible mechanism for changing exploratory behaviors [65]. Specifically, a network consisting of the MCC, the dlPFC, and, potentially, the locus coeruleus could support the relaxation of the effect of expected value on choices based on the context. Altogether, these results illustrate how ACC/MCC and dlPFC might dynamically switch modes to pursue different goals depending on the task demands [42,52,54]. Future studies will aim at testing whether switching mode is dependent on noradrenergic inputs and which causal role both regions play in changing into and out of strategic exploration.

Update signals for chosen and counterfactual outcomes in OFC

Being able to process counterfactual information during learning is key to reducing costly exploration to only the kind of situations when active sampling is necessary. Doing so requires an ability to process abstract information and learn from it similarly to experienced outcomes, without confusion between the two, which our monkeys achieved. Neurally, we found classic activations in the partial feedback condition in response to the magnitude of the outcome and the prediction error of the chosen option. At reward delivery, we observed a prediction error signal in ventral striatum, which has previously been reported in neurophysiological studies [66]. We also observed prediction error activity at outcome in MCC, which had been shown previously in neurophysiological recordings [67]. We also found that the prediction error for the chosen outcome modulated the activity of the OFC, but further examination showed that the lOFC was mostly sensitive to the chosen outcome. Previous studies have shown the lOFC involvement in learning and using choice–outcome associations to guide behavior [6870], and causal studies demonstrated its role in credit assignment [7174]. Here, in the presence of two outcomes—two stimuli—OFC could be crucial to integrate the information specifically related the chosen option. We also revealed that the cOFC and mOFC carried clear chosen prediction error signals.

However, our results go beyond chosen prediction error signals and add two additional dimensions to our understanding the neural processing of counterfactual information during exploration and learning. Firstly, we were able to map out for the first time counterfactual prediction error signals in monkeys in the cOFC and mOFC. Importantly, by using fMRI, we could establish its specificity within the prefrontal cortex. In particular, we found signals for that counterfactual and the chosen outcomes but not the expectations in the vmPFC (10/mc14m). This adds to our knowledge of modulation of activity by counterfactual outcomes in gambling tasks had been reported in macaque lateral prefrontal cortex, MCC, and OFC [30,31]. Secondly, we found in the mOFC a relationship between the strength of the counterfactual prediction error signal and the extent to which the counterfactual outcomes influenced future choices (Fig 5). Encoding of the counterfactual outcome has also been observed in humans mOFC [23,75,76], and lesion of the mOFC in patients had been associated with an inability to use counterfactual information to guide future decisions [77]. Those results are compatible with the proposed broader role of the mOFC in representing abstract values [78,79]. Here, we show that it represents the comparison of the obtained counterfactual information with the expected counterfactual information. We found that the representation of the prediction error for the chosen and unchosen outcomes had the same sign at the time of outcome, which leads us to postulate that this update mechanism is independent of the frame of the decision [23,79,80].

Having identified the orbitofrontal source of counterfactual prediction errors in macaques opens up further possibilities to directly interfere with the neural processes in each system to see the effect it has on this complex adaptation of the animals’ exploratory strategy. Furthermore, knowing how the brains of non-human primates might solve this complex sequential exploration task also sheds light on the building behavioral and neural blocks of reward exploration, learning, and credit assignment.

Conclusions

Here, we showed that monkeys are able to assess the contingency between choice and information and the utility of information for the future when making strategic exploratory decisions. Different subparts of the ACC and MCC related to the assessment of these variables for strategic exploration, and the dlPFC represented them both additively, such that it was most active when exploration was beneficial. Only when the only way to obtain information was to explore did MCC and dlPFC show increased activity with less exploitative choices. This suggests a role in suppressing expected value signals when value-guided exploration should to be considered. Importantly, to limit costly exploration to when it is necessary being able to process counterfactual information is key. We showed monkeys could do this potentially by representing chosen and unchosen reward prediction errors in central and medial OFC. Furthermore, the strength of this signal in the mOFC was shown to be correlated with future decisions taken. Overall, our study shows how ACC/MCC-dlPFC and OFC circuits together might support exploitation of available information to the fullest and drive behavior towards finding more information when it is beneficial.

Materials and methods

Ethics statement

Animals were kept on a 12-h light–dark cycle, with access to water for 12 to 16 h on testing days and with ad libitum water otherwise. Feeding, social housing, and environmental enrichment followed guidelines of the Biomedical Sciences Services of the University of Oxford. All procedures were conducted under licenses from the United Kingdom (UK) Home Office in accordance with The Animals (Scientific Procedures) Act 1986 and the European Union guidelines (EU Directive 2010/63/EU).

Monkeys and task

Three male rhesus monkeys were involved in the experiment (Monkey M: 14 kg, 7 years old, monkey S: 12 kg, 7 years old, and monkey E: 11 kg, 7 years old). During the task, monkeys sat in the sphinx position in a primate chair (Rogue Research, Petaluma, CA) in a 3T clinical horizontal bore MRI scanner. They faced an MRI-compatible screen (MRC, Cambridge) placed 30 cm in front of the animal. Visual stimuli were projected on the screen by an LX400 projector (Christie Digital Systems). Monkeys were surgically implanted under anesthesia with an MRI-compatible cranial implant (Rogue Research) in order to prevent head movements during data acquisition. Two custom-built infrared sensors were placed in front of their left and right hands that corresponded to the stimuli on the screen. Blackcurrant juice rewards were delivered from a tube positioned between the monkey’s lips. The behavioral paradigm was controlled using Presentation software (Neurobehavioral Systems, CA, USA).

The task consisted of making choices between two options by responding on either the left or right touch sensor to select the left or right stimulus, respectively. A trial consisted of a given number of choices (determined by the horizon length) between these two options (Fig 1A). Each option corresponded to one side for the entire trial (Fig 1B). After each choice, monkeys received a reward associated with the chosen option (Fig 1E). The reward was between 0 and 10 drops (0.5 mL of juice per drop) and was sampled from a Gaussian distribution with a standard deviation of 1.5 and mean between 3 and 7. The means of the underlying distribution were different for the two options and remained the same during a trial, such that one option was always better than the other. After each choice, monkeys also received a visual feedback on the reward (Fig 1E). This feedback was in the form of an orange rectangle displayed in a yellow rectangular window, such that the wider the orange rectangle, the greater the amount of juice (Fig 1B). It remained on the screen for the remainder of the trial.

At the beginning of each trial, prior to making their first choice, monkeys received 4 informative observations in total, which consisted of information about the reward they would have received if they had chosen the option (Fig 1B). This was displayed in the same manner as reward feedback and also remained on screen during the duration of the trial. For each informative observation, a non-informative observation was presented for the other option (Fig 1B). The non-informative observation was a white rectangle crossed by black diagonals. Half of the trials started with an equal amount of information about the two options (2 informative and 2 non-informative observations for each option), and the other half with an unequal (3 informative and 1 non-informative observations). The order and side were randomly determined.

A critical parameter was the number of choices in each trial (horizon length). In short horizon trials, monkeys were only allowed 1 choice before a new trial with new stimuli started, whereas in long horizon trials, they were allowed to make 4 choices between the options. Horizon conditions were blocked (5 consecutive trials of equal horizons) and alternated in the session. A second key manipulation was whether feedback was received only for the option they chose (partial feedback condition) or whether they received information about both the reward they received for the chosen option and the reward they would have received for selecting the alternative option (complete feedback condition) (Fig 1E).

A trial would proceed as follows (Fig 1E; timings in Table 1): After an inter-trial interval during which the screen was black, the stimuli were displayed, consisting of a large grey rectangle and the 4 horizontal bars of feedback information (Fig 1B and 1E). The length of the grey rectangle corresponded to the length of the horizon, which each line corresponding to a choice, simulated or actual. Informative or non-informative stimuli were displayed on the first four lines. After the display of the stimuli, a red dot at the center of the screen disappeared, and monkeys were then allowed to choose between the two options by touching the corresponding sensor (in less than 5,000 ms or the trial restarted). A red rectangular frame appeared around the line on the side of the chosen option. After a delay, the outcome—the reward feedback—was displayed inside the rectangle. In complete feedback condition only, the reward that would have been gained on the other side (informative stimulus) was also displayed at the same moment. After an additional delay, a white star appeared on the screen, and the reward was delivered. After the end of the reward delivery, the star disappeared. In short horizon blocks, a new trial started after the inter-trial interval delay. In long horizon trials, the red dot appeared and then monkeys could choose among the options. The events leading to the reward were similar than for the first choice, but the delays were shorter. At the end of the fourth choice, a new trial started. The feedback condition monkeys were in was not explicitly cued but instead fixed both within and across several sessions (6 to 10 consecutive sessions). Sessions after a switch from one feedback condition to the other were included in the analysis since it only took one choice for monkeys to know the feedback condition.

Table 1. Timings.

Choice ITI Go delay Outcome delay Reward delay ICI
First 4,000 ± 1,000 ms 1,500 ± 500 ms (1,500 ± 200 ms for monkey E) 3,500 ± 500 ms 1,000 ms 2,500 ± 500 ms
Second to fourth N/A 1,500 ± 500 ms 1,500 ± 500 ms

ICI, Inter-choice interval; ITI, Inter-trial interval.

Monkey M performed 14 sessions in the partial feedback condition and 13 (2 were corrupted and unrecoverable for fMRI analysis) in the complete feedback condition; monkey S performed 13 and 12 (3 corrupted sessions) sessions in each condition, respectively; and monkey E performed 14 and 15 (1 corrupted session) sessions in each condition. Sessions with less than 50 trials completed for the horizon task or with more than 80% bias for one side were removed from the analyses.

Training

All monkeys followed the following training procedure, which lasted several months in a testing room mimicking the actual scanner room: First, they learned the meaning of the informative observation stimuli by choosing between a rewarded (1 to 10 drops) and a non-rewarded (0 drop) observation stimulus and then between different rewarded (0 to 10 drops) observation stimuli. They next learnt to associate an option with an expected value by choosing between a non-rewarded option (0 drop) and a rewarded option and then between rewarded options in the long horizon and partial feedback condition. We then introduced blocks of small and long horizon trials. Monkeys were then tested in the scanner room. They all had previous experience of awake behaving testing in the scanner. We discarded the first scanning session and then analyzed the following ones if they corresponded to our inclusion criterions in terms of number of trials and spatial bias. Monkeys M and S were introduced to the complete feedback condition during the training procedure; monkey E experienced it for the first time during testing.

Bayesian expectation model

We analyzed the behavior using an ideal Bayesian model, which estimated the most likely next outcome given the previous observations about the options. Outcomes were randomly drawn form a distribution of mean μ and fixed standard deviation. P(x|μ) is the probability that an outcome x would be observed given that it came from a distribution of mean μ. Since outcomes were independently dram from a distribution of mean μ, the probability of observing a set of outcomes {x1…xn} was:

P({x1xn}|μ)=i=1nP(xi|μ) (1)

Using Bayes’ rule, we computed the probability that this observation was generated by a distribution of mean μ:

P(μ|{x1xn})=P({x1xn}|μ)P(μ)P({x1xn})=i=1nP(xi|μ)jNinP(xi|μj) (2)

For each observation, we computed the probability of a new observation:

P(xn+1|{x1xn})=j=1NP(xn+1|μj)P(μj|{x1xn}) (3)

Thus, we can compute the probability distribution of the future outcomes given a set of observations (Fig 1C shows how the distributions change with new observations).

In our model, the expected value (EV) of an option is the mean of the probability distribution of the set of observations. The uncertainty (U) about what the next outcome was represented by the variance of the distribution. In general, the more informative observations the subject has access to for an option, the closer the expected value to the actual mean of the underlying distribution and the smaller the uncertainty about this quantity. The weight of the expected value controls a specific form of exploration: the reward-based exploration. Reducing this parameter allows exploring options by relaxing the tendency to choose the most rewarded option.

Choice model fit

We first focused our analysis on the first choice of the trial because it was similar in terms of information content (4 informative observations) across horizon lengths and feedback conditions. Contrary to subsequent choices in the long horizon, the expected value and the uncertainty about the expected value (which decreases with the number of informative observations, from 1 to 3) associated with each option were uncorrelated on the first choice. Indeed, in the partial feedback condition, if the option with the higher expected value is chosen more often, the uncertainty about its expected value decreases specifically, inducing a correlation between expected value and uncertainty about it. For these first trials t, we model the probability of picking the option that is presented on the right side of the screen as

P(rightt)=σ(bSB+bRBRBt+bhorizonhorizont+bEVEVt+bUUt+bERhorizonERthorizont+bUhorizonUthorizont) (4)

using logistic regression. Here, σ is the sigmoid function, RBt is a categorical predictor that control for a repetition bias, EVt and Ut denote the difference between the expected value / uncertainty of the options on the right and left side of the screen, horizont is a categorical predictor for whether trial t is a short or long horizon trial, and et is the residual.

For the remaining trials (second, third, and fourth choice in the long horizon), we are interested in whether the animals change their behavior as new information becomes available. We model these trials as

P(rightt)=σ(bSB+bRBRBt+bΔChosenΔChosent+bbaselineERbaselineEVt+bΔERchosenΔEVchosent+bbaselineUbaselineUt+bΔUchosenΔUchosent+bΔERunchosenΔEVunchosent+bΔUunchosenΔUunchosent) (5)

For this logistic regression, we used an additional bias, ΔChosen, which corresponds to the number of times the option on the right was chosen during the trial. Here, baselineEVt and baselineUt are the difference between the expected value/uncertainty of the right and the left option at the first trial within the horizon. As such, these regressors capture the impact the initial information displayed on screen has on subsequent choices. ΔEVchosent and ΔUchosent capture the difference between the initial baseline and the information presented at the current trial based on the choices the animal has made; i.e., these regressors capture the update of outcome expectation and uncertainty between the right and the left option compared to the first choice based on the consequent rewards the animals experienced.

In our complete feedback condition, the animals can also learn about the reward they would have gotten, had they chosen the other option. This is not captured by ΔEVchosent and ΔUchosent as these regressors only take the experienced (i.e., obtained) reward into account. To see how the unobtained reward affects choices, we included the regressors ΔEVunchosent and ΔUunchosent. These regressors are computed as the difference between the full outcome expectation and uncertainty (based on both the obtained and unobtained reward), and the outcome expectation and uncertainty for the obtained reward only. Just as with the ΔEVchosent and ΔUchosent, these regressors are also again constructed as the difference between the right and left option, and with the baseline subtracted.

To fit these models, we used STAN (https://mc-stan.org) and brms with the default priors [81,82]. For each model, we ran 12 chains, each with 1,000 iterations after a warm-up of 1,000 samples. We allowed all regressors to vary by condition (partial and complete) and animal (3 animals) as fixed effects. We modelled testing sessions as random effects with different Gaussians for each animal; i.e., for each regressor and each animal, we estimated the Gaussian distribution that session-level regressors are most likely drawn from. Group-level estimates of the coefficients were obtained by averaging across animals and/or conditions. To determine statistical significance, we counted the number of samples of the posterior that are greater/smaller than 0.

MRI data acquisition and pre-processing

Imaging data were acquired using a 3T clinical MRI scanner and an 8-cm-diameter four-channel phased-array receiver coil in conjunction with a radial transmission coil (Windmiller Kolster Scientific, Fresno, CA). Structural images were collected under general anesthesia, using a T1-weighted MP-RAGE sequence (resolution = 0.5 × 0.5 × 0.5 mm, repetition time (TR) = 2.05 s, echo time (TE) = 4.04 ms, inversion time (TI) = 1.1 s, flip angle = 8°). Three structural images per subject were averaged. Intramuscular injection of 10 mg/kg ketamine, 0.125 to 0.25 mg/kg xylazine, and 0.1 mg/kg midazolam were used to induce anesthesia. fMRI data were collected while the subjects performed the task with a gradient-echo T2* echo planar imaging (EPI) sequence (resolution = 1.5 × 1.5 × 1.5 mm, interleaved slice acquisition, TR = 2.28 s, TE = 30 ms, flip angle = 90°). To help image reconstruction, a proton density–weighted image was acquired using a gradient-refocused echo (GRE) sequence (resolution = 1.5 × 1.5 × 1.5 mm, TR = 10 ms, TE = 2.52 ms, flip angle = 25°) at the end of the session.

fMRI data were pre-processed according to a dedicated non-human primate fMRI pre-processing pipeline [29,83,84] combining FSL, Advanced Normalization Tools (ANTs), and Magnetic Resonance Comparative Anatomy Toolbox (MrCat; https://github.com/neuroecology/MrCat) tools. In brief, T2* EPI data were reconstructed using an offline SENSE algorithm (Offline_SENSE GUI, Windmiller Kolster Scientific, Fresno, CA). Time-varying spatial distortions due to body movement were corrected by non-linear registration (restricted to the phase encoding direction) of each slice and each volume of the time series to a reference low noise EPI image for each session. The distortion corrected and aligned session-wise images were first registered to the animal structural image and then to a group-specific template in CARET macaque F99 space. Finally, the functional images were temporally filtered (high-pass temporal filtering, 3-dB cutoff of 100 s) and spatially smoothed (Gaussian spatial smoothing, full-width half maximum of 3m).

fMRI analysis

We conducted our fMRI analysis using a hierarchical GLM (FSLREF). Specifically, we first fitted each individual session (in session space) using FSL’s fMRI Expert Analysis Toolbox (FEAT). We then warped the session-level whole-brain maps into F99 standard space, before combining them using FEAT’s FLAME 1 + 2 random effects procedure. Here, we used contrast to obtain separate estimates for the partial and complete sessions, the difference between partial and complete sessions and their average. To determine statistical significance, we used a cluster-based approach with standard thresholding criteria of z > 2.3 and p < 0.05. To increase power, we ran this cluster correction only in an a priori mask of the frontal cortex that was previously used in Grohn and colleagues [29].

On the session level, we included 58 regressors for the partial feedback sessions and 73 (including the same 58 regressors as in the partial feedback condition) for the complete feedback sessions. On top of these regressors, we also included nuisance regressors that indexed head motion and volumes with excessive noise. All regressors were convolved with an HRF that was modelled as a gamma function (mean lag = 3, standard deviation = 1.5 s), convolved with a boxcar function of 1 s.

The two main periods of the task we were interested in were when the stimuli first appeared on screen and when the outcome appeared on subsequent choices in the long horizon trials. At stimulus onset, we included a constant and regressors for the expected value of the chosen and unchosen options and also regressors for the uncertainty of the chosen and unchosen option. To allow us to examine the effects of these five regressors on first choices in short and long horizons and subsequent choices within the long horizon, we up each regressor by horizon and choice number (first choice short horizon, first choice long horizon, second choice long horizon, third choice long horizon, and fourth choice long horizon) for a total of 25 regressors. At outcome, we included another constant, the expected value of the chosen and unchosen options, the reward obtained on this trial, the absolute value of the prediction error of this trial (|reward—expected value|), and the update in uncertainty on this trial. Again, all of these regressors were split up by horizon and choice number, for a total of 30 regressors at outcome. On top of these regressors of interest, we also included 3 control regressors: the log response time at stimulus onset, a constant at decision, and the response side (left or right) at decision. In the complete feedback condition, we included additional regressors: At outcome, we added regressors for the reward of the unchosen option, the absolute prediction error for the unchosen option, and the update in uncertainty for the unchosen option. Splitting these regressors up by horizon and choice within a horizon yields an additional 15 regressors.

Having split up all regressors this way into choice horizon and number of choices within a horizon, we used planned contrasts combining them again to answer our questions of interest. At stimulus onset, we were only interested in first choices, as this allowed us to compare whether the animals represented expected value and uncertainty differently depending on condition (partial or complete feedback) and/or choice horizon (long and short). We thus constructed contrasts adding up and subtracting the first choices on long and short horizons for the constant, the expected value, and the uncertainty. At outcome, we were interested in reward effects and updates to the expected value of stimuli. As this should happen not just after first choices in a horizon but all choices, we used contrast to construct (weighted) averages of our regressors combining all choices within horizons. Moreover, to look at the effect of (signed) prediction errors, we use contrasts that subtract the expectation from the reward.

To visualize the cluster-corrected effects we find in our mask of the frontal cortex, we use an atlas of the macaque brain [36] to identify the regions where we observe activity. We then create ROIs by calculating the overlap of the anatomical region according to the atlas (dilated with a kernel of 3 × 3 × 3 voxels), and the functional activation we found. By extracting the average t-statistics in this region, we are able to visualize the effects we found and also examine the individual components that contributed the effects (e.g., the reward and outcome expectation for prediction errors).

To best describe the localization of orbitofrontal activities, we considered 3 orbital subdivisions based on their respective position on the orbital surface. Lateral to the lateral orbitofrontal sulcus is the lateral OFC; medial to the medial orbitofrontal sulcus is the medial OFC. In between the two sulci is a region we referred to as the central OFC. Such parcellation resembles subdivisions considered in humans and rodents [71,85,86], although alternative labels have been proposed [70].

To best describe the localization of cingulate activities, we considered a dissociation between anterior and mid-cingulate subdivisions as proposed by Vogt and colleagues [87,88].

Supporting information

S1 Fig. Full model fit of the model predicting choosing the right option on screen on first choices (shown in Fig 2D and descripted in detail in the Materials and methods section).

(A) Predictors are from left to right: Intercept (i.e., a side bias), repetition bias (RB), expected value of difference between right and left according to our Bayesian model (EV), uncertainty difference between right and left according to our Bayesian model (U), horizon length (short horizon is positive, long horizon is negative), the interaction between horizon and expected value (horizonXER), and the interaction between horizon and uncertainty (horizonXU). The distributions are the posteriors of the parameter estimates, shown both for each monkey individually and averaged over animals. Fits from the partial feedback sessions are shown on the left, and from the complete feedback sessions on the right. (B) Data from the same fit as in (A) but now summed up over both partial and complete feedback sessions. (C) Data from the same fit as in (A) but now we computed the difference between partial and complete feedback sessions. (D) One-sided p-values for all parameters are computed as the number of samples of the posterior greater than 0. To compute the p-value for effects smaller than 0, the p-values in the table can be subtracted from 1. Data and code to reproduce the figure can be found at https://doi.org/10.5281/zenodo.7464572.

(TIFF)

S2 Fig. The same model as in S1 Fig but only fit to trials during which the available choices on screen were the same on each side (2 and 2).

All conventions are the same as in S1 Fig. Data and code to reproduce the figure can be found at https://doi.org/10.5281/zenodo.7464572.

(TIFF)

S3 Fig. Equivalency between our linear regression and the framing in terms of random and directed exploration.

(A) With the uncertainty regressor. We find that monkeys modulate their sensitivity to the expected value depending on the horizon and the feedback type, which is equivalent to the random exploration parameter, the softmax noise, which is the inverse of the expected value regressor. However, we find no modulation of the uncertainty by the horizon nor the feedback type, which is equivalent to the directed exploration parameter, the uncertainty bonus, which is the uncertainty regressor divided by the expected value regressor. (B) Same as A but with the number of available information rather than the uncertainty. Error bars indicate standard deviation. Data and code to reproduce the figure can be found at https://doi.org/10.5281/zenodo.7464572.

(TIFF)

S4 Fig. The same model as in S1 Fig but only using the number of available information on each side rather than the uncertainty.

All conventions are the same as in S1 Fig. Data and code to reproduce the figure can be found at https://doi.org/10.5281/zenodo.7464572.

(TIFF)

S5 Fig. Full model fit of predicting choosing the right option on screen during subsequent choices in the long horizon (choices 2–4; shown in Fig 3E and described in detail in the Materials and methods section).

(A) Predictors are from left to right: Intercept (i.e., a side bias), repetition bias (RB), the change in expected value between the right and left option revealed by choices made during this horizon, compared to the initial expected value for this horizon, i.e., the baseline (deltaERchosen), the change in expected value between the right and left option revealed by feedback about the unchosen option, compared to the initial expected value for this horizon (deltaERcounterfactual), the difference in initial expected value between the right and left option available, i.e., the expected value difference at first choice (baselineU), the change in uncertainty between the right and left option revealed by choices made during this horizon, compared to the initial uncertainty for this horizon (deltaUchosen), the change in uncertainty between the right and left option revealed by feedback about the unchosen option, compared to the initial uncertainty for this horizon (deltaUcounterfactual), the difference in initial uncertainty between the right and left option available, i.e., the uncertainty difference at first choice (baselineU), the difference between how often the right option has been chosen over the left option during this horizon (deltaChosen). All other conventions are the same as in S1 Fig, also for panels B-D. Data and code to reproduce the figure can be found at https://doi.org/10.5281/zenodo.7464572.

(TIFF)

S6 Fig

(A) Expected value of the chosen option without mask and when taking the activity before the choice in all trials (not just first choice trials), we observed large activations related to the expected value of the chosen option (which is the same as the chosen action in our task) spanning from the motor cortex/somatosensory cortex, the dlPFC, the OFC, and striatum, as well as an inverted signal in the visual areas (Cluster p < 0.05, cluster forming threshold of z > 2.3). (B) In the partial and complete feedback conditions in our VOI and when taking the activity before the first choice only, we found 1 cluster of activity related to the inverse of the magnitude of the uncertainty about the chosen option in the right medial prefrontal cortex (24c and 9m) that extended bilaterally in the frontal pole (10mr) (Cluster P < 0.05, cluster forming threshold of z > 2.3). Data to reproduce the figure can be found at https://doi.org/10.5281/zenodo.7464572. dlPFC, dorsolateral prefrontal cortex; OFC, orbitofrontal cortex; VOI, volume of interest.

(TIFF)

S7 Fig. Outcome prediction error and magnitude in the partial feedback condition.

(A) In the partial feedback condition and at the time of outcome, we found 3 clusters of activity that were positively modulated by the chosen option prediction error in the medial prefrontal cortex and bilaterally in the somatosensory and motor cortex in our VOI (Cluster p < 0.05, cluster forming threshold of z > 2.3). (B) We found the same 3 clusters when we looked for a positive modulation by the magnitude of the chosen outcome. We additionally found 1 cluster of activity in the right lateral OFC that was negatively modulated by the magnitude of the chosen outcome. (C) When we time-locked our search to the onset of the reward (1 s after the display of the outcome, with a different GLM), we found the same clusters as in A, as well as the classic prediction error related activity in the ventral striatum and a negative prediction error in visual areas (see full map at https://doi.org/10.5281/zenodo.7464572) at the whole brain level. Data to reproduce the figure can be found at https://doi.org/10.5281/zenodo.7464572. GLM, general linear model; OFC, orbitofrontal cortex; VOI, volume of interest.

(TIFF)

S8 Fig. Chosen and unchosen outcome magnitude in the complete feedback condition.

(A) In complete feedback sessions only, we found clusters for inverted chosen outcome magnitude activity in the right lOFC (47/12o) and bilaterally in the vlPFC and 2 clusters in the somatosensory/motor cortex [3]. (B) We found a cluster of activity for the inverted unchosen outcome magnitude in the cOFC and mOFC and the vlPFC. Data to reproduce the figure can be found at https://doi.org/10.5281/zenodo.7464572. cOFC, central orbitofrontal cortex; lOFC, lateral orbitofrontal cortex; mOFC, medial orbitofrontal cortex; vlPFC, ventrolateral prefrontal cortex.

(TIFF)

S1 Table. Tables showing the peaks of all significant clusters found within our frontal masks that are reported in the main text.

Coordinates are given in the F99 standard space.

(TIFF)

Acknowledgments

We thank Drs. Kevin Marche, Lea Roumazeilles, and Urs Schuffelgen, as well as Kelly Simpson for technical assistance during the data acquisition and the animal housing facility staff for their care of the animals. We thank the Motivation Brain and Behavior lab, the Rushworth lab, the Walton lab, Dr. Nadescha Trudel, and Dr. Vasilisa Skvortsova for insightful conversations that helped shape this manuscript.

Abbreviations

ACC/MCC

anterior and mid-cingulate cortex

ANT

Advanced Normalization Tool

cOFC

central oribitofrontal cortex

dlPFC

dorsolateral prefrontal cortex

EPI

echo planar imaging

FEAT

fMRI Expert Analysis Toolbox

FLAME

FMRIB’s Local Analysis of Mixed Effects

FSL

FMRIB Software Library

GLM

general linear model

GRE

gradient-refocused echo

ICI

inter-choice interval

lOFC

lateral orbitofrontal cortex

ITI

inter-trial interval

mOFC

medial orbitofrontal cortex

MRC

MRI-compatible screen

MrCat

Magnetic Resonance Comparative Anatomy Toolbox

OFC

orbitofrontal cortex

pgACC

pregenual anterior cingulate cortex

ROI

region of interest

vlPFC

ventrolateral prefrontal cortex

VOI

volume of interest

Data Availability

The behavioral data, code to reproduce the figures shown in the manuscript and supplementary materials, and statistical fMRI maps are available at: https://doi.org/10.5281/zenodo.7464572.

Funding Statement

This research was supported by the Université Paris Descartes (doctoral and mobility grants to CIJ), the Medical Research Council UK (MR/K501256/1 and MR/N013468/1 to JG), St John’s College, Oxford (JG), the Wellcome Trust (096587/Z/11/Z to SC, 090051/Z/09/Z and 202831/Z/16/Z to MEW, WT1005651MA to JS and the Wellcome Centre for Integrative Neuroimaging: 203139/Z/16/Z), the BBSRC (AFL Fellowship: BB/R01803/1 to NK), as well as the LabEx CORTEX of the Université de Lyon (ANR-11-LABX-0042 to JS). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Daw ND, O’Doherty JP, Dayan P, Seymour B, Dolan. Cortical substrates for exploratory decisions in humans. Nature. 2006. Jun;441(7095):876–9. doi: 10.1038/nature04766 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Ebitz RB, Albarran E, Moore T. Exploration Disrupts Choice-Predictive Signals and Alters Dynamics in Prefrontal Cortex. Neuron. 2018. Jan;97(2):450–461.e9. doi: 10.1016/j.neuron.2017.12.007 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Watkins CJCH. Learning from delayed rewards. 1989. [Google Scholar]
  • 4.Thompson WR. On the Likelihood that One Unknown Probability Exceeds Another in View of the Evidence of Two Samples. Biometrika. 1933;25(3/4):285–94. [Google Scholar]
  • 5.Bridle JS. Probabilistic Interpretation of Feedforward Classification Network Outputs, with Relationships to Statistical Pattern Recognition. Neurocomputing [Internet]. Berlin, Heidelberg: Springer; 1990. [cited 2018 Mar 19]. p. 227–36. (NATO ASI Series). Available from: https://link.springer.com/chapter/ doi: 10.1007/978-3-642-76153-9_28 [DOI] [Google Scholar]
  • 6.Payzan-LeNestour E, Bossaerts P. Risk, Unexpected Uncertainty, and Estimation Uncertainty: Bayesian Learning in Unstable Settings. PLoS Comput Biol. 2011. Jan 20;7(1):e1001048. doi: 10.1371/journal.pcbi.1001048 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Costa VD, Mitz AR, Averbeck BB. Subcortical Substrates of Explore-Exploit Decisions in Primates. Neuron. 2019. Aug 7;103(3):533–545.e5. doi: 10.1016/j.neuron.2019.05.017 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Costa VD, Averbeck BB. Primate Orbitofrontal Cortex Codes Information Relevant for Managing Explore–Exploit Tradeoffs. J Neurosci. 2020. Mar 18;40(12):2553–61. doi: 10.1523/JNEUROSCI.2355-19.2020 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Hogeveen J, Mullins TS, Romero JD, Eversole E, Rogge-Obando K, Mayer AR, et al. The neurocomputational bases of explore-exploit decision-making. Neuron. 2022. Jun 1;110(11):1869–1879.e5. doi: 10.1016/j.neuron.2022.03.014 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Badre D, Doll BB, Long NM, Frank MJ. Rostrolateral Prefrontal Cortex and Individual Differences in Uncertainty-Driven Exploration. Neuron. 2012. Feb 9;73(3):595–607. doi: 10.1016/j.neuron.2011.12.025 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Cavanagh JF, Figueroa CM, Cohen MX, Frank MJ. Frontal Theta Reflects Uncertainty and Unexpectedness during Exploration and Exploitation. Cereb Cortex. 2012. Nov 1;22(11):2575–86. doi: 10.1093/cercor/bhr332 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Wilson RC, Geana A, White JM, Ludvig EA, Cohen JD. Humans Use Directed and Random Exploration to Solve the Explore–Exploit Dilemma. J Exp Psychol Gen. 2014. Dec;143(6):2074–81. doi: 10.1037/a0038199 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Warren CM, Wilson RC, Wee NJ van der, Giltay EJ, Noorden MS van, Cohen JD, et al. The effect of atomoxetine on random and directed exploration in humans. PLoS ONE. 2017. Apr 26;12(4):e0176034. doi: 10.1371/journal.pone.0176034 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Zajkowski WK, Kossut M, Wilson RC. A causal role for right frontopolar cortex in directed, but not random, exploration. Elife [Internet]. [cited 2018. Mar 19;6. Available from: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5628017/ [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Wilson RC, Bonawitz E, Costa VD, Ebitz RB. Balancing exploration and exploitation with information and randomization. Curr Opin Behav Sci. 2021. Apr 1;38:49–56. doi: 10.1016/j.cobeha.2020.10.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Friedrich P, Forkel SJ, Amiez C, Balsters JH, Coulon O, Fan L, et al. Imaging evolution of the primate brain: the next frontier? Neuroimage. 2021. Mar 1;228:117685. doi: 10.1016/j.neuroimage.2020.117685 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Neubert FX, Mars RB, Sallet J, Rushworth MFS. Connectivity reveals relationship of brain areas for reward-guided learning and decision making in human and monkey frontal cortex. PNAS. 2015. May 19;112(20):E2695–704. doi: 10.1073/pnas.1410767112 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Sallet J, Mars RB, Noonan MP, Neubert FX, Jbabdi S, O’Reilly JX, et al. The Organization of Dorsal Frontal Cortex in Humans and Macaques. J Neurosci. 2013. Jul 24;33(30):12255–74. doi: 10.1523/JNEUROSCI.5108-12.2013 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Findling C, Skvortsova V, Dromnelle R, Palminteri S, Wyart V. Computational noise in reward-guided learning drives behavioral variability in volatile environments. Nat Neurosci. 2019. Dec;22(12):2066–77. doi: 10.1038/s41593-019-0518-9 [DOI] [PubMed] [Google Scholar]
  • 20.Palminteri S, Khamassi M, Joffily M, Coricelli G. Contextual modulation of value signals in reward and punishment learning. Nat Commun. 2015. Aug 25;6(1):8096. doi: 10.1038/ncomms9096 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Palminteri S, Lefebvre G, Kilford EJ, Blakemore SJ. Confirmation bias in human reinforcement learning: Evidence from counterfactual feedback processing. PLoS Comput Biol. 2017. Aug 11;13(8):e1005684. doi: 10.1371/journal.pcbi.1005684 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Bavard S, Lebreton M, Khamassi M, Coricelli G, Palminteri S. Reference-point centering and range-adaptation enhance human reinforcement learning at the cost of irrational preferences. Nat Commun. 2018. Oct 29;9(1):4503. doi: 10.1038/s41467-018-06781-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Pischedda D, Palminteri S, Coricelli G. The Effect of Counterfactual Information on Outcome Value Coding in Medial Prefrontal and Cingulate Cortex: From an Absolute to a Relative Neural Code. J Neurosci. 2020. Apr 15;40(16):3268–77. doi: 10.1523/JNEUROSCI.1712-19.2020 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Premereur E, Janssen P, Vanduffel W. Functional MRI in Macaque Monkeys during Task Switching. J Neurosci. 2018. Dec 12;38(50):10619–30. doi: 10.1523/JNEUROSCI.1539-18.2018 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Nakahara K, Hayashi T, Konishi S, Miyashita Y. Functional MRI of Macaque Monkeys Performing a Cognitive Set-Shifting Task. Science. 2002. Feb 22;295(5559):1532–6. doi: 10.1126/science.1067653 [DOI] [PubMed] [Google Scholar]
  • 26.Ford KA, Gati JS, Menon RS, Everling S. BOLD fMRI activation for anti-saccades in nonhuman primates. Neuroimage. 2009. Apr 1;45(2):470–6. doi: 10.1016/j.neuroimage.2008.12.009 [DOI] [PubMed] [Google Scholar]
  • 27.Kaskan PM, Costa VD, Eaton HP, Zemskova JA, Mitz AR, Leopold DA, et al. Learned Value Shapes Responses to Objects in Frontal and Ventral Stream Networks in Macaque Monkeys. Cereb Cortex. 2017. May 1;27(5):2739–57. doi: 10.1093/cercor/bhw113 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Fouragnan EF, Chau BKH, Folloni D, Kolling N, Verhagen L, Klein-Flügge M, et al. The macaque anterior cingulate cortex translates counterfactual choice value into actual behavioral change. Nat Neurosci. 2019. May;22(5):797–808. doi: 10.1038/s41593-019-0375-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Grohn J, Schüffelgen U, Neubert FX, Bongioanni A, Verhagen L, Sallet J, et al. Multiple systems in macaques for tracking prediction errors and other types of surprise. PLoS Biol. 2020. Oct 30;18(10):e3000899. doi: 10.1371/journal.pbio.3000899 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Abe H, Lee D. Distributed Coding of Actual and Hypothetical Outcomes in the Orbital and Dorsolateral Prefrontal Cortex. Neuron. 2011. May 26;70(4):731–41. doi: 10.1016/j.neuron.2011.03.026 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Hayden BY, Pearson JM, Platt ML. Fictive Reward Signals in the Anterior Cingulate Cortex. Science. 2009. May 15;324(5929):948–50. doi: 10.1126/science.1168488 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Lopez-Persem A, Roumazeilles L, Folloni D, Marche K, Fouragnan EF, Khalighinejad N, et al. Differential functional connectivity underlying asymmetric reward-related activity in human and nonhuman primates. Proc Natl Acad Sci. 2020. Nov 10;117(45):28452–62. doi: 10.1073/pnas.2000759117 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Lau B, Glimcher PW. Value Representations in the Primate Striatum during Matching Behavior. Neuron. 2008. May 8;58(3):451–63. doi: 10.1016/j.neuron.2008.02.021 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Hunt LT, Malalasekera WMN, de Berker AO, Miranda B, Farmer SF, Behrens TEJ, et al. Triple dissociation of attention and decision computations across prefrontal cortex. Nat Neurosci. 2018. Oct;21(10):1471–81. doi: 10.1038/s41593-018-0239-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Ballesta S, Padoa-Schioppa C. Economic Decisions through Circuit Inhibition. Curr Biol. 2019. Nov 18;29(22):3814–3824.e5. doi: 10.1016/j.cub.2019.09.027 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Reveley C, Gruslys A, Ye FQ, Glen D, Samaha J, E Russ B, et al. Three-Dimensional Digital Template Atlas of the Macaque Brain. Cereb Cortex. 2017. Sep 1;27(9):4463–77. doi: 10.1093/cercor/bhw248 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Mackey S, Petrides M. Quantitative demonstration of comparable architectonic areas within the ventromedial and lateral orbital frontal cortex in the human and the macaque monkey brains. Eur J Neurosci. 2010;32(11):1940–50. doi: 10.1111/j.1460-9568.2010.07465.x [DOI] [PubMed] [Google Scholar]
  • 38.Hampton RR, Zivin A, Murray EA. Rhesus monkeys (Macaca mulatta) discriminate between knowing and not knowing and collect information as needed before acting. Anim Cogn. 2004. Oct 1;7(4):239–46. doi: 10.1007/s10071-004-0215-1 [DOI] [PubMed] [Google Scholar]
  • 39.Tu HW, Pani AA, Hampton RR. Rhesus monkeys (Macaca mulatta) adaptively adjust information seeking in response to information accumulated. J Comp Psychol. 2015;129(4):347–55. doi: 10.1037/a0039595 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Bosc M, Bioulac B, Langbour N, Nguyen TH, Goillandeau M, Dehay B, et al. Checking behavior in rhesus monkeys is related to anxiety and frontal activity. Sci Rep. 2017. Mar 28;7(1):45267. doi: 10.1038/srep45267 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Kolling N, Scholl J, Chekroud A, Trier HA, Rushworth MFS. Prospection, Perseverance, and Insight in Sequential Behavior. Neuron. 2018. Sep 5;99(5):1069–1082.e7. doi: 10.1016/j.neuron.2018.08.018 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Procyk E, Tanaka YL, Joseph JP. Anterior cingulate activity during routine and non-routine sequential behaviors in macaques. Nat Neurosci. 2000. May;3(5):502–8. doi: 10.1038/74880 [DOI] [PubMed] [Google Scholar]
  • 43.Ferrigno S, Cheyette SJ, Piantadosi ST, Cantlon JF. Recursive sequence generation in monkeys, children, U.S. adults, and native Amazonians. Sci Adv. 2020;6():eaaz1002. doi: 10.1126/sciadv.aaz1002 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Roumazeilles L, Schurz M, Lojkiewiez M, Verhagen L, Schüffelgen U, Marche K, et al. Social prediction modulates activity of macaque superior temporal cortex. Sci Adv. 2021. Sep17;7(38):eabh2392. doi: 10.1126/sciadv.abh2392 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Wang MZ, Hayden BY. Monkeys are curious about counterfactual outcomes. Cognition. 2019. Aug 1;189:1–10. doi: 10.1016/j.cognition.2019.03.009 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Steiner AP, Redish AD. Behavioral and neurophysiological correlates of regret in rat decision-making on a neuroeconomic task. Nat Neurosci. 2014. Jul;17(7):995–1002. doi: 10.1038/nn.3740 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Hayden BY, Pearson JM, Platt ML. Neuronal basis of sequential foraging decisions in a patchy environment. Nat Neurosci. 2011. Jul;14(7):933–9. doi: 10.1038/nn.2856 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Kolling N, Behrens TEJ, Mars RB, Rushworth MFS. Neural Mechanisms of Foraging. Science. 2012. Apr 6;336(6077):95–8. doi: 10.1126/science.1216930 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Kane GA, Vazey EM, Wilson RC, Shenhav A, Daw ND, Aston-Jones G, et al. Increased locus coeruleus tonic activity causes disengagement from a patch-foraging task. Cogn Affect Behav Neurosci. 2017. Sep 12. doi: 10.3758/s13415-017-0531-y [DOI] [PubMed] [Google Scholar]
  • 50.Shima K, Tanji J. Role for Cingulate Motor Area Cells in Voluntary Movement Selection Based on Reward. Science. 1998. Nov 13;282(5392):1335–8. doi: 10.1126/science.282.5392.1335 [DOI] [PubMed] [Google Scholar]
  • 51.Kennerley SW, Walton ME, Behrens TEJ, Buckley MJ, Rushworth MFS. Optimal decision making and the anterior cingulate cortex. Nat Neurosci. 2006. Jul;9(7):940–7. doi: 10.1038/nn1724 [DOI] [PubMed] [Google Scholar]
  • 52.Quilodran R, Rothé M, Procyk E. Behavioral shifts and action valuation in the anterior cingulate cortex. Neuron. 2008. Jan 24;57(2):314–25. doi: 10.1016/j.neuron.2007.11.031 [DOI] [PubMed] [Google Scholar]
  • 53.Amiez C, Sallet J, Procyk E, Petrides M. Modulation of feedback related activity in the rostral anterior cingulate cortex during trial and error exploration. Neuroimage. 2012. Nov 15;63(3):1078–90. doi: 10.1016/j.neuroimage.2012.06.023 [DOI] [PubMed] [Google Scholar]
  • 54.Achterberg J, Kadohisa M, Watanabe K, Kusunoki M, Buckley MJ, Duncan J. A One-Shot Shift from Explore to Exploit in Monkey Prefrontal Cortex. J Neurosci. 2022. Jan 12;42(2):276–87. doi: 10.1523/JNEUROSCI.1338-21.2021 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Stoll FM, Fontanier V, Procyk E. Specific frontal neural dynamics contribute to decisions to check. Nat Commun. 2016. Jun 20;7:11990. doi: 10.1038/ncomms11990 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Procyk E, Wilson CRE, Stoll FM, Faraut MCM, Petrides M, Amiez C. Midcingulate Motor Map and Feedback Detection: Converging Data from Humans and Monkeys. Cereb Cortex. 2016. Feb 1;26(2):467–76. doi: 10.1093/cercor/bhu213 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Kennerley SW, Wallis JD. Evaluating choices by single neurons in the frontal lobe: outcome value encoded across multiple decision variables. Eur J Neurosci. 2009;29(10):2061–73. doi: 10.1111/j.1460-9568.2009.06743.x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Luk CH, Wallis JD. Choice Coding in Frontal Cortex during Stimulus-Guided or Action-Guided Decision-Making. J Neurosci. 2013. Jan 30;33(5):1864–71. doi: 10.1523/JNEUROSCI.4920-12.2013 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Basten U, Biele G, Heekeren HR, Fiebach CJ. How the brain integrates costs and benefits during decision making. PNAS. 2010. Dec 14;107(50):21767–72. doi: 10.1073/pnas.0908104107 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Philiastides MG, Biele G, Heekeren HR. A mechanistic account of value computation in the human brain. PNAS. 2010. May 18;107(20):9430–5. doi: 10.1073/pnas.1001732107 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Domenech P, Rheims S, Koechlin E. Neural mechanisms resolving exploitation-exploration dilemmas in the medial prefrontal cortex. Science. 2020. Aug 28;369(6507):eabb0184. doi: 10.1126/science.abb0184 [DOI] [PubMed] [Google Scholar]
  • 62.Khamassi M, Quilodran R, Enel P, Dominey PF, Procyk E. Behavioral Regulation and the Modulation of Information Coding in the Lateral Prefrontal and Cingulate Cortex. Cereb Cortex. 2015. Sep 1;25(9):3197–218. doi: 10.1093/cercor/bhu114 [DOI] [PubMed] [Google Scholar]
  • 63.Aston-Jones G, Cohen JD. An integrative theory of locus coeruleus-norepinephrine function: adaptive gain and optimal performance. Annu Rev Neurosci. 2005;28:403–50. doi: 10.1146/annurev.neuro.28.061604.135709 [DOI] [PubMed] [Google Scholar]
  • 64.Jahn CI, Gilardeau S, Varazzani C, Blain B, Sallet J, Walton ME, et al. Dual contributions of noradrenaline to behavioural flexibility and motivation. Psychopharmacology (Berl). 2018. Sep;235(9):2687–702. doi: 10.1007/s00213-018-4963-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65.Tervo DGR, Proskurin M, Manakov M, Kabra M, Vollmer A, Branson K, et al. Behavioral variability through stochastic choice and its gating by anterior cingulate cortex. Cell. 2014. Sep 25;159(1):21–32. doi: 10.1016/j.cell.2014.08.037 [DOI] [PubMed] [Google Scholar]
  • 66.Schultz W, Dayan P, Montague PR. A Neural Substrate of Prediction and Reward. Science. 1997. Mar 14;275(5306):1593–9. doi: 10.1126/science.275.5306.1593 [DOI] [PubMed] [Google Scholar]
  • 67.Matsumoto M, Matsumoto K, Abe H, Tanaka K. Medial prefrontal cell activity signaling prediction errors of action values. Nat Neurosci. 2007. May;10(5):647–56. doi: 10.1038/nn1890 [DOI] [PubMed] [Google Scholar]
  • 68.Buckley MJ, Mansouri FA, Hoda H, Mahboubi M, Browning PGF, Kwok SC, et al. Dissociable Components of Rule-Guided Behavior Depend on Distinct Medial and Prefrontal Regions. Science. 2009. Jul 3;325(5936):52–8. doi: 10.1126/science.1172377 [DOI] [PubMed] [Google Scholar]
  • 69.Kennerley SW, Behrens TEJ, Wallis JD. Double dissociation of value computations in orbitofrontal and anterior cingulate neurons. Nat Neurosci. 2011. Dec;14(12):1581. doi: 10.1038/nn.2961 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70.Rudebeck PH, Murray EA. The Orbitofrontal Oracle: Cortical Mechanisms for the Prediction and Evaluation of Specific Behavioral Outcomes. Neuron. 2014. Dec 17;84(6):1143–56. doi: 10.1016/j.neuron.2014.10.049 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 71.Izquierdo A, Suda RK, Murray EA. Bilateral Orbital Prefrontal Cortex Lesions in Rhesus Monkeys Disrupt Choices Guided by Both Reward Value and Reward Contingency. J Neurosci. 2004. Aug 25;24(34):7540–8. doi: 10.1523/JNEUROSCI.1921-04.2004 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 72.Noonan MP, Walton ME, Behrens TEJ, Sallet J, Buckley MJ, Rushworth MFS. Separate value comparison and learning mechanisms in macaque medial and lateral orbitofrontal cortex. Proc Natl Acad Sci. 2010. Nov 23;107(47):20547–52. doi: 10.1073/pnas.1012246107 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 73.Walton ME, Behrens TEJ, Buckley MJ, Rudebeck PH, Rushworth MFS. Separable Learning Systems in the Macaque Brain and the Role of Orbitofrontal Cortex in Contingent Learning. Neuron. 2010. Mar 25;65(6):927–39. doi: 10.1016/j.neuron.2010.02.027 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 74.Folloni D, Fouragnan E, Wittmann MK, Roumazeilles L, Tankelevitch L, Verhagen L, et al. Ultrasound modulation of macaque prefrontal cortex selectively alters credit assignment–related activity and behavior. Sci Adv. 7(51):eabg7700. doi: 10.1126/sciadv.abg7700 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 75.Coricelli G, Critchley HD, Joffily M, O’Doherty JP, Sirigu A, Dolan RJ. Regret and its avoidance: a neuroimaging study of choice behavior. Nat Neurosci. 2005. Sep;8(9):1255–62. doi: 10.1038/nn1514 [DOI] [PubMed] [Google Scholar]
  • 76.Tobia MJ, Guo R, Schwarze U, Boehmer W, Gläscher J, Finckh B, et al. Neural systems for choice and valuation with counterfactual learning signals. Neuroimage. 2014. Apr 1;89:57–69. doi: 10.1016/j.neuroimage.2013.11.051 [DOI] [PubMed] [Google Scholar]
  • 77.Camille N, Coricelli G, Sallet J, Pradat-Diehl P, Duhamel JR, Sirigu A. The Involvement of the Orbitofrontal Cortex in the Experience of Regret. Science. 2004. May 21;304(5674):1167–70. doi: 10.1126/science.1094550 [DOI] [PubMed] [Google Scholar]
  • 78.Chib VS, Rangel A, Shimojo S, O’Doherty JP. Evidence for a Common Representation of Decision Values for Dissimilar Goods in Human Ventromedial Prefrontal Cortex. J Neurosci. 2009. Sep 30;29(39):12315–20. doi: 10.1523/JNEUROSCI.2575-09.2009 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 79.Boorman ED, Rushworth MF, Behrens TE. Ventromedial Prefrontal and Anterior Cingulate Cortex Adopt Choice and Default Reference Frames during Sequential Multi-Alternative Choice. J Neurosci. 2013. Feb 6;33(6):2242–53. doi: 10.1523/JNEUROSCI.3022-12.2013 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 80.Lopez-Persem A, Domenech P, Pessiglione M. How prior preferences determine decision-making frames and biases in the human brain. Frank MJ, editor. Elife. 2016. Nov 19;5:e20317. doi: 10.7554/eLife.20317 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 81.Bürkner PC. brms: An R Package for Bayesian Multilevel Models Using Stan. J Stat Softw. 2017. Aug 29;80:1–28. [Google Scholar]
  • 82.Stan Development Team. Stan Modeling Language Users Guide and Reference Manual. 2021. [Google Scholar]
  • 83.Bongioanni A, Folloni D, Verhagen L, Sallet J, Klein-Flügge MC, Rushworth MFS. Activation and disruption of a neural mechanism for novel choice in monkeys. Nature. 2021. Mar;591(7849):270–4. doi: 10.1038/s41586-020-03115-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 84.Khalighinejad N, Bongioanni A, Verhagen L, Folloni D, Attali D, Aubry JF, et al. A Basal Forebrain-Cingulate Circuit in Macaques Decides It Is Time to Act. Neuron. 2020. Jan 22;105(2):370–384.e8. doi: 10.1016/j.neuron.2019.10.030 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 85.Kahnt T, Chang LJ, Park SQ, Heinzle J, Haynes JD. Connectivity-Based Parcellation of the Human Orbitofrontal Cortex. J Neurosci. 2012. May 2;32(18):6240–50. doi: 10.1523/JNEUROSCI.0257-12.2012 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 86.Cerpa JC, Marchand AR, Coutureau E. Distinct regional patterns in noradrenergic innervation of the rat prefrontal cortex. J Chem Neuroanat. 2019. Mar 1;96:102–9. doi: 10.1016/j.jchemneu.2019.01.002 [DOI] [PubMed] [Google Scholar]
  • 87.Palomero-Gallagher N, Vogt BA, Schleicher A, Mayberg HS, Zilles K. Receptor architecture of human cingulate cortex: evaluation of the four-region neurobiological model. Hum Brain Mapp. 2009. Aug;30(8):2336–55. doi: 10.1002/hbm.20667 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 88.van Heukelum S, Mars RB, Guthrie M, Buitelaar JK, Beckmann CF, Tiesinga PHE, et al. Where is Cingulate Cortex? A Cross-Species View. Trends Neurosci. 2020. May 1;43(5):285–99. doi: 10.1016/j.tins.2020.03.007 [DOI] [PubMed] [Google Scholar]

Decision Letter 0

Kris Dickson, PhD

27 Jun 2022

Dear Dr Jahn,

Thank you for submitting your manuscript entitled "Strategic exploration in the macaque’s prefrontal cortex." for consideration as a Research Article by PLOS Biology.

Your manuscript has now been evaluated by the PLOS Biology editorial staff, as well as by an academic editor with relevant expertise, and I am writing to let you know that we would like to send your submission out for external peer review.

However, before we can send your manuscript to reviewers, we need you to complete your submission by providing the metadata that is required for full assessment. To this end, please login to Editorial Manager where you will find the paper in the 'Submissions Needing Revisions' folder on your homepage. Please click 'Revise Submission' from the Action Links and complete all additional questions in the submission questionnaire.

Once your full submission is complete, your paper will undergo a series of checks in preparation for peer review. After your manuscript has passed the checks it will be sent out for review. To provide the metadata for your submission, please Login to Editorial Manager (https://www.editorialmanager.com/pbiology) within two working days, i.e. by Jun 29 2022 11:59PM.

If your manuscript has been previously peer-reviewed at another journal, PLOS Biology is willing to work with those reviews in order to avoid re-starting the process. Submission of the previous reviews is entirely optional and our ability to use them effectively will depend on the willingness of the previous journal to confirm the content of the reports and share the reviewer identities. Please note that we reserve the right to invite additional reviewers if we consider that additional/independent reviewers are needed, although we aim to avoid this as far as possible. In our experience, working with previous reviews does save time.

If you would like us to consider previous reviewer reports, please edit your cover letter to let us know and include the name of the journal where the work was previously considered and the manuscript ID it was given. In addition, please upload a response to the reviews as a 'Prior Peer Review' file type, which should include the reports in full and a point-by-point reply detailing how you have or plan to address the reviewers' concerns.

During the process of completing your manuscript submission, you will be invited to opt-in to posting your pre-review manuscript as a bioRxiv preprint. Visit http://journals.plos.org/plosbiology/s/preprints for full details. If you consent to posting your current manuscript as a preprint, please upload a single Preprint PDF.

Feel free to email us at plosbiology@plos.org if you have any queries relating to your submission.

Kind regards,

Kris

Kris Dickson, Ph.D. (she/her)

Neurosciences Senior Editor/Section Manager

PLOS Biology

kdickson@plos.org

Decision Letter 1

Kris Dickson, PhD

14 Aug 2022

Dear Dr Jahn,

Thank you for your patience while your manuscript "Strategic exploration in the macaque’s prefrontal cortex." was peer-reviewed at PLOS Biology. It has now been evaluated by the PLOS Biology editors, an Academic Editor with relevant expertise, and by several independent reviewers.

In light of the reviews, which you will find at the end of this email, we would like to invite you to revise the work to thoroughly and comprehensively address the reviewers' reports.

Given the extent of revision needed, we cannot make a decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is likely to be sent for further evaluation by all or a subset of the reviewers.

We hope to receive your revised manuscript within 3 months. If you feel you might need additional time to comprehensively address the reviewer concerns or you have additional questions, please email us at plosbiology@plos.org.

At this stage, your manuscript remains formally under active consideration at our journal; please notify us by email if you do not intend to submit a revision so that we may withdraw it.

**IMPORTANT - SUBMITTING YOUR REVISION**

Your revisions should address the specific points made by each reviewer. Please submit the following files along with your revised manuscript:

1. A 'Response to Reviewers' file - this should detail your responses to the editorial requests, present a point-by-point response to all of the reviewers' comments, and indicate the changes made to the manuscript.

*NOTE: In your point-by-point response to the reviewers, please provide the full context of each review. Do not selectively quote paragraphs or sentences to reply to. The entire set of reviewer comments should be present in full and each specific point should be responded to individually, point by point.

You should also cite any additional relevant literature that has been published since the original submission and mention any additional citations in your response.

2. In addition to a clean copy of the manuscript, please also upload a 'track-changes' version of your manuscript that specifies the edits made. This should be uploaded as a "Revised Article with Changes Highlighted" file type.

*Re-submission Checklist*

When you are ready to resubmit your revised manuscript, please refer to this re-submission checklist: https://plos.io/Biology_Checklist

To submit a revised version of your manuscript, please go to https://www.editorialmanager.com/pbiology/ and log in as an Author. Click the link labelled 'Submissions Needing Revision' where you will find your submission record.

Please make sure to read the following important policies and guidelines while preparing your revision:

*Published Peer Review*

Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. Please see here for more details:

https://blogs.plos.org/plos/2019/05/plos-journals-now-open-for-published-peer-review/

*PLOS Data Policy*

Please note that as a condition of publication PLOS' data policy (http://journals.plos.org/plosbiology/s/data-availability) requires that you make available all data used to draw the conclusions arrived at in your manuscript. If you have not already done so, you must include any data used in your manuscript either in appropriate repositories, within the body of the manuscript, or as supporting information (N.B. this includes any numerical values that were used to generate graphs, histograms etc.). For an example see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5

*Blot and Gel Data Policy*

We require the original, uncropped and minimally adjusted images supporting all blot and gel results reported in an article's figures or Supporting Information files. We will require these files before a manuscript can be accepted so please prepare them now, if you have not already uploaded them. Please carefully read our guidelines for how to prepare and upload this data: https://journals.plos.org/plosbiology/s/figures#loc-blot-and-gel-reporting-requirements

*Protocols deposition*

To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

Thank you again for your submission to our journal. We hope that our editorial process has been constructive thus far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Kris

Kris Dickson, Ph.D. (she/her)

Neurosciences Senior Editor/Section Manager

PLOS Biology

kdickson@plos.org

------------------------------------

REVIEWS:

Reviewer #1: In this new interesting work, Jahn and colleagues studied the behavioral and neural correlates of strategic decision-making in macaques. Through an innovative adaptation of the horizon task, previously used in human research by another group of researchers, the authors were able to study how monkeys strategically made different, well-defined, choices constrained by short vs. long decision horizon manipulations. Then, the authors used this task to examine the neural (fMRI) correlates of these behaviors.

While behavioral and neural bases of exploration vs. exploitation strategic decisions in rhesus macaques have been studied frequently through multiple approaches in the past, the use of the horizon task provided a novel and unique context to examine these strategic behaviors. In particular, this study could rely on the 'functional utility' of exploration decisions in a more constrained manner than some of the other existing studies due to the need to continue choosing sequentially in the long horizon condition. Furthermore, the use of fMRI allowed more holistic perspectives on how different decision variables are represented in this task.

Overall, the use of a novel task design combined with the careful analyses of decision variables at both behavioral and neural levels makes this research influential for future research in the domain of strategic decision-making. The behavioral and neural results are well-grounded in the model used and reveal an important change in expected value representation to promote exploration when there is a strategic benefit for doing so.

I believe this is a strong study with very interesting results. I did not find any major issues with this work but have some (mostly minor) comments that could further improve the manuscript and enhance its broader impact.

1. A rather big portion of the Introduction is taken up by describing the task logic, methods, and results of the current work. Much of this information is again repeated at the beginning of the Results section. I suggest the authors cover more literature background and research motivation in the Introduction beyond introducing the horizon task and its relation to what the authors did in this manuscript. Doing so will more strongly position this work in light of the big-picture importance of studying strategic decision-making, in the brain and help readers, in addition to the Discussion section, to understand why this work is aimed at providing novel insights into strategic decision-making than other existing studies.

2. Strategic decisions frequently have idiosyncratic components as different monkeys may employ slightly different strategies for solving the same task even when their overall strategies might converge on average. It might be thus worthwhile to examine if there were individual differences in strategies over the course of a session or the course of the entire data collection (e.g., were these decision variables stationary or non-stationary?). Given that there is evidence of individual differences in Supp Figures 1-3, it would be informative to go deeper into how such individual differences might be reflected potentially differently across ACC/MCC, dlPFC, and OFC BOLD activations with respect to their functional roles suggested in this manuscript.

3. In P. 9, "However, we found no statistically reliable difference in the sensitivity to the uncertainty across the experimental conditions. Therefore, uncertainty did not play a key role in strategic exploration in our task." I am a bit surprised that there were no clear effects of uncertainty in monkeys' behaviors, as reducing uncertainty should be a major functional driver of decisions that are central to exploration. Unless I missed it, I don't see a place where the authors examined the effects of uncertainty at the neural activation level and how it might have influenced the neural representation of strategic decision variables in the brain regions examined. It would be surprising if uncertainty (variance) played no role in ACC/MCC, dlPFC, or OFC activations on signaling expected value or chosen/unchosen outcomes. If there are any effects, these modulations might further provide neural mechanistic insights into the computations performed by these brain regions for explore vs. exploit decisions. Either way, I suggest the authors discuss in some detail the effects of uncertainty (behavioral and neural) in the Discussion in relation to the main findings.

4. Throughout the manuscript, while the innovative and purposeful use of the horizon task was well discussed, the big-picture importance of having the functional utility of exploring vs. exploiting in the horizon condition compared to other existing studies of explore vs. exploit decision paradigms (e.g., two-armed bandit with fixed probability or with dynamic reward schedule) was not communicated as effectively as possible. Perhaps the authors should make this point clearly so that the readers can better grasp why this particular study was able to demonstrate certain aspects of behavioral and neural strategic decision-making in monkeys compared to other existing work.

5. Differentiated or common roles of different brain regions tracking decision variables were interesting. Were there any different functional connectivity patterns among different brain regions for exploration vs. exploitation and in long vs. short horizons? Such network-level analyses may provide additional support on how different brain regions work together to compute strategic choices.

6. Fig. 1B, in the distributions of Expected Value and Uncertainty, there are some thin lines that should be deleted (second and third column of the distribution graphs) - I believe these are remainders left over while making the figure.

7. P. 11, the results section at the top of this page has some sentences that should refer to Fig. 3E, rather than Fig. 3D.

Reviewer #2: Jahn CI et al., Strategic exploration in the macaque's prefrontal cortex

The authors describe data from a study in which they trained monkeys on a primate variation of the horizon's task, to study the explore-exploit tradeoff, and carried out fMRI while animals executed the task. They found that animals relied more on the expected value in short horizon than long horizon choices. There choices did not, however, depend on the uncertainty of the options. Comparisons of different conditions showed that there was increased activation in areas 24, 46 and 47/12 for long-horizon vs. short-horizon trials, and increased activation in area 46 for partial vs. complete feedback.

This manuscript addresses important questions about the behavioral and neural mechanisms that underlie exploration vs. exploitation. The behavioral paradigm is well-controlled and is based on the Horizon's task developed previously by Robert Wilson. This task allows for explicitly testing specific hypotheses about the factors that underlie exploratory behavior. Overall, this is an interesting manuscript. I do have several comments which should be addressed.

Comments

1. The abstract and introduction conflate foraging and the explore-exploit tradeoff. Foraging and the explore-exploit tradeoff, however, refer to different behavioral/theoretical processes. Stephens and Krebs (reference 1) developed the marginal value theorem to account for foraging. Within this framework, the choice to leave a patch results in a random sample from a known distribution of other patches. Thus, there is no learning within the theoretical framework that has been developed to describe foraging. The explore-exploit tradeoff on the other hand comes from the reinforcement learning literature and refers specifically to the process being studied in this manuscript. In the explore-exploit tradeoff the goal is specifically to learn, or gather more information, about some probability distribution. While there is of course an intuitive or folk similarity in the terms exploration and forage, they do not refer to the same phenomenon as studied in the literature. If you would like to stick with the term foraging, I would suggest stating that you are not using it in the way in which it was originally defined. I would, however, suggest removing reference to foraging.

2. A number of papers, some of them recent, should be cited. The work of Vincent Costa in monkeys and recently in humans (Hogeveen J et al., Neuron, 2022; Costa VD and Averbeck, J Neurosci, 2020; Costa VD et al., Neuron, 2019). Work by Becket Ebitz (Ebitz RB et al., Neuron, 2018), and work by Michael Frank (Badre D et al., Neuron, 2012; Cavanagh JF et al., Cer Cortex, 2012), for example.

3. It would be useful to report ANOVA results for the data shown in Fig. 2B. A direct examination of main and interaction effects of feedback and horizon would be useful. Means from each session could be entered into the analysis. It's also somewhat surprising that the regression weight for expected value is higher in the complete feedback condition (Fig. 2D) but the probability of choosing the highest EV option is not higher in 2B. Perhaps some of the other variables from the logistic regression are driving this. Are there strong correlations, for example, between repetition bias and EV?

4. In the Horizon's task, exploration is divided into directed and undirected (or random) exploration. In the current manuscript, the term strategic exploration is used. However, there are no effects of uncertainty on exploration, and therefore monkeys are only using undirected exploration. Is random exploration strategic? Is the strategic referring to the increase in exploration in the partial feedback and long horizon in the partial condition? Do these effects come out of the ANOVA (comment 3?). It would be useful to discuss explicitly, directed and undirect exploration, and how these relate to the current results. Avoiding uncertainty, which the animals show, is difficult to fit within this framework. Perhaps some mention of this would be useful.

5. The figure legends state that error bars are S.E.M. but they do not give the N.

6. F and t- stats and N are given for the imaging data, but not for the behavioral data. More detailed statistics should be given for the behavioral data.

7. The Bayesian statistics are fine. But what if a t-test were used on the parameter comparisons for the logistic regression? Can you report t and p values for these comparisons? It's not clear what is gained by using the Bayesian framework in this situation.

Reviewer #3: This manuscript describes an attempt to implement the horizon task pioneered by Wilson and colleagues (2013) in rhesus macaques. An operating assumption is that monkeys like humans would use directed exploration, guided by uncertainty, in deciding whether to sample one option over another. To test this assumption authors adapted the version of the task tested in humans. The monkeys first observed visual cue presentations that signaled information about the potential value in choosing either cue. Equal (2 observations of visual feedback about each cue) or unequal information (3 observations of visual feedback for one cue and 1 observation of the other cue) was provided about each of the two cues during this observation phase. No primary rewards were experienced during the observation phase. The monkeys were then visually signaled by the height of two gray vertical bars on the left and right side of the screen, that they would have either one (short horizon) or four (long horizon) opportunities to choose between the two options following the observation phase. After each choice, the monkeys received between 0 and 10 drops of juice determined by the length of visual feedback associated with their choice. The monkeys completed the task inside an MRI scanner to enable assessments of brain activity related to encoding of the choice horizon and expected value. One additional manipulation was that following their choices the monkeys were provided with either partial feedback (i.e. they only received visual feedback about the option they chose) or complete feedback (i.e. they were shown visual feedback about both the chosen and unchosen option, but only received primary reward related to the option they chose). The partial versus complete feedback manipulation occurred across the individual test sessions, whereas the horizon and information manipulations occurred within a session. The authors used a Bayesian model to quantify the expected value and uncertainty associated with choosing each option prior to each instance of visual feedback in the observation and choice phases of each trial. The authors focus on a reduced reliance on the expected value in deciding which option to choose as indicative of the monkeys' using "strategic exploration" to decide when to explore, predicting increased exploration in long versus short horizons when the monkeys received partial feedback. Whereas exploration should not differ between the horizon conditions when complete feedback is provided. Although the authors believe they have sufficient behavior and neural evidence to support this hypothesis, their claims are not well supported. I have serious doubts about whether the monkeys were even sensitive to the horizon manipulation and more fundamental behavioral analyses that hew closely to the data and match those performed previously in humans are needed before interpreting the reported fMRI contrasts or claiming to have identified differences in humans versus non-human primates use of strategic exploration. My major concerns are outlined below:

1. The authors need to provide more direct evidence, replicating analyses from Wilson et al. (2013) that demonstrates the monkeys were sensitive to the horizon manipulation. While their accuracy is above chance it is not apparent that the accuracy measures directly differ by the horizon or feedback conditions based on the error bars in the plot provided in Figure 2. The horizon task is set up to measure directed versus random exploration, explicitly, by assessing how either the selection of the more informative, uncertain option varies as a function of the difference in the expected value of the two options based on the observation phase when unequal information is provided (i.e. the number of forced choice trials in the observation) or preference for a particular choice option (i.e. right or left) varies with the difference in the expected value when equal information is provided. When a non-linear choice function (e.g. sigmoid) is fit to the behavior in this way the slope and intercept of the function can be used to quantify random exploration (slope) and how directed exploration is shaped by an information bonus. Given that the manuscript potentially describes the first implementation of the horizon task in nonhuman primates, these fundamental analyses that build a bridge to human studies and that do not rely on the implementation of a particular computational model must be added to the main body of the manuscript. Especially critical is fitting these functions to choices on the first trial in the partial feedback sessions when equal or unequal information is presented. But I suspect, given the result from the Bayesian model that the monkeys avoided uncertainty, that these analyses will reveal that the monkeys are not using directed exploration at all, that the information bonuses are equivalent across the two horizons, and that the documented "strategic exploration" is simply an increase in random exploration denoted by a smaller slope in the long vs. short horizon condition stemming from the animals being less exploitative. If this is the case then it casts the fMRI analyses in a very different light, especially the contrast of the activity as a function of horizon. If this contrast is effectively highlighting random rather than directed exploration this might be one reason that cingulate rather than the frontal pole of prefrontal cortex is activated differently under the two conditions. To summarize, the authors need to show, using some behavioral metric that doesn't rely on the Bayesian regression, using only the first choice trials where value and uncertainty are decoupled by the forced choice manipulation that convincingly shows the animals were sensitive to the horizon manipulation.

2. The partial and complete feedback condition comparisons is problematic in isolating processing of counterfactual feedback. This is because in the complete feedback sessions where the feedback is always provided during the forced and free choice trials for both options, the counterfactual feedback is confounded with complete (i.e. equal) information, whereas in the partial feedback sessions the amount of information provided is either equal or unequal. Instead of a direct comparison between the partial and feedback sessions, the authors should be contrasting a randomly sampled subset of trials from the complete feedback sessions with trials in which equal amounts of partial feedback were provided about the two cues in the observation phase. Ideally, the same exact feedback would be provided but I recognize this might be impossible. This contrast and the identified neural activity could then be compared to the currently missing contrast of equal and unequal feedback in the partial feedback sessions, to better isolate the effects due to counterfactual feedback versus uncertainty.

3. The use of one-tailed statistical tests is not defensible and many of the key effects, particularly relating to the horizon manipulations would not survive appropriate corrections for multiple comparisons.

Decision Letter 2

Kris Dickson, PhD

16 Dec 2022

Dear Dr Jahn,

Thank you for your patience while we considered your revised manuscript "Strategic exploration in the macaque’s prefrontal cortex." for publication as a Research Article at PLOS Biology. This revised version of your manuscript has been evaluated by the PLOS Biology editors, the Academic Editor and the original reviewers.

Based on the reviews and on our Academic Editor's assessment of your revision, we are likely to accept this manuscript for publication. Given that our readership might have similar questions to those raised by the reviewers, we would like you to do some rewriting of your manuscript to incorporate the point-by-point responses you'd provided to Reviewer 1 into the body of your manuscript. We also ask that you consider a title change to something like:

The macaque prefrontal cortex supports strategic exploration (If you don't feel this is too strongly worded)

OR

Neural responses in macaque prefrontal cortex are linked to strategic exploration

Please also make sure to fully address the data and other policy-related requests at the bottom of this email. Failure to do so will result in delays in moving your submission forward.

As you address these items, please take this last chance to review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the cover letter that accompanies your revised manuscript.

We expect to receive your revised manuscript within two weeks.

To submit your revision, please go to https://www.editorialmanager.com/pbiology/ and log in as an Author. Click the link labelled 'Submissions Needing Revision' to find your submission record. Your revised submission must include the following:

- a cover letter that should detail your responses to any editorial requests, if applicable, and whether changes have been made to the reference list

- a Response to Reviewers file that provides a detailed response to the reviewers' comments (if applicable)

- a track-changes file indicating any changes that you have made to the manuscript.

NOTE: If Supporting Information files are included with your article, note that these are not copyedited and will be published as they are submitted. Please ensure that these files are legible and of high quality (at least 300 dpi) in an easily accessible file format. For this reason, please be aware that any references listed in an SI file will not be indexed. For more information, see our Supporting Information guidelines:

https://journals.plos.org/plosbiology/s/supporting-information

*Published Peer Review History*

Please note that you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. Please see here for more details:

https://blogs.plos.org/plos/2019/05/plos-journals-now-open-for-published-peer-review/

*Press*

Should you, your institution's press office or the journal office choose to press release your paper, please ensure you have opted out of Early Article Posting on the submission form. We ask that you notify us as soon as possible if you or your institution is planning to press release the article.

*Protocols deposition*

To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

Please do not hesitate to contact me should you have any questions.

Sincerely,

Kris

Kris Dickson, Ph.D., (she/her)

Neurosciences Senior Editor/Section Manager,

kdickson@plos.org,

PLOS Biology

------------------------------------------------------------------------

ETHICS STATEMENT:

Please update your ethics statement with the required additional details as outlined in

https://journals.plos.org/plosbiology/s/animal-research#loc-non-human-primates

Specifically, non-human primate studies must be performed in accordance with the recommendations of the Weatherall report “The use of non-human primates in research”. Manuscripts describing research involving non-human primates must include details of animal welfare, including information about housing, feeding, and environmental enrichment, and steps taken to minimize suffering, including use of anesthesia and method of sacrifice if appropriate.

------------------------------------------------------------------------

DATA POLICY:

You may be aware of the PLOS Data Policy, which requires that all data be made available without restriction: http://journals.plos.org/plosbiology/s/data-availability. For more information, please also see this editorial: http://dx.doi.org/10.1371/journal.pbio.1001797

We appreciate the deposition of your data on Gitfront but note that we cannot accept sole deposition of data to non-static site (e.g. no personal sites and generally no institutional sites). (https://journals.plos.org/plosbiology/s/data-availability). We require deposition of all summary data to a static site, like Zenodo, FigShare or OSF. GitHub and similar sites can be used for depositing code however.

Note that we do not require all raw data to be deposited on such a static site. Rather, we ask that all individual quantitative observations that underlie the data summarized in the figures and results of your paper be made available in order to allow our readership to reproduce the figures in your paper:

1) Please ensure that these data files are invariably referred to (in the manuscript, figure legends, and the Description field when uploading your files) using the following format verbatim: S1 Data, S2 Data, etc. If using excel, multiple panels of a single or even several figures can be included as multiple sheets in one excel file that is saved using exactly the following convention: S1_Data.xlsx (using an underscore).

2) Please also provide the updated accession code so that we may view your data before publication.

3) Please ensure that you provide the individual numerical values that underlie the summary data displayed in the following figure panels as they are essential for readers to assess your analysis and to reproduce it:

Fig2A-D; Fig3A-E; Fig4B,D,E; Fig5B-D

Supplemental figures: S1A,B; S2A-C; S3A,B; S4A-C; S5A-C;

NOTE: the numerical data provided should include all replicates AND the way in which the plotted mean and errors were derived (it should not present only the mean/average values).

4) Please also ensure that figure legends in your manuscript include information on where this underlying data can be found (e.g. “The underlying data supporting Fig X, panel Y can be found in file Z.”), and please also ensure that your supplemental data file/s has a legend.

5) Please ensure that your Data Statement in the submission system accurately describes where your data can be found.

------------------------------------------------------------------------

DATA NOT SHOWN?

- Please note that per journal policy, we do not allow the mention of "data not shown", "personal communication", "manuscript in preparation" or other references to data that is not publicly available or contained within this manuscript. Please check your submission to ensure no such statements are included. If there are such statements, either remove mention of these data or provide figures presenting the results and the data underlying the figure(s).

------------------------------------------------------------------------

Reviewer remarks:

Do you want your identity to be public for this peer review?

Reviewer #1: Yes: Steve W. C. Chang

Reviewer #2: No

Reviewer #3: No

Reviewer #1: The authors did a very thorough job at addressing all of my points by 1) performing a number of new analyses to support their responses, 2) adding new information on the manuscript, and 3) modifying their interpretations more broadly in the context of explore vs. exploit decision-making. Their responses were detailed and constructed based on evidence from their data (even though not included in the manuscript). I have no more comments.

Reviewer #2: The authors have addressed my concerns. I have no further comments.

Reviewer #3: The authors have have satisfactorily addressed my previous concerns.

Decision Letter 3

Kris Dickson, PhD

3 Jan 2023

Dear Dr Jahn,

Thank you for the submission of your revised Research Article "Neural responses in macaque prefrontal cortex are linked to strategic exploration" for publication in PLOS Biology. On behalf of my colleagues and the Academic Editor, Thorsten Kahnt, I am pleased to say that we can in principle accept your manuscript for publication. In your final version, we do ask that you either include the two additional statements addressing Dr Chang's points (i.e. "Based on the reviewer's request...") and include a statement at the end of these paragraphs referring readers to the supplemental "response to reviewers" file that you will also include with the final submission, or minimally that you include a pared down version of these two statements: "We conducted additional exploratory brain-behavior correlations but found no significant relationships to behavioral sensitivity (see supplemental "Response to Reviewers" file for additional details)." Please also address any remaining formatting and reporting issues that will be detailed in an email you should receive within 2-3 business days from our colleagues in the journal operations team; no action is required from you until then. Please note that we will not be able to formally accept your manuscript and schedule it for publication until you have completed these requested changes.

Please take a minute to log into Editorial Manager at http://www.editorialmanager.com/pbiology/, click the "Update My Information" link at the top of the page, and update your user information to ensure an efficient production process.

PRESS

We frequently collaborate with press offices. If your institution or institutions have a press office, please notify them about your upcoming paper at this point, to enable them to help maximize its impact. If the press office is planning to promote your findings, we would be grateful if they could coordinate with biologypress@plos.org. If you have previously opted in to the early version process, we ask that you notify us immediately of any press plans so that we may opt out on your behalf.

We also ask that you take this opportunity to read our Embargo Policy regarding the discussion, promotion and media coverage of work that is yet to be published by PLOS. As your manuscript is not yet published, it is bound by the conditions of our Embargo Policy. Please be aware that this policy is in place both to ensure that any press coverage of your article is fully substantiated and to provide a direct link between such coverage and the published work. For full details of our Embargo Policy, please visit http://www.plos.org/about/media-inquiries/embargo-policy/.

Thank you again for choosing PLOS Biology for publication and supporting Open Access publishing. We look forward to publishing your study. 

With best wishes for 2023, 

Kris

Kris Dickson, Ph.D., (she/her)

Neurosciences Senior Editor/Section Manager

PLOS Biology

kdickson@plos.org

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Fig. Full model fit of the model predicting choosing the right option on screen on first choices (shown in Fig 2D and descripted in detail in the Materials and methods section).

    (A) Predictors are from left to right: Intercept (i.e., a side bias), repetition bias (RB), expected value of difference between right and left according to our Bayesian model (EV), uncertainty difference between right and left according to our Bayesian model (U), horizon length (short horizon is positive, long horizon is negative), the interaction between horizon and expected value (horizonXER), and the interaction between horizon and uncertainty (horizonXU). The distributions are the posteriors of the parameter estimates, shown both for each monkey individually and averaged over animals. Fits from the partial feedback sessions are shown on the left, and from the complete feedback sessions on the right. (B) Data from the same fit as in (A) but now summed up over both partial and complete feedback sessions. (C) Data from the same fit as in (A) but now we computed the difference between partial and complete feedback sessions. (D) One-sided p-values for all parameters are computed as the number of samples of the posterior greater than 0. To compute the p-value for effects smaller than 0, the p-values in the table can be subtracted from 1. Data and code to reproduce the figure can be found at https://doi.org/10.5281/zenodo.7464572.

    (TIFF)

    S2 Fig. The same model as in S1 Fig but only fit to trials during which the available choices on screen were the same on each side (2 and 2).

    All conventions are the same as in S1 Fig. Data and code to reproduce the figure can be found at https://doi.org/10.5281/zenodo.7464572.

    (TIFF)

    S3 Fig. Equivalency between our linear regression and the framing in terms of random and directed exploration.

    (A) With the uncertainty regressor. We find that monkeys modulate their sensitivity to the expected value depending on the horizon and the feedback type, which is equivalent to the random exploration parameter, the softmax noise, which is the inverse of the expected value regressor. However, we find no modulation of the uncertainty by the horizon nor the feedback type, which is equivalent to the directed exploration parameter, the uncertainty bonus, which is the uncertainty regressor divided by the expected value regressor. (B) Same as A but with the number of available information rather than the uncertainty. Error bars indicate standard deviation. Data and code to reproduce the figure can be found at https://doi.org/10.5281/zenodo.7464572.

    (TIFF)

    S4 Fig. The same model as in S1 Fig but only using the number of available information on each side rather than the uncertainty.

    All conventions are the same as in S1 Fig. Data and code to reproduce the figure can be found at https://doi.org/10.5281/zenodo.7464572.

    (TIFF)

    S5 Fig. Full model fit of predicting choosing the right option on screen during subsequent choices in the long horizon (choices 2–4; shown in Fig 3E and described in detail in the Materials and methods section).

    (A) Predictors are from left to right: Intercept (i.e., a side bias), repetition bias (RB), the change in expected value between the right and left option revealed by choices made during this horizon, compared to the initial expected value for this horizon, i.e., the baseline (deltaERchosen), the change in expected value between the right and left option revealed by feedback about the unchosen option, compared to the initial expected value for this horizon (deltaERcounterfactual), the difference in initial expected value between the right and left option available, i.e., the expected value difference at first choice (baselineU), the change in uncertainty between the right and left option revealed by choices made during this horizon, compared to the initial uncertainty for this horizon (deltaUchosen), the change in uncertainty between the right and left option revealed by feedback about the unchosen option, compared to the initial uncertainty for this horizon (deltaUcounterfactual), the difference in initial uncertainty between the right and left option available, i.e., the uncertainty difference at first choice (baselineU), the difference between how often the right option has been chosen over the left option during this horizon (deltaChosen). All other conventions are the same as in S1 Fig, also for panels B-D. Data and code to reproduce the figure can be found at https://doi.org/10.5281/zenodo.7464572.

    (TIFF)

    S6 Fig

    (A) Expected value of the chosen option without mask and when taking the activity before the choice in all trials (not just first choice trials), we observed large activations related to the expected value of the chosen option (which is the same as the chosen action in our task) spanning from the motor cortex/somatosensory cortex, the dlPFC, the OFC, and striatum, as well as an inverted signal in the visual areas (Cluster p < 0.05, cluster forming threshold of z > 2.3). (B) In the partial and complete feedback conditions in our VOI and when taking the activity before the first choice only, we found 1 cluster of activity related to the inverse of the magnitude of the uncertainty about the chosen option in the right medial prefrontal cortex (24c and 9m) that extended bilaterally in the frontal pole (10mr) (Cluster P < 0.05, cluster forming threshold of z > 2.3). Data to reproduce the figure can be found at https://doi.org/10.5281/zenodo.7464572. dlPFC, dorsolateral prefrontal cortex; OFC, orbitofrontal cortex; VOI, volume of interest.

    (TIFF)

    S7 Fig. Outcome prediction error and magnitude in the partial feedback condition.

    (A) In the partial feedback condition and at the time of outcome, we found 3 clusters of activity that were positively modulated by the chosen option prediction error in the medial prefrontal cortex and bilaterally in the somatosensory and motor cortex in our VOI (Cluster p < 0.05, cluster forming threshold of z > 2.3). (B) We found the same 3 clusters when we looked for a positive modulation by the magnitude of the chosen outcome. We additionally found 1 cluster of activity in the right lateral OFC that was negatively modulated by the magnitude of the chosen outcome. (C) When we time-locked our search to the onset of the reward (1 s after the display of the outcome, with a different GLM), we found the same clusters as in A, as well as the classic prediction error related activity in the ventral striatum and a negative prediction error in visual areas (see full map at https://doi.org/10.5281/zenodo.7464572) at the whole brain level. Data to reproduce the figure can be found at https://doi.org/10.5281/zenodo.7464572. GLM, general linear model; OFC, orbitofrontal cortex; VOI, volume of interest.

    (TIFF)

    S8 Fig. Chosen and unchosen outcome magnitude in the complete feedback condition.

    (A) In complete feedback sessions only, we found clusters for inverted chosen outcome magnitude activity in the right lOFC (47/12o) and bilaterally in the vlPFC and 2 clusters in the somatosensory/motor cortex [3]. (B) We found a cluster of activity for the inverted unchosen outcome magnitude in the cOFC and mOFC and the vlPFC. Data to reproduce the figure can be found at https://doi.org/10.5281/zenodo.7464572. cOFC, central orbitofrontal cortex; lOFC, lateral orbitofrontal cortex; mOFC, medial orbitofrontal cortex; vlPFC, ventrolateral prefrontal cortex.

    (TIFF)

    S1 Table. Tables showing the peaks of all significant clusters found within our frontal masks that are reported in the main text.

    Coordinates are given in the F99 standard space.

    (TIFF)

    Attachment

    Submitted filename: Response_to_reviewers_final.docx

    Attachment

    Submitted filename: Response_to_editor_final.docx

    Data Availability Statement

    The behavioral data, code to reproduce the figures shown in the manuscript and supplementary materials, and statistical fMRI maps are available at: https://doi.org/10.5281/zenodo.7464572.


    Articles from PLOS Biology are provided here courtesy of PLOS

    RESOURCES