Skip to main content
eLife logoLink to eLife
. 2021 Jul 13;10:e68943. doi: 10.7554/eLife.68943

Value signals guide abstraction during learning

Aurelio Cortese 1,2,, Asuka Yamamoto 1,3, Maryam Hashemzadeh 4, Pradyumna Sepulveda 2, Mitsuo Kawato 1,5, Benedetto De Martino 2,
Editors: Thorsten Kahnt6, Michael J Frank7
PMCID: PMC8331191  PMID: 34254586

Abstract

The human brain excels at constructing and using abstractions, such as rules, or concepts. Here, in two fMRI experiments, we demonstrate a mechanism of abstraction built upon the valuation of sensory features. Human volunteers learned novel association rules based on simple visual features. Reinforcement-learning algorithms revealed that, with learning, high-value abstract representations increasingly guided participant behaviour, resulting in better choices and higher subjective confidence. We also found that the brain area computing value signals – the ventromedial prefrontal cortex – prioritised and selected latent task elements during abstraction, both locally and through its connection to the visual cortex. Such a coding scheme predicts a causal role for valuation. Hence, in a second experiment, we used multivoxel neural reinforcement to test for the causality of feature valuation in the sensory cortex, as a mechanism of abstraction. Tagging the neural representation of a task feature with rewards evoked abstraction-based decisions. Together, these findings provide a novel interpretation of value as a goal-dependent, key factor in forging abstract representations.

Research organism: Human

Introduction

‘All art is an abstraction to some degree.’ Henry Moore

Art is one of the best examples of abstraction, the unique ability of the human mind to organise information beyond the immediate sensory reality. Abstraction is by no means restricted to high-level cognitive behaviour such as art creation. It envelops every aspect of our interaction with the environment. Imagine that you are hiking in a park, and you need to cross a stream. Albeit deceptively simple, this scenario already requires the processing of a myriad visual (and auditory, etc.) features. For an agent that operates directly on each feature in this complex sensory space, any meaningful behavioural trajectory (such as crossing the stream) would quickly involve intractable computations. This is well exemplified in reinforcement learning (RL), where in complex and/or multidimensional problems, classic RL algorithms rapidly collapse (Bellman, 1957; Kawato and Samejima, 2007; Sutton and Barto, 1998). If, on the other hand, the agent is able to first ‘abstract’ the current state to a lower dimensional manifold, representing only relevant features, behaviour becomes far more flexible and efficient (Ho et al., 2019; Konidaris, 2019; Niv, 2019). Attention (Farashahi et al., 2017; Leong et al., 2017; Niv et al., 2015), and more generally, the ability to act upon subspaces, concepts or abstract representations has been proposed as an effective solution to overcome computational bottlenecks arising from sensory-level operations in RL (Cortese et al., 2019; Hashemzadeh et al., 2019; Ho et al., 2019; Konidaris, 2019; Wikenheiser and Schoenbaum, 2016). Abstractions can be thus thought of as simplified maps carved from higher dimensional space, in which details have been removed or transformed, in order to focus on a subset of interconnected features, that is, a higher order concept, category or schema (Gilboa and Marlatte, 2017; Mack et al., 2016).

How are abstract representations constructed in the human brain? For flexible deployment, abstraction should depend on task goals. From a psychological or neuroeconomic point of view, task goals generally determine what is valuable (Kobayashi and Hsu, 2019; Liu et al., 2017; McNamee et al., 2013), such that if one needs to light a fire, matches are much more valuable than a glass of water. Hence, we hypothesised that valuation processes are directly related to abstraction.

Value representations have been linked to neural activity in the ventromedial prefrontal cortex (vmPFC) in the context of economic choices (McNamee et al., 2013; Padoa-Schioppa and Assad, 2006). More recently, the role of the vmPFC has also been extended to computation of confidence (De Martino et al., 2013; Gherman and Philiastides, 2018; Lebreton et al., 2015; Shapiro and Grafton, 2020). While this line of work has been extremely fruitful, it has mostly focused on the hedonic and rewarding aspect of value instead of its broader functional role. In the field of memory, a large corpus of work has shown that the vmPFC is crucial for formation of schemas or conceptual knowledge (Constantinescu et al., 2016; Gilboa and Marlatte, 2017; Kumaran et al., 2009; Mack et al., 2016; Tse et al., 2007), as well as generalisations (Bowman and Zeithamova, 2018). The vmPFC also collates goal-relevant information from elsewhere in the brain (Benoit et al., 2014). Considering its connectivity pattern (Neubert et al., 2015), the vmPFC is well suited to serve a pivotal function in the circuit that involves the hippocampal formation (HPC) and the orbitofrontal cortex (OFC), dedicated to extracting latent task information and regularities important for navigating behavioural goals (Niv, 2019; Schuck et al., 2016; Stachenfeld et al., 2017; Viganò and Piazza, 2020; Wilson et al., 2014). Thus, the aim of this study is twofold: (i) to demonstrate that abstraction emerges during the course of learning, and (ii) to investigate how the brain, and specifically the vmPFC, uses valuation upon low-level sensory features to forge abstract representations.

To achieve this, we leveraged a task in which human participants repeatedly learned novel association rules, while their brain activity was recorded with fMRI. Reinforcement learning (RL) modelling allowed us to track participants’ valuation processes and to dissociate their learning strategies (both at the behavioural and neural levels) based on the degree of abstraction. Participants’ confidence in having performed the task well was positively correlated with their ability to abstract. In a second experiment, we studied the causal role of value in promoting abstraction through directed effect in sensory cortices. To anticipate our results, we show that the vmPFC and its connection to the visual cortex construct abstract representations through a goal-dependent valuation process that is implemented as top-down control of sensory cortices.

Results

Experimental design

The goal of the learning task was to present a problem that could be solved according to two strategies, based on the sampled task-space dimensionality. A simple, slower strategy akin to pattern recognition, and a more sophisticated one that required abstraction to use the underlying structure. Participants (N = 33) learned the fruit preference of pacman-like characters formed by the combination of three visual features (colour, mouth direction, and stripe orientation, Figure 1A–B). The preference was governed by a combination of two features, selected randomly by our computer program for each block (Figure 1A–B). Learning the block rules essentially required participants to uncover hidden associations between features and fruits. Although participants were instructed that one feature was irrelevant, they did not know which. A block ended when a sequence of 8–12 (randomly set by our computer program) correct choices was detected or upon reaching its upper limit (80 trials). Variable stopping criteria were used to prevent participants from learning that a fixed sequence was predictive of block termination. During each trial, participants could see the outcome after selecting a fruit. A green box appeared around the chosen fruit if the choice was correct (red otherwise). Additionally, participants were instructed that the faster they learned a block rule, the higher the reward. At the end of the session, a final monetary reward was delivered, commensurate with participant performance (see Materials and methods). Participants failing to learn the association in three blocks or more (i.e. reaching a block limit of 80 trials without having learned the association), and / or failing to complete more than 10 blocks in the allocated time, were excluded (see Materials and methods). All main results reported in the paper are from the included sample of N = 33 participants.

Figure 1. Learning task and behavioural results.

(A) Task: participants learned the fruit preferences of pacman-like characters, which changed on each block. (B) Associations could form in three ways: colour – stripe orientation, colour – mouth direction, and stripe orientation – mouth direction. The left-out feature was irrelevant. Examples of the two types of fruit associations. The four combinations arising from two features with two levels were divided into symmetric (2x2) and asymmetric (3x1) cases. f1-3: features 1 to 3; fruit:rule refers to the fruit as being the association rule. Both block types were included to prevent participants from learning rules by simple deduction. If all blocks had symmetric association rules and participants knew this, they could simply learn one feature-fruit association (e.g. green-vertical), and from there deduce all other combinations. Both the relevant features and the association types varied on a block-by-block basis. (C), Trial-by-trial ratio-correct improved as a measure of within-block learning. Dots represent the mean across participants, while error bars indicate the SEM, and the shaded area represents the 95% CI (N = 33). Participant-level ratio correct was computed for each trial across all completed blocks. Source data is available in file Figure 1—source data 1. (D), Learning speed was positively correlated with time, among participants. Learning speed was computed as the inverse of the max-normalised number of trials taken to complete a block. Thin gray lines represent least square fits of individual participants, while the black line represents the group-average fit. The correlation was computed with group-averaged data points (N = 11). Average data points are plotted as coloured circles, the error bars are the SEM. (E), Confidence judgements were positively correlated with learning speed, among participants. Each dot represents data from one participant, and the thick line indicates the regression fit (N = 31 [2 missing data]). The experiment was conducted once (n = 33 biologically independent samples), **p<0.01.

Figure 1—source data 1. Csv: panel C.
Figure 1—source data 2. Csv: panel D.
Figure 1—source data 3. Csv: panel E.

Figure 1.

Figure 1—figure supplement 1. Small (non-significant trends) influence of block/association type on learning speed.

Figure 1—figure supplement 1.

Average learning speed was computed by pooling block-wise learning speed from all participants for each block or association type. None of the pairwise tests survived multiple comparison correction (FDR). Bars represent the population mean, error bars the SEM.
Figure 1—figure supplement 2. Behaviour analysis of excluded participants.

Figure 1—figure supplement 2.

Excluded participants made more mistakes overall (lower ratio of correct responses). Wilcoxon rank sum test, z = 2.76, p = 0.006.

Behavioural accounts of learning

We verified that participants learned the task sensibly. Within blocks, performance was higher than chance as early as the second trial (Figure 1C, one-sample t-test against mean of 0.5, trial 2: t32 = 4.13, P(FDR) < 10−3, trial 3: t32 = 2.47, P(FDR) = 0.014, all trials t>3: P(FDR) < 10−3). Considering the whole experimental session, learning speed (i.e. how quickly participants completed a given block) increased significantly across blocks (Figure 1D, N = 11 time points, Pearson’s r = 0.80, p = 0.003). These results not only confirmed that participants learned the task rule in each block, but also that they learned to use more efficient strategies. Notably, in this task, the only way to solve blocks faster was by using the correct subset of dimensions (the abstract representation). When at the end of a session, participants were asked about their degree of confidence in having performed the task well, their self-reports correlated with their learning speed (N = 31 [2 missing data], robust regression slope = 0.024, t29 = 3.27, p = 0.003, Figure 1E), but not with the overall number of trials, or the product of the proportion of successes (learning speed: Pearson’s r = 0.53, p = 0.002, total trials: r = −0.13, p = 0.47, test for difference in r: z = 2.71, p = 0.007; product of the proportion of successes: r = −0.06, p = 0.75, test for difference in r: z = 2.43, p = 0.015). We also confirmed that the block type (defined by relevant features, e.g., colour-orientation) or association type (e.g. symmetric 2x2) did not systematically affect learning speed (Figure 1—figure supplement 1). Excluded participants (see Materials and methods) had overall lower performance (Figure 1—figure supplement 2), although some had comparable ratios correct.

Discovery of abstract representations

Was participants’ learning behaviour guided by the selection of accurate representations? To answer this question, we built upon a classic RL algorithm (Q-learning) (Watkins and Dayan, 1992) in which state-action value functions (beliefs) used to predict future rewards, are updated according to the task state of a given trial and the action outcome. In this study, task states were defined by the number of feature combinations that the agent may track; hence, we devised algorithms that differed in their state-space dimensionality. The first algorithm, called Feature RL, explicitly tracked all combinations of three features, 23 = eight states (Figure 2A, top left). This algorithm is anchored at a low feature level because each combination of the three features results in a unique fingerprint – one simply learns direct pairings between visual patterns and fruits (actions). Conversely, the second algorithm, called Abstract RL, used a more compact or abstract state representation in which only two features are tracked. These compressed representations reduce the explored state-space by half, 22 = four states (Figure 2A, top right). Importantly, in this task environment as many as three Abstract RL in parallel were possible, one for each combination of two features.

Figure 2. Mixture of reinforcement learning (RL) experts and value computation.

(A) Outline of the representational spaces of each RL algorithm comprising the mixture-of-experts architecture. (B) Illustration of the model architecture. See Methods for a formal description of the model. All experts had the same number of hyperparameters: the learning rate α (how much the latest outcome affected agent beliefs), the forgetting factor γ (how much prior RPEs influenced current decisions), and the RPE variance v, modulating the sharpness with which the mixture-of-expert RL model should favour the best performing algorithm in the current trial. (C) The approach used for data analysis and model simulation. The model was first fitted to participant data with Hierarchical Bayesian Inference (Piray et al., 2019). Estimated hyperparameters were used to compute value functions of participant data, as well as to generate new, artificial choice data and to compute simulated value functions. (D) Averaged expected value across all states for the chosen action in each RL expert, as well as responsibility signal for each model. Left: simulated data, right: participant empirical data. Dots represent individual agents (left) or participants (right). Bars indicate the mean and error bars depict the SEM. Statistical comparisons were performed with two-sided Wilcoxon signed rank tests. ***p<0.001. AbRL: Abstract RL, FeRL: Feature RL, AbRLw1: wrong-1 Abstract RL, AbRLw2: wrong-2 Abstract RL. (E) RPE variance was negatively correlated with learning speed (outliers removed, N = 29). Dots represent individual participant data. The thick line shows the linear regression fit. The experiment was conducted once (n = 33 biologically independent samples), * p<0.05.

Figure 2—source data 1. Csv: panel D, mean expected value, model.
Figure 2—source data 2. Csv: panel D, mean expected value, subjects.
Figure 2—source data 3. Csv: panel D, lambda, model.
Figure 2—source data 4. Csv: panel D, lambda, subjects.
Figure 2—source data 5. Csv: panel E.

Figure 2.

Figure 2—figure supplement 1. Model comparison (accounting for model complexity).

Figure 2—figure supplement 1.

Mixture-of-Experts RL (MoE-RL), Feature RL (FeRL), and Abstract RL (AbRL) where fit and compared to each other with Hierarchical Bayesian Inference [HBI] (Piray et al., 2019), at the single participant and block level. (A) Percentage of participants in which the target model best fitted the choice data. The percentage was computed for each block separately (in time, block 1, 2, etc.), then averaged. Bars represent the mean, error bars the SEM. (B) Percentage of blocks (all participants and blocks pooled) in which the target model best fitted the choice data. Bars represent the percentage. (C) Block-specific break-down of the number of participants for which the MoE-RL, the FeRL, or the AbRL provided the best fit.

The above four RL algorithms were combined in a mixture-of-experts architecture (Frank and Badre, 2012; Jacobs et al., 1991; Sugimoto et al., 2012), Figure 2B and Materials and methods. The key intuition behind this approach was that at the beginning of a new block, the agent did not know which abstract representation was correct (i.e., which features were relevant). Thus, the agent needed to learn which representations were most predictive of reward, so as to exploit the best representation for action selection. Experts here denote the four learning algorithms (Feature RL, and three options of Abstract RL). While all experts competed for action selection, their learning uncertainty (RPE: reward prediction error) determined the strength in doing so (Doya et al., 2002; Sugimoto et al., 2012; Wolpert and Kawato, 1998). This architecture allowed us to track the value function of each RL expert separately, while using a unique, global action in each trial.

Estimated hyperparameters (learning rate α, forgetting factor γ, RPE variance ν) were used to compute value functions of participant data, as well as to generate new, artificial choice data and value functions (Figure 2C, and Materials and methods). Simulations indicated that expected value and responsibility were highest for the appropriate Abstract RL, followed by Feature RL, and the two Abstract RLs based on irrelevant features as the lowest (Figure 2D). Participant empirical data displayed the same pattern, whereby the value function and responsibility signal of the appropriate Abstract RL were higher than in other RL algorithms (Figure 2D, right side). Note that the large difference between appropriate Abstract RL and Feature RL was because the appropriate Abstract RL was an ‘oracle’: it had access to the correct low-dimensional state from the beginning. The RPE variance (hyperparameter ν) adjusted the sharpness with which each RL’s (un)certainty was considered for expert weighting. Crucially, the variance v was associated with participant learning speed, such that participants who learned block rules quickly were sharper in expert selection (Figure 2E, N = 29, robust regression slope = −1.02, t27 = −2.59, p = 0.015). These modelling results provided a first layer of support for the hypothesis that valuation is related to abstraction.

Behaviour shifts from Feature- to Abstraction-based reinforcement learning

The mixture-of-expert RL model revealed that participants who learned faster relied more on the best RL model value representations. Further, the modelling established that choices were mostly driven by either the appropriate Abstract RL or Feature RL, which had higher expected values (but note that the other Abstract RLs had mean values greater than a null level of 0.5), and higher responsibility λ. It is important to highlight though that the mixture-of-experts RL might not reflect the actual algorithmic computation used by the participant in this task, but it provides a conceptual solution to the arbitration between representations/strategies. Model comparison showed that Abstract RL and Feature RL in many cases offered a more parsimonious description of the participants behaviour. This is unsurprising since Feature RL is a simple model and Abstract RL is an oracle model – knowing which are the relevant feature (see Figure 2—figure supplement 1 for a direct model comparison between mixture-of-experts RL, Feature RL, and Abstract RL). Hence, we next sought to explicitly explain participant choices and learning according to either Feature RL or Abstract RL strategy. Given the task space (Figure 2A), the only way to solve a block rule faster was to use abstract representations. As such, we expect to observe a shift from Feature RL toward Abstract RL to occur with learning.

Both algorithms had two hyperparameters: the learning rate α and greediness β (inverse temperature, the strength that expected value has on determining actions). Using the estimated hyperparameters, we generated new, synthetic data to evaluate how fast artificial agents, implementing either Feature RL or Abstract RL, solved the learning task (see Materials and methods). The simulations attested that Feature RL was slower and less efficient (Figure 3A), yielding lower learning speed and a higher percentage of failed blocks.

Figure 3. Feature RL vs Abstract RL are related to learning speed and the use of abstraction increases with experience.

(A) Simulated learning speed and % of failed blocks for both Abstract RL and Feature RL. To make simulations more realistic, arbitrary noise was injected into the simulation, altering the state (see Materials and methods). N = 100 simulations of 45 agents. Right plot: bars represent the mean, error bars the SEM. (B) The relationship between the block-by-block, best-fitting model and learning speed of participants. Each dot represents one block from one participant, with data aggregated across all participants. Note that some dots fall beyond p=one or p=0. This effect occurs because dots were scattered with noise in their x-y coordinates for better visualisation. (C) Between participant correlations. Top: abstraction level vs learning speed. The abstraction level was computed as the average over all blocks completed by a given participant (code: Feature RL = 0, Abstract RL = 1). Bottom: confidence vs abstraction level. Dots represent individual participants (top: N = 33, bottom: N = 31, some dots are overlapping). (D) Learning rate was not symmetrically distributed across the two algorithms. (E) Greediness was not symmetrically distributed across the two algorithms. For both (D and E), each dot represents one block from one participant, with data aggregated across all participants. Histograms represent the distribution of data around the midline. (F) The number of participants for which Feature RL or Abstract RL best explained their choice behaviour in the first and last blocks of the experimental session. (G) Abstraction level was computed separately with blocks from the first half (early) and latter half (late) session. (H) Participants count for the best fitting model, in each block. The experiment was conducted once (n = 33 biologically independent samples), * p<0.05, ** p<0.01, *** p<0.001.

Figure 3—source data 1. Csv: panel A left, model simulations histogram of learning speed.
Figure 3—source data 2. Csv: panel A right, model simulations % failed blocks.
Figure 3—source data 3. Csv: panel B, scatter plot of model probabilities.
Figure 3—source data 4. Csv: panel B, violin plot of proportion Abstract RL.
Figure 3—source data 5. Csv: panel C.
Figure 3—source data 6. Csv: panel D.
Figure 3—source data 7. Csv: panel E.
Figure 3—source data 8. Csv: panels F and G.
Figure 3—source data 9. csv: panel H.

Figure 3.

Figure 3—figure supplement 1. Comparison of learning rate α and greediness β in Feature RL and Abstract RL best-fitting and worst-fitting blocks.

Figure 3—figure supplement 1.

(A) The comparison was done across blocks derived from the model that provided a better fit. Here, the learning rate and greediness in Feature RL blocks was lower than in Abstract RL blocks. Statistical comparisons were performed with Wilcoxon rank-sum two-sided tests. Learning rate: z = −3.88, p = 0.0001; Greediness: z = −8.69, p = 3.5x10−18. (B) The comparison was done across blocks derived from the model that provided a worse fit. The pattern of results reversed, with lower learning rate and greediness for the Abstract RL model. Statistical comparisons were performed with Wilcoxon rank-sum two-sided tests. Learning rate: z = 13.65, p = 2x10−42; greediness: z = 12.91, p = 4.1x10−38. This discrepancy is likely due to the fact that in the blocks in which Abstract RL fits best the data, Feature RL would have very high learning rates / greediness to accommodate behaviour within fewer trials. Conversely, because the blocks in which Feature RL fits the data best also tend to have learning speed (more trials taken to complete), Abstract RL must display much lower learning speeds since the horizon is longer (but with few states). Bars correspond to the mean, error bars the SEM.
Figure 3—figure supplement 2. Abstraction index for single blocks and expected value for the chosen action in Abstract RL and Feature RL.

Figure 3—figure supplement 2.

(A) Abstraction index was computed by allocating a value of 0 to blocks labelled as ‘Feature RL’ and one to blocks labelled as ‘Abstract RL’, for each participant. This was done for the first 11 blocks, which were completed by all participants. The plot indicates that, at the group level, there was a tendency towards increasingly higher abstraction later in time. Each dot represents the population mean, error bars the SEM. The least square line fit, robust regression, and p-value were computed on the 11 mean data points. Robust regression slope = 0.022, t31 = 2.34, p = 0.044. (B) Mean expected value computed from all blocks, when fitted with either Abstract RL or Feature RL. (C) Mean expected value computed from the best fitting algorithms on a given block. That is, Abstract RL (resp. Feature RL) refers to the expected value for the chosen action in blocks where Abstract RL (resp. Feature RL) was the best fitting algorithm. Two (B) and one (C) outliers were removed. Each coloured dot represents the average expected value for the chosen options for a single participant - in either Abstract RL (grey) or Feature RL (cyan). Shaded areas represent the density plot, central white dot the median, the dark central bar the interquartile range, and thin dark lines the lower and upper adjacent values. (B) two-sided t-test, t30 = 35.66, p = 4.03x10−26. (C) two-sided t-test, t31 = 2.31, p = 0.028.
Figure 3—figure supplement 3. Parameter recovery.

Figure 3—figure supplement 3.

We first simulated choice data through the models, using the best-fitting parameters. Simulated data was then fed again to the fitting procedure using HBI, separately for each model, in the presence of noise (the update was sometimes not done for the real, correct state but rather for an alternative, random state). Parameters were recovered for each participant, block, and model. Recovered and original parameter values were then pooled across the models and plotted. Values were normalized in the interval [0, 1] for visualisation purposes (note that the normalisation does not affect the strength of the correlation).
Figure 3—figure supplement 4. Strategy analysis of excluded participants.

Figure 3—figure supplement 4.

Excluded participants tended to rely more often on a Feature RL strategy. Binomial test Feature RL: base rate of 0.44 calculated from the main participants group, P(57|103) = 0.029. Binomial test Abstract RL: base rate of 0.52 calculated from the main participants group, P(41|103) = 0.014.

Model comparison at the single participant and block levels (Piray et al., 2019) provided a direct way to infer which algorithm was more likely to explain learning in any given block. Overall, similar proportions of blocks were classified as Feature RL and Abstract RL. This indicates that participants used both learning strategies (binomial test applied to all blocks: proportion of Abstract RL = 0.47 vs. equal level = 0.5, P(212|449) = 0.26, Figure 3B; two-sided t-test of participant-level proportions: lower, but close to 0.5, t32 = −2.87, p = 0.007, Figure 3B inset).

As suggested by the simulations (Figure 3A), the strategy that best explained participant block data accounted for the distribution of learning speed measures in each block. Where learning proceeded slowly, Feature RL was consistently predominant (Figure 3B), while the reverse happened in blocks where participants displayed fast learning (Figure 3B). Among participants, the degree of abstraction (propensity to use Abstract RL) correlated with the empirical learning speed (N = 33, robust regression, slope = 0.52, t31 = 4.56, p = 7.64x10−5, Figure 3C top). Participant confidence in having performed the task well was also significantly correlated with the degree of abstraction (N = 31, robust regression, slope = 0.026, t29 = 2.69, p = 0.012, Figure 3C, bottom). In addition to the finding that confidence related to learning speed (Figure 1E), these results raise intriguing questions about the function of metacognition, as participants appeared to comprehend their own ability to construct and use abstractions (Cortese et al., 2020).

The two RL algorithms revealed a second aspect of learning. Considering all blocks regardless of fit (paired comparison), feature RL appeared to have higher learning rates α compared with Abstract RL (two-sided Wilcoxon rank sum test against median 0, z = 14.33, p < 10−30, Figure 3D). A similar asymmetry was found with greediness (Figure 3E, two-sided Wilcoxon rank sum test against median 0, z = 7.14, p < 10−10). Yet, more specifically, considering only the model (Feature RL or Abstract RL) which provided the best fit on a given block, resulted in Feature RL displaying lower learning rates and greediness (Figure 3—figure supplement 1A). The order inverted entirely when considering the model which provided the worst fit: higher learning rates and greediness for Feature RL (Figure 3—figure supplement 1B). These differences can be explained intuitively as follows. In Feature RL, exploration of the task state-space takes longer - in short blocks (best fit by the Abstract RL strategy) a higher learning rate is necessary for the Feature RL agent to make larger updates on states that are infrequently visited. Results also suggest that action selection tends to follow the same principles – more deterministic in blocks that are best fit by Abstract RL (i.e. large β for shorter blocks).

We predicted that use of abstraction should increase with learning, because the brain has to construct abstractions in the first place and must initially rely on Feature RL. To test this hypothesis, we quantified the number of participants using a Feature RL or Abstract RL strategy in their first and last blocks. On their first block, most participants relied on Feature RL, while the pattern reversed in the last block (two-sided sign test, z = −2.77, p = 0.006, Figure 3F). Computing the abstraction level separately for the session median split of early and late blocks also resulted in higher abstraction in late blocks (two-sided sign test, z = −2.94, p = 0.003, Figure 3G). These effects were complemented by two block-by-block analyses, displaying an increase in abstraction from early to late blocks (Figure 3H, and Figure 3—figure supplement 2A).

Supporting the current modelling framework, the mean expected value of the chosen action was higher for Abstract RL (Figure 3—figure supplement 2B–C), and model hyperparameters could be recovered in the presence of noise (Figure 3—figure supplement 3; see Materials and methods) (Palminteri et al., 2017). Given the lower learning speed in excluded participants, the distribution of strategies was also different among them, with a higher ratio of Feature RL blocks (Figure 3—figure supplement 4).

The role of vmPFC in constructing goal-dependent value from sensory features

The computational approach confirmed that participants relied on both a low-level feature strategy, and a more sophisticated abstract strategy (i.e. Feature RL and Abstract RL; Figures 2D and 3B). Beside proving that abstract representations were generally associated with higher expected value, the modelling approach further allowed us to explicitly classify trials as belonging to either learning strategy. Here, our goal was to dissociate neural signatures of these distinct learning strategies in order to show how abstract representations are constructed by the human brain.

First, we reasoned that an anticipatory value signal might emerge in the vmPFC at stimulus presentation (Knutson et al., 2005). We performed a general linear model (GLM) analysis of neuroimaging data with regressors for ‘High-value’ and ‘Low-value’ trials, selected by the block-level best fitting algorithm (Feature RL or Abstract RL, while controlling for other confounding factors such as time and strategy itself; see Materias and methods and Supplementary note one for the full GLM and regressors specification). As predicted, activity in the vmPFC strongly correlated with value magnitude (Figure 4A). That is, the vmPFC indexed the anticipated value constructed from pacman features at stimulus presentation time. We used this signal to functionally define, for ensuing analyses, the subregion of the vmPFC that was maximally related to task computations about value when pacman visual features were integrated. Concurrently, activity in insular and dorsal prefrontal cortices increased under conditions of low expected value. This pattern of activity is consistent with previous studies on error monitoring and processing (Bastin et al., 2017; Carter et al., 1998Figure 4—figure supplement 1).

Figure 4. Neural substrates of value construction during learning.

(A) Correlates of anticipated value at pacman stimulus presentation time. Trials were labelled according to a median split of the expected value for the chosen action, as computed by the best fitting model, Feature RL or Abstract RL, at the block level. Mass univariate analysis, contrast ‘High-value’ > ‘Low-value’. vmPFC peaks at [2 50 -10]. The statistical parametric map was z-transformed and plotted at p(FWE) < 0.05. (B) Psychophysiological interaction, using as seed a sphere (radius = 6 mm) centred around the participant-specific peak voxel, constrained within a 25 mm sphere centred around the group-level peak coordinate from contrast in (A). The statistical parametric map was z-transformed and plotted at p(fpr) < 0.001 (one-sided, for positive contrast - increased coupling). (C) The strength of the interaction between the vmPFC and VC was positively correlated with participant’s ability to learn block rules. Dots represent individual participant data points, and the line is the regression fit. The experiment was conducted once (n = 33 biologically independent samples), * p<0.05.

Figure 4—source data 1. Csv: panel C.

Figure 4.

Figure 4—figure supplement 1. ‘Low value’ > ‘High value’ GLM contrast.

Figure 4—figure supplement 1.

Neural correlates of (predicted) low value at visual stimulus presentation time. Trials were labelled according to a median split of the expected value for the chosen option as computed by the best fitting model, at the participants and block level. The statistical parametric map was z-transformed, and false-positive means of cluster formation (fpr) correction was applied. p(fpr) < 0.001, Z > 3.09.
Figure 4—figure supplement 2. Neuro-behavioural correlation between VC-vmPFC coupling and abstraction.

Figure 4—figure supplement 2.

The strength of the interaction between the vmPFC and VC showed a weak positive association with the abstraction level across participants. Dots represent individual participant data points, and the line is the regression fit. The experiment was conducted once (n = 33 biologically independent samples), robust regression fit: N=31, slope = 0.013, t31 = 1.56, p = 0.065 (one-sided).

In order for the vmPFC to construct goal-dependent value signals, it should receive relevant feature information from other brain areas and specifically from visual cortices, given the nature of our task. Thus, we computed a psychophysiological interaction (PPI) analysis (Friston et al., 1997), to isolate regions in which functional coupling with the vmPFC at the time of stimulus presentation was dependent on the magnitude of expected value. Supporting the idea that the vmPFC based its predictions on the integration of visual features, only connectivity between the visual cortex (VC) and vmPFC was higher on trials that carried large expected value, compared to low-value trials (Figure 4B). Strikingly, the strength of this VC - vmPFC interaction was associated with the overall learning speed among participants (N = 31, robust regression, slope = 0.016, t29 = 2.55, p = 0.016, Figure 4C), such that participants with stronger modulation of the coupling between the vmPFC and VC also learned block rules faster. The strength of the vmPFC - VC coupling showed a non-significant trend with the level of abstraction (N = 31, robust regression, slope = 0.013, t29 = 1.56, p = 0.065 one-sided, Figure 4—figure supplement 2). However, this study was not optimised to detect between subject correlations that normally require a larger number of subjects. Therefore, future work is required to confirm or falsify this result.

A value-sensitive vmPFC subregion prioritises abstract elements

Having established that the vmPFC computes a goal-dependent value signal, we evaluated whether the activity level of this region was sensitive to the strategies that participants used. To do so, we used the same GLM introduced earlier, and estimated two new statistical maps from the regressors ‘Abstract RL’ and ‘Feature RL’, while controlling for idiosyncratic features of the task, that is, high/low value and early/late trials (see Materials and methods and Supplementary note 1). We extracted the peak activity at the participant level, under Feature RL and Abstract RL conditions, in two regions-of-interest (ROI). Specifically, we focused on the vmPFC and the HPC, as both have been consistently linked with abstraction, and feature-based and conceptual learning. The HPC was defined anatomically (AAL atlas, Figure 5A top), while the vmPFC was defined as voxels sensitive to the orthogonal contrast ‘High value’ > ‘Low value’ from the same GLM (Figure 5A bottom). A linear mixed effects model (LMEM) with fixed effects ‘ROI’ and ‘strategy’ [LMEM: ‘y ~ ROI * strategy + (1|participants)’, y: ROI activity] revealed significant main effects of ‘ROI’ (t128 = 2.16, p = 0.033), and ‘strategy’ (t128 = 3.07, p = 0.003), and a significant interaction (t128 = −2.29, p = 0.024), illustrating different HPC and vmPFC recruitment (Figure 5B). Post-hoc comparisons showed vmPFC activity levels distinguished Feature RL and Abstract RL cases well (LMEM: t64 = 2.94, p(FDR) = 0.009), while the HPC remained agnostic to the style of learning (LMEM: t64 = 0.62, p(FDR) = 0.54). Alternative explanations are unlikely, as there was no effect in terms of both the correlation between value-type trials and algorithms, and task difficulty, measured by reaction times (Figure 5—figure supplement 2).

Figure 5. Neural substrate of abstraction.

(A) Regions of interest for univariate and multivariate analyses. The HPC was defined through automated anatomical labelling (FreeSurfer). The vmPFC was functionally defined as the cluster of voxels found with the orthogonal contrast ‘High value’ > ‘Low value’, at P(unc) < 0.0001. (B) ROI activity levels corresponding to each learning mode were extracted from the contrasts ‘Feature RL’ > ‘Abstract RL’, and ‘Abstract RL’ > ‘Feature RL’. Coloured bars represent the mean, and error bars the SEM. (C) Multivariate (decoding) analysis in three regions of interest: VC, HPC, vmPFC. Binary decoding was performed for each feature (e.g. colour: red vs green), by using trials from blocks labelled as Feature RL or Abstract RL. Colour bars represent the mean, error bars the SEM, and grey dots represent individual data points (for each individual, taken as the average across all three classifications, i.e., of all features). Results were obtained from leave-one-run-out cross-validation. The experiment was conducted once (n = 33 biologically independent samples), * p<0.05, ** p<0.01. (D) Classification was performed for each feature pair (e.g. colour: red vs green), separately for blocks in which the feature in question was relevant or irrelevant to the block’s rule. The statistical map represents the strength of the reduction in accuracy between trials in which the feature was relevant compared to irrelevant, averaged over all features and participants. (E) Classification of the rule (2x2 blocks only). For each participant, classification was performed as fruit 1 vs fruit 2. In (D–E ), statistical parametric maps were z-transformed, false-positive means of cluster formation (fpr) correction was applied. p(fpr) < 0.01, Z > 2.33.

Figure 5—source data 1. Csv: panel B.
Figure 5—source data 2. Csv: panel C.

Figure 5.

Figure 5—figure supplement 1. Ratio correct in Feature RL and Abstract RL.

Figure 5—figure supplement 1.

Mean ratio of correct choices total choices, for trials labelled as Feature RL and Abstract RL (i.e., the best fitting algorithm on a given block). A total of three (Abstract RL: two, Feature RL: one) outliers were removed. Each coloured dot represents the average across selected blocks for a single participant - in either Abstract RL (grey) or Feature RL (cyan). Shaded areas represent the density plot, central white dot the median, the dark central bar the interquartile range, and thin dark lines the lower and upper adjacent values. Two-sided t-test, t29 = 3.23, p = 0.003 (**).
Figure 5—figure supplement 2. Value functions correlations and reaction time differences in Feature RL and Abstract RL trials.

Figure 5—figure supplement 2.

(A) Fisher-transformed coefficients (Spearman ρ) of the correlation between high/low value and Feature RL / Abstract RL across trials. The coloured bar represents the population mean, the error bar the SEM, and dots individual participants’ data. The two labels used in the main GLM were uncorrelated. Wilcoxon sign rank test against median 0, z = 0.72, p = 0.47. (B) Reaction time (RT) data pooled over all participants for trials in blocks labelled as ‘Feature RL’ or ‘Abstract RL’. RT was not significantly different between the two strategies. Wilcoxon sign rank test between the two distributions, z = 1.48, p = 0.14.
Figure 5—figure supplement 3. K-fold cross-validation in feature decoding, using Feature RL and Abstract RL trials separately.

Figure 5—figure supplement 3.

The split of the data in training and test subsets was done (N=20) by randomly selecting 80% of the Nmin trials across the two levels of the target feature and the two conditions (Feature RL and Abstract RL). Thus, in each fold, the same number of trials for each feature and condition was used to train the classifier, avoiding possible accuracy confounds due to varying numbers of trials used for training. As in the primary analysis reported in the main text, classification accuracy was significantly higher in Abstract RL compared with Feature RL trials in both the HPC and vmPFC (two-sided Wilcoxon signed rank test, HPC: z = −4.21, p(FDR) < 0.001, vmPFC: z = −3.15, p(FDR) = 0.002), but not in VC (z = −1.30, p(FDR) = 0.20). The difference in feature decodability was significantly larger in the HPC and vmPFC compared to the VC (LMEM model ‘y ~ ROI + (1|participants)’, y: difference in decodability, t97 = 2.52, p = 0.013).

The next question we asked was, ‘Can we retrieve feature information from HPC and vmPFC activity patterns?’ In order to abstract and operate in the latent space, an agent is still bound to represent and use the features, because the rules are dictated by feature combinations. One possibility is that feature information is represented solely in sensory areas. What matters then is the connection with and/or the read out of vmPFC or HPC. Accordingly, neither HPC nor vmPFC should represent feature information, regardless of the strategy used. Alternatively, feature-level information could also be represented in higher cortical regions under Abstract RL to explicitly support (value-based) relational computations (Oemisch et al., 2019). To resolve this question, we applied multivoxel pattern analysis to classify basic feature information (e.g. colour: red vs green) in three regions of interest: the VC, HPC, and vmPFC, separately for trials labelled as Feature RL or Abstract RL. We found that classification accuracy was significantly higher in Abstract RL trials compared with Feature RL trials in both the HPC and vmPFC (two-sided t-test, HPC: t32 = −2.37, p(FDR) < 0.036, vmPFC: t32 = −2.51, p(FDR) = 0.036, Figure 5C), while the difference was of opposite sign in VC (t32 = 1.61, p(FDR) = 0.12, Figure 5C). The increased feature decodability in Abstract RL was significantly larger in the HPC and vmPFC compared to the VC (LMEM model ‘y ~ ROI + (1|participants)’, y: difference in decodability, t97 = 3.37, p = 0.001). Due to the nature of the task, the number of trials in each category could vary and thus confounds the analysis. A control analysis equating the number of training trials for each feature and condition replicated the original finding (Figure 5—figure supplement 3). These empirical results support the second hypothesis. In Abstract RL, features are represented in the neural circuitry incorporating the HPC and vmPFC, beyond a simple read out of sensory cortices. In Feature RL, representing feature-level information in sensory cortices alone should suffice because each visual pattern mapped to a task-state.

We expanded on this idea with two searchlight multivoxel pattern analyses. In short, we inquired which brain regions are sensitive to feature relevance, and whether we could recover representations of the latent rule itself (the fruit preference). Beside the occipital cortex, significant reduction in decoding accuracy was also detected in the OFC, ACC, vmPFC and dorsolateral PFC when a feature was irrelevant to the rule, compared to when it was relevant (Figure 5D). Multivoxel patterns in the dorsolateral PFC and lateral OFC further predicted fruit class (Figure 5E).

Artificially injecting value in sensory representations with neurofeedback fosters abstraction

Our computational and neuroimaging results indicate that valuation serves a key function in abstraction. Two hypotheses on the underlying mechanism can be outlined here. On one hand, the effect of vmPFC value computations could remain localised within the prefrontal circuitry. For example, this could be achieved by representing and ranking incoming sensory information for further processing within the HPC-OFC circuitry. Alternatively, value computation could determine abstractions by directly affecting early sensory areas – that is, a top-down (attentional) effect to ‘tag’ relevant sensory information (Anderson et al., 2011). Work in rodents has reported strong top-down modulation of sensory cortices by OFC neurons implicated in value computations (Banerjee et al., 2020; Liu et al., 2020). We thus hypothesised that abstraction could result from a direct effect of value in the VC. Therefore, artificially adding value to a neural representation of a task-relevant feature should result in enhanced behavioural abstraction.

Decoded neurofeedback is a form of neural reinforcement based on real time fMRI and multivoxel pattern analysis. It is the closest approximation to a non-invasive causal manipulation, with high specificity and administered without participant awareness (Lubianiker et al., 2019; Muñoz-Moldes and Cleeremans, 2020; Shibata et al., 2019). Such reinforcement protocols can reliably lead to persistent behavioural or physiological changes (Cortese et al., 2016; Koizumi et al., 2016; Shibata et al., 2011; Sitaram et al., 2017; Taschereau-Dumouchel et al., 2018). We used this procedure in a follow-up experiment (N = 22, a subgroup from the main experiment; see Materials and methods) to artificially add value (rewards) to neural representation in VC of a target task-related feature (Figure 6A). At the end of two training sessions, participants completed 16 blocks of the pacman fruit preference task, outside of the scanner. Task blocks could be labelled as ‘relevant’ (eight blocks) if the feature tagged with value was relevant to the block rule, or ‘irrelevant’ otherwise (eight blocks).

Figure 6. Artificially adding value to a feature’s neural representation.

(A) Schematic diagram of the follow-up multivoxel neurofeedback experiment. During the neurofeedback procedure, participants were rewarded for increasing the size of a disc on the screen (max session reward 3000 JPY). Unbeknownst to them, disc size was changed by the computer program to reflect the likelihood of the target brain activity pattern (corresponding to one of the task features) measured in real time. (B) Blocks were subdivided based on the feature targeted by multivoxel neurofeedback as ‘relevant’ or ‘irrelevant’ to the block rules. Scatter plots replicate the finding from the main experiment, with a strong association between Feature RL / Abstract RL and learning speed. Each coloured dot represents a single block from one participant, with data aggregated from all participants. (C) Abstraction level was computed for each participant from all blocks belonging to: (1) left, the latter half of the main experiment (as in Figure 3G, but only selecting participants who took part in the multivoxel neurofeedback experiment); (2) centre, post-neurofeedback for the ‘relevant’ condition; (3) right, post-neurofeedback for the ‘irrelevant’ condition. Coloured dots represent participants. Shaded areas indicate the density plot. Central white dots show the medians. The dark central bar depicts the interquartile range, and dark vertical lines indicate the lower and upper adjacent values. (D) Bootstrapping the difference between model probabilities on each block, separately for ‘relevant’ and ‘irrelevant’ conditions. The experiment was conducted once (n = 22 biologically independent samples), * p<0.05, *** p<0.001.

Figure 6—source data 1. Csv: panel B, irrelevant blocks.
Figure 6—source data 2. Csv: panel B, relevant blocks.
Figure 6—source data 3. Csv: panel C.
Figure 6—source data 4. Csv: panel D.

Figure 6.

Figure 6—figure supplement 1. Neurofeedback experiment results.

Figure 6—figure supplement 1.

(A) Neurofeedback (nfb) scores from each block of training, for sessions 1 and 2. Black dots represent data from individual participants, as the score obtained within a block of nfb training. Large green dots represent the mean, and the error bars the SEM. The red line is a linear fit on the mean data points. Robust regression fits, session 1: slope = 0.75, t7 = 0.60, p = 0.57; session 2: slope = 1.67, t7 = 1.98, p = 0.088. (B) Average nfb scores attained in each session. The red circles represent individual participants, the grey bars group means, and the black error bars the SEM. (C) Correlation between the sum of mean nfb scores (session 1 + session 2) and the subsequent increase in abstraction. The increase in abstraction was calculated as the difference between the abstraction level in the ‘relevant’ blocks and the abstraction level at the beginning of the main learning task. Black circles represent individual participants, and the black line a linear fit. N = 22, Spearman’s rho = 0.39, p = 0.036 (one-sided test for positive correlation). Source data files. Source data files for each figure are available with this manuscript.

Data from the ‘relevant’ and ‘irrelevant’ blocks were analysed separately. The same model-fitting procedure used in the main experiment established whether participant choices in the new blocks were driven by a Feature RL or Abstract RL strategy. ‘Relevant’ blocks appeared to be associated with a behavioural shift toward Abstract RL, whereas there was no substantial qualitative effect in ‘irrelevant’ blocks (Figure 6B). To quantify this effect, we first applied a binomial test, finding that behaviour in blocks where the targeted feature was relevant displayed markedly increased abstraction (base rate 0.5, number of Abstract RL blocks given total number of blocks; ‘relevant’: P(123|176) = 1.37x10−7, ‘irrelevant’: P(90|176) = 0.82). We then measured the abstraction level for each participant and directly compared it to the level attained by the same participants in the late blocks of main experiments (from Figure 3G). Participants increased their use of abstraction in ‘relevant’ blocks, whereas no significant difference was detected in the ‘irrelevant’ blocks (Figure 6C, two-sided Wilcoxon signed rank test, ‘relevant’ blocks: z = 2.44, p = 0.015, ‘irrelevant’ blocks: z = −1.55, p = 0.12, ‘relevant’ vs ‘irrelevant’: z = 4.01, p = 6.03x10−5). Finally, we measured the difference between model probabilities P(Feature RL) - P(Abstract RL) for each block, and bootstrapped the mean over blocks (with replacement) 10,000 times to generate a distribution for ‘relevant’ and ‘irrelevant’ conditions. Replicating the results reported above, behaviour in ‘relevant’ blocks was more likely to be driven by Abstract RL (Figure 6D, perm. test p < 0.001), while Feature RL tended to appear more in ‘irrelevant’ blocks. Participants were successful at increasing the disk size in the neurofeedback task (Figure 6—figure supplement 1A–B). Furthermore, those who were more successful were also more likely to display larger increases in abstraction in the subsequent behavioural test compared to their initial level (Figure 6—figure supplement 1C).

A strategy shift toward abstraction, specific to blocks in which the target feature was tagged with reward, indicates that the effect of value in facilitating abstraction is likely to be mediated by a change in the early processing stage of visual information. In this experiment, by means of neurofeedback, value (in the form of external rewards) ‘primed’ a target feature. Hence, the brain used these ‘artificial’ values when constructing abstract representations by tagging certain sensory channels. Critically, this manipulation indicates that value tagging of early representation has a causal effect on abstraction and consequently on the learning strategy.

Discussion

The ability to generate abstractions from simple sensory information has been suggested as crucial to support flexible and adaptive behaviours (Cortese et al., 2019; Ho et al., 2019; Wikenheiser and Schoenbaum, 2016). Here, using computational, we found that value predictions drive participant selections of the appropriate representation to solve the task. Participants explored and used task dimensionality through learning, as they shifted from a simple feature-based strategy to using more sophisticated abstractions. The more participants used Abstract RL, the faster they became at solving the task. Note that in this task, structure, learning speed, and abstraction are linked. To learn faster, an agent must use Abstract RL, as other strategies would result in slower completion of task blocks.

These results build on the idea that efficient decision-making processes in the brain depend on higher-order, summarised representations of task-states (Niv, 2019; Schuck et al., 2016). Further, abstraction likely requires a functional link between sensory regions and areas encoding value predictions about task states (Figure 4C, the VC-vmPFC coupling was positively correlated with participant’s learning speed). This is consistent with previous work that demonstrates how estimating reward value of individual features provides a reliable and adaptive mechanism in RL (Farashahi et al., 2017). We extend this notion by showing that the mechanism may support formation of abstract representations to be further used for learning computations, for example selection of the appropriate abstract representation.

An interesting question concerns whether the brain uses abstract representations in isolation - operating in a hypothesis-testing regime - that is, favouring the current best model; or whether representations may be used to update multiple internal models, with behaviour determined by their synthesis (as in the mixture-of-experts architecture). The latter implementation may not be the most efficient computationally - the brain would have to run multiple processes in parallel, but it would be very data efficient since one data point can be used to update several models. Humans might (at least at the conscious level) engage with one hypothesis at the time. However, there is circumstantial evidence that multiple strategies might be computed in parallel but deployed one at the time (Domenech et al., 2018; Donoso et al., 2014). Furthermore, in many cases, the brain may have access to only limited data points, while parallel processing is a major feature of neural circuits (Alexander and Crutcher, 1990; Lee et al., 2020; Spitmaan et al., 2020). In this study, we aimed to show that arbitration between feature and abstract learning may be achieved using a relatively simple algorithm (the mixture-of-experts RL) and then proceeded to characterise the neural underpinnings of these two types of learning (i.e. Feature RL and Abstract RL). Admittedly, in the present work the mixture-of-experts RL does not provide a solid account of the data when compared to the more parsimonious Feature RL and Abstract RL in isolation. Future work will need to establish the actual computational strategy employed by the human brain. Of particular importance will be further examining how such strategies vary across circumstances (tasks, contexts, or goals).

There is an important body of work considering how the HPC is involved in formation and update of conceptual information (Bowman and Zeithamova, 2018; Kumaran et al., 2009; Mack et al., 2016; McKenzie et al., 2014). Likely, the role of the HPC is to store, index, and update conceptual/schematic memories (Mack et al., 2016; Tse et al., 2011; Tse et al., 2007). 'Creation' of new concepts or schemas may occur elsewhere. A good candidate could be the mPFC or the vmPFC in humans (Mack et al., 2020; Tse et al., 2011). Indeed, the vmPFC exhibits value signals directly modulated by cognitive requirements (Castegnetti et al., 2021). Our results expose a potential mechanism of how the vmPFC interacts with the HPC in construction of goal-relevant abstractions. vmPFC-driven valuation of low-level sensory information serves to channel-specific features/components to higher order areas (e.g. the HPC, vmPFC, but also the dorsal prefrontal cortex, for instance). Congruent with this interpretation, we found that the vmPFC was more engaged in Abstract RL, while the HPC was equally active under both abstract and feature-based strategies (Figure 5B). When a feature was irrelevant to the rule, its decodability from activity patterns in OFC/DLPFC decreased (Figure 5D). These findings accord well with the role of prefrontal regions in constructing goal-dependent task states and abstract rules from relevant sensory information (Akaishi et al., 2016; Schuck et al., 2016; Wallis et al., 2001). Furthermore, we found block rules were also encoded in multivoxel activity patterns within the OFC/DLPFC circuitry (Figure 5EBengtsson et al., 2009; Mian et al., 2014; Wallis et al., 2001).

How these representations are actually used remains an open question. This study nevertheless suggests that there is no single region of the brain that maintains a fixed task state. Rather, the configuration of elements that determines a state is continuously reconstructed over time. At first glance, this may appear costly and inefficient. But this approach would provide high flexibility in noisy and stochastic environments, and where temporal dependencies exist (as in most real-world situations). By continuously recomputing task states, the agent can make more robust decisions because these are related to the subset of most relevant, up-to-date information. Such a computational coding scheme shares strong analogies with HPC neural coding, whereby neurons continuously generate representations of alternative future trajectories (Kay et al., 2020), and replay past cognitive trajectories (Schuck and Niv, 2019).

One significant topic for discussion concerns the elements used to construct abstractions. We leveraged simple visual features (colour, or stripe orientation), rather than more complex stimuli or objects that can be linked conceptually (Kumaran et al., 2009; Zeithamova et al., 2019). Abstractions happen at several levels, from features, to exemplars, concepts/categories, and all the way to rules and symbolic representations. In this work, we effectively studied one of the lowest levels of abstraction. One may wonder whether the mechanism we identified here generalises to more complex scenarios. While our work cannot decisively support this, we believe it unlikely that the brain uses an entirely different strategy to generate new representations at different levels of abstraction. Rather, the internal source of information abstracted should be different, but the algorithm itself should be the same or, at the very least, highly similar. The fact our PPI analysis showed a link between the vmPFC and VC may point to this distinguishing characteristic of our study. Learning through abstraction of simple visual features should be related to early VC. Features in other modalities, for example, auditory, would involve functional coupling between the auditory cortex and the vmPFC. When learning about more complex objects or categories, we expect to see stronger reliance on the HPC (Kumaran et al., 2009; Mack et al., 2016). Future studies could incorporate different levels of complexity, or different modalities, within a similar design so as to directly test this prediction and dissect exact neural contributions. Depending on which type of information is relevant at any point in time, we suspect that different areas will be coupled with the vmPFC to generate value representations.

In our second experiment, we implemented a direct assay to test our (causal) hypothesis that valuation of features guides abstraction in learning. Artificially adding value in the form of reward to a feature representation in the VC resulted in increased use of abstractions. Thus, the facilitating effect of value on abstraction can be directly linked to changes in the early processing stage of visual information. Consonant with this interpretation, recent work in mice has elegantly reported how value governs a functional remapping in the sensory cortex by direct lateral OFC projections carrying RPE information (Banerjee et al., 2020), as well as by modulating the gain of neurons to irrelevant stimuli (Liu et al., 2020). While these considerations clearly point to a central role of the vmPFC and valuation in abstraction by controlling sensory representations, it remains to be investigated whether this effect results in more efficient construction of abstract representations, or in better selection of internal abstract models.

Given the complex nature of our design, there are some limitations to this work. For example, there isn’t an applicable feature decoder to test actual feature representations (e.g. colour vs orientation), or the likelihood of one feature against another. In our task design, in every trial, all features were used to define pacman characters. Furthermore, we did not have a localiser session in which features were presented in isolation (see Supplementary note two for further discussion). Future work could investigate how separate feature representations emerge on the path to abstractions, for example in the parietal cortex or vmPFC, and their relation to feature levels (e.g. for colour: red vs green) as reported here. We speculate that parallel circuits linking the prefrontal cortex and basal ganglia could track these levels of abstraction, possibly in hierarchical fashion (Badre and Frank, 2012; Cortese et al., 2019; Haruno and Kawato, 2006).

Some may point out that what we call ‘Abstract RL’ is, in fact, just attention-mediated enhanced processing. Yet, if top-down attention were the sole driver in Abstract RL, we contend that the pattern of results would have been different. For example, we would expect to see a marked increase in feature decodability in VC (Guggenmos et al., 2015; Kamitani and Tong, 2005). This was not the case here, with only a minimal, non-significant, increase. More importantly, the results of the decoded neurofeedback manipulation question this interpretation. Because decoded neurofeedback operates unconsciously (Muñoz-Moldes and Cleeremans, 2020; Shibata et al., 2019), value was added directly at the sensory representation level (limited to the targeted region), precluding alternative top-down effects. That is not to say that attention does not significantly mediate this type of abstract learning; however, we argue that attention is most likely an effector of the abstraction and valuation processes (Krauzlis et al., 2014). A simpler top-down attentional effect was indeed evident in the supplementary analysis comparing feature decoding in ‘relevant’ and ‘irrelevant’ cases (Figure 5D). Occipital regions displayed large effect sizes, irrespective of the learning strategy used to solve the task.

While valuation and abstraction appear tightly associated in reducing the dimensionality of the task space, what is the underlying mechanism? The degree of neural compression in the vmPFC has been shown to relate to features most predictive of positive outcomes, under a given goal (Mack et al., 2020). Similarly, the geometry of neural activity in generalisation may be key here. Neural activity in the PFC (and HPC) explicitly generates representations that are simultaneously abstract and high dimensional (Bernardi et al., 2020). An attractive view is that valuation may be interpreted as an abstraction in itself. Value could provide a simple and efficient way for the brain to operate on a dimensionless axis. Each point on this axis could index a certain task state, or even behavioural strategy, as a function of its assigned abstract value. Neuronal encoding of feature-specific value, or choice options, may help the system construct useful representations that can, in turn, inform flexible behavioural strategies (Niv, 2019; Schuck et al., 2016; Wilson et al., 2014).

In summary, this work provides evidence for a function of valuation that exceeds the classic view in decision-making and neuroeconomics. We show that valuation subserves a critical function in constructing abstractions. One may further speculate that valuation, by generating a common currency across perceptually different stimuli, may be an abstraction in itself, and that it is tightly linked to the concept of task states in decision-making. We believe this work not only offers a new perspective on the role of valuation in generating abstract thoughts, but also reconciles apparently disconnected findings in decision-making and memory literature on the role of the vmPFC. In this context, value is not a simple proxy of a numerical reward signal, but is better understood as a conceptual representation or schema built on-the-fly to respond to a specific behavioural demand. Thus, we believe our findings provide a fresh view of the invariable presence of value signals in the brain that play an important algorithmic role in development of sophisticated learning strategies.

Materials and methods

Participants

Forty-six participants with normal or corrected-to-normal vision were recruited for the main experiment (learning task). The sample size was chosen according to prior work and recommendations on model-based fMRI studies (Lebreton et al., 2019). Based on pilot data and the available scanning time in one session (60 min), we set the following conditions of exclusion: failure to learn the association in three blocks or more (i.e. reaching a block limit of 80 trials without having learned the association), or failure to complete at least 11 blocks in the allocated time. Eleven participants were removed based on the above predetermined conditions, 2 of which did not go past the initial practice stages. Additionally, two more participants were removed due to head motion artifacts. Thus, 33 participants (22.4 ± 0.3 y.o.; eight females) were included in the main analyses. Of these, 22 participants (22.2 ± 0.3 y.o.; four females) returned for the follow-up experiment, based on their individual availability. All results presented up to Figure 5 are from the 33 participants who completed the learning task. All results pertaining to the neurofeedback manipulation are from the subset of 22 participants that were called back. Figure 1—figure supplement 2 reports a behavioural analysis of the excluded participants to investigate differences in performance or learning strategy compared to the 33 included participants.

All experiments and data analyses were conducted at the Advanced Telecommunications Research Institute International (ATR). The study was approved by the Institutional Review Board of ATR with ethics protocol numbers 18–122, 19–122, 20–122. All participants gave written informed consent.

Learning task

The task consisted of learning the fruit preference of pacman-like characters. These characters had three features, each with two levels (colour: green vs red, stripe orientation: horizontal vs vertical, mouth direction: left vs right). On each trial, a character composed of a unique combination of the three features was presented. The experimental session was divided into blocks, for each of which a specific rule directed the association between features and preferred fruits. There were always two relevant features and one irrelevant feature, but these changed randomly at the beginning of each block. Blocks could thus be of three types: CO (colour-orientation), CD (colour-direction), and OD (orientation-direction). Furthermore, to avoid a simple logical deduction of the rule after one trial, we introduced the following pairings. The four possible combinations of two relevant features with two levels were paired with the two fruits in both a symmetric or asymmetric fashion - 2x2 or 3x1. The appearance of the two fruits was also randomly changed at the beginning of each new block (see Figure 1B,e.g., green-vertical: fruit 1, green-horizontal: fruit 2, red-vertical: fruit 1, red-horizontal: fruit two or green-vertical: fruit 2, green-horizontal: fruit 2, red-vertical: fruit 1, red-horizontal: fruit 2).

Each trial started with a black screen for 1 s, following which the character was presented for 2 s. Then, while the character continued to be present at the centre of the screen, the two fruit options appeared below, to the right and left sides. Participants had 2 s to indicate the preferred fruit by pressing a button (right for the fruit to the right, left for vice versa). Upon registering a participant’s choice, a coloured square appeared around the chosen fruit: green if the choice was correct, red otherwise. The square remained for 1 s, following which the trial ended with a variable ITI - bringing the trial to a fixed 8 s duration.

Participants were simply instructed that they had to learn the correct rule for each block, and the rule itself (relevant features + association type) was hidden. Each block contained up to 80 trials, but a block could end earlier if participants learned the target rule. Learning was measured as a set of correct trials (between 8 and 12, determined randomly in each block). Participants were instructed that each correct choice added one point, while incorrect choices did not alter the balance. They were further told that points obtained would be weighted by the speed of learning on that block. That is, the faster the learning, the greater the net worth of the points. The end of a block was explicitly signalled by presenting the reward obtained on the screen. Monetary reward was computed at the end of each block according to the formula:

R=A*(ptstr)-(tr-mcs)*a (1)

where R is the reward obtained in that block, A the maximum available reward (150JPY), pts the sum of correct responses, tr the number of trials, mcs the maximum length of a correct strike (12 trials), and a is a scaling factor (a = 1.5). This formula ensures time-dependent decay of the reward that approximately follows a quadratic fit. In case participants completed a block in less than 12 trials, if the amount was larger than 150JPY, it was rounded to 150JPY.

The maximum terminal monetary reward over the whole session was set at 3,000 JPY. On average, participants earned 1251 ± 46 JPY (blocks in which participants failed to learn the association within the 80-trial limit were not rewarded). For each experimental session, there was a sequence of 20 blocks that was pre-generated pseudo-randomly, and on average, participants completed 13.6 ± 0.3 blocks. In the post-neurofeedback behavioural test, all participants completed 16 blocks, 8 of which had the targeted feature as relevant, while in the other half the targeted feature was irrelevant. The order was arranged pseudo-randomly such that in both halves of the session there were four blocks of each type. In the post-neurofeedback behavioural session, all blocks had only asymmetric pairings with preferred fruits.

For sessions done in the MR scanner, participants were instructed to use their dominant hand to press buttons on a dual-button response pad to register their choices. Concordance between responses and buttons was indicated on the display, and importantly, randomly changed across trials to avoid motor preparation confounds (i.e. associating a given preference choice with a specific button press).

The task was coded with PsychoPy v1.82.01 (Peirce, 2008).

Computational modelling part 1: mixture-of-experts RL model

We built on a standard RL model (Sutton and Barto, 1998) and prior work in machine learning and robotics to derive the mixture-of-experts architecture (Doya et al., 2002; Jacobs et al., 1991; Sugimoto et al., 2012). In this work, the mixture-of-experts architecture is composed of several ‘expert’ RL modules, each tracking a different representational space, and each with its own value function. In each trial, the action selected by the mixture-of-experts RL model is given by the weighted sum of the actions computed by the experts. The weight reflects the responsibility of each expert, which is computed from the SoftMax of the squared prediction error. In this section we define the general mixture-of-expert RL model, and in the next section we define each specific expert, based on the task-state representations being used.

Formally, the mixture-of-expert RL model global action is defined as:

At=j=1Nλtjatj (2)

where N is the number of experts, λ the responsibility signal, and a the action selected by the jth-model. Thus, λ is defined as:

λtj=exp(-RPE-t-1jv)/(k=1Nexp(-RPE-t-1kv)) (3)

where N is the same as above, ν is the RPE variance. Expert uncertainty RPE-t is defined as:

RPE-tj=γRPE-t-1j+(1-γ)(RPEtj)2 (4)

where γ is the forgetting factor that controls the strength of the impact of prior experience on the current uncertainty estimate. The most recent RPE is computed as:

RPEtj=O-Qj(St,At) (5)

where O is the outcome (reward: 1, no reward: 0), Q is the value function, S the state for the current expert, and A is the global action computed with Equation (2). The update to the value function can therefore be computed as:

ΔQj(St,At)=λtjαRPEtj (6)

where λ is the responsibility signal computed with Equation (3), α is the learning rate (assumed equal for all experts), and RPE is computed with Equation (5). Finally, for each expert, the action a at trial t is taken as the argmax of the value function, as follows:

atj=argmax[Qj(St,a)] (7)

where Q is the value function, S the state at current trial, and a the two possible actions.

Hyperparameters estimated through likelihood maximisation were the learning rate α, the forgetting factor γ, and the RPE variance ν.

Computational modelling part 2: Feature RL and Abstract RLs

Each (expert) RL algorithm is built on a standard RL model (Sutton and Barto, 1998) to derive variants that differ in the number and type of states visited. Here, a state is defined as a combination of features. Feature RL has 23 = eight states, where each state was given by the combination of all three features (e.g. colour, stripe orientation, mouth direction: green, vertical, left). Abstract RL is designed with 22 = four states, where each state was given by the combination of two features.

Note that the number of states does not change for different blocks, only features used to determine them. These learning models create individual estimates of how participant action-selection depended on features they attended and their past reward history. Both RL models share the same underlying structure and are formally described as:

Q(s,a)Q(s,a)+α(r-Q(s,a)) (8)

where Q(s,a) in Equation (8) is the value function of selecting either fruit-option a for packman-state s. The value of the action selected in the current trial is updated based on the difference between the expected value and the actual outcome (reward or no reward). This difference is called the reward prediction error (RPE). The degree to which this update affects the expected value depends on the learning rate α. For larger α, more recent outcomes will have a strong impact. On the contrary, for small α recent outcomes will have little effect on expectations. Only the value function of the selected action, which is state-contingent in Equation (8), is updated. Expected values of the two actions are combined to compute the probability p of predicting each outcome using a SoftMax (logistic) choice rule:

Psi,A=1/(1+exp(β(Q(si,a1)Q(si,a2)))) (9)

The greediness hyperparameter β controls how much the difference between the two expected values for a1 and a2 actually influence choices.

Hyperparameters estimated through likelihood maximisation were the learning rate α, and the greediness (inverse temperature) β.

Procedures for model fitting, simulations, and hyperparameter recovery

Hierarchical Bayesian Inference (HBI) was used to fit the models to participant behavioural data, enabling precise estimates of hyperparameters at the block level for each participant (Piray et al., 2019). Hyperparameters were selected by maximising the likelihood of estimated actions, given the true actions. For the mixture-of-experts architecture, we fit the model on all participants block-by-block to estimate hyperparameters at the single-block and single-participant level. For the subsequent direct comparison between Feature RL and Abstract RL models, we used HBI for concurrent model fitting and comparison at the single-block and single-participant basis. The model comparison provided the likelihood that each RL algorithm best explained participants’ choice data. That is, it was a proxy to whether learning followed a Feature RL or Abstract RL strategy. Because the fitting was done block-by-block, with a hierarchical approach considering all participants, blocks were first sorted according to their lengths, from longer to shorter, at the participant level. This ensured that each block of a given participant was at the most similar to the blocks of all other participants, thus avoiding unwanted effects in the fitting due to block length. The HBI procedure was then implemented on all participant data, proceeding block-by-block.

We also simulated model action-selection behaviours to benchmark their similarity to human behaviour, and in the case of Feature RL vs Abstract RL comparisons, to additionally compare their formal learning efficiency. In the case of the mixture-of-experts RL architecture, we simply used estimated hyperparameters to simulate 45 artificial agents, each completing 100 blocks. The simulation allowed us to compute, for each expert RL module, the mean responsibility signal, and the mean expected value across all states for the chosen action. Additionally, we also computed the learning speed (time to learn the rule of a block) and compared it with the learning speed of human participants.

In the case of the simple Feature RL and Abstract RL models, we added noise to the state information in order to have a more realistic behaviour (from the perspective of human participants). In the empirical data, the action (fruit selection) in the first trial of a new block was always chosen at random because participants did not have access to the appropriate representations (states). In later trials, participants may have followed specific strategies. For model simulations, we simply assumed that states were corrupted by a decaying noise function:

nt=n0(1/t1/rte) (10)

where nt is the noise level at trial t, n0 the initial noise level (randomly drawn from a uniform distribution within the interval [0.3 0.7]), and rte was the decay rate, which was set to 3. This meant that in early trials in a block, there was a higher chance of basing the action on the wrong representation (at random), rather than following the appropriate value function. Actions in later trials had a decreasing probability of being chosen at random. This approach is a combination of the classic ε-greedy policy and the standard SoftMax action-selection policy in RL. Hyperparameter values were sampled from obtained participant data maximum likelihood fits. We simulated 45 artificial agents solving 20 blocks each. The procedure was repeated 100 times for each block with new random seeds. We used two metrics to compare the efficiency of the two models: learning speed (same as above, the time to learn the rule of a block), as well as the fraction of failed blocks (blocks in which the rule was not learned with the 80-trials limit).

We performed a parameter recovery analysis for the simple Feature RL and Abstract RL models based on data from the main experiment. Parameter recovery analysis was done in order to confirm that fitted hyperparameters had sensible values and that the models themselves were a sensitive description of human choice behaviour (Palminteri et al., 2017). Using the same noisy procedure described above, we generated one more simulated dataset, using the exact blocks that were presented to the 33 participants. The blocks from simulated data were then sorted according to their length, and the hyperparameters α and β were fitted by maximising the likelihood of the estimated actions, given the true model actions. We report in Figure 3—figure supplement 3 the scatter plot and correlation between hyperparameters estimated from participant data and recovered hyperparameters values, showing good agreement, notwithstanding the noise in the estimation.

For data from the behavioural session after multivoxel neurofeedback, blocks were first categorised as to whether the targeted feature was relevant or irrelevant to the rule of a given block. We then applied the HBI procedure as described above to all participants, with all blocks of the same type (e.g. targeted feature relevant) ordered by length. This allowed us to compute, based on whether the targeted feature was relevant or irrelevant, the difference in frequency between the models. We resampled with replacement to produce distributions of mean population bias for each block type.

fMRI scans: acquisition and protocol

All scanning sessions employed a 3T MRI scanner (Siemens, Prisma) with a 64-channel head coil in the ATR Brain Activation Imaging Centre. Gradient T2*-weighted EPI (echoplanar) functional images with blood-oxygen-level-dependent (BOLD)-sensitive contrast and multi-band acceleration factor six were acquired (Feinberg et al., 2010; Xu et al., 2013). Imaging parameters: 72 contiguous slices (TR = 1 s, TE = 30 ms, flip angle = 60 deg, voxel size = 2×2×2 mm3, 0 mm slice gap) oriented parallel to the AC-PC plane were acquired, covering the entire brain. T1-weighted images (MP-RAGE; 256 slices, TR = 2 s, TE = 26 ms, flip angle = 80 deg, voxel size = 1×1×1 mm3, 0 mm slice gap) were also acquired at the end of the first session. For participants who joined the neurofeedback training sessions, the scanner was realigned to their head orientations with the same parameters for all sessions.

fMRI scans: standard and parametric general linear models

BOLD-signal image analysis was performed with SPM12 [http://www.fil.ion.ucl.ac.uk/spm/], running on MATLAB v9.1.0.96 (r2016b) and v9.5.0.94 (r2018b). fMRI data for the initial 10 s of each block were discarded due to possible unsaturated T1 effects. Raw functional images underwent realignment to the first image of each session. Structural images were re-registered to mean EPI images and segmented into grey and white matter. Segmentation parameters were then used to normalise (MNI) and bias-correct the functional images. Normalised images were smoothed using a Gaussian kernel of 7 mm full-width at half-maximum.

GLM1: regressors of interest included ‘High value‘, ‘Low value’ (trials were labelled as such based on the median split of the trial-by-trial expected value for the chosen option computed from the best fitting algorithm - Feature RL or Abstract RL), ‘Feature RL’, ‘Abstract RL’ (trials were labelled as such based on the best fitting algorithm at the block level). For all, we generated boxcar regressors at the beginning of the visual stimulus (character) presentation, with duration 1 s. Contrast of [1 -1] or [−1 1] were applied to the regressors ‘High value’ - ‘Low value’, and ‘Feature RL’ - ‘Abstract RL’. Specific regressors of no interest included the time in the experiment: ‘early’ (all trials falling within the first half of the experiment), and ‘late’ (all trials falling in the second half of the experiment). The early/late split was done according to the total number of trials: taking as ‘early’, trials from the first block onward, adding blocks until the trial sum exceeded the total trials number divided by two.

GLM2 (PPI): the seed was defined as a sphere (radius = 6 mm) centred around the individual peak voxel from the ‘High value’ > ‘Low value’ group-level contrast, within the vmPFC (peak coordinates [2 50 -10], radius 25 mm). The ROI mask was defined individually to account for possible differences among participants. Two participants were excluded from this analysis, because they did not show a significant cluster of voxels in the bounding sphere (even at very lenient thresholds). The GLM for the PPI included three regressors (the PPI, the mean BOLD signal of the seed region, and the psychological interaction), as well as nuisance regressors described below.

For all GLM analyses, additional regressors of no interest included a parametric regressor for reaction time, regressors for each trial event (fixation, fruit options presentation, choice, button press [left, right], ITI), block regressors, the six head motion realignment parameters, framewise displacement (FD) computed as the sum of the absolute values of the derivatives of the six realignment parameters, the TR-by-TR mean signal in white matter, and the TR-by-TR mean signal in cerebrospinal fluid.

Second-level group contrasts from all models were calculated as one-sample t-tests against zero for each first-level linear contrast. Statistical maps were z-transformed, and then reported at a threshold level of P(fpr) < 0.001 (Z > 3.09, false positive control meaning cluster forming threshold), unless otherwise specified. Statistical maps were projected onto a canonical MNI template with MRIcroGL [https://www.nitrc.org/projects/mricrogl/] or a glassbrain MNI template with Nilearn 0.7.1 [https://nilearn.github.io/index.html].

fMRI scans: pre-processing for decoding

As above, the fMRI data for the initial 10 s of each run were discarded due to possible unsaturated T1 effects. BOLD signals in native space were pre-processed in MATLAB v7.13 (R2011b) (MathWorks) with the mrVista software package for MATLAB [http://vistalab.stanford.edu/software/]. All functional images underwent 3D motion correction. No spatial or temporal smoothing was applied. Rigid-body transformations were performed to align functional images to the structural image for each participant. One region of interest (ROI), the HPC, was anatomically defined through cortical reconstruction and volumetric segmentation using the Freesurfer software [http://surfer.nmr.mgh.harvard.edu/]. Furthermore, VC subregions V1, V2, and V3 were also automatically defined based on a probabilistic map atlas (Wang et al., 2015). The vmPFC ROI was defined as the significant voxels from the GLM1 ‘High value’ > ‘Low value’ contrast at p(fpr) < 0.0001, within the OFC. All subsequent analyses were performed using MATLAB v9.5.0.94 (r2018b). Once ROIs were individually identified, time-courses of BOLD signal intensities were extracted from each voxel in each ROI and shifted by 6 s to account for the hemodynamic delay (assumed fixed). A linear trend was removed from time-courses, which were further z-score-normalised for each voxel in each block to minimise baseline differences across blocks. Data samples for computing individual feature identity decoders were created by averaging BOLD signal intensities of each voxel over two volumes, corresponding to the 2 s from stimulus (character) onset to fruit options onset.

Decoding: multivoxel pattern analysis (MVPA)

All ROI-based MVP analyses followed the same procedure. We used sparse logistic regression (SLR) (Yamashita et al., 2008), to automatically select the most relevant voxels for the classification problem. This allowed us to construct several binary classifiers (e.g. feature id.: colour - red vs green, stripes orientation - horizontal vs vertical, mouth direction - right vs left).

Cross-validation was used for each MVP analysis to evaluate the predictive power of the trained (fitted) model. In the primary analysis (reported in Figure 5C), cross-validation was done with a leave-one-run-out scheme, whereby each run was iteratively held out as a test set, and all other runs were used for training of the algorithm. The final accuracy was taken as the averaged accuracy across the runs. This approach is generally used because there may be subtle differences across runs: holding out one run as a test ensures higher generalizability of the results while avoiding within-run information leaks. Yet, because of the nature of our task and experiment, the leave-one-run-out cross-validation leads to other confounds due to varying number of training trials across classes (e.g. colour red vs green) or conditions (Feature RL vs Abstract RL blocks). To control for this idiosyncratic feature of our design, we performed a second cross-validation scheme. Here, we first merged the data from all blocks for each condition, and then computed the lowest bound of trial number from the minority class across conditions (e.g. if Feature RL had 128 trials labelled as ‘green’, and 109 as ‘red’, while Abstract RL had 94 trials labelled as ‘green’, and 99 labelled as ‘red’; then the minority class lowest bound was 94). In each fold (N folds = 20), a number of trials equivalent to 80% of the minority class lowest bound was assigned to the training set from each class, and the remaining trials to the test set. The training samples were randomly chosen in each fold. Furthermore, for all MVP analysis, SLR classification was optimised by using an iterative approach (Hirose et al., 2015) In each fold of the cross-validation, the feature-selection process was repeated 10 times. In each iteration, selected features (voxels) were removed from pattern vectors, and only features with unassigned weights were used for the next iteration. At the end of the cross-validation, test accuracies were averaged for each iteration across folds, in order to evaluate accuracy at each iteration. The number of iterations yielding the highest classification accuracy was then used for the final computation. Results (Figure 5C, Figure 5—figure supplement 3) report the cross-validated average of the best yielding iteration.

For the multivoxel neurofeedback experiment, we used the entire dataset to train the classifier in VC. Thus, each classifier resulted in a set of weights assigned to the selected voxels. These weights could be used to classify any new data sample and to compute a likelihood of its belonging to the target class.

Real-time multivoxel neurofeedback and fMRI pre-processing

As in previous studies (Cortese et al., 2017; Cortese et al., 2016; Shibata et al., 2011), during the multivoxel neurofeedback manipulation, participants were instructed to modulate their brain activity, in order to enlarge a feedback disc and maximise their cumulative reward. Brain activity patterns measured through fMRI were used in real time to compute the feedback score. Unbeknownst to participants, the feedback score, ranging from 0 to 100 (empty to full disc), represented the likelihood of a target pattern occurring in their brains at measurement time. Each trial started with an induction period of 6 s, during which participants viewed a cue (a small grey circle) that instructed them to modulate their brain activity. This period was followed by a 6 s rest interval, and then by a 2 s feedback, during which the disc appeared on the screen. Finally, each trial ended with a 6 s inter-trial interval (ITI). Each block was composed of 12 trials, and one session could last up to 10 blocks (depending on time availability). Participants did two sessions on consecutive days. Within a session, the maximum monetary bonus was 3000 JPY.

Feedback was calculated through the following steps. In each block, the initial 10 s of fMRI data were discarded to avoid unsaturated T1 effects. First, newly measured, whole-brain functional images underwent 3D motion correction using Turbo BrainVoyager (Brain Innovation, Maastricht, Netherlands). Second, time-courses of BOLD signal intensities were extracted from each of the voxels identified in the decoder analysis for the target ROI (VC). Third, the time-course was detrended (removal of linear trends), and z-score-normalised for each voxel using BOLD signal intensities measured up to the last point. Fourth, the data sample to calculate the target likelihood was created by taking the average BOLD signal intensity of each voxel over the 6 s (6 TRs) ‘induction’ period as in previous studies (Cortese et al., 2016; Shibata et al., 2011). Finally, the likelihood of each feature level (e.g. right vs left mouth direction) being represented in the multivoxel activity pattern was calculated from the data sample using weights of the constructed classifier.

Data and code availability

Behavioural data, group-level maps of brain activation, and custom code used to generate results and figures are available at https://github.com/BDMLab/Cortese_et_al_2021 copy archived at swh:1:rev:3ac5090fe0af132364bbf92b9b0dff95919d60ee (Cortese et al., 2021).

Acknowledgements

We thank Kaori Nakamura, Yasuo Shimada for experimental assistance; Drs. Hakwan Lau, Jessica Taylor for helpful comments on previous versions of this manuscript. Funding: JST ERATO (Japan, grant number JPMJER1801) to AC, AY, and MK; AMED (Japan, grant number JP18dm0307008) to AC and MK; the Chilean National Agency for Research and Development (ANID)/Scholarship Program/DOCTORADO BECAS CHILE/2017–72180193 to PS; the Royal Society and Wellcome Trust, Henry Dale Fellowship (102612/A/13/Z) to BDM.

Appendix 1

Supplementary note 1

Target and control regressors for main GLM

On average Abstract RL blocks tended to be later blocks (Figure 3F–G, Figure 3—figure supplement 2A), and to be associated with a small but significantly higher ratio of correct to incorrect responses (Figure 5—figure supplement 1). Moreover, although Abstract RL blocks were associated with higher expected value compared with Feature RL blocks (Figure 3—figure supplement 2C), at the trial level value (high / low) and learning strategy (Feature RL or Abstract RL) were uncorrelated (Figure 5—figure supplement 2A), thus confirming the regressors’ orthogonality. The main analysis that was used for the value contrast and for the strategy contrast thus included regressors for ‘early’, ‘late’, ‘High value’, ‘Low value’, ‘Feature RL’, ‘Abstract RL’, such that the GLM explicitly controlled for the idiosyncratic features of the task. Other regressors of no interest were motion parameters, mean white matter signal, mean cerebro-spinal fluid signal, block, constant.

Supplementary note 2

Levels of multivoxel fMRI neurofeedback

It is worth noting that the neurofeedback procedure targeted one feature’s level, for example red colour, rather than colour overall. One might wonder why this approach would work nevertheless? Given previous work with fMRI-based decoded neurofeedback (1), the main driver of the effect was most likely due to change in processing in VC, leading to increased functional representation of task features also in PFC (particularly, in vmPFC). Because in the current work feature levels were intrinsically coupled in task space, for example if red-horizontal corresponded to fruit 1, then green-vertical too, enhanced processing of red should also directly influence the paired colour.

Funding Statement

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Contributor Information

Aurelio Cortese, Email: cortese.aurelio@gmail.com.

Benedetto De Martino, Email: benedettodemartino@gmail.com.

Thorsten Kahnt, Northwestern University, United States.

Michael J Frank, Brown University, United States.

Funding Information

This paper was supported by the following grants:

  • Japan Science and Technology Agency JPMJER1801 to Aurelio Cortese, Mitsuo Kawato.

  • Japan Agency for Medical Research and Development JP18dm0307008 to Aurelio Cortese, Mitsuo Kawato.

  • Chilean National Agency for Research and Development 72180193 to Pradyumna Sepulveda.

  • Wellcome Trust 102612/A/13/Z to Benedetto De Martino.

Additional information

Competing interests

No competing interests declared.

Author contributions

Conceptualization, Resources, Data curation, Software, Formal analysis, Supervision, Funding acquisition, Validation, Investigation, Visualization, Methodology, Writing - original draft, Project administration, Writing - review and editing.

Software, Formal analysis, Investigation, Methodology.

Software, Formal analysis, Writing - review and editing.

Conceptualization, Software, Investigation, Writing - review and editing.

Conceptualization, Formal analysis, Supervision, Funding acquisition, Writing - original draft, Writing - review and editing.

Conceptualization, Formal analysis, Supervision, Validation, Writing - original draft, Project administration, Writing - review and editing.

Ethics

Human subjects: All experiments and data analyses were conducted at the Advanced Telecommunications Research Institute International (ATR). The study was approved by the Institutional Review Board of ATR with ethics protocol numbers 18-122, 19-122, 20-122. All participants gave written informed consent.

Additional files

Transparent reporting form

Data availability

All data generated or analysed during this study are included in the manuscript and supporting files. Source data files have been provided for Figures 1-6. Behavioural data, group-level maps of brain activation, and custom code used to generate results and figures are available at https://github.com/BDMLab/Cortese_et_al_2021, copy archived at https://archive.softwareheritage.org/swh:1:rev:3ac5090fe0af132364bbf92b9b0dff95919d60ee.

References

  1. Akaishi R, Kolling N, Brown JW, Rushworth M. Neural Mechanisms of Credit Assignment in a Multicue Environment. Journal of Neuroscience. 2016;36:1096–1112. doi: 10.1523/JNEUROSCI.3159-15.2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Alexander GE, Crutcher MD. Functional architecture of basal ganglia circuits: neural substrates of parallel processing. Trends in Neurosciences. 1990;13:266–271. doi: 10.1016/0166-2236(90)90107-l. [DOI] [PubMed] [Google Scholar]
  3. Anderson BA, Laurent PA, Yantis S. Value-driven attentional capture. PNAS. 2011;108:10367–10371. doi: 10.1073/pnas.1104047108. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Badre D, Frank MJ. Mechanisms of hierarchical reinforcement learning in cortico-striatal circuits 2: evidence from fMRI. Cerebral Cortex. 2012;22:527–536. doi: 10.1093/cercor/bhr117. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Banerjee A, Parente G, Teutsch J, Lewis C, Voigt FF, Helmchen F. Value-guided remapping of sensory cortex by lateral orbitofrontal cortex. Nature. 2020;585:245–250. doi: 10.1038/s41586-020-2704-z. [DOI] [PubMed] [Google Scholar]
  6. Bastin J, Deman P, David O, Gueguen M, Benis D, Minotti L, Hoffman D, Combrisson E, Kujala J, Perrone-Bertolotti M, Kahane P, Lachaux JP, Jerbi K. Direct recordings from human anterior insula reveal its leading role within the Error-Monitoring network. Cerebral Cortex. 2017;27:1545–1557. doi: 10.1093/cercor/bhv352. [DOI] [PubMed] [Google Scholar]
  7. Bellman R. Dynamic Programming. Princeton, New Jersey, USA: Princeton University Press; 1957. [Google Scholar]
  8. Bengtsson SL, Haynes JD, Sakai K, Buckley MJ, Passingham RE. The representation of abstract task rules in the human prefrontal cortex. Cerebral Cortex. 2009;19:1929–1936. doi: 10.1093/cercor/bhn222. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Benoit RG, Szpunar KK, Schacter DL. Ventromedial prefrontal cortex supports affective future simulation by integrating distributed knowledge. PNAS. 2014;111:16550–16555. doi: 10.1073/pnas.1419274111. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Bernardi S, Benna MK, Rigotti M, Munuera J, Fusi S, Salzman CD. The geometry of abstraction in the Hippocampus and prefrontal cortex. Cell. 2020;183:954–967. doi: 10.1016/j.cell.2020.09.031. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Bowman CR, Zeithamova D. Abstract memory representations in the ventromedial prefrontal cortex and Hippocampus support concept generalization. The Journal of Neuroscience. 2018;38:2605–2614. doi: 10.1523/JNEUROSCI.2811-17.2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Carter CS, Braver TS, Barch DM, Botvinick MM, Noll D, Cohen JD. Anterior cingulate cortex, error detection, and the online monitoring of performance. Science. 1998;280:747–749. doi: 10.1126/science.280.5364.747. [DOI] [PubMed] [Google Scholar]
  13. Castegnetti G, Zurita M, De Martino B. How usefulness shapes neural representations during goal-directed behavior. Science Advances. 2021;7:eabd5363. doi: 10.1126/sciadv.abd5363. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Constantinescu AO, O'Reilly JX, Behrens TEJ. Organizing conceptual knowledge in humans with a gridlike code. Science. 2016;352:1464–1468. doi: 10.1126/science.aaf0941. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Cortese A, Amano K, Koizumi A, Kawato M, Lau H. Multivoxel neurofeedback selectively modulates confidence without changing perceptual performance. Nature Communications. 2016;7:13669. doi: 10.1038/ncomms13669. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Cortese A, Amano K, Koizumi A, Lau H, Kawato M. Decoded fMRI neurofeedback can induce bidirectional confidence changes within single participants. NeuroImage. 2017;149:323–337. doi: 10.1016/j.neuroimage.2017.01.069. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Cortese A, De Martino B, Kawato M. The neural and cognitive architecture for learning from a small sample. Current Opinion in Neurobiology. 2019;55:133–141. doi: 10.1016/j.conb.2019.02.011. [DOI] [PubMed] [Google Scholar]
  18. Cortese A, Lau H, Kawato M. Unconscious reinforcement learning of hidden brain states supported by confidence. Nature Communications. 2020;11:4429. doi: 10.1038/s41467-020-17828-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Cortese A, Yamamoto A, Hashemzadeh M, Sepulveda P. Cortese_et_al_2021. swh:1:rev:3ac5090fe0af132364bbf92b9b0dff95919d60eeSoftware Heritage. 2021 https://archive.softwareheritage.org/swh:1:dir:88d680896aa54dc52629f4274001a6e529fb78fc;origin=https://github.com/BDMLab/Cortese_et_al_2021;visit=swh:1:snp:d5176536817595f8ae3061e468585b773abc696a;anchor=swh:1:rev:3ac5090fe0af132364bbf92b9b0dff95919d60ee
  20. De Martino B, Fleming SM, Garrett N, Dolan RJ. Confidence in value-based choice. Nature Neuroscience. 2013;16:105–110. doi: 10.1038/nn.3279. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Domenech P, Redouté J, Koechlin E, Dreher JC. The Neuro-Computational architecture of Value-Based selection in the human brain. Cerebral Cortex. 2018;28:585–601. doi: 10.1093/cercor/bhw396. [DOI] [PubMed] [Google Scholar]
  22. Donoso M, Collins AG, Koechlin E. Human cognition. Foundations of human reasoning in the prefrontal cortex. Science. 2014;344:1481–1486. doi: 10.1126/science.1252254. [DOI] [PubMed] [Google Scholar]
  23. Doya K, Samejima K, Katagiri K, Kawato M. Multiple model-based reinforcement learning. Neural Computation. 2002;14:1347–1369. doi: 10.1162/089976602753712972. [DOI] [PubMed] [Google Scholar]
  24. Farashahi S, Rowe K, Aslami Z, Lee D, Soltani A. Feature-based learning improves adaptability without compromising precision. Nature Communications. 2017;8:1768. doi: 10.1038/s41467-017-01874-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Feinberg DA, Moeller S, Smith SM, Auerbach E, Ramanna S, Gunther M, Glasser MF, Miller KL, Ugurbil K, Yacoub E. Multiplexed echo planar imaging for sub-second whole brain FMRI and fast diffusion imaging. PLOS ONE. 2010;5:e15710. doi: 10.1371/journal.pone.0015710. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Frank MJ, Badre D. Mechanisms of hierarchical reinforcement learning in corticostriatal circuits 1: computational analysis. Cerebral Cortex. 2012;22:509–526. doi: 10.1093/cercor/bhr114. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Friston KJ, Buechel C, Fink GR, Morris J, Rolls E, Dolan RJ. Psychophysiological and modulatory interactions in neuroimaging. NeuroImage. 1997;6:218–229. doi: 10.1006/nimg.1997.0291. [DOI] [PubMed] [Google Scholar]
  28. Gherman S, Philiastides MG. Human VMPFC encodes early signatures of confidence in perceptual decisions. eLife. 2018;7:e38293. doi: 10.7554/eLife.38293. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Gilboa A, Marlatte H. Neurobiology of Schemas and Schema-Mediated Memory. Trends in Cognitive Sciences. 2017;21:618–631. doi: 10.1016/j.tics.2017.04.013. [DOI] [PubMed] [Google Scholar]
  30. Guggenmos M, Thoma V, Haynes JD, Richardson-Klavehn A, Cichy RM, Sterzer P. Spatial attention enhances object coding in local and distributed representations of the lateral occipital complex. NeuroImage. 2015;116:149–157. doi: 10.1016/j.neuroimage.2015.04.004. [DOI] [PubMed] [Google Scholar]
  31. Haruno M, Kawato M. Heterarchical reinforcement-learning model for integration of multiple cortico-striatal loops: fMRI examination in stimulus-action-reward association learning. Neural Networks. 2006;19:1242–1254. doi: 10.1016/j.neunet.2006.06.007. [DOI] [PubMed] [Google Scholar]
  32. Hashemzadeh M, Hosseini R, Ahmadabadi MN. Exploiting generalization in the subspaces for faster Model-Based reinforcement learning. IEEE Transactions on Neural Networks and Learning Systems. 2019;30:1635–1650. doi: 10.1109/TNNLS.2018.2869978. [DOI] [PubMed] [Google Scholar]
  33. Hirose S, Nambu I, Naito E. An empirical solution for over-pruning with a novel ensemble-learning method for fMRI decoding. Journal of Neuroscience Methods. 2015;239:238–245. doi: 10.1016/j.jneumeth.2014.10.023. [DOI] [PubMed] [Google Scholar]
  34. Ho MK, Abel D, Griffiths TL, Littman ML. The value of abstraction. Current Opinion in Behavioral Sciences. 2019;29:111–116. [Google Scholar]
  35. Jacobs RA, Jordan MI, Nowlan SJ, Hinton GE. Adaptive mixtures of local experts. Neural Computation. 1991;3:79–87. doi: 10.1162/neco.1991.3.1.79. [DOI] [PubMed] [Google Scholar]
  36. Kamitani Y, Tong F. Decoding the visual and subjective contents of the human brain. Nature Neuroscience. 2005;8:679–685. doi: 10.1038/nn1444. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Kawato M, Samejima K. Efficient reinforcement learning: computational theories, neuroscience and robotics. Current Opinion in Neurobiology. 2007;17:205–212. doi: 10.1016/j.conb.2007.03.004. [DOI] [PubMed] [Google Scholar]
  38. Kay K, Chung JE, Sosa M, Schor JS, Karlsson MP, Larkin MC, Liu DF, Frank LM. Constant Sub-second cycling between representations of possible futures in the Hippocampus. Cell. 2020;180:552–567. doi: 10.1016/j.cell.2020.01.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Knutson B, Taylor J, Kaufman M, Peterson R, Glover G. Distributed neural representation of expected value. Journal of Neuroscience. 2005;25:4806–4812. doi: 10.1523/JNEUROSCI.0642-05.2005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Kobayashi K, Hsu M. Common neural code for reward and information value. PNAS. 2019;116:13061–13066. doi: 10.1073/pnas.1820145116. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Koizumi A, Amano K, Cortese A, Shibata K, Yoshida W, Seymour B, Kawato M, Lau H. Fear reduction without fear through reinforcement of neural activity that bypasses conscious exposure. Nature Human Behaviour. 2016;1:0006. doi: 10.1038/s41562-016-0006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Konidaris G. On the necessity of abstraction. Current Opinion in Behavioral Sciences. 2019;29:1–7. doi: 10.1016/j.cobeha.2018.11.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Krauzlis RJ, Bollimunta A, Arcizet F, Wang L. Attention as an effect not a cause. Trends in Cognitive Sciences. 2014;18:457–464. doi: 10.1016/j.tics.2014.05.008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Kumaran D, Summerfield JJ, Hassabis D, Maguire EA. Tracking the emergence of conceptual knowledge during human decision making. Neuron. 2009;63:889–901. doi: 10.1016/j.neuron.2009.07.030. [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Lebreton M, Abitbol R, Daunizeau J, Pessiglione M. Automatic integration of confidence in the brain valuation signal. Nature Neuroscience. 2015;18:1159–1167. doi: 10.1038/nn.4064. [DOI] [PubMed] [Google Scholar]
  46. Lebreton M, Bavard S, Daunizeau J, Palminteri S. Assessing inter-individual differences with task-related functional neuroimaging. Nature Human Behaviour. 2019;3:897–905. doi: 10.1038/s41562-019-0681-8. [DOI] [PubMed] [Google Scholar]
  47. Lee H, GoodSmith D, Knierim JJ. Parallel processing streams in the hippocampus. Current Opinion in Neurobiology. 2020;64:127–134. doi: 10.1016/j.conb.2020.03.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Leong YC, Radulescu A, Daniel R, DeWoskin V, Niv Y. Dynamic interaction between reinforcement learning and attention in multidimensional environments. Neuron. 2017;93:451–463. doi: 10.1016/j.neuron.2016.12.040. [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. Liu S, Ullman TD, Tenenbaum JB, Spelke ES. Ten-month-old infants infer the value of goals from the costs of actions. Science. 2017;358:1038–1041. doi: 10.1126/science.aag2132. [DOI] [PubMed] [Google Scholar]
  50. Liu D, Deng J, Zhang Z, Zhang ZY, Sun YG, Yang T, Yao H. Orbitofrontal control of visual cortex gain promotes visual associative learning. Nature Communications. 2020;11:2784. doi: 10.1038/s41467-020-16609-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  51. Lubianiker N, Goldway N, Fruchtman-Steinbok T, Paret C, Keynan JN, Singer N, Cohen A, Kadosh KC, Linden DEJ, Hendler T. Process-based framework for precise neuromodulation. Nature Human Behaviour. 2019;3:436–445. doi: 10.1038/s41562-019-0573-y. [DOI] [PubMed] [Google Scholar]
  52. Mack ML, Love BC, Preston AR. Dynamic updating of hippocampal object representations reflects new conceptual knowledge. PNAS. 2016;113:13203–13208. doi: 10.1073/pnas.1614048113. [DOI] [PMC free article] [PubMed] [Google Scholar]
  53. Mack ML, Preston AR, Love BC. Ventromedial prefrontal cortex compression during concept learning. Nature Communications. 2020;11:46. doi: 10.1038/s41467-019-13930-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  54. McKenzie S, Frank AJ, Kinsky NR, Porter B, Rivière PD, Eichenbaum H. Hippocampal representation of related and opposing memories develop within distinct, hierarchically organized neural schemas. Neuron. 2014;83:202–215. doi: 10.1016/j.neuron.2014.05.019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  55. McNamee D, Rangel A, O'Doherty JP, O’Doherty J. Category-dependent and category-independent goal-value codes in human ventromedial prefrontal cortex. Nature Neuroscience. 2013;16:479–485. doi: 10.1038/nn.3337. [DOI] [PMC free article] [PubMed] [Google Scholar]
  56. Mian MK, Sheth SA, Patel SR, Spiliopoulos K, Eskandar EN, Williams ZM. Encoding of rules by neurons in the human dorsolateral prefrontal cortex. Cerebral Cortex. 2014;24:807–816. doi: 10.1093/cercor/bhs361. [DOI] [PMC free article] [PubMed] [Google Scholar]
  57. Muñoz-Moldes S, Cleeremans A. Delineating implicit and explicit processes in neurofeedback learning. Neuroscience and Biobehavioral Reviews. 2020;118:681–688. doi: 10.1016/j.neubiorev.2020.09.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  58. Neubert FX, Mars RB, Sallet J, Rushworth MF. Connectivity reveals relationship of brain areas for reward-guided learning and decision making in human and monkey frontal cortex. PNAS. 2015;112:2695–2704. doi: 10.1073/pnas.1410767112. [DOI] [PMC free article] [PubMed] [Google Scholar]
  59. Niv Y, Daniel R, Geana A, Gershman SJ, Leong YC, Radulescu A, Wilson RC. Reinforcement learning in multidimensional environments relies on attention mechanisms. Journal of Neuroscience. 2015;35:8145–8157. doi: 10.1523/JNEUROSCI.2978-14.2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  60. Niv Y. Learning task-state representations. Nature Neuroscience. 2019;22:1544–1553. doi: 10.1038/s41593-019-0470-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  61. Oemisch M, Westendorff S, Azimi M, Hassani SA, Ardid S, Tiesinga P, Womelsdorf T. Feature-specific prediction errors and surprise across macaque fronto-striatal circuits. Nature Communications. 2019;10:176. doi: 10.1038/s41467-018-08184-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  62. Padoa-Schioppa C, Assad JA. Neurons in the orbitofrontal cortex encode economic value. Nature. 2006;441:223–226. doi: 10.1038/nature04676. [DOI] [PMC free article] [PubMed] [Google Scholar]
  63. Palminteri S, Wyart V, Koechlin E. The Importance of Falsification in Computational Cognitive Modeling. Trends in Cognitive Sciences. 2017;21:425–433. doi: 10.1016/j.tics.2017.03.011. [DOI] [PubMed] [Google Scholar]
  64. Peirce JW. Generating stimuli for neuroscience using PsychoPy. Frontiers in Neuroinformatics. 2008;2:10. doi: 10.3389/neuro.11.010.2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  65. Piray P, Dezfouli A, Heskes T, Frank MJ, Daw ND. Hierarchical Bayesian inference for concurrent model fitting and comparison for group studies. PLOS Computational Biology. 2019;15:e1007043. doi: 10.1371/journal.pcbi.1007043. [DOI] [PMC free article] [PubMed] [Google Scholar]
  66. Schuck NW, Cai MB, Wilson RC, Niv Y. Human orbitofrontal cortex represents a cognitive map of state space. Neuron. 2016;91:1402–1412. doi: 10.1016/j.neuron.2016.08.019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  67. Schuck NW, Niv Y. Sequential replay of nonspatial task states in the human Hippocampus. Science. 2019;364:eaaw5181. doi: 10.1126/science.aaw5181. [DOI] [PMC free article] [PubMed] [Google Scholar]
  68. Shapiro AD, Grafton ST. Subjective value then confidence in human ventromedial prefrontal cortex. PLOS ONE. 2020;15:e0225617. doi: 10.1371/journal.pone.0225617. [DOI] [PMC free article] [PubMed] [Google Scholar]
  69. Shibata K, Watanabe T, Sasaki Y, Kawato M. Perceptual learning incepted by decoded fMRI neurofeedback without stimulus presentation. Science. 2011;334:1413–1415. doi: 10.1126/science.1212003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  70. Shibata K, Lisi G, Cortese A, Watanabe T, Sasaki Y, Kawato M. Toward a comprehensive understanding of the neural mechanisms of decoded neurofeedback. NeuroImage. 2019;188:539–556. doi: 10.1016/j.neuroimage.2018.12.022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  71. Sitaram R, Ros T, Stoeckel L, Haller S, Scharnowski F, Lewis-Peacock J, Weiskopf N, Blefari ML, Rana M, Oblak E, Birbaumer N, Sulzer J. Closed-loop brain training: the science of neurofeedback. Nature Reviews Neuroscience. 2017;18:86–100. doi: 10.1038/nrn.2016.164. [DOI] [PubMed] [Google Scholar]
  72. Spitmaan M, Seo H, Lee D, Soltani A. Multiple timescales of neural dynamics and integration of task-relevant signals across cortex. PNAS. 2020;117:22522–22531. doi: 10.1073/pnas.2005993117. [DOI] [PMC free article] [PubMed] [Google Scholar]
  73. Stachenfeld KL, Botvinick MM, Gershman SJ. The Hippocampus as a predictive map. Nature Neuroscience. 2017;20:1643–1653. doi: 10.1038/nn.4650. [DOI] [PubMed] [Google Scholar]
  74. Sugimoto N, Haruno M, Doya K, Kawato M. MOSAIC for multiple-reward environments. Neural Computation. 2012;24:577–606. doi: 10.1162/NECO_a_00246. [DOI] [PubMed] [Google Scholar]
  75. Sutton RS, Barto AG. Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press; 1998. [Google Scholar]
  76. Taschereau-Dumouchel V, Cortese A, Chiba T, Knotts JD, Kawato M, Lau H. Towards an unconscious neural reinforcement intervention for common fears. PNAS. 2018;115:3470–3475. doi: 10.1073/pnas.1721572115. [DOI] [PMC free article] [PubMed] [Google Scholar]
  77. Tse D, Langston RF, Kakeyama M, Bethus I, Spooner PA, Wood ER, Witter MP, Morris RG. Schemas and memory consolidation. Science. 2007;316:76–82. doi: 10.1126/science.1135935. [DOI] [PubMed] [Google Scholar]
  78. Tse D, Takeuchi T, Kakeyama M, Kajii Y, Okuno H, Tohyama C, Bito H, Morris RG. Schema-dependent gene activation and memory encoding in neocortex. Science. 2011;333:891–895. doi: 10.1126/science.1205274. [DOI] [PubMed] [Google Scholar]
  79. Viganò S, Piazza M. Distance and direction codes underlie navigation of a novel semantic space in the human brain. The Journal of Neuroscience. 2020;40:2727–2736. doi: 10.1523/JNEUROSCI.1849-19.2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  80. Wallis JD, Anderson KC, Miller EK. Single neurons in prefrontal cortex encode abstract rules. Nature. 2001;411:953–956. doi: 10.1038/35082081. [DOI] [PubMed] [Google Scholar]
  81. Wang L, Mruczek RE, Arcaro MJ, Kastner S. Probabilistic maps of visual topography in human cortex. Cerebral Cortex. 2015;25:3911–3931. doi: 10.1093/cercor/bhu277. [DOI] [PMC free article] [PubMed] [Google Scholar]
  82. Watkins CJCH, Dayan P. Q-learning. Machine Learning. 1992;8:279–292. doi: 10.1007/BF00992698. [DOI] [Google Scholar]
  83. Wikenheiser AM, Schoenbaum G. Over the river, through the woods: cognitive maps in the hippocampus and orbitofrontal cortex. Nature Reviews. Neuroscience. 2016;17:513–523. doi: 10.1038/nrn.2016.56. [DOI] [PMC free article] [PubMed] [Google Scholar]
  84. Wilson RC, Takahashi YK, Schoenbaum G, Niv Y. Orbitofrontal cortex as a cognitive map of task space. Neuron. 2014;81:267–279. doi: 10.1016/j.neuron.2013.11.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  85. Wolpert DM, Kawato M. Multiple paired forward and inverse models for motor control. Neural Networks. 1998;11:1317–1329. doi: 10.1016/S0893-6080(98)00066-5. [DOI] [PubMed] [Google Scholar]
  86. Xu J, Moeller S, Auerbach EJ, Strupp J, Smith SM, Feinberg DA, Yacoub E, Uğurbil K. Evaluation of slice accelerations using multiband echo planar imaging at 3 T. NeuroImage. 2013;83:991–1001. doi: 10.1016/j.neuroimage.2013.07.055. [DOI] [PMC free article] [PubMed] [Google Scholar]
  87. Yamashita O, Sato MA, Yoshioka T, Tong F, Kamitani Y. Sparse estimation automatically selects voxels relevant for the decoding of fMRI activity patterns. NeuroImage. 2008;42:1414–1429. doi: 10.1016/j.neuroimage.2008.05.050. [DOI] [PMC free article] [PubMed] [Google Scholar]
  88. Zeithamova D, Mack ML, Braunlich K, Davis T, Seger CA, van Kesteren MTR, Wutz A. Brain Mechanisms of Concept Learning. Journal of Neuroscience. 2019;39:8259–8266. doi: 10.1523/JNEUROSCI.1166-19.2019. [DOI] [PMC free article] [PubMed] [Google Scholar]

Decision letter

Editor: Thorsten Kahnt1
Reviewed by: Alireza Soltani2

In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.

Acceptance summary:

This study combines a novel behavioral task, reinforcement learning modeling, functional imaging, and neurofeedback to show that learning to focus on what information is important for predicting choice outcomes (i.e., "abstraction") is guided by value signals. Because "abstraction" is a key process underlying flexible behavior, understanding its neural and computational basis is of major importance for cognitive neuroscience.

Decision letter after peer review:

Thank you for submitting your article "Value signals guide abstraction during learning" for consideration by eLife. Your article has been reviewed by 3 peer reviewers, and the evaluation has been overseen by a Reviewing Editor and Michael Frank as the Senior Editor. The following individual involved in review of your submission has agreed to reveal their identity: Alireza Soltani (Reviewer #1).

The reviewers have discussed their reviews with one another, and the Reviewing Editor has drafted this to help you prepare a revised submission.

Essential revisions:

All three reviewers agreed that your study is well-conducted and the results convincing. However, they also had specific questions and suggestions for improvement. Below is a list of 'essential' comments that we would expect you to address in a revised version of your manuscript. The reviewers also made additional comments in the individual critiques, which we would encourage you to consider when preparing your revision.

1. There were several questions about the modeling. Please add a formal model comparison (accounting for model complexity) and show how much the mixture of expert RL model improves the fit over purely abstract and/or feature RL models. In addition, please show the responsibility values for the mixture model. This information, in addition to 'mean expected value', is needed to draw conclusions on the importance of feature and abstract RL models.

2. Please add an analysis of the behavior of excluded subjects. Do they adopt a different strategy and that is why they could not learn fast/accurately enough?

3. Does VC-vmPFC coupling predict the abstraction level? This connection seems to be as important to the authors' claims as the discussed relationship between VC-vmPFC coupling and learning speed (Figure 4C).

4. Please discuss issues around efficiency and plausibility that result from running 4 models simultaneously. That is, would it not be better if the brain only implemented the most complex algorithm instead of this algorithm in addition to the 3 simpler models?

5. Could it be that participants are simply better (more correct choices) in the abstract blocks (which presumably are also the later blocks)? If so, what does that mean for the value contrast in vmPFC? DO they reflect performance or strategy?

6. Please plot performance separately for CO, CD and OD blocks as well as for 2x2 vs 3x1 blocks.

7. Please clarify how the variance over RPEs (v) was calculated.

8. Was the cross-validation done between runs? If not, if should be done between runs, if possible.

9. The specific difference between relevant and irrelevant features seems important. Please add Figure S6 into the main manuscript.

10. Please add the results of the neurofeedback experiment. Were participants successful at increasing the size of the disc? Was there a correlation between this success and subsequent performance on the association paradigm? Full results can be provided in the supplements but should be referenced in the main text.

Reviewer #1:

Overall, the question studied in this work is timely, interesting and important. More specifically, although previous modeling studies have been focused on explaining how humans and other animals can learn informative abstract representations at the behavioral level, the underlying neural mechanisms remained poorly understood. Cortese and colleagues performed modeling analyses using a mixture of experts RL that consist of abstract and feature RL models as well as behavioral analyses (analyses of choice in a learning task) to demonstrate that performance of human subjects in a multi-dimensional learning task depends on their adopted level of abstraction. Supporting their modeling and behavioral analyses, authors analyzed fMRI data to demonstrate that the connections between ventromedial prefrontal cortex (vmPFC), the brain area encoding value signals, and visual cortex (VC) can predict subjects' learning speed, which is an indicator for adoption of abstract representations. Lastly, to demonstrate the causal relationship between VC and adoption of abstract representations, authors used a multivoxel neurofeedback procedure and showed that artificially adding value to features in VC results in an increase in adoption of abstract representations.

Although provided analyses are thorough and results are convincing, further quantitative analyses could be included to strengthen the main claims of the study. More specifically, it is helpful to show that results from fitting the mixture of expert approach is fully consistent with analyses using purely Abstract and/or Feature RL models. Additionally, an analysis of excluded subjects' behavior is missing. This is important, because failure in performing the task could indicate alternative (but unsuccessful) representations adopted by some subjects.

Comments for the authors:

(1.1) Page 6-8 and Figure 2: I could have missed this, but authors don't seem to provide any formal comparison between the goodness-of-fit using the mixture of expert RL model and pure Feature RL or Abstract RL models. For example, a simple Abstract RL model with only the informative features could capture behavior of certain subjects. I asked this because, the mixture of expert RL model contains more parameters than the Feature and Abstract RL models. I wonder when accounting for the extra parameters, would the mixture of expert RL model still provide a better fit? Please clarify.

(1.2) Related to the previous point, if the mixture of expert RL model provides a better fit, how much of the captured variance is related to the Feature RL vs Abstract RL experts (e.g. Figure 3F, G)? Perhaps this could be answered by examining the weight assigned to each type of RL in this model (λ values).

(1.3) Authors don't show the values of responsibility signals in the mixture of expert model. This information, in addition to 'mean expected value', is needed to draw conclusions on the importance of Feature and Abstract RL models.

2) I feel that the decoding analysis can be further improved. For example, do authors see any changes happen as a result of experience in the task? Also, a relevant reference is a study by Oemisch et al., Nat. Comm 2019, in which the prevalence of feature encoding neurons is examined.

3) How did the excluded subjects perform the task? Do they adopt a different strategy and that is why they could not learn fast/accurately enough? For example, did they learn about the value of different features and combine these values to make decisions (as in feature-based RL in Farashahi et al., 2017 or Farashahi et al., 2020, which is different from Feature RL and closer to Abstract RL)? Please comment.

4) Does VC-vmPFC coupling predict the abstraction level? This connection seems to be as important to authors' claims as the discussed relationship between VC-vmPFC coupling and learning speed (Figure 4C).

Reviewer #2:

Cortese and colleagues report two experiments in which human subjects made choices based on cues that had three distinct visual features. Only two of the three visual features were needed to make a correct choice. Hence participants could safely ignore one feature and learn based only on the two relevant features (a process the authors call abstraction). The authors modelled how behaviour and ventromedial prefrontal cortex activity shifted from processing all features to only the two relevant features and sought to elucidate the role of feature valuation in this process. In a second experiment, the authors used a real-time neurofeedback approach to tag visual representations of features and showed how this feature valuation process shapes the feature selection described in Experiment 1.

Past research has investigated the process by which humans and other animals learn to attend relevant features during reinforcement learning (e.g., Niv et al., 2015; Leong et al., 2017). These studies have outlined how reward shapes which features we pay attention to, and how attention shapes how we process reward. While a true account of how the process of "abstraction" might occur is still outstanding in my opinion (see below), this study adds some important insights about this process. A main point is that the authors show changing representations of features directly in vmPFC, which co-occur and interact with values. They also provide insight into the unique roles vmPFC and the hippocampus might have in this process, and how vmPFC value signals interact with sensory areas.

One particularly interesting aspect is this study is the use of neurofeedback to achieve reward-tagging of visual representations. This approach is noteworthy as it does not require to pair reward with the visual features themselves, but rather with the occurrence of neural representations that reflect said features. The behavioural effects of this manipulation on later learning were impressively strong: if the task required to attend features that were tagged with reward, behaviour was guided more strongly by appropriate selective learning; if the task required to ignore the features that were previously tagged with reward, the learning process was unchanged. This suggests that the process of selecting relevant features during learning interacts with a neural mechanism that tracks the values associated with these features. This conclusion is also supported by the fact that the same brain area that tracked the expected values of the stimuli during the task, vmPFC, was modulated by participants' level of feature selection, i.e. abstraction.

One weakness of this study is that the mechanism of abstraction remains unclear. The authors use a mixture of experts architecture of 4 different RL models: one RL model that tries to learn the appropriate action as a function of all visual features of the cues, and 3 models that try to learn based on the possible subsets of only two features.

I have some concerns about this approach. One concern is that the modelling presumes that participants concurrently run all 4 RL models, and continuously decide which one is best. The whole purpose of using a lower dimensional model is that it is more efficient. Permanently using 4 models, including the highest dimensional one, seems to defy the purpose of why the search for a best model was initiated in the first place. Arguably, such a scheme would also not necessarily predict that vmPFC should come to selectively represent only the most relevant features, since the model that requires processing all features needs to be kept up to date. It also does not shed light on how participants could ever truly stop to pay attention to some features, as feature selection is only done by weighting the model with the lowest prediction errors relative to the variance most strongly in the action selection process. In other words: I am unsure if the manuscript presents a reasonable account how representations become transformed. Other models, which do not suffer from these shortcomings, such as a function approximation model, might have added important insights to this study. Ideally, the presented model could also explain another interesting observation made by the authors: that performance improves over blocks, even though the relevant features change. This probably reflects that participants might have learned something more global about the dimensionality of the relevant space, but such a learning process is not accounted for by the authors. On the positive side, while such a concurrent training of 4 models seems computationally inefficient, it is at least data efficient, as each experience is used to update all models at once. And, the mixture of experts approach may be considered a tool to investigate feature selection, rather than a cognitive model. This should be clarified in the manuscript.

Another weakness of the manuscript in my opinion is that the valuation process targeted in the neurofeedback experiment presupposes that visual features are predictive of reward. One important aspect of abstraction, however, is that they may not be, as the same feature could lead to different outcomes, for instance based on unobservable context.

Comments for the authors:

– It would be great to try to model how longer-term knowledge about rewarding features and dimensionality of the task influence performance. How does the change over blocks occur? How do biases, as introduced through the neurofeedback procedure, influence model selection in the mixture of experts approach?

– Figure 5: Could it be that participants are simply better (more correct choices) in the abstract blocks (which presumably are also the later blocks)? If so, wouldn't that mean that the contrast high-low value in vmPFC will necessarily be higher for abstract blocks, but it could reflect performance rather than strategy?

– It would be interesting to see performance separately for CO, CD and OD blocks as well as for 2x2 vs 3x1 blocks. Is there a difference between 2x2 vs 3x1 blocks?

– Please consider avoiding the word "predict" when reporting a regression analyses or other types of non-causal effects.

– I did not follow how the variance over RPEs (v) is calculated. It would be important to clarify that in the manuscript, and indicate how it changes as learning progresses.

Does a small variance imply that all models have similar RPEs? If so, I am not sure the statement that it is related to sharper model selection is the only way to view it. It seems it could also be related to more model similarity.

– Isn't the fact that the relevant AbRL has higher values and learns faster trivial, given the design of the task? Would there have been any possibility that these results would not have come out? If not, I believe all p values should be removed.

– It would be great to add the results from Figure S6 into the main manuscript. The specific different between relevant and irrelevant features seems important

– Decoding: was the cross validation done between runs? If not, if should be done between runs if possible

– Neurofeedback: can you provide more information about how good participants were, and how long the neurofeedback effect was presented in the later task blocks (did it diminish over time?).

Reviewer #3:

The authors of this study aimed to demonstrate that abstract representations occur during the course of learning and clarify the role of the vmPFC in this process. In a novel association learning paradigm it was shown that participants used abstract representations more as the experiment went on, and that these representations resulted in enhanced performance and confidence. Using decoded neurofeedback, (implicit) attention to certain features was reinforced monetarily and this led to these features being used more during the association task. They conclude that top down control (vmPFC control of sensory cortices) guides the use of abstract representations.

The strengths of this paper include an objective, model based assessment of reinforcement learning, a strong and simple experimental paradigm incorporating variable stopping criteria, and the incorporation of decoded neurofeedback to determine if these representations could be covertly reinforced and affect behavior.

The weaknesses include a small sample, the lack of subjective evaluation of strategies/learning, and the omission of neurofeedback learning results.

Overall the authors achieved their aims and the data supports their conclusions.

This work will be of significance to computational psychologists, those who study abstraction and decision making, and those interested in the role of the vmPFC. One exciting implication of this work is that the use of certain features can be reinforced via decoded neurofeedback.

Comments for the authors:

I am not an expert in computational methods, therefore my comments are largely restricted to the neurofeedback study. The neurofeedback task is well designed and the use of relevant and irrelevant features is a nice control condition. That the effects were only observed for relevant blocks and the finding of increased abstraction from the late blocks of the main experiment strengthens their conclusions regarding causality.

While this is not the main focus of the manuscript, a supplement should contain the results of the neurofeedback experiment. Were participants successful at increasing the size of the disc? Was there a correlation between this success and subsequent performance on the association paradigm?

6s of modulation seems short for neurofeedback studies, please justify this short modulation time.

Finally, I am curious as to whether subjects were interviewed regarding the strategies they were using during the association paradigm. Were they aware they were using abstraction?

[Editors' note: further revisions were suggested prior to acceptance, as described below.]

Thank you for resubmitting your work entitled "Value signals guide abstraction during learning" for further consideration by eLife. Your revised article has been evaluated by Michael Frank (Senior Editor) and a Reviewing Editor.

The manuscript has been improved but there are some remaining issues that need to be addressed, as outlined below:

1. Please revise your paper such that a casual reader will not erroneously take away that the MoE model presents a solid account of the data. Right now, the MoE is still mentioned in the abstract and also presented prominently in one of the main figures. You can leave the model in the manuscript if you want, but please further tone down any claims related to it.

2. It would be important to mention that some of the excluded subjects had good overall performance and the distribution of strategies was different among them.

Reviewer #1:

Authors have adequately and thoroughly addressed my concerns and questions. The only remaining concern I have is related to point # 1. Based on the presented results, it seems that MoE model does not provide the best fit of data. However, authors clearly mention and discuss this limitation in the revised manuscript. I have no further comments or concerns.

Reviewer #2:

I thank the authors for their thorough response to our previous concerns. I have two main concerns left:

1. The model comparison seems to refute the MoE model. At the same time, it seems clear that neither the FeRL nor AbRL model alone can truly capture participants behavior, since participants switch from one model to the other during the course of behavior. I think this should be made very clear in the paper, and I wonder how useful including the MoE model is.

My main reason is as follows: the core benefit of the MoE model, its ability to flexibly mix the two strategies, is seemingly not implemented in a way that reflects participants behavior. Would there be any way to improve the MoE models flexibility? The fact that it provides a "proof of concept that an algorithmic solution to arbitrate between representations / strategies exists" alone does not convince me, since the arbitration itself seems to not capture behavior and the pure existence of some algorithm is hardly surprising. In addition, there are the concerns about how realistic the MoE model is, which were raised under point 4.

I am also wondering whether the bad fit of the MoE model reflects how the fits were calculated: within each block, and then averaged (if I understood correctly)? Does that mean there was a new set of parameters per block? Have the authors tried to fit over the entirety of the experiment, using one set of parameters?

Relatedly, I believe that the change between strategies over time should be presented in one of the main figures, as this is an important point (e.g. by putting the rightmost graph from the Figure shown in the point by point response in the main paper).

2. I am also not fully convinced by the explanations about exclusions. The fact that the excluded subjects showed a different distribution of strategies should not serve as a reason for exclusion, since the purpose of the paper is to elucidate, in an unbiased manner, the distribution as it exists in the general population. The reported accuracy also does not seem very low for some participants. To me it seems that including the overall high performing subjects (with e.g. avg % correct > 70%) would provide a more unbiased sample.

eLife. 2021 Jul 13;10:e68943. doi: 10.7554/eLife.68943.sa2

Author response


Essential revisions:

All three reviewers agreed that your study is well-conducted and the results convincing. However, they also had specific questions and suggestions for improvement. Below is a list of 'essential' comments that we would expect you to address in a revised version of your manuscript. The reviewers also made additional comments in the individual critiques, which we would encourage you to consider when preparing your revision.

1. There were several questions about the modeling. Please add a formal model comparison (accounting for model complexity) and show how much the mixture of expert RL model improves the fit over purely abstract and/or feature RL models. In addition, please show the responsibility values for the mixture model. This information, in addition to 'mean expected value', is needed to draw conclusions on the importance of feature and abstract RL models.

As per your suggestion we have now included a formal model comparison. Using a hierarchical Bayesian approach (Piray et al. 2019), we performed concurrent fitting and comparison of all 3 models mixture-of-experts (MoE-RL), Feature RL (FeRL), Abstract RL (AbRL). We now show in supplementary (Figure 2—figure supplement 1) the model frequency (i.e., goodness of fit) in the sample. More specifically, for each block, we report the number of participants for which the MoE-RL, FeRL, or AbRL best explain their learning/choice data. Furthermore, Figure 2D in the main text now includes the responsibility values for the mixture model, in the same format as the ‘mean expected value’.

Note the model comparison shows that the MoE-RL in itself improves the fit over the purely abstract or feature RL only in very few instances. It is important to remember that FeRL and AbRL are much simpler models and that AbRL is an oracle model (i.e. the relevant dimension – unknown to the participant, is set by the experimenter). Our rationale in devising the MoE-RL model in the first part of the study was not to show that it is superior to the purely abstract or feature RL. MoE-RL was introduced as a proof of concept that an algorithmic solution in the arbitration between strategies is possible setting the stage for the second part of the work that focused on direct comparison of the two simpler models – Feature RL and Abstract RL (that have been used for all the remaining neuroimaging data analysis). We have now amended the text in the manuscript to make this clearer.

Main text (pp. 8, lines 212 – 227):

“The mixture-of-expert RL model revealed that participants who learned faster relied more on the best RL model value representations. […] Hence, we next sought to explicitly explain participant choices and learning according to either Feature RL or Abstract RL strategy.”

2. Please add an analysis of the behavior of excluded subjects. Do they adopt a different strategy and that is why they could not learn fast/accurately enough?

Of the 13 subjects that were excluded, for 2 we did not have recorded data as the experiment never went past the preliminary stage. As per your request we have now added the analysis of the 11 remaining excluded subjects (these also included 2 subjects that were removed for high head motion in the scanner). Briefly, excluded subjects displayed lower response accuracy in choosing the preferred fruit, as well as significantly higher (resp. lower) proportion of blocks labelled as Feature RL (resp. Abstract RL). Based upon these results, it would appear that excluded participants tended to remain ‘stuck’ in a non-optimal learning regime (Feature RL). These results are now reported in Figure 1—figure supplement 2 and referenced in the main text.

Main text (pp. 24, lines 673 – 675):

“Figure 1—figure supplement 2 reports a behavioural analysis of the excluded participants to investigate differences in performance or learning strategy compared to the 33 included participants.”

3. Does VC-vmPFC coupling predict the abstraction level? This connection seems to be as important to the authors' claims as the discussed relationship between VC-vmPFC coupling and learning speed (Figure 4C).

We agree with the reviewer that the VC-vmPFC coupling with learning speed suggests that a similar coupling might also exist with abstraction level. We have conducted the analysis suggested – we don’t find an effect that passes the statistical threshold but only a non-significant trend (robust linear regression, p = 0.065 one-sided). However, we should also highlight the exploratory nature of these between subjects’ correlations since our study was not optimised to detect between subjects' effects (which generally requires much larger n of subjects). We have now added the plot as Figure 4—figure supplement 2, and report the result in the main text and mentioned this caveat.

Main text (pp. 12, lines 331 – 335):

“The strength of the vmPFC – VC coupling showed a non-significant trend with the level of abstraction (N = 31, robust regression, slope = 0.013, t29 = 1.56, p = 0.065 one-sided, Figure 4—figure supplement 2). […] Therefore, future work is required to confirm or falsify this result.”

4. Please discuss issues around efficiency and plausibility that result from running 4 models simultaneously. That is, would it not be better if the brain only implemented the most complex algorithm instead of this algorithm in addition to the 3 simpler models?

As one of the reviewers astutely noticed, while the mixture-of-experts model is not the most computationally thrifty model, it is very data efficient (i.e., the same data points can be used to update multiple models / representations in parallel). This model was introduced in the manuscript not necessarily as the most realistic model but as a proof of concept of a cognitive architecture that can arbitrate between the abstract and feature-based learning strategies. This was to set the stage for comparing these 2 strategies in the neuroimaging data analysis that was the main scope of this work. We have now clarified this in the manuscript and added a formal model comparison in response to the first query. Note the model comparison shows that the MoE-RL in itself improves the fit over the purely abstract or feature RL only in very few instances. It is important to remember that FeRL and AbRL are much simpler models and that AbRL is an oracle model (i.e. the relevant dimension – unknown to the participant, is set by the experimenter). Our rationale in devising the MoE-RL model in the first part of the study was not to show that it is superior to the purely abstract or feature RL. MoE-RL was introduced as a proof of concept that a simple algorithmic solution in the arbitration between strategies is possible.

We agree with the reviewer that more work needs to be done to establish which is the actual computational basis in humans to select the correct strategy in each circumstance. We share their feeling that humans might (at least at the conscious level) engage with one hypothesis at a time. However, there is circumstantial evidence that multiple strategies might be computed in parallel but deployed one at a time (Domenech et al. 2014, Koechlin 2018). Given how little we know about how different algorithmic architectures to solve this kind of problem are implemented by the brain, we agree that these are important issues that need to be discussed. Making clear our goal to show that arbitration between feature and abstract learning can be achieved using a relatively simple algorithm (the MoE-RL) and then proceeded to characterise the neural underpinnings of these two types of learning (i.e. FeRL and AbRL). These points are also elaborated in the discussion.

Main text (pp. 18 – 19, lines 513 – 529):

“An interesting and open question concerns whether the brain uses abstract representations in isolation – operating in a hypothesis-testing regime – i.e., favouring the current best model; or whether representations may be used to update multiple internal models, with behaviour determined by their synthesis (as in the mixture-of-experts architecture). […] Future work will need to establish the actual computational strategy employed by the human brain, further examining how it may also vary across circumstances.

5. Could it be that participants are simply better (more correct choices) in the abstract blocks (which presumably are also the later blocks)? If so, what does that mean for the value contrast in vmPFC? DO they reflect performance or strategy?

The intuition is correct, in the Abstract blocks participants tended to make more correct choices on average (now reported in Figure 5—figure supplement 1), therefore Abstract RL blocks were associated with higher expected value compared with Feature RL blocks. As the reviewer correctly hinted this is probably due to the fact that abstract strategies were more frequent in late trials in which performance is usually higher. This small but significant result is now reported in Figure 3—figure supplement 2. Importantly for the GLM analysis at trial-by-trial level, value (high / low) and learning strategy (Feature RL or Abstract RL) were uncorrelated (Figure 5—figure supplement 2A) thus confirming the regressors’ orthogonality allowing us to include both regressors in the same GLM. We also included regressors for ‘early’, ‘late’.

To recapitulate we included regressors for ‘early’, ‘late’, ‘High value’, ‘Low value’, ‘Feature RL’, ‘Abstract RL’, such that the GLM explicitly controlled for the idiosyncratic features of the task.

We are therefore confident that our GLM was able to correctly disentangle the contribution of all these parameters on the neural signal.

We have updated the supplementary note 1 – we refer to it in the main text to better explain the idiosyncrasies of the task / conditions and their controls in the main GLM analysis. The text is reported here below.

Main text (pp. 12, lines 353 – 355):

“Having established that the vmPFC computes a goal-dependent value signal, we evaluated whether the activity level of this region was sensitive to the strategies that participants used. To do so, we used the same GLM introduced earlier, and estimated two new statistical maps from the regressors ‘Abstract RL’ and ‘Feature RL’ while controlling for idiosyncratic features of the task, i.e., high/low value and early/late trials (see Methods and Supplementary note 1).”

“Supplementary Note 1:

On average Abstract RL blocks tended to be later blocks (Figure 3F-G, Figure 3—figure supplement 2A), and to be associated with a slightly but significantly higher ratio of correct to incorrect responses (Figure 5—figure supplement 1). […] Other regressors of no interest were motion parameters, mean white matter signal, mean cerebro-spinal fluid signal, block, constant.”

6. Please plot performance separately for CO, CD and OD blocks as well as for 2x2 vs 3x1 blocks.

Thanks for the suggestion; we have done this. Figure S1 displays performance (learning speed) plotted separately for CO, CD, OD blocks, as well as for 2x2 and 3x1 blocks. There were no significant differences between these conditions.

7. Please clarify how the variance over RPEs (v) was calculated.

We apologize for our oversight – the variance over RPEs (v) was an hyperparameter estimated at the participant level, in each block. This has been clarified in the manuscript.

Main text (pp. 6, lines 177 – 179):

“Estimated hyperparameters (learning rate 𝛂, forgetting factor 𝛄, RPE variance 𝛎) were used to compute value functions of participant data, as well as to generate new, artificial choice data and value functions.”

8. Was the cross-validation done between runs? If not, if should be done between runs, if possible.

The cross-validation was done by repeatedly splitting the whole data in a training and test group, at random (N=20). We applied this procedure because the number of trials available for each class differed across conditions and best models (Feature RL and Abstract RL). For example, Feature RL may have had 128 trials labelled as ‘green’, 109 as ‘red’, while Abstract RL 94 as ‘green’, and 99 as ‘red’. We thus selected, in each fold, the number of trials representing 80% of the data in the condition with the lowest number of trials (in this example, 80% of 94). This procedure allowed us to avoid a situation in which different amounts of data are used to train the classifiers in different conditions, making comparisons or performance averages weaker to interpret. Nevertheless, since it has been rightfully pointed out that the procedure generally involves testing a classifier on data from a different run (leave-one-run-out cross-validation) such that even subtle differences across runs cannot be exploited by the algorithm.

Therefore, we have now implemented this procedure, recommended by the reviewer, as the primary analysis (reported in the new Figure 5C). Note that results from these two cross-validation approaches closely align, and we have now moved the original result to supplementary (Figure S10).

Main text (pp. 13, lines 380 – 389):

“We found that classification accuracy was significantly higher in Abstract RL trials compared with Feature RL trials in both the HPC and vmPFC (two-sided t-test, HPC: t32 = -2.37, p(FDR) < 0.036, vmPFC: t32 = -2.51, p(FDR) = 0.036, Figure 5C), while the difference was of opposite sign in VC (t32 = 1.61, p(FDR) = 0.12, Figure 5C). […] A control analysis equating the number of training trials for each feature and condition replicated the original finding (Figure 5—figure supplement 3).”

Methods (pp. 32 – 33, lines 937 – 961):

“Cross-validation was used for each MVP analysis to evaluate the predictive power of the trained (fitted) model. […] Results (Figures 5C, Figure 5—figure supplement 3) report the cross-validated average of the best yielding iteration.”

9. The specific difference between relevant and irrelevant features seems important. Please add Figure S6 into the main manuscript.

We have inserted the previous Figure S6 into the main manuscript, as panels D and E in Figure 5.

10. Please add the results of the neurofeedback experiment. Were participants successful at increasing the size of the disc? Was there a correlation between this success and subsequent performance on the association paradigm? Full results can be provided in the supplements but should be referenced in the main text.

Participants were successful at increasing the size of the disc, with similar levels of performance attained in the first and second session. We also show the relationship between the success in inducing the target pattern and the subsequent behavioural effect. There was a significant tendency in positive correlation between the cumulative session-averaged amount of reward obtained during the NFB manipulation and the strength of the subsequent behavioural effect. Results are reported in the main text and in supplementary, Figure 6—figure supplement 1.

Main text (pp. 16, lines 463 – 466):

“Participants were successful at increasing the disk size in the neurofeedback task (Figure 6—figure supplement 1A-B). Furthermore, those who were more successful were also more likely to display larger increases in abstraction in the subsequent behavioural test (Figure 6—figure supplement 1C).”

[Editors' note: further revisions were suggested prior to acceptance, as described below.]

The manuscript has been improved but there are some remaining issues that need to be addressed, as outlined below:

1. Please revise your paper such that a casual reader will not erroneously take away that the MoE model presents a solid account of the data. Right now, the MoE is still mentioned in the abstract and also presented prominently in one of the main figures. You can leave the model in the manuscript if you want, but please further tone down any claims related to it.

As per your suggestion we have toned down any claims related to the MoE model. To this end, we have amended the abstract, introduction, results and discussion. More specifically, we have removed any reference to the MoE model in the abstract and summary of results in the discussion to avoid giving the impression that the MoE model provides the best explanatory degree to the data. We further acknowledge in the discussion that with the current task the MoE model does not present a strong account of the data, and is not the main take-home message. Nevertheless, we have decided to keep the model in the manuscript and in Figure 2, as we still believe the MoE provides certain valuable pieces of information: (i) it provides a simple way to arbitrate between internal representations, (ii) it affords a data efficient approach (one data point can be used to update multiple strategies in parallel), (iii) it shows that participants who are more confident in their performance also have better selection of internal representations.

We report below excerpts of the main text that we have modified.

Abstract:

“Mixture-of-experts Reinforcement-learning algorithms revealed that, with learning, high-value abstract representations increasingly guided participant behaviour, resulting in better choices and higher subjective confidence.”

Main text, introduction (pp., lines):

“Reinforcement learning (RL) and mixture-of-experts (Jacobs et al., 1991; Sugimoto et al., 2012) modelling allowed us to track participant valuation processes and to dissociate their learning strategies (both at the behavioural and neural levels) based on the degree of abstraction.”

Main text, results (pp., lines):

Section title: “Mixture-of-experts reinforcement learning for Discovery of abstract representations”.

Main text, discussion (pp., lines):

“The ability to generate abstractions from simple sensory information has been suggested as crucial to support flexible and adaptive behaviours (Cortese et al., 2019; Ho et al., 2019; Wikenheiser and Schoenbaum, 2016). […] Of particular importance will be further examining how such strategies vary across circumstances (tasks, contexts, or goals).”

2. It would be important to mention that some of the excluded subjects had good overall performance and the distribution of strategies was different among them.

We agree that mentioning these aspects of the excluded subjects is important. We have thus added this information in the results, in the ‘Behavioural account of learning’ and ‘Behaviour shifts from Feature- to Abstraction-based reinforcement learning’ subsections.

To clarify, subjects were excluded independently of the distribution of strategies, which were computed at a later stage. The a priori criteria for exclusion were: failure to learn the association in 3 blocks or more (i.e., reaching a block limit of 80 trials without having learned the association), or failure to complete more than 10 blocks in the allocated time. These criteria were set to ensure a sufficient number of learning blocks for ensuing analyses.

To avoid any confusion, we have added this information to the main text, in results, subsection ‘Experimental design’.

Main text, results (pp., lines):

“Participants failing to learn the association in 3 blocks or more (i.e., reaching a block limit of 80 trials without having learned the association), and / or failing to complete more than 10 blocks in the allocated time, were excluded (see Methods). All main results reported in the paper are from the included sample of N = 33 participants.”

Main text, results (pp., lines):

“Excluded participants (see Methods) had overall lower performance (Figure 1 supplement 2), although some had comparable ratios correct.”

Main text, results (pp., lines):

“Given the lower learning speed in excluded participants, the distribution of strategies was also different among them, with a higher ratio of Feature RL blocks (Figure 3 supplement 4).”

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    Figure 1—source data 1. Csv: panel C.
    Figure 1—source data 2. Csv: panel D.
    Figure 1—source data 3. Csv: panel E.
    Figure 2—source data 1. Csv: panel D, mean expected value, model.
    Figure 2—source data 2. Csv: panel D, mean expected value, subjects.
    Figure 2—source data 3. Csv: panel D, lambda, model.
    Figure 2—source data 4. Csv: panel D, lambda, subjects.
    Figure 2—source data 5. Csv: panel E.
    Figure 3—source data 1. Csv: panel A left, model simulations histogram of learning speed.
    Figure 3—source data 2. Csv: panel A right, model simulations % failed blocks.
    Figure 3—source data 3. Csv: panel B, scatter plot of model probabilities.
    Figure 3—source data 4. Csv: panel B, violin plot of proportion Abstract RL.
    Figure 3—source data 5. Csv: panel C.
    Figure 3—source data 6. Csv: panel D.
    Figure 3—source data 7. Csv: panel E.
    Figure 3—source data 8. Csv: panels F and G.
    Figure 3—source data 9. csv: panel H.
    Figure 4—source data 1. Csv: panel C.
    Figure 5—source data 1. Csv: panel B.
    Figure 5—source data 2. Csv: panel C.
    Figure 6—source data 1. Csv: panel B, irrelevant blocks.
    Figure 6—source data 2. Csv: panel B, relevant blocks.
    Figure 6—source data 3. Csv: panel C.
    Figure 6—source data 4. Csv: panel D.
    Transparent reporting form

    Data Availability Statement

    Behavioural data, group-level maps of brain activation, and custom code used to generate results and figures are available at https://github.com/BDMLab/Cortese_et_al_2021 copy archived at swh:1:rev:3ac5090fe0af132364bbf92b9b0dff95919d60ee (Cortese et al., 2021).

    All data generated or analysed during this study are included in the manuscript and supporting files. Source data files have been provided for Figures 1-6. Behavioural data, group-level maps of brain activation, and custom code used to generate results and figures are available at https://github.com/BDMLab/Cortese_et_al_2021, copy archived at https://archive.softwareheritage.org/swh:1:rev:3ac5090fe0af132364bbf92b9b0dff95919d60ee.


    Articles from eLife are provided here courtesy of eLife Sciences Publications, Ltd

    RESOURCES