Skip to main content
UKPMC Funders Author Manuscripts logoLink to UKPMC Funders Author Manuscripts
. Author manuscript; available in PMC: 2025 Jul 20.
Published in final edited form as: Curr Opin Behav Sci. 2024 Jun;57:101361. doi: 10.1016/j.cobeha.2024.101361

Modelling cognitive flexibility with deep neural networks

Kai Sandbrink 1,, Christopher Summerfield 1,*,
PMCID: PMC7617834  EMSID: EMS206666  PMID: 40688099

Abstract

Neural networks trained with deep reinforcement learning can perform many complex tasks at similar levels to humans. However, unlike people, neural networks converge to a fixed solution during optimisation, limiting their ability to adapt to new challenges. In this opinion, we highlight three key new methods that allow neural networks to be posed as models of human cognitive flexibility. In the first, neural networks are trained in ways that allow them to learn complementary ‘habit’ and ‘goal’-based policies. In another, flexibility is ‘meta-learned’ during pre-training from large and diverse data, allowing the network to adapt ‘in context’ to novel inputs. Finally, we discuss work in which deep networks are meta-trained to adapt their behaviour to the level of control they have over the environment. We conclude by discussing new insights about cognitive flexibility obtained from the training of large generative models with reinforcement learning from human feedback.

Introduction

Natural environments place heterogeneous demands on an organism. Flat terrain gives way to rocky hillsides, plain sailing to choppy waters or chit-chat to strenuous political debate. Biological lifetimes are short, and organisms cannot learn to expect every challenge that a volatile world may throw up. In psychology and neuroscience, it has long been proposed that brains have evolved tailored mechanisms for cognitive flexibility, that is, computational processes designed to deal with a changeable world through on-the-fly task prioritisation [13]. Psychological theories of cognitive flexibility often invoke a dual-process framework in which resource-in-tensive control systems (housed in prefrontal cortex) can be mobilised to suppress habitual behaviours when environmental demands grow. For example, patients with prefrontal damage are mostly untroubled by routine tasks but falter when required to innovate [4]. In neural recording studies, task contexts that are prone to incur errors or conflict activate putative monitoring mechanisms in medial prefrontal or cingulate cortices [5,6]. Envisaging tabular implementations of reinforcement learning (RL) models, neuroscientists have equated cognitive flexibility with ‘model-based’ processes that exploit a state transition matrix to mentally simulate possible future outcomes (rather than relying on cached value estimates from experience history) [7,8]. Model-based inference may also require prefrontal circuits [9].

Cognitive flexibility is required in tasks where different contexts require different rules. In primates, rapid switching between tasks is facilitated by the formation of explicit neural codes for rules, and computational models that are equipped by hand with rule-coding neurons can account for empirical patterns of task switching and error-related adjustment [10]. If they are invariant to context, rule neurons can also allow agents to generalise over physically dissimilar stimuli with overlapping selection demands. Computational models that draw inspiration from neural architectures and algorithms have shown how abstract rule neurons can emerge during training, and support generalisation at test. Rule neurons emerge most readily when systems are endowed with computational modules that resemble the Prefrontal Cortex (PFC), and use motifs (such as selective gating) that are hallmarks of primate executive function [11].

However, the recent renaissance in connectionist accounts of perception and cognition [12,13] invites us to consider how notions of cognitive flexibility can be incorporated into computational models based on deep learning systems. At first, the two notions might seem incompatible, because neural networks are trained to converge — to find a fixed point in parameter space where the task is satisfied, and beyond which no policy change is warranted. So, can networks learn to be cognitively flexible by gradient descent? If so, what constraints on optimisation may be required? Or are notions of mental flexibility largely chimeric, emerging naturally as undifferentiated neural networks are incrementally trained? In this review, we present three approaches that describe how simple changes to inductive biases at the level of training distribution and network architecture can induce neural networks to exhibit behaviour that is surprisingly flexible.

Dual-process deep reinforcement learning

Deep learning systems rely on gradual tuning of network parameters to satisfy a cost function. Where the objective is to maximise scalar reward, a deep neural network is trained to either approximate the optimal value function [14], or to directly learn an optimal policy [15]. In the former approach, states are typically buffered in a form of ‘episodic’ memory and replayed in tandem with ongoing experience to stabilise training [16], in a process often thought of as echoing the dialogue between hippocampus and neocortex [17]. Here, flexibility arises from the way fast and slow memory processes jointly contribute to value learning [18]. By contrast, the latter alternative, known as the policy gradient approach, uses a deep network to learn a policy π (a distribution of actions given states) that maximises reward, often complementing this ‘actor’ with a separate ‘critic’ network that estimates state values. In neuroscience, policy gradient networks have been used to model perceptual decisions [19] and task switching [20] but have gained limited traction as general-purpose theories of cognition, perhaps because they are computationally intensive and technical to implement.

RL has roots in control engineering, the field that derives optimal policies for controlling a dynamical system (e.g. a plant) to meet a specified goal (e.g. maximise output and minimise cost). A canonical idea in control theory is that an optimal policy can be approximated via two quantities, one that specifies the cost q (x) for a given state x, and the other a divergence between the control dynamics p (x′|x, u) and passive dynamics p (x′|x), which are the goal-conditioned and default transition matrices, respectively, for a given next state x′ and goal u [23]. The intuition is that the passive dynamics (e.g. when navigating, the transition matrix given by a random walk through the environment) should act as a prior for the (goal-dependent) control dynamics, so that in the absence of an externally imposed goal, an agent should revert to the default policy.

Elaborating on this theme, one promising theory proposes a dual-process model of cognition grounded in deep RL [21••]. The idea is that a reward-maximising policy is jointly implemented by two distinct networks, which respectively learn a ‘habit-based’ default policy π0 and a ‘goal-based’ control policy π (Figure 1a). For the networks to adopt these distinctive roles via training from random weights, the system learns to maximise reward under two regularisation constraints: one that keeps π0 simple, and the other that tethers π to π0, so that the two policies do not diverge excessively. The first regulariser ensures that the habit-based network π0 is relatively compressed (encoded with fewer parameters or bits) and thus learns a policy that is simple and general, and that applies in a variety of circumstances. Theoretical work has shown that many cognitive phenomena, including stochasticity of choice, perseveration and chunking, can be rationally explained as a tendency to learn generalisable default tendencies via policy compression [24]. Relatedly, information-theoretic models of the PFC have argued for a subsidiarity principle, whereby organisms rely on the simplest policy possible for task execution, recruiting additional (and anatomically more anterior) control structures only when demanded by the context [25].

Figure 1. Dual-process models of cognitive flexibility.

Figure 1

(a) Schematic diagram of the dual-process network architecture in Ref. [21••]. The lower pathway learns a default or habit-based policy via simplicity regularisation, whereas the upper pathway learns a policy that deviates from this to satisfy goals. (b) Behavioural data from the two-step task [22]. Logistic regression weights describing the influence on current-trial stage-1 choice (stay probability) of outcomes on the preceding five trials corresponding to (left) model-free, (middle) model-based and (right) mixed RL. (c) Same plots as in B but for (left) π and (right) π0. Patterns match respectively those previously described for model-based and model-free behaviour. (d) Same as Panel D but with different weighting of terms in the Minimum Descrption Length (MDL) objective (see Panel B, right). (d) Panels reproduced with permission from Ref. [21••].

The second proposed regulariser penalises novel behaviours, ensuring that π does not stray too far from the default π0. It thus serves to keep cognitive flexibility from running amok, by preventing the network from dramatically overfitting to each goal in turn. Consistent with this idea, recent RL models have proposed that in addition to classical reward prediction errors (RPEs), the brain computes explicit penalties (called ‘action prediction error’ (APE) signals) when actions deviate from the norm encoded by a habit-based policy [26,27]. One recent recording study has even provided evidence for value-free teaching signals that resemble APEs in the tail of the mouse striatum, a region that does not receive dopaminergic RPE signalling [28]. More generally, the proposed division of labour between π and π0 resembles that observed in a biological agent equipped with twin habit- and goal-based systems. Indeed, the twin-network deep RL system trained was shown to capture a range of canonical behavioural phenomena that are touted as evidence for dual-process cognition in psychology and neuroscience, such as patterns of ‘two-step’ planning behaviour (Figure 1b–d), conflict-based interference and override of heuristic behaviours in classic judgement tasks [21].

Deep neural network models of task learning and control expose a ubiquitous problem for intelligent systems: that the brain needs constraints that trade off the merits of combining and separating task representations [29]. Neural networks trained via gradient descent from weights that are initially small in scale naturally learn to share task representations where possible (referred to as ‘low-dimensional’ neural coding), which confers robustness and supports behavioural generalisation [30,31]. However, representation sharing stands in tension with the need to learn distinct policies for divergent tasks. On the one hand, a pupil studying both Italian and Spanish can benefit from the shared orthography of words such as casa (house), luna (moon) and triste (sad), by learning representations that are shared between linguistic tasks.

However, excessive policy compression (here, the over-generalisation of Italian words to Spanish) would lead the student to make errors where vocabulary diverges, such as cane and perro (dog). In machine learning, training on multiple auxiliary tasks promotes the acquisition of shared representations, which improves transfer on a target task [32]. On the other, a bias towards representation sharing — encouraged, for example, by pressure to learn a control policy π that resembles the default π0, and thus to learn representations that are more generalisable — also offers a rational explanation for the costs of multitasking [33••], without having to invoke a nebulous notion of ‘mental resources’ [34].

Deep meta-reinforcement learning

A well-known limitation of deep learning for explaining natural behaviour is that it offers only weak inductive biases, and thus necessitates training with implausible volumes of data over biologically unfeasible timescales [35]. One method that addresses this issue is called meta-learning [36]. In the context of RL, a neural network can be trained on a mixture of tasks invoking a broad distribution of triadic relationships among observations, actions and outcomes [37••,38]. This allows it to learn a strong inductive bias for solving new, unseen tasks — and indeed, there is a formal connection between this ‘meta-RL’ and Bayesian inference, with meta-training endowing the network with a ‘prior’ over tasks [39]. In combination with recurrent memory, meta-RL agents can learn a policy that adapts on the fly to entirely new tasks by ‘learning’ through adjustment of activation dynamics over trials. They thus learn ‘in context’ (based on the adaptation of recent activity patterns) and ‘in weights’ (through adjustment of tuneable parameters). Just as a neural network trained to respond ‘cat’ to many natural images of cats can generalise to previously unseen cats, a meta-RL system that learns a time-varying policy for solving 2-back and 4-back memory tasks can generalise this policy to solve a 3-back task on which it has not been trained, merely by inferring the correct task from the history of stimuli, actions and outcomes [40•].

The authors of one prominent paper (by Wang and colleagues, Figure 2a,b) explicitly linked meta-RL to recurrent neural dynamics in the primate prefrontal cortex [35], where longer time constants of neural integration may support flexible adaptation during sequential decision tasks [41]. Meta-RL may solve novel tasks by learning general-purpose maintenance and selection processes, perhaps implemented in PFC by gating mechanisms that rely on fast synaptic plasticity. Indeed, the underlying architecture on which this meta-RL system was based, the long short-term memory (LSTM) system, bears strong similarities to some computational theories of PFC function [41,42]. In fact, one recent paper has shown that blocking plasticity in rodent orbitofrontal cortex during meta-training on a bandit task disrupts the ability to deal with new contexts with different payout dynamics, consistent with the OFC supporting meta-training [43••] (Figure 2c,d). Using simulations, Wang et al. [37] showed that meta-RL networks are able to match the behaviour of biological systems on a variety of lab-based tasks that are typically thought to involve flexible control processes or planning, including assays of learning set [44] and multi-step inference [22]. This finding is striking because meta-RL is trained using model-free methods, and it shows how recurrent memory can be used to solve tasks that require flexible cognitive control.

Figure 2. Deep meta-RL models of cognitive flexibility.

Figure 2

(a) Overview over a meta-RL system. (left) The meta-RL is trained on a batch of individual episodes (black arrows separated by dotted red lines). The network is trained based on a meta-gradient (dark red) that is backpropagated through all of the episodes in a batch and takes into account the (light red) gradients on each of the individual episodes. (right) This meta-gradient (dark-red arrow) is used to train a recurrent neural network that uses an actor-critic architecture to output a distribution over policies and a value for a given state based on that time step’s observation as well as action, and reward received on the previous timepoint. (b) The resulting learning curve (black arrows) takes on a characteristic shape where performance is low at the beginning of every individual episode (separated by dotted red lines) as the agent learns to adapt more efficiently and flexibly to each subsequent episode. The behaviour closely matches those of the studied mice. (c) Example behaviour of a meta-RL agent in a two-alternative choice session reproduced from Hattori et al. [43]. (d) Mean optimality score that measures the optimality of action policy in this task considering the cumulative nature of reward availability, reproduced with permission from Ref. [43••].

Deep learning models of meta-control

In biological agents, the ability to adapt flexibly to changing circumstances is linked to brain mechanisms that monitor for and respond to incipient conflict or errors. Neural signals in the medial PFC respond during the execution of inappropriate actions, even before any external reward or supervision is administered [45,46], and may be involved in more general regulation of thought and emotion [47]. These signals predict sub-sequent adjustment to behaviour, as if one dimension of cognitive flexibility is the engagement of control processes to cope with heightened mental demand [48,49] and with increased efficacy signifying with more impactful decisions [50•]. However, one major challenge for an agent in a stochastic environment is to learn policies that adapt optimally to the level of control that an agent has over the world. For example, if a tennis player’s repeated service faults are due to insufficient practice, then the player should spend more time on court; if they are due to high winds or an uneven surface, then more training will not help. Interestingly, there is evidence that even rodents may solve this complex credit assignment problem. For example, when learning a novel noisy visual stimulus discrimination task, rodents initially seem to devote undue time to the judgement rather than making rapid guesses to maximise reward rate, as if they were allocating resources to an attempt to improve their policy. Indeed, over the course of the whole task, the rats receive more reward than they would have under models that maximised reward rate initially [51,52••].

In a similar vein, one recent study showed that vanilla deep policy gradient networks, implemented as LSTMs, struggle to meta-learn policies that adapt according to the level controllability in the environment. Networks were meta-trained on multiple versions of an ‘observe or bet’ paradigm [5355] (Figure 3 a), a paradigm that forces the agent to decide on each trial between observing the outcome of a bandit without being rewarded (observe) or choosing among bandits without viewing any reward obtained (bet). On each block, there was a fixed probability that the action chosen on ‘bet’ trials was randomly perturbed (such as a tennis player at the mercy of the elements; Figure 3 b); intuitively, if this probability is 100%, then observation is pointless, because your choices never translate into their intended consequences (you have zero ‘efficacy’). Whilst standard meta-trained agents failed to learn to adapt their policy according to efficacy levels or perturbation (Figure 3 c), networks that were additionally trained to predict their own efficacy via a single additional unit, thereby learning a ‘sense of agency’ [56], succeeded and mirrored human performance on the task [57••] (Figure 3 d-e). The same results hold true in a second task in which participants have access to an additional action (‘sleep’) that allows them to increase their efficacy, but with higher levels of environmental efficacy now corresponding to lower levels of control-seeking behaviour.

Figure 3. Deep learning models of meta-control.

Figure 3

(a) Network architecture for the neural network model for self-control, in which a linear read-out of efficacy is trained from a recurrent learned state encoding and the two are jointly passed as inputs into an actor that decides on a policy. (b) The system is particularly useful in cases where action intention is not (necessarily) equivalent to the action that is ultimately taken and that determines the environmental outcome: this is the case both when there is internal motor noise and external variability such as wind affecting a tennis ball. These situations can be modelled as a two-step task. (c) Similar to a two-step task, the potential for discrepancies between chosen and intended outcome results in counter-intuitive update structures in which negative feedback from the environment does not necessarily mean that the actor should downweigh an intended action, but rather should work to improve its execution. (d) Policy over an episode for (left) a sample APE-trained model and (right) a non-APE-trained model. (e) Behaviour per efficacy level across five model instantiations in terms of (left) reward per episode and (right) number of trials on which the users choose to observe.

Analysis of neural coding in the networks able to exert meta-control revealed that they explicitly represented levels of efficacy along the first two principal components of neural activity of the LSTM layer, whereas the purely reward-driven network did not. This differentiation may allow the networks to assign credit differentially depending on whether an error occurs, as humans can [58•]. This motivates the existence of dedicated neural systems, such as those in the medial PFC, that engage in error monitoring, and serve to detect mismatches between intended and executed actions [59].

Conclusions: cognitive flexibility in deep networks

More recently, the remarkable power of in-context learning has been demonstrated in transformer-based models trained on autoregressive problems, including learning to execute simple programmes with novel inputs [6062], and generating plausible responses to natural language queries [63]. Large language models (LLMs) are typically fine-tuned with a method called reinforcement learning from human feedback (RLHF), which is a form of inverse RL that is used to nudge the model towards more helpful and less-harmful replies [64]. Most implementations of RLHF use proximal policy optimisation, an RL method in which the updated policy is tethered to the existing one, so that fine-tuned language models do not stray too far from the pretraining distribution [65]. This is reminiscent of how π is regularised towards π0 in the dual-process networks above.

The deep neural network models described in this review are trained with trial and error. Nevertheless, they are able to recreate many behaviours that cognitive scientists have thought to require explicit mechanisms for explicit planning and mental simulation, and have ascribed to dedicated subsystems for goal-based cognition or model-based inference. LLMs provide the most vivid example of this phenomenon. RLHF is an offline training method that is applied before networks are frozen and deployed. Remarkably, however, its benefits seem to transfer widely across settings — for example, fine-tuning in English seems to generalise readily to other, lower-resource languages [66]. After fine-tuning, leading LLMs seem to be able to reliably solve (previously unseen) reasoning, maths and coding problems that are classical hallmarks of ‘system-2’ cognition. This shows that rather than requiring explicit rollouts through a model of the world, these tasks can be solved by ‘habit-based’ fine-tuning of neural networks that have acquired rich semantic knowledge through diverse and voluminous pre-training. These feats of in-context learning (or meta-learning) thus invite us to revisit traditional claims about the computational basis for habit-based (stable) and goal-based (flexible) cognition. It turns out that expressive neural networks, trained by trial and error alone, can show much greater levels of cognitive flexibility than was previously thought possible.

The application of deep neural networks to modelling cognitive flexibility is in its infancy. There are several promising avenues for future research, including dual-process neural networks, deep meta-RL and networks capable of learning meta-control. To date, these approaches have been pursued independently, but they are potentially complementary, and could be combined. An interesting avenue for further research is to focus on how RL can be combined with generative modelling to produce high-performing models that learn from both observations and rewards [67,68]. Insights from the training of large generative models with RL may help us understand how richness of prior experience promotes cognitive flexibility in humans.

Acknowledgements

This work was funded by a European Research Council (ERC) Consolidator Award (725937) to C.S., a Wellcome Trust Discovery Award (227928/Z/23/Z) to C.S. and a Cusanuswerk Doctoral Research Fellowship (German Ministry of Education and Research) to K.J.S.

Footnotes

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References and recommended reading

Papers of particular interest, published within the period of review, have been highlighted as:

• of special interest

•• of outstanding interest

  • 1.Miller EK, Cohen JD. An integrative theory of prefrontal cortex function. Annu Rev Neurosci. 2001;24:167–202. doi: 10.1146/annurev.neuro.24.1.167. [DOI] [PubMed] [Google Scholar]
  • 2.Desimone R, Duncan J. Neural mechanisms of selective visual attention. Annu Rev Neurosci. 1995;18:193–222. doi: 10.1146/annurev.ne.18.030195.001205. [DOI] [PubMed] [Google Scholar]
  • 3.Egner T. Principles of cognitive control over task focus and task switching. Nat Rev Psychol. 2023;2:702–714. doi: 10.1038/s44159-023-00234-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Shallice T. From Neuropsychology to Mental Structure. Cambridge University Press; 1988. [Google Scholar]
  • 5.Carter CS, Braver TS, Barch DM, Botvinick MM, Noll D, Cohen JD. Anterior cingulate cortex, error detection, and the online monitoring of performance. Science. 1998;280:747–749. doi: 10.1126/science.280.5364.747. [DOI] [PubMed] [Google Scholar]
  • 6.Brown JW, Braver TS. Learned predictions of error likelihood in the anterior cingulate cortex. Science. 2005;307:1118–1121. doi: 10.1126/science.1105783. [DOI] [PubMed] [Google Scholar]
  • 7.Daw ND, Niv Y, Dayan P. Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control. Nat Neurosci. 2005;8:1704–1711. doi: 10.1038/nn1560. [DOI] [PubMed] [Google Scholar]
  • 8.Dolan RJ, Dayan P. Goals and habits in the brain. Neuron. 2013;80:312–325. doi: 10.1016/j.neuron.2013.09.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Glascher J, Daw N, Dayan P, O’Doherty JP. States versus rewards: dissociable neural prediction error signals underlying model-based and model-free reinforcement learning. Neuron. 2010;66:585–595. doi: 10.1016/j.neuron.2010.04.016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Botvinick MM, Braver TS, Barch DM, Carter CS, Cohen JD. Conflict monitoring and cognitive control. Psychol Rev. 2001;108:624–652. doi: 10.1037/0033-295x.108.3.624. [DOI] [PubMed] [Google Scholar]
  • 11.Rougier NP, Noelle DC, Braver TS, Cohen JD, O’Reilly RC. Prefrontal cortex and flexible cognitive control: rules without symbols. Proc Natl Acad Sci USA. 2005;102:7338–7343. doi: 10.1073/pnas.0502455102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Doerig A, Sommers R, Seeliger K, Richards B, Ismael J, Lindsay G, Kording K, Konkle T, Van Gerven MAJ, Kriegeskorte N, et al. The Neuroconnectionist Research Programme. 2022 doi: 10.1038/s41583-023-00705-w. [DOI] [PubMed] [Google Scholar]
  • 13.Saxe A, Nelli S, Summerfield C. If deep learning is the answer, what is the question? Nat Rev Neurosci. 2021;22:55–67. doi: 10.1038/s41583-020-00395-8. [DOI] [PubMed] [Google Scholar]
  • 14.Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidjeland AK, Ostrovski G, et al. Human-level control through deep reinforcement learning. Nature. 2015;518:529–533. doi: 10.1038/nature14236. [DOI] [PubMed] [Google Scholar]
  • 15.Williams R. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach Learn. 1992;8:229–256. [Google Scholar]
  • 16.Blundell C, Uria B, Pritzel A, Li Y, Ruderman A, Leibo JZ, Rae J, Wierstra D, Hassabis D. Model-Free Episodic Control. 2016.
  • 17.Kumaran D, Hassabis D, McClelland JL. What learning systems do intelligent agents need? Complementary learning systems theory updated. Trends Cogn Sci. 2016;20:512–534. doi: 10.1016/j.tics.2016.05.004. [DOI] [PubMed] [Google Scholar]
  • 18.Botvinick M, Ritter S, Wang JX, Kurth-Nelson Z, Blundell C, Hassabis D. Reinforcement learning, fast and slow. Trends Cogn Sci. 2019;23:408–422. doi: 10.1016/j.tics.2019.02.006. [DOI] [PubMed] [Google Scholar]
  • 19.Song HF, Yang GR, Wang X-J. Reward-based training of recurrent neural networks for cognitive and value-based tasks. eLife. 2017;6:e21492. doi: 10.7554/eLife.21492. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Tsuda B, Tye KM, Siegelmann HT, Sejnowski TJ. A modeling framework for adaptive lifelong learning with transfer and savings through gating in the prefrontal cortex. Proc Natl Acad Sci USA. 2020;117:29872–29882. doi: 10.1073/pnas.2009591117. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Moskovitz T, Miller K, Sahani M, Botvinick MM. A Unified Theory of Dual-Process Control. 2023 [•• This study shows that a deep learning system composed of dual neural networks, with two different regularisers, can capture a wide range of putatively ‘habit-based’ and ‘goal-based’ behaviours described in the psychological literature.] [Google Scholar]
  • 22.Daw ND, Gershman SJ, Seymour B, Dayan P, Dolan RJ. Model-based influences on humans’ choices and striatal prediction errors. Neuron. 2011;69:1204–1215. doi: 10.1016/j.neuron.2011.02.027. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Todorov E. Efficient computation of optimal actions. Proc Natl Acad Sci USA. 2009;106:11478–11483. doi: 10.1073/pnas.0710743106. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Lai L, Gershman SJ. Psychology of Learning and Motivation. Elsevier; 2021. Policy compression: an information bottleneck in action selection; pp. 195–232. [Google Scholar]
  • 25.Koechlin E, Summerfield C. An information theoretical approach to prefrontal executive function. Trends Cogn Sci. 2007;11:229–235. doi: 10.1016/j.tics.2007.04.005. [DOI] [PubMed] [Google Scholar]
  • 26.Bogacz R. Dopamine role in learning and action inference. eLife. 2020;9:e53262. doi: 10.7554/eLife.53262. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Miller KJ, Shenhav A, Ludvig EA. Habits without values. Psychol Rev. 2019;126:292–311. doi: 10.1037/rev0000120. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Watabe-Uchida M, Uchida N. Multiple dopamine systems: weal and woe of dopamine. Cold Spring Harb Symp Quant Biol. 2018;83:83–95. doi: 10.1101/sqb.2018.83.037648. [DOI] [PubMed] [Google Scholar]
  • 29.Musslick S, Cohen JD. Rationalizing constraints on the capacity for cognitive control. Trends Cogn Sci. 2021;25:757–775. doi: 10.1016/j.tics.2021.06.001. [DOI] [PubMed] [Google Scholar]
  • 30.Flesch T, Juechems K, Dumbalska T, Saxe A, Summerfield C. Orthogonal representations for robust context-dependent task performance in brains and neural networks. Neuron. 2022;110:1258–1270. doi: 10.1016/j.neuron.2022.01.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Bernardi S, Benna MK, Rigotti M, Munuera J, Fusi S, Salzman CD. The geometry of abstraction in the hippocampus and prefrontal cortex. Cell. 2020;183:954–967. doi: 10.1016/j.cell.2020.09.031. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Zhang Y, Yang Q. A Survey on Multi-Task Learning. 2021.
  • 33.Musslick S, Saxe A, Özcimder K, Dey B, Henselman G, Cohen JD. Multitasking Capability Versus Learning Efficiency in Neural Network Architectures. 2017. pp. 829–834. [•• This paper uses simulations with deep linear networks to expose how connectionist systems trade off the demands of solving multiple different tasks and generalising information across tasks through shared representations.]
  • 34.Franconeri SL, Alvarez GA, Cavanagh P. Flexible cognitive resources: competitive content maps for attention and memory. Trends Cogn Sci. 2013;17:134–141. doi: 10.1016/j.tics.2013.01.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Lake BM, Ullman TD, Tenenbaum JB, Gershman SJ. Building machines that learn and think like people. Behav Brain Sci. 2017;40:e253. doi: 10.1017/S0140525X16001837. [DOI] [PubMed] [Google Scholar]
  • 36.Binz M, Dasgupta I, Jagadish A, Botvinick M, Wang JX, Schulz E. Meta-Learned Models of Cognition. 2023. [DOI] [PubMed]
  • 37.Wang JX, Kurth-Nelson Z, Kumaran D, Tirumala D, Soyer H, Leibo JZ, Hassabis D, Botvinick M. Prefrontal cortex as a meta-reinforcement learning system. Nat Neurosci. 2018;21:860–868. doi: 10.1038/s41593-018-0147-8. [•• This paper introduces meta-RL as a theory of prefrontal function and shows how it can explain a range of psychological and neural effects in the domain of reward learning.] [DOI] [PubMed] [Google Scholar]
  • 38.Duan Y, Schulman J, Chen X, Bartlett PL, Sutskever I, Abbeel P. RL^2: fast reinforcement learning via slow reinforcement learning. arXiv. 2016 doi: 10.48550/arXiv.1611.02779. [DOI] [Google Scholar]
  • 39.Mikulik V, Delétang G, McGrath T, Genewein T, Martic M, Legg S, Ortega PA. Meta-trained agents implement Bayes-optimal agents. arXiv. 2020:201011223[cs] [Google Scholar]
  • 40.Ortega PA, Wang JX, Rowland M, Genewein T, Kurth-Nelson Z, Pascanu R, Heess N, Veness J, Pritzel A, Sprechmann P, et al. Meta-learning of Sequential Strategies. 2019 [• This paper describes places’ memory-based meta-learning such as meta-RL in a Bayesian framework.] [Google Scholar]
  • 41.Hazy TE, Frank MJ, O’Reilly RC. Towards an executive without a homunculus: computational models of the prefrontal cortex/basal ganglia system. Philos Trans R Soc B. 2007;362:1601–1613. doi: 10.1098/rstb.2007.2055. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Lloyd K, Becker N, Jones MW, Bogacz R. Learning to use working memory: a reinforcement learning gating model of rule acquisition in rats. Front Comput Neurosci. 2012;6 doi: 10.3389/fncom.2012.00087. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Hattori R, Hedrick NG, Jain A, Chen S, You H, Hattori M, Choi J-H, Lim BK, Yasuda R, Komiyama T. Meta-reinforcement learning via orbitofrontal cortex. Nat Neurosci. 2023;26:2182–2191. doi: 10.1038/s41593-023-01485-3. [•• This paper shows multiple lines of evidence that the OFC is critically involved in meta-learning in freely behaving mice.] [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Harlow H. The formation of learning sets. Psychol Rev. 1949;56:51–65. doi: 10.1037/h0062474. [DOI] [PubMed] [Google Scholar]
  • 45.Gehring WJ, Goss B, Coles MGH, Meyer DE, Donchin E. A neural system for error detection and compensation. Psychol Sci. 1993;4:385–390. [Google Scholar]
  • 46.Falkenstein M, Hoormann J, Christ S, Hohnsbein J. ERP components on reaction errors and their functional significance: a tutorial. Biol Psychol. 2000;51:87–107. doi: 10.1016/s0301-0511(99)00031-9. [DOI] [PubMed] [Google Scholar]
  • 47.Ochsner KN, Gross JJ. The cognitive control of emotion. Trends Cogn Sci. 2005;9:242–249. doi: 10.1016/j.tics.2005.03.010. [DOI] [PubMed] [Google Scholar]
  • 48.Gratton G, Coles MG, Donchin E. Optimizing the use of information: strategic control of activation of responses. J Exp Psychol Gen. 1992;121:480–506. doi: 10.1037//0096-3445.121.4.480. [DOI] [PubMed] [Google Scholar]
  • 49.Rabbitt PM. Errors and error correction in choice-response tasks. J Exp Psychol. 1966;71:264–272. doi: 10.1037/h0022853. [DOI] [PubMed] [Google Scholar]
  • 50.Frömer R, Lin H, Dean Wolf CK, Inzlicht M, Shenhav A. Expectations of reward and efficacy guide cognitive control allocation. Nat Commun. 2021;12:1030. doi: 10.1038/s41467-021-21315-z. [• This paper shows that people exert more control and allocate more effort in situations where they have greater efficacy over outcomes.] [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Masís JA, Musslick S, Cohen J. The value of learning and cognitive control allocation; Proceedings of the Annual Meeting of the Cognitive Science Society; 2021. [Google Scholar]
  • 52.Masís J, Chapman T, Rhee JY, Cox DD, Saxe AM. Rats Strategically Manage Learning during Perceptual Decision Making. 2020 doi: 10.7554/eLife.64978. [•• This paper describes evidence that mice choose behavioural strategies that facilitate their own learning, alongside those that maximise reward.] [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Tversky A, Edwards W. Information versus reward in binary choices. J Exp Psychol. 1966;71:680–683. doi: 10.1037/h0023123. [DOI] [PubMed] [Google Scholar]
  • 54.Navarro DJ, Newell BR, Schulze C. Learning and choosing in an uncertain world: an investigation of the explore–exploit dilemma in static and dynamic environments. Cogn Psychol. 2016;85:43–77. doi: 10.1016/j.cogpsych.2016.01.001. [DOI] [PubMed] [Google Scholar]
  • 55.Blanchard TC, Gershman SJ. Pure correlates of exploration and exploitation in the human brain. Cogn Affect Behav Neurosci. 2018;18:117–126. doi: 10.3758/s13415-017-0556-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Haggard P, Chambon V. Sense of agency. Curr Biol. 2012;22:R390–R392. doi: 10.1016/j.cub.2012.02.040. [DOI] [PubMed] [Google Scholar]
  • 57.Sandbrink K, Summerfield C. Learning the value of control with Deep RL; Proceedings of the 2023 Conference on Cognitive Computational Neuroscience; 2023. [•• This paper introduces the neural network models of meta-control.] [Google Scholar]
  • 58.Frömer R, Nassar MR, Bruckner R, Stürmer B, Sommer W, Yeung N. Response-based outcome predictions and confidence regulate feedback processing and learning. eLife. 2021;10:e62825. doi: 10.7554/eLife.62825. [• This paper demonstrates that humans process trials in which they managed to execute their intended action successfully differently from those where they did not based on online error monitoring.] [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Seidler RD, Kwak Y, Fling BW, Bernard JA. In: Progress in Motor Control. Richardson MJ, Riley MA, Shockley K, editors. Springer; 2013. Neurocognitive mechanisms of error-based motor learning; pp. 39–60. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Chan SCY, Santoro A, Lampinen AK, Wang JX, Singh A, Richemond PH, McClelland J, Hill F. Data Distributional Properties Drive Emergent In-Context Learning in Transformers. 2022 doi: 10.48550/ARXIV.2205.05055. [DOI] [Google Scholar]
  • 61.Chan SCY, Dasgupta I, Kim J, Kumaran D, Lampinen AK, Hill F. Transformers Generalize Differently from Information Stored in Context vs in Weights. 2022 doi: 10.48550/ARXIV.2210.05675. [DOI] [Google Scholar]
  • 62.Zhou H, Bradley A, Littwin E, Razin N, Saremi O, Susskind J, Bengio S, Nakkiran P. What Algorithms can Transformers Learn? A Study in Length Generalization. 2023 [Google Scholar]
  • 63.OpenAI. GPT-4 Technical Report. 2023
  • 64.Ziegler DM, Stiennon N, Wu J, Brown TB, Radford A, Amodei D, Christiano P, Irving G. Fine-Tuning Language Models from Human Preferences. 2019 doi: 10.48550/ARXIV.1909.08593. [DOI] [Google Scholar]
  • 65.Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O. Proximal Policy Optimization Algorithms. 2017 [Google Scholar]
  • 66.Shi F, Suzgun M, Freitag M, Wang X, Srivats S, Vosoughi S, Chung HW, Tay Y, Ruder S, Zhou D, et al. Language Models are Multilingual Chain-of-Thought Reasoners. 2022 [Google Scholar]
  • 67.Wayne G, Hung C, Amos D, Mirza M, Ahuja A, Grabska-Barwinska A, Rae J, Mirowski P, Leibo JZ, Santoro A, et al. Unsupervised predictive memory in a goal-directed agent. arXiv. 2018 doi: 10.48550/arXiv.1803.10760. [DOI] [Google Scholar]
  • 68.Ha D, Schmidhuber J. World models. arXiv. :2018.180310122[cs, stat] doi: 10.5281/zenodo.1207631. [DOI] [Google Scholar]

RESOURCES