Continual task learning in natural and artificial agents

Timo Flesch; Andrew Saxe; Christopher Summerfield

doi:10.1016/j.tins.2022.12.006

. 2023 Mar;46(3):199–210. doi: 10.1016/j.tins.2022.12.006

Continual task learning in natural and artificial agents

Timo Flesch ¹, Andrew Saxe ^2,^⁎, Christopher Summerfield ^1,^⁎

PMCID: PMC10914671 PMID: 36682991

Highlights

Both natural and artificial agents face the challenge of learning in ways that support effective future behaviour.
This may be achieved by different learning regimes, associated with distinct dynamics, and differing dimensionality and geometry of neural task representations.
Where two different tasks are learned, neural codes for task-relevant information may be factorised in neocortex.
Combinations of supervised and unsupervised learning mechanisms may help partition task knowledge and avoid catastrophic interference (i.e., overwriting existing knowledge).

Keywords: neural networks, representational geometry, Hebbian gating, machine learning, neuroimaging

Abstract

How do humans and other animals learn new tasks? A wave of brain recording studies has investigated how neural representations change during task learning, with a focus on how tasks can be acquired and coded in ways that minimise mutual interference. We review recent work that has explored the geometry and dimensionality of neural task representations in neocortex, and computational models that have exploited these findings to understand how the brain may partition knowledge between tasks. We discuss how ideas from machine learning, including those that combine supervised and unsupervised learning, are helping neuroscientists understand how natural tasks are learned and coded in biological brains.

Natural tasks

In the natural world, humans and other animals behave in temporally structured ways that depend on environmental context. For example, many mammals cycle systematically through daily activities such as foraging, grooming, napping, and socialising. Humans live in complex societies in which labour is shared among group members, with each adult performing multiple successive roles, such as securing resources, caring for young, or exchanging social information. In many settings, we can describe the behaviour of natural agents as comprising a succession of distinct tasks for which a desired outcome (reward) is achieved by taking actions (responses) to observations (stimuli) through the learning of latent causal processes (rules).

The nature of task-driven behaviour, and the way that tasks are represented and implemented in neural circuits, has been widely studied by cognitive scientists and neurobiologists. One important finding is that switching between distinct tasks incurs a cost in decision accuracy and latency [1]. This switch cost implies the existence of control mechanisms that ensure we remain ‘on task’, possibly protecting ongoing behavioural routines from interference [2,3]. In primates, there is good evidence that control signals originate in the prefrontal cortex (PFC) and encourage task-appropriate behaviours by biasing neural activity in sensory and motor regions [4]. For example, single cells in the PFC have been observed to respond to task rules [5]. Supportive evidence for the notion comes from human studies as well. In patients with PFC damage tend to select tasks erroneously, leading to disinhibited or inappropriate behaviours [6].

This literature on task switching and control, however, has mostly overlooked the question of how tasks are acquired in the first place. How are tasks learned and dynamically represented in the PFC and interconnected regions? One key insight is that mutual interference among tasks can be mitigated when they are coded in independent subspaces of neural activity, such that the neural population vector evoked during task A is uncorrelated with that occurring during task B [7,8]. Over the past decade, evidence for this coding principle has emerged in domains as varied as skilled motor control [9], auditory prediction [10], memory processes [11,12], and visual categorisation [13,14]. However, the precise computational mechanisms by which tasks are encoded and implemented remain a matter of ongoing debate.

In this review, we discuss recent theories of task learning in cognitive science, neuroscience, and AI research. We focus on continual learning, that is, the need for both natural and artificial agents to continue to learn new tasks across the lifespan without catastrophically overwriting existing knowledge.

Rich and lazy learning

One way to study how tasks could be neurally encoded is to simulate learning in a simple class of computational model – a neural network trained with gradient descent. Neural networks uniquely allow researchers to form hypotheses about how neural codes form in biological brains, because their representations emerge through optimisation rather than being hand-crafted by the researcher [15]. One recent observation is that neural networks can learn to perform tasks in different regimes that are characterised by qualitatively diverging learning dynamics and distinct neural patterns at convergence [16,17]. In the lazy regime, which occurs when network weights are initialised with a broader range of values (e.g., higher connection strengths), the dimensionality of the input signals is rapidly expanded via random projections to the hidden layer such that learning is mostly confined to the readout weights, and error decreases exponentially [17., 18., 19., 20.]. By contrast, in the rich regime, which occurs when weights are initialised with low variance (weak connectivity), the hidden units learn highly structured representations that are tailored to the specific demands of the task, and the loss curve tends to pass through one or more saddle points before convergence [21., 22., 23., 24.]. We illustrate using a simple example – learning an ‘exclusive or’ (XOR) problem – in Figure 1A–D.

Neural recordings have offered evidence for both rich and lazy coding schemes. One important observation is that the variables that define a task – observations, actions, outcomes, and rules – are often encoded jointly by single neurons. For example, when monkeys make choices on the basis of distinct cues, single cells tend to multiplex input and choice variables [25., 26., 27.]. In another study, the dimensionality of neural codes recorded during performance of a dual memory task was found to approach its theoretical maximum, implying that neurons represent every possible combination of relevant variables [12]. While conjunctive codes (mixed selectivity) can theoretically emerge under both coding schemes, this finding is more consistent with ‘lazy’ learning, implying that brains encode tasks via high-dimensional representations that enmesh multiple task-relevant variables across the neural population.

However, there is also important evidence that neural systems learn representations that mirror the structure of the task, as might be predicted in the ‘rich’ regime. For example, it is often observed that neurons active in one task are silent during another, and vice versa. For example, when macaques were trained to categorise morphed animal images according to two independent classification schemes, 29% of PFC neurons became active during a single scheme, whereas only 2% of neurons were active during both [28]. More recently, a similar finding was reported using modern two-photon imaging methods in the parietal cortex of mice trained to perform both a grating discrimination task and a T-maze task. Over half of the recorded neurons were active in at least one task, but a much smaller fraction was active in both tasks [29]. In other words, the brain learns to partition task knowledge across independent sets of neurons.

These schemes may have complementary costs and benefits. High-dimensional coding schemes maximise the number of discriminations that can be linearly read out from the network, allowing agents to rapidly learn a new decision rule for a task [30]. Low-dimensional coding schemes confer robustness through redundancy, because neurons exhibit overlapping tuning properties, and promote generalisation, because they tend to correspond to simpler input-output functions when the neural manifold extends in fewer directions [31,32]. The lazy regime promotes ‘conjunctive’ coding, whereby representations are entangled across task variables [33,34], whereas the rich regime encourages ‘compositional’ coding, in which task representations can be assembled from primitive building blocks [35., 36., 37., 38.]. Whether task codes are primarily high dimensional (and conjunctive) or low dimensional (and compositional) may depend on the species, recording site, and the nature of the task at hand [31,39].

One recent study from our group used a neural network model to explicitly compare the predictions of the rich and lazy learning schemes to neural signals recorded from the human brain [13]. We developed a task (similar to the one in [28]) that involved discriminating naturalistic images in two independent contexts. Human participants learned to make ‘plant/don’t plant’ decisions about quasi-naturalistic images of trees with continuously varying branch density and leaf density, whose growth success was determined by leaf density in one ‘garden’ (task A) and branch density in the other (task B) (Figure 2A). Neural networks could be trained to perform a stylised version of this task under either rich or lazy learning schemes by varying the different initial connection strengths (Figure 2B). Multivariate methods used to visualise the representational dissimilarity matrix (RDM) and corresponding neural geometry for the network hidden layer under either scheme revealed that they made quite different predictions (Figure 2C). Under lazy learning, the network learned a high-dimensional solution whose RDM simply recapitulated the organisation of the input signals (into a grid defined by ‘leafiness’ and ‘branchiness’). This is expected, because randomly expanding the dimensionality of the inputs approximately preserves their similarity structure. However, under the rich scheme, the network compressed information that was irrelevant dimension to each context, so that the hidden layer represented the relevant input dimensions (leafiness and branchiness) on two neural planes lying at right angles in neural state space (Figure 2C). Strikingly, BOLD signals in the posterior parietal cortex (PPC) and dorsomedial prefrontal cortex (dmPFC) exhibited a similar dimensionality and RDMs revealed a comparable geometric arrangement onto ‘orthogonal planes’, providing evidence in support of ‘rich’ task representations in the human brain (Figure 2D).

Structured task representations for context-dependent decision-making.

(A) Context-dependent decision-making task with human participants [13]. Stimuli were fractal images of trees that varied in their density of leaves (leafiness) and branches (branchiness). In each context/task, only one of the two dimensions was relevant, indicated by a reward/penalty participants would receive for ‘accepting’ the tree on a given trial. (B) Simplified version of task described in (A) with images of Gaussian ‘blobs’ instead of trees. The mean of these blobs was varied in five steps along the x- and y-axis. In each context, only one of the two dimensions was relevant. A neural network (right) was trained to predict the context-specific feature value. (C) Hidden layer representations under lazy (left) and rich (right) learning. Under lazy learning, the network recapitulates the structure of the stimulus space. Under rich learning, it compresses along the task-irrelevant axes, forming ‘orthogonal’ task-specific representations of relevant features. (D) Visualisation of variance in human functional magnetic resource imaging (fMRI) recordings from early visual cortex (EVC) and dorsolateral prefrontal cortex (DLPFC) explained by a model with free parameters for the compression rate, distance between tasks and rotation of individual task representations. Clearly evident are task-agnostic grid-like representations in EVC and ‘orthogonal’ task-specific representational in DLPFC. Panels A–D are modified from [13].

Neural systems can thus learn both structured, low-dimensional task representations, and unstructured, high-dimensional codes. In artificial neural networks, the emerging regime depends on the magnitude of initial connection strengths in the network [15]. In the brain, these regimes may arise through other mechanisms such as pressure toward metabolic efficiency (regularisation), or architectural constraints that enforce unlearned nonlinear expansions. While it remains unclear when, how, and why either coding scheme might be adopted in the biological brain, neural theory is emerging that may help clarify this issue. One recent paper explored how representational structure is shaped by specific task demands [40]. Comparing recurrent artificial neural networks trained on different cognitive tasks, the authors found that those tasks that required flexible input-output mappings, such as the context-dependent decision task outlined previously, induced task-specific representations, similar to the ones observed under rich learning in feedforward networks. By contrast, for tasks that did not require such a flexible mapping, the authors observed completely random, unstructured representations. This suggests that representational geometry is not only determined by intrinsic factors such as initial connection strength, but flexibly adapts to the computational demands of specific tasks. Rich task-specific representations might therefore arise when there is a need to minimise interference between different tasks and perform flexible context-specific responses to the same stimuli [39,41].

The problem of continual learning

The natural world is structured so that different tasks tend to occur in succession. For example, many animals (such as bears and llamas) are able to both run and swim but they cannot do so at the same time. Similarly, most humans can perform many different tasks (such as playing the violin and the trumpet) but may not do so simultaneously. This aspect of the world presents a well-known challenge for task learning, because when task B is learned after task A, there is a risk that knowledge of task A is erased – a phenomenon known as ‘catastrophic forgetting’ [42] (Figure 3A). Catastrophic forgetting can occur when a neural network is optimised through adjustment of a finite set of connections, because there are no guarantees that any weight configuration that solves a novel task B will also simultaneously solve an existing task A (Figure 3B). Building neural networks that avoid catastrophic forgetting and adapt continually to novel tasks in an open-ended environment is a grand challenge in AI research, where even powerful existing systems are often poor at adapting flexibly to new tasks that are introduced late in training [43., 44., 45.]. Humans, by contrast, seem to have evolved ways to circumvent this problem, allowing people to solve new problems deep into older age. How might the neural representation of tasks allow biological agents to learn continually?

Continual learning in minds and machines.

(A) Performance under blocked (‘continual’) and interleaved training in humans and deep neural networks. Artificial neural networks (ANNs) suffer from catastrophic forgetting under blocked training (top) but reach ceiling performance under interleaved training (bottom). The opposite is true for humans, who perform worse under interleaved training. (B) Solution spaces in a neural network. Solutions for different tasks require different parameter configurations. Training on a second task moves the configuration out of the solution space for the first task. (C) Gating theory for continual learning from [49]. If the context signal was able to inhibit task-irrelevant units (which are relevant for another task), those should not be affected by gradient updates. (D) Hidden layer representations without (left) and with (right) Hebbian gating signals. The standard neural network treats the first task like the second. The gated network learns two separate task representations. (E) Comparison of category boundary estimation error in humans (left) and a neural network either trained with standard error-corrective learning (middle) or with Hebbian gating and sluggishness (right) after blocked (dark) and interleaved (light) learning. The cost of interleaved training is captured by larger estimation errors under interleaved learning. Asterisks indicate significance of differences between blocked and interleaved, *P < 0.05 ***P < 0.001. Panels C, D, and E are reprinted from [49], with original data from [50].

One possibility is that synapses that encode existing knowledge can be protected during new learning. Evidence for this idea comes from two-photon microscopy, which can be used to track dendritic spine formation in the rodent brain. When mice are trained to perform two unrelated tasks, such as running forward and backwards, new spines form across different apical tuft branches. Remarkably, many of these new spines are maintained stably across the entire lifespan of the animal, despite the many further experiences to which the animal is exposed, as if they were being protected from further change [46]. In machine learning, methods that earmark synapses that are critical for performing current tasks, and explicitly protect them during learning of new tasks, have helped neural networks learn several types of image recognition task or multiple Atari video games in succession [47,48].

Another promising approach to continual learning capitalises on the fact that in mammals, new information tends to be rapidly encoded in hippocampal circuits and only slowly consolidated to neocortex [51]. Consolidation allows existing learning (task A) to be internally ‘replayed’ during acquisition of task B, removing the temporal structure from the learning problem by mentally interleaving tasks A and B. This allows the network to sample the two objectives simultaneously – and thus to gravitate towards a parameter setting that jointly solves both tasks [52]. This idea, known as ‘complementary learning systems’ (or CLS) theory, influenced AI researchers designing the first deep neural networks to solve complex, dynamic control problems (such Atari games) [53]. By introducing a virtual ‘buffer’ in which game states were stored and intermittently replayed, akin to the fast rehearsal of successive mental states during short wave ripples observed in both rodents [54] and humans [55], the network was able to overcome the challenge presented by a nonstationary training distribution and to learn to perform dozens of games better than an expert human player [56]. More recently, new methods have been developed which prioritise replay of those states with highest prediction error [57], just as biological replay seems to privilege rewarding events [58], or that replay samples from a generative model of the environment, allowing agents to incorporate plausible but counterfactual events and outcomes into a joint task schema [59].

The benefit of temporal structure

Complementary learning systems theory offers a clear story about how the brain learns tasks in a temporally structured world – by mentally mixing them up. Indeed, there is evidence from cognitive psychology that injecting variety into the training set helps people learn sports [60], foreign languages [61], and even to acquire abstract mathematical concepts [62] or recognise the painting styles of famous artists [63]. However, as every student knows, simply selecting training examples at random rarely makes a good curriculum. Similar principles are relevant in the context of animal learning as well. When researchers want a novice animal to learn a complex task, they carefully design the series of training levels – for example, by first teaching a mouse to lick a waterspout, and only later to discriminate a tone for reward. Pigeons struggle to learn the concept of ‘same’ versus ‘different’ in visual discrimination if trained on random examples, but learn effectively if the training set starts small and grows gradually [64]; for related work in monkeys see [65]. Similarly, in the field of education, when teaching children, teachers typically organise information into discrete blocks, so that, for example, students might study French before Spanish, but not both languages in the same lesson. Experimental animals may even spontaneously structure their own curricula, for example, learning independently to lick a waterspout and discriminate a tone at different times [66]. In other words, although there are good theoretical reasons why mixing exemplars during training should help, in practice animals and humans seem to learn more readily under temporally autocorrelated curricula.

Building on this intuition, one study used the leafy/branchy task described earlier to ask whether curricula that block or interleave tasks (or ‘gardens’, each associated with an independent visual discrimination rule) over trials facilitate learning [50]. Different groups of participants learned to perform the task by purely trial and error under either a blocked curriculum (involving hundreds of trials of consecutive practice for each task) or an interleaved curriculum (in which the task switched from trial to trial), before being tested without feedback on the interleaved task. Surprisingly, test performance was better for groups that learned under the blocked curriculum, despite the fact that other groups could in theory benefit from the shared structure of training and test. For example, the test involved many task switch trials, which the interleaved group had practiced extensively, but the blocked group had only experienced once. Detailed behavioural analysis showed that rather than increasing psychophysical sensitivity, or decreasing generalised errors (lapses), the blocked curriculum helped participants learn to apply the two category boundaries independently – in other words, it facilitated effective partitioning of the two tasks [50] (Figure 3E).

Why does temporal structure assist learning in humans and other animals – but not in neural networks, which suffer catastrophic forgetting during blocked training? One possibility is that biological brains have evolved ways to partition knowledge during initial task acquisition, so that learning can occur in independent subspaces of neural population activity, thereby precluding interference between tasks A and B. Indeed, the orthogonal neural geometry of the BOLD signal observed for the trees task [13] occurred after blocked training – as if human participants, despite the temporal structure, had learned to represent the task in separate neural subspaces. As a proof of concept, methods that project learning updates (gradients) for new tasks into independent neural subspaces have been shown to help with continual learning in machine learning models, including recurrent networks [67,68]. However, implementing these methods often requires expensive computations, such as computing and maintaining an unwieldy projection matrix in memory.

To understand how the brain might partition new task learning across neurons, it is helpful to return to the simple neural network model of task learning. The widely studied case where inputs such as trees [13,50], animals [28], or colour moving dots [26,69,70] are classified according to two independent category boundaries has XOR structure and thus can only be solved by networks equipped with nonlinear transformations, for example, with rectified linear units (ReLUs). When the task varies from trial to trial, the network is provided with ‘task inputs’ denoting whether the current task is A or B (e.g., motion or colour discrimination). At initialisation, the weights connecting each of the task inputs to the hidden layer units are entirely random, but over the course of training they become anticorrelated, so that a hidden unit that responds positively to task A will tend to respond negatively to task B and vice versa (Figure 3C). The additional rectification step – in which the ReLU sets the negative portion of an input to zero – means that hidden layer units become responsive to either one task input unit or the other, but not both, consistent with the finding that tasks tend to be partitioned across neurons, for example, in mouse parietal cortex [29] or monkey PFC [28]. The partitioning allows tasks to be learned independently, and indeed if task input units weights are manually forced to be anticorrelated, the network has no problem learning both tasks A and B even when they are presented in successive blocks of training [49,71] (Figure 3D).

This idea builds upon earlier proposals that context-based gating may be a solution to continual learning and control [72., 73., 74.]. For example, one early theory showed how abstract, rule-like representations could emerge in PFC through a gating-based scheme, allowing the network to perform a multidimensional rule-learning task [75]. Another recent theory proposes a gating-based solution to the stability-plasticity dilemma that relies on temporal synchrony to bind together elements of a currently relevant task [76].

Knowledge partitioning via Hebbian learning

The question remains, however, of how projection into partitioned hidden units might be achieved during online learning. A broad hint comes from considering the relationship between sensory input and the demands of natural tasks. In the real world, different tasks tend to occur in different contexts. For example, a bear might learn to walk on land and swim in water, and not vice versa. An adult in a professional role might work in the office, and drink in the local bar, but not vice versa. Sensory context thus – whether the backdrop is computers and desks, or beer and jukebox – offers strong clues about which tasks we should be performing [77]. Thus, a mechanism that learned to correlate task inputs that shared sensory features, and to orthogonalize those that did not, would allow knowledge to be partitioned by task. Fortunately, one popular learning algorithm neatly meets these requirements. An implementation of Hebbian learning called Oja’s rule strengthens connections between neurons with covarying activity [78]. It thus converges to the first principal component of mean-centred input signals and will inevitably group together those hidden units that are connected to commonly activated inputs. For example, in a setting where tasks A and B occur independently, Oja’s rule will tend to orthogonalize the weights from the two task input units to the hidden layer. This effect will be enhanced where tasks are accompanied by shared sensory features (such as land and water).

To capture the empirically observed advantage of blocked over interleaved learning, however, one further assumption is required – namely, that the window of temporal integration for the inputs is longer than a single trial. In other words, we assume that the input to the network on any given trial contains a mixture of the current and past information, so that decisions on any given trial may be biased by the recent trial history. Indeed sequential effects across trials are a ubiquitous feature of behavioural data gathered in the lab [79., 80., 81.]. For context-dependent decisions, choice history will have a different effect where tasks are blocked and interleaved, because it effectively smooths the input signals so that independence between tasks (and thus partitioning) is promoted when tasks are blocked but decreased when they are interleaved. Intuitively, where task A and B are interleaved over trials, each input contains a mixture of signals from both tasks, reducing their independence. This may also disrupt performance, because the previous task may bias responses on the current trial. Bringing these ideas together, one recent modelling study showed that the introduction of history effects and Hebbian updating together allow networks to learn in ways that avoid catastrophic interference, and also to capture rich pattern of behaviour observed when human participants perform the trees task under blocked and interleaved conditions, including the advantage of blocking over interleaving, and the observation that this benefit stems from a more accurate estimate of the category boundary [49] (Figure 3E).

Learning tasks with and without supervision

Machine learning models are often more likely to converge to optimal solutions when training data are sampled to be independent and identically distributed (i.i.d.) – in other words, with curricula that are as random as possible. Where the data distribution is stationary, i.i.d. sampling reduces bias during learning and has powered machine learning models towards superhuman performance in domains such as object recognition [82]. However, the data distribution in the natural world is not stationary: instead, it is highly autocorrelated within a single context, but shifts abruptly at context boundaries, such as when a penguin emerges from the sea onto the ice or when you leave your warm house and head out into the wintry street. Some machine learning methods treat this structure as a nuisance and have found clever ways to try and remove it [53]. However, the theory mentioned previously suggests that the brain has instead evolved to capitalise on this structure. It proposes that by using unsupervised learning methods – such as Hebbian learning – the brain learns to group ongoing experience into contexts and to partition neural resources to learn about each context independently. By orthogonalizing task signals, behavioural routines can be stored in ways that minimise interference.

This idea taps into a longstanding theme in machine learning research – namely, that unsupervised and error-driven learning can work together to help agents solve natural tasks. In fact, early successes in deep learning used unsupervised pretraining methods to structure neural representations before supervised fine-tuning [83]. When deep convolutional networks are trained to solve the trees task from pixels alone, pretraining on the tree dataset using a beta-variational autoencoder (β-VAE) accelerates subsequent supervised learning. The β-VAE uses self-supervised methods to learn the latent factors in the data (i.e., leafiness and branchiness), thus structuring representations according to the two subsequent decision-relevant dimensions. In a similar vein, when human participants were first asked to arrange samples of trees by their similarity (without knowing the decision-relevant axes) those whose prior tendency was to organise by leafiness and branchiness received more benefit from blocked training [50]. In other words, learning the structure of the world can help both humans and neural networks organise information along task-relevant axes, and may by at the heart of biological solutions to continual learning.

However, there is an important caveat to this theory. When we encounter a novel task, we rarely want to ignore everything we know about other tasks – in fact, past task knowledge can help as well as hinder current task learning. For example, a chef learning a new recipe will benefit from past cooking experience, or a programmer learning Python will probably benefit from past proficiency in MATLAB [84]. Thus, we often want to share representations between tasks, but strict partitioning of task knowledge into orthogonal subspaces reduces the positive as well as the negative effects of this transfer. In fact, knowing how to learn new tasks in a way that negotiates the trade-off between interference (negative transfer) and generalisation (positive transfer) is a key unsolved challenge in both neuroscience and AI research [43,85].

Answers to this question are only beginning to emerge, but one possibility is that the brain is particularly adept at factorising existing task knowledge into reusable subcomponents, which can then be recomposed to tackle novel challenges [86]. For example, when making predictions about sequences of dots presented on a ring, participants seem to combine primitives involving rotation and symmetry [87]. Neural signals seem to code independently for task factors, such as the position and identity of an object in a sequence [88,89]. By coding reusable task factors in independent subspaces of neural activity, they can be recombined in novel ways – for example, if a chef has learned to knead dough when baking bread and make a tomato passata when cooking spaghetti, these skills can be combined when making pizza for the first time.

One recent study showed that Hebbian learning may also contribute to compositional solutions to difficult transfer problems [35]. Human participants were trained to map coloured shapes onto spatial responses made a mouse click, with each feature (e.g., colour) mapping onto a spatial dimension (e.g., the horizontal axis in Cartesian coordinates, or radial axis in polar coordinates). Critically, they were trained with feedback on a single exemplar from each dimension (e.g., all red shapes and all coloured squares) and then asked to make inferences about the location associated with novel objects (e.g., blue triangles). As in the trees task, performance was improved when training of each dimension was blocked (e.g., all red items preceded all squares). Neural networks learned to perform perfectly on training trials but failed to transfer, unless they were equipped with a Hebbian learning step that helped them learn independently about colour and shape. With the combination of Hebbian and supervised learning, networks learned to perform the task in ways that closely resembled humans [35].

Concluding remarks

A renewed interest in connectionist models as theories of brain function [15] and the advent of high-throughput recording and multivariate analysis methods [32] have collectively reinvigorated research into task learning and its neural substrates. However, exactly how (and to what extent) neural representations form as biological agents learn new tasks remains a mystery (see Outstanding questions). Some theories propose that task-related neurons are supremely adaptive, especially in PFC, implying that new tasks automatically beget new geometries of representation [90]. Indeed, neural codes measured in BOLD signals have been shown to adjust rapidly when participants are taught new relations among objects or positions, and this occurs both in the medial temporal lobe and frontoparietal network [91., 92., 93., 94.]. There is even one report that orientation selectivity in V1 can adjust as people learn to classify gratings over just a few hours of practice, as if the basic building blocks of vision were themselves quite labile [95]. However, neural signals recorded from experimental animals seem to change much more gradually with learning, and it is unclear if these plastic changes are a quirk of humans – or perhaps of BOLD signals. Understanding exactly how representations change in both hippocampus and neocortex during new task learning is currently a major outstanding challenge for 21st century neuroscience.

Outstanding questions.

Past learning can interfere with current task performance, but at other times it can be beneficial. How does the brain code for tasks in a way that trades off the costs and benefits of negative and positive transfer?

Given that neural circuits exhibit experience-dependent plasticity during learning, how can old learning be preserved?

What are the respective roles of different brain regions, including the hippocampus and neocortical areas, in facilitating continual learning?

Alt-text: Outstanding questions

Acknowledgments

This work was supported by generous funding from the European Research Council (ERC Consolidator award 725937) and Special Grant Agreement No. 945539 (Human Brain Project SGA) to C.S., a Sir Henry Dale Fellowship to A.S. from the Wellcome Trust and Royal Society (grant number 216386/Z/19/Z) and a Medical Science Graduate School Studentship to T.F. (Medical Research Council and Department of Experimental Psychology). A.S. is a CIFAR Azrieli Global Scholar in the Learning in Machines & Brains program.

Declarations of interests

T.F. is associated with Phytoform Labs, UK. A.S. is associated with FAIR (Meta, Inc). C.S. is affiliated with DeepMind, UK. The authors declare no other competing interests in relation to this work.

Contributor Information

Andrew Saxe, Email: a.saxe@ucl.ac.uk.

Christopher Summerfield, Email: christopher.summerfield@psy.ox.ac.uk.

References

1.Monsell S. Task switching. Trends Cogn. Sci. 2003;7:134–140. doi: 10.1016/s1364-6613(03)00028-7. [DOI] [PubMed] [Google Scholar]
2.Botvinick M.M., et al. Conflict monitoring and cognitive control. Psychol. Rev. 2001;108:624–652. doi: 10.1037/0033-295x.108.3.624. [DOI] [PubMed] [Google Scholar]
3.Badre D. Princeton University Press; 2020. On Task: How Our Brain Gets Things Done. [Google Scholar]
4.Miller E.K., Cohen J.D. An integrative theory of prefrontal cortex function. Annu. Rev. Neurosci. 2001;24:167–202. doi: 10.1146/annurev.neuro.24.1.167. [DOI] [PubMed] [Google Scholar]
5.Freedman D.J., Assad J.A. Neuronal mechanisms of visual categorization: an abstract view on decision making. Annu. Rev. Neurosci. 2016;39:129–147. doi: 10.1146/annurev-neuro-071714-033919. [DOI] [PubMed] [Google Scholar]
6.Shallice T., Burgess P.W. Deficits in strategy application following frontal lobe damage in man. Brain. 1991;114:727–741. doi: 10.1093/brain/114.2.727. [DOI] [PubMed] [Google Scholar]
7.Lewandowsky S., Li S.-C. In: Interference and Inhibition in Cognition. Dempster F.N., Brainerd C.J., editors. Elsevier; 1995. Catastrophic interference in neural networks; pp. 329–361. [Google Scholar]
8.Willshaw D.J., et al. Non-holographic associative memory. Nature. 1969;222:960–962. doi: 10.1038/222960a0. [DOI] [PubMed] [Google Scholar]
9.Kaufman M.T., et al. Cortical activity in the null space: permitting preparation without movement. Nat. Neurosci. 2014;17:440–448. doi: 10.1038/nn.3643. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Libby A., Buschman T.J. Rotational dynamics reduce interference between sensory and memory representations. Nat. Neurosci. 2021;24:715–726. doi: 10.1038/s41593-021-00821-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Xie Y., et al. Geometry of sequence working memory in macaque prefrontal cortex. Science. 2022;375:632–639. doi: 10.1126/science.abm0204. [DOI] [PubMed] [Google Scholar]
12.Rigotti M., et al. The importance of mixed selectivity in complex cognitive tasks. Nature. 2013;497:585–590. doi: 10.1038/nature12160. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Flesch T., et al. Orthogonal representations for robust context-dependent task performance in brains and neural networks. Neuron. 2022 doi: 10.1016/j.neuron.2022.01.005. S0896627322000058. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Failor S.W., et al. Learning orthogonalizes visual cortical population codes. Neuroscience. 2021 doi: 10.1101/2021.05.23.445338. Published online May 23, 2021. [DOI] [Google Scholar]
15.Saxe A., et al. If deep learning is the answer, what is the question? Nat. Rev. Neurosci. 2021;22:55–67. doi: 10.1038/s41583-020-00395-8. [DOI] [PubMed] [Google Scholar]
16.Woodworth, B. et al. Kernel and rich regimes in overparametrized models. arXiv. Published online July 27, 2020. 10.48550/arXiv.2002.09277 [DOI]
17.Chizat L., et al. On lazy training in differentiable programming. NeurIPS. 2018 doi: 10.48550/arXiv.1812.07956. Published online January 7, 2020. [DOI] [Google Scholar]
18.Jacot A., et al. Neural tangent kernel: convergence and generalization in neural networks. arXiv. 2018:8571–8580. doi: 10.48550/arXiv.1806.07572. Published online February 10, 2020. [DOI] [Google Scholar]
19.Arora S., et al. Stronger generalization bounds for deep nets via a compression approach. 2018. https://doi.org/10.48550/arXiv.1802.05296 Available:
20.Lee J., et al. Wide neural networks of any depth evolve as linear models under gradient descent. arXiv. 2019 https://doi.org/10.48550/arXiv.1902.06720 Available: [Google Scholar]
21.Saxe A.M., et al. A mathematical theory of semantic development in deep neural networks. Proc. Natl. Acad. Sci. U. S. A. 2019;116:11537–11546. doi: 10.1073/pnas.1820226116. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Geiger M., et al. Scaling description of generalization with number of parameters in deep learning. J. Stat. Mech. 2020;2020 [Google Scholar]
23.Paccolat J., et al. Geometric compression of invariant manifolds in neural nets. 2021. https://doi.org/10.48550/arXiv.2007.11471 arXiv:200711471 [cs, stat]. 2021 [cited 22 Mar 2021]. Available:
24.Saxe A., et al. Proceedings of the 39th International Conference on Machine Learning. 2022. Neural race reduction: dynamics of abstraction in gated networks; pp. 19287–19309. [Google Scholar]
25.Raposo D., et al. A category-free neural population supports evolving demands during decision-making. Nat. Neurosci. 2014;17:1784–1792. doi: 10.1038/nn.3865. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Mante V., et al. Context-dependent computation by recurrent dynamics in prefrontal cortex. Nature. 2013;503:78–84. doi: 10.1038/nature12742. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Parthasarathy A., et al. Mixed selectivity morphs population codes in prefrontal cortex. Nat. Neurosci. 2017;20:1770–1779. doi: 10.1038/s41593-017-0003-2. [DOI] [PubMed] [Google Scholar]
28.Roy J.E., et al. Prefrontal cortex activity during flexible categorization. J. Neurosci. 2010;30:8519–8528. doi: 10.1523/JNEUROSCI.4837-09.2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Lee J.J., et al. Task specificity in mouse parietal cortex. Neuron. 2022;110:2961–2969.e5. doi: 10.1016/j.neuron.2022.07.017. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Fusi S., et al. Why neurons mix: high dimensionality for higher cognition. Curr. Opin. Neurobiol. 2016;37:66–74. doi: 10.1016/j.conb.2016.01.010. [DOI] [PubMed] [Google Scholar]
31.Gao P., et al. A theory of multineuronal dimensionality, dynamics and measurement. BiorXiv. 2019 doi: 10.1101/214262. Published online November 12, 2017. [DOI] [Google Scholar]
32.Gao P., Ganguli S. On simplicity and complexity in the brave new world of large-scale neuroscience. Curr. Opin. Neurobiol. 2015;32:148–155. doi: 10.1016/j.conb.2015.04.003. [DOI] [PubMed] [Google Scholar]
33.Kikumoto A., Mayr U. Conjunctive representations that integrate stimuli, responses, and rules are critical for action selection. Proc. Natl. Acad. Sci. U. S. A. 2020;117:10603–10608. doi: 10.1073/pnas.1922166117. [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Hommel B. The Theory of Event Coding (TEC): a framework for perception and action planning. Behav. Brain Sci. 2001;24:849–878. doi: 10.1017/s0140525x01000103. [DOI] [PubMed] [Google Scholar]
35.Dekker, R. et al. Determinants of human compositional generalization. PsyArXiv Published online March 30, 2022. 10.31234/osf.io/qnpw6 [DOI]
36.Ito T., et al. Compositional generalization through abstract representations in human and artificial neural networks. arXiv. 2022 doi: 10.48550/arXiv.2209.07431. Published online December 2, 2022. [DOI] [Google Scholar]
37.Frankland S.M., Greene J.D. Concepts and compositionality: in search of the brain’s language of thought. Annu. Rev. Psychol. 2020;71:273–303. doi: 10.1146/annurev-psych-122216-011829. [DOI] [PubMed] [Google Scholar]
38.Dehaene S., et al. Symbols and mental programs: a hypothesis about human singularity. Trends Cogn. Sci. 2022;26:751–766. doi: 10.1016/j.tics.2022.06.010. [DOI] [PubMed] [Google Scholar]
39.Badre D., et al. The dimensionality of neural representations for control. Curr. Opin. Behav. Sci. 2021;38:20–28. doi: 10.1016/j.cobeha.2020.07.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
40.Dubreuil A., et al. The role of population structure in computations through neural dynamics. Nat. Neurosci. 2022;25:783–794. doi: 10.1038/s41593-022-01088-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
41.Musslick S., Cohen J.D. Rationalizing constraints on the capacity for cognitive control. Trends Cogn. Sci. 2021;25:P757–P775. doi: 10.1016/j.tics.2021.06.001. [DOI] [PubMed] [Google Scholar]
42.French R.M. Catastrophic forgetting in connectionist networks. Trends Cogn. Sci. 1999;3:128–135. doi: 10.1016/s1364-6613(99)01294-2. [DOI] [PubMed] [Google Scholar]
43.Parisi G., et al. Continual lifelong learning with neural networks: a review. Neural Netw. 2019;113:54–71. doi: 10.1016/j.neunet.2019.01.012. [DOI] [PubMed] [Google Scholar]
44.Hadsell R., et al. Embracing change: continual learning in deep neural networks. Trends Cogn. Sci. 2020;24:1028–1040. doi: 10.1016/j.tics.2020.09.004. [DOI] [PubMed] [Google Scholar]
45.Dohare S., et al. 2021. Continual Backprop: Stochastic Gradient Descent with Persistent Randomness. [cited 1 Oct 2022] [DOI] [Google Scholar]
46.Yang G., et al. Stably maintained dendritic spines are associated with lifelong memories. Nature. 2009;462:920–924. doi: 10.1038/nature08577. [DOI] [PMC free article] [PubMed] [Google Scholar]
47.Kirkpatrick J., et al. Overcoming catastrophic forgetting in neural networks. Proc. Natl. Acad. Sci. U. S. A. 2017;114:3521–3526. doi: 10.1073/pnas.1611835114. [DOI] [PMC free article] [PubMed] [Google Scholar]
48.Zenke F., et al. Continual learning through synaptic intelligence. 2017. https://doi.org/10.48550/arXiv.1703.04200 [PMC free article] [PubMed]
49.Flesch T., et al. 2022. Modelling continual learning in humans with Hebbian context gating and exponentially decaying task signals. [cited 15 Sep 2022] [DOI] [PMC free article] [PubMed] [Google Scholar]
50.Flesch T. Comparing continual task learning in minds and machines. Proc. Natl. Acad. Sci. U. S. A. 2018;115:E10313–E10322. doi: 10.1073/pnas.1800755115. [DOI] [PMC free article] [PubMed] [Google Scholar]
51.Alvarez P., Squire L.R. Memory consolidation and the medial temporal lobe: a simple network model. Proc. Natl. Acad. Sci. U. S. A. 1994;91:7041–7045. doi: 10.1073/pnas.91.15.7041. [DOI] [PMC free article] [PubMed] [Google Scholar]
52.McClelland J.L., et al. Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory. Psychol. Rev. 1995;102:419–457. doi: 10.1037/0033-295X.102.3.419. [DOI] [PubMed] [Google Scholar]
53.Kumaran D. What learning systems do intelligent agents need? complementary learning systems theory updated. Trends Cogn. Sci. 2016;20:512–534. doi: 10.1016/j.tics.2016.05.004. [DOI] [PubMed] [Google Scholar]
54.Foster D.J. Replay comes of age. Annu. Rev. Neurosci. 2017;40:581–602. doi: 10.1146/annurev-neuro-072116-031538. [DOI] [PubMed] [Google Scholar]
55.Vaz A.P., et al. Replay of cortical spiking sequences during human memory retrieval. Science. 2020;367:1131–1134. doi: 10.1126/science.aba0672. [DOI] [PMC free article] [PubMed] [Google Scholar]
56.Mnih V., et al. Human-level control through deep reinforcement learning. Nature. 2015;518:529–533. doi: 10.1038/nature14236. [DOI] [PubMed] [Google Scholar]
57.Schaul T., et al. Prioritized experience replay. ArXiv. 2021 https://doi.org/10.48550/arXiv.1511.05952 [Google Scholar]
58.Ambrose R.E., et al. Reverse replay of hippocampal place cells is uniquely modulated by changing reward. Neuron. 2016;91:1124–1136. doi: 10.1016/j.neuron.2016.07.047. [DOI] [PMC free article] [PubMed] [Google Scholar]
59.van de Ven, G.M. and Tolias, A.S. Generative replay with feedback connections as a general strategy for continual learning. arXiv Published online September 25, 2020 http://arxiv.org/abs/1809.10635
60.Goode S., Magill R.A. Contextual interference effects in learning three badminton serves. Res. Q. Exerc. Sport. 1986;57:308–314. [Google Scholar]
61.Richland L.E., et al. Proceedings of the 26thAnnual Meeting of the Cognitive Science Society. 2004. Differentiating the contextual interference effect from the spacing effect; p. 1624. [Google Scholar]
62.Rohrer D., et al. Interleaved practice improves mathematics learning. J. Educ. Psychol. 2015;107:900–908. [Google Scholar]
63.Kornell N., Bjork R.A. Learning concepts and categories: is spacing the “enemy of induction”? Psychol. Sci. 2008;19:585–592. doi: 10.1111/j.1467-9280.2008.02127.x. [DOI] [PubMed] [Google Scholar]
64.Katz J.S., Wright A.A. Same/different abstract-concept learning by pigeons. J. Exp. Psychol. Anim. Behav. Process. 2006;32:80–86. doi: 10.1037/0097-7403.32.1.80. [DOI] [PubMed] [Google Scholar]
65.Antzoulatos E.G., Miller E.K. Differences between neural activity in prefrontal cortex and striatum during learning of novel abstract categories. Neuron. 2011;71:243–249. doi: 10.1016/j.neuron.2011.05.040. [DOI] [PMC free article] [PubMed] [Google Scholar]
66.Kuchibhotla K.V., et al. Dissociating task acquisition from expression during learning reveals latent knowledge. Nat. Commun. 2019;10:2151. doi: 10.1038/s41467-019-10089-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
67.Zeng G., et al. Continual learning of context-dependent processing in neural networks. Nat. Mach. Intell. 2019;1:364–372. [Google Scholar]
68.Duncker L., et al. 2020. Organizing recurrent network dynamicsby task-computation to enable continual learning NeurIPS.https://proceedings.neurips.cc/paper/2020/file/a576eafbce762079f7d1f77fca1c5cc2-Paper.pdf Available: [Google Scholar]
69.Takagi Y., et al. Projections of non-invasive human recordings into state space show unfolding of spontaneous and over-trained choice. eLife. 2020;10 doi: 10.7554/eLife.60988. [DOI] [PMC free article] [PubMed] [Google Scholar]
70.Brincat S.L. Gradual progression from sensory to task-related processing in cerebral cortex. Proc. Natl. Acad. Sci. U. S. A. 2018;115:E7202–E7211. doi: 10.1073/pnas.1717075115. [DOI] [PMC free article] [PubMed] [Google Scholar]
71.Russin J., et al. A neural network model of continual learning with cognitive control. arXiv. 2022 http://arxiv.org/abs/2202.04773 Published online March 7, 2022. [PMC free article] [PubMed] [Google Scholar]
72.Masse N.Y., et al. Alleviating catastrophic forgetting using context-dependent gating and synaptic stabilization. Proc. Natl. Acad. Sci. U. S. A. 2018;115:E10467–E10475. doi: 10.1073/pnas.1803839115. [DOI] [PMC free article] [PubMed] [Google Scholar]
73.Tsuda B., et al. A modeling framework for adaptive lifelong learning with transfer and savings through gating in the prefrontal cortex. Proc. Natl. Acad. Sci. U. S. A. 2020;117:29872–29882. doi: 10.1073/pnas.2009591117. [DOI] [PMC free article] [PubMed] [Google Scholar]
74.Cohen J.D., et al. On the control of automatic processes: A parallel distributed processing account of the Stroop effect. Psychol. Rev. 1990;97:332–361. doi: 10.1037/0033-295x.97.3.332. [DOI] [PubMed] [Google Scholar]
75.Rougier N.P., et al. Prefrontal cortex and flexible cognitive control: rules without symbols. Proc. Natl. Acad. Sci. U. S. A. 2005;102:7338–7343. doi: 10.1073/pnas.0502455102. [DOI] [PMC free article] [PubMed] [Google Scholar]
76.Verbeke P., Verguts T. Learning to synchronize: how biological agents can couple neural task modules for dealing with the stability-plasticity dilemma. PLoS Comput. Biol. 2019;15 doi: 10.1371/journal.pcbi.1006604. [DOI] [PMC free article] [PubMed] [Google Scholar]
77.Bar M. (2004) Visual objects in context. Nat. Rev. Neurosci. 2004;5:617–629. doi: 10.1038/nrn1476. [DOI] [PubMed] [Google Scholar]
78.Oja E. Simplified neuron model as a principal component analyzer. J. Math. Biol. 1982;15:267–273. doi: 10.1007/BF00275687. [DOI] [PubMed] [Google Scholar]
79.Yu A., Cohen J. In: Advances in Neural Information Processing Systems. Koller D., et al., editors. 2009. Sequential effects: superstition or rational behavior? pp. 1873–1880. [PMC free article] [PubMed] [Google Scholar]
80.Cho R.Y., et al. Mechanisms underlying dependencies of performance on stimulus history in a two-alternative forced-choice task. Cogn. Affect. Behav. Neurosci. 2002;2:283–299. doi: 10.3758/cabn.2.4.283. [DOI] [PubMed] [Google Scholar]
81.Akaishi R., et al. Autonomous mechanism of internal choice estimate underlies decision inertia. Neuron. 2014;81:195–206. doi: 10.1016/j.neuron.2013.10.018. [DOI] [PubMed] [Google Scholar]
82.Krizhevsky A., et al. Lake Tahoe; Nevada: 2012. ImageNet classification with deep convolutional neural networks. [Google Scholar]
83.Bengio Y., et al. 2006. Greedy layer-wise training of deepnetworks.https://proceedings.neurips.cc/paper/2006/file/5da713a690c067105aeb2fae32403405-Paper.pdf Available: [Google Scholar]
84.Sheahan H., et al. Neural state space alignment for magnitude generalization in humans and recurrent networks. Neuron. 2021;109:1214–1226.e8. doi: 10.1016/j.neuron.2021.02.004. [DOI] [PubMed] [Google Scholar]
85.Musslick S., et al. On the rational boundedness of cognitive control: shared versus separated representations. PsyArXiv. 2020 https://psyarxiv.com/jkhdf/ Published online March 17, 2022. [Google Scholar]
86.Behrens T.E.J., et al. What is a cognitive map? organizing knowledge for flexible behavior. Neuron. 2018;100:490–509. doi: 10.1016/j.neuron.2018.10.002. [DOI] [PubMed] [Google Scholar]
87.Amalric M., et al. The language of geometry: fast comprehension of geometrical primitives and rules in human adults and preschoolers. PLoS Comput. Biol. 2017;13 doi: 10.1371/journal.pcbi.1005273. [DOI] [PMC free article] [PubMed] [Google Scholar]
88.Liu Y. Human replay spontaneously reorganizes experience. Cell. 2019;178:640–652.e14. doi: 10.1016/j.cell.2019.06.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
89.Al Roumi F., et al. Mental compression of spatial sequences in human working memory using numerical and geometrical primitives. Neuron. 2021;109:2627–2639.e4. doi: 10.1016/j.neuron.2021.06.009. [DOI] [PubMed] [Google Scholar]
90.Duncan J. An adaptive coding model of neural function in prefrontal cortex. Nat. Rev. Neurosci. 2001;2:820–829. doi: 10.1038/35097575. [DOI] [PubMed] [Google Scholar]
91.Nelli S. Neural knowledge assembly in humans and deep networks. bioRxiv. 2021 doi: 10.1101/2021.10.21.465374. Published online October 23, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
92.Milivojevic B., et al. Insight reconfigures hippocampal-prefrontal memories. Curr. Biol. 2015;25:821–830. doi: 10.1016/j.cub.2015.01.033. [DOI] [PubMed] [Google Scholar]
93.Morton N.W., et al. Representations of common event structure in medial temporal lobe and frontoparietal cortex support efficient inference. Proc. Natl. Acad. Sci. U. S. A. 2020;117:29338–29345. doi: 10.1073/pnas.1912338117. [DOI] [PMC free article] [PubMed] [Google Scholar]
94.Schapiro A.C. Neural representations of events arise from temporal community structure. Nat. Neurosci. 2013;16:486–492. doi: 10.1038/nn.3331. [DOI] [PMC free article] [PubMed] [Google Scholar]
95.Ester E.F., et al. Categorical biases in human occipitoparietal cortex. J. Neurosci. 2020;40:917–931. doi: 10.1523/JNEUROSCI.2700-19.2019. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0005] 1.Monsell S. Task switching. Trends Cogn. Sci. 2003;7:134–140. doi: 10.1016/s1364-6613(03)00028-7. [DOI] [PubMed] [Google Scholar]

[bb0010] 2.Botvinick M.M., et al. Conflict monitoring and cognitive control. Psychol. Rev. 2001;108:624–652. doi: 10.1037/0033-295x.108.3.624. [DOI] [PubMed] [Google Scholar]

[bb0015] 3.Badre D. Princeton University Press; 2020. On Task: How Our Brain Gets Things Done. [Google Scholar]

[bb0020] 4.Miller E.K., Cohen J.D. An integrative theory of prefrontal cortex function. Annu. Rev. Neurosci. 2001;24:167–202. doi: 10.1146/annurev.neuro.24.1.167. [DOI] [PubMed] [Google Scholar]

[bb0025] 5.Freedman D.J., Assad J.A. Neuronal mechanisms of visual categorization: an abstract view on decision making. Annu. Rev. Neurosci. 2016;39:129–147. doi: 10.1146/annurev-neuro-071714-033919. [DOI] [PubMed] [Google Scholar]

[bb0030] 6.Shallice T., Burgess P.W. Deficits in strategy application following frontal lobe damage in man. Brain. 1991;114:727–741. doi: 10.1093/brain/114.2.727. [DOI] [PubMed] [Google Scholar]

[bb0035] 7.Lewandowsky S., Li S.-C. In: Interference and Inhibition in Cognition. Dempster F.N., Brainerd C.J., editors. Elsevier; 1995. Catastrophic interference in neural networks; pp. 329–361. [Google Scholar]

[bb0040] 8.Willshaw D.J., et al. Non-holographic associative memory. Nature. 1969;222:960–962. doi: 10.1038/222960a0. [DOI] [PubMed] [Google Scholar]

[bb0045] 9.Kaufman M.T., et al. Cortical activity in the null space: permitting preparation without movement. Nat. Neurosci. 2014;17:440–448. doi: 10.1038/nn.3643. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0050] 10.Libby A., Buschman T.J. Rotational dynamics reduce interference between sensory and memory representations. Nat. Neurosci. 2021;24:715–726. doi: 10.1038/s41593-021-00821-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0055] 11.Xie Y., et al. Geometry of sequence working memory in macaque prefrontal cortex. Science. 2022;375:632–639. doi: 10.1126/science.abm0204. [DOI] [PubMed] [Google Scholar]

[bb0060] 12.Rigotti M., et al. The importance of mixed selectivity in complex cognitive tasks. Nature. 2013;497:585–590. doi: 10.1038/nature12160. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0065] 13.Flesch T., et al. Orthogonal representations for robust context-dependent task performance in brains and neural networks. Neuron. 2022 doi: 10.1016/j.neuron.2022.01.005. S0896627322000058. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0070] 14.Failor S.W., et al. Learning orthogonalizes visual cortical population codes. Neuroscience. 2021 doi: 10.1101/2021.05.23.445338. Published online May 23, 2021. [DOI] [Google Scholar]

[bb0075] 15.Saxe A., et al. If deep learning is the answer, what is the question? Nat. Rev. Neurosci. 2021;22:55–67. doi: 10.1038/s41583-020-00395-8. [DOI] [PubMed] [Google Scholar]

[bb0080] 16.Woodworth, B. et al. Kernel and rich regimes in overparametrized models. arXiv. Published online July 27, 2020. 10.48550/arXiv.2002.09277 [DOI]

[bb0085] 17.Chizat L., et al. On lazy training in differentiable programming. NeurIPS. 2018 doi: 10.48550/arXiv.1812.07956. Published online January 7, 2020. [DOI] [Google Scholar]

[bb0090] 18.Jacot A., et al. Neural tangent kernel: convergence and generalization in neural networks. arXiv. 2018:8571–8580. doi: 10.48550/arXiv.1806.07572. Published online February 10, 2020. [DOI] [Google Scholar]

[bb0095] 19.Arora S., et al. Stronger generalization bounds for deep nets via a compression approach. 2018. https://doi.org/10.48550/arXiv.1802.05296 Available:

[bb0100] 20.Lee J., et al. Wide neural networks of any depth evolve as linear models under gradient descent. arXiv. 2019 https://doi.org/10.48550/arXiv.1902.06720 Available: [Google Scholar]

[bb0105] 21.Saxe A.M., et al. A mathematical theory of semantic development in deep neural networks. Proc. Natl. Acad. Sci. U. S. A. 2019;116:11537–11546. doi: 10.1073/pnas.1820226116. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0110] 22.Geiger M., et al. Scaling description of generalization with number of parameters in deep learning. J. Stat. Mech. 2020;2020 [Google Scholar]

[bb0115] 23.Paccolat J., et al. Geometric compression of invariant manifolds in neural nets. 2021. https://doi.org/10.48550/arXiv.2007.11471 arXiv:200711471 [cs, stat]. 2021 [cited 22 Mar 2021]. Available:

[bb0120] 24.Saxe A., et al. Proceedings of the 39th International Conference on Machine Learning. 2022. Neural race reduction: dynamics of abstraction in gated networks; pp. 19287–19309. [Google Scholar]

[bb0125] 25.Raposo D., et al. A category-free neural population supports evolving demands during decision-making. Nat. Neurosci. 2014;17:1784–1792. doi: 10.1038/nn.3865. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0130] 26.Mante V., et al. Context-dependent computation by recurrent dynamics in prefrontal cortex. Nature. 2013;503:78–84. doi: 10.1038/nature12742. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0135] 27.Parthasarathy A., et al. Mixed selectivity morphs population codes in prefrontal cortex. Nat. Neurosci. 2017;20:1770–1779. doi: 10.1038/s41593-017-0003-2. [DOI] [PubMed] [Google Scholar]

[bb0140] 28.Roy J.E., et al. Prefrontal cortex activity during flexible categorization. J. Neurosci. 2010;30:8519–8528. doi: 10.1523/JNEUROSCI.4837-09.2010. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0145] 29.Lee J.J., et al. Task specificity in mouse parietal cortex. Neuron. 2022;110:2961–2969.e5. doi: 10.1016/j.neuron.2022.07.017. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0150] 30.Fusi S., et al. Why neurons mix: high dimensionality for higher cognition. Curr. Opin. Neurobiol. 2016;37:66–74. doi: 10.1016/j.conb.2016.01.010. [DOI] [PubMed] [Google Scholar]

[bb0155] 31.Gao P., et al. A theory of multineuronal dimensionality, dynamics and measurement. BiorXiv. 2019 doi: 10.1101/214262. Published online November 12, 2017. [DOI] [Google Scholar]

[bb0160] 32.Gao P., Ganguli S. On simplicity and complexity in the brave new world of large-scale neuroscience. Curr. Opin. Neurobiol. 2015;32:148–155. doi: 10.1016/j.conb.2015.04.003. [DOI] [PubMed] [Google Scholar]

[bb0165] 33.Kikumoto A., Mayr U. Conjunctive representations that integrate stimuli, responses, and rules are critical for action selection. Proc. Natl. Acad. Sci. U. S. A. 2020;117:10603–10608. doi: 10.1073/pnas.1922166117. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0170] 34.Hommel B. The Theory of Event Coding (TEC): a framework for perception and action planning. Behav. Brain Sci. 2001;24:849–878. doi: 10.1017/s0140525x01000103. [DOI] [PubMed] [Google Scholar]

[bb0175] 35.Dekker, R. et al. Determinants of human compositional generalization. PsyArXiv Published online March 30, 2022. 10.31234/osf.io/qnpw6 [DOI]

[bb0180] 36.Ito T., et al. Compositional generalization through abstract representations in human and artificial neural networks. arXiv. 2022 doi: 10.48550/arXiv.2209.07431. Published online December 2, 2022. [DOI] [Google Scholar]

[bb0185] 37.Frankland S.M., Greene J.D. Concepts and compositionality: in search of the brain’s language of thought. Annu. Rev. Psychol. 2020;71:273–303. doi: 10.1146/annurev-psych-122216-011829. [DOI] [PubMed] [Google Scholar]

[bb0190] 38.Dehaene S., et al. Symbols and mental programs: a hypothesis about human singularity. Trends Cogn. Sci. 2022;26:751–766. doi: 10.1016/j.tics.2022.06.010. [DOI] [PubMed] [Google Scholar]

[bb0195] 39.Badre D., et al. The dimensionality of neural representations for control. Curr. Opin. Behav. Sci. 2021;38:20–28. doi: 10.1016/j.cobeha.2020.07.002. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0200] 40.Dubreuil A., et al. The role of population structure in computations through neural dynamics. Nat. Neurosci. 2022;25:783–794. doi: 10.1038/s41593-022-01088-4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0205] 41.Musslick S., Cohen J.D. Rationalizing constraints on the capacity for cognitive control. Trends Cogn. Sci. 2021;25:P757–P775. doi: 10.1016/j.tics.2021.06.001. [DOI] [PubMed] [Google Scholar]

[bb0210] 42.French R.M. Catastrophic forgetting in connectionist networks. Trends Cogn. Sci. 1999;3:128–135. doi: 10.1016/s1364-6613(99)01294-2. [DOI] [PubMed] [Google Scholar]

[bb0215] 43.Parisi G., et al. Continual lifelong learning with neural networks: a review. Neural Netw. 2019;113:54–71. doi: 10.1016/j.neunet.2019.01.012. [DOI] [PubMed] [Google Scholar]

[bb0220] 44.Hadsell R., et al. Embracing change: continual learning in deep neural networks. Trends Cogn. Sci. 2020;24:1028–1040. doi: 10.1016/j.tics.2020.09.004. [DOI] [PubMed] [Google Scholar]

[bb0225] 45.Dohare S., et al. 2021. Continual Backprop: Stochastic Gradient Descent with Persistent Randomness. [cited 1 Oct 2022] [DOI] [Google Scholar]

[bb0230] 46.Yang G., et al. Stably maintained dendritic spines are associated with lifelong memories. Nature. 2009;462:920–924. doi: 10.1038/nature08577. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0235] 47.Kirkpatrick J., et al. Overcoming catastrophic forgetting in neural networks. Proc. Natl. Acad. Sci. U. S. A. 2017;114:3521–3526. doi: 10.1073/pnas.1611835114. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0240] 48.Zenke F., et al. Continual learning through synaptic intelligence. 2017. https://doi.org/10.48550/arXiv.1703.04200 [PMC free article] [PubMed]

[bb0245] 49.Flesch T., et al. 2022. Modelling continual learning in humans with Hebbian context gating and exponentially decaying task signals. [cited 15 Sep 2022] [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0250] 50.Flesch T. Comparing continual task learning in minds and machines. Proc. Natl. Acad. Sci. U. S. A. 2018;115:E10313–E10322. doi: 10.1073/pnas.1800755115. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0255] 51.Alvarez P., Squire L.R. Memory consolidation and the medial temporal lobe: a simple network model. Proc. Natl. Acad. Sci. U. S. A. 1994;91:7041–7045. doi: 10.1073/pnas.91.15.7041. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0260] 52.McClelland J.L., et al. Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory. Psychol. Rev. 1995;102:419–457. doi: 10.1037/0033-295X.102.3.419. [DOI] [PubMed] [Google Scholar]

[bb0265] 53.Kumaran D. What learning systems do intelligent agents need? complementary learning systems theory updated. Trends Cogn. Sci. 2016;20:512–534. doi: 10.1016/j.tics.2016.05.004. [DOI] [PubMed] [Google Scholar]

[bb0270] 54.Foster D.J. Replay comes of age. Annu. Rev. Neurosci. 2017;40:581–602. doi: 10.1146/annurev-neuro-072116-031538. [DOI] [PubMed] [Google Scholar]

[bb0275] 55.Vaz A.P., et al. Replay of cortical spiking sequences during human memory retrieval. Science. 2020;367:1131–1134. doi: 10.1126/science.aba0672. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0280] 56.Mnih V., et al. Human-level control through deep reinforcement learning. Nature. 2015;518:529–533. doi: 10.1038/nature14236. [DOI] [PubMed] [Google Scholar]

[bb0285] 57.Schaul T., et al. Prioritized experience replay. ArXiv. 2021 https://doi.org/10.48550/arXiv.1511.05952 [Google Scholar]

[bb0290] 58.Ambrose R.E., et al. Reverse replay of hippocampal place cells is uniquely modulated by changing reward. Neuron. 2016;91:1124–1136. doi: 10.1016/j.neuron.2016.07.047. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0295] 59.van de Ven, G.M. and Tolias, A.S. Generative replay with feedback connections as a general strategy for continual learning. arXiv Published online September 25, 2020 http://arxiv.org/abs/1809.10635

[bb0300] 60.Goode S., Magill R.A. Contextual interference effects in learning three badminton serves. Res. Q. Exerc. Sport. 1986;57:308–314. [Google Scholar]

[bb0305] 61.Richland L.E., et al. Proceedings of the 26thAnnual Meeting of the Cognitive Science Society. 2004. Differentiating the contextual interference effect from the spacing effect; p. 1624. [Google Scholar]

[bb0310] 62.Rohrer D., et al. Interleaved practice improves mathematics learning. J. Educ. Psychol. 2015;107:900–908. [Google Scholar]

[bb0315] 63.Kornell N., Bjork R.A. Learning concepts and categories: is spacing the “enemy of induction”? Psychol. Sci. 2008;19:585–592. doi: 10.1111/j.1467-9280.2008.02127.x. [DOI] [PubMed] [Google Scholar]

[bb0320] 64.Katz J.S., Wright A.A. Same/different abstract-concept learning by pigeons. J. Exp. Psychol. Anim. Behav. Process. 2006;32:80–86. doi: 10.1037/0097-7403.32.1.80. [DOI] [PubMed] [Google Scholar]

[bb0325] 65.Antzoulatos E.G., Miller E.K. Differences between neural activity in prefrontal cortex and striatum during learning of novel abstract categories. Neuron. 2011;71:243–249. doi: 10.1016/j.neuron.2011.05.040. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0330] 66.Kuchibhotla K.V., et al. Dissociating task acquisition from expression during learning reveals latent knowledge. Nat. Commun. 2019;10:2151. doi: 10.1038/s41467-019-10089-0. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0335] 67.Zeng G., et al. Continual learning of context-dependent processing in neural networks. Nat. Mach. Intell. 2019;1:364–372. [Google Scholar]

[bb0340] 68.Duncker L., et al. 2020. Organizing recurrent network dynamicsby task-computation to enable continual learning NeurIPS.https://proceedings.neurips.cc/paper/2020/file/a576eafbce762079f7d1f77fca1c5cc2-Paper.pdf Available: [Google Scholar]

[bb0345] 69.Takagi Y., et al. Projections of non-invasive human recordings into state space show unfolding of spontaneous and over-trained choice. eLife. 2020;10 doi: 10.7554/eLife.60988. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0350] 70.Brincat S.L. Gradual progression from sensory to task-related processing in cerebral cortex. Proc. Natl. Acad. Sci. U. S. A. 2018;115:E7202–E7211. doi: 10.1073/pnas.1717075115. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0355] 71.Russin J., et al. A neural network model of continual learning with cognitive control. arXiv. 2022 http://arxiv.org/abs/2202.04773 Published online March 7, 2022. [PMC free article] [PubMed] [Google Scholar]

[bb0360] 72.Masse N.Y., et al. Alleviating catastrophic forgetting using context-dependent gating and synaptic stabilization. Proc. Natl. Acad. Sci. U. S. A. 2018;115:E10467–E10475. doi: 10.1073/pnas.1803839115. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0365] 73.Tsuda B., et al. A modeling framework for adaptive lifelong learning with transfer and savings through gating in the prefrontal cortex. Proc. Natl. Acad. Sci. U. S. A. 2020;117:29872–29882. doi: 10.1073/pnas.2009591117. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0370] 74.Cohen J.D., et al. On the control of automatic processes: A parallel distributed processing account of the Stroop effect. Psychol. Rev. 1990;97:332–361. doi: 10.1037/0033-295x.97.3.332. [DOI] [PubMed] [Google Scholar]

[bb0375] 75.Rougier N.P., et al. Prefrontal cortex and flexible cognitive control: rules without symbols. Proc. Natl. Acad. Sci. U. S. A. 2005;102:7338–7343. doi: 10.1073/pnas.0502455102. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0380] 76.Verbeke P., Verguts T. Learning to synchronize: how biological agents can couple neural task modules for dealing with the stability-plasticity dilemma. PLoS Comput. Biol. 2019;15 doi: 10.1371/journal.pcbi.1006604. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0385] 77.Bar M. (2004) Visual objects in context. Nat. Rev. Neurosci. 2004;5:617–629. doi: 10.1038/nrn1476. [DOI] [PubMed] [Google Scholar]

[bb0390] 78.Oja E. Simplified neuron model as a principal component analyzer. J. Math. Biol. 1982;15:267–273. doi: 10.1007/BF00275687. [DOI] [PubMed] [Google Scholar]

[bb0395] 79.Yu A., Cohen J. In: Advances in Neural Information Processing Systems. Koller D., et al., editors. 2009. Sequential effects: superstition or rational behavior? pp. 1873–1880. [PMC free article] [PubMed] [Google Scholar]

[bb0400] 80.Cho R.Y., et al. Mechanisms underlying dependencies of performance on stimulus history in a two-alternative forced-choice task. Cogn. Affect. Behav. Neurosci. 2002;2:283–299. doi: 10.3758/cabn.2.4.283. [DOI] [PubMed] [Google Scholar]

[bb0405] 81.Akaishi R., et al. Autonomous mechanism of internal choice estimate underlies decision inertia. Neuron. 2014;81:195–206. doi: 10.1016/j.neuron.2013.10.018. [DOI] [PubMed] [Google Scholar]

[bb0410] 82.Krizhevsky A., et al. Lake Tahoe; Nevada: 2012. ImageNet classification with deep convolutional neural networks. [Google Scholar]

[bb0415] 83.Bengio Y., et al. 2006. Greedy layer-wise training of deepnetworks.https://proceedings.neurips.cc/paper/2006/file/5da713a690c067105aeb2fae32403405-Paper.pdf Available: [Google Scholar]

[bb0420] 84.Sheahan H., et al. Neural state space alignment for magnitude generalization in humans and recurrent networks. Neuron. 2021;109:1214–1226.e8. doi: 10.1016/j.neuron.2021.02.004. [DOI] [PubMed] [Google Scholar]

[bb0425] 85.Musslick S., et al. On the rational boundedness of cognitive control: shared versus separated representations. PsyArXiv. 2020 https://psyarxiv.com/jkhdf/ Published online March 17, 2022. [Google Scholar]

[bb0430] 86.Behrens T.E.J., et al. What is a cognitive map? organizing knowledge for flexible behavior. Neuron. 2018;100:490–509. doi: 10.1016/j.neuron.2018.10.002. [DOI] [PubMed] [Google Scholar]

[bb0435] 87.Amalric M., et al. The language of geometry: fast comprehension of geometrical primitives and rules in human adults and preschoolers. PLoS Comput. Biol. 2017;13 doi: 10.1371/journal.pcbi.1005273. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0440] 88.Liu Y. Human replay spontaneously reorganizes experience. Cell. 2019;178:640–652.e14. doi: 10.1016/j.cell.2019.06.012. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0445] 89.Al Roumi F., et al. Mental compression of spatial sequences in human working memory using numerical and geometrical primitives. Neuron. 2021;109:2627–2639.e4. doi: 10.1016/j.neuron.2021.06.009. [DOI] [PubMed] [Google Scholar]

[bb0450] 90.Duncan J. An adaptive coding model of neural function in prefrontal cortex. Nat. Rev. Neurosci. 2001;2:820–829. doi: 10.1038/35097575. [DOI] [PubMed] [Google Scholar]

[bb0455] 91.Nelli S. Neural knowledge assembly in humans and deep networks. bioRxiv. 2021 doi: 10.1101/2021.10.21.465374. Published online October 23, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0460] 92.Milivojevic B., et al. Insight reconfigures hippocampal-prefrontal memories. Curr. Biol. 2015;25:821–830. doi: 10.1016/j.cub.2015.01.033. [DOI] [PubMed] [Google Scholar]

[bb0465] 93.Morton N.W., et al. Representations of common event structure in medial temporal lobe and frontoparietal cortex support efficient inference. Proc. Natl. Acad. Sci. U. S. A. 2020;117:29338–29345. doi: 10.1073/pnas.1912338117. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0470] 94.Schapiro A.C. Neural representations of events arise from temporal community structure. Nat. Neurosci. 2013;16:486–492. doi: 10.1038/nn.3331. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0475] 95.Ester E.F., et al. Categorical biases in human occipitoparietal cortex. J. Neurosci. 2020;40:917–931. doi: 10.1523/JNEUROSCI.2700-19.2019. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Continual task learning in natural and artificial agents

Timo Flesch

Andrew Saxe

Christopher Summerfield

Highlights

Abstract

Natural tasks

Rich and lazy learning

Figure 1.

Figure 2.

The problem of continual learning

Figure 3.

The benefit of temporal structure

Knowledge partitioning via Hebbian learning

Learning tasks with and without supervision

Concluding remarks

Outstanding questions.

Acknowledgments

Acknowledgments

Declarations of interests

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Continual task learning in natural and artificial agents

Timo Flesch

Andrew Saxe

Christopher Summerfield

Highlights

Abstract

Natural tasks

Rich and lazy learning

Figure 1.

Figure 2.

The problem of continual learning

Figure 3.

The benefit of temporal structure

Knowledge partitioning via Hebbian learning

Learning tasks with and without supervision

Concluding remarks

Outstanding questions.

Acknowledgments

Acknowledgments

Declarations of interests

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases