Skip to main content

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

ArXiv logoLink to ArXiv
[Preprint]. 2024 Oct 15:arXiv:2402.08674v3. Originally published 2024 Feb 13. [Version 3]

CURRICULUM EFFECTS AND COMPOSITIONALITY EMERGE WITH IN-CONTEXT LEARNING IN NEURAL NETWORKS

Jacob Russin 1, Ellie Pavlick 2,*, Michael J Frank 3,*
PMCID: PMC10896373  PMID: 38410645

Abstract

Human learning embodies a striking duality: sometimes, we appear capable of following logical, compositional rules and benefit from structured curricula (e.g., in formal education), while other times, we rely on an incremental approach or trial-and-error, learning better from curricula that are unstructured or randomly interleaved. Influential psychological theories explain this seemingly disparate behavioral evidence by positing two qualitatively different learning systems—one for rapid, rule-based inferences and another for slow, incremental adaptation. It remains unclear how to reconcile such theories with neural networks, which learn via incremental weight updates and are thus a natural model for the latter type of learning, but are not obviously compatible with the former. However, recent evidence suggests that both metalearning neural networks and large language models are capable of “in-context learning” (ICL)—the ability to flexibly grasp the structure of a new task from a few examples given at inference time. Here, we show that networks capable of ICL can reproduce human-like learning and compositional behavior on rule-governed tasks, while at the same time replicating human behavioral phenomena in tasks lacking rule-like structure via their usual in-weight learning (IWL). Our work shows how emergent ICL can equip neural networks with fundamentally different learning properties than those traditionally attributed to them, and that these can coexist with the properties of their native IWL, thus offering a novel perspective on dual-process theories and human cognitive flexibility.

1. Introduction

Humans are capable of two qualitatively distinct kinds of learning [1, 2, 3, 4, 5, 6, 7, 8, 9]. The first involves slow, incremental adaptation to the environment through trial and error [2, 9, 6]. The second is much more advanced and involves rapid inference of rules or structure from information available in the environment or held in working memory [10, 11, 12, 13, 14]. For example, although it can famously take 10,000 hours to master the violin, when given a mandolin for the first time an expert musician may rapidly infer the rules about how each string is tuned.

Many findings from cognitive psychology and neuroscience support the idea that humans exhibit different learning and generalization behaviors in different domains [1, 15, 5, 16]. In tasks that are readily described by simple rules (e.g., tasks where a feature like color determines correct responses), humans learn efficiently from only a few examples, appearing to make rapid inferences about the latent structure governing the task [12, 17, 18]. They also appear capable of generalizing this structure compositionally, flexibly recombining familiar elements into novel combinations according to the inferred rules [19, 20, 21, 22, 23, 24, 25, 26, 27]. In such settings, people exhibit a blocking advantage, learning better when information is organized into blocks of related examples that make this underlying structure more salient [28, 19, 15, 5]. In contrast, when a task is not governed by simple rules, learning may require integrating across multiple task dimensions, and proceeds much more incrementally [1, 29, 5]. In these contexts, compositional generalization is not possible, and people show an interleaving advantage, learning better when trials are randomly shuffled over time in both laboratory [5, 30] and real-world contexts [31, 32].

Dual-process accounts [1, 4, 33, 5], explain these contrasting effects by positing two separate learning systems: a rule-based or symbolic system that is compositional and operates by testing explicit hypotheses, and a procedural or sub-symbolic system that learns more incrementally and can capture arbitrary associations, even in the absence of simple rules. Neural networks offer a natural framework for understanding the latter: standard networks operate by incrementally updating their weights [34, 35] and exhibit catastrophic forgetting when learning experiences are blocked, but not when they are interleaved [36, 37, 38]. However, it is less clear how they could explain the blocking advantage exhibited by humans on some tasks [15, 5]. Moreover, neural networks have traditionally been criticized for failing to account for human compositionality [20, 24, 39, 40, 41], as they do not explicitly represent rules or symbols [42, 12, 43]. Some biologically informed neural network models account for various aspects of dual-process theories [e.g., 44], such as models of prefrontal cortex (PFC) that emphasize the importance of dynamic activation-based representations for inferring rules and flexibly adapting to the current context [10, 45, 11, 46, 14, 47]. However, these models have not confronted how the emergence of such rule-based processing might relate to curriculum effects and compositionality.

We hypothesized that both compositionality and human curriculum effects might be reproduced by neural networks with greater cognitive flexibility. Recent advances in artificial neural networks have demonstrated surprising success on rule-governed tasks involving reasoning [48, 49, 50], analogy [51, 52], and compositionality [53, 54, 55]. Many of these capabilities are connected to the emergence of in-context learning (ICL), or the ability to learn new tasks from demonstrations or instructions given in context [48, 56, 57, 58]. For example, if demonstrations of a novel task are provided as contextual inputs (strawberry → red, banana → yellow), pretrained networks such as large language models (LLMs) can often readily perform the task on new inputs (plum → ??).

Importantly, ICL does not require updates to network weights. This stands in contrast with in-weight learning (IWL)—the usual form of learning in neural networks—which proceeds by backpropagating errors to update weights [34]. Instead, ICL takes place within the model’s activation dynamics, similar to the dynamics supporting working memory in neural network models of PFC [13, 14, 10, 59]. Thus, the emergence of a capacity for ICL results in a tradeoff [60, 56, 61, 62]: when ICL succeeds, fewer errors are accumulated, resulting in fewer updates to the weights. This tradeoff resembles one seen in humans, wherein working memory facilitates rapid acquisition of stimulus-response rules but suppresses prediction errors, leading to degraded reinforcement learning and less robust retention [9, 63].

Advanced ICL abilities have been shown to emerge in LLMs [48, 49], but can also be imparted more directly via metalearning, where a network is specifically trained to learn how to learn new tasks provided in context [64, 65, 66, 59, 47]. Metalearning networks that perform ICL through their activation dynamics have been shown to reproduce phenomena associated with the PFC [14, 10, 47], and human-like compositional generalizations [53], suggesting that emergent ICL algorithms can be more rule-like or compositional than the standard IWL used to train networks in the first place [26, 67].

In this work, we demonstrate how a single neural network capable of both ICL and IWL can simultaneously replicate the behavioral effects associated with each of the two systems posited in traditional theories [1, 4, 5], producing compositional generalization and the blocking advantage in rule-governed tasks, while exhibiting an interleaving advantage in tasks lacking such structure. Our theoretical framework can be summarized by three key principles (see Figure 1):

Figure 1:

Figure 1:

Properties of in-context learning (ICL) and in-weight learning (IWL). ICL (blue) is the ability of a neural network to flexibly learn a new task from just a few study examples given in context, and to apply this knowledge to a novel test query (also given in context). ICL is carried out through the activation dynamics of the network (blue arrow), and can happen without weight updates. ICL can be compositional, and is shown here predicting the location of a blue alligator (x=3, y=2) by composing elements of the known locations of a blue elephant (x=3) and a red alligator (y = 2). ICL also exhibits a blocking advantage, learning better when related examples are blocked over time. IWL (red) is the usual form of learning in neural networks, wherein prediction errors are backpropagated to update weights. IWL is non-compositional, depicted here as failing to generalize to the blue alligator due to its reliance on a simple lookup table that ignores the compositional structure of the task. IWL exhibits an interleaving advantage, learning better when examples are randomly shuffled or interleaved due to the well-known problem of catastrophic forgetting.

  1. Standard IWL fails on compositional generalization problems, and shows an interleaving advantage due to catastrophic forgetting when trials are blocked.

  2. ICL can be endowed with inductive biases that produce compositional generalization and a blocking advantage.

  3. When ICL is possible, its properties dominate because few errors are made and IWL is suppressed. But when ICL is difficult, the learning properties of IWL dominate because more errors are made and backpropagated.

We test this theoretical framework by experimenting with metalearning neural networks on tasks based on those used in previous human studies [5, 19]. First, we show in a category-learning setting [5] that a single neural network capable of ICL and IWL produces both of the curriculum effects observed in humans—a blocking advantage in the presence of rule-like structure, and an interleaving advantage in the absence of such structure. Then, we show that when applied to a compositional task, the neural network produces the compositional generalization behaviors and the blocking advantage observed in humans on the same task [19]. Finally, we test existing pretrained LLMs on this compositional task and show that their emergent ICL algorithms exhibit both compositionality and a blocking advantage. Taken together, our findings show how two qualitatively distinct learning processes can coexist in a single neural network model, thus offering a unique perspective on how a dual-process architecture might emerge in a neural network.

2. Results

2.1. Curriculum effects in category-learning

We first consider whether the principles above can account for the curriculum effects observed in human category learning, before turning to compositionality in the next section. As reviewed above, in category learning, humans exhibit a blocking advantage when categories are governed by succinct rules, but an interleaving advantage when no such rules are readily available [5].

We designed a category-learning task directly based on this previous work [5], but suitable for use with metalearning neural networks (see Figure 2ac). Stimuli varied along two feature dimensions (akin to line length and line orientation) with 8 possible values, yielding 64 possible items. Each item was assigned to one of two categories, indicated by an arbitrary category label (e.g., ‘A’ or ‘B’). In the Rule-like condition, one of the two feature dimensions determined category membership (e.g., lines with shorter lengths are in category ‘A’ and lines with longer lengths are in category ‘B’), while in the Rotated condition, category membership was determined by both features. This simple rotation has been shown to challenge the search for a simple, verbalizable rule, and is thought to recruit the more incremental procedural learning system in humans [5, 1]. Networks were presented with 16 items from each category (32 total), and tested on the remaining held-out items. The 32 items used during learning were either Blocked, where items from one category were presented first, followed by the items from the other, or Interleaved, where items were randomly shuffled. Both rotation conditions were tested with both curriculum conditions, yielding a 2×2 design.

Figure 2:

Figure 2:

Category-learning experiments. The task is derived from a human study [5]. Networks were presented with multi-feature items along with their category labels, and tested on unseen items. see Appendix for details. (a) Curriculum conditions. Trials were either blocked by category label or randomly interleaved. (b) In the rule-like condition, category membership was determined by a simple rule that only depended on one of the two features (e.g., ‘A’ if length ≤ 4, ‘B’ otherwise). (c) In the rotated condition, category membership was determined by both feature dimensions. The original axes were rotated by 45 degrees and a category boundary was chosen in the new coordinate system. (d-f) Category-learning with in-weight learning (IWL) only. Randomly initialized networks were trained from scratch on the task. (d) The few-shot evaluation tested networks’ ability to learn the task from the 32 examples presented in context, before any weight updates were made. Unsurprisingly, randomly initialized networks without prior metalearning experience were incapable of utilizing the examples given in context to learn the task, regardless of condition. Values correspond to the average test accuracies shown in (f), but before step 0 (i.e., before any finetuning took place). (e) The same evaluation was conducted after “finetuning”, testing networks’ ability to learn the task through IWL. Without prior metalearning experience, the network was able to learn in-weights, performing well on both the rule-like and the rotated tasks after training. However, performance was much worse in the blocked condition due to catastrophic forgetting (see f). Here, values correspond to the train accuracy (i.e., accuracy on the 32 train items) in (f) at the final timestep. (f) Accuracy and loss results over the course of IWL training in each of the four conditions. Accuracy is split by category to better visualize the effects of catastrophic forgetting in the blocked condition (top row). (g-i) Category learning with both in-weight and in-context learning (ICL). Networks first metalearned on a distribution of related tasks (not shown), and were subsequently finetuned on specific category-learning tasks from each condition. (g) After metalearning, the models exhibited strong ICL on the task, as shown by the high few-shot accuracy. ICL demonstrated a blocking advantage, and also showed improved performance in the rule-like compared to the rotated condition. (h) After finetuning, the network exhibited an interleaving advantage in the rotated condition, due to catastrophic forgetting when trials were blocked (see i). (i) Accuracy and loss results over the course of finetuning in each of the four conditions. When trials were blocked in the rule-like condition, ICL achieved near-perfect accuracy immediately, resulting in little loss and thus little IWL. When trials were interleaved, few-shot performance was worse (see g), but performance quickly recovered due to compensation by IWL. In the rotated condition, ICL failed, resulting in larger losses and increased IWL. This IWL resulted in catastrophic forgetting, as can be seen in the rapid decline in performance on ‘A’ items while training on ‘B’, and vice versa. No such catastrophic forgetting occurred when trials were interleaved (although test performance was not perfect).

2.1.1. IWL produces an interleaving advantage

In this category-learning setting, a network capable of IWL but not ICL exhibited an interleaving advantage, regardless of the presence or absence of rule-like structure. This is consistent with classic findings showing that standard learning in neural networks (i.e., IWL) benefits from random interleaving due to the well-known phenomenon of catastrophic forgetting [37, 38]. A randomly initialized network was trained from scratch on the categorization task in each of the four conditions (see methods for details). Because IWL requires slow, incremental updates, this network was not capable of few-shot learning in this setting (see Figure 2d) even in the rule-like condition, where a few examples should suffice for inference of the simple rule. Consistent with our theoretical framework (principle 1), the model performed better when trials were interleaved compared to when they were blocked (p < 10−3; see Figure 2ef), in both the rule-like and rotated conditions (although slightly better in the rule-like condition). This interleaving advantage was due to catastrophic forgetting when trials were blocked, which can be seen in the dramatic decrease in the network’s performance on examples of the category trained during the previous block (e.g., performance on category ‘A’ decreases as category ‘B’ is trained in the second block). Thus, the default in-weight learning (IWL) behavior of neural networks can explain why an interleaving advantage would be observed in human category-learning [5]. However, a network capable of IWL alone cannot account for the blocking advantage that humans exhibit when categories are governed by rule-like structure [19, 15, 5].

2.1.2. ICL can produce a blocking advantage

Next, we endowed a network with ICL abilities by (meta)training it on a distribution of categorization tasks (see methods for details). Metalearning can induce ICL in deep neural networks [53, 57], and relatedly, has been shown to give rise to abstract generalizable representations in models of PFC [14, 47, 10]. These ICL abilities allowed the network to solve unseen tasks given in context through its activation dynamics, even when weights were frozen and no IWL was allowed to occur.

To ensure that the emergent ICL algorithm would have the desired properties (see principle 2), we (meta)trained it on a distribution of categorization tasks with 1) rule-like structure and 2) blocked curricula. We then evaluated the trained network in the few-shot setting, where the weights were frozen and the network had to learn new tasks from a few examples given in context (see methods for details; see Figure 2g). As predicted, when the model was endowed with an ICL algorithm familiar with rule-like category-learning problems, it could easily generalize to new rule-like problems, but struggled to solve tasks in context in the rotated condition (main effect of rotation: p < 10−3). Moreover, the emergent ICL algorithm exhibited a blocking advantage on unseen rule-like categorization problems (main effect of curriculum: p < 0.05). This blocking advantage emerged due to the metalearning distribution (see Appendix), but see Discussion for alternative explanations based on architectural constraints in human brains. In sum, these few-shot results suggest that it is possible to endow a network with an ICL algorithm that is sensitive to rule-like structure and to the learning curriculum: the network’s forward activation dynamics were capable of making inferences over the items provided in context, but was better at doing so when related items were blocked over time.

2.1.3. Concurrent ICL and IWL reproduce both curriculum effects

While the above explorations showed how IWL and ICL can produce different curriculum effects, we are now in a position to study how the two might interact in a single system capable of both. To do this, we took our network that developed ICL abilities through metalearning, and gave it unseen category-learning tasks, allowing it to learn by either ICL (via forward activation dynamics) or IWL (via error backpropagation). Here, we predicted that the dynamic interaction between IWL and ICL would qualitatively reproduce the full set of curriculum effects observed in the original study [5]: ICL would produce the blocking advantage in the presence of rule-like structure, while IWL would produce the interleaving advantage in the absence of such structure (see principle 3).

As we described above, when categories are governed by rule-like structure, ICL succeeds on the task and exhibits a blocking advantage in few-shot inference. But in the rotated task, where categories are not governed by rule-like structure, ICL struggles (Figure 2g). The resulting errors, when backpropagated, drive an increase in IWL, producing an interleaving advantage due to catastrophic forgetting (Figure 2i; interaction between curriculum and rotation: p < 10−3).

Thus, consistent with our predictions, we have shown that a single model capable of ICL and IWL can recapitulate the curriculum effects observed in human category-learning [5]. When the network is capable of making inferences over familiar rules, it can solve new tasks from a few samples given in context. However, when the environment does not afford such inferences or the network cannot make them, IWL can still compensate, allowing good performance. This IWL suffers from catastrophic forgetting, resulting in an interleaving advantage on the rotated task.

2.2. Curriculum effects in a compositional task

As noted above, one of the most impressive recent developments in neural networks has been the demonstration that ICL can give rise to compositionality [53, 55, 67], traditionally considered to be a major theoretical challenge to neural networks [20, 40]. Recent results have shown that while standard IWL in neural networks struggles to reproduce human-like compositional generalization behaviors [39, 68, 69], emergent ICL abilities in neural networks can appear to compose inferred rules in order to generalize to new inputs [49, 53, 54, 55]. Thus, a key goal of our framework is to leverage the distinction between ICL and IWL to provide a unified account of both the compositional generalization behaviors and the curriculum effects observed in humans. In particular, ICL should account for both the blocking advantage and for compositional generalization in tasks governed by rule-like structure, while IWL accounts for the interleaving advantage observed when such compositional generalization is challenging or impossible.

We focused our investigations on a recent study demonstrating compositional generalization in humans on a novel rule-governed task [see Figure 3ac; 19]. Notably, this study showed that compositional generalization indeed depended on the curriculum, improving when related trials were blocked compared to interleaved—consistent with the idea that the mechanisms underlying compositionality can be linked to those responsible for producing the blocking advantage. This task therefore provides an excellent testbed for our metalearning neural networks, allowing us to replicate our original curriculum-related results from the category-learning task while also studying their connection to compositionality.

Figure 3:

Figure 3:

Compositional task and results. The task is derived from a human study [19]. Networks were presented with the locations corresponding to particular cues (colored animals) and had to predict the locations of unseen cues. Cues were comprised of sequences of tokens (e.g., ‘blue alligator’). see Appendix for details. (a) Curriculum conditions. In both the blocked and interleaved conditions, the 9 study examples always included one full row and one full column. In the blocked condition, the row was presented in full before the column, or vice-versa. In the interleaved condition, these 9 examples were randomly shuffled. (b) In the rule-like condition, locations varied systematically with the color and animal features (e.g., color determined x-coordinate and animal determined y-coordinate). (c) In our novel rotated condition, the original axes were rotated by 45 degrees, so that any change in either color or animal resulted in a change to both x- and y-coordinates. (d-f) Performance on the compositional task with in-weight learning (IWL) only, where again randomly initialized networks were trained from scratch. (d) The few-shot evaluation tested networks’ ability to solve the task in context based on the 9 study examples given in the input. Again, without prior metalearning neural networks were incapable of solving the task in this way, regardless of condition. (e) Without prior metalearning experience, the network was still able to learn via IWL, performing well on both the rule-like and the rotated tasks after finetuning. IWL again exhibited an interleaving advantage due to catastrophic forgetting (see f). (f) Accuracy and loss results over the course of IWL training in each of the four conditions. Accuracy is again split by group, in this case corresponding to whether the cue was part of the row or the column (see a). Here results were similar to the category-learning case, where IWL exhibited catastrophic forgetting when trials were blocked, regardless of rotation condition. IWL also failed to generalize compositionally, failing on the 16 held-out test cues (green lines) in all conditions. (g-i) Experiments using networks capable of both in-weight and in-context learning (ICL). (g) After metalearning, the models again exhibited a blocking advantage, but also showed strong compositional generalization, as shown by the high few-shot test accuracy in the blocked condition. ICL again failed in the rotated condition. (h) After finetuning, the network exhibited an interleaving advantage in the rotated condition, due to catastrophic forgetting when trials were blocked (see i). (i) When trials were blocked in the rule-like condition, accuracy was near-perfect, resulting in little loss and thus little IWL. In the rotated condition, ICL failed, resulting in larger losses, increased IWL, and increased catastrophic forgetting, as can be seen in the rapid drop in accuracy on the first group (‘TrainA,’ shown in purple) while training on the second group (‘TrainB,’ shown in orange), and vice versa. No catastrophic forgetting occurred in the interleaved condition, but compositional generalization (green) was considerably worse than when trials were blocked.

In the original task, participants learned to pair colored animals with arbitrary xy-coordinates via trial-and-error. Importantly, the correct locations varied systematically with the two features: color determined the x-coordinate (each of 5 different colors was linked to one of 5 different x-values) while the animal determined the y-coordinate, or vice-versa. Participants saw only 9 of the 25 possible color-animal pairs as study examples; they had to make novel inferences on the 16 remaining pairs during testing (without feedback). This task can be seen as rule-based in that a simple rule (e.g., color = x, animal = y) governs the locations, and can be seen as compositional in that good test performance requires composition of knowledge about a particular color (e.g., ‘blue’ means x = 3) with knowledge about a particular animal (e.g., ‘alligator’ means y = 2) into a novel combination (e.g., ‘blue alligator’ means location is 3, 2).

The key experimental variable manipulated in the study was the curriculum—which 9 of the 25 cues were used as study examples, and the order in which they were presented (see Figure 3ac). In the Blocked condition, all cues of a particular color (i.e., a single row/column) were presented before all the cues with a particular animal, or vice-versa. In the Interleaved condition, a single row and column were again chosen for study, but their order was randomly shuffled. (Note that the original study also tested two other related conditions, where sampling of items was “Aligned” or “Misaligned”; we simulated these cases and reproduced similar results in the Appendix, but here focus on the key blocked vs. interleaved contrast).

The experimenters found that human compositional generalization performance depended on which curriculum was used: participants performed better in the blocked than the interleaved condition [19]. The original study did not manipulate the presence or absence of rule-like structure as the categorization task did [5], but we hypothesized that rotating the underlying coordinate grid (see Figure 3c) would cause a similar interleaving advantage to emerge. This is because when the underlying coordinate system is rotated, no simple rule (e.g., color = x, animal = y) is available. We therefore tested our metalearning models in both the original Rule-like setting, and in a Rotated version.

2.2.1. IWL is non-compositional and produces an interleaving advantage

As in the simulations with the categorization task, we first evaluated neural networks without ICL capabilities on the task by training them from scratch. Without ICL, performing the task in the few-shot setting was again impossible (see Figure 3d). The only way the network could learn was through IWL, which again exhibited an interleaving advantage due to catastrophic forgetting when trials were blocked (confirmed by a main effect of curriculum: p < 10−3; see Figure 3ef). Because the network had no way of inferring rules from in-context examples, there was no observable difference between the rule-like task and the rotated task. Furthermore, in both versions of the task the network learned the study examples well when trials were interleaved, but performed poorly on test trials that required compositional generalization. Thus, in contrast to the categorization task where the IWL showed good generalization performance (see Figure 3f), the compositional task allowed us to reproduce known failures in compositional generalization in networks capable only of standard IWL [20, 68, 69, 39, 40].

2.2.2. ICL can be compositional and can produce a blocking advantage

We then endowed the network with ICL abilities by metalearning on a distribution of tasks (see methods for details). After metalearning, the network’s emergent ICL algorithm generalized compositionally on unseen tasks, achieving good performance on color-animal combinations that were not included in the study examples. This generalization performance involved the composition of rules that could be inferred from the study examples (see Figure 3b). Furthermore, as in the previous simulations, metalearning endowed the network with an ICL algorithm that exhibits the same kind of blocking advantage observed in humans [19], performing better in the few-shot setting when trials were blocked compared to interleaved (main effect of curriculum on the rule-like task: p < 10−3).

These findings extend recent work [53] by showing that the ICL algorithm that emerges in metalearning neural networks can reproduce human-like compositional generalization performance and its associated blocking advantage in this experimental paradigm [19]. This is significant because it shows how neural networks, which have traditionally been criticized for lacking compositionality [20, 40], can through metalearning come to implement an ICL algorithm that is capable of human-like compositional generalization [26, 67].

2.2.3. ICL and IWL produce compositionality and both curriculum effects

Finally, we allowed IWL to occur in the network that was capable of ICL, and replicated the full set of human curriculum effects that we reproduced above in the category-learning setting [5]. As predicted, ICL failed in our novel rotated version of the task, leading to more errors and thus greater IWL (see Figure 3g). This increase in IWL led to the emergence of an interleaving advantage (see Figure 3h)—a testable prediction not evaluated in humans in the original study—whereas ICL again produced the blocking advantage in the original rule-like task (see Figure 3g; interaction between rotation and curriculum: p < 10−3). Taken together, our findings on the compositional task are again consistent with our theoretical framework (see principle 3), and show how the distinction between in-context and in-weight learning can offer a unified account of human compositional generalization capabilities and their dependence on the the learning curriculum [19].

2.3. LLMs exhibit compositionality and a blocking advantage

So far, we have established that it is possible for an ICL algorithm to exhibit compositionality and a blocking advantage, and that a single neural network implementing this kind of ICL alongside its usual IWL will reproduce the full set of empirical results that we have been targeting. A separate question one can ask is why a network would develop an ICL algorithm with these particular properties in the first place. In our metalearning experiments, we used task distributions that promote these properties (see methods), but there may be more naturalistic distributions that could give rise to them.

Although the datasets used for LLM pretraining are developmentally unrealistic in many ways [70, 71, 72], they are more naturalistic in the sense that they are comprised of natural language text, rather than content that is specifically relevant to our tasks. These corpora are not purposefully designed to encourage ICL or any of our hypothesized properties to emerge. Nevertheless, impressive ICL abilities do arise in these models, giving them the flexibility to accomplish many kinds of tasks in context [48, 49]. Given the scale and complexity of their pretraining datasets, it is unclear a priori what ICL properties LLMs’ should develop, but prior work has shown that their emergent ICL abilities can exhibit compositional generalization in some settings [54, 52, 55], and can also be sensitive to the order in which in-context examples are provided [73, 74].

We thus hypothesized that the properties of ICL assumed by our theoretical framework (i.e., compositionality and a blocking advantage, see principle 2) may emerge in LLMs. We tested this hypothesis with two pretrained LLMs on the same compositional task used above: Llama 2 [75] and GPT-3.5 [48, 76]. We evaluated the emergent ICL abilities of these models by presenting color-animal pairs from the compositional task only in context.1

Both LLMs showed impressive compositional generalization performance on the task (see Figure 4), even though they were only given the 9 study examples and had not been explicitly pretrained on variants of the task. This shows that the emergent ICL abilities in these models can produce the kinds of generalization behaviors that standard IWL in neural networks struggles to achieve (see test accuracy in Figure 3f).

Figure 4:

Figure 4:

LLM results. Large langauge models (LLMs) are capable of in-context learning (ICL) on the text-based version of the compositional task based on the human study [19]. Both GPT-3.5 (solid lines) and Llama 2 (dashed lines) achieved good compositional generalization performance on the rule-like version of the task (blue), and also exhibited a blocking advantage, performing better when trials were blocked than interleaved (see Figure 3). ICL performance was much worse on the rotated task (red), consistent with our theoretical framework.

Notably, both LLMs also produced the blocking advantage in the rule-like version of the task (curriculum main effect: p < 10−3).2 This again shows that even though the ICL capability in the LLMs has not been specifically sculpted to produce this blocking advantage, it emerges nonetheless via large-scale next-token prediction on large corpora of text.

Finally, both LLMs performed poorly on the rotated task (rotation main effect: p < 10−3). This is also consistent with our theoretical framework (see principle 3), which predicts that ICL should be more difficult in the absence of rule-like structure because in-context inferences are more complex. IWL would be required to compensate for the failure of ICL on such tasks, as we showed in our metalearning experiments.

Thus, neural networks can come to implement an ICL algorithm with the properties of compositionality, a blocking advantage, and a preference for rule-like structure—even when their training does not specifically target these properties, but consists in next-token prediction on a large corpus of natural text.

3. Discussion

Influential theories in cognitive science posit two distinct systems to account for findings suggesting a duality in human learning [1, 2, 3, 4, 5, 6, 7, 8, 9]. Prominent theories leverage distinctions between controlled vs. automatic processing [77, 78, 6], model-based vs. model-free reinforcement learning [3, 79, 80], working memory in PFC vs. striatal synaptic learning [45, 29, 9, 63], system 2 vs. system 1 thinking [33], and rule-based vs. procedural learning [1, 5]. These theories explain why human learning exhibits different properties under different conditions. Here, we have focused on two such properties: 1) compositionality and 2) curriculum effects. Humans are capable of utilizing rule-like structure to generalize compositionally [19, 24, 12, 21, 22, 23], and of integrating over multiple dimensions and making arbitrary associations when no rule-like structure is present [81, 1, 5, 38]. In the former case, learning tends to benefit when related trials are blocked over time [19, 15, 5], while in the latter case it benefits when trials are interleaved [31, 32, 5, 30].

Our work shows how these phenomena can be explained by a single neural network capable of two qualitatively distinct learning processes. In particular, we have shown how metalearning can endow a network with a capacity to learn in context, and how this capacity can capture compositionality and the blocking advantage on tasks governed by rule-like structure. ICL operates by default, but can be unsuccessful on tasks lacking such structure, triggering error-driven IWL and producing an interleaving advantage due to catastrophic forgetting [37, 38]. This dynamic interaction between ICL and IWL is analogous to a tradeoff observed in humans: when working memory is used to learn new stimulus-response rules, reductions in neural prediction errors and incremental reinforcement learning are observed [63, 9]. Our theoretical framework offers a unified perspective on compositionality and curriculum effects, extending dual-process theories by showing how two distinct learning processes can coexist (and compete) within a single neural network.

3.1. Curriculum Effects

There has been some debate about whether humans learn better when related content is blocked or interleaved over time, with some studies finding a blocking advantage [28, 19, 15, 5] and others finding an interleaving advantage [31, 32, 5, 30]. There may be multiple factors that distinguish these cases [e.g., between-category and within-category similarity; 82], but one important variable may be the presence of rule-like structure: humans have been shown to exhibit a blocking advantage when the task is governed by succinct rules, and an interleaving advantage when the task does not afford such rules [19, 5]. These effects are explained by a dual-process account in which a rule-based learning system operates by an explicit hypothesis-testing strategy and a procedural learning system operates by incrementally integrating information over time [1, 5]. Our work offers a novel perspective on this dual-process account, showing how a similar duality can emerge in neural networks capable of both ICL and IWL.

In our framework, the interleaving advantage arises because of catastrophic forgetting [37], which is a natural property of IWL in neural networks due to their use of overlapping distributed representations [38]. Might this kind of forgetting explain the interleaving advantage observed in humans? The brain is thought to mitigate catastrophic forgetting through the use of sparse, pattern-separated representations in hippocampus [38, 83]. However, this effect is unlikely to be eliminated completely, so a similar principle may still underlie the modest interleaving advantage observed in humans [5]. Future work could directly investigate the extent to which the interleaving advantage observed in the absence of rule-like structure is due to this kind of forgetting.

The blocking advantage, on the other hand, doesn’t emerge by default in standard neural networks, but a number of studies have explored the neural mechanisms that might underlie it. For example, a neural network model of rule-based inference and working memory in the PFC showed that blocking related trials over time can encourage abstract rule-like representations to emerge in the network’s activations [14]. More recent work [36] showed that a PFC-like neural network augmented with a gating mechanism and a bias for active maintenance produces a blocking advantage on a task involving cognitive maps [84]. Related work has shown how a neural network equipped with a specialized Hebbian gating mechanism [85] can reproduce a blocking advantage observed in humans on an analogous task [15]. A similar Hebbian mechanism was then used to explain the blocking advantage observed in the compositional task studied here [19]. Another recent study showed how the blocking advantage observed in humans on a next-state prediction task [28] was reproduced by a neural network model that actively maintained distinct contextual representations over time [86]. Overall, these studies emphasize how a blocking advantage can emerge when inferences are made through forward activation dynamics (i.e., in context), such as those made over items maintained in working memory in PFC.

Our theoretical account of the blocking advantage is broadly consistent with previous models of this effect, but has a number of advantages. First, we have shown how the blocking advantage can coexist with the interleaving advantage in a neural network. Furthermore, while our framework is consistent with previous models in suggesting that the blocking advantage is related to activation dynamics [e.g., working memory in PFC; 14, 36], we show how these dynamics can be metalearned in a transformer by training it on a distribution of related tasks [47], thus providing a conceptual link between these prior models and ongoing work investigating metalearning and cognitive flexibility in natural and artificial intelligence [64, 53, 87, 88, 52].

Indeed, we also observed a blocking advantage in LLMs, which have revolutionized artificial intelligence research [48, 49], and are arguably the most cognitively flexible systems ever built [49, 89]. These results show that a blocking advantage can emerge with ICL even when networks are trained on natural text rather than metalearning datasets specifically designed to promote it. Although it is difficult to know exactly why this blocking advantage emerges in the LLMs, we speculate that it is driven by distributional properties of the natural text corpora on which they are trained, such as the tendency for human writing to afford inferences best made by assimilating consecutive examples in a sequential, rather than haphazard, manner. However, further work is needed to better understand the sources of the blocking advantage in the LLMs, and the internal mechanisms responsible for producing it.

In particular, our work does not directly address whether the blocking advantage observed in humans emerges due to strong constraints imposed by neural architecture (e.g., recurrence, limitations in working memory capacity), rather than the statistical properties of the environment (e.g., the distributional properties of natural language). In our experiments, both the metalearning networks and the LLMs utilized the transformer architecture [90], which is not recurrent and does not have hard constraints in working memory capacity. The blocking advantage emerged in these models due to statistical properties of their training data. This was especially clear in the metalearning experiments, where we had full control over the data distribution and confirmed that it determines when the blocking advantage emerges (see Appendix). Consistent with these findings, it has been shown that the human blocking advantage depends on the extent to which the feature dimensions relevant to the rule-like structure of the task are represented in a strongly segregated manner [15], a factor that is likely to depend on an individual’s prior learning experiences. However, we think that the human blocking advantage is also likely to depend on key architectural features of the human brain, such as its recurrence and the mechanisms for gating and serial attention in PFC and basal ganglia [13, 29, 36]. These, in turn, might affect the distributional properties of natural language that is produced by humans and provided as training data for the LLMs. Further work is required to understand how architectural features interact with the distributional properties of a network’s training data, and how they might impact the emergence of ICL with specific properties.

3.2. Compositionality

Compositionality is thought to be a key property underlying human cognitive flexibility, permitting familiar rules or concepts to be combined in novel ways, thus facilitating a powerful form of generalization [20, 12, 67, 91]. Recent work has shown that although compositionality may not be a natural property of standard IWL in neural networks [20, 39, 69, 40], it can emerge as a property of an ICL algorithm [53, 67]. Our results build on this work, showing that it is possible to endow a neural network with an ICL algorithm that is capable of reproducing the compositional generalization behaviors observed in humans in a recent study [19], even when standard IWL fails (see test accuracy in Figure 3f). We showed that this kind of ICL algorithm can be metalearned by training on a distribution of related tasks, but also emerges in LLMs that are pretrained on large corpora of text (see Figure 4). While metalearning offers a clear understanding of how a neural network can come to implement an emergent compositional learning algorithm [53, 67], it is less clear why this property would emerge in LLMs trained on next-word prediction. One suggestion is that the large-scale pretraining of LLMs can itself be seen as a kind of metalearning [48, 87], where some subset of samples from the training distribution puts pressure on these models to learn how to compose novel concepts presented in context [67]. This is consistent with the hypothesis that human compositionality is metalearned—a conjecture that, while difficult to study, may yield specific empirical predictions [26, 53, 92, 93]. Finally, a key contribution of our work is that it builds on studies linking compositionality to curriculum effects in humans [19], providing a unified account of compositionality, curriculum effects, and their interaction by demonstrating that a single neural network capable of both ICL and IWL can reproduce them.

3.3. Metalearning as a tool for modeling cognition

Metalearning allows a neural network to learn how to learn new tasks through its forward activation dynamics [94, 53, 66, 47]. Increasingly used in cognitive modeling [64], metalearning is a versatile technique that allows the modeler to impart any inductive bias to a neural network so long as it can be implicitly specified by a dataset of input-output pairs [95]. We used metalearning to impart inductive biases for compositionality and the blocking advantage. As others have noted [64, 53, 95], researchers can deploy the metalearning approach to understand the consequences of these inductive biases while remaining agnostic about whether the metalearning process models the evolution of innate biases or their development within an individual’s lifetime. Regardless of their origin in humans, our simulations show how compositionality and curriculum effects can emerge in a single neural network through the interactions between ICL and IWL.

3.4. One network or two systems?

We have emphasized how two competing learning processes can coexist in a single neural network, with one taking place through activation dynamics (ICL), and the other taking place through changes to synaptic weights (IWL). Although the distinction between activation-based and weight-based dynamics is relevant to many aspects of computation in the brain [13, 96, 97, 47], we note that the functional roles we hypothesize for ICL and IWL may also map onto different brain regions. The organization of the human PFC, which has an intrinsic bias to robustly maintain information over longer timescales until it is actively updated [13, 98, 99], may encourage ICL abilities, along with their specific properties, to become localized to this area [100, 36]. Indeed, the PFC is known to be important for the flexible adaptation of behavior to the current context [46], and in the maintenance of task sets [101, 102] and goals [103].

While we used standard transformers [90] that did not contain any separate PFC-like system, we note that the ICL algorithms implemented in their activation dynamics can be seen as analogous to those observed in neural models of PFC trained across multiple tasks [14, 47, 10]. Recent work has shown that transformer architectures can mimic the frontostriatal gating mechanisms in these biological models when trained on human working memory tasks, and exhibit effective capacity limitations despite the lack of any inherent architectural constraint imposing such a limitation [104, 105]. Future work could use similar techniques to investigate whether emergent PFC-like computational mechanisms also explain ICL-related phenomena in our metalearning networks.

4. Methods

4.1. Task details

The inputs and outputs of both tasks were encoded into sequences of tokens appropriate for processing by standard transformer architectures [90]. In the category task, each of the two feature dimensions could take any of eight possible values (e.g., ‘length-1,’ ‘length-2,’ …), and each of the categories was associated with one arbitrary label (‘A’ or ‘B’). Each of these feature values and category labels was encoded as a separate token. Inputs to the model consisted of a set of study examples along with a single test query, all supplied to the model in context. 32 study examples were given, each of which consisted of an item-label pair, where the item contained two tokens corresponding to the values of the two feature dimensions. The query came after the study examples, and consisted of a single item without a category label.

In the compositional task, all of the colors and animals were encoded as separate tokens, as were the x- and y-coordinates. Again, inputs to the model consisted of a set of study examples and a single query. In this task, the study examples included 9 item-location pairs, where each item contained a color and an animal and each location contained an x- and a y-coordinate.

4.2. Model details

In all models, tokens were embedded using a dictionary of learnable vectors. In the metalearning experiments, these embeddings started out as arbitrary random vectors and were optimized by end-to-end backpropagation throughout training. In the LLMs, real English words were used, allowing the models to leverage semantic knowledge gained through pretraining. This is similar to the human participants in the original studies, who could leverage existing knowledge that color and animal are orthogonal feature dimensions, for example.

All metalearning experiments used the same transformer architecture [75, 90]. An informal hyperparameter search was conducted to find a suitable number of layers, hidden size, dropout, and learning rate. The size of the feedforward layers was always twice the hidden size. The best-performing model was selected based on validation accuracy for each task separately. In the category task, the best-performing model had 4 layers, 8 heads, a hidden size of 64, and no dropout. In the compositional task, the best-performing model had 12 layers, 8 heads, a hidden size of 64, and dropout of 0.1. Models were were evaluated on exact-match accuracy using greedy decoding and teacher forcing.

In the LLM experiments, we evaluated GPT-3.5 [48, 76] and Llama 2 [75]. GPT-3.5 is an LLM pretrained on next-token prediction and finetuned to be more useful in a chat-based interface. We used the version of Llama 2 that has not been finetuned on instruction data. In GPT-3.5 (“gpt-3.5-turbo-instruct”), the temperature was set to 0.1, and five runs were performed. A maximum of 7 tokens were generated, and no post-processing was done except to strip extra spaces. Llama 2 is an open-source model with approximately 70 billion parameters. The model was run using resources from the Center for Computation and Visualization at Brown University. The model was quantized so it could fit onto 2 gpus. A number of different prompts for each model were tested, but good performance was achieved with simple prompts containing only the study examples, and the prompts did not qualitatively change the pattern of results across conditions.

4.3. Metalearning

We adopted a metalearning framework to induce an ICL algorithm to emerge within the activation dynamics of a neural network by training it on a distribution of tasks [47, 64, 53]. These task distributions encouraged the resulting ICL algorithm to have a preference for related trials to be blocked over time, and a tendency to generalize compositionally.

For the category-learning experiments, we trained our networks on a distribution of tasks with the same basic structure described above. Each individual task was sampled as follows: 2 feature dimensions were sampled uniformly without replacement from a set of 200 unique dimensions. Each of these dimensions had 8 possible values, making 64 possible items in the newly sampled task. One of two possible category labels was randomly assigned to each of the two categories. In each new task, 16 items from each category were randomly chosen to be included in the set of 32 study examples. The queries seen during metalearning could either be one of the 32 given in the context (“train”), or one of the remaining 32 (“test”). In our main experiments, all samples in the metalearning distribution used the rule-like task and the blocked condition. The network metalearned on 12,000 tasks sampled in this way, and was subsequently tested on a held-out set of 100 tasks with combinations of dimensions that had not been seen during training. A further 10 held-out tasks were used for testing and finetuning. During metalearning in the category setting, networks trained for 20 epochs with the Adam optimizer [106], a learning rate of 0.0001, and a batch size of 256.

We also constructed a metalearning distribution based on the design of the compositional task [19]. Again, each individual task in this distribution had the same structure as the compositional task presented above. The tasks were sampled as follows: First, the orders of the lists of five colors and five animals were shuffled, determining their corresponding orders in the 5×5 grid of locations. Then, the two features were randomly assigned to the x- and the y-coordinates (color = x and animal = y, or vice versa). In the rotated condition, this 5×5 grid was rotated by 45 degrees and scaled so that each coordinate of each cue landed on an integer. As in the category-learning setting, all samples in the metalearning distribution were rule-like and blocked. We again generated 12,000 tasks for metalearning, and used 100 held-out tasks with different 5×5 grids for validation. A further 10 held-out tasks were used for testing and finetuning. During metalearning in the compositional task setting, networks trained for 500 epochs with the Adam optimizer [106], a learning rate of 0.001, and a batch size of 256.

4.4. Finetuning

Once the network acquired an ICL algorithm through metalearning, it was subsequently evaluated on its ability to learn new unseen tasks from each condition. This evaluation was conducted in two ways. In the few-shot evaluation, the weights of the network were frozen, ensuring that all learning was due to ICL on the study examples given in context. In the finetuning evaluation, the model’s weights were not frozen, and any errors made were backpropagated to update weights. During finetuning, the model only received feedback on the study examples, thus emulating the experience of the human participants [19]. Note that this is unlike the metalearning phase, when the model learned how to generalize to queries not included in the study examples. This second IWL learning phase that the model underwent can be understood as ‘finetuning’ the model on a specific task, while the metalearning can be understood as ‘pretraining.’

During the finetuning phase, samples were either blocked or interleaved in two distinct but congruent ways. In the blocked condition, related items were blocked over the context, but they were also blocked over the gradient steps (i.e., the model was finetuned for N gradient steps on samples containing queries from one stimulus group, then was finetuned for N gradient steps on samples containing queries from the other group, and so on). Likewise, in the interleaving condition, items from each group were interleaved both over the context and over the gradient steps. In the main experiments, the curriculum condition was always consistent during finetuning—related items were either blocked over both the context and the gradients steps, or interleaved over both the context and the gradient steps. However, for the sake of completeness we experimented with all combinations and report these results in the Appendix.

Acknowledgments

We would like to thank all members of the Language Understanding and Representation Lab and the Laboratory of Neural Computation and Cognition at Brown University, as well as the Analogy and Attentional Control groups for helpful discussions. MJF was supported by ONR grant N00014-23-1-2792. EP and JR were supported by NIH NIGMS COBRE grant #5P20GM103645-10.

A. Extended Methods

A.1. Aligned and misaligned curricula in compositional task

In addition to the blocked and interleaved curricula presented in the main text, the experimenters who designed the original task [19] also tested humans on two additional curriculum conditions (see Figure 5). In the aligned condition, cues from the middle row of the grid were presented first, followed by cues from the middle column (or vice versa). This condition was equivalent to the blocked condition, except that the row and column of the grid used for training were always the middle row and middle column. In the misaligned condition, cues from each of the two diagonals of the grid were presented one after another. All experiments reported in the main text only used the blocked and interleaved conditions, but we also assessed our metalearning networks and the large language models (LLMs) on the aligned and misaligned conditions (see Additional Results below).

Figure 5:

Figure 5:

Aligned and misaligned curriculum conditions. In the original experiment [19], the compositional task also included two additional curriculum conditions. In the aligned condition, participants saw the entire middle row of the grid before seeing the entire middle column (or vice versa). This condition is the same as the blocked condition, except that the row and column used for the study examples were constrained to be the middle row/column. In the misaligned condition, participants saw the cues from one of the diagonals of the grid, followed by cues from the other diagonal.

A.2. Text-based versions of the tasks

Text-based versions of both the category-learning task [5] and the compositional task [19] were developed for use with standard transformer architectures [90]. A space tokenizer was used in all metalearning experiments. In the category-learning task (see Figure 6a), each of the two dimensions of the stimuli (e.g., line orientation and line length) were coded with a separate set of 8 tokens (e.g., ‘length-1’, ‘length-2’, …; ‘orientation-1’, ‘orientation-2’, …), and each of the 2 category labels was coded with a separate token (‘A’ and ‘B’). Each of the 32 examples given to the model in context contained a stimulus item (i.e., a line with a particular length and orientation), paired with its associated category label via a colon. Examples were separated by a special ‘<sep>‘ token (e.g., ‘length-6 orientation-3 : A <sep> length-7 orientation-2 : A …’). The query was appended to this context full of 32 study examples, and consisted of a single item and a colon.

In the compositional task (see Figure 6b), again each of the two dimensions of the stimuli (i.e., the color and animal) were coded with a separate set of 5 tokens. In the metalearning experiments, arbitrary strings (e.g., ‘color-1’, ‘color-2’) were used, but in the experiments with LLMs, we used a set of common English words for the colors (e.g., ‘red’, ‘green’, etc.) and the animals (‘bear’, ‘elephant’, etc.). The x and y coordinates of the locations corresponding to each cue were coded with natural numbers (e.g., ‘1’, ‘2’, …). Each of the 9 study examples contained a cue paired with its location; examples were again separated by a special token in the metalearning experiments (‘<sep>‘), but were separated by a semicolon in the LLM experiments.

A.3. Metalearning details

Metalearning networks were trained on distributions of tasks and evaluated on tasks that were not seen during training. In both the metalearning and the finetuning phases, networks received text-based inputs like the ones described above and predicted the outputs corresponding to the test query (i.e., the 1-token category label following the test query in the category-learning task, or the 2-token reward location following the test query in the compositional task). Cross-entropy loss was computed at only these specific output tokens.

Figure 6:

Figure 6:

Text-based version of the category-learning task (a) and the compositional task (b). In the metalearning experiments, arbitrary strings were used for both tasks (not shown for the compositional task), while in the LLM experiments on the compositional task, common English colors and animals were used. Note that study examples and query appear on new lines in the figure for illustration purposes, but in the experiments examples were only separated by ‘<sep>‘ or by semicolons.

In both task settings, 10 different runs of metalearning were performed with different random initializations. Unless otherwise specified (see Additional Experiments below), metalearning was always done with task distributions that had rule-like structure and blocked curricula. After metalearning was complete, models were evaluated in the few-shot setting on 100 validation tasks and were finetuned on 10 test tasks from each of the four conditions (Rule-like and Blocked, Rule-like and Interleaved, Rotated and Blocked, Rotated and Interleaved). Particular seeds were excluded if they did not pass a 90% accuracy threshold in the few-shot evaluation, because our main questions of interest pertained to models that had acquired a competence in learning in-context. In the IWL-only experiments (see Figure 2df and Figure 3df in the main text), the metalearning step was skipped, and 10 different randomly initialized networks were finetuned on 10 test tasks from each condition in the same way.

B. Statistical testing

B.1. IWL exhibits an interleaving advantage on the category task

We hypothesized that simple in-weight learning on the category-learning task would show an interleaving advantage, regardless of rotation. To test this hypothesis, we trained networks from scratch in each of the four conditions (Rule-like and Blocked, Rule-like and Interleaved, Rotated and Blocked, Rotated and Interleaved) using the same setup used for finetuning networks after metalearning. We analyzed the results using a generalized linear model (GLM) with a quasi-binomial family using the logit link function. This model predicted the train accuracy (successes vs. failures) across the four experimental conditions from the Rotation condition (Rule-like vs. Rotated), the Curriculum condition (Blocked vs. Interleaved), and their interaction (see Figure 2e in the main text). The results revealed a significant main effect of Curriculum (χ2 = 786.88, p < 2.2e −16) and a significant interaction between Rotation and Curriculum (χ2 = 31.18, p = 2.349e −08). The main effect of Rotation was not significant (χ2 = 0.0, p = 1.0). The significant interaction was due to a small but statistically significant difference in the final interleaved accuracy between the two rotation conditions, indicating that 100% accuracy was slightly harder to achieve on the rotated version of the task (see Table 1). We therefore tested the simple main effects of Curriculum in each of the Rotation conditions, and founda significant interleaving advantage in both the Rule-like (χ2 = 7e +16, p < 2.2e −16) and Rotated (χ2 = 393, p < 2.2e −16) conditions. Overall, these findings confirm our hypothesis that in the absence of in-context learning, in-weight learning shows an interleaving advantage that is largely robust to the rotation condition.

Table 1:

IWL interleaving advantage on category-learning task.

Rotation Curriculum Successes Failures Accuracy
Rule-like Blocked 1600 1600 50.0%
Rule-like Interleaved 3200 0 100.0%
Rotated Blocked 1600 1600 50.0%
Rotated Interleaved 3134 66 97.9%

B.2. ICL exhibits a blocking advantage on the category task

To test the hypothesis that ICL would show a blocking advantage on the rule-like version of the category-learning task, we evaluated the few-shot test accuracy of metalearned networks on all four conditions (see Table 2 and Figure 2g in the main text). We again analyzed results using a GLM with a quasi-binomial family and logit link function. This test revealed a significant interaction between Rotation and Curriculum (χ2 = 4.91, p = 0.027) and a significant main effect of Rotation (χ2 = 29.86, p = 4.65e −08), but no significant main effect of Curriculum (χ2 = 1.16, p = 0.28). Due to the presence of a significant interaction, we tested the simple main effect of Curriculum in the Rule-like condition, which showed a significant blocking advantage (χ2 = 5.51, p = 0.019), confirming our hypothesis.

Table 2:

ICL blocking advantage on category-learning task.

Rotation Curriculum Successes Failures Accuracy
Rule-like Blocked 3172 28 99.1%
Rule-like Interleaved 2844 356 88.9%
Rotated Blocked 2269 931 70.9%
Rotated Interleaved 2009 1191 62.8%

B.3. Concurrent ICL and IWL reproduce both curriculum effects on the category task

Our main hypothesis was that a single network capable of both ICL and IWL would reproduce the interaction between Rotation and Curriculum observed in humans in the category-learning setting [5], with ICL producing a blocking advantage in the Rule-like task and IWL producing an interleaving advantage in the Rotated task. We evaluated metalearned networks’ few-shot test accuracy on the Rule-like task (see Figure 2g in the main text) and their finetuned train accuracy on the Rotated task (see Figure 2h in the main text), and performed the same GLM analysis (see Table 3). We observed significant main effects of both Rotation (χ2 = 52.23, p = 2.96e −13) and Curriculum (χ2 = 60.58, p = 7.07e−15), as well as a significant interaction (χ2 = 52.19, p = 5.03e−13). Analyses of the simple main effects revealed a significant blocking advantage in the Rule-like task (same as ICL result above; χ2 = 5.51, p = 0.019), and a significant interleaving advantage in the Rotated task (χ2 = 911, p < 2.2e −16). These analyses confirmed our hypothesis that it is possible for a single neural network model to reproduce the blocking advantage observed in humans on the Rule-like task (due to ICL) and the interleaving advantage observed in humans on the Rotated task (due to IWL).

Table 3:

Concurrent ICL and IWL on category-learning task.

Rotation Curriculum Successes Failures Accuracy
Rule-like Blocked 3172 28 99.1%
Rule-like Interleaved 2844 356 88.9%
Rotated Blocked 1976 1224 61.8%
Rotated Interleaved 3200 0 100.0%

B.4. IWL exhibits an interleaving advantage on the compositional task

The same analyses described above were performed for models in the compositional setting [19], where we hypothesized that the same curriculum effects would emerge for ICL and IWL on the Rule-like and Rotated versions of the task (see Figure 3 in the main text). To test our hypothesis that IWL would by default exhibit an interleaving advantage, we again trained networks from scratch on the task and evaluated their training accuracy in each of the four conditions (see Figure 3e in the main text). Again a GLM (quasi-binomial family, logit link function) revealed a main effect of Curriculum (χ2 = 11510.0, p < 2e −16), where IWL exhibited an interleaving advantage regardless of Rotation (see Table 4). A main effect of Rotation was also observed (χ2 = 103.9, p < 2e −16), but there was no interaction (χ2 = 0.0, p = 1).

Table 4:

IWL interleaving advantage on compositional task.

Rotation Curriculum Successes Failures Accuracy
Rule-like Blocked 361 449 44.6%
Rule-like Interleaved 810 0 100.0%
Rotated Blocked 411 399 50.7%
Rotated Interleaved 810 0 100.0%

B.5. ICL exhibits a blocking advantage on the compositional task

The same GLM analyses were performed to assess whether ICL performed better on the Rule-like version of the compositional task, and whether it exhibited a blocking advantage in the Rule-like task (see Figure 3g in the main text and Table 5 below). A model predicting performance on the task from Rotation, Curriculum, and their interaction showed a main effect of Rotation (χ2 = 134.7, p < 2e−16), but no main effect of Curriculum (χ2 = 0.035, p = 0.85), and no interaction (χ2 = 1.33, p = 0.25). A follow-up analysis revealed a simple main effect of Curriculum in the Rule-like condition (χ2 = 40.6, p = 1.9e −10), with ICL performing better when trials were blocked compared to interleaved. These analyses confirmed our hypothesis that ICL would demonstrate better performance and a blocking advantage in the Rule-like condition.

Table 5:

ICL blocking advantage on compositional task.

Rotation Curriculum Successes Failures Accuracy
Rule-like Blocked 480 0 100.0%
Rule-like Interleaved 112 368 23.3%
Rotated Blocked 2 478 0.42%
Rotated Interleaved 1 479 0.21%

B.6. Concurrent ICL and IWL reproduce both curriculum effects on the compositional task

Again our main hypothesis about curriculum effects in the compositional task was that a network capable of both ICL and IWL would exhibit an interaction between Rotation and Curriculum, where ICL would produce a blocking advantage on the Rule-like task and IWL would produce an interleaving advantage on the Rotated task (see Figure 3gi in the main text and Table 6 below). This hypothesis was confirmed with an analysis equivalent to the one performed for the category-learning task: the same GLM showed a significant main effect of Rotation (χ2 = 24.7, p = 6.6e −07), a significant main effect of Curriculum (χ2 = 17.1, p = 3.5e −05), and a significant interaction (χ2 = 71.4, p < 2.2e−16). Follow-up analyses revealed a significant simple main effect of Curriculum in the Rule-like task (χ2 = 40.61, p = 1.9e −10), where ICL showed a blocking advantage, and a significant simple main effect of Curriculum in the Rotated task (χ2 = 726.58, p < 2.2e −16), where IWL showed an interleaving advantage. Overall, these statistical analyses confirmed our hypothesis that a neural network capable of both ICL and IWL would reproduce human-like curriculum effects on the compositional task [19].

Table 6:

Concurrent ICL and IWL on compositional task.

Rotation Curriculum Successes Failures Accuracy
Rule-like Blocked 480 0 100.0%
Rule-like Interleaved 112 368 23.3%
Rotated Blocked 170 100 63.0%
Rotated Interleaved 270 0 100.0%

B.7. ICL in LLMs performs better on rule-like task and shows blocking advantage

An alternative method for inducing ICL in neural network models is to train them to predict the next token on large corpora of text [48, 49]. We evaluated the ICL abilities of two LLMs trained in this way – Llama 2 [75] and GPT-3.5 [48, 76] – by testing them on the compositional task (see Figure 4 in the main text). Statistical tests analogous to those performed for the metalearning experiments were conducted to evaluate our hypotheses that LLMs would demonstrate better ICL performance when the task was rule-like, and when related trials were blocked rather than interleaved (see Table 7 below). A GLM (quasi-binomial family, logit link function) revealed a significant main effect of Rotation (χ2 = 428.6, p < 2.2e −16), a significant main effect of Curriculum (χ2 = 85.63, p < 2.2e −16), and a significant interaction (χ2 = 62.2, p = 2.0e −13). Further testing was done for each model individually. Llama 2 showed a main effect of Rotation (χ2 = 377.5, p < 2.2e −16), a main effect of Curriculum (χ2 = 86.2, p < 2.2e −16), and a significant interaction (χ2 = 60.8, p = 4.0e −13). In the rule-like condition, Llama 2 showed a significant simple main effect of Curriculum (χ2 = 10.7, p = 0.001), performing better when related trials were blocked compared to interleaved. We collected less data on GPT-3.5, but the results of statistical testing were qualitatively similar. GPT-3.5 showed a significant main effect of Rotation (χ2 = 115.0, p < 2.2e −16), but no significant main effect of Curriculum (χ2 = 1.74, p = 0.63) and no significant interaction (χ2 = 3.0, p = 0.39). Follow-up analyses revealed a marginal but not statistically significant simple main effect of Curriculum in the rule-like condition (χ2 = 2.6, p = 0.1). Overall, our LLM experiments showed that, as expected, ICL in these models performs better on tasks governed by rule-like structure and when related trials are blocked over time.

Table 7:

LLM results on compositional task.

Model Rotation Curriculum Successes Failures Accuracy
Llama 2 Rule-like Blocked 601 39 93.91%
Llama 2 Rule-like Interleaved 478 98 82.99%
Llama 2 Rotated Blocked 151 489 23.59%
Llama 2 Rotated Interleaved 141 435 24.48%
GPT-3.5 Rule-like Blocked 77 3 96.25%
GPT-3.5 Rule-like Interleaved 63 17 78.75%
GPT-3.5 Rotated Blocked 0 80 0.00%
GPT-3.5 Rotated Interleaved 1 79 1.25%

C. Additional results

C.1. Aligned and misaligned curricula in compositional task

As noted above, we also tested our metalearned models on the aligned and misaligned curriculum conditions (see Figure 5) from the original experiment [19]. These conditions manipulate not only the order in which examples are presented, but also which particular examples are used for training. In both of these conditions, the rules governing the task remain the same (e.g., color determines the x-coordinate and animal determines the y-coordinate), but it may be more difficult to extract these rules from the study examples given in the misaligned condition. For example, if given only study examples from one of the two diagonals, it would be impossible to infer whether color determines the x-coordinate and animal the y-coordinate or vice versa. This is because the cues on the diagonal of the latent grid each varies in both the color and animal dimensions (and therefore in both the x and y coordinates). In the aligned condition, on the other hand, one of the two features is held constant within a block, allowing the subject to observe how the locations change as a single feature is changed.

In the human experiments [19], participants showed greater compositional generalization on the aligned condition compared to the misaligned condition. We therefore hypothesized that ICL would show the same effect in both our metalearned networks and in the LLMs. We tested networks that were metalearned in the same setup described for our main experiments on the aligned and misaligned conditions. Consistent with our hypothesis, ICL in these networks exhibited better compositional generalization performance in the aligned condition than in the misaligned condition (see Figure 7). The LLMs also showed better generalization in the aligned condition compared to the misaligned condition (see Figure 8). Taken together, these results suggest that humans and neural networks capable of ICL succeed at generalizing compositionally in similar curriculum conditions.

Figure 7:

Figure 7:

In-context learning results for metalearned networks on all four curriculum conditions used in the original human experiments [19]. In the main text we report results for the blocked and interleaved conditions due to space considerations and because that was our main contrast of interest. However, the same metalearning networks trained on the same distribution of tasks also produce the effects observed in human participants in the aligned and misaligned conditions, generalizing better in the aligned compared to the misaligned condition. The networks perform very poorly on the rotated version of the task, regardless of curriculum (not tested in the original human study), consistent with our hypotheses.

Figure 8:

Figure 8:

ICL performance by LLMs on all four curriculum conditions used in the original human experiments [19]. Consistent with the human data, both LLMs generalize better in the rule-like task (blue) in the aligned condition than in the misaligned condition, and better in the blocked condition than in the interleaved condition. ICL performance was poor in the rotated task (red), regardless of curriculum condition.

C.2. Metalearning on rotated task

What we have called the Rule-like versions of our tasks are only “rule-like” in the sense that the rules governing the task are simpler to discover because the features determining the correct answers (e.g., line orientation and length, or color and animal identities) are intuitive to humans. The rotated versions of the two tasks are more difficult because the rules of the task are only succinctly describable in rotated feature spaces that are unintuitive. Likewise, the rule-like versions of the tasks are easier for the metalearning networks to solve in-context because the appropriate features are familiar from the task distributions on which they metalearned. If instead the networks were more familiar with these rotated feature spaces, we would expect in-context learning to be easier on tasks whose rules utilized similar feature spaces. To test this hypothesis, we trained metalearning networks on a distribution of rotated tasks in the compositional setting (see Figure 9). As expected, the network showed the opposite pattern of curriculum effects, exhibiting a blocking advantage in the rotated task rather than the rule-like task. The network also suffered from catastrophic forgetting in the Rule-like task rather than the Rotated task. Interestingly, catastrophic forgetting was not as severe compared to the network trained on a distribution of rule-like tasks and finetuned on a rotated task. This meant that by the fourth block, the interleaving advantage observed in through earlier blocks had disappeared. This may suggest that metalearning on the rotated version of the task equipped the network with an inductive bias to learn strategies that suffer from less interference when related trials are blocked over time.

In general, this experiment is consistent with our hypothesis that the blocking advantage should emerge on tasks where the properties of ICL dominate, and the interleaving advantage should emerge on tasks where properties of IWL dominate due to catastrophic forgetting. However, the finetuning results suggest that catastrophic forgetting in IWL can be more or less severe depending on the inductive biases imparted by metalearning.

Figure 9:

Figure 9:

Results for a metalearning network trained on a distribution of rotated tasks. This network acquires the ability to solve new tasks in-context, but shows the opposite curriculum effects as the networks trained on a distribution of rule-like tasks (compare to Figure 3gi in the main text). In the few-shot evaluation, the network shows good performance on the rotated version of the task (red) and poor performance on the rule-like task (blue). Importantly, a blocking advantage was observed in the rotated, rather than the rule-like setting, consistent with our hypothesis that the blocking advantage emerges on tasks where ICL is possible. Finetuning results show that IWL promotes catastrophic forgetting on the rule-like version of the task, rather than the rotated task. Interestingly, this catastrophic forgetting was not as severe compared to the network trained on the rule-like task and finetuned on the rotated task (see Figure 3i in the main text), suggesting that the network was able to learn a strategy in the first block that did not interfere as much with the learning in the second block.

C.3. Blocking over context vs. gradient steps

The curriculum conditions in our experiments manipulate the order in which examples are presented over time. However, transformers afford multiple notions of time: they are given an entire sequence (which itself can be indexed by time/position) as input and are trained by backpropagating errors to incrementally update their weights for multiple gradient steps over training time. We can therefore differentiate whether related trials are blocked over the context (i.e., related study examples are blocked over the context within a given input sequence) or blocked over the gradient steps (i.e., sequences containing related test queries are used to compute and backpropagate losses in a blocked fashion over the gradient steps). In our main experiments, we collapse these two notions by maintaining congruence between them: in the “blocked” condition, related trials were blocked over both the context and the gradient steps, and in the “interleaved” condition, trials were randomly interleaved over both the context and the gradient steps. These congruent cases are more realistic when compared with the human situation, where recently seen items that might be held in working memory (and therefore supply a context for the current trial) are the same ones that have recently been used to update synaptic weights (assuming that synapses are updating in a relatively continuous manner over time).

For completeness and to facilitate understanding of our main results, we also performed experiments where these two notions of time are incongruent (i.e., blocked over the context and interleaved over the gradient steps, or interleaved over the context and blocked over the gradient steps). Here, we report such results from the same metalearning model trained on the compositional task (see Figure 3gi in the main text). After metalearning, the model was finetuned in the usual way but with these extra two incongruent curriculum conditions.

The results are presented alongside the original (congruent) results in Figure 10. In general, the results show that the key dynamics motivating our original theoretical framework hold across these incongruent conditions. In the metalearning model, ICL is successful when encountering new tasks that have familiar structure (in this case, rule-like tasks where related items are blocked over the context). This is shown by the high train (purple and orange) and test (green) accuracy, and the low loss (black) from the beginning of finetuning in the rule-like task when items are blocked over the context (middle column, top two plots). When ICL is successful, compositional generalization performance is good, and little loss is incurred and backpropagated, resulting in less IWL. This can be seen by the lack of catastrophic forgetting in these same two plots. On the other hand, when ICL is unsuccessful (in this case, when the task is rotated or when items are interleaved over the context), compositional generalization is poor, and large losses are backpropagated, resulting in increased IWL. This can be seen in the loss curves (black) in the rotated task (right column), and in the rule-like task when items are interleaved over the context (middle column, bottom two plots). This increased IWL results in catastrophic forgetting when related trials are blocked over the gradient steps. This can be seen in the sharp drops in accuracy on the items trained in the previous block (e.g., TrainA accuracy, shown in purple, drops during the second block as TrainB items are learned) in the rule-like task when items are interleaved over the context but blocked over the steps (middle column, third plot) or in the rotated task when items are blocked over the gradient steps (right column, first and third plots). When trials are randomly interleaved over the gradient steps, no catastrophic forgetting occurs, even in cases where ICL is unsuccessful. This can be seen in the rule-like task when trials are interleaved over both the context and the gradient steps (middle column, fourth plot), and in the rotated task when items are interleaved over the gradient steps (right column, second and fourth plots).

Figure 10:

Figure 10:

Results of experiments investigating effects of blocking over the study examples given in the context vs. blocking over the gradient steps during finetuning. The finetuning results from Figure 3hi in the main text report the main cases of interest where trials were either blocked over both context and gradient steps or interleaved over both context and gradient steps. Here, we show these same results alongside the cases where the two are mismatched. When trials are blocked over the context but interleaved over the steps (second row), the model shows good compositional generalization in the rule-like task and does not suffer from catastrophic forgetting. When trials are interleaved over the context but blocked over the steps (third row), the model shows poor compositional generalization and suffers from catastrophic forgetting. These results demonstrate how this model generalizes compositionally only when trials are blocked over the context in the rule-like task, and how catastrophic forgetting occurs whenever large losses are incurred (inducing IWL) and trials are blocked over the gradient steps.

To summarize, these results show that compositional generalization performance is high when ICL is possible because the model is familiar with the structure of the task given in context (i.e. when the task is rule-like and related items are blocked over the context), and catastrophic forgetting happens when large losses are incurred (resulting in increased IWL) and related items are blocked over the gradient steps. Our theoretical framework assumes that in humans, both contextual information and synaptic weights are updated more-or-less continuously throughout learning in the experimental tasks we model. This implies that the distinction between these two notions of time, and the corresponding distinction between blocking/interleaving over the context vs. over the gradient steps, would collapse into a single notion.

Footnotes

1

In principle, these models should also show IWL properties like any other neural network, but it is highly expensive to finetune them, and our main questions here pertain to ICL.

2

Like our metalearned neural networks, the LLMs also showed the full pattern of curriculum effects described in the human study, see Appendix for details.

Contributor Information

Jacob Russin, Department of Computer Science, Department of Cognitive and Psychological Sciences, Brown University.

Ellie Pavlick, Department of Computer Science, Brown University.

Michael J. Frank, Department of Cognitive and Psychological Sciences, Carney Institute for Brain Science, Brown University.

References

  • [1].Gregory Ashby F. and Todd Maddox W.. Human category learning 2.0. Annals of the New York Academy of Sciences, 1224(1):147–161, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [2].Botvinick Matthew, Ritter Sam, Wang Jane X., Zeb Kurth-Nelson Charles Blundell, and Hassabis Demis. Reinforcement Learning, Fast and Slow. Trends in Cognitive Sciences, 23(5):408–422, May 2019. [DOI] [PubMed] [Google Scholar]
  • [3].Daw Nathaniel D., Niv Yael, and Dayan Peter. Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control. Nature Neuroscience, 8(12):1704–1711, November 2005. [DOI] [PubMed] [Google Scholar]
  • [4].Jonathan St B. T. Evans. Dual-processing accounts of reasoning, judgment, and social cognition. Annual Review of Psychology, 59:255–278, 2008. [DOI] [PubMed] [Google Scholar]
  • [5].Noh Sharon M., Yan Veronica X., Bjork Robert A., and Todd Maddox W.. Optimal sequencing during category learning: Testing a dual-learning systems perspective. Cognition, 155:23–29, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [6].O’Reilly Randall C., Nair Ananta, Russin Jacob L., and Herd Seth A.. How Sequential Interactive Processing Within Frontostriatal Loops Supports a Continuum of Habitual to Controlled Processing. Frontiers in Psychology, 11, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [7].Mathias Sablé-Meyer, Lucas Benjamin, Watkins Cassandra Potier, He Chenxi, Roumi Fosca Al, and Dehaene. Two brain systems for the perception of geometric shapes, March 2024.
  • [8].Sloman Steven A.. The empirical case for two systems of reasoning. Psychological Bulletin, 119(1):3–22, 1996. [Google Scholar]
  • [9].Collins Anne G. E. and Frank Michael J.. Within- and across-trial dynamics of human EEG reveal cooperative interplay between reinforcement learning and working memory. Proceedings of the National Academy of Sciences, 115(10):2502–2507, March 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [10].Collins Anne G. E. and Frank Michael J.. Cognitive control over learning: Creating, clustering, and generalizing task-set structure. Psychological Review, 120(1):190–229, January 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [11].Kriete T., Noelle D. C., Cohen J. D., and O’Reilly R. C.. Indirection and symbol-like processing in the prefrontal cortex and basal ganglia. Proceedings of the National Academy of Sciences, 110(41):16390–16395, October 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [12].Lake Brenden M., Ullman Tomer D., Tenenbaum Joshua B., and Gershman Samuel J.. Building machines that learn and think like people. Behavioral and Brain Sciences, 40, 2017./ed. [DOI] [PubMed] [Google Scholar]
  • [13].O’Reilly R. C. and Frank Michael J.. Making working memory work: A computational model of learning in the prefrontal cortex and basal ganglia. Neural Computation, 18(2):283–328, 2006. [DOI] [PubMed] [Google Scholar]
  • [14].Rougier Nicolas P., Noelle David C., Braver Todd S., Cohen Jonathan D., and O’Reilly Randall C.. Prefrontal cortex and flexible cognitive control: Rules without symbols. Proceedings of the National Academy of Sciences, 102(20):7338–7343, May 2005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [15].Flesch Timo, Balaguer Jan, Dekker Ronald, Nili Hamed, and Summerfield Christopher. Comparing continual task learning in minds and machines. Proceedings of the National Academy of Sciences, 115(44):E10313–E10322, October 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [16].Pesnot Lerousseau Jacques and Summerfield Christopher. Space as a scaffold for rotational generalisation of abstract concepts. eLife, 13:RP93636, April 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [17].Lake Brenden, Salakhutdinov Ruslan, Gross Jason, and Tenenbaum Joshua. One shot learning of simple visual concepts. Proceedings of the Annual Meeting of the Cognitive Science Society, 33(33), 2011. [Google Scholar]
  • [18].Lake Brenden M., Salakhutdinov Ruslan, and Tenenbaum Joshua B.. Human-level concept learning through probabilistic program induction. Science, 350(6266):1332–1338, December 2015. [DOI] [PubMed] [Google Scholar]
  • [19].Dekker Ronald B., Otto Fabian, and Summerfield Christopher. Curriculum learning for human compositional generalization. Proceedings of the National Academy of Sciences, 119(41):e2205582119, October 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [20].Fodor Jerry A. and Pylyshyn Zenon W.. Connectionism and cognitive architecture: A critical analysis. Cognition, 28(1–2):3–71, March 1988. [DOI] [PubMed] [Google Scholar]
  • [21].Franklin Nicholas T. and Frank Michael J.. Compositional clustering in task structure learning. PLOS Computational Biology, 14(4):e1006116, April 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [22].Franklin Nicholas T. and Frank Michael J.. Generalizing to generalize: Humans flexibly switch between compositional and conjunctive structures during reinforcement learning. PLOS Computational Biology, 16(4):e1007720, April 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [23].Frankland Steven M. and Greene Joshua D.. Concepts and Compositionality: In Search of the Brain’s Language of Thought. Annual Review of Psychology, 71(1):273–303, 2020. [DOI] [PubMed] [Google Scholar]
  • [24].Lake Brenden M., Linzen Tal, and Baroni Marco. Human few-shot learning of compositional instructions. In Goel Ashok K., Seifert Colleen M., and Freksa Christian, editors, Proceedings of the 41th Annual Meeting of the Cognitive Science Society, CogSci 2019: Creativity + Cognition + Computation, Montreal, Canada, July 24–27, 2019, pages 611–617. cognitivesciencesociety.org, 2019. [Google Scholar]
  • [25].Liu Rex G. and Frank Michael J.. Hierarchical clustering optimizes the tradeoff between compositionality and expressivity of task structures for flexible reinforcement learning. Artificial Intelligence, 312:103770, November 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [26].Russin Jacob, Sam Whitman McGrath Ellie Pavlick, and Frank Michael J.. Is human compositionality metalearned? Commentary in Behavioral and Brain Sciences (forthcoming), 2024. [DOI] [PubMed] [Google Scholar]
  • [27].Schwartenbeck Philipp, Baram Alon, Liu Yunzhe, Mark Shirley, Muller Timothy, Dolan Raymond, Botvinick Matthew, Kurth-Nelson Zeb, and Behrens Timothy. Generative replay underlies compositional inference in the hippocampal-prefrontal circuit. Cell, 186(22):4885–4897.e14, October 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [28].Beukers Andre O., Collin Silvy H. P., Kempner Ross P., Franklin Nicholas T., Gershman Samuel J., and Norman Kenneth A.. Blocked training facilitates learning of multiple schemas. Communications Psychology, 2(1):1–17, April 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [29].Frank Michael J. and Badre David. Mechanisms of hierarchical reinforcement learning in corticostriatal circuits 1: Computational analysis. Cerebral Cortex (New York, N.Y.: 1991), 22(3):509–526, March 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [30].Richland Lindsey E., Finley Jason R., and Bjork Robert A.. Differentiating the Contextual Interference Effect from the Spacing Effect. Proceedings of the Annual Meeting of the Cognitive Science Society, 26(26), 2004. [Google Scholar]
  • [31].Goode Sinah and Magill Richard A.. Contextual Interference Effects in Learning Three Badminton Serves. Research Quarterly for Exercise and Sport, 57(4):308–314, December 1986. [Google Scholar]
  • [32].Landin Dennis K., Hebert Edward P., and Fairweather Malcolm. The Effects of Variable Practice on the Performance of a Basketball Skill. Research Quarterly for Exercise and Sport, 64(2):232–237, June 1993. [DOI] [PubMed] [Google Scholar]
  • [33].Kahneman Daniel. Thinking, Fast and Slow. Farrar, Straus and Giroux, 1 edition edition, October 2011. [Google Scholar]
  • [34].Rumelhart David E., Hinton Geoffrey E., and Williams Ronald J.. Learning representations by back-propagating errors. Nature, 323(6088):533–536, October 1986. [Google Scholar]
  • [35].Rumelhart David E., McClelland James L., and PDP Research Group, editors. Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol. 2: Psychological and Biological Models. MIT Press, Cambridge, MA, USA, 1986. [Google Scholar]
  • [36].Russin Jacob, Zolfaghar Maryam, Park Seongmin A., Boorman Erie, and O’Reilly Randall C.. A Neural Network Model of Continual Learning with Cognitive Control. In Proceedings for the 44th Annual Meeting of the Cognitive Science Society, February 2022. [PMC free article] [PubMed] [Google Scholar]
  • [37].McCloskey M. and Cohen N. J.. Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem. In Bower G. H., editor, The Psychology of Learning and Motivation , Vol. 24, pages 109–164. Academic Press, San Diego, CA, January 1989. [Google Scholar]
  • [38].McClelland J. L., McNaughton B. L., and O’Reilly R. C.. Why There Are Complementary Learning Systems in the Hippocampus and Neocortex: Insights from the Successes and Failures of Connectionist Models of Learning and Memory. Psychological Review, 102(3):419–457, August 1995. [DOI] [PubMed] [Google Scholar]
  • [39].Lake Brenden M. and Baroni Marco. Generalization without Systematicity: On the Compositional Skills of Sequence-to-Sequence Recurrent Networks. In Dy Jennifer G.and Krause Andreas, editors, Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10–15, 2018, volume 80 of Proceedings of Machine Learning Research, pages 2879–2888. PMLR, 2018. [Google Scholar]
  • [40].Marcus Gary F.. Rethinking Eliminative Connectionism. Cognitive Psychology, 37(3):243–282, December 1998. [DOI] [PubMed] [Google Scholar]
  • [41].Marcus Gary. Deep learning: A critical appraisal, January 2018.
  • [42].Pinker Steven and Prince Alan. On language and connectionism: Analysis of a parallel distributed processing model of language acquisition. Cognition, 28(1):73–193, March 1988. [DOI] [PubMed] [Google Scholar]
  • [43].Quilty-Dunn Jake, Porot Nicolas, and Mandelbaum Eric. The best game in town: The reemergence of the language-of-thought hypothesis across the cognitive sciences. Behavioral and Brain Sciences, 46:e261, January 2023. [DOI] [PubMed] [Google Scholar]
  • [44].Love Bradley C., Medin Douglas L., and Gureckis Todd M.. SUSTAIN: A Network Model of Category Learning. Psychological Review, 111(2):309–332, 2004. [DOI] [PubMed] [Google Scholar]
  • [45].Frank Michael J. and Claus Eric D.. Anatomy of a decision: Striato-orbitofrontal interactions in reinforcement learning, decision making, and reversal. Psychological Review, 113(2):300–326, April 2006. [DOI] [PubMed] [Google Scholar]
  • [46].Miller E. K. and Cohen J. D.. An integrative theory of prefrontal cortex function. Annual Review of Neuroscience, 24:167–202, 2001. [DOI] [PubMed] [Google Scholar]
  • [47].Wang Jane X., Zeb Kurth-Nelson Dharshan Kumaran, Tirumala Dhruva, Soyer Hubert, Leibo Joel Z., Hassabis Demis, and Botvinick Matthew. Prefrontal cortex as a meta-reinforcement learning system. Nature Neuroscience, 21(6):860–868, June 2018. [DOI] [PubMed] [Google Scholar]
  • [48].Brown Tom B., Mann Benjamin, Ryder Nick, Subbiah Melanie, Kaplan Jared, Dhariwal Prafulla, Neelakantan Arvind, Shyam Pranav, Sastry Girish, Askell Amanda, Agarwal Sandhini, Ariel Herbert-Voss Gretchen Krueger, Henighan Tom, Child Rewon, Ramesh Aditya, Ziegler Daniel M., Wu Jeffrey, Winter Clemens, Hesse Christopher, Chen Mark, Sigler Eric, Litwin Mateusz, Gray Scott, Chess Benjamin, Clark Jack, Berner Christopher, Sam McCandlish Alec Radford, Sutskever Ilya, and Amodei Dario. Language Models are Few-Shot Learners, May 2020.
  • [49].Bubeck Sébastien, Chandrasekaran Varun, Eldan Ronen, Gehrke Johannes, Horvitz Eric, Kamar Ece, Lee Peter, Lee Yin Tat, Li Yuanzhi, Lundberg Scott, Nori Harsha, Palangi Hamid, Ribeiro Marco Tulio, and Zhang Yi. Sparks of Artificial General Intelligence: Early experiments with GPT-4, March 2023.
  • [50].Saparov Abulhair, Pang Richard Yuanzhe, Padmakumar Vishakh, Joshi Nitish, Kazemi Seyed Mehran, Kim Najoung, and He He. Testing the General Deductive Reasoning Capacity of Large Language Models Using OOD Examples, November 2023.
  • [51].Musker Sam, Duchnowski Alex, Millière Raphaël, and Pavlick Ellie. Semantic Structure-Mapping in LLM and Human Analogical Reasoning, June 2024.
  • [52].Webb Taylor, Holyoak Keith J., and Lu Hongjing. Emergent analogical reasoning in large language models. Nature Human Behaviour, 7(9):1526–1541, September 2023. [DOI] [PubMed] [Google Scholar]
  • [53].Lake Brenden M. and Baroni Marco. Human-like systematic generalization through a meta-learning neural network. Nature, 623(7985):115–121, November 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [54].Press Ofir, Zhang Muru, Min Sewon, Schmidt Ludwig, Smith Noah A., and Lewis Mike. Measuring and Narrowing the Compositionality Gap in Language Models, October 2023.
  • [55].Zhou Denny, Schärli Nathanael , Hou Le, Wei Jason, Scales Nathan, Wang Xuezhi, Schuurmans Dale, Cui Claire, Bousquet Olivier, Le Quoc V., and Chi Ed H.. Least-to-Most Prompting Enables Complex Reasoning in Large Language Models. In The Eleventh International Conference on Learning Representations, September 2022. [Google Scholar]
  • [56].Chan Stephanie C. Y., Santoro Adam, Lampinen Andrew K., Wang Jane X., Singh Aaditya, Richemond Pierre H., McClelland Jay, and Hill Felix. Data Distributional Properties Drive Emergent In-Context Learning in Transformers, May 2022.
  • [57].Johannes von Oswald Eyvind Niklasson, Schlegel Maximilian, Kobayashi Seijin, Zucchet Nicolas, Scherrer Nino, Miller Nolan, Sandler Mark, Arcas Blaise Agüera y, Vladymyrov Max, Pascanu Razvan, and Sacramento João. Uncovering mesa-optimization algorithms in Transformers, September 2023.
  • [58].Xie Sang Michael, Raghunathan Aditi, Liang Percy, and Ma Tengyu. An Explanation of In-context Learning as Implicit Bayesian Inference, July 2022.
  • [59].Wang Jane, Zeb Kurth-Nelson Hubert Soyer, Leibo Joel, Tirumala Dhruva, Munos Remi, Blundell Charles, Kumaran Dharshan, and Botvinick Matt. Learning to reinforcement learn. Proceedings of the Annual Meeting of the Cognitive Science Society, 39(0), 2017. [Google Scholar]
  • [60].Anand Suraj, Lepori Michael A., Merullo Jack, and Pavlick Ellie. Dual Process Learning: Controlling Use of In-Context vs. In-Weights Strategies with Weight Forgetting, May 2024.
  • [61].Chan Stephanie C. Y., Dasgupta Ishita, Kim Junkyung, Kumaran Dharshan, Lampinen Andrew K., and Hill Felix. Transformers generalize differently from information stored in context vs in weights, October 2022.
  • [62].Reddy Gautam. The mechanistic basis of data dependence and abrupt learning in an in-context classification task, December 2023.
  • [63].Rachel Rac-Lubashevsky Anna Cremer, Collins Anne G. E., Frank Michael J., and Schwabe Lars. Neural Index of Reinforcement Learning Predicts Improved Stimulus–Response Retention under High Working Memory Load. Journal of Neuroscience, 43(17):3131–3143, April 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [64].Binz Marcel, Dasgupta Ishita, Jagadish Akshay K., Botvinick Matthew, Wang Jane X., and Schulz Eric. Meta-Learned Models of Cognition. Behavioral and Brain Sciences, pages 1–38, November 2023. [DOI] [PubMed] [Google Scholar]
  • [65].Finn Chelsea, Abbeel Pieter, and Levine Sergey. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, pages 1126–1135, Sydney, NSW, Australia, August 2017. [Google Scholar]
  • [66].Santoro Adam, Bartunov Sergey, Botvinick Matthew, Wierstra Daan, and Lillicrap Timothy. Meta-Learning with Memory-Augmented Neural Networks, 2016.
  • [67].Russin Jacob, Sam Whitman McGrath Danielle Williams, and Elber-Dorozko Lotem. From Frege to chatGPT: Compositionality in language, cognition, and deep neural networks. In Forthcoming. 2024. [Google Scholar]
  • [68].Keysers Daniel, Nathanael Schärli Nathan Scales, Buisman Hylke, Furrer Daniel, Kashubin Sergii, Momchev Nikola, Sinopalnikov Danila, Stafiniak Lukasz, Tihon Tibor, Tsarkov Dmitry, Wang Xiao, Zee Marc van, and Bousquet Olivier. Measuring Compositional Generalization: A Comprehensive Method on Realistic Data. In International Conference on Learning Representations, September 2019. [Google Scholar]
  • [69].Kim Najoung and Linzen Tal. COGS: A Compositional Generalization Challenge Based on Semantic Interpretation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9087–9105, Online, November 2020. Association for Computational Linguistics. [Google Scholar]
  • [70].Frank Michael C.. Bridging the data gap between children and large language models. Trends in Cognitive Sciences, 0(0), August 2023. [DOI] [PubMed] [Google Scholar]
  • [71].Linzen Tal. How Can We Accelerate Progress Towards Human-like Linguistic Generalization? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5210–5217, Online, 2020. Association for Computational Linguistics. [Google Scholar]
  • [72].Warstadt Alex, Mueller Aaron, Choshen Leshem, Wilcox Ethan, Zhuang Chengxu, Ciro Juan, Mosquera Rafael, Paranjabe Bhargavi, Williams Adina, Linzen Tal, and Cotterell Ryan. Findings of the BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora. In Warstadt Alex, Mueller Aaron, Choshen Leshem, Wilcox Ethan, Zhuang Chengxu, Ciro Juan, Mosquera Rafael, Paranjabe Bhargavi, Williams Adina, Linzen Tal, and Cotterell Ryan, editors, Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning, pages 1–34, Singapore, December 2023. Association for Computational Linguistics. [Google Scholar]
  • [73].Chen Xinyun, Chi Ryan A., Wang Xuezhi, and Zhou Denny. Premise Order Matters in Reasoning with Large Language Models, May 2024.
  • [74].Lu Yao, Bartolo Max, Moore Alastair, Riedel Sebastian, and Stenetorp Pontus. Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity. In Muresan Smaranda, Nakov Preslav, and Villavicencio Aline, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8086–8098, Dublin, Ireland, May 2022. Association for Computational Linguistics. [Google Scholar]
  • [75].Touvron Hugo, Martin Louis, Stone Kevin, Albert Peter, Almahairi Amjad, Babaei Yasmine, Bashlykov Nikolay, Batra Soumya, Bhargava Prajjwal, Bhosale Shruti, Bikel Dan, Blecher Lukas, Cristian Canton Ferrer Moya Chen, Cucurull Guillem, Esiobu David, Fernandes Jude, Fu Jeremy, Fu Wenyin, Fuller Brian, Gao Cynthia, Goswami Vedanuj, Goyal Naman, Hartshorn Anthony, Hosseini Saghar, Hou Rui, Inan Hakan, Kardas Marcin, Kerkez Viktor, Khabsa Madian, Kloumann Isabel, Korenev Artem, Punit Singh Koura Marie-Anne Lachaux, Lavril Thibaut, Lee Jenya, Liskovich Diana, Lu Yinghai, Mao Yuning, Martinet Xavier, Mihaylov Todor, Mishra Pushkar, Molybog Igor, Nie Yixin, Poulton Andrew, Reizenstein Jeremy, Rungta Rashi, Saladi Kalyan, Schelten Alan, Silva Ruan, Eric Michael Smith Ranjan Subramanian, Xiaoqing Ellen Tan Binh Tang, Taylor Ross, Williams Adina, Jian Xiang Kuan Puxin Xu, Yan Zheng, Zarov Iliyan, Zhang Yuchen, Fan Angela, Kambadur Melanie, Narang Sharan, Rodriguez Aurelien, Stojnic Robert, Edunov Sergey, and Scialom Thomas. Llama 2: Open Foundation and Fine-Tuned Chat Models, July 2023.
  • [76].Ouyang Long, Wu Jeff, Jiang Xu, Almeida Diogo, Wainwright Carroll L., Mishkin Pamela, Zhang Chong, Agarwal Sandhini, Slama Katarina, Ray Alex, Schulman John, Hilton Jacob, Kelton Fraser, Miller Luke, Simens Maddie, Askell Amanda, Welinder Peter, Christiano Paul, Leike Jan, and Lowe Ryan. Training language models to follow instructions with human feedback, March 2022.
  • [77].Shiffrin Richard M. and Schneider Walter. Controlled and automatic human information processing: II. Perceptual learning, automatic attending and a general theory. Psychological Review, 84(2):127–190, 1977. [Google Scholar]
  • [78].Rosa Angela Fabio Tindara Caprì, and Romano Martina. From Controlled to Automatic Processes and Back Again: The Role of Contextual Features. Europe’s Journal of Psychology, 15(4):773–788, December 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [79].Daw Nathaniel D., Gershman Samuel J., Seymour Ben, Dayan Peter, and Dolan Raymond J.. Model-based influences on humans’ choices and striatal prediction errors. Neuron, 69(6):1204–1215, March 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [80].Sutton R. S. and Barto A. G.. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, January 1998. [Google Scholar]
  • [81].Gregory Ashby F. and Todd Maddox W.. Human Category Learning. Annual Review of Psychology, 56(Volume 56, 2005):149–178, February 2005. [DOI] [PubMed] [Google Scholar]
  • [82].Carvalho Paulo F. and Goldstone Robert L.. Putting category learning in order: Category structure and temporal arrangement affect the benefit of interleaved over blocked study. Memory & Cognition, 42(3):481–495, April 2014. [DOI] [PubMed] [Google Scholar]
  • [83].O’Reilly Randall C., Bhattacharyya Rajan, Howard Michael D., and Ketz Nicholas. Complementary Learning Systems. Cognitive Science, 38(6):1229–1248, 2014. [DOI] [PubMed] [Google Scholar]
  • [84].Park Seongmin A., Miller Douglas S., Nili Hamed, Ranganath Charan, and Boorman Erie D.. Map Making: Constructing, Combining, and Inferring on Abstract Cognitive Maps. Neuron, 107(6):1226–1238.e8, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [85].Flesch Timo, Nagy David G., Saxe Andrew, and Summerfield Christopher. Modelling continual learning in humans with Hebbian context gating and exponentially decaying task signals. PLOS Computational Biology, 19(1):e1010808, January 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [86].Giallanza Tyler, Campbell Declan, and Cohen Jonathan D.. Toward the Emergence of Intelligent Control: Episodic Generalization and Optimization. Open Mind, 8:688–722, May 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [87].Sandbrink Kai and Summerfield Christopher. Modelling cognitive flexibility with deep neural networks. Current Opinion in Behavioral Sciences, 57:101361, June 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [88].Wang Jane X. Meta-learning in natural and artificial intelligence. Current Opinion in Behavioral Sciences, 38:90–95, April 2021. [Google Scholar]
  • [89].Mirchandani Suvir, Xia Fei, Florence Pete, Ichter Brian, Driess Danny, Montserrat Gonzalez Arenas Kanishka Rao, Sadigh Dorsa, and Zeng Andy . Large Language Models as General Pattern Machines, October 2023.
  • [90].Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser Lukasz, and Polosukhin Illia. Attention is all you need. In Guyon Isabelle, Luxburg Ulrike von, Bengio Samy, Wallach Hanna M., Fergus Rob, Vishwanathan S. V. N., and Garnett Roman, editors, Adv. Neur. Inf. Proc. Sys. 30, pages 5998–6008, long beach, CA, USA, 2017. [Google Scholar]
  • [91].O’Reilly Randall C., Ranganath Charan, and Russin Jacob L.. The Structure of Systematicity in the Brain. Current directions in psychological science, 31(2):124–130, April 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [92].Piantadosi Steven and Aslin Richard. Compositional Reasoning in Early Childhood. PLOS ONE, 11(9):e0147734, September 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [93].Piantadosi Steven T., Palmeri Holly, and Aslin Richard. Limits on composition of conceptual operations in 9-month-olds. Infancy : the official journal of the International Society on Infant Studies, 23(3):310–324, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [94].Lake Brenden M. Compositional generalization through meta sequence-to-sequence learning. In Wallach H., Larochelle H., Beygelzimer A., Alché-Buc F., Fox E., and Garnett R., editors, Advances in Neural Information Processing Systems 32, pages 9788–9798. Curran Associates, Inc., 2019. [Google Scholar]
  • [95].McCoy R. Thomas, Grant Erin, Smolensky Paul, Griffiths Thomas L., and Linzen Tal. Universal linguistic inductive biases via meta-learning. Proceedings of the Annual Meeting of the Cognitive Science Society, 42(0), 2020. [Google Scholar]
  • [96].O’Reilly R. C., Munakata Yuko, Frank Michael J., Hazy Thomas E., and Contributors. Computational Cognitive Neuroscience. Wiki Book, 1st Edition, URL: http://ccnbook.colorado.edu, 2012. [Google Scholar]
  • [97].Miller Paul and Wang Xiao-Jing. Inhibitory control by an integral feedback signal in prefrontal cortex: A model of discrimination between sequential stimuli. Proceedings of the National Academy of Sciences, 103(1):201–206, January 2006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [98].Hunt Laurence T. and Hayden Benjamin Y.. A distributed, hierarchical and recurrent framework for reward-based choice. Nature reviews. Neuroscience, 18(3):172–182, February 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [99].Cavanagh Sean E., Hunt Laurence T., and Kennerley Steven W.. A Diversity of Intrinsic Timescales Underlie Neural Computations. Frontiers in Neural Circuits, 14:615626, December 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [100].Russin Jacob, O’Reilly Randall C, and Bengio Yoshua. Deep learning needs a prefrontal cortex. In Bridging AI and Cognitive Science (BAICS) Workshop, ICLR 2020, page 11, 2020. [Google Scholar]
  • [101].Buchsbaum Bradley R., Greer Stephanie, Chang Wei-Li, and Berman Karen Faith. Meta-analysis of neuroimaging studies of the Wisconsin Card-Sorting task and component processes. Human Brain Mapping, 25(1):35–45, 2005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [102].Milner Brenda. Effects of Different Brain Lesions on Card Sorting: The Role of the Frontal Lobes. Archives of Neurology, 9(1):90–100, July 1963. [Google Scholar]
  • [103].Shallice T. and Burgess P. W.. Deficits in strategy application following frontal lobe damage in man. Brain: A Journal of Neurology, 114 (Pt 2):727–741, April 1991. [DOI] [PubMed] [Google Scholar]
  • [104].Traylor Aaron, Merullo Jack, Frank Michael J., and Pavlick Ellie. Transformer Mechanisms Mimic Frontostriatal Gating Operations When Trained on Human Working Memory Tasks. Proceedings of the Annual Meeting of the Cognitive Science Society, 46(0), 2024. [Google Scholar]
  • [105].Soni Aneri, Traylor Aaron, Merullo Jack, Frank Michael J., and Pavlick Ellie. Transformer Mechanisms Mimic Frontostriatal Gating Operations When Trained on Human Working Memory Tasks. in prep.
  • [106].Kingma Diederik P. and Ba Jimmy. Adam: A Method for Stochastic Optimization. In Bengio Yoshua and LeCun Yann, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7–9, 2015, Conference Track Proceedings, 2015. [Google Scholar]

Articles from ArXiv are provided here courtesy of arXiv

RESOURCES