Transforming task representations to perform novel tasks

Andrew K Lampinen; James L McClelland

doi:10.1073/pnas.2008852117

. 2020 Dec 10;117(52):32970–32981. doi: 10.1073/pnas.2008852117

Transforming task representations to perform novel tasks

Andrew K Lampinen ^a,¹, James L McClelland ^a

PMCID: PMC7777120 PMID: 33303652

Significance

An intelligent system should be able to adapt to a novel task without any data and achieve at least moderate success. Humans can often do so, while models often require immense datasets to reach human-level performance. We propose a general computational framework by which models can adapt to new tasks based only on their relationship to old tasks. Our approach is based on transforming learned task representations. Our approach allows models to perform well on novel tasks, even using novel relationships not encountered during training. This adaptation can substantially accelerate later learning. Our work could contribute to understanding the computational basis of intelligence, to cognitive modeling, and to building more flexible forms of artificial intelligence.

Keywords: cognitive science, artificial intelligence, transfer, zero-shot

Abstract

An important aspect of intelligence is the ability to adapt to a novel task without any direct experience (zero shot), based on its relationship to previous tasks. Humans can exhibit this cognitive flexibility. By contrast, models that achieve superhuman performance in specific tasks often fail to adapt to even slight task alterations. To address this, we propose a general computational framework for adapting to novel tasks based on their relationship to prior tasks. We begin by learning vector representations of tasks. To adapt to new tasks, we propose metamappings, higher-order tasks that transform basic task representations. We demonstrate the effectiveness of this framework across a wide variety of tasks and computational paradigms, ranging from regression to image classification and reinforcement learning. We compare to both human adaptability and language-based approaches to zero-shot learning. Across these domains, metamapping is successful, often achieving 80 to 90% performance, without any data, on a novel task, even when the new task directly contradicts prior experience. We further show that metamapping can not only generalize to new tasks via learned relationships, but can also generalize using novel relationships unseen during training. Finally, using metamapping as a starting point can dramatically accelerate later learning on a new task and reduce learning time and cumulative error substantially. Our results provide insight into a possible computational basis of intelligent adaptability and offer a possible framework for modeling cognitive flexibility and building more flexible artificial intelligence systems.

Adaptability is a key feature of biological intelligence—adaptation is necessary for a system to efficiently handle all of the vagaries of its environment (1). An advantage of neural networks over ordinary computer programs is that they can adapt by learning from training examples. Yet this is only a limited form of adaptability. An intelligent system should be able to transform its behavior on a task in accordance with a change in goals, and humans often exhibit this form of adaptability (2). For example, if we are told to try to lose at poker, we can perform quite well on our first try, even if we have always tried to win previously. If we are shown an object and told to find the same object in a new color or texture, we can do so. By contrast, this type of first-try adaptation is quite difficult for standard deep-learning models (2–4). How could models reuse their knowledge more flexibly?

We suggest that this ability to adapt can arise from exploiting the relationship between the adapted version of the task and the original. In this work, we propose a computational model of adaptation based on task relationships and demonstrate its success across a variety of domains, ranging from regression to classification to reinforcement learning. Our approach could provide insights into the flexibility of human cognition and allow for more flexible artificial intelligence systems.

Our model incorporates several key cognitive insights. First, to perform different tasks, it is useful for the system to constrain its behavior by an internal task representation (5). Prior work in machine learning and cognitive science has constructed task representations from a natural language instruction (6–8) or by learning to infer task representations from examples, a procedure called metalearning (9, 10). We extend these ideas, proposing that the model can adapt to a novel task by transforming its representation for a prior task into a representation for the new task, thereby exploiting the task relationship to perform the new task.

We refer to these transformations of task representations as metamappings. That is, metamappings are higher-order functions over tasks—functions that take a task as input and transform it to produce an adapted version of that task. Metamappings allow the model to adapt to a new task zero shot (i.e., without requiring any data from that new task), based on the relationship between the new task and prior tasks. We propose that metamapping is a powerful way to promote adaptation, because the task relationships it exploits are the fundamental conceptual structure on which systematic generalization can be predicated.

As a concrete example, our model is able to switch to losing at poker on its first try. To do so, it constructs a representation of poker from experience with trying to win the game. It then infers a “try-to-lose” metamapping, either from language or from examples of winning and losing at other games, such as blackjack. It then applies this metamapping to transform its representation of poker, thereby yielding a representation for losing at poker. This adapted task representation can then be used to perform the task of trying to lose at poker zero shot—that is, without any prior experience of losing at poker.

Our main contributions are 1) to propose metamapping as a computational framework for zero-shot adaptation to novel tasks and 2) to provide a parsimonious architecture for metamapping.

We demonstrate the success of metamapping across a variety of task domains, ranging from visual classification to reinforcement learning, and show that the model can even adapt using new metamappings not encountered during training. We further show that adapting by metamapping provides a useful starting point for later learning. This work proposes transforming a task representation to adapt zero shot. We consider related work and implications for cognitive science and artificial intelligence in Discussion.

Task Transformation via Metamappings

Basic Tasks Are Input–Output Mappings.

We take as a starting point the construal of basic tasks as mappings (functions) from inputs to outputs. For example, poker can be seen as a mapping from hands to bets (Fig. 1A), chess as a mapping of board positions to moves, and object recognition as a mapping from images to labels. This perspective is common in machine-learning approaches, which generally try to infer a task mapping from many input/output examples or metalearn how to infer it from fewer examples. We use the phrase “basic task” to refer to any elementary task a system performs (e.g., any card game), including both standard tasks (“poker”) and variants that can be produced by a transformation (“lose at poker”).

Fig. 1. — Performing and transforming tasks with a metamapping architecture. (A) Basic tasks are mappings from inputs to outputs, which can be generalized from examples. (D) Metamappings are mappings from tasks to other tasks, which can be generalized from examples. (B and E) The architecture performs basic tasks and metamappings from a task representation, which can be constructed from a language cue or examples. (C) The task representation is used to alter the parameters of a task network (*Inset*) which executes the appropriate task mapping. (F) The metamapping representation is used to parameterize the task network to transform a task representation. The transformed representation can then be used to perform the new task zero shot (*Inset*). Our architecture exploits a deep analogy between basic tasks and metamappings—both can be seen as mappings of inputs to outputs. This analogy is reflected in the parallels between *A–C* and *D–F*.

Tasks Can Be Transformed via Metamappings.

We propose metamappings as a computational approach to the problem of transforming a basic task mapping. A metamapping is a higher-order task, which takes a task representation as input and outputs a representation of the transformed version of the task. For example, we might have a “lose” metamapping (Fig. 1D) that would transform the representation of poker into a representation of losing at poker.

How can a metamapping be performed? We exploit an analogy between metamappings and basic task mappings—both are simply functions from inputs to outputs. Thus, to perform a metamapping we use approaches analogous to those we use for basic tasks. We infer a metamapping from examples (e.g., winning and losing at a set of example games) or natural language (e.g., “try to switch to losing”). We can then apply this metamapping to other basic tasks, to infer losing variations of those tasks. Importantly, the system can generalize to new metamappings—task transformations never seen in training—as well as to new basic tasks.

Model Architecture and Training Methods

We propose a class of architectures that can both perform basic tasks and adapt to task alterations via metamappings. In this section, we describe the general features of our architectures and their training. See SI Appendix, section A for details, including a formal model description, all hyperparameters, etc.

Constructing a Task Representation (Fig. 1B).

When humans perform a task, we need to know what the task is. In our model, we specify the task using a task representation, which we derive from language, from supporting examples of appropriate behavior, or from metamapping (see Transforming Task Representations via Metamappings). To construct a task representation from language, we process the language through a deep recurrent network (long-short term memory), as in other work (7, 8, 11). To construct a task representation from examples, as in other work (12), we process each example (i.e., an input and its corresponding target) to construct an appropriate representation of the example and then aggregate across those representations by taking an elementwise maximum, to combine examples in a nonlinear but order-invariant way. This aggregated representation then receives further processing to produce the task representation.

Performing a Task from Its Representation (Fig. 1C).

Once we have a task representation, we use it to perform the task. We allow a large part of the input processing (perception) and output decoding (action) to be shared across the tasks within each domain we consider, so that the task-specific computations can be relatively simple and abstract.* For example, if a human is playing card games, the cards will be identical whether the game is poker or bridge, and the task-specific computations will be performed over abstract features such as suit and rank relationships. We thus allow the system to learn a general basis of perceptual features over all tasks within a domain.

The system then uses these features in a task-specific way to perform task-appropriate behavior. Specifically, the model uses a HyperNetwork (13, 14) which takes as input the representation of a task. This network adapts the values of learned “default” connection weights, to make the network task sensitive (Fig. 1C, Inset). The adapted network transforms the perceptual features into task-appropriate output features, which can then be decoded to outputs via a shared output decoding network. The whole model (including the construction of the task representations) can be trained end to end, just as a standard metalearning system would be. Our approach outperforms an alternative architecture, in which the task representation is provided as another input to a feedforward task network (SI Appendix, Fig. S10.)

Transforming Task Representations via Metamappings (Fig. 1 E and F).

We defined a metamapping to be a higher-order task, which takes as input a task representation and outputs a transformed task representation. Thus, we need a way of transforming the task representations constructed in Constructing a Task Representation. To do so, we exploit the functional analogy between basic tasks and metamappings. We infer a representation for a metamapping from examples of that metamapping or from a language description, just like we infer a basic task representation from examples or language. We use this metamapping representation to adapt the parameters of the task network to transform other task representations. This approach is analogous to how we used a representation of a basic task to adapt the task network to perform that task. (See SI Appendix, section G.1 for proof that a simpler vector-analogy metamapping approach is inadequate.)

Homoiconicity.

Our architectures exploit the analogy between basic tasks and metamappings by using exactly the same networks (with exactly the same parameters) to infer and perform a metamapping as for inferring and performing a basic task. To allow this, the system embeds individual data points, task representations, and metamapping representations into a shared representational space. This means that all task- or metamapping-specific computations can be seen as operations on objects in this shared space and can be inferred using the same processes regardless of object type. (Note that sharing of the space is only enforced implicitly in that the same networks are processing different entities.) This approach is in keeping with the idea that humans have a single mind that implements computations of all types. Our approach is also inspired by the computational notion of homoiconicity. In a homoiconic programming language programs can be manipulated just as data can. Our task representations are like programs that perform tasks, and our implementation is thus homoiconic in the sense that it operates on data and tasks in the same way.

Homoiconicity is parsimonious, in that it does not require adding new networks for each new type of computation. Furthermore, in many cases, functions have some common structure with the entities they act over. For example, both numbers and functions can have inverses. For another example, the set of linear maps over a vector space is itself a vector space. If the different levels of abstraction share structural features, sharing computation should improve generalization. Homoiconicity could also support the ability to build abstractions recursively on top of prior abstractions, as humans do in mathematical cognition (15–17). Although homoiconicity is not a necessary part of metamapping, we suggest that homoiconic approaches will be beneficial and verify this empirically, see Polynomials section).

Classifying Task Representations.

In several domains, we also trained the model to classify task representations by relevant attributes (for example, whether a game was a variation of poker), again using the same architectural components. See SI Appendix, section A.3 for details. This may improve generalization by helping the model constrain its representation of the task space, but is not essential (SI Appendix, Fig. S11).

Training the Model.

We train the system in epochs, during which it receives one training step on each trained basic task and one training step on each trained metamapping, interleaved in a random order. To train the system to perform the basic tasks, we compute a task-appropriate loss at the output of the output decoding network and then minimize this loss with respect to the parameters in all networks. This includes the networks used to construct the task representation and even the representations of the examples or language input. That is, we train the system end to end to perform the basic tasks.

When constructing a task representation from examples, we do not allow the example network to see every example in the training batch. This forces the model to generalize in a standard metalearning fashion. Specifically, we separate the batch of examples into a support set which is provided to the example network and a probe set which is only passed through the task network to compute an output, from which a loss can be computed against a task target. For example, in a card game the system will have to construct a task representation from the support hands that will be useful for playing the probe hands. This approach encourages the task representations to capture the task structure, rather than just memorizing examples. We randomly split the training examples into support and probe sets on each step, so that over the course of training every training example would fill both roles. In this approach, the task representation is constructed anew at each training step. However, to stabilize learning in difficult domains, it can be useful to maintain a persistent task representation which updates slowly with each new set of examples (SI Appendix, section A.3).

Training the system to construct basic task representations from language is similar, except that a language description (e.g., “play poker”) is provided rather than examples. Thus, no support set is needed, so all examples can be used as probes.

To train the system to perform metamappings from examples, we start with a training set of example task representation pairs, where each pair consists of a source task representation and the corresponding transformed task representation. Again, on each training step, a subset of these examples is used as a support set to construct a metamapping representation. The remaining examples are used as a probe set to train the system to transform the source representation for each pair to its corresponding target. Specifically, we present the source task embedding as input to the task network and minimize an $ℓ_{2}$ loss on the difference between the output embedding the task network produces and the task representation for the target transformed task. For example, suppose the system has been trained to play winning and losing variations of blackjack, hearts, and rummy. We might use the representations of winning and losing hearts and rummy as support-set examples to instantiate the metamapping, then input the task representation for winning blackjack as a probe, and try to match the output to the task representation for losing blackjack. Again, we randomly chose which examples were used as support or probes on each training step. On the next training step, we might use hearts and blackjack as examples and train the metamapping to generalize to losing at rummy.

Training the system to perform metamappings from language is similar, except that again a language description (e.g., “switch to losing”) is provided rather than examples of the transformation. Thus, as when using language rather than examples to perform basic tasks, all pairs can be used as probes.

Evaluating Base-Task and Metamapping Performance.

After training, we can evaluate the model’s base-task performance using held-out examples unseen during training. To test generalization of a metamapping (e.g., try to lose), we can pass in the representation for a task that has never been used for any training on this metamapping (either as a support example or as a probe for generalization), for example, poker. We construct a metamapping representation using all of the training examples of the lose metamapping as a support set. We then apply the lose metamapping to the task representation of poker (i.e., pass it through the task network parameterized by the lose metamapping representation) to produce a transformed representation. We then actually perform the losing variation of poker with this transformed representation. Metamapping performance is always evaluated by zero-shot performance on held-out tasks that the system has never performed during training.

In metamapping, generalization is possible at different levels of abstraction. The paragraph above refers to basic generalization—applying a metamapping seen during training to a basic task that metamapping has not been applied to during training, to perform a held-out transformed version of that task. However, if the system has experienced sufficiently many metamappings during training, we can also test its ability to generalize to held-out metamappings. For example, if the system has been trained to switch various pairs of colors in a classification task (red for blue, green for yellow, etc.), it should be able to generalize to switching held-out pairs (red for yellow, green for blue, etc.) from an appropriate cue (examples or instructions). That is, even if a metamapping has never been encountered during training, we can construct a representation for it by providing a support set of transformation examples or a language instruction that is systematically related to those used for trained metamappings. We view this as an important part of intelligent adaptability—the system should be able not only to adapt to tasks via metamappings that it has directly experienced, but also to infer and use novel metamappings based on specific instructions or examples. We demonstrate this ability in the subset of our experimental domains where we can instantiate sufficiently many metamappings.

Experiments

Metamapping is an extremely general framework. Because the assumptions are simply that the basic tasks are mappings from inputs to outputs and that metamappings transform basic tasks, the approach can be applied to most paradigms of machine learning with minor modifications. We demonstrate our results in four experimental domains. We summarize the contributions of each domain in Table 1.

Table 1.

The contributions of our four experimental domains

Domain	Held-out MMs	Lang. comp.	Type	Input	Output
Polynomials	$✓$		Regression	Vector ( $R^{4}$ )	Scalar ( $R$ )
Cards		$✓$	Regression	Binary features	Bet values ( $R^{3}$ )
Visual concepts	$✓$	$✓$	Classification	$50 \times 50$ image	Label ( ${0,1}$ )
RL		$✓$	RL	$91 \times 91$ image	Action Q values ( $R^{4}$ )

Open in a new tab

Our results span various computational paradigms and data types. (Note that “Held-out MMs” refers to held-out metamappings, and “Lang. comp.” refers to a comparison to language alone; see Language and Metamapping section.)

Polynomials.

As a proof of concept we first apply metamapping to polynomial regression (Fig. 2). We construct basic tasks that are polynomial functions (of degree $\leq 2$ ) in four variables (i.e., from $R^{4} \to R$ ). These polynomials can be inferred from a support set of (input, output) examples, where the input is a point in $R^{4}$ and the output is the evaluation of that polynomial at that point. For details, and to see that the system performs this simple metalearning regression problem extremely well, see SI Appendix, section B.1 and Fig. S4.

These basic tasks/polynomials can be transformed by various metamappings—we considered squaring a polynomial, permuting its variables, or adding or multiplying by a constant. We considered 36 metamappings in total, of which we trained the model to perform 20, and held out the remaining 16 to evaluate the model’s ability to generalize to held-out metamappings (see Evaluating Base-Task and Metamapping Performance). The held-out metamappings included some of the possible permutation, addition, and multiplication transformations. We used 60 example (source polynomial, transformed polynomial) mapping pairs as a training set for each metamapping and held out another 40 transformed polynomials per metamapping for evaluation. The source and transformed polynomials for all 60 example pairs were trained for each trained or held-out metamapping. This results in a total of 2,260 polynomials trained and 1,440 held out for evaluation. For the 20 trained metamappings, the 60 trained (source polynomial, transformed polynomial) pairs were used to train the metamapping and as the support set for evaluation. For the 16 held-out metamappings, these pairs were used only as the support set for evaluation.

In Fig. 3, we show the success of our metamapping approach in this setting. We plot a normalized performance measure, $100 % (1 - loss / c)$ , where the loss is the mean squared error, and $c$ is the loss for a baseline model that always outputs zero. This measure is 0% for a model which outputs all zeros and 100% if the system performs perfectly. See SI Appendix, Table S4 for raw losses. Metamapping achieves good performance on the support set examples that are used to instantiate the mapping, with 98.3% performance (bootstrap 95% CI across runs [97.3, 99.0]) on trained metamappings and 92.1% [91.3, 93.0] on held-out metamappings. More importantly, on polynomials never experienced during training, metamapping achieves 89.0% [89.3, 89.8] zero-shot performance on average based on a trained metamapping and 85.5% [85.1, 85.9] performance based on a held-out metamapping. We also show the performance the model obtains when it is scored on the new task using the untransformed source task representation (no adaptation). This baseline yields only 4.3 and 19.3% performance, respectively. In summary, metamapping is able to achieve good performance on a new task without any data, based only on its relationship to prior tasks. This success is consistent across all of the metamapping types we evaluated (SI Appendix, Fig. S6).

We show in Fig. 4 that polynomial and metamapping representations are systematically organized and transform in systematic ways. In general, the transformed representations are close to the nominal targets where targets are known. (Note that even missing the nominal target does not necessarily mean the model is incorrect; just as we could write $2 (x + 1)$ instead of $2 x + 2$ , the model may have different representations for the same function.) The model is sample efficient at inferring both polynomials and metamappings (SI Appendix, Figs. S5 and S7).

Finally, our homoiconic approach significantly outperforms a nonhomoiconic baseline, which differs from the homoiconic architecture only in that separate example networks and hypernetworks are used for the basic tasks and metamappings (SI Appendix, Fig. S9). Sharing these networks improves generalization. Why is homoiconicity beneficial? We show that there is nontrivial overlap between the basic-task and metamapping representations (SI Appendix, Figs. S17–S19) and that some of this overlap reflects structural isomorphisms (SI Appendix, Fig. S20). While this may not fully explain the benefits of homoiconicity, it suggests that the model may exploit the shared structure between basic tasks and metamappings. By contrast, there is little alignment between the representations of numerical polynomial inputs and task representations, perhaps because there are fewer constraints encouraging such alignment.

Card Games.

We motivated our work in part by observations about human flexibility, so we next compare our model to human adaptation in a simple card game. Performing a basic task consists of receiving a hand of two cards and making a bet. The human (or model) plays against an opponent and wins (or loses) the bet if the human’s (or model’s) hand beats (loses to) the opponent’s.

We trained human participants to play one poker-like game with two-card hands (card rank 1 to 4, suit red or black). We evaluated their ability to play that game and then to switch strategy when told to try to lose. We evaluated on multiple trials without feedback, to get multiple “zero-shot” measurements from each participant. (Our protocol was approved by the Stanford University Institutional Review Board Panel on Human Subjects in Non-Medical Research, and all subjects provided informed consent. See SI Appendix, section C for details.)

We compare human adaptation to that of a metamapping model trained on poker and four other card games. The specific rules vary from game to game. We created eight variations of each game, by applying any subset of three transformations, each of which could be learned as a metamapping (SI Appendix, section B.2). The most dramatic transformation is switching from trying to win to trying to lose. This variation requires completely inverting the strategy. We trained the network on 36 of the 40 basic tasks; all losing variations of poker were held out. We used the learned task representations to train metamappings for each of the three transformations. Two of the metamappings were trained using all five games, but the lose metamapping was trained only on the games other than poker.

After training, the lose metamapping is applied to the task representation of poker, to transform it into a task representation of losing at poker. This representation is then used to play the losing variation of poker. This evaluation exactly matches the evaluation of the human participants.

Since rewards are observed only for the action taken, we must alter the representation of basic task examples. Instead of (input, target) examples we use (state (action, reward)) examples (SI Appendix, section A.4).

See Fig. 5 for the results. Human subjects are not optimal at the game (mean performance 64%, bootstrap 95% CI $[0.57, 0.70]$ ), but are adapting well, at least in the sense that performance is similar in the losing variation on average (losing phase mean performance 64%, bootstrap 95% CI $[0.55, 0.72]$ ). However, there is substantial intersubject variability in base task performance and adaptation. The evaluation hands were sampled in a stratified way in each phase, so this variability in adaptation is due to randomness either in participants’ behavior (e.g., because they are probability matching rather than optimizing bets) or in the way that their behavior changes between winning and losing phases. The metamapping model performs near optimally at the trained task and adapts quite well (mean 85%, 95% CI [79, 90]). In summary, the model performed differently than the human participants, but both the model and humans were able to switch from winning to losing zero shot. See SI Appendix, section F.3 for further analyses.

Visual Concepts.

We next applied metamapping to visual concepts, a long-standing cognitive paradigm (18). Past work has focused almost entirely on learning a concept from examples. However, adult humans can also understand some novel concepts without any examples at all. If you learn that “blickets” are red triangles and then are told that “zipfs are cyan blickets,” you will instantly be able to recognize a zipf without ever having seen an example. This zero-shot performance can be understood as applying a “switch-red-to-cyan” metamapping to the blicket classification function (Fig. 6). To capture this ability, we applied metamapping.

Fig. 6. — The visual concepts domain. Concepts consist of mappings from images to binary labels, e.g., 1 for images that are red and triangle, 0 otherwise, for example for a red circle or yellow triangle. These concepts can be transformed by metamappings that alter their attributes, such as switching red to cyan.

We constructed stimuli by selecting from eight shapes, eight colors, and three sizes. We rendered each item at a random position and rotation within a $50 \times 50$ pixel image. We defined the basic concepts (basic tasks) as binary classifications of images (i.e., functions from images to ${0,1}$ ). We trained the system on all unidimensional concepts (i.e., one-vs.-all classification of each shape, color, and size) as basic tasks, so that it could learn all of the basic attributes. We also constructed composite basic tasks based on conjunctions, disjunctions, and exclusive disjunctions (XOR) of these attributes. For example, one composite concept might be “red and triangle.”

For each concept, we chose balanced datasets of examples (that is, there was a 50% chance that each stimulus was a member of the category), during both training and evaluation. We included negative examples that were only one alteration away from being a category member. These careful contrasts can encourage neural networks to extract more general concepts (19).

In this domain we constructed both the basic task and metamapping representations from language rather than examples (Fig. 1 B and E), to show that metamapping can use this human-relevant cue. That is, there is no example network; instead a language network processes descriptions of tasks and metamappings to construct task and metamapping representations.

We trained the system on metamappings that switched one shape for another or one color for another. We sampled six composite concept transformation pairs that supported each mapping and another six with held-out targets for evaluation. However, our task sampling meant that each held-out example had a closely matched trained example, unlike the other experimental domains. See SI Appendix, section B.3 for details of sampling.

We varied the number of metamappings trained and evaluated the system on its ability to apply metamappings to trained source concepts to recognize the held-out target concepts. (Note that we exclude disjunctions from evaluation, because not adapting works fairly well on them.) Because there are many metamappings available, we were able to hold out one shape metamapping and one color metamapping for evaluation. The same basic concepts instantiating a held-out metamapping were trained as would be for a trained mapping, but the metamapping itself was not. This reduces possible confounds when evaluating metamapping generalization.

The model generalizes well (Fig. 7). On trained metamappings, its performance reaches close to ceiling around 12 training mappings. Furthermore, given enough training metamappings it is able to generalize well to held-out metamappings from a language description of that metamapping. This generalization improves rapidly as the number of metamappings trained increases. Although the average held-out metamappings performance is not perfect even at 32 training metamappings, it is perfect in 40% of the runs.

Reinforcement Learning.

We next apply our approach to reinforcement learning (RL). RL-like computations relate to neural activity (20, 21), and RL has driven recent artificial intelligence achievements in complex tasks like Go and StarCraft (22, 23). Furthermore, RL requires sophisticated adaptation, since actions have lasting consequences. Thus, RL is an important testing domain for metamapping.

Our RL tasks consist of simple two-dimensional games (Fig. 8), which take place in a $6 \times 6$ room with an additional impassable barrier of one square on each side. This grid is rendered at a resolution of 7 pixels per square to provide visual input to the agent. The agent receives egocentric input; i.e., its view is always centered on its position. This improves generalization (8). The agent can take four actions, corresponding to moving in the four cardinal directions. Invalid actions, such as trying to move onto the edge of the board, do not change the state.

The tasks the agent must perform relate to objects that are placed in the room. The objects can appear in 10 different colors. In any given task, the room only has two colors of objects in it. Each color of objects appears only with one other color, so there are in total five possible color pairs that can appear. In any given task, one of the present colors is “good” and the other is “bad.” On some tasks, the good and bad colors in a pair are switched.

There are two types of tasks, a “pick-up” task and a “push-off” task. In the pick-up task, the agent is rewarded for moving to the grid location of each good object, which then disappears, and is negatively rewarded for moving to the location of bad objects. In the push-off task, the agent is able to push an adjacent object by moving toward it, if there is no other object behind it. The agent is rewarded for pushing the good-colored objects off the edges of the board and negatively rewarded for pushing the bad-colored objects off. The two types of tasks (pick up and push off) are visually distinguishable, because the shapes of the objects used for them are different. However, which color is good or bad is not visually discernible and must be inferred from the example (state, (action, reward)) tuples used to construct the task representation.

There are in total (2 task types) $\times$ (5 color pairs) $\times$ (binary switching of good and bad colors) $= 20$ tasks. (See SI Appendix, section B.4 for details.) We trained the system on 18 tasks, holding out the switched color combinations of (red, blue) in both task types. That is, during training the agent was always positively rewarded for interacting with red objects and negatively rewarded for interacting with blue objects. We trained the system on the “switch-good-and-bad-colors” metamapping using the remaining four color pairs in both task types and then evaluated its ability to perform the held-out tasks zero shot based on this mapping. This evaluation is a difficult challenge, since the model was always negatively rewarded during training for interacting with the objects that it must interact with in the evaluation tasks.

We evaluate the model for each task by requiring the training accuracy to be above a threshold and selecting an optimal stopping time when the other task is performed well. We also used two minor model modifications to stabilize learning: persistent task representations (discussed in Training the Model) and weight normalization. See SI Appendix, section A.4 for details. Despite the challenging setting, the model adapts well, achieving 88.0% of optimal rewards (mean, bootstrap 95% CI [75.0–99.0]) on the held-out pick-up task and 71.7% (mean, bootstrap 95% CI [42.0, 94.6]) on the held-out push-off task. The results are plotted in Fig. 9, along with the results from the comparison models from the next section. (The model is also slower to complete generalization episodes [SI Appendix, Fig. S26 ]; perhaps humans, too, might be more hesitant in novel situations.)

Fig. 9. — Comparing RL adaptation performance when metamapping (MM) with task representations constructed from examples, when metamapping with task representations constructed from language, or when generalizing from language alone. Metamapping generalizes well with either type of task representation, while language alone generalizes poorly. (Chance refers to taking random actions.)

In SI Appendix, section F.7, we show that metamapping is able to extrapolate metamappings beyond the dimensions it has been trained on, to transform new dimensions. Specifically, when trained with the switch-good-and-bad metamapping applied to colors, it can generalize to switching shapes. This is further evidence for the flexibility and systematicity of metamapping.

Language and Metamapping.

Language is often key to human adaptation, and prior work on zero-shot performance has often used a task description as input (6–8). We showed in the visual concepts domain that language provides a suitable cue for basic tasks and met-mappings; in this section we explore the relationship between language, examples, and metamapping further. We compare three approaches to zero-shot task performance in the RL domain: metamapping from examples (shown in the previous section), metamapping from language, and generalization from language alone.

First, we consider metamapping from language. We use language input both to generate task representations (e.g., “pick-up, red, blue, first” to indicate picking up objects, where the first color, red, is good) and as a cue for metamapping (“switch colors”). Applying this approach to the same training and hold-out setup used above for metamapping from examples yields comparable performance: 69.2% (mean, bootstrap 95% CI [49.5, 84.5]) on the pick-up task and 74.9% [60.9, 85.5] on the push-off task (Fig. 9). This shows (as with the visual concepts) that generating task representations from examples is not essential—language can support metamapping.

However, a model that generates task representations from language offers an alternative approach to performing a new task zero shot. If language descriptions systematically relate to tasks, the model should be able to generalize to new tasks from their description alone. If the system learns that “green, yellow, first” means that the objects will be green and yellow, and the first color (green) is good; that “green, yellow, second” means that yellow will be good; and that “red, blue, first” means that red will be good and blue bad; it could in principle generalize appropriately to “red, blue, second.” Indeed, this approach to zero-shot task performance has been demonstrated in prior work (7, 8). However, we find that transforming the task representation via a metamapping can provide a stronger basis for adapting, compared to systematic language alone.

To demonstrate this, we compare the example- and language-based metamapping approaches to generalizing from language alone, again using the same basic tasks to train the network to perform tasks from language, but without metamapping training (Fig. 9). Performing the new tasks from language alone results in very poor generalization performance: −92.8% (mean, bootstrap 95% CI [−96.3, −88.4]) on the pick-up task and −79.7% [−92.8, −59.1] on the push-off task. Metamapping provides much better generalization.

The direct comparison between language-based metamapping and language alone shows that metamapping is beneficial, but there are two mechanisms by which it could help. Metamapping at test time could be key to generalization, or metamapping training could simply improve the learning of the basic task representations, such that even language alone would allow good generalization in a metamapping trained model. However, language-alone generalization is not significantly improved even in the language-based metamapping model (SI Appendix, section F.6), suggesting that metamapping at test time is key to the benefits we observe.

We also compared metamapping to language alone in the cards and visual concepts domains. We summarize the results here; see SI Appendix, section F.6 for details. In the cards domain, the language-based model was not able to generalize well to the losing game, instead degrading to chance-level performance. In the visual concepts domain, by contrast, the language model generalizes comparably to metamapping. This may be due to the concept sampling—each evaluation concept had several closely related training concepts, unlike the other domains. Indeed, metamapping shows a greater advantage when new concepts are less similar to trained ones.

In summary, metamapping (from examples or language) outperforms or equals language alone in all our experiments. Metamapping is especially beneficial when the task space is sparsely sampled or generalization is challenging. We consider the advantage of metamapping further in Discussion.

Metamapping as a Starting Point for Later Learning.

Zero-shot adaptation by metamapping allows a model to perform a new task without any direct experience. However, as we have seen, zero-shot performance is not always as good as the ultimate performance after training on the task. Here, we show that even if zero-shot performance is not completely optimal, it makes learning much faster than starting from scratch. We also show that this learning can be done in a way that avoids interference with performance on prior tasks.

We return to the polynomials domain to demonstrate this. We reinstate a trained model and consider how it could learn on the held-out tasks once it encounters them. To do so, we optimize the representations of the new tasks to improve performance on those tasks, without allowing any of the network weights to change (SI Appendix, section A.5). This approach improves performance on the new tasks without interfering with prior knowledge (24) (cf. refs. 25 and 26). Thus, it provides a useful approach to learning after zero-shot adaptation, once the system is actually performing the new tasks.

We evaluate a variety of starting points for initializing the new task representations. We compare initializing via metamappings to a variety of reasonable alternatives, such as small random values (the standard in machine learning), the embedding of an arbitrary trained task, and the centroid of all trained task representations. We plot learning curves from these different initializations in Fig. 10. Producing an initial task representation by metamapping results in much lower initial loss and faster learning than any other method.

Fig. 10. — Metamapping provides a good starting point for later learning. Shown are learning curves (geometric mean of loss on new tasks) while optimizing task representations on new tasks in the polynomials domain. Using metamapping as a starting point offers much lower initial loss and results in faster learning than alternative initializations. (Thin curves are five individual runs, thick curves are averages.)

To quantify this, we consider the cumulative loss over learning, i.e., the integral of the learning curves. This measures how much loss the model had to suffer to reach perfect behavior on the new tasks. Starting from a metamapping results in almost an order of magnitude less cumulative error (mean $= 24.58$ , bootstrap 95% CI $[17.71, 32.08]$ ) than the next best initialization (centroid of trained task representations, mean $= 192.89$ , bootstrap 95% CI $[151.98, 234.53]$ ). Metamapping provides a valuable starting point for future learning. (We also show this in the visual concepts domain in SI Appendix and show that a hypernetwork architecture is essential.)

Discussion

We have proposed metamappings as a computational mechanism for performing a novel task zero shot—without any direct experience on the task—based on the relationship between the novel task and prior tasks. We have shown that our approach performs well across a wide range of settings, often achieving 80 to 90% performance on a new task with no data on that task at all. With enough experience, as in the visual classification settings with enough training tasks, it can adapt perfectly. It can also adapt using novel relationships (held-out metamappings) that it has never encountered during training.

As noted in the Introduction, there are computational benefits to adaptivity. Its potential contributions to biological intelligence have been highlighted by Siegelmann (1), who proposes that there is a “hierarchy of computational powers” and that a particular system’s location in that hierarchy depends on its “particular level of richness and adaptability.” Because our work offers an additional perspective on adaptation, it would be interesting to explore the theoretical computational power of metamapping under different input and representation regimes.

As Siegelmann notes, for a model to be able to adapt, it must first be capable of performing a variety of related tasks (1). Thus, instead of learning parameters that execute a single task, our model learns to construct task representations from examples or language and to use those representations to perform appropriate behaviors. The key insight of this work is that those task representations are then available for transformation and that transforming task representations by metamappings can allow effective adaptation.

In our experiments, directly exploiting task relationships by metamapping allowed more systematic adaptation than indirectly exploiting them by generalizing through compositional language alone. Even when language alone generalized poorly, as in the RL domain, metamapping with language-based task representations resulted in strong generalization. This illustrates the value of a transformation-oriented perspective.

Why is transforming tasks according to task relationships so effective? We suggest that this is because metamapping constructs and uses an explicit cognitive operation that captures what is systematic in the task relationships. For example, “trying to lose” is systematic precisely insofar as the relationship between winning and losing is similar across different games. The metamapping approach gives primacy to these relationships. It thus directly exploits systematic structure where it exists in the cognitively meaningful relationships between tasks.

We also highlight the results showing that metamapping provides a useful starting point for later learning. While metalearning approaches (27) can construct a good starting point for learning new tasks, they do not use task relationships to offer a uniquely appropriate starting point for each novel task. Our results show that using a task relationship to adapt a prior task can substantially reduce the errors made along the way to mastering the new task. This could make deep learning more efficient. It could also be useful in settings like robotics, where mistakes during learning can be extremely costly (28).

Our results should not be taken as a suggestion that metamapping is the only possible mechanism for adaptation. We see intelligence as multifaceted, and any single model is a simplification. Metamapping may be useful as one tool for building models with greater flexibility.

Metamapping increases the adaptability of our models, although our present work has limitations that we discuss in Limitations and Future Directions. Our models can perform tasks from examples, from natural language, and from metamappings, which we have shown are an effective way to adapt zero shot. Thus, our work has many potential applications in machine learning and cognitive science.

Related Work in Machine Learning.

To allow zero-shot adaptation, we built on ideas from several areas of machine learning. First, there is a large body of prior work on allowing models to learn to behave more flexibly, for example by metalearning, that is, learning to learn from examples (9, 27, 29). Our approach to inferring tasks from examples draws on recent ideas like aggregating examples in a permutation-invariant way to produce a task representation (12).

Second, a range of work uses the idea of different timescales of weight adaptation—that is, even if some parameters of a network may need to be learned slowly, it may be useful to alter others much more rapidly (30). We have drawn particularly on the idea that the parameters of a network could be specified by another network in a single forward inference (13, 14). This approach has shown success in metalearning recently (10, 31) and improved our model’s adaptation (SI Appendix, Fig. S9).

There has been a variety of other work on zero-shot task performance. We compared to the zero-shot task performance from language alone. The idea of performing tasks from descriptions was proposed by Larochelle et al. (6). More recent work has considered zero-shot classification using language (32, 33) or performing tasks from language in RL (7, 8). Some of this latter work has even exploited relationships between tasks as a learning signal (11), but without transforming task representations. As discussed in the beginning of the Discussion, transforming task representations with metamappings directly exploits systematic relationships, allowing metamapping to outperform language alone in our experiments. To our knowledge none of the prior work has proposed task transformations to adapt to new tasks.

Other prior work has used similarity between tasks to help generate representations for a new task (34). Again, metamapping may be a stronger approach, since it can specify along which dimensions two tasks are related and the specific ways in which they differ, which a scalar similarity measure cannot.

Aspects of zero-shot adaptation have also been explored in model-based reinforcement learning. Work in model-based RL has partly addressed how to transfer knowledge between different reward functions (35). Metamapping can potentially be applied to this form of transfer as well; indeed, our RL experiments show that metamapping can offer a model-free alternative to model-based adaptation. Metamapping may also offer advantages that could complement model-based methods. Metamapping provides a principled way to infer a new reward estimator by transforming a prior one. It could also transform a transition function used in the planning model in response to environmental changes. Thus, exploring the relationship and synergies between metamapping and model-based RL methods provides an exciting direction for future work.

There has also been other recent interest in task representations. Achille et al. (36) proposed computing embeddings for visual tasks from the Fisher information of a task-tuned model. They show that this captures some interesting properties of the tasks, including some semantic relationships, and can help identify models that can perform well on a task. Other recent work has tried to learn representations for skills (37) or tasks (38) for exploration and representation learning, but without exploring zero-shot transformation of these skills.

Related Work in Cognitive Science.

Our work is related to several streams of research in cognitive science. Prior work has suggested that analogical transfer between structurally isomorphic domains may be a key component of “what makes us smart” (39). Analogical transfer is a kind of zero-shot mapping and has been demonstrated across various cognitive domains (18, 40). We hope our work stimulates further exploration of the conditions under which humans can adapt to task transformations zero shot. Different types of task relationships might be made accessible through culture or education—“relational concepts are not simply given in the natural world: they are culturally and linguistically shaped” (ref. 39, pp. 204–206).

Our work also touches on complex issues of compositionality, productivity, and systematicity. Fodor (41, 42) and Lake and Baroni (43) have advocated that cognition must use compositional representations to exhibit systematic and productive generalization. We see our work as part of an alternative approach to this issue, exploring how systematic, structured generalization can instead emerge from the structure of learning experience, without needing to be built in (44, 45). By focusing on task relationships, rather than building in compositional representations of tasks, our model can learn to exploit the shared structure in the concept of “losing” across a few card games to achieve 85% performance in losing a game it has never tried to lose before.

Crucially, the question of whether the model adapts according to compositional task structure is distinct from the question of whether the model’s representations exhibit compositional structure. Because the mapping from task representations to behavior is highly nonlinear, it is difficult to craft a definition of compositional representations that is either necessary or sufficient for generalization. For example, if “compositional” is taken to mean that Euclidean vector addition of the representations of two constant polynomials results in the representation of their sum, this is clearly untrue for our model (SI Appendix, Fig. S13). However, the nonlinear mapping from representations to behavior can allow for systematic generalization from nonlinear structure. Indeed, it appears that the constant polynomial representations may be approximately systematically arranged in a compressed polar coordinate system. This may support generalization better than a more intuitively compositional representational structure.

Furthermore, there are a number of potential benefits to letting systematic behavior emerge, rather than attempting to build in compositional representations. First, the structure does not need to be hand engineered separately for each domain. Our system required no special knowledge about the domains beyond the basic tasks and the existence of relationships between them. The fact that some of these relationships corresponded to, e.g., permutations of variables in the polynomial domain did not need to be hard coded; instead, the model was able to discover the nature of this transformation from the data (in that it was able to generalize well to held-out permutations). Emergence may also allow for novel decompositions at test time. The ability of our model to perform well on held-out metamappings indicates that it has some promise in this regard. Future work should assess this capability of the model more fully.

We also believe that our approach can capture some of the recursive processing that Fodor (42) and others have emphasized. We have also been influenced by ideas in mathematical cognition about how concepts build upon more basic concepts (15–17). This recursive construction reflects the way that metamappings transform basic tasks—complex transformations are built upon simpler ones. If humans can handle an indefinite number of levels of abstraction, the advantage of using a shared representational space for all levels increases, since it eliminates the need to create a new space for each level. Relatedly, our shared workspace for data points, tasks, and metamappings connects to ideas like the global workspace theory of consciousness (46). The ability to reason about and explain concepts at different levels of abstraction can be explained parsimoniously by assuming a shared representational space. Exploring these connections would be an exciting future direction.

We found particular inspiration in Karmiloff-Smith’s (47) and Clark and Karmiloff-Smith’s (48) work on rerepresenting knowledge. It would be interesting to explore modeling the phenomena they considered, which they argued required that representations be “objects for further manipulation” (ref. 48, p. 509), as task representations are in metamapping.

Our work also relates to Fodor’s (49) ideas about the modularity of the mind. Indeed, our division of the architecture into input and output systems, with the flexible, task-specific computations in the middle, may seem very reminiscent of the modularity that he advocated. However, we chose this implementation for simplicity—we believe that in reality processes such as perception can be influenced by the task, as well as contextual constraints (50).

Reciprocally, we believe that higher-level computations are influenced and constrained by the modalities in which they are supported. This computational feature can emerge in our model; despite the fact that different types of data and tasks are embedded in a shared latent space, the model generally learns to organize distinct types of inputs into somewhat distinct regions of this space. This means that the task-specific processing can potentially exploit domain-specific features of the input, as for example humans do when they use gestures to think and learn in spatial contexts like mathematical reasoning (51). At the same time, the shared space can allow a graded overlap in the structure that is shared across different entities, insofar as they are related to each other. For example, in the polynomial domain there is more overlap between polynomial representations and metamapping representations than between either type of representations and the representations of numerical inputs. Using a shared space allows the model to discover what should be shared and what should be separated—that is, modularity “may not be built in [but] may result from the relationship among representations” (ref. 52, p. 231).

Finally, our approach relates to work on cognitive control (5). The “default” task-network weights could be used to model more automatic processing. This processing can be overridden by task-specific constraints set by the HyperNetwork, when conditioned on an appropriate task representation. We present a simple demonstration of these ideas in SI Appendix, section F.9. Metamapping itself could also be relevant; for example, an imperfect metamapping might capture some failures of control.

Limitations and Future Directions.

Although we believe our approach is promising, the present work has limitations. We have explored metamapping within a limited range of settings. While we used one particular model, metamapping could potentially be useful in any metalearning approach that uses task representations (10). Furthermore, we have demonstrated our model only within relatively simple, small domains. The model adapts quite well, but does not always achieve perfect fidelity of adaptation. One factor that may contribute is the relatively limited range of experience of the model—our models lack the rich lifetime of experience that our human participants have. Furthermore, recent work shows that more realistic and embodied environments can improve generalization (8). Thus, evaluating our approach in richer, more realistic settings will be an important future direction.

Another important limitation is that our approach requires the imposition of structured training to provide the network with experience of the relationships between tasks. However, we suggest that identifying task relationships is useful for building more flexible intelligent systems and that exposure to task relationships is an important part of human experience. A long-term goal would be to create a system that learns to identify task relationships for itself from such experience.

Our work suggests many other possibilities. For simplicity we considered using language, examples, and metamapping to infer task representations in this work. However, it would likely be beneficial to use multiple constraints to both infer and adapt task representations. Furthermore, we considered language as input, but producing language as output (in the form of explanations) can improve understanding and generalization in both humans (53) and neural networks (54). Adding language output would likely improve performance and better capture the structure of human behavior.

In addition, we did not thoroughly explore robustness and the effect of noise. We showed that our approach is reasonably robust to sample-size variability (SI Appendix, Figs. S5 and S7), but there is room for further exploration. For example, how would input noise affect the computations? How would errors compound if multiple metamappings were applied sequentially?

Our model architecture also has limitations; cognitive tasks often require more complex processing than our model allows. Replacing the feedforward task network with a recurrent or attentional network—or a network with external memory (55)—would increase the flexibility of the model. It will be important to incorporate these ideas in future work.

Conclusions

An intelligent system should be able to adapt to novel tasks zero shot, based on the relationship between the novel task and prior tasks. We have proposed a computational implementation of this ability. Our approach is based on constructing task representations and learning to transform those task representations through metamappings. We have also proposed a homogeneous implementation that reuses the same architectures for both basic tasks and metamappings. We see our proposal as a logical development from the fundamental idea of metalearning—that tasks themselves can be seen as data points in a higher-order task of learning to learn. This insight leads to the idea of transforming task representations just like we transform data.

Metamapping is an extremely general approach—we have shown that it performs well across several domains and computational paradigms, with task representations constructed from either examples or language. Metamapping is able to perform well at new tasks zero shot, even when the new task directly contradicts prior learning. It is generally able to adapt more effectively after experiencing fewer tasks than approaches relying on language alone and sometimes seems to exhibit more systematic behavior. We suggest that this is because task relationships better capture the underlying conceptual structure. Metamapping provides a valuable starting point for later learning, one that can substantially reduce both time to learn a new task and cumulative errors made in learning. Our results thus provide a possible mechanism for an advanced form of cognitive adaptability and illustrate the role it may play in future learning. We hope our work will lead to a better understanding of human cognitive flexibility and to the development of artificial intelligence systems that can learn and adapt more flexibly.

Supplementary Material

Supplementary File

pnas.2008852117.sapp.pdf^{(3.6MB, pdf)}

Acknowledgments

A.K.L. was supported by a National Science Foundation Graduate Research Fellowship. We appreciate helpful suggestions from Noah Goodman, Surya Ganguli, Felix Hill, Steven Hansen, Erin Bennett, Katherine Hermann, Arianna Yuan, Andrew Nam, Effie Li, and the anonymous reviewers.

Footnotes

The authors declare no competing interest.

This article is a PNAS Direct Submission.

*Of course, with different input types, this type of processing will be different. While the core model components are similar across experiments, the input and output systems can therefore differ.

This article contains supporting information online at https://www.pnas.org/lookup/suppl/doi:10.1073/pnas.2008852117/-/DCSupplemental.

Data Availability.

All study data are included in this article and SI Appendix.

References

1.Siegelmann H. T., Turing on super-Turing and adaptivity. Prog. Biophys. Mol. Biol. 113, 117–126 (2013). [DOI] [PubMed] [Google Scholar]
2.Lake B. M., Ullman T. D., Tenenbaum J. B., Gershman S. J., Building machines that learn and think like people. Behav. Brain Sci. 40, e253 (2017). [DOI] [PubMed] [Google Scholar]
3.Marcus G., Deep learning: A critical appraisal. arXiv:1801.00631 (2 January 2018).
4.Russin J., O’Reilly R. C., Bengio Y., “Deep learning needs a pre-frontal cortex” in ICLR Workshop on Bridging AI and Cognitive Science. https://baicsworkshop.github.io/. Accessed 26 April 2020.
5.Dunbar K., Cohen J. D., McClelland J. L., On the control of automatic processes: A parallel distributed processing account of the Stroop effect. Psychol. Rev. 97, 332–361 (1990). [DOI] [PubMed] [Google Scholar]
6.Larochelle H., Erhan D., Bengio Y., “Zero-data learning of new tasks” in Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence, Eds. Fox D., Gomes C. P., Eds. (AAAI Press, Menlo Park, CA, 2008), pp. 645–651. [Google Scholar]
7.Hermann K. M., et al. , Grounded language learning in a simulated 3D world. arXiv:1706.06551 (26 June 2017).
8.Hill F., et al. , “Environmental drivers of generalization in a situated agent” in Proceedings of the 8th International Conference on Learning Representations. https://openreview.net/pdf?id=SklGryBtwr. Accessed 5 May 2020.
9.Vinyals O., Blundell C., Lillicrap T., Kavukcuoglu K., Wierstra D., Matching networks for one shot learning. Adv. Neural Inf. Process. Syst. 29, 3630–3639 (2016). [Google Scholar]
10.Rusu A. A., et al. , “Meta-Learning with latent embedding optimization” in Proceedings of the 7th International Conference on Learning Representations (International Conference on Learning Representations, 2019). [Google Scholar]
11.Oh J., Singh S., Lee H., Kohli P., “Zero-shot task generalization with multi-task deep reinforcement learning” in Proceedings of the 34th International Conference on Machine Learning, Precup D., Teh Y. W., Eds. (Journal of Machine Learning Research, 2017), Vol. 70, pp. 2661–2670. [Google Scholar]
12.Garnelo M., et al. , “Conditional neural processes” in Proceedings of the 35th International Conference on Machine Learning, Dy J. G., Krause A., Eds. (Journal of Machine Learning Research, 2018), Vol. 80, pp. 1704–1713. [Google Scholar]
13.Ha D., Dai A., Le Q. V., Hyper networks. Proceedings of the 5th International Conference on Learning Representations. https://openreview.net/references/pdf?id=BkXLhI7te. Accessed 15 June 2018.
14.McClelland J. L., Putting knowledge in its place : A scheme for programming parallel processing structures on the fly. Cogn. Sci. 146, 113–146 (1985). [Google Scholar]
15.Wilensky U., “Abstract meditations on the concrete and concrete implications for mathematics education” in Constructionism, Harel I., Papert S., Eds. (Ablex Publishing, 1991). https://ccl.northwestern.edu/papers/concrete/. Accessed 25 March 2020. [Google Scholar]
16.Hazzan O., Reducing abstraction level when learning abstract algebra concepts. Educ. Stud. Math. 40, 71–90 (1999). [Google Scholar]
17.Lampinen A. K., McClelland J. L., Different presentations of a mathematical concept can support learning in complementary ways. J. Educ. Psychol. 110, 664–682 (2018). [Google Scholar]
18.Bourne L. E., Knowing and using concepts. Psychol. Rev. 77, 546–556 (1970). [Google Scholar]
19.Hill F., Santoro A., Barrett D., Morcos A., Lillicrap T., “Learning to make analogies by contrasting abstract relational structure” in Proceedings of the 7th International Conference on Learning Representations https://openreview.net/pdf?id=SylLYsCcFm. Accessed 14 June 2019.
20.Niv Y., Reinforcement learning in the brain. J. Math. Psychol. 53, 139–154 (2009). [Google Scholar]
21.Dabney W., et al. , A distributional code for value in dopamine-based reinforcement learning. Nature 577, 671–675 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Silver D., et al. , Mastering the game of Go with deep neural networks and tree search. Nature 529, 484–489 (2016). [DOI] [PubMed] [Google Scholar]
23.Vinyals O., et al. , Grandmaster level in Starcraft II using multi-agent reinforcement learning. Nature 575, 350–354 (2019). [DOI] [PubMed] [Google Scholar]
24.Reed S., de Freitas N., “Neural programmer-interpreters” in Proceedings of the 4th International Conference on Learning Representations https://arxiv.org/pdf/1511.06279.pdf. Accessed 5 September 2017.
25.Rogers T. T., McClelland J. L., Semantic Cognition: A Parallel Distributed Processing Approach (MIT Press, 2004). [DOI] [PubMed] [Google Scholar]
26.Lampinen A. K., McClelland J. L., One-shot and few-shot learning of word embeddings. arXiv:1710.10280 (27 October 2017).
27.Finn C., Abbeel P., Levine S., “Model-agnostic meta-learning for fast adaptation of deep networks” in Proceedings of the 34th Annual Conference on Machine Learning, Precup D., Teh Y. W., Eds. (Journal of Machine Learning Research, 2017), Vol. 70, pp. 1126–1135. [Google Scholar]
28.Turchetta M., Berkenkamp F., Krause A., Safe exploration in finite Markov decision processes with Gaussian processes. Adv. Neural Inf. Process. Syst. 29, 4312–4320 (2016). [Google Scholar]
29.Ravichandran A., Bhotika R., Soatto S., Few-shot learning with embedded class models and shot-free meta training. arXiv:1905.04398 (10 May 2019).
30.Hinton G. E., Plaut D. C., “Using fast weights to deblur old memories” in Proceedings of the 9th Annual Conference of the Cognitive Science Society (Lawrence Erlbaum Associates, Hillsdale, New Jersey, 1982), pp. 177–186. [Google Scholar]
31.Li H., et al. , “LGM-Net: Learning to generate matching networks for few-shot learning” in Proceedings of the 36th International Conference on Machine Learning, Chaudhuri K., Salakhutdinov R., Eds. (Journal of Machine Learning Research, 2019), pp. 3825–3834. [Google Scholar]
32.Socher R., Ganjoo M., Manning C. D., Ng A. Y., “Zero-shot learning through cross-modal transfer” in Advances in Neural Information Processing Systems 26, Burges C. J. C., Bottou L., Welling M., Ghahramani Z., Weinberger K. Q., Eds. (Neural Information Processing Systems Foundation, 2013), pp. 935–943. [Google Scholar]
33.Xian Y., Lampert C. H., Schiele B., Akata Z., Zero-shot learning - A comprehensive evaluation of the good, the bad and the ugly. IEEE Trans. Pattern Anal. Mach. Intell. 41, 2251–2265 (2018). [DOI] [PubMed] [Google Scholar]
34.Pal A., Balasubramanian V. N., “Zero-shot task transfer” in Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition, Gupta A., Hoiem D., Hua G., Tu Z., Eds. (IEEE, 2019), pp. 2189–2198. [Google Scholar]
35.Laroche R., Barlier M., “Transfer reinforcement learning with shared dynamics” in Proceedings of the Thirty First AAAI Conference on Artificial Intelligence, Singh S. and Markovitch S., Eds. (AAAI Press, Palo Alto, CA, 2017), pp. 2147–2153. [Google Scholar]
36.Achille A., et al. , Task2Vec: Task embedding for meta-learning. arXiv:1902.03545 (10 February 2019).
37.Eysenbach B., Gupta A., Ibarz J., Levine S., “Diversity is all you need: Learning skills without a reward function” in Proceedings of the 7th International Conference on Learning Representations. https://openreview.net/pdf?id=SJx63jRqFm. Accessed 9 May 2019.
38.Hsu K., Levine S., Finn C., “Unsupervised learning via meta-learning” in Proceedings of the 7th International Conference on Learning Representations https://openreview.net/pdf?id=r1My6sR9tX. Accessed 9 May 2019.
39.Gentner D., “Why We’re So Smart” in Language in Mind: Advances in the Study of Language and Thought, Gentner D., Goldin-Meadow S., Eds. (MIT Press, Cambridge, MA, 2003), pp. 195–235. [Google Scholar]
40.Gick M. L., Holyoak K. J., Analogical problem solving. Cogn. Psychol. 12, 306–355 (1980). [Google Scholar]
41.Fodor J. A., Language, thought and compositionality. Mind Lang. 16, 1–15 (2001). [Google Scholar]
42.Fodor J. A., LOT 2: The Language of Thought Revisited (Oxford University Press, 2008). [Google Scholar]
43.Lake B. M., Baroni M., “Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks” in Proceedings of the 36th International Conference on Machine Learning, Chaudhuri K., Salakhutdinov R., Eds. (Journal of Machine Learning Research, 2018), pp. 2873–2882. [Google Scholar]
44.McClelland J. L., et al. , Letting structure emerge: Connectionist and dynamical systems approaches to cognition. Trends Cognit. Sci. 14, 348–356 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
45.Hansen S. S., Lampinen A., Suri G., McClelland J. L., Building on prior knowledge without building it in. Behav. Brain Sci. 40, e268 (2017). [DOI] [PubMed] [Google Scholar]
46.Baars B. J., Global workspace theory of consciousness: Toward a cognitive neuroscience of human experience. Prog. Brain Res. 150, 45–53 (2005). [DOI] [PubMed] [Google Scholar]
47.Karmiloff-Smith A., From meta-processes to conscious access: Evidence from children’s metalinguistic and repair data. Cognition 23, 95–147 (1986). [DOI] [PubMed] [Google Scholar]
48.Clark A., Karmiloff-Smith A., The cognizer’s innards: A psychological and philosophical perspective on the development of thought. Mind Lang. 8, 487–519 (1993). [Google Scholar]
49.Fodor J. A., The Modularity of Mind (MIT Press, 1983). [Google Scholar]
50.McClelland J. L., Mirman D., Bolger D. J., Khaitan P., Interactive activation and mutual constraint satisfaction in perception and cognition. Cogn. Sci. 38, 1139–1189 (2014). [DOI] [PubMed] [Google Scholar]
51.Goldin-Meadow S., The role of gesture in communication and thinking. Trends Cogn. Sci. 3, 419–429 (1999). [DOI] [PubMed] [Google Scholar]
52.Tanenhaus M. K., Lucas M. M., Context effects in lexical processing. Cognition 25, 213–234 (1987). [DOI] [PubMed] [Google Scholar]
53.Chi M. T., De Leeuw N., Chiu M. H., Lavancher C., Eliciting self-explanations improves understanding. Cogn. Sci. 18, 439–477 (1994). [Google Scholar]
54.Mu J., Liang P., Goodman N., “Shaping visual representations with language for few-shot classification” in Visually Grounded Interaction and Language Workshop, NeurIPS (2019). https://vigilworkshop.github.io/2019. Accessed 13 December 2019.
55.Graves A., et al. , Hybrid computing using a neural network with dynamic external memory. Nat. Publ. Group 538, 471–476 (2016). [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary File

pnas.2008852117.sapp.pdf^{(3.6MB, pdf)}

Data Availability Statement

All study data are included in this article and SI Appendix.

[r1] 1.Siegelmann H. T., Turing on super-Turing and adaptivity. Prog. Biophys. Mol. Biol. 113, 117–126 (2013). [DOI] [PubMed] [Google Scholar]

[r2] 2.Lake B. M., Ullman T. D., Tenenbaum J. B., Gershman S. J., Building machines that learn and think like people. Behav. Brain Sci. 40, e253 (2017). [DOI] [PubMed] [Google Scholar]

[r3] 3.Marcus G., Deep learning: A critical appraisal. arXiv:1801.00631 (2 January 2018).

[r4] 4.Russin J., O’Reilly R. C., Bengio Y., “Deep learning needs a pre-frontal cortex” in ICLR Workshop on Bridging AI and Cognitive Science. https://baicsworkshop.github.io/. Accessed 26 April 2020.

[r5] 5.Dunbar K., Cohen J. D., McClelland J. L., On the control of automatic processes: A parallel distributed processing account of the Stroop effect. Psychol. Rev. 97, 332–361 (1990). [DOI] [PubMed] [Google Scholar]

[r6] 6.Larochelle H., Erhan D., Bengio Y., “Zero-data learning of new tasks” in Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence, Eds. Fox D., Gomes C. P., Eds. (AAAI Press, Menlo Park, CA, 2008), pp. 645–651. [Google Scholar]

[r7] 7.Hermann K. M., et al. , Grounded language learning in a simulated 3D world. arXiv:1706.06551 (26 June 2017).

[r8] 8.Hill F., et al. , “Environmental drivers of generalization in a situated agent” in Proceedings of the 8th International Conference on Learning Representations. https://openreview.net/pdf?id=SklGryBtwr. Accessed 5 May 2020.

[r9] 9.Vinyals O., Blundell C., Lillicrap T., Kavukcuoglu K., Wierstra D., Matching networks for one shot learning. Adv. Neural Inf. Process. Syst. 29, 3630–3639 (2016). [Google Scholar]

[r10] 10.Rusu A. A., et al. , “Meta-Learning with latent embedding optimization” in Proceedings of the 7th International Conference on Learning Representations (International Conference on Learning Representations, 2019). [Google Scholar]

[r11] 11.Oh J., Singh S., Lee H., Kohli P., “Zero-shot task generalization with multi-task deep reinforcement learning” in Proceedings of the 34th International Conference on Machine Learning, Precup D., Teh Y. W., Eds. (Journal of Machine Learning Research, 2017), Vol. 70, pp. 2661–2670. [Google Scholar]

[r12] 12.Garnelo M., et al. , “Conditional neural processes” in Proceedings of the 35th International Conference on Machine Learning, Dy J. G., Krause A., Eds. (Journal of Machine Learning Research, 2018), Vol. 80, pp. 1704–1713. [Google Scholar]

[r13] 13.Ha D., Dai A., Le Q. V., Hyper networks. Proceedings of the 5th International Conference on Learning Representations. https://openreview.net/references/pdf?id=BkXLhI7te. Accessed 15 June 2018.

[r14] 14.McClelland J. L., Putting knowledge in its place : A scheme for programming parallel processing structures on the fly. Cogn. Sci. 146, 113–146 (1985). [Google Scholar]

[r15] 15.Wilensky U., “Abstract meditations on the concrete and concrete implications for mathematics education” in Constructionism, Harel I., Papert S., Eds. (Ablex Publishing, 1991). https://ccl.northwestern.edu/papers/concrete/. Accessed 25 March 2020. [Google Scholar]

[r16] 16.Hazzan O., Reducing abstraction level when learning abstract algebra concepts. Educ. Stud. Math. 40, 71–90 (1999). [Google Scholar]

[r17] 17.Lampinen A. K., McClelland J. L., Different presentations of a mathematical concept can support learning in complementary ways. J. Educ. Psychol. 110, 664–682 (2018). [Google Scholar]

[r18] 18.Bourne L. E., Knowing and using concepts. Psychol. Rev. 77, 546–556 (1970). [Google Scholar]

[r19] 19.Hill F., Santoro A., Barrett D., Morcos A., Lillicrap T., “Learning to make analogies by contrasting abstract relational structure” in Proceedings of the 7th International Conference on Learning Representations https://openreview.net/pdf?id=SylLYsCcFm. Accessed 14 June 2019.

[r20] 20.Niv Y., Reinforcement learning in the brain. J. Math. Psychol. 53, 139–154 (2009). [Google Scholar]

[r21] 21.Dabney W., et al. , A distributional code for value in dopamine-based reinforcement learning. Nature 577, 671–675 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r22] 22.Silver D., et al. , Mastering the game of Go with deep neural networks and tree search. Nature 529, 484–489 (2016). [DOI] [PubMed] [Google Scholar]

[r23] 23.Vinyals O., et al. , Grandmaster level in Starcraft II using multi-agent reinforcement learning. Nature 575, 350–354 (2019). [DOI] [PubMed] [Google Scholar]

[r24] 24.Reed S., de Freitas N., “Neural programmer-interpreters” in Proceedings of the 4th International Conference on Learning Representations https://arxiv.org/pdf/1511.06279.pdf. Accessed 5 September 2017.

[r25] 25.Rogers T. T., McClelland J. L., Semantic Cognition: A Parallel Distributed Processing Approach (MIT Press, 2004). [DOI] [PubMed] [Google Scholar]

[r26] 26.Lampinen A. K., McClelland J. L., One-shot and few-shot learning of word embeddings. arXiv:1710.10280 (27 October 2017).

[r27] 27.Finn C., Abbeel P., Levine S., “Model-agnostic meta-learning for fast adaptation of deep networks” in Proceedings of the 34th Annual Conference on Machine Learning, Precup D., Teh Y. W., Eds. (Journal of Machine Learning Research, 2017), Vol. 70, pp. 1126–1135. [Google Scholar]

[r28] 28.Turchetta M., Berkenkamp F., Krause A., Safe exploration in finite Markov decision processes with Gaussian processes. Adv. Neural Inf. Process. Syst. 29, 4312–4320 (2016). [Google Scholar]

[r29] 29.Ravichandran A., Bhotika R., Soatto S., Few-shot learning with embedded class models and shot-free meta training. arXiv:1905.04398 (10 May 2019).

[r30] 30.Hinton G. E., Plaut D. C., “Using fast weights to deblur old memories” in Proceedings of the 9th Annual Conference of the Cognitive Science Society (Lawrence Erlbaum Associates, Hillsdale, New Jersey, 1982), pp. 177–186. [Google Scholar]

[r31] 31.Li H., et al. , “LGM-Net: Learning to generate matching networks for few-shot learning” in Proceedings of the 36th International Conference on Machine Learning, Chaudhuri K., Salakhutdinov R., Eds. (Journal of Machine Learning Research, 2019), pp. 3825–3834. [Google Scholar]

[r32] 32.Socher R., Ganjoo M., Manning C. D., Ng A. Y., “Zero-shot learning through cross-modal transfer” in Advances in Neural Information Processing Systems 26, Burges C. J. C., Bottou L., Welling M., Ghahramani Z., Weinberger K. Q., Eds. (Neural Information Processing Systems Foundation, 2013), pp. 935–943. [Google Scholar]

[r33] 33.Xian Y., Lampert C. H., Schiele B., Akata Z., Zero-shot learning - A comprehensive evaluation of the good, the bad and the ugly. IEEE Trans. Pattern Anal. Mach. Intell. 41, 2251–2265 (2018). [DOI] [PubMed] [Google Scholar]

[r34] 34.Pal A., Balasubramanian V. N., “Zero-shot task transfer” in Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition, Gupta A., Hoiem D., Hua G., Tu Z., Eds. (IEEE, 2019), pp. 2189–2198. [Google Scholar]

[r35] 35.Laroche R., Barlier M., “Transfer reinforcement learning with shared dynamics” in Proceedings of the Thirty First AAAI Conference on Artificial Intelligence, Singh S. and Markovitch S., Eds. (AAAI Press, Palo Alto, CA, 2017), pp. 2147–2153. [Google Scholar]

[r36] 36.Achille A., et al. , Task2Vec: Task embedding for meta-learning. arXiv:1902.03545 (10 February 2019).

[r37] 37.Eysenbach B., Gupta A., Ibarz J., Levine S., “Diversity is all you need: Learning skills without a reward function” in Proceedings of the 7th International Conference on Learning Representations. https://openreview.net/pdf?id=SJx63jRqFm. Accessed 9 May 2019.

[r38] 38.Hsu K., Levine S., Finn C., “Unsupervised learning via meta-learning” in Proceedings of the 7th International Conference on Learning Representations https://openreview.net/pdf?id=r1My6sR9tX. Accessed 9 May 2019.

[r39] 39.Gentner D., “Why We’re So Smart” in Language in Mind: Advances in the Study of Language and Thought, Gentner D., Goldin-Meadow S., Eds. (MIT Press, Cambridge, MA, 2003), pp. 195–235. [Google Scholar]

[r40] 40.Gick M. L., Holyoak K. J., Analogical problem solving. Cogn. Psychol. 12, 306–355 (1980). [Google Scholar]

[r41] 41.Fodor J. A., Language, thought and compositionality. Mind Lang. 16, 1–15 (2001). [Google Scholar]

[r42] 42.Fodor J. A., LOT 2: The Language of Thought Revisited (Oxford University Press, 2008). [Google Scholar]

[r43] 43.Lake B. M., Baroni M., “Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks” in Proceedings of the 36th International Conference on Machine Learning, Chaudhuri K., Salakhutdinov R., Eds. (Journal of Machine Learning Research, 2018), pp. 2873–2882. [Google Scholar]

[r44] 44.McClelland J. L., et al. , Letting structure emerge: Connectionist and dynamical systems approaches to cognition. Trends Cognit. Sci. 14, 348–356 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r45] 45.Hansen S. S., Lampinen A., Suri G., McClelland J. L., Building on prior knowledge without building it in. Behav. Brain Sci. 40, e268 (2017). [DOI] [PubMed] [Google Scholar]

[r46] 46.Baars B. J., Global workspace theory of consciousness: Toward a cognitive neuroscience of human experience. Prog. Brain Res. 150, 45–53 (2005). [DOI] [PubMed] [Google Scholar]

[r47] 47.Karmiloff-Smith A., From meta-processes to conscious access: Evidence from children’s metalinguistic and repair data. Cognition 23, 95–147 (1986). [DOI] [PubMed] [Google Scholar]

[r48] 48.Clark A., Karmiloff-Smith A., The cognizer’s innards: A psychological and philosophical perspective on the development of thought. Mind Lang. 8, 487–519 (1993). [Google Scholar]

[r49] 49.Fodor J. A., The Modularity of Mind (MIT Press, 1983). [Google Scholar]

[r50] 50.McClelland J. L., Mirman D., Bolger D. J., Khaitan P., Interactive activation and mutual constraint satisfaction in perception and cognition. Cogn. Sci. 38, 1139–1189 (2014). [DOI] [PubMed] [Google Scholar]

[r51] 51.Goldin-Meadow S., The role of gesture in communication and thinking. Trends Cogn. Sci. 3, 419–429 (1999). [DOI] [PubMed] [Google Scholar]

[r52] 52.Tanenhaus M. K., Lucas M. M., Context effects in lexical processing. Cognition 25, 213–234 (1987). [DOI] [PubMed] [Google Scholar]

[r53] 53.Chi M. T., De Leeuw N., Chiu M. H., Lavancher C., Eliciting self-explanations improves understanding. Cogn. Sci. 18, 439–477 (1994). [Google Scholar]

[r54] 54.Mu J., Liang P., Goodman N., “Shaping visual representations with language for few-shot classification” in Visually Grounded Interaction and Language Workshop, NeurIPS (2019). https://vigilworkshop.github.io/2019. Accessed 13 December 2019.

[r55] 55.Graves A., et al. , Hybrid computing using a neural network with dynamic external memory. Nat. Publ. Group 538, 471–476 (2016). [DOI] [PubMed] [Google Scholar]

PERMALINK

Transforming task representations to perform novel tasks

Andrew K Lampinen

James L McClelland

Significance

Abstract

Task Transformation via Metamappings

Basic Tasks Are Input–Output Mappings.

Fig. 1.

Tasks Can Be Transformed via Metamappings.

Model Architecture and Training Methods

Constructing a Task Representation (Fig. 1B).

Performing a Task from Its Representation (Fig. 1C).

Transforming Task Representations via Metamappings (Fig. 1 E and F).

Homoiconicity.

Classifying Task Representations.

Training the Model.

Evaluating Base-Task and Metamapping Performance.

Experiments

Table 1.

Polynomials.

Fig. 2.

Fig. 3.

Fig. 4.

Card Games.

Fig. 5.

Visual Concepts.

Fig. 6.

Fig. 7.

Reinforcement Learning.

Fig. 8.

Fig. 9.

Language and Metamapping.

Metamapping as a Starting Point for Later Learning.

Fig. 10.

Discussion

Related Work in Machine Learning.

Related Work in Cognitive Science.

Limitations and Future Directions.

Conclusions

Supplementary Material

Acknowledgments

Footnotes

Data Availability.

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases